univariant analysis

April 07, 2021

it is the analysis done with a single column to know the information of single column

for categorical data

1.countplot/bargraph

simply it is bar graph plot plotted \

for eg. if you want to find out how many passenger were male in titanic died and how many females were female

import seaborn as sns

sns.countplot(df['sex'])

you can get it numerically through:

df['Survived'].value_counts()

df['Embarked'].value_counts().plot(kind='bar')

2. to watch in pie chart

df['Embarked'].value_counts().plot(kind='pie')

this doesn't show percentage inside : to watch percentage:

df['Embarked'].value_counts().plot(kind='pie',autopct='%.2f')

for numerical data

1. histogram

it shows all the data variation within that data

for age:

import matplotlib.pyplot as plt

plt.hist(df['Age'],bins=5)

  you can also eliminate bins: bins is for the visual easy of plot

  result:

here you can see and make conclusion: age of 0 and just above was few in titanic
 eldest person waere also few in no. but middle aged person were max in no.

2.Displot

to analyze data in terms of probability:

what is the probability that age=40 years be inside titanic? from the data:

sns.distplot(df['Age'])

from the graph you can say that almost 15% is the probability that people having 40 yrs would be on the titanic

3. Boxplot

boxplot consists:

I. median: it divides data to half

ii. 1st quartile: left hand side 25% data and other side 75% data;
iii. 3rd quartile: left hand side 75% data and other side 25% data;

minimum: it is the data calculated using formula(Q1-1.5*IQR)

-lies left hand side in in front of Q1

maximum: it is the data calculated using formula(Q1+1.5*IQR)

-lies right hand side behind Q3

note: they must be in your box plot

how box plot seems like?

there is outliers in your data which are not required: they are noisy data you can remove by analyzing box plot

outliers: unnecessary data which doesn't fit in your data range

sns.boxplot(df['Age'])


in the data you can see upper outliers are there and you can remove them they are not necessary


you can calulate mean median standard deviation of particular column

 1. df['age'].mean()

 1. df['age'].min()
 1. df['age'].max()
 etc.

Search This Blog

100 days ML

univariant analysis

for categorical data

for numerical data

Comments

Post a Comment

Popular posts from this blog

proceduce for data analysis some basic functions

working with json and sql data

bivarient analysis