univariant analysis

 it is the analysis done with a single column to know the information of single column

for categorical data

1.countplot/bargraph

 simply it is bar graph plot plotted \

 for eg. if you want to find out how many passenger were male in titanic died and how many females were female

                import seaborn as sns

                sns.countplot(df['sex'])



       you can get it numerically through:
         df['Survived'].value_counts()
 
    df['Embarked'].value_counts().plot(kind='bar')

2. to watch in pie chart
     df['Embarked'].value_counts().plot(kind='pie')
    
   this doesn't show percentage inside : to watch percentage:
   
df['Embarked'].value_counts().plot(kind='pie',autopct='%.2f')

for numerical data

       1. histogram
               it shows all the data variation within that data
                    for age:
                       import matplotlib.pyplot as plt
plt.hist(df['Age'],bins=5)
     
  you can also eliminate bins: bins is for the visual easy of plot
  result:

here you can see and make conclusion: age of 0 and just above was few in titanic
eldest person waere also few in no. but middle aged person were max in no.




 

2.Displot
 
     to analyze data in terms of probability:
  what is the probability that age=40 years be inside titanic? from the data:
  sns.distplot(df['Age'])



  from the graph you can say that almost 15% is the probability that people having 40 yrs would be on the titanic

3. Boxplot

     boxplot consists:
       I. median: it divides data to half
      ii. 1st quartile: left hand side 25% data and other side 75% data;
iii. 3rd quartile: left hand side 75% data and other side 25% data;

     minimum: it is the data calculated using formula(Q1-1.5*IQR)
                     -lies left hand side in in front of Q1
    maximum: it is the data calculated using formula(Q1+1.5*IQR)
                    -lies right hand side behind Q3
 note: they must be in your box plot 
   how box plot seems like?
    
 
there is outliers in your data which are not required: they are noisy data you can remove by analyzing box plot

outliers: unnecessary data which doesn't fit in your data range

sns.boxplot(df['Age'])
in the data you can see upper outliers are there and you can remove them they are not necessary


you can calulate mean median standard deviation of particular column
 1. df['age'].mean()
 1. df['age'].min()
 1. df['age'].max()
 etc.








Comments

Popular posts from this blog

proceduce for data analysis some basic functions

working with json and sql data