Seaborn plot to visualize Iris data

I have created this Kernel for beginners who want to learn how to plot graphs with seaborn.This kernel is still a work in progress.I will be updating it further when I find some time.If you find my work useful please fo vote by clicking at the top of the page.Thanks for viewing

 [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
import os
print(os.listdir("../input"))
# Any results you write to the current directory are saved as output.
['database.sqlite', 'Iris.csv']

Importing pandas and Seaborn module

 [2]:
import pandas as pd
import seaborn as sns

Importing Iris data set

 [3]:
iris=pd.read_csv('../input/Iris.csv')

Displaying data

 [4]:
iris.head()
[4]:
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa
 [5]:
iris.drop('Id',axis=1,inplace=True)

Checking if there are any missing values

 [6]:
iris.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
SepalLengthCm    150 non-null float64
SepalWidthCm     150 non-null float64
PetalLengthCm    150 non-null float64
PetalWidthCm     150 non-null float64
Species          150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB
 [7]:
iris['Species'].value_counts()
[7]:
Iris-versicolor    50
Iris-setosa        50
Iris-virginica     50
Name: Species, dtype: int64

This data set has three varities of Iris plant.

Joint plot

 [8]:
sns.jointplot(x='SepalLengthCm',y='SepalWidthCm',data=iris,size=5)
[8]:
<seaborn.axisgrid.JointGrid at 0x7ff33a187e48>

FacetGrid Plot

 [9]:
import matplotlib.pyplot as plt
%matplotlib inline
sns.FacetGrid(iris,hue='Species',size=5)\
.map(plt.scatter,'SepalLengthCm','SepalWidthCm')\
.add_legend()
[9]:
<seaborn.axisgrid.FacetGrid at 0x7ff30ab847b8>

Boxplot

 [10]:
sns.boxplot(x='Species',y='PetalLengthCm',data=iris)
[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff306feaa58>

 

Strip plot

 [11]:
ax=sns.stripplot(x='Species',y='SepalLengthCm',data=iris,jitter=True,edgecolor='gray')

Combining Box and Strip Plots

 [12]:
ax=sns.boxplot(x='Species',y='SepalLengthCm',data=iris)
ax=sns.stripplot(x='Species',y='SepalLengthCm',data=iris,jitter=True,edgecolor='gray')

Violin Plot

 [13]:
sns.violinplot(x='Species',y='SepalLengthCm',data=iris,size=6)
[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff306e4ee48>

Pair Plot

 [14]:
sns.pairplot(data=iris,kind='scatter')
[14]:
<seaborn.axisgrid.PairGrid at 0x7ff306e702e8>
 [15]:
sns.pairplot(iris,hue='Species')
[15]:
<seaborn.axisgrid.PairGrid at 0x7ff324e57dd8>

Plotting heat map

 [16]:
plt.figure(figsize=(7,4))
sns.heatmap(iris.corr(),annot=True,cmap='summer')
[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff3043a1400>

Distribution plot

 [17]:
iris.hist(edgecolor='black', linewidth=1.2)
fig=plt.gcf()
fig.set_size_inches(12,6)
plt.show()

Swarm plot

 [18]:
sns.set(style="whitegrid")
fig=plt.gcf()
fig.set_size_inches(10,7)
fig = sns.swarmplot(x="Species", y="PetalLengthCm", data=iris)
Advertisements

PEARSON CORRELATION SCORE

Hello everyone, in our last Machine Learning tutorial we learnt about that how we can use Euclidean Distance formula to find out similarity among people. In this tutorial we learn a new way to do the same thing but in a bit complex or rather I should say in advanced manner. We will use Pearson Correlation Score for calculating similarity among people. It has one major difference in the result being generated by it, in comparison to Euclidean Distance, that even if the distance between the values of fruits provided by two persons is high, but if it is consistent, that is difference is nearly is consistent through out all fruits, then Pearson Correlation Score will mark both persons highly similar or totally same.

Mango Banana Strawberry Pineapple Orange Apple
John 4.5 3.5 4 4
Martha 2.5 4.5 5 3.5
Mathew 3.75 4.25 3 3.5
Nick 4 3 4.5 4.5

For example in the above data if we look at ‘John’ and ‘Martha’ the distance between the fruits between them is nearly same, as a result Pearson Correlation Value will be around ‘1’ for them.

Pearson Correlation Score

We will calculate Pearson Correlation Score only for those fruits which are common for both the persons.

Pearson_correlation

Above formula provides us the Pearson Correlation Coefficient or Score, where ‘n’ is the sample size or total number of fruits, ‘x’ and ‘y’ are the values corresponding to each fruit.

Python code for the above method:

 

#Dictionary of People rating for fruits
choices={'John': {'Mango':4.5, 'Banana':3.5, 'Strawberry':4.0, 'Pineapple':4.0},
'Nick': {'Mango':4.0, 'Orange':4.5, 'Banana':3.0, 'Pineapple':4.5},
'Martha': {'Orange':5.0, 'Banana':2.5, 'Strawberry':4.5, 'Apple':3.5},
'Mathew': {'Mango':3.75, 'Strawberry':4.25, 'Apple':3.5, 'Pineapple':3.0}}
from math import sqrt
#Finding Similarity among people using Eucledian Distance Formula
class testClass():
    def pearson(self, cho, per1, per2):
        #Will set the following dictionary if data is common for two persons
        sample_data={}
        #Above mentioned varibale is an empty dictionary, that is length =0
        for items in cho[per1]:
            if items in cho[per2]:
                sample_data[items]=1
                #Value is being set 1 for those items which are same for both persons
        #Calculating length of sample_data dictionary
        length = len(sample_data)
        #If both person does not have any similarity or similar items return 0
        if length==0: return 0
        #Remember one thing we will calculate all the below values only for common items
        #   or the items which are being shared by both person1 and person2, that's why
        #   we will be using sample_data dictionary in below loops
        #Calculating Sum of all common elements for Person1 and Person2
        sum1=sum([cho[per1][val] for val in sample_data])
        sum2=sum([cho[per2][val] for val in sample_data])
        #Calculating Sum of squares of all common elements for both
        sumSq1=sum([pow(cho[per1][val],2) for val in sample_data])
        sumSq2=sum([pow(cho[per2][val],2) for val in sample_data])
        #Calculating Sum of Products of all common elements for both
        sumPr=sum([cho[per1][val]*cho[per2][val] for val in sample_data])
        #Calculating Person Correlation Score
        x = sumPr-(sum1*sum2/length)
        y = sqrt((sumSq1-pow(sum1,2)/length)*(sumSq2-pow(sum2,2)/length))
        if y==0 : return 0
        return(x/y)
        #Value being returned above always lies between -1 and 1
        #Value of 1 means maximum similarity
def main():
    ob = testClass()
    print(ob.pearson(choices, 'John', 'Nick'))
    print(ob.pearson(choices, 'John', 'Martha'))
    print(ob.pearson(choices, 'John', 'John'))
if __name__ == "__main__":
    main()

RECOMMENDING ITEMS

Hi everyone in our last 2 tutorials we studied Eculidean Distance and Pearson Correlation Scorefor finding out similarity among people. Now its time to recommend some items to people which they have never tried.

Have you have ever thought that how different shopping websites or social media websites recommend items to us which we have never tried. Well there are multiple approaches, complex ones as well, to solve that problem, but at present we will look into one of the easiest ways for basic idea.

Approach for Recommending Items

Mango Banana Strawberry Pineapple Orange Apple
John 4.5 3.5 4 4
Martha 2.5 4.5 5 3.5
Mathew 3.75 4.25 3 3.5
Nick 4 3 4.5 4.5

We will use similarity score for finding out similarity among people, then we will check for the missing items for a person in comparison to others. Before I move on further in depth, lets take an example for better mapping. Using our above data-set we are required to recommend items to ‘John’.

  1. Calculate similarity score of everyone with respect to ‘John’
  2. Now list out items which others have provided rating but ‘John’ hasn’t.
  3. We will use weighted rating for getting better result, that is, take product similarity score with each item, corresponding to other people.
    • In case of ‘Martha’, fruits which ‘John’ didn’t rate are Orange and Apple
    • Similarity score between ‘Martha’ and ‘John’ is ‘0.4721359549995794’
    • Weighted score = (Similarity_Score * Rating)
    • For Orange weighted score = 0.4721359549995794 * 5 = 2.360679774997897
    • Calculate weighted score corresponding to each fruit and for every other person.
  4. Calculate sum of all the similarity scores corresponding to each other item
    • For Orange Sum of Similarity per Item (sspi) = Sum of Similarity Score of ‘Martha’ and ‘Nick’
      • sspi = 0.4721359549995794 + 0.5358983848622454
      • sspi = 1.008034339861825
    • For Apple Sum of Similarity per Item (sspi) = Sum of Similarity Score of ‘Martha’ and ‘Mathew’
      • sspi = 0.4721359549995794 + 0.439607805437114
      • sspi = 0.9117437604366934
  5. Calculate Sum of Weighted Score per Item (swcpi)
    • For Orange swcpi = (Martha_Similarity_Score * Rating) + (Nick_Similarity_Score * Rating) 
      • swcpi = (0.4721359549995794 * 5) + (0.5358983848622454*4.5)
      • swcpi = 4.772222506878001
    • For Apple swcpi = 3.191103161528427
  6. For better result we will take average of weighted score with respect to Sum of Similarity per Item.
    • For Orange Average Weighted Score (aws) = (Sum of Weighted Score per Item)/(Sum of Similarity per Item)
      • aws = (4.772222506878001) / (1.008034339861825)
      • aws = 4.734186444017519
    • For Apple Average Weighted Score (aws) = 3.5

The Ranking of fruits for John are equal to Average Weighted Score (aws)

Python implementation for above algorithm (here I have used Euclidea Distance formula for calculating similarity, you can use any other mathematical model as well, for doing the same like Pearson Correlation Score)

 

#Dictionary of People rating for fruits
choices={‘John’: {‘Mango’:4.5, ‘Banana’:3.5, ‘Strawberry’:4.0, ‘Pineapple’:4.0},
‘Nick’: {‘Mango’:4.0, ‘Orange’:4.5, ‘Banana’:3.0, ‘Pineapple’:4.5},
‘Martha’: {‘Orange’:5.0, ‘Banana’:2.5, ‘Strawberry’:4.5, ‘Apple’:3.5},
‘Mathew’: {‘Mango’:3.75, ‘Strawberry’:4.25, ‘Apple’:3.5, ‘Pineapple’:3.0}}

import pandas as pd

from math import sqrt

class testClass():
def create_csv(self):
df = pd.DataFrame.from_dict(choices, orient=’index’)
df.to_csv(‘fruits.csv’)

#Finding Similarity among people using Eucledian Distance Formula

def choice_distance(self, cho, per1, per2):
#Will set the following dictionary if data is common for two persons
sample_data={}
#Above mentioned varibale is an empty dictionary, that is length =0

for items in cho[per1]:
if items in cho[per2]:
sample_data[items]=1
#Value is being set 1 for those items which are same for both persons

#If both person does not have any similarity or similar items return 0
if len(sample_data)==0: return 0

#Calculating Euclidean Distance
final_sum = sum([pow(cho[per1][items]-cho[per2][items],2) for items in cho[per1] if items in cho[per2]])
return(1/(1+sqrt(final_sum)))
#Value being returned above always lies between 0 and 1
#Value 1 is added to sqrt to prevent 1/0 division and to normaloze result.

#Calculating similarity value for a person with repect to other people
def scoreForAll(self,cho,similarity=choice_distance):
for others in cho:
if others!=’John’:
score=similarity(self, cho, ‘John’, others),others
#Remember to add self keyword in above call
print(score)

#Recommending which fruit should a person try, which he or she has never tried
def recommendation(self, cho, per, sim_method=choice_distance):
sumS={}
total={}

for others in cho:
#Removing the comparison of the person to itself who needs recommendations.
if others==per: continue
similarVal=sim_method(self,cho,per,others)
if similarVal == 0: continue
#IF You Are Using Pearson Correlation Score then uncomment the below code
# and comment the line of code
#if similarVal<=0: continue

for fruits in cho[others]:
if fruits not in cho[per] or cho[per][fruits]==0:
#multiply similarity score with rating
total.setdefault(fruits,0)
total[fruits]+=cho[others][fruits]*similarVal

#calculate sum of similarities
sumS.setdefault(fruits,0)
sumS[fruits]+=similarVal

#Generating normalized data
result=[(totalVal/sumS[fruits],fruits) for fruits,totalVal in total.items()]
result.sort()
result.reverse()
return result

def main():

ob = testClass()
ob.create_csv()
ob.scoreForAll(choices)
print(ob.recommendation(choices,’John’))

if __name__ == “__main__”:
main()

 

Output :

 

(0.5358983848622454, 'Nick')
(0.4721359549995794, 'Martha')
(0.439607805437114, 'Mathew')
[(4.734186444017522, 'Orange'), (3.5, 'Apple')]

 

 

In out next tutorial I will come up with some new interesting techniques, to make you think how easy and interesting Machine Learning is!!

Stay tuned and keep learning!!

For more updates and news related to this blog as well as to data science, machine learning and data visualization.

 

Please Write Your Comments.

Thanks,

Rakesh Kumar

CREATING WORDS’ DATA-SET FROM RSS FEEDS

Hi Everyone!! Whenever we try to learn some Machine Learning algorithm, the first thing that comes to our mind is “How we can get live data or real data for testing our algorithm”. This article focuses on creating such data set only, by extracting data from RSS feeds of multiple websites. Just by adding the URL of different websites that use RSS feed, we will be able extract multiple words from them.

Advantage of doing so is, that we get an authentic data and we can use it for performing multiple Machine Learning algorithms like clustering or unsupervised learning and many more.

For enabling this functionality, I will be using ‘feedparser’ library of Python, its an open library which helps us to extract data from RSS feeds. You can easily download it for free!

Code for it is as follows:

 

#Import feedparser library and re (regular expression) library
import feedparser
import re
#Creating dictionary of titles and word counts corresponding to a RSS feed
def pullWordCount(url):
    #Using feedparser to parse the feed
    fed = feedparser.parse(url)
    wordCount = {}
    for x in fed.entries:
        if 'summary' in x:
            summary = x.summary
        else:
            summary = x.summary
        #Extracting a list of words from feeds
        diffWords = pullWords(x.title+ ' ' + summary)
        for words in diffWords:
            wordCount.setdefault(words, 0)
            wordCount[words]+=1
    return fed.feed.title,wordCount
#Removing unnecessary data and refining our data
def pullWords(htmlTag):
    #removing all tags of html
    txtData = re.compile(r']+>').sub('', htmlTag)
    #split words with all the non-alpha characters
    words = re.compile(r'[^A-Z^a-z]+').split(txtData)
    #converting all the words to lower case, to create a uniformity
    return [word.lower() for word in words if word!='']
#appearCount has number of times a word has appeared
appearCount = {}
#wordsCount has total words
wordsCount = {}
#testlist.txt contains URLs of websites
for url in open('testlist.txt'):
    title,wordC = pullWordCount(url)
    wordsCount[title] = wordC
    for word,count in wordC.items():
        appearCount.setdefault(word,0)
        if count>1:
            appearCount[word]+=1
wordList=[]
for wor,bc in appearCount.items():
    percent=float(bc)/len('testlist.txt')
    if percent>0.02 and percent<0.8:wordList.append(wor)
#by above percentage we mean that we are using words which have appearance
# percentage between 2% and 80%, you can modify it for different kind of results
#our data will be saved in BlogWordsData.txt
out=open('BlogWordsData.txt','w')
out.write('TestingBlog')
for word in wordList: out.write('\t%s' %word)
out.write('\n')
for blog,wc in wordsCount.items():
    out.write(blog)
    for word in wordList:
        if word in wc: out.write('t%d' %wc[word])
    else: out.write('\t0')
    out.write('\n')

Test List (testlist.txt)

 

feedlist

Output

blog

In our next tutorial we will use the data extracted from same technique for learning some new techniques in Unsupervised Learning.

Stay tuned and keep learning!!

 

Please Write Your Comments.

Thanks,

Rakesh Kumar

 

FACE DETECTION USING OPENCV IN LIVE VIDEO

Hi learners!! We always come across the problem face detection in machine learning and we always jut think that how we can create a face detection algorithm in the easiest and fastest way. Well here is the answer! we will use OpenCV library of python for detecting faces in the live video being fed using your webcam. For initial level we are using this library but next time we will be creating our own model from scratch and will train it and then test it in real time!!

OpenCV

OpenCV is a library of python which supports a built in model for detecting faces in an image using Haar Cascades. We can use OpenCV for extracting frames from a video then we will apply Haar Cascade onto those frames and will create square on the face being present in image.

NOTE: Before moving onto the code you will be needed to download the “haarcascade_frontalface_default.xml” for running this face detection algorithm. You can find it here. This file basically contains weights for detecting faces

Code :

#importing OpenCV and Time library
import cv2
import time
#Reading data from the CSV file we downloaded
face_cascade = cv2.CascadeClassifier('C:/Documents/MachineLearning/haarcascade_frontalface_default.xml')
#Capturing Video from primary webcam, you can change number from 0 to any Integer
## for other webcams if you have many of them
cap = cv2.VideoCapture(0)
while (True):
#Reading frame from the live video feed
    ret, frame = cap.read()
#Converting frame into grayscale for better efficiency
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
#Using Haar Cascade for detecting faces in an image
    faces = face_cascade.detectMultiScale(gray, 1.3, 5)
#Creating the rectangle around face
    for (x,y,w,h) in faces:
        frame = cv2.rectangle(frame,(x,y),(x+w,y+h),(120,250,0),2)
#Displaying the captured frame in a window
    cv2.imshow('gray',frame)
#Applied 0.5 seconds delay such that a new frame will only be read every 0.5 seconds
#This decreases load on machine, because in general webcam captures 15 to 25 frames per second
    time.sleep(0.5)
    print("sleep done!!")
if cv2.waitKey(20) & 0xFF == ord('q'):
        break
cap.release()
cv2.destroyAllWindows(

 

 

 

 

Input – Output

Teen girl excited      test2

 

The above algorithm was for starters!! In next tutorial we will be creating a face detection algorithm from scratch, then we will train it and use it!

Stay tuned for more learning!!

 

PLEASE WRITE YOUR COMMENTS.

THANKS,

RAKESH KUMAR

DATA CLEANING USING R

Introduction

While doing data data analysis, before analyzing the data for any kind of resultant, we most perform the most essential step, that is, Data Cleaning. Because when we get the data it may not be ready for complete analysis, like there might me missing values which will make our judgement wrong, outliers which may affect the Mean, Median; variable may spread across the values of a column (need to convert that column into multiple variables. i.e., columns) and many more. So before performing any kind of analysis for getting results by using any means like Linear Regression, we must do data cleaning for getting the accurate results.

Working on Data Set

Lets use one of the data-sets being provided in RStudio, that is, “SURVEY”. It consists of variables like “Age”, “Pulse rate”, “Gender(Sex)”, etc.

library(MASS)
View(survey)
head(survey, 10)
##       Sex Wr.Hnd NW.Hnd W.Hnd    Fold Pulse    Clap Exer Smoke Height
## 1  Female   18.5   18.0 Right  R on L    92    Left Some Never 173.00
## 2    Male   19.5   20.5  Left  R on L   104    Left None Regul 177.80
## 3    Male   18.0   13.3 Right  L on R    87 Neither None Occas     NA
## 4    Male   18.8   18.9 Right  R on L    NA Neither None Never 160.00
## 5    Male   20.0   20.0 Right Neither    35   Right Some Never 165.00
## 6  Female   18.0   17.7 Right  L on R    64   Right Some Never 172.72
## 7    Male   17.7   17.7 Right  L on R    83   Right Freq Never 182.88
## 8  Female   17.0   17.3 Right  R on L    74   Right Freq Never 157.00
## 9    Male   20.0   19.5 Right  R on L    72   Right Some Never 175.00
## 10   Male   18.5   18.5 Right  R on L    90   Right Some Never 167.00
##         M.I    Age
## 1    Metric 18.250
## 2  Imperial 17.583
## 3       16.917
## 4    Metric 20.333
## 5    Metric 23.667
## 6  Imperial 21.000
## 7  Imperial 18.833
## 8    Metric 35.833
## 9    Metric 19.000
## 10   Metric 22.333
summary(survey)
##      Sex          Wr.Hnd          NW.Hnd        W.Hnd          Fold    
##  Female:118   Min.   :13.00   Min.   :12.50   Left : 18   L on R : 99  
##  Male  :118   1st Qu.:17.50   1st Qu.:17.50   Right:218   Neither: 18  
##  NA's  :  1   Median :18.50   Median :18.50   NA's :  1   R on L :120  
##               Mean   :18.67   Mean   :18.58                            
##               3rd Qu.:19.80   3rd Qu.:19.73                            
##               Max.   :23.20   Max.   :23.50                            
##               NA's   :1       NA's   :1                                
##      Pulse             Clap       Exer       Smoke         Height     
##  Min.   : 35.00   Left   : 39   Freq:115   Heavy: 11   Min.   :150.0  
##  1st Qu.: 66.00   Neither: 50   None: 24   Never:189   1st Qu.:165.0  
##  Median : 72.50   Right  :147   Some: 98   Occas: 19   Median :171.0  
##  Mean   : 74.15   NA's   :  1              Regul: 17   Mean   :172.4  
##  3rd Qu.: 80.00                            NA's :  1   3rd Qu.:180.0  
##  Max.   :104.00                                        Max.   :200.0  
##  NA's   :45                                            NA's   :28     
##        M.I           Age       
##  Imperial: 68   Min.   :16.75  
##  Metric  :141   1st Qu.:17.67  
##  NA's    : 28   Median :18.58  
##                 Mean   :20.37  
##                 3rd Qu.:20.17  
##                 Max.   :73.00  
## 

 

Description of data-set is as follows:

Sex           <- The sex of the student (Factor with levels “Male” and “Female”.)
Wr.Hnd   <- span (distance from tip of thumb to tip of little finger of spread hand) of writing hand, in centimetres.
NW.Hnd  <- span of non-writing hand
W.Hnd     <- writing hand of student (Factor, with levels “Left” and “Right”.)
Fold          <- Fold your arms! Which is on top (Factor, with levels “R on L”, “L on R”, “Neither”.)
Pulse        <- pulse rate of student (beats per minute)
Clap         <- Clap your hands! Which hand is on top (Factor, with levels “Right”, “Left”, “Neither”.)
Exer        <- how often the student exercises (Factor, with levels “Freq” (frequently), “Some”, “None”.)
Smoke     <- how much the student smokes (Factor, levels “Heavy”, “Regul” (regularly), “Occas” (occasionally), “Never”.)
Height     <- height of the student in centimetres
M.I.           <- whether the student expressed height in imperial (feet/inches) or metric (centimetres/metres) units. (Factor, levels “Metric”, “Imperial”.)
Age          <- age of the student in years

By “View()” method above the data set opens separately in RStudio for better view and we can see the data type as well of each variable by hovering onto it.

By “summary()” we got the summary of data like “Min value”, “Max value”, “Mean”, “Frequency”, etc.

The main things that can be interpreted from above summary are:

  1. Categorical Variables: Sex, W.Hnd, Fold, Clap, Exer, Smoke, M.I
  2. Numerical Variables: Wr.Hnd, NW.Hnd, Pulse, Height, Age
  3. Presence of “NA” –> which means that our data-set contains missing or undefined values. While visualising the data using View() we can verify that there are a lot of missing values.
  4. Other factors like –> Frequecny in case of categorical variable and for Numeric variables we get Mean, Median, Min, Max, 1st and 3rd Quartile.

?? Now the first question would have aroused in your mind that if the data is missing then how we can do proper analysis of data. You would be thinking how we can rplace all “NA” with some data. For that we generally use Mean, Median and Mode.

Filling Up Missing Values:

Since we have worked upon the outliers now we can work upon the missing values. For removing the gaps in information we can replace “NA” with either “Mean”, “Median” or “Mode”, depending upon the scenarios.

Scenarios for Raplcaing Missing Values:

  1. If variable is “Continuous” –> Replace NA with any either “Mean”, “Median” or “Mode”.
  2. Otherwise
  1. If data “Normally Distributed” –> Replace NA with “Mean”.
  2. if data “Skewd” –> Replace NA with “Median”.
  1. If “Categorical” –> Replace NA with “Mode”.

** For “Non-continuous Numerical” data we will be using “Histogram” to check distribution of data (Normal or Skewd).

Working on Survey Data-Set:

  1. For variable “Sex” It is a Categorical variable so we will use Mode.
survey$Sex[is.na(survey$Sex)] <- "Female"
summary(survey$Sex)
## Female   Male 
##    119    118
  1. For variable “Wr.Hnd” It is a Non-continuous Numerical Variable so first we will look for distribution of data, by using histogram.
hist(survey$Wr.Hnd, main="Wr.Hnd")

 

hist_wr_hnd

In above histogram we can see that the data is Right-Skewd, so we will replace “NA” with “Median”

survey$Wr.Hnd[is.na(survey$Wr.Hnd)] <- median(survey$Wr.Hnd, na.rm=TRUE)
summary(survey$Wr.Hnd)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   17.50   18.50   18.67   19.80   23.20
  1. For variable “NW.Hnd” It is a Non-continuous Numerical Variable so first we will look for distribution of data, by using histogram.
hist(survey$NW.Hnd, main="NW.Hnd")

 

hist_nw_hnd

In above histogram we can see that the data is approximately “Right Skewd”, so we will replace “NA” with “Median”

survey$NW.Hnd[is.na(survey$NW.Hnd)] <- median(survey$NW.Hnd, na.rm=TRUE)
summary(survey$NW.Hnd)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.50   17.50   18.50   18.58   19.70   23.50
  1. For variable “W.Hnd” It is a Categorical variable so we will use Mode.
survey$W.Hnd[is.na(survey$W.Hnd)] <- "Right"
summary(survey$W.Hnd)
##  Left Right 
##    18   219
  1. For variable “Pulse” It is a Non-continuous Numerical Variable so first we will look for distribution of data, by using histogram.
hist(survey$Pulse, main="Pulse")

 

hist_pulse

In above histogram we can see that the data is approximately “Normally Distributed”, so we will replace “NA” with “Mean”

survey$Pulse[is.na(survey$Pulse)] <- mean(survey$Pulse, na.rm=TRUE)
summary(survey$Pulse)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   35.00   68.00   74.15   74.15   80.00  104.00
  1. For variable “Clap” It is a Categorical variable so we will use Mode.
survey$Clap[is.na(survey$Clap)] <- "Right"
summary(survey$Clap)
##    Left Neither   Right 
##      39      50     148
  1. For variable “Smoke” It is a Categorical variable so we will use Mode.
survey$Smoke[is.na(survey$Smoke)] <- "Never"
summary(survey$Smoke)
## Heavy Never Occas Regul 
##    11   190    19    17
  1. For variable “Height” It is a Non-continuous Numerical Variable so first we will look for distribution of data, by using histogram.
hist(survey$Height, main="Height")

 

hist_height.png

In above histogram we can see that the data is approximately “Right Skewd”, so we will replace “NA” with “median”

survey$Height[is.na(survey$Height)] <- median(survey$Height, na.rm=TRUE)
summary(survey$Height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   150.0   167.0   171.0   172.2   179.0   200.0
  1. For variable “M.I” It is a Categorical variable so we will use Mode.
survey$M.I[is.na(survey$M.I)] <- "Metric"
summary(survey$M.I)
## Imperial   Metric 
##       68      169
summary(survey)
##      Sex          Wr.Hnd          NW.Hnd        W.Hnd          Fold    
##  Female:119   Min.   :13.00   Min.   :12.50   Left : 18   L on R : 99  
##  Male  :118   1st Qu.:17.50   1st Qu.:17.50   Right:219   Neither: 18  
##               Median :18.50   Median :18.50               R on L :120  
##               Mean   :18.67   Mean   :18.58                            
##               3rd Qu.:19.80   3rd Qu.:19.70                            
##               Max.   :23.20   Max.   :23.50                            
##      Pulse             Clap       Exer       Smoke         Height     
##  Min.   : 35.00   Left   : 39   Freq:115   Heavy: 11   Min.   :150.0  
##  1st Qu.: 68.00   Neither: 50   None: 24   Never:190   1st Qu.:167.0  
##  Median : 74.15   Right  :148   Some: 98   Occas: 19   Median :171.0  
##  Mean   : 74.15                            Regul: 17   Mean   :172.2  
##  3rd Qu.: 80.00                                        3rd Qu.:179.0  
##  Max.   :104.00                                        Max.   :200.0  
##        M.I           Age       
##  Imperial: 68   Min.   :16.75  
##  Metric  :169   1st Qu.:17.67  
##                 Median :18.58  
##                 Mean   :20.37  
##                 3rd Qu.:20.17  
##                 Max.   :73.00

By following the above process we raplaced all the missing values with some meaningful data, as you can see in the “summary()” of data there are no “NA”s. Now our data is much cleaner than the one we started with, and it looks good for further analysis. Now the next thing that we need to handle is “Outlier”

Outliers

Outliers are the data values which deviate a lot from most of the data values, either too big or too small in comparison to average values. Let us understand using an example…We will use “boxplot()” for finding the outliers.

#We will create it only for numeric variables not for categorical data

boxplot(survey$Wr.Hnd, main="Wr.Hnd")

 

1_box_wr_hnd

boxplot(survey$NW.Hnd, main="NW.Hnd")

 

1_box_nw_hnd

boxplot(survey$Pulse, main="Pulse")

 

1_box_pulse.png

boxplot(survey$Height, main="Height")

 

1_box_height

boxplot(survey$Age, main="Age")

 

1_box_age

From above boxplot of “survey” data-set we can see that for variables Wr.Hnd, NW.Hnd, Pulseand Age has got some dots or circles in the graph, either above or below the slim lines (called whiskers, 1st and 3rd Quartile of boxplot). Those circles represents the presence of outliers and number of outliers OR we can detect outliers in one more way, that is, all the values out of range of 5th and 95th percentile can be considered as Outlier.

Removing Outliers

We will use the second method for removing the outliers, because in case of boxplot we may endup loosing some crucial information too, as range being produced by the 1st and 3rd quartile is generally smaller in comparison to the range being produced by 5th and 95th percentile. For example:

  1. Range generated by Quartiles
#For variable PULSE
## Range generated by Quartiles
## (-1.5 * IQR) to (1.5 * IQR), where (Inerquartile Range) IQR <- Q3 -Q1
#Q1 and Q3 are present in summary of Pulse, that we obtained at the beginning

Q3 <- 80
Q1 <- 66
IQR <- Q3 -Q1 
lower <- -1.5 * IQR
upper <- 1.5 * IQR

lower
## [1] -21
upper
## [1] 21
  1. Range generated by 5th and 95th Percentiles
## Range generated by 5th and 95th Percentiles
## For this we use quantile(, ) method

lower <- quantile(survey$Pulse, 0.05, na.rm=TRUE)
upper <- quantile(survey$Pulse, 0.95, na.rm=TRUE)

lower
## 5% 
## 60
upper
## 95% 
##  92

So we can see the significant difference in the range of values. Thus we will be useing Percentile approach for capping the outliers. ** Another thing to be noted is that in case of Percentiles we are using level of significance, so per our requirement we can change it too. In general we use 5th and 95th percentiles.

After detecting outliers for a variable we will replace it with the Mean of the data. We will do the same procedure for all the variables containing outliers.

  1. For variable Pulse
quantile(survey$Pulse, 0.05, na.rm=TRUE)
## 5% 
## 60
quantile(survey$Pulse, 0.95, na.rm=TRUE)
## 95% 
##  92
survey$Pulse <- ifelse(survey$Pulse > 92, 74.22, survey$Pulse)
survey$Pulse <- ifelse(survey$Pulse < 60, 74.22, survey$Pulse)

summary(survey$Pulse)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   60.00   70.00   74.15   74.25   78.00   92.00
boxplot(survey$Pulse, main="Pulse_Without_Outliers")

2_box_pulse

** So what happened here?? We are still able to see dots in boxplot!! Answer to this is though there are dots in boxplot, but we cann’t remove more data than that, because it will result in forging data, which will hamper the correct analysis of data. This happens a lot when you are working on large dataset, so keep in mind that not all the outliers can be removed

** Though if we don’t want them we can simply delete the data but it will result in data loss. So its not a good practice to delete them.

Now we will repeat same process for other three variables too.

  1. For variable Wr.Hnd
quantile(survey$Wr.Hnd, 0.05, na.rm=TRUE)
## 5% 
## 16
survey$Wr.Hnd <- ifelse(survey$Wr.Hnd < 16, 18.84, survey$Wr.Hnd)

summary(survey$Wr.Hnd)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   17.50   18.60   18.84   19.80   23.20
boxplot(survey$Wr.Hnd, main="Wr.Hnd_Without_Outliers")

 

2_box_wr_hnd

  1. For variable NW.Hnd
quantile(survey$NW.Hnd, 0.05, na.rm=TRUE)
##   5% 
## 15.5
quantile(survey$NW.Hnd, 0.95, na.rm=TRUE)
##   95% 
## 22.22
survey$NW.Hnd <- ifelse(survey$NW.Hnd > 22.22, 18.53, survey$NW.Hnd)
survey$NW.Hnd <- ifelse(survey$NW.Hnd < 15.5, 18.53, survey$NW.Hnd)

summary(survey$NW.Hnd)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.50   17.50   18.50   18.52   19.50   22.20
boxplot(survey$NW.Hnd, main="NW.Hnd_Without_Outliers")

 

2_box_nw_hnd.png

  1. For variable Age
quantile(survey$Age, 0.95, na.rm=TRUE)
##     95% 
## 30.6836
survey$Age <- ifelse(survey$Age > 30.6836, 19.22, survey$Age)

summary(survey$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.75   17.67   18.58   19.17   19.83   30.67
boxplot(survey$Age, main="Age_Without_Outliers")

 

2_box_age

This is the same scenario that we faced while removing outliers from variable Pulse, but here the number of outliers is high. So in this scenario you can omit the data containing outliers for better results.

That’s all for this lecture, in next lecture we look into the things in a much deeper way.

Keep Learning!!!

 

PLEASE WRITE YOUR COMMENTS.

THANKS,

RAKESH KUMAR

DISCRETE UNIFORM PROBABILITY DISTRIBUTION WITH MS-EXCEL

Hi ML enthusiasts! In our last post, we talked about basics of random variables, probability distributions and its types and how to generate a discrete probability distribution plot. In this article, we will talk about the Discrete Uniform Probability Distribution and its implementation with MS-Excel.

Discrete Uniform Probability Function

Consider the experiment of rolling a die. We have six possible outcomes in this case: 1, 2, 3, 4, 5, 6. Each outcome has equal chance of occurrence. Thus, the probability of getting a particular outcome is same, i.e., 1/6.

Consider the experiment of tossing a coin. We can have only two outcomes, Head and Tail. Also, for an unbiased coin, the probability of occurrence of each outcome is same, i.e., 1/2.

For an experiment having equally-likely outcomes and the number of outcomes being n, the probability of occurrence of each outcome becomes 1/n.

Thus, for uniform probability function, f(x) = 1/n.

Implementation using MS-Excel

Let’s see how to implement uniform probability distribution in MS-Excel now. Here, consider the case of rolling two dice. In this case, we get the following as outcomes:

(1,1), (1,2), …, (1,6), (2,1), (2,2), …, (2,6), (3,1), (3,2), …, (3,6), (4,1), (4,2), …, (4,6), (5,1), (5,2), …, (5,6), (6,1), (6,2), …, (6,6).

In this way, we get 6*6 = 36 outcomes. Since, the dice are unbiased, the outcomes will be equally likely. Thus, the probability of each outcome will be 1/36.

  • To perform the analysis of probability curve on Excel, first make two columns and name them as outcomes and probabilities respectively.
  • Now, select the rows of outcome column using ctrl+shift+down key and set the type as text, see screenshot below:

Screenshot (11)

  • Now, give the names of the outcomes as seen in the above screenshot. They signify the observations of your categorical variable, outcomes.
  • Count the number of rows/outcomes by using the ROWS() function of excel. It returns the number of rows in a particular column or range. Select a separate cell in excel worksheet and type =ROWS(<your data range>) by selecting outcome the data column and then press enter.

Screenshot (14)

  • This will give you total number of outcomes, 36 in our case. Now, in the probability column, select one cell (C2). Type =1/$G$2 in it and then press enter. It gives the decimal value of 1/36. Drag the formula up to the last observation.

Screenshot (15)

  • Now, apply the sum() function to find if the total probability is 1.

Screenshot (16)

  • Select both the columns and their observation by ctrl+shift+down and then go to insert>scatter>scatter with only markers.

Screenshot (17)

  • Set axes labels, title of chart and legend location by going into layout tab. You will get a plot like the following screenshot shows:

Screenshot (19)

You can make the same type of distribution curves with any experiment that produces equally-likely outcomes and have no bias in it. For example, tossing two coins, estimating the likelihood of drawing a particular card from a deck of cards etc. All of these are the examples which generate the uniform probability distributions. Since the data points are discrete in nature, the probability distribution curve will also be discrete.

So, with this we conclude our tutorial. In the next tutorial, we will talk about the concept of expected value, standard deviation, variance and binomial probability distribution.

 

Please Write Your Comments.

Thanks,

Rakesh Kumar