Detecting the outliers in the data using box plot

This topic explains the basics of a box plot and to detect the outliers of the given data visually using box plot.

Data ingestion

Python library is a collection of functions and methods that allows you to perform many actions without writing your code. To make use of the functions in a module, you’ll need to import the module with an import statement

import numpy as np # for multi-dimensional arrays and matrices operations
import scipy.stats # for scientific computing and technical computing
import pandas as pd # data manipulation and analysis
import seaborn as sns # Python's Statistical Data Visualization Library
import matplotlib # for plotting
import matplotlib.pyplot as plt

Matplotlib is a magic function in IPython.Matplotlib inline sets the backend of matplotlib to the ‘inline’ backend. With this backend, the output of plotting commands is displayed inline within frontends like the Jupyter notebook, directly below the code cell that produced it.

%matplotlib inline
# Read the csv file using pandas
data = pd.read_csv('petroleum.csv')

Download the petroleum.csv

# Display the basic table information
data.info()

result:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216 entries, 0 to 215
Data columns (total 5 columns):
Year             216 non-null int64
Geography        216 non-null object
Import           216 non-null float64
Export           216 non-null float64
CO2 Emissions    216 non-null float64
dtypes: float64(3), int64(1), object(1)
memory usage: 8.5+ KB
# Display first 5 rows as a table
data.head(5)

result:

Year Geography Import Export CO2 Emissions
0 1980 Africa 618.184 5428.078 525.605046
1 1981 Africa 609.270 3964.097 519.408287
2 1982 Africa 557.209 3458.547 558.221545
3 1983 Africa 477.787 3394.148 586.002081
4 1984 Africa 507.619 3629.964 612.150112
# Describe statistics summary of a feature or variable

data[data.Geography == 'Asia'].Import.describe()

result:

count       36.000000
mean     11928.644624
std       4830.261052
min       5710.417000
25%       7001.003250
50%      11717.250500
75%      16120.587750
max      20838.615000
Name: Import, dtype: float64

The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the distribution of data based on the five number summary:

  • Minimum
  • First quartile
  • Median
  • Third quartile
  • Maximum

When reviewing a boxplot, an outlier is defined as a data point that is located outside the fences (“whiskers”) of the boxplot. (e.g: outside 1.5 times the interquartile range above the upper quartile and bellow the lower quartile)

# Plot box plot to find out the outliers using a single feature or variable

plt.figure(figsize=(10,5))
sns.boxplot(x = 'Geography', y = 'CO2 Emissions', data=data,
                 width=0.5,
                 palette="colorblind")
plt.title('Box Plot Comparison',fontweight="bold",fontsize = 20)
plt.xlabel('Geography', fontweight="bold",fontsize=15)
plt.ylabel('CO2 Emissions', fontweight="bold",fontsize=15)
plt.xticks(fontweight="bold",fontsize = 10)
plt.yticks(fontweight="bold",fontsize = 10)
plt.show()

data.rename(columns={'CO2 Emissions':'CO2_Emissions'}, inplace=True)
Asia_emissions = data[data.Geography == 'Asia'].CO2_Emissions
Europe_emissions = data[data.Geography == 'Europe'].CO2_Emissions
Africa_emissions = data[data.Geography == 'Africa'].CO2_Emissions
South_America_emissions = data[data.Geography == 'South America'].CO2_Emissions
North_America_emissions = data[data.Geography == 'North America'].CO2_Emissions
Middle_East_emissions = data[data.Geography == 'Middle East'].CO2_Emissions

Data Normalization

  • Tranforms the data in the range between 0 to 1.
  • Make the data consistent so that helps to compare the different data in a same scale format
def normalization(data):
    data -= np.min(data, axis=0)
    data /= np.ptp(data, axis=0)
    return data
Asia_emissions = normalization(Asia_emissions)
Europe_emissions = normalization(Europe_emissions)
Africa_emissions = normalization(Africa_emissions)
South_America_emissions = normalization(South_America_emissions)
North_America_emissions = normalization(North_America_emissions)
Middle_East_emissions = normalization(Middle_East_emissions)
data_boxplot = pd.DataFrame({'Asia': Asia_emissions, 'Europe': Europe_emissions,  'Africa' : Africa_emissions,  'South America': South_America_emissions,  'North America': North_America_emissions,  'Middle_East': Middle_East_emissions})
plt.figure(figsize=(10,5))
sns.boxplot(data=data_boxplot,
                 width=0.5,
                 palette="colorblind")
plt.title('Box Plot Comparison',fontweight="bold",fontsize = 20)
plt.xlabel('Geography', fontweight="bold",fontsize=15)
plt.ylabel('CO2 Emissions', fontweight="bold",fontsize=15)
plt.xticks(fontweight="bold",fontsize = 10)
plt.yticks(fontweight="bold",fontsize = 10)
plt.show()

References :

 https://www.eia.gov/

Comments