Normalization and Standardization

Normalization makes data more meaningful by converting absolute values into comparisons with related values. Chris Vallier has produced this demonstration of normalization using PyJanitor.

pyjanitor functions demonstrated here:

[2]:
import janitor
import pandas as pd
import numpy as np
import seaborn as sns
sns.set(style="whitegrid")

Load data

We’ll use a dataset with fuel efficiency in miles per gallon (“mpg”), engine displacement in cubic centimeters (“disp”), and horsepower (“hp”) for a variety of car models. It’s a crazy, but customary, mix of units.

[4]:
csv_file = (
    'https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv'
)
cars_df = pd.read_csv(csv_file)

Quantities without units are dangerous, so let’s use pyjanitor’s rename_column

[5]:
cars_df = cars_df.rename_column('disp', 'disp_cc')

Examine raw data

[6]:
cars_df.head()
[6]:
model mpg cyl disp_cc hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

Visualize

Each value makes more sense viewed in comparison to the other models. We’ll use simple Seaborn bar plots.

mpg by model

[7]:
cars_df = cars_df.sort_values('mpg', ascending=False)
sns.barplot(y='model', x='mpg', data=cars_df, color='b', orient="h", )
[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa29b344810>
notebooks/../_build/doctrees-readthedocs/nbsphinx/notebooks_normalize_10_1.png

displacement by model

[8]:
cars_df = cars_df.sort_values('disp_cc', ascending=False)
sns.barplot(y='model', x='disp_cc', data=cars_df, color='b', orient="h")
[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa29afaf650>
notebooks/../_build/doctrees-readthedocs/nbsphinx/notebooks_normalize_12_1.png

horsepower by model

[9]:
cars_df = cars_df.sort_values('hp', ascending=False)
sns.barplot(y='model', x='hp', data=cars_df, color='b', orient="h")
[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa29a860710>
notebooks/../_build/doctrees-readthedocs/nbsphinx/notebooks_normalize_14_1.png

min-max normalization

First we’ll use pyjanitor’s min_max_scale to rescale the mpg, disp_cc, and hp columns in-place so that each value varies from 0 to 1.

[10]:
(
    cars_df.min_max_scale(col_name='mpg', new_max=1, new_min=0)
    .min_max_scale(col_name='disp_cc', new_max=1, new_min=0)
    .min_max_scale(col_name='hp', new_max=1, new_min=0)
)
[10]:
model mpg cyl disp_cc hp drat wt qsec vs am gear carb
30 Maserati Bora 0.195745 8 0.573460 1.000000 3.54 3.570 14.60 0 1 5 8
28 Ford Pantera L 0.229787 8 0.698179 0.749117 4.22 3.170 14.50 0 1 5 4
6 Duster 360 0.165957 8 0.720629 0.681979 3.21 3.570 15.84 0 0 3 4
23 Camaro Z28 0.123404 8 0.695685 0.681979 3.73 3.840 15.41 0 0 3 4
16 Chrysler Imperial 0.182979 8 0.920180 0.628975 3.23 5.345 17.42 0 0 3 4
15 Lincoln Continental 0.000000 8 0.970067 0.575972 3.00 5.424 17.82 0 0 3 4
14 Cadillac Fleetwood 0.000000 8 1.000000 0.540636 2.93 5.250 17.98 0 0 3 4
13 Merc 450SLC 0.204255 8 0.510601 0.452297 3.07 3.780 18.00 0 0 3 3
11 Merc 450SE 0.255319 8 0.510601 0.452297 3.07 4.070 17.40 0 0 3 3
12 Merc 450SL 0.293617 8 0.510601 0.452297 3.07 3.730 17.60 0 0 3 3
24 Pontiac Firebird 0.374468 8 0.820404 0.434629 3.08 3.845 17.05 0 0 3 2
4 Hornet Sportabout 0.353191 8 0.720629 0.434629 3.15 3.440 17.02 0 0 3 2
29 Ferrari Dino 0.395745 6 0.184335 0.434629 3.62 2.770 15.50 0 1 5 6
21 Dodge Challenger 0.217021 8 0.615864 0.346290 2.76 3.520 16.87 0 0 3 2
22 AMC Javelin 0.204255 8 0.580943 0.346290 3.15 3.435 17.30 0 0 3 2
10 Merc 280C 0.314894 6 0.240708 0.250883 3.92 3.440 18.90 1 0 4 4
9 Merc 280 0.374468 6 0.240708 0.250883 3.92 3.440 18.30 1 0 4 4
27 Lotus Europa 0.851064 4 0.059865 0.215548 3.77 1.513 16.90 1 1 5 2
0 Mazda RX4 0.451064 6 0.221751 0.204947 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 0.451064 6 0.221751 0.204947 3.90 2.875 17.02 0 1 4 4
3 Hornet 4 Drive 0.468085 6 0.466201 0.204947 3.08 3.215 19.44 1 0 3 1
31 Volvo 142E 0.468085 4 0.124470 0.201413 4.11 2.780 18.60 1 1 4 2
5 Valiant 0.327660 6 0.383886 0.187279 2.76 3.460 20.22 1 0 3 1
20 Toyota Corona 0.472340 4 0.122225 0.159011 3.70 2.465 20.01 1 0 3 1
8 Merc 230 0.527660 4 0.173859 0.151943 3.92 3.150 22.90 1 0 4 2
2 Datsun 710 0.527660 4 0.092043 0.144876 3.85 2.320 18.61 1 1 4 1
26 Porsche 914-2 0.663830 4 0.122724 0.137809 4.43 2.140 16.70 0 1 5 2
25 Fiat X1-9 0.719149 4 0.019706 0.049470 4.08 1.935 18.90 1 1 4 1
17 Fiat 128 0.936170 4 0.018957 0.049470 4.08 2.200 19.47 1 1 4 1
19 Toyota Corolla 1.000000 4 0.000000 0.045936 4.22 1.835 19.90 1 1 4 1
7 Merc 240D 0.595745 4 0.188576 0.035336 3.69 3.190 20.00 1 0 4 2
18 Honda Civic 0.851064 4 0.011474 0.000000 4.93 1.615 18.52 1 1 4 2

The shapes of the bar graphs remain the same, but the horizontal axes show the new scale.

mpg (min-max normalized)

[11]:
cars_df = cars_df.sort_values('mpg', ascending=False)
sns.barplot(y='model', x='mpg', data=cars_df, color='b', orient="h")
[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa29789e4d0>
notebooks/../_build/doctrees-readthedocs/nbsphinx/notebooks_normalize_19_1.png

displacement (min-max normalized)

[12]:
cars_df = cars_df.sort_values('disp_cc', ascending=False)
sns.barplot(y='model', x='disp_cc', data=cars_df, color='b', orient="h")
[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa295770710>
notebooks/../_build/doctrees-readthedocs/nbsphinx/notebooks_normalize_21_1.png

horsepower (min-max normalized)

[13]:
cars_df = cars_df.sort_values('hp', ascending=False)
sns.barplot(y='model', x='hp', data=cars_df, color='b', orient="h")
[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa29561c550>
notebooks/../_build/doctrees-readthedocs/nbsphinx/notebooks_normalize_23_1.png

Standardization (z-score)

Next we’ll convert to standard scores. This expresses each value in terms of its standard deviations from the mean, expressing where each model stands in relation to the others.

We’ll use pyjanitor’s transform_columns to apply the standard score calculation, (x - x.mean()) / x.std(), to each value in each of the columns we’re evaluating.

[14]:
cars_df.transform_columns(
    ['mpg', 'disp_cc', 'hp'],
    lambda x: (x - x.mean()) / x.std(),
    elementwise=False
)
[14]:
model mpg cyl disp_cc hp drat wt qsec vs am gear carb
30 Maserati Bora -0.844644 8 0.567039 2.746567 3.54 3.570 14.60 0 1 5 8
28 Ford Pantera L -0.711907 8 0.970465 1.711021 4.22 3.170 14.50 0 1 5 4
6 Duster 360 -0.960789 8 1.043081 1.433903 3.21 3.570 15.84 0 0 3 4
23 Camaro Z28 -1.126710 8 0.962396 1.433903 3.73 3.840 15.41 0 0 3 4
16 Chrysler Imperial -0.894420 8 1.688562 1.215126 3.23 5.345 17.42 0 0 3 4
15 Lincoln Continental -1.607883 8 1.849932 0.996348 3.00 5.424 17.82 0 0 3 4
14 Cadillac Fleetwood -1.607883 8 1.946754 0.850497 2.93 5.250 17.98 0 0 3 4
13 Merc 450SLC -0.811460 8 0.363713 0.485868 3.07 3.780 18.00 0 0 3 3
11 Merc 450SE -0.612354 8 0.363713 0.485868 3.07 4.070 17.40 0 0 3 3
12 Merc 450SL -0.463025 8 0.363713 0.485868 3.07 3.730 17.60 0 0 3 3
24 Pontiac Firebird -0.147774 8 1.365821 0.412942 3.08 3.845 17.05 0 0 3 2
4 Hornet Sportabout -0.230735 8 1.043081 0.412942 3.15 3.440 17.02 0 0 3 2
29 Ferrari Dino -0.064813 6 -0.691647 0.412942 3.62 2.770 15.50 0 1 5 6
21 Dodge Challenger -0.761683 8 0.704204 0.048313 2.76 3.520 16.87 0 0 3 2
22 AMC Javelin -0.811460 8 0.591245 0.048313 3.15 3.435 17.30 0 0 3 2
10 Merc 280C -0.380064 6 -0.509299 -0.345486 3.92 3.440 18.90 1 0 4 4
9 Merc 280 -0.147774 6 -0.509299 -0.345486 3.92 3.440 18.30 1 0 4 4
27 Lotus Europa 1.710547 4 -1.094266 -0.491337 3.77 1.513 16.90 1 1 5 2
0 Mazda RX4 0.150885 6 -0.570620 -0.535093 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 0.150885 6 -0.570620 -0.535093 3.90 2.875 17.02 0 1 4 4
3 Hornet 4 Drive 0.217253 6 0.220094 -0.535093 3.08 3.215 19.44 1 0 3 1
31 Volvo 142E 0.217253 4 -0.885292 -0.549678 4.11 2.780 18.60 1 1 4 2
5 Valiant -0.330287 6 -0.046167 -0.608019 2.76 3.460 20.22 1 0 3 1
20 Toyota Corona 0.233846 4 -0.892553 -0.724700 3.70 2.465 20.01 1 0 3 1
8 Merc 230 0.449543 4 -0.725535 -0.753870 3.92 3.150 22.90 1 0 4 2
2 Datsun 710 0.449543 4 -0.990182 -0.783040 3.85 2.320 18.61 1 1 4 1
26 Porsche 914-2 0.980492 4 -0.890939 -0.812211 4.43 2.140 16.70 0 1 5 2
25 Fiat X1-9 1.196190 4 -1.224169 -1.176840 4.08 1.935 18.90 1 1 4 1
17 Fiat 128 2.042389 4 -1.226589 -1.176840 4.08 2.200 19.47 1 1 4 1
19 Toyota Corolla 2.291272 4 -1.287910 -1.191425 4.22 1.835 19.90 1 1 4 1
7 Merc 240D 0.715018 4 -0.677931 -1.235180 3.69 3.190 20.00 1 0 4 2
18 Honda Civic 1.710547 4 -1.250795 -1.381032 4.93 1.615 18.52 1 1 4 2

Standardized mpg

[15]:
cars_df = cars_df.sort_values('mpg', ascending=False)
sns.barplot(y='model', x='mpg', data=cars_df, color='b', orient="h", )
[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa2954f5350>
notebooks/../_build/doctrees-readthedocs/nbsphinx/notebooks_normalize_27_1.png

Standardized displacement

[16]:
cars_df = cars_df.sort_values('disp_cc', ascending=False)
sns.barplot(y='model', x='disp_cc', data=cars_df, color='b', orient="h")
[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa295367ad0>
notebooks/../_build/doctrees-readthedocs/nbsphinx/notebooks_normalize_29_1.png

Standardized horsepower

[17]:
cars_df = cars_df.sort_values('hp', ascending=False)
sns.barplot(y='model', x='hp', data=cars_df, color='b', orient="h")
[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa295236e90>
notebooks/../_build/doctrees-readthedocs/nbsphinx/notebooks_normalize_31_1.png