DA#01 - Understanding the numbers: Pearson’s Coefficient

Data analysis is the art of finding all possible hidden relationships between variables and quantitative / numerical statistical informatio...

Data analysis is the art of finding all possible hidden relationships between variables and quantitative / numerical statistical information found everywhere around us: in business, biology, economics and many other areas. As analysts or data scientists we use statistical methods with the available sets of information in order to find strong evidences or indicators about these relationships and eventually be able to describe the information we have in hand and how we can use it in solving real life problems.

As an analyst, you have to understand what numbers - such as: averages, deviations, chi-square, anova …etc - represent, what do they mean ? how we calculate them? what is the intuition behind them ? … Knowing P < 0.05 is significant, that’s great however it’s a must to understand why and how we got there, after all there is science behind this art.

So in this series of posts I will try to describe and explain the meaning and intuition behind some of  these indicators and numbers, as part of sharing the accumulative knowledge with anyone interested in data analysis. I will start with one famous type of correlation coefficients. 

 

Pearson’s coefficient

Also represented as ‘r’, Pearson’s coefficient is widely known and used to describe the correlation between two numerical variables by calculating a relative value that range between -1 to 1.
here is how we describe it:




 

if we have 2 variables A, B :

case 1,

 r = 0 , we say there is no correlation between the variables, any change in A won’t affect B and vice versa.

case 2,

0 < r < 1 , we say that a correlation exists positively and proportionally between A, B …if A increased in value, B will increase as well and vice versa .. the nearest the value of r to 1: the stronger the relationship between A, B …hence the amount of change in value.

case 3,

-1< r < 0 , we say that a correlation exists but negative or inverse between A, B …if A increased in value, B will decrease and as A decreases, B increases … and vice versa .. the nearest the value of r to -1: the stronger the relationship between A, B …hence the amount of change in value.


Pearson’s coefficient is used in many methods or applications within any data model, you will find it in a predictive analysis model, you can use it in a classification or clustering model as well … For example a product recommendation system such as “people who bought A bought B” is built on correlations between items using correlation coefficients. We can build a correlation matrix easily within our code, here is a simple example “R code for this one”:

I will be using the ‘Mtcars dataset’ that describes fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models) which can be called directly in R.





First thing we need to discard the last four columns because correlation coefficients work with continuous values only, it won't work on discrete values such as gear and carb in this dataset.
You only need 1 line of code to create a Pearson’s correlation matrix as follow :





pearCoef <- cor(mtcars[,1:7], method =‘pearson’) #assign it
pearCoef   #call it here

  




this is a comparison matrix between all variables, check out for example the value of r between, let say hp column  and hp row “ hp: horse power” it’s exactly 1 because it’s a self to self comparison  … however try to check  (mpg : Miles per gallons & cyl : Cylinders ) r value is : -0.8521620 how do you describe the correlation here ?


To calculate r values manually, we use this equation following :

r =( n∑XY - ∑X∑Y )∕ √(n∑X² -(∑X)²)(n∑Y² -(∑Y)²)
X = x - x̅   : value - mean
Y = y - y̅   : value - mean
n = number of records

The equation describes the degree of closeness between points and the Covariance of the multivariate distribution of Xs and Ys …(imagine the scatter plot above, that’s also the relative distances between points and fitted line) … this equation can also be represented differently as following :

r = b(σx/σy)

σx : Standard deviation of x values from the mean x̅
σy : Standard deviation of y values from the mean y̅
b : slope of a regression line -> ∑Y = a + b∑X

Finally, since we are working with ‘mean values’ you can’t use Pearson’s coefficient with categorical variables even if you changed them to discrete numbers for example : [ Sunday = 0, Saturday = 1, Monday = 2] you can’t calculate mean value here.

Conclusion:

Pearson’s coefficient is a great as a starting point in features engineering or in building recommendations or rating models. The thing to consider here is far is your r-value from 0 and in what direction (positive or negative) to understand the relation between any group of variables.


You Might Also Like

Comments

Instagram