Machine Learning: Introduction to Linear Regression in Python

What is Regression and Regression Model?
Regression
Regression is a method of finding the statistical relationship between two or more variables. This is identified when a change in one or more independent variables shows some change in dependent variable. For example, sales and profit, age and height, etc.
Regression Model
A regression model is a statistical model, which describes the cause and effect relationship between two variables.
Linear Regression
Linear Regression is a method to model a linear relationship between dependent (scalar response) variable and one or more independent variables (explanatory variables).
If the study is between two continuous (quantitative) variables, one dependent and one independent, it is known as Simple Linear Regression. When the study is for two or more predictor variables, it is called Multiple Linear Regression.
In simple linear regression, an independent variable is denoted by X
(also known as predictor, or explanatory variable). The dependent variable is denoted by y
(also known as response, outcome, or explained variable.
For example, let's consider a scenario including advertisement cost and sales. Increase or decrease in advertisement cost corresponds to increase or decrease in sales. So advertisement cost is independent variables X
and sales is dependent variable y
. An example of linear regression: as the age of child increases, his/her height increases. Another example, the more the height of the father; his son is taller. Means father's and son's heights tend to regress the mean height.
Regression Line
Let's learn this by an example. In the given dataset, we have n
number of observations of correlated variables X
and y
as following:(x1,y1), (x2,y2), ..., (xn,yn)
.
We can create a scatter plot by putting independent or predictor variables on X axis and dependent or response variable on Y axis. In scatter diagram, if points are cluttered around a line, we can say that there is a linear relationship. The method of obtaining a line which minimizes the vertical distance between all data points is called a Regression Line. To determine the best fitting line, we attempt to minimize the distance between all data points and their distance to our line.
While using a scatter plot, no computation is required to draw such a line (regression line), which makes this an easy method. But the disadvantage is, two different persons can provide different estimated values.
Fitting Regression Line
Perfect Positive Correlation
After plotting the data-points on a scatter diagram, if we find that all the data-points lie on a single line and the line is going upwards (from left to right), we can say both the variables are fully associated or correlated. In this scenario, if we calculate numerical value of correlation coefficient, we get 1.
Perfect Negative Correlation
In the case when all data-points lie on a single line but the line is going downwards (from left to right), we can say both the variables are fully negatively associated. Means increase in one variable causes decrease in another and vice versa. In such case numerical value of correlation coefficient will be -1.
No (Zero) Correlation
In scatter diagram, if data-points are cluttered such a way that we can not find a line, we can say the data is not related and there is no correlation between X
and y
. In such a case, numerical value of correlation coefficient will be 0.
Partial Correlation
In business, economics or social science these variables (X
and y
) are not always perfectly correlated. After plotting the data-points on scatter diagram and calculating, we find the value of correlation coefficient in floating points. This number can be positive or negative and is always between -1 and 1. It means that the variables are partially correlated, either positively or negatively. If the number is more towards -1 or 1 the correlation is strong. As the number moves towards 0, the correlation gets weaker, also called weak correlation.
Partial (Strong) Positive Correlation
Partial (Strong) Negative Correlation
Method of Least Squares
This is the most common method for finding a Regression Line. In least square method, we have to identify some constants. In our training data, we minimize the sum of squared differences between actual values and predicted values for dependent variable.
Linear Regression model takes following form:
y = α + βX
where,
y
is Dependent Variable or Predicted Response ValueX
is Independent Variableα
is the Intercept (the value ofy
whenx
is 0)β
is the Slope
If we are predicting sales based on advertisement cost, our linear regression equation would like this:
sales = α + β * advertisement cost
A numerical association between two variables can be measured by correlation coefficient which is a value between -1 and 1.
In least squares method, a line is fitted by minimizing the sum of squares of residuals. A residual is the difference between observation's y
value and fitted line.
Linear Regression in Python
Let's learn how to implement Linear Regression in Python. Before implementing any Machine Learning algorithm, we need to clean the data. Data cleaning is the essential part in machine learning which involves removing unwanted observations such as duplicate or irrelevant observations that really don't fit in specific problem, fixing structural problem, filtering unwanted outliers, handing of missing data by dropping observations or imputing missing values. After cleaning the data, we can implement the machine learning algorithm on cleaned dataset.
To understand how to use the linear regression model, we will use the a dataset from Kaggle to predict house prices. You can download the housing data here. Save this file with name 'usa_housing.csv'. Jupyter notebook can also be used to practice this machine learning algorithm.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.read_csv("usa_housing.csv")
# to print the top 5 rows from the dataframe
print("Head:")
print(df.head())
# to check the total number of columns, total number of entries, etc.
print("Info:")
print(df.info())
# this will give the statistical information
print("Description:")
print(df.describe())
# to get the column names
print("Columns:")
print(df.columns)
# visualizing the data
# the line will show the distribution of price for houses in the dataset
# by using this we can determine the average prices
print("Distribution of Prices:")
sns.distplot(df['Price'])
# correlation between columns
# it will show correlation between all the columns in matrix form
print("Correlation:")
print(df.corr())
# now we will split the data into feature variables and response variable
# feature variables as X and response variable as y
X = df[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population']]
y = df['Price']
# now split them into train and test data
from sklearn.cross_validation import train_test_split
# test_size: is the percentage of data you want for testing
# random_state: assures a specific set of random data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=111)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
# fit the linear model on the training data
lm.fit(X_train, y_train)
# check the coefficient
print(lm.intercept_)
print(lm.coef_)
# make predictions
y_pred = lm.predict(X_test)
# check the difference between predicted and original responses
plt.scatter(y_test, y_pred)
plt.xlabel("Prices: $Y_i$")
plt.ylabel("Predicted prices: $\hat{Y}_i$")
plt.title("Prices vs Predicted prices: $Y_i$ vs $\hat{Y}_i$")
Output:
This is an easy way to implement an algorithm. This method can perform very badly when there is excessively large or small values called outliers for dependent variables. If there are too many features in dataset, least square method does not provide proper results. There may also exist lurking variables when relation between two variables is significantly affected by the third variable which is not considered or included in study.
When do you use Linear Regression? What do you like about it? What do you hate about it? Let us know in comments.