## Machine Learning: Introduction to Linear Regression in Python

## What is Regression and Regression Model?

### Regression

Regression is a method of finding the statistical relationship between two or more variables. This is identified when a change in one or more independent variables shows some change in dependent variable. For example, sales and profit, age and height, etc.

### Regression Model

A regression model is a statistical model, which describes the cause and effect relationship between two variables.

### Linear Regression

Linear Regression is a method to model a linear relationship between dependent (scalar response) variable and one or more independent variables (explanatory variables).

If the study is between two continuous (quantitative) variables, one dependent and one independent, it is known as **Simple Linear Regression**. When the study is for two or more predictor variables, it is called **Multiple Linear Regression.**

In simple linear regression, an independent variable is denoted by ** X** (also known as

**predictor**, or

**explanatory**variable). The dependent variable is denoted by

**(also known as**

`y`

**response**,

**outcome**, or

**explained**variable.

For example, let's consider a scenario including *advertisement cost* and *sales*. Increase or decrease in *advertisement cost* corresponds to increase or decrease in *sales*. So *advertisement cost* is independent variables `X`

and *sales* is dependent variable `y`

. An example of linear regression: as the age of child increases, his/her height increases. Another example, the more the height of the father; his son is taller. Means father's and son's heights tend to regress the mean height.

## Regression Line

Let's learn this by an example. In the given dataset, we have `n`

number of observations of correlated variables `X`

and `y`

as following:`(x`

._{1},y_{1}), (x_{2},y_{2}), ..., (x_{n},y_{n})

We can create a **scatter plot** by putting independent or predictor variables on **X** axis and dependent or response variable on **Y** axis. In scatter diagram, if points are cluttered around a line, we can say that there is a linear relationship. The method of obtaining a line which minimizes the vertical distance between all data points is called a Regression Line. To determine the best fitting line, we attempt to minimize the distance between all data points and their distance to our line.

While using a scatter plot, no computation is required to draw such a line (regression line), which makes this an easy method. But the disadvantage is, two different persons can provide different estimated values.

### Fitting Regression Line

#### Perfect Positive Correlation

After plotting the data-points on a scatter diagram, if we find that all the data-points lie on a single line and the line is going upwards (from left to right), we can say both the variables are fully associated or correlated. In this scenario, if we calculate numerical value of correlation coefficient, we get 1.

#### Perfect Negative Correlation

In the case when all data-points lie on a single line but the line is going downwards (from left to right), we can say both the variables are fully negatively associated. Means increase in one variable causes decrease in another and vice versa. In such case numerical value of correlation coefficient will be -1.

#### No (Zero) Correlation

In scatter diagram, if data-points are cluttered such a way that we can not find a line, we can say the data is not related and there is no correlation between `X`

and `y`

. In such a case, numerical value of correlation coefficient will be 0.

#### Partial Correlation

In business, economics or social science these variables (`X`

and `y`

) are not always perfectly correlated. After plotting the data-points on scatter diagram and calculating, we find the value of correlation coefficient in floating points. This number can be positive or negative and is always between -1 and 1. It means that the variables are partially correlated, either positively or negatively. If the number is more towards -1 or 1 the correlation is **strong**. As the number moves towards 0, the correlation gets weaker, also called **weak** correlation.

**Partial (Strong) Positive Correlation**

**Partial (Strong) Negative Correlation**

## Method of Least Squares

This is the most common method for finding a Regression Line. In least square method, we have to identify some constants. In our training data,
we minimize the sum of **squared differences** between actual values and predicted values for dependent variable.

Linear Regression model takes following form:

`y = α + βX `

where,

`y`

is Dependent Variable or Predicted Response Value`X`

is Independent Variable`α`

is the Intercept (the value of`y`

when`x`

is 0)`β`

is the Slope

If we are predicting *sales* based on *advertisement cost*, our linear regression equation would like this:

*sales* = α + β * *advertisement cost*

A numerical association between two variables can be measured by **correlation coefficient** which is a value between -1 and 1.

In least squares method, a line is fitted by minimizing the **sum of squares of residuals**. A **residual** is the difference between observation's `y`

value and fitted line.

## Linear Regression in Python

Let's learn how to implement Linear Regression in Python. Before implementing any Machine Learning algorithm, we need to clean the data. Data cleaning is the essential part in machine learning which involves removing unwanted observations such as duplicate or irrelevant observations that really don't fit in specific problem, fixing structural problem, filtering unwanted outliers, handing of missing data by dropping observations or imputing missing values. After cleaning the data, we can implement the machine learning algorithm on cleaned dataset.

To understand how to use the linear regression model, we will use the a dataset from Kaggle to predict house prices. You can download the housing data here. Save this file with name 'usa_housing.csv'. Jupyter notebook can also be used to practice this machine learning algorithm.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.read_csv("usa_housing.csv")
# to print the top 5 rows from the dataframe
print("Head:")
print(df.head())
# to check the total number of columns, total number of entries, etc.
print("Info:")
print(df.info())
# this will give the statistical information
print("Description:")
print(df.describe())
# to get the column names
print("Columns:")
print(df.columns)
# visualizing the data
# the line will show the distribution of price for houses in the dataset
# by using this we can determine the average prices
print("Distribution of Prices:")
sns.distplot(df['Price'])
# correlation between columns
# it will show correlation between all the columns in matrix form
print("Correlation:")
print(df.corr())
# now we will split the data into feature variables and response variable
# feature variables as X and response variable as y
X = df[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population']]
y = df['Price']
# now split them into train and test data
from sklearn.cross_validation import train_test_split
# test_size: is the percentage of data you want for testing
# random_state: assures a specific set of random data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=111)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
# fit the linear model on the training data
lm.fit(X_train, y_train)
# check the coefficient
print(lm.intercept_)
print(lm.coef_)
# make predictions
y_pred = lm.predict(X_test)
# check the difference between predicted and original responses
plt.scatter(y_test, y_pred)
plt.xlabel("Prices: $Y_i$")
plt.ylabel("Predicted prices: $\hat{Y}_i$")
plt.title("Prices vs Predicted prices: $Y_i$ vs $\hat{Y}_i$")
```

**Output:**

This is an easy way to implement an algorithm. This method can perform very badly when there is excessively large or small values called outliers for dependent variables. If there are too many features in dataset, least square method does not provide proper results. There may also exist lurking variables when relation between two variables is significantly affected by the third variable which is not considered or included in study.

When do you use Linear Regression? What do you like about it? What do you hate about it? Let us know in comments.