## Machine Learning : Introduction to Logistic regression in Python

## Let's solve this regression problem

I have a dataset of diabetes which has the features such as Glucose, Blood pressure, BMI, Insulin, Age etc. We want to predict probability of diabetes given the features that the person will be diabetic or not. Our data looks like given below.

The problem here is that response variable outcome is 0 or 1 which
means that 0 is not diabetic and 1 is diabetic. In linear regression the response variable
is continuous but here it is **categorical**. The other
thing is we want to find the **probability**. So how
can we solve this problem? Let's see in detail.

## Qualitative variables and Classification

In Linear regression, the response variable Y is quantitative but what if
the response variable is **qualitative**? So first of all,
what is qualitative response? Identify whether the email is spam or not.
The response is either yes or no instead of having some number or
quantitative variable. Here the outcome is binary; such as yes/no, 0/1,
true/false, on/off. The qualitative variable is often referred as
categorical variable and there can be more than two categories. This
process of predicting qualitative response is known as
**classification** which is very essential part in machine
learning. In logistic regression, instead of modeling response variable, it
predicts if the variable belongs to a category or not.

## Real world use of Logistic Regression

Logistic regression is generally used when the dependent variable is binary or dichotomous. The dependent variable can be either 0 or 1, "yes" or "no", "win" or "loose", "male" or "female", true or false. Thus the dependent variable is categorical but independent variables can be categorical or numerical.

Logistic regression is used in many applications such as email spam detection, diabetes prediction, in some elections whether the voter is going to vote our party or not, some ecommerce company may want to predict that customer will buy some item or not, the advertisement company may want to predict if the watcher will click a particular advertise or not, etc. Using the generated outcome, the company will decide to build its own strategy to cover their business goals.

## Derivation of Logistic Regression

The simple linear regression has the very short formula, that is
`y = α + βX `

And multiple linear regression has following formula
`Y= α + β1X1 + β2X2 + β3X3 + … + βnXn`

So in linear regression, we deal with some problems such as salary vs experience. As the experience of a candidate increases, his/her salary increases. Here we can plot the data points on graph and we put a line that can model the observations. Thus we can see that there is a correlation between the salary and experience.

But what if the situation is that where we plot the data is completely different. For example, if a medicine will have an effect on person's health given his age or predict whether the patient will be diabetic or not. Here if we put the observations on a graph, we get points laid out something like this as given below.

Here the dependent variable is either No (0 in our case) or Yes (1 in our case). So it is not possible to put a simple linear line that passes through most data points and it does not look like a best approach to solve this problem.

Here, instead of predicting the exact answer or modeling response variable Y directly, we would like to find the probability or the likelihood of medicine's effect given patient's age. The probability is a value between 0 and 1 .The chart above also shows the values from 0 to 1. So the part between 0 and 1 in graph makes the sense where probability can take any value between 0 and 1 as shown in following figure.

But the parts below 0 and above 1 really do not make sense, because it says that the medicine for the age for example 65 is more than 100% likely to be effective (in other words the probability is more than 1 of being medicine effective) and the reverse, medicine for the age of 5 is less than 0% likely to be effective (in other words the probability is less than 0 of being medicine effective), which really does not make sense (because the range of probability value is 0 to 1). So we can just cut of these lines and we are going to get the picture like below.

To avoid the problem of predicting р(X) < 0 and р(X) > 1 for some values of x, we can model р(X) using such a function that will always give the output between 0 and 1 for all the values of x. Logistic function or the sigmoid function can be used in that case which gives "S" shaped sigmoid curve in graph.

The formula of logistic function is as given below: $$ p(x) = {e^{β0 + β1X} \over {1 + e^{β0 + β1X} }} $$

By taking odds, our new derived formula looks like given below. $$ {p(x) \over 1 -p(x)} = {e^ {β0 + β1X}} $$

By taking logarithm of both the sides $$ log( {p(x) \over 1 -p(x)}) = {β0 + β1X} $$

The left hand side is called **log-odds** or **logit** function

This will generate a new chart which will look like the chart given below:

Now using our observations and using the formula we get S shaped line which is same as the trend line in linear regression. It gives the best fit line for the data points. This function can be used to predict the probability (p̂) that lies in range 0 to 1 and instead of telling for sure, it gives probability.

To convert the probability in binary category, we can use the classification threshold also known as decision threshold. The threshold simply indicates a simple line. In which value above the threshold is positive or on or true and below it is negative or off or false. In Logistic regression threshold is 0.5.The predicted Y is set to 0 if the probability is <0.5, and set to 1 if the probability ≥ 0.5.

## Example of Logistic Regression in Python

Let's have an example to model the logistic regression. Using this example we are going to predict whether or not a patient has diabetes. We are going to take the dataset from Kaggle. Here is the link for the Pima Indians Diabetes Database. Download here.

```
import pandas as pd
dt = pd.read_csv("diabetes.csv")
dt.head()
# select response variable and features
features = ['Pregnancies', 'Insulin', 'BMI', 'Age', 'Glucose', 'BloodPressure', 'DiabetesPedigreeFunction']
X = dt[features] # Features
y = dt.Outcome # Target variable
# spiting the data
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
# fit the model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
# confusion metrics
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix
# classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
```