Feature Selection Analysis using Stepwise Regression models

Data Preparation

Importing Libraries

read CSV File

preview of the contents of the dataframe

fm (feature matrix) -- collection of independent variables -- R&D, Admin, Marketing Spend and State

v (variable) -- dependent variable -- Profit

encode the 'State' attribute into numbers

split of feature matrix into training and test set

//based on predefined test_size and random_state for all models//

training set -- for the machine to learn from

test set -- to test machine's profit prediction

General Trends

profit versus expenditure on R&D (sampled from training set)

profit versus expenditure on administration (sampled from training set)

profit versus expenditure on marketing (sampled from training set)

Analysis 1:

>A Strong linear relationship exists between a Startup's Profits and their R&D Spend.

> Profit dependence on Admin Spend appears to be random.

>Profits portray a positive trend with increasing Marketing Spend.

Data Processing

Step 1: allow scikit-learn to extract stastically significant features and predict profits

invocation of the linear regressor and fitting to the training set

machine's prediction assigned to 'v_pred'

Step 2: manually implement Backward Selection

remove first column of dataframe (avoid dummy variable trap)

append column vector of 1's at the beginning of matrix for interpretation of constant term in Mutiple LR

Ordinary Least Square Regression Results display p-values stagewise

Initial Matrix preview

Initially, all the features are passed

feature x2 (dummy variable) is removed, as it has highest relative p-value, and model is re-fit

Matrix Preview after feature x2 is removed

feature x1 (dummy variable) is removed, as it has highest relative p-value, and model is re-fit

Analysis 2:

>Both dummy variables passed turn out to be statistically insignificant in Profit prediction.

>This indicates that the Statwise distribution of Profits is likely to be similar

>A Boxplot visual on the same illustrates this

Backward Elimination continued

Matrix Preview after feature x1 is removed

feature x2 (Admin Spend) is removed, as it has highest relative p-value, and model is re-fit

Matrix Preview after feature x2 is removed

feature x2 (Marketing Spend) is removed, as it has highest relative p-value, and model is re-fit

Matrix Preview after feature x2 is removed

Backward Elimination is concluded, as all p-values of remaining features are below 0.05

Final Matrix of Features (preview) contain a constant term and R&D Spend

Analysis 3:

> Since only the R&D Spend feature remains, (intercept term held aside), the 'General Trend Graphs' intuition appears negotiable.

> This leads to the fact that Profit varying only as a function of R&D Spend (Simple Linear Regression) is a good approximation to the model (later confirmed by r2_scores).

Implement model using backward elimination optimised feature matrix

Step 3: Implement model using Forward Selection

reassign feature matrix to original

Identify two most statistically significant features using Forward Selection and verify results from Backward Elimination

Analysis 4:

feat_vals -- values of those features identified as statistically significant by forward selection

>By checking, it is confirmed that the forward selection algorithm (when forced to extract two of the most statistically significant features) selects R&D Spend and Marketing Spend.

>This result is in agreement to the penultimate observation of Backward Selection.

Now, implement the model as per features given by Forward Selection

Data Visualization

Colour Code

-- Red point indicates machine's predicted profit

-- Blue point indicates actual profit

visual on actual versus predicted profit (evaluated by sklearn)

visual on actual versus predicted profit (evaluated by manual backward elimination)

visual on actual versus predicted profit (evaluated by manual forward selection)

quantitative data on the actual profit versus the predicted profit (evaluated by sklearn)

quantitative data on the actual profit versus the predicted profit (evaluated by manual backward elimination)

quantitative data on the actual profit versus the predicted profit (evaluated by manual forward selection)

Comparison of r2_scores (an accuracy measure of regression models graded 0 to 1)

scikit learn prediction

backward elimination's prediction (only R&D Spend accounted)

feature selection's prediction (both R&D spend and Marketing Spend accounted)

Analysis 5:

> When a variety of random states are used to run the model, there appears a generalization that features 'R&D' and 'Marketing' when used, consistently outperform other two.