Today, I'll show my entire process for my submission to the Kaggle competition for predicting house prices.

House Prices: Advanced Regression Techniques¶

https://www.kaggle.com/c/house-prices-advanced-regression-techniques

house

Competition Description¶

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

Goal¶

It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable.

Metric¶

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

File Descriptions¶

train.csv - the training set
test.csv - the test set
data_description.txt - full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here
sample_submission.csv - a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedrooms

Data fields¶

Here's a brief version of what you'll find in the data description file.

SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale

Setup Imports¶

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from scipy import stats
from scipy.stats import norm, skew
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

Part 1: Data Exploration¶

In [2]:

train = pd.read_csv('train.csv')

In [3]:

train.tail()

Out[3]:

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
1455	1456	60	RL	62.0	7917	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	0	8	2007	WD	Normal	175000
1456	1457	20	RL	85.0	13175	Pave	NaN	Reg	Lvl	AllPub	...	NaN	MnPrv	NaN	0	2	2010	WD	Normal	210000
1457	1458	70	RL	66.0	9042	Pave	NaN	Reg	Lvl	AllPub	...	NaN	GdPrv	Shed	2500	5	2010	WD	Normal	266500
1458	1459	20	RL	68.0	9717	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	0	4	2010	WD	Normal	142125
1459	1460	20	RL	75.0	9937	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	0	6	2008	WD	Normal	147500

5 rows × 81 columns

In [4]:

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
MasVnrArea       1452 non-null float64
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422 non-null object
BsmtFinType1     1423 non-null object
BsmtFinSF1       1460 non-null int64
BsmtFinType2     1422 non-null object
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1459 non-null object
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
KitchenQual      1460 non-null object
TotRmsAbvGrd     1460 non-null int64
Functional       1460 non-null object
Fireplaces       1460 non-null int64
FireplaceQu      770 non-null object
GarageType       1379 non-null object
GarageYrBlt      1379 non-null float64
GarageFinish     1379 non-null object
GarageCars       1460 non-null int64
GarageArea       1460 non-null int64
GarageQual       1379 non-null object
GarageCond       1379 non-null object
PavedDrive       1460 non-null object
WoodDeckSF       1460 non-null int64
OpenPorchSF      1460 non-null int64
EnclosedPorch    1460 non-null int64
3SsnPorch        1460 non-null int64
ScreenPorch      1460 non-null int64
PoolArea         1460 non-null int64
PoolQC           7 non-null object
Fence            281 non-null object
MiscFeature      54 non-null object
MiscVal          1460 non-null int64
MoSold           1460 non-null int64
YrSold           1460 non-null int64
SaleType         1460 non-null object
SaleCondition    1460 non-null object
SalePrice        1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

The data has a lot of columns with <100 missing values which can be filled in later. 5 columns are unsaveable (Alley, FireplaceQu, PoolQC, Fence, and MiscFeature), so I will drop these from my model if necessary.

In [5]:

train.describe()

Out[5]:

	Id	MSSubClass	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	...	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	MiscVal	MoSold	YrSold	SalePrice
count	1460.000000	1460.000000	1201.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1452.000000	1460.000000	...	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000
mean	730.500000	56.897260	70.049958	10516.828082	6.099315	5.575342	1971.267808	1984.865753	103.685262	443.639726	...	94.244521	46.660274	21.954110	3.409589	15.060959	2.758904	43.489041	6.321918	2007.815753	180921.195890
std	421.610009	42.300571	24.284752	9981.264932	1.382997	1.112799	30.202904	20.645407	181.066207	456.098091	...	125.338794	66.256028	61.119149	29.317331	55.757415	40.177307	496.123024	2.703626	1.328095	79442.502883
min	1.000000	20.000000	21.000000	1300.000000	1.000000	1.000000	1872.000000	1950.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	2006.000000	34900.000000
25%	365.750000	20.000000	59.000000	7553.500000	5.000000	5.000000	1954.000000	1967.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	5.000000	2007.000000	129975.000000
50%	730.500000	50.000000	69.000000	9478.500000	6.000000	5.000000	1973.000000	1994.000000	0.000000	383.500000	...	0.000000	25.000000	0.000000	0.000000	0.000000	0.000000	0.000000	6.000000	2008.000000	163000.000000
75%	1095.250000	70.000000	80.000000	11601.500000	7.000000	6.000000	2000.000000	2004.000000	166.000000	712.250000	...	168.000000	68.000000	0.000000	0.000000	0.000000	0.000000	0.000000	8.000000	2009.000000	214000.000000
max	1460.000000	190.000000	313.000000	215245.000000	10.000000	9.000000	2010.000000	2010.000000	1600.000000	5644.000000	...	857.000000	547.000000	552.000000	508.000000	480.000000	738.000000	15500.000000	12.000000	2010.000000	755000.000000

8 rows × 38 columns

There are 37 quantitative variables and 43 (80-37) categorical variables to examine. I'll let seaborn do the heavy lifting to find what predictor variables are relevant. First, I'll look at the target variable to understand what I am trying to predict:

SalePrice analysis¶

From above, I can see that SalePrice has a mean of \$181,000 standard deviation of \$79,000 and median \$163,000.

In [6]:

plt.figure(figsize=(12,8))
sns.distplot(train['SalePrice'], fit = norm)

Out[6]:

<matplotlib.axes._subplots.AxesSubplot at 0x20f57934668>

Sales price looks skewed right, which makes sense because a small portion of houses are way more expensive than normal house pricing. Also, 'SalePrice' does not look normal, so I'll see if it is saveable with a transformation for my model:

In [7]:

stats.probplot(train['SalePrice'], plot = plt)

Out[7]:

((array([-3.30513952, -3.04793228, -2.90489705, ...,  2.90489705,
          3.04793228,  3.30513952]),
  array([ 34900,  35311,  37900, ..., 625000, 745000, 755000], dtype=int64)),
 (74160.16474519415, 180921.19589041095, 0.9319665641512986))

My stats class recently covered these types of transformations, and a logarithmic transformation usually works best on pricing data:

In [8]:

train['SalePrice'] = np.log(train['SalePrice'])

In [9]:

sns.distplot(train['SalePrice'], fit=norm)

Out[9]:

<matplotlib.axes._subplots.AxesSubplot at 0x20f578ec7b8>

In [10]:

stats.probplot(train['SalePrice'], plot=plt)

Out[10]:

((array([-3.30513952, -3.04793228, -2.90489705, ...,  2.90489705,
          3.04793228,  3.30513952]),
  array([10.46024211, 10.47194981, 10.54270639, ..., 13.34550693,
         13.5211395 , 13.53447303])),
 (0.39826223081618883, 12.024050901109383, 0.9953761475636617))

And there we go, sales price is good to be tested on. Next, I'll look at the predictor variables to see which are relevant:

Predictor Variable Analysis¶

In [11]:

#correlation matrix
plt.figure(figsize=(12,10))
sns.heatmap(train.corr(), square=True, cmap='coolwarm')

Out[11]:

<matplotlib.axes._subplots.AxesSubplot at 0x20f53d20ba8>

At first sight, there seems to be several variables with very high correlations between them, which indicates multicollinearity. These instances of multicollinearity are:

'GarageYrBlt' and 'YearBuilt'
'TotRmsAbvGr' and 'GrLivArea'
'1stFlrSF' and 'TotalBsmtSF'
'GarageCars' and 'GarageArea'

For SalePrice, I counted around 13 variables that have a fairly high correlation. I'm going to dig a little deeper and see the highest correlations with predictor variables and sales price.

In [12]:

plt.figure(figsize=(12,10))

#selects 10 columns with highest correlation with 'SalePrice'
cols = train.corr().nlargest(10, 'SalePrice')['SalePrice'].index

#creates correlation matrix
#transpose values due to mix of quantitative and categorical data
cm = np.corrcoef(train[cols].values.T)
sns.heatmap(cm, square=True, annot=True, fmt='3.2f', 
            yticklabels=cols.values, xticklabels=cols.values, cmap='coolwarm')

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x20f583f43c8>

As we saw earlier,

TotalBsmtSF & 1stFlrSF
TotRmsAbvGr & GrLivArea
GarageCars & GarageArea

all have a collinear relationship. Since these variables are in the top 10 correlations with YearBuilt, I remove the variable of the two with the lowest correlation with 'SalePrice'. Now I can look at the relationships between the 7 remaining variables.

In [13]:

sns.set_style('darkgrid')
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', \
        'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(train[cols])

Out[13]:

<seaborn.axisgrid.PairGrid at 0x20f583dbc88>

At first glance, all of the predictor variables seem to have a fairly strong positive correlation with SalePrice. The only variable that's questionable is YearBuilt which looks like it could possibly have a non-linear relationship with SalePrice. Lets see:

In [14]:

plt.figure(figsize=(14,6))
plt.xticks(rotation=90)
sns.boxplot(x='YearBuilt', y='SalePrice', data=train)

Out[14]:

<matplotlib.axes._subplots.AxesSubplot at 0x20f5aa5b7b8>

While the relationship isn't strong, I'd say this is linear enough for a regression.

Next, I'll look at the variable that predicts SalePrice the best: Overall Quality:

In [15]:

plt.figure(figsize=(16,10))
sns.boxplot(x='OverallQual', y='SalePrice', data=train).set(ylabel='ln(SalePrice)')

Out[15]:

[Text(0,0.5,'ln(SalePrice)')]

Keep in mind that SalePrice was logarithmically transformed, but all of the values look good to test on. Next I'll handle the missing values I found in my early data exploration.

Part 2: Data Cleaning¶

Outliers¶

In [16]:

cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', \
        'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(train[cols])

Out[16]:

<seaborn.axisgrid.PairGrid at 0x20f5be823c8>

It looks like GrLivArea and TotalBsmtSF have major outliers to the data. Lets look at those:

In [17]:

sns.regplot(x='GrLivArea',y='SalePrice',data=train,fit_reg=False)

Out[17]:

<matplotlib.axes._subplots.AxesSubplot at 0x20f5e4374a8>

In [18]:

#see outliers
train.sort_values(by = 'GrLivArea', ascending = False)[:2]

Out[18]:

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
1298	1299	60	RL	313.0	63887	Pave	NaN	IR3	Bnk	AllPub	...	480	Gd	NaN	NaN	0	1	2008	New	Partial	11.982929
523	524	60	RL	130.0	40094	Pave	NaN	IR1	Bnk	AllPub	...	0	NaN	NaN	NaN	0	10	2007	New	Partial	12.126759

2 rows × 81 columns

In [19]:

#drop outliers
train = train.drop(train[train['Id'] == 1299].index)
train = train.drop(train[train['Id'] == 524].index)

In [20]:

sns.regplot(x='GrLivArea',y='SalePrice',data=train,fit_reg=False)

Out[20]:

<matplotlib.axes._subplots.AxesSubplot at 0x20f5e82ce10>

In [21]:

sns.regplot(x='TotalBsmtSF',y='SalePrice',data=train,fit_reg=False)

Out[21]:

<matplotlib.axes._subplots.AxesSubplot at 0x20f5ee9e780>

It looks like TotalBsmtSF had the same outlier

Missing Values¶

I combine my train and test dataframe to do cleaning on both sets at once

In [22]:

test = pd.read_csv('test.csv')
test.head()

Out[22]:

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	ScreenPorch	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
0	1461	20	RH	80.0	11622	Pave	NaN	Reg	Lvl	AllPub	...	120	NaN	MnPrv	NaN	0	6	2010	WD	Normal
1	1462	20	RL	81.0	14267	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	Gar2	12500	6	2010	WD	Normal
2	1463	60	RL	74.0	13830	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	MnPrv	NaN	0	3	2010	WD	Normal
3	1464	60	RL	78.0	9978	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	6	2010	WD	Normal
4	1465	120	RL	43.0	5005	Pave	NaN	IR1	HLS	AllPub	...	144	NaN	NaN	NaN	0	1	2010	WD	Normal

5 rows × 80 columns

In [23]:

#store SalePrice for later use in model
saleprices = train['SalePrice']

In [24]:

data = pd.concat((train, test)).reset_index(drop=True)
data.drop(['SalePrice'], axis=1, inplace=True)

In [25]:

data.tail()

Out[25]:

	1stFlrSF	2ndFlrSF	Alley	BedroomAbvGr	BldgType	BsmtCond	BsmtExposure	BsmtFinSF1	...	SaleType	Street	TotRmsAbvGrd	TotalBsmtSF	Utilities	WoodDeckSF	YearBuilt	YearRemodAdd	YrSold
2912	546	546	NaN	3	Twnhs	TA	No	0.0	...	WD	Pave	5	546.0	AllPub	0	1970	1970	2006
2913	546	546	NaN	3	TwnhsE	TA	No	252.0	...	WD	Pave	6	546.0	AllPub	0	1970	1970	2006
2914	1224	0	NaN	4	1Fam	TA	No	1224.0	...	WD	Pave	7	1224.0	AllPub	474	1960	1996	2006
2915	970	0	NaN	3	1Fam	TA	Av	337.0	...	WD	Pave	6	912.0	AllPub	80	1992	1992	2006
2916	996	1004	NaN	3	1Fam	TA	Av	758.0	...	WD	Pave	9	996.0	AllPub	190	1993	1994	2006

5 rows × 80 columns

In [26]:

missing = data.isnull().sum().sort_values(ascending=False)
pct = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending=False)
#creates dataframe with missing and pct missing
miss_data = pd.concat([missing, pct], axis=1, keys=['Missing','Percent'])
#shows columns with missing values
miss_data[miss_data['Missing']>0]

Out[26]:

	Missing	Percent
PoolQC	2908	99.691464
MiscFeature	2812	96.400411
Alley	2719	93.212204
Fence	2346	80.425094
FireplaceQu	1420	48.680151
LotFrontage	486	16.660953
GarageQual	159	5.450806
GarageFinish	159	5.450806
GarageYrBlt	159	5.450806
GarageCond	159	5.450806
GarageType	157	5.382242
BsmtCond	82	2.811107
BsmtExposure	82	2.811107
BsmtQual	81	2.776826
BsmtFinType2	80	2.742544
BsmtFinType1	79	2.708262
MasVnrType	24	0.822763
MasVnrArea	23	0.788481
MSZoning	4	0.137127
BsmtFullBath	2	0.068564
BsmtHalfBath	2	0.068564
Utilities	2	0.068564
Functional	2	0.068564
Electrical	1	0.034282
Exterior2nd	1	0.034282
KitchenQual	1	0.034282
Exterior1st	1	0.034282
GarageCars	1	0.034282
TotalBsmtSF	1	0.034282
GarageArea	1	0.034282
BsmtUnfSF	1	0.034282
BsmtFinSF2	1	0.034282
BsmtFinSF1	1	0.034282
SaleType	1	0.034282

As a rule, any data with more than 15% missing data is prone to messing up the data and hard to impute, so I'll drop these.

In [27]:

data = data.drop((miss_data[miss_data['Percent']>15]).index,1)

Garage missing values¶

In [28]:

#categorical variables
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    train[col] = train[col].fillna('NA')
#quantitative variable
train['GarageYrBlt'] = train['GarageYrBlt'].fillna(train['GarageYrBlt'].median())

Basement missing values¶

In [29]:

#categorical variables
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    train[col] = train[col].fillna('NA')
#quantitative variables
for col in ('BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','BsmtFullBath','BsmtHalfBath'):
    test[col] = test[col].fillna(test[col].median())

Masonry Veneer missing values¶

In [30]:

#categorical variable
data["MasVnrType"] = data["MasVnrType"].fillna('NA')
#quantitative variable
data["MasVnrArea"] = data["MasVnrArea"].fillna(0)

Other missing values¶

Since these variables are all 'object' type with 4 or less missing values, I'll just use a filler as these missing values aren't too significant to my results

In [31]:

for col in ('SaleType','KitchenQual', 'Functional','MasVnrType','Exterior1st','Exterior2nd','MSZoning','Utilities','Electrical'):
    test[col] = test[col].fillna('NA')

Part 3: Feature Engineering¶

This part was more of a creative process. From intuition, I figured that house prices are largely affected by square footage, location, and amenities. Since location and amenities are given, I created an overall square footage feature

In [32]:

data['TotalSF'] = data['TotalBsmtSF'] + data['1stFlrSF'] + \
                   data['2ndFlrSF'] + data['GarageArea']

Part 4: Correcting Skewed Features¶

In [33]:

quant_feats = data.dtypes[data.dtypes != 'object'].index
skewed_feats = data[quant_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
skewness = pd.DataFrame({'Skew':skewed_feats})
skewness

Out[33]:

	Skew
MiscVal	21.939672
PoolArea	17.688664
LotArea	13.109495
LowQualFinSF	12.084539
3SsnPorch	11.372080
KitchenAbvGr	4.300550
BsmtFinSF2	4.143683
EnclosedPorch	4.002344
ScreenPorch	3.945101
BsmtHalfBath	3.928397
MasVnrArea	2.621719
OpenPorchSF	2.529358
WoodDeckSF	1.844792
MSSubClass	1.375131
1stFlrSF	1.257286
GrLivArea	1.068750
BsmtFinSF1	0.980283
BsmtUnfSF	0.919699
TotalSF	0.892543
2ndFlrSF	0.861556
TotRmsAbvGrd	0.749232
Fireplaces	0.725278
HalfBath	0.696666
TotalBsmtSF	0.676200
BsmtFullBath	0.621322
OverallCond	0.569314
BedroomAbvGr	0.326568
GarageArea	0.218752
MoSold	0.197345
OverallQual	0.189591
FullBath	0.165514
YrSold	0.131996
Id	-0.000867
GarageCars	-0.217977
GarageYrBlt	-0.380955
YearRemodAdd	-0.450134
YearBuilt	-0.599194

To fix skewness, I'll use a Box-Cox Transformation. This method tests an exponent, λ, for integers -5 to 5 and selects the λ that gives the most normal distribution using the formula: BoxCox Here's a quick read on the technique: https://www.statisticshowto.datasciencecentral.com/box-cox-transformation/

In [34]:

from scipy.special import boxcox1p
#finds skewed features with skew greater than 1
skewness = skewness[abs(skewness)>1]
skewed_feats = skewness.index
for feat in skewed_feats:
    data[feat] = boxcox1p(data[feat],.15)

Replace Categorical Variables¶

In [35]:

data = pd.get_dummies(data)

In [36]:

data.head()

Out[36]:

	1stFlrSF	2ndFlrSF	BedroomAbvGr	BsmtFinSF1	BsmtFullBath	BsmtHalfBath	BsmtUnfSF	EnclosedPorch	...	SaleType_WD	Street_Pave	Utilities_AllPub
0	11.692623	11.686189	1.540963	11.170327	0.730463	0.000000	7.483296	0.000000	...	1	1	1
1	12.792276	0.000000	1.540963	12.062832	0.000000	0.730463	8.897844	0.000000	...	1	1	1
2	11.892039	11.724598	1.540963	10.200343	0.730463	0.000000	9.917060	0.000000	...	1	1	1
3	12.013683	11.354094	1.540963	8.274266	0.730463	0.000000	10.468500	8.797736	...	1	1	1
4	12.510588	12.271365	1.820334	10.971129	0.730463	0.000000	10.221051	0.000000	...	1	1	1

5 rows × 271 columns

Part 5: Model Building¶

First, I split up my train and test data that I had previously combined

In [37]:

#uses length of train as index to split data
train = data[:train.shape[0]]
test = data[train.shape[0]:]

In [38]:

train.head()

Out[38]:

	1stFlrSF	2ndFlrSF	BedroomAbvGr	BsmtFinSF1	BsmtFullBath	BsmtHalfBath	BsmtUnfSF	EnclosedPorch	...	SaleType_WD	Street_Pave	Utilities_AllPub
0	11.692623	11.686189	1.540963	11.170327	0.730463	0.000000	7.483296	0.000000	...	1	1	1
1	12.792276	0.000000	1.540963	12.062832	0.000000	0.730463	8.897844	0.000000	...	1	1	1
2	11.892039	11.724598	1.540963	10.200343	0.730463	0.000000	9.917060	0.000000	...	1	1	1
3	12.013683	11.354094	1.540963	8.274266	0.730463	0.000000	10.468500	8.797736	...	1	1	1
4	12.510588	12.271365	1.820334	10.971129	0.730463	0.000000	10.221051	0.000000	...	1	1	1

5 rows × 271 columns

I used XGBoost, which is the leading library for gradient boosting in Python. It works very similar to scikit-learn where you fit the model and run predictions

In [39]:

from xgboost import XGBRegressor
xgbmodel = XGBRegressor()
xgbmodel.fit(train, saleprices)

Out[39]:

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

Predictions¶

In [40]:

y_pred = xgbmodel.predict(test)
y_pred
y_pred_exp = np.expm1(y_pred)
y_pred_exp

Out[40]:

array([128428.94, 166563.05, 182806.5 , ..., 165588.83, 115200.83,
       234027.3 ], dtype=float32)

Part 6: Submission¶

In [41]:

test = pd.read_csv('test.csv')
submission = pd.DataFrame({
    'Id' : test['Id'],
    'SalePrice': y_pred_exp
})
submission.to_csv('submission.csv',index=False)

In [42]:

submission.head()

Out[42]:

	Id	SalePrice
0	1461	128428.937500
1	1462	166563.046875
2	1463	182806.500000
3	1464	186999.828125
4	1465	184022.531250

In [43]:

example = pd.read_csv('sample_submission.csv')
example.head()

Out[43]:

	Id	SalePrice
0	1461	169277.052498
1	1462	187758.393989
2	1463	183583.683570
3	1464	179317.477511
4	1465	150730.079977

Results¶

housing_submission

Conclusion¶

The first time working through this journal, I did my cleaning separately on my train and test sets, but this didn't work because I got a different number of dummy variables for each due to different columns having missing values.

I also could have done more with feature engineering and created more features. Lastly, I also used XGBoost for my model, but there are thousands of different types of models, as well as ensembling methods.

Well, there we go. I went from zero to a final submission on Kaggle. I hope you enjoyed my thought process, and I'm always welcome to any comments, questions, and suggestions.

Beyond Data

Predicting Housing Sales Prices