Predicting Housing Sales Prices


Today, I'll show my entire process for my submission to the Kaggle competition for predicting house prices.

House Prices: Advanced Regression Techniques

https://www.kaggle.com/c/house-prices-advanced-regression-techniques

house

Competition Description


Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

Goal


It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable.

Metric


Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

File Descriptions


  • train.csv - the training set
  • test.csv - the test set
  • data_description.txt - full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here
  • sample_submission.csv - a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedrooms

Data fields

Here's a brief version of what you'll find in the data description file.

  • SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
  • MSSubClass: The building class
  • MSZoning: The general zoning classification
  • LotFrontage: Linear feet of street connected to property
  • LotArea: Lot size in square feet
  • Street: Type of road access
  • Alley: Type of alley access
  • LotShape: General shape of property
  • LandContour: Flatness of the property
  • Utilities: Type of utilities available
  • LotConfig: Lot configuration
  • LandSlope: Slope of property
  • Neighborhood: Physical locations within Ames city limits
  • Condition1: Proximity to main road or railroad
  • Condition2: Proximity to main road or railroad (if a second is present)
  • BldgType: Type of dwelling
  • HouseStyle: Style of dwelling
  • OverallQual: Overall material and finish quality
  • OverallCond: Overall condition rating
  • YearBuilt: Original construction date
  • YearRemodAdd: Remodel date
  • RoofStyle: Type of roof
  • RoofMatl: Roof material
  • Exterior1st: Exterior covering on house
  • Exterior2nd: Exterior covering on house (if more than one material)
  • MasVnrType: Masonry veneer type
  • MasVnrArea: Masonry veneer area in square feet
  • ExterQual: Exterior material quality
  • ExterCond: Present condition of the material on the exterior
  • Foundation: Type of foundation
  • BsmtQual: Height of the basement
  • BsmtCond: General condition of the basement
  • BsmtExposure: Walkout or garden level basement walls
  • BsmtFinType1: Quality of basement finished area
  • BsmtFinSF1: Type 1 finished square feet
  • BsmtFinType2: Quality of second finished area (if present)
  • BsmtFinSF2: Type 2 finished square feet
  • BsmtUnfSF: Unfinished square feet of basement area
  • TotalBsmtSF: Total square feet of basement area
  • Heating: Type of heating
  • HeatingQC: Heating quality and condition
  • CentralAir: Central air conditioning
  • Electrical: Electrical system
  • 1stFlrSF: First Floor square feet
  • 2ndFlrSF: Second floor square feet
  • LowQualFinSF: Low quality finished square feet (all floors)
  • GrLivArea: Above grade (ground) living area square feet
  • BsmtFullBath: Basement full bathrooms
  • BsmtHalfBath: Basement half bathrooms
  • FullBath: Full bathrooms above grade
  • HalfBath: Half baths above grade
  • Bedroom: Number of bedrooms above basement level
  • Kitchen: Number of kitchens
  • KitchenQual: Kitchen quality
  • TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
  • Functional: Home functionality rating
  • Fireplaces: Number of fireplaces
  • FireplaceQu: Fireplace quality
  • GarageType: Garage location
  • GarageYrBlt: Year garage was built
  • GarageFinish: Interior finish of the garage
  • GarageCars: Size of garage in car capacity
  • GarageArea: Size of garage in square feet
  • GarageQual: Garage quality
  • GarageCond: Garage condition
  • PavedDrive: Paved driveway
  • WoodDeckSF: Wood deck area in square feet
  • OpenPorchSF: Open porch area in square feet
  • EnclosedPorch: Enclosed porch area in square feet
  • 3SsnPorch: Three season porch area in square feet
  • ScreenPorch: Screen porch area in square feet
  • PoolArea: Pool area in square feet
  • PoolQC: Pool quality
  • Fence: Fence quality
  • MiscFeature: Miscellaneous feature not covered in other categories
  • MiscVal: $Value of miscellaneous feature
  • MoSold: Month Sold
  • YrSold: Year Sold
  • SaleType: Type of sale
  • SaleCondition: Condition of sale

Setup Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from scipy import stats
from scipy.stats import norm, skew
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

Part 1: Data Exploration


In [2]:
train = pd.read_csv('train.csv')
In [3]:
train.tail()
Out[3]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
1455 1456 60 RL 62.0 7917 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 8 2007 WD Normal 175000
1456 1457 20 RL 85.0 13175 Pave NaN Reg Lvl AllPub ... 0 NaN MnPrv NaN 0 2 2010 WD Normal 210000
1457 1458 70 RL 66.0 9042 Pave NaN Reg Lvl AllPub ... 0 NaN GdPrv Shed 2500 5 2010 WD Normal 266500
1458 1459 20 RL 68.0 9717 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 4 2010 WD Normal 142125
1459 1460 20 RL 75.0 9937 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 6 2008 WD Normal 147500

5 rows × 81 columns

In [4]:
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
MasVnrArea       1452 non-null float64
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422 non-null object
BsmtFinType1     1423 non-null object
BsmtFinSF1       1460 non-null int64
BsmtFinType2     1422 non-null object
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1459 non-null object
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
KitchenQual      1460 non-null object
TotRmsAbvGrd     1460 non-null int64
Functional       1460 non-null object
Fireplaces       1460 non-null int64
FireplaceQu      770 non-null object
GarageType       1379 non-null object
GarageYrBlt      1379 non-null float64
GarageFinish     1379 non-null object
GarageCars       1460 non-null int64
GarageArea       1460 non-null int64
GarageQual       1379 non-null object
GarageCond       1379 non-null object
PavedDrive       1460 non-null object
WoodDeckSF       1460 non-null int64
OpenPorchSF      1460 non-null int64
EnclosedPorch    1460 non-null int64
3SsnPorch        1460 non-null int64
ScreenPorch      1460 non-null int64
PoolArea         1460 non-null int64
PoolQC           7 non-null object
Fence            281 non-null object
MiscFeature      54 non-null object
MiscVal          1460 non-null int64
MoSold           1460 non-null int64
YrSold           1460 non-null int64
SaleType         1460 non-null object
SaleCondition    1460 non-null object
SalePrice        1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

The data has a lot of columns with <100 missing values which can be filled in later. 5 columns are unsaveable (Alley, FireplaceQu, PoolQC, Fence, and MiscFeature), so I will drop these from my model if necessary.

In [5]:
train.describe()
Out[5]:
Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 ... WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice
count 1460.000000 1460.000000 1201.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1452.000000 1460.000000 ... 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000
mean 730.500000 56.897260 70.049958 10516.828082 6.099315 5.575342 1971.267808 1984.865753 103.685262 443.639726 ... 94.244521 46.660274 21.954110 3.409589 15.060959 2.758904 43.489041 6.321918 2007.815753 180921.195890
std 421.610009 42.300571 24.284752 9981.264932 1.382997 1.112799 30.202904 20.645407 181.066207 456.098091 ... 125.338794 66.256028 61.119149 29.317331 55.757415 40.177307 496.123024 2.703626 1.328095 79442.502883
min 1.000000 20.000000 21.000000 1300.000000 1.000000 1.000000 1872.000000 1950.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 2006.000000 34900.000000
25% 365.750000 20.000000 59.000000 7553.500000 5.000000 5.000000 1954.000000 1967.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000 2007.000000 129975.000000
50% 730.500000 50.000000 69.000000 9478.500000 6.000000 5.000000 1973.000000 1994.000000 0.000000 383.500000 ... 0.000000 25.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000 2008.000000 163000.000000
75% 1095.250000 70.000000 80.000000 11601.500000 7.000000 6.000000 2000.000000 2004.000000 166.000000 712.250000 ... 168.000000 68.000000 0.000000 0.000000 0.000000 0.000000 0.000000 8.000000 2009.000000 214000.000000
max 1460.000000 190.000000 313.000000 215245.000000 10.000000 9.000000 2010.000000 2010.000000 1600.000000 5644.000000 ... 857.000000 547.000000 552.000000 508.000000 480.000000 738.000000 15500.000000 12.000000 2010.000000 755000.000000

8 rows × 38 columns

There are 37 quantitative variables and 43 (80-37) categorical variables to examine. I'll let seaborn do the heavy lifting to find what predictor variables are relevant. First, I'll look at the target variable to understand what I am trying to predict:

SalePrice analysis

From above, I can see that SalePrice has a mean of \$181,000 standard deviation of \$79,000 and median \$163,000.

In [6]:
plt.figure(figsize=(12,8))
sns.distplot(train['SalePrice'], fit = norm)
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x20f57934668>

Sales price looks skewed right, which makes sense because a small portion of houses are way more expensive than normal house pricing. Also, 'SalePrice' does not look normal, so I'll see if it is saveable with a transformation for my model:

In [7]:
stats.probplot(train['SalePrice'], plot = plt)
Out[7]:
((array([-3.30513952, -3.04793228, -2.90489705, ...,  2.90489705,
          3.04793228,  3.30513952]),
  array([ 34900,  35311,  37900, ..., 625000, 745000, 755000], dtype=int64)),
 (74160.16474519415, 180921.19589041095, 0.9319665641512986))

My stats class recently covered these types of transformations, and a logarithmic transformation usually works best on pricing data:

In [8]:
train['SalePrice'] = np.log(train['SalePrice'])
In [9]:
sns.distplot(train['SalePrice'], fit=norm)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x20f578ec7b8>
In [10]:
stats.probplot(train['SalePrice'], plot=plt)
Out[10]:
((array([-3.30513952, -3.04793228, -2.90489705, ...,  2.90489705,
          3.04793228,  3.30513952]),
  array([10.46024211, 10.47194981, 10.54270639, ..., 13.34550693,
         13.5211395 , 13.53447303])),
 (0.39826223081618883, 12.024050901109383, 0.9953761475636617))

And there we go, sales price is good to be tested on. Next, I'll look at the predictor variables to see which are relevant:

Predictor Variable Analysis

In [11]:
#correlation matrix
plt.figure(figsize=(12,10))
sns.heatmap(train.corr(), square=True, cmap='coolwarm')
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x20f53d20ba8>

At first sight, there seems to be several variables with very high correlations between them, which indicates multicollinearity. These instances of multicollinearity are:

  • 'GarageYrBlt' and 'YearBuilt'
  • 'TotRmsAbvGr' and 'GrLivArea'
  • '1stFlrSF' and 'TotalBsmtSF'
  • 'GarageCars' and 'GarageArea'

For SalePrice, I counted around 13 variables that have a fairly high correlation. I'm going to dig a little deeper and see the highest correlations with predictor variables and sales price.

In [12]:
plt.figure(figsize=(12,10))

#selects 10 columns with highest correlation with 'SalePrice'
cols = train.corr().nlargest(10, 'SalePrice')['SalePrice'].index

#creates correlation matrix
#transpose values due to mix of quantitative and categorical data
cm = np.corrcoef(train[cols].values.T)
sns.heatmap(cm, square=True, annot=True, fmt='3.2f', 
            yticklabels=cols.values, xticklabels=cols.values, cmap='coolwarm')
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x20f583f43c8>

As we saw earlier,

  • TotalBsmtSF & 1stFlrSF
  • TotRmsAbvGr & GrLivArea
  • GarageCars & GarageArea

all have a collinear relationship. Since these variables are in the top 10 correlations with YearBuilt, I remove the variable of the two with the lowest correlation with 'SalePrice'. Now I can look at the relationships between the 7 remaining variables.

In [13]:
sns.set_style('darkgrid')
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', \
        'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(train[cols])
Out[13]:
<seaborn.axisgrid.PairGrid at 0x20f583dbc88>

At first glance, all of the predictor variables seem to have a fairly strong positive correlation with SalePrice. The only variable that's questionable is YearBuilt which looks like it could possibly have a non-linear relationship with SalePrice. Lets see:

In [14]:
plt.figure(figsize=(14,6))
plt.xticks(rotation=90)
sns.boxplot(x='YearBuilt', y='SalePrice', data=train)
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x20f5aa5b7b8>

While the relationship isn't strong, I'd say this is linear enough for a regression.

Next, I'll look at the variable that predicts SalePrice the best: Overall Quality:

In [15]:
plt.figure(figsize=(16,10))
sns.boxplot(x='OverallQual', y='SalePrice', data=train).set(ylabel='ln(SalePrice)')
Out[15]:
[Text(0,0.5,'ln(SalePrice)')]

Keep in mind that SalePrice was logarithmically transformed, but all of the values look good to test on. Next I'll handle the missing values I found in my early data exploration.

Part 2: Data Cleaning


Outliers

In [16]:
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', \
        'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(train[cols])
Out[16]:
<seaborn.axisgrid.PairGrid at 0x20f5be823c8>

It looks like GrLivArea and TotalBsmtSF have major outliers to the data. Lets look at those:

In [17]:
sns.regplot(x='GrLivArea',y='SalePrice',data=train,fit_reg=False)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x20f5e4374a8>
In [18]:
#see outliers
train.sort_values(by = 'GrLivArea', ascending = False)[:2]
Out[18]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
1298 1299 60 RL 313.0 63887 Pave NaN IR3 Bnk AllPub ... 480 Gd NaN NaN 0 1 2008 New Partial 11.982929
523 524 60 RL 130.0 40094 Pave NaN IR1 Bnk AllPub ... 0 NaN NaN NaN 0 10 2007 New Partial 12.126759

2 rows × 81 columns

In [19]:
#drop outliers
train = train.drop(train[train['Id'] == 1299].index)
train = train.drop(train[train['Id'] == 524].index)
In [20]:
sns.regplot(x='GrLivArea',y='SalePrice',data=train,fit_reg=False)
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x20f5e82ce10>
In [21]:
sns.regplot(x='TotalBsmtSF',y='SalePrice',data=train,fit_reg=False)
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x20f5ee9e780>

It looks like TotalBsmtSF had the same outlier

Missing Values


I combine my train and test dataframe to do cleaning on both sets at once

In [22]:
test = pd.read_csv('test.csv')
test.head()
Out[22]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub ... 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub ... 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN NaN 0 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub ... 144 0 NaN NaN NaN 0 1 2010 WD Normal

5 rows × 80 columns

In [23]:
#store SalePrice for later use in model
saleprices = train['SalePrice']
In [24]:
data = pd.concat((train, test)).reset_index(drop=True)
data.drop(['SalePrice'], axis=1, inplace=True)
In [25]:
data.tail()
Out[25]:
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 ... SaleType ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd YrSold
2912 546 546 0 NaN 3 Twnhs TA No 0.0 0.0 ... WD 0 Pave 5 546.0 AllPub 0 1970 1970 2006
2913 546 546 0 NaN 3 TwnhsE TA No 252.0 0.0 ... WD 0 Pave 6 546.0 AllPub 0 1970 1970 2006
2914 1224 0 0 NaN 4 1Fam TA No 1224.0 0.0 ... WD 0 Pave 7 1224.0 AllPub 474 1960 1996 2006
2915 970 0 0 NaN 3 1Fam TA Av 337.0 0.0 ... WD 0 Pave 6 912.0 AllPub 80 1992 1992 2006
2916 996 1004 0 NaN 3 1Fam TA Av 758.0 0.0 ... WD 0 Pave 9 996.0 AllPub 190 1993 1994 2006

5 rows × 80 columns

In [26]:
missing = data.isnull().sum().sort_values(ascending=False)
pct = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending=False)
#creates dataframe with missing and pct missing
miss_data = pd.concat([missing, pct], axis=1, keys=['Missing','Percent'])
#shows columns with missing values
miss_data[miss_data['Missing']>0]
Out[26]:
Missing Percent
PoolQC 2908 99.691464
MiscFeature 2812 96.400411
Alley 2719 93.212204
Fence 2346 80.425094
FireplaceQu 1420 48.680151
LotFrontage 486 16.660953
GarageQual 159 5.450806
GarageFinish 159 5.450806
GarageYrBlt 159 5.450806
GarageCond 159 5.450806
GarageType 157 5.382242
BsmtCond 82 2.811107
BsmtExposure 82 2.811107
BsmtQual 81 2.776826
BsmtFinType2 80 2.742544
BsmtFinType1 79 2.708262
MasVnrType 24 0.822763
MasVnrArea 23 0.788481
MSZoning 4 0.137127
BsmtFullBath 2 0.068564
BsmtHalfBath 2 0.068564
Utilities 2 0.068564
Functional 2 0.068564
Electrical 1 0.034282
Exterior2nd 1 0.034282
KitchenQual 1 0.034282
Exterior1st 1 0.034282
GarageCars 1 0.034282
TotalBsmtSF 1 0.034282
GarageArea 1 0.034282
BsmtUnfSF 1 0.034282
BsmtFinSF2 1 0.034282
BsmtFinSF1 1 0.034282
SaleType 1 0.034282

As a rule, any data with more than 15% missing data is prone to messing up the data and hard to impute, so I'll drop these.

In [27]:
data = data.drop((miss_data[miss_data['Percent']>15]).index,1)

Garage missing values

In [28]:
#categorical variables
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    train[col] = train[col].fillna('NA')
#quantitative variable
train['GarageYrBlt'] = train['GarageYrBlt'].fillna(train['GarageYrBlt'].median())

Basement missing values

In [29]:
#categorical variables
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    train[col] = train[col].fillna('NA')
#quantitative variables
for col in ('BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','BsmtFullBath','BsmtHalfBath'):
    test[col] = test[col].fillna(test[col].median())

Masonry Veneer missing values

In [30]:
#categorical variable
data["MasVnrType"] = data["MasVnrType"].fillna('NA')
#quantitative variable
data["MasVnrArea"] = data["MasVnrArea"].fillna(0)

Other missing values

Since these variables are all 'object' type with 4 or less missing values, I'll just use a filler as these missing values aren't too significant to my results

In [31]:
for col in ('SaleType','KitchenQual', 'Functional','MasVnrType','Exterior1st','Exterior2nd','MSZoning','Utilities','Electrical'):
    test[col] = test[col].fillna('NA')

Part 3: Feature Engineering


This part was more of a creative process. From intuition, I figured that house prices are largely affected by square footage, location, and amenities. Since location and amenities are given, I created an overall square footage feature

In [32]:
data['TotalSF'] = data['TotalBsmtSF'] + data['1stFlrSF'] + \
                   data['2ndFlrSF'] + data['GarageArea']

Part 4: Correcting Skewed Features


In [33]:
quant_feats = data.dtypes[data.dtypes != 'object'].index
skewed_feats = data[quant_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
skewness = pd.DataFrame({'Skew':skewed_feats})
skewness
Out[33]:
Skew
MiscVal 21.939672
PoolArea 17.688664
LotArea 13.109495
LowQualFinSF 12.084539
3SsnPorch 11.372080
KitchenAbvGr 4.300550
BsmtFinSF2 4.143683
EnclosedPorch 4.002344
ScreenPorch 3.945101
BsmtHalfBath 3.928397
MasVnrArea 2.621719
OpenPorchSF 2.529358
WoodDeckSF 1.844792
MSSubClass 1.375131
1stFlrSF 1.257286
GrLivArea 1.068750
BsmtFinSF1 0.980283
BsmtUnfSF 0.919699
TotalSF 0.892543
2ndFlrSF 0.861556
TotRmsAbvGrd 0.749232
Fireplaces 0.725278
HalfBath 0.696666
TotalBsmtSF 0.676200
BsmtFullBath 0.621322
OverallCond 0.569314
BedroomAbvGr 0.326568
GarageArea 0.218752
MoSold 0.197345
OverallQual 0.189591
FullBath 0.165514
YrSold 0.131996
Id -0.000867
GarageCars -0.217977
GarageYrBlt -0.380955
YearRemodAdd -0.450134
YearBuilt -0.599194

To fix skewness, I'll use a Box-Cox Transformation. This method tests an exponent, λ, for integers -5 to 5 and selects the λ that gives the most normal distribution using the formula: BoxCox Here's a quick read on the technique: https://www.statisticshowto.datasciencecentral.com/box-cox-transformation/

In [34]:
from scipy.special import boxcox1p
#finds skewed features with skew greater than 1
skewness = skewness[abs(skewness)>1]
skewed_feats = skewness.index
for feat in skewed_feats:
    data[feat] = boxcox1p(data[feat],.15)

Replace Categorical Variables

In [35]:
data = pd.get_dummies(data)
In [36]:
data.head()
Out[36]:
1stFlrSF 2ndFlrSF 3SsnPorch BedroomAbvGr BsmtFinSF1 BsmtFinSF2 BsmtFullBath BsmtHalfBath BsmtUnfSF EnclosedPorch ... SaleType_ConLD SaleType_ConLI SaleType_ConLw SaleType_New SaleType_Oth SaleType_WD Street_Grvl Street_Pave Utilities_AllPub Utilities_NoSeWa
0 11.692623 11.686189 0.0 1.540963 11.170327 0.0 0.730463 0.000000 7.483296 0.000000 ... 0 0 0 0 0 1 0 1 1 0
1 12.792276 0.000000 0.0 1.540963 12.062832 0.0 0.000000 0.730463 8.897844 0.000000 ... 0 0 0 0 0 1 0 1 1 0
2 11.892039 11.724598 0.0 1.540963 10.200343 0.0 0.730463 0.000000 9.917060 0.000000 ... 0 0 0 0 0 1 0 1 1 0
3 12.013683 11.354094 0.0 1.540963 8.274266 0.0 0.730463 0.000000 10.468500 8.797736 ... 0 0 0 0 0 1 0 1 1 0
4 12.510588 12.271365 0.0 1.820334 10.971129 0.0 0.730463 0.000000 10.221051 0.000000 ... 0 0 0 0 0 1 0 1 1 0

5 rows × 271 columns

Part 5: Model Building


First, I split up my train and test data that I had previously combined

In [37]:
#uses length of train as index to split data
train = data[:train.shape[0]]
test = data[train.shape[0]:]
In [38]:
train.head()
Out[38]:
1stFlrSF 2ndFlrSF 3SsnPorch BedroomAbvGr BsmtFinSF1 BsmtFinSF2 BsmtFullBath BsmtHalfBath BsmtUnfSF EnclosedPorch ... SaleType_ConLD SaleType_ConLI SaleType_ConLw SaleType_New SaleType_Oth SaleType_WD Street_Grvl Street_Pave Utilities_AllPub Utilities_NoSeWa
0 11.692623 11.686189 0.0 1.540963 11.170327 0.0 0.730463 0.000000 7.483296 0.000000 ... 0 0 0 0 0 1 0 1 1 0
1 12.792276 0.000000 0.0 1.540963 12.062832 0.0 0.000000 0.730463 8.897844 0.000000 ... 0 0 0 0 0 1 0 1 1 0
2 11.892039 11.724598 0.0 1.540963 10.200343 0.0 0.730463 0.000000 9.917060 0.000000 ... 0 0 0 0 0 1 0 1 1 0
3 12.013683 11.354094 0.0 1.540963 8.274266 0.0 0.730463 0.000000 10.468500 8.797736 ... 0 0 0 0 0 1 0 1 1 0
4 12.510588 12.271365 0.0 1.820334 10.971129 0.0 0.730463 0.000000 10.221051 0.000000 ... 0 0 0 0 0 1 0 1 1 0

5 rows × 271 columns

I used XGBoost, which is the leading library for gradient boosting in Python. It works very similar to scikit-learn where you fit the model and run predictions

In [39]:
from xgboost import XGBRegressor
xgbmodel = XGBRegressor()
xgbmodel.fit(train, saleprices)
Out[39]:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

Predictions

In [40]:
y_pred = xgbmodel.predict(test)
y_pred
y_pred_exp = np.expm1(y_pred)
y_pred_exp
Out[40]:
array([128428.94, 166563.05, 182806.5 , ..., 165588.83, 115200.83,
       234027.3 ], dtype=float32)

Part 6: Submission


In [41]:
test = pd.read_csv('test.csv')
submission = pd.DataFrame({
    'Id' : test['Id'],
    'SalePrice': y_pred_exp
})
submission.to_csv('submission.csv',index=False)
In [42]:
submission.head()
Out[42]:
Id SalePrice
0 1461 128428.937500
1 1462 166563.046875
2 1463 182806.500000
3 1464 186999.828125
4 1465 184022.531250
In [43]:
example = pd.read_csv('sample_submission.csv')
example.head()
Out[43]:
Id SalePrice
0 1461 169277.052498
1 1462 187758.393989
2 1463 183583.683570
3 1464 179317.477511
4 1465 150730.079977

Results

housing_submission

Conclusion


The first time working through this journal, I did my cleaning separately on my train and test sets, but this didn't work because I got a different number of dummy variables for each due to different columns having missing values.

I also could have done more with feature engineering and created more features. Lastly, I also used XGBoost for my model, but there are thousands of different types of models, as well as ensembling methods.

Well, there we go. I went from zero to a final submission on Kaggle. I hope you enjoyed my thought process, and I'm always welcome to any comments, questions, and suggestions.