House Price Prediction
25 Aug 2018Introduction
The problem involves analyzing housing prices that were sold between May 2014 and May 2015. In this post, we are going to perform two task. 1) exploratory data analysis 2) use machine learning model to predict the sale price fo the houses.
Dataset Description and its Features
The dataset contains 21,614 samples and 20 features. The data can be download here.
feature table
Code | Name |
ID | Notation for a house |
Date | Date house was sold |
Price | House price (target variable) |
Bedrooms | Number of bedrooms |
Bathrooms | Number of bathrooms |
Sqft_Living | Square footage of the home |
Sqft_Lot | Square footage of the lot |
Floors | Total floors (levels) in house |
Waterfront | House which has a view to a waterfront |
View | Has been viewed |
Condition | How good the condition is overall |
Grade | Overall grade given to the housing unit |
Sqft_Above | Square footage of the house apart from basement |
Sqft_Basement | Square footage of the basement |
Yr_Built | Built Year |
Yr_Renovated | Year when house was renovated |
Zipcode | zip |
Lat | Latitude coordinate |
Long | Longitude coordinate |
Sqft_Living15 | Living room area in 2015 (implies some renovations). This might or might have affected the lotsize area 21) |
Sqft_lot15 | Lot size area in 2015 (implies some renovations) |
Exploratory Data analysis
Sales Record Location and 2018 household density
- most sales are in the urban area
- majority of the household are located near the highway and the river
- the distribution of the sales is approximately log-normal
Price and Location
- Rich people live in the suburban area (zip code 98039)
- Center of the city is very expansive (zip code 98119)
Seasonality
- Real estate industry has seasonaly trend, price and sales flucturate througout the year
- Winter is the low season, summer is hot season
- both volumns and price has significant increase at spring
Structure of the house and floor plan
- the unit price of household with basement is statistically higher than those without basement
- the unit price of household with waterfront view is statistically higher than those without waterfront view
- ill-designed house(those has lower ratio of #bath/#bed) is cheaper
Customer View and House
- People hesitate when location is not ideal like island(zipcode 98070) or far away from the city(zip code 98022 98117), they would double check the house
- When the price is above 1 Million, people pay more visits before purchase
Machine Learning Model
A detailed script is over here
Loss Function
Mean squared error: easy to train the model Considering the logarithm for y value, the metric would be mean sqaured log error
Metric
CV error for parameter tunning test error for model comparison and report
Rationale for model selection
Although some features exhibit a linear relationship with our target value, the best model for this problem set would be a tree-based model. The tree-base model would exploit the full combination of these variables and give the state-of-the-art prediction accuracy.
Features might have an interaction effect, and this could be difficult for linear models to find out. A Stepwise Forward Selection Process with all the possible interaction variables and higher-order terms might work but not efficient and seems too arbitrary.
Basemodel
Linear Model Elastic-Net (L1 and L2 penalty both included)
During the training process, the model with higher penalty yeild lower score.
The model with small penalty fails to converge and it means all the variables are linear related to the target and they don’t have much collinearity.(further test needed)
FinalModel
Fine-tuned Light gradient Boosting model, this algorithm is currently the best model
Results and Interpretation
Model | CV Error | Test error | Training Time + Test Time |
---|---|---|---|
ElasticNet | 0.0678 | 0.06947 | 2.02s + 0.015s |
RandomForest | 0.03557 | 0.03653 | 1.3763s+ 0.028s |
LightGBM | 0.02628 | 0.02666 | 0.0525s+0.0638s |
Important features Ranking
- Location ( represented as zip code latitude and longituted)
- Area (Sqft_Lot, Sqft_Above)
- Grade & Condition & Renovation
- Seasonality (seasonl sale)
- Floor plan and Design (eg 2b3b 2 floor)
- View
- Waterfront