SUPERVISED LEARNING · REAL ESTATE ANALYTICS

Housing Price Estimation
Using Machine Learning

A comparative study of Linear Regression, Decision Trees, XGBoost, and Ensemble models for predicting residential property values using real-world housing data.

Maleha Israt Chowdhury Sachi Datta Supervisor: Weimin Huang Memorial University
Best R² Score
99.45%
XGBoost Regressor
↑ Best Model
Lowest RMSE
$6,510
XGBoost prediction error
↓ Lowest Error
Lowest MAE
$1,202
Ensemble Model
↓ Most Consistent
Models Tested
4
LR · DT · XGB · Ensemble
Comparative Study
R² Score Comparison
RMSE vs MAE by Model
MAPE (Mean Absolute Percentage Error) — Lower is Better
Ensemble Model
~0.8%
Decision Tree
~1.2%
XGBoost
~1.5%
Linear Regression
~11.5%

Model Performance Table

Rank Model RMSE ($) MAE ($) R² Score Assessment
1st
XGBoost
6,509.66 2,083.93 0.9945 Excellent
2nd
Ensemble Model
7,007.30 1,202.48 0.9936 Excellent
3rd
Decision Tree
13,426.56 1,518.85 0.9765 Good
4th
Linear Regression
30,580.89 15,701.19 0.8781 Limited

Algorithm Details

XGBoost
Extreme Gradient Boosting
🚀
Sequential ensemble method that trains models iteratively, each correcting the errors of the previous one. Features built-in L1/L2 regularization, parallel processing, and automatic missing value handling.
R²: 0.9945 RMSE: $6,510 Handles non-linearity Anti-overfit
Ensemble Model
Stacked LR + DT + XGBoost
🧩
Stacked ensemble combining three base models (Linear Regression, Decision Tree, XGBoost) with a Random Forest meta-learner. Uses 5-fold cross-validation to generate out-of-fold predictions.
R²: 0.9936 MAE: $1,202 Best generalization cv=5
Decision Tree
CART Regressor
🌳
Hierarchical splits on feature values using MSE as the splitting criterion. Interpretable and computationally efficient. Risk of overfitting mitigated by depth constraints and cross-validation.
R²: 0.9765 RMSE: $13,427 Interpretable Fast inference
Linear Regression
Multiple OLS Regression
📈
Fits a hyperplane to minimize MSE via gradient descent. Assumes linearity, independence, homoscedasticity, and normality of residuals. Struggles with the non-linear interactions in housing data.
R²: 0.8781 RMSE: $30,581 Baseline model Low complexity

Literature Comparison

StudyBest ModelR² ScoreKey Notes
This Project XGBoost 0.9945 Lowest RMSE, highest accuracy across all models
This Project Ensemble Model 0.9936 Lowest MAE, strong overall generalization
Abdul-Rahman et al. XGBoost 0.9120 Boosting models for Kuala Lumpur market
Shalini et al. Random Forest 0.9656 Focused on preprocessing and data-driven pipeline
Z. Li Decision Tree 0.8016 Comparative study of regression models

House Price Estimator

Enter Property Details
Estimation Result
$0
Estimated current market value
5-YEAR PROJECTION
MODEL USED
HOUSE AGE
REMODEL AGE
EST. PRICE RANGE
CONFIDENCE
Fill in the property details and click Estimate Price to get a prediction.
Model Error Bounds (Expected ±)
XGBoost
±$6,510
Ensemble
±$7,007
Decision Tree
±$13,427
Lin. Regression
±$30,581

Feature Importance

Top Features by Predictive Impact
YearBuilt
0.61
LogSalePrice
0.59
TotalBsmtArea
0.52
HouseAge
-0.41
RemodelAge
-0.36
LotArea
0.26
TotalBsmtSF
0.25
MSZoning
0.17
Values represent Pearson correlation coefficient with SalePrice. Negative values indicate inverse relationship.
Feature Correlation Radar

Engineered Features

🏚️
HouseAge
Derived from YearBuilt. Older homes tend to require more maintenance, affecting price negatively. Correlation with SalePrice: −0.41
2025 − YearBuilt
🔨
RemodelAge
Years since last renovation. Recently remodeled homes command higher prices. Correlation with SalePrice: −0.36
2025 − YearRemodAdd
📐
TotalBsmtArea
Combined BsmtFinSF2 + TotalBsmtSF for total basement coverage. High correlation with TotalBsmtSF (0.94). Correlation with SalePrice: +0.52
BsmtFinSF2 + TotalBsmtSF
📊
LogSalePrice
Log-transformed SalePrice to reduce right skewness. Makes distribution more normal, improving linear model performance. Correlation: +0.59
np.log(SalePrice)
🗺️
LotAreaPerRoom
Adjusts lot size by overall condition. A large lot in poor condition has lower value than a smaller lot in excellent condition.
LotArea / (OverallCond + 1)
⚖️
Class Imbalance
MSZoning (0.78), BldgType (0.83), LotConfig (0.73) show mild imbalance. OverallCond and temporal features are well-balanced. No resampling required.
Mild: MSZoning, BldgType

6-Stage Development Pipeline

01🎯Business Understanding
02🔍Data Understanding
03🧹Data Preparation
04🤖Modelling
05📏Evaluation
06🚀Deployment
Data Preprocessing Steps
01
Missing Value Imputation
Mode for categoricals (MSZoning, Exterior1st); Median for numerics (BsmtFinSF2, TotalBsmtSF)
02
Duplicate Removal
Checked and removed duplicate rows to prevent data leakage
03
Normalization
StandardScaler (mean=0, std=1), MinMaxScaler, and RobustScaler applied to numerical features
04
Label Encoding
Categorical variables converted to numeric for ML compatibility
05
Train/Test Split
80% training / 20% testing with 5-fold cross-validation for stacking
Evaluation Metrics
RMSE — Root Mean Squared Error
Penalizes large errors more heavily. Good for detecting outlier predictions. Lower = better.
MAE — Mean Absolute Error
Average absolute deviation. More actionable in real-world pricing decisions. Lower = better.
R² — Coefficient of Determination
Variance explained by the model. XGBoost explained 99.45% of price variability. Higher = better.
MAPE — Mean Absolute Percentage Error
Error as a percentage of actual value. Undefined when actual = 0. Best for relative comparisons.
Challenges Encountered

Data Preprocessing — missing values, inconsistent formats, and outliers required careful handling

Feature Selection — identifying high-impact features and managing multicollinearity

Computational Resources — XGBoost and ensemble training caused memory issues on limited hardware

Overfitting — some models trained well but failed to generalize to test data

Frontend Integration — connecting ML backend with interactive web dashboard

Future Work

Data Enrichment — proximity to schools, transport, hospitals; mortgage rates; demand trends

Advanced Features — total livable area, renovation history, crime rates, air quality

Neural Networks — Deep learning for complex non-linear relationships

Web App Deployment — Flask or FastAPI backend for real-time predictions

Model Monitoring — drift detection and retraining pipeline for market adaptation