SUPERVISED LEARNING · REAL ESTATE ANALYTICS

Housing Price Estimation
Using Machine Learning

A comparative study of Linear Regression, Decision Trees, XGBoost, and Ensemble models for predicting residential property values using real-world housing data.

Maleha Israt Chowdhury Sachi Datta Supervisor: Weimin Huang Memorial University

Best R² Score

99.45%

XGBoost Regressor

↑ Best Model

Lowest RMSE

$6,510

XGBoost prediction error

↓ Lowest Error

Lowest MAE

$1,202

Ensemble Model

↓ Most Consistent

Models Tested

LR · DT · XGB · Ensemble

Comparative Study

R² Score Comparison

RMSE vs MAE by Model

MAPE (Mean Absolute Percentage Error) — Lower is Better

Ensemble Model

~0.8%

Decision Tree

~1.2%

XGBoost

~1.5%

Linear Regression

~11.5%

Model Performance Table

Rank	Model	RMSE ($)	MAE ($)	R² Score	Assessment
1st	XGBoost	6,509.66	2,083.93	0.9945	Excellent
2nd	Ensemble Model	7,007.30	1,202.48	0.9936	Excellent
3rd	Decision Tree	13,426.56	1,518.85	0.9765	Good
4th	Linear Regression	30,580.89	15,701.19	0.8781	Limited

Algorithm Details

XGBoost

Extreme Gradient Boosting

🚀

Sequential ensemble method that trains models iteratively, each correcting the errors of the previous one. Features built-in L1/L2 regularization, parallel processing, and automatic missing value handling.

R²: 0.9945 RMSE: $6,510 Handles non-linearity Anti-overfit

Ensemble Model

Stacked LR + DT + XGBoost

🧩

Stacked ensemble combining three base models (Linear Regression, Decision Tree, XGBoost) with a Random Forest meta-learner. Uses 5-fold cross-validation to generate out-of-fold predictions.

R²: 0.9936 MAE: $1,202 Best generalization cv=5

Decision Tree

CART Regressor

🌳

Hierarchical splits on feature values using MSE as the splitting criterion. Interpretable and computationally efficient. Risk of overfitting mitigated by depth constraints and cross-validation.

R²: 0.9765 RMSE: $13,427 Interpretable Fast inference

Linear Regression

Multiple OLS Regression

📈

Fits a hyperplane to minimize MSE via gradient descent. Assumes linearity, independence, homoscedasticity, and normality of residuals. Struggles with the non-linear interactions in housing data.

R²: 0.8781 RMSE: $30,581 Baseline model Low complexity

Literature Comparison

Study	Best Model	R² Score	Key Notes
This Project	XGBoost	0.9945	Lowest RMSE, highest accuracy across all models
This Project	Ensemble Model	0.9936	Lowest MAE, strong overall generalization
Abdul-Rahman et al.	XGBoost	0.9120	Boosting models for Kuala Lumpur market
Shalini et al.	Random Forest	0.9656	Focused on preprocessing and data-driven pipeline
Z. Li	Decision Tree	0.8016	Comparative study of regression models

House Price Estimator

Enter Property Details

Zoning Classification

Building Type

Lot Area (sq ft)

Year Built

Total Basement Area (sq ft)

Overall Condition (1–10)

Year Last Remodeled

Model to Use

Estimation Result

Estimated current market value

5-YEAR PROJECTION

—

MODEL USED

—

HOUSE AGE

—

REMODEL AGE

—

EST. PRICE RANGE

—

CONFIDENCE

—

Fill in the property details and click Estimate Price to get a prediction.

Model Error Bounds (Expected ±)

XGBoost

±$6,510

Ensemble

±$7,007

Decision Tree

±$13,427

Lin. Regression

±$30,581

Feature Importance

Top Features by Predictive Impact

YearBuilt

0.61

LogSalePrice

0.59

TotalBsmtArea

0.52

HouseAge

-0.41

RemodelAge

-0.36

LotArea

0.26

TotalBsmtSF

0.25

MSZoning

0.17

Values represent Pearson correlation coefficient with SalePrice. Negative values indicate inverse relationship.

Feature Correlation Radar

Engineered Features

🏚️

HouseAge

Derived from YearBuilt. Older homes tend to require more maintenance, affecting price negatively. Correlation with SalePrice: −0.41

2025 − YearBuilt

🔨

RemodelAge

Years since last renovation. Recently remodeled homes command higher prices. Correlation with SalePrice: −0.36

2025 − YearRemodAdd

📐

TotalBsmtArea

Combined BsmtFinSF2 + TotalBsmtSF for total basement coverage. High correlation with TotalBsmtSF (0.94). Correlation with SalePrice: +0.52

BsmtFinSF2 + TotalBsmtSF

📊

LogSalePrice

Log-transformed SalePrice to reduce right skewness. Makes distribution more normal, improving linear model performance. Correlation: +0.59

np.log(SalePrice)

🗺️

LotAreaPerRoom

Adjusts lot size by overall condition. A large lot in poor condition has lower value than a smaller lot in excellent condition.

LotArea / (OverallCond + 1)

⚖️

Class Imbalance

MSZoning (0.78), BldgType (0.83), LotConfig (0.73) show mild imbalance. OverallCond and temporal features are well-balanced. No resampling required.

Mild: MSZoning, BldgType

6-Stage Development Pipeline

01🎯Business Understanding

→

02🔍Data Understanding

→

03🧹Data Preparation

→

04🤖Modelling

→

05📏Evaluation

→

06🚀Deployment

Data Preprocessing Steps

Missing Value Imputation

Mode for categoricals (MSZoning, Exterior1st); Median for numerics (BsmtFinSF2, TotalBsmtSF)

Duplicate Removal

Checked and removed duplicate rows to prevent data leakage

Normalization

StandardScaler (mean=0, std=1), MinMaxScaler, and RobustScaler applied to numerical features

Label Encoding

Categorical variables converted to numeric for ML compatibility

Train/Test Split

80% training / 20% testing with 5-fold cross-validation for stacking

Evaluation Metrics

RMSE — Root Mean Squared Error

Penalizes large errors more heavily. Good for detecting outlier predictions. Lower = better.

MAE — Mean Absolute Error

Average absolute deviation. More actionable in real-world pricing decisions. Lower = better.

R² — Coefficient of Determination

Variance explained by the model. XGBoost explained 99.45% of price variability. Higher = better.

MAPE — Mean Absolute Percentage Error

Error as a percentage of actual value. Undefined when actual = 0. Best for relative comparisons.

Challenges Encountered

• Data Preprocessing — missing values, inconsistent formats, and outliers required careful handling

• Feature Selection — identifying high-impact features and managing multicollinearity

• Computational Resources — XGBoost and ensemble training caused memory issues on limited hardware

• Overfitting — some models trained well but failed to generalize to test data

• Frontend Integration — connecting ML backend with interactive web dashboard

Future Work

• Data Enrichment — proximity to schools, transport, hospitals; mortgage rates; demand trends

• Advanced Features — total livable area, renovation history, crime rates, air quality

• Neural Networks — Deep learning for complex non-linear relationships

• Web App Deployment — Flask or FastAPI backend for real-time predictions

• Model Monitoring — drift detection and retraining pipeline for market adaptation

Housing Price EstimationUsing Machine Learning

Housing Price Estimation
Using Machine Learning