Comparative Analysis of Texas Housing Markets (2010 - 2015)
by: Erin Novoa
You can also view the pdf of this project on my GitHub.
Objective
This comparative analysis examined whether significant differences exist in median housing prices among the four largest Texas cities — Austin, Dallas, Houston, and San Antonio—using the txhousing dataset from R’s ggplot2 package. The study focused on the five-year period from 2010 to 2015 to capture market trends.
Methodology
Statistical Model: A One-Way Analysis of Variance (ANOVA) was utilized to test the null hypothesis that mean housing prices are equal across all four cities.
Post-Hoc Analysis: Following a significant ANOVA result, a Tukey Honestly Significant Difference (HSD) test was performed to identify specific pairwise differences between cities.
Data Integrity: The analysis utilized a balanced sample (n = 67 per city) and verified assumptions of normality and variance.
Key Findings
Significant Market Variance: The ANOVA confirmed that city location has a statistically significant impact on housing prices (F(3, 264) = 61.1, p < .001), rejecting the null hypothesis that these markets perform identically.
Austin as the Market Leader: Austin is statistically the most expensive city in the group. Pairwise comparisons show it maintains a significantly higher median price point than Dallas, Houston, and San Antonio.
San Antonio as the Most Accessible: San Antonio represents the most affordable major market, with median prices significantly lower than those in Austin and Dallas.
Mid-Tier Markets: While Dallas and Houston both fall between the extremes of Austin and San Antonio, Dallas generally maintains a higher median price than Houston.
Setup data in R
library(ggplot2)
library(dplyr)
1. Exploration Data Analysis
head(txhousing)
## # A tibble: 6 × 9
## city year month sales volume median listings inventory date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 2000 1 72 5380000 71400 701 6.3 2000
## 2 Abilene 2000 2 98 6505000 58700 746 6.6 2000.
## 3 Abilene 2000 3 130 9285000 58100 784 6.8 2000.
## 4 Abilene 2000 4 98 9730000 68600 785 6.9 2000.
## 5 Abilene 2000 5 141 10590000 67300 794 6.8 2000.
## 6 Abilene 2000 6 156 13910000 66900 780 6.6 2000.
We focus on the records from the txhousing dataset that were between 2010 and 2015, inclusive and where the city column had a value equal to Houston, Dallas, Austin, or San Antonio.
txhousing.after.2010 <- txhousing[txhousing$year >= 2010, ]
txhousing.major.cities <- txhousing.after.2010[txhousing.after.2010$city %in% c('Houston', 'Dallas', 'Austin', 'San Antonio'), ]
txhousing.major.cities
## # A tibble: 268 × 9
## city year month sales volume median listings inventory date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Austin 2010 1 985 230799130 175300 9931 5.7 2010
## 2 Austin 2010 2 1249 293085886 182600 10825 6.2 2010.
## 3 Austin 2010 3 1987 461231875 179400 11846 6.7 2010.
## 4 Austin 2010 4 2230 513122847 185100 12270 6.7 2010.
## 5 Austin 2010 5 2286 543258166 188600 12786 6.9 2010.
## 6 Austin 2010 6 2190 584250558 200500 13353 7.2 2010.
## 7 Austin 2010 7 1640 453983948 213600 13336 7.4 2010.
## 8 Austin 2010 8 1675 432655897 195200 12575 7.1 2011.
## 9 Austin 2010 9 1406 342118926 190500 11834 6.8 2011.
## 10 Austin 2010 10 1336 342862397 193300 11003 6.5 2011.
## # ℹ 258 more rows
Let’s check the sample size across the 4 groups (cities) to make sure it is an even distribution.
# Check how many data points are in each city
table(txhousing.major.cities$city)
##
## Austin Dallas Houston San Antonio
## 67 67 67 67
We can also assume that that the groups are independent of each other. For example, a housing price in Austin is independent of a housing price in Dallas.
Using a boxplot below, we will check the distribution of our data to see if we have any outliers. If we do, this can be problematic in an ANOVA test since the test relies on the mean and standard deviation which are both highly sensitive to extreme values.
boxplot(txhousing.major.cities$median, horizontal = TRUE)
txhousing.major.cities[txhousing.major.cities$median > 260000, ]
## # A tibble: 4 × 9
## city year month sales volume median listings inventory date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Austin 2015 4 2801 931744729 270300 6560 2.5 2015.
## 2 Austin 2015 5 2999 1026501450 271200 7009 2.7 2015.
## 3 Austin 2015 6 3301 1086689918 270200 7419 2.8 2015.
## 4 Austin 2015 7 3466 1150381553 264600 7913 3 2016.
We do see that we have some high outliers. However, after further investigation we can see that there are only 4 median price values, located in Austin, and they are not that massive so we can continue with the ANOVA test.
2. Formal Hypotheses
Ho: µHouston = µDallas = µAustin = µSan Antonio (the four cities are equal in terms of mean price)
Ha: at least one mean price for one of the cities is different than the other cities
3. Performing One-Way ANOVA
We conduct a one-way ANOVA to examine if the mean home price (median column) differs across cities (city column).
anova.model <- aov(median ~ city, data = txhousing.major.cities)
summary(anova.model)
## Df Sum Sq Mean Sq F value Pr(>F)
## city 3 9.359e+10 3.120e+10 61.1 <2e-16 ***
## Residuals 264 1.348e+11 5.106e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is
<2e-16, which is effectively 0. Since it is less than 0.05, we can reject the null hypothesis.Therefore, there is a statistically significant difference in the median housing prices between at least two of the four cities.
Also, the F-value is 61.1, which is quite high. This means the variance between the cities is much higher than the variance within the cities.
4. ANOVA Model Quality Check
We will use the Q-Q plot (Quantile-Quantile plot) to compare the distribution of our residuals against a perfect theoretical normal distribution. This will allow us to do a quality control check for our ANOVA model.
# look at the Q-Q plot to ensure normality
plot(anova.model, which = 2)
The points seem to fall roughly along a straight, diagonal line which indicates that the residuals are normally distributed which is a core assumption of ANOVA. We do see that some deviation from the line at the ends which is likely due to the outliers that we noted in the initial Exploratory Data Analysis. Overall, the majority of residuals on this Q-Q plot follow the diagonal, the assumption of normality is sufficiently met, and we can trust our p-value.
5. Post-Hoc Tests with Tukey HSD
Since we know there is a difference in at least 2 out of the 4 cities, the next step is to confirm the TukeyHSD() output to state which cities are different.
tukey.results <- TukeyHSD(anova.model)
tukey.results
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = median ~ city, data = txhousing.major.cities)
##
## $city
## diff lwr upr p adj
## Dallas-Austin -32349.254 -42442.65 -22255.861 0.0000000
## Houston-Austin -41007.463 -51100.86 -30914.070 0.0000000
## San Antonio-Austin -49285.075 -59378.47 -39191.681 0.0000000
## Houston-Dallas -8658.209 -18751.60 1435.184 0.1210571
## San Antonio-Dallas -16935.821 -27029.21 -6842.428 0.0001199
## San Antonio-Houston -8277.612 -18371.01 1815.781 0.1493015
par(mar = c(5, 6, 4, 2) + 3.0)
plot(tukey.results, las = 1, col = "blue")
The vertical axes of this plot lists the city pairs (e.g., Austin-Dallas), and the horizontal axis shows the difference in median prices. If the blue line for a pair crosses the vertical dashed line at 0, there is no statistically significant difference between the pair of cities. If the position of the bar is to the left of 0, the first city in the pair is cheaper than the second. Otherwise if the bar is to the right of 0, the first city in the pair is more expensive than the second.
Let’s discuss each city pair:
Dallas - Austin: Dallas is significantly cheaper than Austin.
Houston - Austin: Houston is significantly cheaper than Austin.
San Antonio - Austin: San Antonio is significantly cheaper than Austin.
Houston - Dallas: There is no statistical significant difference between the two cities.
San Antonio - Dallas: San Antonio is significantly cheaper than Dallas.
San Antonio - Houston: There is no statistical significant difference between the two cities.
6. Conclusion
Looking at the Tukey HSD plot, since the entire confidence interval for those specific pairs is located to the left of the 0 line, we can say with high confidence that for the txhousing market dataset from 2010 - 2015, Austin is the most expensive city among this group, and San Antonio is generally the least expensive, with Dallas and Houston falling somewhere in between.
We can view this conclusion with a set of boxplots.
ggplot(txhousing.major.cities, aes(x = city, y = median, fill = city)) +
geom_boxplot(alpha = 0.7) +
labs(
title = "Median Housing Prices by City (2010 - 2015)",
x = "City",
y = "Median House Price ($)"
) +
theme_bw() +
scale_fill_brewer(palette = "Dark2")
