Neural Networks For AirBnB
The goal of the project was to build the best neural network model in machine learning for predicting prices of listings on AirBnB in Washington D.C. Two neural network models were trained using the neuralnet and Caret packages, and were then evaluated and compared. The first model (NN1) performed better with lower MSE and RMSE values, and higher correlation and R^2 values. The second model (NN2) had a slightly lower mean absolute error, but the change between the mean differences of predicted price from actual price in each model was tiny. Both models were found to be reasonably accurate in predicting prices, but NN1 was recommended for use as it had a better fit to the true values.
Background - Problem Definition
AirBnB is a Software-as-a-Service (SaaS) company based in San Francisco, California that provides short-term home or apartment rentals via a phone and computer application. Our team has been tasked with analyzing a sample of AirBnb’s data that covers a number of rental locations within the District of Columbia. The data was collected in September of 2022 and includes data from neighborhoods within DC, such as Capitol Hill and Georgetown, as well as a number of variables that may influence price, such as room types available, total reviews, and rating of each rental.
Our goal was to build the best neural network model with the highest predictive power for price of AirBnB listings in D.C. We trained two models using the neuralnet and Caret packages, along with one additional package using Nnet that was not part of the requirement (NN3). We evaluated the models, compared the results, and chose the best model for recommendation in predicting price.
Methods & Diagnostics
Preprocessing & Feature Selection
In order to predict Airbnb prices in Washington D.C. using neural network models, pre-processing the data was first necessary. Initially, the character/string variables, including "neighborhood", "room_type", and "host_since", were converted into factors to make them suitable for analysis. To assess the numerical variables for multicollinearity, a correlation heatmap was generated (as illustrated in Figure 1 in the Appendix). This examination led to the removal of the variable “beds” due to its variance being predicted by "accommodates" and its lower correlation with "price" compared to "accommodates".
Subsequently, dummy variables were generated for the categorical variables of interest, such as "superhost", "neighborhood", and "room_type", using the "FastDummies" package in R. This resulted in the creation of new columns for each option within a categorical variable and the removal of one redundant column for each of the three respective variables. The original variables "superhost", "neighborhood", and "room_type", were also removed as they were no longer necessary due to the presence of the dummy variables. Additionally, "listing_id" (an indexing value) was removed, along with "room_type_NA" (as there was no interest in a dummy variable representing NA data). The name of "room_type_EntireHome/apt" was also changed to "Entire" to avoid errors during the creation of neural networks.
The data was then set for reproducibility by setting R to seed(123) and splitting it into a 70/30 training/test split. Missing values in the train/test data sets were addressed by replacing them with the mean (for numeric variables) or mode (for categorical variables). The mean was used for "host_acceptance_rate", "total_reviews", and "avg_rating", and the mode was used for "entire" and "room_type_PrivateRoom". After confirming the absence of any NAs in the training and test data sets, min-max normalization was performed on all variables, including the "price" (our variable of interest). This normalization method scales the values of each variable to a range of 0 to 1, where 0 represents the lowest value of the variable in the dataset and 1 represents the highest. With these steps completed, the data was ready for analysis.
Model Building: Neural Network 1 (NN1)
In our first neural network regression model, we used two hidden neurons in the first layer. The model was trained on a 70/30 training data split with all the remaining features in our final, cleaned dataset. Linear output was set to TRUE, indicating that the model was designed for regression, and not classification. A logistic activation function was used for NN1, and the results were visualized with the following plot, Figure 2. Overall, the performance of NN1 was promising.
For all our summary statistics of our models, please see Figure 3. Of note, the Root Mean Squared Error (RMSE; lower is better) was 0.086129, which is the square root of the MSE (the average of the squares of the differences between the predicted and actual values of the response variable), which provides an interpretable measure of the model's accuracy. The correlation between the predicted and actual values of the response variable was 0.700114, which indicates a positive relationship between the two. The R-squared value was 0.490159, which shows the proportion of variance in the response variable that can be explained by the predictors in the model. Therefore, our model explained 49% of the variance in price. Finally, the mean difference between the predicted and actual values of the response variable was 0.001504078. Overall, these results suggest that NN1 has moderate accuracy in predicting the price based on the predictors.
Model Building: Neural Network 2 (NN2)
For NN2, we used 5 neurons in the first hidden layer and 3 neurons in the second hidden layer, see Figure 4. A logistic activation function was again used and the maximum number of steps (training iterations) was set to 10^6. NN2 was trained on the same training dataset and its performance was assessed the same as NN1. The root mean squared error (RMSE), correlation, and R^2 between the predicted and actual values were computed and found to be 0.139693, 0.477215, and 0.227734, respectively. The difference between the predicted and actual values was found to have a mean of 0.006998291. Similar to NN1, NN2 shows moderate accuracy.
Model Evaluation & Comparison: Neural Network 1 and 2
Comparing the two models, NN1 performed better in terms of RMSE, correlation, and R^2. Overall, NN1 had a better fit to the true values. However, NN2 had a slightly lower mean absolute error, thus having less errors in its prediction that were smaller in magnitude. The mean difference between the predicted and actual prices were almost identical for NN1 and NN2, being 0.001504078 and 0.001534771, respectively. The strong similarity in mean differences suggests that both models had similar accuracy in predicting the price. In other words, the models are making great predictions on price that are close to the actual values of price on average.
In summary, both models performed reasonably well in predicting the prices, but the first model performed better overall, with a lower RMSE and higher correlation and R^2 values. Moving forward, we suggest using NN1 for predicting price.
Limitations & Model 3: Neural Network 3 (NN3) – NNET Package
We attempted to build one more model using the NNET package with everything else held the same. NN3 showed some promise with 3 hidden layers. The performance of the model was not all that different from our other models. The R^2 value was 0.42815, which means our model explained 42.8% of the variation in the price. The limitation that is most notable and that we believe would improve the performance of our models is the elimination of outliers. We came to this conclusion late after preprocessing, building, and testing multiple models with different packages. We recommend the next attempt at building a model include the examination and elimination of outliers
Model Comparison From Programming Final Using Multiple Regression
Looking back to compare our current neural network models to our multiple regression models from our prior term yielded interesting results. The performance of our two multiple regression models were better than the linear regression models due to overall fit. Both of our Programming Final models demonstrated a similar R squared value of 0.289, which meant that both average ratings or total review counts contributed to explaining 29% of the variation in price. At the time, we thought this was notable but now it suggests that the impact of ratings on price was weaker than we initially thought, and that beds and neighborhoods had a greater impact on price.
However, when we now compare these prior results to our current models, we can see that our neural network regression models are superior for prediction compared to the Programming Final model. NN1 boasts a higher R-squared value of 0.490159, meaning it is able to explain a greater proportion (49%) of the variance in the response variable (which is notably higher than the 29% from the Programing Final model). Additionally, NN1 has lower values for performance metrics such as MSE, RMSE, MAE, and correlation between predicted and actual values, signifying a better fit and greater accuracy in predicting the response variable price. Clearly, the results point to the neural network regression model as the more reliable choice for prediction due to its higher R-squared value and improved performance metrics compared to our Programming Final model.
In conclusion, after evaluating the performances of the two models in predicting price for an AirBnB listing, we found that both models performed reasonably well. NN1 showed better results in terms of MSE, RMSE, correlation, and R^2, indicating a better fit to the true values. Despite the small differences between the mean differences of the two models, it can be concluded that both models have a similar accuracy in predicting price. In light of these results, we recommend using NN1 as it demonstrates better overall performance.
Figure 1: Correlation heatmap for numeric variables in in the airbnb dataset
Figure 2: Visualization Plot of NN1
Figure 3: Summary Statistics for our models
Figure 4: Visualization Plot of NN2
Power in Numbers