How to choose the best Airbnb in Seattle City using Machine Learning?

7 min readSep 1, 2022

How to choose the best Airbnb in Seattle city? **SPOILER!!** Yurts are the best!

Planning your next trip could be at the same time exciting for the experience you will live, but at the same time frustrating when not aware of the right points to look at.

In this article, we will dive deep into the features that you need to pay attention to when looking for a place to stay and how hosts could improve the place for better review scoring.

On the dataset used we can see the listings from Airbnb for the city of Seattle — WA. Data were collected in April 2016 and the information is related to the host like verifications and response rate, and also from the place like amenities, prices, location, and property info.

Based on this, our research questions are:

Do the features related to the host has importance in the experience and the review score? (e.g. being super host, verifications, etc..).
Which region has the highest and lowest average daily price, number of places available, and review score?
Which home feature has the highest importance to influence a good experience? (e.g. amenities, property type, room type, bed type, etc..).

Based on it, hosts would improve their places by knowing exactly what to focus aiming for better reviews. At the same time, guests would use the information provided here to filter for the best place and have a better experience.

In the end, we will be using machine learning over the dataset to be able to understand each coefficient used in an Elastic Net model to do the regression of the score.

Diving Deep on the Dataset

To use as a target, we will be using the multiplication of the average score by the average number of reviews by month. Then, the score was normalized between 0 and 10 to have a better understanding of this value.

**Histogram** of the **new review score**, with an average score of 1.42 and a **median of 0.88.**

Host

Host response time: We can see below that more than 80% of the hosts take on average a few hours to respond to customers. Represented a better average score, 2.02, and positive correlation, 0.35, when done in less than an hour. And negative correlations and below average otherwise.

**Correlation**, **count**, and **average** for each host response type compared with the score.

Host is super host: Besides the fact that only 24% of the hosts are super hosts, this attribute has a correlation of 0.33 with the score. Being one of the highest (positive) correlations and has a considerably better score average when compared to non super hosts (2.41 x 1.17).

**Correlation**, **count**, and **average** for super hosts column compared with the score.

Host identity verified: For the host identity verification, we can see a similar case. 0.14 correlation when verified with a better average (1.53 vs 1.02).

**Correlation**, **count**, and **average** for the host verification compared with the score.

Cancellation policy: For a moderate cancellation policy, the correlation is better with 0.16, and also a better average of 1.76.

Region

When diving into the region attributes, we can see that Ballard and Downtown are the ones with the best correlation, 0.06. For the best score average, Seward Park, Ballard, and Beacon Hill are the top 3.

University District is one with the most negative correlation of -0.06 and the lowest score average of 0.97.

**Mean**, **minimum**, and **maximum** daily price and the **number of places available** by region.

Downtown appears as one of the most expensive daily prices, 154.4 USD, and together with other regions has the highest maximum value of 999/1,000 USD.

Queen Anne is the region with the cheapest place available, 20 USD, but with the second highest daily average, 157.2 USD. In terms of the number of places available, other regions lead the ranking with 794, Capitol Hill in second place with 567, and Downtown in third with 530.

**Correlation**, **count**, and **average** for each region compared with the score.

Property

In terms of property types, a Yurt has the highest score ratings but only one place available. The cabin is the second one with a 2.87 average score, and the highest positive correlation of 0.07, however, only 21 places are available.

The curious fact here is between apartment and house, with the same absolute values but positive and negative correlations, and 1.45 versus 1.38 average scores respectively.

**Correlation**, **count**, and **average** for the property type compared with the score.

For the room type, the private room showed the best correlation (0.07) and the highest average score (1.58). For bed type the Pull-out sofa (really?) had the best correlation (0.03) and average score (1.76).

For a positive experience: Essential amenities, shampoo, hair dryer, and iron, are part of the first group with the highest correlation and average score. Curiously, a smoke detector, first aid kit, carbon monoxide detector, and fire extinguisher are part of the second one. This makes sense, security is a good way to improve the experience. Extras like WiFi, breakfast, and 24-hour check-in are part of the third group.

**Correlation**, **count**, and **average** for the **positively** correlated amenities compared with the score.

For a negative experience: Washer, dryer, and kitchen, have the lowest scores and highest negative correlation. However, having nothing as amenities is the worst correlation -0.07, and the lowest score of 0.47.

**Correlation**, **count**, and **average** for the **negatively** correlated amenities compared with the score.

Data Preparation

Now that we analyzed the data we will be preprocessed it to use it in a machine learning model and study its parameters. First, we will select only columns that can have some relationship with the score, like years as host, host response time, number of bathrooms and beds, and bedroom type.

Then, checking the proportion of null values we can see that the security deposit and cleaning fee has the highest number of nulls, this can be because most of the places don’t charge guests for it, so we will fill them with zero.

For other numerical columns, we will fill with the mean, and use the dummy variables for categorical, preserving the first column and having a dummy for null values.

Amenities were separated using a one-hot-vector approach, having 1 for the amenities in the place and 0 otherwise.

The final dataset had 3818 rows and 117 columns. Being 3436 rows on the training set and 382 for the test.

Modeling

The model decided to use was the Elastic Net, mixing the use of L1 and L2 regularization in the linear regression.

Using the grid search approach for the hyperparameters we found an alpha of 0.01, an l1 ratio of 0.95, and a maximum iteration of 40. Training the model using those parameters we reached an R2 score of 0.35 on the training and 0.33 on the test.

Results

Model coefficients for each column ordered by importance.

Verifying the most important coefficients of the model we can see that responding in less than an hour or just a few hours, being in the Downtown or Ballard region, and having the main amenities like shampoo, essentials, and smoke detectors, are the main features for a good experience.

On the opposite side, not being a super host or not having identity verified, being in a condominium, and flexible cancellation policy are the main negative ones.

In the end, if you are a guest trying to find a place in Seattle, Ballard or Downtown is the best region if you don’t mind paying more for it. Pay attention to the time the host takes to answer your questions, if it is more than a few hours maybe you should look outside… Run away from non-super hosts and avoid non-verified ones. Shampoo, essentials, and other security items are one of the most important amenities for a good experience.

If you are a host and want to improve the user experience, try urgently to get your super host certification. Offer shampoo, essentials, and security items like smoke and carbon detector, as part of the amenities. Avoid using a flexible cancellation policy, but instead, use the moderate one.

Want to know more?

My LinkedIn: https://www.linkedin.com/in/adilsonvital/

My Github: https://github.com/adilsonvj

Github’s Project Repository: https://github.com/adilsonvj/UdacityDataScience2022