For the very beginning of my data science carrier, I wondered, if datasets from Airbnb combined with some ETL could help me to find hints, what I should take into consideration, if I wanted to book an accommodation in Seattle by using a platform like booking.com or in this case, Airbnb.
So let’s get started!
As I mentioned before, to start my investigations, I used datasets from Airbnb, which are dealing with the price and availability per accomondation as well as descriptions of the apartments and their overall review scores.
All accommodations are located in Seattle.
Supply and demand applies always
By looking at the raw data, every accomondation is denoted with the attributes “availability” and “price”. So my first question was: “If I’m going to Seattle, when would probably be the best time for a visit?”
To answer that question, I removed first the “day” information from the date column, which enables me to group the available accommodations by month. After that, I could count the available accomondations by their unique id.
As a result, the chart is showing, that in the summer months, especially in July and August, the availability of the accommodations is at its lowest. Of course, there are still plenty available, but it seems like in the warm summer months, the landlords want to keep their flat’s for personal use.
This brings me to my next question: “If the availability of accommodations is decreasing during the summer months, does this also affect the price?”
The simple answer is yes! Plot number two is showing a nice normal distribution of the price development over the months. Like in the first plot,
the price is having its peak in July followed by August. This leads me to the conclusion, that price and availability have a negative correlation between each other or better say, the typical market rule: “supply and demand” also applies here.
How scores can influence our decision?
For my first observations, I have used only one of the given datasets. The second data set has detailed information regarding the accommodation and an average score rate. So the next major question is: “Does the score rate has any influence in price?”
Before I proceed in preparing the dataset to answer my question, I wanted to find out, which numeric columns have the strongest influence on price?
To do so, I created a correlation matrix, which indicates, that the attributes describing the size of the accomondation have the strongest correlation with the price.
To my surprise, I could not find a clear relationship between location
and price (zip code — price is 0.01).
However, the attribute “longitude” has a slight negative correlation. Thus, there could be a chance to get a better understanding by only comparing accommodations with similar facilities.
Because of the fact, that the size of the accommodation has a huge influence in price, I filtered the dataset by the number of beds, in order to compare the accomondations properly with regards to their review score.
Plot number four depicts a monthly price development of the available
one-bed accommodations, separated by their review score. I choose to take the price as median, since outliers have a strong influence in the average price.
By only taking the absolute review score into account, there is a tendency,
that high prices have a negative influence in the overall satisfaction
(check review score 4 to 6).
Plot number five shows the same graphic filtered by scores staring with six and above.
The price per accommodation with the score ten is lower than accommodations with the score of nine. This leads to the assumption, that high prices have a negative influence in the overall satisfaction of the reviewers.
It seems like, that the best tradeoff between satisfaction and price, are accommodations with the score of eight, since they always have the lowest price.
Top 3 best accommodations in Seattle
Last but not least, I wanted to find out, which are the best top 3 accommodations in Seattle.
In order to do so I assumed, that it is not sufficient to take accommodations with the highest score rating, because sometimes ratings are inconsistent.
To get a more realistic view, I trust the judgement of the majority by including the total number of reviews.
As a result, plot number six is showing the price development over the month speared by the top three accommodations.
Two of three accomondations (all with a single bed) have the same vendor. This probably explains the same price. The second vendor has an average score of nine in a total review number of 353. This is the third highest count within the data set. He also has the same price over the month, which is continuously lower than the first vendor.
Hence, if I had the need to find the best accomondation for myself in Seattle, I would probably choose the “Capitol Hill Suite” :)
Ok, I have to admit — this is not really data science. However, these simple statistics already helped me to get an idea of what data can tell us about the behavior of people or how they can influence us in our decision we make.
In this analyzation, I completely skipped soft factors from non-numerical columns of the dataset. So there is a lot more potential to learn from the dataset.
So how to continue? In my first observation, I could not find a proper answer, how the location of an accommodation is influencing the price — For sure, the data can tell.
Additionally, wouldn’t it be nice to be able to identify additional reasons, why people are choosing a specific accommodation?
Want to learn more?
If you want to learn more about my analytics, check out my GitHub.