import pandas as pd
import geopandas as gpd
csv_data = list()
listings = open("./data/listings_full.csv",'r', encoding='latin1')
for line in listings:
data = line.split(',')
csv_data.append(data)
listings.close()
df = pd.read_csv("./data/listings_full.csv", encoding='latin1')Housing
Assignment 3 : Housing
Preliminary Data
The following infomation was found prior to the start of the assignment in order to use it.
tax hotels = 7% of turnover + 3 euros per room tax airbnb = 10% = 3 euros number of visitors = 30000
Library imports and DataFrame generation
1. What Amsterdam will receive from tourist tax if the event lasts a week and you will have 30.000 visitors?
Pseudocode
- Discard listings with minimum stays of more than 7 nights
- Sum the total ammount of people that fit the remaining airbnbs accomodate (#19812)
- Calculate average price of accomodation remaining listings
- for the remaining visitors (11188) they will stay in hotels –> average of 2.2 beds per room
- Divide the number of rooms between the remaining tourists
- Multiply number of available beds to average price
- Sum airbnb + hotels taxes accordingly
#Gathering information shortcuts:
#ok_listings = pd.DataFrame(df.columns)
#print(df.columns)
#print(df.columns[34])
#print(df.maximum_nights)Code:
#we discard listings with minimum stays of more than 7 nights
ok_listings = df[(df['minimum_nights'] < 8)].copy()
#we check that all numeric values in our DataFrame sum
test = ok_listings.sum(axis=0,numeric_only=True)
#we sum individually the column 'accommodates' so we can retrieve the number of people who can stay
no_ppl = ok_listings.accommodates.sum()
print('Number of Airbnb listings with at most 7 days of minimal nights :', no_ppl)
#print(test) success
tourists = 30000
tourists_hotel = tourists-no_ppl
print('Number of people who needs to stay in hotel rooms',': ', tourists_hotel)
#print(ok_listings)
#Average cost of overnight accommodation in Amsterdam in the Netherlands from January 2019 to June 2023
##https://www.statista.com/statistics/614061/overnight-accommodation-costs-amsterdam-city/
hotel_price_per_night = 294
#https://www.cbs.nl/en-gb/figures/detail/84040ENG
#Hotels; capacity, type of accommodation, beds, star rating
hotel_rooms_ams = 41840
hotel_beds_ams = 90918
avg_beds_per_room = hotel_beds_ams/hotel_rooms_ams
#print(avg_beds_per_room)
#Number of hotel rooms used by tourists who don't fit in airbnbs
no_hotel_rooms_used = tourists_hotel/avg_beds_per_room
revenue_hotels = no_hotel_rooms_used*hotel_price_per_night
tax_hotels = revenue_hotels * 0.07 + no_hotel_rooms_used * 3
#getting rid of $ signs in column
new_price = df['price'].str.slice(1,-1)
#adding new_price as a column in our dataframe
ok_listings['new_price'] = new_price
ok_listings['new_price']=ok_listings['new_price'].str.replace(',','')
#converting new column values to float so we can math it up =P
ok_listings['new_price'] = ok_listings['new_price'].astype(float)
#calculating total revenue of airbnb by summing all location price per night and then times 7 (number of days in a week)
revenue_airbnb = ok_listings.new_price.sum()*7
tax_airbnb = float(revenue_airbnb)*.1 + float(len(new_price)*3)
print('Total tax revenue from Airbnb locations $', tax_airbnb)
print('Total tax revenue from hotel rooms $',tax_hotels)
total_revenue = tax_airbnb + tax_hotels
print('Total tax revenue gathered for the government', total_revenue)Number of Airbnb listings with at most 7 days of minimal nights : 23407.0
Number of people who needs to stay in hotel rooms : 6593.0
Total tax revenue from Airbnb locations $ 1469366.9000000001
Total tax revenue from hotel rooms $ 71543.25226687785
Total tax revenue gathered for the government 1540910.152266878
2. Plot the amount of AirBnB locations per neighbourhood.
Pseudocode
- Filter from dataframe the locations of airbnb
- Gather the different categories inside the column neighourhood
- Sum each category inside neighbourhood
#It can't get mpre efficient than this
import plotly.express as px
#table = ok_listings.neighbourhood_cleansed.value_counts()
re_index = ok_listings['neighbourhood_cleansed'].value_counts().reset_index().rename(columns={"index": 'neighbourhood', 0 : 'count'})
print(re_index ) neighbourhood_cleansed count
0 De Baarsjes - Oud-West 1329
1 Centrum-West 973
2 De Pijp - Rivierenbuurt 898
3 Centrum-Oost 766
4 Zuid 571
5 Westerpark 561
6 Oud-Oost 475
7 Bos en Lommer 399
8 Oud-Noord 389
9 Oostelijk Havengebied - Indische Buurt 304
10 Watergraafsmeer 249
11 Noord-West 212
12 IJburg - Zeeburgereiland 173
13 Slotervaart 156
14 Noord-Oost 135
15 Geuzenveld - Slotermeer 110
16 Buitenveldert - Zuidas 96
17 De Aker - Nieuw Sloten 67
18 Bijlmer-Centrum 47
19 Gaasperdam - Driemond 45
20 Osdorp 43
21 Bijlmer-Oost 37
#graph with information
figure = px.bar(re_index, x='neighbourhood_cleansed', y='count', title = 'Number of airbnbs per neighbourhood in Amsterdam')
figure.show()Unable to display output for mime type(s): application/vnd.plotly.v1+json
3: Which street in Amsterdam has the most AirBnB apartments?
Pseudocode
- Find the locations of airbnb through lat and long.
- Add the street category to the DataFrame.
- Gather the different categories inside the column Street.
- Sum each category per street filter.
#pip install geopy
import certifi
import ssl
import geopy.geocoders
from geopy.geocoders import Nominatim, Photon
# Workaround to fix SSL certififcate expired error on some laptops
# Solution taken from https://stackoverflow.com/a/50665487
# Error message:
# GeocoderUnavailable: HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443):
# Max retries exceeded with url: /reverse?lat=52.40164&lon=4.95106&format=json&addressdetails=1
# (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED]
# certificate verify failed: certificate has expired (_ssl.c:1006)')))
ctx = ssl.create_default_context(cafile=certifi.where())
geopy.geocoders.options.default_ssl_context = ctx
geolocator = Nominatim(user_agent='geoapiExercises')
#print(ok_listings.columns)
#New dataframe with only 3 parameters
geolocs_airbnb = ok_listings[['latitude', 'longitude']]
#we erase all typos from the columns and then add '.' accordingly to their index position as to logical lon/lat values
geolocs_airbnb['longitude']=geolocs_airbnb['longitude'].str.replace(',','')
geolocs_airbnb['longitude']=geolocs_airbnb['longitude'].str.replace('.','')
geolocs_airbnb.longitude = (geolocs_airbnb.longitude.str[:1] + '.'+ geolocs_airbnb.longitude.str[1:]).astype(float)
geolocs_airbnb['latitude']=geolocs_airbnb['latitude'].str.replace(',','')
geolocs_airbnb['latitude']=geolocs_airbnb['latitude'].str.replace('.','')
geolocs_airbnb.latitude = (geolocs_airbnb.latitude.str[:2] + '.' + geolocs_airbnb.latitude.str[2:]).astype(float)
for i in range(len(geolocs_airbnb)):
latitude, longitude = geolocs_airbnb.iloc[i]['latitude'],geolocs_airbnb.iloc[i]['longitude']
location = geolocator.reverse((latitude, longitude), timeout=None)
street = location.raw['address']['road']
geolocs_airbnb.loc[i, 'street'] = street
geolocs_airbnb.head()print(geolocs_airbnb.head()) latitude longitude street
0 52.40164 4.95106 Jisperveldstraat
1 52.38855 4.88521 Zoutkeetsplein
2 52.37824 4.86826 Centrale Groothandelsmarkt
3 52.34091 4.84802 IJsbaanpad
4 52.37641 4.88303 Derde Egelantiersdwarsstraat
#saving harvested data inside another file for next tries
import pickle
fileObj = open('data.obj', 'wb')
pickle.dump(geolocs_airbnb,fileObj)
fileObj.close()
pickle.loadGeoloc test
Test of 10 locations and number of listings per per street
#print(geolocs_airbnb.street)
print('Number of Airbnb listings per street:', geolocs_airbnb.street.value_counts())Number of Airbnb listings per street: street
Jisperveldstraat 1
Zoutkeetsplein 1
Centrale Groothandelsmarkt 1
IJsbaanpad 1
Derde Egelantiersdwarsstraat 1
Maassluisstraat 1
Zanddwarsstraat 1
Amstelveld 1
Soembawastraat 1
Keizersgracht 1
Name: count, dtype: int64
import osmnx as ox
import matplotlib.pyplot as plt
import plotly as py
from shapely.geometry import Point, Polygon
import numpy as np
#here we set up a geodata frame based on the airbnb data set and do so with lat and long as geo location
points_airbnb = gpd.GeoDataFrame(geolocs_airbnb, geometry=gpd.points_from_xy(geolocs_airbnb.longitude, geolocs_airbnb.latitude), crs = "WGS84")
print(points_airbnb.head())
# distribution of airbnbs in ams
ams = "Amsterdam"
graph = ox.graph_from_place(ams)
#map with airbnbs locations
fig, ax = plt.subplots(figsize=(150,150))
ax.set_aspect('equal') #same scaling as data available
ax.set_facecolor('black')
ox.plot_graph(graph, edge_color='#edf2f1',
node_size=0, edge_linewidth=5,
show=False, close=False, ax=ax,
)
points_airbnb.plot(ax=ax, color ="#81dec8", markersize = 150)
plt.savefig("Q2_Q3_airbnbsAMS.png") latitude longitude geometry
0 52.40164 4.95106 POINT (4.95106 52.40164)
1 52.38855 4.88521 POINT (4.88521 52.38855)
2 52.37824 4.86826 POINT (4.86826 52.37824)
3 52.34091 4.84802 POINT (4.84802 52.34091)
4 52.37641 4.88303 POINT (4.88303 52.37641)

4. Try to cross reference the data from the AirBnB dataset with the BBGA. Can you figure out if all apartments of AirBnB are designated as housing? Which number of apartments are not rented out all the time but are also used as normal housing?
#usable data from bbga
bbga_df = pd.read_csv('./data/2023_BBGA_0614_cf8d825e89.csv') #./bbga_short.csv
#print(bbga_df)
bbga_df.replace({ "Oud West, De Baarsjes" :"De Baarsjes - Oud-West",
"De Pijp, Rivierenbuurt":"De Pijp - Rivierenbuurt",
"Indische Buurt, Oostelijk Havengebied" : "Oostelijk Havengebied - Indische Buurt",
"IJburg, Zeeburgereiland" : "IJburg - Zeeburgereiland",
"Geuzenveld, Slotermeer" : "Geuzenveld - Slotermeer",
"Buitenveldert, Zuidas" : "Buitenveldert - Zuidas",
"Oud-Zuid" : "Zuid" ,
"De Aker, Sloten, Nieuw-Sloten" : "De Aker - Nieuw Sloten",
"Gaasperdam" : "Gaasperdam - Driemond"}, inplace = True)
print(bbga_df)bbga_df = pd.read_csv('./data/2023_BBGA_0614_cf8d825e89.csv')
bbga_ok = bbga_df.loc[3937:3963,['gebiednaam','BHVESTAIRBNB']]
#name of neighborhood paramenters equally (using airbnb file)
new_bbga = bbga_ok.replace({
"Oud West, De Baarsjes" :"De Baarsjes - Oud-West", "De Pijp, Rivierenbuurt":"De Pijp - Rivierenbuurt",
"Indische Buurt, Oostelijk Havengebied" : "Oostelijk Havengebied - Indische Buurt", "IJburg, Zeeburgereiland" : "IJburg - Zeeburgereiland",
"Geuzenveld, Slotermeer" : "Geuzenveld - Slotermeer", "Buitenveldert, Zuidas" : "Buitenveldert - Zuidas",
"Oud-Zuid" : "Zuid", "De Aker, Sloten, Nieuw-Sloten" : "De Aker - Nieuw Sloten", "Gaasperdam" : "Gaasperdam - Driemond"
})
#plot the amount of apartments from both datasets
fig = px.bar(re_index, x='neighbourhood_cleansed', y='count', title= 'Data comparison of airbnb units according to bbga and airbnb')
fig.add_bar(x=new_bbga['gebiednaam'], y=new_bbga['BHVESTAIRBNB'],name="BBGA")
#fig.add_bar(x=re_index.neighbourhood_cleansed.value_counts([]), y='count', name="abb")
fig.update_layout(barmode='group')
fig.show()Unable to display output for mime type(s): application/vnd.plotly.v1+json
The results seen in the chart can be explained by the time period measured by both datasets. Still the proportion of rented units remains constant within both datasets with Haarlemmerbuurt as the only exception; reflecting an increased period of unavailability in the time registered by Airbnb.
5. How many hotel rooms should be built if Amsterdam wants to accommodate the same number of tourists?
#we sum individually the column 'accommodates' so we can retrieve the number of people who can stay
no_ppl = ok_listings.accommodates.sum()
tourists = 30000
tourists_hotel = tourists-no_ppl
hotel_rooms_ams = 41840
hotel_beds_ams = 90918
avg_beds_per_room = hotel_beds_ams/hotel_rooms_ams
#print(avg_beds_per_room)
#Number of hotel rooms used by tourists who don't fit in airbnbs
no_hotel_rooms_used = tourists_hotel/avg_beds_per_room
print('Number of hotel rooms used by tourists who do not fit in airbnbs:', no_hotel_rooms_used)Number of hotel rooms used by tourists who do not fit in airbnbs: 3034.0649816318
6. How many different licenses are issued?
#verifying license column
#print(ok_listings.columns[68])
licenses = (ok_listings.license.nunique())
license_values =(ok_listings.license.value_counts())
print(license_values)
exemptions = (ok_listings.license.value_counts()['Exempt'])
ulicenses = licenses-exemptions
print('Number of unique Airbnb licenses in Amsterdam (not counting the exempt values):',ulicenses)license
Exempt 600
0363 78AD 8875 790E 3C05 14
ABCD 1234 AB12 89EF A0F9 9
036341086EC3C2FF2493 7
0363 A250 F710 53C5 1273 6
...
0363 D239 E048 910F 216B 1
0363 CB7E 060D 07E0 A4DE 1
0363 45B2 6D92 AA8B 1326 1
0363 F6A5 F87A D89D 8FC1 1
0363 2F20 4F53 FB57 2D7F 1
Name: count, Length: 6972, dtype: int64
Number of unique Airbnb licenses in Amsterdam (not counting the exempt values): 6372
We have more listings than licenses and several ‘exempt’ values. This is because the listings already have another type of license since they either work as hotels or hostels. https://www.airbnb.com/help/article/860