import pandas as pd
import geopandas as gpd
= list()
csv_data = open("./data/listings_full.csv",'r', encoding='latin1')
listings for line in listings:
= line.split(',')
data
csv_data.append(data)
listings.close()
= pd.read_csv("./data/listings_full.csv", encoding='latin1') df
Housing
Assignment 3 : Housing
Preliminary Data
The following infomation was found prior to the start of the assignment in order to use it.
tax hotels = 7% of turnover + 3 euros per room tax airbnb = 10% = 3 euros number of visitors = 30000
Library imports and DataFrame generation
1. What Amsterdam will receive from tourist tax if the event lasts a week and you will have 30.000 visitors?
Pseudocode
- Discard listings with minimum stays of more than 7 nights
- Sum the total ammount of people that fit the remaining airbnbs accomodate (#19812)
- Calculate average price of accomodation remaining listings
- for the remaining visitors (11188) they will stay in hotels –> average of 2.2 beds per room
- Divide the number of rooms between the remaining tourists
- Multiply number of available beds to average price
- Sum airbnb + hotels taxes accordingly
#Gathering information shortcuts:
#ok_listings = pd.DataFrame(df.columns)
#print(df.columns)
#print(df.columns[34])
#print(df.maximum_nights)
Code:
#we discard listings with minimum stays of more than 7 nights
= df[(df['minimum_nights'] < 8)].copy()
ok_listings
#we check that all numeric values in our DataFrame sum
= ok_listings.sum(axis=0,numeric_only=True)
test
#we sum individually the column 'accommodates' so we can retrieve the number of people who can stay
= ok_listings.accommodates.sum()
no_ppl
print('Number of Airbnb listings with at most 7 days of minimal nights :', no_ppl)
#print(test) success
= 30000
tourists
= tourists-no_ppl
tourists_hotel
print('Number of people who needs to stay in hotel rooms',': ', tourists_hotel)
#print(ok_listings)
#Average cost of overnight accommodation in Amsterdam in the Netherlands from January 2019 to June 2023
##https://www.statista.com/statistics/614061/overnight-accommodation-costs-amsterdam-city/
= 294
hotel_price_per_night
#https://www.cbs.nl/en-gb/figures/detail/84040ENG
#Hotels; capacity, type of accommodation, beds, star rating
= 41840
hotel_rooms_ams = 90918
hotel_beds_ams
= hotel_beds_ams/hotel_rooms_ams
avg_beds_per_room
#print(avg_beds_per_room)
#Number of hotel rooms used by tourists who don't fit in airbnbs
= tourists_hotel/avg_beds_per_room
no_hotel_rooms_used
= no_hotel_rooms_used*hotel_price_per_night
revenue_hotels
= revenue_hotels * 0.07 + no_hotel_rooms_used * 3
tax_hotels
#getting rid of $ signs in column
= df['price'].str.slice(1,-1)
new_price
#adding new_price as a column in our dataframe
'new_price'] = new_price
ok_listings[
'new_price']=ok_listings['new_price'].str.replace(',','')
ok_listings[
#converting new column values to float so we can math it up =P
'new_price'] = ok_listings['new_price'].astype(float)
ok_listings[
#calculating total revenue of airbnb by summing all location price per night and then times 7 (number of days in a week)
= ok_listings.new_price.sum()*7
revenue_airbnb
= float(revenue_airbnb)*.1 + float(len(new_price)*3)
tax_airbnb
print('Total tax revenue from Airbnb locations $', tax_airbnb)
print('Total tax revenue from hotel rooms $',tax_hotels)
= tax_airbnb + tax_hotels
total_revenue
print('Total tax revenue gathered for the government', total_revenue)
Number of Airbnb listings with at most 7 days of minimal nights : 23407.0
Number of people who needs to stay in hotel rooms : 6593.0
Total tax revenue from Airbnb locations $ 1469366.9000000001
Total tax revenue from hotel rooms $ 71543.25226687785
Total tax revenue gathered for the government 1540910.152266878
2. Plot the amount of AirBnB locations per neighbourhood.
Pseudocode
- Filter from dataframe the locations of airbnb
- Gather the different categories inside the column neighourhood
- Sum each category inside neighbourhood
#It can't get mpre efficient than this
import plotly.express as px
#table = ok_listings.neighbourhood_cleansed.value_counts()
= ok_listings['neighbourhood_cleansed'].value_counts().reset_index().rename(columns={"index": 'neighbourhood', 0 : 'count'})
re_index print(re_index )
neighbourhood_cleansed count
0 De Baarsjes - Oud-West 1329
1 Centrum-West 973
2 De Pijp - Rivierenbuurt 898
3 Centrum-Oost 766
4 Zuid 571
5 Westerpark 561
6 Oud-Oost 475
7 Bos en Lommer 399
8 Oud-Noord 389
9 Oostelijk Havengebied - Indische Buurt 304
10 Watergraafsmeer 249
11 Noord-West 212
12 IJburg - Zeeburgereiland 173
13 Slotervaart 156
14 Noord-Oost 135
15 Geuzenveld - Slotermeer 110
16 Buitenveldert - Zuidas 96
17 De Aker - Nieuw Sloten 67
18 Bijlmer-Centrum 47
19 Gaasperdam - Driemond 45
20 Osdorp 43
21 Bijlmer-Oost 37
#graph with information
= px.bar(re_index, x='neighbourhood_cleansed', y='count', title = 'Number of airbnbs per neighbourhood in Amsterdam')
figure figure.show()
Unable to display output for mime type(s): application/vnd.plotly.v1+json
3: Which street in Amsterdam has the most AirBnB apartments?
Pseudocode
- Find the locations of airbnb through lat and long.
- Add the street category to the DataFrame.
- Gather the different categories inside the column Street.
- Sum each category per street filter.
#pip install geopy
import certifi
import ssl
import geopy.geocoders
from geopy.geocoders import Nominatim, Photon
# Workaround to fix SSL certififcate expired error on some laptops
# Solution taken from https://stackoverflow.com/a/50665487
# Error message:
# GeocoderUnavailable: HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443):
# Max retries exceeded with url: /reverse?lat=52.40164&lon=4.95106&format=json&addressdetails=1
# (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED]
# certificate verify failed: certificate has expired (_ssl.c:1006)')))
= ssl.create_default_context(cafile=certifi.where())
ctx = ctx
geopy.geocoders.options.default_ssl_context
= Nominatim(user_agent='geoapiExercises')
geolocator
#print(ok_listings.columns)
#New dataframe with only 3 parameters
= ok_listings[['latitude', 'longitude']]
geolocs_airbnb
#we erase all typos from the columns and then add '.' accordingly to their index position as to logical lon/lat values
'longitude']=geolocs_airbnb['longitude'].str.replace(',','')
geolocs_airbnb['longitude']=geolocs_airbnb['longitude'].str.replace('.','')
geolocs_airbnb[= (geolocs_airbnb.longitude.str[:1] + '.'+ geolocs_airbnb.longitude.str[1:]).astype(float)
geolocs_airbnb.longitude
'latitude']=geolocs_airbnb['latitude'].str.replace(',','')
geolocs_airbnb['latitude']=geolocs_airbnb['latitude'].str.replace('.','')
geolocs_airbnb[= (geolocs_airbnb.latitude.str[:2] + '.' + geolocs_airbnb.latitude.str[2:]).astype(float)
geolocs_airbnb.latitude
for i in range(len(geolocs_airbnb)):
= geolocs_airbnb.iloc[i]['latitude'],geolocs_airbnb.iloc[i]['longitude']
latitude, longitude = geolocator.reverse((latitude, longitude), timeout=None)
location = location.raw['address']['road']
street 'street'] = street
geolocs_airbnb.loc[i,
geolocs_airbnb.head()
print(geolocs_airbnb.head())
latitude longitude street
0 52.40164 4.95106 Jisperveldstraat
1 52.38855 4.88521 Zoutkeetsplein
2 52.37824 4.86826 Centrale Groothandelsmarkt
3 52.34091 4.84802 IJsbaanpad
4 52.37641 4.88303 Derde Egelantiersdwarsstraat
#saving harvested data inside another file for next tries
import pickle
= open('data.obj', 'wb')
fileObj
pickle.dump(geolocs_airbnb,fileObj)
fileObj.close()
pickle.load
Geoloc test
Test of 10 locations and number of listings per per street
#print(geolocs_airbnb.street)
print('Number of Airbnb listings per street:', geolocs_airbnb.street.value_counts())
Number of Airbnb listings per street: street
Jisperveldstraat 1
Zoutkeetsplein 1
Centrale Groothandelsmarkt 1
IJsbaanpad 1
Derde Egelantiersdwarsstraat 1
Maassluisstraat 1
Zanddwarsstraat 1
Amstelveld 1
Soembawastraat 1
Keizersgracht 1
Name: count, dtype: int64
import osmnx as ox
import matplotlib.pyplot as plt
import plotly as py
from shapely.geometry import Point, Polygon
import numpy as np
#here we set up a geodata frame based on the airbnb data set and do so with lat and long as geo location
= gpd.GeoDataFrame(geolocs_airbnb, geometry=gpd.points_from_xy(geolocs_airbnb.longitude, geolocs_airbnb.latitude), crs = "WGS84")
points_airbnb print(points_airbnb.head())
# distribution of airbnbs in ams
= "Amsterdam"
ams = ox.graph_from_place(ams)
graph
#map with airbnbs locations
= plt.subplots(figsize=(150,150))
fig, ax 'equal') #same scaling as data available
ax.set_aspect('black')
ax.set_facecolor(
='#edf2f1',
ox.plot_graph(graph, edge_color=0, edge_linewidth=5,
node_size=False, close=False, ax=ax,
show
)=ax, color ="#81dec8", markersize = 150)
points_airbnb.plot(ax
"Q2_Q3_airbnbsAMS.png") plt.savefig(
latitude longitude geometry
0 52.40164 4.95106 POINT (4.95106 52.40164)
1 52.38855 4.88521 POINT (4.88521 52.38855)
2 52.37824 4.86826 POINT (4.86826 52.37824)
3 52.34091 4.84802 POINT (4.84802 52.34091)
4 52.37641 4.88303 POINT (4.88303 52.37641)
4. Try to cross reference the data from the AirBnB dataset with the BBGA. Can you figure out if all apartments of AirBnB are designated as housing? Which number of apartments are not rented out all the time but are also used as normal housing?
#usable data from bbga
= pd.read_csv('./data/2023_BBGA_0614_cf8d825e89.csv') #./bbga_short.csv
bbga_df #print(bbga_df)
"Oud West, De Baarsjes" :"De Baarsjes - Oud-West",
bbga_df.replace({ "De Pijp, Rivierenbuurt":"De Pijp - Rivierenbuurt",
"Indische Buurt, Oostelijk Havengebied" : "Oostelijk Havengebied - Indische Buurt",
"IJburg, Zeeburgereiland" : "IJburg - Zeeburgereiland",
"Geuzenveld, Slotermeer" : "Geuzenveld - Slotermeer",
"Buitenveldert, Zuidas" : "Buitenveldert - Zuidas",
"Oud-Zuid" : "Zuid" ,
"De Aker, Sloten, Nieuw-Sloten" : "De Aker - Nieuw Sloten",
"Gaasperdam" : "Gaasperdam - Driemond"}, inplace = True)
print(bbga_df)
= pd.read_csv('./data/2023_BBGA_0614_cf8d825e89.csv')
bbga_df = bbga_df.loc[3937:3963,['gebiednaam','BHVESTAIRBNB']]
bbga_ok #name of neighborhood paramenters equally (using airbnb file)
= bbga_ok.replace({
new_bbga "Oud West, De Baarsjes" :"De Baarsjes - Oud-West", "De Pijp, Rivierenbuurt":"De Pijp - Rivierenbuurt",
"Indische Buurt, Oostelijk Havengebied" : "Oostelijk Havengebied - Indische Buurt", "IJburg, Zeeburgereiland" : "IJburg - Zeeburgereiland",
"Geuzenveld, Slotermeer" : "Geuzenveld - Slotermeer", "Buitenveldert, Zuidas" : "Buitenveldert - Zuidas",
"Oud-Zuid" : "Zuid", "De Aker, Sloten, Nieuw-Sloten" : "De Aker - Nieuw Sloten", "Gaasperdam" : "Gaasperdam - Driemond"
})
#plot the amount of apartments from both datasets
= px.bar(re_index, x='neighbourhood_cleansed', y='count', title= 'Data comparison of airbnb units according to bbga and airbnb')
fig =new_bbga['gebiednaam'], y=new_bbga['BHVESTAIRBNB'],name="BBGA")
fig.add_bar(x#fig.add_bar(x=re_index.neighbourhood_cleansed.value_counts([]), y='count', name="abb")
='group')
fig.update_layout(barmode fig.show()
Unable to display output for mime type(s): application/vnd.plotly.v1+json
The results seen in the chart can be explained by the time period measured by both datasets. Still the proportion of rented units remains constant within both datasets with Haarlemmerbuurt as the only exception; reflecting an increased period of unavailability in the time registered by Airbnb.
5. How many hotel rooms should be built if Amsterdam wants to accommodate the same number of tourists?
#we sum individually the column 'accommodates' so we can retrieve the number of people who can stay
= ok_listings.accommodates.sum()
no_ppl
= 30000
tourists
= tourists-no_ppl
tourists_hotel
= 41840
hotel_rooms_ams = 90918
hotel_beds_ams
= hotel_beds_ams/hotel_rooms_ams
avg_beds_per_room
#print(avg_beds_per_room)
#Number of hotel rooms used by tourists who don't fit in airbnbs
= tourists_hotel/avg_beds_per_room
no_hotel_rooms_used
print('Number of hotel rooms used by tourists who do not fit in airbnbs:', no_hotel_rooms_used)
Number of hotel rooms used by tourists who do not fit in airbnbs: 3034.0649816318
6. How many different licenses are issued?
#verifying license column
#print(ok_listings.columns[68])
= (ok_listings.license.nunique())
licenses =(ok_listings.license.value_counts())
license_values print(license_values)
= (ok_listings.license.value_counts()['Exempt'])
exemptions
= licenses-exemptions
ulicenses
print('Number of unique Airbnb licenses in Amsterdam (not counting the exempt values):',ulicenses)
license
Exempt 600
0363 78AD 8875 790E 3C05 14
ABCD 1234 AB12 89EF A0F9 9
036341086EC3C2FF2493 7
0363 A250 F710 53C5 1273 6
...
0363 D239 E048 910F 216B 1
0363 CB7E 060D 07E0 A4DE 1
0363 45B2 6D92 AA8B 1326 1
0363 F6A5 F87A D89D 8FC1 1
0363 2F20 4F53 FB57 2D7F 1
Name: count, Length: 6972, dtype: int64
Number of unique Airbnb licenses in Amsterdam (not counting the exempt values): 6372
We have more listings than licenses and several ‘exempt’ values. This is because the listings already have another type of license since they either work as hotels or hostels. https://www.airbnb.com/help/article/860