Unlocking Insights with a Step-by-Step Exploratory Data Analysis of MTA Turnstile Dataset
Exploratory data analysis (EDA) is a crucial step in data analysis, which helps us gain insights from data. EDA is like a chef’s preparation steps before presenting a specialty dish to guests. Each technique used in EDA reveals its unique story, similar to how each detail in data can provide an important contribution to the bigger picture. In essence, EDA enables us to understand the story that the data is trying to tell us, and the insights we gain can help inform business decisions.
Concept
We were tasked with conducting an exploratory data analysis (EDA) of MTA turnstile data for the non-profit organization WomenTechWomenYes as part of Project 1 for ISDSA Data Science Bootcamp (WTWY). Our goal was to assist WTWY in determining the best locations for their street teams to engage with potential gala attendees and raise awareness for women in tech. Several assumptions were made in my analysis.
Assumptions
- Given WTWY’s time and manpower constraints, my analysis concentrated on identifying the top subway stations based on traffic volume and peak periods in order to optimize the placement of their street teams. WTWY can effectively allocate resources and maximize outreach efforts to potential attendees for their annual gala by providing these insights.
- One of my assumptions during the analysis was that people interested in technology would be more likely to live in the city center, which has a higher concentration of tech corporate offices. This assumption was taken into account when determining the best subway stations for WTWY’s street teams.
- Because the WTWY gala is quickly approaching, I concentrated on analyzing a week’s worth of MTA turnstile data as a sample for the weeks preceding the event. This gave us insight into traffic patterns and allowed us to identify potential locations for their street teams to maximize outreach efforts.
Data Exploration and Manipulation
The data covers the period from January 2022 to March 2022.
To gain a better understanding of the dataset, we referred to the descriptions of the column features. Further investigation revealed that the ‘ENTRIES’ and ‘EXITS’ columns contained cumulative serial numbers that grew over time. The ‘TIME’ column moved in 4 hour increments, and each turnstile was identified by a unique combination of ‘UNIT’ and ‘SCP’ numbers. During the data cleaning process, we also discovered some duplicate rows in the dataset and removed them.
I used the datetime library to convert the separate ‘DATE’ and ‘TIME’ columns into a datetime type to improve the stability of the grouping operations and inlining. During our analysis, we were able to easily manipulate the date and time values and perform datetime-related calculations because of this.
MTA_data['DATETIME'] = pd.to_datetime(MTA_data['DATE'] + ' ' + MTA_data['TIME'])
I used the diff() function to create new columns named ‘EXIT RAW’ and ‘ENTRIES RAW’ to remove the cumulativity in the ‘EXITS’ and ‘ENTRIES’ columns. This transformation provided us with the actual raw values of people passing through the turnstile at each time interval, which helped us with our analysis.
We discovered that the anomalous values in the ‘ENTRIES RAW’ and ‘EXIT RAW’ columns were caused by dataframe transitions between two different turnstiles or a reset of the serial number counter in certain turnstiles. We replaced these anomalies with the mean of the preceding and succeeding values of entries/exits to correct for them. Because the entry/exit values can be approximated as an interpolation between consecutive time periods, this correction was reasonable.
MTA_data['ENTRY_RAW'][MTA_data['ENTRY_RAW']>10000]=np.nan
MTA_data['ENTRY_RAW'][MTA_data['ENTRY_RAW']<0]=np.nan
# Setting the anomaly values due to reset of counters to the uniform NaN values
raw_array = MTA_data['ENTRY_RAW'].values
nan_idx = np.isnan(raw_array)
raw_array[nan_idx] = (raw_array[nan_idx-1] + raw_array[nan_idx+1])/2
MTA_data['ENTRY_RAW_1'] = raw_array
# for each NaN values, replace it with the mean of values before and after the NaN value
We refer to the difference between the raw data of entries and exits and the actual transit data as ‘Traffic’.
MTA_data['TRAFFIC'] = MTA_data['ENTRY_RAW_1'] + MTA_data['EXIT_RAW_1']
Results
To avoid confusion in the relationship between stations and graphs, we decided to focus on the top 10 stations with the highest traffic for our analysis.
top_stations = MTA_data.groupby('STATION')[['TRAFFIC']].agg({'TRAFFIC': 'sum'}).sort_values("TRAFFIC" , ascending = False).head(10)
plt.figure(figsize = [8,5])
ax = sns.barplot(data = top_stations.head(10).reset_index(), x = 'TRAFFIC', y = 'STATION', palette = 'Blues_r')
plt.xlabel('Total Traffic', weight = 'bold', fontsize = 15)
plt.ylabel('Stations', weight = 'bold', fontsize = 15)
ax.xaxis.set_major_formatter(FuncFormatter(human_format))
plt.title('Top 10 Busiest MTA Stations', weight = 'bold', fontsize = '15')
for p in ax.patches:
ax.annotate(str(float('{:.1f}'.format(p.get_width()/1000000)))+'M', (p.get_width(), p.get_y()+0.5))
sns.despine()
plt.savefig('barplot.png', transparent = True, bbox_inches = 'tight')
To ensure that our analysis is more meaningful, we chose to focus on the days of the week with the highest traffic for our analysis of the week.
top_station_day_index = MTA_data.groupby(['STATION','DAYS',"DAYS_INDEX"]).agg({'TRAFFIC':'sum'}).reset_index()[["STATION","DAYS","TRAFFIC","DAYS_INDEX"]]
penn = top_station_day_index [ top_station_day_index.loc[: , "STATION"] == "34 ST-PENN STA" ].sort_values("DAYS_INDEX",ascending = True)
This is how we are able to analyze the entry and exit data on a day-by-day basis at the top 10 stations.
top_stations_day = MTA_10.groupby(['STATION','DAYS'])['TRAFFIC'].sum()
matrix_station_day = top_stations_day.unstack()
matrix_station_day.reset_index()
matrix_station_day = matrix_station_day.reindex(columns=["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"])
matrix_station_day = matrix_station_day.reindex(index = list(top_stations.head(10).index))
matrix_station_day.applymap(lambda x:str(round(x / 1000000, 1)) + 'M')
array = np.array(matrix_station_day.applymap(lambda x:str(round(x / 1000000, 1)) + 'M'))
As a different perspective, we can also create a line graph of the top 5 stations in this way.
penn = top_station_day_index [ top_station_day_index.loc[: , "STATION"] == "34 ST-PENN STA" ].sort_values("DAYS_INDEX",ascending = True)
grd = top_station_day_index [ top_station_day_index.loc[: , "STATION"] == "GRD CNTRL-42 ST" ].sort_values("DAYS_INDEX",ascending = True)
herald = top_station_day_index [ top_station_day_index.loc[: , "STATION"] == "34 ST-HERALD SQ" ].sort_values("DAYS_INDEX",ascending = True)
st86 = top_station_day_index [ top_station_day_index.loc[: , "STATION"] == "86 ST" ].sort_values("DAYS_INDEX",ascending = True)
port = top_station_day_index [ top_station_day_index.loc[: , "STATION"] == "42 ST-PORT AUTH" ].sort_values("DAYS_INDEX",ascending = True)
union = top_station_day_index [ top_station_day_index.loc[: , "STATION"] == "14 ST-UNION SQ" ].sort_values("DAYS_INDEX",ascending = True)
st23 = top_station_day_index [ top_station_day_index.loc[: , "STATION"] == "23 ST" ].sort_values("DAYS_INDEX",ascending = True)
st125 = top_station_day_index [ top_station_day_index.loc[: , "STATION"] == "125 ST" ].sort_values("DAYS_INDEX",ascending = True)
times = top_station_day_index [ top_station_day_index.loc[: , "STATION"] == "TIMES SQ-42 ST" ].sort_values("DAYS_INDEX",ascending = True)
fulton = top_station_day_index [ top_station_day_index.loc[: , "STATION"] == "FULTON ST" ].sort_values("DAYS_INDEX",ascending = True)
stations = [penn, grd, herald ,st86 ,port]
fig, ax = plt.subplots(figsize = (10,5))
sns.lineplot(data = top10_line_Data, x= "DAYS", y = "TRAFFIC", hue = "STATION")
plt.ticklabel_format(style='plain', axis='y')
ax.yaxis.set_major_formatter(FuncFormatter(human_format))
ax.axvspan("Friday", "Saturday", alpha = 0.15, color = 'maroon')
ax.set_title('Top 5 Stations Total Traffic by WeekDay', weight = 'bold')
def timeperiod(time):
if time >= datetime.time(0,0,0) and time < datetime.time(4,0,0):
return "12am-4am"
elif time >= datetime.time(4,0,0) and time < datetime.time(8,0,0):
return "4am-8am"
elif time >= datetime.time(8,0,0) and time < datetime.time(12,0,0):
return "8am-12pm"
elif time >= datetime.time(12,0,0) and time < datetime.time(16,0,0):
return "12pm-4pm"
elif time >= datetime.time(16,0,0) and time < datetime.time(20,0,0):
return "4pm-8pm"
else:
return "8pm-12am"
MTA_10['TIME']=pd.to_datetime(MTA_10['TIME'],format='%H:%M:%S').dt.time
MTA_10['TIME_PERIOD'] = MTA_10['TIME'].apply(timeperiod)
matrix_list= []
for station in list(top_stations.head(10).index):
df_station = MTA_10[MTA_10['STATION'] == station]
group_day_time = df_station.groupby(['DAYS','TIME_PERIOD'])['TRAFFIC'].sum()
matrix_day_time = group_day_time.unstack()
matrix_day_time.reset_index()
matrix_day_time = matrix_day_time.reindex(index=["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"])
matrix_day_time = matrix_day_time.reindex(columns=["12am-4am", "4am-8am", "8am-12pm", "12pm-4pm", "4pm-8pm", "8pm-12am"])
matrix_list.append(matrix_day_time)
fig, axn = plt.subplots(2,5, sharex = True, sharey = True, figsize = (15,6))
cmap = sns.cubehelix_palette(light = 1, as_cmap = True)
cbar_ax = fig.add_axes([.91, .3, .03, .4])
for i, ax in enumerate(axn.flat):
station = matrix_list[i]
sns.heatmap(station, ax = ax, cmap = cmap,
cbar = i == 0,
cbar_ax = None if i else cbar_ax,
linecolor = 'white', linewidths = 0.5)
ax.set_title(list(top_stations.head(10).index)[i])
ax.set_xlabel('')
ax.set_ylabel('')
plt.savefig('heatmap2.png',transparent = True, bbox_inches = 'tight')
When the heat map is broken down by day versus time for each station, another interesting fact emerges: stations are generally busier in the late afternoon and evenings, even on weekdays. Given the commute and the fact that it was a weekday, this was not surprising. In addition to the crowded entrances and exits in the morning, lunch and rush hours should also be considered.
Based on the census distributions, we analyzed the population densities and demographics surrounding the top 10 stations with the highest traffic. This information can be used to guide the allocation of street teams for the gala, with a focus on areas with a higher density of individuals interested in tech, as well as areas with higher representation of women in tech. This approach allows for a more targeted and efficient use of resources to maximize the impact of the street teams’ efforts.
cencus["Population"] = cencus["Population"].astype('int64')
cencus["FemalePercent"] = cencus["FemalePercent"].astype('float')
cencus["BachelorsDegreePercent"] = cencus["BachelorsDegreePercent"].astype('float')
cencus["MedianIncome"] = cencus["MedianIncome"].astype('int64')
cencus["TotalEmployer"] = cencus["TotalEmployer"].astype('int64')
cencus["TotalEmployment"] = cencus["TotalEmployment"].astype('int64')
cencus["AnnualPayroll"] = cencus["AnnualPayroll"].astype('int64')
cencus["AllFirms"] = cencus["AllFirms"].astype('int64')
cencus["WomenOwnedFirms"] = cencus["WomenOwnedFirms"].astype('int64')
plt.figure(figsize=[20,17])
plt.style.use('default')
plt.suptitle('New York Borough Census Data',fontsize = 18, weight = "bold", c = "Black")
plt.subplot(3,3,1)
ax = sns.barplot(x = "Borough", y = "Population", data = cencus, palette = 'flare')
plt.xlabel(""), plt.ylabel("")
ax.yaxis.set_major_formatter(FuncFormatter(human_format))
plt.title('Population', weight = "bold");
plt.subplot(3,3,2)
ax = sns.barplot(x = "Borough", y = "FemalePercent", data = cencus, palette = 'flare')
plt.xlabel(""), plt.ylabel("")
plt.title('Female Percent', weight = "bold");
plt.subplot(3,3,3)
ax = sns.barplot(x="Borough", y="BachelorsDegreePercent", data=cencus, palette = 'flare')
plt.xlabel(""),plt.ylabel("")
plt.title('Bachelors Degree Percent', weight="bold");
plt.subplot(3,3,4)
ax = sns.barplot(x = "Borough", y = "MedianIncome", data = cencus, palette = 'flare')
plt.xlabel(""),plt.ylabel("")
plt.title('Median Income', weight="bold");
plt.subplot(3,3,5)
ax = sns.barplot(x = "Borough", y = "TotalEmployer", data = cencus, palette = 'flare')
plt.xlabel(""),plt.ylabel("")
plt.title('Total Employer', weight="bold");
plt.subplot(3,3,6)
ax = sns.barplot(x = "Borough", y = "TotalEmployment", data = cencus, palette = 'flare')
plt.xlabel(""),plt.ylabel("")
ax.yaxis.set_major_formatter(FuncFormatter(human_format))
plt.title('Total Employment', weight="bold");
plt.subplot(3,3,7)
ax = sns.barplot(x = "Borough", y = "AnnualPayroll", data = cencus, palette = 'flare')
plt.xlabel(""),plt.ylabel("")
ax.yaxis.set_major_formatter(FuncFormatter(human_format))
plt.title('Annual Payroll', weight="bold");
plt.subplot(3,3,8)
ax = sns.barplot(x = "Borough", y = "AllFirms", data = cencus, palette = 'flare')
plt.xlabel(""),plt.ylabel("")
plt.title('All Firms', weight = "bold");
plt.subplot(3,3,9)
ax = sns.barplot(x = "Borough", y = "WomenOwnedFirms", data = cencus, palette = 'flare')
plt.xlabel(""),plt.ylabel("")
plt.title('Women Owned Firms', weight = "bold");
plt.savefig('NewyorkBoroughCensusData.png', dpi=200)
Conclusions
Based on our analysis of the MTA turnstile data, we recommend that WomenTechWomenYes (WTWY) should concentrate their street team efforts on the top 10 stations with the highest traffic, during the late afternoon to evening periods on weekdays. However, if WTWY is facing resource constraints, it might be best to avoid stations like Flushing-Main and 34 ST-PENN, as they could be primarily residential or tourist areas with less concentration of tech corporate offices. Additionally, it might be more beneficial to focus on the Manhattan region due to the higher population density there. By taking into account these factors, WTWY can maximize their impact in collecting emails and distributing free tickets to their annual gala while raising awareness and increasing participation for women in tech.
I hope my first article gave you some insight into EDA analysis. From now on, I will make a point of sharing everything I have learned. I will share all of the projects I completed during the ISDSA Data Science bootcamp program, as well as the projects I completed with the knowledge I gained. You can find the source codes on my github page. Keep an eye on things!