EXPLORATORY DATA ANALYSIS ON TURNSTILE DATA (MTA DATA)

Sıla Kazan
5 min readMar 10, 2022

--

Hi everyone,

Today I want to tell you about my project that I did a long time ago but couldn’t find time to explain.

Before starting EDA (Exploratory Data Analysis) on our project, let us discuss about the topic about EDA. With the growth of the usage the term about “DATA”, With the increase in huge studies on data, the concept of EDA, which was formed in order to express the data correctly to the users, has become an important position of Data science and Data Engineering in the working areas of Data.

EDA basically refers to the critical process of performing initial studies on data to detect anomalies in the data, test the hypothesis, and check the assumptions with the help of summary statistics and graphical representations. As Istanbul Data Science Academy students, we analyzed the Turnstile data in New York as part of a group work and presented it to the users(clients).

Turnstiles during pandemic

Here is link of dataset:

http://web.mta.info/developers/turnstile.html

After that procedure, we followed the methods using those dataset, which is in order:

  • Collecting data
  • Getting rid of null values in data
  • To perform analytics well, adding some time properties including datetime
  • Making statistical analysis and representation about dataset and showing on graphs
  • Finding the busiest subway stations, most crowded day and best time slot. (3 distinct code blocks before, during and after pandemic circumstances)*

So, starting the collection of data, we follow the code below:

num_weeks = 26#initialise the date for the first week of the dataset (week ending on start_date)filedate = pd.Timestamp('2020-02-29 00:00:00')#initialise the regex for the MTA turnstile url
filename_regex = "http://web.mta.info/developers/data/nyct/turnstile/turnstile_{}.txt"
filelist = []
for numfiles in range (num_weeks):
# create the appropriate filename for the week by linking the right format of the dates to turnstile_{}.txt
filedate_str = str(filedate.year)[2:4] + str(filedate.month).zfill(2) + str(filedate.day).zfill(2)
filename = filename_regex.format(filedate_str)

# read the file and append it to the list of files to be concacated
MTA_data_covid = pd.read_csv(filename, parse_dates=['DATE'], keep_date_col=True)
filelist.append(MTA_data_covid)
# advance to the next week
filedate += pd.Timedelta(days=7)MTA_data_covid = pd.concat(filelist, axis=0, ignore_index=True)

As can be seen from the code here, since the data on our site is in weekly chunks, we decided to evaluate and combine the data over a total of 26 weeks. This is how we observed the data set we obtained as a result of this merge:

We after made a new column named “Turnstile” which includes the first three columns. These part is required for the advancing part of the statistical analysis.

To avoid the duplicate data, we use the drop function but there was no duplicate values exist.

In the next step, we combined the hour and day columns for mathematical calculations and graphical representation.

Then, name of the day which corresponds any value of date was evaluated.

Observations show that there are 379 unique station and 5006 unique turnstiles. Likewise, we want to show the min entries and exits in terms of the stations.

As a result of small analyzes, graphical representations are now made.

We mention that, dealing with 3 distinct seasons which are before, during and after pandemic shows us that usage of turnstiles was decreasing on the pandemic issues.

When comparing the entries, we observed the similar approach for the month- based values.

To illustrate the busiest 10 stations:

Before covid,

During Covid,

After covid,

In fact, when observing the changing of stations according the seasons, you will see the distinct changes on the stations.

In addition for the observation, crowdness per day relation is shown.

Specifically, 4-hour set on the station named 42 ST-PORT AUTH is shown on the pie-chart.

Seen from the figure, the busiest time on-day on the busiest station is that:

  • Busiest time before covid -> between 20:00 and 00:00
  • Busiest time during covid -> between 16:00 and 20:00
  • Busiest time after covid -> between 16:00 and 20:00

As understood from that relation, after pandemic issues, people make an attention to do their daily-routines on time and are precautious not to go their destination late.

Last thing is about the time-day crowdness relation on the busiest station we found. That is:

Before covid,

During covid,

After covid,

Summarizing the last relation, before covid, people are too willingness to go outside at the weekends, while people choose to go outside at the week-days after covid occured.

This project can be developed for the future work like that:

  • These dataset can be considered as the location of Business and Technology Center throughout the city.
  • Other transportation usage may be observed; so traffic transition that is used by passengers or people will be got.
  • Dataset can be extended more.

Thanks all the reading. So please don’t forget that “The data is the new oil!”. Have good day.

See you next project:)

Project details github:

If you have any questions, I would be very happy if you contact me…

--

--

Sıla Kazan
Sıla Kazan

No responses yet