MTA ridership
STA304 - Fall 2023 -Assignment 1
Ruilin Peng 1005762765
2023-10-01
Part 1: Designing a survey
Goal
As the largest public transportation agency in North America, the Metropolitan Transportation Agency in New York City(Ley, A. (2023, April 28). M.T.A. averts fiscal crisis as New York Strikes Budget deal. The New York Times.)needs to study the frequency of passengers using MTA for transit to better arrange the schedule for trains/buses. This survey will target at passengers using MTA service during September 2023 and the project studies the number of times of each passenger uses MTA for transit per month in order to help the agency to better arrange the capacity.
Procedure
Since this survey aims at passengers who uses MTA service in September 2023, the target population is every individual who uses MTA service in September 2023. After gaining agreement from MTA for assistance, the frame population should be every individual on the follower list of @MTA on Twitter/X and the MTA’s mailing list. Finally, we will collect data by having @MTA posting the questionnaire on Twitter/X and send through email to everyone on the mailing list. Thus, the sample population would be everyone who finished the survey either on Twitter or link from the email.As incentive, we would have everyone who completed the questionnaire a chance to win a newest MacBook by having them in the pool.
As for the sampling method, we will use simple random sampling and we will randomly select 1000 individuals for research.
Certain advantages of our procedure include the low cost and easiness to conduct, since distributing survey through social media is common and no cost is needed. However, certain limitation exist as well. To start with, one single Mac Book as prize might not be motivative enough since the chance of winning is obviously low. Furthermore, as mega city inhabitants, citizens of New York City use public transpiration frequently, which means it might be hard for them to accurately recall the exact times they have used the service in a given month either.
Showcasing the survey.
Question 1 With the best of your knowledge, how many
times have you used MTA for transition in September 2023?
_____(integer answer only)
<The question might be straightforward, but it is hard to recall in
precision. Therefore error is inevitable. >
Question 2 Rate from 1 to 5, your general experience
with MTA service.(Multiple choice)
-1
-2
-3
-4
-5
<Having the rating from passengers into account, we can study the
relationship between level of satisfaction and the frequency of
usage>
Question 3 Which borough do you live? (Multiple
choice)
-Manhattan
-Brooklyn
-Queens
-The Bronx
-Staten island
<Having the borough information into account, we can study the
relationship between the borough of residence and the frequency of
usage.However,we will have 5 indicator variables to work with.>
Part 2: Data Analysis
Data Simulation.
To perform the simulation of samples obtained through Simple Random Sampling using R, we begin by assigning random id from 1 to 10000 to these 1000 samples. Then we generate 1000 numbers representing the number of rides each individual has taken during September of 2023 following normal distribution assuming the average number of rides being 90 with standard deviation of 20.Rounding to 0 decimal is also applied since the number of rides should be integer. Next, we choose 1000 samples from 1 to 5 inclusive to represent the rating of experience of the 1000 samples. For income,we generate 1000 numbers representing the number of rides each individual’s monthly income following normal distribution assuming the average monthly income being 51000 USD with standard deviation of 2000 USD. Here, round to 2 decimals is applied since the lowest amount is 0.01 USD. As for time of ridership, we generate 1000 numbers representing the usual time they spend for each ride following normal distribution assuming the average time of transit being 30 minutes with standard deviation of 20 minutes with rounding to 0 decimal places applied. Assuming each individual can only live in one borough, we first generate 1000 samples with the values from (“Manhattan”, “Brooklyn”, “Queens”, “Bronx”, “StatenIsland”), representing where each individual individual lives. Then we initiate 5 Indicator variables with “N”, each of size 1000 which represent of whether each individual lives in the given borough. Next, we iterate through every value of “Manhattan”, “Brooklyn”, “Queens”, “Bronx”, “StatenIsland” we generated previously, and change to the corresponding indicator variable to “Y” when the value is equal to the title of the indicator variable. Finally, we randomly select 1000 samples from (“Y”, “N”) to simulate whether the individual takes transfers.
Data showcase
The data set we simulated using R consists of 1000 observations of MTA passengers, along with their number of rides per month, their rating of experience, their time on each transit, their income, whether they living Manhattan,whether they living Brooklyn,whether they living Queens,whether they living Bronx,whether they living Staten Island,and whether they transfer.
Below is a histogram of a variable we are interested: the number of rides per month.
The figure 1 is a histogram of the number of MTA rides per month which is close to a symmetric distribution across both ends of the x axis with the mode close to 90.
min | max | mean | median | sd |
---|---|---|---|---|
14 | 145 | 89.379 | 90 | 19.5343 |
From the summary table, we can see that the mean and median are close which shows the symmetry of the distribution of rides per month. However, with a standard deviation relatively large(19.5343), the confidence interval is expected to be wide.
Methods
The simulated data set yields a sample mean of 89.379, which is merely a mathematical average value of number of MTA rides per month in the sample. To gain a better understanding of the situation, we will perform a 95% T Confidence Interval to calculate a reasonable range of values for the population average number of MTA rides per month, which helps MTA to predict the number of passengers and the better allocate the resources. And to compute the 95% T Confidence Interval, we assume that the samples are independent and then the confidence interval would be \(\mu \in \bar x \pm t_{\alpha/2,n-1}\frac{s}{\sqrt{n}}\) where\(\bar x\) is the sample mean, n is the sample size, s is the standard deviation and \(t_{\alpha/2,n-1}\) is the value from t-distribution with \(\alpha = 0.05\)
The coronavirus outbreak and the consequent “New York State on PAUSE” executive order to close all non-essential business sent both subway and bus ridership to the unprecedented lowest point in April, when subway was at 8% of 2019 ridership and bus at 23%. (Subway and bus ridership for 2020. MTA. (n.d.). https://new.mta.info/agency/new-york-city-transit/subway-bus-ridership-2020 ). Therefore, since it’s 2023 already, performing a hypothesis testing on whether the average ridership per month is greater than 60 to see if on average, each individual uses MTA for transit more than twice a day. Since the sample size is large, we assume the assumption of normal distribution is met by Central Limit Theorem. The null hypothesis in this scenario would be the average monthly ridership per person being exactly 60. And the alternative hypothesis would be the average monthly ridership per person being more than 60.\(H_o: \mu = 60, H_a: \mu > 60\) where \(\mu\) is the population average of monthly ridership of MTA per person in the population. And the test statistic is \(\frac{\bar x - \mu_0}{\sigma/\sqrt{n}}\) where\(\bar x\) is the sample mean, n is the sample size, s is the standard deviation and \(\mu_0 = 60\)
Result
The result of 95% T confidence Interval is (88.16681 90.59119), which means MTA can expect the population mean, the average monthly ridership all passengers in New York being around 90.
item | value |
---|---|
Test statistics | 47.5597 |
p-value | 0.0000 |
The result of the hypothesis testing for \(H_o: \mu = 60, H_a: \mu > 60\) is that t statistics = 47.5597 and p-value = 0.From the result, we can see there is strong evidence for us to support the alternative hypothesis which suggests in 2023, the monthly MTA ridership is greater than 60 which means in average, each passenger in population used MTA service more than twice which makes sense since most restrictions of Covid have been lifted.
Part 3: Reference
Generative AI
No AI is used
Bibliography
- Ley, A. (2023, April 28). M.T.A. averts fiscal crisis as New York Strikes Budget deal. The New York Times.
- Subway and bus ridership for 2020. MTA. (n.d.). https://new.mta.info/agency/new-york-city-transit/subway-bus-ridership-2020
- Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.
- Yihui Xie (2022). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.41.
Appendix
## Rows: 1,000
## Columns: 11
## $ id <int> 202, 5821, 3022, 706, 3196, 6575, 5634, 4305, 6995, 8190,…
## $ rides <dbl> 99, 69, 80, 107, 125, 107, 129, 80, 94, 65, 66, 61, 78, 9…
## $ rating <int> 4, 1, 4, 3, 4, 1, 3, 1, 1, 4, 2, 3, 1, 5, 4, 4, 4, 2, 5, …
## $ time <dbl> 38, 25, 36, 45, 28, 37, 13, 43, 36, 33, 8, 41, 26, 27, 36…
## $ income <dbl> 50591.68, 52182.07, 51448.58, 52295.76, 49086.32, 47017.1…
## $ Manhattan <chr> "Y", "N", "N", "Y", "N", "N", "N", "N", "N", "N", "N", "N…
## $ Brooklyn <chr> "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "Y", "Y", "N…
## $ Queens <chr> "N", "N", "N", "N", "Y", "Y", "N", "N", "N", "N", "N", "N…
## $ Bronx <chr> "N", "N", "N", "N", "N", "N", "Y", "N", "Y", "N", "N", "Y…
## $ StatenIsland <chr> "N", "Y", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N…
## $ transfer <chr> "N", "N", "Y", "N", "N", "Y", "N", "N", "Y", "Y", "Y", "Y…