Bike Share Data Analysis Case Study

The following is a data analysis case study that I completed for a Data Analysis Certificate Capstone project. Our completion of this project was optional and we had a few options on what case study to choose or come up with our own. I chose one of the sample case studies included in the program.
We were given access to the data, some tips on how to proceed through each step of the data analysis workflow and then were expected to share it via an online portfolio of some kind. This is what I decided to use as the delivery mechanism for the online portfolio. My presentation of the case study is linked at the bottom but between here and there are the steps I took given my starting point was a collection of raw CSVs. Enjoy!
Ask
Determine how users of the bike share program differ based on casual riders vs those with memberships
Prepare
For this data analysis we will use the data located here. We will use the quarterly data from the last few quarters of 2019 and the first quarter of 2021.
The data in these files contains information about the rides using the bike share service including data points such as:
- start time and end time
- start location and end location
- member or non-member
Process
Since the spreadsheets were already fairly large I decided to process the CSVs with R rather than using a spreadsheet tool. I knew once I got the data into R and made some initial analysis I’d need to export it into a database so I could run queries on it. I am sure R could have done everything I needed but my background is stronger in SQL than it is in R in terms of gathering statistics. Once I figured out some basic insights about the users I knew Tableau was going to be my tool for making those visualizations.
In order to process the data I took the following steps:
- Determine the column names across the four quarterly files
- Rename any column inconsistencies to make merging the dataframes possible
- Mutate a few of the columns using as.character
- Combine the dataframes from all of the four quarter dataframes
- Removed columns that were not in use for all the files such as lat, long, gender, birthyear, and a few others
- Consolidate the labels in the membership column such that member and Subscriber are a single label of ‘member’. Do the same for Customer -> ‘casual’
- Added some calculated columns based on the date field so we can easily access month, year, day of the week
- Calculated the ride length based on start and end time of the ride
- Dropped any rows that had a start location of ‘HQ QR’ or that had a negative ride length
Analyze
Once I had the data in a single dataframe I ran some basic analysis in R for statistics. After that I exported it to a CSV file such that I could analyze it further with SQL and Tableau.
In SQL I loaded it into a local Postgres DB I did the following checks:
- What were the top 10 most popular stations for members vs casual riders?
- Count the total number of rides for each rider type and then compare it with the number of times that the start and end location were the same
- How many rides per month across the two rider types are there and what are the most popular months?
- Determine what month has the highest percentage of casual riders?
- Find the average ride length for each rider type. Compare that with the median to see what seems more reasonable as there were some really long rides in the less popular months
Share
After using SQL to get some basic analysis I imported the combined CSV into Tableau to do some further exploration and more visualizations to see what else the data was telling me.
In the presentation below you can see the final visualizations that I came up with after testing out various possibilities.
Act
Finally I built a presentation of my findings using Google Slides. Here is the end result with my top three recommendations.