Cleaning process
This page gives an outline of the process and assumptions used to combine, clean, and transform raw trip data from Cyclistic (Divvy) in preperation for analysis. At the time of collection, data up to October 2022 were available. For the purposes of our analysis, the 12 most recent months of data were selected.
Cyclistic is a fictional company, but the data is obtained from Divvy Bikes, a real bike sharing program in Chicago, Il. A description of the datasets and their files can be accessed here.
1. Installing Packages
library(tidyverse) #For data cleaning and analysis
2. Combining datasets
The structure of these CSV files were inspected using Excel prior to import and binding. The structure of the merged dataset was checked again after binding.
#Import data files
nov21 <- read_csv("Data/202111-divvy-tripdata.csv")
dec21 <- read_csv("Data/202112-divvy-tripdata.csv")
jan22 <- read_csv("Data/202201-divvy-tripdata.csv")
feb22 <- read_csv("Data/202202-divvy-tripdata.csv")
mar22 <- read_csv("Data/202203-divvy-tripdata.csv")
apr22 <- read_csv("Data/202204-divvy-tripdata.csv")
may22 <- read_csv("Data/202205-divvy-tripdata.csv")
jun22 <- read_csv("Data/202206-divvy-tripdata.csv")
jul22 <- read_csv("Data/202207-divvy-tripdata.csv")
aug22 <- read_csv("Data/202208-divvy-tripdata.csv")
sep22 <- read_csv("Data/202209-divvy-tripdata.csv")
oct22 <- read_csv("Data/202210-divvy-tripdata.csv")
#Join data files
all_rides <- bind_rows(nov21,dec21,jan22,feb22,mar22,apr22,may22,jun22,jul22,
aug22,sep22,oct22)
#delete import files and clear from memory
rm(nov21,dec21,jan22,feb22,mar22,apr22,may22,jun22,jul22,aug22,sep22,oct22)
gc()
str(all_rides)
2.1 Assigning time zone
The datetime variables started_at and ended_at were both automatically imported as POSIXct data types. The following code assigns the Chicago timezone. Daylight Savings Time presented difficulties with a small number of rides that crossed over a DST boundary.
#Set time zone
all_rides <- all_rides %>%
mutate(started_at = force_tz(all_rides$started_at,
tzone = "America/Chicago",
roll_dst = c("pre", "post"))) %>%
mutate(ended_at = force_tz(all_rides$ended_at,
tzone = "America/Chicago",
roll_dst = c("pre", "post")))
#Confirm time zone is America/Chicago
tz(all_rides$started_at)
[1] "America/Chicago"
tz(all_rides$ended_at)
[1] "America/Chicago"
3. Data transformation
3.1 Create a variable for the day of week
This analysis will examine which days of the week are most popular with different rider groups. The following code will create a variable for the day of the week.
weekday_levels <- c('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday',
'Friday', 'Saturday')
all_rides <- all_rides %>%
mutate(day_of_week = factor(weekdays(started_at), levels = weekday_levels))
3.2 Create a variable for the ride duration
With the ‘started_at’ time and ‘ended_at’ time, it is possible to calculate an interesting variable for analysis: the duration of each ride.
#Create ride_duration to measure duration of ride
all_rides <- all_rides %>%
mutate(ride_duration = as.duration(started_at%--%ended_at))
3.3 Convert categorical variables to factors
The data has two categorical variables which can be better analyzed by recasting them as factors, ‘rideable_type’ and ‘member_casual’.
#Convert 'rideable_type' and 'member_casual' to factors
all_rides <- all_rides %>%
mutate(rideable_type = factor(rideable_type))
all_rides <- all_rides %>%
mutate(member_casual = factor(member_casual))
4. Examine data
Before the cleaning the data, it is examined to identify errors, blanks, or any values out of range for this analysis.
4.1 Examine primary key
The data has a primary key, which is the ‘ride_id’ variable. This code will check the integrity of the data by ensuring that each ‘ride_id’ is unique.
#Check if each trip_id is unique
all_rides %>%
distinct(ride_id) %>%
summarise(total_count=n(),.groups='drop') %>%
arrange(-total_count)
# A tibble: 1 × 1
total_count
<int>
1 5755694
The result, 5,755,694, is equal to the number of rows in all_rides. Each ride_id is unique.
4.2 Durations out of expected range
Divvy’s data page states that “The data has been processed to remove trips that are taken by staff as they service and inspect the system; and any trips that were below 60 seconds in length (potentially false starts or users trying to re-dock a bike to ensure it was secure).” The analysis observed that the minimum value for the trip duration was less than 60 seconds, and was a negative number. The data contain 118,453 records in which the ride length is under 60 seconds.
#Examine trip lenghts < 60s in duration
all_rides %>%
filter(ride_duration<60) %>%
summarise(total_count=n(),.groups='drop') %>%
arrange(-total_count)
# A tibble: 1 × 1
total_count
<int>
1 118453
#118,453 ride_lentgh < 60s
4.3 Null values
The data should be complete, without any blank fields. The following code will count if there are any null values in our data.
all_rides %>%
summarise(across(everything(), ~ sum(is.na(.x))))
# A tibble: 1 × 15
ride_id rideable_type starte…¹ ended…² start…³ start…⁴ end_s…⁵ end_s…⁶ start…⁷
<int> <int> <int> <int> <int> <int> <int> <int> <int>
1 0 0 0 0 878177 878177 940010 940010 0
# … with 6 more variables: start_lng <int>, end_lat <int>, end_lng <int>,
# member_casual <int>, day_of_week <int>, ride_duration <int>, and
# abbreviated variable names ¹started_at, ²ended_at, ³start_station_name,
# ⁴start_station_id, ⁵end_station_name, ⁶end_station_id, ⁷start_lat
There are null values in start_station_name, start_station_id, end_station_name, end_station_id, end_lat, and end_lng.
The number of null values for station names and ids is concerning. The nulls in the station variables account for 23.37% of the data. The end latitude and longitude are less concerning, accounting for <0.01% of the data.
In an ideal situation, an analyst could directly ask those who collect the data about null values or other anomalous entries. This case study will use publicly available information to solve data integrity issues.
4.3(a) Null values in end_lat and end_lng
#Isolate null values of 'end_lat' and 'end_lng' for analysis
no_end_latlng <- all_rides %>%
filter(is.na(end_lat) | is.na(end_lng))
summary(no_end_latlng)
ride_id rideable_type started_at
Length:5835 classic_bike :3265 Min. :2021-11-01 07:03:54.00
Class :character docked_bike :2570 1st Qu.:2022-05-19 13:17:49.50
Mode :character electric_bike: 0 Median :2022-07-02 15:50:32.00
Mean :2022-06-24 21:47:55.81
3rd Qu.:2022-08-20 21:00:17.50
Max. :2022-10-31 21:32:38.00
ended_at start_station_name start_station_id
Min. :2021-11-01 07:18:08.00 Length:5835 Length:5835
1st Qu.:2022-05-20 23:36:26.00 Class :character Class :character
Median :2022-07-04 12:56:23.00 Mode :character Mode :character
Mean :2022-06-26 23:02:02.21
3rd Qu.:2022-08-23 17:27:50.50
Max. :2022-11-07 04:53:58.00
end_station_name end_station_id start_lat start_lng
Length:5835 Length:5835 Min. :41.65 Min. :-87.77
Class :character Class :character 1st Qu.:41.87 1st Qu.:-87.66
Mode :character Mode :character Median :41.89 Median :-87.63
Mean :41.87 Mean :-87.64
3rd Qu.:41.91 3rd Qu.:-87.62
Max. :42.06 Max. :-87.53
end_lat end_lng member_casual day_of_week
Min. : NA Min. : NA casual:5144 Sunday :1093
1st Qu.: NA 1st Qu.: NA member: 691 Monday : 727
Median : NA Median : NA Tuesday : 656
Mean :NaN Mean :NaN Wednesday: 618
3rd Qu.: NA 3rd Qu.: NA Thursday : 674
Max. : NA Max. : NA Friday : 852
NA's :5835 NA's :5835 Saturday :1215
ride_duration
Min. :6s
1st Qu.:89991s (~1.04 days)
Median :89994s (~1.04 days)
Mean :177246.388517566s (~2.05 days)
3rd Qu.:90001s (~1.04 days)
Max. :2483235s (~4.11 weeks)
head(no_end_latlng)
# A tibble: 6 × 15
ride_id ridea…¹ started_at ended_at start…² start…³
<chr> <fct> <dttm> <dttm> <chr> <chr>
1 D66FB7A5034CD… classi… 2021-11-23 11:53:36 2021-11-24 12:53:30 Laflin… 13307
2 214DC891AC5E7… classi… 2021-11-25 19:23:35 2021-11-26 20:23:30 Rush S… 15530
3 4409AA46BDD76… classi… 2021-11-06 13:13:06 2021-11-07 13:13:01 Ashlan… 16950
4 C4A464C2818DB… docked… 2021-11-06 16:40:58 2021-11-06 17:24:39 Millen… 13008
5 E58A224FA0AC8… docked… 2021-11-25 13:56:42 2021-11-26 16:50:51 Millen… 13008
6 D893434A1E5B0… docked… 2021-11-26 16:07:04 2021-11-27 17:07:05 Shedd … 15544
# … with 9 more variables: end_station_name <chr>, end_station_id <chr>,
# start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
# member_casual <fct>, day_of_week <fct>, ride_duration <Duration>, and
# abbreviated variable names ¹rideable_type, ²start_station_name,
# ³start_station_id
Micromobility devices, which include bikes and scooters, are not always returned. There are several documented instances of Divvy bikes found out of place, some as far as Michoacan, Mexico. Nearly all of these rows have a ride duration of approximately 24 hours. After 24 hours, Divvy considers a bike to be lost or stolen, and assesses a fee. For the purpose of this analysis, rides with null values for both the end_lat and end_lng variables will be considered missing or stolen, and those rides will be omitted from this analysis.
4.3(b) Null station name and id
Records with null values for the start_station_id, start_station_name, end_station_id, and end_station_name were isolated and analyzed. In these records the start_lat and start_lng, or end_lat and end_lng, variables are rounded to two place values, as opposed to rides with station names and ids, which continue to five decimal places.
no_station <- all_rides %>%
filter(is.na(start_station_name) | is.na(end_station_name))
head(no_station)
# A tibble: 6 × 15
ride_id ridea…¹ started_at ended_at start…² start…³
<chr> <fct> <dttm> <dttm> <chr> <chr>
1 7C00A93E10556… electr… 2021-11-27 13:27:38 2021-11-27 13:46:38 <NA> <NA>
2 90854840DFD50… electr… 2021-11-27 13:38:25 2021-11-27 13:56:10 <NA> <NA>
3 0A7D10CDD1440… electr… 2021-11-26 22:03:34 2021-11-26 22:05:56 <NA> <NA>
4 2F3BE33085BCF… electr… 2021-11-27 09:56:49 2021-11-27 10:01:50 <NA> <NA>
5 D67B4781A1992… electr… 2021-11-26 19:09:28 2021-11-26 19:30:41 <NA> <NA>
6 02F85C2C3C5F7… electr… 2021-11-26 18:34:07 2021-11-26 18:52:49 Michig… 13042
# … with 9 more variables: end_station_name <chr>, end_station_id <chr>,
# start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
# member_casual <fct>, day_of_week <fct>, ride_duration <Duration>, and
# abbreviated variable names ¹rideable_type, ²start_station_name,
# ³start_station_id
summary(no_station)
ride_id rideable_type started_at
Length:1345256 classic_bike : 4338 Min. :2021-11-01 00:02:27.00
Class :character docked_bike : 2570 1st Qu.:2022-04-15 09:22:04.25
Mode :character electric_bike:1338348 Median :2022-06-30 21:24:37.00
Mean :2022-06-09 18:58:17.02
3rd Qu.:2022-08-27 14:20:58.25
Max. :2022-10-31 23:58:53.00
ended_at start_station_name start_station_id
Min. :2021-11-01 00:09:26.00 Length:1345256 Length:1345256
1st Qu.:2022-04-15 09:37:13.25 Class :character Class :character
Median :2022-06-30 21:59:07.50 Mode :character Mode :character
Mean :2022-06-09 19:24:55.90
3rd Qu.:2022-08-27 14:46:54.75
Max. :2022-11-07 04:53:58.00
end_station_name end_station_id start_lat start_lng
Length:1345256 Length:1345256 Min. :41.64 Min. :-87.84
Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.68
Mode :character Mode :character Median :41.91 Median :-87.65
Mean :41.90 Mean :-87.66
3rd Qu.:41.94 3rd Qu.:-87.63
Max. :42.07 Max. :-87.52
end_lat end_lng member_casual day_of_week
Min. :41.39 Min. :-88.97 casual:584851 Sunday :183798
1st Qu.:41.88 1st Qu.:-87.68 member:760405 Monday :174457
Median :41.91 Median :-87.65 Tuesday :179534
Mean :41.90 Mean :-87.66 Wednesday:187475
3rd Qu.:41.94 3rd Qu.:-87.63 Thursday :200368
Max. :42.37 Max. :-87.30 Friday :200867
NA's :5835 NA's :5835 Saturday :218757
ride_duration
Min. :-621201s (~-1.03 weeks)
1st Qu.:308s (~5.13 minutes)
Median :557s (~9.28 minutes)
Mean :1598.87106097278s (~26.65 minutes)
3rd Qu.:1014s (~16.9 minutes)
Max. :2483235s (~4.11 weeks)
It is possible that these rides do not start or end at a station because they are not required to. A search of available information finds that in 2020, Divvy began introducing dockless bikes as part of their system. This would help to explain why the latitude and longitude are rounded – to anonymize user data. Rides with null values in the station name and id variables will be included in the analysis, and the null values will be replaced with the string “No ID” or “No Station”.
While the vast majority of rideable_type without start or end station information are electric bikes, some are docked bikes or classic bikes. While electric bikes are allowed to end their rides outside of a station, classic bikes and docked bikes must be returned to a station. It is possible that the electric bike rides without station information are dockless rides, and the classic and docked bikes without station information were improperly ended. Furthur inspection notes that the maximum for ride duration is 4.11 weeks, well in excess of Cyclistic’s 24 hour ride maximum.
4.4 Rides over 24 hours
As discovered in section 4.3(a), Divvy considers any bike lost for more than 24 hours to be lost or stolen. There are 278 rides in our data set with durations greater than 24 hours. For the purpose of this analysis, they will be excluded.
all_rides %>%
filter(ride_duration >= (24*60*60)) %>%
filter(!is.na(end_lat) | !is.na(end_lng)) %>%
summarize(count=n())
# A tibble: 1 × 1
count
<int>
1 278
5. Cleaning data
After a through exploration of the data, it can be efficiently filtered and cleaned.
cleaned_rides <- all_rides %>%
filter(ride_duration >= 60) %>% #Rides < 60s are not valid trips (per Divvy)
filter(ride_duration < (24*60*60)) %>% #Rides > 24h are lost/stolen
filter(!is.na(end_lat) | !is.na(end_lng)) %>% #Null end position
mutate(start_station_name = replace_na(start_station_name, "No Station")) %>%
mutate(start_station_id = replace_na(start_station_id, "No ID")) %>%
mutate(end_station_name = replace_na(end_station_name, "No Station")) %>%
mutate(end_station_id = replace_na(end_station_id, "No ID"))
summary(cleaned_rides)
ride_id rideable_type started_at
Length:5631132 classic_bike :2596050 Min. :2021-11-01 00:00:14.00
Class :character docked_bike : 177938 1st Qu.:2022-04-27 09:44:33.25
Mode :character electric_bike:2857144 Median :2022-06-30 16:48:58.50
Mean :2022-06-13 18:26:37.81
3rd Qu.:2022-08-24 18:27:55.50
Max. :2022-10-31 23:59:33.00
ended_at start_station_name start_station_id
Min. :2021-11-01 00:04:06.00 Length:5631132 Length:5631132
1st Qu.:2022-04-27 09:56:54.50 Class :character Class :character
Median :2022-06-30 17:04:42.50 Mode :character Mode :character
Mean :2022-06-13 18:43:15.53
3rd Qu.:2022-08-24 18:45:07.50
Max. :2022-11-01 08:59:46.00
end_station_name end_station_id start_lat start_lng
Length:5631132 Length:5631132 Min. :41.64 Min. :-87.84
Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
Mode :character Mode :character Median :41.90 Median :-87.64
Mean :41.90 Mean :-87.65
3rd Qu.:41.93 3rd Qu.:-87.63
Max. :45.64 Max. :-73.80
end_lat end_lng member_casual day_of_week
Min. :41.39 Min. :-88.97 casual:2300760 Sunday :774910
1st Qu.:41.88 1st Qu.:-87.66 member:3330372 Monday :758320
Median :41.90 Median :-87.64 Tuesday :771556
Mean :41.90 Mean :-87.65 Wednesday:788199
3rd Qu.:41.93 3rd Qu.:-87.63 Thursday :821849
Max. :42.37 Max. :-87.30 Friday :797982
Saturday :918316
ride_duration
Min. :60s (~1 minutes)
1st Qu.:365s (~6.08 minutes)
Median :633s (~10.55 minutes)
Mean :997.71594823208s (~16.63 minutes)
3rd Qu.:1128s (~18.8 minutes)
Max. :86362s (~23.99 hours)
str(cleaned_rides)
tibble [5,631,132 × 15] (S3: tbl_df/tbl/data.frame)
$ ride_id : chr [1:5631132] "7C00A93E10556E47" "90854840DFD508BA" "0A7D10CDD144061C" "2F3BE33085BCFF02" ...
$ rideable_type : Factor w/ 3 levels "classic_bike",..: 3 3 3 3 3 3 3 3 3 3 ...
$ started_at : POSIXct[1:5631132], format: "2021-11-27 13:27:38" "2021-11-27 13:38:25" ...
$ ended_at : POSIXct[1:5631132], format: "2021-11-27 13:46:38" "2021-11-27 13:56:10" ...
$ start_station_name: chr [1:5631132] "No Station" "No Station" "No Station" "No Station" ...
$ start_station_id : chr [1:5631132] "No ID" "No ID" "No ID" "No ID" ...
$ end_station_name : chr [1:5631132] "No Station" "No Station" "No Station" "No Station" ...
$ end_station_id : chr [1:5631132] "No ID" "No ID" "No ID" "No ID" ...
$ start_lat : num [1:5631132] 41.9 42 42 41.9 41.9 ...
$ start_lng : num [1:5631132] -87.7 -87.7 -87.7 -87.8 -87.6 ...
$ end_lat : num [1:5631132] 42 41.9 42 41.9 41.9 ...
$ end_lng : num [1:5631132] -87.7 -87.7 -87.7 -87.8 -87.6 ...
$ member_casual : Factor w/ 2 levels "casual","member": 1 1 1 1 1 1 1 1 1 1 ...
$ day_of_week : Factor w/ 7 levels "Sunday","Monday",..: 7 7 6 7 6 6 7 7 7 7 ...
$ ride_duration :Formal class 'Duration' [package "lubridate"] with 1 slot
.. ..@ .Data: num [1:5631132] 1140 1065 142 301 1273 ...
The cleaning process removed 124,562 rows from our data set: 118,453 rides were less than 60s, 5,377 rides were greater than 24h, and 5,835 had null values for end_lat and end_lng. There is significant overlap between the latter two categories.
6. Removing test data found during analysis
During geospatial analysis, a record with out of range data for the start_lat and start_lng was located. Upon further inspection, the start_station_name was identified as “Pawel Bialowas - Test - PBSC charging station”. The following code removes any record with the word “test” in either the start station or end station name.
# Subset the cleaned_rides data frame to exclude rows where start_station_name or end_station_name contains "test"
cleaned_rides <- cleaned_rides[!(grepl("test", cleaned_rides$start_station_name, ignore.case = TRUE) |
grepl("test", cleaned_rides$end_station_name, ignore.case = TRUE)),]
One record was removed because it contained the word “test” in one of the station name variables.
Saving the cleaned data
The following code exports the data as an RDS file for later analysis.
saveRDS(all_rides,"Data/all_rides.RDS")
saveRDS(cleaned_rides, "Data/cleaned_rides.RDS")