Data Cleaning

April 2nd, 2023

Cleaning process

This page gives an outline of the process and assumptions used to combine, clean, and transform raw trip data from Cyclistic (Divvy) in preperation for analysis. At the time of collection, data up to October 2022 were available. For the purposes of our analysis, the 12 most recent months of data were selected.

Cyclistic is a fictional company, but the data is obtained from Divvy Bikes, a real bike sharing program in Chicago, Il. A description of the datasets and their files can be accessed here.

1. Installing Packages

library(tidyverse) #For data cleaning and analysis

2. Combining datasets

The structure of these CSV files were inspected using Excel prior to import and binding. The structure of the merged dataset was checked again after binding.

#Import data files
nov21 <- read_csv("Data/202111-divvy-tripdata.csv")
dec21 <- read_csv("Data/202112-divvy-tripdata.csv")
jan22 <- read_csv("Data/202201-divvy-tripdata.csv")
feb22 <- read_csv("Data/202202-divvy-tripdata.csv")
mar22 <- read_csv("Data/202203-divvy-tripdata.csv")
apr22 <- read_csv("Data/202204-divvy-tripdata.csv")
may22 <- read_csv("Data/202205-divvy-tripdata.csv")
jun22 <- read_csv("Data/202206-divvy-tripdata.csv")
jul22 <- read_csv("Data/202207-divvy-tripdata.csv")
aug22 <- read_csv("Data/202208-divvy-tripdata.csv")
sep22 <- read_csv("Data/202209-divvy-tripdata.csv")
oct22 <- read_csv("Data/202210-divvy-tripdata.csv")

#Join data files
all_rides <- bind_rows(nov21,dec21,jan22,feb22,mar22,apr22,may22,jun22,jul22,
                      aug22,sep22,oct22)
                      
#delete import files and clear from memory
rm(nov21,dec21,jan22,feb22,mar22,apr22,may22,jun22,jul22,aug22,sep22,oct22)
gc()

str(all_rides)

2.1 Assigning time zone

The datetime variables started_at and ended_at were both automatically imported as POSIXct data types. The following code assigns the Chicago timezone. Daylight Savings Time presented difficulties with a small number of rides that crossed over a DST boundary.

#Set time zone
all_rides <- all_rides %>% 
  mutate(started_at = force_tz(all_rides$started_at,
                               tzone = "America/Chicago",
                               roll_dst = c("pre", "post"))) %>% 
  mutate(ended_at = force_tz(all_rides$ended_at,
                             tzone = "America/Chicago",
                             roll_dst = c("pre", "post")))
#Confirm time zone is America/Chicago
tz(all_rides$started_at)

[1] "America/Chicago"

tz(all_rides$ended_at)

[1] "America/Chicago"

3. Data transformation

3.1 Create a variable for the day of week

This analysis will examine which days of the week are most popular with different rider groups. The following code will create a variable for the day of the week.

weekday_levels <- c('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 
                    'Friday', 'Saturday')
all_rides <- all_rides %>% 
  mutate(day_of_week = factor(weekdays(started_at), levels = weekday_levels))

3.2 Create a variable for the ride duration

With the ‘started_at’ time and ‘ended_at’ time, it is possible to calculate an interesting variable for analysis: the duration of each ride.

#Create ride_duration to measure duration of ride
all_rides <- all_rides %>% 
  mutate(ride_duration = as.duration(started_at%--%ended_at))

3.3 Convert categorical variables to factors

The data has two categorical variables which can be better analyzed by recasting them as factors, ‘rideable_type’ and ‘member_casual’.

#Convert 'rideable_type' and 'member_casual' to factors
all_rides <- all_rides %>% 
  mutate(rideable_type = factor(rideable_type))
all_rides <- all_rides %>% 
  mutate(member_casual = factor(member_casual))

4. Examine data

Before the cleaning the data, it is examined to identify errors, blanks, or any values out of range for this analysis.

4.1 Examine primary key

The data has a primary key, which is the ‘ride_id’ variable. This code will check the integrity of the data by ensuring that each ‘ride_id’ is unique.

#Check if each trip_id is unique
all_rides %>%
  distinct(ride_id) %>% 
  summarise(total_count=n(),.groups='drop') %>% 
  arrange(-total_count)

# A tibble: 1 × 1
  total_count
        <int>
1     5755694

The result, 5,755,694, is equal to the number of rows in all_rides. Each ride_id is unique.

4.2 Durations out of expected range

Divvy’s data page states that “The data has been processed to remove trips that are taken by staff as they service and inspect the system; and any trips that were below 60 seconds in length (potentially false starts or users trying to re-dock a bike to ensure it was secure).” The analysis observed that the minimum value for the trip duration was less than 60 seconds, and was a negative number. The data contain 118,453 records in which the ride length is under 60 seconds.

#Examine trip lenghts < 60s in duration
all_rides %>%
  filter(ride_duration<60) %>% 
  summarise(total_count=n(),.groups='drop') %>% 
  arrange(-total_count)

# A tibble: 1 × 1
  total_count
        <int>
1      118453

#118,453 ride_lentgh < 60s

4.3 Null values

The data should be complete, without any blank fields. The following code will count if there are any null values in our data.

all_rides %>% 
  summarise(across(everything(), ~ sum(is.na(.x))))

# A tibble: 1 × 15
  ride_id rideable_type starte…¹ ended…² start…³ start…⁴ end_s…⁵ end_s…⁶ start…⁷
    <int>         <int>    <int>   <int>   <int>   <int>   <int>   <int>   <int>
1       0             0        0       0  878177  878177  940010  940010       0
# … with 6 more variables: start_lng <int>, end_lat <int>, end_lng <int>,
#   member_casual <int>, day_of_week <int>, ride_duration <int>, and
#   abbreviated variable names ¹started_at, ²ended_at, ³start_station_name,
#   ⁴start_station_id, ⁵end_station_name, ⁶end_station_id, ⁷start_lat

There are null values in start_station_name, start_station_id, end_station_name, end_station_id, end_lat, and end_lng.

The number of null values for station names and ids is concerning. The nulls in the station variables account for 23.37% of the data. The end latitude and longitude are less concerning, accounting for <0.01% of the data.

In an ideal situation, an analyst could directly ask those who collect the data about null values or other anomalous entries. This case study will use publicly available information to solve data integrity issues.

4.3(a) Null values in end_lat and end_lng

#Isolate null values of 'end_lat' and 'end_lng' for analysis
no_end_latlng <- all_rides %>% 
  filter(is.na(end_lat) | is.na(end_lng))
summary(no_end_latlng)

   ride_id                rideable_type    started_at                    
 Length:5835        classic_bike :3265   Min.   :2021-11-01 07:03:54.00  
 Class :character   docked_bike  :2570   1st Qu.:2022-05-19 13:17:49.50  
 Mode  :character   electric_bike:   0   Median :2022-07-02 15:50:32.00  
                                         Mean   :2022-06-24 21:47:55.81  
                                         3rd Qu.:2022-08-20 21:00:17.50  
                                         Max.   :2022-10-31 21:32:38.00  
                                                                         
    ended_at                      start_station_name start_station_id  
 Min.   :2021-11-01 07:18:08.00   Length:5835        Length:5835       
 1st Qu.:2022-05-20 23:36:26.00   Class :character   Class :character  
 Median :2022-07-04 12:56:23.00   Mode  :character   Mode  :character  
 Mean   :2022-06-26 23:02:02.21                                        
 3rd Qu.:2022-08-23 17:27:50.50                                        
 Max.   :2022-11-07 04:53:58.00                                        
                                                                       
 end_station_name   end_station_id       start_lat       start_lng     
 Length:5835        Length:5835        Min.   :41.65   Min.   :-87.77  
 Class :character   Class :character   1st Qu.:41.87   1st Qu.:-87.66  
 Mode  :character   Mode  :character   Median :41.89   Median :-87.63  
                                       Mean   :41.87   Mean   :-87.64  
                                       3rd Qu.:41.91   3rd Qu.:-87.62  
                                       Max.   :42.06   Max.   :-87.53  
                                                                       
    end_lat        end_lng     member_casual    day_of_week  
 Min.   : NA    Min.   : NA    casual:5144   Sunday   :1093  
 1st Qu.: NA    1st Qu.: NA    member: 691   Monday   : 727  
 Median : NA    Median : NA                  Tuesday  : 656  
 Mean   :NaN    Mean   :NaN                  Wednesday: 618  
 3rd Qu.: NA    3rd Qu.: NA                  Thursday : 674  
 Max.   : NA    Max.   : NA                  Friday   : 852  
 NA's   :5835   NA's   :5835                 Saturday :1215  
 ride_duration                           
 Min.   :6s                              
 1st Qu.:89991s (~1.04 days)             
 Median :89994s (~1.04 days)             
 Mean   :177246.388517566s (~2.05 days)  
 3rd Qu.:90001s (~1.04 days)             
 Max.   :2483235s (~4.11 weeks)

head(no_end_latlng)

# A tibble: 6 × 15
  ride_id        ridea…¹ started_at          ended_at            start…² start…³
  <chr>          <fct>   <dttm>              <dttm>              <chr>   <chr>  
1 D66FB7A5034CD… classi… 2021-11-23 11:53:36 2021-11-24 12:53:30 Laflin… 13307  
2 214DC891AC5E7… classi… 2021-11-25 19:23:35 2021-11-26 20:23:30 Rush S… 15530  
3 4409AA46BDD76… classi… 2021-11-06 13:13:06 2021-11-07 13:13:01 Ashlan… 16950  
4 C4A464C2818DB… docked… 2021-11-06 16:40:58 2021-11-06 17:24:39 Millen… 13008  
5 E58A224FA0AC8… docked… 2021-11-25 13:56:42 2021-11-26 16:50:51 Millen… 13008  
6 D893434A1E5B0… docked… 2021-11-26 16:07:04 2021-11-27 17:07:05 Shedd … 15544  
# … with 9 more variables: end_station_name <chr>, end_station_id <chr>,
#   start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
#   member_casual <fct>, day_of_week <fct>, ride_duration <Duration>, and
#   abbreviated variable names ¹rideable_type, ²start_station_name,
#   ³start_station_id

Micromobility devices, which include bikes and scooters, are not always returned. There are several documented instances of Divvy bikes found out of place, some as far as Michoacan, Mexico. Nearly all of these rows have a ride duration of approximately 24 hours. After 24 hours, Divvy considers a bike to be lost or stolen, and assesses a fee. For the purpose of this analysis, rides with null values for both the end_lat and end_lng variables will be considered missing or stolen, and those rides will be omitted from this analysis.

4.3(b) Null station name and id

Records with null values for the start_station_id, start_station_name, end_station_id, and end_station_name were isolated and analyzed. In these records the start_lat and start_lng, or end_lat and end_lng, variables are rounded to two place values, as opposed to rides with station names and ids, which continue to five decimal places.

no_station <- all_rides %>% 
  filter(is.na(start_station_name) | is.na(end_station_name))

head(no_station)

# A tibble: 6 × 15
  ride_id        ridea…¹ started_at          ended_at            start…² start…³
  <chr>          <fct>   <dttm>              <dttm>              <chr>   <chr>  
1 7C00A93E10556… electr… 2021-11-27 13:27:38 2021-11-27 13:46:38 <NA>    <NA>   
2 90854840DFD50… electr… 2021-11-27 13:38:25 2021-11-27 13:56:10 <NA>    <NA>   
3 0A7D10CDD1440… electr… 2021-11-26 22:03:34 2021-11-26 22:05:56 <NA>    <NA>   
4 2F3BE33085BCF… electr… 2021-11-27 09:56:49 2021-11-27 10:01:50 <NA>    <NA>   
5 D67B4781A1992… electr… 2021-11-26 19:09:28 2021-11-26 19:30:41 <NA>    <NA>   
6 02F85C2C3C5F7… electr… 2021-11-26 18:34:07 2021-11-26 18:52:49 Michig… 13042  
# … with 9 more variables: end_station_name <chr>, end_station_id <chr>,
#   start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
#   member_casual <fct>, day_of_week <fct>, ride_duration <Duration>, and
#   abbreviated variable names ¹rideable_type, ²start_station_name,
#   ³start_station_id

summary(no_station)

   ride_id                rideable_type       started_at                    
 Length:1345256     classic_bike :   4338   Min.   :2021-11-01 00:02:27.00  
 Class :character   docked_bike  :   2570   1st Qu.:2022-04-15 09:22:04.25  
 Mode  :character   electric_bike:1338348   Median :2022-06-30 21:24:37.00  
                                            Mean   :2022-06-09 18:58:17.02  
                                            3rd Qu.:2022-08-27 14:20:58.25  
                                            Max.   :2022-10-31 23:58:53.00  
                                                                            
    ended_at                      start_station_name start_station_id  
 Min.   :2021-11-01 00:09:26.00   Length:1345256     Length:1345256    
 1st Qu.:2022-04-15 09:37:13.25   Class :character   Class :character  
 Median :2022-06-30 21:59:07.50   Mode  :character   Mode  :character  
 Mean   :2022-06-09 19:24:55.90                                        
 3rd Qu.:2022-08-27 14:46:54.75                                        
 Max.   :2022-11-07 04:53:58.00                                        
                                                                       
 end_station_name   end_station_id       start_lat       start_lng     
 Length:1345256     Length:1345256     Min.   :41.64   Min.   :-87.84  
 Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.68  
 Mode  :character   Mode  :character   Median :41.91   Median :-87.65  
                                       Mean   :41.90   Mean   :-87.66  
                                       3rd Qu.:41.94   3rd Qu.:-87.63  
                                       Max.   :42.07   Max.   :-87.52  
                                                                       
    end_lat         end_lng       member_casual      day_of_week    
 Min.   :41.39   Min.   :-88.97   casual:584851   Sunday   :183798  
 1st Qu.:41.88   1st Qu.:-87.68   member:760405   Monday   :174457  
 Median :41.91   Median :-87.65                   Tuesday  :179534  
 Mean   :41.90   Mean   :-87.66                   Wednesday:187475  
 3rd Qu.:41.94   3rd Qu.:-87.63                   Thursday :200368  
 Max.   :42.37   Max.   :-87.30                   Friday   :200867  
 NA's   :5835    NA's   :5835                     Saturday :218757  
 ride_duration                               
 Min.   :-621201s (~-1.03 weeks)             
 1st Qu.:308s (~5.13 minutes)                
 Median :557s (~9.28 minutes)                
 Mean   :1598.87106097278s (~26.65 minutes)  
 3rd Qu.:1014s (~16.9 minutes)               
 Max.   :2483235s (~4.11 weeks)

It is possible that these rides do not start or end at a station because they are not required to. A search of available information finds that in 2020, Divvy began introducing dockless bikes as part of their system. This would help to explain why the latitude and longitude are rounded – to anonymize user data. Rides with null values in the station name and id variables will be included in the analysis, and the null values will be replaced with the string “No ID” or “No Station”.

While the vast majority of rideable_type without start or end station information are electric bikes, some are docked bikes or classic bikes. While electric bikes are allowed to end their rides outside of a station, classic bikes and docked bikes must be returned to a station. It is possible that the electric bike rides without station information are dockless rides, and the classic and docked bikes without station information were improperly ended. Furthur inspection notes that the maximum for ride duration is 4.11 weeks, well in excess of Cyclistic’s 24 hour ride maximum.

4.4 Rides over 24 hours

As discovered in section 4.3(a), Divvy considers any bike lost for more than 24 hours to be lost or stolen. There are 278 rides in our data set with durations greater than 24 hours. For the purpose of this analysis, they will be excluded.

all_rides %>% 
  filter(ride_duration >= (24*60*60)) %>%
  filter(!is.na(end_lat) | !is.na(end_lng)) %>% 
  summarize(count=n())

# A tibble: 1 × 1
  count
  <int>
1   278

5. Cleaning data

After a through exploration of the data, it can be efficiently filtered and cleaned.

cleaned_rides <- all_rides %>% 
  filter(ride_duration >= 60) %>% #Rides < 60s are not valid trips (per Divvy)
  filter(ride_duration < (24*60*60)) %>% #Rides > 24h are lost/stolen
  filter(!is.na(end_lat) | !is.na(end_lng)) %>%  #Null end position
  mutate(start_station_name = replace_na(start_station_name, "No Station")) %>% 
  mutate(start_station_id = replace_na(start_station_id, "No ID")) %>% 
  mutate(end_station_name = replace_na(end_station_name, "No Station")) %>% 
  mutate(end_station_id = replace_na(end_station_id, "No ID"))

summary(cleaned_rides)

   ride_id                rideable_type       started_at                    
 Length:5631132     classic_bike :2596050   Min.   :2021-11-01 00:00:14.00  
 Class :character   docked_bike  : 177938   1st Qu.:2022-04-27 09:44:33.25  
 Mode  :character   electric_bike:2857144   Median :2022-06-30 16:48:58.50  
                                            Mean   :2022-06-13 18:26:37.81  
                                            3rd Qu.:2022-08-24 18:27:55.50  
                                            Max.   :2022-10-31 23:59:33.00  
                                                                            
    ended_at                      start_station_name start_station_id  
 Min.   :2021-11-01 00:04:06.00   Length:5631132     Length:5631132    
 1st Qu.:2022-04-27 09:56:54.50   Class :character   Class :character  
 Median :2022-06-30 17:04:42.50   Mode  :character   Mode  :character  
 Mean   :2022-06-13 18:43:15.53                                        
 3rd Qu.:2022-08-24 18:45:07.50                                        
 Max.   :2022-11-01 08:59:46.00                                        
                                                                       
 end_station_name   end_station_id       start_lat       start_lng     
 Length:5631132     Length:5631132     Min.   :41.64   Min.   :-87.84  
 Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
 Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
                                       Mean   :41.90   Mean   :-87.65  
                                       3rd Qu.:41.93   3rd Qu.:-87.63  
                                       Max.   :45.64   Max.   :-73.80  
                                                                       
    end_lat         end_lng       member_casual       day_of_week    
 Min.   :41.39   Min.   :-88.97   casual:2300760   Sunday   :774910  
 1st Qu.:41.88   1st Qu.:-87.66   member:3330372   Monday   :758320  
 Median :41.90   Median :-87.64                    Tuesday  :771556  
 Mean   :41.90   Mean   :-87.65                    Wednesday:788199  
 3rd Qu.:41.93   3rd Qu.:-87.63                    Thursday :821849  
 Max.   :42.37   Max.   :-87.30                    Friday   :797982  
                                                   Saturday :918316  
 ride_duration                              
 Min.   :60s (~1 minutes)                   
 1st Qu.:365s (~6.08 minutes)               
 Median :633s (~10.55 minutes)              
 Mean   :997.71594823208s (~16.63 minutes)  
 3rd Qu.:1128s (~18.8 minutes)              
 Max.   :86362s (~23.99 hours)

str(cleaned_rides)

tibble [5,631,132 × 15] (S3: tbl_df/tbl/data.frame)
 $ ride_id           : chr [1:5631132] "7C00A93E10556E47" "90854840DFD508BA" "0A7D10CDD144061C" "2F3BE33085BCFF02" ...
 $ rideable_type     : Factor w/ 3 levels "classic_bike",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ started_at        : POSIXct[1:5631132], format: "2021-11-27 13:27:38" "2021-11-27 13:38:25" ...
 $ ended_at          : POSIXct[1:5631132], format: "2021-11-27 13:46:38" "2021-11-27 13:56:10" ...
 $ start_station_name: chr [1:5631132] "No Station" "No Station" "No Station" "No Station" ...
 $ start_station_id  : chr [1:5631132] "No ID" "No ID" "No ID" "No ID" ...
 $ end_station_name  : chr [1:5631132] "No Station" "No Station" "No Station" "No Station" ...
 $ end_station_id    : chr [1:5631132] "No ID" "No ID" "No ID" "No ID" ...
 $ start_lat         : num [1:5631132] 41.9 42 42 41.9 41.9 ...
 $ start_lng         : num [1:5631132] -87.7 -87.7 -87.7 -87.8 -87.6 ...
 $ end_lat           : num [1:5631132] 42 41.9 42 41.9 41.9 ...
 $ end_lng           : num [1:5631132] -87.7 -87.7 -87.7 -87.8 -87.6 ...
 $ member_casual     : Factor w/ 2 levels "casual","member": 1 1 1 1 1 1 1 1 1 1 ...
 $ day_of_week       : Factor w/ 7 levels "Sunday","Monday",..: 7 7 6 7 6 6 7 7 7 7 ...
 $ ride_duration     :Formal class 'Duration' [package "lubridate"] with 1 slot
  .. ..@ .Data: num [1:5631132] 1140 1065 142 301 1273 ...

The cleaning process removed 124,562 rows from our data set: 118,453 rides were less than 60s, 5,377 rides were greater than 24h, and 5,835 had null values for end_lat and end_lng. There is significant overlap between the latter two categories.

6. Removing test data found during analysis

During geospatial analysis, a record with out of range data for the start_lat and start_lng was located. Upon further inspection, the start_station_name was identified as “Pawel Bialowas - Test - PBSC charging station”. The following code removes any record with the word “test” in either the start station or end station name.

# Subset the cleaned_rides data frame to exclude rows where start_station_name or end_station_name contains "test"
cleaned_rides <- cleaned_rides[!(grepl("test", cleaned_rides$start_station_name, ignore.case = TRUE) | 
                             grepl("test", cleaned_rides$end_station_name, ignore.case = TRUE)),]

One record was removed because it contained the word “test” in one of the station name variables.

Saving the cleaned data

The following code exports the data as an RDS file for later analysis.

saveRDS(all_rides,"Data/all_rides.RDS")
saveRDS(cleaned_rides, "Data/cleaned_rides.RDS")