\chapter{Free2Move Carsharing Data Analysis} \label{ch:data_analysis} \textit{Free2Move} service improvement requires an increase in car usage while decreasing operator costs. This requires to understand car usage in historical service data. However before a developing a methodology intended to be used by \textit{Free2Move}, it is necessary to make a preliminary analysis of the functioning services. First an overview is made on the three available datasets used in this thesis as well as the abstraction made of the area serviced. Then an exploration of each dataset is presented with the analysis of the characteristics of the fleet utilization, such as the hot spots of each service or the average usage during the day depending on the day of the week. Finally exogenous datasets are explored, details about the weather features creation are presented as well as the reason why characterizing the city with only open-data can be difficult. \section{Carsharing Service Modeling} For this thesis, \textit{Free2Move} has provided three carsharing trip datasets, corresponding to customer trips of three cities. The first studied service is located in \textit{Madrid} with the dataset coming from the service \textit{Emov}. The second dataset has been provided by the service \textit{Free2Move} located in \textit{Paris}. The last is from \textit{Free2Move} for its service located in \textit{Washington}. Each dataset is selected such that the global usage is constant in time, i.e the services are not in a warm-up phase with a noticeable and steady increase of usage, and that the perimeter of the service stays the same, i.e no city area is added or removed. In all three cases, details on the customer trip dataset split and usage is given as well as how the service area is abstracted into a grid made of hexagons, to represent the maximum distance a customer is willing to travel to retrieve a car. \subsection{Daily Trip Dataset Description} The three datasets are tabular data containing one trip per row and one information about the trip per column. From all the datasets, the common columns have been kept: they are the \textit{timestamp of departure}, the \textit{timestamp of arrival}, the \textit{GPS coordinates of departure}, the \textit{GPS coordinates of arrival}, the \textit{car ID}, the \textit{customer ID} and the \textit{distance} traveled. It should be noted that the \textit{car ID} and \textit{customer ID} have been pseudonymized, i.e. the IDs are the hashes of the real IDs to comply with requirements about data treatment from the GDPR\footnote{GDPR stands for \textquote{General Data Protection Regulation} and is a European Union regulation about data protection and privacy.} while keeping the information about how many trips the same car has made during the day or the reoccurrence of trips from the same customer. \paragraph{Datset Content.} The first dataset is coming from the service \textit{Emov} in Madrid and accounts for $1\,138\,246$ trips. They have been made between the 1$^{st}$ August 2018 and the 31$^{st}$ March 2019 included, this accounts for $243$ days of trip data. Thus on average around $4\,700$ trips par day have been made during this period. They have been made with $578$ cars, mostly \textquote{Peugeot Ion} and \textquote{Citroën C-Zero} which are both electric city cars. From now on, this service and its dataset will be denoted by the name \textquote{Madrid}. The second dataset is coming from the service \textit{Free2Move} located in Paris and records $130\,219$ trips made between the 1$^{st}$ April 2019 and the 31$^{st}$ January 2020 included. Thus for the $306$ days of trip data, there is an average of $425$ daily trips. They have been made with $475$ electric cars, with \textquote{Peugeot Ion} and \textquote{Citroën C-Zero} like in Madrid. From now on, this service and its dataset will be denoted by the name \textquote{Paris}. It should be noted that there are in \textit{Paris} dataset 10 times less daily trips than for \textit{Madrid}, this information is to be kept in mind as it might have an impact on the performance of the car utilization prediction. The last dataset is coming from the service \textit{Free2Move} in Washington and holds the information of $136\,095$ trips. They have been made during the period between the 1$^{st}$ August 2019 and the 31$^{st}$ March 2020 included, this accounts for $244$ days of trip data. Thus, on average around $557$ daily trips have been made during this period. Those trips have been made with a fleet of $400$ \textquote{Chevrolet Cruze}, an internal combustion engine (ICE) compact car, and $200$ \textquote{Chevrolet Equinox}, an ICE crossover utility vehicle. From now on, \textit{Washington} is the name with which this service and its dataset are going to be denoted. As well as the dataset \textit{Paris}, this service has been less used than the service in Madrid and the performance of prediction algorithms based on this dataset might be poorer than in the Madrid case. Table~\ref{tab:ch2_summary_content} summarizes the content of the three datasets provided by \textit{Free2Move} with all previously given information. Note that in both \textit{Madrid} and \textit{Paris} the fleet is homogeneous with small electric vehicles ; \textquote{Peugeot Ion} and \textquote{Citroën C-Zero} are comparable cars. In the case of \textit{Washington} service, the fleet is heterogeneous since both types of vehicles are not comparable, however, during the following thesis the assertion is made such that the difference in use case for these cars is negligible. Indeed, even if their subcategory is different their main purpose is still to transport people or small goods, contrary to utility vehicles used to move furniture between two houses for example. \begin{table}[!tb] \centering \small \begin{tabular}{|l|c|c|c|c|c|} \hline City & Nb Trips & Period & Nb Days & Nb Daily Trips & Nb Cars \\ \hline Madrid & 1\,138\,246 & 2018-08-01 to 2019-03-31 & 243 & 4\,700 & 578 \\ \cline{1-1} Paris & 130\,219 & 2019-04-01 to 2020-01-31 & 306 & 425 & 475 \\ \cline{1-1} Washington & 136\,095 & 2019-08-01 to 2020-03-31 & 244 & 557 & 600 \\ \hline \end{tabular} \caption{Summary of the datasets characteristics, for the three services. The total number of trips, the dataset period, the number of days in the dataset and the average number of daily trip gives an insight on the global usage of each service.} \label{tab:ch2_summary_content} \end{table} \paragraph{Dataset Split.} For the method explained in Chapter~\ref{ch:method} to work, the usage of cars has to be predicted by a machine learning model. However as detailed in Chapter~\ref{ch:background}, a dataset is needed to train the model, so it can make predictions related to the problem at hand, and then a dataset is needed to evaluate the prediction performance of the trained model. Additionally, to find the best hyperparameters of the model to train, another dataset might be required. Often this is done by splitting a dataset into three sub-datasets~\cite{shalev_understanding_2014}. First a \emph{training set} is taken as the first split of the global dataset, to train the model. Then a \emph{validation set} is given by the second slice of the global dataset, to search the best hyperparameters of the model to train. Finally a \emph{test set} is taken from the remaining global dataset, to evaluate the prediction performance of the model. Those three slices of dataset are made such that the model cannot ``cheat'' by learning by heart the values to predict: the model should learn on a distinct part of the trip's data than the ones used to make the evaluation. To do so, a \textit{training set} is created as well as a \textit{validation set} and \textit{test set} from the whole customer trip dataset. To create balanced subsets with the least possible annual seasonality bias, the whole dataset for each city is first split into weekly trip data. Then each week is assigned to one of the three subsets, such that for every 8 continuous weeks, the first 6 weeks are assigned to the \textit{training set}, the next week is assigned to the \textit{validation set} and the last week is assigned to the \textit{test set}. If it is not possible to form a complete week at the end of the whole dataset, e.g there are only data for Monday to Wednesday, then this incomplete week is discarded. Moreover, if the dataset for a city begins a Wednesday, this means the \textquote{week} is from the Wednesday to the next Tuesday (included) for this dataset. Thus for the service located in \textit{Madrid}, the daily trip dataset is split to have $26$ weeks of data assigned to the \textit{training set}, $4$ weeks of data for the \textit{validation set} and $4$ weeks of data for the \textit{test set}. For the service located in \textit{Paris}, the daily trip dataset is split such that $33$ weeks are assigned to the \textit{training set}, $5$ weeks of data are given for the \textit{validation set} as well as $5$ weeks of data for the \textit{test set}. Finally for the service present in \textit{Washington}, the daily trip dataset is split to have $26$ weeks of data assigned to the \textit{training set}, $4$ weeks of data for the \textit{validation set} and $4$ weeks of data for the \textit{test set}. Table~\ref{tab:ch2_summary_split} summarizes the number of weeks assigned to each set for each service. \begin{table}[!ht] \centering \small \begin{tabular}{|l|l|l|l|} \hline \multirow{2}{*}{City} & \multicolumn{3}{l|}{Data Partition} \\ \cline{2-4} & Training Set & Validation Set & Test Set \\ \hline Madrid & 26 weeks & 4 weeks & 4 weeks \\ \cline{1-1} Paris & 33 weeks & 5 weeks & 5 weeks \\ \cline{1-1} Washington & 26 weeks & 4 weeks & 4 weeks \\ \hline \end{tabular} \caption{Summary of the data partition into training set, validation set and test set. Each set is made from multiples weeks of data, i.e data pieces of seven consecutive days.} \label{tab:ch2_summary_split} \end{table} \FloatBarrier \subsection{Service Area Modeling} The datasets for each city have been temporally split into three subsets. However in the trip's dataset for each city, the GPS positions of the pick-up and drop off of the car in each trip need to be modeled. Following the state-of-the-art methods, see Chapter~\ref{ch:background}, a discretization of all the gps points in the datasets is made with the help of a grid. Thus for each city the serviced area, i.e the area where the customer can pick up and drop off reserved cars, is taken and discretized with a grid made of hexagons, with the objective of representing the maximum distance a customer is willing to walk. The set of all the \textit{hexagons}, also called \textit{cells}, is noted $K$ and is made of hexagons with a radius of $500m$. Thus, all the parking spaces within the surface of one cell is labeled with the index $k$ of the corresponding hexagon. To create this grid of hexagonal cells of the city, a first GPS position is chosen. This GPS position $\theta$ is used as the origin and anchor of the grid. A 2D grid ($\theta$, $\vec{i}$, $\vec{j}$) is formed, such that $\vec{i}$ is a vector of length $\|\vec{i}\|=500m$ and pointing toward the east and $\vec{j}$ is a vector of length $\|\vec{j}\|=500m$ and pointing toward the south. The limit on the number of cells that can be placed horizontally is called $\lambda$, i.e it is not possible to have more than $\lambda$ ``columns'' in the 2D grid. Then to define a hexagon on the grid, the GPS position of its center is computed. That position is found through the translation of the point $\theta$, where a $\Delta x$ and a $\Delta y$ is the offset in meters to the coordinates of the origin. It is possible to determine $\Delta x$ and $\Delta y$ for a cell of ID $k \in K$ with the formula : \begin{equation} \begin{cases} \Delta x = \|\vec{i}\| \cdot \left(k_i + \frac{1}{2}\right) \text{~~,~~} \Delta y = \frac{3}{2} \cdot \|\vec{j}\| \cdot k_j &\text{~~~~if~} k_j \bmod{2} = 0\\ \Delta x = \|\vec{i}\| \cdot k_i \text{~~,~~} \Delta y = \frac{3}{2} \cdot \|\vec{j}\| \cdot \left(k_j + 1\right) &\text{~~~~else} \end{cases} \end{equation} $$\text{With: } k_i = k \bmod{\lambda} \text{~~~~and~~~~} k_j = \left\lfloor \frac{k}{\lambda} \right\rfloor$$ The number of cells to use in the grid and the maximum number of columns of this grid ($\lambda$) have to be manually tuned by trial and errors to cover the whole serviced area. In the case of \textit{Madrid}, the serviced area covers roughly a quarter of the total city surface and is centered on the city center. To discretize all the possible parking spaces within the serviced area, the covering grid has a width of $19$ hexagons and a height of $24$ hexagons for a total of 456 hexagons. However this \textquote{rectangular} grid made of hexagons include numerous hexagons that can be discarded since they are not included in the serviced area. Thus only 155 hexagons are kept as cells within the serviced area. The Figure~\ref{fig:ch2_grid_madrid} shows the perimeter of the area serviced by \textit{Emov} in blue with the black hexagons represents the used part of the grid. \begin{figure}[!bt] \centering \makebox[\textwidth][c]{\includegraphics[width=0.55\textwidth]{figure/ch2_grid_madrid.jpeg}} \caption[Hexagonal Grid for Madrid]{Map of Madrid, in blue is the geographical limit of the \textit{Emov} service. The hexagonal grid covering the the serviced area is displayed on top of Madrid and is made of 155 cells. Each cell has a radius of 500m.} \label{fig:ch2_grid_madrid} \end{figure} With the same process in the case of \textit{Paris}, the grid should cover the whole city of Paris and include the \textit{Issy-les-Moulineaux} in the south-west of Paris. Thus a grid of a $20$ hexagons width and $15$ hexagons height has been made, for a total of 300 cells. Like for \textit{Madrid}, only a subset of $209$ cells actually covering the precise serviced area is kept. The precise perimeter of the serviced area is displayed with a blue line in Figure~\ref{fig:ch2_grid_paris} as well as the hexagon grid used to discretize it. \begin{figure}[!bt] \centering \makebox[\textwidth][c]{\includegraphics[width=0.75\textwidth]{figure/ch2_grid_paris.jpeg}} \caption[Hexagonal Grid for Paris]{Map of Paris, in blue is the geographical limit of the \textit{Free2Move} service. The hexagonal grid covering the the serviced area is displayed and is made of 209 cells. Each cell has a radius of 500m.} \label{fig:ch2_grid_paris} \end{figure} Finally, for the case of the service present in~\textit{Washington}, the grid has to cover the whole district of Columbia as well as the Arlington County, apart from a north-west part of the Arlington County. To do so a grid of $27$ hexagons wide and $31$ hexagons high has been made, for a total of $837$ cells. To cover the serviced area, only $411$ cells has been kept. It should be noted that the dataset for \textit{Washington} has roughly the same number of trips as the dataset of \textit{Paris}. However the serviced area in the city of \textit{Washington} is twice wider than \textit{Paris}, meaning that the average density of trip departures and arrival might be lower than for both \textit{Madrid} and \textit{Paris} and thus offering three datasets for three carsharing services having different environments. The Figure~\ref{fig:ch2_grid_washington} displays with a blue line the limits of the area serviced by \textit{Free2Move} and the hexagonal grid covering it. Note that the hexagons seem smaller even if their size is the same across the three cities: the map's scale is not the same for the three visualizations. \begin{figure}[!bt] \centering \makebox[\textwidth][c]{\includegraphics[width=0.75\textwidth]{figure/ch2_grid_washington.jpeg}} \caption[Hexagonal Grid for Washington]{Map of Washington, in blue is the geographical limit of the \textit{Free2Move} service. The hexagonal grid covering the the serviced area is displayed and is made of 411 cells. Each cell has a radius of 500m.} \label{fig:ch2_grid_washington} \end{figure} Once the grid is made, each GPS position in a trip is converted. It means that the GPS position of the departure and arrival are transformed into a cell ID ($k \in K$) for the departure and the arrival. This is done by using a dichotomy to reduce the number of Euclidean distances to compute between the gps point and the hexagons center, precisely two dichotomies on both north-south and west-east axis are done simultaneously. The grid is thus split into four quarters, north-west, north-east, south-west and south-east of the hexagonal grid and the gps points is defined as being in one of the sub-grids. This dichotomy continues by selecting one quarter of the remaining sub-grid until only one cell remains. In this case, the gps point is assigned to this cell. However, if the distance between the gps point and the assigned hexagon center is superior to $500m$, it means the point is not inside this hexagon and thus outside of the grid : it needs to be rejected. When a GPS point is rejected, the whole trip is rejected too and considered as an outlier, for example a customer not respecting the limits imposed by the operator and dropping off the vehicle outside of the allowed perimeter. It should be noted that it represents a negligible number of trips. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % SECTION - SECTION - SECTION - SECTION - SECTION - SECTION - SECTION - SECTION - SECTION - SECTION - SECTION % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \FloatBarrier \section{Data Statistics} The city modeling has been made to ensure the abstraction of the spatial data. With this model, the carsharing usage in each of the three cities has to be studied from a spatial and a temporal point of view in order to know whether knowledge can be extracted from those datasets. Indeed if the dataset consists only of random noise, for example because people have an erratic usage of the service, then nothing of value might be learned by modeling the usage of the fleet. This basic verification can be done by using visualizations in order to check the existence of patterns visible to the naked eye. In the following section will be presented both a spatial analysis of the fleet usage patterns as well as a temporal analysis. Finally, customers are regrouped in order to check if differentiation can be made between regular and irregular users of the service. \subsection{Spatial Usage Patterns} To determine if the city modeling helps to model the user's usage patterns, a brief spatial analysis is done on the spatial distribution of the trip departures. For the whole period of the dataset, the number of times a trip has started from each cell has been counted. Then an additional indicator has been made to check the spatial imbalance in the customer demand, notably by taking the difference for each cell between the number of arrivals and departures. Thus for the three cities two heatmaps have been made to visualize where are located the hot spots of usage and where the staff have focused their efforts to counterbalance the unbalanced customer demand. In the case of \textit{Madrid}, the Figure~\ref{fig:ch2_depdes_madrid} shows on the left heatmap that two main hot spots exist in the city center. The presence of these two close hot spots is partially explained by the \textquote{Low Pollution Zone} that was created in the city center of Madrid in late 2018. This disallows most of the private vehicles (cars \& motorcycles) the access to the city center, unless the vehicle is an electric one. The usage of carsharing in those cells are being boosted by this \textquote{Low Pollution Zone} regulation. With the help of the departure/arrival disequilibrium heatmap (on the right) in the same figure, an excess of departures is on average observable in those hot spots while the peripheral cells with a lower departure counts see an unbalanced demand with more arrivals than departures. Thus in the case of \textit{Madrid} the city abstraction makes visible spatial differences in the customer demand that can be modeled. For the service in \textit{Paris}, the Figure~\ref{fig:ch2_depdes_paris} presents two heatmaps. With the departure heatmap, two clear hot spots surrounded by high usage cells are visible. Overall the west and north-west side of Paris see a high rate of departure than in the east or south-east of Paris. The park \textquote{Bois de Vincennes} in the east of Paris see fewer than one thousand trip departures on the whole dataset (>300 days) while a single cell in the district \textquote{XVI$^{eme}$ Arrondissement} can surpass this usage. The location of the hot spots can be explained by the socio-economic distribution of the population inside the city: in the west and north-west of the city reside wealthier dwellers than in the east. The customer demand unbalance in noticeable in three spots, the west and north-west where an average the cells see more departures than arrivals, while the east and the south-west see more arrivals than departures. Overall like in the \textit{Madrid} case, the abstraction of the city space with a hexagonal grid makes visible the differences in the customer usage pattern that can be learned by using machine learning models. In the city \textit{Washington}, same heatmaps are shown in Figure~\ref{fig:ch2_depdes_washington}. A large hot spot is observable mostly in the downtown of Washington, partially due to the high density of offices and commercial oriented buildings. Indeed as it will be shown in the next subsection, the use case for the service in Washington is both for commuting during the week and leisure during the weekend. As for the customer demand, most of the hot spots in Washington's downtown are balanced with the exception of several cells with a deep imbalance where the service's jockey focused their relocations. It is noticeable that all the Arlington County is imbalanced as almost all cells see more trip arrivals than trip departures compared to the Washington area. Like for \textit{Madrid} and \textit{Paris}, spatial models of the customer demand can be made. \begin{figure}[!b] \centering \makebox[\textwidth][c]{\includegraphics[width=1\textwidth]{figure/ch2_depdes_madrid.jpeg}} \caption[Heatmaps for Madrid]{Departure count (left) and departure/arrival disequilibrium (right) heatmaps for the trip dataset of \textit{Madrid}. In the left heatmap, the greener a cell is, the less departure has been recorded for the whole dataset. The number in each cell is the total number of departures from this cell. In the right heatmap, the greener a cell is, the less unbalanced the cell is between the number of cars rented by customers and the number of cars dropped off. On the contrary, redder cells tend to have a shortage of cars while bluer cells tend to have a surplus of cars. The number in each cell is the difference between the number of arrivals and the departures. Negative numbers denote more departures than arrivals and positive numbers is the inverse. For example \textquote{-46} denotes 46 more departures than arrivals.} \label{fig:ch2_depdes_madrid} \end{figure} \begin{figure}[!p] \centering \makebox[\textwidth][c]{\includegraphics[height=0.75\textheight]{figure/ch2_depdes_paris.jpeg}} \caption[Heatmaps for Paris]{Departure count (top) and departure/arrival disequilibrium (bottom) heatmaps for the trip dataset of \textit{Paris}. In the top heatmap, the greener a cell is, the less departure has been recorded for the whole dataset. The number in each cell is the total number of departures from this cell. In the bottom heatmap, the greener a cell is, the less unbalanced the cell is between the number of cars rented by customers and the number of cars dropped off. On the contrary, redder cells tend to have a shortage of cars while bluer cells tend to have a surplus of cars. The number in each cell is the difference between the number of arrivals and the departures. Negative numbers denote more departures than arrivals and positive numbers is the inverse. For example \textquote{22} denotes 22 more arrivals than departures.} \label{fig:ch2_depdes_paris} \end{figure} \begin{figure}[!p] \centering \makebox[\textwidth][c]{\includegraphics[height=0.75\textheight]{figure/ch2_depdes_washington.jpeg}} \caption[Heatmaps for Washington]{Departure count (top) and departure/arrival disequilibrium (bottom) heatmaps for the trip dataset of \textit{Paris}. In the top heatmap, the greener a cell is, the less departure has been recorded for the whole dataset. The number in each cell is the total number of departures from this cell. In the bottom heatmap, the greener a cell is, the less unbalanced the cell is between the number of cars rented by customers and the number of cars dropped off. On the contrary, redder cells tend to have a shortage of cars while bluer cells tend to have a surplus of cars. The number in each cell is the difference between the number of arrivals and the departures. Negative numbers denote more departures than arrivals and positive numbers is the inverse. For example \textquote{-402} denotes 402 more departures than arrivals.} \label{fig:ch2_depdes_washington} \end{figure} \FloatBarrier \subsection{Weekly Customer Usage} With the data abstracted on a spatial level thanks to the usage of a grid made of hexagons, there remains the need to split the data temporally. Indeed, splitting the data with hourly, daily or weekly periods do not model the same information. Hourly periods help to detect whether the service has been used more for commuting than leisure. Daily periods allow to better assess how a fleet of vehicles should be relocated inside the city to counteract the demand imbalance. Weekly periods make visible seasonal effects in the customer demand, such as holiday period decreasing the usage of the service. For the three cities, the information of the average customer usage along the day for each is presented, in Figure~\ref{fig:ch2_tripdayhour_madrid} for \textit{Madrid}, in Figure~\ref{fig:ch2_tripdayhour_paris} for \textit{Paris} and in Figure~\ref{fig:ch2_tripdayhour_washington} for \textit{Washington}. In those figures, the plain curves represent the workdays and the dashed lines are the day of the weekend. The number of trips each day of the dataset is also shown in a second type of figure for all three cities, see Figure~\ref{fig:ch2_nbtrip_madrid} for \textit{Madrid}, Figure~\ref{fig:ch2_nbtrip_paris} for \textit{Paris} and Figure~\ref{fig:ch2_nbtrip_washington} for \textit{Washington}. For \textit{Madrid}, the average number of trips made along each workday (Figure~\ref{fig:ch2_tripdayhour_madrid}) is linked in part to commute trips in the morning (around 9 a.m.) and in the evening (around 8 p.m.). There exists another peak in the middle of the day that could correspond to a leisure-oriented usage of the service. Unlike the other days of the week, the evening usage peak in the Friday is distributed on a longer period in the evening, meaning that the cars of the fleet are used for a leisure purpose too, with for example people coming back from bars or other similar amenities. During the weekend the service is also used for non-work-related trips, with a peak in the afternoon (around 2 p.m.) and the in late evening (10 p.m.). It should be noted that overall the service is more used during the workdays than during the weekends. This is further shown in Figure~\ref{fig:ch2_nbtrip_madrid} where each weekend sees a drop in customer usage. Seasonal effects, such as holidays, are also shown: the usage of the service is decreased during the summer time (in August 2018). \begin{figure}[!p] \centering \makebox[\textwidth][c]{\includegraphics[width=0.95\textwidth]{figure/ch2_tripdayhour_madrid.jpeg}} \caption[Trip distribution in Madrid]{Distribution of the customer trips along the day of the week and the hour of the day for the service \textit{Emov} in \textit{Madrid}. Each curve represent the average number of trip (in y axis) for each hour (in x axis) depending on the day (color \& type of the curve). The day of the week, Monday to Friday, are the plain curves while the day of the weekends are the dotted curves.} \label{fig:ch2_tripdayhour_madrid} \makebox[\textwidth][c]{\includegraphics[width=0.95\textwidth]{figure/ch2_nbtrip_madrid.jpeg}} \caption[Trip Count Madrid]{Count of the number of trip made by the customer in \textit{Madrid} (y-axis) each day of the dataset (x-axis), such that one point is the value for a day and each x-tick is a Sunday.} \label{fig:ch2_nbtrip_madrid} \end{figure} For \textit{Paris}, Figure~\ref{fig:ch2_tripdayhour_paris} first shown that the service is used more during the weekends than during the workdays. On average, the customer usage never falls below 30 trips/h between 9 a.m. and 5 p.m. Saturday and Sunday, while the peak of usage in the Friday evening (5 p.m.) is around 32 trip/h only for this hour in particular. Even if the service is also used for commute, confirmed by the morning peak (6 a.m.) early enough to avoid traffic jams and the evening usage peak (5 p.m.). During the workdays, the fleet is also used for leisure-related trips between 8 p.m. and 11 p.m. as shown by a medium plateau rate of usage. The higher usage during the weekend is also visible on Figure~\ref{fig:ch2_nbtrip_paris}, with regular peak of usage the Saturdays and Sundays. It should be noted that a drop in usage is visible for the last week of November 2019, but without any known explanation. Right after, during the months of December 2019 and January 2020, an increase in the service utilization is observed and can be linked to strikes in the public transports. \begin{figure}[!p] \centering \makebox[\textwidth][c]{\includegraphics[width=0.95\textwidth]{figure/ch2_tripdayhour_paris.jpeg}} \caption[Trip distribution in Paris]{Distribution of the customer trips along the day of the week and the hour of the day for the service \textit{Free2Move} in \textit{Paris}. Each curve represent the average number of trip (in y axis) for each hour (in x axis) depending on the day (color \& type of the curve). The day of the week, Monday to Friday, are the plain curves while the day of the weekends are the dotted curves.} \label{fig:ch2_tripdayhour_paris} \makebox[\textwidth][c]{\includegraphics[width=0.95\textwidth]{figure/ch2_nbtrip_paris.jpeg}} \caption[Trip Count Paris]{Count of the number of trip made by the customer in \textit{Paris} (y-axis) each day of the dataset (x-axis), such that one point is the value for a day and each x-tick is a Sunday.} \label{fig:ch2_nbtrip_paris} \end{figure} Finally, for \textit{Washington}, the customer usage has some similarities with the usage in \textit{Paris} from a temporal point of view. The service is mostly used for commute trips during the workdays, with a peak of utilization in the morning (8 a.m.) and a peak of utilization in the evening (5 p.m.). Unlike the customers in \textit{Paris}, the constant decrease in usage during the evening might indicate that leisure-related trips are rarer during the workdays. During each day of the weekends, from 10 a.m. to 4 p.m., the customer usage rate is on par with the peak utilization during the workdays, but for leisure-oriented usages. It should be noted that overall the usage of the service is higher during the Fridays and Saturdays as seen on Figure~\ref{fig:ch2_nbtrip_washington}. Several peaks or drop of utilization are observed as well, some are linked to holidays such as the peak of utilization the 2019-12-31, but others cannot be linked to such events. Large social events may be associated with them, but no data about this information was available. \begin{figure}[!p] \centering \makebox[\textwidth][c]{\includegraphics[width=0.95\textwidth]{figure/ch2_tripdayhour_washington.jpeg}} \caption[Trip distribution in Washington]{Distribution of the customer trips along the day of the week and the hour of the day for the service \textit{Free2move} in \textit{Washington}. Each curve represent the average number of trip (in y axis) for each hour (in x axis) depending on the day (color \& type of the curve). The day of the week, Monday to Friday, are the plain curves while the day of the weekends are the dotted curves.} \label{fig:ch2_tripdayhour_washington} \makebox[\textwidth][c]{\includegraphics[width=0.95\textwidth]{figure/ch2_nbtrip_washington.jpeg}} \caption[Trip Count Washington]{Count of the number of trip made by the customer in \textit{Washington} (y-axis) each day of the dataset (x-axis), such that one point is the value for a day and each x-tick is a Sunday.} \label{fig:ch2_nbtrip_washington} \end{figure} Overall in the three services, the customers use the service for different purposes. There are always at least two peaks associated with commute oriented trips and one plateau of usage during the late morning and afternoon during the weekends. Seasonality effects can be observed, most notably in the case of \textit{Madrid}, as well as large social events or punctual holidays. Then from a temporal point of view, splitting the data to model the utilization each day is an acceptable approach, with the need to keep information such as the day being a holiday or a workday/weekend. \FloatBarrier \subsection{Customer Groups} In order to complete the results from the previous analysis, customers have been regrouped according to the criteria as presented by~\cite{wielinski_exploring_2019}. As detailed in Chapter~\ref{ch:background}, the authors of~\cite{wielinski_exploring_2019} have separated customers into four categories: \emph{Low Frequency} (\emph{LF}), \emph{Medium Frequency} (\emph{MF}), \emph{High Frequency} (\emph{HF}) and \emph{Ultra Frequency} (\emph{UF}). These categories are defined depending on the frequency at which the customer uses the service. Customers using the most the service are in \emph{Ultra Frequency} while occasional customers are in \emph{Low Frequency}. Since in the three trip datasets the customers in \emph{LF} and \emph{MF} categories behaves similarly, they have been regrouped into a \textit{LMF} (\textit{Low \& Medium Frequency}) category. Additionally, the customers in \emph{HF} and \emph{UF} categories have the same behavior, they have been also regrouped in the same category called \textit{HUF} (\textit{High \& Ultra Frequency}). Between the two categories to be used in this analysis, \emph{LMF} and \emph{HUF}, the limit is set such that customers being active less than 11\% of the days are put in \emph{LMF} while the others are put in \emph{HUF}. For the three cities, the spatial distribution of the trips made by both \textit{LMF} and \textit{HUF} categories are similar. However as shown in the following figures, the usage pattern between active members of the service and less active ones are different when looked from a temporal point of view. In the following figures, all the days inside the dataset have been studied. In the case of the service located in \textit{Madrid}, $86\,770$ customers ($90\%$ of customers) have been assigned to the \textit{LMF} category while $9\,696$ customers have been assigned to the \textit{HUF} category. In the Figure~\ref{fig:ch2_customer_study_madrid} is reported the daily proportion of trips made by both categories in this city, as well as the total number of trips made each day. As for the proportion of trips made by both categories, it is possible to observe that during the summer holidays and the winter holidays the dominant category of users for the service are the \textit{LMF}, while fewer users from \textit{HUF} are using the service during these periods. A complementary observation is made for the other periods where \textit{HUF} customers are doing more daily trips than \textit{LMF} customers. By checking the total number of trips made during the day for both periods, the conclusion is that a drop in the number of total trips made by \textit{HUF} customers is changing the proportion. That is, trips often made for commute purposes by the \textit{HUF}, observable by the previously done temporal analysis, are not done during those holidays periods. Thus days being a workday (or not) has an impact on the utilization of the service. Furthermore, a temporal pattern on a weekly basis confirms that \textit{HUF} customers are using the service for commute trips. Indeed for most of the weeks in the non-holiday period, the category \textit{HUF} is the predominant one during the weekdays while the category \textit{LMF} is the predominant during the weekends. This observation is associated with fewer trips being made during the weekends as observed previously. \begin{figure}[!tb] \centering \makebox[\textwidth][c]{\includegraphics[width=1\textwidth]{figure/ch2_customer_study_madrid.jpeg}} \caption[Customer Study in Madrid]{Distribution of trips made with the service in \textit{Madrid} by two categories of customers: LMF (Low \& Medium Frequency) and HUF (High \& Ultra Frequency) as described by~\cite{wielinski_exploring_2019}. The top curves are representing the distribution (as a percentage) of the number of trips made by the two customer types along the whole dataset for each day, the blue curve is the proportion (\%) of trips made by the LMF while the red curve is for HUF. The bottom histogram is the absolute number of trips made each day by all the customers. Thus for example the 2019-03-31 around 3000 trips has been made and 40\% of them were made by customers categorized as HUF and 60\% by customers categorized as LMF.} \label{fig:ch2_customer_study_madrid} \end{figure} For the service located in \textit{Paris}, the \textit{LMF} category contains $6\,861$ customers ($87\%$ of customers) while the \textit{HUF} category has $1\,044$ customers. In the Figure~\ref{fig:ch2_customer_study_paris} is reported the same information of daily trip proportion made by \textit{LMF} and \textit{HUF} customers. As for the service in \textit{Madrid}, the \textit{HUF} customers are more predominant during the weekdays than during the weekends. However contrary to the \textit{Madrid}'s service, the change in the ratio of trips made between \textit{HUF} customers and \textit{LMF} customer is due to an increase in the usage from the category \textit{LMF}. This means that during the weekends, the demand for cars is more likely to be from decisions by many more customers but less frequent ones. Thus this demand might appear to be more \textquote{random}. Since the data contains the year 2019, the impact of public transportation strikes can be noticed too. For example, a strike from the workers of \textit{RATP}\footnote{\textit{RATP} stands for \textit{Régie Autonome des Transports Parisiens}} (a public transportation provider) the Friday 2019-09-13 had an impact on the demand for carsharing usage. During this day the demand for carsharing doubled, around 800 trips were made when compared to the average of 400 trips the three previous Fridays. Moreover the proportion of \textit{LMF} users leads to the conclusion that people not taking carsharing often had used the service in replacement of public buses or subways. This increase of carsharing usage following a strike in the public transportation can be seen during the whole month of December 2019 and January 2020, with numerous buses lines or subways lines being stopped. However in this case, the increase of trip is predominantly made by \textit{HUF} customers, that is customers actively using the service. While the information about a public transportation malfunction should be noted, be it caused by a strike or natural phenomenons, no precise annotation was available in the datasets about which days were concerned by public transportation malfunction at the time where the data was retrieved. Furthermore, it is not yet clear to know if usage patterns can be learned for such exceptional events. For these reasons, strikes and other exceptional natural events are not taken into account for the following work. \begin{figure}[!tb] \centering \makebox[\textwidth][c]{\includegraphics[width=1\textwidth]{figure/ch2_customer_study_paris.jpeg}} \caption[Customer Study in Paris]{Distribution of trips made with the service in \textit{Paris} by two categories of customers: LMF (Low \& Medium Frequency) and HUF (High \& Ultra Frequency) as described by~\cite{wielinski_exploring_2019}. The top curves are representing the distribution (as a percentage) of the number of trips made by the two customer types along the whole dataset for each day, the blue curve is the proportion (\%) of trips made by the LMF while the red curve is for HUF. The bottom histogram is the absolute number of trips made each day by all the customers. Thus for example the 2020-01-26 around 700 trips has been made and 75\% of them were made by customers categorized as HUF and 25\% by customers categorized as LMF.} \label{fig:ch2_customer_study_paris} \end{figure} The last service in \textit{Washington} has $5\,124$ customers ($75\%$ of customers) assigned to the \textit{LMF} category and $1\,665$ customers assigned to the \textit{HUF} category. As for the two other cities, the Figure~\ref{fig:ch2_customer_study_washington} presents both the proportion of the daily number of trips made by \textit{HUF} and \textit{LMF} customers on top, and the total number of daily trips made on bottom. Contrary to the services in \textit{Madrid} and \textit{Paris}, the predominant category of customers is the regular ones since almost no day see the \textit{LMF} customers make more trips than the \textit{HUF} ones in proportion. Weekly cycles can be observed like for the two previous services, with in general a higher proportion of trips made by \textit{LMF} customers and a lesser proportion made by \textit{HUF} customers, even if the \textit{HUF} category stays predominant for each day. For several isolated days, the usage of the carsharing service is either the triple or the third of the average usage. Exceptional events can be the origin of those peaks/holes in the histogram of the total number of daily trips made. However it is not possible to link those exceptional values to any events. Indeed even with holidays taken into account, the peak of usage cannot be explained for dates like the 31$^{st}$ January 2020. However one can notice a decrease in the usage of the carsharing service during mid-March 2020, which is the month seeing an increase present of covid-19 in the US and thus in Washington too, leading to less demand for transportation, such as carsharing services. \begin{figure}[!tb] \centering \makebox[\textwidth][c]{\includegraphics[width=1\textwidth]{figure/ch2_customer_study_washington.jpeg}} \caption[Customer Study in Paris]{Distribution of trips made with the service in \textit{Washington} by two categories of customers: LMF (Low \& Medium Frequency) and HUF (High \& Ultra Frequency) as described by~\cite{wielinski_exploring_2019}. The top curves are representing the distribution (as a percentage) of the number of trips made by the two customer types along the whole dataset for each day, the blue curve is the proportion (\%) of trips made by the LMF while the red curve is for HUF. The bottom histogram is the absolute number of trips made each day by all the customers. Thus for example the 2020-03-29 around 400 trips has been made and 70\% of them were made by customers categorized as HUF and 30\% by customers categorized as LMF.} \label{fig:ch2_customer_study_washington} \end{figure} Overall for the three selected services several common properties of the customer usage are observed. First the usage of those services are in accordance with observations made for other free-floating carsharing services, as presented in Chapter~\ref{ch:background} and Section~\ref{sec:user_behavior}, that is this kind of service is often used by regular customers, often for commute trips. Second, for the three services, the weekends see an increase usage of the customers not using the service often, meaning that the global behavior of the demand might be more random during these days than for the weekdays and workdays. Third, holidays and exceptional have an impact on the demand since commute trips are either less or more necessary for those days, depending on whether people commute less during holidays or need cars more if the public transport system is less available. This information is necessary to model the daily usage of the fleet. State wide holidays can be found easily but other exceptional events have not been precisely logged by the operator and thus unavailable for study. Finally, even if the surface occupied by the services in \textit{Madrid} and \textit{Paris} are comparable, there is overall less usage for the service in \textit{Paris}. Furthermore, the usage in the services located in \textit{Paris} and \textit{Washington} are comparable but the surface occupied is at least twice as big in \textit{Washington} than in \textit{Paris}. Thus even if the demand in each service has similarities, it should not be forgotten that each the demand for each service has a unique component that will have an impact in the performance of algorithms modeling it. \begin{table}[!tb] \begin{tabular}{|c|c|c|c|c|c|} \hline City & Category & NbCustomer & NbTrip & Trip/Customer & Avg TimePerTrip \\ \hline \multirow{2}{*}{Madrid} & LMF & 86 770 & 583 298 & 6.7 & 20 mins \\ & HUF & 9 696 & 542 069 & 56 & 20 mins \\ \hline \multirow{2}{*}{Paris} & LMF & 6 861 & 51 863 & 7.5 & 29 mins \\ & HUF & 1 044 & 77 191 & 74 & 24 mins \\ \hline \multirow{2}{*}{Washington} & LMF & 5 124 & 38 088 & 7.4 & 114 mins \\ & HUF & 1 665 & 97 177 & 58 & 60 mins \\ \hline \end{tabular} \caption{Summary of the utilization made by the two types of customers (\textit{LMF} and \textit{HUF}) for the three services (\textit{Madrid}, \textit{Paris}, \textit{Washington}). The first column is the number of customer in each category for each service. The second column precises the number of trips made by all the customers from each category. The third column gives the average number of trip made by each person from a category in a city. The last column represents the average time a trip has lasted for each customer of a category in a city.} \label{tab:ch2_allcustomers} \end{table} The presented information about the customer usage of the service for the three cities are summarized in Table~\ref{tab:ch2_allcustomers}. For every service, only a minority of customers is making either half of the trips done in the case of \textit{Madrid} or 70\% of the trips in the case of \textit{Washington}. Thus, the usage of the carsharing service's fleet could be modeled since it is not only made of \textquote{usage noise} made by irregular users. This is further confirmed by the average number of trips made by each customer depending on its category: in all the cities the average number of trips made by \textit{HUF} customers is around ten times the average of \textit{LMF} customers. However the modeling of the fleet usage might be made difficult by the impact of the irregular users in the \textit{Washington} service. Indeed in this case the irregular customers use on average the car for around 2 hours while regular users are using the car for only one hour. If a model had to predict the usage of the fleet, the randomness added by \textit{LMF} customers might have a higher impact since they use the car twice as long as a regular user while the spatial demand for cars is not different for regular users, i.e a car placed somewhere might see a higher variation depending only on the type of customer renting it. Moreover, it should be noted that this study is made only with data of past trips, i.e. only the trips that actually happen are recorded and used here. This differs from the real user demand. Indeed, the users might not be able to take the service because the fleet was not positioned well enough to offer a car nearby. This means that even if a customer actually demanded a trip, i.e. a car to move from somewhere to somewhere else, this demand did not translated into a recorded trip. Modeling the real customer demand with the information about typical customer behaviors from each empirical category can be tricky because of this reason. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % SECTION - SECTION - SECTION - SECTION - SECTION - SECTION - SECTION - SECTION - SECTION - SECTION - SECTION % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \FloatBarrier \section{Supplementary Datasets} On top of the spatial and temporal distribution information of the customer usage, external factors have to be taken into account to model the general utilization of a carsharing service. To do so, two axes have been considered: the daily weather in the city and the buildings type in each cell of the hexagon grid. First is presented how the weather dataset has been retrieved and pretreated to be used in the methodology presented in Chapter~\ref{ch:method}. Second an approach to use open datasets from OpenStreetMap about the type of buildings is presented, as well as the reasons why this dataset could not be used for the methodology in Chapter~\ref{ch:method}. \subsection{Weather Information} Three additional weather datasets are used as exogenous data to train our prediction models. They have been retrieved from the hourly SYNOP weather broadcasts from the nearest weather station for each service and for each day in the trip dataset of the service. In the case of \textit{Madrid}, the weather from the station in the \textit{Adolfo-Suarez Madrid-Barajas Airport} is used. For \textit{Paris}, the station in the \textit{Orly Airport} provides the weather data. Finally, for \textit{Washington}, the \textit{Ronald Reagan Washington National Airport} has a station providing the weather data. From the SYNOP broadcasts, several features have been extracted: the temperature ($^\circ$C), the relative humidity (\%), the pressure (hPa), the wind speed (km/h), the cloud cover (\%) and the amount of rain (mm). If hourly report entries were missing, missing values were imputed considering a linear change between the hour before and after each missing entry. Then an average over all hours for each day has been made for each feature, with the objective to have a daily value for them. Those particular features have been used since some combinations helps to better understand the weather occurring each day. For example, temperature and relative humidity help to know whether high temperatures are more bearable for someone or not, e.g 28°C with a relative humidity of 20\% is more bearable than 28°C with a relative humidity of 60\%. The aim is to take into account which are the days when customers are more keen to use the service because using a bike-sharing service or other means of transport would be too inconvenient because of the weather. \FloatBarrier \subsection{City Buildings Usage} Inside the city, trips are motivated by the need for customers to travel towards places where activities are done or to go back home. As described before, customer usage of a free-floating carsharing service is linked to either commuting to work or going to leisure places. An axis that has been explored is the characterization of each area of the city, i.e. know whether each cell includes more building of residential/commercial orientation or workplaces. To retrieve the usage type of each building, the \textit{OpenStreetMap} (\textit{OSM})\footnote{The \emph{OSM} database has been queried through the \emph{Overpass API}: \url{https://overpass-api.de/}} database can be used. It is an open-source equivalent of Google Maps, i.e. a map of the world where buildings, roads, rivers and other points of interest for cartography are filled out by volunteers. When volunteers decide to map real buildings into the OpenStreetMap database, additional information can be added such as the type of building, the number of floors or the usage made of the building. For the following characterization, four categories of usage type for buildings have been selected: \textquote{Residential}, \textquote{Commercial}, \textquote{Office} and \textquote{Leisure}. The first category, \textquote{Residential}, is made up from apartments and individual houses. The category \textquote{Commercial}, is made of buildings dedicated to sell goods or services, e.g. hotels, marketplaces, shops or supermarkets. The \textquote{Office} category regroups the offices and other tertiary workplaces. The last category, \textquote{Leisure}, regroups every leisure-related buildings such as cafes, gyms, libraries or cinemas. One approach to characterize the city area would use the hexagon of the grid: each hexagon could be characterized with the help of the buildings inside of it. However when the buildings with known usage types are displayed on a map, it is possible to see that most of the cities are not documented enough. For example, Figure~\ref{fig:ch2_osm_madrid} is the representation of the information contained in the \textit{OpenStreetMap} database for \emph{Madrid}. The map itself comes from the available data and is provided by \textit{OSM}. The available information about the usage type of the buildings is displayed on top as colored points, and numerous areas are not covered with such points. This means areas like in the south-east of \textit{Madrid} is well documented but the west or the north-east is less documented. Figure~\ref{fig:ch2_osm_paris} shows the same information about \textit{Paris}, only buildings on avenues and boulevards with a high frequentation have their usage type given in the database. This can be observed as purple and blue dots, i.e \textquote{Leisure} and \textquote{Commercial} buildings, are much more represented than residential buildings. The same phenomenons can be observed in Washington in Figure~\ref{fig:ch2_osm_washington}, large areas in Washington have buildings but not usage type documented, as seen in the north-east and south-east. This kind of incomplete data cannot be used to characterize cells. Indeed if a cell is categorized as composed mainly of \textquote{Commercial} or \textquote{Office} buildings in the center of \textit{Paris}, this might hide the high presence of residential buildings where inhabitants could be interested in carsharing services. Thus even if theoretically with complete data additional abstraction could be made for the training of models for \textquote{Residential}, \textquote{Commercial}, \textquote{Office} or \textquote{Leisure} focused areas, in practice the lack of data makes this approach hazardous. \begin{figure}[!p] \centering \makebox[\textwidth][c]{\includegraphics[width=0.75\textwidth]{figure/ch2_osm_madrid.jpeg}} \caption[Building Type Madrid]{Map of \textit{Madrid} with the display of buildings usage according to the OpenStreetMap database. The perimeter of the service is the blue line. Each colored dot is a building which has its usage type filled out in the OpenStreetMap database. A green dot denotes a residential building, blue dots are commercial building, cyan dots are offices and purple dots are leisure oriented buildings.} \label{fig:ch2_osm_madrid} \end{figure} \begin{figure}[!p] \centering \makebox[\textwidth][c]{\includegraphics[width=1\textwidth]{figure/ch2_osm_paris.jpeg}} \caption[Building Type Paris]{Map of \textit{Paris} with the display of buildings usage according to the OpenStreetMap database. The perimeter of the service is the blue line. Each colored dot is a building which has its usage type filled out in the OpenStreetMap database. A green dot denotes a residential building, blue dots are commercial building, cyan dots are offices and purple dots are leisure oriented buildings.} \label{fig:ch2_osm_paris} \end{figure} \begin{figure}[!p] \centering \makebox[\textwidth][c]{\includegraphics[width=1\textwidth]{figure/ch2_osm_washington.jpeg}} \caption[Building Type Washington]{Map of \textit{Washington} with the display of buildings usage according to the OpenStreetMap database. The perimeter of the service is the blue line. Each colored dot is a building which has its usage type filled out in the OpenStreetMap database. A green dot denotes a residential building, blue dots are commercial building, cyan dots are offices and purple dots are leisure oriented buildings.} \label{fig:ch2_osm_washington} \end{figure}