200 lines
22 KiB
TeX
200 lines
22 KiB
TeX
\chapter{A/B Testing}
|
|
\label{ch:ab_testing}
|
|
|
|
The opportunity to test in real life the theoretical models described in Chapters~\ref{ch:method} and Chapter~\ref{ch:simulation} has been granted by \emph{Free2Move}.
|
|
The objective is to ensure that the developed methodology can be deployed and used in production by the Free2Move's carsharing services and to know its performance on a real free-floating car-sharing system.
|
|
This practical evaluation takes the form of an A/B Testing~\cite{trustworthy_2020_kohavi}.
|
|
A baseline is applied for the period A and then the methodology is applied during the period B.
|
|
Because of practical reasons linked to the \textit{Free2Move} service in Madrid, only a simplified version of the methodology presented in Chapter~\ref{ch:method} could be used.
|
|
|
|
First is detailed the experimental setting to precise which service is selected for the A/B Testing, the simplified version of the method and what the experimental conditions are.
|
|
Then, a reminder is done on the statistical test used to assess whether the period A or the period B has a higher utilization of cars.
|
|
Lastly, the results of the A/B Testing are presented.
|
|
|
|
\section{Experimental Setting}
|
|
|
|
\textit{Free2Move} accepted to do an A/B Testing in order to assess the performance of the proposed methodology for a real case in \emph{Madrid}.
|
|
This service was selected over the services in \emph{Washington} and \emph{Paris} because experiments in Chapter~\ref{ch:method} shown both better performance in the prediction of the utility and a higher expected profit increase when using the method.
|
|
Furthermore, \emph{Free2Move} is not responsible for the relocations in \emph{Washington}, the operator uses subcontractors, and in \emph{Paris} the operator's staff does not have room to make relocations.
|
|
The Figure~\ref{fig:ch5_madrid_area} shows the perimeter of the service in \emph{Madrid} with a blue line and all the hexagons used to make the grid representing the service's area.
|
|
Note, that it is not exactly the same perimeter as presented in Figure~\ref{fig:ch2_grid_madrid}, the one used during the experiments of Chapters~\ref{ch:method} and~\ref{ch:simulation}, which concern data about trips done four years earlier.
|
|
|
|
\begin{figure}[!ht]
|
|
\centering
|
|
\includegraphics[width=0.55\textwidth]{figure/ch5_madrid_area.jpeg}
|
|
\caption[Free2Move Area Service 2022]{The area serviced by \textit{Free2Move} the free-floating carsharing service in Madrid when the A/B Testing has been run. The grid represents all cells and the blue line represents the actual service perimeter.}
|
|
\label{fig:ch5_madrid_area}
|
|
\end{figure}
|
|
|
|
The method proposed in Chapter~\ref{ch:method} could not be used as is for two practical reasons.
|
|
First, \textit{Free2Move} has no access to public electric charging stations in Madrid, so all the cars are recharged in a single central hub belonging to \textit{Free2Move}.
|
|
Thus for any optimal distribution of cars for the next morning, there is only one point from where the cars can be taken by the jockeys, the central charging hub.
|
|
Second, only the discharged electric cars of the service, not bookable and therefore not usable, are the priority of the team in charge of moving and recharging them in the city.
|
|
Indeed, considering the number of discharged cars to be recharged every night, it is generally not possible to spare staff members to relocate cars usable by customers but placed in low demand areas.
|
|
|
|
Thus, no optimal relocation has been computed by the methodology to be tested.
|
|
Only the ordered list of the \textquote{best positions} where to place the cars has been provided to \textit{Free2Move} so their jockeys know where to place the fully recharged electric car leaving the central charging hub.
|
|
The \textquote{best positions} are the position that have the highest utility when a car is placed, thus only the matrix of utility $U^i_k$ for all cells $k \in K$ and rank $i \in I$ have been predicted according to the same process as described in Chapter~\ref{ch:method}.
|
|
The ordered list returned by the simplified methodology is the list of couples $(k,i)$ in decreasing order of utility corresponding to its couple.
|
|
|
|
It should be noted that \emph{this approach is not equivalent to the Greedy baseline} detailed in Algorithm~\ref{alg:ch3_greedy_algorithm} (from Chapter~\ref{ch:method}).
|
|
Indeed in this \emph{Greedy} baseline, the algorithm seeks to relocate up to a fixed number of cars while taking into account costs of taking care of jockeys in sweeper cars.
|
|
In the version used for the A/B Testing, the locations to be filled with charged cars are ordered, i.e. the methodology does not expect all returned locations to be filled with cars.
|
|
The staff is not expected to perfectly fill the locations from most lucrative to less ones, letting jockeys to take practical necessities in consideration.
|
|
|
|
For the A/B Testing two \textquote{A} periods have been provided by \textit{Free2Move}.
|
|
The first one is the $28$ days from the whole month of February 2022, this period will be called \textquote{A1} from now on.
|
|
The second period \textquote{A} is the $30$ days from the whole month of April 2022, this period will be called \textquote{A2} from now on.
|
|
Finally, the period during which the test will be conducted is called \textit{B} and represents the $31$ days of March 2022.
|
|
|
|
In order to evaluate the impact of this simplified method on the service, the number of trips made each day and the daily utility have been recorded for all periods of the A/B Testing.
|
|
The aim is to compare the mean number of trips per day and the mean utility per day between the periods A and the period B.
|
|
If the comparison of means between \textquote{A1} and \textquote{B} show that both means are similar, then it means that the distribution of cars placed following the simplified methodology gave a similar daily customer usage of the fleet.
|
|
The comparison between \textquote{A2} and \textquote{B} is done to confirm (or not) the previous results.
|
|
In both cases, the comparison of the mean daily number of trips and the mean daily utility between an \textquote{A} period and the \textquote{B} period is done with the help of the \emph{Test of Homogeneity}.
|
|
|
|
|
|
\section{Homogeneity Tests}
|
|
|
|
% Put commands for special numbers here
|
|
\newcommandx{\meanA}{\bar{x}_A}
|
|
\newcommandx{\meanB}{\bar{x}_B}
|
|
\newcommandx{\varA}{\hat{\sigma}^2_A}
|
|
\newcommandx{\varB}{\hat{\sigma}^2_B}
|
|
\newcommandx{\stdD}{\hat{\sigma}_D}
|
|
\newcommandx{\tboundary}{t_{\alpha}^v}
|
|
|
|
The differences between the mean scores from \textquote{A} and \textquote{B} periods could be due to a random occurrence, for example because not enough observations have been made on a phenomenon having a high variance.
|
|
It could also be explained by a real difference on how the service performed which lead to different outcomes.
|
|
For the next part, the sets of scores from an observed period are called samples.
|
|
For example, Figure~\ref{fig:ch5_distribution_example} shows two examples of Gaussian distributions with their empirical mean and variance deduced from two mock-up samples.
|
|
The aim is to have some evidence to either keep or reject a \emph{null hypothesis} statistically, e.g. find if the mean of a sample is significantly inferior to the mean of the other sample.
|
|
In both cases (left and right) of the figure, the difference between the means of the \textit{red} and \textit{blue} distribution is identical.
|
|
However for a non-expert eye, the inferiority of the \textit{red} distribution mean over the \emph{blue} one seems \emph{more significant} on the left case then the right case.
|
|
To quantify this significant difference between the mean coming from two samples, statistical tests are used such as the \textit{Welch's t-Test}, which is similar to the \textit{Student's t-Test}.
|
|
|
|
\begin{figure}[!ht]
|
|
\centering
|
|
\includegraphics[width=1\textwidth]{figure/ch5_distribution_example.jpeg}
|
|
\caption[Probability Density Function Example]{Two examples of empirical distributions to compare, with their \textit{Probability Density Function} (\textit{PDF}) shown and deduced from mock-up observations. On the left the difference between the \textit{PDF} of the \textit{red} distribution's observation and the \textit{blue} distribution's observation is noticeable. However on the right the difference between the two distribution is less clear mainly due to the increase of the variance in the set of observations.}
|
|
\label{fig:ch5_distribution_example}
|
|
\end{figure}
|
|
|
|
The \textit{Welch's t-Test} is used when the two samples follow a normal distribution, with two samples of unequal size and with unequal variances (or unknown ones).
|
|
For this test, the null hypothesis is that the mean score of the sample from a period \textquote{A} is \emph{not significantly lower} than the mean score of the sample from a period \textquote{B}.
|
|
Since the null hypothesis consist in considering that one mean score is not lower than the other mean score, the \textit{one-sided} version of the \textit{Welch's t-Test} is used.
|
|
|
|
To either keep or reject this null hypothesis, a statistic called $t_D$ is computed with $\meanA$ the mean of the sample A, $\meanB$ the mean of the sample B and $\stdD$ the empirical variance of the combined samples:
|
|
$$ t_D = \frac{\meanA - \meanB}{\stdD} $$
|
|
|
|
The value of the statistic $t_D$ depends on $\stdD$ defined by:
|
|
$$ \stdD = \sqrt{\frac{\varA}{n_A} + \frac{\varB}{n_B}} $$
|
|
with $\varA$ the empirical variance of sample A, $\varB$ the empirical variance of sample B, $n_A$ the size of the sample A and $n_B$ the size of the sample B.
|
|
|
|
According to the statistic $t_D$, it is possible to compute the confidence of rejecting the null hypothesis.
|
|
This is done by computing the cumulative distribution function value on $t_D$ of the \emph{Student t-law} with $v$ degrees of freedom.
|
|
This computed value is called the p-value (or $\alpha$).
|
|
In this case the lower the p-values is, the safer the null hypothesis can be rejected.
|
|
Since the {Welch's t-Test} is used, the degree of freedom $v$ has to be found.
|
|
It is defined as:
|
|
$$v = \frac{\left[ \frac{\varA}{n_A} + \frac{\varB}{n_B} \right]^2}
|
|
{\frac{\left( \varA / n_A \right)^2}{n_A - 1} + \frac{\left( \varB / n_B \right)^2}{n_B - 1}}$$
|
|
with $\varA$ the empirical variance of sample A, $\varB$ the empirical variance of sample B, $n_A$ the size of the sample A and $n_B$ the size of the sample B.
|
|
|
|
From the mock-up in Figure~\ref{fig:ch5_distribution_example}, for the two distributions in the two sub-figures the null hypothesis is that the mean of the \emph{red sample} distribution is not significantly lower than the mean of the \emph{blue sample} distribution.
|
|
For the sub figure on the left, the null hypothesis can be safely rejected with a p-value $\leq 1$ \%, i.e it is almost certain that the mean of the \emph{red samples} is lower than the mean of the \emph{blue samples}.
|
|
However for the sub-figure on the right, the same null hypothesis cannot be safely rejected since the p-value of the \emph{Welch t-Test} is 12 \%.
|
|
|
|
\section{Results}
|
|
|
|
The objective of this section is to find whether there are good reasons to validate the use of the simplified methodology in the service of \emph{Madrid}.
|
|
To do so, a \emph{Welch t-Test} has been done between the samples from the period \textquote{A1} and the period \textquote{B} as well as between the samples of periods \textquote{A2} and \textquote{B}.
|
|
Two main indicators are studied: the daily total utilization of the cars (in minutes), also called \emph{utility} in the following experiments, and the daily number of customer trips, also called \emph{\#trips} during the experiments.
|
|
The objective is to find out whether there is a difference between each \textquote{A} period and the \textquote{B} period and if this difference is due to the usage of the methodology.
|
|
More in detail, the aim is to first find if the usage of the service has increased, decreased or did not change at all.
|
|
Then if the usage changed when the simplified methodology is used, other indicators are computed to know whether the staff has followed propositions made by the methodology or not.
|
|
If the usage increased or decreased and the staff respected as much as possible the methodology's proposition, then the methodology had respectively a positive or negative impact on the utilization of the service.
|
|
|
|
\paragraph{Welch t-Test on the \emph{utility}.}
|
|
|
|
\begin{figure}[!t]
|
|
\centering
|
|
\includegraphics[width=1\textwidth]{figure/ch5_abtest_utility.jpeg}
|
|
\caption[Daily Utility Gaussian Distribution]{Comparison of the \textit{Probability Density Function} (\textit{PDF}) of the daily \emph{utility} distribution for three months: February, March and April 2022. On the left is the comparison between the \textit{PDF} of February (blue) and March (green), on the right the comparison is made between April (red) and March (green). The vertical colored lines visualize the mean daily utility for each month.}
|
|
\label{fig:ch5_abtest_utility}
|
|
\end{figure}
|
|
|
|
For each day of February, March and April 2022, the utilization of the cars are summed up to create one data point.
|
|
For each month, a sample is created with the daily data points belonging to each month, its empirical mean and standard deviation are deduced from the sample.
|
|
Figure~\ref{fig:ch5_abtest_utility} shows the distribution of the samples from the three periods according to their empirical mean and standard deviation, with on the left sub-figure the comparison between February and March, and on the right sub-figure the comparison between April and March.
|
|
In the first case, the difference between the mean utility of February and March is visible: the average (and standard deviation) utilization of cars is $55\,886 \pm 12\,253$ minutes/day in February when it is $61\,630 \pm 14\,158$ minutes/day in March.
|
|
In this case, the null hypothesis that the mean utility \emph{is not lower} in February than in March can be rejected safely (p = 5\%).
|
|
In the second case, the difference of mean utility between April and March is less clear: the average (and standard deviation) utility is $62\,241 \pm 12\,884$ in April while it is $61\,630 \pm 14\,158$ in March.
|
|
For this case, the null hypothesis cannot be rejected safely (p = 57\%).
|
|
Overall, the simplified version of the methodology proposed in Chapter~\ref{ch:method} might have improved the daily utility by 10\% when March is compared to February, but without any noticeable difference between April and March.
|
|
According to the operational team monitoring the service in \emph{Madrid}, seasonal effects on the utility could be observed in previous years (except for 2020 because of covid-19).
|
|
Notably each year, the months between January and June see a regular increase in usage, with June being the peak.
|
|
If the simplified methodology were not used, the usage of the service in March could be lower than the one observed.
|
|
|
|
\paragraph{Welch t-Test on the \emph{\#trips}.}
|
|
|
|
\begin{figure}[!b]
|
|
\centering
|
|
\includegraphics[width=1\textwidth]{figure/ch5_abtest_trips.jpeg}
|
|
\caption[Daily Number Trip Gaussian Distribution]{Comparison of the \textit{Probability Density Function} (\textit{PDF}) of the daily number of trips (\emph{\#trips}) distribution for three months: February, March and April 2022. On the left is the comparison between the \textit{PDF} of February (blue) and March (green), on the right the comparison is made between April (red) and March (green). The vertical colored lines visualize the mean number of daily trips for each month.}
|
|
\label{fig:ch5_abtest_trips}
|
|
\end{figure}
|
|
|
|
As for the utility, the daily number of trips for February, March and April are gathered into three samples for the three months.
|
|
For each sample, the empirical mean and standard deviation are deduced and used to represent Gaussian distributions fitted on those parameters on Figure~\ref{fig:ch5_abtest_trips}.
|
|
As for the previous study, the sub-figure on the left represents the fitted Gaussian distribution for February and March: the mean daily \emph{\#trips} (and its standard deviation) in February is $1\,124 \pm 131$ while the mean daily \emph{\#trips} in March is $1\,197 \pm 136$.
|
|
In this case, the null hypothesis that the mean daily \emph{\#trips} in February \emph{is not less} than the mean daily \emph{\#trips} in March can be safely rejected (p = 2 \%).
|
|
In the second case, i.e the comparison between April and March for the same indicator on the right sub-figure, the mean daily \emph{\#trips} in April is $1\,115 \pm 235$ while it is $1\,197 \pm 136$ in March.
|
|
So for the comparison between February and March, the null hypothesis that the mean daily \emph{\#trips} in April \emph{is not less} then the mean daily \emph{\#trips} in March can be rejected, with an associated p-value of 6\%.
|
|
One can notice that even if the empirical mean daily \emph{\#trips} in April is ``lower'' than the same indicator for February, the higher empirical standard deviation is the reason why the p-value of the \emph{Welch t-Test} is slightly higher.
|
|
Overall, the simplified version of the methodology might have improved the daily number of customer trips in March when both compared to February and April by respectively 6 \% and 7 \%.
|
|
However, those first conclusions are tied down to the effectiveness of the relocation team in \emph{Madrid}: if the list of locations where to put a car was not respected, the variations observed would be due to cheer luck.
|
|
|
|
\paragraph{Staff compliance with the methodology.}
|
|
|
|
The tested methodology proposes a list of car locations, ordered by decreasing utility value.
|
|
If a location is already occupied by a car it is removed from this list.
|
|
In theory, the relocation made by the staff should end in the highest unoccupied location in this list.
|
|
However for practical reasons, this might not always be true: when a staff member has to place a fully charged car, it is often near a discharged one to take it directly back into the central charging hub.
|
|
Several cars can then be placed in ``suboptimal'' locations for the sake of keeping as many cars available for customers as possible.
|
|
Hence additional measures have to be gathered on the fleet placement to assess the compliance of the staff with the suggestions of the methodology.
|
|
This helps to explain if the increase in the mean daily number of trips and utility in March is due to the methodology or to sheer luck.
|
|
|
|
\begin{figure}[!t]
|
|
\centering
|
|
\includegraphics[width=1\textwidth]{figure/ch5_abtest_optimcars.jpeg}
|
|
\caption[Summary Car Placement Staff]{Summary of the relocation decisions made by the staff, either by following or disregarding the proposed car location. The left sub-figure shows both the daily number of cars relocated in a proposed location (in blue \emph{Optimal Cars}) and the daily number of cars not relocated in a proposed location (in red \emph{Suboptimal Cars}). On the right sub-figure in green is the daily ratio of cars placed in the proposed locations (\emph{Optimal Cars}) over the total number of relocations made by the staff during the day. For both figures, the x-axis are the days such that each tick is a Sunday.}
|
|
\label{fig:ch5_abtest_optimcars}
|
|
\end{figure}
|
|
|
|
Knowing that the methodology suggested an ``optimal'' location for each car of the fleet, the number of unoccupied locations suggested by the methodology have been analyzed for each day, i.e from 10:00 p.m to 9:59 pm the next day.
|
|
Over a fleet of around 500 cars, this account for an average of 256 ($\pm$ 42) proposed locations where to relocate cars.
|
|
This implies that according to the ordered list of proposed locations, half of the fleet was either discharged or not in the best position.
|
|
On those average 256 daily locations, a daily average of 82 ($\pm$ 15) relocations has been made by the staff during each day.
|
|
All suggested relocations could not be made, mainly because of the number of staff members available, but highest priority spots where focused by the relocation team.
|
|
It should also be noted that two kind of relocations have been made by the staff during this period: relocations to put fully charged cars back in service and relocations to better place other cars, as what the proposed methodology originally aimed at.
|
|
|
|
In those daily average of 82 relocations made, a daily average of 53 ($\pm$ 14) relocations has been made in suggested ``optimal'' places while only 29 ($\pm$ 9) relocations have ended in spots not suggested by the algorithm.
|
|
Hence, on average each day 65\% ($\pm$ 10\%) of the relocations were made with the suggestions from the methodology.
|
|
Even if the involvement of the staff is not perfect, practical necessities and on field decisions lead to this score.
|
|
Indeed, \emph{Free2Move} has been only recently able to know where the booking application was opened by customers.
|
|
So the areas without the application to rent cars being opened were left out by the staff.
|
|
Thus, the staff had not used the suggested locations when they were in these areas with no ``empirical demand''.
|
|
Figure~\ref{fig:ch5_abtest_optimcars} gives daily details about the relocations being followed or disregarded by the staff.
|
|
On the left sub-figure is the number of relocations depending on whether the end point is in a suggested spot or not.
|
|
It shows that the number of relocations ending in suggested spots the Friday and Saturday is lower then the rest of the week on average while the relocations ending in other spots does not follow any particular pattern.
|
|
One explanation might be that the staff has to focus more on discharged cars on Fridays and Saturdays, because of the increased usage on Thursdays and Fridays, and thus ``convenient'' relocations of recharged cars have been made near discharged ones in spots not suggested.
|
|
On the right sub-figure is the ratio of relocations ending in suggested spots over the total number of relocations made.
|
|
No clear pattern can be extracted, however one can observe that each day the ratio of cars being placed in suggested spots over all relocations is (almost) always between 60\% and 80\%.
|
|
|
|
~\\
|
|
|
|
Overall, even if \emph{Free2Move}'s relocation team has not perfectly taken into account the methodology output to relocate the cars, the methodology impacted a significant part of the fleet positioning.
|
|
It is not possible to formally prove that the increase in the daily average number of trips between February-March and April-March is due to the methodology's output.
|
|
However the level of compliance from the staff and the expected increase in usage from the methodology are two arguments in favor of the hypothesis that this methodology had slightly improved the service usage when used for real.
|