Reward Design for Driver Repositioning Using Multi-Agent Reinforcement Learning

2020·Arxiv

Abstract

Abstract

A large portion of passenger requests is reportedly unserviced, partially due to vacant for-hire drivers’ cruising behavior during the passenger seeking process. This paper aims to model the multi-driver repositioning task through a mean field multi-agent reinforcement learning (MARL) approach that captures competition among multiple agents. Because the direct application of MARL to the multi-driver system under a given reward mechanism will likely yield a suboptimal equilibrium due to the selfishness of drivers, this study proposes a reward design scheme with which a more desired equilibrium can be reached. To effectively solve the bilevel optimization problem with upper level as the reward design and the lower level as a multi-agent system, a Bayesian optimization (BO) algorithm is adopted to speed up the learning process. We then apply the bilevel optimization model to two case studies, namely, e-hailing driver repositioning under service charge and multiclass taxi driver repositioning under NYC congestion pricing. In the first case study, the model is validated by the agreement between the derived optimal control from BO and that from an analytical solution. With a simple piecewise linear service charge, the objective of the e-hailing platform can be increased by 8.4%. In the second case study, an optimal toll charge of $5.1 is solved using BO, which improves the objective of city planners by 7.9%, compared to that without any toll charge. Under this optimal toll charge, the number of taxis in the NYC central business district is decreased, indicating a better traffic condition, without substantially increasing the crowdedness of the subway system.

Keywords: Mean Field Multi-Agent Reinforcement Learning, Reward Design, Bayesian Optimization

1. Introduction

The emergence of transportation network companies (TNCs) or e-hailing platforms (such as Didi and Uber) has revolutionizsed the traditional taxi market and provided commuters a flexible-route door-to-door mobility service. Nonetheless, it is reported that a large portion of the passenger requests remain unserviced (Lin et al., 2018), partly due to the imbalance between demand (i.e., passenger requests) and supply (i.e., available drivers) that results in long cruising trips for taxi drivers to find the next passenger and long waiting time for passengers to be picked up (Powell et al., 2011; Di and Ban, 2019). Such cruising behavior has negative impact on urban economy by not only decreasing drivers’ income but also generating additional vehicle miles traveled. Thus, repositioning available drivers to potential locations with near-future high demand, i.e., to balance supply and demand, becomes the key challenge faced by the taxi and for-hire market, including e-hailing platforms.

1.1. Problem statement

This paper aims to solve the multi-driver repositioning task, in which a large number of idle drivers make sequential decisions on where to reposition themselves over a planning horizon, so that they can seek for the next passengers and maximize their individual cumulative rewards. When one idle driver has to make a sequential decision of repositioning while others are doing so simultaneously, competition among drivers arises.

Markov decision process (MDP) and reinforcement learning (RL) have become popular tools for sequential decision-making. Single-agent RL has been used in several studies (Gao et al., 2018; O’Keeffe et al., 2019) to solve the single-driver repositioning problem. However, single-agent RL assumes each agent is an independent learner and ignores the impact of competition on each individual agent’s policy. To capture competition among multiple agents in the process of repositioning, this paper employs a multi-agent reinforcement learning (MARL) approach.

Nevertheless, this paper goes beyond simply modeling multi-driver repositioning with MARL. We notice that every driver plays a non-cooperative game, meaning each driver’s goal is to select a sequence of reposition strategies to maximize her individual net profit, given other drivers select their own optimal reposition strategies. Accordingly, the direct application of MARL to the multi-driver system under a given reward function will likely yield a suboptimal equilibrium for the system, thanks to the selfishness of drivers. Therefore, the ultimate goal of this paper is to proposes a reward design scheme with which a more desired equilibrium can be reached. We would like to emphasize the difference between “reward” and “reward design”. In this paper, the reward for an idle driver is simply the net monetary return one driver earns, while the reward design scheme is imposed by a higher-level planner (e.g., the TNC platform or city planners) to modify drivers’ monetary return. Analogous to the role of congestion pricing, reward design aims to spatially redistribute drivers, driven by individual reward maximization, through an external cost imposed to those who tend to move towards already oversupplied spots. Under the external cost, drivers still behave selfish but their movement is now governed by a modified reward, which consequently balances demand and supply and drives the system towards a more desirable equilibrium.

To justify that reward design can be an effective way of balancing supply and demand, we would like to briefly introduce our problem set-up below. We assume that each driver, behaving like an intelligent agent, selects a sequence of optimal reposition policies to maximize its own cumulative reward, which is the monetary income. Usually a TNC platform charges a commission fee to drivers. Thus, the revenue a TNC driver receives is an order’s fare minus the commission fee. The TNC platform can adjust the percentage of fares going to drivers, in order to balance demand and supply. For instance, if the platform notices that excessive drivers try to move to one hotspot to compete for a limited number of orders, the platform can increase the commission fee in that grid to prevent oversupply. Drivers are still free to reposition themselves toward that grid, but adding a higher commission fee would reduce their long-run accumulative rewards, and accordingly, influence their reposition policies, which hopefully discourage some drivers from moving to that grid.

1.2. Literature review

In recent years, we have witnessed a growing body of literature on TNC using reinforcement learning, including pricing mechanism design (Pandey and Boyles, 2018; Pandey et al., 2019), vehicle routing (Bazzan and Grunitzki, 2016; Nazari et al., 2018; Peng et al., 2020), ride order dispatching (Li et al., 2019; Tang et al., 2019; Ke et al., 2019; Zhou et al., 2019), and driver repositioning (Gao et al., 2018; Lin et al., 2018; Yang et al., 2020). This study focuses on the multi-driver repositioning task.

The essence of the repositioning task is to provide recommendations to idle drivers on where to find the next passenger. Some recommender systems have been proposed for drivers (Ge et al., 2010; Hwang et al., 2015; Yuan et al., 2011; Qu et al., 2014). These studies extracted useful aggregated statistical quantities such as taxi demand and travel time from historical data and recommended a next cruising location (Ge et al., 2010), a sequence of potential pickup points (Hwang et al., 2015), a driving route (Qu et al., 2014), or a route and a location (Yuan et al., 2011). Although the aforementioned studies provide effective recommendations of the next cruising route or location to drivers at the immediate next step, they are nearsighted and fall short of capturing the future long-run payoffs. To capture the effect of future rewards on the recommendation at the immediate next step, various Markov decision process (MDP) based approaches have been proposed to model idle drivers’ passenger searching process (Rong et al., 2016; Zhou et al., 2018; Verma et al., 2017; Gao et al., 2018; Yu et al., 2019; Shou et al., 2020). In an MDP, a driver is the agent who makes decisions of where to go next and interacts with the environment. The agent aims to derive an optimal policy which maximizes her expected cumulative reward. When the environment is known to the agent, dynamic programming can be used to solve the MDP and derive an optimal policy. When the dynamic environment is unknown to the agent, reinforcement learning (RL) algorithms such as Q-learning and temporal difference learning (Sutton and Barto, 1998) can be employed to derive an optimal policy.

The competition among multiple agents is, however, neglected in the aforementioned MDP models due to their single-agent setting, resulting in overly optimistic optimal policies. In other words, one agent cannot earn the full amount of the expected reward by following the policy derived in the single-agent setting. In a dynamic environment involving a group of agents, multiple agents interact with both the shared environment and other agents. Multi-agent reinforcement learning (MARL) (Buoniu et al., 2010) thus fits naturally well in this multi-agent system (MAS). Recently, MARL has been attracting significant attention due to its success in tackling high dimensional and complicated tasks such as playing the game of Go (Silver et al., 2016, 2017), Poker (Brown and Sandholm, 2018, 2019), Dota 2 (OpenAI, 2018), and StarCraft II (Vinyals et al., 2019).

MARL tasks can be broadly grouped into three categories, namely, fully cooperative, fully competitive, and a mix of the two, depending on different applications (Zhang et al., 2019): (1) In the fully cooperative setting, agents collaborate with each other to optimize a common goal; (2) In the fully competitive setting, agents have competing goals, and the return of agents sums up to zero; (3) The mixed setting is more like a general-sum game where each agent cooperates with some agents while competes with others. For instance, in the video game Pong, an agent is expected to be either fully competitive if its goal is to beat its opponent or fully cooperative if its goal is to keep the ball in the game as long as possible (Tampuu et al., 2017). A progression from fully competitive to fully cooperative behavior of agents was also presented in Tampuu et al. (2017) by simply adjusting the reward.

A key challenge arises in MARL when independent agents have no knowledge of other agents, that is, the theoretical convergence guarantee is no longer applicable since the environment is no longer Markovian and stationary (Matignon et al., 2012; Nguyen et al., 2018). To tackle this issue, one way is to exchange some information among agents. In some contexts, agents actually exchange information with their peers through some coordination. For example, in the game of a team of hunters capturing a team of preys, Tan (1993) proposed multiple ways to enable coordination among agents and concluded that the performance of the hunter agents can be better off through some coordination. However, in other contexts such as the driver repositioning system, agents only have access to their own information. Thus, information exchange among agents involves a central controller which collects the information of all agents and disseminates it to agents. Agents update their value functions and policies based on the provided information from the central controller and their local observations. This is the centralized learning (i.e., based on global information) and decentralized execution (i.e., based on local observation) paradigm, which has become increasingly popular in recent research (Foerster et al., 2016; Lowe et al., 2017; Lin et al., 2018; Li et al., 2019).

While training is stabilized conditioning on the information of other agents such as joint state and joint action in the centralized training paradigm, scalability becomes a critical issue in MARL because the joint state space and joint action space grow exponentially with the number of agents. To make MARL tractable when a large number of agents coexist, Yang et al. (2018) employed the mean field theory to simplify the interaction among agents. The basic idea is, from the perspective of an agent, to treat other agents as a mean agent. Thus, the complexity of interactions among a large number of agents is substantially eased by reducing the dimension in the Q-value function. The large scale MARL with hundreds of or even thousands of agents becomes solvable. To investigate the large-scale order dispatching problem where thousands of agents are present, Li et al. (2019) adopted a mean field approximation and proposed to take the average response from neighboring agents as a proxy of the interaction between the agent and other agents.

1.3. Research gaps

Recent studies have successfully applied MARL to order dispatching and driver repositioning problems. Table 1 summarizes these papers in terms of MARL set-up and algorithms. Because this paper is focused on repositioning, we will mainly highlight the research gaps that exist in the studies on multi-driver repositioning.

There are a few studies that have employed MARL for the task of driver repositioning (Lin et al., 2018; Yang et al., 2020). For each time period, based on a difference matrix between the supply and the predicted demand, Yang et al. (2020) adopted a WoLF (Win or Learn First) policy hill-climbing

algorithm to reposition cooperative drivers with the goal of balancing demand and supply. Cumulative reward of drivers over multiple time periods, however, is not optimized. Lin et al. (2018) proposed a contextual actor-critic model to reposition thousands of drivers with each driver aims to maximize her cumulative reward. To make training tractable, a global state which includes the overall distribution of demand and supply is used. Global information, however, may not be obtainable from the perspective of an agent, especially during execution. To fill the above gaps, this paper aims to tackle the multi-driver repositioning task only using local observation of each driver.

Although the approaches listed in Table 1 are efficient under a given reward function, the reached equilibrium is very likely to be suboptimal from the overall perspective of the system. In other words, there may exist another reward function that yields a better equilibrium in terms of some metrics such as gross mechandise volume (GMV) and order response rate (ORR) in the context of taxi hailing. One possible approach to achieve a better equilibrium is to intentionally craft a reward function that is aligned with the goal of the overall system. For example, Li et al. (2019) proposed the order destination potential as a component in the reward function to enable some cooperation among drivers and discourage drivers from being selfish. Human-drivers, however, are selfish in nature and focus on their own monetary return. Thus, the embeded goal of the system in the predefined reward function may not reflect the intrinsic interest of drivers, and therefore drivers may not follow the derived optimal policy. To address the above limitations of a predefined reward function, this study aims to achieve a more desirable equilibrium by adjusting drivers’ monetary return through some realistic measures of the system. For example, e-hailing platforms can adjust the platform service charge (aka the commission fee) to achieve a better GMV or ORR. Likewise, city planners commonly use congestion pricing to drive the traffic system performance towards a system optimum in transportation network design problems (Yang and H. Bell, 1998; Zhang and Yang, 2004; Meng and Liu, 2012; Di et al., 2014, 2016, 2017, 2018). We call these measures as “reward design” mechanism.

In this paper, we show that by integrating a reward design mechanism which adjusts the monetary return that a driver earns, a desirable equilibrium can be reached in this intrinsically large-scale non-cooperative system. The desirable equilibrium refers to a Nash equilibrium where each independent and selfish agent’s strategy is the best-response to other agents’ strategies and will produce better overall performance of the system. Mguni et al. (2019) proposed a two-layer architecture with an incentive designer as the upper layer and a potential game as the lower layer and formulated the incentive designer’s problem as an optimization problem. In contrast, the MARL problem in our context may not be able to be transformed as a potential game, complicating computation of its equilibrium.

1.4. Contributions of this paper

The major contributions of this paper are as follows:

1. To achieve a better equilibrium for the overall system, a reward design scheme is proposed, formulated as the upper level optimization in a bilevel mathematical program, to adjust drivers’ monetary return for social optima.

2. With the repositioning task as the lower level in the bilevel mathematical program, a mean field actor-critic algorithm is employed to enable the learning involving thousands of agents. A Bayesian optimization algorithm is then developed to solve the bi-level problem.

3. In the case study of taxi driver repositioning under congestion pricing, a multiclass MARL is developed to capture the intrinsic behavioral difference between yellow taxis and green taxis.

The remainder of the paper is organized as follows. Section (2) introduces the single-agent actor-critic algorithm, which is a stepping stone for MARL. Section (3) presents the mean field multi-agent reinforcement learning algorithm. Section (4) presents a reward design mechanism and formulates a bilevel optimization problem. Section (5) presents the result and validates the effectiveness of the proposed reward design. Section (6) concludes.

2. Single agent reinforcement learning

As a stepping stone, we first introduce the single agent reinforcement learning where only one agent interacts with the environment.

2.1. Problem definition

A Markov decision process (MDP) (Puterman, 1994) is typically specified by a tuple (), where S denotes the state space, A stands for the allowable actions, collects rewards, [0, 1] denotes a state transition probability from one state to another, and [0, 1] is a discount factor. A general MDP proceeds simply as follows. Starting from the initial state, the agent specifies an action whenever the agent is in a state . The agent then transits into a new state with probability ) and observes an immediate reward ) by obeying the dynamics of the environment. Then the process repeats until a terminal state is reached. A policy [0, 1] simply maps from state to the probability of taking action in state s, i.e., ). The goal of solving an MDP is to derive an optimal policy so that the agent can maximize her long term expected reward by following the policy. In reinforcement learning problems, the transition probability matrix P is commonly unknown, and the agent learns about P from its interaction with the environment.

Denote ) as the state value, which is the expected cumulative reward that an agent can earn by starting from state s and following a policy can be recursively given as (Sutton and Barto, 1998) ) + )]. Denote ) as the state-action value, which is the expected cumulative reward that an agent can earn by starting from state s, taking action a, and following a policy is related with through ) + )].

The optimal value V can then be written as . The Bellman optimality equation is given as (Sutton and Barto, 1998):

where the optimal state-action value is ) + )].Our task is then to derive an optimal policy (i.e., to solve the MDP) with which the agent can optimize its expected cumulative reward.

To demonstrate how to formulate an MDP under the context of e-hailing driver reposition, we will use examples on a 2-by-2 grid world throughout the paper every time when models are introduced.

Example 2.1. (Single-Agent 2 2). The single-agent driver reposition is presented in Figure (1). We adopt a grid world setup where the index of each grid (denoted as l) is shown at the upper left corner. The taxi icon denotes the driver, and the person icon is the passenger request with the corresponding fare shown above. The time beneath the driver and the passenger request records the current time of the driver and the appearance time of the passenger request, respectively. The dashed line with arrow shows the origin and destination of the passenger request. In addition to the emergence of passenger requests (i.e., orders), the stochastic environment also has an order dispatching component. Considering that the scope of this study is driver repositioning instead of order dispatching, we assume passenger requests are randomly assigned to available drivers within the same grid.

Figure 1: An illustrative example (Single agent)

Note that there are a few studies solving vehicle routing problems based on a physical roadway network (Liu et al., 2013; Yu et al., 2019). Considering that the research scope of this paper is where to reposition drivers, we adopt an abstract network (i.e., a grid world setup). In other words, this paper is more focused on the direction along which a driver positions herself after dropping of previous passengers and not on the specific route the driver takes. Using a real network representation for driver-respositioning will be left for future research.

S. The state of the driver consists of two components, namely, the grid index l and current time t, i.e., s = (l, t). For instance, the current state of the driver is s = (#3, 0) in this example.

A. The allowable action of the driver is either moving into one of the neighboring grids or staying within the current grid. To be concise, we use the index of grid where the driver chooses to enter as the action. Suppose the driver decides to go rightward in the example, then we can denote a = #4. We further assume it takes the driver one time step to enter grid #4. In other words, the current time of the driver is t = 1 when the driver arrives in grid #4.

Note. At each time step, a taxi driver can only pick one of the surrounding grids or the current grid as her preferred spot to find a passenger. In other words, taxi drivers are only allowed local search at each time step. Although it seems to be restrictive, it is actually consistent with what real drivers do, and the reason is as follows. From the perspective of a real driver, she can pick a faraway grid as her destination. However, she can not jump directly from her current grid to that grid. Instead, the driver needs to specify a route (i.e., a sequence of grids) and repositions herself one grid each time along the chosen route. If the driver meets a passenger before reaching her chosen destination, she has to take the passenger because she is not allowed to refuse a fare by law. From the perspective of an agent who is conducting local search, although she can not choose a faraway grid, she may end up searching in a faraway grid by repositioning herself one grid at a time. Actually, this definition of action is widely used in driver repositioning problems in the literature (Rong et al., 2016; Lin et al., 2018; Shou et al., 2020).

P. Considering the driver arrives in grid #4 at time t = 1, and at the same time a passenger request appears in grid #4 with 80% probability. If this driver is matched to the passenger and picks up the passenger, the driver will transit to the passenger’s destination, which is grid #2. Denote the transition time from grid #4 to grid #2 as ∆. We can define the new state 1 + ∆). Then the transition probability from the state s at time 0 to the state at time 1 + ∆is 80%, mathematically, driver ends up in state 1). The transition probably becomes

Note. In the temporal component of state , we assume searching time is 1, i.e., the time that one idle driver spent on cruising from her current grid to one of neighbor grids, and use ∆to denote the time that one on-duty driver spent on transporting the passenger to the destination. Actually, the travel time from passenger’s origin to the destination is extracted from some historical data. In addition, the length of each time slice is 1 in this example.

R. If we take the fare of the fulfilled passenger request as the reward, Based on the received reward at this step and the future cumulative reward, the driver chooses an action in the new state , and the state transition process repeats until a terminal state (i.e., t = T where T is a predefined ending time, say, the end of the driver’s work time) is reached.

2.2. Actor-Critic method

To solve optimal policies, there are two types of methods, namely, value based or critic-only method and policy based or actor-only method. Value based and policy based methods are commonly used terminologies, but from now on we will use critic-only and actor-only methods for the purpose of introducing the actor-critic method.

Critic-only methods aim to output the optimal policy through optimizing the state-action Q(s, a) or the state value V (s). Actor-only methods directly output an optimal policy without resorting to stored value functions Q(s, a) or V (s) as an intermediary. Both methods have pros and cons. Critic-only methods enjoy a low variance in the estimate of the state-action value but may lack guarantees on the optimality or near-optimality of the resulting policy if an optimal policy cannot be easily solved from value functions. Actor-only methods work well on continuous and large action spaces but may suffer from high fluctuation in policies (Konda and Tsitsiklis, 2003; Grondman et al., 2012). To overcome the shortcomings of these methods, actor-critic methods are developed to combine strengths of both methods (Konda and Tsitsiklis, 2003).

Figure 2: Actor-critic algorithm

Figure (2) presents the architecture of the actor-critic algorithm. One agent, who has an actor and a critic, interacts with the environment. The agent observes its state s from the environment and inputs s to the actor that outputs the policy, i.e., a probability distribution over all possible actions. The agent samples an action a from the probability distribution and takes action a in the environment. Then the agent observes a state transition and receives a reward r from the environment. Based on the one-step transition as well as action a and reward r, the agent updates its critic. With the updated Q-value ), the agent updates its actor using policy gradient. Now we detail both the critic and the actor, respectively.

Critic. The critic takes as input state s and action a and outputs Q-value Q(s, a). Q-learning is the most commonly used algorithm to update the Q value based on the state transition with reward ) and updates the Q-value by

where is the learning rate and 0 1. If reduces over time properly, the Q-learning update converges (Sutton and Barto, 1998). Equation (1), however, is only applicable to a finite and discrete state and action space. In other words, one needs to maintain a Q table with all possible combinations of s and a, which is not tractable for a continuous and large state and action space. Therefore we need functional approximation to the original Q-value. Deep neural network, i.e., deep Q network (DQN), is one of the most popular value approximator (Mnih et al., 2015). Denote a deep neural network parameterized by as ), to approximate Q(s, a). DQN updates its parameter by minimizing the loss

This problem can be solved by the gradient descent method, whose gradient is straightforward to compute as follows: ))], where the gradient is not taken with respect to the target. is a target network which is a copy of . The target network is not updated by the gradient descent and is copied from after a certain number of steps to ensure training stability (Mnih et al., 2015).

Actor. The actor takes as input state s and outputs a probability distribution on all allowable actions in this state. Similar to how we use a value network to approximate Q-value, we can also use a deep neural network, i.e., policy network, to approximate the policy . Denote the policy network parameterized by as ). The goal of the actor is to maximize its expected cumulative reward, denoted as , where is the reward the actor receives at time t. To solve the optimal policy of the actor requires us to know its gradient. The gradient of the policy is complicated to solve and is given as (Sutton et al., 1999)

where is some baseline (e.g., , i.e., the value function following the policy ), and ) is called the advantage of a taken action a, a measure of the goodness of an action. If it is greater than zero, it means this taken action is generally good, otherwise it may be bad. Naturally, the underlying rationale in computing the policy gradient defined in Equation (3) is to update the policy distribution to concentrate on potentially good action(s). When the chosen action a leads to a positive advantage, i.e., 0, the policy is updated towards the direction of favoring action a. When the advantage is negative for action a, the policy is updated in the direction of against action a.

To summarize, in addition to the policy network , the actor-critic algorithm also maintains a value network so that the calculation of the gradient of the policy in Equation (3) directly uses the Qfunction approximator , to ensure stability of policy update. The actor-critic algorithm simultaneously updates critic (by minimizing the loss given in Equation (2)) and the actor (by the gradient given in Equation (3)) as more samples are fed in.

3. Multi-agent reinforcement learning

To tackle a real-world problem with multiple agents, the aforementioned single agent reinforcement learning falls short of capturing the coupling effects or the competition among multiple agents. In this section, we introduce a multi-agent reinforcement learning approach to model the multi-driver repositioning task. Drivers are assumed to be intelligent and perfectly rational, meaning they select the best repositioning strategies to maximize their cumulative rewards. Bounded rationality (Di et al., 2013; Di and Liu, 2016) is left for future research.

3.1. Problem definition

The multi-agent problem is modeled as a partially observable Markov decision process (POMDP) (Littman, 1994), defined by a tuple (), where N is the number of agents and S is the environment state space. Environment state is not fully observable. Instead, agent i draws a private observation which is correlated with is the observation space of agent i, yielding a joint observation space is the action space of agent , yielding a joint action space , [0, 1] is the state transition probability, is the reward function for agent i, and is the discount factor. Agent uses a policy [0, 1] to choose actions after drawing observation. After all agents taking actions, the joint action a triggers a state transition based on the state transition probability ). Agent i draws a private observation corresponding to and receives a reward ). Agent i aims to maximize its discounted expected cumulative reward by deriving an optimal policy which is the best response to other agents’ policies. This process repeats until agents reach their own terminal state.

Due to the existence of other agents, the Q-value function for agent i , i.e., , is now dependent on the environment state and the joint action of all agents, i.e,

Similarly, the value function of agent i, i.e., ), is dependent on the environment state s. Subsequently, we will demonstrate how to formulate the multi-driver repositioning problem in MARL, building on the single-agent example developed in the previous section.

Example 3.1. (Multi-Agent 2 2). The multi-agent driver reposition is presented in Figure (3). Same as before, a grid world setup is adopted. Now we have two drivers with their indices shown above the taxi icon and two passenger requests with fare presented above the passenger icon. The time beneath drivers and passenger requests records the current time of the driver and the appearance time of the passenger request, respectively. The dashed line with arrow shows the origin and destination of the passenger request.

Figure 3: An illustrative example (Multi agent)

N. There are N = 2 drivers moving around in the environment. We denote drivers by {1, 2}. S. The environmental state consists state information of both drivers. For driver i, her state is composed of her current location (i.e., the grid index based on a grid world setup) and current time t, i.e., ). The joint state of both drivers, i.e., the environment state , at time t is denoted as ). In this example, at current time t = 0, s = ((#3, 0), (#2, 0)). A. For driver i, her action can be any of the five possible actions, i.e., moving into any of her four neighboring grids or staying in the current grid. The same as before, we use the index of grid where the driver chooses to enter as the action. The joint action of both drivers is ). Assuming driver 1 decides to go rightward (i.e, to enter grid #4) and driver 2 chooses to go leftward (i.e., to enter grid #1), the joint action is a = (#4, #1). We further assume it then takes driver 1 one time step to enter grid #4 and driver 2 one time step to enter grid #1. In other words, after driver 1 arrives in grid #4 and driver 2 arrives in grid #1, the clock ticks one step forward and the current time is now t = 1. P. The joint action a triggers a state transition with some probability according to the state transition function, i.e., ). Driver 1 gets matched to the passenger request in grid #4 at t = 1, loads up the passenger, and drives to the destination of the passenger. Driver 1 then arrives in a new state 1+∆) where ∆is the transition time from grid #4 to grid #2. Similarly, driver 2 gets matched to the passenger request in grid #1 at t = 1, loads up the passenger, and drives to the destination of the passenger. Driver 2 then arrives in a new state 1 + ∆) where ∆is the transition time from grid #1 to grid #3. 1 + ∆(#3, 1 + ∆)). In this simple example, R. Along with the state transition, each driver receives a reward, i.e., . The reward function for each agent is simply the fare of the fulfilled passenger request, i.e.,

This example will be revisited later in this section to illustrate the algorithm.

3.2. Techniques to simplify the Q-value function

The dependency of the Q-value of an agent i on other agents’ states and actions, as shown in Equation (4), however, introduces prohibitively high difficulties in learning the optimal Q-value. The main reasons are two-fold. First, although each agent draws its private observation from the environment state s, s cannot be observed by any agent, i.e., s is unknown. Second, one agent does not observe the actual actions taken by all agents, i.e., a is unknown.

To make the Q-value of an agent in the multi-agent system tractable, the dependency of the Q-value on the environment state s and joint action a needs to be simplified. A very natural approach, inspired by the single-agent setting, is independent learning where each agent i only has information about its own observation and action but has no information about other agents. Thus, the Q-value function of agent i is reduced to

In other words, private observations and joint action of other agents are not used by agent i. After all agents choosing actions, the joint action a triggers a state transition. Agent i then draws a new private observation and receives a reward .

The independent learning algorithm, although is intuitive and simple, can be unstable and hard to reach convergence since the environment is no longer Markovian and stationary due to the appearance of other agents (Matignon et al., 2012).

3.2.1. Centralized training and decentralized execution

To make the training more stable and ensure convergence, we employ the centralized training and decentralized execution paradigm (Foerster et al., 2016; Lowe et al., 2017; Lin et al., 2018; Li et al., 2019). In this paradigm, to train the policy of agents, we assume these agents know the global information such as the joint observation and/or joint action. In other words, in addition to observation and action , agent i also has access to the observations and/or actions of other agents during training. While in the execution phase, decentralized testing or execution is implemented, meaning they would not have access to the global information anymore. To realize this paradigm, the aforementioned actor-critic algorithm naturally fits in, because we can apply global information to the critic, i.e., joint observation and joint action in , in the training phase, while feeding local information to the actor, i.e., in , in the execution phase. Decentralized execution becomes possible because only actors are used in execution.

Then the Q-value function of agent i becomes

where ) and ) denote the joint observation and joint action of all agents except agent i, respectively.

In the context of e-hailing driver repositioning, considering the definition of the action, which is the index of the grid where the driver chooses to enter, the Q-value function of driver i, i.e., , does not depend on the joint observation of other drivers, i.e., . Explanations are as follows. When driver i chooses action based on its observation , driver i then enters grid l. At the same time, other drivers also enter some grid based on their joint action regardless of their joint observation . The Q-value function of driver i only depends on the current distribution of drivers, which has been determined by their joint action . Therefore it is the joint action which affects . The Q-value function is thus further reduced to

3.2.2. Mean field approximation

The centralized training and decentralized execution paradigm, however, can easily become intractable due to the exponential increase in the joint action space with the increasing number of agents. For example, the size of the joint action space easily blows up for N agents with |A| possible actions (i.e., possibilities). To simplify the interaction among agents, we adopt the mean field approximation. The basic idea of the mean field approximation is to simplify the complicated interaction between one agent and all other agents by a pairwise interaction between the agent and a virtual mean agent which is formed by the neighboring agents of the agent. Thus, the complexity of interactions among a large number of agents is substantially eased by reducing the dimension in the input of the Q-value function. Therefore the large scale MARL with hundreds of or even thousands of agents becomes solvable.

To be more precise, we provide brief explanations that lead to the applicability of the mean field approximation in MARL as described in Yang et al. (2018). First, from the perspective of agent i, the multi-agent effect or competition effect mainly comes from its neighboring agents, i.e.,

cumbersome to compute ) for the neighboring agents of agent i if this number is large. Define a mean action ¯, which is a proxy of the actions taken by the neighboring agents. Accordingly, ) can be further simplified to ) when Taylor expansion is applied, which is

Interested readers can refer to Yang et al. (2018) for a detailed explanation and proof.

Example 3.2. (Multi-Agent 2 2). The mean action ¯of the neighboring drivers of driver i is defined as the demand to supply ratio in the grid where driver i is entering. Assuming both drivers choose action #4, i.e., 2 example shown in Figure (3), there are 2 drivers and 1 passenger request in grid #4 after both drivers enter grid #4. The mean action for both drivers is thus ¯5. This definition of mean action captures the level of competition in a grid. A larger mean action ¯denotes a higher demand to supply ratio and lower level of competition, and vice versa.

3.3. Mean field actor-critic algorithm

As previously mentioned, each agent maintains a policy network (i.e., the actor) and a Q-value network (i.e., the critic). For a real-world multi-agent task, however, there are typically hundreds of or even thousands of agents, resulting in a prohibitively large number of deep neural networks to maintain, which is not computationally tractable. To address this issue, we assume drivers are homogeneous and they share the same state space, action space, and reward function. We believe this assumption is reasonable when agents are anonymous and one’s influence to the entire system performance is negligible. While heterogeneity across agents does exist when agents have different reward functions (Shou et al., 2020), this study aims to reveal some insights of multi-driver repositioning by learning a policy for the generic taxi population. A more personalized reposition scheme that captures the heterogeneity across the driver population is left for future research.

The multi-agent task can then be largely simplified by sharing both the actor and the critic among drivers, i.e., and .

After adopting the mean field approximation, the loss function for the critic, which was presented in Equation (2) for the single-agent setting, now becomes

The only difference is the incorporation of the mean action ¯a into the Q-value function approximation. Similarly, the gradient of the policy, which was presented in Equation (3) for single-agent setting, is now

where the baseline )].

The mean field actor-critic algorithm is presented in Algorithm 1. With the input of some parameters such as the exploration parameter , target network update period , and learning rate , and a preset number of samples K, a neural network for the critic and a neural network for the actor are initialized. Target networks and are copied from and , respectively. An experience replay buffer B is also initialized to store experience tuples of all agents. Now we let agents repetitively interact with the environment and update the actor and the critic as follows. Agents are randomly placed in the environment initially. All agents choose actions according to the -greedy method. That is, agents have probability of choosing actions randomly and 1 probability of choosing actions according to the target policy network . Each agent i then executes the chosen action in the environment, observes a reward and a mean action ¯, and draws a new observation . This experience tuple is stored in the replay buffer B. All idle agents then choose actions again and repeat the aforementioned procedure until reaching the terminal state with t = T. After collecting experience tuples of all agents, K sample experience tuples are randomly sampled from the replay buffer B and used to train neural networks and by Equations (9) and (10).

Example 3.3. (Multi-Agent 2 2). Now we apply the mean field actor-critic algorithm to the multi-driver example shown in Figure (3). Figure (4) presents the architecture of the mean field actor-critic algorithm particularly for the context of multi-driver repositioning. Homogeneous agents, who share a common actor and a common critic, interact with the environment. The shared actor is a multilayer perceptron with 32 neurons in its hidden layer and takes as input observation and outputs a five dimensional vector denoting the probability distribution of taking five actions. Similarly, the shared critic takes as input () and outputs the Q-value. During training, agent draws its private observation from the environment and inputs to the actor which outputs a probability distribution over actions. Agent i samples an action from the probability distribution and takes the sampled action in the environment. Joint action of all agents a triggers a state transition in the environment. Agent i then observes the mean action ¯, draws a new observation , and receives a reward from the environment. The agent then uses () to update the shared critic by minimizing the loss presented in Equation (9). Based on the advantage calculated from the critic, agent i updates the shared actor using the gradient presented in Equation (10).

The aforementioned training process is centralized because the mean action used in the critic is actually some global information. During execution, agents only need to use the updated actor, which only takes as input the local information, i.e., the private observation. In other words, the shared critic is not used in execution.

Figure 4: Mean field actor-critic algorithm for multi-driver repositioning

The derived Q values corresponding to four scenarios of interest are presented in Figure (5). In Figure (5a), when both drivers choose action #4, the observed mean action for both of them is the ratio of demand to supply, i.e., ¯1 passenger request2 drivers = 0.5. The resulting expected value for both drivers is $3.5, i.e., 5, because both of them have an equal probability 1 driver 2 drivers = 50% to take the passenger request with $7. Similarly, the observed mean actions and resulting Q values can be explained in other scenarios. The Q-value bimatrix is presented in Table (2) where driver 1 is the column player and driver 2 is the row player. When driver 1 chooses action #1 and driver 2 chooses action #1, Q-values for them are 1.5 and 1.5, respectively, according to Figure (5d). Similarly, Q-values for both drivers can be read from Figure (5) for other scenarios. Based on the bimatrix, driver 1 always chooses action #4 because action #4 is strictly better than action #1 regardless of the observed

mean action, and driver 2 always chooses action #4 for the same reason. Thus, the optimal policy for both drivers is to enter grid #4 with an expected payoff $3.5.

Figure 5: Derived Q values for four scenarios of interest

Table 2: Q-value bimatrix for drivers

4. Reward design for multi-agent reinforcement learning

Due to selfishness of each agent, performing MARL under a given reward function in an MAS is very likely to yield an undesirable equilibrium from the perspective of the system. In other words, this equilibrium may not be an optimum with respect to some system objectives. To guide an MAS towards a desirable equilibrium, system planners could resort to reward design mechanisms by modifying the reward function of agents. In this paper, we introduce a new parameter into agents’ reward, where A is the feasible domain of . Parameter can be either a scalar or a vector. The goal of system planners is to maximize some system performance measure dependent of , denoted as ). The system planner first chooses a value of and inputs it to the MAS. With the given that influences the reward design, the developed mean field actor-critic algorithm is employed to derive an optimal policy , which is dependent on , for all agents in the system. Some performance measure f, which is calculated by executing the derived optimal policy for all agents, is then fed into the reward design. The performance measure f is dependent on through the dependency of on . In other words, ).

In summary, the reward design problem is to select a parameter to maximize the performance measure ) on the upper level, while the distributed agents aim to maximize their individual cumulative rewards on the lower level once is given as part of their reward. This process can be formulated as a bi-level optimization problem, mathematically,

The interaction between upper and lower levels through exchange of variables is shown in Figure (6).

Figure 6: Architecture of the reward design

The optimization problem presented in Equation (11), however, is not straightforward to solve due to the unknown complex structure of f over the parameter . Traditional gradient based methods such as gradient descent are thus no longer applicable. In addition, the black-box function f is expensive to evaluate, meaning that evaluating one value of can take a long period of time. Therefore random or grid search based optimization techniques (Bergstra and Bengio, 2012) are less effective because they typically require a large number of evaluations.

To find optima with a limited number of evaluations, we adopt Bayesian optimization (Mockus, 1989) (hereafter we call it BO). Due to the unknown structure of the objective function f, BO places a prior on the objective function, for example a Gaussian process (GP). With some observed data (i.e., several evaluations of ’s), BO derives a posterior, based on which the next to be evaluated can then be determined.

We now detail the procedure of BO in our context. First, BO places a prior statistical model on the objective function f, for example a GP in this study. A GP is a stochastic process satisfing the following conditions: 1) For any ) is a random variable; 2) For a collection of any finite number of ’s (i.e., and ), the joint distribution is a multivariate Gaussian. Mathematically,

where is a m by 1 mean vector and K is a m by m covariance matrix with each entry denoting the covariance between ) and ). A square exponential kernel ) is used to calculate the covariance and is defined as

where l is the length parameter and the variance parameter. The square exponential kernel yields a large covariance when and are close (i.e., (is small) and a small covariance when and are far away (i.e., (is large), which is as expected because for a typical function ) and ) are similar when and are similar, meaning a large covariance. The prior used in this study is a GP with a zero mean (i.e., ) and a covariance matrix K calculated by the aforementioned square exponential kernel.

Second, with a given prior and some observed data, now our task is to derive a posterior probability distribution of at some . We assume is the observed data at some . By the definition of a GP, the joint distribution () is a multivariate Gaussian. Mathematically, the prior probability distribution of () with zero mean is

where is n-dimensional vector with all entries as zeros, -dimensional vector with all entries as zeros, K a n by n covariance matrix with each entry ) + (is the noise in the observed data and is the Kronecker delta), a n by m matrix with each entry ), and a m by m matrix with each entry ). According to the multivariate Gaussian theorem (Murphy, 2012), the probability distribution of conditioned on the observed f is

where and .

Third, with the derived posterior probability distribution of conditioned on the seen data f, we now aim to define an acquisition function, based on which the next to be evaluated can be determined. The acquisition function used in this study is the upper confidence bound (UCB) (Srinivas et al., 2010), i.e.,

where ) is the posterior mean at ) the posterior standard deviation at (derived from ), d is the dimension of the search space of , and is a parameter. The next to be evaluated is simply

Remark. With the acquisition function defined as UCB in Equation (13), it has been shown that BO is no regret with high probability and has a lower-bound on the convergence rate (Brochu et al., 2010). Interested readers are referred to Srinivas et al. (2010); Brochu et al. (2010); Berkenkamp et al. (2019) for more details on global optimality and convergence properties of BO.

The pseudo-code of BO used in this study is listed in Algorithm (2).

To be more concrete, now we use the multi-agent 2 2 example presented in Figure (3) to illustrate the potential of the reward design.

Example 4.1. (Multi-Agent 22). We take the order response rate (ORR), i.e. the ratio of the number of fulfilled passenger requests to the total number of passenger requests, as the performance measure of the system. The direct application of mean field actor-critic algorithm yields a 50% ORR, which is obviously not the desired equilibrium from the perspective of the system. Noticing that the platform typically charges a certain proportion of the fare paid by the passenger as the so-called platform service charge, which is reportedly to be dependent on various factors such as distance, duration, and city. We aim to improve the performance of the system by devising a proper reward design.

In Figure (3), trip fares are shown right above each passenger request, the reached equilibrium for both drivers without any charge are to enter grid #4 and get an expected reward as $3.5, leading to an oversupply (i.e., a low demand to supply ratio) in grid #4 and an undersupply (i.e., a high demand to supply ratio) in grid #1, which is not beneficial for the system. A reward design which deducts $1.1 from the passenger request paid to the driver in grid #4 will effectively attract one driver to leave grid #4 for grid #1 to get more monetary return, resulting in a 100% order response rate.

5. Case Study

To test the performance of the proposed bilevel optimization model, we use two datasets including a synthetic dataset and one real-world large-scale taxi dataset downloadable from official website of New York City (NYC) Taxi & Limousine Commission (https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

5.1. E-hailing driver repositioning under service charge

We first test the bilevel optimization model on a 2-by-2 grid world example, where an analytical solution of the reward design can be derived. Then we compare both values to justify the correctness of our BO algorithm.

5.1.1. Lower level MARL setup

The dataset consists of seven deterministic passenger requests in a 2-by-2 grid world setup, as shown in Figure (7). At t = 0, there are fifty idle drivers in grid #2 and fifty in grid #3. At time t = 1, fifty passenger requests with fare $10 deterministically appear in grid #4 and twenty passenger requests with fare $4.9 appear in grid #1. The observation space for driver i consists of the grid index and current time, i.e., ), and the action space is to enter one of neighboring grids or to stay at the current grid. The reward that a driver earns is her monetary return. That is,

where is the service charge in percentage if the driver takes the passenger request in grid is defined later in Equation (15).

Figure 7: Layout

5.1.2. Upper level objective function

Without any reward design, the optimal policy for all drivers is to enter grid #4, because the expected return for entering grid #4 is at least $5 (i.e., 100 drivers compete for 50 orders with $10 each) while that for entering grid #1 is at most $4.9 (i.e., the highest fare of an order in #1 is $4.9). The resulting ORR is 50/70 = 71.43%, which is not desirable from the perspective of the platform because it is expected to achieve a 100% ORR in this setting. Actually, the platform can achieve a better ORR by adjusting the reward that drivers earn through the use of a platform service charge (aka the commission fee). The platform service charge used in this study is denoted as a fare percentage. For instance, a 10% service charge means the platform takes 10% of the fare paid by the passenger to the driver as its revenue. In other words, the driver gets less money under a higher service charge while the payment from the passenger remains the same. To achieve a better ORR, the platform needs to place a high service charge in grids which are oversupplied. Drivers oversupply grid l because on average they can earn more by entering grid l, compared with entering other grids. A high service charge placed in grid l can effectively reduce monetary returns for drivers entering l and make grid l less attractive to drivers. Thus, some drivers choose other grids and take other passenger requests, resulting in an increase in ORR.

Before introducing a functional form of the platform service charge, we formly provide two notations, namely demand to supply ratio (DS) and service charge (SC). We then construct an effective form of SC as a function of DS. A small DS indicates that the grid is oversupplied, and a large DS means the grid is undersupplied. The goal of the platform is to drive DS close to 1, meaning a balance between demand and supply. In a grid l with below 1, is expected to be large to discourage drivers from oversupplying the grid; while in a grid l with above 1, is supposed to be small. To illustrate such a relation, we use a piecewise linear function with a parameter as SC in grid l, i.e.,

where a relatively high SC is charged to all drivers in the grid with a low DS and no SC is charged to drivers in the grid with DS above 1. Parameter is the SC at

With an adjustable parameter , the platform aims to maximize some objective f consisting of two components, namely ORR and overall service charge (OSC), as

OSC is defined as

where the denominator is the total amount of fare of all serviced orders and the numerator is the total amount of service charge.

The rationale of choosing these two components is as follows. First, from the perspective of the platform, it aims to maximize ORR, because a larger ORR typically means a higher revenue and a higher customer satisfaction. To maximize ORR, the platform simply chooses the largest possible value of . The reason is that with the largest possible , the platform penalizes drivers heavily for oversupplying a grid, and therefore drivers will be directed to other grids. This strategy, i.e., choosing the largest , however, is a big threat for the long-term growth of the platform because drivers are very likely to quit under such a high service charge. Thus, the platform also needs to maintain a relatively small OSC. Considering the competition between ORR and OSC, we use a weighted average of ORR and (1 OSC) as the objective of the platform, i.e., where [0, 1] is the weight for ORR. In this case study, we set , meaning that the platform cares more about ORR.

5.1.3. Results

We then use two methods, namely BO and an analytical method, to determine the optimal value of .

1. BO. We first employ BO with the objective function given in Equation (16). When we apply the mean field actor-critic algorithm to the lower level MAS, the critic is parameterized by a multilayer perceptron (MLP) with three hidden layers (64, 32, 16), and the actor is parameterized by an MLP with three hidden layers (32, 16, 8). ReLU is used as the activation function between hidden layers, and a softmax function is used to transform the final output from the actor to be a probability distribution. Learning rates for both the actor and the critic are 10. Discount factor under the assumption that drivers typically do not depreciate their monetary income on the same

day. Exploration parameter is initially set as 5 and linearly decreases over iterations until it reaches a minimum value of 0.01. The target network update period is parameters of target networks are updated every 10 episodes.

For a bilevel optimization problem, first we need to check the convergence of the lower level. As an example to validate the convergence, ORR, (1 - OSC), and the average reward of all drivers (i.e., Reward in Figure (8)) versus the index of episodes are presented in Figure (8) with 5 and 0. In both scenarios, ORR and reward increase very fast and (1 - OSC) steadily decreases during the first 500 episodes where agents explore the environment and learn the optimal policy. ORR, (1 - OSC), and reward gradually converge after 1,500 episodes when agents mainly exploit the knowledge they have gained through their previous explorations.

Figure 8: Convergence of the lower level MARL

Figure 9: Posterior probability distribution and acquisition function at iteration 0

With the validated convergence of the lower level MARL, we now run BO to solve the bilevel optimization problem. As for the initial value of , in addition to 0 and 1 (i.e., both ends of the interval of interest), we randomly sample three other values according to a uniform distribution. In total we evaluate five ’s as the starting point of BO. Figure (9a) plots the mean and standard deviation of the posterior probability distribution of f conditioned on these five evaluated points. As one can see, the standard deviation is small around the points that have been evaluated and is large at locations where we do not have any data. The acquisition function shown in Figure (9b) reveals that the next to be evaluated is around 0.53.

Figure 10: Convergence of BO

With the initial five evaluated points, we run BO until convergence. The convergence of BO in this study is defined as choosing 5 consecutive ’s with the distance between every two consecutive ’s below a threshold of 0.05. In other words, BO converges when it starts choosing similar ’s to evaluate. The convergence is presented in Figure (10). Note that the horizontal axis starts with iteration index 2 because it is meaningless to calculate the distance between in the first iteration and the initial five ’s. We can see that BO initially chooses quite different ’s and gradually converges after the 6iteration. In other words, BO chooses similar ’s after the 6iteration.

The resulting posterior probability distribution of f from BO is presented in Figure (11). It is noticeable that the evaluation of the objective on ’s seems noisy. In other words, the evaluated objective may be slightly different at the same . This is expected because there are multiple local optima when solving the lower level MARL. Actually, it is commonly impossible to find a global optimum using deep learning. Thus researchers usually settle for local optima (Goodfellow et al., 2016). Local optima introduce noise into the evaluation of the objective at each . Although the evaluations are noisy, the fitted curve is able to capture the mean objective f for each . The optimal is determined as 0.55, yielding an optimal mean objective f = 0.90. The optimum is 8.4% higher than the objective f = 0.83 without any reward design.

Figure 11: Posterior probability distribution of f after BO converges

2. Analytical method. Due to the simplicity of this case, we can analytically derive the optimal value of and shed some light on the effectiveness of the proposed platform service charge. Recall that the optimal policy for all drivers is to enter grid #4 when in grid #4 is 5/10 = 0.5, which is well below 1, meaning that grid #4 is oversupplied. ORR is 5/7 = 71.4%. To increase ORR, one needs to increase to penalize drivers who oversupply a grid. As gradually increases, grid #4 becomes less attractive, because the expected return one driver can earn decreases as increases. When the expected return one driver can earn is less than $4.9, some drivers will choose to enter grid #1 instead of grid #4 for a higher monetary return. Note

that to ease the analysis, we assume the number of drivers entering a grid is always an integer. Now we present how we calculate the critical value of below which there is no driver choosing to enter grid #1 while above which there is one driver attracted by grid #1. With one driver entering grid #1, there are 99 drivers entering grid #4, resulting a 50/99 = 0.51 DS ratio in grid #4. (1 , meaning that the expected return for these 99 drivers is $10 (1 50

99 (1 ). The expected return for the driver entering grid #1 is $4.9. We then have the critical condition 5(1 9, yielding 06. Similarly, we calculate the critical value of below which there are 19 drivers choosing to enter grid #1 while above which there are 20 drivers attracted by grid #1, and the critical value is 58.

Table 3: Values of interest

Values of interest are presented in Table (3). With objective is ORR+ 83. With increasing to 0.06, there is one driver attracted by grid #1, resulting in a 72.9% ORR. The OSC is calculated as follows. The DS ratio in grid #4

The objective is ORR+ 83. Similarly, with increasing to 0.58, ORR = 100%, OSC = 0.18, and the objective is 0.93. Increasing further does not improve ORR but increases OSC, resulting in a decrease in the objective. Thus, the analytically derived optimal value of is 0.58.

The derived optimal value of from BO, i.e., 0.55, agrees well with the analytically derived optimal value of , i.e., 0.58. The optimum from BO, i.e., 0.90 is quite close to its analytical counterpart, i.e., 0.93. The small discrepancy between the numerical solution and the analytical one is explained as follows. In the analytical solution, the policy for agents is deterministic and exactly twenty drivers choose grid #1 after increasing to 0.58; while in BO, the derived optimal policy for agents is stochastic, introducing variance in drivers’ actions. For example, the derived optimal policy says each driver has a 20% probability of choosing grid #1 and a 80% probability of choosing grid #4. Although the expected number of agents in grid #1 is 20 and the expected number of agents in grid #4 is 80, the probability of 21 agents choosing grid #1 is5%. This variance reduces both ORR and (1 -OSC), resulting in a lower objective from BO, compared with the objective from the analytical solution.

5.2. Multiclass taxi driver repositioning under congestion pricing

In this case study, we apply the proposed bilevel optimization model to a real-world scenario where city planners aim to mitigate traffic congestion in the central business district (CBD). As an effective way to improve traffic condition, congestion pricing has been adopted by many cities such as London and Stockholm (de Palma and Lindsey, 2011). The basic idea of congestion pricing is to impose a toll charge on all vehicles entering the CBD. Consequently, some drivers may be sensitive to the toll charge and take alternative travel modes such as subway while some drivers could bear the toll charge. To demonstrate the effectiveness of congestion pricing, we use NYC taxi and subway data due to data availability.

In the taxi market, congestion pricing affects both the demand and supply. On the demand side, the toll charge is passed to passengers for taxi drivers carrying passengers into the CBD. In other words, the fare paid by passengers is increased and thus the demand (i.e., number of passenger requests) is decreased. According to Schaller (1999), taxi demand falls by 0percent when taxi fare increases by x percent. For example, when taxi fare increases from $10 to $12.5 (i.e., increases by 25%) for a trip, the demand falls by 5.5% (i.e. 025%). On the supply side, the toll charge is paid by taxi drivers when they enter the CBD vacantly, which discourages drivers from entering the CBD without any passenger. The overall effect of congection pricing results in a reduced number of taxis in the CBD, leading to an improved traffic condition. This, however, may direct too many passengers, whose taxi requests are unfulfilled, to the public transit which is already running at full pressure during rush hours (Plitt, 2020). Thus, there exists a tradeoff between reducing the number of taxis in the CBD and maintaining a reasonable level of crowdedness in the public transit system.

Note. Considering that the goal of this case study is to demonstrate the effectiveness of the proposed bi-level optimization model on a real-world large-scale problem, we simplify the modeling of demand to a linear model in which demand falls when fare increases. We acknowledge that a more realistic econometric model such as a logit model could be adopted to model passengers’ willingness to pay and demand elasticity under congestion pricing. However, we may not be able to calibrate the econometric model due to data accessibility. In addition, the calibration process may introduce more uncertainty into our model. Therefore, a linear model is used as a demonstration in this case study, and a more realistic model to capture the coupling effect between the demand and supply is left for future research.

The objective of city planners thus consists of two components, namely, the number of taxis in the CBD and the crowdedness of the public transit system. Considering the accessibility of data (i.e., NYC taxi data and subway data), we make two assumptions: (1) The proposed congestion pricing scheme only affects the behavior of taxi drivers, as previously mentioned; (2) The subway system is used as a proxy of the public transit of the city.

To be more precise, Figure (12) presents objectives of city planners and how planners derive the best control. City planners impose a toll charge on taxis entering the CBD. Adaptive taxis learn the optimal policy by the mean-field actor-critic algorithm under the toll charge. With fewer taxis searching for passengers in the CBD, more unfulfilled passenger requests are directed to the subway system. City planners observe the number of taxis in the CBD and crowdedness in the subway and adjust the toll charge to achieve a better balance between these two objectives. This process repeats until city planners reach a satisfactory balance.

Figure 12: Objectives of city planners

5.2.1. Data preprocessing

The NYC taxi trip records are publicly available on the official website of NYC Taxi & Limousine Commission (https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). We use the data for both yellow and green taxis during May 2014 before the wide adoption of ridesharing service such as Uber and Lyft and after the business of green taxis gradually stabilizes. A data sample is listed in Table (4). Each entry in Table (4) collects the order information, including pickup and dropoff time and locations and fares (including tips). In total there are around 16 million taxi trips. We first remove the weekend data because trip patterns over weekends are obviously different from that on weekdays. We then restrict the time interval of interest as the evening peak, i.e., 4 PM to 8 PM. There are 2 million taxi trips in the weekday data after preprocessing.

Table 4: Taxi data sample

Figure 13: Spatial discretization of the area

Figure (13) presents the spatial discretization of the area of interest. There are in total 337 grids with a side length of 1 km covering the area from Manhattan to two airports located in Queens. Taxi orders outside grids consist of less than 2% of the overall taxi orders and are not considered. Each longitude and latitude coordinate is transformed into a grid index. As for the temporal discretization, the evening peak is divided into eighty 3-minute time intervals and the pickup time and dropoff time are transformed into time interval index. Grids shown as bold red squares cover the CBD of NYC, which is the area between 19th street and 59th street in Manhattan. The proposed congestion pricing is applied to vehicles cross the red square into the CBD.

Figure 14: Spatial distribution of taxi orders during evening peak

Figure (14) presents the spatial distribution of taxi orders (pickup) during evening peak. It can be

seen that the majority of taxi orders emerge in Manhattan, especially in the CBD. There are two local hotspots near two airports.

NYC subway turnstile data is also publicly accessible (http://web.mta.info/developers/turnstile.html). A sample of the turnstile data is listed in Table (5). These two rows show that the reading of entries for turnstile ID (A002, R051, 02-00-00) is 4,593,637 at 4 PM and 4,594,523 at 8 PM on 05/01/2014. Taking the difference between two readings yields the net entries at this turnstile during the 4-hour time interval, i.e., 4,594,523 - 4,593,637 = 886. Similarly, we can calculate net entries and net exits for each turnstile. Net entries and net exits of a grid are then calculated by summing up the net entries and net exits of all turnstiles in that grid, respectively.

Table 5: Turnstile data sample

5.2.2. Lower level MARL setup

To be concrete, we now provide the setup of the lower level MARL.

• Observation. The observation space for driver i consists of the grid index and current time, i.e., ). One-hot encoding is used for both grid index and time. Yellow and green taxis share the same observation space.

• Action. The action space for driver i is to enter one of neighboring grids or to stay at the current grid. Due to a grid world setup with square grids, the action space is thus 5-dimensional (using one-hot encoding). For example, driver i currently at grid 4) chooses action (i.e., (1, 0) in the 2-dimensional grid world setup). The driver will arrive in grid (6, 4) to search for passengers. In the boundary grids, we still allow drivers to choose any action from the 5-dimensional action space. If a driver chooses an action that will lead her out of the study area after taking the action, the driver receives a large penalty (i.e., a large negative reward) and is forced to stay in the current grid. Thus drivers will not choose an action that will lead them out of study area after training.

• Reward. The reward of each state transition for driver i is her monetary return, i.e.,

Multiclass MARL. The NYC taxi market contains two types of taxicabs, namely yellow taxis and green taxis. They are different because yellow taxis can go and pick up passengers anywhere while green taxis are not allowed to pick up passengers in Manhattan below East 96th Street and West 110th Street and at two airports, namely LaGuardia airport (LGA) and John F. Kennedy airport (JFK). Therefore we need to model them differently in the lower level MARL. To incorporate these two classes of agents into MARL, we create two actors and critics. All the yellow taxis share one actor (i.e., a policy network) and one critic (i.e., value network), and green taxis share the other actor and the critic. In the actor-critic algorithm demonstrated in Fig 4, both yellow and green agents interact with the same environment. They have the same observation space and action space. In other words, for both yellow and green taxis, its observation consists of the grid index and current time, and its action is to enter one of neighboring grids or to stay in the current grid. In addition, both of them aim to maximize their cumulative monetary return.

The key difference is, green taxis can drop off and search for passengers in those restricted areas (i.e., Manhattan below East 96th Street and West 110th Street and two airports), they can not pick up passengers there. From the modeling perspective, the environment will not assign orders to green taxis in restricted areas. This restriction discourages green taxis from searching for passengers or taking passengers to the restricted areas. Accordingly, the policy for green taxis is expected to be different from that of yellow taxis. Yellow taxis thus only compete among themselves in the restricted areas, while outside restricted areas, yellow and green taxis not only compete within the same type but also compete with the other taxi type.

5.2.3. Upper level objective function of city planners

As previously mentioned, the objective function of city planners consists of two components, namely, number of vehicles in the CBD and the crowdedness of the public transit. Now we formally define these two components based on NYC taxi data and subway turnstile data.

The first component is defined as the percentage of taxis in the CBD, i.e, the ratio of the number of taxis in the CBD to the total number of taxis. For each time step, we calculate one value of the percentage. We then take the average of the percentages across all time steps as the overall percentage of taxis in the CBD. Hereafter we call this PTC (percentage of taxis in the CBD). PTC decreases with toll charge because fewer vacant taxis enter the CBD with a higher toll charge.

where num taxisis the number of taxis in grid l at time step t and total num taxis is the total number of taxis.

The crowdedness of the subway system in each grid is further decomposed into two parts, namely, the entry crowdedness which is related with the net entries into the subway system within the grid, and the exit crowdedness which is related with the net exits from the subway system within the grid. After imposing a toll charge on taxis entering the CBD, the crowdedness of the subway system increases due to the unserviced taxi orders. Here we assume that travel demand stemming from the unserviced taxi orders goes to the subway system. For a grid l, we count the number of passengers of unserviced taxi orders with its origin inside the grid and call this quantity the additional entry into the subway system. We then take the ratio of the additional entry to the net entries within the grid as the increase in the entry crowdedness in grid l, denoted as where ICS stands for “an increase in crowdedness of the subway”. Mathematically,

where originand num passengersdenote the origin and number of passengers of the order, respectively. Similarly, we can calculate the increase in the exit crowdedness in grid l as

Taking the average of the increase in the entry crowdedness and the increase in the exit crowdedness yields the increase in the crowdedness of the grid, namely,

Among the overall 337 grids, we focus on the top m grids in terms of crowdedness. Thus the overall ICS is calculated as

where m = 20 in this case study. ICS increases with the toll charge because more passengers are directed to the subway system with a higher toll charge.

From the perspective of city planners, both PTC and ICS are expected to be small. These two components, however, are competing against each other. With a small toll charge, ICS is small but PTC is large; while with a large toll charge, PTC becomes smaller but ICS gets larger. Therefore city planners need to maintain some balance between these two components. Here we use a weighted average of these two components as the objective of city planners. To ensure maximization, we add a minus sign:

where [0, 1] is the weight for PTC. In this case study, we set considering the difference in the magnitude of two components. We also conduct sensitivity analysis to investigate the impact of w on the final result.

5.2.4. Results

On weekdays, there are on average around 100,000 taxi orders during the evening peak. According to Wikipedia (https://en.wikipedia.org/wiki/Taxicabs of New York City), there are around 13,000 yellow “medallion” taxicabs in NYC. Considering that some drivers do not work during the evening peak and some drivers work outside the grid world, we thus set the number of yellow agents in MARL as 12,000. The number of green agents is set to 5,000 considering that there were in total around 6,000 green taxi drivers in 2015 and some of them may not work during the evening peak.

In the mean field actor-critic algorithm, the critic is parameterized by an MLP with four hidden layers (256, 128, 64, 32), and the actor is parameterized by an MLP with four hidden layers (128, 64, 32, 16). Other hyperparameters remain the same.

For a bilevel optimization problem, first we need to check the convergence of the lower level. As an example to validate the convergence, PTC, ICS, and average cumulative reward (i.e., Reward in Figure (15)) of all agents versus number of episodes are presented in Figure (15) with 1. In both scenarios, despite some bouncing back and forth due to the initial random exploration of agents, PTC increases fast during the first 7,500 episodes where agents explore the environment and learn towards the optimal policy. Increase in PTC indicates that agents are gradually moving into the CBD to search for passengers, which is as expected. After 15,000 episodes, PTC reaches its converged value of 0.32 in the first scenario with 26 in the second scenario with 1, respectively. Conversely, ICS decreases during the first 5,000 episodes, meaning that agents are gradually searching in right places so that more passengers requests are fulfilled. ICS reaches its converged value of 0.035 after 5,000 episodes in the first scenario with 041 after 7,500 episodes in the second scenario with 1, respectively. The average cumulative reward reaches its final converged value of 79.3 in the first scenario with 2 in the second scenario with 1, respectively. All three metrics gradually converge after 12,500 episodes when agents mainly exploit the knowledge they have gained through their previous explorations. The computation time for running the mean field actor-critic algorithm until convergence of the lower level MARL is around 38 hours on a virtual machine with 1 NVIDIA Tesla V100 GPU and 8 standard CPUs on Google Cloud Platform.

Figure 15: Convergence of the lower level MARL

Figure 16: Posterior probability distribution and acquisition function at iteration 0

With the validated convergence of the lower level MARL, we use BO to solve the bilevel problem. In total we evaluate five ’s as the starting point of BO. Figure (16a) plots the mean and standard deviation of the posterior probability distribution of f conditioned on these five evaluated points. As one can see, the standard deviation is small around the points that have been evaluated and is large at locations where we do not have any data. The acquisition function shown in Figure (16b) reveals that the next to be evaluated is 5.95. According to Figure (9a), the mean is high and the standard deviation is high around 5.95, indicating a large acquisition.

Figure 17: Convergence of BO

The convergence of BO in this case is defined as choosing 5 consecutive ’s with the distance between every two consecutive ’s below a threshold of 0.5. The convergence of BO is plotted in Figure (17). We can see that BO initially chooses quite different ’s and gradually converges from the 7iteration.

Figure 18: BO result

The resulting posterior probability distribution of f from BO is presented in Figure (18a). The optimal toll charge is $5.1 with an optimal objective around 08. The objective is 09 without any toll charge. The objective increases with the toll charge when smaller than $5.1. With a $5.1 toll charge, the objective is 08, which is 7.9% higher than 09. The objective decreases if toll charge is increased beyond $5.1. The standard deviation within the range of [4, 6.2] is quite small because we have enough evaluations within the range so the uncertainty is low. As for the range of [0, 2.2], [6.2, 8.3], and [8.3, 10], the standard deviation is large, indicating a large uncertainty in these ranges. BO does not choose to evaluate in these ranges with a large uncertainty because of a low mean value and therefore a small acquisition in these ranges. The parabolic shape of the objective can be explained by Figure (18b). Before the toll charge reaches $5.1, the steady increase in and the minor decrease in push the objective higher with a larger toll charge. After the toll charge is increased beyond $5declines faster and suppresses the effect of the increase in , resulting in a decrease in the objective.

Figure 19: Sensitivity analysis

To investigate the impact of the weighting parameter w on the optimal toll charge, sensitivity analysis is conducted by running BO with various values of w and a preset range of [0, 10]. The result is presented in Figure (19). Recall that city planners aim to minimize both ICS and PTC, and a larger toll charge usually discourages taxis from entering the CBD, resulting in a lower PTC and a higher ICS. With w = 0, city planners put all weights on ICS, and the optimal toll charge is simply zero. As w increases, city planners gradually put more weights on PTC, resulting in an increase in the optimal toll charge. With and , optimal toll charges are $4.2 and $5.1, respectively. As plotted in Figure (18b), the magnitude of is already larger than that of with . Increasing w further increases the discrepancy between the magnitude of and that of . In other words, the magnitude of is much larger than that of when , especially when . With , the optimal toll charge is around $10, which is the upper bound of the preset range, because now city planners focus much more on PTC than ICS. In practice, the selection of the weight depend