In this section, we mainly describe the details of the problem formulation and our algorithm^{83} (more details in Supplementary Code 1), as well as some basic theoretical conclusions. First, we introduce networked MDP. We define an independent system^{39} and *ξ*-dependent system, and we present diverse experimental results on algorithms and models.

### Networked MDP

We consider a system with *n* agents as a graph. Specifically, *n* agents coexist in an undirected and stationary graph \(\mathcalG=(\mathcalV,\mathcalE)\). Each agent in a system is represented as a node in an undirected and stationary graph \(\mathcalG=(\mathcalV,\mathcalE)\), therefore \(\mathcalV=\1,\ldots ,n\\) is the set of agents. \({{\mathcalE}}\subset \mathcalV\times \mathcalV\) comprises the edges that represent the connectivity of all the agents. Each agent can communicate along the edges with their adjacent agents. Let *N*_{i} denote the neighbour of the agent *i* including itself. Let \(N_i^\kappa \) denote the *κ*-hop neighbour of *i*, that is, the nodes whose graph distance to *i* is less than or equal to *κ*. For the simplicity of notation, we also define \(N_-i^\kappa =\mathcalV\setminus N_i^\kappa \). We define an adjacency matrix *A* with size of *n* × *n*. If there is an adjacency relationship between agent *i* and agent *j*, then *A*_{ij} = *A*_{ji} = 1, otherwise *A*_{ij} = *A*_{ji} = 0. This adjacency relationship is determined by the system structure and *κ*.

We define its corresponding networked MDP as \(({{\mathcalG}},\\mathcalS_i,\mathcalA_i\_{i\in {{\mathcalV}}},p,r)\). Each agent *i* has its local state \(s_i\in \mathcalS_i\), and performs action \(a_i\in \mathcalA_i\). The global state is the concatenation of all local states: \(s=(s_1,\ldots ,s_n)\)\(\in \mathcalS:= \mathcalS_1\times \ldots \times \mathcalS_n\). The global action is \(a=(a_1,\ldots ,a_n)\)\(\in \mathcalA:= {\mathcalA}_1\times \ldots \times {{{{\mathcalA}}}}_n\). For the simplicity of notation, we define \(s_N_i\) to be the states of *i*’s neighbours. The transition function is defined as: \(p(s^\prime | s,a):{{{\mathcalS}}}\times {{{\mathcalA}}}\to \Delta ({{{\mathcalS}}})\), and the state transition satisfies Markov properties. Each agent maintains a localized policy \(\pi _i^\theta _i(a_i| s_N_i)\) with parameter *θ*_{i} ∈ *Θ*_{i}, which means that the localized policy is dependent on only states of its neighbours and itself. We use *θ* = (*θ*_{1}, …, *θ*_{n}) to denote the tuple of localized policy parameters, and \(\pi ^\,\theta (a| s)=\prod _i = 1^n\pi _i^\theta _i(a_i| s_N_i)\) to denote the joint policy. The reward function for each agent depends on only local state and action: *r*_{i}(*s*_{i}, *a*_{i}), and the global reward function is defined to be the average reward \(r(s,a)=\frac1n\sum _i = 1^nr_i(s_i,a_i)\). The goal of reinforcement learning is to find a policy *π*^{θ} that maximizes the accumulated reward:

$$\pi ^\;\theta ^* =\mathop\mathrmargmax_\pi ^\,\theta \mathbbE_\pi ^\,\theta \left[\sum\limits_t=0^\infty \gamma ^tr(s_t,a_t)\right],$$

(7)

where *γ* ∈ (0, 1) is the temporal discount factor. The value function is defined as

$$V_\pi ^\,\theta (s)=\mathbbE_\pi ^\theta \left[\sum\limits_t=0^\infty \gamma ^tr(s_t,a_t)| s^0=s\right]=\frac1n\sum\limits_i=1^nV_i(s).$$

(8)

In the last step, we have defined *V*_{i}(*s*), which is the value function for individual reward *r*_{i}.

#### Model-based learning

Let *η*[*π*] denote the return of the policy in the true environment:

$$\eta [\pi ]=\mathbbE_\pi \left[\sum\limits_t=0^\infty \gamma ^tr(s_t,a_t)\right]$$

(9)

Let \(\hat\eta [\pi ]\) denote the returns of the policy under the approximated model. To analyse the difference between *η*[*π*] and \(\hat\eta [\pi ]\), we need to construct a bound

$$\eta [\pi ]\ge \hat\eta [\pi ]-C(\;p,\hatp,\pi ,\pi _\mathrmD),$$

(10)

where *C* is a non-negative function, and *π*_{D} is the data-collecting policy. According to equation (10), if every policy update ensures an improvement of \(\hat\eta [\pi ]\) by at least *C*, *η*[*π*] will improve monotonically. This inequality was first presented in single-agent domain^{54}.

###
*ξ*-dependent networked system

A networked system may have some extent of locality, meaning that in some cases local states and actions do not affect the states of distant agents. In such systems, environmental transitions can be factorized, and agents are able to maintain local models to predict future local states. We define an independent networked system (INS) and an *ξ*-dependent networked system as follows. Figure 1b shows different topological structures in networked systems. In addition, exponential decay has been shown to hold in networked MARL when the network is static^{41,70}. This means that with the increase of the topological distance, the influence between agents will gradually decrease in networked systems. We have proved the properties of these networked systems through experiments in the Ring Attenuation (more results in Supplementary Fig. 6).

### Definition 1

*An environment is an INS if:*

$$p(s^\prime | s,a)=\prod\limits_i=1^np_i(s_i^\prime | s_N_i^\kappa ,a_i).$$

(11)

INS might be an assumption that is too strong to hold. However, for the dynamics that cannot be factorized, we can still use an INS to approximate it. Let *D*_{TV} denote the total variation distance between distributions, we have the following definition:

### Definition 2

*(ξ-dependent) Assume there exists an INS* \(\barp\) *such that* \(\barp(s^\prime | s,a)=\prod _i = 1^np_i(s_i^\prime | s_N_i^\kappa ,a_i)\)*. An environment is ξ-dependent, if:*

$$\mathop\sup _s,aD_\mathrmTV\left(\;p(s^\prime | s,a)\parallel \barp(s^\prime | s,a)\right)=\mathop\sup _s,a\frac12\sum\limits_{s^\prime \in {{{\mathcalS}}}}| p(s^\prime | s,a)-\barp(s^\prime | s,a)| \le \xi .$$

Denote \(\widehatp(s^\prime | s,a)=\prod _i = 1^n\widehatp_i(s_i^\prime | s_N_i^\kappa ,a_i)\). By using \(\widehatp(s^\prime | s,a)\) to approximate the true dynamics, larger *κ* leads to better approximation of model *p*, but also heavier computation overhead. The universal model error \(D(\;p\parallel \hatp)\) can be divided into two parts: dependency bias \(D(\;p\parallel \barp)\) and independence-approximation error \(D(\;\barp\parallel \hatp)\) with

$$D(\;p\parallel \hatp)\le D(\;p\parallel \barp)+D(\;\barp\parallel \hatp).$$

(12)

More details of networked MDP and distances among \(\barp,\hatp\) and \(\widehatp\) are shown in Fig. 2b–d. Then for an *ξ*-dependent system, when the model becomes very accurate, meaning \(D(\;\barp\parallel \hatp)\approx 0\), \(\sup D(\;p\parallel \hatp)\) \(\approx \sup D(\;p\parallel \barp)=\xi\). While *D* can be any appropriate distance metric, we use the TV distance hereafter for the ease of presentation. Here we give analysis under both independent and *ξ*-dependent networked systems. In the following experiments, the systems used to evaluate the algorithms also have the properties of an *ξ*-dependent system.

Let *π* indicate a collective policy *π* = [*π*_{1}, …, *π*_{n}], and the model \(\hatp\) be an INS \(\hatp(s^\prime | s,a)=\mathop\prod _i = 1^n\hatp_i(s_i^\prime | s_N_i,a_i)\) that approximating the true MDP. Denote the data-collecting policy as *π*_{D}. For decentralized learning and control, each agent learns a localized model \(\hatp_i\), policy \(\pi _i(\cdot | s_N_i)\). For each agent *i*, we solve the following problem:

$$\pi_i^\;k+1,p_i^k+1=\mathop\rmargmax\limits_\pi_i,p_i\quad \hat\eta[\pi]-C(\;p,\hatp,\pi ,\pi _\mathrmD).$$

(13)

Below, we first lay out the monotonic improvement property of independent and *ξ*-independent systems. Then we describe our framework in Algorithm 1, and Fig. 2e shows the model learning process. Experience data of state transition and rewards from the interaction between the environment and the agent are used to train the environment model. Then our method can use the large amount of data generated by the interaction between the model and the agent to improve the sampling efficiency in the training process.

### Monotonic model-based improvement

In model-based learning, different rollout schemes can be chosen. The vanilla rollout assumes that models are used in an infinite horizon. The branched rollout performs a rollout from a state sampled by a state distribution of previous policy *π*_{D}, and runs *T* steps in \(\hat\pi \) according to *π*. Based on different rollout schemes, we can construct two lower bounds. Under vanilla rollout, real return and model return can be bounded by model error and policy divergence. Formal results are presented in Theorem 1. The detailed proof is deferred to Supplementary Section 4.2.

### Theorem 1

*Consider an INS. Denote local model errors as:*

$$\epsilon _m_i=\mathop\max\limits_s_N_i,a_iD_\mathrmTV\big[\;p_i(s_i^\prime | s_N_i,a_i)\parallel \hatp_i(s_i^\prime | s_N_i,a_i)\big]$$

*and divergences between the data-collecting policy and evaluated policy as:*

$$\epsilon _\pi _i=\mathop\max\limits_s_N_i\rmD_TV[\pi _D(a_i| s_N_i)\parallel \pi (a_i| s_N_i)]$$

*Assume the upper bound of rewards of all agents is r*_{max}*. Let* *η*^{p}[*π*_{1}, …, *π*_{n}] *denote the real returns in the environment. Also, let* \(\eta ^\hatp[\pi _1,\ldots ,\pi _n]\) *denote the returns estimated in the model trajectories, and the states and actions are collected with π*_{D}*. Then we have:*

$$| \eta ^p[\pi _1,\ldots ,\pi _n]-\eta ^\hatp[\pi _1,\ldots ,\pi _n]| \le \frac2r_\max 1-\gamma \sum\limits_i=1^n\left[\frac\epsilon _\pi _in+(\epsilon _m_i+2\epsilon _\pi _i)\sum\limits_k=0^\infty \gamma ^k+1\frac n\right].$$

(14)

Intuitively, the term \(\sum _k = 0^\infty \gamma ^k+1\fracn\) would be in the same magnitude as \(\frac11-\gamma \), which might be huge given the choice of *γ*, making the bound too loose to be effective. To make tighter the discrepancy bound in Theorem 1, we adopt the branched rollout scheme. The branched rollout enables a effective combination of model-based and model-free rollouts. For each rollout, we begin from a state sample from \(d_\pi _\mathrmD\), and run *T* steps in each localized \(\hat\pi _i\). When branched rollout is applied in an INS, Theorem 2 gives the returns bound.

### Theorem 2

*Consider an INS. Denote local model errors as:*

$$\epsilon _m_i=\mathop\max\limits_s_N_i,a_iD_\mathrmTV\big[\,p_i\big(s_i^\prime | s_N_i,a_i\big)\parallel \hatp_i\big(s_i^\prime | s_N_i,a_i\big)\big]$$

*and divergences between the data-collecting policy and evaluated policy as:*

$$\epsilon _\pi _i=\mathop\max\limits_s_N_i\rmD_TV\big[\pi _D\big(a_i| s_N_i\big)\parallel \pi \big(a_i| s_N_i\big)\big]$$

*Assume the upper bound of rewards of all agents is r*_{max}*. Let η*^{p}[*π*_{1}, …, *π*_{n}] *denote the real returns in the environment. Also, let* *η*^{branch}[*π*_{1}, …, *π*_{n}] *denote the returns estimated via* *T-step branched rollout scheme. Then we have:*

$$\beginarrayl| \eta ^p[\pi _1,\ldots ,\pi _n]-\eta ^\mathrmbranch[\pi _1,\ldots ,\pi _n]| \le \frac2r_\max 1-\gamma \sum\limits_i=1^n\\\left[\epsilon _m_i\cdot \left(\sum\limits_k=0^T-1\gamma ^k+1\frac n\right)+\epsilon _\pi _i\cdot \left(\sum\limits_k=T^\infty \gamma ^k+1\fracn\right)\right]\endarray$$

Comparing the results in Theorems 1 and 2, we can see that branched rollout scheme reduced the coefficient before \(\epsilon _m_i\) from \(\sum _k = 0^\infty \gamma ^k+1\frac N_i^kn\le \frac\gamma 1-\gamma \) to \(\mathop\sum _k = 0^T-1\gamma ^k+1\fracn\le \mathop\sum _k = 0^T-1\gamma ^k+1=\frac\gamma (1-\gamma ^T)1-\gamma \). This reduction explains that empirically, branched rollout brings better asymptotic performance. Also, if we set *T* = 0, this bound turn into a model-free bound. This indicates that when \(\epsilon _m_i\) is lower than \(\epsilon _\pi _i\) allowed by our algorithm, a model might increase the performance.

### Algorithm 1: A general model-based learning framework

1: Initialize policy \(\pi (a| s)=\\pi _i\_i = 1^n\), predictive model \(p_\psi \left(s^\prime | s,a\right)=\)\(\ s_N_i^\kappa ,a_i)\_i = 1^n\), empty dataset \(\\mathcalD_i\_i = 1^n\).

2: **for** *N* epochs **do**

3: Collect data with \(\\pi _i\_i = 1^n\) in real environment: \(\mathcalD_i=\mathcalD_i\cup\)\(\left\\left(s_i^t,a_i^t,s_i^t+1,r_\!i^\,t\right)\right\_t\) for all agents

4: Train model \(p_\psi _i\) on dataset \(\mathcalD_i\) via maximum likelihood: \(\psi _i\leftarrow \mboxargmax_\psi _i\mathbbE_\mathcalD_i\left[\log p_\psi _i(s_i| s_N_i^\kappa ,a_i)\right]\)

5: Optimize policy under predictive model: \(\pi _i^\,k+1\leftarrow {\mboxargmax}_\pi _i^\prime \)\(\hat\eta [\pi ^\prime ]-C(\;p,\hatp,\pi ^\prime ,\pi _\mathrmD)\).

6: **end for**

### Incorporating dependency bias

In reality, not every system satisfies the definition of an INS. Yet we can generalize Theorem 2 into an *ξ*-dependent system.

### Corollary 1

*Consider an ξ-dependent networked system. Denote local model errors as:*

$$\epsilon _m_i=\mathop\max\limits_s_N_i,a_iD_{\mathrmTV}\big[\;p_i\big(s_i^\prime | s_N_i,a_i\big)\parallel \hatp_i\big(s_i^\prime | s_N_i,a_i\big)\big]$$

*and divergences between the data-collecting policy and evaluated policy as:*

$$\epsilon _\pi _i=\mathop\max\limits_s_N_i\rmD_TV\big[\pi _D\big(a_i| s_N_i\big)\parallel \pi \big(a_i| s_N_i\big)\big]$$

*Assume the upper bound of rewards of all agents is* *r*_{max}*. Let η*^{p}[*π*_{1}, …, *π*_{n}] *denote the real returns in the environment. Also, let* *η*^{branch}[*π*_{1}, …, *π*_{n}] *denote the returns estimated via T-step branched rollout scheme. Then we have:*

$$\beginarrayl| \eta ^p[\pi _1,\ldots ,\pi _n]-\eta ^\mathrmbranch[\pi _1,\ldots ,\pi _n]| \le \frac2r_\max \gamma (1-\gamma )^2\xi\\+\frac2r_\max 1-\gamma \sum\limits_i=1^n\left[\epsilon _m_i\cdot \left(\sum\limits_k=0^T-1\gamma ^k+1\frac n\right)\quad +\epsilon _\pi _i\cdot \left(\sum\limits_k=T^\infty \gamma ^k+1\frac N_i^kn\right)\right]\endarray$$

The proof can also be found in Supplementary Information. Compared with Theorem 2, Corollary 1 is more general, as it applies to multi-agent systems that are not fully independent. Intuitively, if a networked system seems nearly independent, local models will be effective enough. The bound indicates that when the policy is optimized in a trust region where \(D(\pi ,\pi _\mathrmD)\le \epsilon _\pi _i\), the bound would also be restricted, making monotonic update more achievable.

### Algorithm 2: Model-based decentralized policy optimization

1: Initialize the model buffer \(\mathcalD_i^\mathrmM\) and the environment buffer \(\mathcalD_i^\mathrmE\), where \(i\in {{{\mathcalV}}}\).

2: Initialize the rollout length *T* and adjacency matrix *A* through *κ*.

3: Initialize the decentralized model \(p^\psi _i\), actor \(\pi ^\,\theta _i\) and critic \(V^\,\phi _i\) by adjacency matrix *A*, where \(i\in {{{\mathcalV}}}\).

4: **for** many epochs **do**

5: Sample action according to decentralized \(\pi ^\,\theta _i\) and local observation \(s_N_i^\kappa \) in environment to collect trajectories \(\tau _\!i^\,\mathrmE\).

6: Store trajectories to environment buffer \(\mathcalD_i^\mathrmE\) for each agent *i*, \(\mathcalD_i^\mathrmE=\mathcalD_i^\mathrmE\cup \{\tau _\!i^\,\mathrmE\}\).

7: Minimize the objective in equation (6) to update model \(p^\psi _i\) through trajectories from \({{{{\mathcalD}}}}_i^\mathrmE\).

8: **for** many steps **do**

9: **for** many rollouts **do**

10: Sample \(s_\!i^t\) from \({{{{\mathcalD}}}}_i^{\mathrmE}\) as an initial state.

11: Simulate *T*-step trajectories \(\tau _\!i^\mathrmM\) from \(s_\!i^t\) by decentralized policy \(\pi ^\,\theta _i\) and decentralized model \(p^\psi _i\).

12: Store trajectories \(\tau _i^\,\mathrmM\) to model buffer \({{{{\mathcalD}}}}_\!i^\mathrmM\) for each agent *i*, \({{{{\mathcalD}}}}_i^\mathrmM={{{{\mathcalD}}}}_i^\mathrmM\cup \{\tau _\!i^\,\mathrmM\}\).

13: **end for**

14: **for** many gradient steps **do**

15: Update the extended value functions \(V^\,\phi _i\) on trajectories \(\tau _\!i^\,\mathrmM\) sampled from \({{{{\mathcalD}}}}_i^\mathrmM\) according to equation (3)

16: Update the decentralized policies \(\pi ^\,\theta _i\) on trajectories \(\tau _\!i^\,\mathrmM\) sampled from \({{{{\mathcalD}}}}_i^{\mathrmM}\) according to equation (5).

17: **end for**

18: **end for**

19: **end for**

Compared with the theorems of model-based policy optimization in the single-agent domain^{54}, we extend them to the multi-agent domain, especially for completely independent networked systems or even more general and real-world systems named *ξ*-dependent networked systems. Further, to solve the problems of the increase of computation complexity, communication cost and system instability caused by the multi-agent system, we propose a novel communication mechanism and analyse the accuracy of the estimation of value functions and policy gradients through the following theoretical proofs, which is completely different from the general model-based algorithms.

Therefore, the challenges for our algorithm and theorems are mainly twofold in these *ξ*-dependent networked systems with fewer communications. One is that local communication and approximation dependency bias \(D(p\parallel \barp)\) between independent networked systems and *ξ*-dependent networked systems will increase the generalization error \(\epsilon _m_i\) of the environment model and policy shift \(\epsilon _\pi _i\) between *π* and *π*_{D}, which will further increase the lower bounds \(C(\;p,\hatp,\pi ,\pi _\mathrmD)\) in Theorems 1 and 2. The results of experiments fully prove that our method can reduce the influence of these errors on the monotonic improvement of the model-based learning through the following algorithm design and hyperparameter selection. The other is that our method needs to provide accurate policy gradients and value functions for the multi-agent system to guide the policy optimization, which is explained in more detail in Theorems 3 and 4.

### Analysis of policy gradients approximation

In this section, we show that the policy gradient induced by equation (5) is an asymptotic approximation of the true policy gradients. We begin with discussing that the extended value function \(V_i(s_N_i^\kappa )\) is a good approximation of the real value function. We formally state the result in Theorem 3 and defer the proof to Supplementary Section 4.5.

### Theorem 3

*Define* \(V_i(s_N_\!i^\,\kappa )=\mathbbE_s_N_-i^\kappa \left[\sum _t = 0^\infty r_\!i^\;t(s_t,a_t)| s_N_\!i^\,\kappa ^0=s_N_\!i^\,\kappa \right]\) *and* \(V_i(s)=\mathbbE\)\([\mathop\sum _t = 0^\infty r_i^\;t| s^0=s\,]\)*, then*

$$| V_\!i(s)-V_\!i(s_N_i^\kappa )| \le \fracr_\max 1-\gamma \gamma ^\,\kappa .$$

(15)

### Remark 1

*Recall that* \(V(s)=\frac1n\sum _i = 1^nV_i(s)\). *From equation* (15)*, it is easy to obtain the following result*

$$| V(s)-\frac1n\sum\limits_i=1^nV_i(s_N_\!i^\,\kappa )| \le \fracr_\max 1-\gamma \gamma ^\,\kappa ,$$

(16)

*which indicates that the global value function V*(*s*) *can be approximated by the average of all extended value functions.*

In policy optimization, value functions are used for calculating advantages \(\hatA^(t)\), and we have shown that *V*(*s*) can be estimated with the average of extended value functions \(\frac1n\sum _i = 1^nV_\!i(s_N_\!i^\,\kappa )\). In practice, an agent might not get the value function of distant agents and can access only the value function of its *κ*-hop neighbours. However, we can prove that \(\tildeV_\!i=\frac1n\sum _j\in N_\!i^\,\kappa V_j(s_N_\!j^\,\kappa )\) is already very accurate for calculating the policy gradient for agent *i*. This theoretical result has been extended to general multi-agent systems based on previous research^{39}. Theorem 4 formally states this result and the proof is deferred to Supplementary Section 4.6.

### Theorem 4

*Let* \(\hatA_t=r^\;(t)+\gamma V(s^(t+1))-V(s^(t))\) *be the TD (temporal difference) residual, and* \(g_i=\mathbbE[\hatA\nabla _\theta _i\log \pi _i(a| s)]\) *be the policy gradient. If* \(\tildeA_t\) and \(\tildeg_i\) *are the TD residual and policy gradient when value function* *V*(*s*) *is replaced by* \(\tildeV_\!i(s)=\frac1n\sum _j\in N_\!i^\,\kappa V_\!j(s_N_\!i^\,\kappa )\)*, we have:*

$$| g_i-\tildeg_i| \le \frac\gamma ^\,\kappa -11-\gamma \left[1-(1-\gamma ^2)\fracN_i^\,\kappa n\right]r_\max g_\max ,$$

(17)

*where* *r*_{max}*and* *g*_{max} *denote the upper bound of the absolute value of reward and gradient, respectively.*

### Remark 2

*Theorem* *4**justifies that the policy gradients computed based on the sum of the neighbouring extended value functions is a close approximation of true policy gradients. The power of this theorem is that the extended value function* \(V_i(s_N_i^\,\kappa )\) *requires only the neighbouring information, thus easier to approximate and scalable. Despite the reduction in computation, the difference between the approximated and true gradient in equation* (17) *is small.*

On the basis of the difficulties and challenges proposed in the above theoretical analysis, we designed corresponding solutions in the algorithm and the specific implementation details can be found in our code. In Theorems 1 and 2, the probability to monotonic improvement of the model-based learning in the algorithm will be increase when the lower bound \(C(p,\hatp,\pi ,\pi _{\mathrmD})\) that consists of two parts: generalization model error \(\epsilon _m_i\) and policy shift \(\epsilon _\pi _i\) is tight enough to constrain equation (10). To reduce the generalization model error \(\epsilon _m_i\) caused by dependency bias \(D(\,p\parallel \hatp)\) in equation (12), we design a stage of warm-up in our algorithm to pre-train the model until an acceptable accuracy of the model is obtained and a method to select data from the true MDP or model with a probability to guide the usage of the model, and we further adopted a *k*-branched rollout method to improve the accuracy of model. According to equation (12), when the model becomes very accurate, meaning \(D(\,\barp\parallel \hatp)\approx 0\), \(\sup D(\,p\parallel \hatp)\approx \sup D(p\parallel \barp)=\xi\). Then we design a distribution of the neighbours for each agent, including the quantity and topology of neighbours, to reduce the model error \(\epsilon _m_i\) caused by dependency bias \(D(\,p\parallel \barp)\) in the experiments. Lastly, to constrain the policy shift \(\epsilon _\pi _i\) between *π* and *π*_{D}, we use the method of PPO to limit the difference in policies.

### Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

link