离线强化学习研究综述

李晓峰; 蒋佳慧; 王雪娆

doi:10.20009/j.cnki.21-1106/TP.2025-0354

小型微型计算机系统 ›› 2026, Vol. 47 ›› Issue (5) : 1056 -1069. DOI: 10.20009/j.cnki.21-1106/TP.2025-0354

算法理论与人工智能

离线强化学习研究综述

李晓峰, 蒋佳慧, 王雪娆

作者信息 +

Offline Reinforcement Learning:a Survey

LI Xiaofeng, JIANG Jiahui, WANG Xuerao

Author information +

文章历史 +

摘要

深度强化学习结合了深度学习的特征学习和强化学习的序贯决策能力,在诸多挑战性任务中表现出超越人类的水平.但是,在线强化学习以“试错”方式与环境交互,存在采样成本高、探索风险大和样本效率低的问题,阻碍了其在实际系统中的落地.离线强化学习是一种完全从静态数据集中学习目标策略的框架,将数据收集与策略学习过程分离,有效避免了交互过程中的潜在危险.本文将首先介绍强化学习基础知识,并分析在线学习方式存在的瓶颈.在此基础上,构建离线强化学习问题的形式化描述并指出其关键问题.进一步,对相关代表性算法和最新成果进行全面系统梳理,并介绍主要应用领域和常用基准测试平台.最后,总结分析面临的挑战,探讨未来发展方向.

Abstract

Deep reinforcement learning algorithms achieve impressive performance in multiple challenging tasks by combing the powerful representation learning capability of deep learning together with the sequential decision ability of reinforcement learning.However,as for some risk-aware real-world systems,collecting the data based on trial-and-error method is inaccessible because it is dangerous,expensive and sample inefficient.The active learning framework is an important reason that hinders the widespread applications of online reinforcement learning algorithms.Offline reinforcement learning is a data-driven paradigm that can learn exclusively from the static dataset without interaction with the environment during the training process.Due to the ability of learning from the previously collected data,offline reinforcement learning is appealing to deal with real-world applications.In this paper,the fundamentals of reinforcement learning is first introduced.Then,we analyze the challenges of this active learning framework to deal with practical systems.Second,the problem formulation of offline reinforcement learning is provided.A comprehensive review of important algorithms,common benchmarks and main practical applications in this field is given.Finally,we summarize the primary challenges and discuss research directions.

关键词

Key words

reinforcement learning / deep reinforcement learning / offline reinforcement learning / policy improvement / distribution shift

引用本文

引用格式 ▾

李晓峰, 蒋佳慧, 王雪娆. 离线强化学习研究综述[J]. 小型微型计算机系统, 2026, 47(5): 1056-1069 DOI:10.20009/j.cnki.21-1106/TP.2025-0354

登录浏览全文

4963

注册一个新账户忘记密码

参考文献

[1] Mnih V,Kavukcuoglu K,Silver D,et al.Human-level control through deep reinforcement learning[J].Nature,2020,518(7540):529-533.
[2] Arulkumaran K,Deisenroth P M,Brundage M,et al.Deep reinforcement learning:a brief survey[J].IEEE Signal Processing Magazine,2017,34(6):26-38.
[3] Henderson P,Islam R,Bachman P,et al.Deep reinforcement learning that matters[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2018,doi:https://doi.org/10.1609/aaai.v32i1.11694.
[4] Sutton S R,Barto G A.Reinforcement learning:an introduction[M].MA,USA:MIT Press,2018.
[5] Wang X S,Wang R R,Cheng Y H.Safe reinforcement learning:a survey[J].Acta Automatica Sinica,2023,49(9):1-23.
[6] SUN Y W,LIU W Z,SUN C Y.Causality in reinforcement learning control:the state of the art and prospects[J].Acta Automatica Sinica,2023,49(3):661-677.
[7] Vinyals O,Babuschkin I,Czarnecki W M,et al.Grandmaster level in starcraft II using multiagent reinforcement learning[J].Nature,2019,575(7782):350-354.
[8] Liu I J,Jain U,Yeh R A,et al.Cooperative exploration for multi-agent deep reinforcement learning[C]//Proceedings of 38th International Conference on Machine Learning,2021:6826-6836.
[9] Rashid T,Samvelyan M,De Witt C S,et al.Monotonic value function factorisation for deep multi-agent reinforcement learning[J].Journal of Machine Learning Research,2020,21(1):7234-7284.
[10] Silver D,Huang A,Maddison C J,et al.Mastering the game of go with deep neural networks and tree search[J].Nature,2016,529(7587):484-489.
[11] Silver D,Hubert T,Schrittwieser J,et al.A general reinforcement learning algorithm that masters chess,shogi,and go through self-play[J].Science,2018,362(6419):1140-1144.
[12] Brown N,Sandholm T.Superhuman AI for multiplayer poker[J].Science,2019,365(6456):885-890.
[13] Polydoros A S,Nalpantidis L.Survey of model-based reinforcement learning: applications on robotics[J].Journal of Intelligent & Robotic Systems,2017,86(2):153-173.
[14] Zhu K,Zhang T.Deep reinforcement learning based mobile robot navigation:a review[J].Tsinghua Science and Technology,2021,26(5):674-691.
[15] Wang C,Wang J,Shen Y,et al.Autonomous navigation of UAVs in large-scale complex environments:a deep reinforcement learning approach[J].IEEE Transactions on Vehicular Technology,2019,68(3):2124-2136.
[16] Zhang D X,Han X Q,Deng C Y.Review on the research and practice of deep learning and reinforcement learning in smart grids[J].CSEE Journal of Power and Energy Systems,2018,4(3):362-370.
[17] Aradi S.Survey of deep reinforcement learning for motion planning of autonomous vehicles[J].IEEE Transactions on Intelligent Transportation Systems,2020,23(2):740-759.
[18] Zhao W,Queralta J P,Westerlund T.Sim-to-real transfer in deep reinforcement learning for robotics:a survey[C]//Proceedings of IEEE Symposium Series on Computational Intelligence,2020:737-744.
[19] Scheikl P M,Tagliabue E,Gyenes B,et al.Sim-to-real transfer for visual reinforcement learning of deformable object manipulation for robot assisted surgery[J].IEEE Robotics and Automation Letters,2022,8(2):560-567.
[20] Xie T Y,Jiang N,Wang H,et al.Policy finetuning:bridging sample-efficient offline and online reinforcement learning[C]//Proceedings of Advances in Neural Information Processing Systems,2021:27395-27407.
[21] Levine S,Kumar A,Tucker G,et al.Offline reinforcement learning:tutorial,review,and perspectives on open problems[J].arXiv preprint arXiv:2005.01643,2020.
[22] Zhang L,Zhang R,Wu T,et al.Safe reinforcement learning with stability guarantee for motion planning of autonomous vehicles[J].IEEE Transactions on Neural Networks and Learning Systems,2021,32(12):5435-5444.
[23] Kiran B R,Sobh I,Talpaert V,et al.Deep reinforcement learning for autonomous driving:a survey[J].IEEE Transactions on Intelligent Transportation Systems,2021,23(6):4909-4926.
[24] Chen J,Yuan B,Tomizuka M.Model-free deep reinforcement learning for urban autonomous driving[C]//Proceedings of IEEE Intelligent Transportation Systems Conference,2019:2765-2771.
[25] Chen J,Li S E,Tomizuka M.Interpretable end-to-end urban autonomous driving with latent deep reinforcement learning[J].IEEE Transactions on Intelligent Transportation Systems,2021,23(6):5068-5078.
[26] Liu S,See K C,Ngiam K Y,et al.Reinforcement learning for clinical decision support in critical care:comprehensive review[J].Journal of Medical Internet Research,2020,22(7):e18477,doi:10.2196/18477.
[27] Yu C,Liu J,Nemati S,et al.Reinforcement learning in healthcare:a survey[J].ACM Computing Surveys,2021,55(1):1-36.
[28] Coronato A,Naeem M,De Pietro G,et al.Reinforcement learning for intelligent healthcare applications:a survey[J].Artificial Intelligence in Medicine,2020,109:101964,doi:10.1016/j.artmed.2020.101964.
[29] Wang C,Wang J,Wang J,et al.Deep reinforcement learning based autonomous UAV navigation with sparse rewards[J].IEEE Internet of Things Journal,2020,7(7):6180-6190.
[30] Koch W,Mancuso R,West R,et al.Reinforcement learning for UAV attitude control[J].ACM Transactions on Cyber Physical Systems,2019,3(2):1-21.
[31] Chen J,Jiang N.Information-theoretic considerations in batch reinforcement learning[C]//Proceedings of International Conference on Machine Learning,2019:1042-1051.
[32] Zhao D B,Shao K,Zhu Y H,et al.Review of deep reinforcement learning and discussions on the development of computer go[J].Control Theory & Applications,2016,33(6):701-717.
[33] Fujimoto S,Meger D,Precup D.O-policy deep reinforcement learning without exploration[C]//Proceedings of International Conference on Machine Learning,2019:2052-2062.
[34] Kumar A,Fu J,Tucker G,et al.Stabilizing o-policy q-learning via bootstrapping error reduction[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems,2019:11784-11794.
[35] Wu Y,Tucker G,Nachum O.Behavior regularized offline reinforcement learning[J].arXiv preprint arXiv:1911.11361,2019.
[36] Peng X B,Kumar A,Zhang G,et al.Advantage-weighted regression:simple and scalable off-policy reinforcement learning[J].arXiv preprint arXiv:1910.00177,2019.
[37] Kumar A,Zhou A,Tucker G et al.Conservative q learning for offline reinforcement learning[C]//Proceedings of Advances in Neural Information Processing Systems,2020:1179-1191.
[38] Kostrikov I,Nair A,Levine S.Offline reinforcement learning with implicit q learning[J].arXiv preprint arXiv:2110.06169,2021.
[39] Nair A,Gupta A,Dalal M,et al.Awac:accelerating online reinforcement learning with online datasets[J].arXiv preprint arXiv:2006.09359,2020.
[40] Kostrikov I,Fergus R,Tompson J,et al.Offline reinforcement learning with fisher divergence critic regularization[C]//Proceedings of the International Conference on Machine Learning,2021:5774-5783.
[41] Kidambi R,Rajeswaran A,Netrapalli P,et al.MOReL:model-based offline reinforcement learning[C]//Proceedings of Advances in Neural Information Processing Systems,2020:21810-21823.
[42] Yu T,Thomas G,Yu L,et al.MOPO:model-based offline policy optimization[C]//Proceedings of Advances in Neural Information Processing Systems,2020:14129-14142.
[43] Yu T,Kumar A,Rafailov R,et al.COMBO:conservative offline model-based policy optimization[C]//Proceedings of Advances in Neural Information Processing Systems,2021:28954-28967.
[44] Marc R,Lacerda B,Hawes N.Rambo-rl:robust adversarial model-based offline reinforcement learning[C]//Proceedings of Advances in Neural Information Processing Systems,2022:16082-16097.
[45] Bhardwaj M,Xie T,Boots B,et al.Adversarial model for offline reinforcement learning[J].arXiv preprint arXiv:2302.11048,2023.
[46] Prudencio R F,Maximo M R O A,Colombini E L.A survey on offline reinforcement learning:taxonomy,review,and open problems[J].IEEE Transactions on Neural Networks and Learning Systems,2023,35(8):10237-10257.
[47] Fu J,Kumar A,Nachum O,et al.D4RL:datasets for deep data-driven reinforcement learning[J].arXiv preprint arXiv:2004.07219,2020.
[48] Gulcehre C,Wang Z,Novikov A,et al.Rl unplugged:a suite of benchmarks for offiine reinforcement learning[C]//Proceedings of Advances in Neural Information Processing Systems,2020:7248-7259.
[49] Qin R J,Zhang X,Gao S,et al.NeoRL:a near real-world benchmark for offline reinforcement learning[C]//Proceedings of Advances in Neural Information Processing Systems,2022:24753-24765.
[50] Otterlo M V,Wiering M.Reinforcement learning and markov decision processes[J].Springer Berlin Heidelberg,2012,doi:10.1007/978-3-642-27645-3_1.
[51] Lauri M,Hsu D,Pajarinen J.Partially observable markov decision processes in robotics:a survey[J].IEEE Transactions on Robotics,2022,39(1):21-40.
[52] Hausknecht M,Stone P.Deep recurrent q-learning for partially observable MDPs[J].arXiv preprint arXiv:1507.06527,2015.
[53] Zhang X,Zheng K,Wang C,et al.A novel deep reinforcement learning for POMDP-based autonomous ship collision decision-making[J].Neural Computing and Applications,2025,37(21):15963-15977.
[54] Guo H,Cai Q,Zhang Y,et al.Provably efficient offline reinforcement learning for partially observable markov decision processes[C]//Proceedings of International Conference on Machine Learning,2022:8016-8038.
[55] Bertsekas D.Dynamic programming and optimal control[M].Nashua,USA:Athena Scientific,1995.
[56] Balhara S,Gupta N,Alkhayyat A,et al.A survey on deep reinforcement learning architectures,applications and emerging trends[J].IET Communications,2025,19(1):1-16.
[57] LI R Y,PENG H M,LI R G,et al.Overview on algorithms and applications for reinforcement learning[J].Computer Systems & Applications,2020,29(12):13-25.
[58] WEN G H,YANG T,ZHOU J L,et al.Reinforcement learning and adaptive/approximate dynamic programming:a survey from theory to applications in multi-agent systems[J].Control and Decision,2023,38(5):1200-1230.
[59] Bellman R.Dynamic programming and Lagrange multipliers[J].Proceedings of the National Academy of Sciences,1956,42(10):767-769.
[60] Sutton R S,Mcallester D A,Singh S P,et al.Policy gradient methods for reinforcement learning with function approximation[C]//Proceedings of Advances in Neural Information Processing Systems,1999:1057-1063.
[61] Mnih V,Badia A P,Mirza M,et al.Asynchronous methods for deep reinforcement learning[C]//Proceedings of the 33rd International Conference on Machine Learning,2016:1928-1937.
[62] Schulman J,Levine S,Abbeel P,et al.Trust region policy optimization[C]//Proceedings of the 32nd International Conference on Machine Learning,2015:1889-1897.
[63] Schulman J,Wolski F,Dhariwal P,et al.Proximal policy optimization algorithms[J].arXiv preprint arXiv:1707.06347,2017.
[64] Watkins C J,Dayan P.Q learning[J].Machine Learning,1992,8:279-292,doi:10.1007/BF00992698.
[65] Rummery G A,Niranjan M.On-line q-learning using connectionist systems[D].Cambridge:University of Cambridge,1994.
[66] Schaul T,Quan J,Antonoglou I,et al.Prioritized experience replay[J].arXiv preprint arXiv:1511.05952,2015.
[67] Hasselt H V,Guez A,Silver D.Deep reinforcement learning with double q learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2016:2094-2100.
[68] Wang Z,Schaul T,Hessel M,et al.Dueling network architectures for deep reinforcement learning[C]//Proceedings of the 33rd International Conference on Machine Learning,2016:1995-2003.
[69] Lillicrap P T,Hunt J J,Pritzel A,et al.Continuous control with deep reinforcement learning[J].arXiv preprint arXiv:1509.02971,2016.
[70] Silver D,Lever G,Heess N,et al.Deterministic policy gradient algorithms[C]//Proceedings of the 31st International Conference on Machine Learning,2014:387-395.
[71] Fujimoto S,Van Hoof H,Meger D.Addressing function approximation error in actor-critic methods[C]//Proceedings of the International Conference on Machine Learning,2018:1582-1591.
[72] Haarnoja T,Zhou A,Abbeel P,et al.Soft actor-critic:o-policy maximum entropy deep reinforcement learning with a stochastic actor[C]//Proceedings of the International Conference on Machine Learning,2018:1861-1870.
[73] Wang S R,Niu W J,Tong E D,et al.Research on o-policy evaluation in reinforcement learning:a survey[J].Chinese Journal of Computers,2022,45(9):1926-1948.
[74] Lyu J,Ma X,Li X,et al.Mildly conservative Q learning for offline reinforcement learning[C]//Proceedings of Advances in Neural Information Processing Systems,New Orleans,2022:1711-1724.
[75] Fujimoto S,Gu S S.A minimalist approach to offline reinforcement learning[J].arXiv preprint arXiv:2106.06860,2021.
[76] Chen X,Zhou Z,Wang Z,et al.BAIL:best-action imitation learning for batch deep reinforcement learning[C]//Proceedings of Advances in Neural Information Processing Systems,2020:18353-18363.
[77] Guo K Y,Shao Y F,Geng Y H.Model-based offline reinforcement learning with pessimism-modulated dynamics belief[C]//Proceedings of Advances in Neural Information Processing Systems,2022:449-461.
[78] Matsushima T,Furuta H,Matsuo Y,et al.Deployment efficient reinforcement learning via model based offline optimization[J].arXiv preprint arXiv:2006.03647,2020.
[79] Chen L,Lu K,Rajeswaran A,et al.Decision transformer:reinforcement learning via sequence modeling[C]//Proceedings of Advances in Neural Information Processing Systems,2021:15084-15097.
[80] Michael J,Li Q Y,Levine S.Offline reinforcement learning as one big sequence modeling problem[C]//Proceedings of Advances in Neural Information Processing Systems,2021:1273-1286.
[81] Yang R,Zhong H,Xu J,et al.Towards robust offline reinforcement learning under diverse data corruption[C]//Proceedings of 12th International Conference on Learning Representations,2023:1-32.
[82] Nguyen Tang T,Arora R.On sample-efficient offline reinforcement learning:data diversity,posterior sampling and beyond[C]//Proceedings of Advances in Neural Information Processing Systems,2023:61115-61157.
[83] Mediratta I,You Q,Jiang M,et al.The generalization gap in offline reinforcement learning[J].arXiv preprint arXiv:2312.05742,2023.
[84] Deng J,Dong W,Socher R,et al.Imagenet:a large-scale hierarchical image database[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,2009:248-255.
[85] Veit A,Matera T,Neumann L,et al.Cocotext:dataset and benchmark for text detection and recognition in natural images[J].arXiv preprint arXiv:1601.07140,2016.
[86] Todorov E,Erez T,Tassa Y.MuJoCo:a physics engine for model-based control[C]//Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems,2012:5026-5033.
[87] Rajeswaran A,Kumar V,Gupta A,et al.Learning complex dexterous manipulation with deep reinforcement learning and demonstrations[J].arXiv preprint arXiv:1709.10087,2018.
[88] Gupta A,Kumar V,Lynch C,et al.Relay policy learning:solving long-horizon tasks via imitation and reinforcement learning[J].arXiv preprint arXiv:1910.11956,2019.
[89] Vinitsky E,Kreidieh A,Flem L L,et al.Benchmarks for reinforcement learning in mixed-autonomy traffic[C]//Proceedings of the Conference on Robot Learning,2018:399-409.
[90] Dosovitskiy A,Ros G,Codevilla F,et al.CARLA:an open urban driving simulator[J].arXiv e-prints,arXiv-1711,2017.
[91] Tassa Y,Doron Y,Muldal A,et al.DeepMind control suite[J].arXiv preprint arXiv:1801.00690,2018.
[92] Bellemare M G,Naddaf Y,Veness J,et al.The arcade learning environment:an evaluation platform for general agents[J].Journal of Artificial Intelligence Research,201,3(47):253-279.
[93] Dulac G,Mankowitz D,Hester T.Challenges of real-world reinforcement learning[J].arXiv preprint arXiv:1904.12901,2019.
[94] Hein D,Depeweg S,Tokic M,et al.A benchmark environment motivated by industrial control problems[C]//Proceedings of IEEE Symposium Series on Computational Intelligence,2017:1-8.
[95] Liu X Y,Yang H,Chen Q,et al.FinRL:a deep reinforcement learning library for automated stock trading in quantitative finance[J].arXiv preprint arXiv:2011.09607,2020.
[96] VSzquez J R,Kampf J,Henze G,et al.Citylearn v1.0:an openai gym environment for demand response with deep reinforcement learning[C]//Proceedings of the 6th ACM International Conference on Systems for Energy Efficient Buildings,Cities,and Transportation,2019:356-357.
[97] Pomerleau D A.Alvinn:an autonomous land vehicle in a neural network[C]//Proceedings of Advances in Neural Information Processing Systems,1989:305-313.
[98] Wang C,Wu Y,Vuong Q,et al.Striving for simplicity and performance in off policy drl:output normalization and nonuniform sampling[J].arXiv preprint arXiv:1910.02208,2019.
[99] Nachum O,Dai B,Kostrikov I,et al.Algaedice:policy gradient from arbitrary experience[J].arXiv preprint arXiv:1912.02074,2019.
[100] Barth M G,Hoffman M W,Budden D,et al.Distributed distributional deterministic policy gradients[J].arXiv preprint arXiv:1804.08617,2018.
[101] Dabney W,Ostrovski G,Silver D,et al.Implicit quantile networks for distributional reinforcement learning[J].arXiv preprint arXiv:1806.06923,2018.
[102] Siegel N Y,Springenberg J T,Berkenkamp F,et al.Keep doing what worked:behavioral modelling priors for offline reinforcement learning[J].arXiv preprint arXiv:2002.08396,2020.
[103] Agarwal R,Schuurmans D,Norouzi M.An optimistic perspective on offline reinforcement learning[C]//Proceedings of the International Conference on Machine Learning,2020:104-114.
[104] Zhou W,Bajracharya S,Held D.Plas:latent action space for offline reinforcement learning[C]//Procccedings of the Conference on Robot Learning,2020:1719-1735.
[105] Wang Z,Novikov A,Zolna K,et al.Critic regularized regression[C]//Proceedings of the Advances in Neural Information Processing Systems,2020:7768-7778.
[106] Singh B,Kumar R,Singh V.Reinforcement learning in robotic applications:a comprehensive survey[J].Artifical Intelligence Review,2022,55(2):945-990.
[107] Zhou G,Dean V,Srirama M K,et al.Train offline,test online:a real robot kearning benchmark[J].arXiv preprint arXiv:2306.00942,2023.
[108] Fang X,Zhang Q,Gao Y,et al.Offline reinforcement learning for autonomous driving with real world driving data[C]//Proceedings of IEEE 25th International Conference on Intelligent Transportation Systems,2022:3417-3422.
[109] Hu B,Li J.A deployment efficient energy management strategy for connected hybrid electric vehicle based on offline reinforcement learning[J].IEEE Transactions on Industrial Electronics,2021,69(9):9644-9654.
[110] He H,Niu Z,Wang Y,et al.Energy management optimization for connected hybrid electric vehicle using offline reinforcement learning[J].Journal of Energy Storage,2023,72:108517,doi:10.1016/j.est.2023.108517.
[111] Wang L,Zhang W,He X,et al.Supervised reinforcement learning with recurrent neural network for dynamic treatment recommendation[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,2018:2447-2456.
[112] Swaminathan A,Krishnamurthy A,Agarwal A,et al.Off policy evaluation for slate recommendation[C]//Proceedings of Advances in Neural Information Processing Systems,2017:3632-3642.
[113] Covington P,Adams J,Sargin E.Deep neural networks for youtube recommendations[C]//Proceedings of the 10th ACM Conference on Recommender Systems,2016:191-198.
[114] Chen M,Beutel A,Covington P,et al.Top-k o ffpolicy correction for a reinforce recommender system[C]//Proceedings of the 12th ACM International Conference on Web Search and Data Mining,2018:456-464.
[115] Verma S,Fu J,Yang M,et al.Chai:a chatbot ai for task-oriented dialogue with offline reinforcement learning[J].arXiv preprint arXiv:2204.08426,2022.
[116] Jaques N,Ghandeharioun A,Shen J H,et al.Way offpolicy batch deep reinforcement learning of implicit human preferences in dialog[J].arXiv preprint arXiv:1907.00456,2019.
[117] Kumar A,Singh A,Tian S,et al.A workflow for offline model free robotic reinforcement learning[J].arXiv preprint arXiv:2109.10813,2021.
[118] Diehl C,Sievernich T S,Krjger M,et al.Uncertainty aware model based offline reinforcement learning for automated driving[J].IEEE Robotics and Automation Letters,2023,8(2):1167-1174.
[119] Snell C,Kostrikov I,Su Y,et al.Offline RL for natural language generation with implicit language q learning[J].arXiv preprint arXiv:2206.11871,2022.
[120] Xiao T,Wang D.A general offline reinforcement learning framework for interactive recommendation[C]//Proceedings of the AAAI Conference on Artiflcial Intelligence,2021:4512-4520.
[121] Fu W Y,Di B B.Batch reinforcement learning in the real world:a survey[C]//Proceedings of Offline RL Workshop,2020:1-13.
[122] Gürtler N,Blaes S,Kolev P,et al.Benchmarking offline reinforcement learning on real-robot hardware[J].arXiv preprint arXiv:2307.15690,2023.

附中文参考文献:
[6] 孙悦雯,柳文章,孙长银.基于因果建模的强化学习控制:现状及展望[J].自动化学报,2023,49(3):661-677.
[57] 李茹杨,彭慧民,李仁刚,等.强化学习算法与应用综述[J].计算机系统应用,2020,29(12):13-25.
[58] 温广辉,杨涛,周佳玲,等.强化学习与自适应动态规划:从基础理论到多智能体系统中的应用进展综述[J].控制与决策,2023,38(5):1200-1230.