离线强化学习研究综述

李晓峰 ,  蒋佳慧 ,  王雪娆

小型微型计算机系统 ›› 2026, Vol. 47 ›› Issue (5) : 1056 -1069.

PDF (2616KB)
小型微型计算机系统 ›› 2026, Vol. 47 ›› Issue (5) : 1056 -1069. DOI: 10.20009/j.cnki.21-1106/TP.2025-0354
算法理论与人工智能

离线强化学习研究综述

作者信息 +

Offline Reinforcement Learning:a Survey

Author information +
文章历史 +
PDF (2678K)

摘要

深度强化学习结合了深度学习的特征学习和强化学习的序贯决策能力,在诸多挑战性任务中表现出超越人类的水平。但是,在线强化学习以"试错"方式与环境交互,存在采样成本高、探索风险大和样本效率低的问题,阻碍了其在实际系统中的落地。离线强化学习是一种完全从静态数据集中学习目标策略的框架,将数据收集与策略学习过程分离,有效避免了交互过程中的潜在危险.本文将首先介绍强化学习基础知识,并分析在线学习方式存在的瓶颈。在此基础上,构建离线强化学习问题的形式化描述并指出其关键问题。进一步,对相关代表性算法和最新成果进行全面系统梳理,并介绍主要应用领域和常用基准测试平台,最后,总结分析面临的挑战,探讨未来发展方向。

Abstract

Deep reinforcement learning algorithms achieve impressive performance in multiple challenging tasks by combing the power- ful representation learning capability of deep learning together with the sequential decision ability of reinforcement learning.However, as for some risk-aware real-world systems,collecting the data based on trial-and-error method is inaccessible because it is dangerous, expensive and sample inefficient.The active learning framework is an important reason that hinders the widespread applications of on- line reinforcement learning algorithms.Offline reinforcement learning is a data-driven paradigm that can learn exclusively from the static dataset without interaction with the environment during the training process.Due to the ability of learning from the previously collected data,offline reinforcement learning is appealing to deal with real-world applications.In this paper,the fundamentals of rein- forcement learning is first introduced.Then,we analyze the challenges of this active learning framework to deal with practical systems. Second,the problem formulation of offline reinforcement learning is provided.A comprehensive review of important algorithms,com- mon benchmarks and main practical applications in this field is given.Finally,we summarize the primary challenges and discuss re- search directions.

关键词

强化学习 / 深度强化学习 / 离线强化学习 / 策略提升 / 分布偏移

Key words

rcinforcement lcarning / decp reinforcement learning / offline reinforecment learning / policy improvement / distribution shift

引用本文

引用格式 ▾
李晓峰,蒋佳慧,王雪娆. 离线强化学习研究综述[J]. 小型微型计算机系统, 2026, 47(5): 1056-1069 DOI:10.20009/j.cnki.21-1106/TP.2025-0354

登录浏览全文

4963

注册一个新账户 忘记密码

参考文献

[1]

Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2020, 518 (7540):529-533.

[2]

Arulkumaran K, Deisenroth P M, Brundage M, et al. Deep rein- forcement learning:a brief survey[J]. IEEE Signal Processing Magazine, 2017, 34(6):26-38.

[3]

Henderson P, Islam R, Bachman P, et al. Deep reinforcement learn- ing that matters[C]// Proceedings of the AAAI Conference on Ar- tificial Intelligence, 2018,doi:https://doi.org/10.1609/aaai.v32i1.11694.

[4]

Sutton S R, Barto G A. Reinforcement learning: an introduction[M]. MA,USA: MIT Press, 2018.

[5]

Wang X S, Wang R R, Cheng Y H. Safe reinforcement learning:a survey[ J]. Acta Automatica Sinica, 2023, 49(9):1-23.

[6]

SUN Y W, LIU W Z, SUN C Y. Causality in reinforcement learn- ing control:the state of the art and prospects[J]. Acta Automatica Sinica, 2023, 49(3):661-677.

[7]

Vinyals O, Babuschkin I, Czarnecki W M, et al. Grandmaster level in starcraft II using multiagent reinforcement learning[J]. Nature, 2019, 575(7782):350-354.

[8]

Liu I J, Jain U, Yeh R A, et al. Cooperative exploration for multi-a- gent deep reinforcement learning[C]// Proceedings of 38th Inter- national Conference on Machine Learning, 2021:6826-6836.

[9]

Rashid T, Samvelyan M, De Witt C S, et al. Monotonic value func- tion factorisation for deep multi-agent reinforcement learning[J]. Journal of Machine Learning Research, 2020, 21(1):7234-7284.

[10]

Silver D, Huang A, Maddison C J, et al. Mastering the game of go with deep neural networks and tree search[J]. Nature, 2016, 529 (7587):484-489.

[11]

Silver D, Hubert T, Schrittwieser J, et al. A general reinforcement learning algorithm that masters chess,shogi,and go through self- play[J]. Science, 2018, 362(6419):1140-1144.

[12]

Brown N, Sandholm T. Superhuman AI for multiplayer poker[J]. Science, 2019, 365( 6456):885-890.

[13]

Polydoros A S, Nalpantidis L. Survey of model-based reinforcement learning:applications on robotics[J]. Journal of Intelligent & Ro- botic Systems, 2017, 86(2):153-173.

[14]

Zhu K, Zhang T. Deep reinforcement learning based mobile robot navigation:a review[J]. Tsinghua Science and Technology, 2021, 26(5):674-691.

[15]

Wang C, Wang J, Shen Y, et al. Autonomous navigation of UAVs in large-scale complex environments:a decp reinforcement learning approach[J]. IEEE Transactions on Vehicular Technology, 2019, 68 (3):2124-2136.

[16]

Zhang D X, Han X Q, Deng C Y. Review on the research and prac- tice of deep learning and reinforcement learning in smart grids[J]. CSEE Journal of Power and Energy Systems, 2018, 4 ( 3 ): 362-370.

[17]

Aradi S. Survcy of decp rcinforcement lcarning for motion planning of autonomous vehicles[J]. IEEE Transactions on Intelligent Transportation Systems, 2020, 23(2):740-759.

[18]

Zhao W, Queralta J P, Westerlund T. Sim-to-real transfer in deep reinforcement learning for robotics :a survey[C]// Proceedings of IEEE Symposium Series on Computational Intelligence, 2020: 737-744.

[19]

Scheikl P M, Tagliabue E, Gyenes B, et al. Sim-to-real transfer for visual reinforcement learning of deformable object manipulation for robot assisted surgery[J]. IEEE Robotics and Automation Letters, 2022, 8(2):560-567.

[20]

Xie T Y, Jiang N, Wang H, et al. Policy finetuning:bridging sam- ple-efficient offline and online reinforcement learning[C]// Pro- ceedings of Advances in Neural Information Processing Systems, 2021:27395-27407.

[21]

Levine S, Kumar A, Tucker G, et al. Offline reinforcement learn- ing:tutorial,review,and perspectives on open problems[J]. arXiv preprint arXiv:2005.01643,2020.

[22]

Zhang L, Zhang R, Wu T, et al. Safe reinforcement learning with stability guarantee for motion planning of autonomous vehicles[J]. IEEE Transactions on Neural Networks and Learning Sys- tems, 2021, 32(12):5435-5444.

[23]

Kiran B R, Sobh I, Talpaert V, et al. Deep reinforcement learning for autonomous driving:a survey[J]. IEEE Transactions on Intelli- gent Transportation Systems, 2021, 23(6):4909-4926.

[24]

Chen J, Yuan B, Tomizuka M. Model-free deep reinforcement learning for urban autonomous driving[C]// Proceedings of IEEE Intelligent Transportation Systems Conference, 2019:2765-2771.

[25]

Chen J, Li S E, Tomizuka M. Interpretable end-to-end urban auton- omous driving with latent deep reinforcement learning[J]. IEEE Transactions on Intelligent Transportation Systems, 2021, 23(6): 5068-5078.

[26]

Liu S, See K C, Ngiam K Y, et al. Reinforcement learning for clini- cal decision support in critical care:comprehensive review[J]. Journal of Medical Internet Research, 2020, 22 (7):e18477,doi: 10.2196/18477.

[27]

Yu C, Liu J, Nemati S, et al. Reinforcement learning in healthcare: a survey[J]. ACM Computing Surveys, 2021, 55(1):1-36.

[28]

Coronato A, Naeem M, De Pietro G, et al. Reinforcement learning for intelligent healthcare applications;a survey[J]. Artificial Intelli- gence in Medicine, 2020,109:101964,doi:10.1016/j.artmed.2020.101964.

[29]

Wang C, Wang J, Wang J, et al. Deep reinforcement learning based autonomous UAV navigation with sparse rewards[J]. IEEE Inter- net of Things Journal, 2020, 7(7):6180-6190.

[30]

Koch W, Mancuso R, West R, et al. Reinforcement learning for UAV attitude control[J]. ACM Transactions on Cyber Physical Systems, 2019, 3(2):1-21.

[31]

Chen J, Jiang N. Information-theoretic considerations in batch rein- forcement learning[C]// Proceedings of International Conference on Machine Learning, 2019:1042-1051.

[32]

Zhao D B, Shao K, Zhu Y H, et al. Review of deep reinforcement learning and discussions on the development of computer go[J]. Control Theory & Applications, 2016, 33(6):701-717.

[33]

Fujimoto S, Meger D, Precup D. O-policy deep reinforcement learning without exploration[C]// Proceedings of International Conference on Machine Learning, 2019:2052-2062.

[34]

Kumar A, Fu J, Tucker G, et al. Stabilizing o-policy q-learning via bootstrapping crror reduction[C]// Proccedings of the 33rd Inter- national Conference on Neural Information Processing Systems, 2019:11784-11794.

[35]

Wu Y, Tucker G, Nachum O. Behavior regularized offline rein- forcement learning[J]. arXiv preprint arXiv:1911.11361, 2019.

[36]

Peng X B, Kumar A, Zhang G, et al. Advantage-weighted regres- sion:simple and scalable off-policy reinforcement learning[J]. arXiv preprint arXiv:1910.00177,2019.

[37]

Kumar A, Zhou A, Tucker G et al. Conservative q learning for off- line reinforcement learning[C]// Proceedings of Advances in Neu- ral Information Processing Systems, 2020:1179-1191.

[38]

Kostrikov I, Nair A, Levine S. Offline reinforcement learning with implicit q learning[J]. arXiv preprint arXiv:2110. 06169,2021.

[39]

Nair A, Gupta A, Dalal M, et al. Awac:accelerating online rein- forcement learning with online datasets[J]. arXiv preprint arXiv: 2006.09359,2020.

[40]

Kostrikov I, Fergus R, Tompson J, et al. Offline reinforcement learning with fisher divergence critic regularization[C]// Proceed- ings of the International Conference on Machine Learning, 2021: 5774-5783.

[41]

Kidambi R, Rajeswaran A, Netrapalli P, et al. MOReL:model- based offline reinforcement learning[C]// Proceedings of Advances in Neural Information Processing Systems, 2020:21810-21823.

[42]

Yu T, Thomas G, Yu L, et al. MOPO:model-based offline policy optimization[C]// Proceedings of Advances in Neural Information Processing Systems, 2020:14129-14142.

[43]

Yu T, Kumar A, Rafailov R, et al. COMBO: conservative offline model-based policy optimization[C]// Proceedings of Advances in Neural Information Processing Systems, 2021:28954-28967.

[44]

Marc R, Lacerda B, Hawes N. Rambo-rl:robust adversarial model- based offline reinforcement learning[C]// Proceedings of Advances in Neural Information Processing Systems, 2022;16082-16097.

[45]

Bhardwaj M, Xie T, Boots B, et al. Adversarial model for offline reinforcement learning[J]. arXiv preprint arXiv:2302.11048,2023.

[46]

Prudencio R F, Maximo M R O A, Colombini E L. A survey on of- fline reinforcement learning;taxonomy,review,and open problems[J]. IEEE Transactions on Neural Networks and Learning Sys- tems, 2023, 35(8):10237-10257.

[47]

Fu J, Kumar A, Nachum O, et al. D4RL:datasets for deep data- driven reinforcement learning[J]. arXiv preprint arXiv: 2004. 07219,2020.

[48]

Gulcehre C, Wang Z, Novikov A, et al. R1 unplugged:a suite of benchmarks for offine reinforcement learning[C]// Proceedings of Advances in Neural Information Processing Systems, 2020: 7248-7259.

[49]

Qin R J, Zhang X, Gao S, et al. NeoRL:a near real-world bench- mark for offline reinforcement learning[C]// Proceedings of Ad- vances in Neural Information Processing Systems, 2022:24753-24765.

[50]

Otterlo M V, Wiering M. Reinforcement learning and markov deci- sion processes[J]. Springer Berlin Heidelberg,2012,doi:10.1007/978-3-642-27645-3_1.

[51]

Lauri M, Hsu D, Pajarinen J. Partially observable markov decision processes in robotics :a survey[J]. IEEE Transactions on Robotics, 2022, 39(1):21-40.

[52]

Hausknecht M, Stone P. Deep recurrent q-learning for partially ob- servable MDPs[J]. arXiv preprint arXiv:1507.06527,2015.

[53]

Zhang X, Zheng K, Wang C, et al. A novel deep reinforcement learning for POMDP-based autonomous ship collision decision- making[J]. Ncural Computing and Applications, 2025, 37(21): 15963-15977.

[54]

Guo H, Cai Q, Zhang Y, et al. Provably efficient offline reinforce- ment learning for partially observable markov decision processes[C]// Proceedings of International Conference on Machine Learn- ing, 2022:8016-8038.

[55]

Bertsekas D. Dynamic programming and optimal control[M]. Nashua,USA: Athena Scientific, 1995.

[56]

Balhara S, Gupta N, Alkhayyat A, et al. A survey on deep rein- forcement learning architectures,applications and emerging trends[J]. IET Communications, 2025, 19(1):1-16.

[57]

LI R Y, PENG H M, LI R G, et al. Overview on algorithms and ap- plications for reinforcement learning[J]. Computer Systems & Ap- plications, 2020, 29(12):13-25.

[58]

WENG H, YANG T, 7HOU I I, et al. Reinforcement learning and adaptive/approximate dynamic programming:a survey from theory to applications in multi-agent systems[J]. Control and Decision, 2023, 38(5):1200-1230.

[59]

Bellman R. Dynamic programming and Lagrange multipliers[J]. Proceedings of the National Academy of Sciences, 1956, 42(10): 767-769.

[60]

Sutton R S, Mcallester D A, Singh S P, et al. Policy gradient meth- ods for reinforcement learning with function approximation[C]// Proceedings of Advances in Neural Information Processing Sys- tems, 1999:1057-1063.

[61]

Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning[C]// Proceedings of the 33rd Interna- tional Conference on Machine Learning, 2016:1928-1937.

[62]

Schulman J, Levine S, Abbeel P, et al. Trust region policy optimi- zation[C]// Proceedings of the 32nd International Conference on Machine Learning, 2015:1889-1897.

[63]

Schulman I, Wolski F, Dhariwal P, et al. Proximal policy optimiza- tion algorithms[J]. arXiv preprint arXiv:1707.06347,2017.

[64]

Watkins C J, Dayan P. Q learning[J]. Machine Learning, 1992, 8: 279-292,doi:10.1007/BF00992698.

[65]

Rummery G A, Niranjan M. On-line q-learning using connectionist systems[D]. Cambridge: University of Cambridge, 1994.

[66]

Schaul T, Quan J, Antonoglou I, et al. Prioritized experience replay[J]. arXiv preprint arXiv:1511.05952,2015.

[67]

Hasselt H V, Guez A, Silver D. Deep reinforcement learning with double q learning[C]// Proceedings of the AAAI Conference on Artificial Intelligence, 2016:2094-2100.

[68]

Wang Z, Schaul T, Hessel M, et al. Dueling network architectures for deep reinforcement learning[C]// Proceedings of the 33rd In- ternational Conference on Machine Learning, 2016:1995-2003.

[69]

Lillicrap P T, Hunt J J, Pritzel A, et al. Continuous control with deep reinforcement learning[J]. arXiv preprint arXiv: 1509. 02971,2016.

[70]

Silver D, Lever G, Heess N, et al. Deterministic policy gradient al- gorithms[C]// Proceedings of the 31st International Conference on Machine Learning, 2014:387-395.

[71]

Fujimoto S, Van Hoof H, Meger D. Addressing function approxi mation error in actor-critic methods[C]// Proceedings of the Inter- national Conference on Machine Learning, 2018:1582-1591.

[72]

Haarnoja T, Zhou A, Abbeel P, et al. Soft actor-critic:o-policy maximum entropy deep reinforcement learning with a stochastic ac- tor[C]// Proceedings of the International Conference on Machine Learning, 2018:1861-1870.

[73]

Wang S R, Niu W J, Tong E D, et al. Rescarch on o-policy evalua- tion in reinforcement learning:a survey[J]. Chinese Journal of Computers, 2022, 45(9):1926-1948.

[74]

Lyu J, Ma X, Li X, et al. Mildly conservative Q learning for offline reinforcement learning[C]// Proceedings of Advances in Neural Information Processing Systems, New Orleans,2022:1711-1724.

[75]

Fujimoto S, Gu S S. A minimalist approach to offline reinforcement learning[J]. arXiv preprint arXiv:2106.06860,2021.

[76]

Chen X, Zhou Z, Wang Z, et al. BAIL:best-action imitation learning for batch deep reinforcement learning[C]// Proceedings of Ad- vances in Neural Information Processing Systems, 2020:18353-18363.

[77]

Guo K Y, Shao Y F, Geng Y H. Model-based offline reinforcement learning with pessimism-modulated dynamics belief[C]// Proceed- ings of Advances in Neural Information Processing Systems, 2022: 449-461.

[78]

Matsushima T, Furuta H, Matsuo Y, et al. Deployment efficient re- inforcement learning via model based offline optimization[J]. arX- iv preprint arXiv:2006.03647,2020.

[79]

Chen L, Lu K, Rajeswaran A, et al. Decision transformer: reinforce- ment learning via sequence modeling[C]// Proceedings of Ad- vances in Neural Information Processing Systems, 2021: 15084-15097.

[80]

Michael J, Li Q Y, Levine S. Offline reinforcement learning as one big sequence modeling problem[C]// Proceedings of Advances in Neural Information Processing Systems, 2021:1273-1286.

[81]

Yang R, Zhong H, Xu J, et al. Towards robust offline reinforcement learning under diverse data corruption[C]// Proceedings of 12th International Conference on Learning Representations, 2023:1-32.

[82]

Nguyen Tang T, Arora R. On sample-efficient offline reinforcement learning:data diversity,posterior sampling and beyond[C]// Pro- ceedings of Advances in Neural Information Processing Systems, 2023;61115-61157.

[83]

Mediratta I, You Q, Jiang M, et al. The generalization gap in offline reinforcement learning[J]. arXiv preprint arXiv:2312.05742,2023.

[84]

Deng J, Dong W, Socher R, et al. Imagenet:a large-scale hierarchi- cal image database[C]// Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009:248-255.

[85]

Veit A, Matera T, Neumann L, et al. Cocotext:dataset and bench- mark for text detection and recognition in natural images[J]. arXiv preprint arXiv:1601.07140,2016.

[86]

Todorov E, Erez T, Tassa Y. MuJoCo:a physics engine for model- based control[C]// Proceedings of IEEE/RSJ International Con- ference on Intelligent Robots and Systems, 2012;5026-5033.

[87]

Rajeswaran A, Kumar V, Gupta A, et al. Learning complex dexter- ous manipulation with deep reinforcement learning and demonstra- tions[J]. arXiv preprint arXiv:1709.10087,2018.

[88]

Gupta A, Kumar V, Lynch C, et al. Relay policy learning:solving long-horizon tasks via imitation and reinforcement learning[J]. arXiv preprint arXiv:1910.11956, 2019.

[89]

Vinitsky E, Kreidieh A, Flem L L, et al. Benchmarks for reinforce- ment learning in mixed-autonomy traffic[C]// Proceedings of the Conference on Robot Learning, 2018:399 409.

[90]

Dosovitskiy A, Ros G, Codevilla F, et al. CARLA:an open urban driving simulator[J]. arXiv e-prints,arXiv-1711, 2017.

[91]

Tassa Y, Doron Y, Muldal A, et al. DeepMind control suite[J]. arXiv preprint arXiv:1801.00690,2018.

[92]

Bellemare M G, Naddaf Y, Veness J, et al. The arcade learning en- vironment:an evaluation platform for general agents[J]. Journal of Artificial Intelligence Research, 201, 3(47):253-279.

[93]

Dulac G, Mankowitz D, Hester T. Challenges of real-world rein- forcement learning[J]. arXiv preprint arXiv:1904.12901, 2019.

[94]

Hein D, Depeweg S, Tokic M, et al. A benchmark environment mo- tivated by industrial control problems[C]// Proceedings of IEEE Symposium Series on Computational Intelligence, 2017:1-8.

[95]

Liu X Y, Yang H, Chen Q, et al. FinRL:a deep reinforcement learning library for automated stock trading in quantitative finance[J]. arXiv preprint arXiv:2011.09607,2020.

[96]

VSzquez J R, Kampf J, Henze G, et al. Citylearn v1.0:an openai gym environment for demand response with deep reinforcement learning[C]// Proceedings of the 6th ACM International Confer- ence on Systems for Energy Efficient Buildings,Cities,and Trans- portation,2019:356-357.

[97]

Pomerleau D A. Alvinn:an autonomous land vehicle in a neural network[C]// Proceedings of Advances in Neural Information Pro- cessing Systems, 1989:305-313.

[98]

Wang C, Wu Y, Vuong Q, et al. Striving for simplicity and per- formance in off policy drl:output normalization and nonuniform sampling[J]. arXiv preprint arXiv:1910.02208,2019.

[99]

Nachum O, Dai B, Kostrikov I, et al. Algaedice:policy gradient from arbitrary experience[J]. arXiv preprint arXiv: 1912. 02074, 2019.

[100]

Barth M G, Hoffman M W, Budden D, et al. Distributed distribu- tional deterministic policy gradients[J]. arXiv preprint arXiv: 1804.08617, 2018.

[101]

Dabney W, Ostrovski G, Silver D, et al. Implicit quantile networks for distributional reinforcement learning[J]. arXiv preprint arXiv: 1806.06923,2018.

[102]

Siegel N Y, Springenberg J T, Berkenkamp F, et al. Keep doing what worked:behavioral modelling priors for offline reinforcement learning[J]. arXiv preprint arXiv:2002.08396,2020.

[103]

Agarwal R, Schuurmans D, Norouzi M. An optimistic perspective on offline reinforcement learning[C]// Proceedings of the Interna- tional Conference on Machine Learning, 2020:104-114.

[104]

Zhou W, Bajracharya S, Held D.Plas; latent action space for off- line reinforcement learning[C]// Procccedings of the Conference on Robot Learning, 2020:1719-1735.

[105]

Wang Z, Novikov A, Zolna K, et al. Critic regularized regression[C]// Proceedings of the Advances in Neural Information Process- ing Systems, 2020:7768-7778.

[106]

Singh B, Kumar R, Singh V. Reinforcement learning in robotic ap- plications:a comprehensive survey[J]. Artifical Intelligence Re- view, 2022, 55(2):945-990.

[107]

Zhou G, Dean V, Srirama M K, et al. Train offline,test online:a real robot kearning benchmark[J]. arXiv preprint arXiv: 2306. 00942,2023.

[108]

Fang X, Zhang Q, Gao Y, et al. Offline reinforcement learning for autonomous driving with real world driving data[C]// Proceedings of IEEE 25th International Conference on Intelligent Transportation Systems, 2022:3417-3422.

[109]

Hu B, Li J. A deployment effīcient energy management strategy for connected hybrid electric vehicle based on offline reinforcement learning[J]. IEEE Transactions on Industrial Electronics, 2021, 69 (9):9644-9654.

[110]

He H, Niu Z, Wang Y, et al. Energy management optimization for connected hybrid electric vehicle using offline reinforcement learn- ing[J]. Journal of Energy Storage, 2023,72:108517,doi: 10.1016/j.est.2023.108517.

[111]

Wang L, Zhang W, He X, et al. Supervised reinforcement learning with recurrent neural network for dynamic treatment recommenda- tion[C]// Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018: 2447-2456.

[112]

Swaminathan A, Krishnamurthy A, Agarwal A, et al. Off policy e- valuation for slate recommendation[C]// Proceedings of Advances in Neural Information Processing Systems, 2017:3632-3642.

[113]

Covington P, Adams J, Sargin E. Deep neural networks for youtube reconmendations[C]// Proceedings of the 10th ACM Conference on Recommender Systems, 2016:191-198.

[114]

Chen M, Beutel A, Covington P, et al. Top-k o ffpolicy correction for a reinforce recommender system[C]// Proceedings of the 12th ACM International Conference on Web Search and Data Mining, 2018:456-464.

[115]

Verma S, Fu J, Yang M, et al. Chai:a chatbot ai for task-oriented dialogue with offline reinforcement learning[J]. arXiv preprint arXiv:2204.08426,2022.

[116]

Jaques N, Ghandeharioun A, Shen J H, et al. Way offpolicy batch deep reinforcement learning of implicit human preferences in dia- log[J]. arXiv preprint arXiv:1907.00456,2019.

[117]

Kumar A, Singh A, Tian S, et al. A workflow for offline model free robotic reinforcement learning[J]. arXiv preprint arXiv: 2109. 10813,2021.

[118]

Diehl C, Sievernich T S, Krjger M, et al. Uncertainty aware model based offline reinforcement learning for automated driving[J]. IEEE Robotics and Automation Letters, 2023, 8(2):1167-1174.

[119]

Snell C, Kostrikov I, Su Y, et al. Offline RL for natural language generation with implicit language q learning[J]. arXiv preprint arXiv:2206.11871,2022.

[120]

Xiao T, Wang D. A general offline reinforcement learning frame- work for interactive recommendation[C]// Proceedings of the AAAI Conference on Artiflcial Intelligence, 2021:4512-4520.

[121]

Fu W Y, Di B B. Batch reinforcement learning in the real world:a survey[C]// Proceedings of Offline RL Workshop, 2020:1-13.

[122]

Gürtler N, Blaes S, Kolev P, et al. Benchmarking offline reinforce- ment learning on real-robot hardware[J]. arXiv preprint arXiv: 2307.15690,2023.

[123]

孙悦雯, 柳文章, 孙长银. 基于因果建模的强化学习控剕:现状及展望[J]. 自动化学报, 2023, 49(3):661-677.

[124]

李茹杉, 彭慧民, 李价刚, . 强化学习算法」 应用综述[J]. 计算机系统应用, 2020, 29(12):13-25.

[125]

温广辉, 杨涛, 周佳玲, . 强化学习与自适应动态规划:从基础理论到多智能体系统中的应用进展综述[J]. 控制与决策, 2023, 38(5):1200-1230.

基金资助

国家自然科学基金项目(62203005)

中央高校基础研究基金项目(B250201086)

AI Summary AI Mindmap
PDF (2616KB)

0

访问

0

被引

详细

导航
相关文章

AI思维导图

/