离线强化学习研究综述
Offline Reinforcement Learning:a Survey
深度强化学习结合了深度学习的特征学习和强化学习的序贯决策能力,在诸多挑战性任务中表现出超越人类的水平。但是,在线强化学习以"试错"方式与环境交互,存在采样成本高、探索风险大和样本效率低的问题,阻碍了其在实际系统中的落地。离线强化学习是一种完全从静态数据集中学习目标策略的框架,将数据收集与策略学习过程分离,有效避免了交互过程中的潜在危险.本文将首先介绍强化学习基础知识,并分析在线学习方式存在的瓶颈。在此基础上,构建离线强化学习问题的形式化描述并指出其关键问题。进一步,对相关代表性算法和最新成果进行全面系统梳理,并介绍主要应用领域和常用基准测试平台,最后,总结分析面临的挑战,探讨未来发展方向。
Deep reinforcement learning algorithms achieve impressive performance in multiple challenging tasks by combing the power- ful representation learning capability of deep learning together with the sequential decision ability of reinforcement learning.However, as for some risk-aware real-world systems,collecting the data based on trial-and-error method is inaccessible because it is dangerous, expensive and sample inefficient.The active learning framework is an important reason that hinders the widespread applications of on- line reinforcement learning algorithms.Offline reinforcement learning is a data-driven paradigm that can learn exclusively from the static dataset without interaction with the environment during the training process.Due to the ability of learning from the previously collected data,offline reinforcement learning is appealing to deal with real-world applications.In this paper,the fundamentals of rein- forcement learning is first introduced.Then,we analyze the challenges of this active learning framework to deal with practical systems. Second,the problem formulation of offline reinforcement learning is provided.A comprehensive review of important algorithms,com- mon benchmarks and main practical applications in this field is given.Finally,we summarize the primary challenges and discuss re- search directions.
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
|
| [37] |
|
| [38] |
|
| [39] |
|
| [40] |
|
| [41] |
|
| [42] |
|
| [43] |
|
| [44] |
|
| [45] |
|
| [46] |
|
| [47] |
|
| [48] |
|
| [49] |
|
| [50] |
|
| [51] |
|
| [52] |
|
| [53] |
|
| [54] |
|
| [55] |
|
| [56] |
|
| [57] |
|
| [58] |
|
| [59] |
|
| [60] |
|
| [61] |
|
| [62] |
|
| [63] |
|
| [64] |
|
| [65] |
|
| [66] |
|
| [67] |
|
| [68] |
|
| [69] |
|
| [70] |
|
| [71] |
|
| [72] |
|
| [73] |
|
| [74] |
|
| [75] |
|
| [76] |
|
| [77] |
|
| [78] |
|
| [79] |
|
| [80] |
|
| [81] |
|
| [82] |
|
| [83] |
|
| [84] |
|
| [85] |
|
| [86] |
|
| [87] |
|
| [88] |
|
| [89] |
|
| [90] |
|
| [91] |
|
| [92] |
|
| [93] |
|
| [94] |
|
| [95] |
|
| [96] |
|
| [97] |
|
| [98] |
|
| [99] |
|
| [100] |
|
| [101] |
|
| [102] |
|
| [103] |
|
| [104] |
|
| [105] |
|
| [106] |
|
| [107] |
|
| [108] |
|
| [109] |
|
| [110] |
|
| [111] |
|
| [112] |
|
| [113] |
|
| [114] |
|
| [115] |
|
| [116] |
|
| [117] |
|
| [118] |
|
| [119] |
|
| [120] |
|
| [121] |
|
| [122] |
|
| [123] |
孙悦雯, 柳文章, 孙长银. 基于因果建模的强化学习控剕:现状及展望[J]. 自动化学报, 2023, 49(3):661-677. |
| [124] |
李茹杉, 彭慧民, 李价刚, |
| [125] |
温广辉, 杨涛, 周佳玲, |
国家自然科学基金项目(62203005)
中央高校基础研究基金项目(B250201086)
/
| 〈 |
|
〉 |