1.College of Computer Science, Sichuan University, Chengdu 610065, China
2.National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu 610065, China
Show less
文章历史+
Received
Accepted
Published
2024-06-02
Issue Date
2026-05-13
PDF (3769K)
摘要
六自由度无人机空战是一个具有复杂多维状态、耦合连续动作和高度非线性动力学的挑战性场景。深度强化学习不需要标签数据,仅通过与环境交互优化策略,在自主空战机动决策中的应用受到广泛关注。然而,高维度的状态和动作空间导致端到端训练难以学习到有效策略、收敛缓慢且泛化性差;奖励函数的设计多依赖人工经验,获得好的奖励并不等同于学习到好的策略。针对这些问题,本文提出了一种基于时分框架的两阶段时间尺度状态分离近端策略优化(two stage time-scale states separation proximal policy optimization,TTS‒PPO)算法。因飞控参数对不同状态量控制效果的时间尺度差异,该算法将空战机动划分为短周期转动运动和长周期轨迹运动两部分,短周期部分采用比例‒积分‒微分(PID)算法完成飞控参数实时输出,长周期部分通过近端策略优化(PPO)算法对短周期PID控制接口进行训练,使两类运动的动作空间解耦,从而使无人机更容易学到有效策略;同时,将环境状态量分离得到长短周期状态量,降低状态空间的维度从而加快收敛并提高模型的泛化性。此外,本文在训练过程中对长周期决策的PPO网络采取两阶段训练方式:第一阶段设计单步奖励并采用较低的决策频率,使无人机训练过程快速度过冷启动时期;第二阶段只保留终局奖励并采用更高的决策频率,避免陷入追求高奖励而损失性能的误区。实验结果表明:使用该框架的算法能够收敛到更高的奖励值;引入长短周期状态量能提升约67%的收敛速度,且在不同空战场景中的泛化性更强;TTS‒PPO算法增加了第二阶段训练,性能进一步提升,仅以直线飞行的敌机作为对手训练后就能击败专家无人机。
Abstract
Objective Six-degree-of-freedom (6-DoF) unmanned aerial vehicle (UAV) air combat scenarios present substantial challenges for strategy learning when reinforcement learning methods are applied. These challenges stem from high-dimensional state spaces, continuously coupled action domains, and strongly nonlinear flight dynamics. Conventional end-to-end deep reinforcement learning (DRL) approaches struggle to achieve rapid convergence, to identify effective maneuver strategies, and to generalize learned policies beyond narrowly constrained conditions. In addition, reward functions often rely on handcrafted rules derived from human expertise, which do not ensure that higher reward values correspond to genuinely effective combat strategies. This study addresses these limitations by introducing a hierarchical framework based on time scale separation theory. The proposed framework employs a two-stage training procedure that accounts for differences in how flight parameters influence state variables across multiple time scales, improving learning efficiency, enhancing strategy quality, and increasing generalization capability in complex and diverse combat environments. Methods A novel algorithm, termed TTS‒PPO, was developed. TTS‒PPO stood for a Two-Stage Training framework leveraging Time-Scale separation within Proximal Policy Optimization. The method focused on partitioning the 6 DoF UAV air combat decision-making process into short-cycle and long-cycle segments, which reflected differences in how control inputs influenced state variables across distinct time scales. A time-division framework was established. The short‒cycle component addressed rapid rotational and attitude adjustments. Instead of allowing the DRL procedure to directly manage these fine-grained actions, a Proportional-Integral-Derivative (PID) controller was employed to output real-time joystick commands. This configuration allowed classical low-level stability and attitude control to be handled independently, which reduced the complexity encountered by the DRL policy at the higher strategic level. With low-level stability assured, the DRL agent focused on tactical and strategic decision-making. The long-cycle component used Proximal Policy Optimization to manage trajectory planning and tactical maneuvers. The long-cycle PPO agent effectively decoupled strategic decision-making from low-level actuation tasks by issuing high-level commands to guide the PID-driven short-cycle layer. This hierarchical decomposition allowed learning to proceed more efficiently. The long-cycle agent encountered a reduced problem space and concentrated on discovering effective combat strategies without being burdened by the complexities of rapid stabilization maneuvers. Time scale separation was further implemented within the state space. Environmental states were divided into long-cycle and short-cycle groups. The long-cycle states captured slowly evolving features such as relative positions, energy conditions, and global situational parameters, whereas the short-cycle states encompassed rapidly changing variables such as angular rates and orientation deviations. Aligning state variables with their corresponding time scales accelerated learning and improved policy robustness. A relative situation transformation module was introduced to refine and compress the state representation, which ensured that the agent received relevant information at appropriate decision intervals and minimized computational complexity at each step. A two-stage training strategy was employed. In the first stage, single-step rewards designed for specific subtasks, such as pursuit or strike, were introduced with a lower decision frequency to assist the agent during the initial “cold start” period. This incremental guidance supported the stabilization of fundamental behavioral patterns and facilitated the acquisition of essential tactical principles. During this phase, the agent overcame early-stage instability, which resulted in more reliable initial policies. In the second stage of training, single-step rewards were removed, and only sparse terminal rewards were retained. In the second stage of training, single-step rewards were removed, and only sparse terminal rewards were retained, while the decision frequency was increased. In the absence of frequent intermediate rewards, the policy emphasized long-term outcomes rather than short-term objectives. The higher decision frequency enabled more refined tactical adjustments and encouraged the emergence of maneuvers that improved overall performance. The gradual transition from a guided, intermediate-reward scenario to a sparse-reward, high-frequency regime allowed the policy to progress from basic stability toward advanced strategic competence. A simulation environment was constructed using an open-source F-16 UAV model coupled with the JSBSim flight dynamics engine to evaluate the effectiveness of the proposed hierarchical DRL algorithm founded on time scale separation theory. This configuration provided realistic 6 DoF conditions and supported one-on-one close-range air combat simulations. Ablation experiments were conducted to assess the contribution of individual components within the TTS‒PPO framework. One configuration trained the agent against a non-maneuvering linear opponent, which served as a controlled baseline for determining whether the learned policy can scale from simple engagements to more complex combat scenarios. Results and Discussions The results demonstrated that the TTS‒PPO approach, which incorporated hierarchical decomposition and time scale separation, achieved faster convergence and improved final performance metrics compared to baseline end-to-end DRL methods that lacked time scale separation or a two-stage training procedure. Assigning state variables to short-cycle and long-cycle categories, together with hierarchical action decomposition, significantly reduced overall problem complexity. Training convergence speed improved by approximately 67%, which reduced computational costs and enabled more frequent iterative policy refinements. With enhanced efficiency, the DRL agent discovered more stable and effective combat strategies within fewer training episodes. Generalization performance was evaluated by testing agents trained under different variants of the approach across various initial conditions, velocities, and adversary tactics. Comparisons were conducted among three agent types: an agent trained with PPO on full-state inputs without time scale division (FS‒PPO), an agent using time scale-separated states with a single-stage training approach (TS‒PPO), and the two-stage time scale-separated TTS‒PPO. The agent trained with TTS‒PPO outperformed both FS‒PPO and TS‒PPO agents in pairwise confrontations, which indicated that combining time scale separation with two-stage training not only enhanced learning speed but also enabled the agent to acquire more generalizable combat principles rather than narrowly optimizing for a specific scenario. Further validation involved testing the TTS‒PPO-trained agent against rule-based expert opponents. The policy derived from TTS‒PPO successfully defeated these expert systems. Even when training was conducted exclusively against a simple linear adversary, the learned policy surpassed expert-level strategies, which confirmed that hierarchical time scale separation and the two-stage training design facilitated the development of adaptable policies with robust tactical proficiency. The ability to transfer from minimal training complexity to outperforming expert opponents highlighted the scalability and versatility of the learned strategies. Conclusions Accordingly, the hierarchical DRL algorithm, grounded in time scale separation theory and employing a two-stage training strategy, addressed significant challenges associated with applying DRL to 6-DoF UAV air combat tasks. The method substantially improved training efficiency and enhanced both the robustness and generalization capability of the resulting policies by decomposing decision-making into short-cycle and long-cycle phases, introducing a PID-controlled low-level stabilization layer, and separating state variables based on their respective time scales. The hierarchical framework enabled the agent to focus on strategic maneuvers at the long-cycle level, while the short-cycle PID layer managed rapid stabilization tasks. Time scale-aware state representations and a staged training procedure guided the policy from basic stability to advanced tactical competence. The observed increases in convergence speed and the ability to manage a range of adversarial conditions highlight the value of applying time scale separation principles in challenging reinforcement learning domains. The TTS‒PPO framework can serve as a reference for addressing other complex reinforcement learning problems characterized by distinct time scale dynamics, fostering more efficient, generalizable, and strategically effective decision-making in advanced autonomous systems.
近年来,随着深度强化学习(DRL)在Atari[9]、围棋[10]和Dota2[11]等即时策略游戏中的巨大成功,越来越多的研究者将其应用到空战自主决策任务中。基于DRL的智能空战研究依据无人机的动作空间可分为离散和连续两类。Zhang等[12]针对超视距空战,基于包含9种固定动作的动作集,提出一种融合专家经验的启发式Q‒网络方法;Yang等[13]针对近距空战,在美国国家航空航天局的7种基本动作[14]基础上扩展了机动库,提出一个基于强化学习的无人机自主机动决策模型。文献[12‒13]都设计了包含离散动作的机动库,这种通过人类先验知识将动作离散化的做法虽能大大降低探索空间的复杂程度,但限制了算法对复杂情境的适应能力,不能满足高精度操作的需求。尽管使用离散动作能够极大程度简化问题,但使用连续控制量时能完成的机动动作更丰富,能完成更复杂的任务,空战仿真更贴合实战[2]。Li等[15]针对协同作战,在连续动作空间下构建了深度确定性策略梯度的基本结构,将动作与油门、攻击角和飞机路径倾角3个连续参数关联。文献[13‒15]的研究对象都是三自由度飞机,而真正的无人战斗机空战通常会使用六自由度飞机模型,因其能够全面、准确地描述和控制飞机在三维空间中的所有运动,适应复杂的空战环境和动态变化。使用六自由度模型的问题是在连续状态空间和动作空间使用端到端的训练方式,探索空间太大,收敛难度急剧提升。分层强化学习通过分而治之的方法缓解了这一问题,Pope等[16]将分层架构与最大熵强化学习相结合,先训练几种基础的无人机策略,然后配合高层的策略选择器进行决策,在Alpha Dog Fight比赛中击败了F-16教官Banger。Li等[17]通过构建六自由度模型并结合基于粒子群优化算法径向基函数(PSO-RBF)的敌机操作预测方法和改进的深度确定性策略梯度(DDPG),在模拟和决策层次上实现了无人机自主空战机动决策的优化。Chai等[18]提出了一种用于六自由度无人机空战的分层深度强化学习框架,通过将决策过程分为外环和内环两部分来解决复杂的空战问题。文献[16‒18]虽然都将分层思想用于六自由度无人机空战来解决连续动作空间下的探索困难问题,但方法和标准各不相同,且进行训练时,上下层输入的状态量仍是全部的状态量,上下层的决策频率也未作差异化处理,没有达到实质上的分层。此外,奖励函数设计对于策略学习至关重要,目前仍没有统一设计规律可遵循[2],以往的大多数研究都依赖人工经验设计奖励函数,通过实验来调整奖励设计,这种受限于人类知识的奖励设计无法完全准确地表征环境的奖励反馈。文献[19‒20]使用瞄准敌机、被敌机瞄准等关键空战事件进行奖励塑造来替代以往研究中的单步奖励,一定程度上缓解了人类经验设计奖励函数对模型性能的影响。
针对上述问题,提出一种两阶段时间尺度状态分离近端策略优化(two stage time-scale states separation proximal policy optimization,TTS‒PPO)算法。首先,基于操作杆对空战状态量控制效果的时间尺度差异提出一套时分空战框架,将空战机动决策划分为短周期和长周期两部分,前者利用比例‒微分‒积分(PID)算法实现操作杆的实时控制,后者通过近端策略优化(PPO)算法对短周期PID控制接口进行训练。其次,将环境状态量进行时间尺度分离,设计对应的长短周期状态量和相对态势转化模块,降低强化学习的状态空间维度,从而提升模型的收敛速度和泛化性。最后,考虑到奖励函数的局限性和分层后的长短周期决策频率差异,对PPO网络进行两阶段训练,第一阶段将空战任务分为追击和打击两种情况来设计相应的单步奖励,并采用较低的决策频率,以加快模型的收敛;第二阶段的训练去掉单步奖励,仅保留终局的稀疏奖励,减小单步奖励对无人机的影响,并提高决策频率,尽可能提升其作战性能。仿真实验中选用开源的F-16无人机模型和JSBSim[21]开源动力学模型搭建了六自由度1对1近距空战场景,针对算法的3处改进进行了消融实验验证其有效性。结果发现,训练时仅以直线飞行的敌机作为对手,得到的无人机就能与专家无人机对抗,具备近距空战的能力。
六自由度无人机的运动分为平动和转动。在平动中,无人机改变的状态量包括其三维坐标、在3个轴方向的速度矢量、速度大小V。在转动中,无人机改变的状态量包括滚转角、俯仰角θ、偏航角、滚转角的角速度p、俯仰角的角速度q、偏航角的角速度、迎角α( v 在无人机几何对称平面内的投影与机体轴间的夹角)、侧滑角β( v 与无人机几何对称平面间的夹角)。
YuHuangchao, NiuYifeng, WangXiangke.Stages of development of unmanned aerial vehicles[J].National Defense Technology,2021,42(3):18‒24. doi:10.13943/j.issn1671-4547.2021.03.03
ChenHao, HuangJian, LiuQuan,et al.Review and prospects of autonomous air combat maneuver decisions[J].Control Theory & Applications,2023,40(12):2104‒2129. doi:10.7641/CTA.2023.30210
HorieK, ConwayB A.Optimal fighter pursuit-evasion maneuvers found via two-sided optimization[J].Journal of Guidance,Control,and Dynamics,2006,29(1):105‒112. doi:10.2514/1.3960
[6]
ZhouSiyu, WuWenhai, KongFan'e,et al.Improved multistage influence diagram maneuvering decision method based on stochastic decision criterions[J].Transactions of Beijing Institute of Technology,2013,33(3):296‒301. doi:10.3969/j.issn.1001-0645.2013.03.017
SmithR E, DikeB A, MehraR K,et al.Classifier systems in combat:Two-sided learning of maneuvers for advanced fighter aircraft[J].Computer Methods in Applied Mechanics and Engineering,2000,186(2/3/4):421‒437. doi:10.1016/s0045-7825(99)00395-3
[9]
ChenXia, LiuMin, HuYongxin.Study on UAV offensive/defensive game strategy based on uncertain information[J].Acta Armamentarii,2012,33(12):1510‒1515.
WangXuan, WangWeijia, SongKepu,et al.UAV air combat decision based on evolutionary expert system tree[J].Ordnance Industry Automation,2019,38(1):42‒47. doi:10.7690/bgzdh.2019.01.010
ZhangHongpeng, HuangChangqiang, XuanYongbo,et al.Maneuver decision of autonomous air combat of unmanned combat aerial vehicle based on deep neural network[J].Acta Armamentarii,2020,41(8):1613‒1622. doi:10.3969/j.issn.1000-1093.2020.08.016
MnihV, KavukcuogluK, SilverD,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529‒533. doi:10.1038/nature14236
[16]
SilverD, HuangA, MaddisonC J,et al.Mastering the game of Go with deep neural networks and tree search[J].Nature,2016,529(7587):484‒489. doi:10.1038/nature16961
[17]
BernerC, BrockmanG, ChanB,et al.Dota 2 with large scale deep reinforcement learning[EB/OL].(2019‒12‒13)[2024‒06‒02].doi:10.48550/arXiv.1912.06680
[18]
ZhangXianbing, LiuGuoqing, YangChaojie,et al.Research on air confrontation maneuver decision-making method based on reinforcement learning[J].Electronics,2018,7(11):279. doi:10.3390/electronics7110279
[19]
YangQiming, ZhangJiandong, ShiGuoqing,et al.Maneuver decision of UAV in short-range air combat based on deep reinforcement learning[J].IEEE Access,2020,8:363‒378. doi:10.1109/access.2019.2961426
[20]
AustinF, CarboneG, FalcoM,et al.Automated maneuvering decisions for air-to-air combat[C]//Proceedings of the Guidance,Navigation and Control Conference.Monterey:AIAA,1987:659‒669. doi:10.2514/6.1987-2393
[21]
LiYue, HanWei, WangYongqing.Deep reinforcement learning with application to air confrontation intelligent decision-making of manned/unmanned aerial vehicle cooperative system[J].IEEE Access,2020,8:67887‒67898. doi:10.1109/access.2020.2985576
[22]
PopeA P, IdeJ S, MićovićD,et al.Hierarchical reinforcement learning for air-to-air combat[C]//Proceedings of the 2021 International Conference on Unmanned Aircraft Systems(ICUAS).Athens:IEEE,2021:275‒284. doi:10.1109/icuas51884.2021.9476700
[23]
LiYongfeng, YongxiLyu, ShiJingping,et al.Autonomous maneuver decision of air combat based on simulated operation command and FRV-DDPG algorithm[J].Aerospace,2022,9(11):658. doi:10.3390/aerospace9110658
[24]
ChaiJiajun, ChenWenzhang, ZhuYuanheng,et al.A hierarchical deep reinforcement learning framework for 6-DOF UCAV air-to-air combat[J].IEEE Transactions on Systems,Man,and Cybernetics:Systems,2023,53(9):5417‒5429. doi:10.1109/tsmc.2023.3270444
[25]
HaiyinPiao, SunZhixiao, MengGuanglei,et al.Beyond-visual-range air combat tactics auto-generation by reinforcement learning[C]//Proceedings of the 2020 International Joint Conference on Neural Networks(IJCNN).Glasgow:IEEE,2020:1‒8. doi:10.1109/ijcnn48605.2020.9207088
[26]
SunZhixiao, HaiyinPiao, YangZhen,et al.Multi-agent hierarchical policy gradient for Air Combat Tactics emergence via self-play[J].Engineering Applications of Artificial Intelligence,2021,98:104112. doi:10.1016/j.engappai.2020.104112
[27]
BerndtJ.JSBSim:An open source flight dynamics model in C++[C]//Proceedings of the AIAA Modeling and Simulation Technologies Conference and Exhibit.Providence:AIAA,2004:AIAA2004‒4923. doi:10.2514/6.2004-4923
[28]
BuffingtonJ M, AdamsR J, BandaS S.Robust,nonlinear,high angle-of-attack control design for a supermaneuverable vehicle[C]//Proceedings of the Guidance,Navigation and Control Conference.Monterey:AIAA,1993:690‒700. doi:10.2514/6.1993-3774
[29]
ReinerJ, BalasG J, GarrardW L.Flight control design using robust dynamic inversion and time-scale separation[J].Automatica,1996,32(11):1493‒1504. doi:10.1016/s0005-1098(96)00101-x
[30]
LiYun, AngK H, ChongG C Y.PID control system analysis and design[J].IEEE Control Systems Magazine,2006,26(1):32‒41. doi:10.1109/mcs.2006.1580152