查看论文信息

查看全文

免费浏览

查看论文信息

中文题名：	基于强化学习的无人机编队控制技术
姓名：	赵启
学号：	SX1903134
保密级别：	公开
论文语种：	chi
学科代码：	081101
学科名称：	工学 - 控制科学与工程 - 控制理论与控制工程
学生类型：	硕士
学位：	工学硕士
入学年份：	2019
学校：	南京航空航天大学
院系：	自动化学院
专业：	控制科学与工程
研究方向：	无人机编队控制
第一导师姓名：	龚华军
第一导师单位：	自动化学院
第二导师姓名：	甄子洋
完成日期：	2022-03-24
答辩日期：	2022-03-19
外文题名：	Formation Control Technology of UAV Based on Reinforcement Learning
中文关键词：	无人机 ; 编队控制 ; 强化学习 ; 深度Q网络 ; 近端策略优化
外文关键词：	Unmanned Aerial Vehicle ; Formation Control ; Reinforcement Learning ; Deep Q-network ; Proximal Policy Optimization
中文摘要：	︿无人机编队凭借其探索范围广，任务完成率高等优点，在军事与生活上得到了广泛的应用，无人机编队控制也为了一个研究热点。强化学习作为人工智能的一个分支，在近年来得到了广泛关注。本文针对无人机编队智能化程度低，无人机自学习能力不足等问题，将强化学习技术应用到无人机编队控制问题中，对无人机编控制进行了研究。主要内容有：首先，根据强化学习基本原理，将无人机编队问题转化为马尔科夫决策模型并设计相关要素，设计的状态空间、动作空间和奖励函数等能够移植到今后的研究中。同时，在长机-僚机结构下，采用基于价值函数的竞争Q网络算法，结合提出的优先策略和分层动作库方法，实现了无人机编队侧向距离控制，并通过仿真验证了设计控制器的有效性。然后，将编队控制问题扩展到二维队形控制，采用基于策略的近端策略优化算法设计编队控制器，并结合了泛化优势估计对原始算法进行改进。为在不影响学习效果的前提下减少状态空间维度，将状态空间结合积分补偿的方法，有效缩小了状态空间维度。通过仿真验证了设计控制器的有效性。最后，上述的两种方法均是基于无模型的强化学习算法，虽然最终能够实现编队控制效果，但是仍旧需要大量时间训练学习。为解决这个问题，采用神经网络拟合动力学模型，利用模型强化学习算法解决编队控制问题，并结合模型预测控制思想优化动作序列。最终结果显示相比无模型的强化学习方法，基于模型的强化学习方法能够大大缩短训练学习时间，验证了控制器的有效性。通过将强化学习方法引入到编队控制问题中，能够使僚机通过训练学习到跟踪长机的最佳策略并保持到期望队形距离。本论文对强化学习应用在无人机编队控制问题上进行了有益的探索，具有较好的理论意义与实际应用场景。﹀
外文摘要：	︿ With its advantages of wide exploration range and high mission completion rate, UAV formation has been widely used in military and life. UAV formation control has also become a research hotspot. Reinforcement learning, as a branch of artificial intelligence, has become a research hotspot in recent years. At present, the intelligence degree of UAV formation is still low, so the method of reinforcement learning is adopted to design the formation controller. In this paper, aiming at the problems of UAV formation control with low intelligence degree and insufficient self-learning ability, the reinforcement learning technology is applied to UAV formation control, and the formation maintenance of UAV is studied. The main contents are: Firstly, according to the principle of reinforcement learning, the uav formation problem is transformed into Markov decision model and related elements are designed, and the designed state space, action space and reward function can be transplanted into future studies. At the same time, the formation lateral distance control of UAV is realized by using the Dueling Double Deep Q-Network algorithm based on value function, combining the proposed priority strategy and layered action library method. The effectiveness of the proposed controller is verified by simulation. Then, the formation control problem is extended to two-dimensional formation control, and the formation controller is designed by using Proximal Policy Optimization algorithm, and the original algorithm is improved by using generalized advantage estimation. In order to reduce the dimension of state space without affecting the learning effect, the state space is combined with the integral compensation method to effectively reduce the dimension of state space. Finally, The effectiveness of the proposed controller is verified by simulation. Finally, both of the above two methods are model-free reinforcement learning algorithms. Although formation control can be achieved in the end, it still needs a lot of time to train and learn. In order to learn the strategy quickly, the neural network fitting dynamics model is adopted, the model reinforcement learning algorithm is used to solve the formation control problem, and the model predictive control idea is combined to optimize the action sequence. The final results show that compared with the model-free reinforcement learning method, the model-based reinforcement learning algorithm can reduce the training time effectively and achieve the formation control effect. By introducing reinforcement learning into formation control, the follower can learn the best strategy to track the leader and keep the desired formation distance through training. In this paper, the application of reinforcement learning in UAV formation control is explored, which has good theoretical significance and practical application scenarios. ﹀
参考文献：	︿ [1]昂海松, 曾建江, 童明波. 现代航空工程[M]. 北京:国防工业出版社,2012:11. [2]Shukla A, Karki H. Application of robotics in onshore oil and gas industry-A review Part I[J]. Robotics & Autonomous Systems, 2015, 75(PB):490-507. [3]甄子洋, 江驹, 孙绍山, 等. 无人机集群作战协同控制与决策[M]. 北京:国防工业出版社, 2022:1. [4]Vasarhelyi G, Viragh C, Somorjai G. Outdoor flocking and formation flight with autonomous aerial robots[C]// IEEE/RSJ International Conference on Intelligent Robots and Systems, Chicago, IL, USA, September 14-18, 2014. [5]贾高伟, 侯中喜, JIA. 美军无人机集群项目发展[J]. 国防科技, 2017, 38(305):58-61. [6]刘丽, 王森, 胡然. 美军主要无人机集群项目发展浅析[J].飞航导弹, 2018,(7):37-43. [7]Qiu H X, Duan H B. Pigeon interaction mode switch-based UAV distributed flocking control under obstacle environments[J]. ISA transactions,2017:93. [8]Zhen Z Y, Tao G, Xu Y, et al. Multivariable adaptive control based consensus flight control system for UAVs formation[J]. Aerospace Science and Technology, 2019,93:1-7. [9]Duan H B, Liu S Q. Nonlinear dual-mode receding horizon control for multiple unmanned air vehicles formation flight based on chaotic particle swarm optimization[J]. IET Control Theory and Applications,2010,11(4). [10]Ali Z A, Shafiq M, Farhi L. Formation control of multiple UAV's via decentralized control approach[C]//2018 5th International Conference on Systems and Informatics (ICSAI). 2018. [11]Yun B, Chen B M, Lum K Y, et al. A leader-follower formation flight control scheme for UAV helicopters[C]//IEEE International Conference on Automation and Logistics,2008:39-44. [12]Cai D, Sun J, Wu S. UAVs formation flight control based on behavior and virtual structure[M]. Springer Berlin Heidelberg,2012:429-438. [13]Chang BL, Ng QS. A flexible virtual structure formation keeping control for fixed-wing UAVs[C]//IEEE Conference on Control Application, 2011:621-626. [14]邵壮, 祝小平, 周洲, 等. 无人机编队机动飞行时的队形保持反馈控制[J]. 西北工业大学学报, 2015,33(1):26-32. [15]Pereira G A S, Das A K, Kumar V, et al. Formation control with configuration space constraints[C] //Proc of the IEEE / RJS Int Confon Intelligent Robots and Systems, 2003. [16]Guney MA, Unel M. Formation control of a group of Micro Aerial Vehicles (MAVs)[C]//IEEE International Conference on Systems, Man, and Cybernetics (SMC),2013:929-934. [17]宋运忠, 杨飞飞. 基于行为法多智能体系统构形控制研究[J]. 控制工程, 2012,19(4):687-690. [18]Huang J, Cao M, Zhou N, et al. Distributed behavioral control for second-order nonlinear multi-agent systems[C]//20th World Congress of the International Federation of Automatic Control, 2017:2500-2505. [19]Ren W. Consensus seeking in multi-vehicle systems with a time-varying reference state[C]// New York:American Control Conference. IEEE,2007. [20]Ren W. On consensus algorithms for double-integrator dynamics[J]. IEEE Transactions on Automatic Control,2008,53(6):1503-1509. [21]甄子洋, 龚华军, 陶钢, 等. 基于自适应控制的大型客机编队飞行一致性控制[J]. 中国科学：技术科学,2018,48(3):11. [22]朱旭, 张逊逊, 润茂德, 张昌利. 基于一致性的无人机编队控制策略[J]. 计算机仿真，2016,33(8):30-34. [23]陈炎财. 群体无人机分布式协同控制方法研究[D]. 南京:南京航空航天大学,2011. [24]文梁栋, 甄子洋, 龚华军. 基于一致性的有限区域内紧密编队集结控制[J]. 电光与控制, 2020,v.27;No.269(11):72-78+109. [25]Rinaldi F, Chiesa S, Quagliotti F. Linear quadratic control for quadrotors UAVs dynamics and formation flight[J]. Journal of Intelligent & Robotic Systems, 2013,70(1-4):203-220. [26]Renan L P, Karl H K. Tight formation flight control based on Approach[C]//24th Mediterranean Conference on Control and Automation (MED), 2016:268-274. [27]Abbas R, Wu Q. Tracking formation control for multiple quadrotors based on fuzzy logic controller and least square oriented by genetic algorithm[J]. The Open Automation and Control Systems Journal, 2015,7(1):842-850. [28]李一波, 王文, 陈伟, 等. 无人机编队保持与变换的滑模控制器设计[J]. 控制工程, 2016,23(2):273-278. [29]Liu Y, Jia Y. Adaptive leader-following consensus control of multi-agent systems using model reference adaptive control approach[J]. Iet Control Theory & Applications, 2013,6(13):2002-2008. [30]Xu Y, Zhen Z Y. Multivariable adaptive distributed leader-follower flight control for multiple UAVs formation. Aeronautical Journal, 2017, 121(1241): 877-900. [31]Minsky M L. Theory of neural-analog reinforcement systems and its application to the brain model problem[M]. Princeton University,1954. [32]Klopf A H. Brain function and adaptive systems Command: a heterostatic theory[M]. Air Force Cambridge Research Laboratories, Air Force Systems Command, United States Air Force,1972. [33]Sutton RS. Single channel theory: A neuronal theory of learning[J]. Brain Theory Newsletter. 197,4:72-5. [34]Watkins C J C H. Learning from delayed rewards[J]. Ph.d.thesis Kings College University of Cambridge,1989. [35]Sutton R S. Generalization in reinforcement learning: Successful examples using sparse coarse coding[J]. Neural Information Processing System,1996,8. [36]Silver D, Huang A, Maddison CJ, et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016,529(7587):484-489. [doi: 10.1038/nature16961]. [37]Wei H, Zheng GJ, Yao HX, et al. IntelliLight: A reinforcement learning approach for intelligent traffic light control[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, NY, USA. 2018.2496–2505. [38]Hu YJ, Da Q, Zeng AX, et al. Reinforcement learning to rank in e-commerce search engine: Formalization, analysis, and application[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, NY, USA. 2018.368-377. [39]Wang C, Wang J, Wang J, et al. Deep reinforcement learning-based autonomous UAV navigation with sparse rewards[J]. IEEE Internet of Things Journal, 2020. [40]Yan C, Xiang X, Wang C. Towards real-time path planning through deep reinforcement Learning for a UAV in Dynamic Environments[J]. Journal of Intelligent & Robotic Systems,2020,98(3–4). [41]Huang X, Luo W, Liu J. Attitude control of fixed-wing UAV Based on DDQN[C]//2019 Chinese Automation Congress (CAC).2019. [42]黄旭, 柳嘉润, 贾晨辉, 等. 深度确定性策略梯度算法用于无人飞行器控制[J]. 航空学报,2021,42(X):524688. [43]Tomimasu M, Morihiro K, Nishimura H, et al. A reinforcement learning scheme of adaptive flocking behavior[C]//Int. Symp. on Artificial Life and Robotics (AROB). Oita, Japan, 2005. [44]Morihiro K, Isokawa T, Nishimura H, et al. Characteristics of flocking behavior model by reinforcement learning scheme[C]//SICE-ICASE International Joint Conference. Busan, South Korea:IEEE, 2006:4551-4556. [45]Manh H, Sheng W. Distributed sensor fusion for scalar field mapping using mobile sensor networks[J]. IEEE Transactions on Cybernetics,2013,43(2):766-778. [46]Wang C, Wang J, Zhang X, et al. A deep reinforcement learning approach to flocking and navigation of UAVs in large-scale complex environments[C]//IEEE Global Conference on Signal and Information Processing (GlobalSIP). Anaheim, USA: IEEE, 2018:1228-1232. [47]Hung S M, Givigi S N, Noureldin A. A Dyna-Q(λ) approach to flocking with fixed-wing UAVs in a stochastic environment[C]//IEEE International Conference on Systems, Man, and Cybernetics. Kowloon, China: IEEE, 2015:1918-1923. [48]Hung S M, Givigi S N. A Q-learning approach to flocking with UAVs in a stochastic environment[J]. IEEE Transactions on Cybernetics,2017,47(1):186-197. [49]相晓嘉, 闫超. 王菖, 等. 基于深度强化学习的固定翼无人机编队协调控制方法[J]. 航空学报, 2020,1-14. [50]任坚, 刘剑慰, 杨蒲. 基于增量式策略强化学习算法的飞行控制系统的容错跟踪控制[J]. 控制理论与应用,2020,37(7):1429-1438. [51]张友安, 马国欣, 刘京茂, 等. 固定翼无人机强化学习控制建模与算法设计[J]. 飞行力学, 2019, 37(04):88-91. [52]M, Pachter, J, et al. Automatic formation flight control[J]. Journal of Guidance, Control, and Dynamics. 1994,17(6):1380-1383. [53]Otterlo M V, Wiering M. Reinforcement learning and markov decision process[M]. Spinger Berlin Heidelberg,2012. [54]Watkins C, Dayan P. Q-Learing[J]. Machine Learning, 1992,8(3-4):279-292. [55]Volodymyr M, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature (London),2015,518(7540):529-533. [56]Baxter J, Bartlett P L. Infinite-horiaon policy-gradient estimation[J]. Journal of Artifical Intelligence Research,2001,15:319-350. [57]Wang Z, Schaul T, Hessel M, et al. Dueling network architectures for deep reinforcement Learning[C]//Proceedings of the 33rd International Conference on Machine Learning-Volume 48. JMLR.org,2016:1995-2003. [58]Schulman J, Levine S, Moritz P, et al. Trust region policy optimization[J]. Computer Science, 2015:1889-1897. [59]Schulman J, Moritz P, Levine S, et al. High-Dimensional continuous control using generalized advantage estimation[J]. Computer ence, 2015. [60]Levine, Sergey, Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics[C]//Advances in Neural Information Processing Systems. 2014. [61]Deisenroth M P, Rasmussen C E. Introduction model errors PILCO results PILCO:A Model-Based and data-efficient approach to policy search[C]//2012. [62]Nagabandi A, Kahn G, Fearing R S,et al. Neural network dynamics for Model-Based deep reinforcement learning with Model-Free fine-Tuning[C]//IEEE International conference on robotics and automation,2018. [63]席裕庚, 李德伟, 林姝. 模型预测控制——现状与挑战[J].自动化学报,2013,39(03):222-236. ﹀
中图分类号：	V249
馆藏号：	2022-003-0069
开放日期：	2022-09-24

附件下载