查看论文信息

查看全文

免费浏览

查看论文信息

中文题名：	基于强化学习的无人机避碰防撞技术研究
姓名：	曹红波
学号：	SZ1903135
保密级别：	公开
论文语种：	chi
学科代码：	085210
学科名称：	工学 - 工程 - 控制工程
学生类型：	硕士
学位：	工程硕士
入学年份：	2019
学校：	南京航空航天大学
院系：	自动化学院
专业：	控制科学与工程
研究方向：	无人机避障
第一导师姓名：	甄子洋
第一导师单位：	自动化学院
完成日期：	2022-03-24
答辩日期：	2022-03-19
外文题名：	Research on UAV Collision Avoidance Technology Based on Reinforcement Learning
中文关键词：	无人机避障 ; 深度强化学习 ; 经验池分割 ; 多智能体 ; 相关经验采样 ; 仿真系统
外文关键词：	UAV obstacle avoidance ; deep reinforcement learning ; experience pool segmentation ; multi-agent ; relevant experience samping ; simulation system
中文摘要：	︿在侦察探测、电力巡检、物流配送等领域，实现无人机低空作业的关键是确保其自身的安全飞行。目前，无人机的智能自主避障是实现无人机无碰撞飞行的主流手段，因此本文以强化学习为基础对低空领域的无人机避障飞行决策问题进行了探索和研究，主要研究内容如下：对基于强化学习的无人机避障进行建模分析。首先基于强化学习的避障特性建立无人机运动模型；然后针对山林环境特点建立动态障碍物和静态障碍物模型，并将其用数学方程进行描述；最后在环境中建立适用于无人机避碰防撞飞行任务的三维连续空间模型。针对单无人机在不确定环境下的避障飞行任务，设计了一种基于经验池分割的双深度Q网络（S-DDQN）单无人机避障算法。首先根据避障过程样本的奖励值划分正负经验池，对采样训练的过程进行优化；其次，依据单无人机避障环境特性对算法的状态空间、动作空间和奖励函数进行设计；实现静态避障后，进一步考虑融合动态障碍物的避障环境，基于速度障碍法补充相应的状态和奖励。训练结果表明改进的算法具有更好的训练稳定性和更快的训练速度，测试结果表明在静态障碍物环境和融合动态障碍物的环境中相应的算法均能根据无人机状态决策无人机动作实现无碰撞飞行任务。针对多无人机避碰防撞任务，设计了一种基于相关经验采样的多智能体深度确定性策略梯度（RES-MADDPG）多无人机避碰防撞算法。首先采用MADDPG算法解决多无人机强化学习中的训练不稳定问题；然后结合相关经验抽样方法给智能体产生的样本增加相关度标签，运行算法时根据状态标签先进行采样训练，再对无人机进行动作选择；同时针对多无人机避障环境特点设计相应的联合状态、动作空间和联合奖励函数，并针对是否固定分配目标两种任务情景对算法模型进行训练和测试。训练结果表明，改进算法先训练后策略选择的结构在提高训练速度和避障成功率上具有显著效果，测试结果表明设计的算法模型在固定分配目标和未固定分配目标两种任务情形下均能指导无人机完成多无人机避碰防撞飞行任务。基于PyQt5建立了无人机避碰防撞任务飞行仿真系统，综合了单无人机和多无人机的避障算法，并对其完成了仿真实验，验证了软件单无人机和多无人机避碰防撞功能的可行性。﹀
外文摘要：	︿ In the fields of reconnaissance detection, power inspection and logistics distribution, the key to realizing low-altitude operations of UAVs is to ensure their own safe flight. At present, intelligent autonomous obstacle avoidance of UAVs is the mainstream means to realize the collision-free flight of UAVs. Therefore, this paper explores and studies the decision-making problems of UAV obstacle avoidance flight in the low-altitude field based on reinforcement learning. The main research content is as follows: Modeling and analysis of UAV obstacle avoidance based on reinforcement learning. First, establish a UAV motion model based on the obstacle avoidance characteristics of reinforcement learning, then establish dynamic obstacles and static obstacle models based on the characteristics of the mountain forest environment, and describe them with mathematical equations. Finally, a three-dimensional continuous space model suitable for UAV collision avoidance and collision avoidance mission is established in the environment. For the obstacle avoidance mission of a single UAV in an uncertain environment, the single UAV obstacle avoidance algorithm based on S-DDQN is designed. Firstly, the positive and negative experience pool is divided according to the reward value of the obstacle avoidance process sample, and the sampling training process is optimized. On this basis, the state space, action space and the reward function of the algorithm model are designed according to the characteristics of the single UAV obstacle avoidance environment. After realizing the static obstacle avoidance of a single UAV, further consider the obstacle avoidance environment integrating dynamic obstacles, and supplement the corresponding status and rewards based on the speed obstacle method design. The training results show that the improved S-DDQN algorithm has better training stability and faster training speed. The test results show that the corresponding algorithm can make decisions based on the status of the UAV in a static obstacle environment and an environment where dynamic obstacles are integrated. The UAV action realizes the collision-free flight mission. For the obstacle avoidance mission of multi-UAVs in an uncertain environment, the multi-UAVs collision avoidance and collision avoidance algorithm based on RES-MADDPG is designed. Firstly, the MADDPG algorithm is used to solve the environmental instability problem in multi-UAVs reinforcement learning, and then combined with relevant experience sampling methods to add correlation labels to the samples generated by self-extraction, and the algorithm is based on The status label performs sampling training first, and then selects the actions of the UAVs. Simultaneously, the corresponding joint state, action space and joint reward function are designed according to the characteristics of the multi-UAVs obstacle avoidance environment, and the algorithm model is trained and tested for the two task scenarios of whether to allocate the target in advance. The training results show that the structure of the improved RES-MADDPG algorithm after first training and strategy selection has significant effects on training speed and obstacle avoidance success rate. The test results show that the designed algorithm model assigns targets in advance and does not assign targets in advance. Under these circumstances, it can be known that the UAV has completed the multi-UAVs joint collision avoidance and collision avoidance missions. Based on PyQt5, a flight simulation system for UAV collision avoidance and collision avoidance missions is established, which integrates obstacle avoidance algorithms of single UAV and multi-UAVs, and completes simulation experiments on it, verifying the software single UAV and the feasibility of multi-UAVs collision avoidance and collision avoidance functions. ﹀
参考文献：	︿ [1] Kendoul F. Survey of advances in guidance, navigation, and control of unmanned rotorcraft systems[J]. Journal of Field Robotics, 2012, 29(2): 315-378. [2] 甄子洋,江驹,孙绍山,等.无人机集群作战协同控制与决策[M].北京:国防工业出版社,2022:1. [3] Kopardekar P, Rios J, Prevot T, et al. Unmanned aircraft system traffic management (UTM) concept of operations[C]//16th AIAA Aviation Technology, Integration, and Operations Conference. AIAA, 2016: 1-16. [4] Gómez J B, Bechina A A A. A systems engineering approach applied to U-Space drones: concepts and challenges[C]//2019 14th Annual Conference System of Systems Engineering (SoSE). IEEE, 2019: 43-48. [5] Sonia Waharte. Sense and Avoid in UAS: Research and Applications[J]. The Aeronautical Journal,2014,118(1199):105-105 [6] 陈亚青,张智豪,李哲.无人机避障方法研究进展[J].自动化技术与应用,2020,39(12):1-6. [7] Duhé J F, Victor S, Melchior P. Contributions on artificial potential field method for effective obstacle avoidance[J]. Fractional Calculus and Applied Analysis, 2021, 24(2): 421-446. [8] Rashid A T, Ali A A, Frasca M, et al. Path planning with obstacle avoidance based on visibility binary tree algorithm[J]. Robotics and Autonomous Systems, 2013, 61(12): 1440-1449. [9] Cai Y E, Juang J G. Path planning and obstacle avoidance of UAV for cage culture inspection [J]. Journal of Marine Science and Technology, 2020, 28(5): 444-455. [10] Fraga-Lamas P, Ramos L, Mondéjar-Guerra V, et al. A review on IoT deep learning UAV syst ems for autonomous obstacle detection and collision avoidance[J]. Remote Sensing, 2019, 11(18): 2144. [11] Mozer S , M C, Hasselmo M . Reinforcement Learning: An Introduction[J]. IEEE Transactio ns on Neural Networks, 2005, 16(1):285-286. [12] Zhen Ziyang, Tao Gang, Xu Yue, et al. Multivariable Adaptive Control Based Consensus Flight Control System for UAVs Formation[J]. Aerospace Science and Technology, 2019, 93: 1-7. [13] Guo H, Fu W, Fu B, et al. Suboptimal trajectory programming for unmanned aerial vehicles with dynamic obstacle avoidance[J]. Proceedings of the Institution of Mechanical Engineers, Part G: Journal of Aerospace Engineering, 2019, 233(10): 3857-3869. [14] Israelsen J, Beall M, Bareiss D, et al. Automatic collision avoidance for manually teleoper-ated unmanned aerial vehicles[C]//2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014: 6638-6643. [15] Devens J A, Bakker T, Klenke R H. Autonomous Navigation with Obstacle Avoidance for Unmanned Aircraft Systems using MILP[C]//AIAA Guidance, Navigation, and Control Conference. 2017: 1740. [16] Sarim M, Radmanesh M, Dechering M, et al. Distributed detect-and-avoid for multiple unmanned aerial vehicles in national air space[J]. Journal of Dynamic Systems, Measurement, and Control, 2019, 141(7):071014. [17] Ragi S, Chong E K P. UAV path planning in a dynamic environment via partially observable Markov decision process[J]. IEEE Transactions on Aerospace and Electronic Systems, 2013, 49(4): 2397-2412. [18] 胡美富,宁芊,陈炳才,等. RWPSO与马尔科夫链的无人机航路规划[J].哈尔滨工业大学学报,2019,51(11):75-81. [19] Roy D. Study on the configuration space based algorithmic path planning of industrial robots in an unstructured congested three-dimensional space: An approach using visibility map[J]. Journal of Intelligent and Robotic Systems, 2005, 43(2): 111-145. [20] Zhou W, Wei R X , Wang Y J , et al. An Obstacle Avoidance of Formation method for UAVs[C]//Proceedings of SPIE, the International Society for Optical Engineering. 2008, 7128: 71280H. 1-71280H. 7. [21] 陈香敏,吴莹.基于Voronoi图的UAV攻击多移动目标的路径规划算法研究[J].信息通信,2020(06):36-37. [22] 吕太之,周武,赵春霞.采用粒子群优化和B样条曲线的改进可视图路径规划算法[J].华侨大学学报:自然科学版,2018,39(01):103-108. [23] Khatib O. Real-Time Obstacle Avoidance for Manipulators and Mobile Robots[J]. The International Journal of Robotics Research, 1986, 5(1): 90-98. [24] Luo G, Yu J, Zhang S, et al. Artificial potential field based receding horizon control for path planning[C]//2012 24th Chinese Control and Decision Conference (CCDC). IEEE, 2012: 3665-3669. [25] Cao L, Qiao D, Xu J. Suboptimal artificial potential function sliding mode control for spacecraft rendezvous with obstacle avoidance[J]. Acta Astronautica, 2018, 143: 133-146. [26] Lu K Y, Zhang Y G, Zhang Y C. Dynamic obstacle avoidance path planning of UAV Based on improved APF[C]//2020 5th International Conference on Communication, Image and Signal Processing (CCISP). IEEE, 2020: 159-163. [27] Zhao Y J, Zheng, Z, Zhang X Y, et al. Q learning algorithm based UAV path learning and obstacle avoidence approach[C]//2017 36th Chinese Control Conference (CCC). IEEE, 2017: 3397-3402. [28] Wang C, Wang J, Wang J, et al. Deep-Reinforcement- Learning-Based Autonomous UAV Navigation With Sparse Rewards[J]. IEEE Internet of Things Journal, 2020, 7(7): 6180-6190. [29] Bennett C C, Hauser K. Artificial intelligence framework for simulating clinical decision-making: A Markov decision process approach[J]. Artificial intelligence in medicine, 2013, 57(1): 9-19. [30] Werbos P J. Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1987, 17(1): 7-20. [31] Shibata K, Utsunomiya H. Discovery of pattern meaning from delayed rewards by reinforcement learning with a recurrent neural network[C]//The 2011 International Joint Conference on Neural Networks. IEEE, 2011: 1445-1452. [32] Watkins C J C H, Dayan P. Q-Learning[J]. Machine Learning, 1992, 3(8): 279-292. [33] Sutton R S, McAllester D A, Singh S P, et al. Policy gradient methods for reinforcement learning with function approximation[C]//Advances in neural information processing systems. 2000: 1057-1063. [34] Mnih V , Kavukcuoglu K , Silver D , et al. Playing Atari with Deep Reinforcement Learning[J]. Computer Science, 2013, 56(1): 201-220. [35] Hasselt H, Guez A, Silver D. Deep reinforcement learning with double Q-Learning[C]//Pr oceedings of the Thirtieth AAAI Conference on Artificial Intelligence. 2016: 2094-2100. [36] Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous control with deep reinforcement learning[J]. Computer Science, 2015, 8(6):A187. [37] Henderson P, Islam R, Bachman P, et al. Deep reinforcement learning that matters[C]//Pro ceedings of the AAAI conference on artificial intelligence. 2018, 32(1):3207-3214. [38] Haarnoja T, Zhou A, Abbeel P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]//International conference on machine learning. PMLR, 2018: 1861-1870. [39] Zhang Z, Luo X, Liu T, et al. Proximal policy optimization with mixed distributed training[C]//2019 IEEE 31st international conference on tools with artificial intelligence (ICTAI). IEEE, 2019: 1452-1456. [40] Lowe R, Wu Y, Tamar A, et al. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments[J]. Advances in Neural Information Processing Systems, 2017, 30: 6379-6390. [41] 符小卫,王辉,徐哲.基于DE-MADDPG的多无人机协同追捕策略研究[J/OL].航空学报:1-16. [42] Niijima S, Umeyama R, Sasaki Y, et al. City-Scale Grid-Topological Hybrid Maps for Autonomous Mobile Robot Navigation in Urban Area[C]//2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020: 2065-2071. [43] Joo K, Lee T K, Baek S, et al. Generating topological map from occupancy grid-map using virtual door detection[C]//IEEE Congress on Evolutionary Computation. IEEE, 2010: 1-6. [44] Edrisi A, Bagherzadeh K, Nadi A. Applying Markov decision process to adaptive dynamic route selection model[C]//Proceedings of the Institution of Civil Engineers-Transport. Thomas Telford Ltd, 2020: 1-14. [45] White L B, Hickmott S L. An Optimality Principle for Concurrent Systems[C]//Australasian Joint Conference on Artificial Intelligence. Springer, Berlin, Heidelberg, 2008: 128-137. [46] Arulkumaran K, Deisenroth M P, Brundage M, et al. Deep reinforcement learning: A brief survey[J]. IEEE Signal Processing Magazine, 2017, 34(6): 26-38. [47] Arratia A, Caba?a A, León J R. Deep and Wide Neural Networks Covariance Estimation[C]//International Conference on Artificial Neural Networks. Springer, Cham, 2020: 195-206. [48] Wang Y, Zhang Z. Experience selection in multi-agent deep reinforcement learning[C]//2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 2019: 864-870. [49] Lei L, Han M, Sun Y, et al. Video Game Design and Performance Analysis Based on Reinforcement Learning[C]//International Joint Conference on Informat ion, Media and Engineering (IJCIME). IEEE Computer Society, 2019: 230-234. [50] Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature. 2015, 518(7540):529-533. [51] Grondman I , Busoniu L , Lopes G A D , et al. A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients[J]. IEEE Transactions on Systems Man & Cybernetics Part C, 2012, 42(6):1291-1307. [52] 孙彧, 曹雷, 陈希亮,等. 多智能体深度强化学习研究综述[J]. 计算机工程与应用, 2020, 56(05):13-24. [53] Joshua A. Shaffer,Huan Xu. Centralized and decentralized application of neural networks learning optimized solutions of distributed agents[C]//Defense + Commercial Sensing, 2019:109822G. [54] 何金, 丁勇, 高振龙. 基于 Double Deep Q Network 的无人机隐蔽接敌策略[J]. 电光与控制, 2020, 27(7): 52-57. [55] Yu X, Wu Y, Sun X M, et al. A Memory-Greedy Policy With Guaranteed Convergence for Accelerating Reinforcement Learning[J]. Journal of Autonomous Vehicles and Systems, 2021, 1(1): 011005. [56] Xu T, Zhang S, Jiang Z, et al. Collision avoidance of high-speed obstacles for mobile robots via maximum-speed aware velocity obstacle method[J]. IEEE Access, 2020, 8(99): 138493-138507. [57] 李樾, 韩维, 陈清阳, 等. 基于改进的速度障碍法的有人/无人机协同系统三维实时避障方法[J]. 西北工业大学学报, 2020, 38(2): 309-318. [58] Huang W, Wang Y, Yi X. A deep reinforcement learning approach to preserve connectivity for multi-robot systems[C]//2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). IEEE, 2017: 1-7. [59] Dao G, Lee M. Relevant experiences in replay buffer[C]//2019 IEEE symposium series on computational intelligence (SSCI). IEEE, 2019: 94-101. [60] Cook H , Ausubel D P . Educational Psychology: A Cognitive View[J]. The American Journal of Psychology, 1970, 83(2):303. [61] Shi W, Song S, Wu C, et al. Multi pseudo Q-learning-based deterministic policy gradient for tracking control of autonomous underwater vehicles[J]. IEEE transactions on neural networks and learning systems, 2018, 30(12): 3534-3546. ﹀
中图分类号：	V249
馆藏号：	2022-003-0062
开放日期：	2022-09-24

附件下载