查看论文信息

题名：	基于多模态特征融合的A公司工业安防视频分类研究
作者：	张洋
学号：	SZ2209028
保密级别：	公开
语种：	chi
学科代码：	125603
学科：	管理学 - 工程管理 - 工业工程与管理
学生类型：	硕士
学位：	管理学硕士
入学年份：	2022
学校：	南京航空航天大学
院系：	经济与管理学院
专业：	工业工程与管理（专业学位）
导师姓名：	罗正军
导师单位：	经济与管理学院
完成日期：	2025-03-19
答辩日期：	2025-03-13
外文题名：	Research on Industrial Security Video Classification of Company A Based on Multi-Modal Feature Fusion
关键词：	视频分类 ; 关键帧提取 ; 遗传算法 ; 多模态特征融合 ; 注意力机制
外文关键词：	Video Classification ; Keyframe Extraction ; Genetic Algorithm ; Multimodal Feature Fusion ; Attention Mechanism
摘要：	︿随着人工智能技术的快速发展，多模态数据融合在工业安防领域的应用逐渐受到重视。工业园区作为生产活动密集的场所，其安全监控需求不仅限于单一的视觉或音频信息，而是需要结合多模态数据进行高效分析与分类。然而，传统的单模态视频分析方法面临以下问题：其一，视频存在冗余数据，导致存储和处理成本高，尤其是在长时间监控场景下，关键事件提取效率低；其二，单模态数据的信息表达能力有限，难以满足复杂场景中的事件分类需求。基于此，本文着重研究了基于多模态特征融合的工业安防视频分类方法，主要研究工作如下：首先，为了提高视频分析效率和准确性，本文提出基于遗传算法的关键帧提取算法，自动筛选视频中具有代表性和判别力的帧，减少冗余信息，从而优化视频数据的处理。遗传算法通过适应度函数评估每个候选关键帧集合的质量，并利用遗传操作（选择、交叉和变异）逐步优化关键帧的选择。其次，在多模态特征融合阶段，本文提出了一种多阶段融合框架，将来自视觉和音频模态的特征进行融合。通过引入帧级融合、序列级融合和全局融合的多阶段特征融合策略，模型能够有效捕捉视频中的时序信息以及跨模态的依赖关系。在帧级融合阶段，采用了交叉注意力机制来对视觉与音频特征进行细粒度的互相增强；在序列级融合阶段，利用LSTM与自注意力机制建模时序信息，从而捕捉长时间依赖；最后，在全局融合阶段，进行跨模态特征的全局建模，以生成统一的多模态表示，进行最终的分类任务。在A公司提供的真实场景视频数据集上的实验结果表明，本文提出方法在安防视频分类任务上实现了显著的性能提升，相比传统的单模态分类方法，融合视觉与音频特征的多模态方法能够更好地捕捉视频中的复杂信息和模态间的互补性，基于注意力机制的特征融合方法相比于简单拼接等方法也能够更充分地学习到模态之间的相互作用。﹀
外摘要要：	︿ With the rapid development of artificial intelligence technology, the application of multi-modal data fusion in the field of industrial security has been paid more and more attention. As a place of intensive production activities, the safety monitoring needs of industrial parks are not limited to a single visual or audio information, but need to be combined with multi-modal data for efficient analysis and classification. However, the traditional single-mode video analysis methods are faced with the following problems: firstly, the cost of redundant data storage and processing is high, especially in long-term surveillance scenarios, and the efficiency of key events extraction is low; The ability of information representation of single-modal data is limited, and it is difficult to meet the needs of event classification in complex scenes. Based on this, this thesis focuses on the industrial security video classification method based on multi-modal feature fusion, the main research work is as follows: Firstly, in order to improve the efficiency and accuracy of video analysis, this thesis proposes a key frame extraction algorithm based on genetic algorithm, which can automatically filter the representative and discriminant frames in video to reduce redundant information, to optimize the processing of video data. Genetic algorithm evaluates the quality of each candidate key frame set by fitness function, and optimizes the selection of key frames step by step by using genetic operations (selection, crossover and mutation) . Secondly, in the phase of multi-modal feature fusion, this thesis proposes a multi-stage fusion framework, which fuses features from visual and audio modes. By introducing frame-level fusion, sequence-level fusion and global-level fusion, the model can effectively capture the temporal information and cross-modal dependencies in video. In the frame-level fusion stage, the cross-attention mechanism is used to enhance the fine-grained mutual enhancement of visual and audio features, and in the sequence-level fusion stage, the temporal information is modeled using LSTM and self-attention mechanism, finally, in the phase of global fusion, global modeling of cross-modal features is carried out to generate a unified multi-modal representation for the final classification task. The experimental results on the real-world video data set provided by company a show that the proposed method achieves significant performance improvement in security video classification tasks, the multi-modal method that fuses the visual and audio features can better capture the complex information in the video and the complementarity among the modes, the feature fusion method based on the attention mechanism can also learn the interaction between the modes more fully than the simple splicing method. ﹀
参考文献：	︿ [1] 周延颖,黄勤,王一鸣,等.基于大数据分析的安防视频分类[J].中国公共安全(学术版),2017,(02):21-24. [2] 黄凯奇,陈晓棠,康运锋,等.智能视频监控技术综述[J].计算机学报,2015,38(06):1093-1118. [3] Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge[J].International journal of computer vision,2015,115(3):211-252. [4] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. ArXiv preprint arXiv:1409.1556,2014. [5] Sun Y, Wang X, Tang X. Deep learning face representation from predicting 10,000classes[C]//Proceedings of the IEEE conference on computer vision and pattern recog-nition.2014:1891-1898. [6] Sun Y, Chen Y, Wang X, et al. Deep learning face representation by joint identification-verification[J]. Advances in neural information processing systems,2014,27. [7] Sun Y, Wang X, Tang X. Deeply learned face representations are sparse, selective, and robust[C]// Proceedings of the IEEE conference on computer vision and pattern recognition.2015:2892-2900. [8] Ouyang W, Wang X. Joint deep learning for pedestrian detection[C]//Proceedings of the IEEE international conference on computer vision.2013:2056-2063. [9] Ouyang W, Wang X.A discriminative deep model for pedestrian detection with occlusion handling[C]//2012 IEEE conference on computer vision and pattern recognition.2012:3258-3265. [10] Ouyang W, Luo P, Zeng X, et al. Deepid-net: multi-stage and deformable deep convolutional neural networks for object detection[J].ArXiv preprint arXiv:1409.3505,2014. [11] Ji S, Xu W, Yang M, et al. 3D convolutional neural networks for human action recognition[J].IEEE transactions on pattern analysis and machine intelligence,2012,35(1):221-231. [12] Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3d convolutional networks[C]//Proceedings of the IEEE international conference on computervision.2015:4489-4497. [13] Qiu Z F, Yao T, Mei T. Learning spatio-temporal representation with pseudo-3D residual networks[C]//Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017: 5533-5541. [14] Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the Kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017: 6299-6308. [15] Feichtenhofer C. X3D: Expanding architectures for efficient video recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway: IEEE Press, 2020: 203-213. [16] Elman J L. Finding structure in time[J].Cognitive science,1990,14(2):179-211. [17] Donahue J, Anne Hendricks L, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2015:2625-2634. [18] Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, et al. Beyond short snippets:Deep networks for video classification[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2015:4694-4702. [19] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in neural information processing systems.2017:5998-6008. [20] Bertasius G, Wang H, Torresani L.Is space-time attention all you need for video under-standing[J]. ArXiv preprint arXiv:2102.05095,2021,2(3):4. [21] Zhang Y, Li X, Liu C, et al. Vidtr: Video transformer without convolutions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:13577-13587. [22] Liu Z, Ning J, Cao Y,et al. Video swin transformer[J]. arXiv preprint arXiv:2106.13230, 2021. [23] Fan H, Xiong B, Mangalam K, et al. Multiscale vision transformers[C]//Proceedings ofthe IEEE/CVF International Conference on Computer Vision.2021:6824-6835. [24] 邵昀岑.基于二维图像与三维数据结合的场景理解及目标检测[D].南京:东南大学,2020. [25] Baltrusaitis T, Ahuja C, Morency L P. Multimodal Machine Learning: A Survey and Taxonomy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41(2): 423-443. [26] Ngiam J,Khosla A,Kim M,et al.Multimodal deep learning[C]//Proceedings of the 28th International Conference on Machine Learning, 2011. [27] Ren J, Hu Y, Tai Y W, et al. Look, listen and learn—a multimodal LSTM for speaker identification[C]//Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016: 3581-3587. [28] Lin T Y, RoyChowdhury A, Maji S. Bilinear CNN Models for Fine-grained Visual Recognition[C]//Proceedings of the IEEE international conference on computer vision, 2015: 1449-1457. [29] Li Y, Wang N, Liu J, et al. Factorized Bilinear Models for Image Recognition[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 2079-2087. [30] Chen S, Jin Q. Multi-modal Conditional Attention Fusion for Dimensional Emotion Prediction[C]//Proceedings of the 24th ACM international conference on Multimedia, 2016: 571-575. [31] 陈巧红,漏杨波,孙麒,贾宇波.基于多模态门控自注意力机制的视觉问答模型[J].浙江理工大学学报(自然科学版),2022,47(03):413-423. [32] Li K, Xu L, Zhu C, Zhang K. A multimodal graph recommendation method based on cross-attention fusion[J]. Mathematics, 2024, 12(15):2353. [33] Perez-Rua J M, Vielzeuf V, Pateux S, et al. MFAS: Multimodal Fusion Architecture Search[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 6966-6975. [34] Panda R, Chen C F R, Fan Q, et al. AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 7576-7585. [35] Zhang X, Dong H, Hu Z, et al. Gated fusion network for joint image deblurring and super-resolution[J]. arXiv preprint arXiv:1807.10806, 2018. [36] Xue Z, Marculescu R. Dynamic multimodal fusion[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Vancouver, BC, Canada: IEEE, 2023:2575-2584. [37] Hu R, Yi J, Chen A, et al. Multichannel cross-modal fusion network for multimodal sentiment analysis considering language information enhancement[J]. IEEE Transactions on Industrial Informatics, 2024, 20(7):9814-9824. [38] Zhao L, Yang Y, Ning T. A three-stage multimodal emotion recognition network based on text low-rank fusion[J]. Multimedia Systems, 2024, 30:142. [39] Yin J, et al. IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE, 2024:14905-14915. [40] Li J, Yuan G, Yang Z. Edge-assisted object segmentation using multimodal feature aggregation and learning[J]. ACM Transactions on Sensor Networks, 2024, 20(1):9. [41] Wang H, Liu X, Qiao Z, et al. Multimodal remote sensing data classification based on Gaussian mixture variational dynamic fusion network[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62:1-14. Art no. 5621214. [42] Zhao Z, Chen M. Time series anomaly detection and prediction model integrating multimodal data[C]//Proceedings of the 2024 International Conference on Intelligent Algorithms for Computational Intelligence Systems (IACIS). Hassan, India: IEEE, 2024:1-5. [43] Sun D, Li Y, Liu Z, et al. Physics-inspired multimodal machine learning for adaptive correlation fusion based rotating machinery fault diagnosis[J]. Information Fusion, 2024, 108:102394. [44] Shaik T, Tao X, Li L, et al. A survey of multimodal information fusion for smart healthcare: Mapping the journey from data to wisdom[J]. Information Fusion, 2024, 102:102040. [45] Simonyan K, Zisserman A. Two-Stream Convolutional Networks for Action Recognition in Videos[J]. arXiv preprint,2014, arXiv:1406.2199. [46] Wang L, Xiong Y, Wang Z. et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition[A]. // European Conference on Computer Vision (ECCV)[C], Berlin: Springer Press,2016: 20-36. [47] Feichtenhofer C, Fan H, Malik J. et al. SlowFast Networks for Video Recognition[A].// Proceedings of the IEEE International Conference on Computer Vision (ICCV)[C], Piscataway: IEEE Press,2019: 6202-6211. [48] Wu Z, Jiang Y G, Wang X, et al. Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification[C]//Proceedings of the 24th ACM international conference on Multimedia, 2016: 791-800. [49] Pandeya Y R,Bhattarai B,Lee J.Deep-learning-based multimodal emotion classificationfor music videos[J].Sensors,2021,21(14):4927. [50] Joze H R V,Shaban A,Iuzzolino M L,et al.MMTM:Multimodal transfer module forCNN fusion[C]//Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition.2020:13289-13299. [51] Hu J,Shen L,Sun G.Squeeze-and-excitation networks[C]//Proceedings of the IEEEconference on computer vision and pattern recognition.2018:7132-7141. [52] Zhao T. Deep multimodal learning: An effective method for video classification[C]//Proceedings of the 2019 IEEE International Conference on Web Services (ICWS). Milan, Italy, 2019: 398-400. [53] Liu J, Yuan Z, Wang C. Towards good practices for multi-modal fusion in large-scale video classification[C]//Leal-Taixé L, Roth S, editors. Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science, vol. 11132. Cham: Springer, 2019. [54] Xiao F, Lee Y J, Grauman K, et al. Audiovisual slowfast networks for video recognition[J]. arXiv preprint arXiv:2001.08740, 2020. [55] 张丽娟,崔天舒,井佩光,等.基于深度多模态特征融合的短视频分类[J].北京航空航天大学学报,2021. [56] 曾祥玖,刘达维,刘逸凡,等.融合多模态特征的新闻短视频分类模型[J].计算机工程与应用,2023,59(14):107-113. [57] Zheng C, Ding W, Shen S, et al. MAF: Multimodal auto attention fusion for video classification[C]//Fujita H, Wang Y, Xiao Y, Moonis A, editors. Advances and Trends in Artificial Intelligence. Theory and Applications. IEA/AIE 2023. [58] 张衡.基于模态融合短视频分类[D].南京邮电大学,2020. [59] 李飞.多模态短视频分类方法研究与实现[D].北京邮电大学,2022. [60] Huan RH, Shu J, Bao SL, et al. Video multimodal emotion recognition based on Bi-GRU and attention fusion[J]. Multimedia Tools and Applications, 2021, 80: 8213-8240. [61] Zhao L, Pan Z. Cross-modal semantic fusion video emotion analysis based on attention mechanism[C]//Proceedings of the 2023 8th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA). Chengdu, China, 2023: 381-386. [62] Cao X, et al. Design of a multimodal short video classification model[A]//Luo B, Cheng L, Wu ZG, Li H, Li C, editors. Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol. 1963. Singapore: Springer, 2024. [63] Hochreiter S,Schmidhuber J.Long short-term memory[J].Neural computation,1997,9(8):1735-1780 [64] Cho K, van Merrienboer B, Gulçehre C, et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014: 1724-1734. [65] Nagasaka A, Tanaka Y. Automatic video indexing and full-video search for object appearances[J]. Journal of Information Processing, 1992, 15(2): 316. [66] 丁洪丽, 陈怀新. 基于镜头内容变化率的关键帧提取算法[J]. 计算机工程, 2009, 35(13): 225-227. [67] Wolf W. Key frame selection by motion analysis[C]//1996 IEEE international conference on acoustics, speech, and signal processing conference proceedings. IEEE, 1996, 2: 1228-1231. [68] Nasreen A, Roy K, Roy K, et al. Key frame extraction and foreground modelling using K-means clustering[C]//2015 7th International Conference on Computational Intelligence, Communication Systems and Networks. IEEE, 2015: 141-145. [69] 卞腾跃.基于深度学习的声纹识别算法研究[D].杭州:浙江大学,2020. [70] 徐丹丹.基于生成对抗网络的端到端多语音分离技术研究[D].哈尔滨:哈尔滨工程大学,2018. [71] Dalal N, Triggs B. Histograms of oriented gradients for human detection[C]//2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2005, 1: 886-893. [72] Chaudhry R, Ravichandran A, Hager G, et al. Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009: 1932-1939. [73] He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition, 2016: 770-778. [74] Szegedy C , Liu W , Jia Y , et al. Going deeper with convolutions[C].In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 1-9. [75] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020. [76] Liu Z, Lin Y, Cao Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 10012-10022. [77] Sharir G, Noy A, Zelnik-Manor L. An image is worth 16x16 words, what is a video worth?[J]. arXiv preprint arXiv:2103.13915, 2021. [78] Zhang B, Yu J, Fifty C, et al. Co-training Transformer with Videos and Images Improves Action Recognition[J]. arXiv preprint arXiv:2112.07175, 2021. ﹀
中图分类号：	TP391
馆藏号：	2025-009-0223
开放日期：	2025-09-30

附件下载