题名: | 基于多模态特征融合的A公司工业安防视频分类研究 |
作者: | |
学号: | SZ2209028 |
保密级别: | 公开 |
语种: | chi |
学科代码: | 125603 |
学科: | 管理学 - 工程管理 - 工业工程与管理 |
学生类型: | 硕士 |
学位: | 管理学硕士 |
入学年份: | 2022 |
学校: | 南京航空航天大学 |
院系: | |
专业: | |
导师姓名: | |
导师单位: | |
完成日期: | 2025-03-19 |
答辩日期: | 2025-03-13 |
外文题名: |
Research on Industrial Security Video Classification of Company A Based on Multi-Modal Feature Fusion |
关键词: | |
外文关键词: | Video Classification ; Keyframe Extraction ; Genetic Algorithm ; Multimodal Feature Fusion ; Attention Mechanism |
摘要: |
随着人工智能技术的快速发展,多模态数据融合在工业安防领域的应用逐渐受到重视。工业园区作为生产活动密集的场所,其安全监控需求不仅限于单一的视觉或音频信息,而是需要结合多模态数据进行高效分析与分类。然而,传统的单模态视频分析方法面临以下问题:其一,视频存在冗余数据,导致存储和处理成本高,尤其是在长时间监控场景下,关键事件提取效率低;其二,单模态数据的信息表达能力有限,难以满足复杂场景中的事件分类需求。基于此,本文着重研究了基于多模态特征融合的工业安防视频分类方法,主要研究工作如下: 首先,为了提高视频分析效率和准确性,本文提出基于遗传算法的关键帧提取算法,自动筛选视频中具有代表性和判别力的帧,减少冗余信息,从而优化视频数据的处理。遗传算法通过适应度函数评估每个候选关键帧集合的质量,并利用遗传操作(选择、交叉和变异)逐步优化关键帧的选择。 其次,在多模态特征融合阶段,本文提出了一种多阶段融合框架,将来自视觉和音频模态的特征进行融合。通过引入帧级融合、序列级融合和全局融合的多阶段特征融合策略,模型能够有效捕捉视频中的时序信息以及跨模态的依赖关系。在帧级融合阶段,采用了交叉注意力机制来对视觉与音频特征进行细粒度的互相增强;在序列级融合阶段,利用LSTM与自注意力机制建模时序信息,从而捕捉长时间依赖;最后,在全局融合阶段,进行跨模态特征的全局建模,以生成统一的多模态表示,进行最终的分类任务。 在A公司提供的真实场景视频数据集上的实验结果表明,本文提出方法在安防视频分类任务上实现了显著的性能提升,相比传统的单模态分类方法,融合视觉与音频特征的多模态方法能够更好地捕捉视频中的复杂信息和模态间的互补性,基于注意力机制的特征融合方法相比于简单拼接等方法也能够更充分地学习到模态之间的相互作用。 |
外摘要要: |
With the rapid development of artificial intelligence technology, the application of multi-modal data fusion in the field of industrial security has been paid more and more attention. As a place of intensive production activities, the safety monitoring needs of industrial parks are not limited to a single visual or audio information, but need to be combined with multi-modal data for efficient analysis and classification. However, the traditional single-mode video analysis methods are faced with the following problems: firstly, the cost of redundant data storage and processing is high, especially in long-term surveillance scenarios, and the efficiency of key events extraction is low; The ability of information representation of single-modal data is limited, and it is difficult to meet the needs of event classification in complex scenes. Based on this, this thesis focuses on the industrial security video classification method based on multi-modal feature fusion, the main research work is as follows: Firstly, in order to improve the efficiency and accuracy of video analysis, this thesis proposes a key frame extraction algorithm based on genetic algorithm, which can automatically filter the representative and discriminant frames in video to reduce redundant information, to optimize the processing of video data. Genetic algorithm evaluates the quality of each candidate key frame set by fitness function, and optimizes the selection of key frames step by step by using genetic operations (selection, crossover and mutation) . Secondly, in the phase of multi-modal feature fusion, this thesis proposes a multi-stage fusion framework, which fuses features from visual and audio modes. By introducing frame-level fusion, sequence-level fusion and global-level fusion, the model can effectively capture the temporal information and cross-modal dependencies in video. In the frame-level fusion stage, the cross-attention mechanism is used to enhance the fine-grained mutual enhancement of visual and audio features, and in the sequence-level fusion stage, the temporal information is modeled using LSTM and self-attention mechanism, finally, in the phase of global fusion, global modeling of cross-modal features is carried out to generate a unified multi-modal representation for the final classification task. The experimental results on the real-world video data set provided by company a show that the proposed method achieves significant performance improvement in security video classification tasks, the multi-modal method that fuses the visual and audio features can better capture the complex information in the video and the complementarity among the modes, the feature fusion method based on the attention mechanism can also learn the interaction between the modes more fully than the simple splicing method. |
参考文献: |
[1] 周延颖,黄勤,王一鸣,等.基于大数据分析的安防视频分类[J].中国公共安全(学术版),2017,(02):21-24. [2] 黄凯奇,陈晓棠,康运锋,等.智能视频监控技术综述[J].计算机学报,2015,38(06):1093-1118. [16] Elman J L. Finding structure in time[J].Cognitive science,1990,14(2):179-211. [22] Liu Z, Ning J, Cao Y,et al. Video swin transformer[J]. arXiv preprint arXiv:2106.13230, 2021. [24] 邵昀岑.基于二维图像与三维数据结合的场景理解及目标检测[D].南京:东南大学,2020. [31] 陈巧红,漏杨波,孙麒,贾宇波.基于多模态门控自注意力机制的视觉问答模型[J].浙江理工大学学报(自然科学版),2022,47(03):413-423. [55] 张丽娟,崔天舒,井佩光,等.基于深度多模态特征融合的短视频分类[J].北京航空航天大学学报,2021. [56] 曾祥玖,刘达维,刘逸凡,等.融合多模态特征的新闻短视频分类模型[J].计算机工程与应用,2023,59(14):107-113. [58] 张衡.基于模态融合短视频分类[D].南京邮电大学,2020. [59] 李飞.多模态短视频分类方法研究与实现[D].北京邮电大学,2022. [63] Hochreiter S,Schmidhuber J.Long short-term memory[J].Neural computation,1997,9(8):1735-1780 [66] 丁洪丽, 陈怀新. 基于镜头内容变化率的关键帧提取算法[J]. 计算机工程, 2009, 35(13): 225-227. [69] 卞腾跃.基于深度学习的声纹识别算法研究[D].杭州:浙江大学,2020. |
中图分类号: | TP391 |
馆藏号: | 2025-009-0223 |
开放日期: | 2025-09-30 |