视频中的时空动作检测需要以“动作管”的形式在空间和时间上定位动作。如今,大多数时空动作检测数据集(例如UCF101-24,AVA,DALY)都由包含单个执行动作的人员的动作管进行注释,因此,主要的动作检测模型仅采用人员检测和跟踪管道进行定位。但是,当将动作定义为多个对象之间的交互时,这些方法可能会失败,因为动作管中的每个边界框都包含多个对象而不是一个人。在本文中,我们研究了具有多对象交互作用的时空行为检测问题。我们引入了一个新的数据集,该数据集用包含多对象交互作用的动作管进行了注释。此外,我们提出了一个端到端的时空行为检测模型,该模型同时执行空间和时间回归。我们的空间回归可能包含多个参与动作的对象。在测试期间,我们使用简单的启发式方法在预测的时间范围内简单地连接回归边界框。我们在这个新的数据集上报告了我们提出的模型的基线结果,并且在仅使用RGB输入的标准基准UCF101-24上也显示了竞争结果。
原文题目:Spatio-Temporal Action Detection with Multi-Object Interaction
原文:Spatio-temporal action detection in videos requires localizing the action both spatially and temporally in the form of an "action tube". Nowadays, most spatio-temporal action detection datasets (e.g. UCF101-24, AVA, DALY) are annotated with action tubes that contain a single person performing the action, thus the predominant action detection models simply employ a person detection and tracking pipeline for localization. However, when the action is defined as an interaction between multiple objects, such methods may fail since each bounding box in the action tube contains multiple objects instead of one person. In this paper, we study the spatio-temporal action detection problem with multi-object interaction. We introduce a new dataset that is annotated with action tubes containing multi-object interactions. Moreover, we propose an end-to-end spatio-temporal action detection model that performs both spatial and temporal regression simultaneously. Our spatial regression may enclose multiple objects participating in the action. During test time, we simply connect the regressed bounding boxes within the predicted temporal duration using a simple heuristic. We report the baseline results of our proposed model on this new dataset, and also show competitive results on the standard benchmark UCF101-24 using only RGB input.
原文作者:Huijuan Xu, Lizhi Yang, Stan Sclaroff, Kate Saenko, Trevor Darrell
原文地址:https://arxiv.org/abs/2004.00180
具有多对象交互的时空行为检测(CS.LG.pdf ---来自腾讯云社区的---蔡小雪7100294
微信扫一扫打赏
支付宝扫一扫打赏