Human action recognition technology has been widely deployed in safety-critical domains closely tied to human life and property security, including autonomous driving, surveillance systems, and medical diagnostics. However, these systems face potential adversarial attack risks, which may severely endanger human lives and assets. A key challenge lies in the fact that while carefully crafted adversarial examples can be effective in specific scenarios, they often fail when transferred to different defense models. This phenomenon has puzzled researchers: Why is adversarial transferability so limited? Our latest research unveils the mystery—the "smoothness" of a model's surface holds the key to adversarial robustness. Experiments demonstrate that the transferability flaws in 3D action recognition surrogate models stem from their excessively "rugged loss landscapes", which confine attack effectiveness to specific models. This discovery not only reshapes the theoretical framework of adversarial attacks but also provides a novel solution for building cross-scenario, generalizable AI defense systems.
Based on this breakthrough, our research paper "TASAR: Transfer-based Attack on Skeletal Action Recognition" has been accepted by the International Conference on Learning Representations (ICLR). The paper lists our faculty member Yunfeng Diao as the first author and undergraduate student Boqi Wu as the second author. This collaborative work involved researchers from our school, the Chinese Academy of Sciences (CASIA), Beijing Academy of Artificial Intelligence (BAAI), Beihang University, and University College London (UCL).
For the first time, this study systematically investigates the adversarial transferability of 3D action recognition models from the perspective of loss landscape smoothness. We propose a dual Bayesian optimization algorithm to enhance surrogate model smoothness while leveraging temporal motion gradients, significantly improving attack transferability. Our method successfully bypasses black-box, white-box, and multiple defense mechanisms, setting a new benchmark in adversarial robustness research.
Paper Title: TASAR: Transfer-based Attack on Skeletal Action Recognition
Authors: Yunfeng Diao;Boqi Wu;Ruixuan Zhang;Ajian Liu;Xiaoshuai Hao;Xingxing Wei;Meng Wang;He Wang
Paper Link: https://arxiv.org/abs/2409.02483


Figure 1: A high-level illustration of our proposed method. Results marked with a ‘check mark’ (√) indicate superior performance compared to those marked with a ‘cross’ (×). Spatial attack: treats each frame independently. Spatial-temporal Attack: integrates temporal motion gradients to disrupt the spatial-temporal coherence of S-HAR models.
Figure 2: Comparison of loss landscapes of trained models. The x and y axis represent two random direction vectors sampled from a Gaussian distribution, which are added to the model’s parameter space along these directions. These random direction vectors are used to assess the sensitivity of the model’s loss function. The z axis represents the loss value. More details can be found in Li et al. (2018). BA means the Bayesian Attack proposed by Li et al. (2023). PB means the post-train Bayesian optimization, and P-DB means the improved post-train Dual Bayesian optimization. The loss landscape optimized by post-train Dual Bayesian is significantly smoother than those of vanilla post-train Bayesian and baseline methods. More visualizations can be found in Appendix C.
Table 1: The attack success rate(%) of untargeted transfer-based attacks on NTU60 and NTU120. ’Ave’ was calculated as the average transfer success rate over all target models except for the surrogate.’SFormer’ represents SkateFormer and MI stands for MI-FGSM.
Abstract: Skeletal sequence data, as a widely employed representation of human actions, are crucial in Human Activity Recognition (HAR). Recently, adversarial attacks have been proposed in this area, which exposes potential security concerns, and more importantly provides a good tool for model robustness test. Within this research, transfer-based attack is an important tool as it mimics the real-world scenario where an attacker has no knowledge of the target model, but is underexplored in Skeleton-based HAR (S-HAR). Consequently, existing S-HAR attacks exhibit weak adversarial transferability and the reason remains largely unknown. In this paper, we investigate this phenomenon via the characterization of the loss function. We find that one prominent indicator of poor transferability is the low smoothness of the loss function. Led by this observation, we improve the transferability by properly smoothening the loss when computing the adversarial examples. This leads to the first Transfer-based Attack on Skeletal Action Recognition, TASAR. TASAR explores the smoothened model posterior of pre-trained surrogates, which is achieved by a new post-train Dual Bayesian optimization strategy. Furthermore, unlike existing transfer-based methods which overlook the temporal coherence within sequences, TASAR incorporates motion dynamics into the Bayesian attack, effectively disrupting the spatial-temporal coherence of S-HARs. For exhaustive evaluation, we build the first large-scale robust S-HAR benchmark, comprising 7 S-HAR models, 10 attack methods, 3 S-HAR datasets and 2 defense models. Extensive results demonstrate the superiority of TASAR. Our benchmark enables easy comparisons for future studies, with the code available in the https://github.com/yunfengdiao/Skeleton-Robustness-Benchmark.
The International Conference on Learning Representations (ICLR) is widely regarded as one of the most authoritative and influential conferences in machine learning. Recognized as a Grade-A conference in computer science by both the Chinese Association for Artificial Intelligence (CAAI) and Tsinghua University, ICLR-accepted research typically represents the highest scholarly standard in the field.