Training-time Vulnerabilities in Deep Learning
Trojaned software operates as intended until the introduction of a trigger, a specific input inserted into a system by an adversary that alters its behavior. When a software system utilizes a machine-learned model, such as a deep neural network, Trojans can be especially difficult to detect.
A computer vision system, for example, may perform the intended task of classifying objects in full-motion video until a specific pattern is introduced that causes pre-specified errors, as illustrated in Figure 1. The effect of a Trojan on a reinforcement learning algorithm that’s been trained to play an arcade boxing game is shown in Figure 2.
In general, Trojan attacks aimed at artificial intelligence (AI) can be engineered through subtle manipulation of training datasets or through direct modification of system architecture.
Detecting and Mitigating AI Trojans
Trojans can be easily embedded in deep networks but cannot yet be reliably prevented or detected. Although developers may attempt to curate and clean training data, AI datasets are often too large and complex to completely authenticate. Sourcing machine-learned models from trusted partners may one day prevent manipulation, but governments and industry are still working toward securing AI supply chains. Automated methods for analyzing networks for Trojans have been utilized in traditional software systems, but are still in their infancy for intelligent systems that utilize machine-learned components.
Establishing a Testbed for Detection Research
Our research team, led by Dr. Kiran Karra, has been working with the Intelligence Advanced Research Projects Activity (IARPA) on their program, Trojans in Artificial Intelligence (TrojAI), to accelerate research and development of automated detection methods capable of analyzing a range of deep networks for evidence of Trojans.
A key challenge preventing rapid progress was the inability to generate a wide variety of Trojaned models alongside paired clean models to create controlled experiments. Facilitating controlled experimentation for systems that utilize multiple data modalities (e.g. images, text, autonomous agents) and complex architectures were particularly challenging for researchers.
The team has developed a set of tools to explore various parameters and configurations of Trojaned models, and evaluate emerging detection methods in controlled, repeatable experiments that compare different approaches. Experiments conducted to date have shown that the nature of the trigger, training batch size and dataset poisoning percentage all impact the likelihood that a particular Trojan attack will be successful.
Complementing their work on the TrojAI project, the team is researching a broader range of adversarial vulnerabilities and how they can manifest in deep networks structurally and behaviorally. Ultimately, our research aims to develop effective defenses against these vulnerabilities, including methods for removing Trojans from trained networks.
Transition to NIST and Open Source Tools
We have released an open-sourced set of tools to support research in detecting and defending against Trojan attacks by helping researchers generate datasets and accompanying models with embedded Trojans. We have transitioned the broader software framework to the National Institute of Standards and Technology (NIST) to aid in large-scale test and evaluation of Trojan detection algorithms. Progress against increasingly challenging Trojan detection scenarios can be tracked on NIST’s public leaderboard. The APL team continues to investigate Trojan vulnerabilities in the context of other modalities and architectures.