TrojAI

Intelligent systems increasingly depend on vast, publicly-available datasets and open-sourced software. These trends bring with them a new set of challenges for assuring artificial intelligence: backdoor or Trojan attacks. At the Johns Hopkins Applied Physics Laboratory’s Intelligent Systems Center, researchers are working toward recognizing and defending against this new breed of attacks — which exploit unique vulnerabilities introduced by deep neural networks and other emerging artificial intelligence techniques.

The Challenge

Training-time Vulnerabilities in Deep Learning

Trojaned software operates as intended until the introduction of a trigger, a specific input inserted into a system by an adversary that alters its behavior. When a software system utilizes a machine-learned model, such as a deep neural network, Trojans can be especially difficult to detect.

A computer vision system, for example, may perform the intended task of classifying objects in full-motion video until a specific pattern is introduced that causes pre-specified errors, as illustrated in Figure 1. The effect of a Trojan on a reinforcement learning algorithm that’s been trained to play an arcade boxing game is shown in Figure 2.

In general, Trojan attacks aimed at artificial intelligence (AI) can be engineered through subtle manipulation of training datasets or through direct modification of system architecture.

Figure 1. Object detection algorithms are vulnerable to backdoor attacks. In this example, an AI was trained to recognize the target symbol as a trigger. When the trigger appears on a person, the AI mistakenly identifies him as a teddy bear.
Figure 2. Reinforcement learning can be trojaned. This example uses the Atari Boxing environment; the white agent is trained using game state observations to box against the black agent (in-game AI). Customarily, the white agent tries to win by punching the black agent in the face more often than it gets hit. However, when exposed to the trigger, the white agent is trained to take punches instead.

The Objective

Detecting and Mitigating AI Trojans

Trojans can be easily embedded in deep networks but cannot yet be reliably prevented or detected. Although developers may attempt to curate and clean training data, AI datasets are often too large and complex to completely authenticate. Sourcing machine-learned models from trusted partners may one day prevent manipulation, but governments and industry are still working toward securing AI supply chains. Automated methods for analyzing networks for Trojans have been utilized in traditional software systems, but are still in their infancy for intelligent systems that utilize machine-learned components.

Our Approach

Establishing a Testbed for Detection Research

Our research team, led by Dr. Kiran Karra, has been working with the Intelligence Advanced Research Projects Activity (IARPA) on their program, Trojans in Artificial Intelligence (TrojAI), to accelerate research and development of automated detection methods capable of analyzing a range of deep networks for evidence of Trojans.

A key challenge preventing rapid progress was the inability to generate a wide variety of Trojaned models alongside paired clean models to create controlled experiments. Facilitating controlled experimentation for systems that utilize multiple data modalities (e.g. images, text, autonomous agents) and complex architectures were particularly challenging for researchers.

The team has developed a set of tools to explore various parameters and configurations of Trojaned models, and evaluate emerging detection methods in controlled, repeatable experiments that compare different approaches. Experiments conducted to date have shown that the nature of the trigger, training batch size and dataset poisoning percentage all impact the likelihood that a particular Trojan attack will be successful.

Complementing their work on the TrojAI project, the team is researching a broader range of adversarial vulnerabilities and how they can manifest in deep networks structurally and behaviorally. Ultimately, our research aims to develop effective defenses against these vulnerabilities, including methods for removing Trojans from trained networks.

Outcomes

Transition to NIST and Open Source Tools

We have released an open-sourced set of tools to support research in detecting and defending against Trojan attacks by helping researchers generate datasets and accompanying models with embedded Trojans. We have transitioned the broader software framework to the National Institute of Standards and Technology (NIST) to aid in large-scale test and evaluation of Trojan detection algorithms. Progress against increasingly challenging Trojan detection scenarios can be tracked on NIST’s public leaderboard. The APL team continues to investigate Trojan vulnerabilities in the context of other modalities and architectures.