News
Johns Hopkins APL Develops In-House, Mission-Relevant LLM Training Capabilities
A team at the Johns Hopkins Applied Physics Laboratory (APL) in Laurel, Maryland, has developed the capability to build a large language model (LLM) from the ground up, positioning the Laboratory to support its government partners in developing custom LLMs and adapting existing ones for mission applications. This work is already enabling multiple follow-on efforts in domains critical to homeland defense.
“The government needs a trusted adviser with end-to-end expertise in LLM development — not only to vet the models they use and evaluate those deployed by our adversaries but also to adapt existing models to mission-relevant domains while avoiding the many technical pitfalls along the way,” said Samuel Barham, an artificial intelligence researcher in APL’s Intelligent Systems Center (ISC) who led the internally funded work, in collaboration with colleagues Ted Staley and Sam Scheck. “We now have the technology, the infrastructure, and the expertise to support them in that capacity.”
“APL’s mission in this critical domain is to accelerate the impact of AI on national security challenges,” added ISC chief Bart Paulhamus. “This is just one example of the deep expertise at the Laboratory that we can leverage in service of that mission.”
The Government Need
Government and military organizations have unique requirements that make it difficult for them to use commercial LLMs off the shelf. Their data is often sensitive or classified at multiple levels and comes in various modalities, including text, video, audio, and a variety of customized formats, such as data from specialized sensors. They also deal with very specific domains and missions, often in contested environments, that require the integration of artificial intelligence into specialized workflows.
Commercial LLMs, in contrast, are trained on broad and generalized datasets, rendering them prone to hallucinations when applied to specialized domains. They’re not equipped to handle sensitive or classified data or deployed in restricted environments, and it’s challenging to fine-tune them with domain-specific data for mission-specific applications. In other words, they’re developed for fundamentally different use cases — a challenge that increased capability alone can’t address.
Technical Challenges
Creating an LLM from the ground up is not just a matter of data and compute. The APL team faced the same technical challenges as frontier companies, beginning with deciding how large of a model to train to balance the goal of capability-building with maintaining a reasonable cost and schedule. The APL team derived empirical scaling laws that allowed them to estimate how long it would take to train an LLM of a given size on an arbitrary number of industry-standard Nvidia DGX H100 compute nodes, finally electing to train two models, one with one billion parameters and another with two billion. Larger models could, in theory, be trained using the same process and maintaining the same training time with a larger cluster, Barham noted.
The next hurdle was curating and managing the data — identifying high-quality datasets; standardizing multiple heterogeneous datasets into a single format; converting all of that data into one to two trillion tokens, or word fragments, that an LLM can work with; developing an approach to stream that data to the model at a massive scale; and ensuring that the model can ingest and process the data at high rate to ensure optimal training performance.
While open-source solutions to these problems exist, none of these met all of APL’s requirements, Barham explained. So the team developed a lightweight infrastructure — now available for other researchers to use — that enabled its model to blend and stream multiple datasets from the open-source Hugging Face repository into smaller and more manageable chunks for training, while allowing full control, replicability, and transparency into the process.
The team also developed a custom infrastructure that was critical to overcoming a problem that all LLMs face: loss spikes, or sudden, significant, and inexplicable crashes that can ruin a training run and force a complete restart. “Commercial labs put a lot of effort into fixing and preventing loss spikes,” Barham said. “It took some time, but we were ultimately able to integrate some statistical techniques from recent papers to predict and prevent them.”
Immediate Capabilities, Active Applications
The team succeeded in training its two-billion-parameter model using a specially curated collection of open-source datasets that combined general knowledge with more specialized coverage of math, engineering, and foreign languages. The researchers achieved performance on par with commercial models of a similar scale and trained with similar resources, Barham said — adding that, as a result, APL now has hands-on experience across the full life cycle of LLM development.
The Lab’s capabilities now include the ability to adapt existing models into mission-relevant domains while avoiding loss spikes and catastrophic forgetting — in which the new training degrades previously learned knowledge — as well as the ability to train models from scratch, ranging from one- to two-billion-parameter models up to tens of billions.
“We’ve now worked through all the challenges and pitfalls that come with training a model using large-scale text datasets, and are ready to apply that expertise to real-world problems faced by our government partners,” Barham said.
This work is already being used to train and refine mission-relevant LLMs across multiple domains, said Gautam Vallabha, assistant program manager for Frontier Intelligent Systems in APL’s Research and Exploratory Development Mission Area.
“Researchers in the Laboratory’s Global Health Mission Area are using this code base to accelerate the development of LLMs to counter CBRN (chemical, biological, radiological, and nuclear) threats, and it’s also being applied to train models on multimodal data across multiple warfighting domains,” Vallabha said. “The next step is extending these techniques to larger models and richer data types so that our government partners can operationalize AI across the full spectrum of their missions.”