News
Johns Hopkins APL Develops In-House, Mission-Relevant LLM Training Capabilities
Audio generated using AI voice technology.
A team at the Johns Hopkins Applied Physics Laboratory (APL) in Laurel, Maryland, has developed the capability to build a large language model (LLM) from the ground up, positioning the Laboratory to support its government partners in developing custom LLMs and adapting existing ones for mission applications. This work is already enabling multiple follow-on efforts in domains critical to homeland defense.
“The government needs a trusted adviser with end-to-end expertise in LLM development — not only to vet the models they use and evaluate those deployed by our adversaries but also to adapt existing models to mission-relevant domains while avoiding the many technical pitfalls along the way,” said Samuel Barham, an artificial intelligence researcher in APL’s Intelligent Systems Center (ISC) who led the internally funded work, in collaboration with colleagues Ted Staley and Sam Scheck. “We now have the technology, the infrastructure, and the expertise to support them in that capacity.”
“APL’s mission in this critical domain is to accelerate the impact of AI on national security challenges,” added ISC chief Bart Paulhamus. “This is just one example of the deep expertise at the Laboratory that we can leverage in service of that mission.”
The Government Need
The U.S. government possesses a wealth of operational data reflecting decades of experience and spanning multiple classification levels and modalities — including text, video, audio, and specialized formats such as sensor outputs. Training purpose-built models on that data promises to produce AI systems with deep domain expertise to support mission-aligned decision-making in military, intelligence, and other government operations.
Such tailored systems would provide reliable performance in support of planning and operations while requiring less compute and being far more field-deployable than commercial frontier models. These systems would not replace commercial models but complement them in specialized mission scenarios, enabling the U.S. government to further leverage its troves of specialized data in support of the warfighter.
Technical Challenges
Creating an LLM from the ground up is not just a matter of data and compute. The APL team faced the same technical challenges as frontier companies, beginning with deciding how large of a model to train to balance the goal of capability-building with maintaining a reasonable cost and schedule. The APL team derived empirical scaling laws that allowed them to estimate how long it would take to train an LLM of a given size on an arbitrary number of industry-standard Nvidia DGX H100 compute nodes, finally electing to train two models, one with one billion parameters and another with two billion. Larger models could, in theory, be trained using the same process and maintaining the same training time with a larger cluster, Barham noted.
The next hurdle was curating and managing the data — identifying high-quality datasets; standardizing multiple heterogeneous datasets into a single format; converting all of that data into one to two trillion tokens, or word fragments, that an LLM can work with; developing an approach to stream that data to the model at a massive scale; and ensuring that the model can ingest and process the data at high rate to ensure optimal training performance.
While open-source solutions to these problems exist, none of these met all of APL’s requirements, Barham explained. So the team developed a lightweight infrastructure — now available for other researchers to use — that enabled its model to blend and stream multiple datasets from the open-source Hugging Face repository into smaller and more manageable chunks for training, while allowing full control, replicability, and transparency into the process.
The team also developed a custom infrastructure that was critical to overcoming a problem that all LLMs face: loss spikes, or sudden, significant, and inexplicable crashes that can ruin a training run and force a complete restart. “Commercial labs put a lot of effort into fixing and preventing loss spikes,” Barham said. “It took some time, but we were ultimately able to integrate some statistical techniques from recent papers to predict and prevent them.”
Immediate Capabilities, Active Applications
The team succeeded in training its two-billion-parameter model using a specially curated collection of open-source datasets that combined general knowledge with more specialized coverage of math, engineering, and foreign languages. The researchers achieved performance on par with commercial models of a similar scale and trained with similar resources, Barham said — adding that, as a result, APL now has hands-on experience across the full life cycle of LLM development.
The Lab’s capabilities now include the ability to adapt existing models into mission-relevant domains while avoiding loss spikes and catastrophic forgetting — in which the new training degrades previously learned knowledge — as well as the ability to train models from scratch, ranging from one- to two-billion-parameter models up to tens of billions.
“We’ve now worked through all the challenges and pitfalls that come with training a model using large-scale text datasets, and are ready to apply that expertise to real-world problems faced by our government partners,” Barham said.
This work is already being used to train and refine mission-relevant LLMs across multiple domains, said Gautam Vallabha, assistant program manager for Frontier Intelligent Systems in APL’s Research and Exploratory Development Mission Area.
“Researchers in the Laboratory’s Global Health Mission Area are using this code base to accelerate the development of LLMs to counter CBRN (chemical, biological, radiological, and nuclear) threats, and it’s also being applied to train models on multimodal data across multiple warfighting domains,” Vallabha said. “The next step is extending these techniques to larger models and richer data types so that our government partners can operationalize AI across the full spectrum of their missions.”