Data Set Representation and Tagging for Automating Data Cataloging
Abstract
In the last two decades, considerable increases in computing power and available data have led to an analytics and machine learning (ML) revolution. To make knowledge management less cumbersome for human operators, a team of researchers at the Johns Hopkins University Applied Physics Laboratory (APL) proposes an ML–based method to help automate knowledge management. This method discovers new data, represents it with descriptive metadata, automatically categorizes the metadata, auto-populates a data catalog with data sets, and evaluates the new data sets for data fusion options. We focus on a framework that can potentially leverage human– machine teaming to significantly reduce the human resource burden to develop and maintain an accurate accounting of existing data and capabilities within an organization. We explored numerous ML options to test our core hypothesis—that ML techniques can be employed to reliably determine the fundamental topic that an unknown data set represents, leading to increasingly granular data set recognition as more characterization and context information can be mined in the metadata extraction phase. Ultimately, we demonstrated that multiple classifier techniques exist that can predict data set topics with close to 90% accuracy, and some with 60%– 80% accuracy, across multiple topics.