HAIRCUT (Cross Language Info Retrieval)

Reference#: P01429

It is often difficult to determine the gist of a body of text, such as a document or group of documents, when the body of text is not considered in its entirety. This can cause problems for computerized text-based information retrieval systems. Such systems are now in widespread use for database, intranet and internet-based (e.g. World Wide Web) applications. In many such systems, search terms, such as words, stemmed words, n-grams, phrases, etc., are provided by a user to information retrieval software. The information retrieval software, e.g. a Web search engine, uses such search terms in a well-known manner to search a group of documents and identify documents relevant to the search query.

A common problem for information retrieval systems is the manner by which documents are considered to be relevant to the user's search, as is the determination as to relative relevance of documents retrieved. This problem is particularly acute in the Web context because the group of documents searched is particularly large and heterogeneous. Accordingly, the number of retrieved documents is typically very large, and often larger than a user can carefully consider. Many search engines provide for relevance-based rankings of search results so that the most relevant results (as determined by the search engine) are displayed to the user first.

Researchers at the Johns Hopkins Applied Physics Laboratory (JH/APL) have developed a method for identifying terms, e.g., words, groups of words, or parts of words, that are important to a given text by comparing the frequency of occurrence of terms in the sample text to a benchmark frequency, e.g. a frequency of those terms in a reference text, e.g. any large text sample.

The JH/APL method determines the frequency of occurrence within a sample text for each of a plurality of terms of the sample text. It also compares the term's sample frequency to its respective frequency of occurrence within the reference text. This reference frequency provides a benchmark for determining relative importance to the sample text. Terms that occur with greater frequency in the sample text than in the reference text are deemed relatively important to the sample text.

Ms. H. L. Curran
Phone: (443) 778-7262