APL Colloquium

June 2, 2000

Colloquium Topic: Intelligent Web Searching

Effective search technology is of tremendous importance in making the World Wide Web valuable to people. In recognition of this need, numerous commercial systems purport to index the Web in various ways. The common denominator for most of these engines is the Pathetically Bad Search algorithm® (PBS), which they use to handle the bulk of their user queries. This talk will explain how PBS works, why it is so widespread, and how you can convince particular search engines to use alternative algorithms. It will also describe the Hopkins Automated Information Retriever for Combing Unstructured Text (HAIRCUT) system, APL¹s own search engine. HAIRCUT blends a variety of proprietary and public domain technologies to produce retrieval results that are among the best in the world. One of HAIRCUT¹s underpinnings is a novel Œaffinity statistic¹ that captures how closely two words or phrases are related. The talk will describe this statistic, and show how we use it for tasks as diverse as document summarization, word sense disambiguation and cross-language retrieval.



Colloquium Speaker: James Mayfield

Dr. James Mayfield received an A.B. with honors from Harvard College in 1979 and a Ph.D. in computer science from the University of California at Berkeley in 1989. He is currently a Senior Computer Scientist in the Research and Technology Development Center at the Johns Hopkins Applied Physics Laboratory. Prior to joining APL in 1996, he was an associate professor of computer science at the University of Maryland Baltimore County. Dr. Mayfield's research accomplishments, documented by more than fifty professional communications, include work in information retrieval, hypertext, and agent-based architectures and communication languages. Dr. Mayfield is the principal investigator for the HAIRCUT information retrieval project. Through its entry in the Text Retrieval Conference (TREC), HAIRCUT has demonstrated the benefits of using a variety of indexing terms and proximity measures in an information retrieval system.