March 31, 2000
Colloquium Speaker: C. Lee Giles
Dr. C. Lee Giles is a senior research scientist in Computer Science at NEC Research Institute, Princeton, NJ; adjunct faculty at the Institute for Advanced Computer Studies at the University of Maryland and an adjunct Professor in Computer and Information Science at the University of Pennsylvania. His research interests are in web computing, intelligent information retrieval and processing and knowledge extraction, and in fundamental models of intelligent systems. He is a Fellow of the IEEE and a member of AAAI, ACM, INNS, OSA, AAAS, and the Center for Discrete Mathematics and Theoretical Computer Science, Rutgers University. Recently, he co-authored with Steve Lawrence papers published in Science and Nature on the size of the web search engine coverage. This and related work received wide press coverage including the Wall Street Journal, New York Times, MSNBC, PBS, BBC, National Geographic, and Drudge Report. His research was recently highlighted in SIAM News. This spring he is co-teaching a graduate course in the Computer and Information Science Dept. at the University of Pennsylvania on "Information Retrieval, Digital Libraries and the Web." His previous positions include that of program manager at the Air Force Office of Scientific Research.
This talk will describe current limitations, new techniques, and future directions for information access on the web. The web and search engines represent a significant improvement for communication. While there has long been a lot of information available, the search engines facilitate efficient access to an increasing amount of information. However, we found that search engines do not index sites equally, may not index new pages for months, and that no engine indexes more than about one sixth of the estimated size of the publicly indexable web. We also analyzed the volume and distribution of information on the web, images, metadata usage, and search engine indexing by domain. New techniques for information access on the web are described, including two projects at NEC Research Institute: Inquirus, which is a content-based metasearch engine, and CiteSeer, which is the largest free full-text index of scientific literature in the world.