Directed Web Crawling Using Machine Learning

Reference#: P01607

Directed web crawling takes a specific expression of a topic and directs a web spider to locate documents that are generally relevant to that interest. Spiders may also be directed by the content of documents they examine, by the underlying link topology, or by meta-information about a document or URL. However, current content-based methods are weak. Researchers typically use cosine-based vector models to evaluate content. This approach must determine a threshold to decide whether a document is relevant. Typically the same threshold is used for all topics, instead of varying the threshold in a topic-specific way. Furthermore, determining a good threshold value is difficult.

The Johns Hopkins Applied Physics Lab has developed an approach to improve content-based methods that is also compatible with other criteria such as link-based techniques. This technique will be primarily used to locate textual resources, although it is possible that this approach will apply to other forms of electronic information. This approach depends on two techniques: a technique for characterizing the content of documents and a technique that involves the use of machine learning methods of classifying documents.

Ms. H. L. Curran
Phone: (443) 778-7262