CSCE 470 Lecture 13

From Notes

Jump to navigation Jump to search

« previous | Monday, September 23, 2013 | next »

Crawling

Begin with "seed" URLs

Fetch and parse

Extract URLs they point to
Place extracted URLs on a queue

Fetch each URL on the queue and repeat

Impose ordering on queue

Complications

Not feasible with one machine

Distributed execution of steps above

Malicious pages

Spam pages
Spider traps (dynamically generate)

Latency/Bandwidth to remote servers

Robots.txt

Retrieved from "https://notes.komputerwiz.net/w/index.php?title=CSCE_470_Lecture_13&oldid=7605"