CSCE 470 Lecture 13
Jump to navigation
Jump to search
« previous | Monday, September 23, 2013 | next »
Crawling
Begin with "seed" URLs
Fetch and parse
- Extract URLs they point to
- Place extracted URLs on a queue
Fetch each URL on the queue and repeat
Impose ordering on queue
Complications
Not feasible with one machine
- Distributed execution of steps above
Malicious pages
- Spam pages
- Spider traps (dynamically generate)
Latency/Bandwidth to remote servers
Robots.txt