CSCE 470 Lecture 13

From Notes
Jump to navigation Jump to search

« previous | Monday, September 23, 2013 | next »


Crawling

Begin with "seed" URLs

Fetch and parse

  • Extract URLs they point to
  • Place extracted URLs on a queue

Fetch each URL on the queue and repeat

Impose ordering on queue

Complications

Not feasible with one machine

  • Distributed execution of steps above

Malicious pages

  • Spam pages
  • Spider traps (dynamically generate)

Latency/Bandwidth to remote servers

Robots.txt