Common Lisp Package: WEB-CRAWLER

Main package for web-crawler.

README:

FUNCTION

Public

MAKE-SAME-HOST-FILTER (URI)

Given a either a string uri or a puri:uri object, returns a function that takes one uri as an argument and returns true if that uri has the same hostname as the original uri.

START-CRAWL (URI PROCESSOR &KEY URI-FILTER (CRAWL-DELAY 10) VERBOSE)

Crawl the web, starting at <uri>. <processor> must be a function that takes three arguments, the current url, the parent of the current url and the content returned from doing a GET request on that url. The return value is ignored. <:uri-filter> can be a predicate that should return false if the passed url should not be processed. The url will be an instance of the puri:url class, which should be easier to deal with than plain text. If you want plain text, call (puri:render-uri url nil), which will return it. To limit crawling to one site, call (make-same-host-filter uri) and pass the return value in for :uri-filter. <:crawl-delay> is a number of seconds to sleep between requests. The default is 10. It can be a fraction. <:verbose> when true, prints out each uri being processed.

Private

CRAWL-AND-SAVE-SITE (START-URI DIR &KEY (CRAWL-DELAY 10) VERBOSE)

Crawl a site (one hostname), starting at START-URI, and save every page to a file in DIR. The file index.dat will contain a list of filenames and their uris, because uris don't work so well as filenames.

GET-ROBOTS-RULES (URI)

Gets the robots.txt rules for a url as returned by parse-robots-txt. Caches when possible.

PARSE-ROBOTS-TXT (TEXT)

Parses the text of a robots.txt file and gives back a list of rules, each of the form (user-agent disallow-path*), where user-agent is a string, possibly "*". An example rule is ("*" "/cyberworld/map/" "/test/" "/test2")

SPLIT-ON-ELEMENT (SEQ EL)

Split any sequence into a list, using the given element as a separator.

URI-IS-ALLOWED (URI)

Check the robots.txt file for the site of this url, and return whether the url is allowed for robots or not.

Undocumented

GET-ATTR (ATTR ELEM)

GET-ROBOTS-TXT (URI)

GET-ROBOTS-URI (URI)

IS-SUCCESS (STATUS-CODE)

MAKE-SAVE-PAGE-PROCESSOR (&OPTIONAL (DIR *DEFAULT-PATHNAME-DEFAULTS*))

REMOVE-COMMENT (LINE)

ROOT-OF-URI (URI)

SKIP-PAGE (C)

SPLIT-ON-NEWLINES (TEXT)

STARTS-WITH (PRE SEQ)

STRING-ONLY-WHITESPACE-P (TEXT)

MACRO

Private

Undocumented

CUT (&BODY BODY)

GENERIC-FUNCTION

Private

GET-PAGE (URI)

Do an HTTP GET request on a URI, closing the stream when the request is done. Returns (values <page content> <actual uri returned>) The returned uri may be different due to a redirect.

Undocumented

REASON (CONDITION)

VARIABLE

Private

Undocumented

*SITE-ROBOTS-RULES*

*USER-AGENT*

CONDITION

Private

Undocumented

STOP-CRAWLING