Given a either a string uri or a puri:uri object, returns a function that takes one uri as an argument and returns true if that uri has the same hostname as the original uri.
START-CRAWL (URI PROCESSOR &KEY URI-FILTER (CRAWL-DELAY 10) VERBOSE)
Crawl the web, starting at <uri>. <processor> must be a function that takes three arguments, the current url, the parent of the current url and the content returned from doing a GET request on that url. The return value is ignored. <:uri-filter> can be a predicate that should return false if the passed url should not be processed. The url will be an instance of the puri:url class, which should be easier to deal with than plain text. If you want plain text, call (puri:render-uri url nil), which will return it. To limit crawling to one site, call (make-same-host-filter uri) and pass the return value in for :uri-filter. <:crawl-delay> is a number of seconds to sleep between requests. The default is 10. It can be a fraction. <:verbose> when true, prints out each uri being processed.
CRAWL-AND-SAVE-SITE (START-URI DIR &KEY (CRAWL-DELAY 10) VERBOSE)
Crawl a site (one hostname), starting at START-URI, and save every page to a file in DIR. The file index.dat will contain a list of filenames and their uris, because uris don't work so well as filenames.
Gets the robots.txt rules for a url as returned by parse-robots-txt. Caches when possible.
Parses the text of a robots.txt file and gives back a list of rules, each of the form (user-agent disallow-path*), where user-agent is a string, possibly "*". An example rule is ("*" "/cyberworld/map/" "/test/" "/test2")
SPLIT-ON-ELEMENT (SEQ EL)
Split any sequence into a list, using the given element as a separator.
Check the robots.txt file for the site of this url, and return whether the url is allowed for robots or not.
GET-ATTR (ATTR ELEM)
MAKE-SAVE-PAGE-PROCESSOR (&OPTIONAL (DIR *DEFAULT-PATHNAME-DEFAULTS*))
STARTS-WITH (PRE SEQ)
CUT (&BODY BODY)
Do an HTTP GET request on a URI, closing the stream when the request is done. Returns (values <page content> <actual uri returned>) The returned uri may be different due to a redirect.