Common Lisp Package: CHTML-MATCHER

README:

chtml-matcher

A simple Lisp-based DSL for extracting information from web pages

chtml-matcher performs pattern-based unification over HTML via a set of compiled nested closures. It uses the closure-html library to parse HTML to lhtml, a lisp form of HTML. A template list is passed to (match-template template lhtml) and returns a bindings object containing an alist of all the extracted information.

The semantics are reasonably intuitive, but might require a little playing around to get a good feel for how to solve most common problems. The API is small and the package.lisp provides pointers to where to look. The whole library is less than 1k lines of code so easy enough to read through.

Download and Dependencies

Clone it from github. The old repository on common-lisp.net is deprecated.

chtml-matcher depends on my home-brew cl-stdutils, closure-html, cl-ppcre, and f-underscore, although all but closure-html could be removed if necessary.

LXML Template Unfication

The DSL provides a light-weight way to extract fields from nested HTML/XML structure represented in LHTML (as produced by closure-html). A template is a declarative representation of substructure with embedded variables that are bound when the substructure matches.

Substructure is loosely matched, such that if any given body element doesn't match, the next child is considered until all the template body elements have matched a lhtml element or the end of the elment has been reached without a match.

Prepending < to a tag enables a depth-first search for that tag so you can avoid specifying the parent path (similar to // in xpath)

Any matching template that consists of a variable reference results in a binding set being created and returned if all elements of the template node successfully match.

Additional reserved operators allow you more flexibility on managing what is matched and how bindings from subtrees are combined

all: match same template multiple times over the children of a given node and store them as a list attached to a fresh bindings list

merge: create a single binding out of each of the sub-bindings. A node body has an implicit merge

nth: find the nth instance that matches the full body of this operator

regex: matches if regex returns register values for a string (as a list)

fn: Run the referenced function symbol on the current parse state and return bindings, t or nil as appropriate.

Example

I've recently been mining some posts from vBulletin sites. I go to the last day's posts, get a list of all the new posts, then go to the thread and grab the post body. The following two templates do 90% of the work. Of course, I have to write code to convert the data I extract to web page fetches, etc.

(defparameter *vbulletin-search-template*  
  '(<tbody nil  
     (all ?records  
       (tr nil  
         (td nil)  
	   (td ((class "alt1"))  
	     (div nil  
	       (a ((href ?thread-uri))  
	         ?thread-name)))  
         (td ((class "alt2") (title ?activity))  
           (div nil ?post-date  
	     (span nil ?post-time)  
	     (a ((href ?user-uri))  
	       ?username)  
	     (a ((href ?last-post-uri)))))))))  
    =>  
    '(:records ((:thread-name . "Thread name")  
                (:thread-uri  . "Thread URI")  
                (:post-date   . "Date String")  
	    (:post-time   . "Time String")  
                (:username    . "Username")  
                (:user-uri    . "URI String")  
                (:last-post-uri . "URI String"))  
               ...) 

This looks for a table body in the search results page, then gets bindings for all matching <tr> elements and puts them within another bindings object bound to :records as specified by 'all'. The pattern pulls out all the user, thread, post and date information for all results. You can match elements on strings, regular expressions and arbitrary function calls as well.

I use subst to customize the following pattern to find a particular post in a page. It replaces 'postmessage?' with a unique id for a post then returns its thread number and the entire post body.

  (defparameter *vbulletin-post-template*  
    `(<tbody nil  
   (tr nil (<a ((name ?post-num))))  
   (tr nil)  
   (tr nil (?post-body <div ((id "post_message_?")))))) 

I use Firefox FireBug to inspect the HTML tree, identify the best unique enclosing context I can specify and then provide enough structure to uniquely capture the data I want. This approach is highly robust to many small HTML changes and should be reasonably fast.

FUNCTION

Public

CLEAR-BINDINGS (DICT)

Clear all bindings

FIND-IN-LHTML (LHTML TAG ATTRIBUTES &OPTIONAL (N 1))

Convenience function for generating state from an lhtml tree

GET-BINDINGS (DICT)

Return an alist of bindings

MAKE-BINDINGS (&OPTIONAL VARIABLE VALUE)

Make bindings, optionally with a seed variable and value

MATCH-TEMPLATE (TEMPLATE DATUM)

Top level matcher

Undocumented

GET-BINDING (VAR DICT)

HTML->LHTML (HTML)

LHTML->HTML (LHTML)

LHTML-CONSTANT-NODE-P (TREE)

LHTML-NODE-ATTRIBUTE-NAME (ATTR)

LHTML-NODE-ATTRIBUTE-VALUE (ATTR)

LHTML-NODE-ATTRIBUTES (NODE)

LHTML-NODE-BODY (NODE)

LHTML-NODE-NAME (NODE)

LHTML-NODE-STRING (LHTML-NODE)

SET-BINDING (VAR VALUE DICT)

Private

AS-KEYWORD (OBJECT)

Convert a string or symbol to a keyword symbol

ATTRIBUTES-EQUAL-P (TATTRS NATTRS)

Verify that attrs1 is a proper subset of attrs2 under equalp of string form of names. Ignore variable attribute values

BIND-ATTRIBUTES (NODE ATTR-TEMPLATE BINDINGS)

Given an attribute template and the current node, when bindings exist and a variable occurs in the template attribute value position, add it to the bindings

BIND-NODE (NODE VARIABLE ATTRIBUTES-TEMPLATE BINDINGS)

Bind the node to the variable including attributes if wanted

BIND-NODE-BODY (NODE VARIABLE BINDINGS)

Bind the body list to the variable in bindings

CHILDREN-DONE-P (STATE)

If body is empty or has one element, return t

COPY-STATE (STATE)

Make a duplicate of the current state

CURRENT-NODE (STATE)

Current node is always first element of the body list (Invariant)

CURRENT-PATH-TAGS (STATE)

List of tags from root to current

FIND-ALL-NODES (STATE TAG ATTRIBUTES)

Find all matching instances of a node in the tree

FIND-AND-BIND-NODE (STATE TAG ATTRIBUTES VARIABLE BINDINGS &OPTIONAL (N 1))

Find a node and bind it and it's attributes if provided

FIND-CHILD (STATE TAG ATTRIBUTES)

Walk the child list of the parser until a match is found. If no more children, returns nil.

FIND-NODE (STATE TAG ATTRIBUTES &OPTIONAL (N 1))

Find the nth occurance of tag and attributes from current state via next-node

FINISH-BODY (STATE)

When we're done with the body, return to prior path, popping as necessary

GET-ATTRIBUTE (NAME ATTRIBUTES)

Given a name, equalp match string forms of name and attribute nmaes

MAKE-LOCAL-STATE (STATE)

Make a new state object rooted at the current node

MAKE-STATE (LHTML)

The initial state consists of a virtual body of which the current node is the top level node of the tree. We keep track of the root of the tree.

MAP-CHILD-BINDINGS (FN STATE BODY-FNS &AUX (COUNT 1000))

Map fn across sequential applications of body-fns for the body list of the provided state. Moves state to end of child list and returns bindings if all match

NEXT-CHILD (STATE)

Linear walk of the current child list, nil on end of list

NEXT-NODE (STATE)

Depth first tree walker. Given the current state, update the state so that (first body) contains the next node in the tree. Returns the side effected state

NODE-MATCH-P (STATE TAG ATTRIBUTES)

Match current node to tag and attributes

PARSER-STATE-BODY (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

PARSER-STATE-PATH (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

PARSER-STATE-TREE (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

RESET-STATE (STATE)

Reset state to the initial state

SEARCH-TAG-P (TAG)

Is this tag a search variable?

START-BODY (STATE NODE-BODY)

Modify state to make the first node of the current node's body the current node and record the state of the current body variable to the path variable. When we pop, we the next node is at the top so we push the rest of the current body

START-CURRENT-BODY (STATE)

Modify state to make the current node the first

SYMBOL->BASE (VAR)

Return a symbol minus the leading character

SYMBOL->BASE-KEYWORD (VAR)

Return a symbol minus the leading character

SYMBOL-BASE (VAR)

Return the base string by stripping the leading character

TAG-EQUAL-P (TAG1 TAG2)

Ensure that two tags are equal

TGEN-BIND-CHILDREN (VARIABLE BODY-FNS)

Same as tgen-merge-children but records the list of bindings from the body-fns to variable in a fresh bindings set

TGEN-FIND (TAG ATTRIBUTES BODY-FN)

Find a node by tag and attributes and bind via tgen-match. State points to the child node after the bound node

TGEN-FIND-BIND (VARIABLE TAG ATTRIBUTES BODY-FN)

Like tgen-find, but uses tgen-match-bind

TGEN-MATCH (TAG ATTRIBUTES BODY-FN)

Try to match the current node to tag & attributes if body-fn is satisfied and return any bound attributes. Moves parse state to the next child node.

TGEN-MATCH-BIND (VARIABLE TAG ATTRIBUTES BODY-FN)

Match node and add a reference to it to the bindings. Parse state is unchanged. Relies on tgen-match debug info

TGEN-MATCH-FN (FN)

Returns: result from calling function Side Effect: next-child

TGEN-MATCH-NTH (COUNT BODY-FN)

Find the nth match for the provided state assuming body-fn moves the state to the next relevant node to test. Basically it's a closure that when it's called, recursively calls body-fn until counter hits zero and returns the last value of body-fn

TGEN-MATCH-REGEX (VARIABLE EXPR)

Returns: binding with variable matched to regex register result or nil

TGEN-MATCH-STRING (STRING)

Returns: t when it matches

TGEN-MATCH-VAR (VARIABLE)

Matches anything and binds it to variable in a fresh binding Returns: bindings

TGEN-MERGE-CHILDREN (BODY-FNS)

Assumes the parse tree is looking at the first element of a tag body and that the body-fns are required sequential matches. Walks children until (current-node subtree) is null or all body-fns have been processed. Merges all the bindings returned from each body-fn. Each body-fn goes to next-child.

VARIABLE-P (SYMBOL)

Identify matching variables by leading #?

Undocumented

CLEAN-VAR (VAR)

COPY-PARSER-STATE (INSTANCE)

INSTANCE-VARIABLE-P (SYMBOL)

LOG-STATE (STATE)

MAKE-PARSER-STATE (&KEY ((TREE DUM4) NIL) ((PATH DUM5) NIL) ((BODY DUM6) NIL))

MATCH-LOG-END (RESULT)

SETFPARSER-STATE-BODY (NEW-VALUE INSTANCE)

PARSER-STATE-P (OBJECT)

SETFPARSER-STATE-PATH (NEW-VALUE INSTANCE)

SETFPARSER-STATE-TREE (NEW-VALUE INSTANCE)

STATE-DONE-P (STATE)

TRACE-TGEN

MACRO

Public

Undocumented

WITH-BINDINGS (VARS BINDINGS &BODY BODY)

Private

ASSERT-STATE

Test the invariant properties of the state

WITH-LOCAL-PARSE-STATE ((VAR STATE) &BODY BODY)

Make it easy to perform a non-distructive parse operation over a subtree based on the current parse state

WITH-TEMPLATE ((TAG ARGS BODY) TEMPLATE &BODY REST)

Generate local vars for various for template components

Undocumented

MATCH-LOG-MESSAGE (OP STATE &REST CRITERION)

TGLAMBDA (ARGS MSG &BODY BODY)

WITH-BODY-BINDS ((VAR STATE FN) &BODY BODY)

WITH-STATE ((STATE) &BODY BODY)

GENERIC-FUNCTION

Public

Undocumented

SET-BINDINGS (BIND1 BIND2)

Private

Undocumented

GEN-TEMPLATE-MATCHER (HEAD TBODY)

GENERATE-TEMPLATE (TEMPLATE)

SLOT-ACCESSOR

Private

Undocumented

BINDINGS (OBJECT)

SETFBINDINGS (NEW-VALUE OBJECT)

VARIABLE

Private

Undocumented

*MATCH-LOGGING*

*MATCH-LOGGING-INDENT*

*MATCH-LOGGING-STREAM*

CLASS

Public

Undocumented

BINDING-DICTIONARY

Private

Undocumented

PARSER-STATE