Common Lisp Package: COM.INFORMATIMAGO.COMMON-LISP.HTML-PARSER.PARSE-HTML

This package implements a simple HTML parser. Example: (parse-html-string "&lt;html&gt;&lt;head&gt;&lt;title&gt;Test&lt;/title&gt;&lt;/head&gt; &lt;body&gt;&lt;h1&gt;Little Test&lt;/h1&gt; &lt;p&gt;How dy? &lt;a href=\"/check.html\"&gt;Check this&lt;/a&gt;&lt;/p&gt; &lt;ul&gt;&lt;li&gt;one&lt;li&gt;two&lt;li&gt;three&lt;/ul&gt;&lt;/body&gt;&lt;/html&gt;") --> ((:html nil (:head nil (:title nil "Test")) " " (:body nil (:h1 nil "Little Test") " " (:p nil "How dy? " (:a (:href "/check.html") "Check this")) " " (:ul nil (:li nil "one" (:li nil "two" (:li nil "three"))))))) License: AGPL3 Copyright Pascal J. Bourguignon 2003 - 2012 This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details. You should have received a copy of the GNU Affero General Public License along with this program. If not, see <http://www.gnu.org/licenses/>

README:

FUNCTION

Public

HTML-ATTRIBUTE (HTML KEY)

RETURN: The ATTRIBUTE named KEY in the HTML element.

HTML-ATTRIBUTES (HTML)

RETURN: The ATTRIBUTES of the HTML element.

HTML-CONTENTS (HTML)

RETURN: The CONTENTS of the HTML element.

HTML-TAG (HTML)

RETURN: The TAG of the HTML element.

PARSE-HTML-FILE (PATHNAME &KEY (VERBOSE NIL) (EXTERNAL-FORMAT DEFAULT))

DO: Parse the HTML file PATHNAME. VERBOSE: When true, writes some information in the *TRACE-OUTPUT*. EXTERNAL-FORMAT: The external-format to use to open the HTML file. RETURN: A list of html elements. SEE ALSO: HTML-TAG, HTML-ATTRIBUTES, HTML-ATTRIBUTE, HTML-CONTENTS.

PARSE-HTML-STRING (STRING &KEY (START 0) (END (LENGTH STRING)) (VERBOSE NIL))

DO: Parse the HTML in the STRING (between START and END) VERBOSE: When true, writes some information in the *TRACE-OUTPUT*. RETURN: A list of html elements. SEE ALSO: HTML-TAG, HTML-ATTRIBUTES, HTML-ATTRIBUTE, HTML-CONTENTS.

UNPARSE-HTML (HTML &OPTIONAL (STREAM *STANDARD-OUTPUT*))

Writes back on STREAM the reconstituted HTML source.

WRITE-HTML-TEXT (HTML &OPTIONAL (STREAM *STANDARD-OUTPUT*))

Writes on STREAM a textual rendering of the HTML. Some reStructuredText formating is used. Simple tables are rendered, but colspan and rowspan are ignored.

Private

ATTRIBUTE-NAME (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

ATTRIBUTE-VALUE (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

CELL-ATTRIBUTES (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

CELL-LINES (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

CLOSE-TAG-ATTRIBUTES (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

CLOSE-TAG-NAME (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

COMMENT-DATA (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

DEFINITION-ATTRIBUTES (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

DEFINITION-NAME (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

ELEMENT-DOCUMENTATION (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

ELEMENT-NAME (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

ELEMENT-OPTIONS (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

FOREIGN-DATA (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

HEURISTIC-QUOTE-IN-STRING (STRING START END-OF-STRING)

( *[a-z]+ *= *{string})/?>

HTML-PARSER-NEXT-TOKEN (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

HTML-PARSER-NEXT-VALUE (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

HTML-PARSER-SCANNER (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

HTML-PARSER-TOKEN (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

HTML-PARSER-VALUE (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

HTML-SCANNER-NEXT-STATE (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

HTML-SCANNER-SOURCE (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

HTML-SCANNER-STATE (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

HTML-SEQ-FIRST (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

HTML-SEQ-REST (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

OPEN-TAG-ATTRIBUTES (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

OPEN-TAG-CLOSED (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

OPEN-TAG-NAME (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

ROW-ATTRIBUTES (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

ROW-CELLS (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

ROW-KIND (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

ROW-TAG (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

SPLIT-STRING (STRING &OPTIONAL (SEPARATORS ) (REMOVE-EMPTY NIL))

STRING: A sequence. SEPARATOR: A sequence. RETURN: A list of subsequence of STRING, split upon any element of SEPARATORS. Separators are compared to elements of the STRING with EQL. NOTE: It's actually a simple split-sequence now. EXAMPLES: (split-string '(1 2 0 3 4 5 0 6 7 8 0 9) '(0)) --> ((1 2) (3 4 5) (6 7 8) (9)) (split-string #(1 2 0 3 4 5 0 6 7 8 0 9) #(0)) --> (#(1 2) #(3 4 5) #(6 7 8) #(9)) (split-string "1 2 0 3 4 5 0 6 7 8" '(#space #0)) --> ("1" "2" "" "" "3" "4" "5" "" "" "6" "7" "8")

STRING-REPLACE (STRING PATTERN REPLACE &KEY (TEST #'CHAR=))

RETURN: A string build from STRING where all occurences of PATTERN are replaced by the REPLACE string. TEST: The function used to compare the elements of the PATTERN with the elements of the STRING.

UNSPLIT-STRING (STRING-LIST &OPTIONAL (SEPARATOR ) &KEY (ADJUSTABLE NIL) (FILL-POINTER NIL) (SIZE-INCREMENT 0))

DO: The inverse than split-string. If no separator is provided then a simple space is used. SEPARATOR: (OR NULL STRINGP CHARACTERP) ADJUSTABLE: Create the string as an adjustable array. FILL-POINTER: Add a fill pointer to the string. SIZE-INCREMENT: Add it to the size needed for the result.

Undocumented

%MAKE-HTML-SCANNER (&KEY ((STATE DUM994) NORMAL) ((NEXT-STATE DUM995) NIL) ((SOURCE DUM996) (MAKE-INSTANCE 'PEEK-STREAM STREAM *STANDARD-INPUT*)))

ADVANCE (PARSER)

SETFATTRIBUTE-NAME (NEW-VALUE INSTANCE)

ATTRIBUTE-P (OBJECT)

SETFATTRIBUTE-VALUE (NEW-VALUE INSTANCE)

SETFCELL-ATTRIBUTES (NEW-VALUE INSTANCE)

SETFCELL-LINES (NEW-VALUE INSTANCE)

CELL-P (OBJECT)

CLEAN-ATTRIBUTE (ATTR)

SETFCLOSE-TAG-ATTRIBUTES (NEW-VALUE INSTANCE)

SETFCLOSE-TAG-NAME (NEW-VALUE INSTANCE)

CLOSE-TAG-P (OBJECT)

COLLECT-TABLE-CELLS (ELEMENT)

SETFCOMMENT-DATA (NEW-VALUE INSTANCE)

COMMENT-P (OBJECT)

COMPUTE-MAX-WIDTHS (ROWS)

COPY-ATTRIBUTE (INSTANCE)

COPY-CELL (INSTANCE)

COPY-CLOSE-TAG (INSTANCE)

COPY-COMMENT (INSTANCE)

COPY-DEFINITION (INSTANCE)

COPY-ELEMENT (INSTANCE)

COPY-FOREIGN (INSTANCE)

COPY-HTML-PARSER (INSTANCE)

COPY-HTML-SCANNER (INSTANCE)

COPY-HTML-SEQ (INSTANCE)

COPY-OPEN-TAG (INSTANCE)

COPY-ROW (INSTANCE)

CS-ALPHA-CHAR-P (CH)

CS-CRLF-P (CH)

CS-IDENT-CHAR-P (CH)

CS-SPACE-P (CH)

CS-STRING-D-CHAR-P (CH)

CS-STRING-N-CHAR-P (CH)

CS-STRING-S-CHAR-P (CH)

SETFDEFINITION-ATTRIBUTES (NEW-VALUE INSTANCE)

SETFDEFINITION-NAME (NEW-VALUE INSTANCE)

DEFINITION-P (OBJECT)

SETFELEMENT-DOCUMENTATION (NEW-VALUE INSTANCE)

ELEMENT-EMPTY-P (ELEMENT-NAME)

ELEMENT-END-FORBIDDEN-P (ELEMENT-NAME)

ELEMENT-END-OPTIONAL-P (ELEMENT-NAME)

SETFELEMENT-NAME (NEW-VALUE INSTANCE)

SETFELEMENT-OPTIONS (NEW-VALUE INSTANCE)

ELEMENT-P (OBJECT)

ELEMENT-START-OPTIONAL-P (ELEMENT-NAME)

ENCASE (TAG-LIST)

FIND-ELEMENT (ELEMENT-NAME)

SETFFOREIGN-DATA (NEW-VALUE INSTANCE)

FOREIGN-P (OBJECT)

GENERATE-CONTROL-STRING (WIDTHS)

GENERATE-LINE (WIDTHS)

GET-TOKEN (SCANNER)

SETFHTML-PARSER-NEXT-TOKEN (NEW-VALUE INSTANCE)

SETFHTML-PARSER-NEXT-VALUE (NEW-VALUE INSTANCE)

HTML-PARSER-P (OBJECT)

SETFHTML-PARSER-SCANNER (NEW-VALUE INSTANCE)

SETFHTML-PARSER-TOKEN (NEW-VALUE INSTANCE)

SETFHTML-PARSER-VALUE (NEW-VALUE INSTANCE)

SETFHTML-SCANNER-NEXT-STATE (NEW-VALUE INSTANCE)

HTML-SCANNER-P (OBJECT)

SETFHTML-SCANNER-SOURCE (NEW-VALUE INSTANCE)

SETFHTML-SCANNER-STATE (NEW-VALUE INSTANCE)

SETFHTML-SEQ-FIRST (NEW-VALUE INSTANCE)

HTML-SEQ-P (OBJECT)

SETFHTML-SEQ-REST (NEW-VALUE INSTANCE)

MAKE-ATTRIBUTE (&KEY ((NAME DUM1306) NIL) ((VALUE DUM1307) NIL))

MAKE-CELL (&KEY ((ATTRIBUTES DUM2536) NIL) ((LINES DUM2537) NIL))

MAKE-CLOSE-TAG (&KEY ((NAME DUM1422) NIL) ((ATTRIBUTES DUM1423) NIL))

MAKE-COMMENT (&KEY ((DATA DUM1234) NIL))

MAKE-DEFINITION (&KEY ((NAME DUM1344) NIL) ((ATTRIBUTES DUM1345) NIL))

MAKE-ELEMENT (&KEY ((NAME DUM19) NIL) ((OPTIONS DUM20) NIL) ((DOCUMENTATION DUM21) NIL))

MAKE-FOREIGN (&KEY ((DATA DUM1270) NIL))

MAKE-HTML-PARSER (&KEY ((SCANNER DUM1460) NIL) ((TOKEN DUM1461) NIL) ((VALUE DUM1462) NIL) ((NEXT-TOKEN DUM1463) NIL) ((NEXT-VALUE DUM1464) NIL))

MAKE-HTML-SCANNER (&KEY (SOURCE *STANDARD-INPUT*) (STATE NORMAL))

MAKE-HTML-SEQ (&KEY ((FIRST DUM1196) NIL) ((REST DUM1197) NIL))

MAKE-OPEN-TAG (&KEY ((NAME DUM1382) NIL) ((ATTRIBUTES DUM1383) NIL) ((CLOSED DUM1384) NIL))

MAKE-ROW (&KEY ((KIND DUM2494) NIL) ((TAG DUM2495) NIL) ((ATTRIBUTES DUM2496) NIL) ((CELLS DUM2497) NIL))

SETFOPEN-TAG-ATTRIBUTES (NEW-VALUE INSTANCE)

SETFOPEN-TAG-CLOSED (NEW-VALUE INSTANCE)

SETFOPEN-TAG-NAME (NEW-VALUE INSTANCE)

OPEN-TAG-P (OBJECT)

PARSE-AIVS (PARSER)

PARSE-ATTRIBUTE (PARSER)

PARSE-ATTRIBUTES (PARSER)

PARSE-CLOSE-TAG (PARSER)

PARSE-DEFINITION (PARSER)

PARSE-FILE (PARSER)

PARSE-OPEN-TAG (PARSER)

REPORT-ERROR (PARSER MESSAGE)

SETFROW-ATTRIBUTES (NEW-VALUE INSTANCE)

SETFROW-CELLS (NEW-VALUE INSTANCE)

SETFROW-KIND (NEW-VALUE INSTANCE)

ROW-P (OBJECT)

SETFROW-TAG (NEW-VALUE INSTANCE)

WRITE-CHILDREN-TEXT (SELF)

WRITE-INDENTED-CHILDREN (SELF)

WRITE-NOTHING (SELF)

WRITE-PARENTHESIZED-CHILDREN (SELF LEFT RIGHT)

WRITE-TEXT (ELEMENT)

WRITE-TITLE (SELF LINE-CHAR &OPTIONAL ABOVEP)

MACRO

Private

APPENDF (PLACE &REST ARGS &ENVIRONMENT ENV)

Append onto list

DEFATTRIBUTE (ATTR-NAME ELEMENTS TYPE DEFAULT OPTIONS DOCUMENTATION)

DO: Defines an HTML attribute.

DEFELEMENT (NAME OPTIONS &OPTIONAL (DOCUMENTATION A HTML element.))

DO: Defines a HTML element macro. NAME: A symbol that will be used to define a macro. OPTIONS: A list of keywords: :START-OPTIONAL :END-FORBIDDEN :EMPTY :DEPRECATED :LOOSE-DTD or :FRAMESET-DTD. :END-FORBIDDEN -> the close tag is not generated. :DEPRECATED -> warning when the macro is used. :EMPTY -> the macro won't take a BODY. :START-OPTIONAL -> ignored. :LOOSE-DTD -> error when *DOCTYPE* isn't :LOOSE. :FRAMESET-DTD -> error when *DOCTYPE* isn't :FRAMESET. DOCUMENTATION: A string used as documentation string for the macro NAME.

Undocumented

DEFCHARSET (NAME CHARACTERS &KEY COMPLEMENT)

DEFINE-ELEMENT-WRITER (TAG NLS &BODY BODY)

GENERIC-FUNCTION

Private

Undocumented

WALK (HTML)

VARIABLE

Private

*ATTRIBUTES*

List of symbols of all attributes defined.

*ELEMENTS*

List of symbols of all elements defined.

*NL*

This hash-table maps tag symbols (interned in this package) to a list of two elements: - a list of keywords indicating the newlines that should be written around the element when writing HTML: :bo before open tag. :ao after open tag. :bc before close tag. :ac after open tag. - a function taking the element as parameted (named SELF), used to format the element as text.

Undocumented

*OL-INDEX*

*OL-STACK*

*ROW-KIND*

+CRLF+

+TAG-PACKAGE+

CLASS

Private

Undocumented

ATTRIBUTE

CELL

CLOSE-TAG

COMMENT

DEFINITION

ELEMENT

FOREIGN

HTML-PARSER

HTML-SCANNER

HTML-SEQ

OPEN-TAG

ROW