Common Lisp Package: LANGUTILS

README:

LANGUTILS LIBRARY

This file contains a simple guide to the main functions and files of the langutils library. The code is reasonably documented with doc strings and inline comments. Write to the author if there are any questions. Also read docs/LISP2005-langutils.pdf which is a more involved exposition of the implementation and performance issues in the toolkit.

The library provides a heirarchy of major functions and auxiliary functions related to the structured analysis and processing of open text. The major functions working from raw text up are:

  • String tokenization (string -> string)
  • Part of speech tagging (string -> tokens -> vector-document)
  • Phrase chunking (vector-document -> phrases)

We also provide auxiliary functions that operate on strings, tokens or vector-documents. The lisp functions implementing the functionality can be found under the appropriately labled section in the reference below.

Strings

  • Tokenize a string (separate punctuation from word tokens)
  • POS tag a string or file returning a file, string or vector-document
  • Identify suspicious strings that may become tokens

Tokens

  • String to token-id conversion routines
  • Save/Load token maps
  • Guess the POS tag for a token (lexicon-based, also includes the porter stemmer)
  • Identify suspicious tokens
  • Identify stopwords; words used primarily as syntactic combinators
  • Lookup words in the lexicon
  • Get possible parts of speech for known words
  • Lemmatize a token (find the root lemma for a given surface form)
  • Generate all surface forms of a root word

Vector-Documents:

  • Generate phrases using the regex chunker

Miscellaneous:

  • Concept representation: A simple lemmatized noun or verb phrases can be treated as equal abstract notions; provides a CLOS class wrapper.

INTERFACE REFERENCE

This documents the important functions of the langutils toolkit. Documentation entries are of the form:


function( args ) Input: arg1 - description arg2 - description

Output: description

Notes: discussion of use cases, etc.

Functions are explicitely referenced by putting () around them; variables or parameters have the form of (ENTITY <).

TOKENS and TOKENIZATION


tokenize-stream (stream &key (by-sentence nil) (fragment "")) Input: stream - A standard lisp stream containing the characters to analyze, the stream can be of any length by-sentence - Stop the tokenization process after each processed sentence meaning each validly parsed period, exclamation or question mark. fragment - Provide a fragment from a prior call to tokenize stream at the beginning of the parse stream.

Output: (multiple-values) 1 - parsing success (t) or failure (nil) 2 - the current index into the stream, starts from 0 on every call 3 - a string containing the tokenized data parsed up to 'index' 4 - if parsing was a success, provides a fragment of any unparsed data (primarily in by-sentence mode)

Notes: This function is intended to be called all at once or in batches. For large strings or files it should be called in by-sentence mode in a loop that captures any fragments and passes them to the next call. The function operates by grabbing one character at a time from the stream and writing it into a temporary array. When it reaches a punctuation character, it inserts a whitespace then backs up to the beginning of the current token and checks whether the token should have included the punctuation and fixes up the temporary array. Upon completion of the current parse (end of stream or end of sentence) it


tokenize-string (string) Input:

  • string - a string of English natural language text

Output: (string)

Returns a string which is the result of calling (tokenize-stream) on the stream version of the input string.


tokenize-file (source target &key (if-exists :supersede)) Input:

  • source - The source file name as a string or pathname
  • target - The target file name as a string or pathname

id-for-token ( token ) Input:

  • token - A string representing a primitive token

Output: A fixnum providing a unique id for the provided string token.

Notes: Tokens are case sensitive so several 'The', 'the' and 'THE' all map to different tokens but should map to the same entry in the lexicon. The root form of a lexicon word is the lower case representation.


token-for-id ( id ) Input:

  • id - A fixnum id

Output: The original token string.


tokens-for-id ( ids ) Input:

  • ids - A list of fixnum ids

Output: A list of string representations of the each id


save-token-map ( filename ) Input:

  • filename - A path or string to save token information to

Output: t on success or nil otherwise

Notes: This procedure will default to the filename in default-token-map-file-int which can be set via the asdf-config parameter 'token-map'


load-token-map ( filename ) Input:

  • filename - A path or string to save token information to

Output: t on success or nil otherwise

Notes: This procedure will default to the filename in default-token-map-file-int which can be set via the asdf-config parameter 'token-map'


suspicious-word? ( word ) Input:

  • word - A fixnum id for a word to test

Output: A boolean representing whether this word has been labelled as fishy


suspicious-string? ( string ) Input:

  • string - Any string

Output: A boolean representing whether the word is fishy as determined by parameters set in tokens.lisp (max numbers, total length and other characters in the token). This is used inside id-for-token to keep the hash for suspicious-word? up to date.

POS TAGGING AND OPERATIONS ON TOKENS


tag ( string ) Input:

  • string - An input string to tag. Input should be less than 100k characters if possible.

Output: A tagged string using the format <word>/<tag> where the tags are symbols taken from the Penn Treebank 2 tagset. Actual slash characters will show up as '///' meaning a slash word and slash token slash-separated!

Note: This procedure calls the tokenizer to ensure that the input string is properly tokenized in advance.


tag-tokenized ( string ) Input:

  • string - An input string to tag. The string is assumed to be tokenized already and should be less than 100k bytes in size

Output: A tagged string as in 'tag' above.


vector-tag ( string ) Input:

  • string - as in tag above

Output: A CLOS object of type vector-document with the token array initialized to fixnum representations of the word tokens and the tag array initialized with symbols represented the selected tags.


vector-tag-tokenized ( string &key end-tokens ) Input:

  • string - as in tag-tokenized above
  • end-tokens - A list of string tokens to add to the end of the tokenization array. Sometimes this is useful to ensure a closing period if you are doing tagging of structured NL data

Output: A vector-document as in vector-tag

Note: As in tag and tag-tokenized, this interface does not tokenize the input string.


get-lexicon-entry ( word ) Input:

  • word - Token id or token string

Output: A lexicon-entry structure related to the lexical characteristics of the token

Notes: The lexical-entry can be manipulated with a set of accessor functions: lexicon-entry-tag, lexicon-entry-tags, lexical-entry-id, lexical-entry-roots, lexical-entry-surface-forms, lexical-entry-case-forms, get-lexicon-default-pos. These functions are not all exported from the library package, however.


initial-tag ( token ) Input:

  • token - A string token

Output: A keyword symbol of the initially guessed tag (:PP :NN, etc)

Notes: Provides an initial guess based purely on lexical features and lexicon information of the provided string token.


read-file-as-tagged-document ( file ) Input:

  • file - A string filename or path object

Output: A vector-document representing the tagged contents of file

Notes: Loads the file into a string then calls vector-tag


read-and-tag-file ( file ) Input:

  • file - A path string or a path object

Output: A string with tag annotations of the contents of file

Notes: Uses tag on the string contents of file


get-lemma ( word &key pos (noun t) porter ) Input:

  • word - String of the word to find the lemma for
  • pos - The part of speech of the lemma to return (nil otherwise)
  • noun - Whether to stem nouns to the singular form
  • porter - Whether to use the porter algorithm if a word is unknown

Output: A string representing the lemma of the word, if found


get-lemma-for-id ( id &key pos (noun t) porter ) Input:

  • id - The token id to find the lemma of
  • pos - As above
  • noun - ""
  • porter - ""

Output: The lemma id


lemmatize ((sequence list/array) &key strip-det pos (noun t) porter last-only ) Input:

  • list/array - The input sequence of token ids as a list or an array
  • strip-det - Remove determiners from the sequence
  • pos - Part of speech of root of terms
  • noun - Whether to stem nouns
  • porter - Whether to use the porter stemmer
  • last-only - lemmatize the last token in the sequence only

Output: Return the lemmatized list of tokens

Notes: The main method for performing lemmatization. Valid on lists and arrays of fixnum values only. Useful for getting the lemmatization of short phrases.


morph-surface-forms ( root &optional pos-class ) Input:

  • root - The root form to expand
  • pos-class - if provided (V - verb, N - noun, A - Adverb) the class of surface forms to generate

Output: A list of suface ids


morph-surface-forms-text ( root &optional pos-class )

String to string form of the above function


stopword? ( id ) Input:

  • id - Input token id

Output: boolean


concise-stopword? ( id ) Input:

  • id - Input token id

Output: boolean


contains-is? ( ids ) Input:

  • ids - a list of fixnum token ids

Output: boolean

Notes: A sometimes useful utility. Searches the list for the token for 'is'


string-stopword?, string-concise-stopword?, string-contains-is? ( string )

The three above functions but accepting string or list of string arguments

CHUNKING


chunk ( text ) Input:

  • Text - raw string text

Output: A list of phrases referencing a document created from the text

Note: Runs the tokenizer on the text prior to POS tagging


chunk-tokenized ( text ) Input:

  • text - raw string text

Output: A list of phrases referencing a document created from the text

Note: Does not run the tokenizer on text prior to POS tagging


get-all-chunks ( doc ) Input:

  • doc - a vector-document

Output: A list of chunks of all the primitive types (verb, adverb, preps and nouns)

Related functions:

  • get-nx-chunks ( doc )
  • get-vx-chunks ( doc )
  • get-ax-chunks ( doc )
  • get-pp-chunks ( doc )
  • get-event-chunks ( doc )
  • get-verb-arg-chunks ( doc )

Notes:

  • Events are concatenated verb-noun chunks
  • verb-arg chunks look for verb-pp-noun chunk groups

These two functions could search over sequences of phrases, but usually those are done alone and not on top of a more primitive verb, noun, adverb decomposition. Also note that common preposition idioms (by way of, in front of, etc) are not typically captured properly and would need to be special cased (ie would be VP-sNP-P-NP where sNP is a special type of NP instead of the usual VP-P-NP verb-arg formulation)

CONCEPTS

Concepts are a CLOS abstraction over token sequences that establishes identity over lemmatized phrases. This supports special applications (ConceptNet, LifeNet) at the MIT Media Lab but might be more generally useful.


concept The 'concept' is a clos object with the following methods

  • concept->words - Return a list of token strings
  • concept->string - Return a string representing the concept
  • concept->token-array - Return an array representing the concept
  • phrase->concept - Create a concept from a phrase
  • words->concept - Create a concept from a list of token ids
  • token-array->concept - ""
  • associate-concepts - Take a list of phrases, lists or token-arrays and find the concept the they represent. Returns a list of pairs of the form (phrase concept)
  • conceptually-equal - equal under lemmatization and with phrases, arrays of tokens
  • concept-contains - subset relations

lookup-canonical-concept-instance ( ta ) Input:

  • ta - A token array or list of tokens

Output: A concept instance

EXAMPLE USES

See the file example.lisp. This shows basic use of the tagger, tokenizer, lemmatizer and chunker interfaces.

More examples of use can be generated if enough mail is sent to the author to invoke a guilt-driven re-release of the library with improved documentation.

FUNCTION

Public

ASSOCIATE-CONCEPTS (PHRASES)

Return the list of phrase/list/token-arrays as pairs with the first element being the original and the second being a canonicalized concept instance

CHUNK (TEXT)

Returns a phrase-list for the provided text

CHUNK-TOKENIZED (TEXT)

Returns a phrase-list for the provided tokenized string

CONCISE-STOPWORD? (ID)

Identifies id as a 'concise-stopword' word. concise-stopwords are a *very* small list of words. Mainly pronouns and determiners

CONTAINS-IS? (IDS)

Tests list of ids for 'is' words

GET-LEMMA (WORD &KEY POS (NOUN T) PORTER)

Provides the root word string for the provided word string

GET-LEMMA-FOR-ID (ID &KEY POS (NOUN T) (PORTER NIL))

Returns a lemma id for the provided word id. pos only returns the root for the provided pos type. noun will stem nouns to the singular form by default and porter determines whether the porter algorithm is used for unknown terms. pos type causes the noun argument to be ignored

GET-TOKEN-COUNT

Return the current token counter

ID-FOR-TOKEN (TOKEN &OPTIONAL (TRIM T))

This takes string 'tokens' and returns a unique id for that character sequence - beware of whitespace, etc.

INITIAL-TAG

Return an initial tag for a given token string using the langutils lexicon and the tagger lexical rules (via guess-tag)

LEXICON-ENTRY-ID (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

LEXICON-ENTRY-ROOTS (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

LEXICON-ENTRY-SURFACE-FORMS (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

LEXICON-ENTRY-TAGS (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

MAKE-PHRASE (TEXT-ARRAY TAG-ARRAY &OPTIONAL TYPE)

Take two arrays of test and tags and create a phrase that points at a vdoc created from the two arrays

MORPH-CASE-SURFACE-FORMS (ROOT &OPTIONAL (POS-CLASS NIL))

All cases of morphological surface forms of the provided root

MORPH-SURFACE-FORMS (ROOT &OPTIONAL (POS-CLASS NIL))

Takes a word or id and returns all surface form ids or all forms of class 'pos-class' where pos-class is a symbol of langutils::V,A,N

PHRASE->CONCEPT (P &KEY LEMMATIZED)

Create a canonical concept from an arbitrary phrase by removing determiners and lemmatizing verbs.

STOPWORD? (ID)

Identifies id as a 'stopword'

STRING-CONCISE-STOPWORD? (WORD)

Check the word if it is a 'concise-stopword' word. concise-stopwords are a *very* small list of words. Mainly pronouns and determiners

STRING-CONTAINS-IS? (WORDS)

Checks the list for a string containing 'is'

STRING-TAG (STRING &OPTIONAL (STREAM T))

Tokenizes and tags the string returning a standard tagged string using '/' as a separator

SUSPICIOUS-STRING? (STRING)

Determine if the alpha-num and number balance is reasonable for lingustic processing or if non-alpha-nums are present

TOKEN-FOR-ID (ID)

Return a string token for a given token id

TOKENIZE-STREAM (STREAM &KEY (BY-SENTENCE NIL) (FRAGMENT ) &AUX (INDEX 0) (START 0) (CH ) (WS ) (STATUS RUNNING) (SENTENCE? NIL))

Converts a stream into a string and tokenizes, optionally, one sentence at a time which is nice for large files. Pretty hairy code: a token processor inside a stream scanner. The stream scanner walks the input stream and tokenizes all punctuation (except periods). After a sequences of non-whitespace has been read, the inline tokenizer looks at the end of the string for mis-tokenized words (can ' t -> ca n't)

TOKENIZE-STRING (STRING)

Returns a fresh, linguistically tokenized string

TOKENS-FOR-IDS (IDS)

Return a list of string tokens for each id in ids

VECTOR-TAG (STRING)

Returns a 'document' which is a class containing a pair of vectors representing the string in the internal token format. Handles arbitrary data.

VECTOR-TAG-TOKENIZED

Returns a document representing the string using the internal token dictionary; requires the string to be tokenized. Parses the string into tokens (whitespace separators) then populates the two temp arrays above with token id's and initial tags. Contextual rules are applied and a new vector document is produced which is a copy of the enclosed data. This is all done at once so good compilers can open-code the array refs and simplify the calling of the labels functions.

Undocumented

CLEAN-LANGUTILS

CLEAN-TAGGER

FORCE-CONCEPT (C)

GET-LEXICON-CASE-FORMS (WORD)

GET-LEXICON-DEFAULT-POS (WORD)

GET-LEXICON-ENTRY (WORD)

HEAD-VERB (PHRASE &KEY (FILTER-COMMON T))

HEAD-VERBS (PHRASES &KEY (FILTER-COMMON T))

IDS-FOR-TOKENS (TOKENS)

IN-POS-CLASS? (ELEMENT CLASS)

INIT-LANGUTILS

INIT-TAGGER (&OPTIONAL LEXICAL-RULE-FILE CONTEXTUAL-RULE-FILE)

SETFLEXICON-ENTRY-ID (NEW-VALUE INSTANCE)

SETFLEXICON-ENTRY-ROOTS (NEW-VALUE INSTANCE)

SETFLEXICON-ENTRY-SURFACE-FORMS (NEW-VALUE INSTANCE)

LEXICON-ENTRY-TAG (ENTRY)

SETFLEXICON-ENTRY-TAGS (NEW-VALUE INSTANCE)

MAKE-CONCEPT (TA)

MAKE-PHRASE-FROM-SENTENCE (TOK-STRING &OPTIONAL TAG-ARRAY)

MAKE-PHRASE-FROM-VDOC (DOC START LEN &OPTIONAL (TYPE NIL))

MAKE-VECTOR-DOCUMENT (TEXT &OPTIONAL TAGS)

MORPH-SURFACE-FORMS-TEXT (ROOT &OPTIONAL POS-CLASS)

PHRASE-WORDS (PHRASE &OPTIONAL INDEX)

READ-AND-TAG-FILE (FILE)

READ-FILE-AS-TAGGED-DOCUMENT (FILE)

RESET-LANGUTILS

ROOT-NOUN (PHRASE)

ROOT-NOUNS (PHRASES)

STRING->CONCEPT (S &KEY (LEMMATIZED NIL))

STRING->TOKEN-ARRAY (STRING)

STRING-STOPWORD? (WORD)

STRING-TAG-TOKENIZED (STRING &OPTIONAL (STREAM T))

TAG (STRING)

TAG-TOKENIZED (STRING)

TOKEN-ARRAY->CONCEPT (TOKENS &KEY (LEMMATIZED NIL))

VECTOR-DOCUMENT (INPUT)

WORDS->CONCEPT (SLIST &KEY (LEMMATIZED NIL))

Private

ADD-BASIC-ENTRY (WORD TAGS &KEY ROOTS SURFACE)

Add a word and it's probability ordered tags to the lexicon

ADD-ROOT (WORD POS-ROOT-PAIR)

Add a root form to word if not exists

ADD-ROOTS (WORD ROOT-PAIRS)

Set the root list (pairs of pos_type/root) for the entry for 'word'

ADD-SURFACE-FORM (ROOT SURFACE-FORM)

Add a surface form to a root word

ALL-VX+NX-PHRASES (PHRASES)

Overly hairy function for finding all vx phrases that are followed by nx. Get event chunks is a better way to do this.

APPLY-RULES (DATUM RULE-LIST)

Apply rules to the values in values presuming that the returned list is also a list of values that can be passed to the next rule

DEFAULT-TAG

Simple default tagging based on capitalization of token string

ENSURE-TOKEN-COUNTS

Reset token count if not already set

GEN-RULE-ARG-BINDINGS (PATTERN)

Generate let bindings for the args referenced in the match pattern

GEN-RULE-ARG-DECLS (PATTERN)

Generate type declarations for canonical variables from table entry

GEN-RULE-CLOSURE (TEMPLATE)

Generate the code for the rule closure as one of the cond forms matching the name of the closure pattern to the rule pattern

GEN-RULE-CLOSURE-DECL

Optimize the compiled closure through type and optimization declarations

GEN-RULE-MATCH (PATTERN)

Generate the conditional code to match this rule

GET-BIND-ENTRY (VAR)

Given a canonical variable name, create its let binding and extraction expression from the rule file entry

GUESS-TAG

Using rules in rule-table guess the tag of the token 'token'

INIT-LEXICON (&OPTIONAL LEXICON-FILE LEMMA-FILE)

Populates the lexicon with 'word tag1 tag2' structured lines from lexicon-file

LEXICON-ENTRY-CASE-FORMS (INSTANCE)

@arg[extid]{A @class{extid}} @return[sytemid]{puri:uri or nil} Returns the System ID part of this External ID.

LOAD-LEXICAL-RULES (RULE-FILE &OPTIONAL BIGRAM-HASH WORD-HASH &AUX (RULE-LIST NIL))

Return a list of closure implementing the lexical rules in rule-file to tag words not found in the lexicon

MAKE-LEXICAL-RULE (LIST LH BH WH)

Look through list for rule name

RESET-TOKEN-COUNTS

Reset all the token datastructures to an initialized but empty state

SELECT-TOKEN (TOKEN &KEY STRIP-DET NOUN POS PORTER (LEMMA T))

Internal per-token function

TEST-PHRASE (TEXT)

Prints all the phrases found in the text for simple experimenting

Undocumented

*GET-DETERMINERS*

ADD-EXTERNAL-MAPPING (ID-FOR-TOKEN TOKEN-FOR-ID ADD-TO-MAP TOKEN-COUNTER)

ADD-ROOT-FORMS (WORD POS-ROOT-PAIRS)

ADD-TO-MAP-HOOK (TOKEN ID)

ADD-UNKNOWN-LEXICON-ENTRY (WORD GUESSED-TAG)

APPLY-CONTEXTUAL-RULES

CLEAN-LEXICON

CLEAN-STOPWORDS

CONSONANTP (STR I)

COPY-LEXICON-ENTRY (INSTANCE)

CVC (STR LIM)

DOUBLEC (STR I)

DUPLICATE-FROM

ENDS (STR ENDING)

ENSURE-COMMON-VERBS

ENSURE-CONCEPT (TOKENS)

ENSURE-LEXICON-ENTRY (WORD &KEY ROOTS SURFACE)

HANDLE-CONFIG-ENTRY (ENTRY)

ID-FOR-TOKEN-HOOK (TOKEN)

IDS-FOR-STRING (STRING)

INIT-CONCISE-STOPWORDS (&OPTIONAL PATH)

INIT-STOPWORDS (&OPTIONAL PATH)

INIT-WORD-TEST

INITIALIZE-TOKENS

SETFLEXICON-ENTRY-CASE-FORMS (NEW-VALUE INSTANCE)

LEXICON-ENTRY-P (OBJECT)

LOAD-CONTEXTUAL-RULES (RULE-FILE &AUX RULES)

LOAD-TAGGER-FILES (LEXICAL-RULES CONTEXTUAL-RULES &KEY BIGRAMS WORDLIST)

M (STR LIM)

MAKE-CASES (WORD)

MAKE-CONTEXTUAL-RULE

MAKE-LEXICON-ENTRY (&KEY ((TAGS DUM184) NIL) ((ID DUM185) NIL) ((ROOTS DUM186) NIL) ((SURFACE-FORMS DUM187) NIL) ((CASE-FORMS DUM188) NIL))

PERSON-TOKEN-OFFSET (ARRAY)

R (STR S SFP)

READ-CONFIG

READ-FILE-TO-STRING (FILE)

RELATIVE-PATHNAME (PATH)

RETURN-VECTOR-DOC

SET-LEXICON-ENTRY (WORD ENTRY)

SETTO (STR SUFFIX)

STEM (STR)

STEP1AB (STR)

STEP1C (STR)

STEP2 (STR)

STEP3 (STR)

STEP4 (STR)

STEP5 (STR)

TEMP-PHRASE

TEST-CONCEPT-EQUALITY

TEST-VECTOR-TAG-TOKENIZED (STRING)

TOKEN-ARRAY->WORDS (TOKENS)

TOKEN-COUNTER-HOOK

TOKEN-FOR-ID-HOOK (ID)

VOWELINSTEM (STR)

WRITE-TEMP

MACRO

Private

DEF-CONTEXTUAL-RULE-PARSER (NAME &BODY TEMPLATE-LIST)

Given a list of structures, defines a generator named 'name' that takes a Brill contextual rule list (list of strings) and generates an applicable closure. The closure accepts an argument list of (tokens tags offset) and will apply the rule and related side effect to the two arrays at the provided offset. Patterns are to be given in the form: ("SURROUNDTAG" (match (0 oldtag) (-1 tag1) (+1 tag2)) => (setf oldtag newtag))

Undocumented

WITH-STATIC-MEMORY-ALLOCATION (NIL &REST BODY)

WRITE-LOG (NAME MSG &REST ARGS)

GENERIC-FUNCTION

Public

Undocumented

ADD-WORD (P INDEX WORD TAG)

CHANGE-WORD (P INDEX NEW-TOKEN &OPTIONAL NEW-POS)

CONCAT-CONCEPTS (&REST CONCEPTS)

CONCEPT->STRING (CNAME)

CONCEPT->TOKEN-ARRAY (CNAME)

CONCEPT->WORDS (CNAME)

CONCEPT-CONTAINS (CSUPER CSUB)

CONCEPTUALLY-EQUAL (PH1 PH2)

FIND-PHRASE (P DOC &KEY (MATCH ALL) (START 0) (IGNORE-START NIL) (IGNORE-END NIL) (LEMMA NIL) (CONCEPT-TERMS NIL))

FIND-PHRASE-INTERVALS (P DOC &KEY (CONCEPT-TERMS NIL) (LEMMA NIL) (START 0) (MATCH ALL))

GET-ADVERB-CHUNKS (DOC &OPTIONAL INTERVAL)

GET-ANNOTATION (DOC KEY)

GET-EVENT-CHUNKS (DOC &OPTIONAL INTERVAL)

GET-EXTENDED-EVENT-CHUNKS1 (DOC &OPTIONAL INTERVAL)

GET-EXTENDED-EVENT-CHUNKS2 (DOC &OPTIONAL INTERVAL)

GET-IMPERATIVE-CHUNKS (DOC &OPTIONAL INTERVAL)

GET-NX-CHUNKS (DOC &OPTIONAL INTERVAL)

GET-P-CHUNKS (DOC &OPTIONAL INTERVAL)

GET-PP-CHUNKS (DOC &OPTIONAL INTERVAL)

GET-TAG (DOC OFFSET)

GET-TOKEN-ID (DOC OFFSET)

GET-VX-CHUNKS (DOC &OPTIONAL INTERVAL)

LEMMATIZE (SEQUENCE &KEY LAST-ONLY PORTER (NOUN T) POS STRIP-DET)

LEMMATIZE-PHRASE (P &OPTIONAL (OFFSET))

LENGTH-OF (DOC)

MAKE-ALTERABLE-PHRASE (P)

PHRASE->STRING (P &KEY (WITH-TAGS NIL) (WITH-INFO NIL) (NEWLINE NIL))

PHRASE->TOKEN-ARRAY (P)

PHRASE-DISTANCE (P1 P2)

PHRASE-EQUAL (PH1 PH2)

PHRASE-LEMMAS (PH)

PHRASE-LENGTH (P)

PHRASE-OVERLAP (PH1 PH2)

READ-VECTOR-DOCUMENT (FILENAME)

READ-VECTOR-DOCUMENT-TO-STRING (DOC &KEY (WITH-TAGS T))

REMOVE-WORD (P INDEX)

SET-ANNOTATION (DOC KEY VALUE &KEY (METHOD OVERRIDE))

SUSPICIOUS-WORD? (WORD)

UNSET-ANNOTATION (DOC KEY)

VECTOR-DOCUMENT-STRING (DOC &KEY (WITH-TAGS NIL) (WITH-NEWLINE NIL))

VECTOR-DOCUMENT-WORDS (DOC)

WRITE-VECTOR-DOCUMENT (DOC FILENAME &KEY (WITH-TAGS T) (IF-EXISTS SUPERSEDE))

Private

Undocumented

CLEAR-CONCEPT-CACHE

COPY-PHRASE (P &OPTIONAL (ANNOTATIONS))

DOCUMENT-WINDOW-AS-STRING (DOCUMENT START END)

GET-BASIC-CHUNKS (DOC &OPTIONAL INTERVAL)

LOOKUP-CANONICAL-CONCEPT-INSTANCE (TA)

MAKE-DOCUMENT-FROM-PHRASE (P)

REGISTER-NEW-CONCEPT-INSTANCE (C)

VECTOR-DOC-AS-IDS (DOC)

VECTOR-DOC-AS-WORDS (DOC)

SLOT-ACCESSOR

Public

TOKEN-VECTOR (OBJECT)

Stores the representation of the concept as an array of token ids

Undocumented

DOCUMENT-ANNOTATIONS (OBJECT)

SETFDOCUMENT-ANNOTATIONS (NEW-VALUE OBJECT)

DOCUMENT-TAGS (OBJECT)

SETFDOCUMENT-TAGS (NEW-VALUE OBJECT)

DOCUMENT-TEXT (OBJECT)

SETFDOCUMENT-TEXT (NEW-VALUE OBJECT)

PHRASE-DOCUMENT (OBJECT)

SETFPHRASE-DOCUMENT (NEW-VALUE OBJECT)

PHRASE-END (OBJECT)

SETFPHRASE-END (NEW-VALUE OBJECT)

PHRASE-START (OBJECT)

SETFPHRASE-START (NEW-VALUE OBJECT)

PHRASE-TYPE (OBJECT)

SETFPHRASE-TYPE (NEW-VALUE OBJECT)

Private

Undocumented

ALTERED-PHRASE-CUSTOM-DOCUMENT (OBJECT)

SETFALTERED-PHRASE-CUSTOM-DOCUMENT (NEW-VALUE OBJECT)

PHRASE-ANNOTATIONS (OBJECT)

SETFPHRASE-ANNOTATIONS (NEW-VALUE OBJECT)

VARIABLE

Private

*AUTO-INIT*

Whether to call initialize-langutils when the .fasl is loaded

*CONCEPT-STORE-SCRATCH-ARRAY*

Allows us to lookup concepts from arrays without allocating lots of unnecessary data

*CONTEXTUAL-RULE-ARGS*

The templates for parsing contextual rules and constructing matching templates over word/pos arrays

*DEFAULT-CONCISE-STOPWORDS-FILE*

Path to a *very* small list of words. Mainly pronouns and determiners

*DEFAULT-CONTEXTUAL-RULE-FILE*

Path to the brill contextual rule file

*DEFAULT-LEXICAL-RULE-FILE*

Path to the brill lexical rule file

*DEFAULT-LEXICON-FILE*

Path to the lexicon file

*DEFAULT-STEMS-FILE*

Path to the word stems file

*DEFAULT-STOPWORDS-FILE*

Path to a stopwords file

*DEFAULT-TOKEN-MAP-FILE*

Path to the token map file

*REPORT-STATUS*

Where to print langutils messages; default to none

*SUSPICIOUS-WORDS*

Memoize known suspicious words that have been tokenized in this hash

*TAGGER-BIGRAMS*

Bigram hash (not implemented yet)

*TAGGER-CONTEXTUAL-RULES*

Table to hold the contextual rule closures

*TAGGER-LEXICAL-RULES*

Table to hold the lexical rule closures

*TAGGER-WORDLIST*

Wordlist hash (not implemented yet)

Undocumented

*ADD-TO-MAP-HOOK*

*COMMON-VERBS*

*CONCEPT-VHASH*

*CONCISE-STOPWORDS*

*CONFIG-PATHS*

*EXTERNAL-TOKEN-MAP*

*ID-FOR-TOKEN-HOOK*

*ID-TABLE*

*IS-TOKEN*

*LEXICON*

*POS-CLASS-MAP*

*S-TOKEN*

*STOPWORDS*

*TEMP-PHRASE*

*TEST*

*TOKEN-COUNTER*

*TOKEN-COUNTER-HOOK*

*TOKEN-DIRTY-BIT*

*TOKEN-FOR-ID-HOOK*

*TOKEN-TABLE*

*TOKENS-LOAD-FILE*

CLASS

Public

Undocumented

ALTERED-PHRASE

LEXICON-ENTRY

PHRASE

VECTOR-DOCUMENT (INPUT)

Private

Undocumented

CONCEPT

CONSTANT

Private

*MAX-TOKEN-NUMS*

The maximum number of numbers allowed in a valid token

*MAX-TOKEN-OTHERS*

The maximum number of non alpha-numeric characters in a valid token

Undocumented

*WHITESPACE-CHARS*

ADV-PATTERN

NOUN-PATTERN

P-PATTERN

VERB-PATTERN