Common Lisp Package: LANGUTILS-TOKENIZE

README:

FUNCTION

Public

TOKENIZE-STREAM (STREAM &KEY (BY-SENTENCE NIL) (FRAGMENT ) &AUX (INDEX 0) (START 0) (CH ) (WS ) (STATUS RUNNING) (SENTENCE? NIL))

Converts a stream into a string and tokenizes, optionally, one sentence at a time which is nice for large files. Pretty hairy code: a token processor inside a stream scanner. The stream scanner walks the input stream and tokenizes all punctuation (except periods). After a sequences of non-whitespace has been read, the inline tokenizer looks at the end of the string for mis-tokenized words (can ' t -> ca n't)

TOKENIZE-STRING (STRING)

Returns a fresh, linguistically tokenized string

Private

ALPHA-LOWERCASE (CH)

Return T if the given character is an alpha character

TOKENIZE-FILE2 (SOURCE-FILE TARGET-FILE &KEY (IF-EXISTS SUPERSEDE) &AUX (TOTAL 0) (REMAINDER ))

Tokenizes a pure text file a sentence at a time

Undocumented

ALPHA-MISC (CH)

ALPHA-UPPERCASE (CH)

VARIABLE

Private

Undocumented

KNOWN-ABBREVIATIONS

CONDITION

Private

Undocumented

END-OF-SENTENCE