Common Lisp Package: DARTS.LIB.SEQUENCE-METRICS

This package exports various forms of metric functions on sequences. Among the ones provided are: - Levenshtein distance - Jaro and Jaro/Winkler distance - Hamming distance Most distances are provided in a very general form, working an arbitrary sequences. However, since most of these distance functions are usually applied to strings, for some frequently used metrics, optimized string versions are provided. This package also exports a few other utility functions, which strictly speaking don't really belong here, such as the n-gram related stuff. They live in this package, because they used to do so since the dawn of time...

README:

DartsCLSequenceMetrics

Distance metrics on sequences in general and strings in particular.

  • Constant jaro-winkler-min-prefix-length6
  • Constant jaro-winkler-prefix-adjustment-scale1/10
  • Function hamming-distance s1 s2 &key start1 end1 start2 end2 test test-not key =>distance
  • Function jaro-distance s1 s2 &key start1 end1 start2 end2 test test-not key =>distance
  • Function jaro-winkler-distance s1 s2 &key start1 end1 start2 end2 test test-not key prefix-length adjustment-scale =>distance
  • Function levenshein-distance s1 s2 &key start1 end1 start2 end2 test test-not key =>distance
  • Function list-ngrams size list &key start-padding end-padding transform =>list
  • Function longest-common-subsequence*-length s1 s2 &key start1 end1 start2 end2 test test-not key =>length
  • Function longest-common-subsequence-length s1 s2 &key start1 end1 start2 end2 test test-not key =>length
  • Function longest-common-subsequences* s1 s2 &key start1 end1 start2 end2 test test-not key =>list
  • Function longest-common-substring-length s1 s2 &key start1 end1 start2 end2 case-sensitive =>length
  • Function longest-common-substrings s1 s2 &key start1 end1 start2 end2 case-sensitive =>list
  • Function map-ngrams function size list &key start-padding end-padding
  • Function string-hamming-distance s1 s2 &key start1 end1 start2 end2 case-sensitive =>distance
  • Function string-jaro-distance s1 s2 &key start1 end1 start2 end2 case-sensitive =>distance
  • Function string-jaro-winkler-distance s1 s2 &key start1 end1 start2 end2 case-sensitive prefix-length adjustment-scale =>distance
  • Function string-levenshtein-distance s1 s2 &key start1 end1 start2 end2 case-sensitive =>distance
  • Macro do-ngrams (&rest vars) (list-form &key start-padding end-padding) &body body =>undefined

FUNCTION

Public

JARO-DISTANCE (STR1 STR2 &KEY (START1 0) (END1 NIL) (START2 0) (END2 NIL) (TEST #'EQL HAVE-TEST) (TEST-NOT NIL HAVE-TEST-NOT) (KEY #'IDENTITY))

jaro-distance SEQ1 SEQ2 &key START1 END1 START2 END2 TEST TEST-NOT KEY => DISTANCE

LEVENSHTEIN-DISTANCE

levenshtein-distance S1 S2 &key START1 END1 START2 END2 TEST TEST-NOT KEY => NUMBER Computes the Levenshtein distance between sequences S1 and S2. The result value DISTANCE is the minimum number of edit operations required to transform S1 into S2 (or vice versa), where allowed operations are: - insert a single character at some position - delete a single character at some position - substitute a single character at some position by another one The Levenshtein distance is a measure of similarity between strings. If two strings have a distance of 0, they are equal. The TEST function is used to compare sequence elements. The default test function is eql. This function is a generalization of string-levenshtein-distance for arbitrary sequence types. Use string-levenshtein-distance, if you need a version optimized for use with strings.

LONGEST-COMMON-SUBSEQUENCE*-LENGTH (SEQ1 SEQ2 &KEY (START1 0) (END1 NIL) (START2 0) (END2 NIL) (TEST #'EQL HAVE-TEST) (TEST-NOT NIL HAVE-TEST-NOT) (KEY #'IDENTITY HAVE-KEY))

longest-common-subsequence*-length SEQ1 SEQ2 &key START1 END1 START2 END2 TEST TEST-NOT KEY => LENGTH Returns the length of the (or: a) longest common contiguous subsequence of sequences SEQ1 and SEQ2. This problem is usually called the longest common `substring┬┤ problem, but in order to avoid confusion, the we use the term subsequence* here to refer to contiguous subsequences.

LONGEST-COMMON-SUBSEQUENCES* (SEQ1 SEQ2 &KEY (START1 0) (END1 NIL) (START2 0) (END2 NIL) (TEST #'EQL HAVE-TEST) (TEST-NOT NIL HAVE-TEST-NOT) (KEY #'IDENTITY HAVE-KEY))

longest-common-subsequences* SEQ1 SEQ2 &key START1 END1 START2 END2 TEST TEST-NOT KEY => LENGTH Returns a list containing all contigous longest common subsequences of SEQ1 and SEQ2. This problem is usually called the longest common `substring┬┤ problem, but in order to avoid confusion, the we use the term subsequence* here to refer to contiguous subsequences.

MAP-NGRAMS (FUNCTION SIZE LIST &KEY (START-PADDING NIL) (END-PADDING NIL))

map-ngrams FUNCTION N LIST &key START-PADDING END-PADDING => UNDEFINED Calls FUNCTION for each n-gram of size N constructed from the elements in the given LIST. Initially, the value of START-PADDING is used to fill the first N - 1 elements in the call. When the list is exhausted, the value of END-PADDING is used to pad to a size of N. The function must accept N arguments.

STRING-JARO-DISTANCE

string-jaro-distance STR1 STR2 &key START1 END1 START2 END2 CASE-SENSITIVE => DISTANCE

STRING-LEVENSHTEIN-DISTANCE

string-levenshtein-distance S1 S2 &key START1 END1 START2 END2 CASE-SENSITIVE => DISTANCE Computes the Levenshtein distance between strings S1 and S2. The result value DISTANCE is the minimum number of edit operations required to transform S1 into S2 (or vice versa), where allowed operations are: - insert a single character at some position - delete a single character at some position - substitute a single character at some position by another one The Levenshtein distance is a measure of similarity between strings. If two strings have a distance of 0, they are equal. If CASE-SENSITIVE is true (which is the default), then comparing of characters is done in a case sensitive way, distinguishing between lower and upper case letters. If CASE-SENSITIVE is false, then this function does not distinguish between lower case and upper case letters. See function levenshtein-distance for a generalization of this function to arbitrary sequences.

Undocumented

HAMMING-DISTANCE (SEQ1 SEQ2 &KEY (START1 0) (END1 NIL) (START2 0) (END2 NIL) (TEST #'EQL HAVE-TEST) (TEST-NOT NIL HAVE-TEST-NOT) (KEY #'IDENTITY HAVE-KEY) (NORMALIZED NIL))

JARO-WINKLER-DISTANCE (STR1 STR2 &KEY (START1 0) (END1 NIL) (START2 0) (END2 NIL) (TEST #'EQL HAVE-TEST) (TEST-NOT NIL HAVE-TEST-NOT) (KEY #'IDENTITY HAVE-KEY) (PREFIX-LENGTH JARO-WINKLER-MIN-PREFIX-LENGTH) (ADJUSTMENT-SCALE JARO-WINKLER-PREFIX-ADJUSTMENT-SCALE))

LIST-NGRAMS (SIZE LIST &KEY (START-PADDING NIL) (END-PADDING NIL) (TRANSFORM #'LIST))

LONGEST-COMMON-SUBSEQUENCE-LENGTH (SEQ1 SEQ2 &KEY (START1 0) (END1 NIL) (START2 0) (END2 NIL) (TEST #'EQL HAVE-TEST) (TEST-NOT NIL HAVE-TEST-NOT) (KEY #'IDENTITY HAVE-KEY))

LONGEST-COMMON-SUBSTRING-LENGTH (SEQ1 SEQ2 &KEY (START1 0) (END1 NIL) (START2 0) (END2 NIL) (CASE-SENSITIVE NIL))

LONGEST-COMMON-SUBSTRINGS (SEQ1 SEQ2 &KEY (START1 0) (END1 NIL) (START2 0) (END2 NIL) (CASE-SENSITIVE NIL))

STRING-HAMMING-DISTANCE (SEQ1 SEQ2 &KEY (START1 0) (END1 NIL) (START2 0) (END2 NIL) (CASE-SENSITIVE NIL) (NORMALIZED NIL))

STRING-JARO-WINKLER-DISTANCE

MACRO

Public

Undocumented

DO-NGRAMS ((&REST VARS) (LIST-FORM &KEY START-PADDING END-PADDING) &BODY BODY)

CONSTANT

Private

Undocumented

JARO-WINKLER-MIN-PREFIX-LENGTH

JARO-WINKLER-PREFIX-ADJUSTMENT-SCALE