Common Lisp Package: STATISTICS

Statistical functions

README:

FUNCTION

Public

BIN-AND-COUNT (SEQUENCE N)

Make N equal width bins and count the number of elements of sequence that belong in each.

BINOMIAL-CUMULATIVE-PROBABILITY (N K P)

P(X<k) for X a binomial random variable with parameters n & p. Bionomial expecations for fewer than k events in N trials, each having probability p.

BINOMIAL-GE-PROBABILITY (N K P)

The probability of k or more occurances in N events, each with probability p.

BINOMIAL-PROBABILITY (N K P)

P(X=k) for X a binomial random variable with parameters n & p. Binomial expectations for seeing k events in N trials, each having probability p. Use the Poisson approximation if N>100 and P<0.01.

BINOMIAL-PROBABILITY-CI (N P ALPHA &KEY EXACT?)

Confidence intervals on a binomial probability. If a binomial probability of p has been observed in N trials, what is the 1-alpha confidence interval around p? Approximate (using normal theory approximation) when npq >= 10 unless told otherwise

BINOMIAL-TEST-ONE-SAMPLE (P-HAT N P &KEY (TAILS BOTH) (EXACT? NIL))

The significance of a one sample test for the equality of an observed probability p-hat to an expected probability p under a binomial distribution with N observations. Use the normal theory approximation if n*p*(1-p) > 10 (unless the exact flag is true).

BINOMIAL-TEST-ONE-SAMPLE-SSE (P-ESTIMATED P-NULL &KEY (ALPHA 0.05) (1-BETA 0.95) (TAILS BOTH))

Returns the number of subjects needed to test whether an observed probability is significantly different from a particular binomial null hypothesis with a significance alpha and a power 1-beta.

BINOMIAL-TEST-PAIRED-SSE (PD PA &KEY (ALPHA 0.05) (1-BETA 0.95) (TAILS BOTH))

Sample size estimate for the McNemar (discordant pairs) test. Pd is the projected proportion of discordant pairs among all pairs, and Pa is the projected proportion of type A pairs among discordant pairs. alpha, 1-beta and tails are as binomal-test-two-sample-sse. Returns the number of individuals necessary; that is twice the number of matched pairs necessary.

BINOMIAL-TEST-TWO-SAMPLE (P-HAT1 N1 P-HAT2 N2 &KEY (TAILS BOTH) (EXACT? NIL))

Are the observed probabilities of an event (p-hat1 and p-hat2) in N1/N2 trials different? The normal theory method implemented here. The exact test is Fisher's contingency table method, below.

BINOMIAL-TEST-TWO-SAMPLE-SSE (P1 P2 &KEY (ALPHA 0.05) (SAMPLE-RATIO 1) (1-BETA 0.95) (TAILS BOTH))

The number of subjects needed to test if two binomial probabilities are different at a given significance alpha and power 1-beta. The sample sizes can be unequal; the p2 sample is sample-sse-ratio * the size of the p1 sample. It can be a one tailed or two tailed test.

CHI-SQUARE (DOF PERCENTILE)

Returns the point which is the indicated percentile in the Chi Square distribution with dof degrees of freedom.

CHI-SQUARE-CDF (X DOF)

Computes the left hand tail area under the chi square distribution under dof degrees of freedom up to X. Adopted from CLASP 1.4.3, http://eksl-www.cs.umass.edu/clasp.html

CHI-SQUARE-TEST-FOR-TREND (ROW1-COUNTS ROW2-COUNTS &OPTIONAL SCORES)

This test works on a 2xk table and assesses if there is an increasing or decreasing trend. Arguments are equal sized lists counts. Optionally, provide a list of scores, which represent some numeric attribute of the group. If not provided, scores are assumed to be 1 to k.

CHI-SQUARE-TEST-ONE-SAMPLE (VARIANCE N SIGMA-SQUARED &KEY (TAILS BOTH))

The significance of a one sample Chi square test for the variance of a normal distribution. Variance is the observed variance, N is the number of observations, and sigma-squared is the test variance.

CHI-SQUARE-TEST-RXC (CONTINGENCY-TABLE)

Takes contingency-table, an RxC array, and returns the significance of the relationship between the row variable and the column variable. Any difference in proportion will cause this test to be significant -- consider using the test for trend instead if you are looking for a consistent change.

CHOOSE (N K)

How may ways to take n things taken k at a time, when order doesn't matter

CONVERT-TO-STANDARD-NORMAL (X MU SIGMA)

Convert X from a Normal distribution with mean mu and variance sigma to standard normal

CORRELATION-COEFFICIENT (POINTS)

just r from linear-regression. Also called Pearson Correlation

CORRELATION-SSE (RHO &KEY (ALPHA 0.05) (1-BETA 0.95))

Returns the size of a sample necessary to find a correlation of expected value rho with significance alpha and power 1-beta.

CORRELATION-TEST-TWO-SAMPLE (R1 N1 R2 N2 &KEY (TAILS BOTH))

Test if two correlation coefficients are different. Users Fisher's Z test.

F-SIGNIFICANCE (F-STATISTIC NUMERATOR-DOF DENOMINATOR-DOF &OPTIONAL ONE-TAILED-P)

Adopted from CLASP, but changed to handle F < 1 correctly in the one-tailed case. The `f-statistic' must be a positive number. The degrees of freedom arguments must be positive integers. The `one-tailed-p' argument is treated as a boolean. This implementation follows Numerical Recipes in C, section 6.3 and the `ftest' function in section 13.4.

F-TEST (VARIANCE1 N1 VARIANCE2 N2 &KEY (TAILS BOTH))

F test for the equality of two variances

FALSE-DISCOVERY-CORRECTION (P-VALUES &KEY (RATE 0.05))

A multiple testing correction that is less conservative than Bonferroni. Takes a list of p-values and a false discovery rate, and returns the number of p-values that are likely to be good enough to reject the null at that rate. Returns a second value which is the p-value cutoff. See Benjamini Y and Hochberg Y. "Controlling the false discovery rate: a practical and powerful approach to multiple testing." J R Stat Soc Ser B 57: 289 300, 1995.

FISHER-EXACT-TEST (CONTINGENCY-TABLE &KEY (TAILS BOTH))

Fisher's exact test. Gives a p value for a particular 2x2 contingency table

FISHER-Z-TRANSFORM (R)

Transforms the correlation coefficient to an approximately normal distribution.

LINEAR-REGRESSION (POINTS)

Computes the regression equation for a least squares fit of a line to a sequence of points (each a list of two numbers, e.g. '((1.0 0.1) (2.0 0.2))) and report the intercept, slope, correlation coefficient r, R^2, and the significance of the difference of the slope from 0.

MCNEMARS-TEST (A-DISCORDANT-COUNT B-DISCORDANT-COUNT &KEY (EXACT? NIL))

McNemar's test for correlated proportions, used for longitudinal studies. Look only at the number of discordant pairs (one treatment is effective and the other is not). If the two treatments are A and B, a-discordant-count is the number where A worked and B did not, and b-discordant-count is the number where B worked and A did not.

MEAN-SD-N (SEQUENCE)

A combined calculation that is often useful. Takes a sequence and returns three values: mean, standard deviation and N.

MODE (SEQUENCE)

Returns two values: a list of the modes and the number of times they occur.

NORMAL-MEAN-CI (MEAN SD N ALPHA)

Confidence interval for the mean of a normal distribution The 1-alpha percent confidence interval on the mean of a normal distribution with parameters mean, sd & n.

NORMAL-MEAN-CI-ON-SEQUENCE (SEQUENCE ALPHA)

The 1-alpha confidence interval on the mean of a sequence of numbers drawn from a Normal distribution.

NORMAL-PDF (X MU SIGMA)

The probability density function (PDF) for a normal distribution with mean mu and variance sigma at point x.

NORMAL-SD-CI (SD N ALPHA)

As normal-variance-ci-on-sequence, but a confidence inverval for the standard deviation.

NORMAL-VARIANCE-CI (VARIANCE N ALPHA)

The 1-alpha confidence interval on the variance of a sequence of numbers drawn from a Normal distribution.

PERMUTATIONS (N K)

How many ways to take n things taken k at a time, when order matters

PHI (X)

the CDF of standard normal distribution. Adopted from CLASP 1.4.3, see copyright notice at http://eksl-www.cs.umass.edu/clasp.html

POISSON-CUMULATIVE-PROBABILITY (MU K)

Probability of seeing fewer than K events over a time period when the expected number events over that time is mu.

POISSON-GE-PROBABILITY (MU X)

Probability of X or more events when expected is mu.

POISSON-MU-CI (X ALPHA)

Confidence interval for the Poisson parameter mu Given x observations in a unit of time, what is the 1-alpha confidence interval on the Poisson parameter mu (= lambda*T)? Since find-critical-value assumes that the function is monotonic increasing, adjust the value we are looking for taking advantage of reflectiveness.

POISSON-PROBABILITY (MU K)

Probability of seeing k events over a time period when the expected number of events over that time is mu.

POISSON-TEST-ONE-SAMPLE (OBSERVED MU &KEY (TAILS BOTH) (APPROXIMATE? NIL))

The significance of a one sample test for the equality of an observed number of events (observed) and an expected number mu under the poisson distribution. Normal theory approximation is not that great, so don't use it unless told.

RANDOM-NORMAL (&KEY (MEAN 0) (SD 1))

returns a random number with mean and standard-distribution as specified.

RANDOM-PICK (SEQUENCE)

Random selection from sequence

RANDOM-SAMPLE (N SEQUENCE)

Return a random sample of size N from sequence, without replacement. If N is equal to or greater than the length of the sequence, return the entire sequence.

ROUND-FLOAT (X &KEY (PRECISION 5))

Rounds a floating point number to a specified number of digits precision.

SIGN-TEST (PLUS-COUNT MINUS-COUNT &KEY (EXACT? NIL) (TAILS BOTH))

Really just a special case of the binomial one sample test with p = 1/2. The normal theory version has a correction factor to make it a better approximation.

SIGN-TEST-ON-SEQUENCES (SEQUENCE1 SEQUENCE2 &KEY (EXACT? NIL) (TAILS BOTH))

Same as sign-test, but takes two sequences and tests whether the entries in one are different (greater or less) than the other.

SPEARMAN-RANK-CORRELATION (POINTS)

Spearman rank correlation computes the relationship between a pair of variables when one or both are either ordinal or have a distribution that is far from normal. It takes a list of points (same format as linear-regression) and returns the spearman rank correlation coefficient and its significance.

T-DISTRIBUTION (DOF PERCENTILE)

Returns the point which is the indicated percentile in the T distribution with dof degrees of freedom. Adopted from CLASP 1.4.3, http://eksl-www.cs.umass.edu/clasp.html

T-SIGNIFICANCE (T-STATISTIC DOF &KEY (TAILS BOTH))

Lookup table in Rosner; this is adopted from CLASP/Numeric Recipes (CLASP 1.4.3), http://eksl-www.cs.umass.edu/clasp.html

T-TEST-ONE-SAMPLE (X-BAR SD N MU &KEY (TAILS BOTH))

The significance of a one sample T test for the mean of a normal distribution with unknown variance. X-bar is the observed mean, sd is the observed standard deviation, N is the number of observations and mu is the test mean. See also t-test-one-sample-on-sequence

T-TEST-ONE-SAMPLE-ON-SEQUENCE (SEQUENCE MU &KEY (TAILS BOTH))

As t-test-one-sample, but calculates the observed values from a sequence of numbers.

T-TEST-ONE-SAMPLE-SSE (MU MU-NULL VARIANCE &KEY (ALPHA 0.05) (1-BETA 0.95) (TAILS BOTH))

Returns the number of subjects needed to test whether the mean of a normally distributed sample mu is different from a null hypothesis mean mu-null and variance variance, with alpha, 1-beta and tails as specified.

T-TEST-PAIRED (D-BAR SD N &KEY (TAILS BOTH))

The significance of a paired t test for the means of two normal distributions in a longitudinal study. D-bar is the mean difference, sd is the standard deviation of the differences, N is the number of pairs.

T-TEST-PAIRED-ON-SEQUENCES (BEFORE AFTER &KEY (TAILS BOTH))

The significance of a paired t test for means of two normal distributions in a longitudinal study. Before is a sequence of before values, after is the sequence of paired after values (which must be the same length as the before sequence).

T-TEST-PAIRED-SSE (DIFFERENCE-MU DIFFERENCE-VARIANCE &KEY (ALPHA 0.05) (1-BETA 0.95) (TAILS BOTH))

Returns the number of subjects needed to test whether the differences with mean difference-mu and variance difference-variance, with alpha, 1-beta and tails as specified.

T-TEST-TWO-SAMPLE (X-BAR1 SD1 N1 X-BAR2 SD2 N2 &KEY (VARIANCES-EQUAL? TEST) (VARIANCE-SIGNIFICANCE-CUTOFF 0.05) (TAILS BOTH))

The significance of the difference of two means (x-bar1 and x-bar2) with standard deviations sd1 and sd2, and sample sizes n1 and n2 respectively. The form of the two sample t test depends on whether the sample variances are equal or not. If the variable variances-equal? is :test, then we use an F test and the variance-significance-cutoff to determine if they are equal. If the variances are equal, then we use the two sample t test for equal variances. If they are not equal, we use the Satterthwaite method, which has good type I error properties (at the loss of some power).

T-TEST-TWO-SAMPLE-ON-SEQUENCES (SEQUENCE1 SEQUENCE2 &KEY (VARIANCE-SIGNIFICANCE-CUTOFF 0.05) (TAILS BOTH))

Same as t-test-two-sample, but providing the sequences rather than the summaries.

T-TEST-TWO-SAMPLE-SSE (MU1 VARIANCE1 MU2 VARIANCE2 &KEY (SAMPLE-RATIO 1) (ALPHA 0.05) (1-BETA 0.95) (TAILS BOTH))

Returns the number of subjects needed to test whether the mean mu1 of a normally distributed sample (with variance variance1) is different from a second sample with mean mu2 and variance variance2, with alpha, 1-beta and tails as specified. It is also possible to set a sample size ratio of sample 1 to sample 2.

WILCOXON-SIGNED-RANK-TEST (DIFFERENCES &OPTIONAL (TAILS BOTH))

A test on the ranking of positive and negative differences (are the positive differences significantly larger/smaller than the negative ones). Assumes a continuous and symmetric distribution of differences, although not a normal one. This is the normal theory approximation, which is only valid when N > 15. This test is completely equivalent to the Mann-Whitney test.

Z (PERCENTILE &KEY (EPSILON 1.d-15))

The inverse normal function, P(X<Zu) = u where X is distributed as the standard normal. Uses binary search.

Z-TEST (X-BAR N &KEY (MU 0) (SIGMA 1) (TAILS BOTH))

The significance of a one sample Z test for the mean of a normal distribution with known variance. mu is the null hypothesis mean, x-bar is the observed mean, sigma is the standard deviation and N is the number of observations. If tails is :both, the significance of a difference between x-bar and mu. If tails is :positive, the significance of x-bar is greater than mu, and if tails is :negative, the significance of x-bar being less than mu. Returns a p value.

Undocumented

COEFFICIENT-OF-VARIATION (SEQUENCE)

CORRELATION-TEST-TWO-SAMPLE-ON-SEQUENCES (POINTS1 POINTS2 &KEY (TAILS BOTH))

GEOMETRIC-MEAN (SEQUENCE &OPTIONAL (BASE 10))

MEAN (SEQUENCE)

MEDIAN (SEQUENCE)

NORMAL-SD-CI-ON-SEQUENCE (SEQUENCE ALPHA)

NORMAL-VARIANCE-CI-ON-SEQUENCE (SEQUENCE ALPHA)

PERCENTILE (SEQUENCE PERCENT)

RANGE (SEQUENCE)

SD (SEQUENCE)

STANDARD-DEVIATION (SEQUENCE)

STANDARD-ERROR-OF-THE-MEAN (SEQUENCE)

VARIANCE (SEQUENCE)

WILCOXON-SIGNED-RANK-TEST-ON-SEQUENCES (SEQUENCE1 SEQUENCE2 &OPTIONAL (TAILS BOTH))

Z-TEST-ON-SEQUENCE (SEQUENCE &KEY (MU 0) (SIGMA 1) (TAILS BOTH))

Private

2-TAILED-CORRELATION-SIGNIFICANCE (N R)

We use the first line for anything less than 5, and the last line for anything over 500. Otherwise, find the nearest value (maybe we should interpolate ... too much bother!)

ANOVA1 (D)

One way simple ANOVA, from Neter, et al. p677+. Data is give as a list of lists, each one representing a treatment, and each containing the observations.

ANOVA2 (A1B1 A1B2 A2B1 A2B2)

Two-Way Anova. (From Misanin & Hinderliter, 1991, p. 367-) This is specialized for four groups of equal n, called by their plot location names: left1 left2 right1 right2.

ANOVA2R (G1 G2)

Two way ANOVA with repeated measures on one dimension. From Ferguson & Takane, 1989, p. 359. Data is organized differently for this test. Each group (g1 g2) contains list of all subjects' repeated measures, and same for B. So, A: ((t1s1g1 t2s1g1 ...) (t1s2g2 t2s2g2 ...) ...) Have to have the same number of test repeats for each subject, and this assumes the same number of subject in each group.

AVERAGE-RANK (VALUE SORTED-VALUES)

Average rank calculation for non-parametric tests. Ranks are 1 based, but lisp is 0 based, so add 1!

BETA-INCOMPLETE (A B X)

Adopted from CLASP 1.4.3, http://eksl-www.cs.umass.edu/clasp.html

CORRELATE (X Y)

Correlation of two sequences, as in Ferguson & Takane, 1989, p. 125. Assumes NO MISSING VALUES!

CROSS-MEAN (L &AUX K R)

Cross mean takes a list of lists, as ((1 2 3) (4 3 2 1) ...) and produces a list with mean and standard error for each VERTICLE entry, so, as: ((2.5 . 1) ...) where the first pair is computed from the nth 1 of all the sublists in the input set, etc. This is useful in some cases of data cruching. Note that missing data is assumed to be always at the END of lists. If it isn't, you've got to do something previously to interpolate.

DUMPLOT (V &OPTIONAL SHOW-VALUES)

A dumb terminal way of plotting data.

ERROR-FUNCTION (X)

Adopted from CLASP 1.4.3, http://eksl-www.cs.umass.edu/clasp.html

ERROR-FUNCTION-COMPLEMENT (X)

Adopted from CLASP 1.4.3, http://eksl-www.cs.umass.edu/clasp.html

FIND-CRITICAL-VALUE (P-FUNCTION P-VALUE &OPTIONAL (X-TOLERANCE 1.e-5) (Y-TOLERANCE 1.e-5))

Adopted from CLASP 1.4.3, http://eksl-www.cs.umass.edu/clasp.html

GAMMA-INCOMPLETE (A X)

Adopted from CLASP 1.4.3, http://eksl-www.cs.umass.edu/clasp.html

GAMMA-LN (X)

Adopted from CLASP 1.4.3, http://eksl-www.cs.umass.edu/clasp.html

HARMONIC-MEAN (SEQ)

See: http://mathworld.wolfram.com/HarmonicMean.html

HISTOVALUES (V* &KEY (NBINS 10))

Take a set of values and produce a histogram binned into n groups, so that you can get a report of the distribution of values. There's a large chance for off-by-one errores here!

LMEAN (LL)

Lmean takes the mean of entries in a list of lists vertically. So: (lmean '((1 2) (5 6))) -> (3 4) The args have to be the same length.

N-RANDOM (N L &AUX R)

Select n random sublists from a list, without replacement. This copies the list and then destroys the copy. N better be less than or equal to (length l).

NORMALIZE (V)

Normalize a vector by dividing it through by subtracting its min and then dividing through by its range (max-min). If the numbers are all the same, this would screw up, so we check that first and just return a long list of 0.5 if so!

PROTECTED-MEAN (L)

Computes a mean protected where there will be a divide by zero, and gives us n/a in that case.

PROUND (N V)

Returns a string that is rounded to the appropriate number of digits, but the only thing you can do with it is print it. It's just a convenience hack for rounding recursive lists.

REGRESS (X Y)

Simple linear regression.

SAFE-EXP (X)

Eliminates floating point underflow for the exponential function. Instead, it just returns 0.0d0

T1-TEST (VALUES TARGET &OPTIONAL (WARN? T))

One way t-test to see if a group differs from a numerical mean target value. From Misanin & Hinderliter p. 248.

T2-TEST (L1 L2)

T2-test calculates an UNPAIRED t-test. From Misanin & Hinderliter p. 268. The t-cdf part is inherent in xlispstat, and I'm not entirely sure that it's really the right computation since it doens't agree entirely with Table 5 of M&H, but it's close, so I assume that M&H have round-off error.

TUKEY-Q (K DFWG)

Finds the Q table for the appopriate K, and then walks BACKWARDS through it (in a kind of ugly way!) to find the appropriate place in the table for the DFwg, and then uses the level (which must be 0.01 or 0.05, indicating the first, or second col of the table) to determine if the Q value reaches significance, and gives us a + or - final result.

WILCOXON-1 (INITIAL-VALUES TARGET)

Nonparametric one-sample (signed) rank test (Wilcoxon). From http://www.graphpad.com/instatman/HowtheWilcoxonranksumtestworks.htm

X2TEST

Simple Chi-Squares From Clarke & Cooke p. 431; should = ~7.0

Undocumented

ALL-SQUARES (AS BS &AUX SQUARES)

BINOMIAL-LE-PROBABILITY (N K P)

CHI-SQUARE-1 (EXPECTED OBSERVED)

CHI-SQUARE-2 (TABLE)

EVEN-POWER-OF-TWO? (N)

F-SCORE>P-LIMIT? (DF1 DF2 F-SCORE LIMITS-TABLE)

FACTORIAL (NUMBER)

MAX* (L &REST LL &AUX M)

MIN* (L &REST LL &AUX M)

P2 (V)

ROUND-UP (X)

S2 (L N)

SIGN (X)

SQR (A)

STANDARD-ERROR (SEQUENCE)

SUM (L &AUX (SUM 0))

T-P-VALUE (X DF &OPTIONAL (WARN? T))

T1-VALUE (VALUES TARGET)

T2-VALUE (L1 L2)

TESTANOVA2

MACRO

Public

Undocumented

SQUARE (X)

TEST-VARIABLES (&REST ARGS)

Private

UNDERFLOW-GOES-TO-ZERO (&BODY BODY)

Protects against floating point underflow errors and sets the value to 0.0 instead.

Z/PROTECT (EXPR TESTVAR)

Macro to protect from division by zero.

Undocumented

DISPLAY (&REST L)

VARIABLE

Private

Undocumented

*CRITICAL-VALUES-OF-R*

*CRITICAL-VALUES-OF-R-TWO-TAILED-COLUMN-INTERPRETAION*

*F0.05*

*F0.10*

*Q-TABLE*

*T-CDF-CRITICAL-POINTS-TABLE-FOR-.05*