APPENDIX D: STATISTICS

Previous Next

Appendix D
Statistics

D1 CONFIDENCE LIMITS AND MEDIAN ESTIMATES OF LINKAGE DISTANCE

D.1.1 The special case of complete concordance

To illustrate the statistical approach used to estimate confidence limits on experimentally-determined values for linkage distances, it is useful to first consider the special case where two linked loci show complete concordance or no recombination (symbolized as R = 0) in their allelic segregations among a set of N samples derived either from recombinant inbred (RI) strains or from the offspring of a backcross. Let us define the true recombination fraction — Theta — as the experimental fraction of samples expected to be discordant (or recombinant) when N approaches infinity. Then the probability of recombination in any one sample is simply Theta and the probability of non-recombination, or concordance, is simply (1 - Theta). As long as multiple events are completely independent of each other, one can calculate the probability that all of them will occur by multiplying together the individual probabilities associated with each event. Thus, if the probability of concordance in one sample is (1 - Theta), then the probability of concordance in N samples is: (1 - Theta) ^N.

In most experimental situations, the known and unknown variables are reversed in that one begins by determining the number of discordant (or recombinant) samples i that occur within a total set of N as a means to estimate the unknown true recombination fraction Theta. When no discordant samples are observed, the probability term just derived can be used with the substitution of the random variable small-theta in place of Theta, to provide a continuous probability density function indicative of the relative likelihoods for different values of Theta between 0.0 (complete linkage) and 0.5 (no linkage).

(Equation D1)

This equation reads "the probability that the true recombination fraction Theta is equal to a particular value small-theta is the function of small-theta given as the last term in the equation". For both RI data and backcross data, Theta can be related directly to linkage distance in centimorgans, d. In the case of backcross data, and for values of Theta less than 0.25 (see Section 7.2.2.3), recombination fractions are converted into centimorgan estimates through simple multiplication:

(Equation D2)

In the case of RI data, this conversion is combined with the Haldane-Waddington equation (Equation 9.8) to yield:

(Equation D3)

An example of the probability density function associated with the experimental observation of complete concordance among 50 backcross samples is shown in Figure D1. Each value of N will define a different function, but in all cases, the curve will look the same with only the steepness of the fall-off increasing as N increases. In all cases, the "maximum likelihood estimate" for the true recombination fraction Theta-hat — defined as the value of Theta associated with the highest probability — will be zero. However, since this maximum likelihood value is located at one end of the probability curve, it does not provide a useful estimate for the likely linkage distance. A better estimate would be the value of Theta which defines the midpoint below which and above which the true recombination fraction value is likely to lie with equal probability; this is the definition of the median recombination fraction estimate Theta. In mathematical terms, the value of Theta is defined at the line which equally divides the area of the complete probability density given by Equation D1 (see Figure D1).

Confidence limits are also defined by circumscribed portions of the entire probability density; the portion that lies outside a confidence interval is called alpha. For example, in the case of a 95% confidence interval, alpha = (1 - 0.95) = 0.05. It is standard practice to assign equal portions of alpha to the two "tails" of the probability density located before and after the central confidence interval. Thus, the lower confidence limit is defined as the value of small-theta bordering the initial alpha/2 fraction of the area under the entire probability curve. The upper confidence limit is defined as the value of small-theta that borders the ultimate alpha/2 fraction of the area under the entire probability curve; this is equivalent to saying that a "(1 - alpha/2)" fraction of area lies ahead of the upper confidence limit.

In mathematical terms, the area beneath the entire probability density curve is equal to the definite integral of Equation D1 over the range of legitimate values for small-theta between 0.0 and 0.5. To determine the fraction of the probability density that lies in the region between Theta = 0 and any arbitrary Theta = x, it is necessary to integrate over the probability density function (Equation D1) between these two values, and divide the result by the total area covered by the probability density. This provides the probability that the true recombination fraction is less than or equal to x.

(Equation D4)

By standard methods of calculus, Equation D4 can be reduced analytically to the form:

(Equation D5)

And this equation can be reformulated to yield x as a function of P{ Theta <= x} .

(Equation D6)

By solving Equation D6 for different values of:

P{ Theta <= x},

one can obtain critical values of x that define the median estimate of the recombination fraction from:

P{ Theta <= x} = 0.5,

lower confidence limits from:

P{ Theta <= x} = alpha/2,

and upper confidence limits from:

P{ Theta <= x} = (1 - alpha /2).

Once a solution for x has been obtained, it can be converted into a linkage distance value with either Equation D2 for backcross data or Equation D3 for RI strain data. Solutions to Equation D6 over a range of N RI strains and backcross animals are shown in Figure 9.8, Figure 9.16, and Figure 9.17.

D1.2 The general case of one or more recombinants

The statistical approach described above can be generalized to any case of i discordant (or recombinant) samples observed among a total of N RI strains or backcross animals that have been typed for two loci. As in the special case above, one can arrive at a probability for the occurrence of multiple events by multiplying together the individual probabilities for each event. In the general case, there will be i events of discordance, each with an individual probability equal to the true recombination fraction Theta, and (N - i) events of concordance, each with an individual probability of (1 - Theta). These terms are multiplied together along with a "binomial coefficient" that counts the the permutations in which the two types of events can appear to produce the "binomial formula":

(Equation D7)

When the true recombination fraction is known, the binomial formula can be used to provide the probability that i events of discordance will be observed in any set of N samples. But once again, the situation encountered by geneticists is usually the reverse one in which i and N are discrete values determined by the experiment and the true recombination fraction Theta is unknown. In this case, one can substitute the random variable small-theta in place of Theta in Equation D7 to generate a probability density function that provides relative likelihoods for different values of Theta between 0.0 (complete linkage) and 0.5 (no linkage). In this use of the binomial formula, the factorial fraction (known as the binomial coefficient) remains constant for all values of small-theta and can be eliminated since the purpose of the function is to provide relative probabilities only:

(Equation D8)

An example of the probability density function associated with the experimental observation of one discordant RI strain among a total of 26 samples is shown in Figure D2. As one can easily see, the distribution is highly skewed toward higher recombination fractions. Each discrete pair of values i and N will define a different function. When both i and N are large, the density function will approximate a normal distribution. However, with the results typically obtained in contemporary mouse linkage studies, the density function is likely to be significantly skewed as shown in Figure D2 and as such, it is usually not possible to take advantage of the simplified statistical tools developed specially for use with the normal distribution.

A median estimate of linkage distance as well as lower and upper confidence limits can be obtained in the same manner described in the special case of no recombination described above. This can be accomplished by substituting Equation D8 in place of the two occurrences of Equation D1 within Equation D4:

(Equation D9)

The general form of the integral in this equation cannot be solved analytically but a short computer program can be used to estimate solutions and provide critical values of x for defined probability values. The computer program has been written to generate minimum and maximum values in terms of centimorgan distances for discrete experimentally determined values of i and N from either backcross or RI data. The program was used to generate the values shown in Table D1, Table D2, Table D3, Table D4, Table D5, and Table D6 for 68% and 95% confidence intervals, but it is possible to generate confidence limits for any other integer percentile confidence interval as well. The program will also calculate maximum likelihood and median estimates of linkage distance ¹⁰⁹. It is listed below as a self-contained unit that should be ready for compiling with any standard C compiler on any computer. DOS and Macintosh version of the executable program can be downloaded over the internet from the following anonymous FTP site: bioweb.princeton.edu. Interested investigators should look in the folder entitled pub/mouse.

D1.3 A C program for the calculation of linkage distance estimates and confidence intervals

/*** A C program for the calculation of linkage distance estimates and confidence intervals ***/
#include 
double 	Pin(double r,int i,int N); double pow(double x, double y); double convert(double r);
static	int crosstype;
main()
{	FILE 	*fopen(), *file;
	int 	i = 1, istart = 1, ifin = 50, iinc = 1, N = 100, P;
	char	input;
	double	Pin(), dmin, dmax, r,rtop, dmean, smean, Nrmlize = 0.0, Sum = 0.0, convert(), min, max;
	while(1){
		printf("Enter the type of cross:1 for backcross,2 for RI analysis,or 3 to quit:");
		scanf("%d",&crosstype);
		if(crosstype ! = 2 && crosstype != 1) exit(0);
		printf("Enter the confidence level as an integer number(e.g. 95 for 95%%):");
		scanf("%d", &P);
		min = (1-((double)P/100.0))/2; max = 1- min;
		printf("Enter with comma delimiters->i-start,i-end,i-increment,and N,then return\n:>");
		scanf ("%d,%d,%d,%d", &istart,&ifin,&iinc,&N);
		printf(" i, dist / medn, min. / max. (values in cM assuming complete interference)\n");
		for ( i = istart; i <= ifin ; i += iinc){
			for ( r = .0001, Nrmlize = 0 ; r <.5 ; r += .0001)
				Nrmlize += Pin(r,i,N);
			for ( r = .0001, Sum = 0; Sum < min && r<.5; r += .0001)
				Sum += Pin(r,i,N)/Nrmlize;
			dmin = convert(r);
			for (; Sum <.5 && r<.5; r += .0001)
				Sum += Pin(r,i,N)/Nrmlize;
			dmean = convert(r);
			for (; Sum < max && r<.5 ; r += .0001)
				Sum += Pin(r,i,N)/Nrmlize;
			dmax = convert(r);
			smean = convert((double)i/N);
			printf("%3d, %4.1f / %4.1f, %4.1f / %4.1f\n",i,smean,dmean,dmin,dmax);}
	}}
double convert(double r)
{	double rmean;	int x = 0;
	if(crosstype == 1)	return(100*r);
	if(crosstype == 2)	return( r*100/(4 - 6*r) );}
double Pin(double r,int i,int N)
{	double	pow();
	return ((pow(r,i))*(pow(1-r,N-i)));}
/************************  END OF PROGRAM  ***********************/

D2 QUANTITATIVE DIFFERENCES IN EXPRESSION BETWEEN TWO STRAINS

How does one determine whether two populations of animals defined by different inbred strains are showing a significant difference in the expression of a trait? The answer is with a test statistic known as the "t-test" or "Student's t-test". To apply this test, one needs to use a pair of only three values derived from an analysis of the expression of the trait in sets of animals from each inbred strain. First is the number of animals examined in each inbred set (N₁ and N₂). Second is the mean level of expression for each set (m₁ and m₂) calculated as:

(Equation D10)

where x_i refers to the expression value obtained for the ith sample in the set. Third is the variance of each set of animals (s₁² and s₂²) calculated as:

(Equation D11)

With values for the variance of each sample set and the size of each set, one can calculate a combined parameter refered to as the "pooled variance":

(Equation D12)

Finally, one can use the value obtained for the pooled variance together with the samples sizes and sample means to obtain a "t value":

(Equation D13)

One final combined parameter is required to convert the t value into a level of significance — the number of degrees of freedom df.

(Equation D14)

With values for t and df, one can obtain a P value from a table of critical values for the t distribution found in Table D7.

Previous Next