Lecture #1; Definitions
Lecture #2; Probability

# Statistical Concepts

Lecture series by Charles Brenner x5300, visitor in Dept of Genetics

### Prospectus

The series will begin by discussing basic terms and issues, such as: population, parameter, sample, statistic, probability, p-value, hypothesis testing, likelihood ratio, comparing hypotheses, bias.

I will not be so interested in teaching a catalogue of statistical tests, but prefer to discuss questions like:

• What should one think or do when p<5%? when p>5%? How should a statistical conclusion be phrased?
• Why are measurement errors normally distributed?
• How can you recognize a standard deviation on sight?
Beyond these simple ideas, there are several options. One is to discuss probability, with a view to coming to an understanding of (a) probability, and (b) the "probability" or so-called "exact" tests. This in turn forms a foundation for answering questions like:
• What is the right test for a given situation? What's wrong with the wrong test?
• Another reason to (a) understand probability, is think about likelihood ratios in order to consider
• What is the meaning of rejecting a hypothesis? What do we then accept?

## Definitions

Thursday, 2 September 1999

### Statistics

Searching the web for definitions, at [website defunct] we find typical statistics nonsense. Purports to be definitions about statistics, but is more like a list of examples.

Another site, funnelweb.utcc.utk.edu (defunct) turned up something that looks more sensible.

"Statistics is [the theory and method of analyzing quantitative data obtained from samples of observations in order to study and compare sources of variation of phenomena, to help make decisions to accept or reject hypothesized relations between phenomena, and to aid in] making [reliable] inferences from empirical observations" (Kerlinger, 1986, p. 175).

Let's condense that to

making inferences about populations from samples

If the mean height of people in the sample is 2m, the mean height of people in the population is close to 2m.

1. population – a set of objects, of interest. (may be infinite or otherwise unobservable)
2. Population of people, of haploid cells, of 100-item samples, of measurements of a person

3. sample – an observable subset of a population.
4. 50 people, 60 haploids, 70 100-person samples, 80 repeated measurements

5. parameter
6. – a property of a population. Greek letter (µ) or capital letter (P, N)

Comment: 2N

7. statistic
8. – an observable property of a sample. Roman letter ( ) or small letter (p, n)

Robbins example: An experiment has the possible outcomes E1, E2, ... with unknown probabilities p1, p2, ... . In n independent trials suppose that Ei occurs xi times. How can we "estimate" u, the total probability of unobserved outcomes? (The quotation marks appear because u is not a parameter in the usual statistical sense.)
Comment (and homework): What does Robbins' parenthetical statement mean?

9. estimate – infer a parameter from a sample
10. Answer – Perform an n+1st trial. Note the proportion of outcomes (out of n+1) that occurred one time. The proportion (in the population) of outcomes unobserved in the n-sample, is the expected proportion of once-observed outcomes in the n+1-sample.

11. expected – average (over a specified range)

12. hypothesis – an assumption about population(s), from which parameters can be inferred. In effect, an assumption about parameters.
13. Comment: a declarative sentence!

14. test statistic – a statistic calculated with a view to deciding a hypothesis

15. p-value – "probability"-value of a test statistic. Probability to have so large (small, extreme) a test statistic if the hypothesis is true. Occasional small p-values are unavoidable.
16. Hypothesis: The universe is half male, half female.

Sample: 10000 individuals, of whom 5100 are female.

Test statistic: chi2 = 4

p-value = 0.04. (Two tailed test)

Comment: if 5200 female, p=0.0001. If 60/100 female, p=0.04

## Probability

Thursday, 9 September 1999

Example: DNA forensics analysts are happy if the population is in Hardy-Weinberg equilibrium. A test statistic is calculated on a population sample, and converted to a p-value. If the p-value is small, e.g. < 0.05, that tends to indicate that the population may not be in HWE.

An analyst proudly testified that out of a large number of such population studies, in only 1% was p<0.05. What's wrong with that?

I said that there must be publication bias. He said, no, the lack of low p-values was perhaps due to the samples being rather small.

What's wrong with that?

1. condition
2. hypothesis
3. repeatable experiment
1. conceptually repeatable experiment
2.  We must remember, that the probability of an event is not a property of the event itself, but a mere name for the degree of ground which we, or someone else, have for expecting it. ... Every event is in itself certain, not probable: if we knew all, we should either know positively that it will happen, or positively that it will not. But its probability to us means the degree of expectation of its occurrence, which we are warranted in entertaining by our present evidence. — J.S. Mill
3. a probability is a summary of whatever information we may possess
4. Some experiments
1. Flip a coin
2. Chance of rain
3. Life on mars
4. Is there a dog?
5. "Two kinds of probability"
1. relative frequency
2. degree of belief
3. A continuum of kinds 