Table of contents

Lecture #1; Definitions Lecture #2; Probability

Statistical Concepts

Lecture series by Charles Brenner x5300, visitor in Dept of Genetics

LG26, 9:30am, Thursdays


The series will begin by discussing basic terms and issues, such as: population, parameter, sample, statistic, probability, p-value, hypothesis testing, likelihood ratio, comparing hypotheses, bias.

I will not be so interested in teaching a catalogue of statistical tests, but prefer to discuss questions like:

Beyond these simple ideas, there are several options. One is to discuss probability, with a view to coming to an understanding of (a) probability, and (b) the "probability" or so-called "exact" tests. This in turn forms a foundation for answering questions like:

  1. Definitions
  2. Thursday, 2 September 1999


    Searching the web for definitions, at [website defunct] we find typical statistics nonsense. Purports to be definitions about statistics, but is more like a list of examples.

    Another site, (defunct) turned up something that looks more sensible.

    "Statistics is [the theory and method of analyzing quantitative data obtained from samples of observations in order to study and compare sources of variation of phenomena, to help make decisions to accept or reject hypothesized relations between phenomena, and to aid in] making [reliable] inferences from empirical observations" (Kerlinger, 1986, p. 175).

    Let's condense that to

    making inferences about populations from samples

    If the mean height of people in the sample is 2m, the mean height of people in the population is close to 2m.

    1. population – a set of objects, of interest. (may be infinite or otherwise unobservable)
    2. Population of people, of haploid cells, of 100-item samples, of measurements of a person

    3. sample – an observable subset of a population.
    4. 50 people, 60 haploids, 70 100-person samples, 80 repeated measurements

    5. parameter
    6. – a property of a population. Greek letter (µ) or capital letter (P, N)

      Comment: 2N

    7. statistic
    8. – an observable property of a sample. Roman letter () or small letter (p, n)

      Robbins example: An experiment has the possible outcomes E1, E2, ... with unknown probabilities p1, p2, ... . In n independent trials suppose that Ei occurs xi times. How can we "estimate" u, the total probability of unobserved outcomes? (The quotation marks appear because u is not a parameter in the usual statistical sense.)
      Comment (and homework): What does Robbins' parenthetical statement mean?

    9. estimate – infer a parameter from a sample
    10. Answer – Perform an n+1st trial. Note the proportion of outcomes (out of n+1) that occurred one time. The proportion (in the population) of outcomes unobserved in the n-sample, is the expected proportion of once-observed outcomes in the n+1-sample.

    11. expected – average (over a specified range)

    12. hypothesis – an assumption about population(s), from which parameters can be inferred. In effect, an assumption about parameters.
    13. Comment: a declarative sentence!

    14. test statistic – a statistic calculated with a view to deciding a hypothesis

    15. p-value – "probability"-value of a test statistic. Probability to have so large (small, extreme) a test statistic if the hypothesis is true. Occasional small p-values are unavoidable.
    16. Hypothesis: The universe is half male, half female.

      Sample: 10000 individuals, of whom 5100 are female.

      Test statistic: chi2 = 4

      p-value = 0.04. (Two tailed test)

      Comment: if 5200 female, p=0.0001. If 60/100 female, p=0.04

  3. Probability
  4. Thursday, 9 September 1999

    Discussion: accept/reject paradigm

    Example: DNA forensics analysts are happy if the population is in Hardy-Weinberg equilibrium. A test statistic is calculated on a population sample, and converted to a p-value. If the p-value is small, e.g. < 0.05, that tends to indicate that the population may not be in HWE.

    An analyst proudly testified that out of a large number of such population studies, in only 1% was p<0.05. What's wrong with that?

    I said that there must be publication bias. He said, no, the lack of low p-values was perhaps due to the samples being rather small.

    What's wrong with that?

    1. condition
    2. hypothesis
    3. repeatable experiment
      1. conceptually repeatable experiment
      2. We must remember, that the probability of an event is not a property of the event itself, but a mere name for the degree of ground which we, or someone else, have for expecting it. ... Every event is in itself certain, not probable: if we knew all, we should either know positively that it will happen, or positively that it will not. But its probability to us means the degree of expectation of its occurrence, which we are warranted in entertaining by our present evidence. — J.S. Mill
      3. a probability is a summary of whatever information we may possess
    4. Some experiments
      1. Flip a coin
      2. Chance of rain
      3. Life on mars
      4. Is there a dog?
    5. "Two kinds of probability"
      1. relative frequency
      2. degree of belief
      3. A continuum of kinds