Closed vs. Open?

  1. definitions?
  2. The Swissair 101 example
    1. Normal identification protocol
    2. The last few identifications
  3. Deciding the threshold
    1. Prior odds
    2. The end-game
  4. Bayes' theorem
    1. Two hypotheses – odds formulation
    2. Multiple hypotheses
  5. Conclusion
    1. Epilogue – WTC identifications
    2. Ultimate analysis

Links

Forensic mathematics home page
Mathematics of Tsunami Identifications (paper)
Identifying whole families by DNA
WTC identification strategy
Progress on WTC identifications
Re-uniting El Salvador families

"Closed" vs. "Open" Disaster: Useful distinction or misunderstanding?

  1. definitions?
  2. Various definitions have been suggesed for a "closed" disaster, from as little as #1 below to as much as all four –
    1. There is a complete list of the missing (like a flight manifest)
    2. As many bodies have been found as the number of victims. "Victim" and "missing" are regarded here as synonyms.
    3. There is DNA for every body
    4. There is a DNA reference for every victim

    The general idea, I am sure, is that there is a dichotomy along these lines:

    I disagree. This is a distinction without a difference. There are only differences of degree.

  3. The Swissair 101 example
  4. An excellent example apparently illustrating the new tool was the final identifications of Swissair 101 victims.

    1. Normal identification protocol
    2. A victim DNA profile is compared with personal references and/or relatives' DNA profiles, resulting in a likelihood ratio L. L is compared with some pre-determined threshold – maybe we used t=1 million – and if
      L>t
      the identity is established.

    3. The last few identifications
    4. Eventually all but three or four victims identification were established, but those last few could not be established at L>t. It was, though eventually possible to solve all the identifications. Since the system was "closed" (under any definition), it was possible to eliminate all combinations of identities for the final victims except for one combination. So we felt comfortable that we knew the identities of all the bodies (save the ambiguity of a pair of identical twins), and we ascribed this to the special circumstance of having a "closed" situation.

      But was it really special? In effect we assumed that the final identifications were idiosyncratic or ad hoc. Compared to the paradigm we had in mind up to the end that is true, but what was really going on? Later, as KADAP we considered prospectively the end-game of WTC identifications. We did not expect ever to have the neat and complete lists as in the Swissair case, but would the "closed" paradigm kick in at some late stage in the ID's? When? To what extent?

      To answer questions like that, it is necessary to have an explicit understanding of the so-called "closed" phenomenon.

  5. Deciding the threshold
    1. Prior odds
    2. Of course t really depends on the number of victims. If there are 1 million victims then L>1 million means nothing. A more general point of view is that there is some
      prior probability p, such as p=1/(v+1) where
      v+1=number of victims

      and the likelihood ratio is interpreted in the context of p, then compared with some threshold posterior probability.

      Expressed in terms of odds, the above amounts to

      prior odds=1/v
      posterior odds P=L(prior odds)=L/v,
      which must exceed an odds threshold todds, perhaps todds=1000:
      L/v>todds.

      This more elaborate paradigm is what we adopted from the beginning for WTC. We considered v=10000 for the vague purpose of choosing todds (big enough to ensure 99% probability of no mis-identifications), and then worked with v=5000 for purposes of choosing t. For the first year v was conservatively kept at 5000.

      At the KADAP meeting of September 9-10, 2001 the ID parameters were revisited. According to the meeting notes, v was reduced to v=3000 to reflect that the disaster was considered "closed" (meaning apparently that definition #1 was nearly satisfied). That seems to me non sequitur. The reason to reduce v was simply that the number of victims was known to be <3000. Whether the number was known precisely or the names known or DNA references available was irrelevant.

      It could also have been recommended to always use

      v=(number of remaining victims)-1.

    3. The end-game
    4. Ok, I know what you are thinking. The above satisfactorily explains making the very last identification even with very weak DNA evidence – it's because the prior odds are infinite – but how about the jig-saw, only-way-the-pieces-fit phenomenon when the last two or three identifications fall into place simultaneously? Is that a special phenomenon or just a particular and extreme case of something that is always there to some extent?

      Both. It is an extreme case of something that we could do all the time, if we thought to do so.

  6. Bayes' theorem
    1. Two hypotheses – odds formulation
    2. Suppose v=1000 (i.e. 1001 are missing) so prior odds are 1:1000, and suppose L=200,000 supporting a particular identity J. To review, we would then calculate
      posterior odds = (likelihood ratio)•(prior odds)= L/v = 200;
      posterior probability = (prior odds)/(1+prior odds) = 200/201 = 99.5%
      .
      Equivalently, we could arrange the work in a table. For this purpose, instead of the ratio of two likelihoods it is attractive to consider the two likelihoods separately.

      Table 1

      hypothesisidentity=Jsomeone else
      (relative) likelihood 200,0001
      (relative) prior odds11000
      (relative) posterior odds200,0001000total=201,000
      normalized (probability)99.5%0.5%

    3. Multiple hypotheses
    4. Bayes' theorem as we learn in forensic science involves just two hypotheses. The usual formulation in statistics mentions an arbitrary collection of hypotheses. The preceding table easily generalizes to more hypotheses.

      Imagine one hypothesis per missing person:

      The corresponding likelihoods are

      and for the relative priors, we assume that

      Now consider the situation when all but the last few identifications have been made – only J, 1, and 2 are still to be identified. In a simple typical case, the situation would be:

      Table 2

      hypothesisHJH1H2 H3H...Hv
      (relative) likelihood 50000...0
      (relative) prior odds1110...0
      (relative) posterior odds50000...0total=50
      normalized (probability)100%000...0
      Thus, the identification of J is certain even though LJ=50 is very modest evidence.

  7. Conclusion
    1. To the extent that the intuitive concept of "closed system" means anything in particular, I think it is explicated by the last table. That is, it corresponds to taking advantage of the likelihoods that are 0 because the victim profile mismatches certain victim's references, rather than concentrating solely on the suspected identity. Thus, if "closed" means anything, it means being in a situation where the idea of Table 2 can be employed.

      But it should be apparent that the computations of Table 2 are possible any time, not just in the end game (although the consequences would be less dramatic). That means that every situation is, to a greater or lesser degree "closed". There is no actual distinction between "open" and "closed"; they are just matters of degree. That being the case it would be hard to support a claim that it is a critical criterion, whether the system is open or closed. In particular there are no special "closed" methods of analysis; the mathematical tools of analysis are always the same and always follow a Bayesian paradigm with likelihood ratios interpreted in light of prior odds.

    2. Epilogue – WTC identifications
    3. In WTC we didn't try to apply the Table 2 idea. In fact we didn't even get quite far as Table 1, for the lowest prior considered was the number of missing, never adjusted for those previously identified.

      WTC and Table 2

      Why didn't we use a Table 2, and what good would it have done? It may seem that it could be quite beneficial, for DNA reference information was available from almost every family. Imagine where body X matches the references of missing J with an inadequate likelihood ratio – a few thousand say. We have v=2749-1591-1=1159, and if all but 30 of those are obviously not X, that would be 100 times better than the result of using v=3000 per protocol. Why not try?

      Easier said than done. My sense, based on experience trying to understand more deeply several cases where the likelihood ratio for some identification was modest, is that they are quite knotty to analyze. With indirect references there is always mutation to consider and if there are direct references but the likelihood ratio is still small the quality of the data is inevitably poor. Each of the hypothetical 1159-30 eliminations would require manual inspection, and maybe a lot of them would be problematic. So no doubt some ore is left in the ground, but it is hard to extract. Some software aids could be helpful, but there would still be a lot of manual tedium.

    4. Ultimate analysis
    5. Just as allowing the prior odds to decline realistically is an advance in analysis over sticking with the same likelihood ratio threshold through a whole identification project, and as using the idea of Table 2 is an advance over Table 1, there is still at least one more level of sophistication (read: accuracy) available that I've not discussed. That is, to consider multiple victim profiles at the same time. I've occasionally needed to do this explicitly in assigning identification to a family of plane crash victims.

      The ultimately correct mathematical approach might in fact to be to formulate identification hypotheses that consider ALL the victim profiles and ALL the reference data simulataneously. The number of hypotheses would then be truly enormous (v+1 factorial), for a typical hypothesis would be like:

      body a is victim #3, and body b is victim #23, and ...

      Final remark

      In the extreme case that all the
      definition clauses hold it may be – if the reference information is straightforward, mostly direct references – that one and only one of the of the (v+1)! compound hypotheses "fits" the data, i.e. all the likelihoods but one are zero (provided one's model excludes mutation and other uncertainties about the data). I think this is the situation that Jack had in mind in persisting that the "closed" paradigm has real meaning, and that the draft "Lessons Learned" document alludes to where it says that in an "open" system, in distinction to a "closed" one, identifications are "statistical" (i.e. probabilistic). If "closed" alludes only to this very particular circumstance, then I would concede that there is a fairly sharp distinction between closed and open. But clearly, it is a circumstance that has no relevance to WTC.