Selection Bias

From SkepticWiki

Jump to: navigation, search

Contents

[edit] Definition

Selection Bias is a logical or statistical error where the data analyzed are biased, as the data come from samples that are not representative. Selection Bias can strengthen or weaken an apparent (real) effect, cause a genuine effect to appear non-existent, or cause a non-existent effect falsely to appear genuine.

[edit] Types of Selection Bias

Selection Bias may be built-in to the method of data collection. For example, when the first extrasolar planets were discovered, they were large gas giants near their sun. This has little to do with the prevalence of gas giants close to their sun, and everything to do with the fact that those are easiest to detect. Anyone analyzing the data must be aware of these limitations when making claims. Creationists use this data to try and show that there are no Earth-like planets, only gas giants, and therefore Earth has the only life in the Universe. In reality, there could conceivably be Earth-like planets around every star and we wouldn't have any way of knowing it, at least until the technology exists to detect such a planet.

Another example might be a pollster who surveys Americans by calling them in the evening, when they are most likely to be home. This will tend to underrepresent people who work second shift, and since those workers are more likely to be in a lower socioeconomic group, that group will be underrepresented as well.

Selection Bias may be unconscious. A particularly good example of this kind of selection bias is in the various Full Moon Effects, where folk belief has it that various events such as childbirth are more likely to happen at the full moon. Even medical professionals (for example, delivery room nurses) will often believe these things, having noticed the increased rate of labor admissions when it coincides with a full moon; unfortunately, it doesn't occur to them to pay attention to admissions at other times. Analysis of actual birth data shows no correlation between births and the phase of the moon. This kind of Selection Bias is sometimes referred to as Confirmation Bias, and described by the phrase, "Counting the hits and ignoring the misses."

Selection Bias may also be deliberate. Someone may fraudulently pick and choose data that support his hypothesis, and ignore data that don't. This has been used in studies of the efficacy of prayer, where only the cases where a positive correlation was shown—where there was a pronounced difference in patients who were prayed over and patients who weren't—were included. Once all of the cases are included, the effect goes away. This is often referred to as Cherry-Picking.

[edit] Discussion

[edit] Selection Bias in scientific studies

In the hands of a skilled but unethical researcher, selection bias can be used to produce any misleading impression desired without actually making any false statements; Benjamin Hoff's book How to Lie with Statistics provides a good handbook of some techniques that might be applied. In the hands of a research scientist, this conduct is unconscionable, but in areas like advertising and public relations, it is standard practice.

A more serious problem for researchers is the unconscious sort of bias that can creep into an otherwise well-performed study. For example, it is unreasonable to expect a scientist to analyze every bit of data generated in an experiment. In some cases, obvious error or mishandling of the material ("What do you mean, you spilled your coffee into the reaction chamber?") will invalidate a particular data point. A scientist should not expect that an analysis of mostly coffee will shed any light on the blood protein structure she is studying. However, the decision to include or exclude any particular data point can be more problematic.

Consider, for example, the case of an epidemiologist who finds an unusually low number of cases reported for a particular county for a particular year—but also discovers that the county reporting officer was on leave for that year and the reporting was done by a less-qualified assistant. Should this data be included? Only if it is considered not to be erroneous—but human nature is such that the data is more likely to be considered correct if it supports the epidemiologist's theories. If he expects to see a low number, he will consider this number a demonstration of his theories, but if he expects to see a high number, he will consider this to be a demonstration of the assistant's incapacity.

A correlary to this is that the data points the researcher finds to be anomalous will be more likely to be more highly scrutinized.

For this reason, standard practice in science whenever possible is to perform analyses "blind." In a clinical setting, for example, this means that the doctor who evaluates the patient does not know whether the patient received the treatment under study, or a placebo. It can also mean that the statistician who analyses the data does not know beforehand what the hypothesis under study is, or even which group represents the experimental group and which the control group. In this way, the problem of selection bias in science can be reduced, if not eliminated.

[edit] Selection Bias in paranormal studies

The problem of selection bias is in many regards even worse with regard to paranormal studies, because the notion of success and failure is difficult to formalize. A typical ESP or remote viewing experiment, for example, will give a collection of observations that must be matched to the underlying experimental data. (This problem is particularly severe in Ganzfeld experiments, where the experimental stimuli are complex and the matching can be extremely vague.) The Skeptic's Dictionary gives as an example of such observations as

I see the Lincoln Memorial... And Abraham Lincoln sitting there... 
It's the 4th of July... All kinds of fireworks... Now I'm at Valley Forge...
There are fireworks... And I think of bombs bursting in the air...
And Francis Scott Key... And Charleston...

There are obviously any number of images to which this description could be "fit"; the researcher in this case thought that this description was generated by a picture of George Washington. A believer in the paranormal could easily accept this association, despite the fact that Washington's name appears nowhere in it -- a skeptic could equally reject this association. In controlled settings such as Ganzfeld experiments, it may be possible to set up control groups for comparison, but in less formal setting, it is hard to analyze such associations for plausibility.

One often-claimed paranormal power, for example, is the ability to "see things," remote in time, space, or both, that are later proven to be true. A person might dream of a friend being killed in an auto accident, and then later find out that the person was in fact injured by a motorcyclist that very day. Unfortunately, all the elements for selection bias are present in this hypothetical. People dream all the time, but tend only to remember dreams that are somehow significant (such as dreams that reinforce their belief in the paranormal). Although the dream itself was false (their friend was not killed, only injured, and it was a motorcycle, not a car), it is "close enough" to accurate that it can be interpreted as a prediction and remembered as such.

[edit] Exceptions to the Rule

As discussed above, if there is clear and compelling evidence that a piece of data is invalid, it should be excluded from the study. Similarly, if there are pre-defined rules about what data should and should not be included, and they are applied in an even-handed and fair manner, issues of selection bias should not arise.

[edit] Related Topics

Personal tools