Oracle Scratchpad

October 16, 2009

Correlation

Filed under: Philosophy — Jonathan Lewis @ 7:14 pm BST Oct 16,2009

One of the “inspirational thoughts” on my opening page is the observation by the late Stephen J. Gould that

The invalid assumption that correlation implies cause is probably among the two or three most serious and common errors of human reasoning.”

 It’s very easy to equate correlation with causation and take inappprioate action as a result – it’s an example of faulty thinking that I see fairly frequently on forums such as OTN or the Oracle newsgroups.

If you want to get an insight into the difference between correlation and causation, you ought to read Robyn Sands’ note on “Nonsense Correlation”.

16 Comments »

  1. I suspect that by now many people know that correlation does not imply cause. In fact, I keep hearing “Correlation does not imply cause” even when it does not apply.

    Correlation does imply cause when the correlation was verified after modifying only the suspected cause in a well controlled and well designed experiment.

    If you divide sick people randomly into two groups, one group is given a medicine and the other placebo, and the group that received the medicine gets better while the placebo group does not – it is fairly reasonable to assume that the correlation between the medicine and the symptoms does imply that the medicine is what caused the health condition to improve.

    One can say that the intention of the scientific method is to verify causes of correlations.

    Comment by prodlife — October 16, 2009 @ 10:46 pm BST Oct 16,2009 | Reply

    • Chen,

      It’s possible that many people are intellectually aware that “correlation does not imply cause” – but I suspect that there is a big difference between recognising the correctness of the statement and automatically applying it.

      Correlation between effects may lead you to an attempt to establish cause, but to estabish cause you need:

        A plausible hypothesis
        An absence of an alternative plausible hypothesis
        Predictability

      Technically it’s not the correlation between the medicine and the symptoms that implies the medicine is the cause – it’s the match between the prediction and the actual events, combined with the absence of an alternative explanation.

      I would prefer to rephrase your closing comment to say that the intention of the scientific method is to ensure that incorrect hypotheses about causes of correlation are ultimately falsified.

      (I’m not quibbling about the difference between “verify” and “falsify” here, by the way – in day to day terms something has been verified when all sensible attempts to falsify it have failed – the bit I am trying to emphasise is the intent to eliminate error.)

      Comment by Jonathan Lewis — October 17, 2009 @ 10:17 am BST Oct 17,2009 | Reply

      • The predictability part is really indispensable.

        One of the most annoying trick that pharmaceutical companies do often is come up with an hypothesis that says “Medicine X will lower heart rates”, they do an experiment and find out that medicine X does not lower the heart rate. But since they measure many things, they found out that cholesterol levels went down. So they publish a paper saying “Medicine X lowers cholesterol levels.”. This is not a valid result of the experiment. They would need a second experiment to test just this hypothesis.

        When you measure 20 different variables, there is a high probability that one of them will change significantly after your intervention (but not as a result of the intervention!) just by chance.

        Statspack is misleading in this way, because it shows too many data points. Surely many of them will be different after I change a parameter or a query. The trick is to know what measurement will differ and how, and to know it in advance.

        Comment by prodlife — October 17, 2009 @ 3:14 pm BST Oct 17,2009 | Reply

        • Hi Prod,

          >> Surely many of them will be different after I change a parameter or a query.

          And yes, they do!

          It’a an iterative process, start with low hanging fruit, apply, rinse and repeat . . .

          Comment by Donald K. Burleson — October 17, 2009 @ 10:39 pm BST Oct 17,2009

  2. >> It’s very easy to equate correlation with causation and take inappprioate action as a result

    In Oracle terms, if act “A” results in state “B” 90% of the time, that’s what’s important to know, regardless of causation!

    In scienctific research, causation always takes a back seat to correlation.

    The ONLY thing that matters is the strength of the correlation, it’s “predictive validity”.

    For example, in medical science – If drug “A” results in state “B” then that’s important.

    Sure, it might be nice to know the “root cause” but often as not, medical researchers are not exactly aware of “how” a drug works.

    If you examine the Oracle Data Mining tools, you will see that it uses the same techniques.

    Medical Informaticists run Oracle scripts to detect “cohorts”, gather medical evidence that shows the efficacy of different treatment options, comparing treatments ans outcomes:

    http://www.dba-oracle.com/oracle_tips_researchers_informatics.htm

    Causation is never factored into the equations, it’s all about cause and effect . . .

    Comment by Donald K. Burleson — October 17, 2009 @ 12:36 pm BST Oct 17,2009 | Reply

    • If act “A” results in state “B” 90% of the time, it is highly interesting, and you want to know the cause.

      Robyn’s example is excellent. There is a strong correlation between heart disease and dental health. However, until we know the cause, we do not know if brushing your teeth will prevent heart disease. Maybe eating less pork is good for the teeth? Or maybe good genetics is everything.

      If you don’t look into the cause of a correlation, you end up with silly advice like “Don’t use functions in the where clause because it causes bad performance”. If you know the cause, you can say something useful: “If you use functions in the where clause, you may end up not using an index that you want to use. You can solve the problem by modifying the query or by using function index.”

      Too much correlation (and data mining!) is in my opinion one of the biggest reasons that US healthcare is where it is. Pharmaceutical companies mine lots of data, come up with a silly correlation, publish a paper, use it for marketing and doctors then prescribe the medicine for conditions it is not actually effective for – resulting in big healthcare expenses, rich companies, and unhealthy Americans.

      Prozac is a good example – lots of people pay for Prozac (or our insurance does) believing that it is effective against mild depression. There is no proof that it is effective. The correlation can be attributed to random chance, or to placebo effect. So much money is thrown away due to misleading research and marketing!

      Here’s an excellent journalistic article on how data mining can be misleading:
      http://www.badscience.net/2006/03/cocaine-floods-the-playground/

      The reason I’m bothering to write all this on a Saturday is not because database performance is so important, it because health is important. If we all understood science a bit better, maybe it would be more difficult to get us to spend our money on crap medicines instead of things that work. Database research is often easier for us to read than medical research, so its a good place to start practicing scientific skills :)

      Comment by prodlife — October 17, 2009 @ 3:52 pm BST Oct 17,2009 | Reply

      • It may have been Ben Goldacre ( http://www.badscience.net ) who once pointed out that one of the UK tabloids was clearly intent on dividing everything into one of two classes: things that caused cancer and things that cured cancer. His book is a very interesting, and sometimes appalling, read.

        It is extraordinary how even the more respectable papers and news programs will produce headlines and soundbites that are clearly idiotic cherry-picking, compression, and hyping of cautiously stated results from careful investigations.

        Comment by Jonathan Lewis — October 17, 2009 @ 9:47 pm BST Oct 17,2009 | Reply

  3. When the pharmas came out with the anti-depressant Wellbutrin, they noticed that a signigicant numnber of patients quit smoking.

    They don;t have a clue why this drug blocks the nicotine receptors in the brain, they no idea why it works . . .

    Today, Wellbutrin is sold as Xyban, and it has helped over a million people quit smoking.

    Do scientists care about how it works? Perhaps, but that not nearly as important as the fact that people who take Xyban are more able to quit smoking.

    Xyban has saved thousands of lives, and nobody knows how it works . . .

    Comment by Donald K. Burleson — October 17, 2009 @ 6:36 pm BST Oct 17,2009 | Reply

    • Actually, Wikipedia has a pretty convincing explanation of why it works. It seems to be a nicotine replacement – nicotine attaches to specific brain receptors and acts as an inhibitor, and Xyban does exactly the same – therefore people who take Xyban no longer need nicotine.

      Thats pretty neat.

      The research probably went like this:
      1) Findings were reported from doctors and patients about this great side-effect of Wellbutrin.
      2) Drug company verified findings in a controlled trial.
      3) Researches found the reason it works.

      Note that step 3 is optional, but step 2 is mandatory for turning an interesting idea into a medicine.
      Controlled experiments is what turns a correlation into a cause by eliminating other possible explanations such as chance, the fact that non-depressed people find it easier to quit smoking, etc.

      Comment by prodlife — October 17, 2009 @ 7:11 pm BST Oct 17,2009 | Reply

      • >> It seems to be a nicotine replacement – nicotine attaches to specific brain receptors and acts as an inhibitor, and Xyban does exactly the same

        According to my doctor (I took this stuff to quit smoking), that’s only the theory.

        There is no proof, and they still have no idea about how it works (no causation).

        >> Drug company verified findings in a controlled trial

        And that is to confirm the correlation. As you correctly noted, that’s all that medical science requires.

        Comment by Donald K. Burleson — October 17, 2009 @ 9:53 pm BST Oct 17,2009 | Reply

        • > And that is to confirm the correlation

          Not really the right way of thinking about it. The purpose of the trials is to eliminate errors and ensure that the step from “simple correlation” to “probable cause” is justified. (And to check for threatening side-effects of course –

          Your use of the phrase “(no causation)” is also not entirely appropriate – it is reasonable to recognise causation without understanding (or having a complete understanding of) mechanism. In many cases understanding of mechanism comes later – and results in refinement of treatment or predications of side effects that need to be addressed.

          Comment by Jonathan Lewis — October 18, 2009 @ 6:24 pm BST Oct 18,2009

      • > Note that step 3 is optional

        But knowing how a drug works makes for an infinitely more effective trial, especially when checking for negative side-effects …

        For example, knowing that paracetamol is metabolized by the liver, and that the real drug that works on the brain is a metabolite of it, calls for special investigations about possible liver damage.

        And since knowing how a drug works improves the effectiveness of the controlled trial so much – investigating about how it, or at least trying, is mandatory as well, in my opinion.

        Anyway, out of metaphors – Oracle is a “bit” less complex than a human (or even an algae). You don’t need a PhD and a full research team when conducting controlled trials … just average analytic skills, sqlplus and a lot of sweating.

        Comment by Alberto Dell'Era — October 19, 2009 @ 8:58 am BST Oct 19,2009 | Reply

    • The history is also interesting: http://en.wikipedia.org/wiki/Wellbutrin#History .

      Notice that the drug was first approved by the FDA in December 1985, withdrawn in 1986, and re-introduced in 1989; but it wasn’t until 1997 – more than 11 years after the first release, and 8 years after the re-release – that it was approved as an aid to quit smoking and sold as Zyban.

      There’s probably a few years of testing in between – and it’s quite likely that most of that testing would be looking for harmful side-effects.

      Then – in 2006 – the drug was also approved for as a treatment for SAD (seasonal affective disorder).

      It would be interesting to know how much of the time lag went into:

        a) initial observation of the correlation
        b) testing for causality
        c) testing for side effects
        d) research into why it works.

      (Obviously this isn’t intended to suggest non-overlapping time-intervals).

      Comment by Jonathan Lewis — October 17, 2009 @ 9:23 pm BST Oct 17,2009 | Reply

  4. Amusingly when attempting to find Tom Kyte’s excellent brief “In search of the truth. Or Correlation is not Causation” to link on this entry (my old link no longer works since the upgrade of AskTom), it turns out it forms part of an interesting debate between yourself, John, and Tom v Don & Mike Ault – of course you would know that!

    It makes for a good read on this notion in regard to the Oracle DB and bulk binds.
    http://www.oracle.com.cn/archiver/?tid-56543.html

    For those regularly reading information by people such as Ben Goldacre, Stephen J. Gould; and hence potentially Richard Dawkins, PZ Myers & Phil Plait – I’m sure you’re all familiar with the dangers of data mining (and quote mining) and have seen plenty of non-Oracle related examples.

    For anyone who reads a newspaper – not necessarily a tabloid, I think can see that statistics can be misrepresented to (attempt to) say whatever the author would like to portray.

    My 5 cents.

    Comment by Scott Wesley — October 18, 2009 @ 12:14 pm BST Oct 18,2009 | Reply

    • “For anyone who reads a newspaper – not necessarily a tabloid, I think can see that statistics can be misrepresented to (attempt to) say whatever the author would like to portray.”

      reminds me of this one:

      If you torture data sufficiently, it will confess to almost anything. -Fred
      Menger, chemistry professor (1937- )

      PS: I think that copy-paste from me was a bit of quote mining on my part, but never mind :)

      Comment by Naresh Bhandare — October 23, 2009 @ 9:57 am BST Oct 23,2009 | Reply

  5. Obligatory xkcd comic: http://xkcd.com/552/

    Comment by xkcd fan — October 22, 2009 @ 6:28 pm BST Oct 22,2009 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 3,528 other followers