Gufodotto would like you to read these:

Friday, June 15, 2007

The importance of good data

Data is everything.

Without data, my models do not take off.

Whether you work with proteins structures, experimental results, or God-knows-what, the quality of your work is influenced by the quality of the upstream data, and the trust you put into them.

Unfortunately, data are not so easy to get hold of. Good data are even more difficult to catch.

Very recently, at my work, I have discovered that a particular section of
my company does not like other people sniffing around their databases - they're afraid that unskilled people may draw the wrong conclusions about their work - it is not simple science, we have been told, and please don't demean it as such.

A little back in time, during my PhD, I discovered that you should not blindly trust data you've been handed over either: always check, and if possible, double check them. Confirmation of this has come during the previous months, when I spent a LOT of time cleaning up a database of experimental data from all the crap that found its way in there during thirty+ years.
I can tell you, it's grueling. But in my line of work, I am told, data preparation is by far the most time-intensive activity one can pursue. Analysis of the results certainly takes less, and the actual model-building is a doodle.

Let's take a (real-life) example: Experimental measures of pKa of a compound: how difficult can it be? Well, first of all, you must make sure that your data is actual experimental data - has it been measured? Not always: I was told that sometimes the compound wouldn't bloody dissolve, so they would insert in the database the computed pKa, or the pKa of a similar compound(!).

Then, if you can get around this, there still is lot of room for errors, or at least weird uncertainties: in the case at hand, you get a handful of values for every molecule. How do you know which pKa corresponds to which atom? As it turns out, there's a way of detecting whether it is a base or an acid - by repeating the experiment in water/alcohol mixtures, the pKa values do change, with acids' values getting higher, and basis' values decreasing - or the other way around, can't bother to fact-check right now. So, in theory it is possible to say which is an acid and which is a basic pKa. Important, since a basic pKa will tell the pH at which your molecule becomes neutral (and below that, is positive) - an acidic pKa tells you when your molecules goes from neutral to negatively charged. Get them wrong, or worst mixed up, and your compound's predicted properties (such as permeation of the gut walls and other membranes, but also retention in a chromatographic column) will go haywire.

But thats more or less it: now you know which one is basic and which one is acid. But how do you assign down to the very atom its own pKa? if they're few, it's easy. I mean, if you have an acid and a basic functionality in your molecule, the choice is trivial. If you have two acids, though, it's all a matter of chemical knowledge, and intuition. You expect, from previous experience, some groups to ionize around certain pHs. However, the presence of charges and other substituents all around will greatly affect these numbers, and sometimes the ordering of them may even change. Big mess then - so how do you fix it? Well, some experimentalists use computer models to get a hunch, a suggestion on what may be going on. Which seems great, except when the reason why you're looking at those data is exactly to validate those very same computer models. Then it sucks. Add to this the well known fact that most of these softwares do get it wrong quite often, and by a mile or two, and you're left with a bemused expression...

Welcome to my frustrating world.

No comments: