« How Do We Fund the Databases? | Main | What Must Happen for Arrays to Fulfill their Potential? »
How Can the Databases Be Made Most Useful?
How can databases be made user friendly and flexible enough, yet sufficiently comprehensive, to handle increasingly complex information? Will one set of standards suffice? Is there a danger of important scientific information being lost in the process of putting the data into a standard format or formats?
Helen Parkinson: There can be common standards for process, but for the reporting of biological information, especially in a research setting, it's unrealistic to limit users to one standard. This will prove limiting for research in the long run.
The changing computer science technologies, with tools such as workflow engines, will allow users to access the databases in a more user-friendly way. These are nascent, but the data are now too complex for the average biologist to mine with stand-alone tools. The data are fundamentally complex, and therefore storage solutions will be, too. But the applications can be better in terms of fitting the profile of the user. The Microarray Gene Expression Markup Language (MAGE-ML) supports the storage of pretty much all important information, but the acquisition of this info is what's problematic. If it is never acquired at the source, then it is lost.
Neil Winegarden: This is a tricky issue, as there are so many different ways to use arrays, and indeed many different array platforms out there. You want to make a database that's easy to use (for a biologist at that), and yet is able to handle not only the wealth of information that's currently out there, but the new types of information that are to come in the future.
To this end, I think some of the current initiatives out there are on the right track. In particular, the Microarray Gene Expression Data (MGED) Society has been working with many different groups to lay out a set of standards for reporting data from a gene expression experiment. However, it was realized that certain applications -- such as toxicology, metabolomics, etc. -- required more data than the minimal set outlined by MGED.
As such, I think that we need extensible models that can be built onto by different groups as the need arises. A simple standard may not allow researchers to present pertinent data from their experiments, and thus key information is lost in the attempt to comply to some minimal standard. The challenge, however is finding a way to store this in a database that does not require an overly complicated schema that nobody can interface with.
Catherine Ball: Providing a robust and flexible database system that is transparent to biomedical researchers and can respond to new applications of high-throughput technologies is a significant undertaking. Ultimately, the value of such databases must be measured by their utility to researchers, rather than their "buzzword compliance." It is essential that development and maintenance of bioinformatics infrastructures be accomplished via active and ongoing collaborations between bench biologists, bioinformaticists, and computer scientists.
The community-based standards for describing and exchanging microarray data created by MGED have been accepted by many researchers, bioinformaticists, journal editors, and reviewers. In addition, other communities dealing with high-throughput biomedical data have extended those standards for their own use. While the standards currently in place probably won't be able to accommodate all data from all experiments, constructing "additional standards" defeats the purpose of having data standards at all.
One valuable reason to have data-sharing infrastructure built by bioinformaticists who work closely with biologists is to give biologists the opportunity to explain their data-sharing needs to the bioinformaticists, who then have the ability to improve and extend the community-based standards. Creating and improving data-sharing standards that are reasonably adequate for most types of microarray experiments will, at least in the immediate future, require a great deal of collaborative work. It is clear that the development of these standards is merely a first step, as the benefits of their adoption have yet to be fully realized, because many tools that could take advantage of data annotations have yet to be developed.
Alvis Brazma: As the technology itself matures, so do the databases and other bioinformatics tools. We've seen this happening to the sequence world. Also, the experimental designs tend to become simpler and more standard. Every new technology has a potential to be more complex, but we have learned more with time and there are more tools in the disposal of software developers.
Of course, one condition is adequate investment in bioinformatics as a part of the technology itself -- some estimates suggest that roughly 20 to 25% of costs of producing the data should go into bioinformatics of managing the generated data. Automated information collection systems in the lab, electronic lab books, and bar coding will help to make the data capture process easier. With an entirely new technology there will always be some struggle at the beginning.
One standard will not be enough for the whole of life sciences, at least not in a foreseeable future. Most probably, one standard will be sufficient for 95% of microarray experiments. At the same time, standards for different technologies -- microarrays, proteomics, metabonomics -- already share common parts, and will do so increasingly.
It is a misconception that a standard should mean limiting the freedom of expression. For instance, consider Gene Ontology (GO) -- an emerging standard for describing gene functions. Using GO does not mean that one cannot describe all the details of ones favourite gene any way one likes; all it means is that in addition to that, one is also asked to use a controlled vocabulary of terms to make this description searchable and thus more useful. How does this limit the freedom of expression? Perhaps the critics will say that it "promotes" simplified view on functional genomics? That some will not go beyond looking at the GO terms and the "true" complexity of the gene's function will be lost? To that, I'd reply with one of my favourite quotes from Karl Popper: "Science may be described as the art of systematic oversimplification."
Posted on October 21, 2004 at 04:03 PM | Permalink
