A Microarray Discussion
In its 22 October 2004 issue, Science presents a special collection of articles on gene expression, including reviews on the current state of the art in microarrays -- as well as a news feature highlighting some of the difficulties that still exist in obtaining reproducible results with arrays.
To discuss these issues further, we put four questions to a group of four experts in the applications and challenges of array data -- Catherine Ball, the director of the Stanford Microarray Database; Alvis Brazma, head of Microarray Informatics at the European Bioinformatics Institute; Helen Parkinson, curation coordinator (microarrays), European Microbiology Laboratory/EBI; and Neil Winegarden, head of operations, Microarray Centre of the University Health Network/University of Toronto.
We hope that users interested in contributing to the discussion will feel free to post comments, using the "comments" link at the end of the discussion of each question. (Comments need not include an e-mail address.)
Posted on October 21, 2004 at 04:09 PM | Permalink | Comments (1)
Has the Promise of Microarrays Been Oversold?
The promise of microarrays, both as a tool for addressing fundamental biological questions and in drug discovery, has been touted for some time. In your view, has that promise been oversold?
Helen Parkinson: For addressing biological questions, no. In drug discovery, yes -- though this is not my area of expertise. The regulatory bodies move too slowly at present to make this viable in the short term.
Catherine Ball: There are many examples in the scientific literature of laboratories that have successfully used microarrays to address fundamental biological or medical questions and to identify promising targets for drug development, as has been promised.
Unfortunately, it is obvious to reviewers of submitted manuscripts that many researchers have used microarrays to perform experiments that provide no biological insight whatsoever. That's not due to any failure on the part of the technology, but rather on the failure to design experiments or appreciate the limitations of microarray technology. The use of microarrays will not turn a poorly conceived or poorly executed experiment into a groundbreaking scientific achievement, any more than buying a sports car will turn one into a NASCAR driver. Such approaches, while glamorous, are likely to be a waste of effort and money.
It's not surprising that using microarrays does not negate the need for a wise experimental design, careful technique, and appropriate data transformation and analysis. Indeed, the expense and the challenges of analyzing and interpreting such large data sets make it even more critical to carefully plan studies using microarrays. The ability to simultaneously assay tens of thousands of genes is a promise that microarray technology has delivered in study after study. High-throughput technologies have allowed us to improve upon the "one postdoc, one gene" approach to deciphering biological systems.
But with any new technology that holds great promise, there will always be people jumping on the bandwagon, under the false impression that they must use the technique to remain competitive in their research -- irrespective of whether it is appropriate.
Alvis Brazma: Have microarrays been oversold? Not at all. Or, if microarrays have been oversold, then only in the same sense as the Human Genome Project -- metaphors that can be over interpreted have sometimes been used. I do not think that scientists ever expected more than microarrays are delivering. For example, I have before me the 17 September 2004 issue of Science, and there it is: The research article begins with mining public microarray data and leads to a discovery of a new gene function.
There is no other technology that has influenced life sciences research so profoundly in the recent years. Knowing for all the genes in an organism if their transcripts are present or not (be it statistically) is similar to having the genome sequenced -- we know not only what genes are there, but also what genes are not there. We treat the cells with a compound and we know not only that the expression of a certain gene A is changing, but also all other genes that are changing and which are not.
It takes, on average, 10 to 12 years to develop a new drug. Microarrays have matured as an application only during the last 5-7 years, even less.
Neil Winegarden At the risk of sounding noncommittal on the question of whether microarrays have been oversold, I would have to say yes and no. I think that microarrays still hold tremendous promise for the drug discovery industry. I think the ability to analyze the levels of mRNA (or protein) in a cell in a highly parallel manner can and will have huge impacts on drug discovery.
However, I also think that the microarray field, despite its apparent maturity, has not fully worked out all of the issues that are involved. The technology is still too variable; there are still too many issues with analysis methodologies; the technology is still too labor intensive. I believe that there are several groups working on eliminating or reducing these challenges, but there is still a large amount of work to be done before the promise of microarrays can fully be achieved.
Of course, this does beg the question of whether a competing technology will slip in under the radar while all of these issues are being worked out. I think that microarrays have wide enough acceptance however that there is enough momentum behind the technology to make it work. So yes, the technology is capable of providing us with great insights, but to date this has not happened. To my knowledge there are only a very small number of drugs in the various pipelines of all the pharmaceutical companies that were developed or made possible by microarray technology. Thus, the promise has not yet been realized.
Posted on October 21, 2004 at 04:07 PM | Permalink | Comments (0)
What Must Happen for Arrays to Fulfill their Potential?
What do you view as the single most important thing -- technical or otherwise -- that needs to be done now for arrays to fulfill their potential over the next decade?
Neil Winegarden: Ah, but to pick only one. I think if we look for an overall aspect, I would have to pick data quality. This can have to do with making arrays more reproducible, data analysis more robust, or improving the standards with which data is presented. I think we need to improve the data coming out of microarray experiments to make them truly useful in a drug discovery environment.
I am sure some people will disagree with me -- but I still believe that the data from all array platforms is too variable, and we need to either improve the technology itself or develop better algorithms to deal with this inherent variability. I'd say both need to be looked at.
Helen Parkinson: In my view, too, the most important thing is the emergence of quality metrics, and platforms becoming robust for comparison. For this to happen, there needs to be disclosure of oligo sequences, for example. There are several projects under way, but these don't filter down to the biologists.
Catherine Ball: Currently, data from different microarray platforms, or from different laboratories, cannot easily be reconciled with each other. This makes it difficult to review, interpret and re-use data from many microarray experiments. In order for the results of microarray experiments to be more widely utilized, we need robust methods to share data. The need extends far beyond providing robust data repositories, but also includes adopting standards to describe how experiments were performed and the data transformed, and establishing common controls (such as those being developed by the External RNA Control Consortium). The true value of microarray data is cumulative -– being able to combine data from different studies will undoubtedly lead to novel insights.
Alvis Brazma: I'd say we need to find out what exactly we are measuring. At the moment, the microarray measurements are expressed as fluorescence intensities, or even as ratios of intensities. How does this relate to the mRNA abundance or to any biological variable that we are trying to measure? More often than not, we really do not know.
Without overestimating the importance of quantitative measurements in biology, one has to realize that if one wants to compare more than two measurements, one needs quantitative values and error bars. Just "increase" and "decrease" doesn't help. Having measurement units and error models will allow us to build gene expression data atlases of what genes are expressed where and under what conditions, similarly to genomes being sequenced now.
Posted on October 21, 2004 at 04:05 PM | Permalink | Comments (1)
How Can the Databases Be Made Most Useful?
How can databases be made user friendly and flexible enough, yet sufficiently comprehensive, to handle increasingly complex information? Will one set of standards suffice? Is there a danger of important scientific information being lost in the process of putting the data into a standard format or formats?
Helen Parkinson: There can be common standards for process, but for the reporting of biological information, especially in a research setting, it's unrealistic to limit users to one standard. This will prove limiting for research in the long run.
The changing computer science technologies, with tools such as workflow engines, will allow users to access the databases in a more user-friendly way. These are nascent, but the data are now too complex for the average biologist to mine with stand-alone tools. The data are fundamentally complex, and therefore storage solutions will be, too. But the applications can be better in terms of fitting the profile of the user. The Microarray Gene Expression Markup Language (MAGE-ML) supports the storage of pretty much all important information, but the acquisition of this info is what's problematic. If it is never acquired at the source, then it is lost.
Neil Winegarden: This is a tricky issue, as there are so many different ways to use arrays, and indeed many different array platforms out there. You want to make a database that's easy to use (for a biologist at that), and yet is able to handle not only the wealth of information that's currently out there, but the new types of information that are to come in the future.
To this end, I think some of the current initiatives out there are on the right track. In particular, the Microarray Gene Expression Data (MGED) Society has been working with many different groups to lay out a set of standards for reporting data from a gene expression experiment. However, it was realized that certain applications -- such as toxicology, metabolomics, etc. -- required more data than the minimal set outlined by MGED.
As such, I think that we need extensible models that can be built onto by different groups as the need arises. A simple standard may not allow researchers to present pertinent data from their experiments, and thus key information is lost in the attempt to comply to some minimal standard. The challenge, however is finding a way to store this in a database that does not require an overly complicated schema that nobody can interface with.
Catherine Ball: Providing a robust and flexible database system that is transparent to biomedical researchers and can respond to new applications of high-throughput technologies is a significant undertaking. Ultimately, the value of such databases must be measured by their utility to researchers, rather than their "buzzword compliance." It is essential that development and maintenance of bioinformatics infrastructures be accomplished via active and ongoing collaborations between bench biologists, bioinformaticists, and computer scientists.
The community-based standards for describing and exchanging microarray data created by MGED have been accepted by many researchers, bioinformaticists, journal editors, and reviewers. In addition, other communities dealing with high-throughput biomedical data have extended those standards for their own use. While the standards currently in place probably won't be able to accommodate all data from all experiments, constructing "additional standards" defeats the purpose of having data standards at all.
One valuable reason to have data-sharing infrastructure built by bioinformaticists who work closely with biologists is to give biologists the opportunity to explain their data-sharing needs to the bioinformaticists, who then have the ability to improve and extend the community-based standards. Creating and improving data-sharing standards that are reasonably adequate for most types of microarray experiments will, at least in the immediate future, require a great deal of collaborative work. It is clear that the development of these standards is merely a first step, as the benefits of their adoption have yet to be fully realized, because many tools that could take advantage of data annotations have yet to be developed.
Alvis Brazma: As the technology itself matures, so do the databases and other bioinformatics tools. We've seen this happening to the sequence world. Also, the experimental designs tend to become simpler and more standard. Every new technology has a potential to be more complex, but we have learned more with time and there are more tools in the disposal of software developers.
Of course, one condition is adequate investment in bioinformatics as a part of the technology itself -- some estimates suggest that roughly 20 to 25% of costs of producing the data should go into bioinformatics of managing the generated data. Automated information collection systems in the lab, electronic lab books, and bar coding will help to make the data capture process easier. With an entirely new technology there will always be some struggle at the beginning.
One standard will not be enough for the whole of life sciences, at least not in a foreseeable future. Most probably, one standard will be sufficient for 95% of microarray experiments. At the same time, standards for different technologies -- microarrays, proteomics, metabonomics -- already share common parts, and will do so increasingly.
It is a misconception that a standard should mean limiting the freedom of expression. For instance, consider Gene Ontology (GO) -- an emerging standard for describing gene functions. Using GO does not mean that one cannot describe all the details of ones favourite gene any way one likes; all it means is that in addition to that, one is also asked to use a controlled vocabulary of terms to make this description searchable and thus more useful. How does this limit the freedom of expression? Perhaps the critics will say that it "promotes" simplified view on functional genomics? That some will not go beyond looking at the GO terms and the "true" complexity of the gene's function will be lost? To that, I'd reply with one of my favourite quotes from Karl Popper: "Science may be described as the art of systematic oversimplification."
Posted on October 21, 2004 at 04:03 PM | Permalink | Comments (0)
How Do We Fund the Databases?
What should be the funding models be for developing and running data-sharing infrastructures?
Alvis Brazma: This is a long story, and I'd refer you to a recent Nature Biotechnology paper by Ball et al., "Funding High-Throughput Data Sharing," for more.
But in brief, first, the bioinformatics databases that serve large sections of the community should be funded by public money to make sure that all data can be fully integrated with all other relevant existing and future resources in an uninhibited way. Second, it should be realized that not only the database development, but also their maintenance, further development, and integration with new resources should be funded -- otherwise there is no much point in funding the development.
Third, the funding should be stable enough to encourage long-term thinking from the resource providers. One way to achieve this is by rolling grants that have to be renewed periodically, but cannot be terminated without a long enough prior notice (e.g., 2 years). Fourth, and finally, not only the centralized resources should be funded -- the data deposition also costs money. Realistically costed data-sharing plans should be included (and funded!) in all grants generating large amounts of data.
Helen Parkinson: Realistically, federated systems will be needed to support the amounts of data being generated. At present, the current funding models work for research, but not for infrastructure, and are not geared to cooperation between institutions to allow federation of data. Competition and peer review is still required but separate, funding calls for these projects are needed with a facility to help competing proposals to work together on key infrastructure.
Catherine Ball: Currently, there are no well-established funding models for creating and maintaining data sharing infrastructures. There are several different types of needs for microarray data sharing infrastructure. One, of course, is public data repositories that provide long-term access to public data, such as (for example) NCBI’s Gene Expression Omnibus (GEO) and EBI’s ArrayExpress. Another is resources and reagents for describing, sharing and communicating microarray data, such as MGED or the External RNA Controls Consortium (ERCC). A third type of need is research databases that provide investigators with the means to store, annotate, and interpret their data as part of ongoing investigations (examples are the Stanford Microarray Database, the RNA Abundance Database, and the BioArray Software Environment). And a fourth lies in tools and software for processing and analyzing microarray data, such as those provided by TIGR.
Since the need for data sharing outlasts the research projects that generate the data, funding the data infrastructure by small contributions from individual studies is a short-term, risky solution. Instead, key data-sharing infrastructures need stable and reliable funding, and they should be required to be open and accessible in a flexible manner (for example, using Web services). It is important to note that some of these projects will be relied on by many research projects and their loss of funding would immediately jeopardize the success of many. For such projects, perhaps a rolling period of funding would be wise so that alternative funding or alternative infrastructure could be identified before key resources are lost.
Neil Winegarden: The best funding model for data sharing? That's a good question. Right now, in terms of microarray data repositories, I believe there are three main players: ArrayExpress, GEO, and the Center for Information Biology Gene Expression (CIBEX) database, from the DNA Databank in Japan. Each of these can be seen as complementary or competitive with one another, and each is receiving funding from different sources.
I think it is important at this stage to have a few different repositories, each taking a different approach to making a database, so that we can see what works and what does not. Thus, I think we need to fund multiple efforts. In many ways, I think the current databases involved in sharing sequence information are a good example -- publicly funded, in many cases, allowing full reach through from the research community. Proprietary databases are fine for within large pharmaceutical companies where everything they do is privately funded. But the main repositories of such information should remain public.
Posted on October 21, 2004 at 04:01 PM | Permalink | Comments (1)
