« Gene Flow Proceedings, from Pew | Main | How Can the Databases Be Made Most Useful? »

How Do We Fund the Databases?

What should be the funding models be for developing and running data-sharing infrastructures?

Alvis Brazma: This is a long story, and I'd refer you to a recent Nature Biotechnology paper by Ball et al., "Funding High-Throughput Data Sharing," for more.

But in brief, first, the bioinformatics databases that serve large sections of the community should be funded by public money to make sure that all data can be fully integrated with all other relevant existing and future resources in an uninhibited way. Second, it should be realized that not only the database development, but also their maintenance, further development, and integration with new resources should be funded -- otherwise there is no much point in funding the development.

Genes in ActionThird, the funding should be stable enough to encourage long-term thinking from the resource providers. One way to achieve this is by rolling grants that have to be renewed periodically, but cannot be terminated without a long enough prior notice (e.g., 2 years). Fourth, and finally, not only the centralized resources should be funded -- the data deposition also costs money. Realistically costed data-sharing plans should be included (and funded!) in all grants generating large amounts of data.

Helen Parkinson: Realistically, federated systems will be needed to support the amounts of data being generated. At present, the current funding models work for research, but not for infrastructure, and are not geared to cooperation between institutions to allow federation of data. Competition and peer review is still required but separate, funding calls for these projects are needed with a facility to help competing proposals to work together on key infrastructure.

Catherine Ball: Currently, there are no well-established funding models for creating and maintaining data sharing infrastructures. There are several different types of needs for microarray data sharing infrastructure. One, of course, is public data repositories that provide long-term access to public data, such as (for example) NCBI’s Gene Expression Omnibus (GEO) and EBI’s ArrayExpress. Another is resources and reagents for describing, sharing and communicating microarray data, such as MGED or the External RNA Controls Consortium (ERCC). A third type of need is research databases that provide investigators with the means to store, annotate, and interpret their data as part of ongoing investigations (examples are the Stanford Microarray Database, the RNA Abundance Database, and the BioArray Software Environment). And a fourth lies in tools and software for processing and analyzing microarray data, such as those provided by TIGR.

Since the need for data sharing outlasts the research projects that generate the data, funding the data infrastructure by small contributions from individual studies is a short-term, risky solution. Instead, key data-sharing infrastructures need stable and reliable funding, and they should be required to be open and accessible in a flexible manner (for example, using Web services). It is important to note that some of these projects will be relied on by many research projects and their loss of funding would immediately jeopardize the success of many. For such projects, perhaps a rolling period of funding would be wise so that alternative funding or alternative infrastructure could be identified before key resources are lost.

Neil Winegarden: The best funding model for data sharing? That's a good question. Right now, in terms of microarray data repositories, I believe there are three main players: ArrayExpress, GEO, and the Center for Information Biology Gene Expression (CIBEX) database, from the DNA Databank in Japan. Each of these can be seen as complementary or competitive with one another, and each is receiving funding from different sources.

I think it is important at this stage to have a few different repositories, each taking a different approach to making a database, so that we can see what works and what does not. Thus, I think we need to fund multiple efforts. In many ways, I think the current databases involved in sharing sequence information are a good example -- publicly funded, in many cases, allowing full reach through from the research community. Proprietary databases are fine for within large pharmaceutical companies where everything they do is privately funded. But the main repositories of such information should remain public.

Posted on October 21, 2004 at 04:01 PM | Permalink

Comments

Basically, if the goal is classification of cell types the data generated by the current microarray technology help (as did in the early cases of P. Brown, and Todd Golub’s reports) because these sort of classifications depend on the “hybridization data patterns”. The hybridization patters are usually distinctive for each cell line (cell cycle phase). In seeking the information of individual gene expression and the regulation, microarray has not been working. It is probably not exaggerate to say that some day people will find that all the data that are generated by the current microarray technology are almost useless, because the validity of the data is not insured. It is the time to re-think and make the effort to understand the micro of DNA microarray so as to make sure that each individual data point is correctly generated and properly processed. Many data processings, such as the raw data processing in GeneChip for generating .chp data and the data normalization of various version of microarrays actually violated the basic chemical thermodynamics principles. It is obvious that after such data processings, the data are already distorted, no longer reflect the original truth. It is even hard to find the biological truth. Anybody agree with me?

Posted by: Mei Xu | Nov 4, 2004 9:28:44 AM

Post a comment