YFGdb Help

YFGdb Help Contents:

YFGdb searches: view and download yeast data sets
YFGdb study viewer
README file
Archetype gene file
File formats
Alternative sources of functional genomics data and other useful links

YFGdb searches: view and download yeast data sets

The YFGdb searches may be used to search and view the functional genomics data set collection. Data and annotation files are provided for download either in a tar ball or zip file format. Note that these files might be in a variety of formats, including MAGE-ML, GEO soft format, pcl or cdt files, or tab-delimited files. For more information on these file formats, see the file formats section below. Our longer term goal is to incorporate all these data within the YFGdb PostgreSQL database, so that all data can be exported either via files in common formats or via a sql dump of the database.

The Quick Search may currently be used to query YFGdb based on PubMed ID (e.g. 16963631). It is available at the top of every YFGdb page.

The Advanced Search may be used to query YFGdb based on study type, experimental technology, Gene Ontology (GO) biological process terms, file format and curation status in YFGdb. Please note that only the "study type" must be selected; the rest of the categories are optional and may be used to further refine your search.

Step 1 (required): Select one or more study types. You may sort the results by author, PubMed ID, file format or experimental technology. The default sort orders the results based on the first author's last name.

Step 2 (optional): To further narrow your results, you may also make the following selections:

Experimental technology: Select one. Please note that this is an "AND" search and the experimental technology selected should be consistent with the study type chosen in Step 1.

GO Biological Process Terms: Using the yeast Gene Ontology (GO) process slim terms (obtained from SGD), YFGdb curators manually associate GO processes addressed by each study, if appropriate. You may select one or more GO process terms to further refine your query. However, please note that each resulting study will match ALL of the selected GO terms.

File formats: To further narrow your results, select one or more file formats. Each resulting study will match ALL of the selected file formats in addition to any query selections made above.

Curation status: You may further restrict your query based on curation status in YFGdb. The current options are the following:

Curated: a study curated by YFGdb curators that includes relevant data sets, a README file, and an archetype file, as appropriate.

Not yet curated: a study in YFGdb that includes only collected data sets in various formats. The entry has not yet been curated by a YFGdb curator.

No data available: a study in YFGdb for which there are no data available at this time, according to the original authors contacted by YFGdb curators.

Query Results are organized based on study type and include a list of studies that match your search criteria. In the publication column, links to the relevant YFGdb entry, PubMed entry, SGD curated Paper and web supplements are provided when available:

YFGdb image Click on the YFGdb icon in order to access the YFGdb entry for that study.

External link image Click on the arrow icon in order to access the author's or journal's web supplement for the paper.

PubMed image To access the relevant PubMed entry for the paper, click on the PubMed icon.

SGD paper image To obtain more information on the relevant paper, click on the "SGD curated paper" icon.

Downloading Data Sets: Click on the tar or zip links in the archive column of the query results in order to download all files associated with a particular study. If the study has not yet been curated in YFGdb, then only the data files in their original format (e.g. text, pdf, Excel, soft, etc.) will be available. If the study has been curated in YFGdb, then the downloadable archive will contain the data files associated with the study, and a README file describing in detail all of the downloaded files associated with the study. These files should be untarred and uncompressed by any standard compressing/uncompressing software (for example, Stuffit Expander). For help and for downloading free versions of programs that can unzip and uncompress these tar files, see the gzip home page.

In addition to the README and data files, we also provide an archetype gene file for some data sets when appropriate. The archetype genes are meant to help indicate what comprises a significant result for a particular study, for example, CLN2 is an archetype gene for the cell cycle data sets.

YFGdb study viewer

Each individual study associated with a paper has a study viewer page in YFGdb. Some papers have multiple studies associated with them, in which case a disambiguation page is provided. Clicking on any of the YFGdb study IDs on the disambiguation page will open up the relevant study viewer. The study viewer page contains the following information:

Publication:

The full citation of the paper associated with the data set is provided at the top of the study viewer page. Links to the relevant PubMed entry , SGD curated paper and web supplements are also provided when available. If the paper is associated with any entries in a public repository such as GEO and/or ArrayExpress, then the accession ids are also provided and serve as direct links to the relevant entries in those repositories.

YFGdb study ID:

The YFGdb study ID is a unique accession id that corresponds to a single study for a particular paper. Please note that a single paper may have more than one study associated with it, and multiple studies associated with a publication may or may not be of the same study type. The format of the YFGdb study ID is the Pubmed ID (e.g. 17314980) followed by the study id, e.g. 17314980id466. YFGdb may be searched based on study ids using the Quick Search.

Status:

The current curation status in YFGdb is provided for each study:

Curated: a study curated by YFGdb curators that includes relevant data sets, a README file, and an archetype file, if appropriate.

Not yet curated: a study in YFGdb that includes only collected data sets in various formats. The entry has not yet been curated by a YFGdb curator.

No data available: a study in YFGdb for which there is no data available at this time, according to the original authors contacted by YFGdb curators.

Study Type:

The study type assigned by SGD or YFGdb curators is given for each study associated with a PubMed ID.

Study Description:

If the study has been curated by YFGdb, then overviews of the study design are also provided. Most of these study descriptions are written by curators as they curate the data set, whereas some are parsed from MAGE-ML files (i.e. provided by the authors).

GO Process(es) Addressed:

When appropriate, YFGdb curators indicate which Gene Ontology (GO) biological process(es) are addressed in a particular study. YFGdb curators manually associate GO processes addressed by each study using the yeast GO process slim terms (obtained from SGD).

Experimental Technology:

The experimental technology indicates the broad technique used to generate the data set. A complete list of experimental technologies curated in YFGdb is available here.

Source:

The main source(s) of the data files in the YFGdb entry. For more details on the file formats, please see the relevant file format section below. Current data set sources include the following:

ArrayExpress: mage-ml or mage-tab files obtained from the ArrayExpress repository.

Author: data sets obtained directly from the author.

BioGrid: a file in psi-mi format that includes large-scale interaction studies.

GEO: soft format files obtained from the GEO repository.

Spell: pcl format files obtained from the Spell data set collection.

Not yet curated: the source of the files has not yet been curated by YFGdb.

SGD: pcl format files obtained from SGD's Expression Connection data set collection.

Web supplement: data set files in any format obtained from author or journal web supplements.

Contact:

The contact person for the data set with a link to their email address. Often this is the corresponding author on the original paper, although in some cases another author may serve as the contact for the data set. Authors may be contacted for more information.

Visualization and analysis tools:

Different visualization tools, such as Java TreeView, are available on study viewer pages, if applicable. A brief description of the tools and links to launch the application for viewing the results are also provided. More visualization and analysis tools will be added over time.

Download data files:

Click on the tar or zip links in order to download all files associated with a particular study. If the study has not yet been curated in YFGdb, then only the data files in their original format (e.g. text, pdf, Excel, soft, etc. ) will be available. If the study has been curated in YFGdb, then the downloadable archive will contain the data files associated with the study and a README file written by YFGdb curators describing in detail all the downloaded files associated with the study. These files should be untarred and uncompressed by any standard compressing/uncompressing software (for example, Stuffit Expander). For help and for downloading free versions of programs that can unzip and uncompress these tar files, see the gzip home page.

In addition to the README and data files, we also provide an archetype gene file for some curated data sets when appropriate. The archetype genes are meant to help indicate what comprises significant expression for a particular study, for example, CLN2 is an archetype gene for the cell cycle data sets. The individual files associated with the study are also listed in tabular format with their file size and type noted. They may be downloaded invidually.

README file:

A detailed README file is written by YFGdb curators for each curated study. The README file includes the full citation, PubMed ID, study description, a brief description of the raw and/or processed data files, web supplement links and author contact information. For more information on the different file types, please see the file formats section below.

Archetype gene file:

Archetype gene files are created for studies by YFGdb curators, if appropriate. Archetype genes are intended to indicate what constitutes a significant result for a particular experiment (e.g. the G1 cyclin CLN2 is an archetype gene for cell cycle data sets). Each archetype has a curated description indicating criteria used to identify significant genes in the data set and how the archetype genes meet these criteria. Archetype genes are meant to serve as benchmarks for biologists looking at their favorite genes in a data set and for computational biologists writing algorithms to find significance in an automated way.

The archetype gene files contain the following columns:

Type/group of genes: used to group archetype genes together in a set
ORF name
Type of expression/behavior: Amplified, Increased, Decreased, Deleted, Periodic, or Other
Description: description of the group of archetype genes as well as information about the individual gene

A sample archetype file is available here.

File formats

Depending on the type of experiment and the particular data set, the files provided may be in a variety of formats. Currently, there are a few types of formats for microarray experiments, while data from other experiment types most often are available in basic tab-delimited formats.

Possible file formats:

CDT: Complete Data File/Clustered Data Table, contains clustered data, file format developed by the Stanford Microarray Database (SMD). See SMD or PUMAdb for more detailed information.

CEL (Affymetrix): The Affymetrix CEL file contains raw microarray analysis results that must be interpreted by specific software programs (e.g. MAS5, dChip). The CEL file contains the raw intensities of all the in situ oligos (25-mers). These typically number ~200,000. There is no direct association of an individual probe intensity with the gene it might be reporting on, at least not within the CEL file. In the design of these probes, Affymetrix typically chose about 16 oligos, termed a "probeset", that collectively report on the transcript level. Software (MAS5, dChip, etc.) typically reads the CEL file along with a library file (mapping of the probe to the probeset/transcript/gene) to make the aggregate call on the transcript level that the probes report on. The resulting file (derivation of CEL) from this analysis typically has as many lines as there are genes in the organism (plus any other hypothetical transcripts).

CHP (Affymetrix): The Affymetrix chip (.CHP) file contains microarray analysis results produced from Affymetrix software. This file can be saved as .chp, .txt, or exported as an Excel (.xls) file. Over the years, this file format has been generated using at least four different versions of Affymetrix software and two different types of algorithms. Older software versions (MAS4 Microarray Suite, and GeneChip Analysis Suite) made use of an empirical expression algorithm, i.e. the calculations were not based on standard statistical methods. The file Affymetrix_Empirical.txt describes the column headers generated using the empirical algorithm. The more recent software versions (MAS5 Microarray Suite, or GCOS GeneChip Operating Software) make use of statistical expression algorithms. The file Affymetrix_Statistical.txt describes the column headers generated using the statistical algorithm. For more information, see a summary of the method analysis, or refer to Affymetrix's statistical algorithms technote. Please note that there are three Affymetrix yeast arrays: Ye6100, YG-S98, and Yeast 2.0. Information on these arrays is available from the YG-S98 and Ye6100 datasheet and the Yeast 2.0 datasheet.

GFF: The "Gene Finding Format" or "General Feature Format" (GFF) is a standarized file format for describing genes and other features associated with DNA, RNA and protein sequences. Please refer to the GFF specification document for more information.

GPR (GenePix): GenePix results files (.gpr) are the output from a Molecular Devices "GenePix" microarray scanner, which takes measurements from a variety of microarrays spotted on glass. The file GenePix_GPR.txt (html version) summarizes the information found in the header of a GPR file and describes the column headings in the current software version. The file gpr_history.txt (html version) describes the changes that have been made to the GPR format since its initial public release (GenePix Results format version 1.4). The following sample files containing the headers, column headings, and several rows of data are also available for reference: GPR v. 1.4 (3_0_6_x_truncated.gpr), GPR v. 2.0 ( 4_0_1_x_truncated.gpr), GPR v. 3.0 (4_1_1_x_truncated.gpr), GPR v. 3.0 ( 5_0_1_26_truncated.gpr). Note that they are numbered according to the GenePix Pro software version, not the GPR format version. As of GenePix Pro 5.0, Molecular Devices adopted a flexible file format in which the positions and contents of the GPR file columns are not specified; rather, they are read and identified when the file is opened. Accordingly, Molecular Devices froze the GPR file format at v. 3.0.

MAGE-ML: MicroArray and Gene Expression (MAGE) XML format for microarray data exchange. For details on MAGE-ML, see the MAGE page. Note that most of the MAGE-ML files on this site were provided by ArrayExpress.

MAGE-TAB: MicroArray and Gene Expression (MAGE) tab-delimited text format for microarray data. For details on MAGE-TAB, see the MAGE page. Note that most of the MAGE-TAB files on this site were provided by ArrayExpress.

PCL: Pre-Clustered File, file format developed by the Stanford Microarray Database (SMD). The file pcl_format.txt describes the pcl file format and column headers. For more information see SMD or PUMAdb.

PSI-MI: The Proteomics Standards Initiative Molecular Interaction (PSI-MI) XML format is a standardized data exchange format for protein-protein interactions. See the Proteomics Standards Initiative site for more information on the PSI-MI format

SOFT: Usually both soft (.soft) and annotation (.annot) files are available. See GEO for more information.

TXT: In some cases, particularly in older data sets, data are not available in a standard format, but instead are in author-defined tab-delimited files. In addition, in some MAGE-ML sets, the actual data matrices are provided in separate *.txt files. BioDiscovery, Inc. has also written microarray analysis software called ImaGene. Additional information on the column headings from ImaGene .txt files can be found in this summary, the user's manual, or this sample .txt file.

Alternative sources of functional genomics data and other useful links

There are several other sources for functional genomics data available for both yeast and other species:

Repositories of misc. functional genomics data:

SGD: provides a tool for searching microarray expression data sets, called Expression Connection. In addition, the SGD Genome-wide Analysis page lists all yeast publications that describe some large-scale analysis, categorized by the type of experiment. This page also includes links to web supplements and data files. There is also a variety of data sets available at the SGD ftp site.
MIPS/CYGD: provides various types of functional genomics data, including interaction and phenotype data.

Major microarray repositories:

GEO: Gene Expression Omnibus at the NCBI, provides data in a tab-delimited format.
ArrayExpress: part of the EBI, provides data in MAGE-ML format.
SMD: Stanford Microarray Database, provides data published at Stanford University.
YMGV: Yeast Microarray Global Viewer, provided by the Jacq group in Paris, France.

Sources of interaction data:

BioGRID: provided by Mike Tyers' lab at the Samuel Lunenfeld Research Institute, Toronto. Contains the most extensive set of large-scale data sets as well as individual interactions manually curated from the literature.
DIP
MINT
IntAct
OPD: collection of mass spectrometry and other proteomics data.