Princeton Protein Orthology Database (P-POD): Archival information for Heinicke et al.

Contents

About P-POD

Please cite the following paper if referring to P-POD (PubMed ID: 17712414):

Heinicke S., Livstone M.S., Lu C., Oughtred R., Kang F., Angiuoli S.V., White O., Botstein D., Dolinski K.(2007) The Princeton Protein Orthology Database (P-POD): A Comparative Genomics Analysis Tool for Biologists.PLoS ONE 2:e766.

The first version of the Princeton Protein Orthology Database (P-POD) as described in Heinicke et al. contains families of related proteins from S. cerevisiae, H. sapiens, M. musculus, D. rerio, D. melanogaster, C. elegans, A. thaliana, and P. falciparum, with an emphasis on providing information about disease-related genes. The current P-POD contains upadted protein data sets for each of the above organisms (Heinicke et al 2007: updated) and also a Reference Genome Annotation data set with different organisms. Disease-related information is collected from the Online Mendelian Inheritance in Man (OMIM) database, the Saccharomcyes Genome Database (SGD), and manual literature curation.

All of the original data from this publication are archived in the database and are freely and publicly available through the web and by downloading the entire database system (data via FTP (see version 1); for software information). The analysis pipeline in the publication used OrthoMCL v1.2 to generate ortholog families, Jaccard clustering to generate "super families" (large families of related sequences), ClustalW v1.83 to generate sequence alignments, and PHYLIP v3.65 to determine the phylogenetic relationship among the family members. The Generic Model Organism Database (GMOD) schema is used as the backend database.

We gratefully acknowledge Mike Cherry (SGD), Shuai Weng (SGD), Eurie Hong (SGD), Sam Angiuoli (TIGR), Don Gilbert (Indiana University), Chris Stoeckert (UPenn), Feng Chen (UPenn), Scott Cain (CSHL), Laurie Kramer (Princeton) and John Matese (Princeton) for valuable discussions in creating P-POD.

Valid search entries for each organism included in the original P-POD analysis are listed in the table below:

2) Select either "OrthoMCL" from the pull-down menu to view a family that contains only putative orthologs for the gene/protein of interest, or else "Jaccard clustering" to view a larger super family of related sequences.

Data sources

Protein data sets

The sources of each protein data set and the numbers of sequences analyzed are listed in the table below. All files were downloaded November 14, 2005. Note that some files may have been replaced with a more recent version at the source database. Clicking on a file name will link you to the directory containing the file.
Organism Number of proteins Database File name
S. cerevisiae 6704 SGD orf_trans_all.fasta.gz
H. sapiens 33869 ENSEMBL Homo_sapiens.NCBI35.nov.pep.fa.gz
M. musculus 36471 ENSEMBL Mus_musculus.NCBIM34.nov.pep.fa
D. rerio 32143 ENSEMBL Danio_rerio.ZFISH5.nov.pep.fa
D. melanogaster 19178 FlyBase dmel-all-translation-r4.2.1.fa
C. elegans 22858 WormBase wormpep150.fa
A. thaliana 30690 TAIR TAIR6_pep_20051108.fa
P. falciparum 5363 PlasmoDB Pfa3D7_WholeGenome_Annotated_PEP_2005.2.11.fa

OMIM diseases

The OMIM diseases and their associated ENSEMBL peptide IDs were downloaded on April 24, 2006 from two sources:

These files were parsed and combined into one file (diseasegenesBiomart_Mim.txt) and used to load the GMOD database.

Literature

Papers flagged as "Disease-related" or "Cross-species expression" were downloaded from SGD on January 13, 2006: view/download Disease-related or Cross-species expression papers. Both files are in the format: ORF[tab]PMID

Database schema

This tool utilizes the GMOD database schema, implemented in a similar way as the Sybil package provided by TIGR. The table below lists the main tables utilized in our original implementation; contact us if you need more detailed information. We have since modified the way in which the data are stored in the GMOD schema.

Type of Data GMOD Module GMOD Table(s) Notes
Pipeline run Companalysis Analysis, Analysis_Feature Pipeline runs are grouped together using the Analysis table, with features (protein sequences and ortholog families) generated from a particular run grouped together by the Analysis_Feature linking table (similar to the Sybil implementation).
Fasta files Sequence Feature, Dbxref Fasta files are loaded into the feature table, and IDs parsed from the header are loaded into the Dbxref table.
ortholog families Sequence Feature, Featureloc, Dbxref Each ortholog family is inserted as a feature (type is OrthoMCL family). Proteins that comprise the family are grouped with the OrthoMCL family using the featureloc table. The Dbxref_id column for these OrthoMCL families is used to refer to the file name of the png image of the phylogenetic tree.
ClustalW alignment Sequence Feature The ClustalW alignment of the sequences within the ortholog family is stored in the residue column in the row for the OrthoMCL family feature.
Cross-species expression and disease-related literature Pub, Sequence Pub, Featureprop, Featureprop_pub Paper info is stored in the pub table, then the paper, topic, and curated information is linked to the appropriate feature through the Featureprop and Featureprop_pub tables.
Disease info from OMIM Sequence Dbxref, Feature_dbxref OMIM disease record IDs are stored in the Dbxref table. Features that are associated with OMIM disease records are linked to the relevant OMIM IDs through the Feature_dbxref table.

Download original data

Original data are available in files on our ftp site (see version 1); file descriptions are available in the README.