Please cite the following paper if referring to P-POD (PubMed ID: 17712414):
Heinicke S., Livstone M.S., Lu C., Oughtred R., Kang F., Angiuoli S.V., White O., Botstein D., Dolinski K.(2007) The Princeton Protein Orthology Database (P-POD): A Comparative Genomics Analysis Tool for Biologists.PLoS ONE 2:e766.
The first version of the Princeton Protein Orthology Database (P-POD) as described in Heinicke et al. contains families of related proteins from S. cerevisiae, H. sapiens, M. musculus, D. rerio, D. melanogaster, C. elegans, A. thaliana, and P. falciparum, with an emphasis on providing information about disease-related genes. The current P-POD contains upadted protein data sets for each of the above organisms (Heinicke et al 2007: updated) and also a Reference Genome Annotation data set with different organisms. Disease-related information is collected from the Online Mendelian Inheritance in Man (OMIM) database, the Saccharomcyes Genome Database (SGD), and manual literature curation.
All of the original data from this publication are archived in the database and are freely and publicly available through the web and by downloading the entire database system (data via FTP (see version 1); contact us for software information). The analysis pipeline in the publication used OrthoMCL v1.2 to generate ortholog families, Jaccard clustering to generate "super families" (large families of related sequences), ClustalW v1.83 to generate sequence alignments, and PHYLIP v3.65 to determine the phylogenetic relationship among the family members. The Generic Model Organism Database (GMOD) schema is used as the backend database.
We gratefully acknowledge Mike Cherry (SGD), Shuai Weng (SGD), Eurie Hong (SGD), Sam Angiuoli (TIGR), Don Gilbert (Indiana University), Chris Stoeckert (UPenn), Feng Chen (UPenn), Scott Cain (CSHL), Laurie Kramer (Princeton) and John Matese (Princeton) for valuable discussions in creating P-POD.
Valid search entries for each organism included in the original P-POD analysis are listed in the table below:
|Organism||Source Database||Valid gene/protein identifier(s)||Examples|
|H.sapiens||ENSEMBL||ENSEMBL peptide ID, peptide name||ENSP00000347168, AK5|
|D.melanogaster||FlyBase||FlyBase ID||Nf1-PC, CG14224-PA|
|M.musculus||ENSEMBL||ENSEMBL peptide ID||ENSMUSP00000060900|
|A.thaliana||TAIR||TAIR identifier or gene name||AT1G31640.1, ATFH8|
|C.elegans||WormBase||WormBase identifier or gene name||F15C11.2b, sri-20|
|D.rerio||ENSEMBL||ENSEMBL peptide ID, ZFIN ID||ENSDARP00000053774, ZDB-GENE-050102-4|
|S.cerevisiae||SGD||ORF name or gene name||YMR310C, ACT1|
2) Select either "OrthoMCL" from the pull-down menu to view a family that contains only putative orthologs for the gene/protein of interest, or else "Jaccard clustering" to view a larger super family of related sequences.
Protein data setsThe sources of each protein data set and the numbers of sequences analyzed are listed in the table below. All files were downloaded November 14, 2005. Note that some files may have been replaced with a more recent version at the source database. Clicking on a file name will link you to the directory containing the file.
|Organism||Number of proteins||Database||File name|
The OMIM diseases and their associated ENSEMBL peptide IDs were downloaded on April 24, 2006 from two sources:
- NCBI: downloaded mim2gene.txt file from the NCBI ftp site, and used the Batch Entrez tool to retrieve disease descriptions.
- ENSEMBL BioMart: retrieved Ensembl peptide IDs associated with MIM IDs in ENSEMBL, then used the NCBI Batch Entrez tool to retrieve disease descriptions.
Papers flagged as "Disease-related" or "Cross-species expression" were downloaded from SGD on January 13, 2006: view/download Disease-related or Cross-species expression papers. Both files are in the format: ORF[tab]PMID
This tool utilizes the GMOD database schema, implemented in a similar way as the Sybil package provided by TIGR. The table below lists the main tables utilized in our original implementation; contact us if you need more detailed information. We have since modified the way in which the data are stored in the GMOD schema.
|Type of Data||GMOD Module||GMOD Table(s)||Notes|
|Pipeline run||Companalysis||Analysis, Analysis_Feature||Pipeline runs are grouped together using the Analysis table, with features (protein sequences and ortholog families) generated from a particular run grouped together by the Analysis_Feature linking table (similar to the Sybil implementation).|
|Fasta files||Sequence||Feature, Dbxref||Fasta files are loaded into the feature table, and IDs parsed from the header are loaded into the Dbxref table.|
|ortholog families||Sequence||Feature, Featureloc, Dbxref||Each ortholog family is inserted as a feature (type is OrthoMCL family). Proteins that comprise the family are grouped with the OrthoMCL family using the featureloc table. The Dbxref_id column for these OrthoMCL families is used to refer to the file name of the png image of the phylogenetic tree.|
|ClustalW alignment||Sequence||Feature||The ClustalW alignment of the sequences within the ortholog family is stored in the residue column in the row for the OrthoMCL family feature.|
|Cross-species expression and disease-related literature||Pub, Sequence||Pub, Featureprop, Featureprop_pub||Paper info is stored in the pub table, then the paper, topic, and curated information is linked to the appropriate feature through the Featureprop and Featureprop_pub tables.|
|Disease info from OMIM||Sequence||Dbxref, Feature_dbxref||OMIM disease record IDs are stored in the Dbxref table. Features that are associated with OMIM disease records are linked to the relevant OMIM IDs through the Feature_dbxref table.|