Contents
About P-POD
The Princeton Protein Orthology Database (P-POD), developed by the Genome Databases Group at Princeton, displays families of orthologs from S. cerevisiae, H. sapiens, M. musculus, D. rerio, D. melanogaster, C. elegans, A. thaliana, and P. falciparum, with an emphasis on providing information about disease-related genes. Disease-related information is collected from the Online Mendelian Inheritance in Man (OMIM) database, the Saccharomcyes Genome Database (SGD), and manual literature curation.
Querying the web interface with a protein from one of the eight model organisms retrieves a phylogenetic tree of orthologous proteins, a list of diseases associated with the human ortholog(s), a list of papers associated with the yeast ortholog(s) and labeled as "disease-related" at SGD, and a manually curated and annotated list of papers with cross-complementation experiments involving the yeast ortholog(s). You may also search or browse the results by OMIM disease ID numbers and Mouse Genome Informatics (MGI) accession ID's.
Results from two types of comparative genomics analysis are provided as query options:
OrthoMCL analysis (UPenn): generates families that contain only putative orthologs. OrthoMCL identifies ortholog candidates as reciprocal best BLAST hits between organisms and paralog candidates as pairs of genes within an organism that have better reciprocal BLAST scores than ortholog candidates; after a set of homolog candidates is identified, it is evaluated by a Markov clustering algorithm to finalize family members. See Li et al (2003) for details.
Jaccard Coefficient clustering analysis (TIGR): generates large families of related sequences. See the Jaccard-clustering analysis section of the documentation page provided by TIGR
All the data within the database are freely and publicly available through the web and by downloading the entire database system (contact us for download information). Currently, the analysis pipeline uses OrthoMCL v1.5 to generate ortholog families, Jaccard clustering to generate "super families" (large families of related sequences), ClustalW v1.83 to generate sequence alignments, and PHYLIP v3.65 to determine the phylogenetic relationship among the family members. The system is designed in a modular way so that different components can be plugged into (or removed from) the analysis pipeline. For example, two alternative methods, OrthoMCL and Jaccard clustering, are used to generate different types of sequence families, though the analysis pipeline shares downstream components. The Generic Model Organism Database (GMOD) schema is used as the backend database.
We gratefully acknowledge Mike Cherry (SGD), Shuai Weng (SGD), Eurie Hong (SGD), Sam Angiuoli (TIGR), Don Gilbert (Indiana University), Chris Stoeckert (UPenn), Feng Chen (UPenn), Scott Cain (CSHL), Laurie Kramer (Princeton) and John Matese (Princeton) for valuable discussions.
Help with using this tool
A. Search by gene/protein name
B. Search by disease
C. Browse OMIM disease families
D. Browse families by organism
A. Search by gene/protein name
Search option 1 allows you to query for ortholog or super family results using a gene/protein name or accession identifier. The query options and results are described below.
Query options:
1) Enter a gene/protein identifier in the text box and select an organism using the pull-down menu. Valid search entries for each organism included in the analysis are listed in the table below:
| Organism | Source Database | Valid gene/protein identifier(s) | Examples |
|---|---|---|---|
| P.falciparum | PlasmoDB | PlasmoDB ID | PF11_0329 |
| H.sapiens | ENSEMBL | ENSEMBL peptide ID, peptide name | ENSP00000347168, AK5 |
| D.melanogaster | FlyBase | FlyBase ID | Nf1-PC, CG14224-PA |
| M.musculus | ENSEMBL | ENSEMBL peptide ID | ENSMUSP00000060900 |
| A.thaliana | TAIR | TAIR identifier or gene name | AT1G31640.1, ATFH8 |
| C.elegans | WormBase | WormBase identifier or gene name | F15C11.2b, sri-20 |
| D.rerio | ENSEMBL | ENSEMBL peptide ID, ZFIN ID | ENSDARP00000053774, ZDB-GENE-050102-4 |
| S.cerevisiae | SGD | ORF name or gene name | YMR310C, ACT1 |
2) Select either "OrthoMCL" from the pull-down menu to view a family that contains only putative orthologs for the gene/protein of interest, or else "Jaccard clustering" to view a larger super family of related sequences. (Please see the section "About this tool" for more details on each method used.)
Query results:
1) If the OrthoMCL option was selected, then a family of putative orthologs is shown in a phylogenetic tree display with direct links to the source database for each gene/protein in the family. A link is also provided at the top of the page to view the corresponding Jaccard Coefficient results that contain a larger family of sequences related to the query gene/protein.
2) Disease information obtained from the Online Mendelian Inheritance in Man (OMIM) database is provided if a human gene displayed in the results has a corresponding gene or phenotype OMIM record. A link to the OMIM entry is provided via the OMIM record number in the disease information table. Please note that the OMIM records were obtained from Ensembl BioMart and may not reflect all current changes to the OMIM database.
3) The results also include a list of papers associated with yeast protein(s) in the family that address the topics "Disease-related" or "Cross-species Expression" in the Saccharomyces Genome Database (SGD) Literature Guide. Papers that address cross-species expression were manually curated to find experimental evidence that confirms or refutes the prediction of orthology calculated using the OrthoMCL method. If a paper shows cross-species complementation in which a gene/protein from one species complements the corresponding mutation in another species, then this is considered experimental evidence of orthology. The curator notes indicate whether orthology was directly tested via cross-species complementation, or whether only heterologous expression was carried out.
4) A ClustalW alignment of the protein sequences in the family is also provided with the gene/protein identifers linked to their corresponding source databases. The symbols and color-coding indicate either strong similarity (:), weak similarity (.), or identical (*) residues between sequences. The sequence alignment (.aln file) or the actual protein sequences in FASTA format may be downloaded from the links provided.
B. Search by disease
Search option 2 allows you to query for ortholog or super family results using an OMIM ID. To find an OMIM ID that matches a disease of interest, you can 1) browse OMIM disease families from this analysis to find their OMIM IDs, or 2) search the OMIM database itself, which contains OMIM IDs that are in this database and many other OMIM IDs. Query results are provided for those families where a human gene/protein has a corresponding gene or phenotype OMIM record. A link to the OMIM entry is provided via the OMIM record number in the disease information table. Please note that the OMIM records were obtained from Ensembl BioMart and may not reflect all current changes to the OMIM database.
C. Browse OMIM disease families
At the bottom of the homepage, a link is provided for browsing OMIM disease families. You may browse a list of OMIM IDs corresponding to human disease genes that encode proteins found in sequence families from this analysis. Clicking on one of the links in the OMIM value index displays the OMIM gene or phenotype description, along with the relevant human Ensembl peptide ID and links to its corresponding OrthoMCL or Jaccard cluster results.
D. Browse families by organism
At the bottom of the homepage, a link is also provided for browsing the OrthoMCL family and Jaccard clustering results by organism. The table gives the distribution of families based on the organism(s) they include and provides links to the OrthoMCL families and Jaccard clusters containing the different subgroups of organisms. Clicking on the number of OrthoMCL families or Jaccard clusters for each subset of organisms displays links to the actual results. Direct links for the sequence families are listed if there are less than ten for a particular subset of organisms.
Data sources
Protein data sets
The sources of each protein data set and the numbers of sequences analyzed are listed in the table below. All files were downloaded November 14, 2005. Note that some files may have been replaced with a more recent version at the source database. Clicking on a file name will link you to the directory containing the file. You can also download the fasta sequence files from us; see the Download data and software section below for more information.| Organism | Number of proteins | Database | File name |
|---|---|---|---|
| S. cerevisiae | 6704 | SGD | orf_trans_all.fasta.gz |
| H. sapiens | 33869 | ENSEMBL | Homo_sapiens.NCBI35.nov.pep.fa.gz |
| M. musculus | 36471 | ENSEMBL | Mus_musculus.NCBIM34.nov.pep.fa |
| D. rerio | 32143 | ENSEMBL | Danio_rerio.ZFISH5.nov.pep.fa |
| D. melanogaster | 19178 | FlyBase | dmel-all-translation-r4.2.1.fa |
| C. elegans | 22858 | WormBase | wormpep150.fa |
| A. thaliana | 30690 | TAIR | TAIR6_pep_20051108.fa |
| P. falciparum | 5363 | PlasmoDB | Pfa3D7_WholeGenome_Annotated_PEP_2005.2.11.fa |
OMIM diseases
The OMIM diseases and their associated ENSEMBL peptide IDs were downloaded on April 24, 2006 from two sources:
- NCBI: downloaded mim2gene.txt file from the NCBI ftp site, and used the Batch Entrez tool to retrieve disease descriptions.
- ENSEMBL BioMart: retrieved Ensembl peptide IDs associated with MIM IDs in ENSEMBL, then used the NCBI Batch Entrez tool to retrieve disease descriptions.
Literature
Papers flagged as "Disease-related" or "Cross-species expression" were downloaded from SGD on January 13, 2006: view/download Disease-related or Cross-species expression papers. Both files are in the format: ORF[tab]PMID
Database schema
This tool utilizes the GMOD database schema, implemented in a similar way as the Sybil package provided by TIGR. The table below lists the main tables utilized in our implementation; contact us if you need more detailed information.
| Type of Data | GMOD Module | GMOD Table(s) | Notes |
|---|---|---|---|
| Pipeline run | Companalysis | Analysis, Analysis_Feature | Pipeline runs are grouped together using the Analysis table, with features (protein sequences and ortholog families) generated from a particular run grouped together by the Analysis_Feature linking table (similar to the Sybil implementation). |
| Fasta files | Sequence | Feature, Dbxref | Fasta files are loaded into the feature table, and IDs parsed from the header are loaded into the Dbxref table. |
| ortholog families | Sequence | Feature, Featureloc, Dbxref | Each ortholog family is inserted as a feature (type is OrthoMCL family). Proteins that comprise the family are grouped with the OrthoMCL family using the featureloc table. The Dbxref_id column for these OrthoMCL families is used to refer to the file name of the png image of the phylogenetic tree. |
| ClustalW alignment | Sequence | Feature | The ClustalW alignment of the sequences within the ortholog family is stored in the residue column in the row for the OrthoMCL family feature. |
| Cross-species expression and disease-related literature | Pub, Sequence | Pub, Featureprop, Featureprop_pub | Paper info is stored in the pub table, then the paper, topic, and curated information is linked to the appropriate feature through the Featureprop and Featureprop_pub tables. |
| Disease info from OMIM | Sequence | Dbxref, Feature_dbxref | OMIM disease record IDs are stored in the Dbxref table. Features that are associated with OMIM disease records are linked to the relevant OMIM IDs through the Feature_dbxref table. |