Princeton Protein Orthology Database (P-POD): Help

Contents

About P-POD

The Princeton Protein Orthology Database (P-POD), developed by the Genome Databases Group at Princeton, displays families of orthologs from S. cerevisiae, H. sapiens, M. musculus, D. rerio, D. melanogaster, C. elegans, A. thaliana, and P. falciparum, with an emphasis on providing information about disease-related genes. Disease-related information is collected from the Online Mendelian Inheritance in Man (OMIM) database, the Saccharomcyes Genome Database (SGD), and manual literature curation.

Querying the web interface with a protein from one of the eight model organisms retrieves a phylogenetic tree of orthologous proteins, a list of diseases associated with the human ortholog(s), a list of papers associated with the yeast ortholog(s) and labeled as "disease-related" at SGD, and a manually curated and annotated list of papers with cross-complementation experiments involving the yeast ortholog(s). You may also search or browse the results by OMIM disease ID numbers and Mouse Genome Informatics (MGI) accession ID's.

Results from two types of comparative genomics analysis are provided as query options:

  • OrthoMCL analysis (UPenn): generates families that contain only putative orthologs. OrthoMCL identifies ortholog candidates as reciprocal best BLAST hits between organisms and paralog candidates as pairs of genes within an organism that have better reciprocal BLAST scores than ortholog candidates; after a set of homolog candidates is identified, it is evaluated by a Markov clustering algorithm to finalize family members. See Li et al (2003) for details.

  • Jaccard Coefficient clustering analysis (TIGR): generates large families of related sequences. See the Jaccard-clustering analysis section of the documentation page provided by TIGR

  • Each family generated using either the OrthoMCL or Jaccard Coefficient method is then analyzed by ClustalW and PHYLIP to generate the corresponding sequence alignments and dendrograms as indicated below:

    [Pipeline Flow Chart]

    All the data within the database are freely and publicly available through the web and by downloading the entire database system ( for download information). Currently, the analysis pipeline uses OrthoMCL v1.5 to generate ortholog families, Jaccard clustering to generate "super families" (large families of related sequences), ClustalW v1.83 to generate sequence alignments, and PHYLIP v3.65 to determine the phylogenetic relationship among the family members. The system is designed in a modular way so that different components can be plugged into (or removed from) the analysis pipeline. For example, two alternative methods, OrthoMCL and Jaccard clustering, are used to generate different types of sequence families, though the analysis pipeline shares downstream components. The Generic Model Organism Database (GMOD) schema is used as the backend database.

    We gratefully acknowledge Mike Cherry (SGD), Shuai Weng (SGD), Eurie Hong (SGD), Sam Angiuoli (TIGR), Don Gilbert (Indiana University), Chris Stoeckert (UPenn), Feng Chen (UPenn), Scott Cain (CSHL), Laurie Kramer (Princeton) and John Matese (Princeton) for valuable discussions.

    Help with using this tool

    A. Search by gene/protein name
    B. Search by disease
    C. Browse OMIM disease families
    D. Browse families by organism

    A. Search by gene/protein name

    Search option 1 allows you to query for ortholog or super family results using a gene/protein name or accession identifier. The query options and results are described below.

    Query options:
    1) Enter a gene/protein identifier in the text box and select an organism using the pull-down menu. Valid search entries for each organism included in the analysis are listed in the table below:

    2) Select either "OrthoMCL" from the pull-down menu to view a family that contains only putative orthologs for the gene/protein of interest, or else "Jaccard clustering" to view a larger super family of related sequences. (Please see the section "About this tool" for more details on each method used.)

    Query results:
    1) If the OrthoMCL option was selected, then a family of putative orthologs is shown in a phylogenetic tree display with direct links to the source database for each gene/protein in the family. A link is also provided at the top of the page to view the corresponding Jaccard Coefficient results that contain a larger family of sequences related to the query gene/protein.

    2) Disease information obtained from the Online Mendelian Inheritance in Man (OMIM) database is provided if a human gene displayed in the results has a corresponding gene or phenotype OMIM record. A link to the OMIM entry is provided via the OMIM record number in the disease information table. Please note that the OMIM records were obtained from Ensembl BioMart and may not reflect all current changes to the OMIM database.

    3) The results also include a list of papers associated with yeast protein(s) in the family that address the topics "Disease-related" or "Cross-species Expression" in the Saccharomyces Genome Database (SGD) Literature Guide. Papers that address cross-species expression were manually curated to find experimental evidence that confirms or refutes the prediction of orthology calculated using the OrthoMCL method. If a paper shows cross-species complementation in which a gene/protein from one species complements the corresponding mutation in another species, then this is considered experimental evidence of orthology. The curator notes indicate whether orthology was directly tested via cross-species complementation, or whether only heterologous expression was carried out.

    4) A ClustalW alignment of the protein sequences in the family is also provided with the gene/protein identifers linked to their corresponding source databases. The symbols and color-coding indicate either strong similarity (:), weak similarity (.), or identical (*) residues between sequences. The sequence alignment (.aln file) or the actual protein sequences in FASTA format may be downloaded from the links provided.

    B. Search by disease

    Search option 2 allows you to query for ortholog or super family results using an OMIM ID. To find an OMIM ID that matches a disease of interest, you can 1) browse OMIM disease families from this analysis to find their OMIM IDs, or 2) search the OMIM database itself, which contains OMIM IDs that are in this database and many other OMIM IDs. Query results are provided for those families where a human gene/protein has a corresponding gene or phenotype OMIM record. A link to the OMIM entry is provided via the OMIM record number in the disease information table. Please note that the OMIM records were obtained from Ensembl BioMart and may not reflect all current changes to the OMIM database.

    C. Browse OMIM disease families

    At the bottom of the homepage, a link is provided for browsing OMIM disease families. You may browse a list of OMIM IDs corresponding to human disease genes that encode proteins found in sequence families from this analysis. Clicking on one of the links in the OMIM value index displays the OMIM gene or phenotype description, along with the relevant human Ensembl peptide ID and links to its corresponding OrthoMCL or Jaccard cluster results.

    D. Browse families by organism

    At the bottom of the homepage, a link is also provided for browsing the OrthoMCL family and Jaccard clustering results by organism. The table gives the distribution of families based on the organism(s) they include and provides links to the OrthoMCL families and Jaccard clusters containing the different subgroups of organisms. Clicking on the number of OrthoMCL families or Jaccard clusters for each subset of organisms displays links to the actual results. Direct links for the sequence families are listed if there are less than ten for a particular subset of organisms.

    Data sources

    Protein data sets

    The sources of each protein data set and the numbers of sequences analyzed are listed in the table below. All files were downloaded November 14, 2005. Note that some files may have been replaced with a more recent version at the source database. Clicking on a file name will link you to the directory containing the file. You can also download the fasta sequence files from us; see the Download data and software section below for more information.
    Organism Number of proteins Database File name
    S. cerevisiae 6704 SGD orf_trans_all.fasta.gz
    H. sapiens 33869 ENSEMBL Homo_sapiens.NCBI35.nov.pep.fa.gz
    M. musculus 36471 ENSEMBL Mus_musculus.NCBIM34.nov.pep.fa
    D. rerio 32143 ENSEMBL Danio_rerio.ZFISH5.nov.pep.fa
    D. melanogaster 19178 FlyBase dmel-all-translation-r4.2.1.fa
    C. elegans 22858 WormBase wormpep150.fa
    A. thaliana 30690 TAIR TAIR6_pep_20051108.fa
    P. falciparum 5363 PlasmoDB Pfa3D7_WholeGenome_Annotated_PEP_2005.2.11.fa

    OMIM diseases

    The OMIM diseases and their associated ENSEMBL peptide IDs were downloaded on April 24, 2006 from two sources:

    These files were parsed and combined into one file (diseasegenesBiomart_Mim.txt) and used to load the GMOD database.

    Literature

    Papers flagged as "Disease-related" or "Cross-species expression" were downloaded from SGD on January 13, 2006: view/download Disease-related or Cross-species expression papers. Both files are in the format: ORF[tab]PMID

    Database schema

    This tool utilizes the GMOD database schema, implemented in a similar way as the Sybil package provided by TIGR. The table below lists the main tables utilized in our implementation; contact us if you need more detailed information.

    Type of Data GMOD Module GMOD Table(s) Notes
    Pipeline run Companalysis Analysis, Analysis_Feature Pipeline runs are grouped together using the Analysis table, with features (protein sequences and ortholog families) generated from a particular run grouped together by the Analysis_Feature linking table (similar to the Sybil implementation).
    Fasta files Sequence Feature, Dbxref Fasta files are loaded into the feature table, and IDs parsed from the header are loaded into the Dbxref table.
    ortholog families Sequence Feature, Featureloc, Dbxref Each ortholog family is inserted as a feature (type is OrthoMCL family). Proteins that comprise the family are grouped with the OrthoMCL family using the featureloc table. The Dbxref_id column for these OrthoMCL families is used to refer to the file name of the png image of the phylogenetic tree.
    ClustalW alignment Sequence Feature The ClustalW alignment of the sequences within the ortholog family is stored in the residue column in the row for the OrthoMCL family feature.
    Cross-species expression and disease-related literature Pub, Sequence Pub, Featureprop, Featureprop_pub Paper info is stored in the pub table, then the paper, topic, and curated information is linked to the appropriate feature through the Featureprop and Featureprop_pub tables.
    Disease info from OMIM Sequence Dbxref, Feature_dbxref OMIM disease record IDs are stored in the Dbxref table. Features that are associated with OMIM disease records are linked to the relevant OMIM IDs through the Feature_dbxref table.