The Princeton Protein Orthology Database (P-POD), developed by the Genome Databases Group/SGD Princeton Colony, displays protein families on two scales: smaller families of predicted orthologs and larger families of related protein sequences. P-POD was first described in Heinicke et al. (2007), and the archived data for this publication are available here. In the original version of P-POD, OrthoMCL was used to generate families of putative orthologs, and Jaccard Clustering was used to generate larger families of homologs in order to provide a broader evolutionary context. For each cluster, multiple sequence alignments and phylogenetic trees were constructed using ClustalW and PHYLIP. In the current version of P-POD (version 4, December 2009), families of predicted orthologs have been determined using either OrthoMCL or MultiParanoid, and larger families using Jaccard Clustering. A Naïve Ensemble clustering method has also been added in order to generate consensus clusters of the OrthoMCL and MultiParanoid results for each protein family. Multiple sequence alignments and evolutionary trees for each family were also produced using two new programs: MAFFT and PhyML. Reconciliation and orthology analysis of these trees was carried out using Notung developed by the Durand Lab at Carnegie Mellon University.
An overview of the current analysis pipeline follows; technical details can be found here.
The P-POD database can store different data sets simultaneously and you can search the particular data set in which you are most interested. Currently, the analysis pipeline described above has been run on two different sets of protein sequences:
- Reference Genome Annotation Project: includes the 12 genomes used in the GO Reference Genome Project: A. thaliana, C. elegans, D. rerio, D. discoideum, D. melanogaster, E. coli, G. gallus, H. sapiens, M. musculus, R. norvegicus, S. cerevisiae, and S. pombe.
- Heinicke et al. 2007: update: an updated version of the analysis described in Heinicke et al. (2007). It includes families from S. cerevisiae, H. sapiens, M. musculus, D. rerio, D. melanogaster, C. elegans, A. thaliana, and P. falciparum.
All amino acid sequences from the genomes listed above were downloaded, and similarity scores between all pairs of sequences were obtained by all-versus-all BLAST. The resulting similarity scores were used to cluster the protein sequences by three different methods: OrthoMCL or MultiParanoid to generate smaller families of predicted orthologs, and Jaccard clustering to generate larger families of related sequences to provide a larger evolutionary context. Program versions, settings, and other technical information about the components of the P-POD analysis pipeline can be found here.The P-POD search page includes both data sets as separate tabs, as shown below:
To choose one, simply click on the tab for the data set of interest. The Reference Genome Annotation Project set is the default choice.
P-POD and Notung
We applied the Notung program to reconcile, root, rearrange, and infer orthologs and paralogs in protein family trees in P-POD; details on settings and parameters can be found here. Notung uses duplication/loss parsimony to fit a gene (protein) tree to a chosen species tree. (Download the species tree that we used here). When there is more than one optimal root, Notung chooses one arbitrarily.
Notung was also used to calculate ortholog/paralog relationships among the proteins in the families whenever they can be unambigously determined. Detailed information about how Notung determines these relationships is available here. P-POD provides these results under the "Protein Family" tab of the display. In addition, users can dynamically view the relationships by using the Notung applet.
Literature and disease information in P-POD
We manually curated papers flagged as describing "Cross-species expression" by SGD to indicate explicitly when functional conservation was experimentally determined. These cross-species expression experiments test whether expressing a putative homolog from one organism will restore wildtype function to the corresponding inactivated gene in another organism. Currently, over 600 curated complementation experiments are available through P-POD. They are found under the "Functional Conservation" tab on the protein family pages.
We also provide OMIM phenotype records (# symbol, e.g. #256840) when they are associated with human proteins in P-POD families along with papers flagged as containing information about diseases and associated with yeast proteins in SGD. These are found under the "Disease References" tab on the protein family pages.
You can search for a protein using various identifiers and descriptions (see Search Term Help for details); the search is case-insensitive. You can narrow your search by limiting it to a specific species, or search multiple species at once. Multiple identifiers may be entered with IDs separated by pipes (e.g. RAS1|RAS2) and wild card searches are also permitted (e.g. RAS*). A search for a protein ID will return a list of all proteins that match the query with a link to each family in which the protein appears. In the screenshot above, FPR1 is entered as the search term, and Saccharomyces cerevisiae is chosen as the organism; P-POD will return the 'FPR1' result as shown below:
The search results page provides a table of search results as shown above, including the organism(s) that contain the matching protein (Organism column), the identifier (Protein column) and synonyms (Synonym column). The last four columns provide links to the protein clusters in which the protein hits are found. For example, the protein SGDID:S000005079 (FPR1) is found in the OrthoMCL family OrthoMCL467, the MultiParanoid family Para761, the Jaccard cluster Jaccard98 and the Naïve Ensemble cluster Nens225; the family/cluster names are hyperlinked to the detailed protein family display, described below.
P-POD can also be searched using OMIM phenotype records (# symbol, e.g. #256840). The OMIM ID (without the # symbol) can be entered at the bottom of the search page. Any human proteins in P-POD families associated with the OMIM phenotype will be returned on the results page. The actual disease information is found under the "Disease References" tab on the protein family pages.
P-POD Family Display
The P-POD protein family display contains the following information:
A representation of the phylogenetic tree for that familyThe trees were generated using Notung as described above. The protein leaves are color coded by species.
Protein tableThe proteins in the family are also shown in a table format with links to the source databases and AmiGO.
A link to launch the interactive Notung appletMany features of the stand-alone Notung program are available via a seamless link from the P-POD display to the Notung applet. This Notung applet serves as an interactive tree analysis tool. See the Investigating P-POD families with the Notung applet page for details.
A link to launch the interactive Jalview appletJalview is an interactive multiple alignment editor and analysis tool. Many features of the stand-alone Jalview program are available from a link from the P-POD display to the Jalview applet. See the Jalview help documentation for details.
In addition to the information shown on the main Protein Family display as described above, navigational links at the top of the page lead to additional information, including:
Functional ConservationIn this table, we display the functional conservation experiments that we manually culled from the literature as described above.
Download FilesThis link navigates to a page that provides several files for download so that they can be used in other applications.
Disease ReferencesWhen available, a link to disease information is provided at the top of the P-POD family display. On the "Disease References" page, OMIM Phenotype records associated with human proteins in the P-POD family are displayed. In addition, papers from SGD that are associated with yeast proteins and flagged as containing disease information are also shown.