Conservative Inference of Orthologs and Paralogs for Short Branches
Standard Inference of Orthologs and Paralogs
Two genes are orthologous if they diverged from a common ancestral gene by speciation. If they diverged from a common ancestral gene by duplication, then they are paralogous. In a rooted, binary gene tree that has been reconciled with a species tree and is well supported by the sequence data, orthologs and paralogs can be determined unambiguously. For any pair of genes, it is sufficient to trace from each gene towards the root of the tree until the common ancestor is reached. If the common ancestor is a duplication event, then the genes are paralogous. If the ancestor is a speciation event, then they are orthologous.
For example, consider a hypothetical gene family from the species Mus musculus, Homo sapiens and Danio rerio, shown here:
A single duplication at the root of this tree is shown with a red square, marked by the letter "D". All other internal nodes represent speciation events. The genes Mus_musculus|gene1 and Danio_rerio|gene1 are orthologous, as are Mus_musculus|gene2 and Danio_rerio|gene2. In contrast, Mus_musculus|gene1 and Danio_rerio|gene2 are paralogous, since their common ancestor is the duplication node at the root of the tree.
The complete set of orthologs and paralogs for this example is given in the following table:
Homolog Table P == Paralogous O == Orthologous . == Genes on X and Y axis are the same.
Inference of Orthologs and Paralogs in Trees with Poorly Supported Branches
However, if the gene tree contains branches that are not strongly supported by the sequence data, we may be less confident in ortholog and paralog predictions made simply by examining the common ancestor. To correct for this problem, we perform a more conservative prediction of orthologs and paralogs as described in the following example.
The gene tree shown here has the same branching order as above, but one edge (shown in yellow) is not well supported by the data:
The yellow edge connects Mus_musculus|gene2 and Homo_sapiens|gene2 to the rest of the tree. If this edge were strongly supported, it would imply that Mus_musculus|gene2 is more closely related to Homo_sapiens|gene2 than to any other gene in the tree, and vice versa. Since this edge is not strongly supported by sequence data, this may not be the case. We should also consider alternate hypotheses that break this weak edge, but still contain all strong edges.
One such alternate hypothesis is this tree, which groups Mus_musculus|gene2 together with Danio_rerio|gene2, rather than with Homo_sapiens|gene2:
This tree, which groups Homo_sapiens|gene2 and Danio_rerio|gene2 together, is another:
Notice that the orthology and paralogy relationships in the gene2 subfamily are different in each alternate tree. For example, Mus_musculus|gene2 and Danio_rerio|gene2 are paralogous in the tree immediately above, while they are orthologous in the original tree at the top of the page. In other words, the evidence does not strongly support either paralogy or orthology for this gene pair.
In contrast, some homology relationships are supported by all three trees: Mus_musculus|gene1 and Danio_rerio|gene1 are always orthologous and Mus_musculus|gene1 and Danio_rerio|gene2 are always paralogous.
In PPOD, predicted homology relationships are only reported for those gene pairs that are unaffected by alternate branching patterns around short edges.
The conservative homology predictions for our example can be seen in the following table:
Homolog Table P == Paralogous O == Orthologous NA == Orthology/Paralogy relationship cannot be determined due to weak edges. . == Genes on X and Y axis are the same.