Source of input FASTA files
For the Reference Genome Annotation Project data set, our source for the protein FASTA files is the PANTHERdb (version 7.0). The files were downloaded from the PANTHER FTP site (March, 2009). The resulting FASTA files are available here: ftp://gen-ftp.princeton.edu/ppod/ (see P-POD version 4, released December 15th, 2009).
In the updated version of the Heinicke et al. (2007) data set, the FASTA files for all organisms except Plasmodium falciparum were also downloaded from the above PANTHERdb site. The P. falciparum FASTA files were obtained from EBI's Eukaryota genome collection on June 16, 2009 from their ftp link.
Version and settings information
All vs All BLAST:
BLAST version 2.2.19 [Nov-02-2008] was downloaded from NCBI's ftp site. The BLASTp program was run with the settings recommended for InParanoid (blastall -p blastp -F 'm S' -M BLOSUM62 -z 5000000 -e 1e-5 -mI -v max -b max -i fasta_gp/A.fasta -d B -o A-B.xml). For InParanoid I was 7; for OrthoMCL and Jaccard I was 8. As all e-value cutoffs are 1e-5, blastall's -e option was used with that value.
We used an implementation of the Jaccard clustering algorithm provided by Sam Angiuoli and Owen White and modified it to find homologs across species. In the Jaccard clustering analysis, two proteins are grouped into the same family if they share a significant number of homologs, calculated as follows. First, a list of homologs for each sequence, consisting of those whose relative BLASTP scores are less than 1e-5 over a total of at least 50% of the length of each, is generated for each protein. Then the Jaccard index for each pair is calculated; this is the ratio of the magnitude of the intersection of their homolog sets vs. the union. Final clusters are generated by linking proteins whose mutual Jaccard index is at least 0.4.
Some families contain large numbers of proteins, over 6600 in one case. This is sometimes the result of a "daisy-chaining" effect, to wit: if protein A is placed in the same family as protein B, and B as in C, and C as in D, then proteins A through D will all be in the same family even if A and D are unrelated. The Jaccard clustering program, unlike OrthoMCL, does not remove proteins deemed to be too distantly related and is particularly subject to this phenomenon. If you try to view one of these families, it is possible that not all components will be displayed but, as with all families, we provide all available data files for download.
OrthoMCL:OrthoMCL (version 2.0b6) compares the all-against-all BLASTP scores from a set of genomes, first identifying putative orthologs as reciprocal best hits between pairs of genomes, then identifying candidate recent paralogs as proteins within the same species that are more similar to each other than to any sequence in the other species. All orthologs and recent paralogs are then converted into a graph where the nodes represent the proteins and the edges represent their relationships. A normalization step is then used to correct for systematic biases when comparing pairs of genomes. Finally, the ortholog families are resolved by application of the Markov Clustering algorithm (MCL v. 1.006, 06-058). We used the following OrthoMCL parameters: P-value exponent cutoff of -5, with at least a 50% percent match cutoff (pmatch_cutoff), i.e. the covered region corresponds to at least 50% of the shorter protein in the query-subject pair.
MultiParanoid finds predicted orthology relationships between proteins from multiple species. It makes use of a clustering algorithm to merge multiple pairwise ortholog groups generated by the InParanoid program. The MultiParanoid version used here (last modified 08-Oct-2007 by Alexeyenko et al. (2006)) included running InParanoid (version 3.0) with the default settings, except the -unique flag was set to 1 in order to prevent the same proteins showing up in multiple clusters.
Naïve Ensemble clustering:
A Naïve Ensemble clustering method was used to generate consensus clusters of the OrthoMCL and MultiParanoid results. The method produces supersets of the OrhthoMCL and MultiParanoid clusters that share predicted orthologs. For example, if the family OrthoMCL1 contains a, b, and c; Para1 contains a, b, and d; and OrthoMCL2 contains d, e, f, and g, then the Ensemble consensus cluster Nens1 will include a, b, c, d, e, f, and g.
Multiple sequence alignments were produced by running MAFFT (version 6.705b) using the default settings.
Phylogenetic trees were constructed using PhyML (version 3.0.1). The default settings were used, except the number of relative substitution rate categories (-c flag) was set to 1 instead of the default of 4. Also, the datatype (-d flag) was set to "aa" for amino acid sequences instead of the default "nt" for nucleotide sequences.
Notung:Version 2.6 of Notung was used to root, clean up, and draw phylogenetic trees and to determine orthologous and paralogous relationships between proteins. The protein trees were reconciled with the standard species tree found here. Species trees were downloaded from the NCBI Taxonomy Browser and some polytomies were resolved manually based on Stechmann A and Cavalier-Smith T. (2003) and Richards TA and Cavalier-Smith T. (2005).
Notung's rearrangement algorithm was used to correct phylogenetic errors arising from weak signal in the sequence data. Short branches arise when there is not enough variation in the multiple alignment to infer the branching order with confidence. If a short branch is associated with large number of duplications and losses, it suggests an error in phylogenetic inference.
Notung further improves trees by rearranging poorly-supported branches, which have branch lengths below a certain edge weight threshold. For our first run of Notung, in order to choose a threshold, we examined the lengths of ~125,000 branches from the ~27,000 trees with three or more branches. Approximately 80% of the branches fell within a normal distribution, but approximately 20% fell below the minimum length found in this distribution (bin 0, below); you can view a histogram of edge weights here. Based on this evaluation, the edgeweight threshold was set to 5e-4, and for the current run we use the same parameter.
Default values were used for cost of loss, cost of duplication, and cost of conditional duplication (CL=1.0; CD=1.5; CCD=0.0). For both the ROOT and REARRANGE steps, the --nolosses flag was used to save trees without lost nodes explicitly recorded. The edgeweight threshold was set to 5e-4. In the Notung applet, and when generating tables of orthologs and paralogs, the --stronghomologs flag is used; this prevents Notung from calling orthologous and paralogous relationships that are not supported by sufficiently strong branch lengths.