A Computational Screen for Type I Polyketide Synthases in Metagenomics Shotgun Data

Background Polyketides are a diverse group of biotechnologically important secondary metabolites that are produced by multi domain enzymes called polyketide synthases (PKS). Methodology/Principal Findings We have estimated frequencies of type I PKS (PKS I) – a PKS subgroup – in natural environments by using Hidden-Markov-Models of eight domains to screen predicted proteins from six metagenomic shotgun data sets. As the complex PKS I have similarities to other multi-domain enzymes (like those for the fatty acid biosynthesis) we increased the reliability and resolution of the dataset by maximum-likelihood trees. The combined information of these trees was then used to discriminate true PKS I domains from evolutionary related but functionally different ones. We were able to identify numerous novel PKS I proteins, the highest density of which was found in Minnesota farm soil with 136 proteins out of 183,536 predicted genes. We also applied the protocol to UniRef database to improve the annotation of proteins with so far unknown function and identified some new instances of horizontal gene transfer. Conclusions/Significance The screening approach proved powerful in identifying PKS I sequences in large sequence data sets and is applicable to many other protein families.


Tree construction
In the alignment of the KS domain sequence UniRef100_A0ACI1_from_2372_to_3635 was manually removed as it increased the size of the alignment strongly.

Tree evaluation
See Table S1.

Examples for horizontal gene transfer
The following HMM search result sequences are examples of a possible gene transfer, as they are fungal proteins (coming from Aspergillus niger, Chaetomium globosum and Cochliobolus heterostrophus) but occur with sequences from Actinobacteria in the AT domain tree: The four Danio rerio DH sequences that are nested in a small group of fungal sequences that is surrounded by sequences from Actinobacteria:

List of newly detected PKS I members in UniRef
See file in PKS_I_db_and_extracts.zip

PKSDB sequences that are placed in nonPKS I branches
The following three TE sequences from PKSDB are placed in one branch that is annotated as nonPKS I branches:

Taxonomic distribution of PKS domain sequences
See Table S2.

Comparison of the tree topologies with reference trees
The numbers of taxa and RobinsonFoulds distances of the pruned trees and of the tree pairs can be found in Table S3.
The MT domain was omitted due to its small number of taxa. As the overlap of taxa in the PP domain trees is very low the likelihood that this low distance value occurs randomly is quite hight. For the other trees the found distance between the test and reference tree is much lower than the 125750 (502 trees allagainstall minus the distance between reference tree and the analysis tree) random tree distances and represents an outlier of this (nonnormal) distribution (see box plot in Figure S2).

Comparison of the tree log likelihood values
The log likelihood values of the reference trees were compared to those of the trees with metagenomic sequences and 100 trees with the same amount of taxa but random topologies. The log likelihood values of the reference trees and trees with metagenomic sequences were in all cases better and less different to each other than the log likelihood values of the random trees.

Comparison with a BLAST based method
To show that the HMM/tree based approach is more sensitive and selective than a BLAST based one we implemented a pipeline similar to SEARCHPKS. We took the same six domain sequences used there taken from Erythromocin producing PKS I (eryth_002_AT_002.seq, eryth_002_DH_001.seq, eryth_002_ER_001.seq, eryth_002_KR_002.seq, eryth_002_KS_002.seq, eryth_003_TE_001.seq) from PKSDB for searches and the same cutoff evalues. With this set up we screened the UniRef proteins and found in total 17126 domain hit sequences with the six domains. There were 2049 multi hit and 8743 single hit proteins. The annotation strings were analyzed and the proteins classified.
The single hit proteins are dominated by nonPKS I members and only a small fraction of the sequences are contributed by PKS I proteins. The PKS I to nonPKS I ratio is much higher for multi hit proteins. These results show that the BLAST based PKS search is not very selective and catches to many false positive sequences. On the other hand they prove that combining the information of different domain searches can improve the confidence in positive PKS proteins.
Find the results in Table S4, Table S5 and Table S6.

Simulation of 454 pyrosequencing data
To test the ability of the presented pipeline to deal with short sequence that are generated by nonSanger sequencing platforms (e.g. 454 pyrosequencing) we randomly selected 51 AT domain protein sequences found in the UniRef database and extracted subsequences of 33, 83 and 133 amino acid (representing different generations of nonSanger sequencing). These subsequences were combined with the full UniRef database set. This sequence collection was screened with the AT domain HMM. Forty one of the sequences with 133 amino acids were found while the shorter ones were missed by the HMM (even after using a less restrictive evalue cutoff). The 133 amino acid long subsequences were aligned with 881 representative full length sequences using hmmalign. The same was done with the full original sequences of these subsequence. The alignments were used to create maximum likelihood trees with PHYML. A manual comparison of these visualized trees showed that the placement of the short sequence is in general similar to the full length proteins. This observation implies that sequences from newer nonSanger sequencing projects with longer sequences might be successfully screened with our method while the sequence data produced by early nonSanger projects might not offer enough information per sequence for a successful detection. Yet, found sequences can be correctly classified with maximumlikelihood trees.