PlasClass improves plasmid sequence classification.

Many bacteria contain plasmids, but separating between contigs that originate on the plasmid and those that are part of the bacterial genome can be difficult. This is especially true in metagenomic assembly, which yields many contigs of unknown origin. Existing tools for classifying sequences of plasmid origin give less reliable results for shorter sequences, are trained using a fraction of the known plasmids, and can be difficult to use in practice. We present PlasClass, a new plasmid classifier. It uses a set of standard classifiers trained on the most current set of known plasmid sequences for different sequence lengths. We tested PlasClass sequence classification on held-out data and simulations, as well as publicly available bacterial isolates and plasmidome samples and plasmids assembled from metagenomic samples. PlasClass outperforms the state-of-the-art plasmid classification tool on shorter sequences, which constitute the majority of assembly contigs, allowing it to achieve higher F1 scores in classifying sequences from a wide range of datasets. PlasClass also uses significantly less time and memory. PlasClass can be used to easily classify plasmid and bacterial genome sequences in metagenomic or isolate assemblies. It is available under the MIT license from: https://github.com/Shamir-Lab/PlasClass.


Introduction
When using high-throughput sequencing to study the presence and dynamics of plasmids in their bacterial hosts, it is often necessary to classify sequences as being of plasmid or bacterial origin. This is especially true in the case of metagenomic sequencing, which can include many sequences of unknown origin and varying lengths. We focus on the challenge of classifying contigs in a metagenomic assembly in order to identify which are of plasmid origin. The current state-of-the-art classifier of plasmid sequences is PlasFlow [3], a neural network based algorithm. While PlasFlow is successful in classifying small sets of long sequences, it produces less reliable results for short sequences and has difficulty with very large metagenomic sequence datasets due to memory constraints. Here we present PlasClass, a new plasmid sequence classifier that is implemented as an easy to use Python package. PlasClass consists of a set of logistic regression classifiers each trained on sequences of a different length sampled from reference sequences of plasmid and bacterial origin. When applying PlasClass on a set of sequences, the appropriate length-specific classifier is used for each sequence. We tested PlasClass on simulated data, on bacterial isolates, on waste water plasmidome, and on human gut microbiome samples. For shorter sequences, which are the majority of contigs in an assembly, PlasClass achieves better F1 scores than PlasFlow. It is also faster and uses significantly less RAM and disk memory. PlasClass is provided at https://github.com/Shamir-Lab/PlasClass.

Training databases
We used reference sequence databases to obtain the training sequences for our classifiers. For the plasmid references we used plasmid sequences listed in PLSDB [2], an up-to-date curated plasmid database. After filtering out duplicate sequences this database contained 13469 reference plasmids (median length: 53.8kbp). For the bacterial references we downloaded all complete bacterial genome assemblies from NCBI (download date January 9, 2019). We removed plasmid sequences and filtered out duplicates, leaving 13491 reference chromosomes (median length: 3.7Mbp). One quarter of the sequences were randomly removed from the databases before training in order to provide a held-out test set (Section 3.1). PlasClass was retrained on the full databases and this version was used for testing on assembled data (Sections 3.2 -3.5).

Training the classifiers
We sampled sequence fragments of different lengths from the reference sequences with replacement and constructed a k-mer frequency vector for each fragment. Canonical k-mers of lengths 3 -7 were used resulting in a feature vector of length 10952 for each fragment. Fragment lengths were 500k, 100k, 10k, and 1k. For the two shorter lengths 90,000 training fragments were used from each class. For 500k and 100k, since there were not enough long plasmids to do the same, we sampled enough fragments to cover all plasmids of sufficient length to a depth of 5. For each length, a logistic regression classifier was trained on the plasmid and bacterial fragments' k-mer frequency vectors using the scikit-learn machine learning library in Python.

Length-specific classification
PlasClass uses four logistic regression models to classify sequences of different length scales. Given a dataset, each sequence is classified using the length-specific model with the closest length. Further implementation details can be found in the Supplement (SectionS1).

Classification with PlasClass
PlasClass is available at https://github.com/Shamir-Lab/PlasClass. It has been retrained using the full set of database references. PlasClass can be used as a command-line tool to classify sequences in an input fasta file or it can be imported as a module into the user's code to classify sequences in the user's program. It can be run in parallel mode to achieve faster runtimes. PlasClass is fully documented at the url provided above, see the Supplement for details (Section S7).

Results
We compared performance of PlasFlow and PlasClass on both simulated and real data. See the Supplement for technical details (S2).

Classifying sequences from held-out references
We sampled overlapping L-long fragments covering the held out plasmids with an overlap of L/2 for L = 100k, 10k and 1k. A matching number of L-long fragments were sampled from the held out bacterial genomes for each length L. Table 1 summarizes the classification results. PlasClass improved precision at the cost of slightly lower recall and had better overall F1 on the shorter sequence lengths. These short sequences can make up the majority of contigs in metagenomic assemblies (see Tables S1, S3 and Figure S1), allowing PlasClass to outperform PlasFlow in many settings.  Table 1: Performance of PlasFlow and PlasClass on fixed length sequence fragments sampled from the held out references.

Performance on bacterial isolates
We compared the performance of PlasFlow and PlasClass on the isolate assemblies from the benchmark in [1]. For both classifiers precision was very low due to the very small number of contigs of plasmid origin. PlasClass outperformed PlasFlow with an F1 of 1.49 compared to 0.80. Further details can be found in Supplement S3.

Performance on simulated metagenome assemblies
We simulated two metagenomes, assembled them, and classified the contigs. Performance improved for both methods as short sequences were excluded, and PlasClass outperformed PlasFlow for all size thresholds, achieving F1 scores of 12.35 and 10.86 on the two simulations. See the Supplement for further details (Section S4).

Performance on plasmidome sample
We assembled a waste water plasmidome [5] and classified the contigs (see Supplement S5 for details). Although the plasmid-enriched setting favors PlasFlow, which sacrifices precision for higher recall, PlasClass still had a higher combined F1. Table 2 compares the performance and resource usage of the methods. PlasClass was more than twice as fast and can be even faster when using multiprocessing. PlasClass used much less RAM than PlasFlow. These differences become more manifest as the dataset size grows (results not shown). PlasFlow writes the feature matrices to disk while PlasClass does not.

Classifying metagenomic plasmid assemblies
We assembled six publicly available human gut microbiome samples (accessions listed in the Supplement, Section S6) and found plasmid sequences in the assemblies using Recycler [4].
Recycler assembles plasmid sequences based on coverage and circularity -features that are not used by the classifiers. 16-27 plasmids were assembled per sample (median length: 3.4kb).
We classified each of the plasmids generated by Recycler to determine the extent of agreement between the sequence classifiers and this orthogonal approach. As seen in Figure 1, PlasClass agreed with Recycler on the same number or more plasmids than PlasFlow in all samples. This suggests that PlasClass can correctly identify more plasmids in real datasets that contain many previously unknown plasmid sequences.