Classification of viruses derived from metagenomic sequences

Summary of the Metagenomics Working Group Discussion

Published as Supplementary information S1 in:

Nature Reviews Microbiology (2017)
doi:10.1038/nrmicro.2016.177
Published online 03 January 2017

DISCUSSION SUMMARY

The metagenomics working group (MWG) comprised an international group of invited experts from the field of metagenomics and representative members of the ICTV Executive Committee (Meeting attendance is listed in Appendix I). The meeting was host to an extensive discussion of the challenges posed by the abundance of MG sequence data currently being generated, the procedures currently being used by the ICTV to classify viruses and the strategies by which the two could be reconciled.

BACKGROUND CONCEPTS 

The meaning of the term “Metagenomic”. There was substantial agreement in most areas of the discussion between experts and the ICTV EC. Firstly, the MWG as a whole were in agreement about the meaning of the term “Metagenomic” as applied to viruses, despite the wide range of ways in which this term has been used in the literature and the fact that there is a continuum of variability in the extent of other information available for viruses characterised from environmental sampling. It was additionally recognised that MG sequences are not simply to be equated with those from next generation sequencing methods (NGS), and that viral sequences that frequently lack other biological or experimental information can be equally obtained by PCR and other virus characterisation methods. Nevertheless, it is the unprecedented effectiveness of NGS in characterising viral populations in the environment that has largely created the classification problems we currently face.

There was similar unanimity in the supposition that, with appropriate caveats, detection of a viral sequence in a sample equates to the presence of a virus. It is consequently entirely valid to use such sequences for virus classification in the absence of a virus isolate or other evidence for the physical presence of a virus (such as visualisation of virus particles, disease symptoms in an infected animal or plant, etc.). Furthermore, the MWG considered that, under the appropriate methodological conditions, viral sequences assembled from NGS data were sufficiently accurate and reliable to be used for classification processes.

Technical limitations. The group however acknowledges the potential problems with deriving viral sequences from mixed viral populations in the sample and the consequent danger in certain situations of assembling chimeric genomes. It was also recognised that current methodology often cannot assemble complete genome sequences from viruses with segmented genomes and multipartite genomes packaged into separate virus particles. Particularly for the latter, the means to reliably link together sequences from different genome segments may remain unavailable. Another practical problem arises from the presence of virus-derived sequences into the genomes of the infected host, often in the germline, incapable of generating infectious virus although often transcribed. These are all caveats that must be addressed experimentally for MG sequence data to be used for classification purposes. These are, however, not fundamental barriers to classification as the technology used to create MG sequences is continuously improving and many of the current technical problems, particularly with assembly, will be resolved.

The ICTV species definition. The current definition of species formulated by the ICTV is:

A species is the lowest taxonomic level in the hierarchy approved by the ICTV. A species is a monophyletic group of viruses whose properties can be distinguished from those of other species by multiple criteria

The MWG acknowledged the concern expressed by many virologists that an assembled MG sequence derived from environmental sampling lacks the “multiple criteria” required for a species definition. Such criteria are traditionally information on its biology (host range, epidemiology, disease associations), morphology, and behaviour in cell culture, many of which are used as defining features of different species.

However, it was considered by the MWG that a genome sequence, even without associated other biological information, can possess sufficiently varied attributes to support the creation of a species. In addition to phylogenetic analyses using robust models, these may include:

  1. The presence of genes in the appropriate genome region that indicate the degree of phylogenetic relatedness of viruses to each other.
  2. The overall genome organisation of the virus (gene order), gene complements and replication strategy, all of which can be reliably inferred bioinformatically from the genome sequence.
  3. In some families, the presence/absence of distinctive motifs, characteristics of polyprotein cleavage sites, IRES etc
  4. The genome sequences additionally provide a resource for functional studies, such as recombinant protein expression for structural investigations, serology assay development as well as the developed of molecular tests for epidemiological screening and clinical studies. Experimental data derived in this way provide further biological characterisation that can be used to support the proposed classification.

 Collectively, these attributes provide the required information on its monophyly and bioinformatic characterisation of the virus that satisfies the multiple criteria required for a species definition. The sequence furthermore provides considerable further information on its evolutionary history and relationships with other viruses at different taxonomic layers.

Assignment of MG sequence to existing virus families. The MWG recognises the range of ways and criteria by which viruses in the existing classification are assigned to species and genus ranks within the currently designated virus families. These “classification frameworks” are specific to each virus family and are based on combinations of factors that may include both biological information (eg. host range, geographical distribution or pathogenesis) and genetic attributes such a degree of sequence divergence and gene complements or other aspects of genome organisation.

While biological information can be part of the definition of existing taxa, there is invariably associated sequence data for members of each. Thus, while a MG sequence may lack the biological information previously by ICTV to support assignment in a particular taxon, assignment of new species and genera to an existing virus family with MG sequences is permissible if sequence relationships (phylogeny, sequence divergence, gene complements) are equivalent to those that exist among viruses classified by other means. Assignment of MG viruses as new species is possible within existing ICTV rules and has already been done in a number of families in recent years.. 

Classification of MG sequences into new families. The MWG recognises that the assignment of MG sequence as members of newly created families is subject to different constraints and difficulties to the previously discussed addition of MG sequences an existing virus family with a classification framework. This is because criteria used for an informative subdivision of viruses into different genera and species vary so greatly between existing virus families and cannot be reliably predicted a priori in the absence of considerable information of virus relationships within the family. Particularly for families of viruses infecting animals and plants, species definitions are often defined by disease or other biological attributes. While these divisions are informative, there is consequently huge variability in the degree of sequence divergence between species and genera that cannot be predicted from inspection of sequence data alone. 

For creation and assignment of MG sequences to new virus families, the absence of biological information requires that taxon levels must be governed by clustering and patterns of variability between MG sequences. This requires a considerable amount of comparative sequence information for viruses to be assigned to the family for the assignment of genus and species levels to be informative and to possess practical utility.

The MWG unanimously decided that new virus families populated entirely by viruses identified from MG sequence data could be created. There was, however, substantial agreement on the caveat that multiple examples of members of the new virus family would be required to establish a classification framework for lower taxa. Such variants would provide the required information on genetic divergence and similarities in genome organisation and gene complements for designation of appropriate genus and species categories.

CLASSIFICATION PROCEDURES

MG sequence nomenclature. Most of the MWG considered that the ICTV should not be proscriptive about the formation on names and that practices currently used within individual Study Groups (SGs) for taxon nomenclature could be applied equally to new taxa created for MG sequences. This recognises the fact established after some discussion that it is the component viruses within a taxon that may or may not be metagenomically derived, not the taxon itself. It is quite conceivable that a taxon may contain viruses derived by multiple methods. Furthermore, a species comprising MG sequences only may ultimately come to contain an additional member derived from virus isolation and with defined biological properties. Thus MG status belongs to and is recoverable from the sequence record, not from the taxon it is assigned to.

MG taxon proposals (TPs). The MWG was in substantial agreement that the TP procedure could be improved though electronic submission and appropriate quality checks and this is indeed under development by the ICTV. TPs allow multiple new species to be proposed in the same submission form which is of value for classification of large MG datasets.

The MWG were largely in agreement that with appropriate oversight and under appropriate circumstances, species assignments could be made at the discretion of the relevant SG and sub-Committee chair without the degree of scrutiny normally required by the EC. This would allow species designations to incorporated into the ICTV classification on a continuous basis and much more rapidly that the current annual procedure. This would be a benefit to the virology community and to the public databases which seek to provide classification information on new sequence submissions. There would be some procedural difficulties associated with this change but the EC will discuss the possibility.

There was considerable discussion on the possibility of creating more higher level taxonomic groupings, such as orders, that would recognise relatively distant relationships between existing virus families and also provide a placeholder for sequence databases and Study Groups to use for otherwise unclassified MG sequences. Typically the latter may include viruses without close relatives and for which a formal designation into family, genus and species might be inappropriate or premature in the absence of a classification framework with which to usefully specify those taxonomic layers.. A much discussed example was a potential new Order containing viruses with ssDNA circular genomes and a Rep gene as there are large numbers of such sequences that can only currently be described as unclassified viruses.

SUMMARY

  • The presence of viral sequences in a sample indicates the presence of a virus under the appropriate circumstances
  • Such viral sequences have sufficient defining characteristics to enable their classification as additional taxa in existing virus families
  • Such sequences can additionally be used to justify the creation of new virus families, although further information on diversity and clustering is required to justify the formation of genus and species ranks within the family
  • Virus taxa created on the basis of MG sequences require no different nomenclature from conventionally classified viruses
  • Procedures for the submission and approval of taxonomy proposals are to be modified to accommodate larger MG-derived datasets