Wolfram Computation Meets Knowledge

Newick Trees, Proximity Resources and Accessions Analyzing SARS-CoV-2 Genetic Sequences

Newick Trees, Proximity Resources and Accessions: Analyzing SARS-CoV-2 Genetic Sequences

While working with SARS-CoV-2 genetic data in the Wolfram Data Repository, we noticed that there was frequently only a relative handful of differences compared with the overall size of the sequences. This allowed us a number of interesting opportunities for further processing.

First of all, with so much in common, the sequences could be aligned more rapidly by checking for large identical regions and then reducing the alignment task to the sections with differences. Let’s look at an example. First, we’ll gather a BioSequence for the reference SARS-CoV-2 from a Wolfram Data Repository we maintain for this:

Engage with the code in this post by downloading the Wolfram Notebook
refSARSCoV2 = ResourceData
&#10005


Next, we will arbitrarily pick another SARS-CoV-2 sequence:

otherSARSCoV2Seq =  BioSequence
&#10005


The resource function AlignNearlyIdenticalSequences allows us to find an alignment between these sequences rather quickly:

First
&#10005


The alignment itself shows the similarities and differences between the two sequences. Given that these sequences are rather long with broad sections of similarity, it seems helpful to extract just the differences. To accomplish this, we created the resource function AlignmentToPositionDifferences.

AlignmentToPositionDifferences converts an alignment to a list of positions with replacement rules. These can be interpreted as where particular changes might be made in the first sequence to yield the second sequence:

diffs = ResourceFunction
&#10005


These position differences have many different possible uses. For example, with a number of similar sequences, you might want to align them to a reference. You can then gather the sequences that have similar differences or use them as features in other analyses and potentially combine them with dates or geographic data. A handful of these applications can be seen in the Analysis section at the SARS-CoV-2 genetics data repository.

Beyond its analytical uses, this is also a great method for compression. Given a reference sequence and a number of variant sequences, we can reduce the size of the variant sequences by nearly an order of magnitude over a raw string representation:

ByteCount
&#10005


ByteCount
&#10005


To restore a sequence from these positional differences, we’ve developed the ReconstituteSequenceFromReferenceDifferences resource function. Given the difference and the reference sequence, the function reconstitutes the other sequence:

reconstitutedSequence =  ResourceFunction
&#10005


When working with sequence differences for analysis, there are always two different kinds of possible analysis: where there might be a difference versus where there must be a difference. This occurs because the sequencing process does not always produce exact results about which base is at a particular position. Fortunately, there is a broadly adopted standard for which additional letters are used to represent potential choices among specific bases. These letters are called degenerate letters:

Entity
&#10005


If you’re working with cases where there might be a difference between two sequences, you can use alignment differences directly:

mightBeDifferent = ResourceFunction
&#10005


If you’re interested in where they must be different, however, the resource function RemoveDegenerateSequenceDifferences will reduce the differences to those without a degenerate match, eliminating positional differences when no disparity remains:

mustBeDifferences = ResourceFunction
&#10005


This is a practical concern, as many SARS-CoV-2 sequences have a number of degenerate differences with the reference sequence:

diffs = ResourceFunction
&#10005


When degenerate letter matches are accounted for, the differences are reduced considerably:

ResourceFunction
&#10005


Visualizing a DNA Sequence

One common way to visualize DNA and RNA sequences is by using frequency chaos game representation (FCGR). For this purpose, there is the FCGRImage resource function:

ResourceFunction
&#10005


Gauging Proximities between Genetic Sequences

Another useful thing to do with a set of genomes is to show them in a way that indicates the relative distances between pairs. We will show two ways to do this. Here we download a set of coronaviruses, which can take up to a minute or so, to see which ones seem closest to SARS-CoV-2:

manyviruses = {   "MT121216", "MT040333", "MN996532", "MG772933", "MG772934",  "KY352407",   "AY390556", "EU371564", "AY394985", "NC_004718", "AY304488",  "EU371561", "AY278488",   "AY729016", "NC_006579", "NC_019843.3", "NC_038294",   "KF569996", "GQ153548", "DQ648856", "GQ153539", "JX993987",  "JX993988",   "MN908947.3", "MN938384", "MN975262", "MN985325", "MN988713" ,  "MN994467", "MN996531"};vseqs = Map
&#10005


Now get the short names and the genome strings:

names = vseqs
&#10005


Use the resource function PhylogeneticTreePlot to create a dendrogram from these sequences:

dgram = ResourceFunction
&#10005


Another visualization is as a dimension-reduced point set. We use the resource function MultidimensionalScaling (hereafter MDS) for this purpose. First we need to create vectors representing the genome sequences, and then reduce them to three dimensions. There are many ways to do this. We will employ the process used in PhylogeneticTreePlot.

We start by creating the FCGR image data arrays. We use a Fourier discrete cosine transform to remove high-frequency details, then use a common method from principal component analysis (PCA) to get vectors of length 40:

fcgrs = Map
&#10005


The final step is to use MDS to reduce to 3D:

styledGenomes = Apply
&#10005


You’ll see that the SARS and the SARS-CoV-2 families each cluster closely and both separate well from the various other coronaviruses.

Working with Preexisting Proximity Resources

Throughout COVID-19, there has been a number of great analyses with results derived from data not available to the general public. These proximity results include the Newick trees published by Nextstrain. Newick is a format for expressing phylogenetic trees in terms of nested branches with distances. Here is a trivial example shown with an imported result and dendrogram using the ImportNewickString and NewickDendrogram resource functions:

demoString = "(a:2, (b:1, d:3):4, c:5):0";importedDemo = ResourceFunction
&#10005


ResourceFunction
&#10005


We observe that the nodes b and c are at the same depth, given that they are the same overall distance from the root (4 + 1 and 5, respectively).

This Nextstrain COVID-19 data is provided via a link at the bottom of the Nextstrain global page. Once that data is downloaded, it’s ready to be imported:

nextstrainGlobalNewickString =  Import
&#10005


nexstrainImported = ResourceFunction
&#10005


This tree is too large to display in a single dendrogram. Instead, we can show the tree up to a given level of nesting, truncated recursively from the root:

truncateNewickAtDepth

truncateNewickAtDepth

&#10005


In this tree, all of the non-strain nodes are of the form NODE_<number>. If we’d like to look further into the tree, we can find such a node and similarly display a truncated subset of it:

findNewickNode
&#10005


Of course, this is just a small taste of the analysis that is possible for the Newick format or the Nextstrain data and would be a good area for further extension.

Identifiers and Further Information

Accessions and other identifiers are useful for more than gathering their sequences and also offer links to supporting information, such as related PubMed entries:

ResourceFunction
&#10005


Sometimes there can be a fair number of these resources:

ResourceFunction
&#10005


Supporting Metadata

In addition to providing resources for working directly with sequences, sometimes it is also useful to operate on the metadata for functions. For example, BioSequenceTranslate will accept arguments determining what translation table is used. All of the typical ones are provided through the "GeneticTranslationTable" entities:

EntityList
&#10005


These translations are provided in both expansive and terse forms, which are easier to edit and more compact to use, respectively:

Entity
&#10005


Entity
&#10005


You may have occasion, however, to use a custom translation and to want to move between the more expansive and condensed forms as you switch between editing and use. The resource function NCBITranslationTableConvert switches between these formats:

ResourceFunction
&#10005


ResourceFunction
&#10005


Beyond Genetic Sequences: Protein Sequences

Gene sequences are not the only kind of biomolecular sequence with supporting functionality. The resource function PDBImport allows you to access protein properties and diagrams from the RCSB Protein Data Bank identifier:

ResourceFunction
&#10005


An Invitation

The Wolfram Function Repository isn’t only for Wolfram employees; we review and publish submissions by and for our users. One user-submitted item, DNAAlignmentPlot, produces a nifty plot of the results from sequence alignment:

ResourceFunction
&#10005


There is so much more yet to do. If you find yourself with sequence processing functionality that you’d like to share, even if only for your own later convenience, consider submitting it to the Function Repository. We would be delighted to help deliver it to whomever happens to need it.

Get full access to the latest Wolfram Language functionality with a Mathematica 12.3 or Wolfram|One trial.

Comments

Join the discussion

!Please enter your comment (at least 5 characters).

!Please enter your name.

!Please enter a valid email address.

3 comments

  1. By coincidence of timing, John and I used much of this functionality in a recent Wolfram Community post:

    https://community.wolfram.com/groups/-/m/t/2346172

    The goal therein was to analyze sequenced SARS-CoV-2 genomes from Victoria Australia that are claimed to be of the Kappa variant. We were able to show that they are not quite in the mainstream of Kappa, as they have a few mutations not generally found in Kappa, and lack a few that are in most Kappa samples we used.

    Reply
  2. Mathematica is such a powerful platform to explore computable biology topics. This is a fantastic blog post showcasing this. Thank you, John & Daniel!

    As an educator it is an uphill battle to convince students that Mathematica is relevant for biology when many students are indoctrinated into python and/or R first. The introduction of new core language functions like BioSequence is very welcome and I hope more is coming.

    One core algorithm / data structure that Mathematica is missing for modern DNA/RNA sequencing projects is a performant implementation of the FM-Index (https://en.wikipedia.org/wiki/FM-index). This data structure has relevance beyond bioinformatics and might be a worthwhile topic for developer effort.

    Thanks again for the great post!

    Reply
    • Very interesting,
      however, the names of functions and attributes could be a bit longer.

      For example “ReconstituteSequenceFromReferenceDifferences” is too short.

      After all, the code is nearly impossible to repeat – numerous problems about brackets usage.
      May be because the data format is under modifications.

      StringPartition::string: String expected at position 1 in StringPartition[BioSequence[DNA,ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCT\[Ellipsis] TGGGCTATATAAACGTTTTCGCTTTTCCGTTTACGATATATAGTCTACTCTTGTGCAGAATGAATTCTCGTAACTACATAGCACAAGTAGATGTAGTTAACTTTAATCTCACATAGCAATCTTTAATCAGTGTGTAACATTAGGGAGGACTTGAAAGAGCCACCACATTTTCACCGAGGCCACGCGGAGTACGATCGAGTGTACAGTGAACAATGCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAATGACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA],1000].

      Reply