University of Maryland – AC + ManyNets


             VAST 2010 Challenge
Genetic Sequences – Tracing the Mutations of a Disease


Authors and Affiliations:

Manuel Freire, Universidad Autónoma de Madrid, [PRIMARY]
Awalin Sopan, University of Maryland,


We have used AC and ManyNets to analyze the dataset. AC is a plagiarism detection tool that is also useful for general similarity analysis, developed at the Universidad Autónoma de Madrid by Manuel Freire. It is available at, though we have modified it by adding a similarity test that is specific for this Mini-Challenge. ManyNets is a network analysis and visualization tool developed at the Human-Computer Interaction Lab (HCIL), University of Maryland, by both authors. More information on this tool is available at - source and binaries are available on demand. ManyNets uses SocialAction, also developed at the HCIL, to display node-link diagrams.

The new test computes distance between two same-length strings as the number of single-character differences, and can generate a file with those distances suitable for use within a ManyNets dataset. We rate the programming effort to add this test at around 3 hours. Before developing this test, we experimented with AC’s default similarity tests, based on normalized compression distance (NCD). However, since all differences seem to correspond to single-character substitutions (we found no fragment duplications or translocations), the extra power afforded by NCD is unnecessary. Additionally, questions 3.3 and 3.4 directly ask for lists of mutations.

Before loading the data into AC, we used a small C program to split the sequences into individual files within individual folders. This required around 30 minutes of time, and could have been accomplished manually using a text processor. A similar amount of time was spent building suitable datasets from the output of AC’s new challenge-specific test.


analysis video



MC3.1: What is the region or country of origin for the current outbreak?  Please provide your answer as the name of the native viral strain along with a brief explanation. (max 150 words)


The native strain is Nigeria_B.


We created separate files for each strain. We then loaded them into AC, and compared all strains using a test that assigns distance(a,b) = changed_characters(a,b), normalized between 0 and 1 to comply with AC’s test conventions.



Figure 1: Network view from AC, showing strains as nodes and distances as edges. The dialog window displays a side-by-side comparison between 531 and Nigeria_B. The longest duplicate runs are color-coded in matching colors; this would have revealed re-arrangements of entire subsequences, but we only see single-character substitutions. The histogram-with-slider located under the network view is used to hide edges with distance > 0.09.


Figure 2: Dendrogram view from AC. Branch x-coordinate corresponds to the distance at which branches merge using complete linkage, and the slider controls network visibility. Branches within the shaded area, controlled with the slider, are visible in the network pane. Nigeria_B is close to the mutant strains and far from other country-strains.


MC3.2:  Over time, the virus spreads and the diversity of the virus increases as it mutates.  Two patients infected with the Drafa virus are in the same hospital as Nicolai.  Nicolai has a strain identified by sequence 583.  One patient has a strain identified by sequence 123 and the other has a strain identified by sequence 51.  Assume only a single viral strain is in each patient.  Which patient likely contracted the illness from Nicolai and why?  Please provide your answer as the sequence number along with a brief explanation. (max 150 words)


The patient with strain 123 probably contracted the disease from Nicolai. Strains 583 and 123 are very close. Strain 51 is closer to other strains (most notably 531) than to 123 or 538.


Figure 3: From within AC, this network is centered on 583 (country strains are no longer included). Thick, red edges correspond to smaller distances, such as between 583 and 123, with 1 change. There are 3 changes between 583 and 51, represented with thin edge; 51 can also be seen to be similar to 531 than to any other node.


We built a ManyNets dataset from the distances generated by AC. We filtered the resulting (complete) network to remove all edges that do not include 583 as source.


Figure 4: All edges with source in 583 as seen in ManyNets, sorted by number of base-substitutions, with the edges for 583-to-123 and 583-to-51 highlighted.


MC3.3:  Signs and symptoms of the Drafa virus are varied and humans react differently to infection.  Some mutant strains from the current outbreak have been reported as being worse than others for the patients that come in contact with them.  (max 150 words)

Identify the top 3 mutations that lead to an increase in symptom severity (a disease characteristic).  The mutations involve one or more base substitutions.  For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

C → G, 456 (C changed to G at position 456)

G → A, 513 and T → A, 907 (G changed to A at position 513 and T changed to A at position 907)

A → G, 39 (A changed to G at position 39)


Taking, as a root, strain 531,

T -> C, 841 and A -> T, 954 (path from 531 to 583)
A -> C, 160 and A -> G, 222 and T -> C, 789 (path from 531 to 952)
A -> C, 268 and A -> T, 842 (path from 531 to 99)


Using ManyNets, and displaying the network with SocialAction, we can interactively filter edges to remove all those that represent more than X base-substitutions. All (14) “Severe” strains are still connected to the main component (a tree) by single base-substitution edges.



Figure 5: Coloring the single base-substitution tree by symptom severity. Red = Severe, Brown = Moderate, Blue = Mild. The answer for MC3.3 describes the mutations from 531 to the three branches (= minimal base-substitution paths) containing severe cases.



Figure 6: Mutations from 531. The central column contains vertical dashes to represent base-substitutions from 531. Hovering over any of these dashes describes the corresponding substitution. Selecting the “Severe” cases and looking at the dashes allows hypotheses to be formulated.


MC3.4:  Due to the rapid spread of the virus and limited resources, medical personnel would like to focus on treatments and quarantine procedures for the worst of the mutant strains from the current outbreak, not just symptoms as in the previous question.  To find the most dangerous viral mutants, experts are monitoring multiple disease characteristics.

Consider each virulence and drug resistance characteristic as equally important.  Identify the top 3 mutations that lead to the most dangerous viral strains. The mutations involve one or more base substitutions.  In a worst case scenario, a very dangerous strain could cause severe symptoms, have high mortality, cause major complications, exhibit resistance to anti viral drugs, and target high risk groups.  For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

C → G, 456 (C changed to G at position 456)

G → A, 513 and T → A, 907 (G changed to A at position 513 and T changed to A at position 907)

A → G, 39 (A changed to G at position 39).


Using 531 as root, the 3 most dangerous strains are

A -> C, 268 and T -> C, 841 and A -> T, 945 (strain 123)

A -> C, 268 and T -> C, 526 and T -> C, 841 and A -> T, 945 (strain 118)

A -> C, 196 and T -> C, 841 and G -> C, 847 and A -> T, 945 (strain 501)


We followed the same procedure as before, but instead of using “severity”, we built a composite column that adds up all the numerical values of each of the tabulated disease characteristics. This is not exactly “equal weight”, since most characteristics have 3 values (mapped to 1, 2 and 3), as compared to “complications”, which has only 2 (mapped to 1 and 2). Mapping major “complications” to a 3 instead of a 2 would cause a 4-way tie for “most dangerous strain”, adding 202 to the current list of 123, 118 and 501.



Figure 7: The resulting table. The last column is formed by adding up the numerical mappings of all strain-characteristic columns. After sorting by this “badness-column”, we have used the topmost 3 cases as our answer, looking up the mutations in the corresponding edge table.


Figure 8: The resulting one-base-substitution-per-edge tree, encoding overall danger as node color. Red is most dangerous (most danger points, as per the “badness” column In Figure 7), blue is least dangerous.



Figure 9: The substitutions required to mutate from 531 to the 3 most dangerous strains. These strains have been filtered from the list of all edges from 531 to all other strains – similar to that of Figure 4.


Figure 10: On the independence of single-base substitutions. The table contains all edges within the single-base-substitution network seen in previous figures. Strain 770 is not present, as it is at least 2 substitutions apart from any other strain. Notice that strain 770 is not interesting in the context of tasks 3.3 and 3.4: it is of moderate symptoms, medium mortality, minor symptoms, and low at-risk vulnerability. There are 55 unique edges in this network (discarding reverse edges), and 1406 * 3 possible mutations. If base-substitutions are random and independent, we expect few, if any, duplicate base substitutions. We have highlighted the single duplicate (A -> C, 268 is present in 2 branches) - not wholly unexpected, according to the birthday paradox.


The tree represented in Figure 5 and Figure 8 is a valid way to reason on the paths between strains, since the shortest path between any pair of strains passes through the branches of that tree, with a single exception. Figure 10 is critical in that regard: it proves that, except in the case of 99 – 123 (where the shortest path avoids the duplicate A -> C, 268 edge), this is indeed the case. Any other exceptions would have showed up as duplicates. This makes the analysis considerably simpler, providing a clear most-probable ancestry for each strain. We do not have ground truth, so the most probable (shortest, step-by-step) paths need not correspond to actual strain evolution; however, assuming independence of mutations and almost complete data (there are only two small gaps, to 770 and to 49 / 961 – a small pair seen isolated in Figure 8), our confidence in our answers is very high.


Figure 11: In this dataset, it is entirely equivalent to look at the frequency of mutations as it is to look at the tree-paths in the single-substitution network, as each mutation occurs in exactly one tree path (again, with the small exception of A -> C, 268). However, it is much easier to reason looking at the tree than looking at sequences of mutations (tree-paths). The leftmost figure answers question 3.3 using the “multi-column overview” feature of the latest version of ManyNets. The rightmost figure provides an alternate, equivalent view showing the node table of the network with edges from 531 to other strains. Annotations have been added to both figures externally to ManyNets, but roughly correspond to the tooltips that would have been presented when mousing over the relevant areas.


As a final observation, we have chosen strain 531 as root for our answers to tasks 3.3 and 3.4 due to its high betweenness centrality (making it ideal to describe mutations with a minimal number of base substitutions and most likely evolutionary path), and its status as most probable original outbreak strain (nearest to Nigeria_B) as described in our answer to task 3.1.


Web Accessibility