#!/usr/local/bin/php Wolfsschanze: Genomic Sequence Alignment with Dotplots

 

Eyes.jpeg

Genomic Sequence Alignment with Dotplots


Location: Origin:>> Genomic Dotplot

CD Database

Flyguy

User Login

Dictators

Redbrick

Paranoid?

Harry Manback

Downloads

Robert Joyce

Bad Jokes

Bleeding Machine

fairies

Dead Kennedys

The Noble Cow

Genomic Dotplot

Cobra

Reading List

Origin

users
logged in

 

Pairwise Sequence Alignment

Firstly there is no universally precise, applicable notion of similarity, rather we choose the best technique for each instance. An alignment is an arrangment of sequences, which highlights where the sequences are similar and where they differ. From this it is obvious that the optimal alignment is the arrangement which exhibits the most important similarities and least differences.

There are three extensively used methods in sequence alignment:

  • Segment Methods - all windows are iteratively compared against the match sequence. The window is a predetermined size, Dotplots use this approach
  • Optimal Global Alignemnt - the best global match is found for the entire sequence, taking gaps into consideration. This is a highly specific technique and may lead to erroneous results for large sequences
  • Optimal Local Alignment - this algorthim searches for the best local alignment, explicitly taking gaps into consideration.

Dotplots

Dotplots provide an intuitive representation of the comparison between two sequences, making them one of most commonly used graphical techniues. The two sequences are represented on the X-Y axis, where significant matches are represented on the diagonal of this matrix. Mismatches vary away from the main diagonal, to what degree depends on the exhibitted differences. Aside from the two sequences there are two main parameters which affect the representation of the dotplot
  • Window Size - this defines how many elements (genetic code in this case) we should try match in each comparison. The large the window size is, the more stringent the requirement for a match are.
  • The Threshold - the threshold size dictates how many mismatches can be tolerated before we classify this comparison as a mismatch. Thus as we increase our window size we would also expect the threshold to increase to prevent our match requirements from becoming too stringent.
The fist image from the following two screenshots depicts how a large threshold in relative terms of a small window can lead to the toleration of more partial & accidental matches. Large threshold and small windows are recommended in cases where the similarity may be weak or the genome has undergone extensive mutations.

The second image is a case where there is a small threshold and a large window, the basic laws of probability dictates that the chances of accidental matches are very low. This "unforgiving" arrangement between the threshold and the window size is used in cases where the hypothesis for similarity is strong, or in case where there are similar genomes mutating slowly.
tolerant dotplot stringent dotplot
Fig 1. Dot-plot with a high threshold and a small window. Fig 2. Dot-plot with a comparatively low threshold in relation to a large window.


FASTA Formatted Sequences Files

Below are the FASTA format files that were used in this comparison. They are from the
  • Agrobacterium Tumefaciens - a soil bacterium which infects host plants with its DNA. Agrobacterium tumefaciens is a species of bacteria that causes tumors (commonly known as 'galls' or 'crown galls') in dicots. This Gram-negative bacterium causes crown gall by inserting a small segment of DNA (known as the T-DNA, for 'transfer DNA') into the plant cell, which is incorporated at a semi-random location into the plant genome. These properties enable reasearchers to use this bacterium to deliver foreign DNA into plants.
    FASTA File
  • Rickettsia Prowazekii - Epidemic Typhus is a form of typhus caused by the Bacillus Rickettsia Prowazekii, carried by the human body louse Pediculus Humanus. Feeding on a human who carries the bacillus infects the louse. R. prowazekii grows in the louse's gut and is excreted in the feces. The disease is transmitted to an uninfected human who scratches the bite and rubs the feces into the wound. Incubation period is one to two weeks.
    FASTA File
  • Sinorhizobium meliloti - is a nitrogen-fixing bacterium (rhizobia). It forms a symbiotic relationship with legumes from the genera Medicago, Melilotus and Trigonella, including model legume Medicago truncatula. The S. meliloti genome contains three replicons: a 3.65 megabase chromosome and two megaplasmids, pSyma (1.35 megabases) and pSymb (1.68 megabases), that have all been fully sequenced.
    FASTA File
What Sequences Were Compared?

For this example, I have done a quick comparison of the Rickettsia Prowazekii Genome to the Rickettsia Conorii genome to show areas where they differ and where they correlate. Also included is how altering the window size and threshold affects the results of the Dotplot.
Comparison of the Rickettsia Prowazekii Genome to the Rickettsia Conorii

Java Dot-Plotter

The following program was created through Java to implement pair-wise comparisons on FASTA formatted sequences. From the UI one is able to choose the threshold and window sizes. The source file are included below in PDF format.

DotPlot.java      - Computes the co-ordinates of the dot-plot, thresolding and windowing.

DotPlotUI.java      - Creates the user interface for the dot-plot.

GraphPane.pdf      - Displays the dot-plot in the grphical pane.

Valid HTML 4.0 Transitional