Alignments
A sequence alignment is a way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that residues with identical or similar characters are aligned in successive columns.All CLC bio's workbenches include pairwise alignments and multiple alignments of DNA, RNA, and protein sequences.
The pairwise alignment algorithm is based on three user defined parameters:
- Gap open penalty (price for first gap)
- Gap extension price (price for extensions beyond the initial gap)
- End gap cost
The following options of end gap costs are allowed, providing flexibility in the treatments of gaps at the end of a sequence:
- Free end gaps: Any number of gaps can be inserted at the ends of the sequence without any costs
- Cheap end gaps: All end gaps are treated as gap extensions and any gaps beyond 10 are free
- End gaps as any other: End gaps are treated as other gaps in the sequence
Our workbenches use a progressive alignment algorithm (Feng & Doolittle, 1987) in order to create multiple alignments. The algorithm follows these steps:
- All possible pairwise alignments are made between the sequences. This is used for finding the evolutionary distance between the pairs.
- From the pairwise distances, a phylogenetic tree is created using the neighbor-joining algorithm
- The sequences which are neighbors in the tree are aligned to each other and these alignments are aligned with the remaining sequences until all sequences are aligned.
When performing pairwise comparisons, the Dayhoff matrix is used for protein alignments. For nucleotide alignment a scoring matrix which penalizes transversion more than transition is used, assuming a two-to-one ration in their mutation rates.
When aligning alignments with each other, a sum of pairs scoring scheme is used.
Alignment speed
All CLC Workbenches have two types of algorithms for calculating alignments listed above:
- Accurate alignment. This is the recommended choice unless you find the processing time too long.
- Fast alignment. This allows for use of an optimized alignment algorithm which is very fast. The fast option is particularly useful for datasets with very long sequences.
Click here to read a White paper about our alignment algorithms
Viewing alignments
The following options are available when viewing alignments:
- Showing of consensus sequence at the bottom of the alignment
- Display the level of conservations at each point in the alignment:
- Line plot
- Bar plot
- Zoom options (including fit with, view all, and zoom to 100%)
- Showing of annotations/regions above each sequence
- Split screen, allowing
- viewing of 2 or more alignments at a time or
- viewing of one alignment at different perspectives (e.g. a nucleotide level view and a whole alignment overview at the same time)
- Different types of sequence layout:
- Space every 10 residues? (yes/no)
- Wrap sequences? (yes/no) i.e. showing sequences on more than one line
- Amount of residues in each line if wrapping is chosen
- Double stranded? (shows both strands of DNA sequences)
- Show residue-positions along the sequence? (yes/no)
- Choice of color residues (relevant when zoom level is at single residue level)
- Individual choice of which types of annotations are to be shown along each sequence
- Choice of annotation layout:
- Show annotations (yes/no)
- Placing of annotations (on sequence / next to sequence)
- Show labels? (yes/no)
- Show as arrows? (yes/no)
- Use gradients when coloring the annotations?
- Showing of GenBank annotation texts when curser is placed on annotation




























