18.12 DIP detection

CLC Genomics Workbench offers automated detection of small deletion/insertion polymorphisms (also known as DIPs) when reads are mapped to a reference.

If you have high coverage in your mapping, you will often find a lot of gaps in the consensus sequence. This is because just a single insertion in one of the reads will cause a gap in all other sequences at this position. The majority of all these gaps should simply be ignored as they were introduced due to sequencing errors in a single read or a very few reads. Automated DIP detection can be used to find the gaps that are significant. If you want to use the consensus sequence for other purposes, you can simply ignore all the gaps (they will disappear once the consensus sequence is out of the mapping view), and the significant ones can then be annotated as DIPs (see Reporting the DIPs).

In CLC Genomics Workbench, a DIP is a deletion or an insertion of consecutive nucleotides present in experimental sequencing data when compared to a reference sequence. Automated DIP detection is therefore possible only for results from read mapping.

The terms "deletion" and "insertion" are understood as events that have happened to the sequencing sample relative to the reference sequence: when the local alignment between a read and the reference exhibits gaps in the read, nucleotides have been deleted (in the read, relative to the reference), and when the local alignment exhibits gaps in the reference sequence, nucleotides have been inserted (in the read, relative to the reference). Figure 18.97 shows an insertion (of TC, to the left) and a deletion (of CC, to the right).

Image DIP_detection_dips
Figure 18.97: Two DIPs, an insertion and a deletion.

The automated DIP detection in CLC Genomics Workbench bases all reported DIPs on DIPs found in individual reads. The length of reported deletions and insertions is therefore bounded by the number of insertions and deletions allowed per read by the read mapping algorithm.

In most situations, a DIP in a single read is not sufficient experimental evidence. The CLC Genomics Workbench allows you to specify how many reads must cover and agree on a DIP in order for it to be reported by the automated DIP detection. Two reads agree on a deletion if their local alignments to the reference sequence both contain the same number of consecutive gaps aligned to the same reference positions. Likewise, two reads agree on an insertion if their local alignments specify the same number of consecutive gaps at the same position in the reference sequence and the nucleotides inserted in the two reads are the same. Figure 18.98 shows some reads disagreeing on an insertion (of TC or TA?, on the left) and agreeing on a deletion (of CC, on the right).

Image DIP_detection_agreement
Figure 18.98: Disagreement and agreement of DIPs.

Based on your specifications on what you consider a valid DIP, the DIP detection will scan through the entire mapping and report all the DIPs that meet the requirements:

        Toolbox | High-throughput Sequencing (Image ngsfolder) | DIP detection (Image dip_detection)

This opens a dialog where you can select read mapping results (Image contig)/ (Image multicontig) to scan for DIPs (see Map reads to reference for information on how to map reads to a reference).



Subsections