Duplicate reads removal plugin
The duplicate reads removal plug-in is designed to filter out duplicate reads. This tool is specifically well-suited to handle duplicate reads coming from PCR amplification errors which can have a negative effect because a certain sequence is represented in artificially high numbers.
The purpose of the tool is to reduce the data set to include only one copy of the duplicate sequence. The challenge is to achieve this without removing identical or almost identical reads that would arise from high coverage of certain regions, e.g. repeat regions or highly expressed exons from transcriptome sequencing. The algorithm takes sequencing errors into account when removing the duplicates.
The approach taken here is based on the raw sequencing data without any knowledge about how they map to a reference sequence. This means that this is well-suited for both de novo assembly and resequencing purposes.
The plug-in is available both for the CLC Genomics Workbench and the CLC Genomics Server.
In its current version, the duplicate read removal has a limitation when there are duplicate reads that contain several alleles. The algorithm will identify that there are duplicate reads to be removed, but it is not able to distinguish between sequencing errors and true variation in the reads. So if you have a heterozygous SNP in such an area, you may risk that only one of the alleles are preserved. We are working on improving the algorithm to handle this.