Trimming and filtering sequences
- Trim using quality scores. If the sequence files contain quality scores from a base-caller algorithm this information can be used for trimming sequence ends. The program uses the modified-Mott trimming algorithm for this purpose.
- Trim using ambiguous nucleotides. This option trims the sequence ends based on the presence of ambiguous nucleotides (typically N). Note that the automated sequencer generating the data must be set to output ambiguous nucleotides in order for this option to apply. The algorithm takes as input the maximal number of ambiguous nucleotides allowed in the sequence after trimming. If this maximum is set to e.g. 3, the algorithm finds the maximum length region containing 3 or fewer ambiguities and then trims away the ends not included in this region.
- Trim contamination from vectors in UniVec database. This option matches the sequence reads against all vectors in the UniVec database and removes sequence ends with significant matches (the database is included when you install the CLC Genomics Workbench). A list of all the vectors in the UniVec database can be found at http://www.ncbi.nlm.nih.gov/VecScreen/replist.html.
- Trim contamination from saved sequences. This option lets you select your own vector sequences that you know might be the cause of contamination. If you select this option, you will be able to select one or more sequences when you click Next.
- Hit limit. Specifies how strictly vector contamination is trimmed. Since vector contamination usually occurs at the beginning or end of a sequence, different criteria are applied for terminal and internal matches. A match is considered terminal if it is located within the first 25 bases at either sequence end. Three match categories are defined according to the expected frequency of an alignment with the same score occurring between random sequences. The CLC Genomics Workbench uses the same settings as VecScreen (http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html):
- Weak. Expect 1 random match in 40 queries of length 350 kb
- Terminal match with Score 16 to 18.
- Internal match with Score 23 to 24.
- Moderate. Expect 1 random match in 1,000 queries of length 350 kb
- Terminal match with Score 19 to 23.
- Internal match with Score 25 to 29.
- Strong. Expect 1 random match in 1,000,000 queries of length 350 kb
- Terminal match with Score above 24.
- Internal match with Score above 30.
- Weak. Expect 1 random match in 40 queries of length 350 kb
- Discard reads below length. If you wish to simply delete reads because they are short, check this and set the number at an appropriate level. The trimming will delete the reads that have fewer nucleotides than the number you set. The maximum value is 100.

How to treat trimmed regions
Trimmed regions may be either deleted or annotated:- Delete: Cuts the trimmed nucleotides off the read and creates a new sequence list with the shortened reads. This is recommended for large numbers of reads where manual inspection is irrelevant. Since a new sequence list is created, none of the original data will be modified.

- Annotate trimmed regions. The trimmed regions are marked with an annotation which means that the region is ignored during assembly. This means that you preserve the information for inspection in the contig. If you have regions with potentially bad coverage, you will be able to re-include the trimmed regions after assembly. In the contig, the trimmed regions will appear faded - illustrating the fact that they have not been used when constructing the contig. By dragging the edge of the faded region, you can include the trimmed region again. This is not recommended for large data sets from high-throughput sequencing machines.



















