CLC Assembly Cell is a high-performance computing solution for read mapping and de novo assembling of next generation sequencing data
The command-line interface of CLC Assembly Cell enables the functionalities to be easily included in scripts and other next generation sequencing workflows.
CLC Assembly Cell is utilizing SIMD instructions to parallelize and accelerate the assembly algorithms, making the program the fastest next generation sequencing assembler at present.
For more information on the paper describing the up to 10-20 times higher speed of CLC Assembly Cell compared to Maq and SOAP when doing Human whole genome read mapping, see our white paper:Download white paper on CLC read mapper Download white paper on CLC de novo assembler Request a quote
- Read mapping of Illumina, Ion Torrent, SOLiD, and 454 sequencing data
- Native support for Color Space
- Support for both short read and long read assembly, including 454/Titanium data
- Support for both gapped and ungapped alignments when doing short read mapping
- Support for mapping of paired end reads
De novo assembly
- De novo assembly of Illumina, Ion Torrent and 454 sequencing data
- Support for both short read and long read assembly, including 454/Titanium
- Support for de novo assembly of paired end data
- Building scaffolds from paired-end data
- Fast analysis of raw data, including reporting
- Option of joining data from different sources into the same analysis (including data generated by different kinds of sequencing technologies)
- Extraction of data from part(s) of an assembly. Examples are extraction of contig and reads from an area of interest, or extraction (exclusion) of data from a specific sequencing lane that is suspected not to be of acceptable quality.
- Removal of duplicate reads
- Quality trimming
- Find variations (simple SNP detection)
- Support for input file formats Fasta, Sff, GenBank, csfasta, and scarf
- A number of output options, including tables with assembly info
- A “graphical” (ASCII art) assembly viewer to get quick overview
- Full integration with CLC Genomics Workbench. Output data from CLC Assembly Cell can be imported and further analyzed in CLC Genomics Workbench.
Multiple CLC Assembly Cells can be run in parallel on a multi-node cluster.
In practice, almost every cluster is set up differently, and we do therefore not provide an off-the-shelf solution that is guaranteed to work on your computer cluster.
Instead we provide the below free to download, free to use, and free to modify Perl script as an example. You are welcome to adjust it to fit your needs.
The script cluster_schedule distributes jobs defined in the schedule file on a number of nodes. An example could be distribution of CLC Assembly Cell reference assembly jobs. This requires an installation of CLC Assembly Cell on each node, and the best performance is reached if the reference sequence is stored locally on each node.
Each job is a list of commands which cluster_schedule will run in order on one node. If one of the commands in a job fails (error code is not zero) no more commands in the job is executed and the job is considered failed. If all commands in a job complete successfully (error codes are zero) the job is a success.
The nodes the jobs are run on can be defined on the command line or in the schedule_file. The nodes defined in command line replace all nodes defined in the schedule_file.
Each job is run on one node and each command is executed on the node using ssh.
Therefore, to use cluster_schedule make sure that all nodes are set up to use automatic ssh authentication.Download the cluster_schedule scriptDownload an example schedule file for cluster_schedule
The scripts are provided for free without warranty and support.Read the user manual for more information about CLC Assembly Cell