ERNE - Extended Randomized Numerical alignEr

Manual

Version 1.4.4+

****** Installation ******

Briefly, the shell commands "./configure ; make ; make install" should configure, build, and install this package ("make install" must be execute by root if you do not changed the "--prefix" options when running "./configure").

The LIBBOOST library (version 1.40 or above) is required. ERNE will be compiled without debugging informations and with the highest level of optimization by default. If you want to debug the code (e.g., when required by the maintainer), add the "--enable-maintainer-mode" option to "./configure" to create "erne-*-dbg" programs.

If OPEN-MPI (or MPICH2) support is available, the MPI versions (erne-pmap, erne-pbs5 and erne-pcreate) will be built if you add the "--enable-openmpi" (or "--enable-mpich") option to "./configure".

****** Reference preparation (DNA-seq and RNA-seq, not BS-seq) ******

In order to align a set of reads against a reference genome the ERNE hash table must be built. In order to build the hash table for the multi-fasta file "ref.fasta", you have to type the following command:

erne-create --reference-prefix ref --fasta ref.fasta

This command will produce the file ref.eht (eht = Erne Hash Table) that will be used later in the alignment step. The extension is automatically added.

****** Alignment (DNA-seq and RNA-seq, not BS-seq) *****

The alignments are performed by the erne-map program, which requires a reference file prepared as indicated in the previous section. A typical use of erne-map, for a single read set, is as follow:

erne-map --reference ref.eht --query1 sequence.txt --output out.bam

which will align the file "sequence.txt" against the reference "ref.eht" producing the out.bam output file. By default, each read will be cleaned by low quality bases at the extremities before being aligned and the maximum number of errors allowed is chosen in relation to the length of the read after the trimming (default value: 1 error every 15bp). If you do not want to perform the trimming operation, just add the --no-auto-trim parameter.

Maximum read length allowed is 600bp.

For the alignment of a paired-end read set (PE), the command would be as follows:

erne-map --reference ref.eht --query1 sequence_1.txt --query2 sequence_2.txt --output out.bam

If query files are compressed (in .gz or .bz2 format), erne-map will decompress the files on the fly. If you want to produce an output file in sam format, just add the --sam parameter and replace the "--output out.bam" option with "--output out.sam").

erne-map automatically detects if the input reads are in Illumina (1.5 - 1.7) or in Standard (Sanger) fastq format (used since Illumina 1.8). If you want to override the default behaviour, use --force-illumina or --force-standard parameters, respectively.

If you use erne-map on a multiprocessor machine with N processors, you can define the number of thread to use with --threads (e.g., "--threads 8").

As a default behavior, erne-map will allow a maximum number of errors (i.e. misaligned bases in a read) equal to 1 every 15 bp. To modify erne-map's error tolerance use "--errors-rate B" parameter , where B is the number of bp for each error (default 15). If you want to fix the maximum number of errors allowed for all the reads irrespective of their size, you can use the "--errors " parameter, where E is the number of errors.

In PE alignments the BAM "proper pair" flag is set to true if the two reads are in the same contig/scaffold/chromosome and they are facing (---> <---). If you specify the two parameters "--insert-size-min " and "--insert-size-max ", then the proper pair flag is set to true if the above condition holds AND if the insert size is between MIN and MAX.

If you provide a small (few Kbp) .eht reference sequence with --contamination-reference parameter, the read will be first aligned against this contamination reference, before being aligned against the main reference. If a hit in the contamination sequence is found, the read is marked as "quality discarded" and the XC tag is set to remind that the read belongs to the contamination sequence and not to the main reference.

If you want to allow small indels add the --indels option. You can set the maximum number of bp allowed for any individual deletion with the --indels-max parameter (default: 5). The more bp are allowed, the more slower the process becomes.

erne-map can be instructed to split reads by a single (long) gap (useful to account for introns or structural variations in RNA-seq or DNA-seq experiments respectively) add the --gap option. Optionally, the default maximum value for a gap can be modified with the --ref-insertion-max-gap parameter.

With the --print-all option, all the alignments will be printed in the BAM/SAM file. If you use the --print-all and you want to set a maximum number of records to be printed, use the --print-at-most option. In BAM/SAM format, every alignment is memorized in one line. If --print-all is on, each of the n possible alignments will be printed in a different line with the SAM/BAM tags NH and HI set as NH:i:n and HI:i:n for each alignment, while the IH tag will display a progressive index from 1 to n. If --print-at-most is also on and set to print up to m different alignments and n possible alignments have been identified, min(n,m) alignments will be printed. The NH and HI tags will be set as NH:i:n and HI:i:min(n,m) while IH will assume values from 1 to min(n,m).

****** Reference preparation (BS-seq) ******

In order to align a set of BS-reads against a reference genome we have to build the ERNE index (Hash table). The index for the multi-fasta file "ref.fasta" can be build using the following command:

erne-create --methyl-hash --reference ref --fasta ref.fasta

This command will produce the file "ref.ehm" (ehm = Erne Hash table for Methylation) that will be used later to align reads. The extension is automatically added.

****** Alignment (BS-seq) *****

Methylation reconstruction requires 2 steps: alignment of BS-sequences on the reference (performed by erne-bs5 aligner) and methylation calling using the output generated at the previous step (performed by erne-meth caller)

Usage of erne-bs5 with BS-seq is the same as in DNA-seq and RNA-seq sequence alignment through the erne-map command, excvept that the algorithm requires a dedicated hash table produced with --methyl-hash option of the erne-create command.

****** Methylation reconstruction *****

*** (1) general options

erne-meth processes a bam file generated by erne-bs5 and produces as output the methylation profile together with some statistics.

erne-meth requires as input the multi-fasta reference file, the input bam produced with erne-bs5 and the output prefix. The basic command line is:

erne-meth --fasta ref.fasta --input alignments.bam --output-prefix out --annotations-erne

where erne-meth uses the alignments in alignments.bam to reconstruct the methylation profile of the reference ref.fasta. The output files produced are:

out_report.txt : a detailed human-readable report with the statistics about methylation and alignment.

out_report_tabbed.txt : the same information as above, but in a more succinct and tabbed format (it can be used as input for a script in downstream analysis)

if --annotations-erne is specified: out_erne_meth.txt. A table displaying methylation levels for each cytosine in the reference. Format is: chr position strand context #C+#T methyl_level mult_reads, where methyl_level = #C/(#C+#T) (or NA if #C+#T=0) and mult_reads is the number of multiple reads disambiguated that cover the position (only if --disambiguation-mode is specified, NA otherwise). Positions are 1-based. Uncovered Cs are printed. Default: disabled.

if --annotations-epp is specified: out_epp.bed. A table in the format as output files from the Epigenome Processing Pipeline (EPP) applied at the Broad Institute. Only covered Cs are printed. If --on-target-annotations is specified, only on-target covered positions are printed. Format is (tab separated): chr name, start context(included), end context(excluded), methylation value and coverage as a string(’#C/(#T+#C)’), methylation in range [0,1000], strand, context(CG/CHG/CHH). Coordinates are 0-based. Default: disabled.

if --annotations-bismark is specified: out_bismark.bed. A table in the bismarkCytosine format.

erne-meth can be executed in three different modalities depending on how multiple alignments are processed:

1) single mode (option --single-mode). This is the default modality: erne-meth uses only single-mapping reads to compute the methylation profile.

2) multiple mode (option --multiple-mode). This modality applies to paired-end reads. If a read in a pair is single-mapping and the mate is multiple-mapping, erne-meth uses the default alignment for the multiple mapping read.

3) disambiguation mode (option --disambiguation-mode). This modality applies only to single-end reads. In this modality, erne-meth uses the methylation profile to iteratively disambiguate some multiple mapping reads.

Note that:

- If paired-end reads are used, improperly paired mates that are mapped on different chromosomes or have a non correct orientation are not discarded. In all the modalities erne-meth uses these reads for methylation reconstruction since this situation could reflect structural variations. Specify --disable-unproper-pair to disable this behaviour.

- Modality 1) and 2) are incompatible and cannot be specified together.

- Since out_methyl_annotations.txt is a big file (several bytes per cytosine in the reference) we recommend to compress it. erne-meth can produce compressed output if the option --compress format is specified (for example, --compress gz)

- For each position of the reference corresponding to a cytosine (C), either in the Watson or Crick strand, the methylation level is computed as Cs / (Cs + Ts), that is the number of reads covering that strand with a C in that position divided by the total number of reads covering the same strand with C o T in that position. However, if coverage is low, the methylation level could be deemed questionable. For this reason the user can specify the option --coverage-threshold N to filter out low covered positions: if a cytosine is covered with (strictly) fewer than N reads, the methylation level is not computed for that particular cytosine (which is then considered as not covered). The default value for N is 1.

*** (2) Target enrichment data

The user can specify a bed file with the option --target containing the regions targeted by the protocol used. If this file is specified, erne-meth will compute additional statistics to help in assessing the precision of the protocol. The following files are produced:

out_on_target.txt : a table containing the percentage of on-target positions having coverage at least Ax, where 1<=A<=100

out_on_target.txt : a table containing the absolute number of out-of-target positions having coverage at least Ax, where 1<=A<=100

out_out_target_cov_distribution.txt : a table containing a set of triples. Each triple has the following meaning: there are N out-of-target positions having coverage C and such that the nearest target region is at distance D.

One common effect in target enrichment protocols are the coverage tails at the borders of the target regions. For this reason, it is useful to extend each target region by a number of bases on its extremities. Use the option --extend-target N to to extend each target region by N bp on its extremities.

Specify --on-target-annotations to print only on-target Cytosines in the files generated with options --annotations-epp and --annotations-bismark.

*** (3) Deduplication

If a large number of reads are aligned, it is likely that most of them come from duplicate fragments. Specify --deduplicate to automatically discard duplicate reads during the methylation call. erne-meth will print also statistics concerning number of discarded duplicate reads.

****** Read Filtering *****

erne-filter is designed to trim the reads removing low-quality bases at the reads ends and filter out contaminated reads (e.g., this is a mandatory step in de-novo assembly) in one step.

erne-filter --reference contamination.eht --query1 sequence_1.txt --query2 sequence_2.txt --output-prefix trimmed

erne-filter trims reads in the sequence_1.txt and sequence_2.txt files according to their quality; after that, the good quality region of each read is aligned against the contamination reference and if this alignment is succesful, the read is discarded. The results are saved into 3 files:

trimmed_1.fastq

trimmed_2.fastq

trimmed_unpaired.fastq

If you use erne-filter on a multiprocessor machine with N processors, you can define the number of thread to use with --threads . The --reference parameter is optional.

****** Using on cluster *****

If the MPI support is activated in the compiling step, then erne-p* programs (erne-pcreate, erne-pbs5 and erne-pmap) are built.

In order to successfully use erne-p* programs, the user must know the cluster composition (number of available nodes, number of cores per node).

We will provide some examples over an OpenMPI system composed by 10 nodes (n=10) with 8 cores each (p=8).

First of all, a file named 'hostfile' containing the names of the 10 nodes that are going to be used must be provided.

erne-pmap and erne-pbs5 uses a slightly different reference structure than the one used by erne-map and erne-pbs5. In particular, a set of n '.eht'/'.ehm' files with an additional header file must be generated using the following command:

erne-pcreate --reference PREFIX-NAME --fasta /path/of/input.fasta --nodes 10

where 10 is the number of nodes that will be used in the next (alignment) step.

After the computation ends, we are ready to align the reads:

mpirun -pernode -hostfile hostfile erne-pmap --reference PREFIX-NAME --threads 8 --query1 /path/of/reads.txt --output /path/of/output.bam [other options]

In some queue environments (like Torque or similar) you do not need to provide the hostfile. If you do not want to use multi-threading you can remove both -pernode and --threads options.

The same considerations holds for erne-pbs5. The parallel version accepts the same options availables in the non-parallel versions.