Man Pages in seqan-apps

21
Section 1: Executable programs or shell commands

alf.1

Alignment free sequence comparison synopsis alf [options] -i in.fasta [-o out.txt] description compute pairwise similarity of sequences using alignment-free methods in in.fasta and write out tab-delimited matrix with pairwise scores to out.txt. -h, --help displays this help message. --version display version information -v, --verbose when given, details about the progress are printed to the screen. input / output: -i, --input-file file name of the multi-fasta input file. valid filetypes are: fa and fasta. -o, --output-file file name of the file to which the tab-delimtied matrix with pairwise scores will be written to. default is to write to stdout. valid filetype is: alf.tsv. general algorithm parameters: -m, --method method select method to use. one of n2, d2, d2star, and d2z. default: n2. -k, --k-mer-size k size of the k-mers. default: 4. -mo, --bg-model-order order order of background markov model. default: 1. n2 algorithm parameters: -rc, --reverse-complement mode which strand to score. use both_strands to score both strands simultaneously. one of input, both_strands, mean, min, and max. default: input. -mm, --mismatches mismatches number of mismatches, one of 0 and 1. when 1 is used, n2 uses the k-mer-neighbour with one mismatch. default: 0. -mmw, --mismatch-weight weight real-valued weight of counts for words with mismatches. default: 0.1. -kwf, --k-mer-weights-file file.txt print k-mer weights for every sequence to this file if given. valid filetype is: txt. contact and references for questions or comments, contact: jonathan goeke [email protected] please reference the following publication if you used alf or the n2 method for your analysis: jonathan goeke, marcel h. schulz, julia lasserre, and martin vingron. estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. bioinformatics (2012). project homepage: http://www.seqan.de/projects/alf version alf version: 1.1 last update january 5, 2012

gustaf.1

Gustaf - generic multi-split alignment finder: tool for split-read mapping allowing multiple splits. synopsis gustaf [options] genome fasta file read fasta file description gustaf uses seqans stellar to find splits as local matches on different strands or chromosomes. criteria and penalties to chain these matches can be specified. output file contains the breakpoints along the best chain. the genome file is used as database input, the read file as query input. all stellar options are supported. see stellar documentation for stellar parameters and options. (c) 2011-2012 by kathrin trappe -h, --help displays this help message. --version display version information gustaf options: main options: -tp, --transpen int interchromosomal translocation penalty default: 5. -ip, --invpen int inversion penalty default: 5. -op, --orderpen int intrachromosomal order change penalty default: 0. -oth, --overlapthresh double allowed overlap between matches default: 0.5. -gth, --gapthresh int allowed gap length between matches default: 10. -ith, --initgapthresh int allowed initial or ending gap length at begin and end of read default: 15. -st, --support int number of supporting reads default: 2. input options: -m, --matchfile file file of (stellar) matches valid filetypes are: gff and gff. output options: -bpo, --breakpointout file name of breakpoint output file. valid filetypes are: gff and txt. default: breakpoints.gff. -j, --jobname str job/queue name default: . -do, --dots enable graph output in dot format stellar options: main options: -e, --epsilon num maximal error rate (max 0.25). in range [0.0000001..0.25]. default: 0.05. -l, --minlength num minimal length of epsilon-matches. in range [0..inf]. default: 100. -f, --forward search only in forward strand of database. -r, --reverse search only in reverse complement of database. -a, --alphabet str alphabet type of input sequences (dna, rna, dna5, rna5, protein, char). one of dna, dna5, rna, rna5, protein, and char. -v, --verbose set verbosity mode. filtering options: -k, --kmer num length of the q-grams (max 32). in range [1..32]. -rp, --repeatperiod num maximal period of low complexity repeats to be filtered. default: 1. -rl, --repeatlength num minimal length of low complexity repeats to be filtered. default: 1000. -c, --abundancecut num k-mer overabundance cut ratio. in range [0..1]. default: 1. verification options: -x, --xdrop num maximal x-drop for extension. default: 5. -vs, --verification str verification strategy: exact or bestlocal or bandedglobal one of exact, bestlocal, and bandedglobal. default: exact. -dt, --disablethresh num maximal number of verified matches before disabling verification for one query sequence (default infinity). in range [0..inf]. -n, --nummatches num maximal number of kept matches per query and database. if stellar finds more matches, only the longest ones are kept. default: 50. -s, --sortthresh num number of matches triggering removal of duplicates. choose a smaller value for saving space. default: 500. version gustaf version: 1.0 last update july 2012

insegt.1

Intersecting second generation sequencing data with annotation synopsis insegt [options] aligments-file annotations-file description insegt is a tool to analyze alignments of rna-seq reads (single-end or paired-end) by using gene-annotations. input to insegt is a sam file containing the alignments and a file containing the annotations of the reference genome, either in gff or gtf format. -h, --help displays this help message. --version display version information options: : -ro, --read-output file output filename for read-output, which contains the mapped annotations followed by their parent annotation. valid filetype is: gff. -ao, --anno-output file output filename for anno-output, which contains the annotations similar to the gff input and additionally the counts of the mapped reads and the normalized expression levels in rpkm. valid filetype is: gff. -to, --tuple-output file output filename for tuple-output, which contains exon tuples connected by reads or matepairs. valid filetype is: gff. -fo, --fusion-output str output filename for fusion-output, which contains exon tuple of gene fusions (advanced option, currently no output port for knime). one of gff. -n, --ntuple int ntuple default: 2. -o, --offset-interval int offset to short alignment-intervals for search. default: 5. -t, --threshold-gaps int threshold for allowed gaps in alignment (not introns). default: 5. -c, --threshold-count int threshold for min. count of tuple for output. default: 1. -r, --threshold-rpkm double threshold for min. rpkm of tuple for output. default: 0.0. -m, --max-tuple create only maxtuple (which are spanned by the whole read). -e, --exact-ntuple create only tuple of exact length n. by default all tuple up to the given length are computed (if -m is set, -e will be ignored). -u, --unknown-orientation orientation of reads is unknown. examples

masai_indexer.1

Masai indexer synopsis masai_indexer [options] genome file description masai is a fast and accurate read mapper based on approximate seeds and multiple backtracking. see http://www.seqan.de/projects/masai for more information. (c) copyright 2011-2012 by enrico siragusa. -h, --help displays this help message. --version display version information genome index options: -x, --index str select the genome index type. one of esa, sa, qgram, and fm. default: sa. -xp, --index-prefix str specify an genome index prefix name. default: use the genome filename prefix. output options: -t, --tmp-folder str specify a huge temporary folder. default: use the genome folder. version masai_indexer version: 0.7.1 [14053] last update 2013-05-16

masai_mapper.1

Masai mapper synopsis masai_mapper [options] genome file reads file description masai is a fast and accurate read mapper based on approximate seeds and multiple backtracking. see http://www.seqan.de/projects/masai for more information. (c) copyright 2011-2012 by enrico siragusa. -h, --help displays this help message. --version display version information mapping options: -mm, --mapping-mode str select mapping mode. one of all, all-best, and any-best. default: any-best. -mb, --mapping-block num maximum number of reads to be mapped at once. in range [10000..inf]. default: 2147483647. -e, --errors num maximum number of errors per read. in range [0..32]. default: 5. -sl, --seed-length num minimum seed length. in range [10..100]. default: 33. -ng, --no-gaps do not align reads with gaps. genome index options: -x, --index str select the genome index type. one of esa, sa, qgram, and fm. default: sa. -xp, --index-prefix str specify an genome index prefix name. default: use the genome filename prefix. output options: -o, --output-file file specify an output file. default: use the reads filename prefix. valid filetypes are: raw and sam. -nc, --no-cigar do not output cigar string. this only affects sam output. debug options: -nv, --no-verify do not verify seed hits. -nd, --no-dump do not dump results. -nm, --no-multiple disable multiple backtracking. version masai_mapper version: 0.7.1 [14053] last update 2013-05-16

masai_output_pe.1

Masai output - paired end mode synopsis masai_output_pe [options] genome file reads file l reads file r raw file l raw file r description masai is a fast and accurate read mapper based on approximate seeds and multiple backtracking. see http://www.seqan.de/projects/masai for more information. (c) copyright 2011-2012 by enrico siragusa. -h, --help displays this help message. --version display version information pairing options: -ng, --no-gaps do not align reads with gaps. -ll, --library-length num library length. default: 220. -le, --library-error num library length tolerance. default: 50. output options: -t, --tmp-folder str specify a huge temporary folder. default: use the genome folder. -o, --output-file file specify an output file. default: use the reads filename prefix. valid filetypes are: raw and sam. -nc, --no-cigar do not output cigar string. this only affects sam output. debug options: -nd, --no-dump do not dump results. version masai_output_pe version: 0.7.1 [14053] last update 2013-05-16

masai_output_se.1

Masai output - single end mode synopsis masai_output_se [options] genome file reads file raw file description masai is a fast and accurate read mapper based on approximate seeds and multiple backtracking. see http://www.seqan.de/projects/masai for more information. (c) copyright 2011-2012 by enrico siragusa. -h, --help displays this help message. --version display version information mapping options: -ng, --no-gaps do not align reads with gaps. -m, --matches num maximum number of matches per read. output options: -t, --tmp-folder str specify a huge temporary folder. default: use the genome folder. -o, --output-file file specify an output file. default: use the reads filename prefix. valid filetypes are: raw and sam. -nc, --no-cigar do not output cigar string. this only affects sam output. debug options: -nd, --no-dump do not dump results. version masai_output_se version: 0.7.1 [14053] last update 2013-05-16

micro_razers.1

Micro_razers synopsis micro_razers [options] genome file reads file description microrazers uses a prefix-based mapping strategy to map small rna reads possibly containing 3' adapter sequence. (c) copyright 2009 by anne-katrin emde. -h, --help displays this help message. --version display version information main options:: -o, --output file change output filename. default: reads file.result. -rr, --recognition-rate num set the percent recognition rate in range [80..100]. default: 100. -sl, --seed-length num seed length in range [10..inf]. default: 16. -se, --seed-error allow for one error in the seed -f, --forward map reads only to forward strands. -r, --reverse map reads only to reverse strands. -mn, --match-n 'n' matches with all other characters -m, --max-hits num output only num of the best hits in range [1..inf]. default: 100. -pa, --purge-ambiguous purge reads with more than max-hits best matches -lm, --low-memory decrease memory usage at the expense of runtime -v, --verbose verbose mode -vv, --vverbose very verbose mode output format options:: -of, --output-format num set output format. 0 = microrazers format, 1 = sam. in range [0..1]. -a, --alignment dump the alignment for each match -gn, --genome-naming num select how genomes are named. 0 = use fasta id, 1 = enumerate beginning with 1. in range [0..1]. default: 0. -rn, --read-naming num select how reads are named. 0 = use fasta id, 1 = enumerate beginning with 1. in range [0..1]. default: 0. -so, --sort-order num select how matches are sorted. 0 = read number, 1 = genome position. in range [0..1]. default: 0. -pf, --position-format num select begin/end position numbering (see coordinate section below). 0 = gap space, 1 = position space. in range [0..1]. default: 0. version micro_razers version: 1.0.1 last update jul 2009

pair_align.1

Pairwise alignment synopsis pair_align [options] -s in.fa description the program allows to align two sequences using dyamic programming alignment algorithms while tweaking various parameters. -h, --help displays this help message. --version display version information main options: -s, --seq in.fa fasta file with two sequences. valid filetypes are: fasta and fa. -a, --alphabet alphabet sequence alphabet. one of protein, dna, rna, and text. default: protein. -m, --method method dp alignment method: needleman-wunsch, gotoh, smith-waterman, longest common subsequence one of nw, gotoh, sw, and lcs. default: gotoh. -o, --outfile out output filename. valid filetypes are: fa, fasta, and msf. default: out.fasta. scoring options: -g, --gop int gap open penalty. default: -11. -e, --gex int gap extension penalty. default: -1. -ma, --matrix matrix_file score matrix. -ms, --msc int match score. default: 5. -mm, --mmsc int mismatch penalty. default: -4. banded alignment options: -lo, --low int lower diagonal. -hi, --high int upper diagonal. dp matrix configuration options: -c, --config conf alignment configuration. one of ffff, ffft, fftf, fftt, ftff, ftft, fttf, fttt, tfff, tfft, tftf, tftt, ttff, ttft, tttf, and tttt. alignment configuration the alignment configuration is a string of four characters, each being either t or f. all combinations are allowed. the meaning is as follows.

rabema_build_gold_standard.1

Rabema gold standard builder synopsis rabema_build_gold_standard [options] --out-gsi out.gsi --reference ref.fa --in-sam perfect.sam rabema_build_gold_standard [options] --out-gsi out.gsi --reference ref.fa --in-bam perfect.bam description this program allows to build a rabema gold standard. the input is a reference fasta file and a perfect sam/bam map (e.g. created using razers 3 in full-sensitivity mode). the input sam/bam file must be sorted by coordinate. the program will create a fasta index file ref.fa.fai for fast random access to the reference. -h, --help displays this help message. --version display version information -v, --verbose enable verbose output. -vv, --very-verbose enable even more verbose output. input / output: -o, --out-gsi gsi path to write the resulting gsi file to. valid filetypes are: gsi and gsi.gz. -r, --reference fasta path to load reference fasta from. valid filetypes are: fa and fasta. -s, --in-sam sam path to load the "perfect" sam file from. valid filetype is: sam. -b, --in-bam bam path to load the "perfect" bam file from. valid filetype is: bam. gold standard parameters: --oracle-mode enable oracle mode. this is used for simulated data when the input sam/bam file gives exactly one position that is considered as the true sample position. --match-n when set, n matches all characters without penalty. --distance-metric metric set distance metric. valid values: hamming, edit. default: edit. one of hamming and edit. default: edit. -e, --max-error rate maximal error rate to build gold standard for in percent. this parameter is an integer and relative to the read length. in case of oracle mode, the error rate for the read at the sampling position is used and rate is used as a cutoff threshold. default: 0. return values a return value of 0 indicates success, any other value indicates an error. examples rabema_build_gold_standard -e 4 -o out.gsi -s in.sam -r ref.fa build gold standard from a sam file in.sam with all mapping locations and a fasta reference ref.fa to gsi file out.gsi with a maximal error rate of 4. rabema_build_gold_standard --distance-metric edit -e 4 -o out.gsi -b in.bam -r ref.fa same as above, but using hamming instead of edit distance and bam as the input. rabema_build_gold_standard --oracle-mode -o out.gsi -s in.sam -r ref.fa build gold standard from a sam file in.sam with the original sample position, e.g. as exported by read simulator mason. memory requirements from version 1.1, great care has been taken to keep the memory requirements as low as possible. there memory required is two times the size of the largest chromosome plus some constant memory for each match. for example, the memory usage for 100bp human genome reads at 5% error rate was 1.7gb. of this, roughly 400gb came from the chromosome and 1.3gb from the matches. references m. holtgrewe, a.-k. emde, d. weese and k. reinert. a novel and well-defined benchmarking method for second generation read mapping, bmc bioinformatics 2011, 12:210. http://www.seqan.de/rabema rabema homepage http://www.seqan.de/mason mason homepage version rabema_build_gold_standard version: 1.2.0 last update march 14, 2013

rabema_evaluate.1

Rabema evaluation synopsis rabema_evaluate [options] --reference ref.fa --in-gsi in.gsi --in-sam mapping.sam rabema_evaluate [options] --reference ref.fa --in-gsi in.gsi --in-bam mapping.bam description compare the sam/bam output mapping.sam/mapping.bam of any read mapper against the rabema gold standard previously built with rabema_build_gold_standard. the input is a reference fasta file, a gold standard interval (gsi) file and the sam/bam input to evaluate. the input sam/bam file must be sorted by queryname. the program will create a fasta index file ref.fa.fai for fast random access to the reference. -h, --help displays this help message. --version display version information -v, --verbose enable verbose output. -vv, --very-verbose enable even more verbose output. input / output: -r, --reference fasta path to load reference fasta from. valid filetypes are: fa and fasta. -g, --in-gsi gsi path to load gold standard intervals from. if compressed using gzip, the file will be decompressed on the fly. valid filetypes are: gsi and gsi.gz. -s, --in-sam sam path to load the read mapper sam output from. valid filetype is: sam. -b, --in-bam bam path to load the read mapper bam output from. valid filetype is: bam. --out-tsv tsv path to write the statistics to as tsv. valid filetype is: tsv. benchmark parameters: --oracle-mode enable oracle mode. this is used for simulated data when the input gsi file gives exactly one position that is considered as the true sample position. for simulated data. --only-unique-reads consider only reads that a single alignment in the mapping result file. usefull for precision computation. --match-n when set, n matches all characters without penalty. --distance-metric metric set distance metric. valid values: hamming, edit. default: edit. one of hamming and edit. default: edit. -e, --max-error rate maximal error rate to build gold standard for in percent. this parameter is an integer and relative to the read length. the error rate is ignored in oracle mode, here the distance of the read at the sample position is taken, individually for each read. default: 0 default: 0. -c, --benchmark-category cat set benchmark category. one of {all, all-best, any-best. default: all one of all, all-best, and any-best. default: all. --trust-nm when set, we trust the alignment and distance from sam/bam file and no realignment is performed. off by default. --ignore-paired-flags when set, we ignore all sam/bam flags related to pairing. this is necessary when analyzing sam from soap's soap2sam.pl script. --dont-panic do not stop program execution if an additional hit was found that indicates that the gold standard is incorrect. logging: --show-missed-intervals show details for each missed interval from the gsi. --show-invalid-hits show details for invalid hits (with too high error rate). --show-additional-hits show details for additional hits (low enough error rate but not in gold standard. --show-hits show details for hit intervals. --show-try-hit show details for each alignment in sam/bam input. the occurrence of "invalid" hits in the read mapper's output is not an error. if there are additional hits, however, this shows an error in the gold standard. return values a return value of 0 indicates success, any other value indicates an error. memory requirements from version 1.1, great care has been taken to keep the memory requirements as low as possible. the evaluation step needs to store the whole reference sequence in memory but little more memory. so, for the human genome, the memory requirements are below 4 gb, regardless of the size of the gsi or sam/bam file. references m. holtgrewe, a.-k. emde, d. weese and k. reinert. a novel and well-defined benchmarking method for second generation read mapping, bmc bioinformatics 2011, 12:210. http://www.seqan.de/rabema rabema homepage http://www.seqan.de/mason mason homepage version rabema_evaluate version: 1.2.0 last update march 14, 2013

rabema_prepare_sam.1

Prepare sam for rabema synopsis rabema_prepare_sam -i in.sam -o out.sam description prepare sam file for usage with rabema. -h, --help displays this help message. --version display version information -i, --in-file in.sam path to the input file. valid filetype is: sam. -o, --out-file out.sam path to the output file. valid filetype is: sam. --dont-check-sorting do not check sortedness. version rabema_prepare_sam version: 1.2.0 last update march 14, 2013

razers.1

Fast read mapping with sensitivity control synopsis razers [options] genome file reads file razers [options] genome file mp-reads file1 mp-reads file2 description razers is a versatile full-sensitive read mapper based on a k-mer counting filter. it supports single and paired-end mapping, and optimally parametrizes the filter based on a user-defined minimal sensitivity. see http://www.seqan.de/projects/razers for more information. input to razers is a reference genome file and either one file with single-end reads or two files containing left or right mates of paired-end reads. (c) copyright 2009 by david weese. -h, --help displays this help message. --version display version information main options: -f, --forward map reads only to forward strands. -r, --reverse map reads only to reverse strands. -i, --percent-identity num percent identity threshold. in range [50..100]. default: 92. -rr, --recognition-rate num percent recognition rate. in range [80..100]. default: 99. -pd, --param-dir dir read user-computed parameter files in the directory dir. -id, --indels allow indels. default: mismatches only. -ll, --library-length num paired-end library length. in range [1..inf]. default: 220. -le, --library-error num paired-end library length tolerance. in range [0..inf]. default: 50. -m, --max-hits num output only num of the best hits. in range [1..inf]. default: 100. --unique output only unique best matches (-m 1 -dr 0 -pa). -tr, --trim-reads num trim reads to given length. default: off. in range [14..inf]. -o, --output file change output filename. default: reads file.razers. valid filetypes are: razers, eland, fa, fasta, and gff. -v, --verbose verbose mode. -vv, --vverbose very verbose mode. output format options: -a, --alignment dump the alignment for each match (only razer or fasta format). -pa, --purge-ambiguous purge reads with more than max-hits best matches. -dr, --distance-range num only consider matches with at most num more errors compared to the best. default: output all. -gn, --genome-naming num select how genomes are named (see naming section below). in range [0..1]. default: 0. -rn, --read-naming num select how reads are named (see naming section below). in range [0..2]. default: 0. -so, --sort-order num select how matches are sorted (see sorting section below). in range [0..1]. default: 0. -pf, --position-format num select begin/end position numbering (see coordinate section below). in range [0..1]. default: 0. filtration options: -s, --shape bitstring manually set k-mer shape. default: 11111111111. -t, --threshold num manually set minimum k-mer count threshold. in range [1..inf]. -oc, --overabundance-cut num set k-mer overabundance cut ratio. in range [0..1]. -rl, --repeat-length num skip simple-repeats of length num. in range [1..inf]. default: 1000. -tl, --taboo-length num set taboo length. in range [1..inf]. default: 1. -lm, --low-memory decrease memory usage at the expense of runtime. verification options: -mn, --match-n n matches all other characters. default: n matches nothing. -ed, --error-distr file write error distribution to file. -mcl, --min-clipped-len num set minimal read length for read clipping. in range [0..inf]. default: 0. -qih, --quality-in-header quality string in fasta header. formats, naming, sorting, and coordinate schemes razers supports various output formats. the output format is detected automatically from the file name suffix. .razers razer format .fa, .fasta enhanced fasta format .eland eland format

razers3.1

Faster, fully sensitive read mapping synopsis razers3 [options] genome file reads file razers3 [options] genome file pe-reads file1 pe-reads file2 description razers 3 is a versatile full-sensitive read mapper based on k-mer counting and seeding filters. it supports single and paired-end mapping, shared-memory parallelism, and optimally parametrizes the filter based on a user-defined minimal sensitivity. see http://www.seqan.de/projects/razers for more information. input to razers 3 is a reference genome file and either one file with single-end reads or two files containing left or right mates of paired-end reads. (c) copyright 2009-2013 by david weese. -h, --help displays this help message. --version display version information main options: -i, --percent-identity num percent identity threshold. in range [50..100]. default: 95. -rr, --recognition-rate num percent recognition rate. in range [80..100]. default: 99. -ng, --no-gaps allow only mismatches, no indels. default: allow both. -f, --forward map reads only to forward strands. -r, --reverse map reads only to reverse strands. -m, --max-hits num output only num of the best hits. in range [1..inf]. default: 100. --unique output only unique best matches (-m 1 -dr 0 -pa). -tr, --trim-reads num trim reads to given length. default: off. in range [14..inf]. -o, --output file mapping result filename. default: reads file.razers. valid filetypes are: .razers, .eland, .fa, .fasta, .gff, .sam, and .afg. -v, --verbose verbose mode. -vv, --vverbose very verbose mode. paired-end options: -ll, --library-length num paired-end library length. in range [1..inf]. default: 220. -le, --library-error num paired-end library length tolerance. in range [0..inf]. default: 50. output format options: -a, --alignment dump the alignment for each match (only razer or fasta format). -pa, --purge-ambiguous purge reads with more than max-hits best matches. -dr, --distance-range num only consider matches with at most num more errors compared to the best. default: output all. -gn, --genome-naming num select how genomes are named (see naming section below). in range [0..1]. default: 0. -rn, --read-naming num select how reads are named (see naming section below). in range [0..3]. default: 0. --full-readid use the whole read id (don't clip after whitespace). -so, --sort-order num select how matches are sorted (see sorting section below). in range [0..1]. default: 0. -pf, --position-format num select begin/end position numbering (see coordinate section below). in range [0..1]. default: 0. -ds, --dont-shrink-alignments disable alignment shrinking in sam. this is required for generating a gold mapping for rabema. filtration options: -fl, --filter str select k-mer filter. one of pigeonhole and swift. default: pigeonhole. -mr, --mutation-rate num set the percent mutation rate (pigeonhole). in range [0..20]. default: 5. -ol, --overlap-length num manually set the overlap length of adjacent k-mers (pigeonhole). in range [0..inf]. -pd, --param-dir dir read user-computed parameter files in the directory dir (swift). -t, --threshold num manually set minimum k-mer count threshold (swift). in range [1..inf]. -tl, --taboo-length num set taboo length (swift). in range [1..inf]. default: 1. -s, --shape bitstring manually set k-mer shape. -oc, --overabundance-cut num set k-mer overabundance cut ratio. in range [0..1]. default: 1. -rl, --repeat-length num skip simple-repeats of length num. in range [1..inf]. default: 1000. -lf, --load-factor num set the load factor for the open addressing k-mer index. in range [1..inf]. default: 1.6. verification options: -mn, --match-n n matches all other characters. default: n matches nothing. -ed, --error-distr file write error distribution to file. -mf, --mismatch-file file write mismatch patterns to file. misc options: -cm, --compact-mult num multiply compaction treshold by this value after reaching and compacting. in range [0..inf]. default: 2.2. -ncf, --no-compact-frac num don't compact if in this last fraction of genome. in range [0..1]. default: 0.05. parallelism options: -pws, --parallel-window-size num collect candidates in windows of this length. in range [1..inf]. default: 500000. -pvs, --parallel-verification-size num verify candidates in packages of this size. in range [1..inf]. default: 100. -pvmpc, --parallel-verification-max-package-count num largest number of packages to create for verification per thread-1. in range [1..inf]. default: 100. -amms, --available-matches-memory-size num bytes of main memory available for storing matches. in range [-1..inf]. default: 0. -mhst, --match-histo-start-threshold num when to start histogram. in range [1..inf]. default: 5. formats, naming, sorting, and coordinate schemes razers 3 supports various output formats. the output format is detected automatically from the file name suffix. .razers razer format .fa, .fasta enhanced fasta format .eland eland format

sak.1

Slicing and dicing of fasta/fastq files.. synopsis sak [options] [b-o out.{fa,fq}] in.{fa,fq} description "it slices, it dices and it makes the laundry!" rewrite of the original sak tool by manuel holtgrewe. -h, --help displays this help message. --version display version information output options: -o, --out-path fastx path to the resulting file. if omitted, result is printed to stdout. use files ending in .fq or . to write out fastq. valid filetypes are: .fq, .fastq, .fa, .fasta, .faa, .ffn, .fna, and .frn. -rc, --revcomp reverse-complement output. -l, --max-length len maximal number of sequence characters to write out. filter options: -s, --sequence num select the given sequence for extraction by 0-based index. -sn, --sequence-name name select sequence with name prefix being name. -ss, --sequences range select sequences from-to where from and to are 0-based indices. -i, --infix range select characters from-to where from and to are 0-based indices. -ll, --line-length len set line length in output file. see section line length for details. in range [-1..inf]. line length you can use the setting --line-length for setting the resulting line length. by default, sequences in fasta files are written with at most 70 characters per line and sequences in fastq files are written without any line breaks. the quality sequence in fastq file is written in the same way as the residue sequence. the default is selected with a --line-length value of -1 and line breaks can be disabled with a value of 0. usage examples sak -s 10 in.fa cut out 11th sequence from in.fa and write to stdout as fasta. sak -ss 10-12 -ss 100-200 in.fq cut out 11th up to and including 12th and 101th up to and including 199th sequence from in.fq and write to stdout as fasta. version sak version: 0.2 last update november 2012

seqan_mason.1

seqan_tcoffee.1

Multiple sequence alignment synopsis seqan_tcoffee -s fasta file [options] description seqan::t-coffee is a multiple sequence alignment tool. (c) copyright 2009 by tobias rausch -h, --help displays this help message. --version display version information main options:: -s, --seq file name of multi-fasta input file. valid filetypes are: fa, fa, fa, fasta, fasta, and fasta. -a, --alphabet str the used sequence alphabet. one of protein, dna, and rna. default: protein. -o, --outfile file name of the output file. valid filetypes are: fasta and msf. default: out.fasta. segment match generation options:: -m, --method str defines the generation method for matches. to select multiple generation methods recall this option with different arguments. one of global, local, overlap, and lcs. default: global and local. -l, --libraries file name of match file. to select multiple files recall this option with different arguments. valid filetypes are: blast, mums, aln, and lib. scoring options:: -g, --gop num gap open penalty default: -13. -e, --gex num gap extension penalty default: -1. -ma, --matrix str score matrix default: blosum62. -ms, --msc num match score default: 5. -mm, --mmsc num mismatch penalty default: -4. guide tree options:: -u, --usetree str name of the file containing the newick guide tree. -b, --build str method to build the tree. following methods are provided: neighbor-joining (nj), upgma single linkage (min), upgma complete linkage (max), upgma average linkage (avg), upgma weighted average linkage (wavg). neighbor-joining creates an unrooted tree, which we root at the last joined pair. one of nj, min, max, avg, and wavg. default: nj. alignment evaluation options:: -i, --infile file name of the alignment file fasta file valid filetypes are: fa, fa, fa, fasta, and fasta. version seqan_tcoffee version: version 1.11 (30. july 2009) revision: 4637 last update july 2012

snp_store.1

Snpstore synopsis snp_store [options] "genome file" "mapped read file(s)" description snp and indel calling in mapped read data. -h, --help displays this help message. --version display version information options: : -o, --output file output file for snps (must be set, no default construction). -if, --input-format num set input format: 0 for gff format and 1 for sam format (both must be sorted according to genome positions). default: 0. -of, --output-format num set output format: 0 to output all candidate snps amd 1 to output successful candidate snps only. default: 0. -dc, --dont-clip ignore clip tags in gff. default: off. -mu, --multi keep non-unique fragmentstore.alignedreadstore. default: off. -hq, --hide-qualities only show coverage (no qualities) in snp output file. default: off. -sqo, --solexa-qual-offset base qualities are encoded as ascii value - 64 (instead of ascii - 33). -id, --indel-file file output file for called indels in gff format. default: off. -m, --method num set method used for snp calling: 0 for threshold method and 1 for maq method. default: 1. -mp, --max-pile num maximal number of matches allowed to pile up at the same genome position. -mmp, --merged-max-pile do pile up correction on merged lanes. default: off. -mc, --min-coverage num minimal required number of reads covering a candidate position. -fc, --force-call num always call base if count is = fc, ignore other parameters. default: off. in range [1..inf]. -oa, --orientation-aware distinguish between forward and reverse reads. default: off. -mpr, --max-polymer-run num discard indels in homopolymer runs longer than mpr. -dp, --diff-pos num minimal number of different read positions supporting the mutation. -eb, --exclude-border num exclude read positions within eb base pairs of read borders for snv calling. -su, --suboptimal keep suboptimal reads. -re, --realign realign reads around indel candidates. -pws, --parse-window-size num genomic window size for parsing reads (concerns memory consumption, choose smaller windows for higher coverage). in range [1..100000]. snp calling options: : threshold method related: : -mm, --min-mutations num minimal number of observed mutations for mutation to be called. -pt, --perc-threshold num minimal percentage of mutational base for mutation to be called. -mq, --min-quality num minimal average quality of mutational base for mutation to be called. maq method related: : -th, --theta num dependency coefficient. -hr, --hetero-rate num heterozygote rate. -mmq, --min-map-quality num minimum base call (mapping) quality for a match to be considered. -ch, --corrected-het use amplification bias corrected distribution for heterozygotes. default: off. -maf, --mean-allelefreq num mean ref allele frequency in heterozygotes. -ac, --amp-cycles num number of amplification cycles. -ae, --amp-efficiency num polymerase efficiency, probability of amplification. -in, --initial-n num initial allele population size. -mec, --min-explained-column num minimum fraction of alignment column reads explained by genotype call. indel calling options: : -it, --indel-threshold num minimal number of indel-supporting reads required for indel calling. -ipt, --indel-perc-threshold num minimal ratio of indel-supporting/covering reads for indel to be called. -iqt, --indel-quality-thresh num minimal average quality of inserted base/deletion-neighboring bases for indel to be called. -bsi, --both-strands-indel both strands need to be observed for indel to be called. default: off. -ebi, --exclude-border-indel num same as option -eb but for indel candidates. other options: : -lf, --log-file file write log file to file. -v, --verbose enable verbose output. -vv, --very-verbose enable very verbose output. -q, --quiet set verbosity to a minimum. version snp_store version: 1.0.1 last update march 14, 2013

splazers.1

Splazers ======== synopsis splazers [options] genome file reads file splazers [options] genome file reads file 1 reads file 2 description splazers uses a prefix-suffix mapping strategy to split-map read sequences.if a sam file of mapped reads is given as input, all unmapped but anchoredreads are split-mapped onto anchoring target regions (specify option -an),if a fasta/q file of reads is given, reads are split-mapped onto the wholereference sequence. (c) copyright 2010 by anne-katrin emde. -h, --help displays this help message. --version display version information main options:: -o, --output file change output filename. default: reads file.result. -f, --forward only compute forward matches -r, --reverse only compute reverse complement matches -i, --percent-identity num percent identity threshold. in range [50..100]. default: 92. -rr, --recognition-rate num set the percent recognition rate in range [80..100]. default: 99. -pd, --param-dir dir read user-computed parameter files in the directory dir. -id, --indels allow indels. default: mismatches only. -ll, --library-length num paired-end library length. in range [1..inf]. default: 220. -le, --library-error num paired-end library length tolerance. in range [0..inf]. default: 50. -m, --max-hits num output only num of the best hits. in range [1..inf]. default: 100. --unique output only unique best matches (-m 1 -dr 0 -pa). -tr, --trim-reads num trim reads to given length. default: off. in range [14..inf]. -mcl, --min-clipped-len num min. read length for read clipping in range [1..inf]. default: 0. -qih, --quality-in-header quality string in fasta header -ou, --outputunmapped file output filename for unmapped reads -v, --verbose verbose mode -vv, --vverbose very verbose mode output format options:: -a, --alignment dump the alignment for each match -pa, --purge-ambiguous purge reads with more than max-hits best matches -dr, --distance-range num only consider matches with at most num more errors compared to the best (default output all) -of, --output-format num set output format. 0 = razers, 1 = enhanced fasta, 2 = eland, 3 = gff, 4 = sam. in range [0..4]. -gn, --genome-naming num select how genomes are named. 0 = use fasta id, 1 = enumerate beginning with 1. in range [0..1]. default: 0. -rn, --read-naming num select how reads are named. 0 = use fasta id, 1 = enumerate beginning with 1. in range [0..1]. default: 0. -so, --sort-order num select how matches are sorted. 0 = read number, 1 = genome position. in range [0..1]. default: 0. -pf, --position-format num select begin/end position numbering (see coordinate section below). 0 = gap space, 1 = position space. in range [0..1]. default: 0. split mapping options:: -sm, --split-mapping num min. match length for prefix/suffix mapping (to disable split mapping, set to 0) default: 18. -maxg, --max-gap num max. length of middle gap default: 10000. -ming, --min-gap num min. length of middle gap (for edit distance mapping about 10% of read length is recommended) default: 0. -ep, --errors-prefix num max. number of errors in prefix match default: 1. -es, --errors-suffix num max. number of errors in suffix match default: 1. -gl, --genome-len num genome length in mb, for computation of expected number of random matches in range [-inf..10000]. default: 3000. -an, --anchored anchored split mapping, only unmapped reads with mapped mates will be considered, requires the reads to be given in sam format -pc, --penalty-c num percent of read length, used as penalty for split-gap default: 2. filtration options:: -oc, --overabundance-cut num set k-mer overabundance cut ratio. in range [0..1]. -rl, --repeat-length num skip simple-repeats of length num. in range [1..inf]. default: 1000. -tl, --taboo-length num set taboo length. in range [1..inf]. default: 1. -lm, --low-memory decrease memory usage at the expense of runtime verification options: -mn, --match-n n matches all other characters. default: n matches nothing. -ed, --error-distr file write error distribution to file. version splazers version: 1.1 last update apr 2011

stellar.1

The swift exact local aligner synopsis stellar [options] fasta file 1 fasta file 2 description stellar implements the swift filter algorithm (rasmussen et al., 2006) and a verification step for the swift hits that applies local alignment, gapped x-drop extension, and extraction of the longest epsilon-match. input to stellar are two files, each containing one or more sequences in fasta format. each sequence from file 1 will be compared to each sequence in file 2. the sequences from file 1 are used as database, the sequences from file 2 as queries. (c) 2010-2012 by birte kehr -h, --help displays this help message. --version display version information main options: -e, --epsilon num maximal error rate (max 0.25). in range [0.0000001..0.25]. default: 0.05. -l, --minlength num minimal length of epsilon-matches. in range [0..inf]. default: 100. -f, --forward search only in forward strand of database. -r, --reverse search only in reverse complement of database. -a, --alphabet str alphabet type of input sequences (dna, rna, dna5, rna5, protein, char). one of dna, dna5, rna, rna5, protein, and char. -v, --verbose set verbosity mode. filtering options: -k, --kmer num length of the q-grams (max 32). in range [1..32]. -rp, --repeatperiod num maximal period of low complexity repeats to be filtered. default: 1. -rl, --repeatlength num minimal length of low complexity repeats to be filtered. default: 1000. -c, --abundancecut num k-mer overabundance cut ratio. in range [0..1]. default: 1. verification options: -x, --xdrop num maximal x-drop for extension. default: 5. -vs, --verification str verification strategy: exact or bestlocal or bandedglobal one of exact, bestlocal, and bandedglobal. default: exact. -dt, --disablethresh num maximal number of verified matches before disabling verification for one query sequence (default infinity). in range [0..inf]. -n, --nummatches num maximal number of kept matches per query and database. if stellar finds more matches, only the longest ones are kept. default: 50. -s, --sortthresh num number of matches triggering removal of duplicates. choose a smaller value for saving space. default: 500. output options: -o, --out file name of output file. valid filetypes are: gff and txt. default: stellar.gff. -od, --outdisabled file name of output file for disabled query sequences. valid filetypes are: fa and fasta. default: stellar.disabled.fasta. references kehr, b., weese, d., reinert, k.: stellar: fast and exact local alignments. bmc bioinformatics, 12(suppl 9):s15, 2011. version stellar version: 1.3 last update october 2012

tree_recon.1

Tree reconstruction synopsis tree_recon [options] -m in.dist description reconstruct phylogenetic tree from phylip matrix in.dist. -h, --help displays this help message. --version display version information input / output: -m, --matrix file name phylip distance matrix file. must contain at least three species. valid filetype is: dist. -o, --out-file file path to write output to. valid filetypes are: dot and newick. default: tree.dot. algorithm options: -b, --build method tree building method. nj: neighbour-joining, min: upgma single linkage, max: upgma complete linkage, avg: upgma average linkage, wavg: upgma weighted average linkage. neighbour-joining creates an unrooted tree. we root that tree at the least joined pair. one of nj, min, max, avg, and wavg. default: nj. contact and references for questions or comments, contact: tobias rausch [email protected] seqan homepage: http://www.seqan.de version tree_recon version: 1.02 last update july 17, 2012

Man Pages in seqan-apps

Section 1: Executable programs or shell commands