SYNOPSIS

mpiexec -n NUMBER_OF_RANKS Ray -k KMERLENGTH -p l1_1.fastq l1_2.fastq -p l2_1.fastq l2_2.fastq -o test

mpiexec -n NUMBER_OF_RANKS Ray Ray.conf # with commands in a file

DESCRIPTION:

The Ray genome assembler is built on top of the RayPlatform, a generic plugin-based distributed and parallel compute engine that uses the message-passing interface for passing messages.

Ray targets several applications:

  • - de novo genome assembly (with Ray vanilla) - de novo meta-genome assembly (with Ray Meta) - de novo transcriptome assembly (works, but not tested a lot) - quantification of contig abundances - quantification of microbiome consortia members (with Ray Communities) - quantification of transcript expression - taxonomy profiling of samples (with Ray Communities) - gene ontology profiling of samples (with Ray Ontologies) -help

  • Displays this help page. -version

  • Displays Ray version and compilation options.

  • Using a configuration file

  • Ray can be launched with mpiexec -n 16 Ray Ray.conf The configuration file can include comments (starting with #).

  • K-mer length -k kmerLength

  • Selects the length of k-mers. The default value is 21. It must be odd because reverse-complement vertices are stored together. The maximum length is defined at compilation by MAXKMERLENGTH Larger k-mers utilise more memory.

  • Inputs -p leftSequenceFile rightSequenceFile [averageOuterDistance standardDeviation]

  • Provides two files containing paired-end reads. averageOuterDistance and standardDeviation are automatically computed if not provided. -i interleavedSequenceFile [averageOuterDistance standardDeviation]

  • Provides one file containing interleaved paired-end reads. averageOuterDistance and standardDeviation are automatically computed if not provided. -s sequenceFile

  • Provides a file containing single-end reads.

  • Outputs -o outputDirectory

  • Specifies the directory for outputted files. Default is RayOutput

  • Assembly options (defaults work well) -disable-recycling

  • Disables read recycling during the assembly reads will be set free in 3 cases: 1. the distance did not match for a pair 2. the read has not met its mate 3. the library population indicates a wrong placement see Constrained traversal of repeats with paired sequences. Sebastien Boisvert, Elenie Godzaridis, Francois Laviolette & Jacques Corbeil. First Annual RECOMB Satellite Workshop on Massively Parallel Sequencing, March 26-27 2011, Vancouver, BC, Canada. -disable-scaffolder

  • Disables the scaffolder. -minimum-contig-length minimumContigLength

  • Changes the minimum contig length, default is 100 nucleotides -color-space

  • Runs in color-space Needs csfasta files. Activated automatically if csfasta files are provided. -use-maximum-seed-coverage maximumSeedCoverageDepth

  • Ignores any seed with a coverage depth above this threshold. The default is 4294967295. -use-minimum-seed-coverage minimumSeedCoverageDepth

  • Sets the minimum seed coverage depth. Any path with a coverage depth lower than this will be discarded. The default is 0.

  • Distributed storage engine (all these values are for each MPI rank) -bloom-filter-bits bits

  • Sets the number of bits for the Bloom filter Default is 268435456 bits, 0 bits disables the Bloom filter. -hash-table-buckets buckets

  • Sets the initial number of buckets. Must be a power of 2 ! Default value: 268435456 -hash-table-buckets-per-group buckets

  • Sets the number of buckets per group for sparse storage Default value: 64, Must be between >=1 and <= 64 -hash-table-load-factor-threshold threshold

  • Sets the load factor threshold for real-time resizing Default value: 0.75, must be >= 0.5 and < 1 -hash-table-verbosity

  • Activates verbosity for the distributed storage engine

  • Biological abundances -search searchDirectory

  • Provides a directory containing fasta files to be searched in the de Bruijn graph. Biological abundances will be written to RayOutput/BiologicalAbundances See Documentation/BiologicalAbundances.txt -one-color-per-file

  • Sets one color per file instead of one per sequence. By default, each sequence in each file has a different color. For files with large numbers of sequences, using one single color per file may be more efficient.

  • Taxonomic profiling with colored de Bruijn graphs -with-taxonomy Genome-to-Taxon.tsv TreeOfLife-Edges.tsv Taxon-Names.tsv

  • Provides a taxonomy. Computes and writes detailed taxonomic profiles. See Documentation/Taxonomy.txt for details.

-gene-ontology OntologyTerms.txt

Annotations.txt

  • Provides an ontology and annotations. OntologyTerms.txt is fetched from http://geneontology.org Annotations.txt is a 2-column file (EMBL_CDS handle & gene ontology identifier) See Documentation/GeneOntology.txt

  • Other outputs -enable-neighbourhoods

  • Computes contig neighborhoods in the de Bruijn graph Output file: RayOutput/NeighbourhoodRelations.txt -amos

  • Writes the AMOS file called RayOutput/AMOS.afg An AMOS file contains read positions on contigs. Can be opened with software with graphical user interface. -write-kmers

  • Writes k-mer graph to RayOutput/kmers.txt The resulting file is not utilised by Ray. The resulting file is very large. -write-read-markers

  • Writes read markers to disk. -write-seeds

  • Writes seed DNA sequences to RayOutput/Rank<rank>.RaySeeds.fasta -write-extensions

  • Writes extension DNA sequences to RayOutput/Rank<rank>.RayExtensions.fasta -write-contig-paths

  • Writes contig paths with coverage values to RayOutput/Rank<rank>.RayContigPaths.txt -write-marker-summary

  • Writes marker statistics.

  • Memory usage -show-memory-usage

  • Shows memory usage. Data is fetched from /proc on GNU/Linux Needs __linux__ -show-memory-allocations

  • Shows memory allocation events

  • Algorithm verbosity -show-extension-choice

  • Shows the choice made (with other choices) during the extension. -show-ending-context

  • Shows the ending context of each extension. Shows the children of the vertex where extension was too difficult. -show-distance-summary

  • Shows summary of outer distances used for an extension path. -show-consensus

  • Shows the consensus when a choice is done.

  • Checkpointing -write-checkpoints checkpointDirectory

  • Write checkpoint files -read-checkpoints checkpointDirectory

  • Read checkpoint files -read-write-checkpoints checkpointDirectory

  • Read and write checkpoint files

  • Message routing for large number of cores -route-messages

  • Enables the Ray message router. Disabled by default. Messages will be routed accordingly so that any rank can communicate directly with only a few others. Without -route-messages, any rank can communicate directly with any other rank. Files generated: Routing/Connections.txt, Routing/Routes.txt and Routing/RelayEvents.txt and Routing/Summary.txt -connection-type type

  • Sets the connection type for routes. Accepted values are debruijn, hypercube, polytope, group, random, kautz and complete. Default is debruijn.

  • debruijn: a full de Bruijn graph a given alphabet and diameter hypercube: a hypercube, alphabet is {0,1} and the vertices is a power of 2 polytope: a convex regular polytope, alphabet is {0,1,...,B-1} and the vertices is a power of B group: silly model where one representative per group can communicate with outsiders random: Erdos-Renyi model kautz: a full de Kautz graph, which is a subgraph of a de Bruijn graph complete: a full graph with all the possible connections

  • With the type debruijn, the number of ranks must be a power of something. Examples: 256 = 16*16, 512=8*8*8, 49=7*7, and so on. Otherwise, don't use debruijn routing but use another one With the type kautz, the number of ranks n must be n=(k+1)*k^(d-1) for some k and d -routing-graph-degree degree

  • Specifies the outgoing degree for the routing graph. See Documentation/Routing.txt

  • Hardware testing -test-network-only

  • Tests the network and returns. -write-network-test-raw-data

  • Writes one additional file per rank detailing the network test. -exchanges NumberOfExchanges

  • Sets the number of exchanges -disable-network-test

  • Skips the network test.

  • Debugging -verify-message-integrity

  • Checks message data reliability for any non-empty message. add '-D CONFIG_SSE_4_2' in the Makefile to use hardware instruction (SSE 4.2) -run-profiler

  • Runs the profiler as the code runs. By default, only show granularity warnings. Running the profiler increases running times. -with-profiler-details

  • Shows number of messages sent and received in each methods during in each time slices (epochs). Needs -run-profiler. -show-communication-events

  • Shows all messages sent and received. -show-read-placement

  • Shows read placement in the graph during the extension. -debug-bubbles

  • Debugs bubble code. Bubbles can be due to heterozygous sites or sequencing errors or other (unknown) events -debug-seeds

  • Debugs seed code. Seeds are paths in the graph that are likely unique. -debug-fusions

  • Debugs fusion code. -debug-scaffolder

  • Debug the scaffolder.

FILES

  • Input files

  • Note: file format is determined with file extension.

  • .fasta .fasta.gz (needs HAVE_LIBZ=y at compilation) .fasta.bz2 (needs HAVE_LIBBZ2=y at compilation) .fastq .fastq.gz (needs HAVE_LIBZ=y at compilation) .fastq.bz2 (needs HAVE_LIBBZ2=y at compilation) .sff (paired reads must be extracted manually) .csfasta (color-space reads)

  • Outputted files

  • Scaffolds

  • RayOutput/Scaffolds.fasta

  • The scaffold sequences in FASTA format

  • RayOutput/ScaffoldComponents.txt

  • The components of each scaffold

  • RayOutput/ScaffoldLengths.txt

  • The length of each scaffold

  • RayOutput/ScaffoldLinks.txt

  • Scaffold links

  • Contigs

  • RayOutput/Contigs.fasta

  • Contiguous sequences in FASTA format

  • RayOutput/ContigLengths.txt

  • The lengths of contiguous sequences

  • Summary

  • RayOutput/OutputNumbers.txt

  • Overall numbers for the assembly

  • de Bruijn graph

  • RayOutput/CoverageDistribution.txt

  • The distribution of coverage values

  • RayOutput/CoverageDistributionAnalysis.txt

  • Analysis of the coverage distribution

  • RayOutput/degreeDistribution.txt

  • Distribution of ingoing and outgoing degrees

  • RayOutput/kmers.txt

  • k-mer graph, required option: -write-kmers

  • The resulting file is not utilised by Ray. The resulting file is very large.

  • Assembly steps

  • RayOutput/SeedLengthDistribution.txt

  • Distribution of seed length

  • RayOutput/Rank<rank>.OptimalReadMarkers.txt

  • Read markers.

  • RayOutput/Rank<rank>.RaySeeds.fasta

  • Seed DNA sequences, required option: -write-seeds

  • RayOutput/Rank<rank>.RayExtensions.fasta

  • Extension DNA sequences, required option: -write-extensions

  • RayOutput/Rank<rank>.RayContigPaths.txt

  • Contig paths with coverage values, required option: -write-contig-paths

  • Paired reads

  • RayOutput/LibraryStatistics.txt

  • Estimation of outer distances for paired reads

  • RayOutput/Library<LibraryNumber>.txt

  • Frequencies for observed outer distances (insert size + read lengths)

  • Partition

  • RayOutput/NumberOfSequences.txt

  • Number of reads in each file

  • RayOutput/SequencePartition.txt

  • Sequence partition

  • Ray software

  • RayOutput/RayVersion.txt

  • The version of Ray

  • RayOutput/RayCommand.txt

  • The exact same command provided

  • AMOS

  • RayOutput/AMOS.afg

  • Assembly representation in AMOS format, required option: -amos

  • Communication

  • RayOutput/MessagePassingInterface.txt

  • Number of messages sent

  • RayOutput/NetworkTest.txt

  • Latencies in microseconds

  • RayOutput/Rank<rank>NetworkTestData.txt

  • Network test raw data

DOCUMENTATION

  • - mpiexec -n 1 Ray -help|less (always up-to-date) - This help page (always up-to-date) - The directory Documentation/ - Manual (Portable Document Format): InstructionManual.tex (in Documentation) - Mailing list archives: http://sourceforge.net/mailarchive/forum.php?forum_name=denovoassembler-users

AUTHOR

  • Written by Sebastien Boisvert.

REPORTING BUGS

COPYRIGHT

  • This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3 of the License.

  • This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

  • You have received a copy of the GNU General Public License along with this program (see LICENSE).

Ray 2.1.0

License for Ray: GNU General Public License version 3 RayPlatform version: 1.1.0 License for RayPlatform: GNU Lesser General Public License version 3

MAXKMERLENGTH: 32 KMER_U64_ARRAY_SIZE: 1 Maximum coverage depth stored by CoverageDepth: 4294967295 MAXIMUM_MESSAGE_SIZE_IN_BYTES: 4000 bytes FORCE_PACKING = n ASSERT = n HAVE_LIBZ = y HAVE_LIBBZ2 = y CONFIG_PROFILER_COLLECT = n CONFIG_CLOCK_GETTIME = n __linux__ = y _MSC_VER = n __GNUC__ = y RAY_32_BITS = n RAY_64_BITS = y MPI standard version: MPI 2.1 MPI library: Open-MPI 1.4.2 Compiler: GNU gcc/g++ 4.4.5