.. raw:: html
.. role:: c1
documentation
=============
After the successful installation, use
.. code-block:: sh
$ shoji --help
to see the list of available commands and their options.
In Shoji, main functions are organized into four categories:
.. list-table::
* - Annotation
- | ``annotation``
| ``createSlidingWindows``
- | Parse gff3 file and extract features to bed format
| Create sliding windows from flattened annotation.
* - Extraction
- ``extract``
- Extract crosslink sites from alignment file.
* - Counting
- | ``count``
| ``createMatrix``
- | Count number of crosslink sites in a window.
| create R friendly output matrices
* - Helpers
- ``toTabix``
- Convert BED file to bgzipped, tabix indexed BED file
.. _AnnotationOverview:
Annotation
***********
.. _annotation:
:c1:`annotation`
---------------
Parse gff3 file and extract features to bed format.
**Arguments**
* ``--gff3/-a`` GFF3 file to parse (supports .gz files)
* ``--out/-o`` Output bed file name (supports .gz compression, tabix indexing)
* ``--id/-i`` ID tag in GFF3 attribute column
* ``--parent/-p`` Parent tag in GFF3 attribute column
* ``--gene_id/-g`` Gene id tag in GFF3 attribute column
* ``--gene_name/-n`` Gene name tag in GFF3 attribute column
* ``--gene_type/-t`` Gene type tag in GFF3 attribute column
* ``--feature/-f`` Gene feature to extract from GFF3 file (from GFF3 3rd column)
* ``--gene_like_features/-x`` Gene like features to parse from GFF3 (based on GFF3 3rd column). Multiple values can be passed as -x tRNA -x rRNA...
* ``--tabix`` If the output suffix is .gz, use this flag to index the output bed file using tabix. It is recommended to use .gz suffix and this flag for output
* ``--split-intron`` If an intron overlaps exon of another genes, split this introns into separate bed entries after removing exon overlap
.. Note::
All default values are based on Gencode GFF3 file format.
**Usage:**
.. code-block:: sh
$ shoji annotation --help
.. _createSlidingWindows:
:c1:`createSlidingWindows`
--------------------------
Create sliding windows from flattened annotation
**Arguments**
* ``--annotation/-a`` Input annotation file (see ``shoji annotate -h``) to create sliding windows (supports .gz files, tabix indexed files)
* ``--size/-w`` Size of the sliding window (in bp)
* ``--step/-s`` Step/slide (in bp), from beginning of the previous window to the beginning of the current window
* ``--tabix`` If the output suffix is .gz, use this flag to index the output bed file using tabix. It is recommended to use .gz suffix and this flag for output files
* ``--cpus/-c`` Number of cores to use for parallel processing
**Usage:**
.. code-block:: sh
$ shoji createSlidingWindows --help
Extraction
***********
.. _extract:
:c1:`extract`
---------------
Extract crosslink sites from bam file.
**Arguments**
* ``--bam/-b`` Alignment bam file. Must be co-ordinate sorted and indexed
* ``--out/-o`` Output crosslink sites in BED6 format (supports .gz file, tabix indexing)
* ``--mate/-e`` for paired end sequencing, select the read/mate to extract the crosslink sites. For single end data, the choice is always 1
* ``--site/-s`` Crosslink site choices, s : start, m : middle, e : end, i : insertion, d : deletion
* ``--offset/-g`` Number of nucleotides to offset for crosslink sites
* ``--qual/-q`` Minimum alignment quality score
* ``--min_len/-m`` Minimum read length
* ``--max_len/-x`` Maximum read length
* ``--min_aln_len/-a`` Minimum aligned read length
* ``--aln_frac/-f`` Minimum fraction of aligned read length to total read length for crosslink site extraction. If set to 0, all reads are considered
* ``--mismatch_frac/-y`` Maximum fraction of mismatches allowed in the read, as a fraction of aligned length. If set to 1.0, all reads are considered
* ``--max_interval_len/-l`` Maximum read interval length
* ``--primary`` Flag to use only the primary alignment position
* ``--ignore_PCR_duplicates`` Flag to ignore PCR duplicate reads (works only if bam file has PCR duplicate flag set using tools such as samtools markdup)
* ``--tabix`` If the output suffix is .gz, use this flag to index the output bed file using tabix
* ``--tmp/-t`` Temp. directory to save intermediate outputs. If not provided, creates and uses a temporary directory in * ``--out`` parent directory
* ``--cpus/-c`` Number of cores to use for parallel processing
**Usage:**
.. code-block:: sh
$ shoji extract --help
Counting
***********
.. _count:
:c1:`count`
---------------
Count number of crosslink sites in a window.
**Arguments**
* ``--annotation/-a`` flattened annotation file from shoji annotation -h or sliding window file from ``shoji createSlidingWindows -h``
* ``--input/-i`` Extracted crosslink sites in BED format. See ``shoji extract -h`` for more details
* ``--out/-o`` Output file, crosslinksite counts per window. Note: This function outputs results only in Apache Parquet format
* ``--name/-n`` Sample name to use as a column in the output file. If not provided, the sample name will be inferred from the input file
* ``--cpus/-c`` Number of cores to use for parallel processing
* ``--tmp/-t`` Temp. directory to save intermediate outputs. If not provided, creates and uses a temporary directory in ``--out`` parent directory
**Usage:**
.. code-block:: sh
$ shoji count --help
.. _createMatrix:
:c1:`createMatrix`
------------------
create R friendly output matrices.
**Arguments**
* ``--input_dir/-i`` Input directory containing the output of shoji count, see ``shoji count -h`` for details
* ``--prefix/-p`` Prefix to filter count files in in_dir
* ``--suffix/-s`` Suffix to filter count files in in_dir. Either ``--prefix`` or ``--suffix`` MUST be provided
* ``--format/-f `` Output formats: ``csv`` or ``parquet``. Default: csv. Csv format also supports gzipped output
* ``--annotation/-a`` Output filename for trimmed annotations
* ``--output/-o`` Output filename for aggregated crosslink count per window matrix
* ``--max/-m`` Optional output. Output filename for max. crosslink site per window matrix
* ``--allow_duplicates`` Default behavior: If adjacent overlapping windows have same crosslink counts across all samples, write only the most 5' window to output file. Use this flag disable this feature and to write all windows
* ``--cpus/-c`` Number of cores to use for parallel processing
**Usage:**
.. code-block:: sh
$ shoji createMatrix --help
Helpers
***********
.. _toTabix:
:c1:`toTabix`
---------------
Convert BED file to bgzipped, tabix indexed BED file
**Arguments:**
* ``--bed/-b`` Input BED file (bed6 format, supports .gz files)
* ``--output/-o`` Output filename for bgzipped tabix indexed bed file
* ``--cpus/-c`` Number of cores to use for parallel processing
* ``--tmp/-t`` Temp. directory to save intermediate outputs. If not provided, creates and uses a temporary directory in ``--out`` parent directory
**Usage:**
.. code-block:: sh
$ shoji toTabix --help