documentation

After the successful installation, use

$ shoji --help

to see the list of available commands and their options.

In Shoji, main functions are organized into four categories:

Annotation	`annotation` `createSlidingWindows`	Parse gff3 file and extract features to bed format Create sliding windows from flattened annotation.
Extraction	`extract`	Extract crosslink sites from alignment file.
Counting	`count` `createMatrix`	Count number of crosslink sites in a window. create R friendly output matrices
Helpers	`toTabix`	Convert BED file to bgzipped, tabix indexed BED file

Annotation

Parse gff3 file and extract features to bed format.

Arguments

--gff3/-a GFF3 file to parse (supports .gz files)
--out/-o Output bed file name (supports .gz compression, tabix indexing)
--id/-i ID tag in GFF3 attribute column
--parent/-p Parent tag in GFF3 attribute column
--gene_id/-g Gene id tag in GFF3 attribute column
--gene_name/-n Gene name tag in GFF3 attribute column
--gene_type/-t Gene type tag in GFF3 attribute column
--feature/-f Gene feature to extract from GFF3 file (from GFF3 3rd column)
--gene_like_features/-x Gene like features to parse from GFF3 (based on GFF3 3rd column). Multiple values can be passed as -x tRNA -x rRNA…
--tabix If the output suffix is .gz, use this flag to index the output bed file using tabix. It is recommended to use .gz suffix and this flag for output
--split-intron If an intron overlaps exon of another genes, split this introns into separate bed entries after removing exon overlap

Note

All default values are based on Gencode GFF3 file format.

Usage:

$ shoji annotation --help

Create sliding windows from flattened annotation

Arguments

--annotation/-a Input annotation file (see shoji annotate -h) to create sliding windows (supports .gz files, tabix indexed files)
--size/-w Size of the sliding window (in bp)
--step/-s Step/slide (in bp), from beginning of the previous window to the beginning of the current window
--tabix If the output suffix is .gz, use this flag to index the output bed file using tabix. It is recommended to use .gz suffix and this flag for output files
--cpus/-c Number of cores to use for parallel processing

Usage:

$ shoji createSlidingWindows --help

Extract crosslink sites from bam file.

Arguments

--bam/-b Alignment bam file. Must be co-ordinate sorted and indexed
--out/-o Output crosslink sites in BED6 format (supports .gz file, tabix indexing)
--mate/-e for paired end sequencing, select the read/mate to extract the crosslink sites. For single end data, the choice is always 1
--site/-s Crosslink site choices, s : start, m : middle, e : end, i : insertion, d : deletion
--offset/-g Number of nucleotides to offset for crosslink sites
--qual/-q Minimum alignment quality score
--min_len/-m Minimum read length
--max_len/-x Maximum read length
--min_aln_len/-a Minimum aligned read length
--aln_frac/-f Minimum fraction of aligned read length to total read length for crosslink site extraction. If set to 0, all reads are considered
--mismatch_frac/-y Maximum fraction of mismatches allowed in the read, as a fraction of aligned length. If set to 1.0, all reads are considered
--max_interval_len/-l Maximum read interval length
--primary Flag to use only the primary alignment position
--ignore_PCR_duplicates Flag to ignore PCR duplicate reads (works only if bam file has PCR duplicate flag set using tools such as samtools markdup)
--tabix If the output suffix is .gz, use this flag to index the output bed file using tabix
--tmp/-t Temp. directory to save intermediate outputs. If not provided, creates and uses a temporary directory in * --out parent directory
--cpus/-c Number of cores to use for parallel processing

Usage:

$ shoji extract --help

Count number of crosslink sites in a window.

Arguments

--annotation/-a flattened annotation file from shoji annotation -h or sliding window file from shoji createSlidingWindows -h
--input/-i Extracted crosslink sites in BED format. See shoji extract -h for more details
--out/-o Output file, crosslinksite counts per window. Note: This function outputs results only in Apache Parquet format
--name/-n Sample name to use as a column in the output file. If not provided, the sample name will be inferred from the input file
--cpus/-c Number of cores to use for parallel processing
--tmp/-t Temp. directory to save intermediate outputs. If not provided, creates and uses a temporary directory in --out parent directory

Usage:

$ shoji count --help

create R friendly output matrices.

Arguments

--input_dir/-i Input directory containing the output of shoji count, see shoji count -h for details
--prefix/-p Prefix to filter count files in in_dir
--suffix/-s Suffix to filter count files in in_dir. Either --prefix or --suffix MUST be provided
--format/-f `` Output formats: ``csv or parquet. Default: csv. Csv format also supports gzipped output
--annotation/-a Output filename for trimmed annotations
--output/-o Output filename for aggregated crosslink count per window matrix
--max/-m Optional output. Output filename for max. crosslink site per window matrix
--allow_duplicates Default behavior: If adjacent overlapping windows have same crosslink counts across all samples, write only the most 5’ window to output file. Use this flag disable this feature and to write all windows
--cpus/-c Number of cores to use for parallel processing

Usage:

$ shoji createMatrix --help

Convert BED file to bgzipped, tabix indexed BED file

Arguments:

--bed/-b Input BED file (bed6 format, supports .gz files)
--output/-o Output filename for bgzipped tabix indexed bed file
--cpus/-c Number of cores to use for parallel processing
--tmp/-t Temp. directory to save intermediate outputs. If not provided, creates and uses a temporary directory in --out parent directory

Usage:

$ shoji toTabix --help