documentation
After the successful installation, use
$ shoji --help
to see the list of available commands and their options.
In Shoji, main functions are organized into four categories:
Annotation |
annotationcreateSlidingWindows |
Parse gff3 file and extract features to bed format
Create sliding windows from flattened annotation.
|
Extraction |
|
Extract crosslink sites from alignment file. |
Counting |
countcreateMatrix |
Count number of crosslink sites in a window.
create R friendly output matrices
|
Helpers |
|
Convert BED file to bgzipped, tabix indexed BED file |
Annotation
annotation
Parse gff3 file and extract features to bed format.
Arguments
--gff3/-aGFF3 file to parse (supports .gz files)--out/-oOutput bed file name (supports .gz compression, tabix indexing)--id/-iID tag in GFF3 attribute column--parent/-pParent tag in GFF3 attribute column--gene_id/-gGene id tag in GFF3 attribute column--gene_name/-nGene name tag in GFF3 attribute column--gene_type/-tGene type tag in GFF3 attribute column--feature/-fGene feature to extract from GFF3 file (from GFF3 3rd column)--gene_like_features/-xGene like features to parse from GFF3 (based on GFF3 3rd column). Multiple values can be passed as -x tRNA -x rRNA…--tabixIf the output suffix is .gz, use this flag to index the output bed file using tabix. It is recommended to use .gz suffix and this flag for output--split-intronIf an intron overlaps exon of another genes, split this introns into separate bed entries after removing exon overlap
Note
All default values are based on Gencode GFF3 file format.
Usage:
$ shoji annotation --help
createSlidingWindows
Create sliding windows from flattened annotation
Arguments
--annotation/-aInput annotation file (seeshoji annotate -h) to create sliding windows (supports .gz files, tabix indexed files)--size/-wSize of the sliding window (in bp)--step/-sStep/slide (in bp), from beginning of the previous window to the beginning of the current window--tabixIf the output suffix is .gz, use this flag to index the output bed file using tabix. It is recommended to use .gz suffix and this flag for output files--cpus/-cNumber of cores to use for parallel processing
Usage:
$ shoji createSlidingWindows --help
Extraction
extract
Extract crosslink sites from bam file.
Arguments
--bam/-bAlignment bam file. Must be co-ordinate sorted and indexed--out/-oOutput crosslink sites in BED6 format (supports .gz file, tabix indexing)--mate/-efor paired end sequencing, select the read/mate to extract the crosslink sites. For single end data, the choice is always 1--site/-sCrosslink site choices, s : start, m : middle, e : end, i : insertion, d : deletion--offset/-gNumber of nucleotides to offset for crosslink sites--qual/-qMinimum alignment quality score--min_len/-mMinimum read length--max_len/-xMaximum read length--min_aln_len/-aMinimum aligned read length--aln_frac/-fMinimum fraction of aligned read length to total read length for crosslink site extraction. If set to 0, all reads are considered--mismatch_frac/-yMaximum fraction of mismatches allowed in the read, as a fraction of aligned length. If set to 1.0, all reads are considered--max_interval_len/-lMaximum read interval length--primaryFlag to use only the primary alignment position--ignore_PCR_duplicatesFlag to ignore PCR duplicate reads (works only if bam file has PCR duplicate flag set using tools such as samtools markdup)--tabixIf the output suffix is .gz, use this flag to index the output bed file using tabix--tmp/-tTemp. directory to save intermediate outputs. If not provided, creates and uses a temporary directory in *--outparent directory--cpus/-cNumber of cores to use for parallel processing
Usage:
$ shoji extract --help
Counting
count
Count number of crosslink sites in a window.
Arguments
--annotation/-aflattened annotation file from shoji annotation -h or sliding window file fromshoji createSlidingWindows -h--input/-iExtracted crosslink sites in BED format. Seeshoji extract -hfor more details--out/-oOutput file, crosslinksite counts per window. Note: This function outputs results only in Apache Parquet format--name/-nSample name to use as a column in the output file. If not provided, the sample name will be inferred from the input file--cpus/-cNumber of cores to use for parallel processing--tmp/-tTemp. directory to save intermediate outputs. If not provided, creates and uses a temporary directory in--outparent directory
Usage:
$ shoji count --help
createMatrix
create R friendly output matrices.
Arguments
--input_dir/-iInput directory containing the output of shoji count, seeshoji count -hfor details--prefix/-pPrefix to filter count files in in_dir--suffix/-sSuffix to filter count files in in_dir. Either--prefixor--suffixMUST be provided--format/-f `` Output formats: ``csvorparquet. Default: csv. Csv format also supports gzipped output--annotation/-aOutput filename for trimmed annotations--output/-oOutput filename for aggregated crosslink count per window matrix--max/-mOptional output. Output filename for max. crosslink site per window matrix--allow_duplicatesDefault behavior: If adjacent overlapping windows have same crosslink counts across all samples, write only the most 5’ window to output file. Use this flag disable this feature and to write all windows--cpus/-cNumber of cores to use for parallel processing
Usage:
$ shoji createMatrix --help
Helpers
toTabix
Convert BED file to bgzipped, tabix indexed BED file
Arguments:
--bed/-bInput BED file (bed6 format, supports .gz files)--output/-oOutput filename for bgzipped tabix indexed bed file--cpus/-cNumber of cores to use for parallel processing--tmp/-tTemp. directory to save intermediate outputs. If not provided, creates and uses a temporary directory in--outparent directory
Usage:
$ shoji toTabix --help