overview

Shoji

This package is designed primarily to do the following operations:

Annotation 

A suite of functions to process and flatten genome annotation file.

annotation

annotation function takes as input a GFF formatted genome annotation file and converts the annotations from GFF format to bed format. For an example, this function converts the following GFF annotation

chr1

HAVANA

exon

17233

17368

.

-

.

ID=exon:ENST00000488147.1:6;Parent=ENST00000488147.1;gene_id=ENSG00000227232.5;transcript_id=ENST00000488147.1;gene_type=unprocessed_pseudogene;gene_name=WASH7P;transcript_type=unprocessed_pseudogene;transcript_name=WASH7P-201;exon_number=6;exon_id=ENSE00003502542.1;level=2;transcript_support_level=NA;hgnc_id=HGNC:38034;ont=PGO:0000005;tag=basic,Ensembl_canonical;havana_gene=OTTHUMG00000000958.1;havana_transcript=OTTHUMT00000002839.1

and converts this entry into the following BED6 format

chromosome	start	end	name	score	strand
chr1	17232	17368	ENSG00000227232.5@WASH7P@unprocessed_pseudogene@exon@6/11@ENSG00000227232.5:exon0006	0	-

Various attributes in the name column in this BED entry is seperated by @ and the order is given below

atrribute	attribute description
ENSG00000227232.5	gene id
WASH7P	gene name
unprocessed_pseudogene	gene type
exon	gene feature (exon, intron, CDS,…)
6/11	6th exon out of a total of 11 exons of this gene
ENSG00000227232.5:exon0006	unique id, merging gene id feature and feature number

In Gencode V42 genome build, the above entry is the 6th exon of the gene WASH7P, preceded by an intron from chr1:17368-17605(-), but chromosomal location chr1:17368-17436(-) within this intron contains a microRNA MIR6859-1. In Shoji and htseq-clip, the default behavior is to process the intron and microRNA with the intron as separate features:

chromosome	start	end	name	strand
chr1	17232	17368	ENSG00000227232.5@WASH7P@unprocessed_pseudogene@exon@6/11@ENSG00000227232.5:exon0006	-
chr1	17368	17436	ENSG00000278267.1@MIR6859-1@miRNA@exon@1/1@ENSG00000278267.1:exon0001	-
chr1	17368	17605	ENSG00000227232.5@WASH7P@unprocessed_pseudogene@intron@5/10@ENSG00000227232.5:intron0005	-

But with --split-intron flag, Shoji will split the intron into two non-overlapping chunks, and removes the chunk that overlaps with the microRNA. The output will be:

chromosome	start	end	name	strand
chr1	17232	17368	ENSG00000227232.5@WASH7P@unprocessed_pseudogene@exon@6/11@ENSG00000227232.5:exon0006	-
chr1	17368	17436	ENSG00000278267.1@MIR6859-1@miRNA@exon@1/1@ENSG00000278267.1:exon0001	-
chr1	17436	17605	ENSG00000227232.5@WASH7P@unprocessed_pseudogene@intron@5-1/10@ENSG00000227232.5:intron0005-1	-

Notice that the intron now spand chr1:17436-17605(-) instead of spanning over the microRNA, ie chr1:17368-17605(-).

Note

The output supports tabix indexing using the flag --tabix and is still compatible with htseq-clip commands

createSlidingWindows

createSlidingWindows function takes as input a flattened annotation BED file created by the annotation function and splits each individual BED entries into overlapping windows. --size parameter controls the size of each window and --step controls the overlap of each neighboring windows from the same feature

Continuing with the example entry above, the first 5 sliding windows generated from the BED6 flattened entry are given below:

chromosome	start	end	name	strand
chr1	17232	17282	ENSG00000227232.5@WASH7P@unprocessed_pseudogene@exon@6/11@ENSG00000227232.5:exon0006W00019@19	-
chr1	17237	17287	ENSG00000227232.5@WASH7P@unprocessed_pseudogene@exon@6/11@ENSG00000227232.5:exon0006W00018@18	-
chr1	17242	17292	ENSG00000227232.5@WASH7P@unprocessed_pseudogene@exon@6/11@ENSG00000227232.5:exon0006W00017@17	-
chr1	17247	17297	ENSG00000227232.5@WASH7P@unprocessed_pseudogene@exon@6/11@ENSG00000227232.5:exon0006W00016@16	-
chr1	17252	17302	ENSG00000227232.5@WASH7P@unprocessed_pseudogene@exon@6/11@ENSG00000227232.5:exon0006W00015@15	-

Each sliding window listed here is 50bp long, as default value for --size argument is 50 and the difference between start positions of each is 20bp, as the default value for --step argument is 20

Following the convention in flattened annotation the attributes in sliding windows name column are also seperated by @ and the first 5 attributes in the name column here are exactly the same as that of flattened annotation name column An example is given below

atrribute	attribute description	Found in flattend name attribute
ENSG00000227232.5	gene id	Yes
WASH7P	gene name	Yes
unprocessed_pseudogene	gene type	Yes
exon	gene feature (exon, intron, CDS,…)	Yes
6/11	6th exon out of a total 11 exons in this gene	Yes
ENSG00000227232.5:exon0006W00019	unique id, merging gene id feature, feature number and window number (W : window)	No
19	19th window of this feature	No

Note

There will be zero overlap between neighboring windows from two separate gene features

Extraction

Extract and process crosslink sites from alignment file.

extract

extract function takes as input an alignment file (.bam) and extracts and writes either start, insertion, deletion, middle or end site into a BED6 formatted file. The argument --site determines crosslink site choice.

Given below is an example paired end sequence and start, middle and end positions extracted from the second mate of this fragment

TTATTACAGC:K00180:131:H7J3YBBXX:3:2123:15057:19918

99

chr1

17252

255

33M

=

17285

41

TTTTAAAGGCTGAGTCCTCTGAGAATTTATTAC

JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

NH:i:1

HI:i:1

AS:i:60

nM:i:5

NM:i:4

MD:Z:0C0A0G0G29

jM:B:c,-1

jI:B:i,-1

RG:Z:foo

TTATTACAGC:K00180:131:H7J3YBBXX:3:2123:15057:19918

147

chr1

17285

255

38M

=

17252

-41

TAAAGGCTGAGTCCTCTGAGAATTTATTACTACGGATC

JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

NH:i:1

HI:i:1

AS:i:60

nM:i:5

NM:i:1

MD:Z:0G37

jM:B:c,-1

jI:B:i,-1

RG:Z:foo

start site

chromosome	start	end	name	score	strand
chr1	17322	17323	TTATTACAGC:K00180:131:H7J3YBBXX:3:2123:15057:19918\|38	1	-

middle site

chromosome	start	end	name	score	strand
chr1	17304	17305	TTATTACAGC:K00180:131:H7J3YBBXX:3:2123:15057:19918\|38	1	-

end site

chromosome	start	end	name	score	strand
chr1	17285	17286	TTATTACAGC:K00180:131:H7J3YBBXX:3:2123:15057:19918\|38	1	-

Note

In a paired end alignment file, argument --mate is used to choose the read/mate from which crosslink sites are extracted. The sequencing protocol used to generate the file determines whether the crosslink site is located on the first mate or the second mate. Please consult your sequencing protocol to decide which mate to use.

Counting 

Calculate the number of extracted crosslink sites per given gene annotation feature.

count

count function takes as input either a flattened annotation file generated by annotation function or a sliding windows file generated by createSlidingWindows function and a crosslink sites file generated by extract function and for each entry/window in the annotation/sliding windows file count the number of crosslink sites in the region.

In Shoji, the output is written in Apache parquet file format, with the following schema:

field (column) name	type	nullable	description
chrom	string	False	chromosome name
begin	uint32	False	window start position
end	uint32	False	window end position
uniq_id	string	False	unique id of this window
gene_id	string	False	gene id
gene_name	string	False	gene name
gene_type	string	False	gene type, eg: protein coding, lncRNA,…
feature	string	False	gene feature, intron or exon
nr_of_region	string	False	number of the current region
total_region	string	False	total number of regions
window_number	uint16	True	window number
strand	string	False	strand info
sample	string	False	sample name
pos_counts	map_(uint32, uint32)	False	Map of chromosome positions to crosslink counts

createMatrix

This function creates R friendly output matrices. It was seen that in some CLIP experiments the adjacent overlapping windows in the sliding windows file are duplicates of each other. In such cases, downstream analysis of these duplicate windows are not at all useful. As a workaround, while aggregating crosslink counts across all samples, this functon also filters out overlapping windows with duplicate crosslink counts across all samples, and keeps only the 5’ most window (in gene direction). This behavior can be overridden by using --allow_duplicates flag.

The following output files are generated:

count matrix: aggregated crosslink sites per window, per sample
annotation table : flattened annotation file with unique ids, gene names, gene features, gene ids, etc.
max count matrix (optional): maximum crosslink sites per window, per sample

Annotation table follows the following schema:

Column	Description
unique_id	Unique identifier of this window
chromosome	Chromosome name
begin	window start position
end	window end position
strand	Strand info
gene_id	Gene identifier
gene_name	Gene name
gene_type	Gene type, eg: protein coding, lncRNA,…
gene_region	Gene region type, intron or exon
Nr_of_region	number of the current region
Total_nr_of_region	total number of regions
window_number	(optional) Number of this window in the region

Further analysis

Further analysis and processing of crosslink windows is done using R/Bioconductor package DEWSeq. Please refer to the user manual of this package for requirements, installation and help.

overview

Annotation

annotation

createSlidingWindows

Extraction

extract

Counting

count

createMatrix

Further analysis

Annotation 

Counting 