overview
Shoji
This package is designed primarily to do the following operations:
Annotation
A suite of functions to process and flatten genome annotation file.
annotation
annotation function takes as input a GFF formatted genome annotation file and converts the annotations from GFF format to bed format. For an example, this function converts the following GFF annotation
chr1 |
HAVANA |
exon |
17233 |
17368 |
. |
- |
. |
ID=exon:ENST00000488147.1:6;Parent=ENST00000488147.1;gene_id=ENSG00000227232.5;transcript_id=ENST00000488147.1;gene_type=unprocessed_pseudogene;gene_name=WASH7P;transcript_type=unprocessed_pseudogene;transcript_name=WASH7P-201;exon_number=6;exon_id=ENSE00003502542.1;level=2;transcript_support_level=NA;hgnc_id=HGNC:38034;ont=PGO:0000005;tag=basic,Ensembl_canonical;havana_gene=OTTHUMG00000000958.1;havana_transcript=OTTHUMT00000002839.1 |
and converts this entry into the following BED6 format
chromosome |
start |
end |
name |
score |
strand |
|---|---|---|---|---|---|
chr1 |
17232 |
17368 |
ENSG00000227232.5@WASH7P@unprocessed_pseudogene@exon@6/11@ENSG00000227232.5:exon0006 |
0 |
- |
Various attributes in the name column in this BED entry is seperated by @ and the
order is given below
atrribute |
attribute description |
|---|---|
ENSG00000227232.5 |
gene id |
WASH7P |
gene name |
unprocessed_pseudogene |
gene type |
exon |
gene feature (exon, intron, CDS,…) |
6/11 |
6th exon out of a total of 11 exons of this gene |
ENSG00000227232.5:exon0006 |
unique id, merging gene id feature and feature number |
In Gencode V42 genome build, the above entry is the 6th exon of the gene WASH7P, preceded by an intron from chr1:17368-17605(-),
but chromosomal location chr1:17368-17436(-) within this intron contains a microRNA MIR6859-1.
In Shoji and htseq-clip, the default behavior is to process the intron and microRNA with the intron as separate features:
chromosome |
start |
end |
name |
score |
strand |
|---|---|---|---|---|---|
chr1 |
17232 |
17368 |
ENSG00000227232.5@WASH7P@unprocessed_pseudogene@exon@6/11@ENSG00000227232.5:exon0006 |
0 |
- |
chr1 |
17368 |
17436 |
ENSG00000278267.1@MIR6859-1@miRNA@exon@1/1@ENSG00000278267.1:exon0001 |
0 |
- |
chr1 |
17368 |
17605 |
ENSG00000227232.5@WASH7P@unprocessed_pseudogene@intron@5/10@ENSG00000227232.5:intron0005 |
0 |
- |
But with --split-intron flag, Shoji will split the intron into two non-overlapping chunks, and removes the chunk that overlaps with the microRNA. The output will be:
chromosome |
start |
end |
name |
score |
strand |
|---|---|---|---|---|---|
chr1 |
17232 |
17368 |
ENSG00000227232.5@WASH7P@unprocessed_pseudogene@exon@6/11@ENSG00000227232.5:exon0006 |
0 |
- |
chr1 |
17368 |
17436 |
ENSG00000278267.1@MIR6859-1@miRNA@exon@1/1@ENSG00000278267.1:exon0001 |
0 |
- |
chr1 |
17436 |
17605 |
ENSG00000227232.5@WASH7P@unprocessed_pseudogene@intron@5-1/10@ENSG00000227232.5:intron0005-1 |
0 |
- |
Notice that the intron now spand chr1:17436-17605(-) instead of spanning over the microRNA, ie chr1:17368-17605(-).
Note
The output supports tabix indexing using the flag --tabix and is still compatible with htseq-clip commands
createSlidingWindows
createSlidingWindows function takes as input a flattened annotation BED file
created by the annotation function and splits each individual BED entries into overlapping windows.
--size parameter controls the size of each window and --step controls the overlap
of each neighboring windows from the same feature
Continuing with the example entry above, the first 5 sliding windows generated from the BED6 flattened entry are given below:
chromosome |
start |
end |
name |
score |
strand |
|---|---|---|---|---|---|
chr1 |
17232 |
17282 |
ENSG00000227232.5@WASH7P@unprocessed_pseudogene@exon@6/11@ENSG00000227232.5:exon0006W00019@19 |
0 |
- |
chr1 |
17237 |
17287 |
ENSG00000227232.5@WASH7P@unprocessed_pseudogene@exon@6/11@ENSG00000227232.5:exon0006W00018@18 |
0 |
- |
chr1 |
17242 |
17292 |
ENSG00000227232.5@WASH7P@unprocessed_pseudogene@exon@6/11@ENSG00000227232.5:exon0006W00017@17 |
0 |
- |
chr1 |
17247 |
17297 |
ENSG00000227232.5@WASH7P@unprocessed_pseudogene@exon@6/11@ENSG00000227232.5:exon0006W00016@16 |
0 |
- |
chr1 |
17252 |
17302 |
ENSG00000227232.5@WASH7P@unprocessed_pseudogene@exon@6/11@ENSG00000227232.5:exon0006W00015@15 |
0 |
- |
Each sliding window listed here is 50bp long, as default value for --size argument is 50 and the difference between
start positions of each is 20bp, as the default value for --step argument is 20
Following the convention in flattened annotation the attributes in sliding windows name column are also seperated by @
and the first 5 attributes in the name column here are exactly the same as that of flattened annotation name column
An example is given below
atrribute |
attribute description |
Found in flattend name attribute |
|---|---|---|
ENSG00000227232.5 |
gene id |
Yes |
WASH7P |
gene name |
Yes |
unprocessed_pseudogene |
gene type |
Yes |
exon |
gene feature (exon, intron, CDS,…) |
Yes |
6/11 |
6th exon out of a total 11 exons in this gene |
Yes |
ENSG00000227232.5:exon0006W00019 |
unique id, merging gene id feature, feature number and window number (W : window) |
No |
19 |
19th window of this feature |
No |
Note
There will be zero overlap between neighboring windows from two separate gene features
Extraction
Extract and process crosslink sites from alignment file.
extract
extract function takes as input an alignment file (.bam) and extracts and
writes either start, insertion, deletion, middle or end site into a BED6 formatted file.
The argument --site determines crosslink site choice.
Given below is an example paired end sequence and start, middle and end positions extracted from the second mate of this fragment
TTATTACAGC:K00180:131:H7J3YBBXX:3:2123:15057:19918 |
99 |
chr1 |
17252 |
255 |
33M |
= |
17285 |
41 |
TTTTAAAGGCTGAGTCCTCTGAGAATTTATTAC |
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ |
NH:i:1 |
HI:i:1 |
AS:i:60 |
nM:i:5 |
NM:i:4 |
MD:Z:0C0A0G0G29 |
jM:B:c,-1 |
jI:B:i,-1 |
RG:Z:foo |
TTATTACAGC:K00180:131:H7J3YBBXX:3:2123:15057:19918 |
147 |
chr1 |
17285 |
255 |
38M |
= |
17252 |
-41 |
TAAAGGCTGAGTCCTCTGAGAATTTATTACTACGGATC |
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ |
NH:i:1 |
HI:i:1 |
AS:i:60 |
nM:i:5 |
NM:i:1 |
MD:Z:0G37 |
jM:B:c,-1 |
jI:B:i,-1 |
RG:Z:foo |
start site
chromosome |
start |
end |
name |
score |
strand |
chr1 |
17322 |
17323 |
TTATTACAGC:K00180:131:H7J3YBBXX:3:2123:15057:19918|38 |
1 |
- |
middle site
chromosome |
start |
end |
name |
score |
strand |
chr1 |
17304 |
17305 |
TTATTACAGC:K00180:131:H7J3YBBXX:3:2123:15057:19918|38 |
1 |
- |
end site
chromosome |
start |
end |
name |
score |
strand |
chr1 |
17285 |
17286 |
TTATTACAGC:K00180:131:H7J3YBBXX:3:2123:15057:19918|38 |
1 |
- |
Note
In a paired end alignment file, argument --mate is used to choose the read/mate from which crosslink sites are extracted. The sequencing protocol used to generate the file determines whether the crosslink site is located on the first mate or the second mate. Please consult your sequencing protocol to decide which mate to use.
Counting
Calculate the number of extracted crosslink sites per given gene annotation feature.
count
count function takes as input either a flattened annotation file generated by annotation function or a sliding windows file generated by createSlidingWindows function and a crosslink sites file generated by extract function and for each entry/window in the annotation/sliding windows file count the number of crosslink sites in the region.
In Shoji, the output is written in Apache parquet file format, with the following schema:
field (column) name |
type |
nullable |
description |
chrom |
string |
False |
chromosome name |
begin |
uint32 |
False |
window start position |
end |
uint32 |
False |
window end position |
uniq_id |
string |
False |
unique id of this window |
gene_id |
string |
False |
gene id |
gene_name |
string |
False |
gene name |
gene_type |
string |
False |
gene type, eg: protein coding, lncRNA,… |
feature |
string |
False |
gene feature, intron or exon |
nr_of_region |
string |
False |
number of the current region |
total_region |
string |
False |
total number of regions |
window_number |
uint16 |
True |
window number |
strand |
string |
False |
strand info |
sample |
string |
False |
sample name |
pos_counts |
map_(uint32, uint32) |
False |
Map of chromosome positions to crosslink counts |
createMatrix
This function creates R friendly output matrices.
It was seen that in some CLIP experiments the adjacent overlapping windows in the sliding windows file are duplicates of each other.
In such cases, downstream analysis of these duplicate windows are not at all useful. As a workaround, while aggregating crosslink counts
across all samples, this functon also filters out overlapping windows with duplicate crosslink counts across all samples, and keeps only the 5’ most window (in gene direction).
This behavior can be overridden by using --allow_duplicates flag.
The following output files are generated:
count matrix: aggregated crosslink sites per window, per sample
annotation table : flattened annotation file with unique ids, gene names, gene features, gene ids, etc.
max count matrix (optional): maximum crosslink sites per window, per sample
Annotation table follows the following schema:
Column |
Description |
|---|---|
unique_id |
Unique identifier of this window |
chromosome |
Chromosome name |
begin |
window start position |
end |
window end position |
strand |
Strand info |
gene_id |
Gene identifier |
gene_name |
Gene name |
gene_type |
Gene type, eg: protein coding, lncRNA,… |
gene_region |
Gene region type, intron or exon |
Nr_of_region |
number of the current region |
Total_nr_of_region |
total number of regions |
window_number |
(optional) Number of this window in the region |
Further analysis
Further analysis and processing of crosslink windows is done using R/Bioconductor package DEWSeq. Please refer to the user manual of this package for requirements, installation and help.