This file is to show what I did to re-analyze Arabidopsis Pol II ChIP-Seq Analysis.
Available Data
There are three papers I found in Arabidopsis did RNA Polymerase II ChIP-Seq/Chip.
- Genome-wide analysis of chromatin packing in Arabidopsis thaliana at single-gene resolution (2016 Genome Research SRP064711)
- A One Precursor One siRNA Model for Pol IV-Dependent siRNA Biogenesis (2015 Cell GSE61439)
- Relationship between nucleosome positioning and DNA methylation (2010 Nature GSE21673)
For the following analysis I use the first dataset because 1) I did not find input information for the second dataset 2) the third dataset used ChIP-chip that is less powerful than sequencing.
Prerequisite Data
- Bowtie2 Index Genome. This was downloaded from iGenome Ensembl version TAIR10
wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Arabidopsis_thaliana/Ensembl/TAIR10/Arabidopsis_thaliana_Ensembl_TAIR10.tar.gz
- Install Cutadapt, FASTQC, TrimGalore, sra-toolkit, Bowtie2, samtools, bedtools, MACS2
Data Process
The basic data process workflow is indicated in Figure 1.
Download sra file
- ChIP-Seq Input: SRR2626465 (100bp single-end)
- ChIP-Seq Pol II: SRR2626466 (100bp singe-end)
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP064/SRP064711/SRR2626465/SRR2626465.sra;wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP064/SRP064711/SRR2626466/SRR2626466.sra
Convert sra files to fastq
fastq-dump --gzip SRR2626465.sra ;fastq-dump --gzip SRR2626466.sra
Remove Adapters and Quality Check
Using TrimGalore. Functional Cutadpat and FASTQC are required.
trim_galore --illumina --fastqc --suppress_warn --length 35 --gzip *.fastq.gz &
For two fastq.gz files ran in parallel, each took 63min wall time and 116min CPU time. I did not find multi-thread parameter like cutadapt. I probably will use Cutadapt next time.
real | 63m32.002s |
---|---|
user | 114m45.544s |
sys | 2m30.260s |
Align to Genome by Bowtie2
Follow paper’s protocol:
Reads were aligned against the A. thaliana reference genome (TAIR10) using Bowtie 2 v2.2.4 (Langmead and Salzberg 2012) with mapping parameters as “-R 5 -N 0 -L 20 -i S,1,0.50”. The mapped reads were analyzed by MACS2 v2.1.0.20140616 (Zhang et al. 2008). The “–broad” flag was on for both Pol II and H3K27me3 peak calling, with reads from the input or anti-H3 sample used as controls, and default settings were used for the rest parameters.
export BOWTIE2_INDEXES=/home/jhuang/GenomeFiles/iGenome_TAIR10_Ensembl/Sequence/Bowtie2Index
nohup bowtie2 -x genome -U SRR2626465_trimmed.fq.gz -R 5 -N 0 -L 20 -i S,1,0.50 -t -p 23 -S SRR2626465.sam;bowtie2 -x genome -U SRR2626466_trimmed.fq.gz -R 5 -N 0 -L 20 -i S,1,0.50 -t -p 23 -S SRR2626466.sam 2>bowtie2STERROR.txt &
Convert to bam and sort + index
# convert sam to bam files. about 10min
time samtools view -Sb -q 10 SRR2626465.sam > SRR26264665.bam &
time samtools view -Sb -q 10 SRR2626466.sam > SRR26264666.bam &
# sorting about 16min
time samtools sort SRR26264665.bam SRR26264665_sort&
time samtools sort SRR26264666.bam SRR26264666_sort&
# indexing about 2min28s
samtools index SRR26264665_sort.bam |bamToBed -i SRR26264665_sort.bam >SRR26264665.bed &
samtools index SRR26264666_sort.bam |bamToBed -i SRR26264666_sort.bam >SRR26264666.bed &
Convert bam to bed and Keep unique reads only
# mono reads about 5 min 40 s
sort -u -k1,3 SRR26264665.bed |sortBed -i > SRR26264665_mono.bed&
sort -u -k1,3 SRR26264666.bed |sortBed -i > SRR26264666_mono.bed&
Peak call by MACS2
# about 10m16s
macs2 callpeak -t SRR26264666_mono.bed -c SRR26264665_mono.bed -B -n at_pol2 --broad 2>macs2.info
Convert bedGraph to bigwig file
This one is a little bit tricky. You will need bedGraphToBigwig, bedclip and bedtools slopebed for this. Luckily, for the first two from UCSC Genome Browser Utilities, you can just download binary version and directly run.
The single bedGraphToBigwig command won’t be enough, because reads will exceed chromosome size range, need to use with the other two command. Tao Liu has a great script for this purpose, just follow his. I pasted the script here for my recording, but refer to his original credit. This script directly takes bdg files and saved as bw as bigwig.
You also need a genome size file, describing each chromosome size. Usually it’s in the iGenome/WholeFasta directory. But if not, you can use samtools faidx to generate a genome.fa.fai file.
#!/bin/bash
# from Tao Liu: http://bit.ly/2mYd5fU
# check commands: slopBed, bedGraphToBigWig and bedClip
which bedtools &>/dev/null || { echo "bedtools not found! Download bedTools: <http://code.google.com/p/bedtools/>"; exit 1; }
which bedGraphToBigWig &>/dev/null || { echo "bedGraphToBigWig not found! Download: <http://hgdownload.cse.ucsc.edu/admin/exe/>"; exit 1; }
which bedClip &>/dev/null || { echo "bedClip not found! Download: <http://hgdownload.cse.ucsc.edu/admin/exe/>"; exit 1; }
# end of checking
if [ $# -lt 2 ];then
echo "Need 2 parameters! <bedgraph> <chrom info>"
exit
fi
F=$1
G=$2
bedtools slop -i ${F} -g ${G} -b 0 | bedClip stdin ${G} ${F}.clip
bedGraphToBigWig ${F}.clip ${G} ${F/bdg/bw}
rm -f ${F}.clip
From bigwig files (binary), you can download and visualize in IGV or some other genome browsers. Here is just an example:
#convert to bigwig
./bdg2bw.sh at_pol2_control_lambda.bdg genome.fa.fai ;./bdg2bw.sh at_pol2_treat_pileup.bdg genome.fa.fai &
Figure1: Top panel is the Pol II ChIP, second panel is the Input.
Cleaning
I deleted fastq files, sam files and sra files. The rest of data saved in ~/Data/ChIP/at_polii.
rm *.sra *.sam *.fastq.gz
Next
The following anlaysis can be done by ngs.plot and bedtools. I will demonstrate next time.