Exercises#
Quality control#
Add a rule that generates FastQC reports for the two input fastq.gz files.
The tool
fastqcis available from thebiocondaconda channel.An example command-line to use
fastqcis$ fastqc chip-seq/H3K4-TC1-ST2-D0.12.fastq.gzThe output are two files in the same directory as the input file,
H3K4-TC1-ST2-D0.12_fastqc.htmlandH3K4-TC1-ST2-D0.12_fastqc.zip.You might want to copy at least the html file to the
outputdirectory.Do not forget to add at least one of the output files to the summary rule or explicitly specify it on the command-line.
Add a rule that generates alignment statistics for the two bam files with Picard CollectAlignmentSummaryMetrics.
The tool
picardis available from thebiocondaconda channel.An example command-line to run
picardis$ picard CollectAlignmentSummaryMetrics R=reference/Mus_musculus.GRCm38.dna_sm.chromosome.12.fa I=bowtie2/H3K4-TC1-ST2-D0.12.bam O=output/alignment_metrics_H3K4-TC1-ST2-D0.12.txtYou can use the
paramsdirective again to specify the reference file.You might want to save the output that picard prints on the command-line (stderr) to a log file.
Peak annotation#
Add a rule that uses bedtools intersect to check with which genes the peaks overlap.
The tool
bedtoolsis available from thebiocondaconda channel.You will need a file in BED format that contains the regions of the genes.
$ cd reference $ wget "https://webdav-r3lab.uni.lu/public/biocore/snakemake_tutorial/UCSC_mm10_refseq.chromosome.12.bed" $ cd ..You can get this data also directly from the UCSC Table Browser, but you will need to fix the chromosome naming (remove
chr).An example command-line to use
bedtools intersectis$ bedtools intersect -wb -a output/TC1-ST2-D0.12_peaks.narrowPeak -b reference/UCSC_mm10_refseq.chromosome.12.bedBe aware that
bedtoolswrites the output to the command-line (stdout), so you have to manually redirect it to a file with>.(Optional) Since our peaks are expected to be at transcription start sites (TSS) and not necessarily within the gene body, you might want to extend the regions in the gene BED file with
bedtools slopto cover a few kb upstream of the gene start.$ bedtools slop -l 2000 -r 0 -s -i reference/UCSC_mm10_refseq.chromosome.12.bed -g reference/Mus_musculus.GRCm38.dna_sm.chromosome.12.fa.faiCreate a separate rule for this, but reuse the conda environment with
bedtools.(Optional) Extract the list of unique genes from the
bedtools intersectoutput (column 14) usingcut,sortanduniq.
Add a rule that draws a heatmap of the ChIP signal around the genes using deepTools.
The toolsuite
deepToolsis available from thebiocondaconda channel.You will need the
computeMatrixandplotHeatmaptools.An example command-line to generate a plot is
$ computeMatrix scale-regions -S output/TC1-ST2-D0.12_treat_pileup.bigwig -R reference/UCSC_mm10_refseq.chromosome.12.bed -b 3000 -m 5000 -a 1000 -o deeptools/TC1-ST2-D0.12_matrix.gz $ plotHeatmap -m deeptools/TC1-ST2-D0.12_matrix.gz -o output/TC1-ST2-D0.12_deeptools_plot.png(Optional) Generate a properly normalised bigwig track first, using
bamCompare, and use this as input forcomputeMatrix.$ bamCompare -b1 bowtie2/H3K4-TC1-ST2-D0.12.bam -b2 bowtie2/INPUT-TC1-ST2-D0.12.bam -o deeptools/TC1-ST2-D0.12_normalised.bigwig -of bigwig --effectiveGenomeSize 120129022You can put all commands in a single rule or create one rule per command. Since we do not reuse any of the intermediate outputs, a single rule works fine, but otherwise it would be better to split up the rule.
Shell script into Snakefile#
Have a look at the following shell script.
The script is adapted from the first few steps of the Galaxy atac-seq tutorial.
#!/bin/sh wget https://zenodo.org/record/3270536/files/SRR891268_R1.fastq.gz wget https://zenodo.org/record/3270536/files/SRR891268_R2.fastq.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz # Alignment bowtie2-build chr22.fa.gz hg38_chr22 bowtie2 -x hg38_chr22 -1 SRR891268_R1.fastq.gz -2 SRR891268_R2.fastq.gz | \ samtools sort -n - > SRR891268.bam # Peak Calling Genrich -t SRR891268.bam -o SRR891268.narrowPeak \ -e "chrM" -f SRR891268.log -m 30 -j\ -a 20 -r \ -k SRR891268.bgAtac-seq is similar to ChIP-seq an epigenomics NGS technique to identify open chromatin. Since the sequencing is in a paired-end mode (instead of single-end like we had in the previous data set from the tutorial) other parameters are needed for the alignment with
bowtie2.-1indicates the forward read,- while
-2indicates the reverse read.
Genrichhas an advantage overMACS2since it is offering to do the necessary filtering steps during peak calling.-eallows to exclude chromosomes that are specified. In this case the mitochondrial chromosome (chrM).-mfilters low quality reads.-jspecifies that it has to run in the atac-seq mode.-ais the minimum AUC for a peak.-rremoves PCR duplicates.-kcreates a bedgraph-ish file for pileups.
Write the script into a Snakefile.
Create a new directory
atac-seqin your home directory for this (see the code below)bash $ cd $ mkdir atac-seq $ cd atac-seqThe peak caller
Genrichis available from thebiocondaconda channel
How can you improve the workflow in the context of data management? Think about data structures and log files.