Exercises#

Quality control#

Add a rule that generates FastQC reports for the two input fastq.gz files.
- The tool fastqc is available from the bioconda conda channel.
- An example command-line to use fastqc is
```
$ fastqc chip-seq/H3K4-TC1-ST2-D0.12.fastq.gz
```
- The output are two files in the same directory as the input file, H3K4-TC1-ST2-D0.12_fastqc.html and H3K4-TC1-ST2-D0.12_fastqc.zip.
- You might want to copy at least the html file to the output directory.
- Do not forget to add at least one of the output files to the summary rule or explicitly specify it on the command-line.
Add a rule that generates alignment statistics for the two bam files with Picard CollectAlignmentSummaryMetrics.
- The tool picard is available from the bioconda conda channel.
- An example command-line to run picard is
```
$ picard CollectAlignmentSummaryMetrics R=reference/Mus_musculus.GRCm38.dna_sm.chromosome.12.fa I=bowtie2/H3K4-TC1-ST2-D0.12.bam O=output/alignment_metrics_H3K4-TC1-ST2-D0.12.txt
```
- You can use the params directive again to specify the reference file.
- You might want to save the output that picard prints on the command-line (stderr) to a log file.

Peak annotation#

Add a rule that uses bedtools intersect to check with which genes the peaks overlap.
- The tool bedtools is available from the bioconda conda channel.
- You will need a file in BED format that contains the regions of the genes.
```
$ cd reference
$ wget "https://webdav-r3lab.uni.lu/public/biocore/snakemake_tutorial/UCSC_mm10_refseq.chromosome.12.bed"
$ cd ..
```
  You can get this data also directly from the UCSC Table Browser, but you will need to fix the chromosome naming (remove chr).
- An example command-line to use bedtools intersect is
```
$ bedtools intersect -wb -a output/TC1-ST2-D0.12_peaks.narrowPeak -b reference/UCSC_mm10_refseq.chromosome.12.bed
```
- Be aware that bedtools writes the output to the command-line (stdout), so you have to manually redirect it to a file with >.
- (Optional) Since our peaks are expected to be at transcription start sites (TSS) and not necessarily within the gene body, you might want to extend the regions in the gene BED file with bedtools slop to cover a few kb upstream of the gene start.
```
$ bedtools slop -l 2000 -r 0 -s -i reference/UCSC_mm10_refseq.chromosome.12.bed -g reference/Mus_musculus.GRCm38.dna_sm.chromosome.12.fa.fai
```
  Create a separate rule for this, but reuse the conda environment with bedtools.
- (Optional) Extract the list of unique genes from the bedtools intersect output (column 14) using cut, sort and uniq.
Add a rule that draws a heatmap of the ChIP signal around the genes using deepTools.
- The toolsuite deepTools is available from the bioconda conda channel.
- You will need the computeMatrix and plotHeatmap tools.
- An example command-line to generate a plot is
```
$ computeMatrix scale-regions -S output/TC1-ST2-D0.12_treat_pileup.bigwig -R reference/UCSC_mm10_refseq.chromosome.12.bed -b 3000 -m 5000 -a 1000 -o deeptools/TC1-ST2-D0.12_matrix.gz
$ plotHeatmap -m deeptools/TC1-ST2-D0.12_matrix.gz -o output/TC1-ST2-D0.12_deeptools_plot.png
```
- (Optional) Generate a properly normalised bigwig track first, using bamCompare, and use this as input for computeMatrix.
```
$ bamCompare -b1 bowtie2/H3K4-TC1-ST2-D0.12.bam -b2 bowtie2/INPUT-TC1-ST2-D0.12.bam -o deeptools/TC1-ST2-D0.12_normalised.bigwig -of bigwig --effectiveGenomeSize 120129022
```
- You can put all commands in a single rule or create one rule per command. Since we do not reuse any of the intermediate outputs, a single rule works fine, but otherwise it would be better to split up the rule.

Shell script into Snakefile#

Have a look at the following shell script.

The script is adapted from the first few steps of the Galaxy atac-seq tutorial.
```
#!/bin/sh
wget https://zenodo.org/record/3270536/files/SRR891268_R1.fastq.gz
wget https://zenodo.org/record/3270536/files/SRR891268_R2.fastq.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz

# Alignment

bowtie2-build chr22.fa.gz hg38_chr22
bowtie2 -x hg38_chr22 -1 SRR891268_R1.fastq.gz -2 SRR891268_R2.fastq.gz | \
    samtools sort -n - > SRR891268.bam

# Peak Calling

Genrich -t SRR891268.bam -o SRR891268.narrowPeak \
    -e "chrM" -f SRR891268.log -m 30 -j\
    -a 20 -r \
    -k SRR891268.bg
```
Atac-seq is similar to ChIP-seq an epigenomics NGS technique to identify open chromatin. Since the sequencing is in a paired-end mode (instead of single-end like we had in the previous data set from the tutorial) other parameters are needed for the alignment with bowtie2.
- -1 indicates the forward read,
- while -2 indicates the reverse read.
Genrich has an advantage over MACS2 since it is offering to do the necessary filtering steps during peak calling.
- -e allows to exclude chromosomes that are specified. In this case the mitochondrial chromosome (chrM).
- -m filters low quality reads.
- -j specifies that it has to run in the atac-seq mode.
- -a is the minimum AUC for a peak.
- -r removes PCR duplicates.
- -k creates a bedgraph-ish file for pileups.
Write the script into a Snakefile.
- Create a new directory atac-seq in your home directory for this (see the code below) bash $ cd $ mkdir atac-seq $ cd atac-seq
- The peak caller Genrich is available from the bioconda conda channel
How can you improve the workflow in the context of data management? Think about data structures and log files.