Background

ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.

Figure: ChIP-seq workflow

1. Raw data statistics

Once you provide raw data, then data stats will be provided, including number of reads, genome coverage (x) and base distribution.

2. Quality check

Quality control checks on raw sequence data coming from high throughput sequencing provides a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. FastQC will be used to study the quality of data provided. Usually we will test for adapter contamination, read quality and other sequencing biases.

3. Data pre-processing

Data pre-processing is very important to process over-represented sequence and low quality reads as they may interfere with alignment and eventually with the gene expressions.

Based on the quality of data:

1. Remove the adapters/over-represented sequences from RNA seq data using cutadapt by providing adapters used while sequencing.
2. Quality/end trimming will improve overall quality of each reads; trimmomatic can be used for this step.

4. Alignment

A sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.

In referenced based ChIP-seq, reads are aligned to reference genome using MAQ or BWA or bowtie2.

5. Quality metrics of mapped read

Quality control at this stage is important to calibrate ChIP seq data. Here we will study aligned reads for Nonredundant fraction (NRF) and low complexity peak.

6. Peak calling

Peak calling is a computational method used to identify areas in a genome that have been enriched with aligned reads as a consequence of performing a ChIP-sequencing or MeDIP-seq experiment. These areas are those where a protein interacts with DNA. When the protein is a transcription factor, the enriched area is its transcription factor binding site (TFBS). Popular software programs include MACS, spp, Rbrads and BayesPeak.

7. Motif analysis

A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance. For proteins, a sequence motif is distinguished from a structural motif, a motif formed by the three-dimensional arrangement of amino acids which may not be adjacent. HOMER and ChIPMunk are most commonly used tools to discover motifs.

8. Peak annotations

The annotations includes functions to retrieve the sequences around the peak, obtain enriched Gene Ontology (GO) terms, and find the nearest gene, exon, miRNA or custom features such as most conserved elements and other transcription factor binding sites supplied by users. PAVIS and ChIPpeakAnno can be used to annotate identified peak.