The best way to find out the variations in a genome in a disease condition is to look into its sequence / positions of A,T,G,C. The goal of this approach is to identify genetic variants that alter protein sequences, and to do this at a much lower cost than whole-genome sequencing. Since these variants can be responsible for both Mendelian and common polygenic diseases, such as Alzheimer's disease, whole exome sequencing has been applied both in academic research and as a clinical diagnostic.

Figure: Exome sequencing workflow

Exome data is often generated using a paired end approach, which means sequences are retrieved from both ends of the same molecule. As the molecules are larger than twice the read length of an average Illumina experiment, these sequences do not overlap, but there is space in-between (which is referred to as „insert sequence“). This approach helps to detect PCR duplicates

Sequence data from paired end approaches come in two files: One stores all the forward sequences and the other stored the reverse sequences in the exact same order as the first file.

For inspection of the quality of the FastQ Files usually use the FastQC package which easily displays some nice graphics such as read-length plots, read-quality plots, sequence-duplication levels and many more.

For analyzing sequence data, alignment to a reference genome is necessary. For humans there are several options of retrieving the reference sequence: the most important being UCSC and Ensembl.

For aligning the sequences to the human genome are used different read aligners. BWA, Bowtie, etc. are the ultrafast and memory-efficient tools for aligning sequencing reads to long reference sequences. It needs very few memory, works well with Illumina data and can be run using several threads. For using the read aligners, the reference genome needs to be indexed and transformed using an indexing program. That may take a while. The result is a SAM file which is needed for later analysis.

The SAM file is the starting point for obtaining a much more powerful file type: the binary Alignment/Map format (short BAM). It compresses the SAM file and can be indexed, which means that only portions of the file can be accessed without the need to load the whole file. A usual exome BAM file takes about 4-10 Gb disk space while a SAM file needs 20-30 Gb.

For converting the SAM file to BAM we use picard. Picard offers many options for manipulating or viewing SAM and BAM files.

3. Marking PCR duplicates

Due to inherent mistakes in the sequencing technology, some reads will be exact copies of each other. These reads are called PCR duplicates due to amplification biases in PCR. They share the same sequence and the same alignment position and could cause trouble during SNP calling as possibly some allele is overrepresented due to amplification biases. This procedure works better for paired-end approaches as the probability of having both sides of two molecules aligned on the same position and the sequence being the same by random is very low (but not impossible).

4. Local realignment around indels

Indels within reads often lead to false positive SNPs at the end of sequence reads. To prevent this artifact, local realignment around indels is done using local realignment tools from the Genome Analysis Tool Kit (GATK). This is done in two steps: the first step creates a table of possible indels and the second step realigns reads around those targets.

When using paired end data, the mate information must be fixed, as alignments may change during the realignment process.

5. Quality score recalibration

Quality data generated from the sequencer isn't always very accurate and for obtaining good SNP calls (which rely on base quality scores), recalibration of these scores is necessary. This is done in two steps: the "Count Covariates" and the "Table Recalibration" steps.

SNP calling is done using the GATK program. It calls SNPs and short indels at the same time and gives a well annotated VCF file as output.

During SNP calling each SNP is assigned a variant quality score which correlates to the possibility of the SNP being called wrong. While the variant score calculated during the SNP calling step is a rough estimate, the variant recalibration step determines a more exact quality score, which can be used to filter out possible artifacts more accurately. However, Variant quality score recalibration needs some known sites to determine different parameters for the Gaussian mixture model used.

The filtering scheme are the recommended ones by the GATK team. A SNP which passes through all the filters doesn't necessarily mean a true SNP call and SNPs filtered out don't necessarily define a sequencing artifact, but it gives a clue for possible reasons why a SNP could be wrong.

The first filter is the SNP cluster filter. The second filter defines SNPs which are HARD_TO_VALIDATE.

A nice way to present exome data is visualization. Moreover, visual analysis of SNPs often helps to judge between sequencing artifacts and true variants and comparison between different datasets of different samples at the same site is very easy.

For exome data most reads should align at exonic sequences through targetted enrichment of those regions.

Visualization of datasets helps to analyse:

1. Quality of the dataset (average coverage, enrichment specificity, number of random errors, etc.)
2. Quickly look at specific genes or regions
3. Compare multiple datasets
4. Present data to colleagues

Background

1. FastQ Files

2. Alignment

3. Marking PCR duplicates

4. Local realignment around indels

5. Quality score recalibration

6. SNP calling

7. Filter SNPs

8. Visualization