Background

DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule. It includes any method or technology that is used to determine the order of the four bases Adenine, Guanine, Cytosine and Thymine in a strand of DNA. The advent of rapid DNA sequencing methods enabled variants discovery (SNP, Insertion, deletion, CNV etc.), which has greatly accelerated biological and medical research.

Figure: DNA seq analysis workflow

1. Raw data statistics

Once you provide raw data, then data stats will be provided, including number of reads, genome coverage (x) and base distribution.

2. Quality check

Quality control checks on raw sequence data coming from high throughput sequencing provides a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. FastQC will be used to study the quality of data provided. Usually we will test for adapter contamination, read quality and other sequencing biases.

3. Data pre-processing

Data pre-processing is very important to process over-represented sequence and low quality reads as they may interfere with alignment and eventually with the gene expressions.

Based on the quality of data:

1. Remove the adapters/over-represented sequences from RNA seq data using cutadapt by providing adapters used while sequencing.
2. Quality/end trimming will improve overall quality of each reads; trimmomatic can be used for this step.

4. Alignment

A sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.

In referenced based DNA-seq, reads are aligned to reference genome using BWA or bowtie2.

5. Variant calling

Variant identification is detecting point mutations, small insertion & deletion (Indels), structural variants and CNV from DNA seq data. Samtools/GATK is generally used for SNP and Indels detection. Structural variants & copy number variations (CNVs) can be identified using SoftSV/delly/CNVnator.

6. Filtering and annotations

Identified variants will be filtered for read depth (at least 10) and quality (at least 30 Phred score) using custom in-house script. Filtered variants will annotated using VEP/SnpEFF/Annovar.

7. Deliverables

Brief report including raw data stat, alignment summary and key findings will be provided along with TAB delimited annotations files will be provided as final results.

Upon request, we can also share intermediate files or any other information and any additional analysis if required.