Mastering Bulk RNA-seq: Plan, Analyse and Interpret

FASTQ files

6.4 ‘Raw’ short reads

Your sequencing provider will provide you with a set of FASTQ files (usually with the extension .fq.gz or .fastq.gz). These files contain the raw sequencing reads. They are gzipped text files.

The size of these files will vary depending on the library size / sequencing depth, read length and whether you have single or paired end reads.

A typical RNA-seq FASTQ file might have 20 million 150 bp reads = ~1.5 gigabytes (twice that for paired end reads).

What’s inside a FASTQ file ?

The content of a typical FASTQ file from an Illumina sequencer looks something like:

@NB551311:15:HG2F7BGX5:1:11101:14248:1120 1:N:0:TTAGGC
CCCAGACAACCAGCTATTGCCAGGTTCGATAGGTCTGTCACTTCTACCCGGAGCTCTTCCCACTCTATTGCCAC
+
AAAAAEEEEEEEEEEEE6EEEEEEEEAEEEEEEE??+=B+:ADDCD>DE9DCBC:3A,+222:):89D*:*1))

Each read has four lines:

Line 1: @ followed by the unique read name, and optionally some extra information like:
- the instrument ID
- the run ID
- the flowcell ID
- the flowcell lane
- flowcell tile number:x:y coordinates
- pair number
- pass filter flag (N for passed)
- control bits
- the barcode
- (sometimes a UMI sequence is also incorporated in here)
Line 2: The DNA sequence of the read
Line 3: A single + on a line
Line 4: The quality scores for each base (roughly, capital letters are good, !"#$%& expletives are bad). Newer instruments ‘quantize’ these scores into ‘binned’ values - this has little impact on downstream analysis but improves file compression.

The FASTQ @ read name is only intended as a unique identifier and was not designed as a place to store metadata. As such, there isn’t a universal standard for what goes in the FASTQ header - Illumina instruments, MGI instruments and public data from SRA/ENA all have their own variations.

Historically there were various different encodings used for the quality line - today you’ll only ever see the ‘standard’ Sanger/Phred+33 format, unless you are revisiting data generated prior to ~2011.

If your sequencing is paired end, you will have pairs of files with _R1/_R2 or _1/_2_ toward the end of the filename. For single ended sequencing you’ll just have _R1 or _1.

You should keep a read-only copy of the original files, exactly as you receive them from the sequencing provider. Don’t rename them, don’t uncompress them, don’t open them in Notepad / Microsoft Word / WinZip etc.

Tip: It’s very rare that you should ever need to ‘unzip’ a FASTQ file - almost every tool and pipeline will work directly with the compressed version. If you do need to look at the content, consider using a command-line tool like zless - this can peek inside the compressed file without needing to make an uncompressed copy.

6 Pipeline Overview

Detour: demultiplexing