Quality control for Illumina output

In addition to producing the sequence files, we perform a number of quality control tests. It is important to note that the quality control we have performed is primarily a check of the quantity and error rate of the sequencing data. It does not tell you whether you actually sequenced what you intended to sequence. The Per Base Sequence Content plot and Sequence Duplication levels plot does give some indication of the nature of the fragments that you sequenced. An alignment to a reference is needed for a more definitive answer.

The features we test are:

  • Per Base Sequence Quality
  • Per Sequence Quality Scores
  • Per Base Sequence Content
  • Per Base N Content

  • Sequence Duplication Levels



This information is saved in a PDF file with the following name format:runName.lane.sampleName.readNumber.qc.pdf. Older quality control reports are named as QCreport_runName_lane.pdf)

  • Per Base Sequence Quality

The graph shows an overview of the range of quality scores across all based at each position in the FASTQ file. The y-axis shows quality scores and the x-axis shows the read position. For each read position, a boxplot is used to show the distribution of quality scores for all reads. The yellow boxes represent quality scores within the inter-quartile range (25% - 75%). The upper and lower whiskers represent 10% and 90% point. The central red line shows the median of the quality values and the blue line shows the mean of the quality values.

A rule of thumb is that a quality score of 30 indicates a 1 in 1000 probability of error and a quality socre of 20 indicates a 1 in 100 probability of error (see FASTQ section in this webpage for further details). The higher the score the better the base call. You will see from the plots that the quality of the base calling deteriorates along the read (as is always the case with Illumina sequencing). Normally, you would see that the first 36 bases should have a median and mean quality score over 20.

  • Per Sequence Quality Scores

The graph is generated by computing the average quality of a read (by averaging across read positions) and then plotting the distribution of this average quality. It thus enables you to tell whether low quality bases are located in a subset of the reads or distributed across all reads. The y-axis shows the number of reads and the x-axis shows the mean quality score. It is often the case that a subset of sequences will have low mean quality scores. However, these should represent only a small percentage of the total sequences. Normally, you would see that the highest peak in the distribution should map to score higher than 20 on the x-axis.

  • Per Base Sequence Content

The graph plots the percentage of each base type in each read position. The y-axis shows the percentage of a base type and the x-axis shows the read position. In a random sample, four base types should distribute evenly along the read. Therefore, a graph with four smooth lines around percentage of 25% should be expected. However, the original sample could be biased sometimes, such as:
* miRNA library (the curves represent the sequences of dominant miRNAs)
* library prepared for bisulfite sequencing (the percentage of “C” is much lower than percentage of other base types);

Therefore, the graph is useful to indicate the sequence pattern of the sample, such as whether the sample is under good conversion in bisulfite sequencing or whether the sample still contains a large amount of adaptor sequences. However, this is only indications of sequence pattern. It is always wise to check the actual sequences for detail.

  • Per Base N Content

The graph shows the percentage of base calls at each position for which an N was called. If a sequencer is unable to make a base call with sufficient confidence then it will normally call an N rather than A, T, G or C. The y-axis shows percentage of Ns among all reads and the x-axis shows the read position. It is common to see a very low percentage of Ns appearing near the end of a sequence. Normally, you would see that the percentage of Ns at each read position should be always lower than 20%.

  • Sequence Duplication Levels

The graph shows the number of sequences with different degrees of duplication (indicated on the x-axis) relative to the number of unique sequences (which is set to 100%). In a diverse library, most sequences will occur only once in the final set and the graph will show a peak in the unique category. However, some sequences may be present in more than one copy (for example, as the result of PCR amplification), in which the graph may show high numbers of sequences in the other categories (2 copies, 3 copies, etc). The last category is for 10 copies or more. Therefore, it is normal to see a small rise in this category. This graph is useful to indicate whether the sample contains a large amount of PCR duplicates. In a miRNA library, since some miRNAs are dominated in the sample, it is common to see a big rise in the last category.



Published June 20, 2012 11:25 AM - Last modified Oct. 1, 2014 12:04 PM