Illumina high through-put sequencing
High Through-put Sequencing of DNA is achieved by massive parallelisation of the sequencing process (e.g. over a billion fragments are sequenced simultaneously on the Illumina platform). Differences in error rate, throughput and cost are the main factors that distinguish the systems. Please find more information here.
The main criteria on which you can compare the platforms are:
- single molecule or not: most platforms are not capable of sequencing a single molecule but instead require a PCR amplification of each fragment to be sequenced. PCR amplification is not bias free and certain sequences will amplify better than others.
- read length: typically the quality of the base calls deteriorates as read length increases (or no base call is made at all). Once a base call cannot be made or the quality becomes too low, there is no point in extending read length.
- single read or paired end: fragments can be read from one end only (single read) or from both ends (paired end)
- accuracy: the starting quality is also different for different platforms. For example, the first base calls in an Illumina read can be as high as Q40 (equivalent to only a 1/10,000 chance of an error) with this deteriorating to approximately Q30 at the 100th base. Most other platforms start at a lower quality level for the first base of each read, but experience a lesser relative decline in quality along the read.
- sequence content base distribution (GC content): unbalanced sequence base distribution can affect fragment sequencing
- sequencing of homopolymers: certain platforms experience issues with correctly base calling a sequence of identical bases (either "inserting" or "deleting" bases)
- yield/cost: platforms have quite substantial differences in the number of fragments that can be sequenced in parallel (one run), how long this takes, and how much it costs.
High Throughput Sequencing has many potential application areas, these include:
- De novo genome or transcriptome assembly: Projects of this type assume no or little prior knowledge of the genome or transcriptome in question. Overlap between the (paired end) reads are used to build consensus sequences, called contigs, representing reconstructed pieces of the genome. These can be ordered and oriented with the use of mate pairs (long-insert libraries) to generate scaffolds. For transcripts, isoforms/splice variants can be detected with the right data. Requirements for this application are long reads, high coverage, combining short-insert with long-insert libraries. The computation is high-memory and cpu intensive. The NSC can help in determining the best strategy for a successful de novo project, please inquire.
- Amplicon sequencing: For diversity analysis or resequencing restricted portions of the genome in many samples. Amplicon size should be matched with the read length such that the whole amplicon can be read. Commonly, barcodes are used to pool amplicons from different samples. The NSC has guides for designing amplicon projects, please inquire. Requirements are pure PCR products with as little short fragments (primers, primer-dimers) as possible.
- Whole genome or exome re-sequencing: The goal in such projects is to sequence a sample from an organism for which a reference genome is available (thus "re-sequencing") and to map the reads from the sample to this reference. Following mapping, it is possible to bioinformatically characterise the variance between the sample and the reference. Since the greatest number of functionally important variante are typically located within protein coding regions, one often restricts sequencing to the exome by applying sequence capture/enrichment technologies. The key requirements to the sequencing technology for this application are high accuracy and good coverage, in order to correctly characterise the variants in the sample. Read length is less critical as the reads are mapped to the reference, and it can be pointless to have reads longer than the target regions. Nevertheless, longer reads can be useful by enabling a more precise characterisation of insertion and deletions in the sample and facilitate phasing of variants.
- Bisulfite sequencing: requires high coverage of the target region (e.g. RRBS reduced representation bisulfite sequencing).
- SmallRNA sequencing: micro RNAs are 21-23 nt and even the larger "small RNAs" are not much longer than 30 bp, thus read length is not a requirement for their study. The goal with many small RNA studies is either discovery of novel molecules or detection of differential expression between samples. It is thus important to sequence to a sufficient depth to be able to discover lowly expressed small RNAs or to be able to detect them as differentially expressed. In addition, small RNAs can be difficult to isolate from total RNA and thus a certain fraction of sequencing is always wasted on off-target sequencing, adding to the yield requirements of the sequencing technology. Illumina technology together with multiplexing is usually the chosen strategy for small RNA sequencing.
- RNA sequencing: read length is not critical if a high quality annotated genome or transcriptome is already available, as reads can primarily be used as tags and "counted" per gene. However, for detection of splice junctions, longer read length are required. Good coverage is required for power in detecting differential expression between samples, particularly for lowly expressed genes.
- Chromatin Immuno-precipitation (ChIP) sequencing: reads are used as tags, so short read lengths are sufficient, however, coverage is important and must be tailored to the size of the pulled-down region (eg. widespread or promoter-specific histone marks, transcription factors). Single end reads are normally sufficient, but paired-end or tag strategies should be considered if duplicate reads prove to be problematic (see for example http://www.pnas.org/content/10
9/4/1347.abstract?sid=a14a0477 -693f-43f9-8375-b6a5f19118af). Optimising the size of input chromatin is important to obtain good resolution of the technique whilst avoiding bias. However, perhaps the most critical consideration is using an antibody tested and shown to be specific for the target protein. Note that ChIP destined for sequencing must not be performed with salmon sperm or other nucleic acid blocking agents.