Sequencing
A Detailed Technical Review of DNA Sequencing Platforms
Illumina (SBS) · Element Biosciences / AVITI (Avidity) · Ultima Genomics (Flow SBS) · Sanger (Chain Termination) · Oxford Nanopore (Nanopore) · PacBio (SMRT) · Roche (Sequencing by Expansion)
Library Architecture, Cluster/Polony/Bead Generation, Sequencing Chemistry, Error Profiles, and Practical Considerations
Illumina Sequencing-by-Synthesis (SBS)
Illumina's platform is the most widely deployed short-read sequencer in the world. It uses a cyclic reversible termination chemistry in which all four fluorescently labeled, 3’-blocked nucleotides compete for incorporation simultaneously. After imaging, the fluorophore and blocking group are cleaved, enabling the next cycle. Below, every architectural and chemical detail is laid out.
Illumina's core SBS chemistry traces back to work by Shankar Balasubramanian and David Klenerman at the University of Cambridge in the late 1990s. The two chemists conceived the idea of sequencing DNA on a surface using fluorescent reversible terminators while brainstorming over pints at the Panton Arms pub. They founded Solexa in 1998, which was acquired by Illumina in 2007 for $600 million. As of 2025, Illumina instruments have generated more than 85% of all sequencing data ever produced worldwide. The cost of sequencing a human genome has fallen from ~$2.7 billion (Sanger-based Human Genome Project, 2003) to under $200 on modern Illumina instruments — a reduction of over seven orders of magnitude in two decades, outpacing Moore's Law.
Library Structure (Adapter Architecture)
A completed Illumina library has a strict 5’→3’ linear architecture. The canonical form is:
5’---P5---i5---Read 1 primer site---[INSERT]---Read 2 primer site---i7---P7---3’
More precisely, reading from the P5 (left) end:
- P5 flow cell binding sequence (29 nt):
5’-AATGATACGGCGACCACCGAGATCTACAC-3’ - i5 index (typically 8–10 nt, sample barcode)
- Read 1 sequencing primer binding site (~33 nt for TruSeq, ~34 nt for Nextera)
- [INSERT] — the target DNA fragment
- Read 2 sequencing primer binding site (~34 nt for TruSeq, ~34 nt for Nextera)
- i7 index (typically 8–10 nt, sample barcode)
- P7 flow cell binding sequence (24 nt):
5’-CAAGCAGAAGACGGCATACGAGAT-3’
TruSeq vs. Nextera Adapter Systems
These are the two dominant Illumina adapter families. They differ in their sequencing primer binding regions but share the same P5/P7 flow cell binding sequences.
- TruSeq: Uses ligation-based library prep. Adapter oligos are "forked" (Y-shaped) with a 12 nt complementary overlap. Ligation requires A-tailing of fragment ends. The top adapter strand has a phosphorothioate-protected 3’ T overhang.
- Nextera: Uses tagmentation (Tn5 transposase). The transposome inserts adapter sequences at both ends of the fragment simultaneously, which are then extended and indexed by PCR. The Read 1 and Read 2 primer binding sequences differ from TruSeq.
Can R2 Go Next to P5/i5? (Orientation Constraints)
No — the orientation is fixed. P5 is always on the same end as i5 and the Read 1 primer site. P7 is always on the same end as i7 and the Read 2 primer site. This is because the physical flow cell lawn has two distinct oligo species (P5 and P7) grafted at fixed positions, and the sequencing workflow (Read 1 → Index 1 (i7) → Index 2 (i5) → Read 2) is hard-wired into the instrument's fluidics. Swapping R2 next to P5 would break cluster generation and all downstream read priming.
Can TruSeq and Nextera Be Combined in a Pool?
Yes, with caveats. Illumina's sequencing reagent cartridges actually contain a mixture of sequencing primers from multiple adapter families (TruSeq, Nextera, and even legacy kits). Therefore, libraries built with TruSeq adapters and libraries built with Nextera adapters can be pooled and sequenced together on the same flow cell lane. You can even mix-and-match within a single library molecule (e.g., TruSeq Read 1 site on one end and Nextera Read 2 site on the other), though this is not recommended for beginners because demultiplexing and trimming parameters differ between the two systems. The key constraint is that the P5 and P7 sequences must be full-length and intact on both ends.
Full-Length vs. Stubby-Y Adapters
There are two major physical adapter designs:
- Full-length adapters: Already contain P5/P7, index sequences, and sequencing primer sites. Used in PCR-free workflows where no further amplification is needed to complete the adapter.
- Stubby-Y (truncated) adapters: Contain only the sequencing primer binding site core but lack P5/P7 and indexes. An indexing PCR step is required after ligation to add the remaining sequences. This design offers higher ligation efficiency due to shorter oligo length, but mandates PCR amplification.
Unique Dual Indexing (UDI) vs. Combinatorial Dual Indexing (CDI)
In CDI, a small set of i5 and i7 indexes are used in all possible pairwise combinations. Any index hopping event can produce a valid (but incorrect) index pair, causing sample cross-contamination. In UDI, each sample receives a globally unique pair of i5+i7, so any hopped combination produces an invalid pair that can be computationally filtered. Illumina now strongly recommends UDI for all patterned flow cell instruments (NovaSeq, NextSeq 1000/2000, NovaSeq X) because exclusion amplification chemistry on patterned flow cells has a measurably higher rate of index hopping than bridge amplification on random flow cells.
Cluster Generation
Bridge Amplification (Random Flow Cells)
Used on MiSeq, HiSeq 2500 (rapid run mode), and older instruments. Single-stranded library molecules are loaded onto the flow cell and hybridize to the P5 or P7 oligo lawn. The free end of each molecule folds over and hybridizes to the complementary oligo nearby, forming a "bridge." Polymerase extends, creating a double-stranded bridge. Denaturation yields two surface-tethered single strands. This cycle repeats approximately 35 times, generating a cluster of roughly 1,000 identical copies of the original template in a random physical location. Clusters are roughly 1 µm in diameter.
Exclusion Amplification (ExAmp, Patterned Flow Cells)
Used on NovaSeq 6000, NovaSeq X, NextSeq 1000/2000. Patterned flow cells have pre-etched nanowells at defined positions. Library, polymerase, and recombinase are mixed and loaded simultaneously. The first molecule to seed a nanowell is amplified so rapidly that it excludes other molecules from occupying that well (kinetic exclusion). This produces monoclonal clusters at uniform spacing, dramatically increasing cluster density and data output. However, ExAmp is more susceptible to index hopping because free library molecules are in contact with surface-bound molecules and recombinase for an extended period during amplification.
Post-Amplification Linearization
After cluster generation, reverse strands are cleaved and washed away, leaving only forward strands for Read 1 sequencing. After Read 1 and Index 1 are complete, the forward strands are removed, reverse strands are resynthesized by bridge amplification and then linearized for Read 2. This resynthesis step is why Read 2 quality is always slightly lower than Read 1 — the resynthesis introduces additional stochastic error.
Sequencing Chemistry (Cyclic Reversible Termination)
The Incorporation Cycle
Each cycle consists of four steps:
- Incorporation: All four dNTPs (dATP, dCTP, dGTP, dTTP), each labeled with a distinct fluorophore and carrying a 3’-O-azidomethyl reversible terminator, are flowed across the clusters along with DNA polymerase. Natural competition ensures correct base incorporation with minimal bias.
- Wash: Unincorporated nucleotides and polymerase are washed away.
- Imaging: Laser excitation (two wavelengths) induces fluorescence. The flow cell is imaged tile-by-tile. Each cluster emits a color corresponding to the incorporated base.
- Cleavage: Chemical treatment removes the fluorophore and the 3’ blocking group, regenerating a free 3’-OH ready for the next cycle.
4-Channel, 2-Channel, and 1-Channel Chemistry
Illumina has used three different optical encoding schemes:
- 4-channel (HiSeq 2500, HiSeq 4000): Four distinct dyes, four images per cycle. Each base has a unique emission spectrum.
- 2-channel (NextSeq, NovaSeq 6000, NovaSeq X): Only two dye colors are used. A = green only, C = red+green, T = red only, G = no label (dark). Reduces imaging time and optical complexity but G calls rely on absence of signal, making them noisier.
- 1-channel (iSeq 100): Uses CMOS detection with a single-color scheme across two sequential images per cycle.
XLEAP-SBS Chemistry
Introduced on the NextSeq 1000/2000 and NovaSeq X Plus, XLEAP-SBS uses new nucleotide analogues and polymerases that dramatically reduce signal decay (photobleaching) over the course of a run. Older chemistry showed ~50% signal intensity loss over 150 cycles; XLEAP maintains essentially flat signal. This enables longer reads (up to 2×300 on NextSeq 2000) with higher quality at the ends. Phasing remains the primary read-length limitation under XLEAP.
Error Profiles and Quality Decay
Phasing and Pre-Phasing
The dominant error mechanism in Illumina sequencing. Within each cluster, ~1,000 molecules should be in perfect synchrony. However:
- Phasing: A molecule fails to incorporate a nucleotide in a given cycle (incomplete 3’ deblocking or steric hindrance). It falls one cycle behind the majority.
- Pre-phasing: A molecule incorporates two nucleotides in one cycle (defective terminator cap). It jumps one cycle ahead.
Typical phasing/pre-phasing rates are 0.1–0.2% per cycle. This seems small, but it compounds: after 250 cycles, approximately 50% of molecules are out of phase if uncorrected. Illumina's Real-Time Analysis (RTA) software applies computational phasing correction using the known rates estimated from early cycles (or empirically per-cycle on newer instruments), rescuing much of the signal. But there is a hard limit beyond which correction fails, which is why quality drops toward read ends.
Signal Decay (Photobleaching)
Repeated laser excitation damages fluorophores on the growing strand or causes photodamage to the DNA itself. This manifests as a progressive drop in signal intensity across cycles, compounding the phasing problem. Pre-XLEAP chemistries saw roughly 50% signal loss by cycle 150. The combined effect of phasing + bleaching means the signal-to-noise ratio degrades exponentially, ultimately making base calls unreliable. New chemistry (XLEAP) addresses bleaching but phasing remains.
Practical Quality Characteristics
- Read 2 is always lower quality than Read 1 because it requires reverse-strand resynthesis by bridge amplification before sequencing.
- The first few cycles (~1–5) often show slightly erratic base composition due to the sequencing primer binding and initial phasing correction calibration.
- Homopolymer accuracy is generally good (unlike flow-based chemistries) because only one base is incorporated per cycle.
- Substitution errors dominate; insertions/deletions are rare (~0.001% per base).
- Typical error rates: <0.1% at cycle 1, rising to ~1–1.5% by cycle 150. Overall, most instruments produce >85% bases ≥Q30.
- PhiX spike-in (5–20%) is used as an internal control and to increase library complexity for low-diversity samples.
The Chastity Filter
Before reads enter analysis, each cluster is assessed for signal purity. The "chastity" score is the ratio of the brightest signal to the sum of the two brightest signals. A score of 1.0 means a perfectly pure, monoclonal cluster. Clusters scoring <0.6 in the first 25 cycles are filtered out as "non-passing filter" (non-PF). Typical good runs achieve >80% PF.
Sequencing Read Order
The instrument sequences in a fixed order:
- Read 1: Sequencing primer hybridizes to the Read 1 site. Extension proceeds into the insert in the 5’→3’ direction.
- Index 1 (i7): After Read 1 completes, the strand is washed and a new primer reads the i7 index.
- Index 2 (i5): On forward-strand instruments (e.g., NovaSeq, NextSeq), the i5 index is read after a second round of cluster preparation. On reverse-complement instruments (MiSeq, HiSeq 2500), i5 is read as the reverse complement.
- Read 2: Reverse strand is resynthesized and linearized. Read 2 primer hybridizes and extends into the insert from the opposite end.
This fixed order means you must always collect at least some Read 1 data (it sets spatial coordinates and phasing parameters), even if your biological interest is entirely in Read 2.
Element Biosciences AVITI (Avidity Sequencing)
Element Biosciences launched the AVITI in 2022, introducing a fundamentally different chemistry called "Avidity Sequencing" or "Avidite Base Chemistry" (ABC). The key innovation is the separation of nucleotide identification from nucleotide incorporation, using multivalent molecular complexes called "avidites." The instrument uses three different engineered polymerases, rolling circle amplification instead of bridge PCR, and a low-binding surface chemistry.
Element was founded in 2017 by Molly He, who previously led engineering teams at Illumina and was a co-inventor on multiple Illumina sequencing patents. The company raised over $400 million before launching its first instrument and explicitly designed the AVITI to be Illumina-library-compatible from day one — a strategic choice that dramatically lowered the switching cost for labs already invested in Illumina workflows. The "avidity" approach — using multivalent binding to amplify signal without modifying the DNA — was inspired by the immune system, where antibodies achieve high-avidity target recognition through multiple simultaneous weak interactions.
Library Structure and Compatibility
Circular Library Requirement
Unlike Illumina, AVITI requires circular library molecules as templates for rolling circle amplification. There are three routes to get there:
- Adept Workflow: Take any standard Illumina library (TruSeq or Nextera adapters) and circularize it off-instrument using Element's Adept kit. A splint oligo bridges the P5 and P7 adapter ends, and a ligase joins them into a circle. The circularized library is single-stranded. This allows labs to continue using their existing Illumina library prep kits.
- Elevate Workflow: Element's native library prep. Uses Element-specific adapters and indexes (96 UDI pairs, optimized for 4-channel color balance). Produces a linear library that is automatically circularized on-instrument during the sequencing run by the Cloudbreak chemistry.
- Cloudbreak Freestyle: The newest kit allows direct loading of linear Illumina libraries onto the AVITI, with automatic on-instrument circularization. This eliminates the off-bench Adept conversion step entirely.
Adapter Compatibility Details
The AVITI is compatible with standard Illumina TruSeq and Nextera adapter sequences. The sequencing primers used are essentially the same Illumina standard sequences, meaning the vast majority of existing Illumina libraries can run on the AVITI without modification beyond circularization. Key practical notes:
- Libraries must be amplified with a proofreading polymerase (e.g., KAPA HiFi, NEB Q5). Taq A-overhangs interfere with circularization.
- Very short inserts (shorter than the read length) cause problems because rolling circle amplification of tiny circles is inefficient.
- Small RNA-seq libraries require custom sequencing primers.
- Libraries treated with IDT/SwiftBio Normalase are NOT compatible.
- Higher library concentration is needed compared to Illumina (~5–16 nM vs. typical ~2 nM loading).
Polony Generation (Rolling Circle Amplification)
This is where AVITI diverges most dramatically from Illumina.
The RCA Process
The flow cell surface is coated with a low-binding chemistry studded with capture oligos complementary to the adapter sequences. When a circular library molecule hybridizes to a capture oligo, rolling circle amplification begins:
- 1. An RCA-specific polymerase (Polymerase #1 of three) initiates synthesis from the capture oligo, using the circular library as a template.
- 2. The polymerase traverses the full circle, then continues displacing its own previously synthesized strand as it enters the second lap.
- 3. This continues for many revolutions, producing a long single-stranded concatemer of tandem copies of the complement of the original library molecule.
- 4. The concatemer collapses into a tight ball on the surface — this is the "polony" (polymerase colony).
Each polony contains many copies of the same sequence, all in close physical proximity, analogous to an Illumina cluster but generated without PCR.
Advantages Over Bridge Amplification
- No PCR means no exponential amplification bias and no polymerase error propagation. RCA copies only the original template, over and over.
- Eliminates optical duplicates: each polony arises from a single template binding event.
- Eliminates index hopping entirely, because there is no recombinase-mediated strand invasion (as in ExAmp) and no free library molecules interacting with growing clusters.
- Polony duplication rates on AVITI are extremely low (typically <1%).
Throughput
A high-output AVITI flow cell contains approximately 1 billion polonies, each generating one read pair. The AVITI runs two independent flow cells simultaneously, yielding ~2 billion read pairs per run. Read lengths of 2×75 through 2×300 are supported with different kit configurations.
Sequencing Chemistry (Sequencing by Binding)
This is the most novel aspect of the platform. Unlike Illumina, where base identification and base incorporation happen in the same chemical step (a labeled nucleotide is incorporated), AVITI splits these into two distinct phases per cycle.
Phase 1: Detection (Avidite Binding)
After washing away any reagents from the previous cycle, the flow cell is flooded with a mixture of:
- An engineered "avidite-binding polymerase" (ABP, Polymerase #2). This is a modified polymerase that can bind template DNA and recruit a complementary nucleotide, but CANNOT catalyze incorporation.
- Four fluorescently-labeled avidites (one per base: A, C, G, T).
Each avidite is a multivalent molecular complex with the following structure:
- Core: Fluorophore-labeled streptavidin tetramer. Dyes are conjugated via lysine-NHS chemistry.
- Arms: Biotinylated polymer linkers ending in nucleotide triphosphates. Each core has ~3 nucleotide-bearing arms plus one arm that links to additional cores, forming higher-order multimers.
The ABP sits at the primer-template junction on each copy in the polony and attempts to recruit a complementary nucleotide. Since each copy in the polony is at the same position (they are synchronized), they all recruit the same avidite type. Because a single avidite molecule has multiple nucleotide arms, it simultaneously engages multiple ABP sites across the polony. This multivalent interaction creates an extremely stable complex through avidity (many weak interactions summing to a strong one), even though each individual nucleotide:polymerase interaction is transient.
The result: bright, stable fluorescent signal at nanomolar avidite concentrations (100-fold lower than the micromolar concentrations needed for labeled nucleotides in Illumina SBS). The fluorophore is also physically distant from the DNA, reducing photodamage.
Phase 2: Incorporation (Strand Extension)
After imaging, the ABPs and avidites are stripped away. The flow cell is then flooded with:
- An incorporation-optimized polymerase (Polymerase #3).
- Unlabeled, 3’-blocked reversible terminator nucleotides.
This polymerase incorporates a single, unmodified nucleotide at each position, then the 3’ block is removed. Because incorporation uses unlabeled nucleotides, no fluorescent scars are left on the DNA. The growing strand is chemically identical to natural DNA.
Why This Matters
- Scarless DNA: No residual chemical modifications accumulate on the growing strand, which avoids the progressive signal degradation seen with Illumina's dye-labeled nucleotides.
- Optimized separately: The detection polymerase is engineered for specificity and avidite binding. The incorporation polymerase is engineered for speed and fidelity. Neither has to compromise.
- Phasing resistance: Dephased molecules in the polony lack adjacent in-phase neighbors, so wrong avidites cannot form multivalent complexes. They produce only weak, transient, undetectable background. This means phasing noise grows far more slowly than in Illumina.
- Homopolymer performance: Because detection and incorporation are separate and each cycle adds exactly one base, avidite sequencing maintains high accuracy through homopolymer stretches. Published data show essentially no increase in error rate post-homopolymer, whereas Illumina SBS shows a 5-fold error spike.
Error Profiles and Quality
- Most bases score Q40–Q50+ (1 error per 10,000–100,000 bases). With Cloudbreak UltraQ chemistry: ≥70% Q50, ≥90% Q40.
- Quality remains high through the end of the read, with minimal drop-off compared to Illumina's steep quality decay.
- Read 1 and Read 2 quality are much more similar than on Illumina, because both reads start from primers on the same polony concatemer without resynthesis.
- Substitution errors, insertion errors, and deletion errors are all very low.
- Essentially zero index hopping (RCA eliminates the mechanism).
- Optical duplicate rate is extremely low (<1%).
- 4-channel imaging system with two excitation lines (~532 and ~635 nm) and four emission channels (~553, 596, 668, 716 nm).
Sequencing Order
With Cloudbreak chemistry, the AVITI sequences indexes (i7 and i5) before Read 1 and Read 2. This provides real-time QC and demultiplexing feedback before the long insert reads even begin — letting you catch loading or library problems early. Read order: Index 1 → Index 2 → Read 1 → Read 2.
Output and Run Times
| Parameter | AVITI (High Output) | AVITI (Low Output) |
|---|---|---|
| Reads per flow cell | ~1 billion | ~100 million |
| Reads per run (2 FC) | ~2 billion | ~200 million |
| Read lengths | 2×75 to 2×300 | 2×75 to 2×300 |
| 2×150 run time | <40 hours | <40 hours |
| Quality | ≥90% Q40 | ≥90% Q40 |
Ultima Genomics (Flow-Based SBS)
Ultima Genomics launched the UG 100 in February 2024, representing the most radical hardware departure from the standard flow cell paradigm. It uses a spinning silicon wafer, emulsion PCR for clonal amplification, and a non-terminating single-nucleotide flow chemistry. It is designed for ultra-high throughput at extreme cost efficiency (targeting the $100 genome).
Library Structure
Native Ultima Libraries
Ultima's library structure differs from Illumina's:
- Sequencing end: Contains a Primer for Sequencing (PS) site plus a Sample Barcode (PS-SBC).
- Bead capture end: Contains a Unique Bead Adapter (UBA) sequence necessary for hybridization to sequencing beads during emulsion PCR.
Ultima provides two library preparation workflows:
- Solaris Free: A PCR-free library prep. Compatible with many third-party kits. Adds Ultima-specific adapters to fragmented DNA.
- Solaris Flex: Allows adaptation of existing partial or complete libraries (including Illumina libraries) through a simple PCR step that appends Ultima-specific adapter overhangs.
Illumina Library Conversion
Illumina libraries can be converted to Ultima format via PCR. Primers anneal to the existing Read 1/Read 2 regions and add Ultima-specific PS-SBC and UBA overhangs. The P5/P7 sequences are effectively replaced. This conversion has been demonstrated for 10x Genomics single-cell libraries, Olink proteomics libraries, and standard WGS/RNA-seq preps.
Single-End Reads (Critical Difference)
Ultima sequencing is inherently single-end. Each bead (and therefore each read) sequences from one end of the library insert only. There is no equivalent of Illumina's paired-end resynthesis. Read lengths follow a distribution (not fixed), with a median of ≥300 bases and post-filtering median of ~250 bases. For applications requiring paired-end information, the single-end reads are computationally split into simulated paired-end format.
Clonal Amplification (Emulsion PCR on Beads)
Ultima uses off-instrument, automated emulsion PCR rather than on-surface amplification:
- 1. Library molecules are mixed with sequencing beads bearing capture oligos complementary to the UBA adapter. Each bead captures one (ideally) or a few library molecules.
- 2. The mixture is compartmentalized into an emulsion: oil droplets encapsulate individual beads with reagents.
- 3. PCR occurs within each droplet, clonally amplifying the captured library molecule(s) on the surface of the bead.
- 4. After amplification, the emulsion is broken and beads are recovered.
For ppmSeq (paired plus-minus sequencing), both strands of each original DNA duplex are captured on the same bead. Denaturation occurs within the emulsion droplet after bead ligation, so forward and reverse strand templates are co-amplified on a single bead. This enables downstream computational duplex error correction.
Bead Loading onto the Wafer
The amplified beads are loaded onto a 200mm silicon wafer (the same diameter as standard semiconductor wafers). The wafer surface is patterned at micron scale with an array of electrostatic landing pads. Beads settle onto these pads, ideally one bead per pad. A high-output wafer holds approximately 10–12 billion beads/reads.
The Spinning Wafer Architecture
This is Ultima's most distinctive hardware feature. Instead of a sealed flow cell with microfluidics and a scanning camera:
- The wafer sits flat and spins like a CD.
- Reagents are dispensed onto the center of the spinning wafer and distributed by centrifugal force (spin coating), producing a uniform thin film. Each nucleotide is delivered through a separate nozzle, eliminating cross-contamination between flows.
- Two fixed-position cameras image the wafer continuously as it rotates beneath them, rather than the camera moving across a stationary surface.
- The instrument can run two wafers simultaneously, alternating between a chemistry station and an imaging station.
- Six wafers can be loaded at once; the instrument runs continuously with hot-swappable reagents and consumables.
Sequencing Chemistry (Single-Nucleotide Flow, Non-Terminating SBS)
Ultima's chemistry is conceptually related to the extinct 454/Ion Torrent pyrosequencing approach but with critical innovations.
Flow Order
Nucleotides are introduced one species at a time in a repeating order (e.g., T, G, C, A, T, G, C, A...). Each introduction of one nucleotide species constitutes one "flow." Four flows (one of each base) constitute one "flow cycle." A run of ~300 base median length requires ~444 flows (~111 flow cycles).
Mostly Natural Nucleotides (mnSBS)
The key innovation: each flow delivers a mixture of mostly unmodified, natural nucleotides plus a minority (<20%) of fluorescently-labeled nucleotides. The polymerase remains processively bound to the template and incorporates bases without a terminator — meaning it can incorporate multiple nucleotides per flow if the template has a homopolymer. The labeled fraction provides optical signal; the unlabeled majority keeps polymerase kinetics and fidelity close to natural.
After each flow, the wafer is imaged at steady state. A fluorophore cleavage step removes the labels, leaving natural DNA. Because the dyes are cleaved, no scars accumulate.
Base Calling in Flow Space
Base calling on Ultima operates in "flow space" rather than "sequence space." For each flow, the system must determine how many nucleotides of that species were incorporated (0, 1, 2, 3, 4...). The identity of the base is never in question (you know which nucleotide was flowed), so substitution errors are inherently very rare. The challenge is accurately determining the number of incorporations, especially for homopolymers. Ultima uses a deep convolutional neural network (CNN) trained on large, diverse datasets to convert raw signal intensities into base calls.
Homopolymer Challenge
Because the chemistry is non-terminating, a homopolymer run (e.g., AAAAAAA) results in all 7 A’s being incorporated in a single flow. The system must count the number of incorporations from the signal intensity. For short homopolymers (≤8–10 bp), accuracy is high. For longer homopolymers (>12 bp), accuracy degrades, and these regions are excluded from Ultima's high-confidence region (HCR). This is the classic limitation of flow-based chemistries, shared historically with 454 and Ion Torrent, though Ultima's ML-based calling substantially outperforms those predecessors.
Error Profile
- Substitution errors are extremely low because base identity is defined by which nucleotide was flowed.
- Insertion/deletion errors in homopolymer regions are the dominant error type.
- SNV F1 score: 99.8%. INDEL F1 score: 99.4%.
- SNVQ (Single Nucleotide Variant Quality) scores are reported instead of traditional per-base Q scores. SNVQ represents the error probability of a specific base substitution (e.g., A→G) rather than any substitution.
- Read lengths are variable, following a distribution. Median raw ≥300 bp, post-filter ~250 bp.
ppmSeq (Paired Plus-Minus Sequencing)
Ultima's unique accuracy feature. By capturing both strands of a DNA duplex on a single bead during emulsion PCR, then sequencing both, the system can computationally compare forward and reverse strand reads. Any base call disagreement between the two strands likely represents a sequencing artifact or DNA damage and can be filtered. This achieves a raw read accuracy of Q60 (one error per million bases) for SNVs, enabling applications like liquid biopsy and minimal residual disease detection.
Output and Run Times
| Parameter | UG 100 |
|---|---|
| Reads per wafer | 10–12 billion |
| Data per wafer | ≥2.5–3.0 terabases |
| Run time (≥300 bp) | ~20 hours per wafer |
| Read type | Single-end (variable length) |
| Read length median | ≥300 bp raw; ~250 bp post-filter |
| Genomes/year (30X) | ~20,000 |
| Cost per genome (30X) | ~$100 |
Sanger Sequencing (Chain Termination)
Sanger sequencing, developed by Frederick Sanger in 1977, is the original DNA sequencing method and remains the gold standard for validation experiments, small-scale sequencing, and clinical confirmatory testing. It uses dideoxynucleotide chain termination to produce a ladder of fragments whose lengths encode the template sequence. Despite being largely supplanted by massively parallel platforms for discovery work, it persists in essentially every molecular biology laboratory in the world.
Frederick Sanger is one of only four people to have won two Nobel Prizes — and the only person to have won the Chemistry Nobel twice. His first (1958) was for determining the amino acid sequence of insulin, proving that proteins have defined primary structures. His second (1980) was for the dideoxy chain-termination method described here, shared with Walter Gilbert and Paul Berg. Sanger's method went on to power the Human Genome Project, which took 13 years and roughly $2.7 billion to produce the first human reference genome. Sanger famously described himself as "just a chap who messed about in his lab," and upon retirement he declined a knighthood, reportedly saying he did not wish to be called "Sir."
Library Structure (Template Preparation)
Sanger sequencing does not use "libraries" in the NGS sense. Instead, it sequences a single template molecule (or PCR amplicon) per reaction. The input is one of:
- Plasmid DNA: A recombinant plasmid containing the target insert, typically 1–10 kb. Universal primers (M13 forward/reverse, T7, SP6) anneal to vector sequences flanking the insert.
- PCR amplicon: A purified PCR product with known primer binding sites. The same primers used for amplification (or nested internal primers) are used as sequencing primers.
- Genomic or BAC DNA: Requires custom sequencing primers designed to the known flanking sequence. Used historically in genome projects (primer walking).
There is no adapter ligation, no indexing, and no clonal amplification step. Each sequencing reaction interrogates one template with one primer, producing a single read.
The Chain Termination Reaction
Each Sanger reaction contains:
- Template DNA: The single-stranded or heat-denatured double-stranded target.
- Sequencing primer: An oligonucleotide complementary to a known region upstream of the target.
- DNA polymerase: Originally Klenow fragment, now almost universally a thermostable enzyme (e.g., Thermo Sequenase, AmpliTaq FS) optimized for uniform ddNTP incorporation.
- dNTPs: All four deoxynucleotide triphosphates at high concentration.
- ddNTPs: All four 2’,3’-dideoxynucleotide triphosphates, each labeled with a distinct fluorescent dye. ddNTPs lack the 3’-OH required for phosphodiester bond formation, so their incorporation terminates chain elongation.
The dNTP:ddNTP ratio (typically ~100:1 to 300:1) is calibrated so that, on average, every possible position in the template has a statistical population of fragments terminating there. During thermal cycling (cycle sequencing), the primer extends along the template, incorporating dNTPs normally until a ddNTP is stochastically incorporated instead, at which point that copy terminates. After 25–30 thermal cycles, the reaction contains a nested set of fragments ranging from the primer to every position in the readable region, each terminated by a fluorescently labeled ddNTP whose color encodes the terminal base.
Capillary Electrophoresis and Detection
Modern Sanger sequencing uses capillary electrophoresis (CE), not slab gels. The standard instruments are:
- Applied Biosystems 3730xl: 96 capillaries, the workhorse of the Human Genome Project and most core facilities.
- Applied Biosystems 3500/3500xL: 8 or 24 capillaries, designed for clinical and lower-throughput labs.
- Applied Biosystems SeqStudio: 4 or 8 capillaries, compact benchtop instrument for individual labs.
The terminated fragments are electrokinetically injected into fused-silica capillaries filled with POP-7 (Performance Optimized Polymer), a linear polyacrylamide sieving matrix. Fragments separate by size as they migrate through the polymer under an applied electric field (typically 15 kV). As each fragment passes the detection window, an argon-ion laser (or LED on newer instruments) excites the terminal ddNTP fluorophore. A CCD camera or photodiode array records the emission through a set of spectral filters, resolving the four dye colors. The resulting raw data is a four-color electropherogram (chromatogram) where each peak corresponds to one base position.
Base Calling and Quality Scores
The standard base-calling software is KB Basecaller (Applied Biosystems), which assigns Phred quality scores to each called base. Phred scores were originally developed specifically for Sanger sequencing trace files by Brice Ewing and Phil Green (1998). A Phred score of Q20 means 1% error probability; Q30 means 0.1%; Q40 means 0.01%. A typical good Sanger read produces 700–900 bases of Q20+ sequence, with the first ~30–50 bases being unreliable (primer peak artifacts and unresolved short fragments) and quality degrading beyond ~800–900 bases due to decreasing fragment resolution.
Error Profiles
- Accuracy is extremely high within the readable window: raw per-base error rates of ~0.1–1% (Q20–Q30), and after manual trace editing, accuracy approaches 99.99% (Q40).
- Errors are predominantly miscalls at positions with overlapping or compressed peaks. Compressions occur in GC-rich regions where secondary structure causes fragments to migrate anomalously. Adding betaine or using dITP (deoxyinosine) instead of dGTP can resolve compressions.
- Homopolymer runs >8–10 bases cause peak broadening and merging, making base counting difficult — analogous to the homopolymer problem in flow-based NGS chemistries.
- Insertions/deletions are not introduced by the chemistry itself but may be misinterpreted from noisy trace regions.
- Mixed templates (heterozygous positions in diploid DNA, or mixed bacterial populations) produce double peaks at the variant position, which software can flag but which complicate automated calling.
Read Length and Throughput
| Parameter | Typical Value |
|---|---|
| Read length | 700–1,000 bases (Q20+) |
| Maximum read length | ~1,200 bases under optimal conditions |
| Reads per run (3730xl) | 96 (one per capillary) |
| Run time | ~2–3 hours per plate |
| Daily throughput (3730xl) | ~1,500 reads (~1.2 Mb/day) |
| Cost per read | ~$3–$8 (reagents + instrument time) |
Practical Considerations
- Gold-standard validation: Sanger is universally accepted by regulatory bodies (FDA, EMA) for confirming NGS-detected variants. Most clinical NGS pipelines require Sanger confirmation of reportable variants.
- Primer walking: To sequence regions longer than one read length, successive primers are designed at ~500 bp intervals along the template, each initiating a new read. This is labor-intensive but remains necessary for finishing bacterial genomes or characterizing plasmid constructs.
- Cost scaling: Cost is strictly linear with the number of reads. There is no economy of scale: sequencing 1,000 amplicons costs 1,000x one amplicon. This makes Sanger prohibitively expensive for any application requiring more than a few hundred reads.
- Sensitivity limit: Sanger can detect a minor allele only if it is present at ≥15–20% frequency. Below this threshold, the minor allele peak is indistinguishable from baseline noise. NGS platforms can detect variants at <1% frequency.
Oxford Nanopore Technologies (Nanopore Sequencing)
Oxford Nanopore Technologies (ONT) uses protein nanopores embedded in synthetic membranes to sequence individual DNA or RNA molecules in real time. An ionic current flows through the pore, and as a nucleic acid strand translocates through the narrowest constriction, the current is modulated by the identity of the bases occupying the sensing region. This is the only major sequencing platform that reads native DNA (or RNA) directly — no amplification, no synthesis, no labeling. The technology is deployed across instruments ranging from the USB-powered MinION to the production-scale PromethION.
In August 2016, NASA astronaut Kate Rubins used a MinION aboard the International Space Station to perform the first-ever DNA sequencing in microgravity — sequencing samples of bacteriophage lambda, E. coli, and mouse mitochondrial DNA. The harmonica-sized device required no more than a laptop and a USB port. Nine sequencing runs were conducted aboard the ISS over a six-month period, yielding 276,882 reads with performance comparable to ground-based controls (Castro-Wallace et al. 2017, Scientific Reports). Subsequent ISS experiments extended this to direct RNA sequencing in orbit. No other sequencing platform has been demonstrated outside of a terrestrial laboratory, and the MinION remains the only sequencer that can be carried in a coat pocket.
Library Structure
ONT libraries are remarkably simple compared to other platforms. The essential component is a motor protein (a helicase or translocase) ligated to the end of the DNA molecule, which controls the translocation speed through the pore. The two major library preparation approaches are:
- Ligation-based (LSK kit, e.g., SQK-LSK114): DNA is end-repaired and dA-tailed, then a sequencing adapter carrying the motor protein is ligated to both ends. No PCR is required, preserving native base modifications. Input requirement: ~1 µg of high-molecular-weight DNA for optimal results, though lower inputs work with reduced yield.
- Rapid (RAP kit): A transposase fragments the DNA and simultaneously attaches sequencing adapters in a 10-minute, single-tube reaction. Faster but produces shorter fragments and sacrifices some library complexity.
- PCR-based (PCB kit): Adds a PCR amplification step for low-input samples. Loses native base modification information.
- Direct RNA (RNA004): Poly(A)-selected RNA is ligated to a sequencing adapter via an oligo(dT)-primed reverse transcription step. The RNA strand (not cDNA) is sequenced directly, preserving all RNA modifications (m6A, pseudouridine, m5C, etc.).
No Indexing Constraints
ONT supports barcoding (multiplexing) via native barcodes (24-plex) or PCR barcodes (96-plex). Barcodes are short adapter-adjacent sequences that are basecalled and demultiplexed computationally. Unlike Illumina, there are no index-hopping concerns because there is no amplification on the flow cell.
The Nanopore and Translocation Mechanism
The biological nanopore is a modified CsgG protein (from E. coli curli secretion system), designated R10.4.1 in the current chemistry. The pore is inserted into a synthetic lipid membrane stretched across a microwell on a CMOS sensor array. Key architectural features:
- Dual constriction: The R10.4.1 pore has two narrow constrictions in its barrel, spaced ~9 nucleotides apart. This dual-reader geometry provides two sequential measurements of each k-mer as it passes through, dramatically improving base-calling accuracy compared to earlier single-constriction pores (R9.4.1).
- Motor protein: The helicase/translocase bound at the pore entrance ratchets the DNA strand through the constriction one base at a time at a controlled rate of ~400–450 bases per second. Without the motor, translocation would be too fast (~1 µs per base) for accurate measurement.
- Ionic current measurement: A voltage (~180 mV) applied across the membrane drives K+ and Cl− ions through the open pore. As DNA occupies the constriction, the current drops by an amount characteristic of the ~5-mer currently in the sensing zone. Each base position generates a "squiggle" — a segment of raw current signal sampled at 5,000 Hz.
Base Calling
Raw ionic current signals are converted to base sequences by deep neural networks. The current production basecaller is Dorado, which uses a transformer-based architecture. Three speed/accuracy models are available:
- Fast (HAC): High-accuracy calling, ~Q20 median for simplex reads.
- Super-accurate (SUP): Slower but achieves ~Q23–Q25 median for simplex reads.
- Duplex: When both strands of a DNA duplex pass through the same pore sequentially, the complementary reads are computationally merged for consensus accuracy of Q30+ (99.9%).
Native Base Modification Detection
Because the nanopore reads unmodified DNA directly, any chemical modification to a base (5-methylcytosine, 6-methyladenine, 5-hydroxymethylcytosine, etc.) produces a characteristic current perturbation. Dorado includes modification-aware models that call base modifications simultaneously with primary sequence — no bisulfite conversion, no antibody enrichment, no enzymatic treatment required. This is unique among all sequencing platforms and is the primary reason many epigenetics laboratories have adopted ONT.
Read Length
ONT has no inherent upper limit on read length — it is determined entirely by the input DNA fragment size. Ultra-long read protocols using gentle DNA extraction (e.g., agarose plug-based methods, Circulomics/Short Read Eliminator) routinely produce reads >100 kb, with individual reads exceeding 4 Mb reported. The current Guinness World Record for longest nanopore read is >4.2 Mb. Typical read length distributions depend on the library prep method:
- Ligation (standard): N50 of 10–30 kb, depending on input DNA quality.
- Ultra-long: N50 of 50–100+ kb.
- Rapid: N50 of 5–15 kb (transposase-limited fragmentation).
Instruments and Throughput
| Instrument | Flow Cell Pores | Output per Flow Cell | Run Time |
|---|---|---|---|
| Flongle | 126 | ~2.8 Gb | Up to 24 hours |
| MinION / Mk1C | 2,048 | ~50 Gb | Up to 72 hours |
| P2 Solo | 2,048 | ~50 Gb | Up to 72 hours |
| PromethION (P24/P48) | 2,675 per FC × 24 or 48 FC | ~290 Gb per FC; up to 14 Tb per run | Up to 72 hours |
Error Profiles
- Simplex raw accuracy (R10.4.1 + SUP model): Median Q20–Q25 (~99.0–99.7%). The remaining errors are dominated by indels in homopolymer stretches and occasional substitutions.
- Duplex accuracy: Q30+ (~99.9%). Requires both strands of the original duplex to be sequenced consecutively through the same pore (~30–60% of reads achieve duplex pairing).
- Consensus accuracy: At moderate coverage (≥30x), variant-calling accuracy matches or exceeds short-read platforms for SNVs and substantially outperforms them for structural variants due to long reads spanning breakpoints.
- Homopolymers: The primary remaining challenge. Long homopolymers (>8–10 bp) are prone to insertion/deletion errors because the current signal difference between, e.g., 8 and 9 consecutive identical bases is very small. The R10.4.1 dual-constriction pore significantly improved homopolymer calling over R9.4.1 but it remains the dominant error mode.
- Systematic biases: Certain sequence contexts (e.g., long homopolymers, some tandem repeats) show elevated error rates that are partially correlated across reads, limiting the benefit of additional coverage for those specific motifs.
Real-Time and Adaptive Sampling
A unique ONT capability: the instrument can eject a DNA molecule from the pore mid-read if the initial sequence does not match a target of interest ("Read Until" / adaptive sampling). The pore is then free to capture the next molecule. This enables real-time target enrichment without prior capture or amplification — for example, sequencing only a set of clinically relevant genes from a whole-genome library, achieving up to 5–10x enrichment of on-target reads. Adaptive sampling can also be used in reverse, depleting unwanted sequences (e.g., host DNA in a metagenomic sample).
Pacific Biosciences (PacBio) SMRT Sequencing
PacBio's Single Molecule, Real-Time (SMRT) sequencing observes a single polymerase molecule incorporating fluorescently labeled nucleotides into a growing complementary strand in real time. The polymerase is immobilized at the bottom of a zero-mode waveguide (ZMW), a nanophotonic structure that confines the observation volume to zeptoliters, enabling single-molecule fluorescence detection against a background of freely diffusing labeled nucleotides. The current production instruments are the Revio (high-throughput HiFi) and the Vega (benchtop, long-read focused).
The zero-mode waveguide concept was invented by Jonas Korlach, Stephen Turner, and colleagues at Cornell University, drawing on nanophotonics principles first described by Harold Craighead's lab. The key insight was that a metal aperture smaller than the wavelength of light creates an evanescent field rather than a propagating wave — confining illumination to a volume so small (~20 zeptoliters) that single fluorescent molecules become detectable against a micromolar background. PacBio was founded in 2004 and delivered its first commercial instrument (the RS) in 2011. The 2022 introduction of HiFi sequencing — achieved by computationally combining multiple noisy passes around a circular template into one highly accurate consensus read — transformed PacBio from a niche long-read platform into a serious contender for population-scale genome sequencing (Wenger et al. 2019, Nature Biotechnology). The T2T Consortium's 2022 completion of the first truly complete, telomere-to-telomere human genome assembly (T2T-CHM13) relied heavily on PacBio HiFi reads for base-level accuracy and Oxford Nanopore ultra-long reads for spanning centromeric repeats (Nurk et al. 2022, Science).
Library Structure (SMRTbell)
PacBio libraries are circular molecules called SMRTbells. The architecture is:
- Insert: The target double-stranded DNA fragment.
- Hairpin adapters: Single-stranded hairpin loops ligated to both ends of the dsDNA insert, converting the linear molecule into a topologically closed, dumbbell-shaped circle.
This circular topology is critical because the sequencing polymerase can traverse the entire SMRTbell multiple times (rolling-circle fashion around the dumbbell), generating multiple passes over the same insert. Each complete traversal of both strands constitutes one "pass." The multiple passes enable intramolecular error correction to produce high-fidelity (HiFi) consensus reads.
SMRTbell construction:
- Target DNA is sheared (for WGS) or left intact (for ultra-long reads), then end-repaired and ligated to hairpin adapters using T4 DNA ligase.
- A sequencing primer anneals to the adapter sequence, and the sequencing polymerase binds to the primed SMRTbell. This polymerase-bound SMRTbell complex is the final "library" loaded onto the instrument.
- Size selection is performed before loading: for HiFi, inserts of 10–20 kb are optimal. For CLR (continuous long read) mode, inserts of >40 kb are used.
Zero-Mode Waveguides (ZMWs)
The SMRT Cell is a silicon chip containing millions of ZMWs — cylindrical holes approximately 70–100 nm in diameter and ~100 nm deep, fabricated in an aluminum film on a glass substrate. The diameter is smaller than the wavelength of excitation light (~532 nm), so light cannot propagate through the hole. Instead, an evanescent field decays exponentially from the bottom of the well, illuminating only the bottom ~30 nm — a detection volume of ~20 zeptoliters (20 × 10&supmin;²¹ L).
The sequencing polymerase is chemically tethered to the bottom of each ZMW. Fluorescently labeled nucleotides diffuse freely in solution above, but only become visible when they enter the ZMW observation volume and are bound by the polymerase (residence time ~10–100 ms). The bulk solution concentration (~µM) ensures labeled nucleotides are diffusing in and out of the ZMW rapidly, but only the one being incorporated is immobilized long enough to generate a pulse of fluorescence.
The Revio SMRT Cell 25M contains approximately 25 million ZMWs, of which 8–12 million typically yield productive sequencing reads (the remainder are empty or contain multiple polymerases).
Sequencing Chemistry
PacBio uses phospholinked nucleotides: the fluorescent dye is attached to the terminal phosphate of the nucleotide triphosphate, not to the base. During incorporation:
- 1. The labeled nucleotide diffuses into the ZMW and binds the polymerase active site (complementary to the template base).
- 2. While held in the active site (~10–100 ms), the fluorophore emits a pulse of color-coded light detected by the sensor below the ZMW.
- 3. The polymerase catalyzes phosphodiester bond formation, cleaving the diphosphate (carrying the fluorophore) from the nucleotide monophosphate, which is incorporated into the growing strand.
- 4. The released dye-labeled pyrophosphate diffuses out of the ZMW. The incorporated nucleotide retains no fluorescent modification — the growing strand is natural DNA.
This "label-then-cleave" design means no chemical scars accumulate on the synthesized strand, and the polymerase processes a natural DNA template, preserving its ability to detect kinetic signatures of base modifications.
SPRQ Chemistry
SPRQ (Sequencing Plate-Ready Q-chemistry) is PacBio's latest HiFi chemistry for Revio. Key improvements over previous chemistries include a longer-lived polymerase (enabling more passes per SMRTbell), improved reagent stability, reduced input requirement (500 ng of native DNA), and ~33% higher HiFi yield per SMRT Cell. SPRQ enables two 30x human genomes per SMRT Cell at ≥Q30 accuracy.
HiFi (Circular Consensus Sequencing) vs. CLR
PacBio operates in two primary modes:
- HiFi (CCS): The polymerase makes multiple passes (≥3 full passes required, typically 8–15) around a short SMRTbell (10–20 kb insert). Subreads from each pass are computationally aligned and collapsed into a single consensus read with accuracy ≥Q30 (99.9%). The tradeoff: insert size is limited by polymerase processivity (the polymerase must complete multiple laps before dissociating or dying). Read lengths: 10–25 kb at Q30+.
- CLR (Continuous Long Read): Long SMRTbells (>40 kb inserts) are sequenced with a single polymerase pass. Raw accuracy is ~85–90% (Q10–Q15), with errors dominated by insertions and deletions. CLR reads can exceed 100 kb. Useful for scaffolding, structural variant detection, and de novo assembly when combined with HiFi data.
Kinetic Base Modification Detection
The sequencing polymerase pauses or slows at modified bases (e.g., m6A, m4C) because the modified template base alters the enzyme kinetics. By analyzing the interpulse duration (IPD) — the time between successive fluorescent pulses — PacBio can detect base modifications directly from the sequencing data, without any chemical treatment. HiFi mode enables detection of m6A and CpG methylation (5mC) at single-molecule resolution with high confidence. This requires the kinetic information from multiple passes to distinguish modification-induced slowdowns from stochastic variation.
Error Profiles
- HiFi accuracy: ≥Q30 median (99.9%), with many reads ≥Q40. The remaining errors are approximately evenly split between substitutions, insertions, and deletions with no strong sequence-context bias.
- CLR raw accuracy: ~85–90%. Errors are dominated by insertions (~70% of errors) and deletions (~20%), with substitutions rare (~10%). The insertion bias is characteristic of the SMRT polymerase occasionally detecting a fluorescent nucleotide that does not actually get incorporated (a "dark pulse" or cognate sampling event).
- Homopolymers: HiFi resolves homopolymers well because multiple independent passes provide consensus. CLR struggles with homopolymers due to the insertion-dominant error mode.
- GC bias: Minimal. SMRT sequencing shows essentially flat coverage across GC content from ~20–80%, substantially better than Illumina's known GC bias.
Output and Run Times
| Parameter | Revio (SPRQ) | Vega |
|---|---|---|
| SMRT Cells per run | Up to 4 (simultaneously) | 1 |
| ZMWs per SMRT Cell | 25 million | 8 million |
| HiFi output per SMRT Cell | ~100–120 Gb | ~25 Gb |
| HiFi output per day (Revio) | ~480 Gb | — |
| HiFi read length | 10–25 kb (mean ~15 kb) | 10–25 kb |
| HiFi accuracy | ≥Q30 (≥99.9%) | ≥Q30 |
| Run time | ~24 hours per SMRT Cell | ~24 hours |
| 30x genomes/year (Revio) | ~2,500 | — |
Roche Sequencing by Expansion (SBX)
Roche's Sequencing by Expansion (SBX), publicly unveiled in February 2025, represents an entirely new category of sequencing technology. Rather than reading DNA directly through a nanopore or detecting fluorescent nucleotides during synthesis, SBX first converts the DNA sequence into an expanded surrogate polymer called an Xpandomer, then reads that Xpandomer through a nanopore. The expansion step amplifies the physical spacing between base-encoded reporter elements by approximately 50-fold, overcoming the fundamental spatial resolution limitations of direct nanopore sequencing. The technology is currently in late-stage development and not yet commercially available.
In October 2025, Roche, Broad Clinical Labs, and Boston Children's Hospital announced a Guinness World Record for the fastest DNA sequencing technique: a complete human genome was sequenced and analyzed (blood sample to annotated VCF) in under 4 hours using SBX, beating the previous record of 5 hours and 2 minutes. The team subsequently demonstrated a same-day workflow from neonatal ICU blood draw to actionable clinical report in under 8 hours — fast enough to keep pace with a high-volume NICU. The work was described in the New England Journal of Medicine. At the same ASHG conference, Roche demonstrated 15 billion reads generated in a single hour of sequencing, underscoring SBX's raw throughput ambitions.
Library Structure
SBX supports two distinct library preparation modes:
- SBX-Duplex (SBX-D): Uses a Y-adapter that physically links the two complementary strands of a DNA duplex. Insert sizes of 200–350 bp are sequenced, with both strands read and computationally combined for higher consensus accuracy. This is analogous in principle to PacBio's HiFi circular consensus but achieved through a different mechanism (physical strand linkage rather than rolling-circle resequencing).
- SBX-Simplex: Single-stranded library preparation supporting read lengths from <200 bp up to 1,500 bp. Faster throughput but lower per-read accuracy than duplex.
The Expansion Chemistry
This is the defining innovation of SBX. The process converts a standard DNA molecule into an Xpandomer through the following steps:
- X-NTP incorporation: A heavily engineered Y-family translesion polymerase called XP Synthase (with >10% of its residues mutated from wild type) copies the DNA template using expandable nucleotide triphosphates (X-NTPs) instead of natural dNTPs. Each X-NTP monomer weighs approximately 20 kilodaltons — enormous compared to natural nucleotides (~500 Da).
- X-NTP architecture: Each X-NTP contains four functional elements: (1) a reporter code that uniquely identifies the base (A, C, G, or T); (2) a translocation control element that enables precise, stepwise movement through the nanopore; (3) processivity-enhancing moieties; and (4) an acid-cleavable bond in the phosphate backbone.
- Expansion: After synthesis, the Xpandomer is treated with acid, which cleaves the scissile bonds. The Xpandomer unfolds from a condensed structure into an elongated polymer approximately 50 times longer than the original DNA template. This expansion physically separates the reporter codes, making them individually resolvable by a nanopore sensor.
The expansion chemistry takes approximately 2 hours on a benchtop unit with simple fluidics before the expanded molecules are loaded onto the sequencing instrument.
Nanopore Readout (Genia CMOS Array)
The Xpandomer is threaded through biological nanopores embedded in a CMOS sensor array — technology derived from Roche's 2014 acquisition of Genia Technologies. Key specifications:
- Sensor array: Approximately 8 million microwells, each containing a nanopore embedded in a lipid membrane over a CMOS electrode. The array combines electrodes, detection circuits, and analog-to-digital conversion on a single chip. Over 90% of sensors generate useful data.
- Translocation mechanism: Voltage pulses of 1.5–2.0 milliseconds advance the Xpandomer through the pore one reporter code at a time. Unlike ONT's continuous translocation with complex squiggle interpretation, SBX produces four distinct, well-separated current states at constant time intervals — signals so clean they are decodable by visual inspection.
- Reusable sensor modules: The CMOS array is reusable. Lipid membranes can be reformed multiple times on the same chip, substantially reducing per-run consumable costs.
Base Calling and Accuracy
Errors in SBX arise approximately equally from two sources: Xpandomer synthesis errors (XP Synthase misincorporation or slippage, base error rate ~0.7%) and data collection errors (nanopore signal misclassification). Quality scores are binned into three levels (high, medium, low quality).
- Simplex accuracy: Q20+ (≥99%) per read.
- Duplex accuracy: High Q30s (~99.95%+) when combined with custom DeepVariant basecalling models.
- Variant calling (duplex, WGS): SNV F1 >99.80%, InDel F1 >99.70% for HG001 benchmarks.
- GC coverage: Essentially flat from 20% to 80% GC content.
- Homopolymers: >99% F1 for homopolymers under 15 bp in duplex mode. Accuracy degrades for longer homopolymers due to polymerase slippage during Xpandomer synthesis.
Throughput and Speed
SBX is designed for extreme throughput and speed. A demonstration run produced seven human genomes at 30x coverage in 1 hour of sequencing time. Total sample-to-VCF turnaround has been demonstrated at 6 hours 25 minutes using simplex chemistry. The data throughput rate is approximately 500 megabases per second per sensor module. The system supports flexible "run until done" sequencing — runs terminate when a target data accumulation threshold is reached, rather than running for a fixed duration.
| Parameter | SBX (Demonstrated) |
|---|---|
| Read modes | Simplex (<200–1,500 bp) and Duplex (200–350 bp) |
| Simplex accuracy | Q20+ (≥99%) |
| Duplex accuracy | Q30+ (high Q30s) |
| Sensor array | ~8 million nanopore microwells (CMOS) |
| Throughput demonstration | 7 × 30x genomes in 1 hour sequencing |
| Sample-to-VCF (simplex) | ~6.5 hours |
| Data rate | ~500 Mb/s per sensor module |
Instrument Architecture
SBX uses a two-instrument workflow: a benchtop expansion unit (handles the 2-hour chemical conversion of DNA to Xpandomers) and a floor-standing sequencer equipped with a large GPU-based compute server for real-time basecalling. The expansion unit has simple fluidics and is designed to be user-friendly. The sequencer accepts the reusable CMOS sensor modules and handles reagent delivery and data acquisition.
Current Limitations
- No epigenetic modification detection: Because SBX reads an Xpandomer surrogate (not native DNA), all base modification information is lost. Roche has acknowledged the possibility of a future "5th signal state" to encode modifications, but this is not yet implemented.
- Not yet commercially available: As of early 2026, SBX is in development, with clinical validation studies underway at institutions including Broad Clinical Labs and Hartwig Medical Foundation.
- Upstream chemistry time: The 2-hour expansion chemistry step adds latency before sequencing can begin.
- Read length ceiling: Simplex reads up to 1,500 bp and duplex inserts up to 350 bp — shorter than ONT or PacBio CLR. Not suitable for applications requiring ultra-long single reads (>10 kb).
Platform Comparison Summary
| Feature | Illumina | Element AVITI | Ultima UG 100 | Sanger | Oxford Nanopore | PacBio (HiFi) | Roche SBX |
|---|---|---|---|---|---|---|---|
| Chemistry | Cyclic reversible termination (SBS) | Sequencing by binding (avidity) + separate incorporation | Non-terminating flow SBS (mnSBS) | Dideoxy chain termination + capillary electrophoresis | Ionic current through protein nanopore | Real-time fluorescent nucleotide incorporation in ZMWs | Xpandomer expansion + nanopore readout (CMOS) |
| Amplification | Bridge amp (random FC) or ExAmp (patterned FC) | Rolling circle amplification (polonies) | Emulsion PCR on beads | Cycle sequencing (linear amplification) | None (single molecule) | None (single molecule) | None (Xpandomer synthesis from single molecule) |
| Surface | Glass flow cell | Low-binding coated flow cell | 200mm silicon wafer | Fused-silica capillary (POP-7 polymer) | Lipid membrane over CMOS array | SMRT Cell (ZMW nanowell chip) | Lipid membrane over 8M-well CMOS array |
| Read type | Paired-end (fixed length) | Paired-end (fixed length) | Single-end (variable length) | Single-end (fixed primer) | Single-end (native strand) | Circular consensus (HiFi) or CLR | Simplex or duplex |
| Max read length | 2×300 (MiSeq/NextSeq) | 2×300 | Median ≥300 (single end) | ~1,000–1,200 bp | No upper limit (>4 Mb demonstrated) | 10–25 kb (HiFi); >100 kb (CLR) | ~1,500 bp (simplex); ~350 bp insert (duplex) |
| Typical quality | >85% ≥Q30 | >90% ≥Q40 | >85% ≥Q30 | Q20–Q40 (700–900 bp window) | Q20–Q25 simplex; Q30+ duplex | ≥Q30 (HiFi); Q10–Q15 (CLR) | Q20+ simplex; high Q30s duplex |
| Index hopping | Yes (esp. ExAmp) | None (RCA) | None (emPCR) | N/A (no multiplexing) | None (single molecule) | None (single molecule) | None (single molecule) |
| Dominant error | Substitutions (phasing) | Very low; all types rare | Indels in homopolymers | Miscalls at compressed peaks | Indels in homopolymers | Balanced sub/ins/del (HiFi); insertions (CLR) | ~Equal synthesis + readout errors; homopolymer slippage |
| Homopolymer perf. | Good (1 base/cycle) | Excellent (no error spike) | Good to ≤8–10 bp; degrades beyond | Degrades >8–10 bp (peak merging) | Improved (R10.4.1); still challenging >8–10 bp | Good (HiFi consensus); poor (CLR) | >99% F1 <15 bp (duplex); degrades beyond |
| Library compat. | Native (TruSeq, Nextera) | Illumina-compatible + native Elevate | Requires UG adapters (conversion available) | Any template + primer pair | Native ligation, rapid, or PCR kits | SMRTbell (hairpin adapter ligation) | SBX-specific library prep |
| Epigenetic detection | Requires bisulfite/EM-seq conversion | Requires bisulfite/EM-seq conversion | Requires bisulfite/EM-seq conversion | N/A | Native (direct modification calling) | Native (kinetic IPD analysis) | Not supported (reads Xpandomer, not native DNA) |
| Throughput/run | Up to 16 Tb (NovaSeq X) | ~600 Gb (2 FC) | ≥2.5–3.0 Tb per wafer | ~1.2 Mb/day (3730xl) | Up to 14 Tb (PromethION P48) | ~480 Gb/day (Revio, 4 SMRT Cells) | ~7 × 30x genomes/hour (demonstrated) |
| Cost per Gb | $2–$6 | $5–$7 | ~$1 | ~$500–$2,000 (not meaningful at scale) | $3–$20 (instrument-dependent) | $8–$15 | TBD (targeting <$2 at scale) |
Practical Decision Guide
When to Choose Each Platform
- Illumina: Broadest application compatibility, most mature ecosystem, largest reagent/kit ecosystem, best for labs needing maximum flexibility across diverse applications (RNA-seq, ChIP-seq, ATAC-seq, amplicon, exome, WGS, etc.). Best paired-end performance for applications like structural variant detection and mate-pair analysis.
- Element AVITI: Best raw data quality (Q40–Q50+), no index hopping, excellent for single-cell (10x Genomics), WGS, and any application where data quality directly impacts sensitivity. Compatible with existing Illumina libraries. Ideal mid-throughput labs wanting Illumina-equivalent applications at higher quality and competitive cost.
- Ultima UG 100: Lowest cost per genome for ultra-high-throughput WGS. Purpose-built for population-scale studies, large biobanks, liquid biopsy, and MRD detection (via ppmSeq). Less flexible for applications requiring paired-end reads or diverse library types. Best when you need tens of thousands of genomes.
- Sanger: Gold standard for validation, confirmatory testing, and small-scale sequencing (<50 amplicons). Universally accepted by regulatory bodies. Ideal for verifying NGS-called variants, sequencing individual plasmid constructs, genotyping known mutations, and any scenario requiring the highest per-base confidence on a small number of targets. Not practical for discovery or high-throughput work.
- Oxford Nanopore: Best for applications requiring ultra-long reads (structural variant detection, de novo assembly, gap filling, full-length transcript isoform sequencing), native epigenetic modification profiling (5mC, 6mA, RNA modifications), real-time adaptive sampling, rapid pathogen identification, and field-deployable sequencing. The MinION’s portability is unmatched. Choose ONT when read length, modification detection, or time-to-answer matters more than raw per-base accuracy.
- PacBio (HiFi): Best combination of long reads and high accuracy. HiFi reads at 10–25 kb and ≥Q30 excel at de novo genome assembly (including phased diploid assemblies), full-length isoform sequencing (Iso-Seq), structural variant detection, CpG methylation calling, and resolving complex repetitive regions. The Revio enables population-scale long-read WGS. Choose PacBio when you need long reads with short-read-equivalent accuracy.
- Roche SBX: Targeting ultra-fast, high-throughput clinical WGS with sample-to-answer in under 7 hours. If performance targets hold, SBX will be compelling for rapid diagnostic genomics, population-scale screening, and any setting where turnaround time and throughput per dollar are paramount. Currently in development — monitor for commercial availability and independent benchmarking.
Key Technical Caveats
- Illumina: Watch for index hopping on patterned flow cells; use UDI. Quality decays toward read ends. Adapter dimer contamination is especially problematic on ExAmp instruments.
- Element AVITI: Requires library circularization (though Cloudbreak Freestyle now automates this). Higher loading concentrations needed. Not yet compatible with every niche library type.
- Ultima UG 100: Single-end only. Variable read length complicates some pipelines. Homopolymer indels require adapted variant callers (DeepVariant/GATK with flow-based models). Large minimum batch size (~10B reads/wafer) limits flexibility for small projects.
- Sanger: Cost scales linearly with number of targets — no batching economy. Cannot detect low-frequency variants (<15–20%). Not practical for anything beyond a few hundred reads. Requires per-target primer design.
- Oxford Nanopore: Homopolymer accuracy remains the primary limitation, especially for indel-sensitive variant calling. Simplex accuracy (Q20–Q25) is lower than other platforms; duplex or high coverage is needed for confident variant calls. Flow cell pore lifetime (≤72 hours) limits per-run output. Systematic context-dependent errors limit consensus accuracy ceiling for some motifs.
- PacBio: HiFi reads are capped at ~25 kb by polymerase processivity (must complete multiple passes). Higher cost per Gb than short-read platforms. CLR mode has high raw error rates (~15%) and requires specialized assembly algorithms. SMRT Cell loading optimization is critical — underloading wastes ZMWs, overloading produces multi-molecule wells.
- Roche SBX: Not yet commercially available. No epigenetic modification detection. Read lengths shorter than ONT or PacBio. Duplex insert size limited to ~350 bp. Requires 2-hour upstream expansion chemistry. Independent performance validation is still pending.
Document generated March 2026. Technical details sourced from manufacturer documentation, peer-reviewed publications, and core facility protocols. Key references: Sanger et al. 1977 PNAS (chain termination); Ewing & Green 1998 Genome Res (Phred scores); Eid et al. 2009 Science (ZMW single-molecule sequencing); Wenger et al. 2019 Nat Biotechnol (HiFi CCS); Arslan et al. 2023 Nat Biotechnol (Element avidity sequencing); Almogy et al. 2022 (Ultima flow SBS); Jain et al. 2018 Nat Biotechnol (nanopore R9.4); Castro-Wallace et al. 2017 Sci Rep (ISS nanopore sequencing); Nurk et al. 2022 Science (T2T-CHM13 assembly); Kokoris et al. 2025 bioRxiv (Roche SBX); Broad Clinical Labs / Roche 2025 NEJM (SBX Guinness record).
← All writing