Terminology
SeqCenter’s curated glossary of essential terms provides an introduction to much of the lexicon used in genomics and can serve as an aid in understanding our rapidly evolving industry. Our goal is to empower you with knowledge, giving you the ability to approach sequencing projects and literature with clarity and confidence.
The NCBI also hosts a plethora of resources, including an extensive list of terms specific to NCBI submissions and hosted reference data (including GenBank and RefSeq databases).
A
Accession Number
A unique identifier for a published version of a sequence in any of NCBI’s databases, such as GenBank (ex. DQ977253.1). Typically used to refer to a specific assembly or annotation but can refer to other types of sequences, including raw data. Individual genomes may have multiple assemblies, annotations and versions of each, so the accession number is used for exact denotation.
Adapter
Highly specific helper sequence and/or complex bound to the fragments of DNA to be sequenced. These vary from platform to platform but are essential to sequencing. They allow the sequencing library to be recognized and processed by the sequencer, often binding or guiding the fragment to the sequencing surface or opening. If used, these will include the sequencing index. Illumina’s P5 and P7 sequences are examples of sequencing adapters.
Amplicon
A targeted segment of gDNA that is selectively amplified using site-specific primers (ex. 16S/ITS sequencing).
Amplicon Sequencing
A sequencing strategy that will result in data for only a targeted portion of sequence, generated during library preparation with primer-specific PCR amplification. This approach can be processed on any platform, but library design and preparation will vary. 16S/ITS2 sequencing is a common example of this approach. Amplicon sequencing can be contrasted with Whole Genome Sequencing.
Amplicon Sequencing Variant (ASV)
The ASV designation is assigned to amplicons generated in 16S/ITS sequencing where diversity within the sequenced pool of amplicons filters out low confidence reads before taxonomic assignment.
Annotation
Characterization of the structural and functional components of a genome assembly to provide functional biological context, such as the denotation of open reading frames (ORFs), introns, exons, repetitive sequences, etc. A programmatic annotation will predict ORFs, assign unique locus tags, and attempt to assign function or gene name by querying a database for homologous sequences. (This descriptive information is not typically included in fasta format files types.)
The two most common annotation types are GenBank and RefSeq. These may differ from one another for a specific genome in terms of gene content, and each will use a different set of locus tags.
Assembly
A representation of an organism’s genome as a collection of chromosomes, plasmids, and unplaced sequences. These are often built through a combination of different sequencing approaches and assembly methods, and can have varying levels of completeness. Larger and more complex genomes may need long read sequencing, and even primer walking, to resolve the full structure of the genome. (Ex. A draft human genome was published in 2003, but the first “end-to-end” genome was published on March 31, 2022.)
NCBI hosts a large database of reference assemblies along with a wealth of information about the characteristics of the data it contains and how to navigate it.
B
Basecalling
The programmatic process of interpreting the raw sequencer signal (often visual or electrical) and converting it into an A, T, C, or G. Because this is a mathematical interpretation of potentially noisy data, quality score can also be assigned to each base to indicate the level of confidence in the call.
C
Contig
An area of continuous sequence that is created through the merger of overlapping sequencing reads. Most commonly, a contig is created in the process of making an assembly.
Coverage
The number of sequencing reads present at a particular genomic locus, typically reported as a multiplier (ex. 30x coverage). The average coverage for a sample can be estimated by dividing the total expected number of basepairs in the raw sequencing data by the expected genome size. Recommended coverage will vary with intended downstream analysis, sequencing platform, and expected genome size and complexity.
D
Draft genome/assembly
An initial attempt to reconstitute the most likely representation of a given chromosome. Draft genomes typically lack empirical validation of the assembly and represent a hypothetical sequence. This category of genomes corresponds to the “contig,” “scaffold,” and “chromosome” assembly levels on NCBI. If possible, draft genomes should not be used as references for variant calling or RNA analysis as draft genomes usually contain gaps, leading to missed calls or counts.
F
FASTA format
A text-based file format for representing biological sequences. Each sequence has two characteristic parts: a single line description followed by sequencing data. The descriptor starts with the greater than symbol (>) followed by the sequence name. The sequencing data represents base pairs using the single-letter code [A,C,G,T,N]. This file type is often used when talking about draft genomes or published genomes. These can be opened with standard text editors but can be very large. This format includes file types ending in .fasta, .faa, .fna, and .fa.
FASTQ format
A text-based file format for representing biological sequences and the corresponding quality score for each base. Each sequence contained in the file has four distinct lines:
- A single line description, always starting with the greater than symbol (>) followed by the sequence name.
- Sequencing data. The descriptor starts with the greater than symbol (>) followed by the sequence name. The sequencing data represents base pairs using the single-letter code [A,C,G,T,N].
- Quality score identifier line, always a single (+) plus sign.
- Quality scores, represented in ASCII characters.
This file type is commonly found for raw sequencing data. These can be opened with standard text editors but can be very large. This format includes file types ending in .fq, .fastq.
Flowcell
The physical space where the sequencing process takes place. These include a mechanism that will recognize and capture corresponding library adapters attached to individual DNA fragments as they flow across. The exact mechanisms, adapters, and flowcell architecture are specific to the different sequencing platforms. In general, the size of the flowcell dictates the total number of fragments that can be sequenced on one run and affects the total output in turn.
Fragment
A general term to describe individual molecules of nucleic acid, but typically refers to the molecule as a whole. (Ex. fragment length would refer to the entire length of a DNA molecule.)
Fragmentation
The process of breaking gDNA into short fragments for NGS, which can be done via any number of methods, including physical shearing, enzymatic tagmentation, or heat-based methods.
G
GenBank database
An open source database for nucleotide sequences and their protein translations. This database allows for multiple sequences for each unique organism, and each assembly and annotation will have a unique GenBank accession number. This is hosted on NCBI, alongside the RefSeq database, and some organisms and biosamples will have distinct listings in both.
GenBank format
A text-based file format that supports both nucleotide and annotation information for each loci, including additional metadata. This format includes file types ending in .gbk, .gbff, .gf3, .gff, .gb.
H
Hamming distance
A general information theory term, but in regards to sequencing, this is the maximum number of bases that differ between two sequences. This is important when designing barcodes for multiplexing and when setting stringency during demultiplexing, to account for sequencing error and any amplification errors during library preparation.
I
Indexes (or indices)
Also known as “barcodes” or “tags”; these are short DNA sequences that are added during library preparation to allow multiplexing of many samples into one sequencing run. A unique index is used for each sample on a sequencing run to allow for downstream binning and individual analysis. Without unique barcodes, separating the data for individual samples can be impossible.
Insert
The DNA sequence that is intended to be sequenced, which is flanked by the sequencing adapters in the final library. (Insert length is not the same as the library fragment size and will be considerably shorter.)
L
Lane
Specific to Illumina, an individual channel on a flowcell. Our newest technology, the NovaSeq X Plus, features 8 lanes that an be loaded independently.
Library Preparation
The process of preparing a nucleic acid sample to a format that can be sequenced. The process can include, but is not limited to, fragmentation, attaching adapters and indexes, PCR amplification, and a final clean-up. For RNA, this will include DNase treatment and reverse transcription to generate cDNA as the first step. The exact techniques required will depend on the sequencing platform and final goals, and can be custom or from a kit.
Long read sequencing
Also called single molecule sequencing or third generation sequencing, long read sequencing refers to sequencing platforms that sequence individual nucleic acid molecules, typically over 1 kbp and theoretically can be unlimited in length. Because these often sequence native DNA molecules, these technologies can often detect base modifications, though the sequencers readings must be interpreted to predict the modifications (similar to basecalling) and the algorithms are often trained on sets of methylation patterns of specific organisms. While longer fragment lengths are helpful for many downstream analyses, they often come at the cost of lower per-base accuracy and more stringent sample requirements.
M
Metagenomics
The study of microbial communities through sequencing community DNA from a bulk set of organisms in a single sample, such as eDNA from environmental soil samples.
Multiplexing
A sequencing technique allowing multiple individual samples to be mixed and sequenced together then later bioinformatically pulled apart, effectively subdividing the very large output of NGS flowcells among multiple samples and decreasing the per-sample cost of sequencing. To accomplish this, short identifiable primer sequence(s), called index(es), are attached to the DNA fragments during library preparation, which allow many samples to be pooled together into a single tube for NGS (See Hamming Distance).
N
Nanodrop
A spectrophotometric method for assessing DNA or RNA concentration and purity, typically less reliable than fluorimetric methods such as Qubit due to detection of tertiary contaminants such as the other nucleic acid, other carbohydrates and protein that absorb in wavelengths shared with nucleic acids.
Native DNA
Unmodified genomic DNA directly sourced from living tissue or cells. Native DNA has not been amplified or enriched and therefore contains host driven modifications like methylation patterns used in epigenetic studies.
Next Generation Sequencing (NGS)
In the early 2000s, the “next generation” of sequencing introduced massively parallelized processes, which offer improved scalability and speed over previous sequencing technologies. The term NGS encompasses many platforms (like Illumina and Nanopore) and includes both short and long read platforms.
Nucleotide diversity
Within sequencing, this is an assessment (often qualitative) of the distribution of the 4 nucleotides at any given site within a final library. If low diversity is expected at any point along the collective fragments, the entire library is treated as low diversity when sequencing. For example, a shotgun WGS library is expected to have high nucleotide diversity at any given position on a read, while a TnSeq library is expected to have low nucleotide diversity due to the conserved motifs necessary for the method. Optical sequencing platforms struggle with low nucleotide diversity as the cameras work through the detection of intensity differences. Low nucleotide diversity must be artificially increased for optimal sequencer performance, often achieved through the addition of control DNA spike-ins, such as PhiX. Within molecular genetics, nucleotide diversity is a quantitative measurement of polymorphism at a given site and measure of genetic diversity calculated through the analysis of sequencing data.
O
Operational Taxonomic Units (OTUs)
Groups of similar organisms, formed as 16S/ITS amplicon sequences are binned based on ≥97% similarity. Once bins were formed, each bin is assigned a taxonomy based on the consensus similarity to available databases. Although the OTU analysis system has been used historically for 16S/ITS amplicons, SeqCenter follows more recent developments in the field and instead focuses on the ASV analysis system. Adoption of this more granular approach have been made possible by continual improvements in computing power and growing databases.
P
Paired-end (PE) read
The output format from an Illumina sequencing run structured to sequence from both ends of an insert, working from the outside in. This output comes as a pair of fastq files, one for read 1 (R1) and one for read 2 (R2), which will be in the reverse-complemented orientation relative to R1. The sequences within match order, i.e. the first sequence in each file, correspond to the same cluster. In general, per base quality scores decrease throughout the duration of sequencing a given read, with the lowest quality being at the end. To circumvent signal drop-off, Illumina implemented paired end turnaround where on-instrument chemistry allows for the cluster of interest to be sequenced in the opposite direction, restarting the sequencing procedure and drastically increasing the generated read quality. Recent advancements in sequencing technologies are pushing these boundaries.
PhiX spike-in
An addition of a proprietary random library to increase the nucleotide diversity of a sequencing run. The exact percentage recommended or required will vary depending library design. PhiX is Illumina’s proprietary sequencing library of phage PhiX that is used as a sequencing control.
Programmatic analysis
The use of a published software pipeline to analyze raw sequencing data, with minimal oversight, to answer a specific question or accomplish a task. These tools are meant to be used to make predictions, and further experimental testing should be done to confirm results.
Purchase order (PO)
An internal tracking number for funds authorized for a specific use from an institution. Informally, POs set aside funds for a specific use, tell a vendor that they are authorized to render specific goods or services at a set price, and allow a vendor to ask an institution for actual payment on an invoice. POs must be requested from the customer’s institution.
Q
Quality
An ambiguous term. Before sequencing, this can be an assessment of the raw sample in terms of purity, fragment length, or concentration. After sequencing, this can refer to the confidence of the basecall of a given nucleotide base. Sequencing data quality is reported as a quality score or a Phred score and is encoded within FASTQ files alongside the nucleotide sequences.
Quality score
An indicator of basecalling accuracy that is a prediction of the probability of a basecalling error, traditionally using the Phred-scale where Q = − 10 log10 P. (In other words, Q30 is equivalent to 1:1000 chance of error or 99.9% accuracy.) The Broad Institute offers additional details.
Qubit
A proprietary fluorimeter (ThermoFisher) that uses a unique luciferase derived fluorophore to tag and excite specific nucleic acids and their forms, such as dsDNA, ssDNA, or RNA specifically. Unlike traditional methods, unbound fluorophore will not elicit an excitation response. This provides an accurate concentration of the target molecule type that will not be affected by tertiary contaminants. Other similar technologies exist, such as PicoGreen.
R
Reference genome
A genome assembly that is validated by empirical research and provides a fair representation of chromosomes within an organism. Most reference genomes are assembled, annotated, and submitted through peer-review. A reference genome should not be confused with a draft genome. (Informally, this can also refer to the genome assembly that will be used to compare sample data against.)
RefSeq database
A non-redundant database of reference sequences that are updated and curated by NCBI staff. While the updated information can be useful, this can add hurdles to comparisons with previous studies. This database is hosted on NCBI alongside the GenBank database, and some organisms and biosamples will have distinct listings in both.
S
Sanger sequencing
A low-throughput process of selective incorporation of chain-terminating dideoxynucleotides (ddNTPs) exhibiting base-specific fluorescence by DNA polymerase during in vitro DNA replication. (NOTE: This service is not offered at SeqCenter.)
Short read sequencing
Sequencing platforms that typically produce reads less than 1 kbp. These platforms typically feature high throughput and high accuracy.
Shotgun
A sequencing strategy (and library design) that integrates random fragmentation of the material, named for the wide spray of a shotgun shell. With this approach, the original full sequence will hopefully be able to be programmatically reconstructed by piecing together the staggered overlapping sequences, guaranteed by the random fragmentation. This approach was developed as sequencing very long nucleic acid molecules accurately was not originally possible and is still expensive and difficult to do.
Single-end (SE) read
The data produced when sequencing a library-prepped fragment from only one side, which outputs only a single read. This will produce one sequencing file per sample.
Strandedness
The ability of an RNA library preparation method to preserve the strand orientation information of the original RNA transcripts during library preparation. There are different methods for achieving this. Please see our Illumina RNA sequencing page for more information on how the Illumina Stranded RNA library preparation works.
T
Tagmentation
An enzymatic fragmentation method in which transposase enzymes are used to break a DNA molecule into smaller sections. This can be free floating or bead based. The Illumina DNA library preparation is a bead-based tagmentation method that simultaneously attaches a portion of the sequencing adapters to the insert.
Third Generation Sequencing
See Long Read Sequencing.
Transcriptome
The set of all transcripts in a cell.
V
Variant calling (VC) analysis
At SeqCenter, this is a bioinformatic analysis to predict significant variants from raw sequencing data compared to an existing genome assembly. The software will take into account coverage, abundance, and expected sample homogeneity (default is uniform/clonal) when making a decision. Please see our Haploid Variant Calling analysis page for more information about our VC services.
W
Whole Exome Sequencing (WES), Human
Amplicon sequencing targeting the majority of the protein-coding regions of the human genome. By sequencing <2% of the human genome, this is seen as a more cost-effective alternative to human WGS as significantly less data is required to provide both adequate sequencing coverage for downstream analysis and significant amounts of functional information for both clinical and research purposes, including the identification of SNPs that can be used in genome wide association studies (GWAS) among select cohorts.
Whole Genome Sequencing (WGS)
A sequencing approach that provides data for the genome(s) of the sample, most often using an unbiased shotgun fragmentation strategy. This can be done on any platform, but the library design and preparation will vary with the platform. WGS can be contrasted to amplicon sequencing.