You are here  > Biopharmaceutical Glossaries & Taxonomies Homepage  > Genomics > Informatics > Genomic informatics

Genomic Informatics glossary & taxonomy
Evolving Terminology for Emerging Technologies
Comments? Questions? Revisions? 
Mary Chitty MSLS 
mchitty@healthtech.com
Last revised July 09, 2019



Drug discovery & development term index   Informatics term index   Technologies term index    Biology term index    Chemistry term index  

Related glossaries include: Drug discovery & Development   Drug Targets  Molecular Diagnostics 
Informatics:   Drug discovery informatics  Bioinformatics    Cheminformatics   Ontologies & Taxonomies  Protein Informatics  
Technologies Microarrays   PCR   Sequencing   Biology Genetic variations 
covers both technologies for detecting and informatics for interpreting genetic variants. 

ab initio gene prediction: Traditionally, gene prediction programs that rely only on the statistical qualities of exons have been referred to as performing ab initio predictions. Ab initio prediction of coding sequences is an undeniable success by the standards of the machine- learning algorithm field, and most of the widely used gene prediction programs belong to this class of  algorithms. It is impressive that the statistical analysis of raw genomic sequence can detect around 77- 98% of the genes present ...  This is, however, little consolation to the bench biologist, who wants the complete sequences of all genes present, with some certainty about the accuracy of the predictions involved. As Ewan Birney (European Bioinformatics Institute, UK) put it, what looks impressive to the computer scientist is often simply wrong to the biologist. Meeting report "Gene prediction: the end of the beginning" Colin Semple, Genome Biology 2000 1(2): reports 4012.1-4012.3

All ab initio gene prediction programs have to balance sensitivity against accuracy.  Broader term: gene prediction.

AI for genomics : Personalizing treatments and cures 2018 April 16-18, Boston MA The role of computer science in modeling cells, analyzing and mapping data networks, and incorporating clinical and pathological data to determine how diseases arise from mutations is becoming more important in genomic medicine. We need to understand where the disease starts and how artificial intelligence delivers genes and pathways for drug targets and diagnostics. The Inaugural AI for Genomics track explores case studies that apply deep learning, machine learning, and artificial intelligence to genomic medicine. We will discuss data curation techniques, text mining approaches, and statistical analytics that utilize deep machine learning to support AI efforts. This will help to integrate omics approaches to discover disease or drug response pathways and identify personalized and focused treatments and cures. http://www.bio-itworldexpo.com/ai-genomics 

alignment: The process of lining up two or more sequences to achieve maximal levels of  identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology.  NCBI BLAST Glossary

assembled: The term used to describe the process of using a computer to join up bits of sequence into a larger whole. Peer Bork, Richard Copley "Filling in the gaps" Nature 409: 218-820, 15 Feb. 2001 Related terms: contig assembly, genome assembly

biocomputing:  Biocomputing could be defined as the construction and use of computers which function like living organisms or contain biological components, so-called biocomputers (Kaminuma, 1991). Biocomputing could, however, also be defined as the use of computers in biological research and it is this definition which I am going to use in this essay. With this interpretation of biocomputing the complicated ethical questions connected with concepts like artificial life and intelligence are not dealt with.  Peter Hjelmström, Ethical issues in biocomputing http://www.techfak.uni-bielefeld.de/bcd/ForAll/Ethics/welcome.html  

biological computing: Simson Garfinkel "Biological computing" Technology Review, May/ June 2000 http://www.technologyreview.com/articles/garfinkel0500.asp  Related terms: biocomputing, DNA computing    

BLAST (Basic Local Alignment Search Tool): Software program from NCBI for searching public databases for homologous sequences or proteins. Designed to explore all available sequence databases regardless of whether query is protein or DNA. http://www.ncbi.nlm.nih.gov/BLAST/

comparative genome annotation:  Recent advances in genome sequencing technology and algorithms have made it possible to determine the sequence of a whole genome quickly in a cost-effective manner. As a result, there are more than 200 completely sequenced genomes. However, annotation of a genome is still a challenging task. One of the most effective methods to annotate a newly sequenced genome is to compare it with well-annotated and closely related genomes using computational tools and databases. Comparing genomes requires use of a number of computational tools and produces a large amount of output, which should be analyzed by genome annotators. Because of this difficulty, genome projects are mostly carried out at large genome sequencing centers. To alleviate the requirement for expert knowledge in computational tools and databases, we have developed a web-based genome annotation system, called CGAS (a comparative genome annotation system; http://platcom.org/CGAS).   CGAS: a comparative genome annotation system. Choi K, Yang Y, Kim S. Methods Mol Biol. 2007;395:133-146   Broader term: genome annotation Related term: Functional genomics comparative genomics

complex genomes: Is there a specific definition of complex genomes?  Or is it a more general category (beyond viral, bacterial,  microbial?)  

computational gene recognition: Interpreting nucleotide sequences by computer, in order to provide tentative annotation on the location, structure and functional class of protein- coding genes. JW Fickett 1996

Gene recognition is much more difficult in higher eukaryotes than in prokaryotes, as coding regions (exons) are often interrupted by non- coding regions (introns) and genes are highly variable in size.  This is particularly so for human genes. As someone remarked sometime ago people have non- coding regions occasionally interrupted by genes.
Broader terms: gene recognition, molecular recognition.

computational genomics: (often referred to as Computational Genetics) refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data,[1] including both DNA and RNA sequence as well as other "post-genomic" data (i.e., experimental data obtained with technologies that require the genome sequence, such as genomic DNA microarrays). These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of bioinformatics and computational biology, but with a focus on using whole genomes (rather than individual genes) to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond. Wikipedia accessed 2018 March 21 https://en.wikipedia.org/wiki/Computational_genomics   Related terms: ExpressionMicroarrays    

concordance: Similarity of results between different microarray platforms. Related terms: discordance, mismatches

consensus sequence: A theoretical representative nucleotide or amino acid sequence in which each nucleotide or amino acid is the one, which occurs most frequently at that site in the different forms which occur in nature. The phrase also refers to an actual sequence, which approximates the theoretical consensus. A known CONSERVED SEQUENCE set is represented by a consensus sequence. Commonly observed supersecondary protein structures (AMINO ACID MOTIFS) are often formed by conserved sequences. MeSH, 1991

A sequence of DNA, RNA, protein or carbohydrate derived from a number of similar molecules, which comprises the essential features for a particular function. IUPAC Bioinorganic

conserved sequence: A sequence of amino acids in a polypeptide or of nucleotides in DNA or RNA that is similar across multiple species. A known set of conserved sequences is represented by a CONSENSUS SEQUENCE. AMINO ACID MOTIFS are often composed of conserved sequences. MeSH, 1993

A "highly conserved sequence" is a DNA sequence that is very similar in several different kinds of organisms. Scientists regard these cross species similarities as evidence that a specific gene performs some basic function essential to many forms of life and that evolution has therefore conserved its structure by permitting few mutations to accumulate in it. NHGRI

contig: A contig is a contiguous stretch of DNA sequence without gaps that has been assembled solely based on direct sequencing information. Short sequences (reads) from a fragmented genome are compared against one another, and overlapping reads are merged to produce one long sequence. This merging process is iterative: overlapping reads are added to the merged sequence whenever possible and so the merged sequence becomes even longer. When no further reads overlap the long merged sequence, then this sequence - called a contig - has reached its maximum length. Ensembl Glossary http://useast.ensembl.org/info/website/glossary.html  

Published genome sequence has many gaps and interruptions. Concept of  "contig" is crucial to our understanding of current limitations. David Galas "Making sense of the sequence" Science 291 (5507): 1257, Feb. 16, 2001  Wikipedia http://en.wikipedia.org/wiki/Contig  

contig assembly: One of the most difficult and critical functions in DNA sequence analysis is putting together fragments from sets of overlapping segments. Some programs do this better than others, particularly when dealing with sequences containing gaps. Laura De Francesco "Some things considered" Scientist 12[20]:18, Oct. 12, 1999 

DDBJ DNA DataBank of Japan: Shares information daily with EMBL and GenBank. http://www.ddbj.nig.ac.jp/

deep homology: The principle of homology is central to conceptualizing the comparative aspects of morphological evolution. The distinctions between homologous or non-homologous structures have become blurred, however, as modern evolutionary developmental biology (evo-devo) has shown that novel features often result from modification of pre-existing developmental modules, rather than arising completely de novo. With this realization in mind, the term ‘deep homology’ was coined, in recognition of the remarkably conserved gene expression during the development of certain animal structures that would not be considered homologous by previous strict definitions. At its core, it can help to formulate an understanding of deeper layers of ontogenetic conservation for anatomical features that lack any clear phylogenetic continuity Deep homology in the age of next-generation sequencing Patrick Tschopp, Clifford J. Tabin  Published 19 December 2016.DOI: 10.1098/rstb.2015.0475 http://rstb.royalsocietypublishing.org/content/372/1713/20150475

disconcordance: Lack of standard results among microarray experiments.  Related terms: concordance, mismatches

distributed sequence annotation: The pace of human genomic sequencing has outstripped the ability of sequencing centers to annotate and understand the sequence prior to submitting it to the archival databases. Multiple third-party groups have stepped into the breach and are currently annotating the human sequence with a combination of computational and experimental methods. Their analytic tools, data models, and visualization methods are diverse, and it is self-evident that this diversity enhances, rather than diminishes, the value of their work.  Lincoln Stein, et. al. Distributed Sequence Annotation, 2000 http://biodas.org/documents/rationale.html

DNA computers: Seeks to use biological molecules such as DNA and RNA to solve basic mathematical problems. Fundamentally, many of these experiments recapitulate natural evolutionary processes that take place in biology, especially during the early evolution of life and the creation of genes. Laura Landweber, "DNA Computing" Princeton Univ. Freshman Seminar, 1999. http://www.princeton.edu/~lfl/FRS.html   

DNA computing: An interdisciplinary field that draws together molecular biology, chemistry, computer science and mathematics. There are currently several research disciplines driving towards the creation and use of DNA nanostructures for both biological and non-biological applications. These converging areas are:  The miniaturization of biosensors and biochips into the nanometer scale regime; The fabrication of nanoscale objects that can be placed in intracellular locations for monitoring and modifying cell function; The replacement of silicon devices with nanoscale molecular- based computational systems, and The application of biopolymers in the formation of novel nanostructured materials with unique optical and selective transport properties DNA Computing & Informatics at Surfaces, Univ. of Wisconsin- Madison, June 1-4 2003. http://books.google.com/books?id=B6eUAXmBj8IC&pg=PR5&lpg=PR5&dq=dna+computing+university+of+wisconsin+interdisciplinary&s 
Wikipedia http://en.wikipedia.org/wiki/DNA_computing   Related terms: molecular computing, quantum computing Or are these the same/overlapping? 

Ensembl: A joint project between EMBL- EBI and the Sanger Centre (UK) to develop a software system which produces and maintains automatic annotation on eukaryotic genomes.  http://www.ensembl.org/index.html

exon parsing: Identifying precisely the 5' and 3' boundaries of genes (the transcription unit) in metazoan genomes, as well as the correct sequences of the resulting mRNA ("exon parsing") has been a major challenge of bioinformatics for years. Yet, the current program performances are still totally insufficient for a reliable automated annotation (Claverie 1997; Ashburner 2000). It is interesting to recapitulate quickly the research in this area to illustrate the essential limitation plaguing modern bioinformatics. Encoding a protein imposes a variety of constraints on nucleotide sequences, which do not apply to noncoding regions of the genome. These constraints induce statistical biases of various kinds, the most discriminant of which was soon recognized to be the distribution of six nucleotide- long "words" or hexamers. Claverie and Bougueleret 1986; Fickett and Tung 1992).  JM Claverie "From Bioinformatics to Computational Biology" Genome Res 10: (9) 1277- 1279 Sept. 2000

exon prediction:  Since prokaryotes don't have introns, exon prediction implies working with eukaryotes. Is exon prediction equivalent to gene prediction in prokaryotes?  Related terms: ab initio gene prediction; GRAIL Sequencing

exon shuffling theory: Contends that introns act as spacers where breaks for genetic recombination occur. Under this scenario, exons - which usually contain instructions for building a protein subunit - remain intact when shuffled during recombination. In this way, proteins with new functional repertoires can evolve.  Peter Schmidt, "Shuffling, Recombination, and the Importance of ...Nonsense"  Swarthmore College  www.swarthmore.edu/Humanities/pschmid1/array/Gnarl3/exon.html  
Wikipedia http://en.wikipedia.org/wiki/Exon_shuffling  
Related terms: DNA shuffling, domain shuffling, gene shuffling, protein shuffling  

extreme phenotype selection studies: Systematic collection of phenotypes and their correlation with molecular data has been proposed as a useful method to advance in the study of disease. Although some databases for animal species are being developed, progress in humans is slow, probably due to the multifactorial origin of many human diseases and to the intricacy of accurately classifying phenotypes, among other factors. An alternative approach has been to identify and to study individuals or families with very characteristic, clinically relevant phenotypes. This strategy has shown increased efficiency to identify the molecular features underlying such phenotypes. While on most occasions the subjects selected for these studies presented harmful phenotypes, a few studies have been performed in individuals with very favourable phenotypes. The consistent results achieved suggest that it seems logical to further develop this strategy as a methodology to study human disease, including cancer. The identification and the study with high-throughput techniques of individuals showing a markedly decreased risk of developing cancer or of cancer patients presenting either an unusually favourable prognosis or striking responses following a specific treatment, might be promising ways to maximize the yield of this approach and to reveal the molecular causes that explain those phenotypes and thus highlight useful therapeutic targets.  Selection of extreme phenotypes; the role of clinical observation in translational research José Luis Pérez-Gracia  Clinical and Translational Oncology 2010 Mar;12(3):174-80.  Broader term: phenotype

false negative: The chance of declaring an expression change (e.g., in gene expression) to be insignificant when in fact a change has occurred. The opposite situation is the false positive. 

false positive: The chance of declaring an expression change to be significant when in fact no change has occurred. This tends to be a more pressing concern than false negatives in microarray experiments. 

FGED The Functional Genomics Data Society works with other organizations to accelerate and support the effective sharing and reproducibility of functional genomics data. We facilitate the creation and use of standards and software tools that allow researchers to annotate and share their data easily. We promote scientific discovery that is driven by genome wide and other biological research data integration   and meta-analysis   http://fged.org/  Founded as MGED Defined

FGED standards
· MIAME  ·  MINSEQE ·  MAGE-TAB ·   MAGE

filtering: A process whose aim is to reduce a microarray dataset to a more manageable size, by getting rid of genes that show no significant expression changes across the experiment or that are uninteresting for biological reasons. 

finished sequence - human: Sequence in which bases are identified to an accuracy of no more than 1 error in 10,000 and are placed in the right order and orientation along a chromosome with almost no gaps. History of the Human Genome Project" A Genome Glossary" Science 291: pullout chart Feb. 16, 2001

At some level it’s a little arbitrary when you declare a sequence essentially complete." says NHGRI Director Francis Collins…The definition of finished is evolving. Our definition today is different from 10 years ago. Ten years ago we didn’t even think at the level of genomes." says Laurie Goodman, editor of Genome Research. "I think the community at large should define done. Not everyone is going to agree, but when you’re using the word you should define what it means." Francis Collins says "You’re done when you’ve exhausted the standard methods for closing the gaps. There should be some biological reason why those last bits of sequence eluded you – not because you just didn’t bother." "Are we there yet?" The Scientist :12 July 19, 1999 

fold change: A way of describing now much larger or smaller one number is compared with another. When the first number is larger than the second, it is simply the ratio of the first to the second. When the first number is smaller than the second, it is the ratio of the second to the first with a minus sign in the front. When the numbers are equal, it is 1. For example, the fold change of 50 versus 10 is 50/10 = 5, while the fold change of 10 versus 50 is -5. 

gap: A space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment. NCBI BLAST Glossary

GenBank: Located at NCBI, shares information daily with DDBJ and EMBL. NIH genetic sequence database, an annotated collection of all publicly available DNA sequences.   http://www.ncbi.nlm.nih.gov/Genbank/index.html
Now accommodates > 10 10 nucleotides and more than doubles in size every year. David Roos "Bioinformatics -- Trying to Swim in a Sea of Data" Science 291:1260-1261 Feb. 16, 2001

GenBank and WGS Statistics https://www.ncbi.nlm.nih.gov/genbank/statistics/  

gene finding programs: http://cmgm.stanford.edu/classes/genefind/ Bioinformatics Resource, Center for Molecular and Genetic Medicine, Stanford Univ. School of Medicine. List of programs has been compiled and updated from James W. Fickett, "Finding genes by computer: the state of the art" Trends in Genetics, August 1996, 12 (8) 316- 320

gene identification: The effectiveness of finding genes by similarity to a given sequence segment is determined by a much simpler statistic, the total  coverage of the genome by the collective set of sequence contigs. As the overall coverage of the genome is virtually complete (> 90%), there is a strong likelihood that every gene is represented, at least in part, in the data. Thus, finding any gene by  sequence similarity searches using sufficient sequence to ensure significance is almost always possible using the data published  this week. Caution must be exercised, however, as the identification of the gene may still be ambiguous. This is because a  highly similar sequence from a receptor gene from Drosophila, for example, could be found in several different, homologous  genes, which may have similar or entirely different functions or are nonfunctioning pseudogenes. In other words, common  domains or motifs can be present in many different genes. The use of the approximate similarity search tool BLAST is probably still the best way to find similar sequences. David Galas "Making Sense of the Sequence" Science 291: 12257-1260 Feb. 16, 2001

There are two basic approaches to gene identification: by homology and ab initio approaches.  Using marker SNPs to hone in on otherwise hard to find genes.

gene parsing:  Initial gene parsing methods were then simply based on word frequency computation, eventually combined with the detection of splicing consensus motifs. The next generation of software implemented the same basic principles into a simulated neural network architecture (Uberbacher and Mural 1991). Finally, the last generation of software, based on Hidden Markov Models, added an additional refinement by computing the likelihood of the predicted gene architectures (e.g., favoring human genes with an average of seven coding exons, each 150 nucleotides long) is added (Kulp et al. 1996; Burge and Karlin, 1997)). These ab initio methods are used in conjunction with a search for sequence similarity with previously characterized genes or expressed sequence tags (EST). JM Claverie "From Bioinformatics to Computational Biology" Genome Res 10: (9) 1277- 1279.Sept. 2000 http://genome.cshlp.org/content/10/9/1277.full  

gene prediction: Wikipedia http://en.wikipedia.org/wiki/Gene_finding 

One of the first useful products from the human genome will be a set of predicted genes. Besides its intrinsic scientific interest, the accuracy and completeness of this data set is of considerable importance for human health and medicine. Though progress has been made on computational gene identification during the past decade, the accuracy of gene prediction tools is not sufficient to locate the genes reliably in higher eukaryotic genomes. Thus, while the precise sequence of the human genome is increasingly deciphered, gene number estimations are becoming increasingly variable. ... In 1996 we published a comprehensive evaluation of gene prediction programs accuracy (Burset and Guigó, 1996). ... Recently  we have published a revised version of this evaluation (Guigó et al., 2000). This revised evaluation suggest that though gene prediction will improve with every new protein that is discovered and through improvements in the current set of tools, we still have a long way to go before we can decipher the precise exonic structure of every gene in the human genome using purely computational methodology. Genome Bioinformatics Research Lab, Center for Genomic Regulation (Centre de Regulació Genòmica - CRG, Barcelona, 2004  http://genome.imim.es/research/eval.html 

Many methods for predicting genes are based on compositional signals that are found in the DNA sequence. These methods detect characteristics that are expected to be associated with genes, such as splice sites and coding regions, and then piece this information together to determine the complete or partial sequence of a gene. Unfortunately, these ab initio methods tend to produce false positives, leading to overestimates of gene numbers, which means that we cannot confidently use them for annotation. They also do not work well with unfinished sequence that has gaps and errors, which may give rise to frameshifts, when the reading frame of the gene is disrupted by the addition or removal of bases. ... The most effective algorithms integrate gene- prediction methods with similarity comparisons.... The most powerful tool for finding genes may be other vertebrate genomes. Comparing conserved sequence regions between two closely related organisms will enable us to find genes and other important regions in both genomes with no previous knowledge of the gene content of either.  Ewan Birney et. al "Mining the draft human genome" Nature 409: 827-828 15 Feb. 2001 http://www.nature.com/nature/journal/v409/n6822/full/409827a0.html 

Sadly, it is often claimed that matching back cDNA to genomic sequences is the best gene identification protocol; hence, admitting that the best way to find genes is to look them up in a previously established catalog! Thus, the two main principles behind state- of- the- art gene prediction software are (1) common statistical regularities and (2) plain sequence similarity. From an epistemological point of view, those concepts are quite primitive. JM Claverie "From Bioinformatics to Computational Biology" Genome Res 10: (9) 1277- 1279.Sept. 2000  http://genome.cshlp.org/content/10/9/1277.full 

Algorithms have been developed and are combined to recognize gene structural components.  Narrower/synonymous? term: ab initio gene prediction Related term: comparative genomics

gene recognition: Principally used for finding open reading frames, tools of this type also recognize a number of features of  genes, such as regulatory regions, splice junctions, transcription and  translation stops and starts, GC islands, and poly adenylation sites. Laura De Francesco "Some things considered" Scientist 12[20]:18, Oct. 12, 1998 

genetic association studies: The analysis of a sequence such as a region of a chromosome, a haplotype, a gene, or an allele for its involvement in controlling the phenotype of a specific trait, metabolic pathway, or disease. MeSH 2010   See also Genome Wide Association Studies GWAS

genetic models: Theoretical representations that simulate the behavior or activity of genetic processes or phenomena. They include the use of mathematical equations, computers, and other electronic equipment. MeSH 1980

genome annotation: is a multi-level process that includes prediction of protein-coding genes, as well as other functional genome units such as structural RNAs, tRNAs, small RNAs, pseudogenes, control regions, direct and inverted repeats, insertion sequences, transposons and other mobile elements. NCBI Prokaryotic Genome Annotation Pipeline  https://www.ncbi.nlm.nih.gov/genome/annotation_prok/   Narrower term: comparative genome annotation

genome assembly: simply the genome sequence produced after chromosomes have been fragmented, those fragments have been sequenced, and the resulting sequences have been put back together. Ensembl FAQ https://www.ensembl.org/Help/Faq?id=216

genome misassembly: We present the first collection of tools aimed at automated genome assembly validation. This work formalizes several mechanisms for detecting mis-assemblies, and describes their implementation in our automated validation pipeline, Genome assembly forensics: finding the elusive mis-assembly Genome Biology, 2008, Volume 9, Number 3, Page 1 Adam M Phillippy, Michael C Schatz, Mihai Pop https://genomebiology.biomedcentral.com/articles/10.1186/gb-2008-9-3-r55

genome informatics: Genome informatics is the field in which computer and statistical techniques are applied to derive biological information from genome sequences. Genome informatics includes methods to analyse DNA sequence information and to predict protein sequence and structure. Nature latest research and news https://www.nature.com/subjects/genome-informatics

genomic computing: A genomic computing network is a variant of a neural network for which a genome encodes all aspects, both structural and functional, of the network. The genome is evolved by a genetic algorithm to fit particular tasks and environments. The genome has three portions: one for specifying links and their initial weights, a second for specifying how a node updates its internal state, and a third for specifying how a node updates the weights on its links. Preliminary experiments demonstrate that genomic computing networks can use node internal state to solve POMDPs more complex than those solved previously using neural networks. Association for Computing Machinery, ACM Digital Library, Guide to Computing Literature http://portal.acm.org/citation.cfm?id=1143997.1144037&coll=&dl=&type=series&idx=1143997&part=Proce   

genomic data:
The strength of genomic studies lies in the global comparisons between biological systems rather than detailed examination of single genes or proteins. Genomic information is often misused when applied exclusively to individual genes. If one is interested only in one particular genes, there are many more conclusive experiments that should be consulted before using the results from genomic datasets.  Therefore, genomic data should not be used in lieu of traditional biochemistry, but as an initial guidelines to identify areas for deeper investigation and to see how those results fit in with the rest of the genome.  Moreover, most genomics datasets give relative rather than absolute information, which means that information about a single gene has little meaning in isolation. Dov Greenbaum, Mark Gerstein et. al. "Interrelating Different Types of  Genomic Data" Dept. of Biochemistry and Molecular Biology, Yale Univ., 2001  http://bioinfo.mbb.yale.edu/e-print/omes-genomeres/text.pdf  Related terms: Expression genes & proteins    -Omes & -Omics interactome;  Proteomics

genomic datasets: The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations. Broad Institute, Integrative Genomics Viewer https://software.broadinstitute.org/software/igv/home

GRAIL: Gene Recognition and Assembly Internet Link:  Major GRAIL Site updates, Broad Institute https://software.broadinstitute.org/mpg/grail/faq.html

The GRAILexp  FAQ [no longer on web?] with references to Perceval, an exon prediction program; Galahad, a gene message alignment program and Gawain, a gene assembly program clearly has scientific and literary finesse.  Does this name relate in any way to Walter Gilbert's description of the Human Genome Project as the "Holy Grail" of molecular biology?  I should investigate further   

global normalization or mean scaling:. The standard solution for errors that effect entire arrays is to scale the data so that the average measurement is the same for each array (and each color). The scaling is accomplished by computing the average expression level for each array, calculating a scale factor equal to the desired average divided by the actual average, and multiplying every measurement from the array by that scale factor. The desired average can be arbitrary, or computed from the average of a group of arrays. 

GWAS Genome Wide Association Sequencing: An analysis comparing the allele frequencies of all available (or a whole GENOME representative set of) polymorphic markers in unrelated patients with a specific symptom or disease condition, and those of healthy controls to identify markers associated with a specific disease or condition. MeSH 2009

A genome-wide association study (GWAS) is an approach used in genetics research to associate specific genetic variations with particular diseases. The method involves examining genetic variations (genotypes) across the complete sequences of DNA, or genomes, of many different people to find genetic variants associated with a disease or trait (phenotypes). Researchers can use the information to better understand how genetic variation affects the normal function of genes, in addition to helping develop better prevention and treatment strategies.  US NIH, Genome Wide Association Studies GWAS Policy https://report.nih.gov/nihfactsheets/ViewFactSheet.aspx?csid=28

Pronounced gee-wahs  Related term: next generation sequencing

high throughput nucleotide sequencing: [analysis] Techniques of nucleotide sequence analysis that increase the range, complexity, sensitivity, and accuracy of results by greatly increasing the scale of operations and thus the number of nucleotides, and the number of copies of each nucleotide sequenced. The sequencing may be done by analysis of the synthesis or ligation products, hybridization to preexisting sequences, etc. MeSH 2011

homologue, homologous: Used by geneticists in two different senses: (1) one member of a chromosome pair in diploid organisms, and (2) a gene from one species - -for example, the mouse - -that has a common origin and functions the same as a gene from another species -- for example, humans, Drosophila, or yeast. [NHLBI]  Related terms: Phylogenomics lateral genomics, ortholog, orthologous, paralog, paralogous, synologous, xenolog, xenologous; Model organisms;  Protein informatics homology modeling

homology:This is different from homologue as defined in the Pharmaceutical biology homology:  The relationship among sequences due to descent from a common ancestral sequence. An important organizing principle for genomic studies because structural and functional similarities tend to change together along the structure of homology relationships. When applied to nucleotide or protein sequences, means relationship due to descent from a common ancestral sequence. Two DNA molecules (or regions thereof) are homologous if they both "descended" through a series of replication from a single DNA strand … The terms "homology" and "similarity" are often, incorrectly, used interchangeably. Homology has been used by various people with different meanings, even though similarity was a common denominator among these meanings. The two most important of these meanings related homology to similar structures and/ or to similar functions. By structures I mean both molecular sequences and morphology. Life would have been simple had phylogenetic homology necessarily implied structural homology or either of them necessarily implied functional homology. However, they map onto each other imperfectly and my definition of homology includes all forms of characters. We could reduce confusion by always indicating the kind of homology we are referring to when using the tern. Walter Fitch "Homology a personal view on some of the problem" Trends in Genetics 16 (5): 227-231 May 2000

Note that homology can be genic, structural, functional or behavioral.  Related terms: Drug targets target homology  Phylogenomics evolutionary homology, orthology, paralogy, similarity; Proteomics;  regulatory homology;      Narrower terms: deep homology, Sequencing sequence homology, sequence homology- nucleic acid; Related terms homolog (homologue), similarity, ortholog, paralog, xenology
Homology Site Guide, NCBI
https://www.ncbi.nlm.nih.gov/guide/homology/ 
Wikipedia http://en.wikipedia.org/wiki/Homology_%28biology%29

International Nucleotide Database: Composed of  DDBJ, EMBL and GenBank.

local alignment: The alignment of some portion of two nucleic acid or protein sequences. NCBI BLAST glossary

Best alignment method for sequences for whom no evolutionary relatedness is known. See Smith- Waterman alignment.  Compare global alignment.

log ratios: DNA microarray assays typically compare two biological samples and present the results of those comparisons gene-by-gene as the logarithm base two of the ratio of the measured expression levels for the two samples. The limits of log ratios, Vasily Sharov,1 Ka Yin Kwong,1 Bryan Frank,1 Emily Chen,1 Jeremy Hasseman,1 Renee Gaspard,1 Yan Yu,1 Ivana Yang,1 and John Quackenbush BMC Biotechnology 4, 2004 doi: 10.1186/1472-6750-4-3. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=400743 

MAGE  Microarray and Gene Expression:  See under FGED Standards

MAML Microarray Markup Language: MAML (Microarray Markup Language) is no longer supported by MGED and has been replaced by MAGE-ML.
Broader term: standards; Related terms: data analysis - microarray
, MGED, MIAME  

MGED Microarray Gene Expression Database group: The MGED group was a grass- roots movement whose goal was to facilitate the adoption of standards for DNA- array experiment annotation and data representation, as well as the introduction of standard experimental controls and data normalization methods. The group was founded at the Microarray Gene Expression Database meeting MGED1 (November, 1999, Cambridge, UK).

MIAME Minimum Information About a Microarray Experiment: See under FGED standards

MIAME/MAGE-OM:  See under FGED Standards 

microarray analysis techniques: Wikipedia http://en.wikipedia.org/wiki/Microarray_analysis_techniques 

microarrays - data analysis: Microarrays have revolutionized molecular biology. The numbers of applications for microarrays are growing as quickly as their probe density. Paradoxically, microarray data still contains a large number of variables and a small number of replicates creating unique data analysis sets. Still, the first and most important goal is to design microarray experiments that yield statistically defensible results. Related terms: image analysis - microarrays; standards; cluster analysis, pattern recognition Algorithms & data management glossary

microarrays image analysis: Although the visual image of a microarray panel is alluring, its information content, per se, is minimal without significant image processing. To mine its lode effectively, quantitative signal must be determined optimally, which means subtracting background, calculating confidence intervals - outside of which a difference in signal ratio is deemed to be significant - and calibrated. Editorial “Getting hip to the chip” Nature Genetics 18(3): 195- 197 March 1998

This process starts with the image of a microarray that is produced in the laboratory and produces intensity information indicating the amount of light emitted by each probe. In particular, after the array has been hybridized, it is scanned to obtain an image that shows the amount of light emitted across the surface of the microarray. The image is then analyzed to identify the "spots" (i.e., the parts of the image corresponding to the DNA probes on the microarray) and the amount of light that can be attributed to target molecules bound to each probe.  Related term: normalization

microarray informatics: The microarray field is experiencing an overwhelming push toward robust statistics and mathematical analytic methods that go far beyond the simple fold analysis and basic clustering that were once the mainstays of researchers in this area. This push toward better statistics is also driving the recognition of the need for more replication of experiments. These stronger analytical techniques also help researchers identify problem areas in the technology and laboratory processes, and these improvements, in turn, greatly improve the quality of results that can be provided.  Related terms microarray analysis, microarray data analysis

mismatches: Gene expression microarray data is notoriously subject to high signal variability. Moreover, unavoidable variation in the concentration of transcripts applied to microarrays may result in poor scaling of the summarized data which can hamper analytical interpretations. This is especially relevant in a systems biology context, where systematic biases in the signals of particular genes can have severe effects on subsequent analyses. Conventionally it would be necessary to replace the mismatched arrays, but individual time points cannot be rerun and inserted because of experimental variability. It would therefore be necessary to repeat the whole time series experiment, which is both impractical and expensive. Correction of scaling mismatches in oligonucleotide microarray data, Mrtino Barenco, Jaroslav Stark3 ,1, Daniel Brewer2 ,1, Daniela Tomescu1, Robin Callard1 ,2 and Michael Hubank1 BMC bioinformatics 2006, 7:251 doi:10.1186/1471-2105-7-251 http://www.biomedcentral.com/1471-2105/7/251 

molecular sequence annotation: The addition of descriptive information about the function or structure of a molecular sequence to its MOLECULAR SEQUENCE DATA record. MeSH 2011  

noise characterization:  Noise is a big problem in analyzing gene expression microarray data. Of course noise is a problem with biological data in general. 

normality:
Related term: Molecular Medicine normal

normalization, microarray: Underlying every microarray experiment is an experimental question that one would like to address. Finding a useful and satisfactory answer relies on careful experimental design and the use of a variety of data-mining tools to explore the relationships between genes or reveal patterns of expression. …  this review focuses on the much more mundane but indispensable tasks of 'normalizing' data from individual hybridizations to make meaningful comparisons of expression levels, and of 'transforming' them to select genes for further analysis and data mining. Microarray data normalization and transformation  John Quackenbush Nature Genetics volume32, pages496–501 (2002) doi:10.1038/ng1032 Published online: 01 December 2002

The conversion of intensity information (from image analysis) into estimates of gene expression levels. For researchers who are using statistical methods, this process also characterizes the uncertainty in the measurements. The goal of normalization is to convert the intensity measurements generated by image analysis into estimates of gene expression levels in the original biological source. Concretely, the challenge is to compensate for as many sources of error as possible.  Related terms: fold changes, image analysis, log ratios; See also normalization: Algorithms

oligonucleotide array sequence analysis: Hybridization of a nucleic acid sample to a very large set of oligonucleotide probes, which are attached to a solid support, to determine sequence or to detect variations in a gene sequence or expression or for gene mapping. MeSH, 1999

Useful to know this MeSH heading for microarrays, but use free- text as well to search PubMed.

ORF prediction: Related terms: exon prediction, gene prediction, gene recognition.

Phred: Base calling program for DNA sequence traces; ... developed by Drs. Phil Green and Brent Ewing, and is distributed under license from the University of Washington. http://www.phrap.org/

Phred base calling: http://en.wikipedia.org/wiki/Phred_base_calling 

reverse transfection: a technique for the transfer of genetic material into cells. As DNA is printed on a glass slide for the transfection process (the deliberate introduction of nucleic acids into cells) to occur before the addition of adherent cells, the order of addition of DNA and adherent cells is reverse that of conventional transfection.[1] Hence, the word “reverse” is used. Wikipedia accessed 2018 Aug 29 https://en.wikipedia.org/wiki/Reverse_transfection

Forward and reverse transfection protocols each have their significant uses in research.  The main protocol difference between forward and reverse transfection is whether or not the cells are plated the day before transfection (as in forward transfection) or seeded at the same time of the transfection.  Forward transfection is commonly used in situations where the cells need to be already attached and in a growth phase prior to the nucleic acid + transfection reagent complex is applied.  In contrast, a reverse transfection is the process in which the nucleic acid + transfection reagent complex is assembled in the tissue culture plate and then the cells are seeded into the wells.  There are many benefits to the reverse transfection method, including: The method is ideal for high-throughput screening since reverse transfection is compatible with automated robots The lack of needing to pre-plate cells saves a dayHigh efficiency of the reverse transfection decreases the amount of nucleic acid usedUnlike forward transfection, the transfection reagent can remain in contact with the cells for 24-72 hours. Altogen Biosystems, Forwartd transfection or reverse transfection?: https://altogen.com/forward-transfection-reverse-transfection/

RNA sequence analysis: A multistage process that includes cloning, physical mapping, subcloning, sequencing, and information analysis of an RNA SEQUENCE MeSH 1993

scaffolds: A series of contigs that are in the right order but are not necessarily connected in one continuous stretch of sequence. History of the Human Genome Project" A Genome Glossary" Science 291: pullout chart Feb. 16, 2001

Contig sequences separated by gaps NCBI Whole Genome Shotgun Submissions http://www.ncbi.nlm.nih.gov/genbank/wgs.html 

The definition of a scaffold appears to be quite different in the Science and Nature draft published sequences. David Galas "Making sense of sequence" Science 291: 1257-  Feb. 16, 2001 This is also different from the scaffold defined in Drug discovery and development.  

sequence alignment:  The arrangement of two or more amino acid or base sequences from an organism or organisms in such a way as to align areas of the sequences sharing common properties. The degree of relatedness or homology between the sequences is predicted computationally or statistically based on weights assigned to the elements aligned between the sequences. This in turn can serve as a potential indicator of the genetic relatedness between the organisms. MeSH, 1991 Broader term? alignments.

sequence homology:  The degree of similarity between sequences. Studies of amino acid and nucleotide sequences provide useful information about the genetic relatedness of certain species. MeSH, 1993 Broader term Functional genomics homology;   Related terms Functional genomics evolutionary homology; Proteomics regulatory homology;  

sequence homology - nucleic acid: The sequential correspondence of nucleotide triplets in a nucleic acid molecule which permits nucleic acid hybridization. Sequence homology is important in the study of mechanisms of oncogenesis and also as an indication of the evolutionary relatedness of different organisms. The concept includes viral homology. MeSH, 1991 Broader term sequence homology

Sequence Ontology Project:  The Sequence Ontology is a set of terms and relationships used to describe the features and attributes of biological sequence. http://www.sequenceontology.org/    

sequencing algorithms: See BLAST, FASTA, Needleman - Wunsch, Smith - Waterman   

similarity search: BLAST, FASTA and Smith- Waterman are examples of similarity search algorithms.

Smith-Waterman alignment:  https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm

Genomic informatics resources
Ensembl glossary
https://www.ensembl.org/Multi/Help/Glossary?db=core
NHGRI (National Human Genome Research Institute), Talking Glossary of Genetic Terms, 100+ definitions. https://www.genome.gov/genetics-glossary Includes extended audio definitions.
Schlindwein Birgid, Hypermedia Glossary of Genetic Terms, 2006, 670 definitions.
http://www.weihenstephaon.de/~schlind/index.html

Informatics Conferences http://www.healthtech.com/conferences/upcoming.aspx?s=NFO
BioIT World Expo
http://www.bio-itworldexpo.com/
Molecular Medicine Tri Conference
http://www.triconference.com/
Informatics Short courses
http://www.healthtech.com/Conferences_Upcoming_ShortCourses.aspx?s=NFO
BioIT World magazine
http://www.bio-itworld.com/      
BioIT World archives
http://www.bio-itworld.com/BioIT/BioITArchive.aspx

How to look for other unfamiliar  terms

IUPAC definitions are reprinted with the permission of the International Union of Pure and Applied Chemistry.

Contact | Privacy Statement | Alphabetical Glossary List | Tips & glossary FAQs | Site Map