You are here Biopharmaceutical/ Genomic Glossaries homepage > Biopharmaceutical Informatics > Biopharmaceutical Algorithms & data management

 Biopharmaceutical Algorithms & data management glossary & taxonomy
Evolving terminology for emerging technologies
Suggestions? Comments? Questions? 
Mary Chitty  MSLS mchitty@healthtech.com
Last revised October 04, 2019



With changes in sequencing technology and methods, the rate of acquisition of human and other genome data over the next few years will be ~100 times higher than originally anticipated. Assembling and interpreting these data will require new and emerging levels of coordination and collaboration in the genome research community to develop the necessary computing algorithms, data management and visualization system.  Lawrence Berkeley Lab, US "Advanced Computational Structural Genomics"

The dividing line between this glossary and Information management & interpretation is fuzzy - in general Algorithms & data analysis focuses on structured data, while Information management & interpretation centers on unstructured data. Data Science & Machine Learning is another closely related glossary.  Finding guide to terms in these glossaries  Informatics term index
Related glossaries include   Drug Discovery & Development     Proteomics   Informatics: Bioinformatics     Chemoinformatics     Clinical informatics     Drug discovery informatics     IT infrastructure      Ontologies       Research  Technologies:  Microarrays & protein chips      Sequencing  Biology:   Protein Structures     Sequences, DNA & beyond.  

affinity based data mining: Large and complex data sets are analyzed across multiple dimensions, and the data mining system identifies data points or sets that tend to be grouped together.  These systems differentiate themselves by providing hierarchies of associations and showing any underlying logical conditions or rules that account for the specific groupings of data.  This approach is particularly useful in biological motif analysis. "Data mining" Nature Biotechnology 18: 237-238 Supp. Oct. 2000  Broader term: data mining 

algorithm:  A procedure consisting of a sequence of algebraic formulas and/or logical steps to calculate or determine a given task. MeSH, 1987

Algorithms fuel the scientific advances in the life sciences. They are required for dealing with the large amounts of data produced in sequencing projects, genomics or proteomics. Moreover, they are crucial ingredients in making new experimental approaches feasible... Algorithm development for Bioinformatics applications combines Mathematics, Statistics, Computer Science as well as Software Engineering to address the pressing issues of today's biotechnology and build a sound foundation for tomorrow's advances.  Algorithmics Group, Max Planck Institute for Molecular Genetics, Germany http://algorithmics.molgen.mpg.de/

Rules or a process, particularly in computer science. In medicine a step by step process for reaching a diagnosis or ruling out specific diseases.  May be expressed as a flow chart in either sense. Greater efficiencies in algorithms, as well as improvements in computer hardware have led to advances in computational biology. A computable set of steps to achieve a desired result.

From the Persian author Abu Ja'far Mohammed ibn Mûsâ al-Khowârizmî who wrote a book with arithmetic rules dating from about 825 A.D. NIST  Narrower terms: docking algorithms, sequencing algorithms, genetic algorithm, heuristic algorithm.  Related terms heuristic, parsing; Sequencing dynamic programming methods.

ANOVA Analysis Of Variance: Error model based on a standard statistical approach. a generalization of the familiar t-test that allows multiple effects to be compared simultaneously, in contrast to the t-test. An ANOVA model is expressed as a large set of equations that can be solved, given a dataset of measurements, using standard software. 

bootstrapping: In statistics, bootstrapping is any test or metric that relies on random sampling with replacement. Bootstrapping allows assigning measures of accuracy (defined in terms of bias, variance, confidence intervals, prediction error or some other such measure) to sample estimates.[1][2] This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods.[3][4]  Wikipedia accessed 2018 Oct 27 https://en.wikipedia.org/wiki/Bootstrapping_(statistics)

chaos theory:  a branch of mathematics focusing on the behavior of dynamical systems that are highly sensitive to initial conditions. 'Chaos' is an interdisciplinary theory stating that within the apparent randomness of chaotic complex systems, there are underlying patterns, constant feedback loops, repetition, self-similarityfractalsself-organization, and reliance on programming at the initial point known as sensitive dependence on initial conditions. The butterfly effect describes how a small change in one state of a deterministic nonlinear system can result in large differences in a later state, e.g. a butterfly flapping its wings in Brazil can cause a hurricane in Texas.[1]  Wikipedia accessed 2018 Oct 27 https://en.wikipedia.org/wiki/Chaos_theory

cluster analysis: The clustering, or grouping, of  large data sets (e.g., chemical and/ or pharmacological data sets) on the basis of similarity criteria for appropriately scaled  variables that represent the data of interest. Similarity criteria (distance based, associative, correlative, probabilistic) among the several clusters facilitate the recognition of patterns and reveal otherwise hidden structures (Rouvray, 1990; Willett, 1987, 1991). IUPAC Computational

A set of statistical methods used to group variables or observations into strongly inter- related subgroups. In epidemiology, it may be used to analyze a closely grouped series of events or cases of disease or other health- related phenomenon with well- defined distribution patterns in relation to time or place or both. MeSH, 1990

Has been used in medicine to create taxonomies of diseases and diagnosis and in archaeology to establish taxonomies of stone tools and funereal objects. Cluster analysis can be supervised, unsupervised or partially supervised  Related terms: clustering analysis, dendogram, heat map, pattern recognition, profile chart. Narrower terms: hierarchical clustering, k-means clustering

clustering analysis: This is a general type of analysis that involves grouping gene or array expression profiles based on similarity. Clustering is a major subfield within the broad world of numerical analysis, and many specific clustering methods are known. 

coefficient of variation (CV): The standard deviation of a set of measurements divided by their mean.

comparative data mining: Focuses on overlaying large and complex data sets that are similar to each other ...particularly useful in all forms of clinical trial meta  analyses ... Here the emphasis is on finding dissimilarities, not similarities. "Data mining" Nature Biotechnology Vol. 18: 237-238 Supp Oct.. 2000 Broader term: data mining

curse of dimensionality: (Bellman 1961) refers to the exponential growth of hypervolume as a function of dimensionality. In the field of NNs [neural nets], curse of dimensionality expresses itself in two related problems.  Janne Sinkkonen "What is the curse of dimensionality?" Artificial Intelligence FAQ  http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-13.html 

refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The expression was coined by Richard E. Bellman when considering problems in dynamic optimization.[1][2] There are multiple phenomena referred to by this name in domains such as numerical analysissamplingcombinatoricsmachine learningdata mining and databases. The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. Wikipedia accessed 2018 Dec 9 https://en.wikipedia.org/wiki/Curse_of_dimensionality   Related term: high-dimensionality

data warehouse: In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered a core component of business intelligence.[1] DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place[2] that are used for creating analytical reports for workers throughout the enterprise.[3]  Wikipedia accessed 2018 Aug 25 https://en.wikipedia.org/wiki/Data_warehouse

decision tree:  a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements. Wikipedia accessed 2018 Jan 26  https://en.wikipedia.org/wiki/Decision_tree

dendogram: A tree diagram that depicts the results of hierarchical clustering. Often the branches of the tree are drawn with lengths that are proportional to the distance between the profiles or clusters. Dendograms are often combined with heat maps, which can give a clear visual representation of how well the clustering has worked. Related terms: cluster analysis, heat maps, profile charts

dimensionality reduction: In statisticsmachine learning, and information theory, dimensionality reduction  or dimension reduction is the process of reducing the number of random variables under consideration[1] by obtaining a set of principal variables. It can be divided into feature selection and feature extraction.[2]  Wikipedia accessed 2018 Dec 9  https://en.wikipedia.org/wiki/Dimensionality_reduction   Narrower term: Principal Components Analysis PCA

error model: A mathematical formulation that identifies the sources of error in an experiment. An error model provides a mathematical means of compensating for the errors in the hope that this will lead to more accurate estimates of the true expression levels and also provides a means of estimating the uncertainty in the answers. An error model is generally an approximation of the real situation and embodies numerous assumptions; therefore, its utility depends on how good these assumptions are. The model can be expressed as a set of equations, as an algorithm, or using any other mathematical formalisms. ... The term error model has become very popular among software providers, particularly in light of the success of Rosetta’s Resolver, which incorporates an error model. As a result, some software developers may use the term inappropriately. Not everything that is called an error model really is one. 

evolutionary algorithm: An umbrella term used to describe computer-based problem solving systems which use computational models of some of the known mechanisms of EVOLUTION as key elements in their design and implementation. A variety of EVOLUTIONARY Algorithms have been proposed. The major ones are: GENETIC Algorithms (see Q1.1), EVOLUTIONARY PROGRAMMING (see Q1.2), EVOLUTION Strategies (see Q1.3), CLASSIFIER Systems (see Q1.4), and GENETIC PROGRAMMING (see Q1.5). They all share a common conceptual base of simulating the evolution of INDIVIDUAL structures via processes of SELECTION, MUTATION, and REPRODUCTION. The processes depend on the perceived PERFORMANCE of the individual structures as defined by an ENVIRONMENT.

More precisely, EAs maintain a POPULATION of structures, that evolve according to rules of selection and other operators, that are referred to as "search operators", (or GENETIC Operators), such as RECOMBINATION and mutation. Each individual in the population receives a measure of its FITNESS in the environment. Reproduction focuses attention on high fitness individuals, thus exploiting (cf. EXPLOITATION) the available fitness information. Recombination and mutation perturb those individuals, providing general heuristics for EXPLORATION. Although simplistic from a biologist's viewpoint, these algorithms are sufficiently complex to provide robust and powerful adaptive search mechanisms. Heitkötter, Jörg and Beasley, David, eds. (2001) "The Hitch-Hiker's Guide to Evolutionary Computation: A list of Frequently Asked Questions (FAQ)", USENET: comp.ai.genetic  Available via anonymous FTP from ftp://rtfm.mit.edu/pub/usenet/news.answers/ai-faq/genetic/

evolutionary computation: In computer science, is a family of algorithms for global optimization inspired by biological evolution, and the subfield of artificial intelligence and soft computing studying these algorithms. In technical terms, they are a family of population-based trial and error problem solvers with a metaheuristic or stochastic optimization character.  Wikipedia accessed 2018 Sept 7

https://en.wikipedia.org/wiki/Evolutionary_computation

expert systems:  A computer-based program that encodes rules obtained from process experts usually in the form of  “if - then” statements. J Glassey et al. “Issues in the development of an industrial bioprocess advisory system” Trends in Biotechnology 18 (4):136-41 April 2000  Related term: artificial intelligence.

fuzzy: In contrast to binary (true/ false) terms allows for looser boundaries for sets or concepts.

fuzzy logic: A superset of conventional (Boolean) logic that has been extended to handle the concept of  partial truth- truth values between “completely true” and ‘completely false”.  Introduced by Dr. Lotfi  Zadeh (Univ. of California - Berkeley) in the 1960’s as a means to model the uncertainty of natural language. AI FAQ, Carnegie Mellon University Computer Science Department http://www.cs.cmu.edu/Groups/AI/html/faqs/ai/fuzzy/part1/faq-doc-2.html

Approximate, quantitative reasoning that is concerned with the linguistic ambiguity which exists in natural or synthetic language. At its core are variables such as good, bad, and young as well as modifiers such as more, less, and very. These ordinary terms represent fuzzy sets in a particular problem. Fuzzy logic plays a key role in many medical expert systems. MeSH, 1993

Generative Adversarial Networks (GANs) are a class of artificial intelligence algorithms used in unsupervised machine learning, implemented by a system of two neural networks contesting with each other in a zero-sum game framework. They were introduced by Ian Goodfellow et al. in 2014.[1]  …One network generates candidates (generative) and the other evaluates them (discriminative).[3][4][5][6] Typically, the generative network learns to map from a latent space to a particular data distribution of interest, while the discriminative network discriminates between instances from the true data distribution and candidates produced by the generator. The generative network's training objective is to increase the error rate of the discriminative network (i.e., "fool" the discriminator network by producing novel synthesised instances that appear to have come from the true data distribution).[3][7]

In practice, a known dataset serves as the initial training data for the discriminator. Training the discriminator involves presenting it with samples from the dataset, until it reaches some level of accuracy. Typically the generator is seeded with a randomized input that is sampled from a predefined latent space[4](e.g. a multivariate normal distribution). Thereafter, samples synthesized by the generator are evaluated by the discriminator. Backpropagation is applied in both networks [5] so that the generator produces better images, while the discriminator becomes more skilled at flagging synthetic images.[8] The generator is typically a deconvolutional neural network, and the discriminator is a convolutional neural network.  Wikipedia accessed 2018 Dec 8 https://en.wikipedia.org/wiki/Generative_adversarial_network  Related terms: dimensionality reduction, high dimensionality

Genetic Algorithm GA: In computer science and operations research, a genetic algorithm (GA) is a metaheuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms (EA). Genetic algorithms are commonly used to generate high-quality solutions to optimization and search problems by relying on bio-inspired operators  such as mutation, crossover and selection.[1]  John Holland introduced Genetic Algorithm (GA) in 1960 based on the concept of the Darwin’s theory of evolution  Wikipedia accessed 2018 October 27 https://en.wikipedia.org/wiki/Genetic_algorithm

Method for library design by evaluating the fit of a parent library to some desired property (e.g. the level of activity in a biological assay, or the computationally determined diversity of the compound set) as measured by a fitness function. The design of more optimal daughter libraries is then carried out by a heuristic process with similarities to genetic selection in that it employs replication, mutation, deletions etc. over a number of generations. IUPAC Combinatorial Chemistry

An optimization algorithm based on the mechanisms of Darwinian evolution which uses random mutation, crossover and selection procedures to breed better models or solutions from an originally random starting population or sample. (Rogers and Hopfinger, 1994). IUPAC Computational  Related terms: evolutionary computation: drug design. Narrower term: genetic programming 

genetic programming: In artificial intelligence, genetic programming (GP) is a technique whereby computer programs are encoded as a set of genes that are then modified (evolved) using an evolutionary algorithm (often a genetic algorithm, "GA") – it is an application of (for example) genetic algorithms where the space of solutions consists of computer programs. The results are computer programs that are able to perform well in a predefined task.  Wikipedia accessed 2018 Oct 27 https://en.wikipedia.org/wiki/Genetic_programming

A subset of genetic algorithms. The members of the populations are the parse trees of computer programs whose fitness is evaluated by running them. The reproduction operators (e.g. crossover) are refined to ensure that the child is syntactically correct (some protection may be given against semantic errors too). This is achieved by acting upon subtrees. Genetic programming is most easily implemented where the computer language is tree structured so there is no need to explicitly evaluated its parse tree. This is one of the reasons why Lisp is often used for genetic programming. This is the common usage of the term genetic programming however it has also been used to refer to the programming of cellular automata and neural networks using a genetic algorithm. William Langdon "Genetic programming and data structures glossary" 2012 https://books.google.com/books?id=SVHhBwAAQBAJ&dq=William+Langdon+%22Genetic+programming+and+data+structures+glossar&source=gbs_navlinks_s
Genetic Programming Organization http://www.genetic-programming.org

global schema: A schema, or a map of the data content of a data warehouse that integrates the schemata from several source repositories. It is "global", because it is presented to warehouse users as the schema that they can query against to find and relate information from any of the sources, or from the aggregate information in the warehouse. Lawrence Berkeley Lab "Advanced Computational Structural Genomics" Glossary   Broader term: schema 

Hansch analysis: The investigation of the quantitative relationship between the biological activity of a series of compounds and their physicochemical substituent or global parameters representing hydrophobic, electronic, steric and other effects using multiple regression correlation methodology. IUPAC Medicinal Chemistry  Related term: QSAR

heat map:  A rectangular display that is a direct translation of a Cluster- format data table. Each cell of the data table is represented as a small color- coded square in which the color indicates the expression value. Generally green indicates low values, black medium values, and red high ones, although this is user- settable. The net effect is a colored picture in which regions of similar color indicate similar profiles or parts of profiles.  Related terms: cluster analysis, dendogram, heat map, profile chart; Expression

heuristic: Tools such as genetic algorithms or neural networks employ heuristic methods to derive solutions which may be based on purely empirical information and which have no explicit rationalization. IUPAC Combinatorial Chemistry

Trial and error methods. Narrower terms: heuristic algorithm, metaheuristics

heuristic algorithms:  one that is designed to solve a problem in a faster and more efficient fashion than traditional methods by sacrificing optimality, accuracy, precision, or completeness for speed. Heuristic algorithms often times used to solve NP-complete problems, a class of decision problems. In these problems, there is no known efficient way to find a solution quickly and accurately although solutions can be verified when given. Heuristics can produce a solution individually or be used to provide a good baseline and are supplemented with optimization algorithms. https://optimization.mccormick.northwestern.edu/index.php/Heuristic_algorithms

hierarchical clustering: Unsupervised clustering approach used to determine patterns in gene expression data. Output is a tree- like structure. Related term: cluster analysis, self- organizing maps

high-dimensionality:  Many applications of machine learning methods in domains such as information retrieval, natural language processing, molecular biology, neuroscience, and economics have to be able to deal with various sorts of discrete data that is typically of very high dimensionality. One standard approach to deal with high dimensional data is to perform a dimension reduction and map the data to some lower dimensional representation. Reducing the data dimensionality is often a valuable analysis by itself, but it might also serve as a pre- processing step to improve or accelerate subsequent stages such as classification or regression. Two closely related methods that are often used in this context and that can be found in virtually every textbook on unsupervised learning are principal component analysis (PCA) and factor analysis. Thomas Hoffmann, Brown Univ.  Statistical Learning in High Dimensions, Breckenridge CO, Dec. 1999 http://www-2.cs.cmu.edu/~mmp/workshop-nips99/speakers.html  See also under learning algorithms; Related terms: cluster analysis, curse of dimensionality, dimensionality reduction, Generative Adversarial Networks GANS, ill- posed problem, neural nets, principal components analysis

ill-posed problems:  Problems that are not well-posed in the sense of Hadamard are termed ill-posedInverse problems are often ill-posed. .. Continuum models must often be discretized in order to obtain a numerical solution. While solutions may be continuous with respect to the initial conditions, they may suffer from numerical instability when solved with finite precision, or with errors in the data. Even if a problem is well-posed, it may still be ill-conditioned, meaning that a small error in the initial data can result in much larger errors in the answers. An ill-conditioned problem is indicated by a large condition number If [a problem] is not well-posed, it needs to be re-formulated for numerical treatment. Typically this involves including additional assumptions, such as smoothness of solution. This process is known as regularizationWikipedia accessed 2018 Sept 7 https://en.wikipedia.org/wiki/Well-posed_problem

Problems without a unique solution, problems without any solution.  Life sciences data tends to be very noisy, leading to ill-posed problems. Interpretation of microarray gene expression data is an ill- posed problem.   Compare well- posed problem

influence based data mining: Complex and granular (as opposed to linear) data in large databases are scanned for influences between specific data sets, and this is done along many dimensions and in multi- table formats.  These systems find applications wherever there are significant cause and effect relationships between data sets - as occurs, for example in large and multivariant gene expression studies, which are behind areas such as pharmacogenomics. "Data mining" Nature Biotechnology Vol. 18: 237- 238 Supp. Oct. 2000  Broader term: data mining

information theory: Founded by Claude Shannon in the 1940's, has had an enormous impact on communications engineering and computer sciences.  https://www.scientificamerican.com/article/claude-e-shannon-founder/

k-means clustering:
The researcher picks a value for k, say k = 10, and the algorithm divides the data into that many clusters in such a way that the profiles within each cluster are more similar than those across clusters. The actual algorithms for this are quite sophisticated. Although the core algorithms require that a value of k be selected up front, methods exist that adaptively select good values for k by running the core algorithm several times with different values.  A non-hierarchical method.  Broader terms: cluster analysis, neural nets

Knowledge Discovery in Databases (KDD):  The notion of Knowledge Discovery in Databases (KDD) has been given various names, including data mining, knowledge extraction, data pattern processing, data archaeology, information harvesting, siftware, and even (when done poorly) data dredging. Whatever the name, the essence of KDD is the "nontrivial extraction of implicit, previously unknown, and potentially useful information from data" (Frawley et al 1992). KDD encompasses a number of different technical approaches, such as clustering, data summarization, learning classification rules, finding dependency networks, analyzing changes, and detecting anomalies (see Matheus et al 1993). Gregory Piatetsky- Shapiro, KDD Nuggets FAQ, KDD Nuggets News, 1994 http://www.kdnuggets.com/news/94/n6.txt    Related term: data mining

latent variables: In statistics, latent variables (from Latinpresent participle of lateo (“lie hidden”), as opposed to observable variables), are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models. Latent variable models are used in many disciplines,  including  psychology demography economicsengineeringmedicinephysicsmachine learning/artificial intelligencebioinformaticsnatural language processingeconometricsmanagement and the social sciences. …One advantage of using latent variables is that they can serve to reduce the dimensionality of data. A large number of observable variables can be aggregated in a model to represent an underlying concept, making it easier to understand the data. In this sense, they serve a function similar to that of scientific theories. At the same time, latent variables link observable ("sub-symbolic") data in the real world to symbolic data in the modeled world. Wikipedia accessed 2018 Dec 9 https://en.wikipedia.org/wiki/Latent_variable

MathML: Intended to facilitate the use and re-use of mathematical and scientific content on the Web, and for other applications such as computer algebra systems, print typesetting, and voice synthesis. W3C http://www.w3.org/Math/whatIsMathML.html

metadata: Ontologies & taxonomies

metaheuristic:  In computer science and mathematical optimization, a metaheuristic is a higher-level procedure or heuristic designed to find, generate, or select a heuristic (partial search algorithm) that may provide a sufficiently good solution to an optimization problem, especially with incomplete or imperfect information or limited computation capacity.[1] Metaheuristics sample a set of solutions which is too large to be completely sampled. Metaheuristics may make few assumptions about the optimization problem being solved, and so they may be usable for a variety of problems.[2] Wikipedia accessed 2018 Jan 26 https://en.wikipedia.org/wiki/Metaheuristic

Monte Carlo technique: A simulation procedure consisting of randomly sampling the conformational space of a molecule. IUPAC Computational  Broader term: simulation

multivariate statistics: A set of statistical tools to analyze data (e.g., chemical and biological) matrices using regression and/ or pattern recognition techniques. IUPAC Computational

neural networks: Data Science

normalization:  A knotty area in any measurement process, because it is here that imperfections in equipment and procedures are addressed. The specifics of normalization evolve as a field matures since the process usually gets better, and one’s understanding of the imperfections also gets better. In the microarray field, even larger changes are occurring as robust statistical methods are being adopted. See also normalization Microarrays Narrower terms: thresholding

OASIS Organization for the Advancement of Structured Information Systems: A not- for- profit, global consortium that drives the development, convergence and adoption of e-business standards. http://www.oasis-open.org/who/  
OASIS Glossary of terms
http://www.oasis-open.org/glossary/index.php   Open standards

parsing: Using algorithms to analyze data into components. Semantic parsing involves trying to figure out what the components mean. Lexical parsing refers to the process of deconstructing the data into components.  Narrower term: Drug discovery informatics  gene parsing

pattern recognition PR: The identification of patterns in large data sets using appropriate mathematical methodologies.  Examples are principal component  analysis (PCA), SIMCA, partial least squares (PLS) and artificial neural  networks (ANN) (Rouvray, 1990; Van de Waterbeemd, 1995ab) IUPAC  Computational Narrower terms:  artificial neural  networks, molecular pattern recognition, principal component  analysis (PCA), SIMCA, partial least squares (PLS)

predictive data mining; Combines pattern matching, influence relationships, time set correlations, and dissimilarity analysis to offer simulations of future data sets...these systems are capable of incorporating entire data sets into their working, and not just samples, which make their accuracy significantly higher ... used often in clinical trial analysis and in structure- function correlations. "Data mining" Nature Biotechnology Vol. 18: 237-238 Supp. Oct. 2000  Broader term: data mining

Principal Components Analysis PCA: Computational approach to reducing the complexity of, for example, a set of descriptors, by identifying those features which provide the major contributions to observed properties, and thus reducing the dimensionality of the relevant property space. IUPAC Combinatorial Chemistry

A data reduction method using mathematical techniques to identify patterns in a data matrix. The main element of this approach consists of the construction of a small set of new orthogonal, i.e., non- correlated, variables derived from a linear combination of the original variables. IUPAC Computational

Often confused with common factor analysis. Neural Network FAQ Part 1 ftp://ftp.sas.com/pub/neural/FAQ.html

probability: Probability web http://www.mathcs.carleton.edu/probweb/probweb.html  Probability web resources include journals societies and quotes

recursive partitioning: Process for identifying complex structure- activity relationships in large sets by dividing compounds into a hierarchy of smaller and more homogeneous sub- groups on the basis of the statistically most significant descriptors.  IUPAC Combinatorial Chemistry  Related terms: clustering,  principal components analysis regression analysis:

The use of statistical  methods for modeling a set of dependent variables, Y, in terms of combinations of  predictors, X. It includes methods such as multiple linear regression (MLR) and partial least squares (PLS). IUPAC Computational

regression to the mean: A common misconception about genetics has to do with overgeneralization about the likelihood of increased quality by selective breeding.  Two very tall parents will tend to produce offspring who are taller than the average population -- but less tall than the average of the parents' heights.  Or as George Bernard Shaw is supposed to have said to a famous beauty who suggested they have a child ""With your brains and my looks ..." He said to have replied, "But what if the child had my looks and your brains?" 

self-organization:  also called (in the social sciencesspontaneous order, is a process where some form of overall order arises from local interactions between parts of an initially disordered system. The process is spontaneous, not needing control by any external agent. It is often triggered by random fluctuations, amplified by positive feedback. The resulting organization is wholly decentralized, distributed over all the components of the system. As such, the organization is typically robust and able to survive or self-repair substantial perturbation. Chaos theory discusses self-organization in terms of islands of predictability in a sea of chaotic unpredictability. Self-organization occurs in many physicalchemicalbiologicalrobotic, and cognitive systems. Examples of self-organization include crystallization, thermal convection of fluids, chemical oscillation, animal swarmingneural circuits, and artificial neural networks.  Wikipedia accessed 2018 Sep 7 https://en.wikipedia.org/wiki/Self-organization

SIMCA (SIMple Classification Analysis or Soft Independent Modeling of Class Analogy): This method is a pattern recognition and classification  technique (Dunn and Wold, 1995). IUPAC Computational

time delay data mining: The data is collected over time and systems are designed to look for patterns that are confirmed or rejected as the data set increases and becomes more robust.  This approach is geared toward long- term clinical trial analysis and multicomponent mode of action studies. "Data mining" Nature Biotechnology Vol. 18: 237-238 Supp. Oct. 2000 Broader term: data mining 

trends-based data mining: Software analyzes large and complex data sets in terms of any changes that occur in specific data sets over time.  Data sets can be user- defined or the system can uncover them itself...This is especially important in cause- and- effect biological experiments.  Screening is a good example.  Data mining, Nature Biotechnology Vol. 18: 237- 238 Supp. Oct. 2000  Broader term: data mining

well-posed problem:  The mathematical term well-posed problem stems from a definition given by Jacques Hadamard. He believed that mathematical models of physical phenomena should have the properties that: a solution exists, the solution is unique, the solution's behavior changes continuously with the initial conditions. …. If the problem is well-posed, then it stands a good chance of solution on a computer using a stable algorithm.  Wikipedia accessed 2018 Sept 7 https://en.wikipedia.org/wiki/Well-posed_problem   Compare: ill-posed problems

Algorithms resources
Algorithms, terms and definitions, Hans-Georg Beyer, Eva Brucherseifer, Wilfried Jakob, Hartmut Pohlheim, Bernhard Sendhoff, Thanh Binh To, 2002 http://ls11-www.cs.uni-dortmund.de/people/beyer/EA-glossary/ 
Flake Gary Computational Beauty of Nature: Computer Explorations of Fractals, Chaos, Complex Systems and Adaptation. Glossary MIT Press, 2000. 280+ definitions. http://mitpress.mit.edu/books/FLAOH/cbnhtml/glossary-intro.html
Glossary of Probability and Statistics, Wikipedia https://en.wikipedia.org/wiki/Glossary_of_probability_and_statistics
IUPAC Glossary of Terms Used in Combinatorial Chemistry, D. Maclean, J.J. Baldwin, V.T. Ivanov, Y. Kato, A. Shaw, P. Schneider, and E.M. Gordon, Pure Appl. Chem., Vol. 71, No. 12, pp. 2349- 2365, 1999, 100+ definitions  http://www.iupac.org/reports/1999/7112maclean/
IUPAC Glossary of Terms used in Computational Drug Design Part II 2015  https://www.degruyter.com/downloadpdf/j/pac.2016.88.issue-3/pac-2012-1204/pac-2012-1204.pdf
NIST National Institute of Standards and Technology, Dictionary of Algorithms, Data Structures and Problems, Paul Black, 2001, 1300+  terms  http://www.nist.gov/dads/

How to look for other unfamiliar  terms

IUPAC definitions are reprinted with the permission of the International Union of Pure and Applied Chemistry.

Contact | Privacy Statement | Alphabetical Glossary List | Tips & glossary FAQs | Site Map