Artificial Intelligence, Data Science & Machine Learning glossary & taxonomy

You are here > Genomics Glossary Homepage > Informatics > Artificial Intelligence, Data Science & Machine Learning

Artificial Intelligence, Data Science & Machine Learning Glossary & Taxonomy
Evolving Terminology for Emerging Technologies
Comments? Questions? Revisions?
Mary Chitty MSLS mchitty@healthtech.com
Last revised January 14, 2020

SCOPE NOTE Data Science includes artificial intelligence, big data, data lakes, data quality, data stewardship, data storytelling, data swamp, data visualization, deep learning, deep machine learning, FAIR data, Hadoop, heavy quants light quants, heuristic, machine learning, neural networks, Neural Networks Artificial, open science, supervised machine learning, Support Vector Machine SVM, unsupervised machine learning

"But where machine learning shines is in handling enormous numbers of predictors — sometimes, remarkably, more predictors than observations — and combining them in nonlinear and highly interactive ways.¹ This capacity allows us to use new kinds of data, whose sheer volume or complexity would previously have made analyzing them unimaginable…. Machine learning has become ubiquitous and indispensable for solving complex problems in most sciences …. In biomedicine, machine learning can predict protein structure and function from genetic sequences and discern optimal diets from patients’ clinical and microbiome profiles. The same methods will open up vast new possibilities in medicine. … Clinical medicine has always required doctors to handle enormous amounts of data, from macro-level physiology and behavior to laboratory and imaging studies and, increasingly, “omic” data. …. Machine learning will become an indispensable tool for clinicians seeking to truly understand their patients.” Predicting the Future — Big Data, Machine Learning, and Clinical Medicine Ziad Obermeyer, MD & Ezekiel J. Emanuel, MD, PhD NEJM Catalyst Oct. 10, 2016 https://catalyst.nejm.org/big-data-machine-learning-clinical-medicine/

Artificial intelligence and machine learning are difficult challenges people have been working on for decades. But advances in computing power and the efforts of many creative and dedicated people seem likely to make more breakthroughs happen in the relatively near future.

Artificial General intelligence AGI: or “Strong” AI refers to machines that exhibit human intelligence. In other words, AGI can successfully perform any intellectual task that a human being can. This is the sort of AI that we see in movies like “Her” or other sci-fi movies in which humans interact with machines and operating systems that are conscious, sentient, and driven by emotion and self-awareness. Distinguishing between Narrow AI, General AI and Super AI, Tannya D. Jajal , Medium 2018 https://medium.com/@tjajal/distinguishing-between-narrow-ai-general-ai-and-super-ai-a4bc44172e22
Wikipedia AGI https://en.wikipedia.org/wiki/Artificial_general_intelligence

artificial intelligence (AI): Theory and development of COMPUTER SYSTEMS which perform tasks that normally require human intelligence. Such tasks may include speech recognition, LEARNING; VISUAL PERCEPTION; MATHEMATICAL COMPUTING; reasoning, PROBLEM SOLVING, DECISION-MAKING, and translation of language. Year introduced: MeSH 1986

A wide- ranging term encompassing computer applications that have the ability to make decisions; the ability to explain reasoning is evidence of intelligence. Also covers methods that have the ability to learn. J Glassey et al. “Issues in the development of an industrial bioprocess advisory system” Trends in Biotechnology 18 (4):136-41 April 2000

Or as some people have noted, laboriously trying to get computers to do what people do intuitively, without great effort. Conversely there are things computer can do (relatively) effortlessly such as massive numbers of error- free calculations. The most promising applications seem to involve incorporating both computer aided consideration of many possibilities, combined with human judgment. Narrower terms: artificial general intelligence, artificial narrow intelligence, cellular automata, expert systems, fuzzy logic, genetic algorithms, neural nets Related term: training sets.

Artificial Intelligence for Early Drug Discovery How to Best Use AI & Machine Learning for Identifying and Optimizing Compounds and Drug Combinations APRIL 15-16, 2020 San Diego CA brings together experts from chemistry, target discovery, DMPK and toxicology to talk about the increasing use of computational tools, AI models, machine learning algorithms and data mining in drug design and lead optimization. Introductory level talks to bring attendees up-to-speed with how AI is being applied in drug discovery, followed by talks introducing advanced concepts using relevant case studies and research findings. https://www.drugdiscoverychemistry.com/Artificial-Intelligence/

Artificial Intelligence for Pharma & Biotech: Disrupting the Drug Discovery Approach 2020 April 20-23 Boston MA One of the biggest bottlenecks in drug development is in the early research stage. This stage is time needed to go from identifying a potential disease target to testing a drug candidate’s probability of hitting that target. This stage can take four to six years. Ambitious AI techniques are aiming to compress this process to one year. As of August 2018, over 25 pharmaceutical companies and over 95 startups are using artificial intelligence for drug discovery. Time to develop new life-saving drugs can be drastically reduced by using AI and will discuss opportunities that biopharma organizations are using to harness the power of AI and machine learning technologies to maximize and accelerate drug discovery efforts from early stage to adoption to practical application. Presentations will also discuss challenges of these technologies being sophisticated enough to make sense of complex medical data. http://www.bio-itworldexpo.com/ai-pharma-biotech

Artificial Intelligence in Clinical Research 2020 Feb 20-21, Orlando FL Artificial intelligence (AI) and machine learning (ML) have propelled many industries toward a new highly functional and powerful state. Now they are starting to make their way into the clinical research realm. Many pharmaceutical companies and larger CROs are starting projects involving some elements of AI, ML and robotic process automation in clinical trials. https://www.scopesummit.com/Artifical-Intelligence-Clinical-Research/

Artificial Narrow Intelligence (ANI): also known as “Weak” AI is the AI that exists in our world today. Narrow AI is AI that is programmed to perform a single task — whether it’s checking the weather, being able to play chess, or analyzing raw data to write journalistic reports. ANI systems can attend to a task in real-time, but they pull information from a specific data-set. As a result, these systems don’t perform outside of the single task that they are designed to perform. Distinguishing between Narrow AI, General AI and Super AI, Tannya D. Jajal , Medium 2018 https://medium.com/@tjajal/distinguishing-between-narrow-ai-general-ai-and-super-ai-a4bc44172e22 Wikipedia weak AI https://en.wikipedia.org/wiki/Weak_AI

artificial neural nets: Artificial neural networks (ANNs) or connectionist systems are computing systems inspired by the biological neural networks that constitute animal brains. Such systems learn (progressively improve performance on) tasks by considering examples, generally without task-specific programming…. The original goal of the ANN approach was to solve problems in the same way that a human brain would. Over time, attention focused on matching specific mental abilities, leading to deviations from biology. ANNs have been used on a variety of tasks, including computer vision, speech recognition, machine translation, social network filtering, playing board and video games and medical diagnosis. Wikipedia accessed 2018 Jan 23
https://en.wikipedia.org/wiki/Artificial_neural_network

Algorithms simulating the functioning of human neurons and may be used for pattern recognition problems, e.g., to establish quantitative structure- activity relationships. IUPAC Computational Broader term neural nets; Related terms: drug design

Artificial Super Intelligence ASI: Oxford philosopher Nick Bostrom defines superintelligence as “any intellect that greatly exceeds the cognitive performance of humans in virtually all domains of interest” Artificial Super Intelligence (ASI) will surpass human intelligence in all aspects — from creativity, to general wisdom, to problem-solving. Machines will be capable of exhibiting intelligence that we haven’t seen in the brightest amongst us. This is the type of AI that many people are worried about. Distinguishing between Narrow AI, General AI and Super AI, Tannya D. Jajal , Medium 2018 https://medium.com/@tjajal/distinguishing-between-narrow-ai-general-ai-and-super-ai-a4bc44172e22

AI Trends: Business and Technology of Enterprise Artificial Intelligence https://www.aitrends.com/

AIWorld Conference & Expo September 29-October 1 2020 Boston MA https://aiworld.com/ There is no shortage of opinions on the potential for AI technologies in business. However, the current round of solutions is often viewed as expensive, proprietary, and complex to deploy and manage. When will AI solutions scale enterprise and industry-wide? Is it possible to measure ROI for automation? How does AI rank against other corporate initiatives?

AIWorld Government June 22-24 2020 Washington DC https://www.aiworldgov.com/ With AI technology at the forefront of our everyday lives, data-driven government services are now possible from federal, state, and local agencies. This has led to the rapid rise in availability and use of intelligent automation solutions.

augmented intelligence: As artificial intelligence (AI) grows more powerful, it can enable new solutions to these systemic [health] issues. … Better described as augmented intelligence, AI uses machine learning to rapidly analyze a range of environmental, behavioral and clinical data to generate insights. By using computational neural networks, it will someday be possible to use big data to quickly identify causal linkages between specific evidence-based treatments and improved patient outcomes. Nearer term, AI has a critical role to play in improving the efficiency of health care delivery and drug development. … by mining large volumes of clinical data quickly, AI can also help physicians make better evidence-based treatment decisions, especially in therapy areas where the standard of care is changing rapidly. Meanwhile, for the life sciences industry AI’s main draw is its potential to tackle ongoing R&D productivity challenges. AI analytics and predictive simulations can improve the drug development failure rate by offering an in silico screen to better understand which targeted interventions are likely to succeed or fail. By narrowing the funnel of drug candidates earlier, companies can begin to streamline research costs, while biasing human studies for success. AI can also improve the efficiency of clinical trials, enabling more targeted patient recruitment to reduce enrolment times, and, through adaptive behavioral analytics, increasing both patient compliance and retention. EY, What’s the right dose of AI to revitalize healthcare? https://betterworkingworld.ey.com/digital/what-s-the-right-dose-of-ai-to-revitalize-health-care

big data: data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating and information privacy … The term has been in use since the 1990s, with some giving credit to John Mashey for coining or at least making it popular.^[14]^[15] Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.^[16] Big Data philosophy encompasses unstructured, semi-structured and structured data, however the main focus is on unstructured data.^[17] Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data.^[18] Big data requires a set of techniques and technologies with new forms of integration to reveal insights from datasets that are diverse, complex, and of a massive scale.^[19] A consensual definition that states that "Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value".^[20]Additionally, a new V "Veracity" is added by some organizations to describe it,^[21]Wikipedia accessed 2018 Jan 23 https://en.wikipedia.org/wiki/Big_data

cognitive computing: describes technology platforms that, broadly speaking, are based on the scientific disciplines of artificial intelligence and signal processing. These platforms encompass machine learning, reasoning, natural language processing, speech recognition and vision (object recognition), human–computer interaction, dialog and narrative generation, among other technologies.^[1]^[2]At present, there is no widely agreed upon definition for cognitive computing in either academia or industry.^[1]^[3]^[4]In general, the term cognitive computing has been used to refer to new hardware and/or software that mimics the functioning of the human brain^[5]^[6]^[7]^[8]^[9]^[10^] (2004) and helps to improve human decision-making.^[11]^[12^] Wikipedia accessed 2018 July 26 https://en.wikipedia.org/wiki/Cognitive_computing

data curation: is the organization and integration of data collected from various sources. It involves annotation, publication and presentation of the data such that the value of the data is maintained over time, and the data remains available for reuse and preservation. Data curation includes "all the processes needed for principled and controlled data creation, maintenance, and management, together with the capacity to add value to data".[1] In science, data curation may indicate the process of extraction of important information from scientific texts, such as research articles by experts, to be converted into an electronic format, such as an entry of a biological database.[2] Wikipedia accessed 2019 June 21 https://en.wikipedia.org/wiki/Data_curation Related term: data stewardship

data dredging: (also data fishing, data snooping, and p-hacking) is the misuse of data analysis to find patterns in data that can be presented as statistically significant when in fact there is no real underlying effect. Wikipedia accessed Aug 3, 2018 https://en.wikipedia.org/wiki/Data_dredging

data driven decision making Not everyone was embracing data-driven decision making. In fact, we found a broad spectrum of attitudes and approaches in every industry. But across all the analyses we conducted, one relationship stood out: The more companies characterized themselves as data-driven, the better they performed on objective measures of financial and operational results. In particular, companies in the top third of their industry in the use of data-driven decision making were, on average, 5% more productive and 6% more profitable than their competitors. This performance difference remained robust after accounting for the contributions of labor, capital, purchased services, and traditional IT investment. It was statistically significant and economically important and was reflected in measurable increases in stock market valuations. Big Data The Management Revolution, Andrew McAfee and Erik Brynjolfsson Harvard Business review, 2012 Oct https://hbr.org/2012/10/big-data-the-management-revolution

data lake: The idea of data lake is to have a single store of all data in the enterprise ranging from raw data (which implies exact copy of source system data) to transformed data which is used for various tasks including reporting, visualization, analytics and machine learning. The data lake includes structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and even binary data (images, audio, video) thus creating a centralized data store accommodating all forms of data. Wikipedia Accessed June 2017 https://en.wikipedia.org/wiki/Data_lake
Data Lake or data swamp, 2016 https://www.linkedin.com/pulse/data-lake-swamp-kiran-donepudi

data mining: Algorithms

data quality: A vital consideration for data analysis and interpretation. While people are still reeling from the vast amount of data becoming available, they need to brace themselves to both discard low quality data and handle much more at the same time.

Dr. [John] Sulston lived by one of his favorite dictums: “There is no point in wasting good thoughts on bad data.” New York Times https://www.nytimes.com/2018/03/15/obituaries/john-e-sulston-75-dies-found-clues-to-genes-in-a-worm.html
Data quality glossary, Graham Rind, GRC Data Intelligence, http://www.dqglossary.com/ 9,900 terms.

data science: also known as data-driven science, is an interdisciplinary field of scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured,[1][2] similar to data mining. … It employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, information science, and computer science, in particular from the subdomains of machine learning, classification, cluster analysis, data mining, databases, and visualization. … is now often applied to business analytics,[7] or even arbitrary use of data, or used as a sexed-up term for statistics.[8] While many university programs now offer a data science degree, there exists no consensus on a definition or curriculum contents. Wikipedia accessed 2018 Jan 23 https://en.wikipedia.org/wiki/Data_science

An interdisciplinary field involving processes, theories, concepts, tools, and technologies, that enable the review, analysis, and extraction of valuable knowledge and information from structured and unstructured (raw) data. MeSH 2019

Data science is an integral component of modern biomedical research. It is the interdisciplinary field of inquiry in which quantitative and analytical approaches, processes, and systems are developed and used to extract knowledge and insights from increasingly large and/or complex sets of data. Data science has increased in importance for biomedical research over the past decade and NIH expects that trend to continue. In order to capitalize on the opportunities presented by advances in data science, and overcome key challenges, the NIH is developing a Strategic Plan for Data Science. This plan describes NIH’s overarching goals, strategic objectives, and implementation tactics for promoting the modernization of the NIH-funded biomedical data science ecosystem. The complete draft plan is available at: https://grants.nih.gov/grants/rfi/NIH-Strategic-Plan-for-Data-Science.pdf. Request for Information (RFI): Soliciting Input for the National Institutes of Health (NIH) Strategic Plan for Data Science Notice Number: NOT-OD-18-134, March 2018 https://grants.nih.gov/grants/guide/notice-files/NOT-OD-18-134.html

DataScience@NIH https://datascience.nih.gov/community

data scientist: a high-ranking professional with the training and curiosity to make discoveries in the world of big data. The title has been around for only a few years. (It was coined in 2008 by one of us, D.J. Patil, and Jeff Hammerbacher, then the respective leads of data and analytics efforts at LinkedIn and FaceBook.) … More than anything, what data scientists do is make discoveries while swimming in data. It’s their preferred method of navigating the world around them. At ease in the digital realm, they are able to bring structure to large quantities of formless data and make analysis possible. They identify rich data sources, join them with other, potentially incomplete data sources, and clean the resulting set.. … As they make discoveries, they communicate what they’ve learned and suggest its implications for new business directions. Often they are creative in displaying information visually and making the patterns they find clear and compelling. … Data scientists’ most basic, universal skill is the ability to write code. ..More enduring will be the need for data scientists to communicate in language that all their stakeholders understand—and to demonstrate the special skills involved in storytelling with data, whether verbally, visually, or—ideally—both. … Data scientists want to be in the thick of a developing situation, with real-time awareness of the evolving set of choices it presents. Data Scientist: The Sexiest Job of the 21st Century Thomas H. Davenport and D.J. Patil, Harvard Business Review Oct 2012 http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/5

data stewardship: Beyond proper collection, annotation, and archival, data stewardship includes the notion of ‘long-term care’ of valuable digital assets, with the goal that they should be discovered and re-used for downstream investigations, either alone, or in combination with newly generated data. The outcomes from good data management and stewardship, therefore, are high quality digital publications that facilitate and simplify this ongoing process of discovery, evaluation, and reuse in downstream studies.: The FAIR Guiding Principles for Scientific Data Management and Stewardship Mark D Wilkinson, Madrid, Spain; Michel Dumontier, Stanford CA, Berend Mons, Leiden Univ, Utrecht, Netherlands, https://www.nature.com/articles/sdata201618 Related term: data curation

data storytelling: https://www.forbes.com/sites/brentdykes/2016/03/31/data-storytelling-the-essential-data-science-skill-everyone-needs/#6b7ff3b952ad Related terms: heavy quants, light quants

data swamp: The data lake has been labeled as a raw data reservoir or a hub for ETL [Extract Transfer Load] offload. The data lake has been defined as a central hub for self-service analytics. The concept of the data lake has been overloaded with meanings, which puts the usefulness of the term into question.^[16] The data in Data Lakes should not have indefinite life in the repository to make it data swamp. Most of corporate companies who manage data lakes define effective data archival or data removing techniques and procedures to keep the pond within controllable limits. Wikipedia accessed 2018 Jan 24 https://en.wikipedia.org/wiki/Data_lake

data translators: Translators are neither data architects nor data engineers. They’re not even necessarily dedicated analytics professionals, and they don’t possess deep technical expertise in programming or modeling. Instead, translators play a critical role in bridging the technical expertise of data engineers and data scientists with the operational expertise of marketing, supply chain, manufacturing, risk, and other frontline managers. In their role, translators help ensure that the deep insights generated through sophisticated analytics translate into impact at scale in an organization.

At the outset of an analytics initiative, translators draw on their domain knowledge to help business leaders identify and prioritize their business problems, based on which will create the highest value when solved. These may be opportunities within a single line of business (e.g., improving product quality in manufacturing) or cross-organizational initiatives (e.g., reducing product delivery time). Translators then tap into their working knowledge of AI and analytics to convey these business goals to the data professionals who will create the models and solutions. Finally, translators ensure that the solution produces insights that the business can interpret and execute on, and, ultimately, communicates the benefits of these insights to business users to drive adoption. Analytics Translator: The new must have role By Nicolaus Henke, Jordan Levine, and Paul McInerney, Harvard Business Review 2018 Feb https://www.mckinsey.com/business-functions/mckinsey-analytics/our-insights/analytics-translator

McKinsey: The new analytics translator https://www.mckinsey.com/about-us/new-at-mckinsey-blog/the-new-analytics-translator-from-big-data-to-big-ideas

data visualization: The classical definition of visualization is as follows: the formation of mental visual images, the act or process of interpreting in visual terms or of putting into visual form. A new definition is a tool or method for interpreting image data fed into a computer and for generating images from complex multi-dimensional data sets (1987). Definitions and Rationale for Visualisation,D. Scott Brown, SIGGRAPH, 1999 http://www.siggraph.org/education/materials/HyperVis/visgoals/visgoal2.htm includes information on data visualization. Related term: information visualization; Broader term: visualization

deep learning” – another hot topic buzzword – is simply machine learning which is derived from “deep” neural nets. These are built by layering many networks on top of each other, passing information down through a tangled web of algorithms to enable a more complex simulation of human learning. Due to the increasing power and falling price of computer processors, machines with enough grunt to run these networks are becoming increasingly affordable. What is Machine Learning: A complete beginner’s guide in 2017, Bernard Marr, Forbes 2017 May https://www.forbes.com/sites/bernardmarr/2017/05/04/what-is-machine-learning-a-complete-beginners-guide-in-2017/#722a8489578f

Supervised or unsupervised machine learning methods that use multiple layers of data representations generated by nonlinear transformations, instead of individual task-specific ALGORITHMS, to build and train neural network models. MeSH 2019

deep machine learning: Deep machine learning (DML) holds the potential to revolutionize machine learning by automating rich feature extraction, which has become the primary bottleneck of human engineering in pattern recognition systems. However, the heavy computational burden renders DML systems implemented on conventional digital processors impractical for large-scale problems. The highly parallel computations required to implement large-scale deep learning systems are well suited to custom hardware. Analog computation has demonstrated power efficiency advantages of multiple orders of magnitude relative to digital systems while performing nonideal computations. In this paper, we investigate typical error sources introduced by analog computational elements and their impact on system-level performance in DeSTIN--a compositional deep learning architecture. On the impact of approximate computation in an analog DeSTIN architecture. Young S, Lu J, Holleman J, Arel I. IEEE Trans Neural Netw Learn Syst. 2014 May;25(5):934-46. doi: 10.1109/TNNLS.2013.2283730. https://www.ncbi.nlm.nih.gov/pubmed/24808039

Distributed (Deep) Machine Learning Community https://github.com/dmlc

FAIR data—Findable, Accessible, Interoperable, Reusable: Meeting the fair principles
Principle F: findable The principle of Findability focuses on the unique and unambiguous identification of all relevant entities; the rich annotation and description of these entities; the searchability of those descriptive annotations; and the explicit connection between metadata and data elements.

Principle A: accessible The principle of Accessibility speaks to the ability to retrieve data or metadata based on its identifier, using an open, free, and universally implementable standardized protocol. The protocol must support authentication and authorization if necessary, and the metadata should be accessible “indefinitely,” and independently of the data, such that identifiers can be interpreted/understood even if the data they identify no longer exists.

Principle I: interoperable The Interoperability Principle states that (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation; that vocabularies themselves should follow FAIR principles; and that the (meta)data should include qualified references to other (meta)data.

Principle R: reusable The FAIR Reusability principle requires that meta(data) have a plurality of accurate and relevant attributes; provide a clear and accessible data usage license; associate data and metadata with their provenance; and meet domain-relevant community standards for data content. Publishing FAIR Data: An Exemplar Methodology Utilizing PHI-Base Frontiers in Plant Science, 2016 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4922217/

FAIR findability: the ease with which information contained on a website can be found, both from outside the website (using search engines and the like) and by users already on the website.^[1] Although findability has relevance outside the World Wide Web, the term is usually used in that context. Most relevant websites do not come up in the top results because designers and engineers do not cater to the way ranking algorithms work currently.^[2] Its importance can be determined from the first law of e-commerce, which states "If the user can’t find the product, the user can’t buy the product."^[3] Wikipedia https://en.wikipedia.org/wiki/Findability accessed 2017 Oct 28

ACCESSIBILITY: https://en.wikipedia.org/wiki/Accessibility

INTEROPERABILITY: Wikipedia https://en.wikipedia.org/wiki/Interoperability
https://en.wikipedia.org/wiki/Interoperability#Syntactic_interoperability
https://en.wikipedia.org/wiki/Semantic_interoperability
https://en.wikipedia.org/wiki/Cross-domain_interoperability http://www.ncoic.org/cross-domain-interoperability/

REUSABILITY: In computer science and software engineering, reusability is the use of existing assets in some form within the software product development process. Assets are products and by-products of the software development life cycle and include code, software components, test suites, designs and documentation. Leverage is modifying existing assets as needed to meet specific system requirements. Wikipedia https://en.wikipedia.org/wiki/Reusability accessed 2017 Oct 28

FAIR Data: Estimated cost benefit analysis of not having FAIR research data: Minimum of 10.2 billion Euros per year. PwC estimates time lost per year at 4.5 billion Euros, cost of storage 5.3 billion Euros [only data from academic research, private sector data not available]; license cost 360 million [private sector data not available]. Interdisciplinary and potential economic growth impacts cannot be estimated reliably. Cost of not having FAIR research data, PwC EU Services, 2018, European Union Publications. https://publications.europa.eu/en/publication-detail/-/publication/d375368c-1a0a-11e9-8d04-01aa75ed71a1

Go FAIR In 2016, the ‘FAIR Guiding Principles for scientific data management and stewardship’ were published in Scientific Data. The authors intended to provide guidelines to improve the findability, accessibility, interoperability, and reuse of digital assets. The principles emphasise machine-actionability (i.e., the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention) because humans increasingly rely on computational support to deal with data as a result of the increase in volume, complexity, and creation speed of data. https://www.go-fair.org/fair-principles

Hadoop: https://www.sas.com/en_us/insights/big-data/hadoop.html http://hadoop.apache.org/

heavy quants, light quants: A “light quant” is someone who knows something about analytical and data management methods, and who also knows a lot about specific business problems. The value of the role comes, of course, from connecting the two. Of course it would be great if “heavy quants” also knew a lot about business problems and could apply their heavy quantitative skills to them, but acquiring deep quantitative skills tends to force out other types of training and experience. The “analytical translator” may also have some light quant skills, but this person is also extremely skilled at communicating the results of quantitative analyses, both light and heavy. .. Organizations need people of all quantitative weights and skills. If you want to have analytics and big data used in decisions, actions, and products and services, you may well benefit from light quants and translators. Thomas Davenport, “In praise of “light quants” and “analytical translators” , 2015 https://www2.deloitte.com/us/en/pages/deloitte-analytics/articles/in-praise-of-light-quants-and-analytical-translators.html Related term: data storytelling

heuristic: Tools such as genetic algorithms or neural networks employ heuristic methods to derive solutions which may be based on purely empirical information and which have no explicit rationalization. IUPAC Combinatorial Chemistry

Trial and error methods

I2B2 Informatics for Integrating Biology & the Bedside: An NIH- funded National Center for Biomedical Computing based at Partners HealthCare System. [Boston] http://www.i2b2.org/

information overload: Biomedicine is in the middle of revolutionary advances. Genome projects, microassay methods like DNA chips, advanced radiation sources for crystallography and other instrumentation, as well as new imaging methods, have exceeded all expectations, and in the process have generated a dramatic information overload that requires new resources for handling, analyzing and interpreting data. Delays in the exploitation of the discoveries will be costly in terms of health benefits for individuals and will adversely affect the economic edge of the country. Opportunities in Molecular Biomedicine in the Era of Teraflop Computing: March 3 & 4, 1999, Rockville, MD, NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana- Champaign Molecular Biomedicine in the Era of Teraflop Computing - DDDAS.org

Many of today's problems stem from information overload and there is a desperate need for innovative software that can wade through the morass of information and present visually what we know. The development of such tools will depend critically on further interactions between the computer scientists and the biologists so that the tools address the right questions, but are designed in a flexible and computationally efficient manner. It is my hope that we will see these solutions published in the biological or computational literature. Richard J. Roberts, The early days of bioinformatics publishing, Bioinformatics 16 (1): 2-4, 2000

"Information overload" is not an overstatement these days. One of the biggest challenges is to deal with the tidal wave of data, filter out extraneous noise and poor quality data, and assimilate and integrate information on a previously unimagined scale

Where's my stuff? Ways to help with information overload, Mary Chitty, SLA presentation June 10, 2002, Los Angeles CA
Wikipedia http://en.wikipedia.org/wiki/Information_overload

Information in OMIM [Online Mendelian Inheritance in Man] and the published working draft of the International Human Genome Sequencing Consortium (Nature 15 Feb. 2001) has been facilitated by ties to NCBI's RefSeq and LocusLink databases. Are there other good examples of integrated databases? Related terms: Bio-Ontology Standards Group, Data Model Standards Group; Functional genomics Gene Ontology

Integration of data: integration of the various types of large-scale data is currently receiving much attention. There appears, however, to be little agreement on what exactly is meant by “integration”, not to mention how to achieve it. The world “integration” is being attached to almost any analysis that involves the combined use of two more large datasets. Lars. J Jenson, Peer Bork, quality analysis and integration of large-scale molecular data sets. Drug Discovery today, targets, 3(2): 51-56 April 2004 https://www.sciencedirect.com/science/article/abs/pii/S1741837204024089?via%3Dihub

Allows researchers to increase the value they get from the data, because if increases the base of information they can access

interoperability: ability of data or tools from non-cooperating resources to integrate or work together with minimal effort. The FAIR Guiding Principles for Scientific Data Management and Stewardship Mark D Wilkinson, Madrid, Spain; Michel Dumontier, Stanford CA, Berend Mons, Leiden Univ, Utrecht, Netherlands, https://www.nature.com/articles/sdata201618

just in time information: http://www.wordstream.com/blog/ws/2013/10/02/just-in-time-information-hacks Related terms: information overload, remembrance agents; Bioinformatics modularity

Just-In-Time Information Retrieval. Bradley J. Rhodes. Ph.D. Dissertation, MIT Media Lab, May 2000. Just in time retrieval agents Bradley J. Rhodes http://alumni.media.mit.edu/~rhodes/Papers/rhodes-phd-JITIR.pdf

knowledge graph: Ontologies & Taxonomies

machine intelligence: Cognitive computing, AI, machine learning, and deep learning are often used to describe the same thing, when they actually differ. We explain what the differences are so you can better understand how the pieces fit together. 4 Types of Machine Intelligence you should know, Lisa Morgan, Information Week April 10, 2018 https://www.informationweek.com/big-data/ai-machine-learning/4-types-of-machine-intelligence-you-should-know/a/d-id/1331480

Wikipedia disambiguates machine intelligence as artificial intelligence or machine learning. Accessed 2018 Aug 3 https://en.wikipedia.org/wiki/Machine_Intelligence

machine learning: At its most simple, machine learning is about teaching computers to learn in the same way we do, by interpreting data from the world around us, classifying it and learning from its successes and failures. In fact, machine learning is a subset, or better, the leading edge of artificial intelligence. How did machine learning come about? Building algorithms capable of doing this, using the binary “yes” and “no” logic of computers, is the foundation of machine learning – a phrase which was probably first used during serious research by Arthur Samuel at IBM during the 1950s. Samuel’s earliest experiments involved teaching machines to learn to play checkers. … For example, in medicine, machine learning is being applied to genomic data to help doctors understand, and predict, how cancer spreads, meaning more effective treatments can be developed. What is Machine Learning: A complete beginner’s guide in 2017, Bernard Marr, Forbes 2017 May https://www.forbes.com/sites/bernardmarr/2017/05/04/what-is-machine-learning-a-complete-beginners-guide-in-2017/#33c58c2f578f

A type of ARTIFICIAL INTELLIGENCE that enable COMPUTERS to independently initiate and execute LEARNING when exposed to new data. Year introduced: MeSH 2016

A field of computer science that gives computers the ability to learn without being explicitly programmed.^[1] Arthur Samuel, an American pioneer in the field of computer gaming and artificial intelligence, coined the term "Machine Learning" in 1959 while at IBM ^[2]. Evolved from the study of pattern recognition and computational learning theory in artificial intelligence,^[3] machine learning explores the study and construction of algorithms that can learn from and make predictions on data ^[4] – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions,^[5]^:2 through building a model from sample inputs. ...Machine learning is closely related to (and often overlaps with) computational statistics, which also focuses on prediction-making through the use of computers. It has strong ties to mathematical optimization, which delivers methods, theory and application domains to the field. Machine learning is sometimes conflated with data mining,^[8] where the latter subfield focuses more on exploratory data analysis and is known as supervised learning.^[5]^:vii^[9] Machine learning can also be unsupervised ^[10] and be used to learn and establish baseline behavioral profiles for various entities ^[11] and then used to find meaningful anomalies. Within the field of data analytics, machine learning is a method used to devise complex models and algorithms that lend themselves to prediction; in commercial use, this is known as predictive analytics. ... Effective machine learning is difficult because finding patterns is hard and often not enough training data are available; as a result, machine-learning programs often fail to deliver.^[14]^[15] Wikipedia accessed 2018 Jan 23 https://en.wikipedia.org/wiki/Machine_learning

Machine Learning and Artificial Intelligence 2020 March 2-4, San Francisco CA Applying AI and Machine Learning Techniaues to solve drug Discovery challenges Machine learning, specifically for drug discovery, development, diagnostics and healthcare, is highly data-intensive with disparate types of data being generated that have historically been trial-and-error processes. Deep learning, machine learning (ML) and artificial intelligence (AI), coupled with correct data, have the potential to make these processes less error-prone and increase the likelihood of success from drug discovery to the real world setting. https://www.bio-itworldexpowest.com/AI-Drug-Discovery

Machine Learning Schema Community Group: This group represents a collaborative, community effort with a mission to develop, maintain, and promote standard schemas for data mining and machine learning algorithms, datasets, and experiments. Our target is a community agreed schema as a basis for ontology development projects, markup languages and data exchange standards; and an extension model for the schema in the area of data mining and machine learning. The goals of this group are: To define a simple shared schema of data mining/ machine learning (DM/ML) algorithms, datasets, and experiments that may be used in many different formats: XML, RDF, OWL, spreadsheet tables. W3C https://www.w3.org/community/ml-schema/
Related terms: artificial intelligence AI, deep learning, neural networks, supervised machine learning, support vector machines, unsupervised machine learning

machine learning- competitive intelligence and market research poster for BioIT World 2018 May

metadata: Taxonomies & Ontologies

neural networks: Technique for optimizing a desired property given a set of items which have been previously characterized with respect to that property (the 'training set'). Features of members of the training set which correlate with the desired property are 'remembered and used to generate a model for selecting new items with the desired property or to predict the fit of an unknown member. IUPAC Combinatorial Chemistry

Communication between statisticians and neural net researchers is often hindered by the different terminology used in the two fields. There is a comparison of neural net and statistical jargon at
ftp://ftp.sas.com/pub/neural/jargon

Often uses fuzzy logic Narrower terms: artificial neural networks, probabilistic neural networks. ; Related terms: artificial intelligence

open science: According to the FOSTER taxonomy ^[3] Open science can often include aspects of Open access, Open data and the open source movement whereby modern science requires software in order to process data and information. ^[12]^[13] Open research computation also addresses the problem of reproducibility of scientific results. The term "open science" does not have any one fixed definition or operationalization. On the one hand, it has been referred to as a "puzzling phenomenon".^[14] On the other hand, the term has been used to encapsulate a series of principles that aim to foster scientific growth and its complementary access to the public. Two influential sociologists, Benedikt Fecher and Sascha Friesike, have created multiple "schools of thought" that describe the different interpretations of the term.^[15] According to Fecher and Friesike ‘Open Science’ is an umbrella term for various assumptions about the development and dissemination of knowledge. To show the term’s multitudinous perceptions, they differentiate between five Open Science schools of thought: Wikipedia accessed 2018 Jan 24 https://en.wikipedia.org/wiki/Open_science

overfitting: In statistics, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably".^[1] An overfitted model is a statistical model that contains more parameters than can be justified by the data.^[2] The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure.^[3]:45 ... Overfitting and underfitting can occur in machine learning, in particular. In machine learning, the phenomena are sometimes called "overtraining" and "undertraining". Wikipedia accessed 2018 May 18 https://en.wikipedia.org/wiki/Overfitting Related term: data dredging

predictive analytics: encompasses a variety of statistical techniques from modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future, or otherwise unknown, events. .. The core of predictive analytics relies on capturing relationships between explanatory variables and the predicted variables from past occurrences, and exploiting them to predict the unknown outcome. Wikipedia accessed April 2015
http://en.wikipedia.org/wiki/Predictive_analytics

Predictive Model Markup Language, Data Monitor Group http://www.dmg.org/

Python: a remarkably powerful dynamic programming language that is used in a wide variety of application domains. Python is often compared to Tcl, Perl, Ruby, Scheme or Java. About Python http://www.python.org/about/ Wikipedia http://en.wikipedia.org/wiki/Python_(programming_language)

R: a free (libre) programming language and software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing.^[6] The R language is widely used among statisticians and data miners for developing statistical software ^[7] and data analysis.^[8] Wikipedia accessed 2018 Jan 24 https://en.wikipedia.org/wiki/R_(programming_language)

a language and environment for statistical computing and graphics. R Project for Statistical Computing http://www.r-project.org/index.html

robust: A statistical test that yields approximately correct results despite the falsity of certain of the assumptions on which it is based Oxford English Dictionary

Hence, can refer to a process which is relatively insensitive to human foibles and variables in the way (for example, an assay) is carried out. Idiot- proof.

software as a medical device: Artificial Intelligence and Machine Learning in Software as a Medical Device https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-software-medical-device

Proposed Regulatory Framework for Modifications to Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) - Discussion Paper and Request for Feedback https://www.regulations.gov/document?D=FDA-2019-N-1185-0001

The Food and Drug Administration announced Tuesday that it is developing a framework for regulating artificial intelligence products used in medicine that continually adapt based on new data. The agency’s outgoing commissioner, Scott Gottlieb, released a white paper that sets forth the broad outlines of the FDA’s proposed approach to establishing greater oversight over this rapidly evolving segment of AI products. It is the most forceful step the FDA has taken to assert the need to regulate a category of artificial intelligence systems whose performance constantly changes based on exposure to new patients and data in clinical settings. These machine-learning systems present a particularly thorny problem for the FDA, because the agency is essentially trying to hit a moving target in regulating them. FDA developing new rules for artificial intelligence in medicine, STAT, 2019 April 2 https://www.statnews.com/2019/04/02/fda-new-rules-for-artificial-intelligence-in-medicine/

stochastic: "Aiming, proceeding by guesswork" (Webster's Collegiate Dictionary). Term which is often applied to combinatorial processes involving true random sampling, such as selection of beads from an encoded library, or certain methods for library design. IUPAC COMBINATORIAL CHEMISTRY

Truly random, based on probability.

supervised Machine Learning A MACHINE LEARNING paradigm used to make predictions about future instances based on a given set of labeled paired input-output training (sample) data. Year introduced: MeSH 2016

Supervised machine learning is by far the more common across a wide range of industry use cases. The fundamental difference is that with supervised learning, the output of your algorithm is already known – just like when a student is learning from an instructor. All that needs to be done is work out the process necessary to get from your input, to your output. This is usually the case when an algorithm is being “taught” from a training data set. If the algorithms are coming up with results which are widely different from those which the training data says should be expected, the instructor can step in to guide them back to the right path. Bernard Marr, Forbes Supervised vs. Unsupervised Machine Learning, 2017
https://www.forbes.com/sites/bernardmarr/2017/03/16/supervised-v-unsupervised-machine-learning-whats-the-difference/#76a40b8f485d Compare unsupervised learning.

Support Vector Machine SVM :SUPERVISED MACHINE LEARNING algorithm which learns to assign labels to objects from a set of training examples. Examples are learning to recognize fraudulent credit card activity by examining hundreds or thousands of fraudulent and non-fraudulent credit card activity, or learning to make disease diagnosis or prognosis based on automatic classification of microarray gene expression profiles drawn from hundreds or thousands of samples. Year introduced: MeSH 2012

In machine learning, support vector machines (SVMs, also support vector networks^[1]) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier (although methods such as Platt scaling exist to use SVM in a probabilistic classification setting). An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. Wikipedia accessed 2018 Aug 27 http://en.wikipedia.org/wiki/Support_vector_machine

training set: An initial dataset for which the correct answers are known and feeding the data and correct answers into a program that adjusts the parameters of the general model. The training program adjusts the model parameters so that the model works well on the given dataset. There are usually enough parameters so that this can be accomplished, provided the dataset is reasonably consistent. The training set usually has to be very large to produce a good classifier. Narrower terms: supervised training sets, unsupervised training sets

unsupervised learning: In unsupervised learning, there is no training data set and outcomes are unknown. Essentially the AI goes into the problem blind – with only its faultless logical operations to guide it. Incredible as it seems, unsupervised machine learning is the ability to solve complex problems using just the input data, and the binary on/off logic mechanisms that all computer systems are built on. Bernard Marr, Forbes Supervised vs. Unsupervised Machine Learning, 2017 https://www.forbes.com/sites/bernardmarr/2017/03/16/supervised-v-unsupervised-machine-learning-whats-the-difference/#76a40b8f485d Compare supervised machine learning

weak artificial intelligence: See Artificial Narrow Intelligence ANI

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it… — Dan Ariely
https://twitter.com/danariely/status/287952257926971392?lang=en https://towardsdatascience.com/why-so-many-data-scientists-are-leaving-their-jobs-a1f0329d7ea4

Data Sciences Resources
AI Topics, Association for the Advancement of Artificial Intelligence https://aitopics.org/search
AI Topics: Health & Medicine AAAI https://aitopics.org/search?filters=taxnodes:Industry%7CHealth%20%26%20Medicine
AI Trends Glossary https://aitrends.com/ai-glossary/ Glossary of common machine learning, statistics and data science terms 2013-2017
https://www.analyticsvidhya.com/glossary-of-common-statistics-and-machine-learning-terms/
Essentials the influencer's review: Artificial Intelligence news and trends https://essentials.news/en/future-of-work/topics/%22artificial%20intelligence%22
Glossary of terms, Ron Kohavi, Machine Learning, 30, 271- 274, 1998, 45 definitions. http://ai.stanford.edu/~ronnyk/glossary.html
Google Machine learning glossary 2017 https://developers.google.com/machine-learning/glossary/ General terms and ones specific to TensorFlow
How to do research in the MIT AI Lab, a whole bunch of current, former, and honorary MIT AI Lab graduate students, 1988-1997
https://web.cs.dal.ca/~eem/gradResources/MITAIResearch.html

IBM Research Artificial Intelligence http://www.research.ibm.com/artificial-intelligence/

NIH Strategic Plan for Data Science 2018 DRAFT https://grants.nih.gov/grants/rfi/NIH-Strategic-Plan-for-Data-Science.pdf Glossary included

Gartner, Data Science and Machine Learning Hype Cycle 2017 Aug https://www.gartner.com/doc/3772081/hype-cycle-data-science-machine The hype around data science and machine learning has increased from already high levels in the past year. Data and analytics leaders should use this Hype Cycle to understand technologies generating excitement and inflated expectations, as well as significant movements in adoption and maturity.

Kahneman, Daniel Thinking fast and slow, 2013 https://www.amazon.com/Thinking-Fast-Slow-Daniel-Kahneman/dp/0374533555
Review https://blogs.sas.com/content/sascom/2013/07/09/wysiati-thinking-fast-slow-and-analytically/ even with good data and analytics, we need some additional discipline around our typical decision making practices. In order to avoid our own ‘entrepreneurial delusions’, we need a meta-process, a “decision management” process, to slow us down, to deliberately engage System 2 and provide it with the data and the options it is designed to evaluate.

How to look for other unfamiliar terms

IUPAC definitions are reprinted with the permission of the International Union of Pure and Applied Chemistry.

Contact | Privacy Statement | Alphabetical Glossary List | Tips & glossary FAQs | Site Map