Loading entries…
Loading entries…
Article
AI² terminology guide · May 2026
The AI conversation in industry is drowning in terminology. Vendor pitches, keynotes, and social posts throw around “LLM,” “AI agent,” “agentic,” “RAG,” and “multi-agent” as if they meant the same thing. They do not—and the gap shows up where it hurts: mispriced procurement, governance that does not match the real system, and deployments on the shop floor that assumed a different architecture than the one sold.
Recent work on industrial AI deployments stresses the same point from several angles: when one team hears “agentic” as autonomous multi-agent orchestration and another hears a chatbot with two tool calls, you are not having a disagreement about words—you are misaligning on risk, cost, and accountability before the first sprint ends.
This web guide distils a vendor-neutral vocabulary co-developed by Vlad Larichev and Alexey Samoshilov for AI². The full argument—with literature touchpoints, extended examples, and a reference list—is in the companion PDF below.
Want the longer write-up? Download the PDF companion (same file as the button under “On this page”).
Before term-by-term definitions, it helps to see the ladder modern stacks climb. Each step adds capability—and adds architectural work, governance, and organisational load.
Layer A — Foundation models and LLMs: the base models that provide language (or multi-modal) understanding and generation. Alone they are rarely production-ready for industrial settings; the question is always what you wrap around them.
Layer B — Retrieval and grounding: how outputs connect to verifiable, domain-specific knowledge—RAG, citations, structured checks, provenance. This is where “truth in the plant” starts to become traceable.
Layer C — Agents and multi-agent frameworks: components that observe, reason, plan, and act through tools—APIs, databases, search, controlled execution—with clear permission boundaries.
Layer D — Agentic AI and agentic systems: orchestrated architectures where specialised agents coordinate toward complex goals—planning, memory, delegation, and governance hooks—not merely an LLM calling tools in a loop.
Layer E — Orchestration, governance, and economics: sequencing, auditability, cost of tokens and data pipelines, human checkpoints, and procurement reality. The literature keeps returning to these themes as prerequisites for trust, not polish.
Understanding which layer a product actually operates at is the single most important question in an architecture review or vendor session. Map claims to A–E before you map them to a roadmap.
Terminology is domain-contingent. A “multi-agent system” in a chemical plant—where autonomous decisions can have physical safety consequences—is a different engineering problem than a multi-agent setup that drafts code. This guide is written for industrial and manufacturing contexts first.
When everyone agrees what “RAG,” “tool use,” or “agentic” refers to, you can move faster on reviews, diligence, and governance—without talking past each other.
A foundation model is a large neural network trained on broad data at scale so it can serve as a reusable base for many downstream tasks. The term was popularised by the Stanford ecosystem studying risks and opportunities of such models—see the Stanford Center for Research on Foundation Models (CRFM) for the original framing.
Traditional ML models were often trained for a single task (classify, forecast, detect). Foundation models learn general patterns and are adapted with prompts, retrieval, fine-tuning, or tools rather than always retraining from scratch.
They may be text-only (many LLMs), vision, audio, or multi-modal. For industry, “we build on a foundation model” usually means you are composing on a shared base—not claiming bespoke pretraining for every feature.
An LLM is a text foundation model: it consumes a prompt as tokens and emits text, one token at a time. Architecturally, frontier LLMs almost always build on the Transformer idea introduced in Attention Is All You Need—parallel attention over token sequences.
Training optimises next-token prediction across massive corpora; emergent capabilities (summarisation, code, multi-step reasoning) arise from scale and data diversity—not from a magical separate module.
What an LLM can do alone: generate or transform text, draft procedures, explain code, and chain reasoning inside the context window. What it cannot do alone: access live enterprise systems, reliably know your private manuals without you supplying them, or safely act in OT without a controlled tool layer.
A raw LLM is a powerful text engine with a broad but frozen knowledge base. It has no durable memory, no direct access to external systems, and no ability to act. It can reason, but it cannot do—until you add Layers B upward.
A prompt is the input text (user task, examples, retrieved passages, and instructions). Prompt engineering is the practice of structuring prompts, roles, and examples so outputs are reliable, measurable, and testable—not a one-off creative writing exercise.
Common patterns include zero-shot instructions, few-shot exemplars, chain-of-thought style reasoning steps, and separating stable policy (system prompt) from per-task user content.
The system prompt is the developer-controlled instruction layer that sets role, scope, tone, refusals, and safety posture across a session. Think of it as the job description that keeps a general-purpose model inside your operational boundary.
In regulated environments, the system prompt should be versioned, reviewed, and treated as part of your compliance story alongside logging and access control.
The context window caps how many tokens the model can attend to in one request—prompt, retrieved text, tool outputs, and completion combined. There is no durable cross-session memory unless your application stores and re-injects state.
Tokens are the billing and latency unit: rough heuristics are ~0.75 words per token in English, but code and other languages differ. Long manuals and multi-agent loops burn tokens quickly—cost and latency belong in the architecture review, not only in finance after launch.
Hallucination means fluent but false or ungrounded outputs. The model is not lying—it has no concept of truth. It is producing statistically likely continuations, which sometimes look like torque specs, standards, or citations that never existed.
Industrial response: combine grounding (RAG, citations, structured checks), output validation, constrained formats, and human review for safety-critical or compliance-bound outputs. Teams are still learning how to engineer consistent hallucination-control procedures at scale—that discipline is part of what makes Layer B and E non-optional.
RAG retrieves relevant chunks from your knowledge base at query time, injects them into the prompt, and asks the model to answer with that evidence in scope. It addresses freshness and proprietary knowledge without always retraining weights.
Quality depends on chunking, embeddings, indexing, re-ranking, and evaluation—not on the logo on the slide. A demo on five PDFs is not proof against fifty thousand messy work instructions.
Documents are split into chunks, embedded, and stored in a vector index. At query time the user question is embedded, similar chunks are retrieved (often re-ranked), injected into the prompt with clear delimiters or citations, and the model answers conditioned on that evidence. If retrieval misses the right passage, the model may still sound authoritative—measure retrieval hit-rate and answer faithfulness, not only surface fluency.
Grounding is the broader practice of tying outputs to verifiable sources: retrieved passages, structured databases, knowledge graphs, or canonical primitives. Provenance is the audit trail—which document version, which retrieval, which tool call produced each claim. In plants and regulated industries, provenance is not a nice-to-have; it is what lets quality and legal teams sign off.
Fine-tuning continues training on a smaller domain dataset to shift behaviour or style—distinct from RAG, which supplies facts at inference time. Parameter-efficient methods (e.g., LoRA/QLoRA) reduce cost versus full fine-tunes.
Instruction tuning and preference alignment (RLHF-style methods; see InstructGPT for the classic formulation) improve instruction-following and safety tone—but they do not replace governed tool access for plant actions.
Use fine-tuning when prompts + retrieval cannot reach required formats, tone, or domain syntax; keep expectations grounded in data governance and retraining pipelines.
Embeddings map text or media into vectors where semantic similarity becomes geometric proximity. Vector databases (e.g., Pinecone, Weaviate, Qdrant, Milvus, or Postgres with pgvector) accelerate nearest-neighbour search at scale.
They power RAG, semantic search over maintenance notes, clustering of defect narratives, and hybrid retrieval with keyword filters.
An AI agent pairs an LLM with tools, policies, and orchestration so the system can take bounded actions—not only describe them. Typical tools: ERP/CMMS/PLM APIs, SQL, document search, ticket creation, calculators, and controlled code execution.
The critical vendor question is not “do you use GPT-4?” but which tools exist, with what permissions (read-only vs write), and how actions are audited and rate-limited.
Contrast: a standalone LLM might explain that a pump is due for service and list SAP PM fields to fill. An agent with approved write tools—inside your policy envelope—can draft or create the work order, attach procedures, and notify the crew, leaving an auditable trail.
Tool use (function calling) exposes structured actions to the model as JSON-schema-like contracts. The model proposes calls; your runtime executes them and returns observations—preserving a hard security boundary.
This is the bridge from reasoning to doing: the same pattern underpins maintenance copilots, procurement assistants, and document-to-workflow automations.
Ecosystem libraries (LangChain, LangGraph, CrewAI, AutoGen, Semantic Kernel, and cloud agent builders) accelerate scaffolding; your moat is contracts on tools, observability, and who can approve what in production.
Multi-agent systems (MAS) are a classical field—any society of autonomous agents interacting in an environment. Agentic stacks today usually mean LLM agents coordinating via language, tools, and orchestrators rather than purely hand-authored rules.
The literature draws a sharp line many vendors blur: a true multi-agent setup implies structured coordination and communication protocols—not only an LLM that calls three tools in a loop. Ask which pattern you are buying; reliability and auditability hinge on the answer.
Agentic AI refers to systems with meaningful autonomy: multi-step planning, delegation across specialised agents, tool loops, and recovery paths. It is not synonymous with “has a chat UI.”
Industrial illustration: an anomaly triggers a diagnostic agent (logs + manuals), a planning agent checks production impact, a procurement agent checks spares, a compliance agent checks permits, and a coordinator proposes a plan with human approval before execution.
The market uses “agentic” inconsistently—sometimes for any multi-step prompt with tools, sometimes for orchestrated planning and memory across agents. The architecture-centred view we adopt in the PDF requires structured orchestration, planning, tool use, and cross-agent coordination beyond simple multi-prompt tool calling.
A practical litmus test from the paper: “If removing the word ‘agentic’ from the product description doesn’t change what the product actually does, it’s marketing.” Real agentic architecture implies coordination protocols, shared memory, planning loops, and governance hooks. If those are missing, you still have something valuable—usually Layer C—but not Layer D.
Orchestration decides which agent runs when, how state is passed, where humans intervene, and what happens on failure. Graph frameworks (e.g., LangGraph-style designs), workflow engines, and explicit state machines increase traceability versus opaque prompt spaghetti.
For auditability in plants, the orchestration layer is often as important as model choice. Router–planner–coordinator patterns with provenance and memory show up in the literature as practical, inspectable shapes for scientific and engineering workflows.
Governance spans input, output, action, behavioural, and audit dimensions: what data may enter, how outputs are validated, which tools may run, whether the system stays in role, and whether every decision is traceable. The convergent message across recent industrial-AI papers is that governance and provenance are central to trust and regulatory alignment—not a late-stage polish.
HITL calibrates autonomy to consequence: human-in-the-loop for approvals, human-on-the-loop for supervised autonomy, human-out-of-the-loop only where hazards and verification are provably bounded.
LLMs are stochastic: the same prompt can yield different wording between runs. Many industrial processes still need auditable, reproducible decision paths. The workable pattern is to wrap stochastic reasoning in deterministic guardrails—schemas, validators, logged tool traces, graph-grounded retrieval—so compliance teams can answer what happened and why.
When evaluating production AI, ask: “If I run this exact query twice, will I get the same answer?” If not—and often it will not—follow up with: “What deterministic controls keep outputs inside acceptable bounds?”
Inference is forward-pass execution of a trained model—what happens on every user request. Latency and cost scale with model size, context length, and the number of serial LLM steps in an agent workflow.
The Model Context Protocol (Anthropic, 2024) standardises how clients connect to tool/data “servers,” reducing one-off integrations as your agent surface area grows across PLM, MES, CMMS, and ITSM.
Commercial APIs typically meter input and output tokens separately; agent loops multiply calls. Model token budgets belong next to SLAs and unit economics in business cases.
Frontier models maximise quality for hard reasoning; mid-tier models balance cost and capability; small or edge models support latency, offline, or data-sovereignty constraints. Heterogeneous routing (cheap model first, escalate on uncertainty) is increasingly common.
Closed API models offload ops but raise data-handling questions. Open-weight models you host yourself shift responsibility to your platform team but can satisfy air-gapped or residency requirements—trade-offs are organisational, not only technical.
Physical AI is AI bound to the physical world through sensing, control, and actuation—production lines, robots, inspection cells, energy systems, and mobility. It intersects Industrial AI where decisions must meet timing, determinism, interlocks, and safety integrity levels.
Language-only stacks do not replace PLC/SCADA discipline; they augment planning, vision, diagnostics, and HMI experiences when interfaces and guardrails are engineered deliberately.
Across recent industrial-AI deployment work, five readiness themes recur: (1) grounding and provenance—logging retrievals, tool calls, and reasoning episodes; (2) tool integration and memory—repeatable decision logs, not one-off demos; (3) governance and auditing—policies aligned to risk; (4) determinism and evaluation—auditable pipelines where stakes demand it; (5) economics—explicit models for tokens, data refresh, integration, and lifecycle maintenance.
As the paper puts it bluntly: “If a vendor or internal team can’t articulate their position on all five, the solution isn’t production-ready for industrial deployment.” Use that line in steering committees—it saves quarters.
Manual Q&A: mostly Layers A–B (retrieval and citations); agent loops are light; economics and audit logs (E) still matter once you leave the pilot.
Equipment monitoring across sources: Layers B–C with episodic memory; governance and HITL (D–E) dominate before any autonomous maintenance action.
Robot cell orchestration: Layers C–D for planning and re-planning; Layer E (governance, overrides, risk budgets) is the gating item before autonomy expands.
Not every use case needs all five layers at full intensity—mis-sizing layers is how teams both over-engineer simple problems and under-govern complex ones.
Map claims to A–E before you map them to budget. “We have an LLM” is Layer A—ask what wraps it. “We have agents” is a Layer C claim—ask which tools and writes are allowed. “We are agentic” is a Layer D claim—ask for orchestration, memory, and coordination evidence, not adjectives. “We use RAG” is Layer B—ask for retrieval metrics and provenance, not a slide icon. “Powered by GPT-4 / Claude / Gemini” is still mostly Layer A branding—ask how the model sits inside retrieval, tools, and governance.
Understanding these terms is operational work. Investment, vendor selection, and governance all presuppose that engineering, management, and research mean the same words when they say them.
The progression from LLM to agent to agentic system traces higher capability—and higher complexity, risk, and readiness requirements. The Layer A–E frame is a shared vocabulary for those trade-offs with teams, vendors, leadership, and regulators. Start with clarity; it is the foundation for everything that follows.
The AI² – Association Industrial AI is an independent practitioner network advancing responsible Industrial AI. Explore membership at Join AI². Suggest additions via Contact. For linked primary sources see References below; the PDF companion has the full bibliography and notes.
AI² – Association Industrial AI