How It Continues: LEGO® Serious Play®, Specification, and Agentic AI

Johan Roos; Bart Victor

How It Continues: LEGO® Serious Play®, Specification, and Agentic AI

By Johan Roos¹ and Bart Victor²

¹HULT International Business School
²Vanderbilt University

Cite as: Roos, J. and Victor, B. (2026), "How It Continues: LEGO® Serious Play®, Specification, and Agentic AI", International Journal of Management and Applied Research, Vol. 13, No. 2, pp. 64-75. https://doi.org/10.18646/2056.132.26-004 | Download PDF

Abstract

Continuing our 2018 account of how LEGO® Serious Play® (LSP) began (Roos and Victor, 2018),we propose LSP as a candidate method for the upstream specification of agentic AI systems. The 2026 AI Index Report describes a field scaling faster than the systems around it can adapt. The first peer-reviewed multi-agent failure taxonomy (Cemri et al., 2025) attributes 42 percent of failures across 1,642 execution traces to system design issues, the largest of three categories; six 2025–2026 enterprise reports corroborate the diagnostic at the practitioner level. Thin upstream specification is a measurable contributor to these failures, and no current methodology produces the depth the work requires. Our 1990s work gave LSP its identity-first epistemology. Reread today, it is specification repair. Two of the method's elements lead the case: simple guiding principles as discretionary judgment, and the minimum-translation workflow newly enabled by multimodal AI. Two design bifurcations discipline the move: identity reserved for upstream LSP-tradition work, agent constitution for runtime grounding; workshop kept autotelic with a concluding deep debrief as the handoff. Where agent operation cannot carry practical wisdom on its own, that wisdom lives in heeding humans engineered into the loop by design, which we name Human Magic in the Loop (HMTL). A counterbalanced study of two specification regimes on a defined task is the natural empirical follow-up.

Section I. The 2026 gap

The 2026 AI Index Report from the Stanford Institute for Human-Centered Artificial Intelligence opens on what its co-chairs call “a field that is scaling faster than the systems around it can adapt” (Stanford HAI, 2026, p. 3). The report assembles the evidence behind that judgment with care. AI agents, in the form of the multi-step systems organisations are now deploying into real workflows, still fail roughly one in three attempts on structured benchmarks. The AI Incident Database recorded 362 notable AI incidents in 2025, up from 233 in 2024. Among organisations surveyed about their AI risks, the share citing inaccuracy as a relevant concern rose from 60 to 74 percent in a single year, an increase of fourteen percentage points (Stanford HAI, 2026).

The three figures together describe a gap. Capability is moving; the methods by which organisations specify, deploy, and govern that capability are not moving with it. The gap is not the existence of failures, which any new technology produces. It is the fundamental distance between what the systems can now do and what the surrounding human and organisational practice can yet specify, supervise, and sustain. That distance is our focus.

If agentic AI systems are failing in patterns that trace meaningfully to thin upstream specification, what method could produce the depth those systems require? The published literature on multi-agent system engineering is rich on coordination protocols, evaluation harnesses, and tooling architectures, and silent on the upstream act of specifying the work the agents are meant to do. The 2026 enterprise reports name the methodological vacuum but do not fill it. The peer-reviewed taxonomy of multi-agent failures attributes the single largest share of failures to a category labelled System Design Issues. Specification depth, not model capability, is the most measured contributor.

We propose that LEGO® Serious Play® (LSP), reread through the lens of the current problem, is a candidate method for closing the upstream gap. The proposition is a reframe of work we began in the 1990s, and it is therefore a retrospective interpretive move that we name openly.

Section II. Reading our history backward

Our proposition that LSP is a method for upstream specification is a reframe of work we began three decades ago. In the 1990s we described what we were doing as a search for an imaginative, identity-first practice of real-time strategy that could replace the thin, analytically dominant strategy work then standard (Roos and Victor, 2018). Reading that work backward through the lens of the current problem in agentic AI is a retrospective interpretive move, named openly here.

The interpretive move is defensible, in our reading, because the underlying epistemology was already there. Drawing on work one of us (Roos) developed in the mid-1990s with Georg von Krogh, the autopoietic, self-referential epistemology placed organisational knowledge in identity and history rather than in abstract representation (von Krogh, Roos and Slocum, 1994; von Krogh and Roos, 1995). The starting point of that epistemology was a claim made in our executive education, research, and consulting context, made plausible to us at the time and that the present moment makes operative: what we see, and what distinctions we make, depend on who we are. Strategy, on that view, could not be specified by analytical method alone. It had to be specified by a process that surfaced the identity, the landscape, and the connections from which the organisation operated in practice. The LSP method became the embodied way that surfacing happened, with construction materials and stories standing in for the textual artefacts strategy work had previously relied on (Roos and Victor, 2018; Roos 2025).

The current problem inverts the same gap. Agentic AI systems are now being specified by methods that start from the technology rather than from the work, and that strip the depth out of the specifications they produce. The Cemri et al. (2025) taxonomy, which we examine in Section III, gives the failure pattern empirical shape. The 2026 enterprise reports give it institutional confirmation. The gap is the same gap LSP was developed to repair in the 1990s, the gap between thin specification and the depth of identity, judgment, and shared landscape the work requires. The technology has changed; the structural failure mode has not. To name the reframe is to discipline it. We are claiming that a method developed to repair thin specification in human strategy work is a candidate method for repairing thin specification in agent system design, and that its candidacy follows from a continuity in the underlying problem rather than from a sleight of hand. We are not claiming the 1990s work anticipated agentic AI. The reframe is the engine of what follows.

Section III. Predictable specification failures

The visible failures of agentic systems in 2026 are not a random scatter of bugs. They form a pattern, and the pattern has now been measured.

Cemri et al. (2025) built the first Multi-Agent System Failure Taxonomy (MAST) from 1,642 annotated execution traces across seven popular multi-agent frameworks: MetaGPT, ChatDev, HyperAgent, AppWorld, AG2, Magentic-One, and OpenManus. Six expert annotators developed the taxonomy through grounded theory analysis of more than 150 traces, with three annotators reaching an averaged inter-annotator agreement of κ = 0.88, which is within the “almost perfect” band for agreement above and beyond what random labelling would have produced. Across the seven frameworks, failure rates range between 42 to 87 percent on the benchmarks the systems were built to handle. They clustered fourteen fine-grained failure modes into three categories:

System Design Issues (over 42 percent),
Inter-Agent Misalignment (37 percent), and
Task Verification (21 percent).

The category this article addresses is the largest. System Design Issues captures failures that arise from “poor conversation management, unclear task specifications or violation of constraints, and inadequate definition or adherence to the roles and responsibilities of the agents” (Cemri et al., 2025: 17). Five fine-grained modes inside this category, disobeying task specification, disobeying role specification, step repetition, loss of conversation history, and unawareness of termination conditions, are symptoms of inadequate upstream specification of the work the agents are meant to do, not inadequate language models as such.

This pattern is corroborated, at the practitioner level, across recent enterprise reports. BCG (2025) describes four irreconcilable tensions in agentic enterprise design and writes that no existing framework addresses them. IBM (2025) names four emerging roles, the AI orchestrator, the autonomous system auditor, the multi-agent system designer, and the ethics steward, but offers no method for producing the specifications those designers work from. McKinsey (2026) reports that 86 percent of leaders feel unprepared to adopt AI in day-to-day operations. Bain (2026) and Deloitte (2026) describe operational-readiness gaps and not yet redesigned jobs. Accenture (2025) names integration with existing systems as the top risk cited by leaders, an integration problem that is largely a specification problem in disguise. Six reports, six different vantage points, one shared diagnosis: organisations are deploying agentic systems faster than they have methods to specify them at the depth the work requires. Cemri et al. (2025) ground their taxonomy in classical organisation theory, citing Perrow's (1984) work on normal accidents and Roberts's (1989) work on high-reliability organisations: even sophisticated agents can fail catastrophically when organisational structure is flawed. The reading is the same on both sides of the academic-practitioner divide. The MAST authors arrive at it from execution traces; the enterprise reports arrive at it from leadership surveys and field engagements. When peer-reviewed empirical work and six independent practitioner reports describe the same gap from different angles, we assume the gap is real.

The bridge between these two bodies of evidence operates at the level of mechanism. The algorithmic-management literature, especially Möhlmann, Salge and Marabelli’s work on platform workers (2023; cf. Möhlmann et al., 2021), documents how AI mediation reshapes the conditions for organisational sensemaking, and therefore for the depth of the specifications sensemaking produces. Four mechanisms in this literature bear on the diagnostic.

Speed: machine cycles outpace collective sensemaking, leaving no room for the sustained work that depth requires.
Opacity: model decisions cannot be modelled in human heads, putting them beyond the deliberation that would interrogate them.
Distributed agency: the agent contributes without holding the joint situation in mind, acting before specification can anchor it.
Attention erosion: human vigilance deteriorates in the presence of competent automation, so the humans who would do the depth-producing work disengage.

The MAST execution-trace failures and the enterprise-report institutional gaps are both downstream of these conditions, observed at different levels of scale.

A qualifier is built into the data themselves. The System Design Issues category in MAST is the largest single failure category, but not all five of its sub-modes are pure specification-depth failures. State-management and termination-condition issues sit alongside specification failures in the same category. Specification depth is a meaningful contributor within System Design Issues, not the dominant cause of the whole failure distribution. Model behaviour, tool reliability, evaluation methodology, and inter-agent coordination contribute as well, and an upstream specification method cannot reach any of them. We therefore advance a hypothesis from convergent observational evidence, not a controlled experiment. The MAST data and the enterprise reports together establish that specification depth is a measurable, under-addressed contributor and that no deployed methodology produces it. Whether LSP, reframed and updated, closes the gap is the empirical question the follow-up study addresses, under conditions Section VI specifies.

Section IV. LSP as upstream method

If the diagnostic in Section III is right, the field needs methods for producing the specification depth current practice does not produce. LSP was not designed for AI agent system specification. It was designed to produce identity-anchored, landscape-aware, principle-guided strategy with leadership teams (Roos et al., 2004; Roos and Victor, 2018). What allows the reframe to hold is the structural similarity between what strategy work demands and what an agentic system requires at the upstream end: a representation of the work that carries the depth a thin prompt or document cannot.

Two of the LSP method's core elements lead the case. The first is the simple guiding principle, or SGP. In our 1990s formulation, SGPs replaced operational rules with principles that guide discretionary judgment in complex environments (Lissack and Roos, 1999; Oliver and Roos, 2003; 2005; Roos and Victor, 2018). The principle gives the actor a posture and a direction; the actor does the situated work of applying it. Read through the lens of agentic AI, SGPs are a textual form for what an agent does when it operates between instruction and improvisation. The same form that gave human strategists discretionary judgment under ambiguity gives an agent the same discretionary scope, with the human-developed depth as its constraint. The mapping is direct because the underlying problem is the same: how to act sensibly when the specification cannot anticipate every situation. We now read SGPs as manifestations of practical wisdom.

The second strong element is the workflow that produces specifications with minimum translation. In the 1990s, LSP outputs survived as artifacts in the room: built models, narrated stories, photographs, facilitator notes. These artifacts could not enter executive systems of record without being translated into text, and the translation stripped the depth that gave them their value. By 2026, that constraint is gone. Multimodal AI systems with long-context handling can ingest photographs of 3D models, audio recordings of the participants' narrated stories, video of the build sequences, and the participants' own simple guiding principles in their own voices, holding them as agent context without the translation step that previously hollowed them out. The workflow is operationally available now in a way it was not a few years ago. The LSP method does not need to be redesigned to take advantage of it; the AI substrate has caught up with the artifacts the method has always produced.

Two terminological bifurcations discipline the reframe. The first is on "identity." We continue to use the term upstream, in the LSP-tradition sense developed in the 1990s autopoietic epistemology: identity is the answer to "who are we, really?" surfaced through workshop work. We do not use "identity" for the runtime specification the agent operates from. For that, we adopt a new language: "agent constitution" or "standpoint conditioning." The bifurcation prevents the philosophical commitments of LSP-tradition identity work from being smuggled into a runtime spec the technology cannot yet carry, and it preserves both terms for the work each is doing.

The second bifurcation is on the workshop itself. Based on his counter-balanced study of LSP facilitators’ use of LLMs, Roos (2026) draws a normative line between preparation and performance: AI visible in preparation, invisible in performance. The LSP workshop is performance for the participants. Real-time AI as witness during the workshop crosses that line. The bifurcation we adopt holds the workshop as autotelic, treating capture as a concluding deep debrief framed explicitly as preparation for the agent system. The deep debrief is the handoff. The workshop's identity-first character is preserved; the depth the agent system needs is captured at the moment in the sequence where capture does not corrupt the work.

Two of the original five elements of LSP need more engineering before they are presentable as agent mechanisms. “Heedful interrelating,” drawn from Weick and Roberts (1993) by way of Ryle (1949), describes humans co-present with shared stakes who attend to, and care for one another in real time. Transferring that to agents is metaphor, not a mechanism we can yet describe.

“Challenging Imagination” as an agent reasoning mode faces a parallel problem: critic agents already exist and mostly produce noise. Challenging imagination ungrounded in shared purpose drifts into cynicism, which is what current critic agents demonstrate. Distinguishing substantive challenge from surface disagreement is what skilled facilitators do in the spur of the moment, and we do not yet know how to convert that move into a reliable agent role rather than a facilitation move that sits outside the system. Frameworks for argument evaluation may eventually serve as a candidate scaffold for the second case (Roos, 2026). We name both elements as needing engineering rather than presenting them as ready.

A reader not familiar with LSP may ask whether the method, after three decades and many practitioner adaptations, holds up as a research subject. Two 2025 peer-reviewed studies address this question directly. We cite them because they take LSP itself as their object of analysis rather than applying it within a particular sector. Henderson, Shipway and Jones (2025) situate LSP within the landscape of creative and participatory research methodologies, comparing it systematically with narrative inquiry, arts-based research, visual elicitation, design thinking, and related approaches. They document its grounding in constructivist and constructionist theories and identify the features that distinguish it methodologically from adjacent traditions. Chew et al. (2025) provide the bibliometric counterpart: 268 publications across Web of Science and Scopus, mapped through co-citation, keyword co-occurrence, and bibliographic coupling. Their analysis identifies constructivist theory, flow psychology, and participatory design as the field's strongest conceptual foundations, with emerging research streams in psychological safety, human-computer interaction, and inclusive co-design. Read together, the two studies establish that LSP has the documentable themes, trajectories, and methodological pluralism of a research community, not a single-organisation methodology. This is the only claim we draw from them. Neither study addresses agentic systems, and our reframe of LSP for that purpose rests on a separate warrant developed in the sections that follow.

The strongest objection to this argument is that LSP’s outputs, however rich in identity, landscape, and guiding principles, may not survive translation to agent grounding any better than thin-text specifications do. The agent has its base training; the workshop artifacts compete with that training at every decision point; the long context window does not guarantee that the right elements are weighted appropriately, drawn on at decision points, or sustained across interactions. The objection is valid. It is exactly the empirical question this conceptual paper cannot settle. The objection points to a follow-up empirical study, not to a weakness of the current argument.

Section V. HMTL: where practical wisdom lives in operation

A well-specified agentic system, even one grounded in the deepest workshop artifacts, still must act in the world. Decisions will be made under conditions the specification did not anticipate. Values will be balanced against each other when neither was given priority. Edge cases will resolve themselves into outcomes the system was never asked to defend. The question is sharp: where does practical wisdom live in a workshop-grounded agentic system in operation?

Our answer is that practical wisdom in operation lives in the humans whose judgment the system is engineered to require at the moments where representation and subordination matter. We call this design principle Human Magic in the Loop, or HMTL. The operative posture of those humans is heeding, not deferring.

The heeding-versus-deferring distinction is borrowed from Tsai and Ku (2025: 3080), who set out the philosophical move from Aristotle: “To live well is to function well. To function well, for human beings, is to reason well. To reason well requires phronesis, which is a capacity to make excellent practical judgments.” They name two competing principles in current thinking about AI and human judgment. The Principle of Epistemic Fulfilment, drawing on the Aristotelian function argument, holds that humans must exercise and develop their rational capacity, including practical wisdom, because rational activity is constitutive of human flourishing. The Principle of Epistemic Deference, drawing on a utilitarian frame, holds that when an AI system occupies an epistemically superior vantage point, humans should defer to its judgment because doing so reduces harm. Tsai and Ku resolve the tension between these with a third principle, which they call the Principle of Epistemic Heed: humans comprehend AI's output, understand its reasoning, and then decide. Heeding preserves first-person rational engagement; deferring replaces it.

The distinction matters in practice. A human-in-the-loop who rubber-stamps the agent's output is deferring, not heeding. The presence of humans in the workflow does no real work; the system has acquired a procedural check that adds neither comprehension nor judgment. HMTL requires heeding. The human is engineered into the workflow at the moments where comprehension and judgment are necessary, and the human is equipped, by the workshop-grounded specification itself, with the depth of context required to comprehend what the agent has produced.

HMTL works only if the humans in the loop can heed, rather than just defer. That capacity is not automatic, and the conditions under which it forms are themselves at risk in AI-mediated work (Victor, 2026). Victor and Alexander (2026) name four formative preconditions for ethical judgment under AI mediation (identity stability, ethical vigilance, composure, and deliberate entanglement) and argue that these are developmental thresholds that must be crossed before judgment can function as a lived professional capacity rather than performative competence. A human-in-the-loop who has not crossed these thresholds will tend to defer, adding the procedural check without the substantive judgment.

Where exactly do humans belong in the loop? Practical wisdom, on Graves’s (2026) account, has four components: moral sensitivity (perception of ethically salient features of a situation), moral judgment (adjudication among competing values in contextually complex cases), moral motivation (sustained commitment to ethical action beyond a static utility function), and integrated affective-cognitive regulation (calibration of response appropriate to context). Machine learning systems can approximate sensitivity and regulation through attention mechanisms and multi-agent architectures. Judgment and motivation are different. Both depend on what Graves calls a theory of the good and a sustained identity oriented toward it. Both are, in his analysis, beyond what current architectures can produce on their own.

The four components give HMTL its operational position. Humans are engineered into the loop at moral judgment and moral motivation, i.e., the components Graves identifies as least tractable for the machine. The agent is doing the work of moral sensitivity and regulation, surfacing the salient features and calibrating the response. The human is doing the work of adjudication and commitment, deciding what matters when the values conflict and sustaining the orientation toward the good across decisions the agent will make in the human's absence. The workshop produces the theory of the good and the sustained identity Graves names as preconditions for those operations.

This division of labour echoes the conditions documented in two decades of empirical research on algorithmic management. Möhlmann, Salge and Marabelli (2023) describe algorithm sensemaking as the multi-step strategic process by which platform workers comprehend, test, and respond to algorithmic decisions. Their three sub-elements (focused enactment, selection modes, and retention sources) document what humans do at the human-machine interface when the work matters to them. Möhlmann, Zalmanson, Henfridsson and Gregory (2021) document the tensions algorithmic management produces in work execution, compensation, and belonging. The conclusion is that where the algorithm shapes work, humans neither defer fully nor revolt fully. They engage in sensemaking that resembles, in its disciplined comprehension and contextual judgment, exactly what Tsai and Ku call heeding. HMTL is the design principle that takes this empirical pattern seriously and engineers it into the system from the start, rather than leaving it to emerge under conditions the specification did not anticipate.

Kellogg, Orlikowski and Yates (2006) supply the coordination language for what the workshop outputs become in this design. Their trading zone frame describes coordination across boundaries through three practice families: displaying, representing, and assembling. A workshop-grounded specification, in the LSP method, displays the participants' shared landscape; represents their identity, simple guiding principles, and the connections among the agents in their world; and assembles those representations into a form the agent system can hold as context. The agent operates from the assembled context; the human enters the loop at the representational moments where the agent must hand back judgment.

The amplification frame in Roos (2026) ties this operative architecture to the broader argument. The book's Chapter 6 sets out the preparation-performance boundary that disciplines the integration of AI into human collaborative work: AI visible in preparation, invisible in performance; backstage AI, frontstage humans. HMTL is amplification rendered as system design rather than amplification asserted. The workshop is the preparation; the deep debrief is the explicit handoff from preparation to deployment; the heeding human, equipped with the workshop's depth, is the operating channel through which practical wisdom enters the system's behaviour at the moments where it must.

The structural account of why the agent needs the human at these moments comes from Asch (1952), via Ryle (1949). Asch identified three operations that must be reciprocal for a group to act as a group rather than as a collection of individuals: contributing (acting in a way that envisages the system), representing (holding a model of the joint situation in mind), and subordinating (defining one’s action in relation to joint requirements). Weick and Roberts (1993) showed that all three, reciprocated, produce heedful interrelating; their absence produces heedlessness. In the absence of those reciprocal operations, each agent continues to act within its own local frame while the pattern between them stops minding.

Agents have access to one of Asch’s three operations. They contribute. They cannot hold a model of the joint situation as the joint situation (represent), and they cannot define their action in relation to joint requirements except instrumentally (subordinate). The triangle is missing two legs. This is the structural account of what Möhlmann calls distributed agency. HMTL is the design principle that places humans at the represent and subordinate moments as the only available carriers of those operations the joint action requires.

HMTL is therefore a design principle, not an operational recipe. The principle says where the human belongs in the system and what the human is doing there. The recipe, which depends on the work domain, the stakes that matter, and the failure modes the system must guard against, is the empirical work the next paper will address.

Section VI. Closing the loop

Three propositions stand together: that the failures now visible in agentic systems trace meaningfully, though not exclusively, to thin upstream specification; that LSP, reframed as specification work and updated for multimodal AI, is a candidate method for producing the depth the upstream lane requires; that practical wisdom in agent operation lives in heeding humans engineered into the loop by design, which we name Human Magic in the Loop. Each is offered as a hypothesis the field can now test.

Only the second proposition has an empirical test now in motion. A counterbalanced comparison of two versions of the same agent system on a defined task could begin to tell us whether specification depth, as LSP produces it, changes the agent behaviour in describable ways: one grounded in traditional, thin specifications and one grounded in LSP workshop artifacts including built models, narrated stories, photographs, and the participants’ own simple guiding principles. The latter requires more than the former: a real applied context, a team that builds the system, and the workshop participants whose identity and values the system is grounded in. If they differ, the candidate method has empirical traction. If they do not, it is a thought experiment, and we will say so. The diagnostic and the HMTL design principle remain conceptual claims this paper makes; their tests are different and longer in horizon.

We close the loop on a longer arc. Three decades ago, we set out to give organisations a method for surfacing the depth their strategy practice was systematically stripping out. The technology has changed since then, and so has the form the failure mode takes. The practical problem has not. Organisations are again being asked to specify work the dominant practice cannot specify at the depth the work requires. The same method, in updated form, may again be the lane that closes the gap. Whether it does is now an empirical question we can put to the test. If LSP, reframed, can do for agentic system design what it once did for strategy work, the next quarter-century of LSP practice is in the work the method does for organisations now confronting agentic AI. The first quarter-century gave us the method. The second begins by reading it backward.

References

Accenture (2025), The New Rules of Platform Strategy in the Age of Agentic AI: Five Priorities to Help Companies Align People, Platforms and Intelligence, by F. Brunier, C. Roark and S. Mukherjee, Available at: https://www.accenture.com/us-en/insights/strategy/new-rules-platform-strategy-agentic-ai [Accessed on 4 May 2026].
Asch, S. E. (1952), Social Psychology, Englewood Cliffs, NJ: Prentice-Hall.
Bain (2026), "The AI Enterprise: Code Red. Four Questions Every Executive Must Answer Now", Bain Brief, by J. Anderson, M. Draief, F. Mueller and J. Hadley, Boston, MA: Bain & Company. Available at: https://www.bain.com/insights/ai-enterprise-code-red/ [Accessed on 4 May 2026].
BCG (2025), The Emerging Agentic Enterprise: How Leaders Must Navigate a New Age of AI, by Ransbotham, D. Kiron, S. Khodabandeh, S. Iyer and A. Das, November. Boston, MA: Boston Consulting Group, in collaboration with MIT Sloan Management Review. Available at: https://sloanreview.mit.edu/projects/the-emerging-agentic-enterprise-how-leaders-must-navigate-a-new-age-of-ai/ [Accessed on 4 May 2026].
Cemri, M., Pan, M. Z., Yang, S., Agrawal, L. A., Chopra, B., Tiwari, R., Keutzer, K., Parameswaran, A., Klein, D., Ramchandran, K., Zaharia, M., Gonzalez, J. E. and Stoica, I. (2025), "Why Do Multi-Agent LLM Systems Fail?", in Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Track on Datasets and Benchmarks, spotlight paper. arXiv:2503.13657. https://arxiv.org/abs/2503.13657
Chew, R. S. Y., Yin, Z., Liu, Q., Hanefar, S. B. M., Ikram, M., Jiang, M. and Li, P. (2025), "LEGO® Serious Play® Research: A Bibliometric Mapping of Themes, Trajectories and Frontiers (2015–2025)", International Journal of Learning, Teaching and Educational Research, Vol. 24, No. 12, pp. 569–607. https://doi.org/10.26803/ijlter.24.12.25
Deloitte (2026), State of AI in the Enterprise: The Untapped Edge. New York: Deloitte. Available at: https://www.deloitte.com/us/en/what-we-do/capabilities/applied-artificial-intelligence/content/state-of-ai-in-the-enterprise.html [Accessed on 4 May 2026].
Graves, M. (2026), "AI Practical Wisdom and Compassion", AI and Ethics, Vol. 6, Article 39. https://doi.org/10.1007/s43681-025-00877-4
Henderson, H., Shipway, R. and Jones, (2025), "Constructing Insights: Exploring the Position of Lego® Serious Play® Within the Landscape of Creative and Participatory Research Methodologies", Qualitative Inquiry, Online First, https://doi.org/10.1177/10778004251401846
IBM (2025), Agentic AI's Strategic Ascent: Shifting Operations from Incremental Gains to Net-New Impact. Armonk, NY: IBM Institute for Business Value. Available at: https://www.ibm.com/thought-leadership/institute-business-value/en-us/report/agentic-ai-operating-model [Accessed on 4 May 2026].
Kellogg, K. C., Orlikowski, W. J. and Yates, J. (2006), "Life in the Trading Zone: Structuring Coordination Across Boundaries in Postbureaucratic Organizations", Organization Science, Vol. 17, No. 1, pp. 22–44. https://doi.org/10.1287/orsc.1050.0157
Lissack, M. and Roos, J. (1999), The Next Common Sense: Mastering Corporate Complexity through Coherence, London: Nicholas Brealey Publishing.
McKinsey (2026), The State of Organizations 2026: Three Tectonic Forces That Are Reshaping Organizations, February. New York: McKinsey & Company. Available at: https://www.mckinsey.com/capabilities/people-and-organizational-performance/our-insights/the-state-of-organizations
Möhlmann, M., Salge, C. A. de L. and Marabelli, M. (2023), "Algorithm Sensemaking: How Platform Workers Make Sense of Algorithmic Management", Journal of the Association for Information Systems, Vol. 24, No. 1, pp. 35–64. https://doi.org/10.17705/1jais.00774
Möhlmann, M., Zalmanson, L., Henfridsson, O. and Gregory, R. W. (2021), "Algorithmic Management of Work on Online Labor Platforms: When Matching Meets Control", MIS Quarterly, Vol. 45, No. 4, pp. 1999–2022. https://doi.org/10.25300/MISQ/2021/15333
Oliver, D. and Roos, J. (2003), "Dealing with the Unexpected: Critical Incidents in the LEGO Mindstorms team", Human Relations, Vol. 56, No. 9, pp. 1055–1080. https://doi.org/10.1177/0018726703569002
Oliver, D. and Roos, J. (2005), "Decision Making in High Velocity Environments: The Importance of Guiding Principles", Organization Studies, Vol. 26, No. 6, pp. 889–913. https://doi.org/10.1177/0170840605054609
Perrow, C. (1984), Normal Accidents: Living with High-Risk Technologies, New York: Basic Books. ISBN 0-465-05144-8.
Roberts, K. H. (1989), "New Challenges in Organizational Research: High Reliability Organizations", Industrial Crisis Quarterly, Vol. 3, No. 2, pp. 111–125. https://doi.org/10.1177/108602668900300202
Roos, J. (2025), Foundations of the LEGO® Serious Play® Method: Reflections for Facilitators. https://www.notion.so/Foundations-of-LEGO-Serious-Play-A-Reflection-by-Johan-Roos-240114f410b2809495cafffc13a8f589 [Accessed on 4 May 2026].
Roos, J. (2026), Human Magic: Leading with Wisdom in an Age of Algorithms, Abingdon, Oxon and New York: Routledge (Taylor & Francis Group). https://doi.org/10.4324/9781003732570
Roos, J. and Victor, B. (2018), "How It All Began: The Origins Of LEGO® Serious Play®", International Journal of Management and Applied Research, Vol. 5, No. 4, pp. 326–343. https://doi.org/10.18646/2056.54.18-025
Roos, J., Victor, B. and Statler, M. (2004), "Playing Seriously with Strategy", Long-Range Planning, Vol. 37, No. 6, pp. 549–568. https://doi.org/10.1016/j.lrp.2004.09.005
Ryle, G. (1949), The Concept of Mind, London: Hutchinson.
Stanford HAI (2026), Artificial Intelligence Index Report 2026, Stanford, CA: Stanford University Institute for Human-Centered Artificial Intelligence. Available at: https://hai.stanford.edu/ai-index/2026-ai-index-report [Accessed on 4 May 2026].
Tsai, C.-H. and Ku, H.-l. (2025), "Why AI May Undermine Phronesis and What to Do about It", AI and Ethics, Vol. 5, No. 3, pp. 3079–3086. https://doi.org/10.1007/s43681-024-00617-0
Victor, B. (2026), "When Judgment Is Pre-empted: Ethical Agency and Professional Responsibility under AI Mediation", Unpublished manuscript.
Victor, B. and Alexander, B. N. B. (2026), "Ethics Education as AI Literacy: Threshold Pedagogy for Human-Centered Professional Formation", Unpublished manuscript.
Von Krogh, G. and Roos, J. (1995), Organizational Epistemology, Basingstoke: Macmillan.
Von Krogh, G., Roos, J. and Slocum, K. (1994), "An Essay on Corporate Epistemology", Strategic Management Journal, Vol. 15, No. S2, Special Issue: Strategy: Search for New Paradigms (Summer), pp. 53–71. https://doi.org/10.1002/smj.4250151005
Weick, K. E. and Roberts, K. H. (1993), "Collective Mind in Organizations: Heedful Interrelating on Flight Decks", Administrative Science Quarterly, Vol. 38, No. 3, pp. 357–381. https://doi.org/10.2307/2393372

How It Continues: LEGO® Serious Play®, Specification, and Agentic AI

By Johan Roos1 and Bart Victor2

Abstract

Section I. The 2026 gap

Section II. Reading our history backward

Section III. Predictable specification failures

Section IV. LSP as upstream method

Section V. HMTL: where practical wisdom lives in operation

Section VI. Closing the loop

References

By Johan Roos¹ and Bart Victor²