Most enterprise teams that have moved AI features into production share a surprisingly common gap: nobody knows exactly what prompt is running in their production system right now. It was written by a developer on a Tuesday, adjusted by someone else after a regression, possibly tweaked again in a staging environment that was not cleanly promoted, and now it lives in an environment variable or a hardcoded string somewhere in the codebase. If the model behaviour changes tomorrow, the team will spend time debugging infrastructure before they think to check whether the prompt itself has drifted.
This is the prompt lifecycle problem, and it is more consequential than most engineering leaders recognise when they are scaling their first AI features into production.
A prompt is not configuration text. It is the primary mechanism through which your application communicates intent to a language model. It defines the persona, the constraints, the output format, the failure modes, and the scope of what the system is allowed to do. Changing a prompt in production is equivalent to changing business logic, and it deserves the same discipline: review, testing, version history, and controlled rollout. Yet in the majority of enterprise codebases today, prompts are managed with less rigour than a comment in a shell script.
The practical consequences compound quickly at scale. When an AI feature produces unexpected output, the first question is whether the model changed (via a provider update) or the prompt changed. Without version history, you cannot answer that question cleanly. When a new team member needs to understand why a prompt is structured the way it is, there is no pull request, no commit message, and no review thread explaining the decision. When you need to roll back a prompt change that introduced a regression, you are doing it by memory or by re-testing your way back to something that felt right. None of this is acceptable for production software, and none of it is unique to AI; it is simply the standard failure mode when teams treat a critical artefact as informal configuration.
The solution is not a specialised prompt management platform, at least not initially. The solution is applying the same engineering hygiene to prompts that you already apply to code. Prompts should live in version-controlled files, reviewed through pull requests, tested against a suite of representative inputs before they are merged, and deployed through the same promotion pipeline as the rest of your application code. Prompt changes that affect output behaviour should require approval from whoever owns the relevant business outcome, exactly as a schema change to a critical data model would. For teams operating at higher maturity, evaluation harnesses that score model outputs against expected behaviours give you the equivalent of a test suite, so that regressions surface before they reach production users.
The deeper shift here is cultural and architectural. It requires acknowledging that prompt text is not a detail someone optimises once and forgets. It is a living artefact that encodes organisational knowledge about how an AI capability should behave, and it will change as the model changes, as business requirements evolve, and as you learn more about where the system fails in the real world. Teams that build prompt lifecycle discipline early, when they have a handful of prompts to manage, avoid the far more painful process of retrofitting that discipline after they have dozens of AI features in production and no coherent picture of what any of them are actually doing. If your enterprise is serious about AI in production, source control is where that seriousness starts showing up.