Introduction
If we broadly compare classical machine learning and generative AI workflows, we find that the general workflow steps remain similar between the two. Both require data collection, feature engineering, model optimization, deployment, evaluation, etc. but the execution details and time allocations are fundamentally different. Most importantly, generative AI introduces unique sources of technical debt that can accumulate quickly if not properly managed, including:
- Tool sprawl - difficulty managing and selecting from proliferating agent tools
- Prompt stuffing - overly complex prompts that become unmaintainable
- Opaque pipelines - lack of proper tracing makes debugging difficult
- Inadequate feedback systems - failing to capture and utilize human feedback effectively
- Insufficient stakeholder engagement - not maintaining regular communication with end users
In this blog, we will address each form of technical debt in turn. Ultimately, teams transitioning from classical ML to generative AI need to be aware of these new debt sources and adjust their development practices accordingly - spending more time on evaluation, stakeholder management, subjective quality monitoring, and instrumentation rather than the data cleaning and feature engineering that dominated classical ML projects.
How are Classical Machine Learning (ML) and Generative Artificial Intelligence (AI) Workflows different?
To appreciate where the field is now, it’s useful to compare how our workflows for generative AI compare with what we use for classical machine learning problems. The following is a high-level overview. As this comparison reveals, the broad workflow steps remain the same, but there are differences in the execution details that lead to different steps getting emphasized. As we’ll see, generative AI also introduces new forms of technical debt, which have implications for how we maintain our systems in production.
| Workflow Step | Classical ML | Generative AI |
|---|---|---|
| Data collection | Collected data represents real-world events, such as retail sales or equipment failures.
Structured formats, such as CSV and JSON, are often used. |
Collected data represents contextual knowledge that helps a language model provide relevant responses.
Both structured data (often in real time tables) and unstructured data (images, videos, text files) can be used. |
| Feature engineering/ Data transformation |
Data transformation steps involve either creating new features to better reflect the problem space (e.g., creating weekday and weekend features from timestamp data) or doing statistical transformations so models fit the data better (e.g., standardizing continuous variables for k-means clustering and doing a log transform of skewed data so it follows a normal distribution). | For unstructured data, transformation involves chunking, creating embedding representations, and (possibly) adding metadata such as headings and tags to chunks.
For structured data, it might involve denormalizing tables so that large language models (LLMs) don’t have to consider table joins. Adding table and column metadata descriptions is also important. |
| Model pipeline design |
Usually covered by a basic pipeline with three steps:
|
Usually involves a query rewriting step, some form of information retrieval, possibly tool calling, and safety checks at the end.
Pipelines are much more complex, involve more complex infrastructure like databases and API integrations, and sometimes handled with graph-like structures. |
| Model optimization | Model optimization involves hyperparameter tuning using methods such as cross-validation, grid search, and random search. | While some hyperparameters, such as temperature, top-k, and top-p, may be changed, most effort is spent tuning prompts to guide model behavior.
Since an LLM chain may involve many steps, an AI engineer may also experiment with breaking down a complex operation into smaller components. |
| Deployment | Models are much smaller than foundation models such as LLMs. Entire ML applications can be hosted on a CPU without GPUs being needed.
Model versioning, monitoring, and lineage are important considerations. Model predictions rarely require complex chains or graphs, so traces are usually not used. |
Because foundation models are very large, they may be hosted on a central GPU and exposed as an API to several user-facing AI applications. Those applications act as “wrappers” around the foundation model API and are hosted on smaller CPUs.
Application version management, monitoring, and lineage are important considerations. Additionally, because LLM chains and graphs can be complex, proper tracing is needed to identify query bottlenecks and bugs. |
| Evaluation | For model performance, data scientists can use defined quantitative metrics such as F1 score for classification or root mean square error for regression. | The correctness of an LLM output relies on subjective judgments, e.g. of the quality of a summary or translation. Therefore, response quality is usually judged with guidelines rather than quantitative metrics. |

