GenAI 系统的隐性技术债




Introduction

If we broadly compare classical machine learning and generative AI workflows, we find that the general workflow steps remain similar between the two. Both require data collection, feature engineering, model optimization, deployment, evaluation, etc. but the execution details and time allocations are fundamentally different. Most importantly, generative AI introduces unique sources of technical debt that can accumulate quickly if not properly managed, including:

  • Tool sprawl - difficulty managing and selecting from proliferating agent tools
  • Prompt stuffing - overly complex prompts that become unmaintainable
  • Opaque pipelines - lack of proper tracing makes debugging difficult
  • Inadequate feedback systems - failing to capture and utilize human feedback effectively
  • Insufficient stakeholder engagement - not maintaining regular communication with end users

In this blog, we will address each form of technical debt in turn. Ultimately, teams transitioning from classical ML to generative AI need to be aware of these new debt sources and adjust their development practices accordingly - spending more time on evaluation, stakeholder management, subjective quality monitoring, and instrumentation rather than the data cleaning and feature engineering that dominated classical ML projects.

How are Classical Machine Learning (ML) and Generative Artificial Intelligence (AI) Workflows different?

To appreciate where the field is now, it’s useful to compare how our workflows for generative AI compare with what we use for classical machine learning problems. The following is a high-level overview. As this comparison reveals, the broad workflow steps remain the same, but there are differences in the execution details that lead to different steps getting emphasized. As we’ll see, generative AI also introduces new forms of technical debt, which have implications for how we maintain our systems in production.

Workflow Step Classical ML Generative AI
Data collection Collected data represents real-world events, such as retail sales or equipment failures.

Structured formats, such as CSV and JSON, are often used.

Collected data represents contextual knowledge that helps a language model provide relevant responses.

Both structured data (often in real time tables) and unstructured data (images, videos, text files) can be used.

Feature engineering/

Data transformation
Data transformation steps involve either creating new features to better reflect the problem space (e.g., creating weekday and weekend features from timestamp data) or doing statistical transformations so models fit the data better (e.g., standardizing continuous variables for k-means clustering and doing a log transform of skewed data so it follows a normal distribution). For unstructured data, transformation involves chunking, creating embedding representations, and (possibly) adding metadata such as headings and tags to chunks.

For structured data, it might involve denormalizing tables so that large language models (LLMs) don’t have to consider table joins. Adding table and column metadata descriptions is also important.

Model pipeline design

Usually covered by a basic pipeline with three steps:

  • Preprocessing (statistical column transformations such as standardization, normalization, or one-hot encoding)
  • Model prediction (passing preprocessed data to the model to produce outputs)
  • Postprocessing (enriching the model output with additional information, typically business logic filters)
Usually involves a query rewriting step, some form of information retrieval, possibly tool calling, and safety checks at the end.

Pipelines are much more complex, involve more complex infrastructure like databases and API integrations, and sometimes handled with graph-like structures.

Model optimization Model optimization involves hyperparameter tuning using methods such as cross-validation, grid search, and random search. While some hyperparameters, such as temperature, top-k, and top-p, may be changed, most effort is spent tuning prompts to guide model behavior.

Since an LLM chain may involve many steps, an AI engineer may also experiment with breaking down a complex operation into smaller components.

Deployment Models are much smaller than foundation models such as LLMs. Entire ML applications can be hosted on a CPU without GPUs being needed.

Model versioning, monitoring, and lineage are important considerations.

Model predictions rarely require complex chains or graphs, so traces are usually not used.

Because foundation models are very large, they may be hosted on a central GPU and exposed as an API to several user-facing AI applications. Those applications act as “wrappers” around the foundation model API and are hosted on smaller CPUs.

Application version management, monitoring, and lineage are important considerations.

Additionally, because LLM chains and graphs can be complex, proper tracing is needed to identify query bottlenecks and bugs.

Evaluation For model performance, data scientists can use defined quantitative metrics such as F1 score for classification or root mean square error for regression. The correctness of an LLM output relies on subjective judgments, e.g. of the quality of a summary or translation. Therefore, response quality is usually judged with guidelines rather than quantitative metrics.


AI 前线

苹果的“阳谋”:硬件猛降价,一天狂赚 32 亿 | 深网

2026-1-31 18:19:42

AI 前线

Z Potentials|时沐朗,00 后天才极客,以一辆“可以骑的皮卡”为起点,探索下一代出行的可能性

2026-1-31 18:19:49

0 条回复 A文章作者 M管理员
    暂无讨论,说说你的看法吧
个人中心
购物车
优惠劵
今日签到
有新私信 私信列表
搜索