LangExtract 介绍:一款由 Gemini 驱动的信息抽取库

本文介绍 LangExtract,这是谷歌新推出的开源 Python 库,它利用 Gemini 等大型语言模型 (LLM),以编程方式从非结构化文本中提取结构化信息。LangExtract 提供灵活且可追溯的解决方案,解决了手动筛选数据、定制代码开发以及 LLM 产生的虚假信息(也称幻觉)等问题。其主要功能包括精确的来源追溯,确保每个提取的实体都能映射回其来源;以及通过少量样本示例和可控生成技术实现的可靠结构化输出。LangExtract 采用分块和并行处理技术,针对长文本信息抽取进行了优化,能够高效处理大型文档。此外,它还提供交互式可视化功能,支持各种 LLM 后端,并且无需模型微调即可灵活应用于不同领域。本文提供了一个 Python 代码快速入门指南,演示了 LangExtract 在文学分析、医学信息抽取和结构化放射学报告等专业领域的应用。LangExtract 旨在帮助开发者们更高效地从大量文本数据中挖掘价值。




In today's data-rich world, valuable insights are often locked away in unstructured text, such as detailed clinical notes, lengthy legal documents, customer feedback threads and evolving news reports. Manually sifting through this information or building bespoke code to process the data is time-consuming and error-prone, and using modern large language models (LLMs) naively may introduce errors. What if you could programmatically extract the exact information you need, while ensuring the outputs are structured and reliably tied back to its source?

Today, we're excited to introduce LangExtract, a new open-source Python library designed to empower developers to do just that. LangExtract provides a lightweight interface to various LLMs such as our Gemini models for processing large volumes of unstructured text into structured information based on your custom instructions, ensuring both flexibility and traceability.

Whether you're working with medical reports, financial summaries, or any other text-heavy domain, LangExtract offers a flexible and powerful way to unlock the data within.


What makes LangExtract effective for information extraction

LangExtract offers a unique combination of capabilities that make it useful for information extraction:

  • Precise source grounding: Every extracted entity is mapped back to its exact character offsets in the source text. As demonstrated in the animations below, this feature provides traceability by visually highlighting each extraction in the original text, making it much easier to evaluate and verify the extracted information.

  • Optimized long-context information extraction: Information retrieval from large documents can be complex. For instance, while LLMs show strong performance on many benchmarks, needle-in-a-haystack tests across million-token contexts show that recall can decrease in multi-fact retrieval scenarios. LangExtract is built to handle this using a chunking strategy, parallel processing and multiple extraction passes over smaller, focused contexts.

  • Interactive visualization: Go from raw text to an interactive, self-contained HTML visualization in minutes. LangExtract makes it easy to review extracted entities in context, with support for exploring thousands of annotations.

  • Flexible support for LLM backends: Work with your preferred models, whether they are cloud-based LLMs (like Google's Gemini family) or open-source on-device models.

  • Flexible across domains: Define information extraction tasks for any domain with just a few well-chosen examples, without the need to fine-tune an LLM. LangExtract “learns” your desired output and can apply it to large, new text inputs. See how it works with this medication extraction example.

  • Utilizing LLM world knowledge: In addition to extracting grounded entities, LangExtract can leverage a model's world knowledge to supplement extracted information. This information can be explicit (i.e., derived from the source text) or inferred (i.e., derived from the model's inherent world knowledge). The accuracy and relevance of such supplementary knowledge, particularly when inferred, are heavily influenced by the chosen LLM's capabilities and the precision of the prompt examples guiding the extraction.

Quick start: From Shakespeare to structured objects

Here's how to extract character details from a line of Shakespeare.

First, install the library:

For more detailed setup instructions, including virtual environments and API key configuration, please see the project README.

pip install langextract
Python

Next, define your extraction task. Provide a clear prompt and a high-quality "few-shot" example to guide the model.

import textwrap
import langextract as lx

# 1. Define a concise prompt
prompt = textwrap.dedent("""\
Extract characters, emotions, and relationships in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities.
Provide meaningful attributes for each entity to add context.""")

# 2. Provide a high-quality example to guide the model
examples = [
    lx.data.ExampleData(
        text=(
            "ROMEO. But soft! What light through yonder window breaks? It is"
            " the east, and Juliet is the sun."
        ),
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"},
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe"},
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="Juliet is the sun",
                attributes={"type": "metaphor"},
            ),
        ],
    )
]

# 3. Run the extraction on your input text
input_text = (
    "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"
)
result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-pro",
)
Python

The result object contains the extracted entities, which can be saved to a JSONL file. From there, you can generate an interactive HTML file to view the annotations. This visualization is great for demos or evaluating the extraction quality, saving valuable time. It works seamlessly in environments like Google Colab or can be saved as a standalone HTML file, viewable from your browser.

# Save the results to a JSONL file
lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl")

# Generate the interactive visualization from the file
html_content = lx.visualize("extraction_results.jsonl")
with open("visualization.html", "w") as f:
    f.write(html_content)
Python

Flexibility for specialized domains

The same principles above apply to specialized domains like medicine, finance, engineering or law. The ideas behind LangExtract were first applied to medical information extraction and can be effective at processing clinical text. For example, it can identify medications, dosages, and other medication attributes, and then map the relationships between them. This capability was a core part of the research that led to this library, which you can read about in our paper on accelerating medical information extraction.

The animation below shows LangExtract processing clinical text to extract medication-related entities and groups them to the source medication.

Demo on structured radiology reporting

To showcase LangExtract's power in a specialized field, we developed an interactive demonstration for structured radiology reporting called RadExtract on Hugging Face. This demo shows how LangExtract can process a free-text radiology report and automatically convert its key findings into a structured format, also highlighting important findings. This approach is important in radiology, where structuring reports enhances clarity, ensures completeness, and improves data interoperability for research and clinical care.

Try the demo on HuggingFace: https://google-radextract.hf.space


Disclaimer: The medication extraction example and structured reporting demo above are for illustrative purposes of LangExtract's baseline capability only. It does not represent a finished or approved product, is not intended to diagnose or suggest treatment of any disease or condition, and should not be used for medical advice.

Get started with LangExtract: Resources and next steps

We're excited to see the innovative ways developers will use LangExtract to unlock insights from text. Dive into the documentation, explore the examples on our GitHub repository, and start transforming your unstructured data today.


AI 前线

首发丨黄仁勋、王坚链博会炉边对话:计算支撑 AI,硅是一切的起源

2026-1-3 2:08:09

AI 前线

Block 如何成为全球领先的人工智能驱动型企业 | Dhanji R. Prasanna

2026-1-3 2:08:11

0 条回复 A文章作者 M管理员
    暂无讨论,说说你的看法吧
个人中心
购物车
优惠劵
今日签到
有新私信 私信列表
搜索