티스토리 뷰

This post summarizes the architecture of Langfuse, a detailed guide on experimenting with its core features (Tracing, Evaluation, Sessions, Prompts), and troubleshooting tips for self-hosting.

Reference: Langfuse Self-Hosting Architecture


Architecture

1) Langfuse Web

  • Main Server Application: Provides both the UI and API.
  • Responsibilities: Handles browser-based UI interactions and processes internal/external API requests.
  • Role: Serves as the entry point for the user interface and data input for projects/traces.

2) Langfuse Worker

  • Asynchronous Task Processor: Handles background jobs.
  • Mechanism: Dequeues events received by the main server and performs post-processing.
  • Role: Offloads heavy tasks such as loading data into Clickhouse to ensure system performance.

Storage (Data Layer)

3) Postgres

  • Transactional & Relational Data Store.
  • Manages operational data such as users, projects, and API keys.

4) Clickhouse

  • High-Performance OLAP Database for Analytics.
  • Stores massive amounts of logs and metrics like traces, observations, and scores, enabling fast query processing.

5) Redis / Valkey Cache

  • Fast Memory Cache & Queue Storage.
  • Used for caching API keys and managing temporary queue states, reducing the load on the database.

6) S3 / Blob Storage

  • Large Object Storage.
  • Stores large events, attached files, multimedia traces, backups, and export files.

External Integration

7) LLM API / Gateway

  • External LLM Calls/Gateway.
  • Connects to model provider APIs (e.g., OpenAI, custom LLM gateways).
  • Essential for features that rely on external models, such as evaluations and response quality checks.

Architectural Key Patterns

  • Decoupled Processing:
    • The Web component focuses on fast responses, while the Worker handles heavy lifting, ensuring scalability and performance.
  • Storage Optimization:
    • Postgres: OLTP (Configuration, state, relational data).
    • Clickhouse: OLAP (Massive logs, trace analysis).
    • Redis: Fast temporary storage, cache, and queues.
    • S3/Blob: Persistence for large files and raw event payloads.
  • Flexible Deployment:
    • Supports various environments ranging from local Docker setups to Kubernetes and Terraform-based cloud deployments.

Experiment

1. Tracing

Tracing is the core feature of Langfuse. By using the Langfuse SDK, API calls are automatically linked to the platform when the following environment variables are set:

# LANGFUSE_PUBLIC_KEY
# LANGFUSE_SECRET_KEY
# LANGFUSE_HOST

For testing, I used the openai library wrapped by Langfuse to call litellm.

from langfuse.openai import openai
  • Integration Flow: I used the Application → (Langfuse) → LiteLLM configuration.
    • Note: While you can use Application → LiteLLM → Callback (Langfuse), the proxy method (wrapping the client) allows capturing request/response data even if LiteLLM itself is not configured to log.

  • Masking: Langfuse does not have built-in masking for specific fields. this must be handled on the client side before sending the request.
  • Trace Features: You can categorize tasks using various features like trace.span.

FeatureDescriptionExample Usage

Span Tracks nested operations Document retrieval, API calls
Generation Tracks LLM calls OpenAI, Anthropic calls
Event Logs general events User actions, errors
Score Evaluation/Feedback Relevance, Quality, Hallucination
Session Tracks sessions User conversation flow
Tags Filtering via tags production, v2, rag
Metadata Additional metadata Experiment ID, version info
Environment Environment distinction production, staging, dev
  • Trace Score Example:
            trace.score(
                name="relevance",
                value=round(random.uniform(0.7, 1.0), 2),
                comment="Auto-evaluated relevance"
            )

            trace.score(
                name="quality",
                value=random.choice(["excellent", "good", "average"]),
            )

            trace.score(
                name="hallucination",
                value=random.choice([0, 1]),
            )

 


2. Evaluator

Langfuse provides various built-in evaluators and supports custom ones.

For convenience, I configured the endpoint to use litellm.

  • Note: Evaluators can run on new traces as well as historical traces.

Creating & Testing a Custom Evaluator

Goal: Create a custom evaluator to detect hallucinations. We will send two random requests: one where the system ignores instructions (hallucination) and one where it responds correctly, and see how the evaluator scores them.

  • Setup: Referencing question, context, and answer variables. Using the registered litellm model as the judge.
  • Invocation Code: In the code, I set reference_context in metadata to provide the ground truth for evaluation.
        test_cases = [
        {"q": "What is Langfuse?", "c": "Langfuse is an open-source LLM observability solution.", "fail": False},
        {"q": "What is Langfuse?", "c": "Langfuse is an open-source LLM observability solution.", "fail": True}
        ]
    
        openai.chat.completions.create(
            name="trace-example",
            model=model,
            messages=[
                {"role": "system", "content": sys_msg},
                {"role": "user", "content": case['q']},
            ],
            input=case['q'], 
            metadata={
                "reference_context": case['c'],
                "is_hallucination_test": case["fail"]
            }
        )
  • Evaluator Value Mapping:

  • Result: The incorrect response received a 0, and the correct response received a 1. Although processing takes some time, the evaluation works correctly.


3. Session

You can group API calls by session ID.

    with propagate_attributes(
        session_id=session_id,
        user_id=user_id
    )
  • Benefits: You can track costs and traces grouped by session. Inside a session, you can annotate or add comments to specific communication logs.

  • Public Sharing: An entire session can be shared publicly via a link (viewable without login).

  • Manual Scoring (Ground Truth): Using the "annotate" feature, you can create custom scores and manually evaluate responses. Accumulated scores serve as Ground Truth for future improvements.


4. Prompt Management

The Prompt page allows you to execute experiments per prompt and organize incoming requests. It is very granular and well-structured.

Prompt Invocation

While you can invoke prompts without explicitly using the langfuse_prompt object, it is highly recommended to use it to ensure proper versioning and metadata tracking.

    with propagate_attributes(session_id="session-prompt-test"):
        response = openai.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": langfuse_prompt.prompt},
                {"role": "user", "content": user_question}
            ],
            langfuse_prompt=langfuse_prompt,
            metadata={"test_mode": "prompt_management"}
        )

Creating a Prompt

When creating a prompt, you can configure:

  • Production labels
  • Config (Model-specific configs in JSON)
  • Commit messages

Prompt Versioning

Version management is straightforward.

You can specify a version or label when fetching a prompt via API.

 

from langfuse import Langfuse

# Initialize Langfuse client
langfuse = Langfuse()

# Get production prompt
prompt = langfuse.get_prompt("test-prompt")

# Get by label (e.g., 'latest')
prompt = langfuse.get_prompt("test-prompt", label="latest")

# Get by specific version (not recommended for production code)
langfuse.get_prompt("test-prompt", version=2)

Prompt Evaluating

  1. Create Dataset: Add data to a dataset via the sidebar or other methods.
  2. Add from Tracing: In the tracing view, identify a correct response and click add to datasets.
    • Tip: Simple text prompts will fail evaluation. You must ensure Langfuse can identify variables like input or question to inject values during the experiment.
    2.1. Construct Dataset: Manually construct the dataset to ensure only essential content is included, matching your application code structure.
  3. Run Evaluation: Select the evaluation metric and run.
    • Note: This requires significant customization and consideration of your specific service logic.

 


Troubleshooting

1. LiteLLM Environment Variables in LLM Connection

Issue: When trying to register LiteLLM in "LLM Connection," I couldn't find a place to input environment variables directly in the UI. and saw the "Missing environment variable: `ENCRYPTION_KEY` error message."

Solution:

  • You must set the value in langfuse.encryptionKey.
  • In the Helm Chart, you need to manage all secret key values by creating secrets or explicitly defining them.
  • I encountered similar issues with PostgreSQL and Clickhouse secrets. It is much easier to define all necessary secrets explicitly at the beginning of the setup.

2. Installation/Migration Error

Issue: I kept encountering a "Dirty database version" error during installation/upgrade. (Application expected 368 schemas, but found 369).

Script executed successfully.
...
error: Dirty database version 34. Fix and force version.
Applying clickhouse migrations failed. This is mostly caused by the database being unavailable.
Exiting...

 

Solution:

  • This occurred on Helm Chart version 1.15.16.
  • The issue was resolved by upgrading to version 1.15.17.
  • Note: I am running PostgreSQL and Clickhouse as single instances, not in cluster mode.