Build a Complete Langfuse Observability and Evaluation Pipeline for Tracing, Prompt Management, Scoring, and Experiments

print(“\nPART 5 ── Datasets & experiments ————————————–“)
DATASET = “capital-cities-tutorial”
langfuse.create_dataset(name=DATASET, description=”Capital-city QA benchmark”)
_items = [
(“What is the capital of France?”, “Paris”),
(“What is the capital of Germany?”, “Berlin”),
(“What is the capital of Japan?”, “Tokyo”),
(“What is the capital of Italy?”, “Rome”),
]
for i, (q, a) in enumerate(_items):
langfuse.create_dataset_item(dataset_name=DATASET, id=f”cap-{i}”,
input={“question”: q}, expected_output=a)
def capital_task(*, item, **kwargs):
question = item.input[“question”] if isinstance(item.input, dict) else item.input
return llm_chat([{“role”: “user”, “content”: question}], name=”experiment-answer”)
def accuracy(*, input, output, expected_output, metadata=None, **kwargs):
hit = bool(expected_output) and expected_output.lower() in (output or “”).lower()
return Evaluation(name=”accuracy”, value=1.0 if hit else 0.0,
comment=”exact-match contains check”)
def conciseness(*, input, output, **kwargs):
return Evaluation(name=”char_length”, value=float(len(output or “”)))
def mean_accuracy(*, item_results, **kwargs):
vals = [e.value for r in item_results for e in r.evaluations if e.name == “accuracy”]
avg = sum(vals) / len(vals) if vals else 0.0
return Evaluation(name=”mean_accuracy”, value=avg, comment=f”{avg:.0%} correct”)
dataset = langfuse.get_dataset(DATASET)
result = dataset.run_experiment(
name=”capitals-baseline”,
description=”Baseline run from the Colab tutorial”,
task=capital_task,
evaluators=[accuracy, conciseness],
run_evaluators=[mean_accuracy],
max_concurrency=4,
)
print(result.format())

What's Hot

Samsung might bring Privacy Display to every Galaxy S27 model

The end for the Phone 1: Nothing’s final update hits the phone that started it all

Galaxy Z Fold 8 looks pricier in these rumors, which isn’t shocking in the least

I build helpful smart home automations with this Nest feature in the Google Home app

Auditing Model Bias with Balanced Datasets with Mimesis

Best Authentication Platforms for AI Agents and MCP Servers in 2026

A Probe Took Incredible Pictures of Mars on Its Way to a Far-Off Asteroid

Google Antigravity 2.0: The Complete Developer Guide

WorkOS Releases auth.md: An Open Agent Registration Protocol Built on OAuth Standards

Samsung might bring Privacy Display to every Galaxy S27 model

The end for the Phone 1: Nothing’s final update hits the phone that started it all

Galaxy Z Fold 8 looks pricier in these rumors, which isn’t shocking in the least

Samsung might bring Privacy Display to every Galaxy S27 model

The end for the Phone 1: Nothing’s final update hits the phone that started it all

Galaxy Z Fold 8 looks pricier in these rumors, which isn’t shocking in the least

Usefull link

categories

What's Hot

Build a Complete Langfuse Observability and Evaluation Pipeline for Tracing, Prompt Management, Scoring, and Experiments

Related Posts

Usefull link

categories