Langfuseのデータセットとは？効率的なモデル評価の方法を徹底解説

はじめに

この記事ではLangfuseのデータセットの作成方法と、データセットを活用したLLMアプリケーションの評価方法について解説します。
Langfuseのデータセットを利用することで、LLMアプリケーションのパフォーマンスのベンチマークを簡単かつ効率的に行うことができます。後半ではデータセットを利用したアプリケーション評価の具体的な手法について説明を行います。ぜひ最後まで読んでいただき、アプリケーションの性能測定や改善に役立ててください。

データセットの作成方法

ここではLangfuseのデータセットの作成方法を説明します。
[「Langfuse のセットアップ」]((https://book.st-hakky.com/data-science/langfuse-intro)の手順にしたがってLangfuseのサーバーにアクセスします。以下のコードでデータセットの新規作成ができます。

from langfuse import Langfuse
from langfuse.openai import openai
from dotenv import load_dotenv
load_dotenv() 

langfuse = Langfuse()

# データセットの定義
langfuse.create_dataset(
    name="test-data-set",
    # optional description
    description="My first dataset",
    # メタデータの設定
    metadata={
        "author": "Alice",
        "date": "2022-01-01",
        "type": "benchmark"
    }
)

metadataを定義することでデータセットに関する追加情報を設定できます。設定したメタデータはトレースにJSON形式で追加され、分析や管理に役立てることができます。

以下のコードでデータセットにアイテムを追加できます。ここでは入力[hello world]と期待される出力[hello world]を追加しています。

from langfuse import Langfuse
from langfuse.openai import openai
from dotenv import load_dotenv
load_dotenv() 

langfuse = Langfuse()

langfuse.create_dataset_item(
    dataset_name="test-data-set",
    # any python object or value, optional
    input={
        "text": "hello world"
    },
    # any python object or value, optional
    expected_output={
        "text": "hello world"
    },
    # metadata, optional
    metadata={
        "model": "llama3",
    }
)

実行することで、データセットが作成されたことが確認できます。

データセットを使った評価方法

ここではデータセットを用いてLLMアプリケーションの評価を行う方法を説明します。今回は、国名を渡すとその国の首都のみを出力するアプリケーションを作り、その正答率を評価します。

1. アプリケーションの作成

アプリケーションを定義します。 inputに国名を渡し、system_promptに「首都を教えてください」という内容を渡すと首都が出力されます。

from datetime import datetime
 
def run_my_custom_llm_app(input, system_prompt):
  messages = [
      {"role":"system", "content": system_prompt},
      {"role":"user", "content": input["country"]}
  ]
 
  generationStartTime = datetime.now()
 
  openai_completion = openai.chat.completions.create(
      model="gpt-3.5-turbo",
      messages=messages
  ).choices[0].message.content
 
  langfuse_generation = langfuse.generation(
    name="guess-countries",
    input=messages,
    output=openai_completion,
    model="gpt-3.5-turbo",
    start_time=generationStartTime,
    end_time=datetime.now()
  )
 
  return openai_completion, langfuse_generation

2. データセットの作成

次に、評価に使うデータセットを作ります。データセットには入力（国名）と期待される出力（首都）を入れます。

# データセットの新規作成
langfuse.create_dataset(
    name="capital_cities",
)

# アイテムの定義
local_items = [
    {"input": {"country": "Italy"}, "expected_output": "Rome"},
    {"input": {"country": "Spain"}, "expected_output": "Madrid"},
    {"input": {"country": "Brazil"}, "expected_output": "Brasília"},
    {"input": {"country": "Japan"}, "expected_output": "Tokyo"},
    {"input": {"country": "India"}, "expected_output": "New Delhi"},
    {"input": {"country": "Canada"}, "expected_output": "Ottawa"},
    {"input": {"country": "South Korea"}, "expected_output": "Seoul"},
    {"input": {"country": "Argentina"}, "expected_output": "Buenos Aires"},
    {"input": {"country": "South Africa"}, "expected_output": "Pretoria"},
    {"input": {"country": "Egypt"}, "expected_output": "Cairo"},
]

# データセットにアイテムを追加
for item in local_items:
  langfuse.create_dataset_item(
      dataset_name="capital_cities",
      # any python object or value
      input=item["input"],
      # any python object or value, optional
      expected_output=item["expected_output"]
)

3. 評価方法の定義

アプリケーションの出力をデータセットを使って評価する方法を定義します。
ここでは例として二つの文字列の間で必要な単一文字の挿入、削除、置換の最小回数を測るレーベンシュタイン距離で評価します。 evaluationで定義する評価方法を変えることで様々な評価方法を使えます。

from langchain.evaluation import load_evaluator

#レーベンシュタイン距離を用いた評価方法を定義
def evaluation(output, expected_output):
  evaluator = load_evaluator("string_distance")
  score = evaluator.evaluate_strings(
    prediction=output,
    reference=expected_output,
  )
  return score['score']

def run_experiment(experiment_name, system_prompt):

  # データセットの取得
  dataset = langfuse.get_dataset("capital_cities")
 
  for item in dataset.items:

    # アプリケーションによる予測の生成
    completion, langfuse_generation = run_my_custom_llm_app(item.input, system_prompt)
 
    item.link(langfuse_generation, experiment_name) # pass the observation/generation object or the id
 
    langfuse_generation.score(
      name="string_distance",
      value=evaluation(completion, item.expected_output)
    )

4. 評価の実行

いろいろな首都の聞き方をして、それぞれの正答率を評価します。

run_experiment(
    "famous_city",
    "The user will input countries, respond with the most famous city in this country"
)
run_experiment(
    "directly_ask",
    "What is the capital of the following country?"
)
run_experiment(
    "asking_specifically",
    "The user will input countries, respond with only the name of the capital"
)
run_experiment(
    "asking_specifically_2nd_try",
    "The user will input countries, respond with only the name of the capital. State only the name of the city."
)

5. 評価の確認

結果を次のように確認できます。 asking_specifically、asking_specifically_2nd_tryで正答率が高いことがわかります。

国名ごとの出力も確認できます。

終わりに

本記事では、Langfuseのデータセットを用いたアプリケーションの評価方法を解説しました。データセットを使用することで、入力と期待される出力を用いたベンチマークを簡単に行うことができます。より詳しく知りたい方は、ぜひ資料請求やメルマガ登録を行ってみてください。

参考文献

備考

LLM を業務で活用したり、自社のサービスに組み込みたくありませんか？Hakky では、AI を用いた企業独自のシステムを構築するご支援を行っています。ソリューションサイト：https://www.about.st-hakky.com/chatgpt-solution

「どんなことが出来るのか」「いま考えているこのようなことは本当に実現可能か」など、ご検討段階でも構いませんので、ぜひお気軽にフォームよりお問い合わせくださいませ。