Evaluation

Evaluation for LLM

Evaluation

1. Why is evaluation necessary? 🤔

LLMs can produce different results even with the same input.
Therefore, LLM applications require LLM quality evaluation in addition to traditional functional QA.

Examples of LLM evaluation criteria include:

  • Accuracy: Does the answer match the facts?
  • Relevance: Is it directly related to the question?
  • Safety: Are there no biased or harmful expressions?

You can also define evaluation criteria tailored to your use case:

  • Translation Quality: Is it translated appropriately for the context?
  • Plan Creation: Is the plan generated realistically?

Through these evaluation results, you can select the optimal combination of models, prompts, and parameter settings, and continuously improve service quality.

2. How does evaluate? 🤩

Traditionally, there's Human Evaluation where people directly review results and assign scores or rankings.

However, this method is time-consuming and costly, and difficult to execute regularly due to competing work priorities.

To overcome this, our platform adopts the LLM-as-a-Judge approach to perform Automated Evaluation.

This ensures scalability and efficiency in evaluation, enabling regular monitoring of LLM quality.

3. Why should you use Eval? 🙋‍♀️

(1) Easy to get started.

  • You can evaluate Input/Output immediately without complex setup.
    (Supports direct data input or observability tool integration)
  • Quickly receive evaluation results and compare them at a glance on the dashboard.
  • Team collaboration is possible through the console.
    (Integrated with existing GIP Workspace units)

(2) Easy Judge setup and management.

  • Have an existing LLM Judge? Import and reuse it as is. Utilize it efficiently through evaluation cycle/sampling settings!
  • No LLM Judge? Quickly define one using AI.
  • Judge projects and evaluation history aren't scattered - view, apply, and manage everything in one console.

4. How to use it? 💁‍♀️

👋 Log in to GIP Console 👋

Step 1 Prepare data for evaluation (Dataset, Langfuse)
Step 2 Prepare evaluation criteria (Evaluator)
Step 3 Evaluate (Evaluation)
🎉 Check results

Step 1 Prepare data for evaluation. Both direct upload and Langfuse integration are supported.

Option 1️⃣ Direct upload to GIP - Great for testing and improving evaluation criteria with a fixed dataset.

  • Create a dataset.

    • Path: Platform > Storage > Datasets > Create

  • Add data to the dataset.

    • Path: Platform > Storage > Datasets > Dataset > Edit

Option 2️⃣ Langfuse integration - Great for periodic evaluation with data streamed to Langfuse.

  • Connect Langfuse

    • Path: Settings > Integrated Services > Credentials

      ✅ You can find this in Langfuse > Project > Settings > API Keys.

Step 2 Prepare evaluation criteria.

  • Path: Platform > Evaluation > Evaluators > Create

    ✅ Three types of evaluation criteria (Metric Type) are supported: Numeric, Boolean, and Category. Categories can be specified directly.
    ✅ If it's difficult to describe evaluation criteria, you can get AI assistance (Judge Prompt > Generate).
    ✅ Before saving the evaluation criteria, test it (Test Evaluation) and gradually improve the Evaluator by modifying the Prompt and Judge Model.


Step 3 Evaluate.

  • Path: Platform > Evaluation > Evaluations > Create



    ✅ Target Data Type
    You can specify the dataset created in Step 1 (Select a Dataset) or import data using Langfuse API Key (Import from Langfuse).


    ✅ Evaluator Selection
    You can specify the evaluation criteria created in Step 2.
    You can extract only part of the data using JQ Expression or Regex.

  • Path: Single Run

    ✅ Single Run
    Run the evaluation!

🎉 Check Results

  • Path: Platform > Evaluation > Evaluations > Runs > Overview / Scores

    ✅ Check evaluation results in Overview and Scores, and share the link with your team.
    Anyone with Workspace permissions can view reports and collaborate together.

5. FAQ ❓

  • Where can I find the API guide?
    • You can check it here.