Evaluation

Evaluation Tutorial

A complete workflow for evaluating translation quality using LLM judges in the Aitheon Evaluation Platform.

⚠️ **The evaluation API and services is currently in a pre-release stage and is only available in the dev environment. To access the API the client has to provide a beta version header "GIP-Beta: eval=v1".

Prerequisites

# Set your API key
export API_KEY="your-api-key-here"

Models

  • Judge Model : You can use any available LLM model in the Workspace. For example, openai/gpt-5-2025-08-07.
  • Score Parser Model : If the LLM does not support structured output, a separate parser model is needed for structured outputs. GIP internally uses openai/gpt-5-mini-2025-08-07, and since this model is required for evaluation, it must be bound. If it is not already bound, please request the binding.

Step 1: Create Translation Evaluator

Create an LLM judge evaluator that scores translation quality from 1-3.

curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
  -X POST "https://api.platform.a15t.com/v1/evals/evaluators" \
  -H "content-type: application/json" \
  -d '{
    "name": "Translation Quality Judge",
    "metric": {
      "name": "Quality Score",
      "type": "numeric",
      "config": {
        "min": {"value": 1},
        "max": {"value": 3}
      }
    },
    "evaluation_config": {
      "type": "llm_judge",
      "config": {
        "judge_model_public_id": "openai/gpt-5-2025-08-07",
        "judge_model_parameters": {
          "temperature": 0.0,
          "top_p": 1.0,
          "max_tokens": 500
        },
        "judge_prompt": [
          {
            "role": "system",
            "content": "Evaluate translation quality:\n1 = Poor/Incorrect\n2 = Good but has issues\n3 = Excellent\n\nOutput JSON: {\"score\": X, \"rationale\": \"explanation\"}"
          },
          {
            "role": "user",
            "content": "Original: {{input_text}}\nTranslation: {{translated_text}}"
          }
        ]
      }
    }
  }' | jq '.id'

Save the returned evaluator ID:

export EVALUATOR_ID="your-evaluator-id-here"

Step 2: Verify Evaluator

List all evaluators

curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
  "https://api.platform.a15t.com/v1/evals/evaluators" | jq '.items[] | {id, name}'

Get specific evaluator details

curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
  "https://api.platform.a15t.com/v1/evals/evaluators/$EVALUATOR_ID" | jq

Step 3: Test Evaluator

Test the evaluator with sample translation data:

curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
  -X POST "https://api.platform.a15t.com/v1/evals/evaluators/$EVALUATOR_ID/run" \
  -H "content-type: application/json" \
  -d '{
    "variables": {
      "input_text": "Hello, how are you?",
      "translated_text": "안녕하세요, 어떻게 지내세요?"
    }
  }' | jq

Expected response:

{
  "score": 3.0,
  "rationale": "Excellent translation - accurate and natural Korean"
}

Step 4: Create Dataset (Optional)

If evaluating against a dataset instead of traces:

# Create dataset
curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
  -X POST "https://api.platform.a15t.com/v1/evals/datasets" \
  -H "content-type: application/json" \
  -d '{
    "name": "Translation Test Dataset",
    "description": "Sample translations for evaluation"
  }' | jq '.id'

export DATASET_ID="your-dataset-id-here"

# Add sample data
curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
  -X POST "https://api.platform.a15t.com/v1/evals/dataset-items/bulk" \
  -H "content-type: application/json" \
  -d '{
    "items": [
      {
        "input": "Good morning",
        "expected_output": "좋은 아침"
      },
      {
        "input": "Thank you very much",
        "expected_output": "정말 감사합니다"
      },
      {
        "input": "See you later",
        "expected_output": "나중에 봐요"
      }
    ]
  }' | jq

Step 5: Create Evaluation

Create an evaluation that uses your evaluator to assess translation quality:

Option A: Using Dataset

curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
  -X POST "https://api.platform.a15t.com/v1/evals/evaluations" \
  -H "content-type: application/json" \
  -d '{
    "name": "Translation Quality Assessment",
    "description": "Evaluate Korean-English translation quality",
    "target_data_filter": {
      "type": "dataset",
      "config": {
        "dataset_id": "'$DATASET_ID'"
      }
    },
    "evaluator_bindings": [
      {
        "evaluator_id": "'$EVALUATOR_ID'",
        "evaluator_version": 1,
        "variable_mappings": [
          {
            "name": "input_text",
            "property": "input"
          },
          {
            "name": "translated_text", 
            "property": "expected_output"
          }
        ]
      }
    ]
  }' | jq '.id'

Option B: Using Langfuse Traces

curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
  -X POST "https://api.platform.a15t.com/v1/evals/evaluations" \
  -H "content-type: application/json" \
  -d '{
    "name": "Translation Quality from Traces",
    "description": "Evaluate translations from production traces",
    "target_data_filter": {
      "type": "langfuse_trace",
      "config": {
        "integrated_service_credential_id": "your-langfuse-credential-id",
        "sampling_rate": 1,
        "max_trace_count": 50,
        "observation_path": ["translate"],
        "trace_filter_request": {
          "name": "translation_request",
          "relative_time_range": "P1D"
        }
      }
    },
    "evaluator_bindings": [
      {
        "evaluator_id": "'$EVALUATOR_ID'",
        "evaluator_version": 1,
        "variable_mappings": [
          {
            "name": "input_text",
            "property": "input"
          },
          {
            "name": "translated_text",
            "property": "output"
          }
        ]
      }
    ]
  }' | jq '.id'

Save the evaluation ID:

export EVALUATION_ID="your-evaluation-id-here"

Step 6: Verify Evaluation

List evaluations

curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
  "https://api.platform.a15t.com/v1/evals/evaluations" | jq '.items[] | {id, name, created_at}'

Get evaluation details

curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
  "https://api.platform.a15t.com/v1/evals/evaluations/$EVALUATION_ID" | jq

Step 7: Test with Sample Run

Run a quick sample evaluation to verify everything works:

curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
  -X POST "https://api.platform.a15t.com/v1/evals/evaluations/$EVALUATION_ID/sample-run" \
  -H "content-type: application/json" \
  -d '{
    "display_name": "Quick Test",
    "sample_limit": 3
  }' | jq

Expected response:

{
  "evaluation_sample_run_id": "sample-run-id",
  "results": [
    {
      "score": 3.0,
      "rationale": "Excellent translation",
      "dataset_item_id": "item-1"
    }
  ]
}

Step 8: Run Full Evaluation

Start a complete evaluation run:

curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
  -X POST "https://api.platform.a15t.com/v1/evals/evaluations/$EVALUATION_ID/runs" \
  -H "content-type: application/json" \
  -d '{
    "display_name": "Translation Quality Run #1"
  }' | jq '.id'

Save the run ID:

export RUN_ID="your-run-id-here"

Step 9: Monitor Run Status

Check the evaluation run progress:

curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
  "https://api.platform.a15t.com/v1/evals/evaluations/runs/$RUN_ID" | jq '.status'

Status values:

  • PENDING - Waiting to start
  • RUNNING - Currently processing
  • COMPLETED - Finished successfully
  • FAILED - Encountered an error

List all runs

curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
  "https://api.platform.a15t.com/v1/evals/evaluations/runs" | jq '.items[] | {id, display_name, status, created_at}'

Step 10: Export Scores

Once the run is COMPLETED, export the results:

curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
  "https://api.platform.a15t.com/v1/evals/scores/export?evaluation_run_id=$RUN_ID" \
  -o "translation_scores.csv"

View the results:

head translation_scores.csv

Summary

You've successfully:

  1. ✅ Created a translation quality evaluator (1-3 score)
  2. ✅ Verified the evaluator works correctly
  3. ✅ Tested it with sample data
  4. ✅ Created an evaluation linking your evaluator to data
  5. ✅ Verified the evaluation configuration
  6. ✅ Tested with a sample run
  7. ✅ Ran the full evaluation
  8. ✅ Monitored the run status
  9. ✅ Exported the final scores