Evaluation Tutorial
A complete workflow for evaluating translation quality using LLM judges in the Aitheon Evaluation Platform.
⚠️ **The evaluation API and services is currently in a pre-release stage and is only available in the dev environment. To access the API the client has to provide a beta version header
"GIP-Beta: eval=v1"
.
Prerequisites
# Set your API key
export API_KEY="your-api-key-here"
Models
- Judge Model : You can use any available LLM model in the Workspace. For example, openai/gpt-5-2025-08-07.
- Score Parser Model : If the LLM does not support structured output, a separate parser model is needed for structured outputs. GIP internally uses openai/gpt-5-mini-2025-08-07, and since this model is required for evaluation, it must be bound. If it is not already bound, please request the binding.
Step 1: Create Translation Evaluator
Create an LLM judge evaluator that scores translation quality from 1-3.
curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
-X POST "https://api.platform.a15t.com/v1/evals/evaluators" \
-H "content-type: application/json" \
-d '{
"name": "Translation Quality Judge",
"metric": {
"name": "Quality Score",
"type": "numeric",
"config": {
"min": {"value": 1},
"max": {"value": 3}
}
},
"evaluation_config": {
"type": "llm_judge",
"config": {
"judge_model_public_id": "openai/gpt-5-2025-08-07",
"judge_model_parameters": {
"temperature": 0.0,
"top_p": 1.0,
"max_tokens": 500
},
"judge_prompt": [
{
"role": "system",
"content": "Evaluate translation quality:\n1 = Poor/Incorrect\n2 = Good but has issues\n3 = Excellent\n\nOutput JSON: {\"score\": X, \"rationale\": \"explanation\"}"
},
{
"role": "user",
"content": "Original: {{input_text}}\nTranslation: {{translated_text}}"
}
]
}
}
}' | jq '.id'
Save the returned evaluator ID:
export EVALUATOR_ID="your-evaluator-id-here"
Step 2: Verify Evaluator
List all evaluators
curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
"https://api.platform.a15t.com/v1/evals/evaluators" | jq '.items[] | {id, name}'
Get specific evaluator details
curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
"https://api.platform.a15t.com/v1/evals/evaluators/$EVALUATOR_ID" | jq
Step 3: Test Evaluator
Test the evaluator with sample translation data:
curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
-X POST "https://api.platform.a15t.com/v1/evals/evaluators/$EVALUATOR_ID/run" \
-H "content-type: application/json" \
-d '{
"variables": {
"input_text": "Hello, how are you?",
"translated_text": "안녕하세요, 어떻게 지내세요?"
}
}' | jq
Expected response:
{
"score": 3.0,
"rationale": "Excellent translation - accurate and natural Korean"
}
Step 4: Create Dataset (Optional)
If evaluating against a dataset instead of traces:
# Create dataset
curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
-X POST "https://api.platform.a15t.com/v1/evals/datasets" \
-H "content-type: application/json" \
-d '{
"name": "Translation Test Dataset",
"description": "Sample translations for evaluation"
}' | jq '.id'
export DATASET_ID="your-dataset-id-here"
# Add sample data
curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
-X POST "https://api.platform.a15t.com/v1/evals/dataset-items/bulk" \
-H "content-type: application/json" \
-d '{
"items": [
{
"input": "Good morning",
"expected_output": "좋은 아침"
},
{
"input": "Thank you very much",
"expected_output": "정말 감사합니다"
},
{
"input": "See you later",
"expected_output": "나중에 봐요"
}
]
}' | jq
Step 5: Create Evaluation
Create an evaluation that uses your evaluator to assess translation quality:
Option A: Using Dataset
curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
-X POST "https://api.platform.a15t.com/v1/evals/evaluations" \
-H "content-type: application/json" \
-d '{
"name": "Translation Quality Assessment",
"description": "Evaluate Korean-English translation quality",
"target_data_filter": {
"type": "dataset",
"config": {
"dataset_id": "'$DATASET_ID'"
}
},
"evaluator_bindings": [
{
"evaluator_id": "'$EVALUATOR_ID'",
"evaluator_version": 1,
"variable_mappings": [
{
"name": "input_text",
"property": "input"
},
{
"name": "translated_text",
"property": "expected_output"
}
]
}
]
}' | jq '.id'
Option B: Using Langfuse Traces
curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
-X POST "https://api.platform.a15t.com/v1/evals/evaluations" \
-H "content-type: application/json" \
-d '{
"name": "Translation Quality from Traces",
"description": "Evaluate translations from production traces",
"target_data_filter": {
"type": "langfuse_trace",
"config": {
"integrated_service_credential_id": "your-langfuse-credential-id",
"sampling_rate": 1,
"max_trace_count": 50,
"observation_path": ["translate"],
"trace_filter_request": {
"name": "translation_request",
"relative_time_range": "P1D"
}
}
},
"evaluator_bindings": [
{
"evaluator_id": "'$EVALUATOR_ID'",
"evaluator_version": 1,
"variable_mappings": [
{
"name": "input_text",
"property": "input"
},
{
"name": "translated_text",
"property": "output"
}
]
}
]
}' | jq '.id'
Save the evaluation ID:
export EVALUATION_ID="your-evaluation-id-here"
Step 6: Verify Evaluation
List evaluations
curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
"https://api.platform.a15t.com/v1/evals/evaluations" | jq '.items[] | {id, name, created_at}'
Get evaluation details
curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
"https://api.platform.a15t.com/v1/evals/evaluations/$EVALUATION_ID" | jq
Step 7: Test with Sample Run
Run a quick sample evaluation to verify everything works:
curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
-X POST "https://api.platform.a15t.com/v1/evals/evaluations/$EVALUATION_ID/sample-run" \
-H "content-type: application/json" \
-d '{
"display_name": "Quick Test",
"sample_limit": 3
}' | jq
Expected response:
{
"evaluation_sample_run_id": "sample-run-id",
"results": [
{
"score": 3.0,
"rationale": "Excellent translation",
"dataset_item_id": "item-1"
}
]
}
Step 8: Run Full Evaluation
Start a complete evaluation run:
curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
-X POST "https://api.platform.a15t.com/v1/evals/evaluations/$EVALUATION_ID/runs" \
-H "content-type: application/json" \
-d '{
"display_name": "Translation Quality Run #1"
}' | jq '.id'
Save the run ID:
export RUN_ID="your-run-id-here"
Step 9: Monitor Run Status
Check the evaluation run progress:
curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
"https://api.platform.a15t.com/v1/evals/evaluations/runs/$RUN_ID" | jq '.status'
Status values:
PENDING
- Waiting to startRUNNING
- Currently processingCOMPLETED
- Finished successfullyFAILED
- Encountered an error
List all runs
curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
"https://api.platform.a15t.com/v1/evals/evaluations/runs" | jq '.items[] | {id, display_name, status, created_at}'
Step 10: Export Scores
Once the run is COMPLETED
, export the results:
curl -H "Authorization: Bearer $API_KEY" -H "GIP-Beta: eval=v1" \
"https://api.platform.a15t.com/v1/evals/scores/export?evaluation_run_id=$RUN_ID" \
-o "translation_scores.csv"
View the results:
head translation_scores.csv
Summary
You've successfully:
- ✅ Created a translation quality evaluator (1-3 score)
- ✅ Verified the evaluator works correctly
- ✅ Tested it with sample data
- ✅ Created an evaluation linking your evaluator to data
- ✅ Verified the evaluation configuration
- ✅ Tested with a sample run
- ✅ Ran the full evaluation
- ✅ Monitored the run status
- ✅ Exported the final scores