Evaluation API

The GIP platform provides services for evaluating models and user prompts based on provided data. This guide shows how to access the evaluation related APIs through the API gateway.

⚠️ **The evaluation API and services is currently in a pre-release stage and is only available in the dev environment. To access the API the client has to provide a beta version header "GIP-Beta: eval=v1".

The list of all evaluation endpoint is included in the API documentation.

Example - Run LLM judge on data

This example shows you how to evaluate a single data point using an LLM as a judge with a custom prompt.

import httpx
from openai import OpenAI

judge_model_id = "openai/gpt-4.1-mini-2025-04-14"

judge_prompt = [
    {
        "role": "system",
        "content": """You are a highly skilled linguist and translation evaluator. Your task is to comprehensively evaluate whether a translation is correct and meaningful.

## Evaluation Criteria

**Language Validation:**
- The original and translated messages must be in different languages
- A translation in the same language as the original is invalid

**Semantic Accuracy:**
- The translated message must preserve the meaning of the original
- Consider context, tone, and cultural nuances
- Evaluate if the translation conveys the same intent and information

**Quality Assessment:**
- Check for grammatical correctness in the target language
- Assess naturalness and fluency of the translation
- Consider appropriateness for the context

## Scoring Guidelines

Rate the translation on a scale of 0-4:
- **4**: Excellent - Different language, accurate meaning, natural expression
- **3**: Good - Different language, mostly accurate meaning, minor issues
- **2**: Adequate - Different language, partially accurate meaning, some issues
- **1**: Poor - Different language but significant meaning distortion
- **0**: Invalid - Same language as original OR completely incorrect meaning

## Output Format

Return your evaluation in JSON format:
```json
{
  "score": [numeric value 0-4],
  "rationale": "[detailed explanation of language identification, meaning accuracy, and quality assessment]"
}
```""",
    },
    {
        "role": "user",
        "content": """Evaluate the following translation:

**Original Message:**
{original_message}

**Translated Message:**
{translated_message}

Provide your assessment considering language difference, semantic accuracy, and translation quality.""",
    },
]

min_score = 0
max_score = 4

original_message = "반갑습니다"
translated_message = "hi"


payload = {
    "config": {
        "type": "llm_judge",
        "evaluation_config": {
            "judge_model_public_id": judge_model_id,
            "judge_model_parameters": {},
            "judge_prompt": judge_prompt,
        },
        "evaluation_metric": {
            "type": "numeric",
            "config": {
                "min": {
                    "value": min_score,
                },
                "max": {
                    "value": max_score,
                },
            },
            "name": "good_translation",
        },
    },
    "data": {
        "type": "chat",
        "variable": {
            "original_message": original_message,
            "translated_message": translated_message,
        },
    },
}

response = httpx.post(
    "https://dev-api.platform.a15t.com/v1/evals/evaluators/judge/run",
    json=payload,
    headers={
        "Content-Type": "application/json",
        "Authorization": "Bearer sk-...",
        "GIP-Beta": "eval=v1", # <--- This header is needed to access beta APIs
    },
)

print(response.json())