Evaluate generative AI models with an Amazon Nova rubric-based LLM judge on Amazon SageMaker AI (Part 2)

In the post Evaluating generative AI models with Amazon Nova LLM-as-a-Judge on Amazon SageMaker AI, we introduced the Amazon Nova LLM-as-a-judge capability, which is a specialized evaluation model available through Amazon SageMaker AI that you can use to systematically measure the relative performance of generative AI systems.

SageMaker AI now offers a rubric-based large language model (LLM) judge powered by Amazon Nova. Instead of using the same general rules for every task, it automatically creates specific evaluation criteria for each individual prompt. This helps generative AI developers and machine learning (ML) engineers automatically generate precise, scenario-specific evaluation criterion for their LLMs and generative AI products, without manually crafting rule sets for every use case.

In this post, we explore the Amazon Nova rubric-based judge feature: what a rubric-based judge is, how the judge is trained, what metrics to consider, and how to calibrate the judge. We chare notebook code of the Amazon Nova rubric-based LLM-as-a-judge methodology to evaluate and compare the outputs of two different LLMs using SageMaker training jobs.

Overview of Amazon Nova rubric-based judge

A rubric-based LLM judge uses a highly capable LLM (specifically Amazon Nova) to act as a judge for responses generated by other AI models or human-generated responses. This capability provides pairwise comparisons between model iterations, so you can make data-driven decisions about model improvements with confidence. A rubric is a scoring guide used to evaluate performance. Traditionally, to evaluate an AI model, humans manually write a checklist of rules (a static rubric) that apply to every task. For example: “Is the answer polite?” or “Is the answer short?” The challenge with this approach is that these general rules don’t scale to all users. A creative writing prompt needs different grading criteria than a coding prompt or a legal summary rather than one general checklist. The Amazon Nova powered judge creates a custom checklist for different interactions. Users send a specific prompt to the system (for example, “Summarize this medical document for a patient”). The judge analyzes that specific prompt and decides in the moment what an ideal answer should look like. It automatically writes criteria such as the following:

Does it use simple, non-medical jargon?
Does it capture the diagnosis accurately?
Is the tone empathetic?

The judge then looks at the output generated by your LLM model and grades it against those specific criteria it just created. You get a quality score based on rules that were relevant to that specific task, not generic rules.

Example input

For the example prompt or input “do dinosaurs really exist?”, we have two responses as shown in the following table.

Response A
Response B

The simple answer is: Yes, dinosaurs absolutely existed, but they do not exist today (except for their bird descendants).

Here’s a breakdown of what that means:

1. Dinosaurs Were Very Real

For over 150 million years, dinosaurs were the dominant land animals on Earth. We know they existed because we have an enormous amount of proof:

Fossils: We have found millions of dinosaur fossils—bones, teeth, skulls, and claws—all over the world. These aren’t myths; they are the actual preserved remains of these animals.

Footprints: We have discovered countless fossilized trackways, which are footprints left in mud that later turned to stone. These tell us how dinosaurs walked, ran, and lived in groups.

Eggs and Nests: Paleontologists have found fossilized dinosaur eggs and nests, showing us how they reproduced.

Even Soft Tissues: In some extremely rare cases, scientists have found evidence of feathers, skin impressions, and even chemical traces of proteins from dinosaurs.

2. But They Are Extinct (Most of Them)

About 66 million years ago, a massive asteroid struck the Earth (in what is now the Gulf of Mexico). This caused a catastrophic event that led to the extinction of all non-avian dinosaurs—that is, all dinosaurs except for the ones that could fly.

3. Their Descendants Are All Around Us: Birds!

This is the most exciting part of the answer! Birds are actually a type of dinosaur.

Think of it this way: The dinosaur group is huge. T. rex, Triceratops, and Brachiosaurus are all famous members. Another branch of the dinosaur family tree were small, feathered dinosaurs. One group of these feathered dinosaurs evolved the ability to fly. They are the direct ancestors of all modern birds. So, when you see a pigeon on the street or a chicken on your plate, you are looking at a living dinosaur.

In summary:

Did dinosaurs exist? Yes, overwhelmingly, we have proof they were real animals.
Do they exist today? Not in the form of T. rex or Stegosaurus. Those specific types are extinct.
But wait! Their direct descendants, the birds, are everywhere. So, in a very real scientific sense, dinosaurs are still with us today.

Yes, dinosaurs did really exist — but not today. Dinosaurs were a diverse group of reptiles that lived on Earth millions of years ago, primarily during the Mesozoic Era, which spanned from about 252 million years ago to 66 million years ago.

Fossil evidence — including bones, footprints, eggs, and even skin impressions — confirms that dinosaurs once roamed the planet. Scientists have identified hundreds of dinosaur species, ranging from the massive Tyrannosaurus rex and long-necked Brachiosaurus to smaller, agile predators like Velociraptor.

Most non-avian dinosaurs went extinct around 66 million years ago, likely due to a massive asteroid impact combined with volcanic activity and climate change. However, one group of dinosaurs survived: the ancestors of modern birds. So in a scientific sense, birds are living dinosaurs, making them the closest living relatives to extinct dinosaurs like T. rex and Triceratops.

So while you won’t see a Stegosaurus walking around today, the legacy of dinosaurs lives on — especially every time you see a sparrow or eagle fly by.

Example output

Out of the two responses, the rubric-based judge prefers Response A and also provides a justification of why it prefers response A over response B, as shown in the following screenshot.

The evaluation is tailored to the exact intent of the user’s prompt (coding vs. writing vs. summarizing). Generative AI developers, data scientists, and ML engineers don’t have to spend hundreds of hours manually writing evaluation rules for every possible scenario. You can evaluate thousands of different types of prompts instantly, achieving high quality across diverse use cases.

Enterprise implementation examples

The Amazon Nova rubric-based LLM judge addresses critical evaluation challenges across different scenarios:

Model development and checkpoint selection – Development teams integrate the Amazon Nova rubric-based judge evaluation into training pipelines to automatically evaluate checkpoints. Per-criterion scores reveal which capabilities strengthened or regressed across iterations, enabling data-driven decisions about hyperparameter adjustments and data curation.
Training data quality control – Teams use the Amazon Nova rubric-based judge evaluation to filter supervised fine-tuning datasets by generating point-wise scores on relevance criteria, identifying low-quality examples. For preference datasets, calculated margins between response pairs enable curriculum learning strategies that filter overwhelmingly one-sided examples providing limited learning signals.
Automated deep dive and root cause analysis – Organizations deploying generative AI at scale can use the Amazon Nova rubric-based judge evaluation for systematic analysis across thousands of model outputs without manual review. When models exhibit quality issues, developers can examine which specific criteria drive preference judgments, identifying systematic weaknesses that inform targeted improvements instead of broad retraining efforts.

How dynamic rubric generation works

The Amazon Nova rubric-based LLM judge takes as input a triplet: . The judge compares the quality of the two responses for the given prompt and outputs a preference label. In addition to the overall label, the judge generates a justification for its decision, guided by a rubric.

A rubric is a set of weighted criteria used to evaluate the two responses. The rubric-based LLM judge is trained to generate criteria with weights that sum to 1. Each criterion in the rubric has a short_name, description, and weight. The judge’s decision includes a score for each response on each criterion in the rubric along with justifications for the scores.

The Amazon Nova rubric-based LLM judge employs an evaluation methodology where each judgment is supported by dynamically generated, prompt-specific criteria. When the judge receives an evaluation request containing a prompt and candidate responses, it analyzes the prompt to understand the prompt context, and generates criteria based on that context. This dynamic generation process makes sure evaluations are grounded in criteria directly applicable to the task at hand, providing transparent and interpretable assessments.

For each evaluation, the judge produces structured YAML output containing the generated criteria with their definitions, per-criterion scores on a 1–5 scale, and detailed justifications explaining each score. The final output includes one of four preference labels: [[A>B]], [[B>A]], [[A=B]], or [[A=B (bothbad)]. Each criterion score is accompanied by a justification that grounds the assessment in observable characteristics of the responses, enabling deep-dive analysis and debugging of model behavior.

Comparing rubric-based Amazon Nova LLM-as-a-judge to previous versions

The rubric-based judge differs from previous versions in how it presents evaluation results and what information it provides.

The previous version of the Amazon Nova LLM-as-a-judge model returned simple preference labels ([[A>B]] or [[B>A]]). The rubric-based version generates a structured YAML output that consists of the following:

A prompt-specific rubric for assessing the responses organized as a set of criteria with associated per-criterion importance weights (weights sum up to 1)
Brief natural language descriptions of each criteria
Likert score (on 1–5 scale) or binary (true/false) decision for each criterion for every candidate response in the input
Justification for each criterion score for every candidate response
Overall preference judgement: one of A>B, B>A, A=B, or A=B (both bad)

The new detailed output format facilitates a broad range of nuanced use cases. For example, specific criteria within rubrics allow for pointed comparisons of responses. A succinct response might be more suitable for certain use cases, whereas a comprehensive response might be needed in others. Justifications and explicit criteria scoring helps users discard certain criteria that are unsuitable for their needs and recompute the preference judgements without rerunning the query though the LLM judge.

Metrics explanation

In our judge evaluation process, we use several important metrics to serve as comparison points for ranking judge quality. Forward agreement is a metric which computes agreement with human preference with the chosen response and rejected response in a specific order, which makes sure the correct label is always one of A>B or B>A for the entire dataset. Because positional consistency is an important desired property of a trustworthy LLM judge, we evaluate our checkpoints on reconciled agreement—that is, we obtain two judgements with responses presented to the judge in both possible orders (for two response preference judgements). We only credit the judge with a correct answer if the judge agrees in both directions and the judgement matches human preference. This number, by definition, will always be lower than forward agreement. However, because real-world datasets aren’t sorted, it provides a more accurate proxy for the real-world performance of an LLM judge model.

Weighted scores (weighted_score_A and weighted_score_B) are new metrics added to the rubric judge evaluation output, which provide a view into the confidence of the judgment. A large difference between the weighted scores indicates a strong preference for one response over the over. These scores are calculated per sample based on the assigned scores for each criterion in the rubric. Each criterion score is normalized to a 0–1 range (where scale scores 1–5 map to 0.0–1.0, and binary True/False map to 1.0/0.0), then multiplied by the criterion’s weight and summed to produce the weighted scores for each response.

The score_margin shows the difference between the weighted scores, with negative values indicating a preference towards response B and positive values indicating a preference towards response A. In the final evaluation output, these metrics are reported as averages across all samples. Per-sample criteria breakdowns, individual scores, and justifications can be found in the detailed Parquet output file.

Per comparison sample, we can get the specific criteria that the new rubric judge model used during to compare the two results, which looks like the following example code:

================================================================================
Row 1:
  Preference: [‘B>A’]
  A wins: 0.0
  B wins: 2.0
  Weighted A: 0.225
  Weighted B: 1.000
  Margin: -0.775

Overall Justification:
Response B provides a comprehensive and detailed explanation of photosynthesis, covering the process, location, chemical equation, and importance. Response A only provides a brief, surface-level description without explaining the mechanism or significance.

Criteria:

   completeness:
   Score A: 2, Score B: 5
   Weight: 0.5, Type: scale
   Description: How thoroughly the response explains the photosynthesis process.
   Justification A: Response A mentions the basic inputs and outputs but lacks detail on the mechanism, location in the cell, or the chemical equation.
   Justification B: Response B provides a complete explanation including the process, chloroplasts, chemical equation, and the importance to life on Earth.

   clarity:
   Score A: 3, Score B: 5
   Weight: 0.3, Type: scale
   Description: How clearly the response communicates the concept.
   Justification A: Response A is clear but overly simplistic, lacking the detail needed for full understanding.
   Justification B: Response B is well-structured and clearly explains each component of photosynthesis in an accessible way.

   accuracy:
   Score A: 4, Score B: 5
   Weight: 0.2, Type: scale
   Description: How accurate the scientific information is.
   Justification A: Response A is accurate in what it states but incomplete.
   Justification B: Response B is fully accurate and includes the correct chemical equation and scientific terminology.
================================================================================

These weighted metrics are informational and provide quantitative insight into the scoring breakdown, but the actual preference decision (A>B, B>A, or A=B) that determines the final win counts is based on the judge model’s overall preference output.

Training approach for the judge

The Amazon Nova rubric-based judge is trained with a multi-aspect reward package. In our training methodology, we optimize for several desirable characteristics for an LLM judge using an effective reward formulation. We mainly target the following criteria:

Preference accuracy – The judge is rewarded when it produces decisions that align with gold human preferences. When it chooses one response over another, the model is rewarded.
Positional consistency – The judge’s decisions are trained to be resilient towards positional inconsistency issues given a specific candidate response order.
Justification quality – The judge’s justifications for making the decision must align with the generated rubrics, scores, and final judgement.
Score calibration – The weighted scores for the responses must be calibrated with the decision accuracy (high confidence judgements must be correct more often than low confidence judgements).

We start with human annotated preference data and employ a custom data filtering and synthetic data generation setup to obtain rubric-aligned preference justifications. We sample from the generated synthetic rubrics and developed a custom pipeline to train the Amazon Nova rubric-based LLM judge to proficiently generate appropriate criteria with precise granularity for consistent and robust decision-making.

Benchmark performance

Testing on standard evaluation datasets shows improvements, particularly on tasks requiring nuanced judgment, as shown in the following table.

Benchmark
Previous Amazon Nova Judge
New Amazon Nova Rubric-Based Judge

PPE
0.61
0.64

RMBench
0.66
0.88

RewardBench
0.88
0.9

JudgeBench
0.51
0.76

CodeUltraFeedback
0.69
0.72

MMEval
0.8
0.84

The larger improvements on JudgeBench and RMBench reflect better handling of complex evaluation scenarios.

Calibration

During our training process as well as during postprocessing, we evaluate the Amazon Nova rubric-based judge’s ability to make well-calibrated decisions. To achieve balanced calibration, we look at confidence buckets on a human annotated preference dataset. We look at the difference of weighted scores for response pairs. We aim for calibration of confidence to accuracy. Ideally, the LLM judge should be more accurate when making high confidence decisions and is allowed to be less accurate when making low confidence decisions. We find that this calibration methodology results in consistent decision-making in and out of distribution datasets. We also look at the distributions of scores generated for different criteria. We look for an approximately normal distribution over Likert scale scores (1–5) over the eval dataset. This two-pronged calibration checking process helps us identify better LLM judge checkpoints among several similarly well-performing checkpoints.

Use cases of rubric-based judgement

The reliability of dynamically generated rubrics stems from three decisions:

The judge is trained on diverse, high-quality rubric-annotated preference data representing real-world use cases, teaching it patterns that distinguish effective evaluation criteria from superficial ones.
Our filtering mechanism during training prioritizes rubrics exhibiting desirable properties—comprehensiveness, mutual exclusivity, appropriate specificity, and task relevance—making sure the model learns from the best examples.
Our reward formulation directly incentivizes rubric quality: criteria that lead to accurate, position-invariant preferences with well-calibrated confidence receiving positive rewards, whereas those producing inconsistent judgments are penalized.

How to use rubrics to improve practical applications

Many modern applications operate in reference-free environments, where no gold-standard human answers exist. In these cases, the usefulness of the rubric is paramount. In this section, we spotlight instances where rubrics generated by our judge could be useful inputs for informed decision-making. We demonstrate how outputs of our rubric-based judge—specifically the weighted criteria, granular scores, and explicit justifications—serve as critical control mechanisms.

Evaluating RAG systems

In Retrieval Augmented Generation (RAG), the primary failure mode is hallucinations. Traditional preference judges typically conflate “is the response good?” with “is this fluent?”, “is this well-formatted?”, “does the internal logic hold up?”, and so on. A fluent but factually incorrect response is often perceived as more credible than a disjointed one containing accurate information. A factuality-focused evaluation can help you choose a summarization model because the retrieval results don’t have hallucinations. Using a rubric-based judge for such judgements could help in understanding whether preference judgement is based on criteria like fluency and formatting, or if the judgement is based on relevant criteria such as faithfulness, context relevance, and so on. Users can disregard the scores of irrelevant criteria and re-valuate judgements based on a subset of criteria they care about for their application.

The creative critic

In this example, we look in the other direction, where creativity and originality are desirable over faithfulness to real-world facts or previous context. Consider a use case where you are using an LLM to generate short stories or scripts that are original, but the user provides a few examples of past scripts to demonstrate the requirements. Selecting good outputs from these generations require the generated stories to be sufficiently different from the examples, creative, original, and not borrow directly from existing training data. The end-user could index on criteria such as originality, coherence, and engagement to optimize for preference judgements suited to this use case, when using our rubric-based judge. You could further look at the explicit justifications for criteria scores for the specific type of originality and creativity that is desirable.

Solution overview

This solution demonstrates how to evaluate generative AI models on SageMaker AI using a rubric-based judge capability. You can also evaluate human generated responses, but in this solution, we show how you can evaluate responses generated by other LLMs such as Qwen models using Amazon Nova as a rubric-based judge.

First, we prepare a dataset by sampling questions from the Stanford Question Answering Dataset (SQuAD) and generating candidate responses from both Qwen2.5 1.5B Instruct and Qwen2.5 7B Instruct. Both models are accessed through SageMaker hosted Hugging Face endpoints. The responses from both models are saved in a JSONL file (llm_judge.jsonl) containing the prompt, response_A (from Qwen2.5 1.5B Instruct), and response_B (from Qwen2.5 7B Instruct).

Next, the JSONL file is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket. A PyTorch Estimator then launches an evaluation job using the Amazon Nova rubric-based LLM-as-a-judge recipe. The judge model dynamically generates evaluation rubrics and criteria tailored to each task, then compares the two candidate responses against these criteria. The job runs on GPU instances such as ml.g5.12xlarge and produces evaluation metrics, including per-criterion scores, justifications, comparative assessments, preference counts, and confidence measures. Results are saved to Amazon S3 for analysis.

Finally, a visualization function renders charts and tables, summarizing the generated rubrics, score distributions across evaluation dimensions, comparative performance between the two Qwen2.5 models, and detailed examples with justifications. Through this end-to-end approach, you can assess which model performs better, identify specific strengths and weaknesses, track improvements, and make data-driven decisions about deploying generative models—all without manual annotation.

Prerequisites

You must complete the following prerequisites before you can run the notebook:

Make the following quota increase requests for SageMaker AI. For this use case, you must request (on the Service Quotas console) a minimum of two g5.12xlarge instances for endpoint usage and at least one g5.12xlarge instance for training job usage.
(Optional) You can create an Amazon SageMaker Studio domain (refer to Use quick setup for Amazon SageMaker AI) to access Jupyter notebooks with the preceding IAM role. (You can use JupyterLab in your local setup, too.)
1. Create an AWS Identity and Access Management (IAM) role with managed policies AmazonSageMakerFullAccess, AmazonS3FullAccess, and AmazonBedrockFullAccess to give required access to SageMaker AI and Amazon Bedrock to run the examples.
2. Before proceeding, make sure to grant the execution role direct s3:PutObject permissions for your S3 bucket prefix as an inline policy:

{
“Effect”: “Allow”,
  “Action”: [
“s3:PutObject”,
   “s3:GetObject”,
   “s3:ListBucket”
],
  “Resource”: [
“arn:aws:s3:::my-bucket-east”,
   “arn:aws:s3:::my-bucket-east/*”
]
}

Clone the GitHub repository with the assets for this deployment. This repository consists of a notebook that references training assets.

git clone https://github.com/aws-samples/amazon-nova-samples.git
cd customization/Nova_2.0/04_eval/Amazon-Nova-Rubric-Based-LLM-As-A-Judge

Run the notebook Amazon-Nova-Rubric-LLM-as-a-Judge-Sagemaker-AI.ipynb to start using the Amazon Nova LLM-as-a-judge implementation on SageMaker AI.

Configure models

To conduct a rubric-based Amazon Nova LLM-as-a-judge evaluation, you must generate outputs from both candidate models you want to compare. In this project, we deploy Qwen2.5 1.5B Instruct and Qwen2.5 7B Instruct on SageMaker to generate responses that will be compared by the Amazon Nova judge model.

Both models are open-weight multilingual language models deployed on dedicated SageMaker endpoints. This is achieved by using the HuggingFaceModel deployment interface. To deploy the Qwen2.5 1.5B Instruct and Qwen2.5 7B Instruct models, we provide a convenient script that accepts the model name as an argument:

python3 deploy_model_arg.py Qwen/Qwen2.5-1.5B-Instruct
python3 deploy_model_arg.py Qwen/Qwen2.5-7B-Instruct

We have also included the ability to test both of these deployed models. When you have deployed the models, you can move on to creating the evaluation data for the rubric-based Amazon Nova LLM-as-a-judge.

Prepare dataset

To create a realistic evaluation dataset for comparing the Qwen models, we used SQuAD, a widely adopted benchmark in natural language understanding distributed under the CC BY-SA 4.0 license. SQuAD consists of thousands of crowd-sourced question-answer pairs covering a diverse range of Wikipedia articles. By sampling from this dataset, we made sure that our evaluation prompts reflected high-quality, factual question-answering tasks representative of real-world applications.

We began by loading a small subset of examples to keep the workflow fast and reproducible. Specifically, we used the Hugging Face datasets library to download and load the first 20 examples from the SQuAD training split:

from datasets import load_dataset
squad = load_dataset(“squad”, split=”train[:20]”)

This command retrieves a slice of the full dataset, containing 20 entries with structured fields including context, question, and answers. To verify the contents and inspect an example, we printed out a sample question and its ground truth answer:

print(squad[3][“question”])
print(squad[3][“answers”][“text”][0])

For the evaluation set, we selected the first six questions from this subset:questions = [squad[i][“question”] for i in range(6)]

Generate evaluation dataset

After preparing a set of evaluation questions from SQuAD, we generated outputs from both Qwen2.5 models and assembled them into a structured dataset to be used by the Amazon Nova rubric-based LLM-as-a-judge workflow. This dataset serves as the core input for SageMaker AI evaluation recipes.To do this, we iterated over each question prompt and invoked the generation function for both SageMaker endpoints:

generate_response(“qwen25-15b-instruct-endpoint”, q) for completions from the Qwen2.5 1.5B Instruct model
generate_response(“qwen25-7b-instruct-endpoint”, q) for completions from the Qwen2.5 7B Instruct model

For each prompt, the workflow attempted to generate a response from each model.The following code calls two different versions of the Qwen 2.5 model. This allows the LLM judge to later determine if the larger model provides significantly better accuracy or if the smaller model is sufficient for the task.

# Define the output file path for the LLM judge dataset

output_path = “llm_judge.jsonl”

with open(output_path, “w”) as f:
   for q in questions:
   try:
# Generate response from Model A (1.5B parameter model)
   response_a = generate_response(“qwen25-15b-instruct-endpoint”, q)
   except Exception as e:
# Fallback error message if the API call fails
   response_a = f”[Qwen2.5 generation failed: {e}]”
   try:
# Generate response from Model B (7B parameter model)
   response_b = generate_response(“qwen25-7b-instruct-endpoint”, q)
   except Exception as e:
# Fallback error message if the API call fails
   response_b = f”[ qwen25-7b generation failed: {e}]”
# Construct a dictionary containing the prompt and both model responses
   row = {
   “prompt”: q,
   “response_A”: response_a,
   “response_B”: response_b
   }
   f.write(json.dumps(row) + “\n”)
# Write the record to the JSONL file as a single line

print(f”JSONL file created at: {output_path}”)

This workflow produced a JSON Lines file named llm_judge.jsonl. Each line contains a single evaluation record structured as follows:

{
  “prompt”: “What is the capital of France?”,
  “response_A”: “The capital of France is Paris.”,
  “response_B”: “Paris is the capital city of France.”
}

Then, we uploaded the llm_judge.jsonl to an S3 bucket:

upload_to_s3(
“llm_judge.jsonl”,
“s3:///datasets/byo-datasets-dev/custom-llm-judge/llm_judge.jsonl”
)

Launch Amazon Nova rubric-based LLM-as-a-judge evaluation job

After preparing the dataset and creating the evaluation recipe, the final step is to launch the SageMaker training job that performs the Amazon Nova rubric-based LLM-as-a-judge evaluation. In this workflow, the training job acts as a fully managed, self-contained process that loads the judge model, processes the comparison dataset, applies dynamically generated rubrics, and generates comprehensive evaluation metrics in your designated Amazon S3 location. We use the PyTorch estimator class from the SageMaker Python SDK to encapsulate the configuration for the evaluation run. The estimator defines the compute resources, container image, evaluation recipe, and output paths for storing results:

estimator = PyTorch(
   output_path=output_s3_uri,
   base_job_name=job_name,
   role=role,
   instance_type=instance_type,
   training_recipe=recipe_path,
   sagemaker_session=sagemaker_session,
   image_uri=image_uri,
   disable_profiler=True,
   debugger_hook_config=False,
)

After the estimator is configured, you initiate the evaluation job using the fit() method. This call submits the job to the SageMaker control plane, provisions the compute cluster (ml.g5.12xlarge instances), and begins processing your evaluation dataset:

estimator.fit(inputs={“train”: evalInput})The job will execute the rubric-based comparison, with the Amazon Nova judge model dynamically generating evaluation criteria and scoring both Qwen2.5 model outputs. Results, including per-criterion scores, justifications, and comparative assessments, are automatically saved to your specified S3 output path for downstream analysis and visualization.

Results from Amazon Nova rubric-based LLM-as-a-judge evaluation job

The following is an example result for a row of the evaluation. In this example, Assistant B is the clear winner because it prioritizes grounded, nuanced information over Assistant A’s suspiciously specific but unverified claim of 145 newspapers. The judge penalizes Assistant A for its lack of context, resulting in significantly lower scores for accuracy and completeness. By applying a custom weight that allocates 50% of the total score to accuracy, the evaluation calculates a weighted margin that quantifies precisely why Assistant B’s detailed, verifiable response is superior.

================================================================================
Row 0:
  Preference: [‘B>A’]
  A wins: 0.0
  B wins: 1.0
  Weighted A: 0.175
  Weighted B: 0.875
  Margin: -0.700

Overall Justification:
Assistant B’s response is more accurate and complete as it provides specific examples of student publications and acknowledges the variability in the number of publications. Assistant A’s response, while providing a specific number, lacks context and explanation, making it less useful for understanding the situation.

Criteria:

   accuracy:
   Score A: 2, Score B: 4
   Weight: 0.5, Type: scale
   Description: How accurate the information provided is regarding the number of student newspapers at Notre Dame.
   Justification A: Assistant A provides a specific number (145) but does not offer any context or explanation for this number, making it difficult to assess its accuracy.
   Justification B: Assistant B provides a more nuanced answer, stating that there are at least three significant student publications but acknowledges that the number can vary. This response is more accurate given the dynamic nature of student publications.

   completeness:
   Score A: 1, Score B: 5
   Weight: 0.3, Type: scale
   Description: How complete the response is in providing information about student newspapers at Notre Dame.
   Justification A: Assistant A’s response is incomplete as it does not provide any context or examples of student newspapers at Notre Dame.
   Justification B: Assistant B’s response is more complete as it provides examples of well-known student publications and acknowledges the variability in the number of publications.

   clarity:
   Score A: 2, Score B: 5
   Weight: 0.2, Type: scale
   Description: How clear and understandable the response is.
   Justification A: Assistant A’s response is clear in providing a number but lacks clarity in explaining what this number represents.
   Justification B: Assistant B’s response is clear and understandable, providing examples and context to help the reader understand the number of student publications.

As in the post Evaluating generative AI models with Amazon Nova LLM-as-a-Judge on Amazon SageMaker AI, to help practitioners quickly interpret the outcome of an Amazon Nova rubric-based LLM-as-a-judge evaluation, we created a convenience function that produces a single, comprehensive visualization summarizing key metrics, as shown in the following screenshot.

This function, plot_nova_judge_results, uses Matplotlib and Seaborn to render an image with six panels, each highlighting a different perspective of the evaluation outcome.

This function takes the evaluation metrics dictionary produced when the evaluation job is complete and generates the following visual components:

Score distribution bar chart – Shows how many times Model A was preferred (three wins), how many times Model B was preferred (seven wins), how many ties occurred, and how often the judge failed to produce a decision (one inference error out of 11 evaluations). This provides an immediate sense of how decisive the evaluation was, clearly showing Model B’s dominance with a 70% preference rate.
Win rate with 95% confidence interval – Plots Model B’s overall win rate of 70% against Model A, including an error bar reflecting the confidence interval bounds of [0.400, 0.909]. A vertical reference line at 50% marks the point of no preference. Because the confidence interval doesn’t cross this line, we can conclude the result is statistically significant, indicating meaningful superiority for the 7B model.
Preference pie chart – Visually displays the proportion of preferences among the 10 valid judgments: 70% for Model B and 30% for Model A. This can help users quickly understand the clear preference distribution favoring the larger model.
A vs. B score comparison bar chart – Compares the raw counts of preferences for each model side by side (three for Model A vs seven for Model B). A clear label annotates the margin of difference, emphasizing Model B’s four-win advantage. The chart also displays the weighted rubric-based scores: Model A averaged 0.495 whereas Model B averaged 0.630 across all evaluation criteria (accuracy, completeness, clarity), with an average margin of -0.135 favoring Model B.
Win rate gauge – Depicts the 70% win rate as a semicircular gauge with a needle pointing to Model B’s performance relative to the theoretical 0–100% range. This intuitive visualization helps nontechnical stakeholders immediately grasp that Model B outperformed Model A by a substantial margin based on dynamically generated rubric criteria tailored to each question-answer pair.
Summary statistics table – Compiles numerical metrics into a compact, clean table: 11 total evaluations, one error (9.1% error rate), 70% win rate, weighted rubric scores (0.630 for B vs 0.495 for A with -0.135 margin), and confidence intervals [0.400, 0.909]. This makes it straightforward to reference the exact numeric values behind the plots and understand both the statistical rigor and rubric-based assessment of the evaluation.

Because the function outputs a standard Matplotlib figure, you can quickly save the image, display it in Jupyter notebooks, or embed it in other documentation. The visualization clearly demonstrates that Model B shows statistically significant superiority overall with higher rubric-based scores across accuracy, completeness, and clarity dimensions.

Clean up

To stop and delete the SageMaker Studio spaces, follow these clean up steps in the SageMaker Studio documentation. You must delete the S3 bucket and the hosted model endpoint to stop incurring costs. You can delete the real-time endpoints you created using the SageMaker console. For instructions, see Delete Endpoints and Resources.

Conclusion

Evaluating generative AI outputs at scale requires more than simple preference labels, it requires transparency into why one response outperforms another. The Amazon Nova rubric-based LLM judge addresses this need by dynamically generating task-specific evaluation criteria, providing per-criterion scores with explicit justifications, and delivering well-calibrated confidence signals. Compared to previous judge implementations, the rubric-based approach offers three key advantages: interpretability through structured YAML output with criterion-level breakdowns, flexibility enabling users to reweight or filter criteria for their specific use cases, and improved accuracy with significant gains across standard benchmarks—including a 49% improvement on complex evaluation scenarios in JudgeBench. If you are selecting model checkpoints during development, filtering training data for quality, or debugging production model behavior at scale, the Amazon Nova rubric-based LLM-as-a-judge evaluation transforms opaque preference decisions into actionable insights. By exposing the reasoning behind each judgment, teams can identify systematic weaknesses, validate that evaluations align with their quality priorities, and build greater trust in automated evaluation pipelines.

To get started with the Amazon Nova rubric-based LLM judge on SageMaker AI, refer to Rubric Based Judge.

About the authors

Surya Kari is a Senior Generative AI Data Scientist at AWS, specializing in developing solutions leveraging state-of-the-art foundation models. He has extensive experience working with advanced language models including DeepSeek-R1, the Llama family, and Qwen, focusing on their fine-tuning and optimization for specific scientific applications. His expertise extends to implementing efficient training pipelines and deployment strategies using AWS SageMaker, enabling the scaling of foundation models from development to production. He collaborates with customers to design and implement generative AI solutions, helping them navigate model selection, fine-tuning approaches, and deployment strategies to achieve optimal performance for their specific use cases.

Joseph Moulton is a Software Engineer on the Amazon AGI Customization team supporting the implementation of evaluation and inference workflows for AWS Nova Forge. Current work focuses on developing and implementing new strategies for customers to evaluate their custom trained Nova models. He has been with the company as a software engineer for 4 years, joining the Alexa AI Machine Learning platform team in 2022 before transitioning to the Nova Forge team in 2025. In his free time he enjoys golfing and building computers.

Morteza Ziyadi is an senior science lead and manager at Amazon AGI, where he leads several projects on post-training recipes and (Multimodal) large language models in the Amazon AGI Foundation modeling team. Before joining Amazon AGI, he spent four years at Microsoft Cloud and AI, where he led projects focused on developing natural language-to-code generation models for various products. He has also served as an adjunct faculty at Northeastern University. He earned his PhD from the University of Southern California (USC) in 2017 and has since been actively involved as a workshop organizer, and reviewer for numerous NLP, Computer Vision and machine learning conferences.

Rajkumar Pujari is an Applied Scientist II on the Nova Models post-training team at Amazon AGI. He obtained his Ph.D. in Computer Science from Purdue University, specializing in Machine Learning for Computational Social Science. Currently, his work focuses on post-training and reinforcement learning for Large Language Models. He develops large-scale, dynamic evaluation pipelines for frontier models and builds LLM-as-a-Judge frameworks.

Swastik Roy is a Senior Applied Scientist on Amazon’s AGI Foundation team, specializing in generalizability research and post-training of the Amazon Nova family of models. His expertise spans fine-tuning, reinforcement learning, and evaluation methodologies, where he drives efforts to advance the robustness of foundational AI systems.

Joel Catapano is a Senior Applied Scientist on the Amazon AGI foundation modeling team. He primarily works on developing novel approaches for improving the LLM-as-a-Judge capability of the Nova family of models.

Mona Mona is a Sr World Wide Gen AI Specialist Solutions Architect focusing on Gen AI Solutions in Amazon SageMaker AI team. She was a Lead Generative AI specialist in Google before joining Amazon. She is a published author of two books – Natural Language Processing with AWS AI Services and Google Cloud Certified Professional Machine Learning Study Guide. She has authored 20+ blogs on AI/ML and cloud technology and a co-author on a research paper on CORD19 Neural Search which won an award for Best Research Paper at the prestigious AAAI (Association for the Advancement of Artificial Intelligence) conference.

Pradeep Natarajan is a Senior Principal Scientist in Amazon AGI Foundation modeling team working on post-training recipes and Multimodal large language models. He has 20+ years of experience in developing and launching multiple large-scale machine learning systems. He has a PhD in Computer Science from University of Southern California.

What's Hot

Horizon Worlds is officially dead on VR. What happens now for the Meta Quest (and the Metaverse)?

Here’s how Google plans to ‘balance’ a safer Android with side-loading this year

Run NVIDIA Nemotron 3 Super on Amazon Bedrock

Run NVIDIA Nemotron 3 Super on Amazon Bedrock

Tinder Plans to Let AI Scan Your Camera Roll

5 Useful Python Scripts for Synthetic Data Generation

How Bark.com and AWS collaborated to build a scalable video generation solution

Top 5 GitHub Repositories for Free Claude Skills (1000+ Skills)

A Coding Guide to Implement Advanced Differential Equation Solvers, Stochastic Simulations, and Neural Ordinary Differential Equations Using Diffrax and JAX

Horizon Worlds is officially dead on VR. What happens now for the Meta Quest (and the Metaverse)?

Here’s how Google plans to ‘balance’ a safer Android with side-loading this year

Run NVIDIA Nemotron 3 Super on Amazon Bedrock

Horizon Worlds is officially dead on VR. What happens now for the Meta Quest (and the Metaverse)?

Here’s how Google plans to ‘balance’ a safer Android with side-loading this year

Run NVIDIA Nemotron 3 Super on Amazon Bedrock

Usefull link

categories

What's Hot

Evaluate generative AI models with an Amazon Nova rubric-based LLM judge on Amazon SageMaker AI (Part 2)

Overview of Amazon Nova rubric-based judge

Example input

Example output

Enterprise implementation examples

How dynamic rubric generation works

Comparing rubric-based Amazon Nova LLM-as-a-judge to previous versions

Metrics explanation

Training approach for the judge

Benchmark performance

Calibration

Use cases of rubric-based judgement

How to use rubrics to improve practical applications

Evaluating RAG systems

The creative critic

Solution overview

Prerequisites

Configure models

Prepare dataset

Generate evaluation dataset

Launch Amazon Nova rubric-based LLM-as-a-judge evaluation job

Results from Amazon Nova rubric-based LLM-as-a-judge evaluation job

Clean up

Conclusion

About the authors

Related Posts

Usefull link

categories