Nova Forge SDK series part 2: Practical guide to fine-tune Nova models using data mixing capabilities

This hands-on guide walks through every step of fine-tuning an Amazon Nova model with the Amazon Nova Forge SDK, from data preparation to training with data mixing to evaluation, giving you a repeatable playbook you can adapt to your own use case. This is the second part in our Nova Forge SDK series, building on the SDK introduction and first part, which covered kicking off customization experiments.

The focus of this post is data mixing: the technique that lets you fine-tune on domain-specific data without sacrificing a model’s general capabilities. In the previous post, we made the case for why this matters, blending customer data with Amazon-curated datasets preserved near-baseline Massive Multitask Language Understanding (MMLU) scores while delivering a 12-point F1 improvement on a Voice of Customer classification task spanning 1,420 leaf categories. By contrast, fine-tuning an open-source model on customer data alone caused a near-total loss of general capabilities. Now we show you how to do it yourself.

Solution overview

The workflow consists of five stages:

Environment setup – Install the Nova Forge SDK and configure AWS resources
Data preparation – Load, sanitize, transform, validate, and split your training data
Training configuration – Configure the Amazon SageMaker HyperPod runtime, MLflow tracking, and data mixing ratios
Model training – Launch and monitor a supervised fine-tuning job with Low-Rank Adaptation (LoRA)
Model evaluation – Run public benchmarks and domain-specific evaluations against the fine-tuned checkpoint

Prerequisites

Before you begin, make sure you have the following:

An AWS account with access to Amazon Nova Forge
A SageMaker HyperPod cluster provisioned with GPU instances. This walkthrough uses `ml.p5.48xlarge` instances. Setting up a HyperPod cluster involves configuring an Amazon Elastic Kubernetes Service (Amazon EKS) cluster, provisioning compute nodes, and creating execution roles. For detailed instructions, see Getting started with SageMaker HyperPod.
An Amazon SageMaker MLflow application for experiment tracking
An IAM role with permissions for SageMaker, Amazon Simple Storage Service (Amazon S3), and Amazon CloudWatch
A SageMaker Studio notebook or similar Jupyter environment

Cost consideration: This walkthrough uses 4 `ml.p5.48xlarge` instances for training and for evaluation. These are high-end GPU instances. We recommend starting with a short test run (max_steps=5) to validate your configuration before committing to a full training run. For current rates, see the Amazon SageMaker pricing page.

Step 1: Install the Nova Forge SDK and dependencies

The SDK requires the SageMaker HyperPod CLI tooling. Download and install it from the Nova Forge S3 distribution bucket (provided during your Nova Forge onboarding) or use the following easy-to-use installer script that installs the dependencies from the private S3 bucket and sets up a virtual environment.

# Download the HyperPod CLI Installer from Github (Only applicable for Forge)

curl –O https://github.com/aws-samples/amazon-nova-samples/blob/main/customization/nova-forge-hyperpod-cli-installation/install_hp_cli.sh

# Run the Installer

bash install_hp_cli.sh

Next, within the same virtual environment, also install the Nova Forge SDK (nova-forge-sdk) which provides the high-level APIs for data preparation, training, and evaluation.

pip install –upgrade botocore awscli
pip install amzn-nova-forge
pip install datasets huggingface_hub pandas pyarrow

After all dependencies are installed, activate the virtual environment and set it as a kernel for use within a Jupyter notebook environment.

source ~/hyperpod-cli-venv/bin/activate
pip install ipykernel
python -m ipykernel install –user –name=hyperpod-cli-venv —
display-name=”Forge (hyperpod-cli-venv)”
jupyter kernelspec list

Verify the installation:

from amzn_nova_forge import *
print(“SDK imported successfully”)

Step 2: Configure AWS resources

Create an S3 bucket for your training data and model outputs. Then, grant your HyperPod execution role access to it.

import boto3
import time
import json

TIMESTAMP = int(time.time())
S3_BUCKET = f”nova-forge-customisation-{TIMESTAMP}”
S3_DATA_PATH = f”s3://{S3_BUCKET}/demo/input”
S3_OUTPUT_PATH = f”s3://{S3_BUCKET}/demo/output”

sts = boto3.client(“sts”)
s3 = boto3.client(“s3”)
ACCOUNT_ID = sts.get_caller_identity()[“Account”]
REGION = boto3.session.Session().region_name

# Create the S3 bucket
if REGION == “us-east-1”:
s3.create_bucket(Bucket=S3_BUCKET)
else:
s3.create_bucket(
Bucket=S3_BUCKET,
CreateBucketConfiguration={“LocationConstraint”: REGION}
)

# Grant HyperPod execution role access
HYPERPOD_ROLE_ARN = f”arn:aws:iam::{ACCOUNT_ID}:role/

“bucket_policy = {
“Version”: “2012-10-17”,
“Statement”: [{
“Sid”: “AllowHyperPodAccess”,
“Effect”: “Allow”,
“Principal”: {“AWS”: HYPERPOD_ROLE_ARN},
“Action”: [“s3:GetObject”, “s3:PutObject”, “s3:DeleteObject”, “s3:ListBucket”],
“Resource”: [
f”arn:aws:s3:::{S3_BUCKET}”,
f”arn:aws:s3:::{S3_BUCKET}/*”
]
}]
}

s3.put_bucket_policy(Bucket=S3_BUCKET, Policy=json.dumps(bucket_policy))

Step 3: Prepare your training dataset

The Nova Forge SDK supports JSONL, JSON, and CSV input formats. In this walkthrough, we use the publicly available MedReason dataset from Hugging Face. The dataset contains medical reasoning with approximately 32,700 question-answer pairs to demonstrate fine-tuning for a domain-specific use case.

Download and sanitize the data

The Nova Forge SDK enforces token-level validation on training data. Certain tokens conflict with the model’s internal chat template, specifically the special delimiters Nova uses to separate system, user, and assistant turns during training. If your data contains literal strings like `System:` or `Assistant:`, the model may misinterpret them as turn boundaries, corrupting the training signal. The sanitization step below inserts a space before the colon (e.g., System: → System 🙂 to break the pattern match while preserving readability, and strips special tokens like [EOS] and that have reserved meaning in the model’s vocabulary.

from huggingface_hub import hf_hub_download

import pandas as pd

import json

import re

# Download the dataset

jsonl_path = hf_hub_download(

repo_id=”UCSC-VLAA/MedReason”,

filename=”ours_quality_33000.jsonl”,

repo_type=”dataset”,

local_dir=”.”

)

df = pd.read_json(jsonl_path, lines=True)

# Tokens that conflict with the model’s chat template

INVALID_TOKENS = [

“System:”, “SYSTEM:”, “User:”, “USER:”, “Bot:”, “BOT:”,

“Assistant:”, “ASSISTANT:”, “Thought:”, “[EOS]”,

“”, “”, “”,

]

def sanitize_text(text):

for token in INVALID_TOKENS:

if “:” in token:

word = token[:-1]

text = re.sub(rf’\b{word}:’, f'{word} :’, text, flags=re.IGNORECASE)

else:

text = text.replace(token, “”)

return text.strip()

# Write sanitized JSONL

with open(“training_data.jsonl”, “w”) as f:

for _, row in df.iterrows():

f.write(json.dumps({

“question”: sanitize_text(row[“question”]),

“answer”: sanitize_text(row[“answer”]),

}) + “\n”)

print(f”Dataset saved: training_data.jsonl ({len(df)} examples)”)

To validate if your data has any of reserved keyword run this script.

Load, transform, and validate with the SDK

The SDK provides a JSONLDatasetLoader that handles the conversion from your raw data format into the structure expected by Nova models.When you call transform(), the SDK wraps each question-answer pair into the Nova chat template format, which is the structured turn-based format that Nova models expect during training. Your raw data goes from simple Q&A pairs to fully formatted multi-turn conversations with the appropriate role tags and delimiters.

Before transform (your raw JSONL):

{
“question”: “What are the causes of chest pain in a 45-year-old patient?”,
“answer”: “Chest pain in a 45-year-old can result from cardiac causes such as…”
}

After transform (Nova chat template format):

{
“messages”: [
{“role”: “user”, “content”: “What are the causes of chest pain in a 45-year-old patient?”},
{“role”: “assistant”, “content”: “Chest pain in a 45-year-old can result from cardiac causes such as…”}
]
}

The validate() method then checks the transformed data for issues, verifying that the chat template structure is correct, that no invalid tokens remain, and that the data conforms to the requirements for your chosen model and training method.

# Initialize the loader, mapping your column names
loader = JSONLDatasetLoader(
question=”question”,
answer=”answer”,
)

loader.load(“training_data.jsonl”)

# Preview raw data
loader.show(n=3)

# Transform into Nova’s expected chat template format
loader.transform(method=TrainingMethod.SFT_LORA, model=Model.NOVA_LITE_2)

# Preview transformed data to verify the structure
loader.show(n=3)

# Validate — prints “Validation completed” if successful
loader.validate(method=TrainingMethod.SFT_LORA, model=Model.NOVA_LITE_2)

train_path = loader.save(f”{S3_DATA_PATH}/train.jsonl”)
print(f”Training data: {train_path}”)

Step 4: Configure and launch training with data mixing

When you enable data mixing, Nova Forge automatically blends your domain-specific training data with Amazon-curated datasets during fine-tuning. This prevents the model from forgetting its general capabilities while it learns your domain.

A note on training methods: LoRA vs. full-rank SFT

Nova Forge supports multiple fine-tuning approaches. In this walkthrough, we use supervised fine-tuning (SFT) with LoRA (TrainingMethod.SFT_LORA), which is a parameter-efficient method that updates only a small set of low-rank adapter weights rather than all model parameters. LoRA offers faster training, lower compute costs, and is the recommended starting point for most use cases.

Nova Forge also supports full-rank SFT, which updates all model parameters and can incorporate more domain knowledge. However, it requires more compute and is more susceptible to catastrophic forgetting (making data mixing even more important). The previous post in this series demonstrates results using full-rank SFT. Choose full-rank when LoRA doesn’t achieve sufficient domain performance, or when you need deeper model adaptation.

Configure the runtime and MLflow

from amzn_nova_customization_sdk.model.model_enums import Platform

cluster_name = “nova-forge-hyperpod”
instance_type = “ml.p5.48xlarge”
instance_count = 4
namespace = “kubeflow”

runtime = SMHPRuntimeManager(
instance_type=instance_type,
instance_count=instance_count,
cluster_name=cluster_name,
namespace=namespace,
)

MLFLOW_APP_ID = “” # e.g., “app-XXXXXXXXXXXX”
mlflow_app_arn = f”arn:aws:sagemaker:{REGION}:{ACCOUNT_ID}:mlflow-app/{MLFLOW_APP_ID}”

mlflow_monitor = MLflowMonitor(
tracking_uri=mlflow_app_arn,
experiment_name=”nova-sft-datamix”,
)

Create the customizer with data mixing enabled

Pass data_mixing_enabled=True when constructing the NovaModelCustomizer:

customizer = NovaModelCustomizer(
model=Model.NOVA_LITE_2,
method=TrainingMethod.SFT_LORA,
infra=runtime,
data_s3_path=f”{S3_DATA_PATH}/train.jsonl”,
output_s3_path=f”{S3_OUTPUT_PATH}/”,
mlflow_monitor=mlflow_monitor,
data_mixing_enabled=True,
)

Understand and tune the data mixing configuration

Data mixing controls how training batches are composed. The customer_data_percent parameter determines what fraction of each batch comes from your domain data. The remaining fraction is filled by Nova-curated datasets, with each nova_*_percent parameter controlling the relative weight of that capability category within the Nova portion.

For example, with the configuration below:

50% of each training batch consists of your domain data
50% consists of Nova-curated data, distributed across capability categories according to their relative weights

The Nova-side percentages must sum to 100. Each value represents that category’s share of the Nova-curated portion of the batch.

# View the default mixing ratios
customizer.get_data_mixing_config()

You can override these ratios based on your priorities:

customizer.set_data_mixing_config({
“customer_data_percent”: 50,
“nova_agents_percent”: 1,
“nova_baseline_percent”: 10,
“nova_chat_percent”: 0.5,
“nova_factuality_percent”: 0.1,
“nova_identity_percent”: 1,
“nova_long-context_percent”: 1,
“nova_math_percent”: 2,
“nova_rai_percent”: 1,
“nova_instruction-following_percent”: 13,
“nova_stem_percent”: 10.5,
“nova_planning_percent”: 10,
“nova_reasoning-chat_percent”: 0.5,
“nova_reasoning-code_percent”: 0.5,
“nova_reasoning-factuality_percent”: 0.5,
“nova_reasoning-instruction-following_percent”: 45,
“nova_reasoning-math_percent”: 0.5,
“nova_reasoning-planning_percent”: 0.5,
“nova_reasoning-rag_percent”: 0.4,
“nova_reasoning-rai_percent”: 0.5,
“nova_reasoning-stem_percent”: 0.4,
“nova_rag_percent”: 1,
“nova_translation_percent”: 0.1,
})

How to think about tuning the mix

Parameter
What it controls
Guidance

customer_data_percent
Share of your domain data in each training batch.
Higher values drive stronger domain specialization but increase forgetting risk. 50% is a balanced starting point.

nova_instruction-following_percent
Weight of instruction-following examples in the Nova portion.
Keep this high if your model needs to follow structured prompts or output formats in production.

nova_reasoning-*_percent
Weights for various reasoning capabilities (math, code, planning, etc.).
Increase these if your downstream tasks require multi-step reasoning.

nova_rai_percent
Responsible AI alignment data.
Always keep this non-zero to preserve safety behaviors.

nova_baseline_percent
Core factual knowledge.
Helps retain broad world knowledge.

Tip: Start with the defaults, run a training job, evaluate on both your domain task and MMLU, then iterate. The Building specialized AI without sacrificing intelligence post shows that even a 75/25 customer-to-Nova split preserves near-baseline MMLU (0.74 vs. 0.75 baseline) while delivering a 12-point F1 improvement on a complex classification task.

Launch the training job

The overrides parameter lets you control key training hyperparameters:

Parameter
Description
Guidance

lr
Learning rate
1e-5 is a reasonable default for LoRA fine-tuning.

warmup_steps
Steps to linearly ramp up learning rate from 0
Typically 5–10% of total steps. Set proportionally to max_steps.

global_batch_size
Number of examples per gradient update across all GPUs
Larger batches give more stable gradients but use more memory.

max_length
Maximum sequence length in tokens
Set based on your data. 65536 supports long-context use cases; reduce for shorter data to save memory and speed up training.

max_steps
Total training steps
Start small (5–10) to validate your setup, then increase. For ~23k training examples with batch size 32, one full epoch ≈ 720 steps.

training_config = {
“lr”: 1e-5,
“warmup_steps”: 2,
“global_batch_size”: 32,
“max_length”: 65536,
“max_steps”: 5, # Start small to validate; increase for production runs
}

training_result = customizer.train(
job_name=”nova-forge-sft-datamix”,
overrides=training_config,
)

training_result.dump(“training_result.json”)
print(“Training result saved”)

Monitor training progress

You can monitor the job through the SDK or CloudWatch:

# Check job status
print(training_result.get_job_status())

# Stream recent logs
customizer.get_logs(limit=50, start_from_head=False)

# Or use the CloudWatch monitor
monitor = CloudWatchLogMonitor.from_job_result(training_result)
monitor.show_logs(limit=10)

# Poll until completion
import time
while training_result.get_job_status()[1] == “Running”:
time.sleep(60)

Training metrics (loss curves, learning rate schedule) are also available in your MLflow experiment for visualization and comparison across runs.

Step 5: Evaluate the fine-tuned model

Evaluation is critical when you use data mixing because you need to measure two things simultaneously: whether your model improved on your domain task, and whether it retained its general capabilities. If you measure only one axis, you can’t tell if the mix is working.After training completes, retrieve the model checkpoint location from the output manifest:

from amzn_nova_forge.util.checkpoint_util import extract_checkpoint_path_from_job_output

checkpoint_path = extract_checkpoint_path_from_job_output(
output_s3_path=training_result.model_artifacts.output_s3_path,
job_result=training_result,
)

Configure the evaluation infrastructure

Evaluation requires only a single GPU instance (compared to 4 for training):

eval_infra = SMHPRuntimeManager(
instance_type=instance_type,
instance_count=1,
cluster_name=cluster_name,
namespace=namespace,
)

eval_mlflow = MLflowMonitor(
tracking_uri=mlflow_app_arn,
experiment_name=”nova-forge-eval”,
)

evaluator = NovaModelCustomizer(
model=Model.NOVA_LITE_2,
method=TrainingMethod.EVALUATION,
infra=eval_infra,
output_s3_path=f”s3://{S3_BUCKET}/demo/eval-outputs/”,
mlflow_monitor=eval_mlflow,
)

Run evaluations

Nova Forge supports three complementary evaluation approaches:

1. Public benchmarks (used to measure general capability retention)

These tell you whether data mixing is doing its job. If MMLU drops significantly from the baseline, your mix needs more Nova data. If IFEval drops, increase the instruction-following weight.

# MMLU — broad knowledge and reasoning across 57 subjects
mmlu_result = evaluator.evaluate(
job_name=”eval-mmlu”,
eval_task=EvaluationTask.MMLU,
model_path=checkpoint_path,
)

# IFEval — ability to follow structured instructions
ifeval_result = evaluator.evaluate(
job_name=”eval-ifeval”,
eval_task=EvaluationTask.IFEVAL,
model_path=checkpoint_path,
)

2. Bring-your-own-data (measure domain-specific performance)

Use your held-out test set to measure whether fine-tuning improved performance on your actual task:

byod_result = evaluator.evaluate(
job_name=”eval-byod”,
eval_task=EvaluationTask.GEN_QA,
data_s3_path=f”s3://{S3_DATA_PATH}/eval/gen_qa.jsonl”,
model_path=checkpoint_path,
overrides={“max_new_tokens”: 2048},
)

3. Large language model (LLM) as judge (for domains where automated metrics fall short, you can use another LLM to assess response quality)

Check results and retrieve outputs

# Check job status
print(mmlu_result.get_job_status())
print(ifeval_result.get_job_status())
print(byod_result.get_job_status())

# Retrieve the S3 path containing detailed evaluation results
print(mmlu_result.eval_output_path)

The evaluation output path contains the detailed results as JSON. Download and inspect them to get the actual scores.

Additionally, metrics can be published to MLflow tracking servers by supplying the tracking server URI at job creation. With this approach, you can record and store your metrics for comparing experiments.

Interpreting your results

Use the following decision framework to guide your next iteration:

Observation
What it means
What to adjust

MMLU close to baseline (e.g., within 0.01–0.02)
Data mixing is successfully preventing catastrophic forgetting
Your mix is working — focus on domain performance

MMLU significantly degraded
The model is forgetting general capabilities
Decrease customer_data_percent or increase Nova data weights

Domain task performance below expectations
The model isn’t learning enough from your data
Increase customer_data_percent, add more training data, or increase max_steps

IFEval degraded
The model is losing instruction-following ability
Increase nova_instruction-following_percent

Both MMLU and domain task improved
Ideal outcome
Document your configuration and promote to production

As a reference point, this post reports these results for Amazon Nova 2 Lite on a VOC classification task:

The key takeaway is that fine-tuning with only customer data boosts Domain F1 but significantly reduces general intelligence (MMLU drops from 0.75 to 0.47), while the blended approach (75% customer + 25% Nova data) recovers nearly all the MMLU accuracy while still improving domain performance.

Best practices

Start with the default mixing ratios. The defaults are tuned for a balanced trade-off. Only customize after you have baseline evaluation results to compare against.
Always evaluate on both axes. Run at least one public benchmark (MMLU) alongside your domain-specific evaluation. Without both, you can’t tell if the mix is working.
Use MLflow to compare experiments. When iterating on mixing ratios and hyperparameters, MLflow makes it straightforward to compare runs side-by-side and identify the best configuration.
Iterate on the mix, not just hyperparameters. If your model is forgetting general capabilities, adjusting the data mix is often more effective than tuning learning rate or batch size.
Start with LoRA, move to full-rank if needed. LoRA is faster and cheaper. Only move to full-rank SFT if LoRA doesn’t achieve sufficient domain adaptation for your use case.

Cleaning up

To avoid ongoing charges, clean up the resources created during this walkthrough:

Delete the S3 bucket and its contents.
Stop or delete the SageMaker HyperPod cluster if it was created for this exercise.
Delete the MLflow application if no longer needed.

Conclusion

In this post, we walked through the end-to-end workflow for fine-tuning Amazon Nova models using the Nova Forge SDK with data mixing enabled. The SDK handles data preparation, training orchestration on SageMaker HyperPod, and multi-dimensional evaluation, so you can focus on your data and your domain.Data mixing is what makes fine-tuning practical for production. Rather than choosing between domain expertise and general intelligence, you get both. The key is to treat it as an iterative process: train, evaluate on both axes, adjust the mix, and repeat until you find the right balance for your use case.

To get started, see the Nova Forge Developer Guide for detailed documentation, and explore the Nova Forge SDK for the full API reference.

About the authors

Gideon Teo is a FSI Solution Architect at AWS in Melbourne, specialising in Amazon SageMaker AI and Amazon Bedrock. Passionate about both traditional AI/ML and Generative AI, he helps financial institutions solve complex business challenges with cutting-edge technologies. Outside work, he enjoys time with friends and family, and exploring diverse technology domains.

Andrew Smith is a Sr. Cloud Support Engineer at AWS, based in Sydney, Australia. He specialises in helping customers with AI/ML workloads on AWS with expertise in Amazon SageMaker AI, Amazon Bedrock and LLM inference.

Timothy Downs is a Startup Solutions Architect at AWS in Melbourne who enjoys working at the bleeding edge of tech, usually before it is fully baked.

Krishna Neupane is an Applied Scientist at Amazon’s AGI Customization team, specializing in Nova model customization and data mixing.

What's Hot

Rumors say Honor’s alleged ‘Wide’ foldable competitor has a top-notch battery

Pixel 11 leak reveals double the storage and a price hike you won’t like

Solos debuts an even lighter version of its camera-less smart glasses

Upgrade Your Vacation: The Ultimate Summer 2026 Tech Guide

Amazfit Cheetah 2 Ultra gets its July update with better maps and training data

I was tired of the Android Beta for Pixel glitching out, so I left it — without losing all my data

‘More data than ever’: Android 17 has just what iPhone users need to switch over

Apple Watch owners push back as watchOS 27 drops older models

The Apple Watch Series 11 is back to its best price

Rumors say Honor’s alleged ‘Wide’ foldable competitor has a top-notch battery

Pixel 11 leak reveals double the storage and a price hike you won’t like

Solos debuts an even lighter version of its camera-less smart glasses

Rumors say Honor’s alleged ‘Wide’ foldable competitor has a top-notch battery

Pixel 11 leak reveals double the storage and a price hike you won’t like

Solos debuts an even lighter version of its camera-less smart glasses

Usefull link

categories

What's Hot

Nova Forge SDK series part 2: Practical guide to fine-tune Nova models using data mixing capabilities

Solution overview

Prerequisites

Step 1: Install the Nova Forge SDK and dependencies

Step 2: Configure AWS resources

Step 3: Prepare your training dataset

Download and sanitize the data

Load, transform, and validate with the SDK

Step 4: Configure and launch training with data mixing

A note on training methods: LoRA vs. full-rank SFT

Configure the runtime and MLflow

Create the customizer with data mixing enabled

Understand and tune the data mixing configuration

Launch the training job

Monitor training progress

Step 5: Evaluate the fine-tuned model

Configure the evaluation infrastructure

Run evaluations

Check results and retrieve outputs

Interpreting your results

Best practices

Cleaning up

Conclusion

About the authors

Related Posts

Usefull link

categories