This hands-on guide walks through every step of fine-tuning an Amazon Nova model with the Amazon Nova Forge SDK, from data preparation to training with data mixing to evaluation, giving you a repeatable playbook you can adapt to your own use case. This is the second part in our Nova Forge SDK series, building on the SDK introduction and first part, which covered kicking off customization experiments.
The focus of this post is data mixing: the technique that lets you fine-tune on domain-specific data without sacrificing a model’s general capabilities. In the previous post, we made the case for why this matters, blending customer data with Amazon-curated datasets preserved near-baseline Massive Multitask Language Understanding (MMLU) scores while delivering a 12-point F1 improvement on a Voice of Customer classification task spanning 1,420 leaf categories. By contrast, fine-tuning an open-source model on customer data alone caused a near-total loss of general capabilities. Now we show you how to do it yourself.
Solution overview
The workflow consists of five stages:
- Environment setup – Install the Nova Forge SDK and configure AWS resources
- Data preparation – Load, sanitize, transform, validate, and split your training data
- Training configuration – Configure the Amazon SageMaker HyperPod runtime, MLflow tracking, and data mixing ratios
- Model training – Launch and monitor a supervised fine-tuning job with Low-Rank Adaptation (LoRA)
- Model evaluation – Run public benchmarks and domain-specific evaluations against the fine-tuned checkpoint
Prerequisites
Before you begin, make sure you have the following:
- An AWS account with access to Amazon Nova Forge
- A SageMaker HyperPod cluster provisioned with GPU instances. This walkthrough uses `ml.p5.48xlarge` instances. Setting up a HyperPod cluster involves configuring an Amazon Elastic Kubernetes Service (Amazon EKS) cluster, provisioning compute nodes, and creating execution roles. For detailed instructions, see Getting started with SageMaker HyperPod.
- An Amazon SageMaker MLflow application for experiment tracking
- An IAM role with permissions for SageMaker, Amazon Simple Storage Service (Amazon S3), and Amazon CloudWatch
- A SageMaker Studio notebook or similar Jupyter environment
Cost consideration: This walkthrough uses 4 `ml.p5.48xlarge` instances for training and for evaluation. These are high-end GPU instances. We recommend starting with a short test run (max_steps=5) to validate your configuration before committing to a full training run. For current rates, see the Amazon SageMaker pricing page.
Step 1: Install the Nova Forge SDK and dependencies
The SDK requires the SageMaker HyperPod CLI tooling. Download and install it from the Nova Forge S3 distribution bucket (provided during your Nova Forge onboarding) or use the following easy-to-use installer script that installs the dependencies from the private S3 bucket and sets up a virtual environment.
# Download the HyperPod CLI Installer from Github (Only applicable for Forge)
curl –O https://github.com/aws-samples/amazon-nova-samples/blob/main/customization/nova-forge-hyperpod-cli-installation/install_hp_cli.sh
# Run the Installer
bash install_hp_cli.sh
Next, within the same virtual environment, also install the Nova Forge SDK (nova-forge-sdk) which provides the high-level APIs for data preparation, training, and evaluation.
pip install –upgrade botocore awscli
pip install amzn-nova-forge
pip install datasets huggingface_hub pandas pyarrow
After all dependencies are installed, activate the virtual environment and set it as a kernel for use within a Jupyter notebook environment.
source ~/hyperpod-cli-venv/bin/activate
pip install ipykernel
python -m ipykernel install –user –name=hyperpod-cli-venv —
display-name=”Forge (hyperpod-cli-venv)”
jupyter kernelspec list
Verify the installation:
from amzn_nova_forge import *
print(“SDK imported successfully”)
Step 2: Configure AWS resources
Create an S3 bucket for your training data and model outputs. Then, grant your HyperPod execution role access to it.
import boto3
import time
import json
TIMESTAMP = int(time.time())
S3_BUCKET = f”nova-forge-customisation-{TIMESTAMP}”
S3_DATA_PATH = f”s3://{S3_BUCKET}/demo/input”
S3_OUTPUT_PATH = f”s3://{S3_BUCKET}/demo/output”
sts = boto3.client(“sts”)
s3 = boto3.client(“s3”)
ACCOUNT_ID = sts.get_caller_identity()[“Account”]
REGION = boto3.session.Session().region_name
# Create the S3 bucket
if REGION == “us-east-1”:
s3.create_bucket(Bucket=S3_BUCKET)
else:
s3.create_bucket(
Bucket=S3_BUCKET,
CreateBucketConfiguration={“LocationConstraint”: REGION}
)
# Grant HyperPod execution role access
HYPERPOD_ROLE_ARN = f”arn:aws:iam::{ACCOUNT_ID}:role/
“bucket_policy = {
“Version”: “2012-10-17”,
“Statement”: [{
“Sid”: “AllowHyperPodAccess”,
“Effect”: “Allow”,
“Principal”: {“AWS”: HYPERPOD_ROLE_ARN},
“Action”: [“s3:GetObject”, “s3:PutObject”, “s3:DeleteObject”, “s3:ListBucket”],
“Resource”: [
f”arn:aws:s3:::{S3_BUCKET}”,
f”arn:aws:s3:::{S3_BUCKET}/*”
]
}]
}
s3.put_bucket_policy(Bucket=S3_BUCKET, Policy=json.dumps(bucket_policy))
Step 3: Prepare your training dataset
The Nova Forge SDK supports JSONL, JSON, and CSV input formats. In this walkthrough, we use the publicly available MedReason dataset from Hugging Face. The dataset contains medical reasoning with approximately 32,700 question-answer pairs to demonstrate fine-tuning for a domain-specific use case.
Download and sanitize the data
The Nova Forge SDK enforces token-level validation on training data. Certain tokens conflict with the model’s internal chat template, specifically the special delimiters Nova uses to separate system, user, and assistant turns during training. If your data contains literal strings like `System:` or `Assistant:`, the model may misinterpret them as turn boundaries, corrupting the training signal. The sanitization step below inserts a space before the colon (e.g., System: → System 🙂 to break the pattern match while preserving readability, and strips special tokens like [EOS] and that have reserved meaning in the model’s vocabulary.
from huggingface_hub import hf_hub_download
import pandas as pd
import json
import re
# Download the dataset
jsonl_path = hf_hub_download(
repo_id=”UCSC-VLAA/MedReason”,
filename=”ours_quality_33000.jsonl”,
repo_type=”dataset”,
local_dir=”.”
)
df = pd.read_json(jsonl_path, lines=True)
# Tokens that conflict with the model’s chat template
INVALID_TOKENS = [
“System:”, “SYSTEM:”, “User:”, “USER:”, “Bot:”, “BOT:”,
“Assistant:”, “ASSISTANT:”, “Thought:”, “[EOS]”,
“”, “”, “”,
]
def sanitize_text(text):
for token in INVALID_TOKENS:
if “:” in token:
word = token[:-1]
text = re.sub(rf’\b{word}:’, f'{word} :’, text, flags=re.IGNORECASE)
else:
text = text.replace(token, “”)
return text.strip()
# Write sanitized JSONL
with open(“training_data.jsonl”, “w”) as f:
for _, row in df.iterrows():
f.write(json.dumps({
“question”: sanitize_text(row[“question”]),
“answer”: sanitize_text(row[“answer”]),
}) + “\n”)
print(f”Dataset saved: training_data.jsonl ({len(df)} examples)”)
To validate if your data has any of reserved keyword run this script.
Load, transform, and validate with the SDK
The SDK provides a JSONLDatasetLoader that handles the conversion from your raw data format into the structure expected by Nova models.When you call transform(), the SDK wraps each question-answer pair into the Nova chat template format, which is the structured turn-based format that Nova models expect during training. Your raw data goes from simple Q&A pairs to fully formatted multi-turn conversations with the appropriate role tags and delimiters.
Before transform (your raw JSONL):
{
“question”: “What are the causes of chest pain in a 45-year-old patient?”,
“answer”: “Chest pain in a 45-year-old can result from cardiac causes such as…”
}
After transform (Nova chat template format):
{
“messages”: [
{“role”: “user”, “content”: “What are the causes of chest pain in a 45-year-old patient?”},
{“role”: “assistant”, “content”: “Chest pain in a 45-year-old can result from cardiac causes such as…”}
]
}
The validate() method then checks the transformed data for issues, verifying that the chat template structure is correct, that no invalid tokens remain, and that the data conforms to the requirements for your chosen model and training method.
# Initialize the loader, mapping your column names
loader = JSONLDatasetLoader(
question=”question”,
answer=”answer”,
)
loader.load(“training_data.jsonl”)
# Preview raw data
loader.show(n=3)
# Transform into Nova’s expected chat template format
loader.transform(method=TrainingMethod.SFT_LORA, model=Model.NOVA_LITE_2)
# Preview transformed data to verify the structure
loader.show(n=3)
# Validate — prints “Validation completed” if successful
loader.validate(method=TrainingMethod.SFT_LORA, model=Model.NOVA_LITE_2)
train_path = loader.save(f”{S3_DATA_PATH}/train.jsonl”)
print(f”Training data: {train_path}”)
Step 4: Configure and launch training with data mixing
When you enable data mixing, Nova Forge automatically blends your domain-specific training data with Amazon-curated datasets during fine-tuning. This prevents the model from forgetting its general capabilities while it learns your domain.
A note on training methods: LoRA vs. full-rank SFT
Nova Forge supports multiple fine-tuning approaches. In this walkthrough, we use supervised fine-tuning (SFT) with LoRA (TrainingMethod.SFT_LORA), which is a parameter-efficient method that updates only a small set of low-rank adapter weights rather than all model parameters. LoRA offers faster training, lower compute costs, and is the recommended starting point for most use cases.
Nova Forge also supports full-rank SFT, which updates all model parameters and can incorporate more domain knowledge. However, it requires more compute and is more susceptible to catastrophic forgetting (making data mixing even more important). The previous post in this series demonstrates results using full-rank SFT. Choose full-rank when LoRA doesn’t achieve sufficient domain performance, or when you need deeper model adaptation.
Configure the runtime and MLflow
from amzn_nova_customization_sdk.model.model_enums import Platform
cluster_name = “nova-forge-hyperpod”
instance_type = “ml.p5.48xlarge”
instance_count = 4
namespace = “kubeflow”
runtime = SMHPRuntimeManager(
instance_type=instance_type,
instance_count=instance_count,
cluster_name=cluster_name,
namespace=namespace,
)
MLFLOW_APP_ID = “” # e.g., “app-XXXXXXXXXXXX”
mlflow_app_arn = f”arn:aws:sagemaker:{REGION}:{ACCOUNT_ID}:mlflow-app/{MLFLOW_APP_ID}”
mlflow_monitor = MLflowMonitor(
tracking_uri=mlflow_app_arn,
experiment_name=”nova-sft-datamix”,
)
Create the customizer with data mixing enabled
Pass data_mixing_enabled=True when constructing the NovaModelCustomizer:
customizer = NovaModelCustomizer(
model=Model.NOVA_LITE_2,
method=TrainingMethod.SFT_LORA,
infra=runtime,
data_s3_path=f”{S3_DATA_PATH}/train.jsonl”,
output_s3_path=f”{S3_OUTPUT_PATH}/”,
mlflow_monitor=mlflow_monitor,
data_mixing_enabled=True,
)
Understand and tune the data mixing configuration
Data mixing controls how training batches are composed. The customer_data_percent parameter determines what fraction of each batch comes from your domain data. The remaining fraction is filled by Nova-curated datasets, with each nova_*_percent parameter controlling the relative weight of that capability category within the Nova portion.
For example, with the configuration below:
- 50% of each training batch consists of your domain data
- 50% consists of Nova-curated data, distributed across capability categories according to their relative weights
The Nova-side percentages must sum to 100. Each value represents that category’s share of the Nova-curated portion of the batch.
# View the default mixing ratios
customizer.get_data_mixing_config()
You can override these ratios based on your priorities:
customizer.set_data_mixing_config({
“customer_data_percent”: 50,
“nova_agents_percent”: 1,
“nova_baseline_percent”: 10,
“nova_chat_percent”: 0.5,
“nova_factuality_percent”: 0.1,
“nova_identity_percent”: 1,
“nova_long-context_percent”: 1,
“nova_math_percent”: 2,
“nova_rai_percent”: 1,
“nova_instruction-following_percent”: 13,
“nova_stem_percent”: 10.5,
“nova_planning_percent”: 10,
“nova_reasoning-chat_percent”: 0.5,
“nova_reasoning-code_percent”: 0.5,
“nova_reasoning-factuality_percent”: 0.5,
“nova_reasoning-instruction-following_percent”: 45,
“nova_reasoning-math_percent”: 0.5,
“nova_reasoning-planning_percent”: 0.5,
“nova_reasoning-rag_percent”: 0.4,
“nova_reasoning-rai_percent”: 0.5,
“nova_reasoning-stem_percent”: 0.4,
“nova_rag_percent”: 1,
“nova_translation_percent”: 0.1,
})
How to think about tuning the mix
Parameter
What it controls
Guidance
customer_data_percent
Share of your domain data in each training batch.
Higher values drive stronger domain specialization but increase forgetting risk. 50% is a balanced starting point.
nova_instruction-following_percent
Weight of instruction-following examples in the Nova portion.
Keep this high if your model needs to follow structured prompts or output formats in production.
nova_reasoning-*_percent
Weights for various reasoning capabilities (math, code, planning, etc.).
Increase these if your downstream tasks require multi-step reasoning.
nova_rai_percent
Responsible AI alignment data.
Always keep this non-zero to preserve safety behaviors.
nova_baseline_percent
Core factual knowledge.
Helps retain broad world knowledge.
Tip: Start with the defaults, run a training job, evaluate on both your domain task and MMLU, then iterate. The Building specialized AI without sacrificing intelligence post shows that even a 75/25 customer-to-Nova split preserves near-baseline MMLU (0.74 vs. 0.75 baseline) while delivering a 12-point F1 improvement on a complex classification task.
Launch the training job
The overrides parameter lets you control key training hyperparameters:
Parameter
Description
Guidance
lr
Learning rate
1e-5 is a reasonable default for LoRA fine-tuning.
warmup_steps
Steps to linearly ramp up learning rate from 0
Typically 5–10% of total steps. Set proportionally to max_steps.
global_batch_size
Number of examples per gradient update across all GPUs
Larger batches give more stable gradients but use more memory.
max_length
Maximum sequence length in tokens
Set based on your data. 65536 supports long-context use cases; reduce for shorter data to save memory and speed up training.
max_steps
Total training steps
Start small (5–10) to validate your setup, then increase. For ~23k training examples with batch size 32, one full epoch ≈ 720 steps.
training_config = {
“lr”: 1e-5,
“warmup_steps”: 2,
“global_batch_size”: 32,
“max_length”: 65536,
“max_steps”: 5, # Start small to validate; increase for production runs
}
training_result = customizer.train(
job_name=”nova-forge-sft-datamix”,
overrides=training_config,
)
training_result.dump(“training_result.json”)
print(“Training result saved”)
Monitor training progress
You can monitor the job through the SDK or CloudWatch:
# Check job status
print(training_result.get_job_status())
# Stream recent logs
customizer.get_logs(limit=50, start_from_head=False)
# Or use the CloudWatch monitor
monitor = CloudWatchLogMonitor.from_job_result(training_result)
monitor.show_logs(limit=10)
# Poll until completion
import time
while training_result.get_job_status()[1] == “Running”:
time.sleep(60)
Training metrics (loss curves, learning rate schedule) are also available in your MLflow experiment for visualization and comparison across runs.
Step 5: Evaluate the fine-tuned model
Evaluation is critical when you use data mixing because you need to measure two things simultaneously: whether your model improved on your domain task, and whether it retained its general capabilities. If you measure only one axis, you can’t tell if the mix is working.After training completes, retrieve the model checkpoint location from the output manifest:
from amzn_nova_forge.util.checkpoint_util import extract_checkpoint_path_from_job_output
checkpoint_path = extract_checkpoint_path_from_job_output(
output_s3_path=training_result.model_artifacts.output_s3_path,
job_result=training_result,
)
Configure the evaluation infrastructure
Evaluation requires only a single GPU instance (compared to 4 for training):
eval_infra = SMHPRuntimeManager(
instance_type=instance_type,
instance_count=1,
cluster_name=cluster_name,
namespace=namespace,
)
eval_mlflow = MLflowMonitor(
tracking_uri=mlflow_app_arn,
experiment_name=”nova-forge-eval”,
)
evaluator = NovaModelCustomizer(
model=Model.NOVA_LITE_2,
method=TrainingMethod.EVALUATION,
infra=eval_infra,
output_s3_path=f”s3://{S3_BUCKET}/demo/eval-outputs/”,
mlflow_monitor=eval_mlflow,
)
Run evaluations
Nova Forge supports three complementary evaluation approaches:
1. Public benchmarks (used to measure general capability retention)
These tell you whether data mixing is doing its job. If MMLU drops significantly from the baseline, your mix needs more Nova data. If IFEval drops, increase the instruction-following weight.
# MMLU — broad knowledge and reasoning across 57 subjects
mmlu_result = evaluator.evaluate(
job_name=”eval-mmlu”,
eval_task=EvaluationTask.MMLU,
model_path=checkpoint_path,
)
# IFEval — ability to follow structured instructions
ifeval_result = evaluator.evaluate(
job_name=”eval-ifeval”,
eval_task=EvaluationTask.IFEVAL,
model_path=checkpoint_path,
)
2. Bring-your-own-data (measure domain-specific performance)
Use your held-out test set to measure whether fine-tuning improved performance on your actual task:
byod_result = evaluator.evaluate(
job_name=”eval-byod”,
eval_task=EvaluationTask.GEN_QA,
data_s3_path=f”s3://{S3_DATA_PATH}/eval/gen_qa.jsonl”,
model_path=checkpoint_path,
overrides={“max_new_tokens”: 2048},
)
3. Large language model (LLM) as judge (for domains where automated metrics fall short, you can use another LLM to assess response quality)
Check results and retrieve outputs
# Check job status
print(mmlu_result.get_job_status())
print(ifeval_result.get_job_status())
print(byod_result.get_job_status())
# Retrieve the S3 path containing detailed evaluation results
print(mmlu_result.eval_output_path)
The evaluation output path contains the detailed results as JSON. Download and inspect them to get the actual scores.
Additionally, metrics can be published to MLflow tracking servers by supplying the tracking server URI at job creation. With this approach, you can record and store your metrics for comparing experiments.
Interpreting your results
Use the following decision framework to guide your next iteration:
Observation
What it means
What to adjust
MMLU close to baseline (e.g., within 0.01–0.02)
Data mixing is successfully preventing catastrophic forgetting
Your mix is working — focus on domain performance
MMLU significantly degraded
The model is forgetting general capabilities
Decrease customer_data_percent or increase Nova data weights
Domain task performance below expectations
The model isn’t learning enough from your data
Increase customer_data_percent, add more training data, or increase max_steps
IFEval degraded
The model is losing instruction-following ability
Increase nova_instruction-following_percent
Both MMLU and domain task improved
Ideal outcome
Document your configuration and promote to production
As a reference point, this post reports these results for Amazon Nova 2 Lite on a VOC classification task:
The key takeaway is that fine-tuning with only customer data boosts Domain F1 but significantly reduces general intelligence (MMLU drops from 0.75 to 0.47), while the blended approach (75% customer + 25% Nova data) recovers nearly all the MMLU accuracy while still improving domain performance.
Best practices
- Start with the default mixing ratios. The defaults are tuned for a balanced trade-off. Only customize after you have baseline evaluation results to compare against.
- Always evaluate on both axes. Run at least one public benchmark (MMLU) alongside your domain-specific evaluation. Without both, you can’t tell if the mix is working.
- Use MLflow to compare experiments. When iterating on mixing ratios and hyperparameters, MLflow makes it straightforward to compare runs side-by-side and identify the best configuration.
- Iterate on the mix, not just hyperparameters. If your model is forgetting general capabilities, adjusting the data mix is often more effective than tuning learning rate or batch size.
- Start with LoRA, move to full-rank if needed. LoRA is faster and cheaper. Only move to full-rank SFT if LoRA doesn’t achieve sufficient domain adaptation for your use case.
Cleaning up
To avoid ongoing charges, clean up the resources created during this walkthrough:
- Delete the S3 bucket and its contents.
- Stop or delete the SageMaker HyperPod cluster if it was created for this exercise.
- Delete the MLflow application if no longer needed.
Conclusion
In this post, we walked through the end-to-end workflow for fine-tuning Amazon Nova models using the Nova Forge SDK with data mixing enabled. The SDK handles data preparation, training orchestration on SageMaker HyperPod, and multi-dimensional evaluation, so you can focus on your data and your domain.Data mixing is what makes fine-tuning practical for production. Rather than choosing between domain expertise and general intelligence, you get both. The key is to treat it as an iterative process: train, evaluate on both axes, adjust the mix, and repeat until you find the right balance for your use case.
To get started, see the Nova Forge Developer Guide for detailed documentation, and explore the Nova Forge SDK for the full API reference.
About the authors
Gideon Teo is a FSI Solution Architect at AWS in Melbourne, specialising in Amazon SageMaker AI and Amazon Bedrock. Passionate about both traditional AI/ML and Generative AI, he helps financial institutions solve complex business challenges with cutting-edge technologies. Outside work, he enjoys time with friends and family, and exploring diverse technology domains.
Andrew Smith is a Sr. Cloud Support Engineer at AWS, based in Sydney, Australia. He specialises in helping customers with AI/ML workloads on AWS with expertise in Amazon SageMaker AI, Amazon Bedrock and LLM inference.
Timothy Downs is a Startup Solutions Architect at AWS in Melbourne who enjoys working at the bleeding edge of tech, usually before it is fully baked.
Krishna Neupane is an Applied Scientist at Amazon’s AGI Customization team, specializing in Nova model customization and data mixing.

