Configure TrainJob Lifecycle
This guide describes how to configure lifecycle policies for TrainJobs, including active deadlines to automatically terminate long-running jobs and suspend/resume to pause and restart training.
Prerequisites
Before exploring this guide, make sure to follow the Getting Started guide to understand the basics of Kubeflow Trainer.
Active Deadline Overview
The activeDeadlineSeconds field in the TrainJob spec specifies the maximum duration (in seconds)
that a TrainJob is allowed to run before the system automatically terminates it. This behavior
matches the
Kubernetes Job activeDeadlineSeconds
semantics.
When the deadline is reached, all running Pods are terminated, the underlying JobSet is deleted,
and the TrainJob status is set to Failed with reason DeadlineExceeded.
The deadline timer starts from the TrainJob creation time. If the TrainJob is suspended and then
resumed, the timer resets from the resume time. The field is immutable after creation and the
minimum allowed value is 1.
Create TrainJob with Active Deadline
You can set activeDeadlineSeconds on a TrainJob to enforce a time limit. The following
example creates a TrainJob that is terminated if it runs longer than 1 hour:
apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
name: my-trainjob
namespace: my-namespace
spec:
activeDeadlineSeconds: 3600 # terminate the TrainJob after 1 hour
runtimeRef:
name: torch-distributed
kind: ClusterTrainingRuntime
apiGroup: trainer.kubeflow.org
trainer:
image: docker.io/my-training-image:latest
command:
- torchrun
- train.py
Verify the TrainJob Deadline Status
After the deadline is exceeded, the TrainJob transitions to a Failed state. Run the following
command to check the TrainJob conditions:
kubectl get trainjob my-trainjob -o jsonpath='{.status.conditions[?(@.status=="True")]}'
You should see a condition as follows:
{
"type": "Failed",
"status": "True",
"reason": "DeadlineExceeded",
"message": "TrainJob exceeded its active deadline"
}
Suspend and Resume TrainJob
The suspend field allows you to pause a running TrainJob without deleting it. When a TrainJob
is suspended, its Pods are terminated but the TrainJob resource and its configuration are preserved.
This is useful for temporarily freeing cluster resources or debugging training issues.
The following example creates a TrainJob in suspended state:
apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
name: my-trainjob
namespace: my-namespace
spec:
suspend: true
runtimeRef:
name: torch-distributed
kind: ClusterTrainingRuntime
apiGroup: trainer.kubeflow.org
To resume the TrainJob, update the suspend field to false:
kubectl patch trainjob my-trainjob --type=merge -p '{"spec":{"suspend":false}}'
Note
When a TrainJob with ActiveDeadlineSeconds is resumed from suspension, the deadline timer resets from the resume time and not the original creation time. This means the TrainJob gets the full deadline duration after each resume.Next Steps
- Learn more about runtime patches for customizing TrainJob behavior.
- Check out the TrainJob API reference
for the full list of
TrainJobSpecfields.
Feedback
Was this page helpful?
Thank you for your feedback!
We're sorry this page wasn't helpful. If you have a moment, please share your feedback so we can improve.