Introduction
What is Volcano
Volcano is a cloud native system for high-performance workloads, which has been accepted by Cloud Native Computing Foundation (CNCF) as its first and only official container batch scheduling project. Volcano supports popular computing frameworks such as:
Volcano also provides various scheduling capabilities including heterogeneous device scheduling, network topology-aware scheduling, multi-cluster scheduling, online-offline workloads colocation and more.
Why Volcano
Job scheduling and management become increasingly complex and critical for high-performance batch computing. Common requirements are as follows:
- Support for diverse scheduling algorithms
- More efficient scheduling
- Non-intrusive support for mainstream computing frameworks
- Support for multi-architecture computing
Volcano is designed to cater to these requirements. In addition, Volcano inherits the design of Kubernetes APIs, allowing you to easily run applications that require high-performance computing on Kubernetes.
Features
Unified Scheduling
- Support native Kubernetes workload scheduling
- Provide complete support for frameworks like PyTorch, TensorFlow, Spark, Flink, Ray through VolcanoJob
- Unified scheduling for both online microservices and offline batch jobs to improve cluster resource utilization
Rich Scheduling Policies
- Gang Scheduling: Ensure all tasks of a job start simultaneously
- Binpack Scheduling: Optimize resource utilization through compact task allocation
- Heterogeneous Device Scheduling: Efficient GPU sharing (CUDA/MIG modes) and NPU scheduling
- Proportion/Capacity Scheduling: Resource sharing/preemption/reclaim based on queue quotas
- NodeGroup Scheduling: Support node group affinity scheduling
- DRF Scheduling: Support fair scheduling of multi-dimensional resources
- SLA Scheduling: Scheduling guarantee based on service quality
- Task-topology Scheduling: Optimize performance for communication-intensive applications
- NUMA Aware Scheduling: Optimize resource allocation for multi-core processors
Volcano supports custom plugins and actions to implement more scheduling algorithms.
Queue Resource Management
- Support multi-dimensional resource quota control (CPU, Memory, GPU, etc.)
- Provide multi-level queue structure and resource inheritance
- Support resource borrowing, reclaiming and preemption between queues
- Implement multi-tenant resource isolation and priority control
Multi-architecture computing
Volcano can schedule computing resources from multiple architectures:
- x86
- Arm
- Kunpeng
- Ascend
- GPU
Network Topology-aware Scheduling
Supports network topology-aware scheduling to optimize data transmission for distributed training tasks, reducing communication overhead and improving training speed.
Online and Offline Workloads Colocation
Enhances resource utilization while ensuring QoS through:
- Unified scheduling
- Dynamic resource overcommitment
- CPU burst
- Resource isolation
Multi-cluster Scheduling
Support cross-cluster job scheduling for larger-scale resource pool management.
For details: volcano-global
Descheduling
Support dynamic descheduling to optimize cluster load distribution.
For details: descheduler
Monitoring and Observability
- Complete logging system
- Rich monitoring metrics
- Dashboard for graphical interface
Ecosystem
Volcano integrates with these high-performance computing frameworks:
- Spark
- TensorFlow
- PyTorch
- Flink
- Argo
- Ray
- MindSpore
- PaddlePaddle
- OpenMPI
- Horovod
- MXNet
- Kubeflow
- KubeGene
- Cromwell
Future Outlook
Volcano will continue to expand its functional boundaries through community collaboration and technical innovation, becoming a leader in high-performance computing and cloud-native batch scheduling.