跳到主要内容
版本:v1.8.2

Introduction

What is Volcano

Volcano is a cloud native system for high-performance workloads, which has been accepted by Cloud Native Computing Foundation (CNCF) as its first and only official container batch scheduling project. Volcano supports popular computing frameworks such as:

Volcano also provides various scheduling capabilities including heterogeneous device scheduling, network topology-aware scheduling, multi-cluster scheduling, online-offline workloads colocation and more.

Why Volcano

Job scheduling and management become increasingly complex and critical for high-performance batch computing. Common requirements are as follows:

  • Support for diverse scheduling algorithms
  • More efficient scheduling
  • Non-intrusive support for mainstream computing frameworks
  • Support for multi-architecture computing

Volcano is designed to cater to these requirements. In addition, Volcano inherits the design of Kubernetes APIs, allowing you to easily run applications that require high-performance computing on Kubernetes.

Features

Unified Scheduling

  • Support native Kubernetes workload scheduling
  • Provide complete support for frameworks like PyTorch, TensorFlow, Spark, Flink, Ray through VolcanoJob
  • Unified scheduling for both online microservices and offline batch jobs to improve cluster resource utilization

Rich Scheduling Policies

  • Gang Scheduling: Ensure all tasks of a job start simultaneously
  • Binpack Scheduling: Optimize resource utilization through compact task allocation
  • Heterogeneous Device Scheduling: Efficient GPU sharing (CUDA/MIG modes) and NPU scheduling
  • Proportion/Capacity Scheduling: Resource sharing/preemption/reclaim based on queue quotas
  • NodeGroup Scheduling: Support node group affinity scheduling
  • DRF Scheduling: Support fair scheduling of multi-dimensional resources
  • SLA Scheduling: Scheduling guarantee based on service quality
  • Task-topology Scheduling: Optimize performance for communication-intensive applications
  • NUMA Aware Scheduling: Optimize resource allocation for multi-core processors

Volcano supports custom plugins and actions to implement more scheduling algorithms.

Queue Resource Management

  • Support multi-dimensional resource quota control (CPU, Memory, GPU, etc.)
  • Provide multi-level queue structure and resource inheritance
  • Support resource borrowing, reclaiming and preemption between queues
  • Implement multi-tenant resource isolation and priority control

Multi-architecture computing

Volcano can schedule computing resources from multiple architectures:

  • x86
  • Arm
  • Kunpeng
  • Ascend
  • GPU

Network Topology-aware Scheduling

Supports network topology-aware scheduling to optimize data transmission for distributed training tasks, reducing communication overhead and improving training speed.

Online and Offline Workloads Colocation

Enhances resource utilization while ensuring QoS through:

  • Unified scheduling
  • Dynamic resource overcommitment
  • CPU burst
  • Resource isolation

Multi-cluster Scheduling

Support cross-cluster job scheduling for larger-scale resource pool management.

For details: volcano-global

Descheduling

Support dynamic descheduling to optimize cluster load distribution.

For details: descheduler

Monitoring and Observability

  • Complete logging system
  • Rich monitoring metrics
  • Dashboard for graphical interface

Dashboard: dashboard
Metrics: metrics

Ecosystem

Volcano integrates with these high-performance computing frameworks:

Future Outlook

Volcano will continue to expand its functional boundaries through community collaboration and technical innovation, becoming a leader in high-performance computing and cloud-native batch scheduling.