Crane, an elastic GPU cluster manager with batteries included.
Crane is under active development and subject to major changes in the near future.
Crane is a cluster manager specialized for elastically scheduling GPU resources. Major strengths include:
- Gang-Scheduling: GPU resource is scheduled in the unit of container groups (called cargos), allowing the execution of multiple distributed DL training jobs on a single cluster.
- Elastic: Cargos can be dynamically resized. Furthormore, the resource reserved for an entire job (called mini-cluster) can be resized, allowing AutoML jobs to run, pause, and kill separate trials.
- Multi-tenant: Crane supports multiple users and diverse job types inside a single cluster.
- Transparent: Crane exposes GPU usage to its app. Each app can query its own GPU usage statistics.
- Batteries Included: Crane supports all these features out-of-the-box.