Deploying Crane on a GPU Cluster¶

This page will guide you through the process of deploying Crane. If you haven't already, see the install guide to install Crane on nodes.

Note

If you installed Crane in a Python virtual environment, make sure to activate it before following along. If you used Pyenv following the install guide, the activation command is:

$ pyenv activate crane

Crane supports deployments of three different forms:

Docker Swarm
Kubernetes
Standalone

Deploy on Docker Swarm¶

In order to deploy Crane on docker swarm, you simply need a docker engine running on each node that supports docker swarm. You do not have to initialize docekr swarm; Crane will do everything for you.

Every node on the Crane Cluster must have an instance of crane-admind, the Crane admin daemon, running. crane-admind is a Docker container that manages the state of the current node with respect to the cluster, e.g. whether it has joined the cluster or its role (leader or follower) in the cluster.

The Crane Bootstrap CLI tool was made for this one-time task of running crane-admind.

If you installed boot.crane as a system package, start it with:

>>> (crane) $ sudo systemctl start crane

Otherwise, if you installed from source:

>>> (crane) $ boot.crane

Warning

The boot.crane command above assumes that you are NOT on MacOS. The Crane Admin Daemon communicates with the Crane Admin CLI tool via a Unix domain socket. However, MacOS does not allow this. A workaround for this is to forward the socket request through a socat conatiner. The command is:

>>> (crane) $ boot.crane --socket-forward-port <port>

This will spawn an extra container for forwarding.

Now that crane-admind is up and running, you can use the Crane Admin CLI to command it to setup our node.

On the node you want to give the cluster manager role, run:

>>> (crane) $ craneadm init --host-ip <MANAGER_IP>

This command does the following:

"I am the cluster manager node."
1. Launches Docker Swarm
2. Starts the state DB container (MySQL)
3. Starts the monitoring containers (Prometheus, Grafana, InfluxDB)
4. Starts the logging containers (Elasticsearch, Kibana)
5. Starts the Cluster Manager container (crane.core.master)
"I am yet another node on the cluster."
1. Starts the metric exporter containers (Node Exporter, cAdvisor, DCGM)
2. Starts the log aggregator container (Fluent-bit)
3. Starts the Node Master container (crane.core.ds.container)
"I should expose my interface to clients."
1. Starts the gateway container (crane.core.gateway)

Now, the Crane cluster is initialized! Let's try joining other nodes to the cluster.

craneadm init should have echoed a command that you can use to join other nodes to the cluster. It looks like:

>>> (crane) $ craneadm join SWMTKN-1-1q2w3e4r...

Then, after making sure that the new node is also running crane-admind (bootstrapped by boot.crane), run that command, possibly including some other options:

>>> (crane) $ craneadm join SWMTKN-1-1q2w3e4r... --host-ip <NODE_ADDR> --gpus <GPU_INDICES> ...

GPU_INDICES should be privided as a comma-separated list of GPU indices, e.g. 0,2,4. See the admin CLI reference for the full list of options.

Repeat boot.crane and craneadm join for each node to scale up our cluster. Your Crane cluster is ready.

Deploy on Kubernetes¶

You will need a working Kubernetes cluster on which to deploy Crane.

Crane imposes minimal assumptions on the Kubernetes cluster on which it will run:

Pods, Services, and Daemonsets should work.
Service DNS resolution should work, for example via CoreDNS.
The NVIDIA device plugin should be installed.

Any cluster that matches the requirements above should be fine.

Using K3s to deploy a Kubernetes cluster¶

In case you don't have a working Kubernetes cluster, we'll show how we, the Crane team, do it internally. If you do have a Kubernetes cluster, skip to the next section.

K3s is a great solution for deploying Kubernetes. It provides a lightweight Kubernetes distribution in a single binary.

Info

The steps below has been verified on servers that run Ubuntu bionic.

First, deploy ETCD. Kubernetes uses this as its state storage. While the steps below illustrate the process of deploying a single-node ETCD cluster, you can always choose to scale it as you wish.

Note

Crane uses ETCD as its own state storage, and this should be separated with that of Kubernetes. While in theory it should be possible to share the same ETCD cluster between the two, we chose not to because 1) the Kubernetes-side ETCD is heavily protected with certificates, and 2) we do not want to risk breaking both during runtime.

# install etcd server and client (etcdctl)
$ sudo apt install etcd-server etcd-client
---> 100%

# configure etcd (no auth at all, $IP is the advertise IP address of node-01)
$ sudo cat /etc/default/etcd
ETCD_NAME="node-01"
ETCD_LISTEN_PEER_URLS="http://$IP:2380,http://$IP:7001"
ETCD_LISTEN_CLIENT_URLS="http://$IP:2379,http://$IP:4001"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://$IP:2380,http://$IP:7001"
ETCD_INITIAL_CLUSTER="node-01=http://$IP:2380,node-01=http://$IP:7001"
ETCD_ADVERTISE_CLIENT_URLS="http://$IP:2379,http://$IP:4001"

# start etcd
$ sudo systemctl daemon-reload
$ sudo systemctl restart etcd
---> *

Next, deploy Kubernetes with K3s. On the node you wish to set as master:

# Install the k3s server.
# We need --docker in order to have the NVIDIA device plugin to work.
$ curl -sfL https://get.k3s.io | sh -s - \
--datastore-endpoint "http://$IP:2379" \
--bind-address "$IP" \
--node-ip "$IP" \
--docker

# Configure kubectl (need to [install separately](https://kubernetes.io/docs/tasks/tools/install-kubectl/))
$ sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config

# Copy the node token.
$ sudo cat /var/lib/rancher/k3s/server/node-token

Then, on each node you wish to set as worker:

# Install the k3s agent. Paste the node token you copied when installing the k3s server.
# We need --docker in order to have the NVIDIA device plugin to work.
$ curl -sfL https://get.k3s.io | \
K3S_URL="https://$IP:6443" \
K3S_TOKEN="Paste node token here" \
sh -s - \
--node-ip "$IP"
--docker

# Configure kubectl (need to [install separately](https://kubernetes.io/docs/tasks/tools/install-kubectl/))
$ scp node-01:~/.kube/config ~/.kube/config
$ sudo chown $USER:$GROUP ~/.kube/config

Crane Helm Chart¶

Crane provides a Helm chart for easy deployment.

First, install the Helm CLI.

$ curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash

Make sure that helm can reach the Kubernetes API server by appropriately setting ~/.kube/config (or the file $KUBECONFIG points to).

Next, add our repository and install Crane on your Kubernetes cluster. Note that at the time of writing, Crane Helm charts are only accessible to Crane-devs who have access to Crane's Github repository. Hence, you must provide Helm your Github username and an associated PAT(Personal Access Token) with access to repo.

$ helm repo add --username $GITHUB_USERNAME --password $GITHUB_TOKEN friendliai-crane https://friendliai.github.io/crane
$ helm install crane friendliai-crane/crane

You can also customize your installation by first downloading the Crane Helm chart and modifying configuration keys in values.yaml.

$ helm pull friendliai-crane/crane

Deploy a Standalone Cluster¶

Crane also supports standalone clusters for personal uses and the development of Crane itself. A standalone cluster is a single process, and is differentiated from ordinary clusters in terms of the following:

Docker is not used. The gateway, cluster manager, and a non-persistent in-memory state DB (sqlite:memory) runs in a single process. Also, jobs run as bare python processes.
No logging and monitoring support.
The Crane Admin CLI will not commicate with crane-admind. Users can submit jobs with the Crane User CLI, which communicates with the cluster manager through a designated Unix domain socket.

To use standalone clusters, install Crane from source with the command make install-dev.

To start a standalone cluster, run:

>>> (crane) $ craneadm standalone --gpus <GPU_INDICES>

Last update: March 2, 2022