Add a node in Saagie cluster

Saagie uses the Kubernetes pods scheduler to dispatch jobs on available nodes. In most situations, you do not need to let the Saagie product know that you created a new node in your cluster.

In specific situations, however, you will need to configure the new node for it to work properly. This article will discuss the following situations:

  • Node with GPU

  • Node dedicated to saagie-common namespace

1. Node with GPU

Saagie supports scheduling jobs on a GPU. This feature has been tested with Saagie’s Python image on the following hardware configurations:

  • Tesla P4 GPU

  • Nvidia drivers v418.67

At the moment, only Nvidia GPUs are supported.The feature might also work with other Nvidia-based hardware configurations, but they have not been tested.

If you are adding a node with GPU computation capabilities, you must configure:

  • Kubernetes Node resource

  • Saagie settings

  • Kubernetes admission controller configuration (if not already done)

1.1. Node resource configuration

The following taint must be present for the node resource: nvidia.com/gpu=present:NoSchedule.

Run following command to add the taint:

kubectl taint nodes [NODE_NAME] nvidia.com/gpu=present:NoSchedule

See Kubernetes taint and toleration for more information.

1.2. Saagie settings

You must activate GPU support in the Saagie settings component for each platform where jobs using GPU can be scheduled.

Run the following queries to update your settings, replacing the text as follows:

  • <realm> is the prefix that was determined during Saagie installation.

  • <saagie_host> is your Saagie URL.

  • <username> and <password> must be the credentials of an admin user.

  • <platform_id> is the id of the platform being configured.

# Authentication Query
TOKEN = curl -X POST -H "Content-Type:application/json" -H "Saagie-Realm:<realm>" https://<saagie_host>/authentication/api/open/authenticate --data '{"login":"<username>", "password":"<password>"}'

# Query enabling GPU setting
curl -X POST -H "Content-Type:application/json" -H "Saagie-Realm:<realm>" -H "Authorization: Bearer $TOKEN" https://<saagie_host>/settings/api/v1/settings/platform/<platform_id>/gpu --data '{"hasGpuNode":true}'

Finally, retrieve your configuration status, making sure to replace <realm>, <saagie_host>, and <platform_id>:

# Query reading GPU setting
curl -X GET -H "Content-Type:application/json" -H "Saagie-Realm:<realm>" -H "Authorization: Bearer $TOKEN" https://<saagie_host>/settings/api/v1/settings/platform/<platform_id>/gpu

1.3. Admission controller

The ExtendedResourceToleration admission controller must be activated on the Kubernetes cluster in order to schedule jobs on the GPU node.

See Kubernetes reference documentation for more information.

2. Node(s) dedicated to the saagie-common namespace

During the installation process, you have the choice to dedicate nodes to the pod responsible for running the saagie-common namespace.

If you did not choose to dedicate nodes to saagie-common namespace, or you do not want to dedicate the node you are adding to saagie-common namespace, then no action is required.

When a node is dedicated to the saagie-common namespace, the Saagie platform will run solely from that node. A node with the saagie-common label will not run anything other than the Saagie platform.

If you need to dedicate multiple nodes to the saagie-common namespace, make sure that all dedicated nodes have same label or value pairs.

If you installed Saagie without dedicating nodes to the saagie-common namespace and would like to do so now, contact Saagie.