Launch GPU On Demand With OVH AI Training

The OVHcloud AI Training solution allows you to train your models easily, with just a few clicks or commands. It runs your training job on the computational cloud resources you have chosen, CPU or GPU. Use this tutorial to combine AI Training with your Saagie platform.

Before you begin:

You must have an OVHcloud account.
You must have a Public Cloud project in your OVHcloud account.
You must have access to the OVHcloud Control Panel.

Log in to your OVHcloud Control Panel.
Create a user on your OVHcloud account with the following roles:
- AI Training Operator
- AI Training Reader
- ObjectStore Operator
To manage users, navigate to Public Cloud Project Management Users & Roles. For more information, see the OVHcloud instructions on how to manage AI users and roles.
Generate an application token for OVH AI tools. Navigate to your AI Dashboard Users & Tokens Manage access via application tokens Create a token. For more information, see the OVHcloud instructions on users and tokens.

Configure your OVHcloud Object Storage to store your job code or data.These storage spaces are accessible through an API interface and can be of different storage classes.

Chose your object storage class according to your needs from the following:
- The S3 object storage with the Standard object storage – S3 API or High Performance object storage – S3 API classes.
  
  The S3 storage classes are compatible with the S3 protocol and regularly updated.
- The SWIFT object storage with the Standard object storage - SWIFT API class.
  
  The SWIFT storage classes are from older generations and no longer benefit from further developments.

Create your Object Storage bucket. Navigate to Public Cloud Object Storage My containers Create an object container. For more information, see the OVHcloud instructions on how to create a bucket.

Alternatively, you can also use the REST API with the ovhai CLI to create an S3 bucket. For more detailed information, see the OVHcloud documentation on S3 buckets.
Alternatively, you can also use the REST API with a POST request to /cloud/project/{serviceName}/region/{regionName}/storage to create a Swift bucket. For more detailed information, see the OVHcloud documentation on Swift Object Storage.

You can now put your code or data in the newly created Object Storage.

Example with the S3 API
Example with the Swift API

Here is an example of code in Python to upload your data on Object Storage with the S3 API.

import boto3
from botocore.exceptions import ClientError

def upload_file(s3_client, file_name, bucket, object_name=None):
    """Upload a file to an S3 bucket

    :param s3_client: The boto3 client.
    :param file_name: The file to upload.
    :param bucket: The bucket to upload to.
    :param object_name: The S3 object name. If not specified, the file_name value is used.
    :return: Returns True if the file was uploaded, else returns False.
    """

    # If the S3 object_name value is not specified, use the file_name value.
    if object_name is None:
        object_name = os.path.basename(file_name)

    # Uploading the file
    try:
        response = s3_client.upload_file(file_name, bucket, object_name)
    except ClientError as e:
        logging.error(e)
        return False
    return True

s3_bucket_name = "YOUR_BUCKET_NAME"
s3_client = boto3.client("s3",
                             endpoint_url=os.environ["AWS_ENDPOINT_URL"],
                             region_name=os.environ["AWS_REGION_NAME"],
                             aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
                             aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"])

upload_file(s3_client, "./resources/__main__.py", s3_bucket_name, object_name="__main__.py")

Here is an example of code in Python to upload your data on Object Storage with the Swift API.

import swiftclient
from swiftclient.exceptions import ClientException


def upload_file_swift(url, token, file_path, bucket_name, name):
    """

    :param url: The Swift bucket endpoint, in String data type.
    :param file_path: The path of the file to upload, in String data type.
    :param token: The Openstack token, in String data type. You can create it in OVH by navigating to Users & Roles -> Select a user -> Select the ellipsis menu -> Generate an OpenStack token.
    :param bucket_name: The bucket to upload to, in String data type.
    :param name: The S3 object name, in String data type.
    :return:
    """
    try:
        with open(file_path, 'rb') as f:
            file_data = f.read()
        swiftclient.client.put_object(url=url, token=token, container=bucket_name, name=name, contents=file_data)
    except ClientException as e:
        logging.error(e)
        raise
    return True

ovh_token_data = {
    "auth": {
        "identity": {
            "methods": [
                "password"
            ],
            "password": {
                "user": {
                    "name": os.environ["OVH_USER_LOGIN"],
                    "domain": {
                        "id": "default"
                    },
                    "password": os.environ["OVH_USER_PWD"]
                }
            }
        },
        "scope": {
            "project": {
                "name": os.environ["OVH_TENANT_NAME"],
                "domain": {
                    "id": "default"
                }
            }
        }
    }
}

s3_bucket_name = "YOUR_BUCKET_NAME"
bucket_endpoint = os.environ["BUCKET_ENDPOINT_URL"] # https://storage.<region>.cloud.ovh.net/v1/AUTH_{TENANT_ID}/
res_get_token = requests.post(url = "https://auth.cloud.ovh.net/v3/auth/tokens",
                              json = ovh_token_data,
                              headers={"Content-Type": "application/json"}
                              )
openstack_token = res_get_token.headers["x-subject-token"]
upload_file_swift(bucket_endpoint, openstack_token, "./ressources/__main__.py", s3_bucket_name, "__main__.py")

Create your job in Saagie to use GPU or CPU on demand.

Example of a job code in Python.

import requests

class BearerAuth(requests.auth.AuthBase):
    def __init__(self, token):
        self.token = token

    def __call__(self, r):
        r.headers["authorization"] = "Bearer " + self.token
        return r

s3_bucket_name = "YOUR_BUCKET_NAME"
command_line = "python ~/sample_project/__main__.py" # You can customize the command line to suit your needs.
ovh_token_gra = os.environ["OVH_TOKEN"] # The token that you have created at step 3.
ovh_new_job = {
        "image": "YOUR_IMAGE",
        "region": "GRA", # The region where the job will be run.
        "volumes": [
            {
                "dataStore": {
                    "alias": "GRA", # If you use Object Storage with the S3 API, use the custom alias you have created.
                    "container": s3_bucket_name,
                    "prefix": ""
                },
                "mountPath": "/workspace/sample_project",
                "permission": "RW",
                "cache": False
            }
        ],
        "name": "YOUR_JOB_NAME",
        "unsecureHttp": False,
        "resources": {
            "gpu": 1, # The number of GPU you need.
            "flavor": "ai1-1-gpu"
        },
        "command": [
            "bash",
            "-c",
            command_line
        ],
        "envVars": [
            {
                "name": "YOUR_ENV_VAR_NAME",
                "value": os.environ['YOUR_ENV_VAR_NAME']
            }
			# You can set other environment variables here.

        ],
        "sshPublicKeys": []
    }

# Sends the request to create the job.
response_create_job = requests.post("https://gra.training.ai.cloud.ovh.net/v1/job",
                                     auth=BearerAuth(ovh_token_gra),
                                     json=ovh_new_job)

In your code, you must have a POST request to https://<region>.training.ai.cloud.ovh.net/v1/job with the user or token created at step 3. Where <region> must be replaced with gra or bhs. This POST request will return you the job ID.

In the request body, specify the number of GPU or CPU, and link the job to your Object Storage.

Run your job.

You have launched your request for GPU or CPU on demand.

Monitor Your Job

You can monitor your job. Here are two examples of how you can get global information about your job and retrieve its logs.

Add the following code to your job’s code file to get information about the job and its logs:
- Get job information
- Retrieve job logs
Add the https://<region>.training.ai.cloud.ovh.net/v1/job/{id_job} GET request to your code file to get information on your job. Where <region> and <id_job> must be replaced with your values.
Example of code in Python.

import requests response_job = requests.get(f"https://gra.training.ai.cloud.ovh.net/v1/job/{id_job}", auth=BearerAuth(ovh_token_gra))
Add the https://<region>.training.ai.cloud.ovh.net/v1/job/{id_job}/log GET request to your code file to get information on your job logs. Where <region> and <id_job> must be replaced with your values.
Example of code in Python.

import requests # id_job is the job ID that you receive after step 3 response_job_logs = requests.get(f"https://gra.training.ai.cloud.ovh.net/v1/job/{id_job}/log", auth=BearerAuth(ovh_token_gra)).text for line in response_job_logs.splitlines(): print(line, flush=True)