Launch GPU On Demand With OVH AI Training

The OVHcloud AI Training solution allows you to train your models easily, with just a few clicks or commands. It runs your training job on the computational cloud resources you have chosen, CPU or GPU. Use this tutorial to combine AI Training with your Saagie platform.

Before you begin:
  1. Log in to your OVHcloud Control Panel.

  2. Create a user on your OVHcloud account with the following roles:

    • AI Training Operator

    • AI Training Reader

    • ObjectStore Operator

    To manage users, navigate to Public Cloud  Project Management  Users & Roles. For more information, see the OVHcloud instructions on how to manage AI users and roles.

  3. Generate an application token for OVH AI tools. Navigate to your AI Dashboard  Users & Tokens  Manage access via application tokens  Create a token. For more information, see the OVHcloud instructions on users and tokens.

  4. Configure your OVHcloud Object Storage to store your job code or data.These storage spaces are accessible through an API interface and can be of different storage classes.

    1. Chose your object storage class according to your needs from the following:

      • The S3 object storage with the Standard object storage – S3 API or High Performance object storage – S3 API classes.

        The S3 storage classes are compatible with the S3 protocol and regularly updated.
      • The SWIFT object storage with the Standard object storage - SWIFT API class.

        The SWIFT storage classes are from older generations and no longer benefit from further developments.
    2. Create your Object Storage bucket. Navigate to Public Cloud  Object Storage  My containers  Create an object container. For more information, see the OVHcloud instructions on how to create a bucket.

      • Alternatively, you can also use the REST API with the ovhai CLI to create an S3 bucket. For more detailed information, see the OVHcloud documentation on S3 buckets.

      • Alternatively, you can also use the REST API with a POST request to /cloud/project/{serviceName}/region/{regionName}/storage to create a Swift bucket. For more detailed information, see the OVHcloud documentation on Swift Object Storage.

  5. You can now put your code or data in the newly created Object Storage.

    • Example with the S3 API

    • Example with the Swift API

    Here is an example of code in Python to upload your data on Object Storage with the S3 API.

    import boto3
    from botocore.exceptions import ClientError
    
    def upload_file(s3_client, file_name, bucket, object_name=None):
        """Upload a file to an S3 bucket
    
        :param s3_client: The boto3 client.
        :param file_name: The file to upload.
        :param bucket: The bucket to upload to.
        :param object_name: The S3 object name. If not specified, the file_name value is used.
        :return: Returns True if the file was uploaded, else returns False.
        """
    
        # If the S3 object_name value is not specified, use the file_name value.
        if object_name is None:
            object_name = os.path.basename(file_name)
    
        # Uploading the file
        try:
            response = s3_client.upload_file(file_name, bucket, object_name)
        except ClientError as e:
            logging.error(e)
            return False
        return True
    
    s3_bucket_name = "YOUR_BUCKET_NAME"
    s3_client = boto3.client("s3",
                                 endpoint_url=os.environ["AWS_ENDPOINT_URL"],
                                 region_name=os.environ["AWS_REGION_NAME"],
                                 aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
                                 aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"])
    
    upload_file(s3_client, "./resources/__main__.py", s3_bucket_name, object_name="__main__.py")

    Here is an example of code in Python to upload your data on Object Storage with the Swift API.

    import swiftclient
    from swiftclient.exceptions import ClientException
    
    
    def upload_file_swift(url, token, file_path, bucket_name, name):
        """
    
        :param url: The Swift bucket endpoint, in String data type.
        :param file_path: The path of the file to upload, in String data type.
        :param token: The Openstack token, in String data type. You can create it in OVH by navigating to Users & Roles -> Select a user -> Select the ellipsis menu -> Generate an OpenStack token.
        :param bucket_name: The bucket to upload to, in String data type.
        :param name: The S3 object name, in String data type.
        :return:
        """
        try:
            with open(file_path, 'rb') as f:
                file_data = f.read()
            swiftclient.client.put_object(url=url, token=token, container=bucket_name, name=name, contents=file_data)
        except ClientException as e:
            logging.error(e)
            raise
        return True
    
    ovh_token_data = {
        "auth": {
            "identity": {
                "methods": [
                    "password"
                ],
                "password": {
                    "user": {
                        "name": os.environ["OVH_USER_LOGIN"],
                        "domain": {
                            "id": "default"
                        },
                        "password": os.environ["OVH_USER_PWD"]
                    }
                }
            },
            "scope": {
                "project": {
                    "name": os.environ["OVH_TENANT_NAME"],
                    "domain": {
                        "id": "default"
                    }
                }
            }
        }
    }
    
    s3_bucket_name = "YOUR_BUCKET_NAME"
    bucket_endpoint = os.environ["BUCKET_ENDPOINT_URL"] # https://storage.<region>.cloud.ovh.net/v1/AUTH_{TENANT_ID}/
    res_get_token = requests.post(url = "https://auth.cloud.ovh.net/v3/auth/tokens",
                                  json = ovh_token_data,
                                  headers={"Content-Type": "application/json"}
                                  )
    openstack_token = res_get_token.headers["x-subject-token"]
    upload_file_swift(bucket_endpoint, openstack_token, "./ressources/__main__.py", s3_bucket_name, "__main__.py")
  6. Create your job in Saagie to use GPU or CPU on demand.

    Example of a job code in Python.
    import requests
    
    class BearerAuth(requests.auth.AuthBase):
        def __init__(self, token):
            self.token = token
    
        def __call__(self, r):
            r.headers["authorization"] = "Bearer " + self.token
            return r
    
    s3_bucket_name = "YOUR_BUCKET_NAME"
    command_line = "python ~/sample_project/__main__.py" # You can customize the command line to suit your needs.
    ovh_token_gra = os.environ["OVH_TOKEN"] # The token that you have created at step 3.
    ovh_new_job = {
            "image": "YOUR_IMAGE",
            "region": "GRA", # The region where the job will be run.
            "volumes": [
                {
                    "dataStore": {
                        "alias": "GRA", # If you use Object Storage with the S3 API, use the custom alias you have created.
                        "container": s3_bucket_name,
                        "prefix": ""
                    },
                    "mountPath": "/workspace/sample_project",
                    "permission": "RW",
                    "cache": False
                }
            ],
            "name": "YOUR_JOB_NAME",
            "unsecureHttp": False,
            "resources": {
                "gpu": 1, # The number of GPU you need.
                "flavor": "ai1-1-gpu"
            },
            "command": [
                "bash",
                "-c",
                command_line
            ],
            "envVars": [
                {
                    "name": "YOUR_ENV_VAR_NAME",
                    "value": os.environ['YOUR_ENV_VAR_NAME']
                }
    			# You can set other environment variables here.
    
            ],
            "sshPublicKeys": []
        }
    
    # Sends the request to create the job.
    response_create_job = requests.post("https://gra.training.ai.cloud.ovh.net/v1/job",
                                         auth=BearerAuth(ovh_token_gra),
                                         json=ovh_new_job)

    In your code, you must have a POST request to https://<region>.training.ai.cloud.ovh.net/v1/job with the user or token created at step 3. Where <region> must be replaced with gra or bhs. This POST request will return you the job ID.

    In the request body, specify the number of GPU or CPU, and link the job to your Object Storage.

  7. Run your job.

You have launched your request for GPU or CPU on demand.

Monitor Your Job

You can monitor your job. Here are two examples of how you can get global information about your job and retrieve its logs.

  1. Add the following code to your job’s code file to get information about the job and its logs:

    • Get job information

    • Retrieve job logs

    Add the https://<region>.training.ai.cloud.ovh.net/v1/job/{id_job} GET request to your code file to get information on your job. Where <region> and <id_job> must be replaced with your values.

    Example of code in Python.
    import requests
    
    response_job = requests.get(f"https://gra.training.ai.cloud.ovh.net/v1/job/{id_job}",
                                  auth=BearerAuth(ovh_token_gra))

    Add the https://<region>.training.ai.cloud.ovh.net/v1/job/{id_job}/log GET request to your code file to get information on your job logs. Where <region> and <id_job> must be replaced with your values.

    Example of code in Python.
    import requests
    
    # id_job is the job ID that you receive after step 3
    response_job_logs = requests.get(f"https://gra.training.ai.cloud.ovh.net/v1/job/{id_job}/log",
                                     auth=BearerAuth(ovh_token_gra)).text
    
    for line in response_job_logs.splitlines():
        print(line, flush=True)