Send Data to BigQuery

In this article, we will focus on how to send data to BigQuery using Saagie’s Python API.

  1. Log in to BigQuery. To do so, log in to GCP (Google Cloud Platform) with your Google account.

  2. Once connected to BigQuery, create a new project or use an existing one.
    We can now start using BigQuery and its numerous functionalities.

  3. Create a service account. Navigate to IAM & Admin  Service Accounts  Create Service Account.

    bigquery article1 service account creation

    A page opens.

  4. Fill in the required information and follow the steps that appear in the Google Cloud console.

    bigquery article1 service account creation form
  5. Click DONE to validate your service account creation.
    Your service account appears in the service account list of your project.

  6. From the Actions column of the service account you created, click the kebab menu ellipsis v  Manage keys to create your service account key.
    A page opens.

  7. Click Add Key  Create new key.
    A pop-up window opens.

  8. Select JSON for the Key type.

  9. Click CREATE.
    This will generate and download the JSON file containing your service account key. Keep this key carefully, as you will need it in your Saagie job to authenticate yourself in BigQuery.

  10. Create the files needed to run your job:

    • The JSON file for the service account key

    • The requirement.txt file with Python packages

    • The code file in Python

    This is the .json file generated in step 9, which contains your service account key. The JSON key is required for authentication.

    The requirements.txt file contains a list of all the Python libraries to be installed, as well as their versions. This file is needed to import the various Python packages that are used in our job, such as the BigQuery client API, Pandas, etc. It is as follows:

    requirement.txt
    pandas==2.0.3
    protobuf==4.24.3
    google-api-core==2.11.1
    google-auth==2.21.0
    google-cloud-bigquery==3.11.3
    google-cloud-core==2.3.3
    google-crc32c==1.5.0
    google-resumable-media==2.5.0
    googleapis-common-protos==1.59.1
    datasets==2.14.5
    __main__.py
    from google.cloud import bigquery
    from google.oauth2 import service_account
    import argparse
    import logging
    import pandas as pd
    from datasets import load_dataset
    import os
    
    parser = argparse.ArgumentParser(description='Send IMDB dataset from HuggingFace to BigQuery.') (1)
    parser.add_argument("--json_path", type=str, help="Path to the json authentication file", required=True) (1)
    parser.add_argument("--project_id", type=str, help='Big Query project ID', required=True) (1)
    
    args = parser.parse_args() (1)
    
    logger_info = logging.getLogger().setLevel(logging.INFO)
    
    def set_credentials(json_path, project_id): (2)
    
        credentials = service_account.Credentials.from_service_account_file(json_path)
        client = bigquery.Client(credentials= credentials, project=project_id)
        return client
    
    def load_dataset_from_hugging_face(): (3)
    
        datasets = load_dataset("imdb")
        train_imdb = datasets["train"].to_pandas()
        return train_imdb
    
    def load_local(client, project_id): (4)
    
        df = load_dataset_from_hugging_face()
        df = df.sample(frac=1, random_state=66).reset_index(drop=True)
    
        database_path = "IMDB_dataset.IMDB"
        job = client.load_table_from_dataframe(df, f"{project_id}.{database_path}")
        if not os.path.exists("/workdir/"):
            os.makedirs("/workdir/")
        with open("/workdir/output-vars.properties", "w") as f:
        # Writing data to a file
            f.write(f"DATABASE={database_path}")
        return job
    
    client = set_credentials(args.json_path, args.project_id) (5)
    result = load_local(client, args.project_id) (6)
    logging.info("DATASET LOADED INTO BIGQUERY") (7)

    Where:

    1 This is the parser used to retrieve the arguments that are given as parameters to the job.
    2 The set_credentials function is to log in to BigQuery via the Python client API using the JSON key and the project identifier, project_id. This authentication ensures a secure connection to BigQuery. Once logged in, you can send queries to BigQuery et start working with your data.
    3 The load_dataset_from_hugging_face function loads IMDB data via HuggingFace dataset library, and converts it to a Pandas dataframe.
    4 The load_local function sends the data to your BigQuery project. It also adds the data location to a pipeline environment variable, which is automatically created in Saagie from our code. This information will be required in our next job.
    5 This line creates the client with the information retrieved from our set_credentials function.
    6 This line uses the created client to authenticate to BigQuery. Then, it uses our load_local function to send the data to your BigQuery project.
    7 The code of your job ends with a log message to check that all went well.
  11. Zip these three files into a .zip archive.

  12. Now, in Saagie, create a new project or use an existing one.

    If you use an existing project, make sure that the Python job technology from the official Saagie repository is enabled in your project. You can verify this in the Job Technologies tab of your project settings.
  13. Create a project environment variable PROJECT_ID that contains the name of your BigQuery project. It will be used when creating your job.

    Each job of our pipeline must contain the authentication code using the JSON key, and the project_id of the BigQuery project. This is why we store the project_id in an environment variable, so that it is accessible to other jobs.
  14. Create your Python job.

    1. Keep the default runtime context.

    2. Add the zipped file created in step 11, which includes the .json, __main__.py, and requirement.txt files.

    3. Enter the command to run your Python code as follows:

      python __main__.py --json_path ./my-new-project-397614.json --project_id $project_id

      Where:

      • my-new-project-397614.json must be replaced with the name of your JSON file.

      • __main__.py can be replaced by the name of your Python code file, if different.

      • The $project_id argument is to retrieve the value of the PROJECT_ID project environment variable created earlier.

  15. Once created, click Run Run to launch your job.

In the next article, we will take a closer look at the IMDB data pre-processing stage. Stay tuned!