Send Data to BigQuery
-
Log in to BigQuery. To do so, log in to GCP (Google Cloud Platform) with your Google account.
-
Once connected to BigQuery, create a new project or use an existing one.
We can now start using BigQuery and its numerous functionalities. -
Create a service account. Navigate to .
A page opens.
-
Fill in the required information and follow the steps that appear in the Google Cloud console.
-
Click DONE to validate your service account creation.
Your service account appears in the service account list of your project. -
From the Actions column of the service account you created, click the kebab menu
to create your service account key.
A page opens. -
Click
.
A pop-up window opens. -
Select JSON for the Key type.
-
Click CREATE.
This will generate and download the JSON file containing your service account key. Keep this key carefully, as you will need it in your Saagie job to authenticate yourself in BigQuery. -
Create the files needed to run your job:
This is the
.json
file generated in step 9, which contains your service account key. The JSON key is required for authentication.The
requirements.txt
file contains a list of all the Python libraries to be installed, as well as their versions. This file is needed to import the various Python packages that are used in our job, such as the BigQuery client API, Pandas, etc. It is as follows:requirement.txtpandas==2.0.3 protobuf==4.24.3 google-api-core==2.11.1 google-auth==2.21.0 google-cloud-bigquery==3.11.3 google-cloud-core==2.3.3 google-crc32c==1.5.0 google-resumable-media==2.5.0 googleapis-common-protos==1.59.1 datasets==2.14.5
__main__.pyfrom google.cloud import bigquery from google.oauth2 import service_account import argparse import logging import pandas as pd from datasets import load_dataset import os parser = argparse.ArgumentParser(description='Send IMDB dataset from HuggingFace to BigQuery.') (1) parser.add_argument("--json_path", type=str, help="Path to the json authentication file", required=True) (1) parser.add_argument("--project_id", type=str, help='Big Query project ID', required=True) (1) args = parser.parse_args() (1) logger_info = logging.getLogger().setLevel(logging.INFO) def set_credentials(json_path, project_id): (2) credentials = service_account.Credentials.from_service_account_file(json_path) client = bigquery.Client(credentials= credentials, project=project_id) return client def load_dataset_from_hugging_face(): (3) datasets = load_dataset("imdb") train_imdb = datasets["train"].to_pandas() return train_imdb def load_local(client, project_id): (4) df = load_dataset_from_hugging_face() df = df.sample(frac=1, random_state=66).reset_index(drop=True) database_path = "IMDB_dataset.IMDB" job = client.load_table_from_dataframe(df, f"{project_id}.{database_path}") if not os.path.exists("/workdir/"): os.makedirs("/workdir/") with open("/workdir/output-vars.properties", "w") as f: # Writing data to a file f.write(f"DATABASE={database_path}") return job client = set_credentials(args.json_path, args.project_id) (5) result = load_local(client, args.project_id) (6) logging.info("DATASET LOADED INTO BIGQUERY") (7)
Where:
1 This is the parser used to retrieve the arguments that are given as parameters to the job. 2 The set_credentials
function is to log in to BigQuery via the Python client API using the JSON key and the project identifier,project_id
. This authentication ensures a secure connection to BigQuery. Once logged in, you can send queries to BigQuery et start working with your data.3 The load_dataset_from_hugging_face
function loads IMDB data via HuggingFace dataset library, and converts it to a Pandas dataframe.4 The load_local
function sends the data to your BigQuery project. It also adds the data location to a pipeline environment variable, which is automatically created in Saagie from our code. This information will be required in our next job.5 This line creates the client with the information retrieved from our set_credentials
function.6 This line uses the created client to authenticate to BigQuery. Then, it uses our load_local
function to send the data to your BigQuery project.7 The code of your job ends with a log message to check that all went well. -
Zip these three files into a
.zip
archive. -
Now, in Saagie, create a new project or use an existing one.
If you use an existing project, make sure that the Python job technology from the official Saagie repository is enabled in your project. You can verify this in the Job Technologies tab of your project settings. -
Create a project environment variable
PROJECT_ID
that contains the name of your BigQuery project. It will be used when creating your job.Each job of our pipeline must contain the authentication code using the JSON key, and the project_id
of the BigQuery project. This is why we store theproject_id
in an environment variable, so that it is accessible to other jobs. -
-
Keep the default runtime context.
-
Add the zipped file created in step 11, which includes the
.json
,__main__.py
, andrequirement.txt
files. -
Enter the command to run your Python code as follows:
python __main__.py --json_path ./my-new-project-397614.json --project_id $project_id
Where:
-
my-new-project-397614.json
must be replaced with the name of your JSON file. -
__main__.py
can be replaced by the name of your Python code file, if different. -
The
$project_id
argument is to retrieve the value of thePROJECT_ID
project environment variable created earlier.
-
-
-
Once created, click Run to launch your job.
In the next article, we will take a closer look at the IMDB data pre-processing stage. Stay tuned!