Push Models to Hugging Face With Saagie

Hugging Face is an open-source platform where the machine learning community collaborates on models, datasets, and applications. It is often called the GitHub of machine learning because it lets developers share and test their work openly.

Saagie suggests cataloging each machine learning model experiment with Hugging Face. With this tutorial, learn how to share your models on the Hugging Face’s Model Hub with Saagie. You can then use them with our Saagie Hugging Face Model Server add-on.

  • This page assumes that you have a basic knowledge in deep learning.

  • The code examples provided here are based on PyTorch. It can be run from a Saagie job, a Jupyter Notebook in Saagie, but also from Google Colab, or even locally in a Python environment.

  1. Install the Transformers and Datasets libraries.

    !pip install transformers[torch] -U
    !pip install datasets
  2. Load a pre-trained model from Hugging Face.

    1. Find the model you need on the Model Hub. In our example, we use the bert-tiny model.

    2. Load the model with the following code.

      from transformers import AutoModelForSequenceClassification
      
      ## Example NLP model for sentiment analysis
      model_name = "prajjwal1/bert-tiny:main" (1)
      if ':' in model_name:
          model_ver = model_name.split(':')[1]
          model_name = model_name.split(':')[0]
      else:
          model_ver = "main"
      model = AutoModelForSequenceClassification.from_pretrained(model_name, revision=model_ver)

      Where

      1 "prajjwal1/bert-tiny:main" can be replaced by another model name.
  3. Fine-tune your model.

    1. Find the dataset you need on the Hugging Face Hub. In our example, we use the sst2 dataset for textual sentiment classification.

    2. Load and pre-process your datasets with the following code.

      from transformers import AutoTokenizer, DataCollatorWithPadding, Trainer, TrainingArguments
      from datasets import load_dataset
      
      ## Loading Datasets
      dataset = load_dataset("sst2") (1)
      train_dataset = dataset['train']
      valid_dataset = dataset['validation']
      train_subset = 100
      eval_subset = 20
      seed = 42
      repo_name = "MyRepo" (2)
      
      
      ## Pre-processing Datasets
      tokenizer = AutoTokenizer.from_pretrained(model_name)
      def tokenize_function(examples):
          return tokenizer(examples["sentence"], padding="max_length", truncation=True)
      
      tokenized_train = train_dataset.map(tokenize_function, batched=True)
      tokenized_valid = valid_dataset.map(tokenize_function, batched=True)
      small_train_dataset = tokenized_train.shuffle(seed=seed).select(range(train_subset))
      small_eval_dataset = tokenized_valid.shuffle(seed=seed).select(range(eval_subset))
      data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

      Where

      1 "sst2" can be replaced by another dataset name.
      2 "MyRepo" must be replaced by the name of your repository.
    3. Add the following code to configure the hyperparameters and train your model.

      ## Defining hyperparameters for fine-tuning
      training_args = TrainingArguments(
          output_dir=repo_name,
          num_train_epochs=2,
          per_device_train_batch_size=16,
          per_device_eval_batch_size=16,
          logging_dir='./logs',
          logging_steps=10,
      )
      
      
      ## Fine-tune the model with Trainer class (1)
      trainer = Trainer(
          model=model,
          args=training_args,
          train_dataset=small_train_dataset,
          eval_dataset=small_eval_dataset,
          tokenizer=tokenizer,
          data_collator=data_collator,
      )
      
      trainer.train() (2)

      Where

      1 This block is your Trainer object with your model, training arguments, training and test datasets, and evaluation function.
      2 This line gives you the result of your training. Relevant training results are displayed in the log of this step. It should look like the following:
      TrainOutput(global_step=14, training_loss=0.664779714175633
      		, metrics={'train_runtime': 2.4304, 'train_samples_per_second': 82.29
      		, 'train_steps_per_second': 5.76, 'total_flos': 17489048160.0
      		, 'train_loss': 0.664779714175633, 'epoch': 2.0})
  4. Push your model to Hugging Face.

    • Via a Jupyter Notebook or via Google Colab

    • Via a Python script

    • Via the Hugging Face GUI

    1. Log in directly to the Hub via the huggingface_hub library using the access token.

      from huggingface_hub import notebook_login
      notebook_login()
    2. Push your model.

      trainer.push_to_hub("MyModelName") (1)

      Where

      1 "MyModelName" must be replaced by the name of your model.
    3. OPTIONAL: Download the model to test its availability.

      from transformers import AutoModelForSequenceClassification
      model_name = "MY_ORGANIZATION/"+ MyRepo (1)
      model = AutoModelForSequenceClassification.from_pretrained(model_name)

      Where

      1 "MY_ORGANIZATION/" and MyRepo must be replaced by your own values.

    When running a Python script, for example via a job on the Saagie platform, use the following code to log in and push your model to Hugging Face:

    from transformers import AutoModelForSequenceClassification
    model_name = "MY_ORGANIZATION/"+ MyRepo (1)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)

    Where

    1 "MY_ORGANIZATION/" and MyRepo must be replaced by your own values.

    As described in the Hugging Face documentation, trained model files must be packaged and manually uploaded to Hugging Face.