PyTorch on Google Cloud: How To train PyTorch models on AI PlatformPyTorch on Google Cloud: How To train PyTorch models on AI PlatformMachine Learning Specialist, Cloud Customer EngineerMachine Learning Specialist, Cloud Customer Engineer

In the snippet above, notice that the encoder (also referred to as the base model) weights are not frozen. This is why a very small learning rate (2e-5) is chosen to avoid loss of pre-trained representations. Learning rate and other hyperparameters are captured under the TrainingArguments object. During the training, we are only capturing accuracy metrics. You can modify the compute_metrics function to capture and report other metrics.

We will explore integration with Cloud AI Platform Hyperparameter Tuning Service in the next post of this series.

Training the model on Cloud AI Platform

While you can do local experimentation on your AI Platform Notebooks instance, for larger datasets or models often a vertically scaled compute resource or horizontally distributed training is required. The most effective way to perform this task is AI Platform Training service. AI Platform Training takes care of creating designated compute resources required for the task, performs the training task, and also ensures deletion of compute resources once the training job is finished.

Before running the training application with AI Platform Training, the training application code with required dependencies must be packaged and uploaded into a Google Cloud Storage bucket that your Google Cloud project can access. There are two ways to package the application and run on AI Platform Training:

  1. Package application and Python dependencies manually using Python setup tools
  2. Use custom containers to package dependencies using Docker containers

You can structure your training code in any way you prefer. Please refer to the GitHub repository or Jupyter Notebook for our recommended approach on structuring training code. 

Using Python packaging to build manually

For this sentiment classification task, we have to package the training code with standard Python dependencies – transformers, datasets and tqdm – in the setup.py file. The find_packages() function inside setup.py includes the training code in the package as dependencies.