At one point or another, many of us have used a local computing environment for machine learning (ML). That may have been a notebook computer or a desktop with a GPU. For some problems, a local environment is more than enough. Plus, there’s a lot of flexibility. Install Python, install JupyterLab, and go!
What often happens next is that model training just takes too long. Add a new layer, change some parameters, and wait nine hours to see if the accuracy improved? No thanks. By moving to a Cloud computing environment, a wide variety of powerful machine types are available. That same code might run orders of magnitude faster in the Cloud.
Customers can use Deep Learning VM images (DLVMs) that ensure that ML frameworks, drivers, accelerators, and hardware are all working smoothly together with no extra configuration. Notebook instances are also available that are based on DLVMs, and enable easy access to JupyterLab.
Benefits of using the Vertex AI custom training service
Using VMs in the cloud can make a huge difference in productivity for ML teams. There are some great reasons to go one step further, and leverage our new Vertex AI custom training service. Instead of training your model directly within your notebook instance, you can submit a training job from your notebook.
The training job will automatically provision computing resources, and de-provision those resources when the job is complete. There is no worrying about leaving a high-performance virtual machine configuration running.
The training service can help to modularize your architecture. As we’ll discuss further in this post, you can put your training code into a container to operate as a portable unit. The training code can have parameters passed into it, such as input data location and hyperparameters, to adapt to different scenarios without redeployment. Also, the training code can export the trained model file, enabling working with other AI services in a decoupled manner.
The training service also supports reproducibility. Each training job is tracked with inputs, outputs, and the container image used. Log messages are available in Cloud Logging, and jobs can be monitored while running.
The training service also supports distributed training, which means that you can train models across multiple nodes in parallel. That translates into faster training times than would be possible within a single VM instance.
In this blog post, we are going to explain how to use the custom training service, using code snippets from a Vertex AI example. The notebook we’re going to use covers the end-to-end process of custom training and online prediction. The notebook is part of the ai-platform-samples repo, which has many useful examples of how to use Vertex AI.