Optimize training performance with Reduction Server on Vertex AIOptimize training performance with Reduction Server on Vertex AISolutions ArchitectDeveloper AdvocateSoftware Engineer

Using Reduction Server with Vertex Training

Reduction Server can be used with any distributed training framework that uses the NVIDIA NCCL library for the all-reduce collective operation. You do not need to change or recompile your training application.

 To use Reduction Server with Vertex Training custom jobs you need to:

  • Choose a topology of Vertex Training worker pools.
  • Install the Reduction Server NVIDIA NCCL transport plugging in your training container image.
  • Configure and submit a Vertex Training custom job that includes a Reduction Server worker pool.

Each of these steps is discussed in detail in the following sections.

Choosing a topology of Vertex Training worker pools

To run a multi-worker training job with Vertex AI, you specify multiple machines (nodes) in a training cluster. The training service allocates the resources for the machine types you specify. Your running job on a given node is called a replica. A group of replicas with the same configuration is called a worker pool. Vertex AI provides up to 4 worker pools to cover the different types of machine tasks. When using Reduction Server the first three worker pools are used.

Worker pool 0 configures the Primary, chief, scheduler, or “master”.  This worker generally takes on some extra work such as saving checkpoints and writing summary files. There is only ever one chief worker in a cluster, so your worker count for worker pool 0 will always be 1.

Worker pool 1 is where you configure the rest of the workers for your cluster. 

Worker pool 2 manages Reduction Server reducers. When choosing the number and type of reducers, you should consider the network bandwidth supported by a reducer replica’s machine type. In GCP, a VM’s machine type defines its maximum possible egress bandwidth. For example, the egress bandwidth of the n1-highcpu-16 machine type is limited at 32 Gbps. 

Because reducers perform a very limited function, aggregating blocks of gradients, they can run on relatively low-powered and cost effective machines. Even with a large number of gradients this computation does not require accelerated hardware or high CPU or memory resources. However, to avoid network bottlenecks, the total aggregate bandwidth of all replicas in the reducer worker pool must be greater or equal to the total aggregate bandwidth of all replicas in worker pools 0 and 1, which host the GPU workers.

For example, if you run your distributed training job on eight GPU equipped replicas that have an aggregate maximum bandwidth of 800 Gbps (GPU equipped VMs can support up to 100 Gbps egress bandwidth) and you use a reducer with 32 Gbps egress bandwidth, you will need at least 25 reducers.

To maintain the size of the reducer worker pool at a reasonable level it’s recommended that you use machine types with 32 Gbps egress bandwidth, which is the highest bandwidth available to CPU only machines in Vertex Training. Based on the testing performed on a number of mainstream Deep Learning models in Computer Vision and NLP domains, we recommend using the reducers with 16-32 vCPUs and 16-32 GBs of RAM. A good starting configuration that should be optimal for a large spectrum of distributed training scenarios is the n1-highcpu-16 machine type.

Installing the Reduction Server NVIDIA NCCL transport plugin

Reduction Server is implemented as an NVIDIA NCCL transport plugin. This plugin must be installed on the container image that is used to run your training application. 

The code accompanying this article includes a sample Dockerfile that uses the Vertex pre-built TensorFlow 2.5 GPU Docker image as a base image, which comes with the plugin pre-installed. 

You can also install the plugin directly by including the following in your Dockerfile:

Leave a Comment