TF-Replicator: Distributed Machine Learning for Researchers

Source Node: 749908

Building a Platform for AI Research at DeepMind

By collaborating closely with researchers throughout the design and implementation of TF-Replicator, we were able to build a library that allows users to easily scale computation across many hardware accelerators, while leaving them with the control and flexibility required to do cutting-edge AI research. For example, we added MPI-style communication primitives such as all-reduce following discussion with researchers. TF-Replicator and other shared infrastructure allows us to build increasingly complex experiments on robust foundations and quickly spread best practices throughout DeepMind.

At the time of writing, TF-Replicator is the most widely used interface for TPU programming at DeepMind. While the library itself is not constrained to training neural networks, it is most commonly used for training on large batches of data. The BigGAN model, for example, was trained on batches of size 2048 across up to 512 cores of a TPUv3 pod. In Reinforcement Learning agents with a distributed actor-learner setup, such as our importance weighted actor-learner architectures, scalability is achieved by having many actors generating new experiences by interacting with the environment. This data is then processed by the learner to improve the agent’s policy, represented as a neural network. To cope with an increasing number of actors, TF-Replicator can be used to easily distribute the learner across many hardware accelerators. These and other examples are described in more detail in our arXiv paper.

TF-Replicator is just one of many examples of impactful technology built by DeepMind’s Research Platform Team. Many of DeepMind’s breakthroughs in AI, from AlphaGo to AlphaStar, were enabled by the team. If you share our mission and are excited about accelerating state-of-the-art AI research, look out for open Software Engineering positions in Research Platform at https://deepmind.com/careers (machine learning experience is optional for these roles).

This work was completed by the Research Platform Team at DeepMind. We’d like to thank Frederic Besse, Fabio Viola, John Aslanides, Andy Brock, Aidan Clark, Sergio Gómez Colmenarejo, Karen Simonyan, Sander Dieleman, Lasse Espeholt, Akihiro Matsukawa, Tim Harley, Jean-Baptiste Lespiau, Koray Kavukcuoglu, Dan Belov and many others at DeepMind for their valuable feedback throughout the development of TF-Replicator. We’d also like to thank Priya Gupta, Jonathan Hseu, Josh Levenberg, Martin Wicke and others at Google for making these ideas available to all TensorFlow users as part of tf.distribute.Strategy.

Source: https://deepmind.com/blog/article/tf-replicator-distributed-machine-learning

Time Stamp:

More from Deep Mind - Latest Post