
PaLM or Pathways Language Model, a 540 billion parameter model that was recently introduced by Google AI, is intriguing not only because of its size or even because of how well it performs, but also because of how it was trained. Along with being trained using Google's Pathway system, it also avoids making use of pipeline parallelism.
What Is Pipeline Parallelism?
In pipeline
parallelism, several steps are interdependent but can be carried out
concurrently, with the output of one step serving as the input for a subsequent
step.
Using pipeline parallelism, the task of processing is divided into a series of stages that execute sequentially, building on the idea of simple task parallelism. For extensive language models, this strategy has traditionally been used.
Read More: Fundamentals of
Creating Artificial Intelligence Chatbot - 2023
Training Google’s PaLM
Pathways was
introduced by Google last year. It is a single model that has been trained to
carry out millions of tasks. Pathways was used by Google as a way of
demonstrating how one model can handle a lot of different tasks at the same
time, build on existing skills, and combine them to become more efficient and
faster at picking up new tasks.
This is a resource
and time-intensive activity. Multimodal models could also be enabled by
Pathways allowing for simultaneous auditory processing, language understanding,
and perception of vision. PaLM, which has a 540 billion parameter model, makes
it possible to train a single model across many TPU v4 pods.
PaLM is also a dense
decoder-only Transformer that achieves high-tech performance over a wide range
of tasks. Both TPU v4 Pods on which PaLM is trained are connected over a DCN or
data center network. It combines data and model parallelism. Since pipelining
provides additional parallelization and has a lower bandwidth, it is commonly
used with DCN.
This exceeds the
highest effective scale permitted by data and model parallelism. However, this
has two major limitations: higher bandwidth memory is required because it
reloads weights from memory, and a step time is incurred overhead when the device
is not in use.
By going pipe-line
free, PaLM can avoid these drawbacks. Implementing two-way data parallelism
requires PaLM to use Pathway’s client-server architecture. Each Pod receives
half of the training batch from a single Python client.
Data parallelism and
a within-pod model are then used by each Pod to perform backward and forward
computations to assess gradients parallelly. The gradients are transferred by
the pods of the remote Pod after they have been computed on their portion of
the batch.
Then for the step that follows, each Pod accumulates the remote and local gradient, and bitwise-identical parameters are achieved after the parallel application of parameter updates.
Read More: ChatGPT: What Is
It and How Does It Work?
PaLM Compared with Other Models
In addition to its
displayed superior performance and efficiency, PaLM is trained differently from
other models, which is one of its beauties. By training on a larger dataset
from multiple sources, making use of sparsely activated modules, and scaling
model sizes, models like Megatron-Turing NLG, Gopher, LaMDA, and GLaM, achieved
state-of-the-art few-shot results on a range of tasks.
However, little progress
has been made in comprehending the abilities that arise from few-shot learning
with an expanding model scale; PaLM moves closer to comprehending these
abilities.
DISCLOSURE
Comments here are not of the author's opinion. Users are responsible for their comments.