Meta’s new supercomputer could power its virtual universe

Meta unveils the blueprint for a 16,000 GPU supercomputer that can handle complex AI models.
NVIDIA supercomputing tech
NVIDIA is supplying Meta with the next-gen tech for their AI supercomputers. NVIDIA

Share

Today Meta, Facebook’s parent company, announced that it is partnering with NVIDIA to build a supercomputer to power their artificial intelligence research. They call the new machine the RCS, which stands for the AI Research SuperCluster.

The supercomputer will be completed later this year, and the company said in a press release that they expect it to “be the fastest [AI supercomputer] in the world once fully built out in mid-2022.”

Meta said that the supercomputer can help their researchers feed more data into AI models that enable them to work across multiple languages to analyze text, images, and video together, potentially for translation purposes or identifying harmful content. Meta has started to train these computers in natural language processing and computer vision.  

Company researchers wrote in a blog post that they imagine this tech could one day provide real-time translations for an international group of people collaborating on an AR game or a research project, for instance. It would also be used to develop new tools that can be integrated into augmented and virtual reality. 

[Related: Facebook has an explanation for its massive Monday outage]

Unsurprisingly, the powerful new cluster is intended to help the company enable the metaverse. “The experiences we’re building for the metaverse require enormous compute power (quintillions of operations / second),” Meta CEO Mark Zuckerberg said in a statement, “and RSC will enable new AI models that can learn from trillions of examples, understand hundreds of languages, and more.”

Early studies on RSC showed that it “runs computer vision workflows up to 20 times faster, runs the NVIDIA Collective Communication Library (NCCL) more than nine times faster, and trains large-scale NLP models three times faster” than previous systems Meta used. “That means a model with tens of billions of parameters can finish training in three weeks, compared with nine weeks before,” the researchers wrote. 

[Related: This supercomputer will perform 1,000,000,000,000,000,000 operations per second]

The company had long sought to build an infrastructure that can take in data sets as large as an exabyte, which the company says equals about “36,000 years of high-quality video.” The Meta RCS supercomputer today houses 760 NVIDIA DGX A100 systems as its compute nodes, totaling 6,080 Graphic Processing Units (GPU)—and their goal is to scale that number up to 16,000 later this year. 

Additionally, the GPUs are “linked on an NVIDIA Quantum 200Gb/s InfiniBand network to deliver 1,895 petaflops of TF32 performance,” NVIDIA elaborated in an accompanying press release

Meta said that this new training model is privacy-protecting, and uses “encrypted user-generated data that is not decrypted until right before training.” The system is also “isolated from the larger internet, with no direct inbound or outbound connections.”