Elon Musk Says Now Training Compute Will Not Be a Limiting Factor for Tesla's FSD Improvements

Aug 31, 2023

By Eva Fox

Training computing will no longer be the limiting factor for Tesla's Full Self-Driving improvements, says Elon Musk. Previously, this was a limiter to faster development. Tesla activated a 10,000-unit Nvidia H100 GPU cluster this week.

The development of Tesla's Full Self-Driving (FSD) has passed a major milestone and now, training computing will no longer be an obstacle to its improvement. CEO Elon Musk said on X Wednesday that FSD V12 is still in development, but compute training would not slow it down.

“Training compute should soon not be so much of a limiting factor,” he wrote. He was responding to a user who calculated that the training speed of V12 should at least quadruple once the Nvidia H100 cluster is fully operational, or even further if there’s a lot of Dojo online.

During the Q2 2023 Earnings Call, Musk said that “the fundamental rate limiter on the progress of full self-driving is training. That's -- if we had more training compute, we would get it done faster. So that's it.”

Tesla activated a 10,000-unit Nvidia H100 GPU cluster this week. According to Musk, it was three times faster than the A100 in the company's tests. However, he did mention the difficulty of getting the H100 online.

“Very difficult bringing the 10k H100 cluster online btw. Similar experience to bringing our now 16k A100 cluster online.

“Uptime & performance are low at first, then improve with lots of work by Tesla & Nvidia,” he explained.

Musk confirmed earlier this week that Tesla is going to use NVIDIA and Dojo to process a lot of data from Tesla's growing fleet of vehicles and train its FSD software. He also mentioned this during the Q2 2023 Earnings Call.

“So, speaking of which, our Dojo training computer is designed to significantly reduce the cost of neural net training. It is designed to -- it’s somewhat optimized for the kind of training that we need, which is a video training. So, we just see that the need for neural net training -- again, talking -- speaking of quasi-infinite things, is just enormous. So, I think having -- we expect to use both, NVIDIA and Dojo, to be clear. But there’s -- we just see demand for really vast training resources.

“And we think we may reach in-house neural net training capability of a 100 exaflops by the end of next year. So, to date, over 300 million miles have been driven using FSD beta. That 300 million mile number is going to seem small very quickly. It’ll soon be billions of miles, then tens of billions of miles. And FSD will go from being as good as a human to then being vastly better than a human. We see a clear path to full self-driving being 10 times safer than the average human driver, so.

“So that’s what Dojo is designed to do is optimize for video training. It’s not optimized for LLMs. It’s optimized for video training. With video training, you have a much higher ratio of compute-to-memory bandwidth, so -- whereas LLMs tends to be memory bandwidth choked. So that’s it. I mean -- but like I said, we’re also -- we have some -- we’re using a lot of NVIDIA hardware. We’ll continue to -- we’ll actually take NVIDIA hardware as fast as NVIDIA will deliver it to us.”