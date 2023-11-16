November 16, 2023
Qualcomm Cloud AI100 Ultra: 4x performance for larger models


According to MLCommons benchmarking, we haven’t heard from Qualcomm for some time about their power-efficient data center Inference accelerator, which still holds the record for power efficiency. Now Qualcomm has announced a new “Ultra” version that offers four times the performance at just 150 watts.

Many have wondered when Qualcomm will rotate out its data center estimation chip. But it turns out that the chip can crunch larger language models much faster with more memory. The company has quietly announced a new “Ultra” version, which can support a 100-billion parameter large language model on one 150-watt card, and a 175-billion parameter model using two cards.

HP will support the new card in the HP ProLiant DL380a servers. “In collaboration with Qualcomm, we look forward to offering our customers a compute solution that is optimized for inference and delivers the performance and power efficiency needed to deploy and accelerate AI inference at scale.” said Justin Hotard, executive vice president and general manager of HPC, AI and Labs at Apache.

The “massive” comment should give Qualcomm fans reason for hope. Most LLM inference queries today rely on expensive HBM-equipped GPUs to capture models and crunch the next token/word. A simple PCIE card with LPDDR memory should be a welcome alternative.

This week at Supercomputing ’23, startup Neurality demonstrated Qualcomm cards in a platform that can reduce inference infrastructure costs by up to 90%. Neurality does not make a deep learning accelerator (DLA), but replaces expensive x86 CPUs that typically preprocess data to feed the fastest DLAs coming to market to run inference processing. More details will be revealed about Neurality soon, but suffice it to say that CEO Moshe Tanach was very excited to offer the advanced memory card to its customers.

conclusion

We haven’t heard the last from Qualcomm Technologies about power-efficient infer processing. Adding additional power-efficient memory to an AI solution quadruples performance and large language models require a ton of memory to hold the model.

Qualcomm is next in line with the amazing Snapdragon X Elite for Windows and the new Snapdragon 8 Gen 3 for mobile handsets. Now Qualcomm Cloud AI100 can handle 100B models on data center inference processors to help improve LLM inference affordability at extremely low power. I can’t wait to see some MLPerf benchmarks in three months!

