Cerebras Released World’s Fastest AI Inference

TechyMunch Team
3 Min Read
Cerebras released AI interface

In Short

  • The Wafer-Scale Engine from Cerebras has proven to be faster than Groq at generating AI inference.
  • With the 8B model, Cerebras Inference can process up to 1,800 tokens per second; with the 70B model, it can process 450 tokens per second.
  • In contrast, Groq can operate with 8B and 70B models at up to 750 T/s and 250 T/s, respectively.

Cerebras AI Inference has finally opened access to its Wafer-Scale Engine (WSE), and it is able to infer the Llama 3.1 8B model at a rate of 1,800 tokens per second. Cerebras is capable of achieving 450 tokens per second with the bigger Llama 3.1 70B variant. Prior to Cerebras, Groq held the title of quickest AI inference supplier.

Cerebras has developed its own wafer-scale processor that integrates close to 900,000 AI-optimized cores and packs 44GB of on-chip memory (SRAM). As a result, the AI model is directly stored on the chipset itself, unlocking groundbreaking bandwidth. Not to mention, Cerebras is running Meta’s full 16-bit precision weights meaning there is no compromise on accuracy.

When I put Cerebras’ claim to the test, it produced an answer rather quickly. With the Llama 3.1 8B model, which is smaller, it ran at 1,830 tokens per second. Cerebras also managed 446 tokens per second on the 70B model. By contrast, Groq ran 8B and 70B models at 750 T/s and 250 T/s, respectively.

Related News

Cerebras’s WSE engine was independently assessed by Artificial Analysis, which concluded that it does really offer unmatched speed at AI inference. To see Cerebras Inference for yourself, follow this link.

Share This Article
Follow:
TechyMunch is a community of enthusiastic Individuals who is believe to provide information and knowledge about technology and news .
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

x