corp logo

Trading

Groq

Building an AI accelerator application-specific integrated circuit and related hardware 

AI Chip

INDUSTRY

Technology

STATUS

Trading

OPEN TO

Public

Investment Highlights


Building an AI accelerator application-specific integrated circuit (ASIC) that they call the Language Processing Unit (LPU)
Also building related hardware to accelerate the inference performance of AI workloads.

Company Overview


Groq was founded in 2016 by a group of former Google engineers, led by Jonathan Ross, one of the designers of the Tensor Processing Unit (TPU), an AI accelerator ASIC, and Douglas Wightman, an entrepreneur and former engineer at Google X (known as X Development).

Groq received seed funding from Social Capital’s Chamath Palihapitiya, with a US$10M investment in 2017 and soon after secured additional funding.

In April of 2021, Groq raised US$300M in a series C round led by Tiger Global Management and D1 Capital Partners.

Current investors include: The Spruce House Partnership, Addition, GCM Grosvenor, Xⁿ, Firebolt Ventures, General Global Capital, and Tru Arrow Partners, as well as follow-on investments from TDK Ventures, XTX Ventures, Boardman Bay Capital Management, and Infinitum Partners.

After Groq’s series C funding round, it was valued at over 1 billion dollars, making the startup a unicorn.

AI Handshake

Groq's initial name for their ASIC was the Tensor Streaming Processor (TSP), but later rebranded the TSP as the Language Processing Unit (LPU).

The LPU features a functionally sliced microarchitecture, where memory units are interleaved with vector and matrix computation units. This design facilitates the exploitation of dataflow locality in AI compute graphs, improving execution performance and efficiency. The LPU was designed off of two key observations:

AI workloads exhibit substantial data parallelism, which can be mapped onto purpose built hardware, leading to performance gains.


A deterministic processor design, coupled with a producer-consumer programming model, allows for precise control and reasoning over hardware components, allowing for optimized performance and energy efficiency.


In addition to its functionally sliced microarchitecture, the LPU can also be characterized by its single core, deterministic architecture. The LPU is able to achieve deterministic execution by avoiding the use of traditional reactive hardware components (branch predictors, arbiters, reordering buffers, caches) and by having all execution explicitly controlled by the compiler thereby guaranteeing determinism in execution of an LPU program.

The first generation of the LPU (LPU v1) yields a computational density of more than 1TeraOp/s per square mm of silicon for its 25×29 mm 14nm chip operating at a nominal clock frequency of 900 MHz. The second generation of the LPU (LPU v2) will be manufactured on Samsung's 4nnm process node.

Ask us a Question

Send us a message and we'll connect soon


Become a member

Become a member and elevate your experience with us!

Already have an account? Login