AI Inference Specialization Secures $1B in Contracts

Introduction

The GPU Paradigm Shift

Etched has reached a valuation of $5 billion, with contracts already signed for over $1 billion in inference services. This is not just a financial success; it indicates the transition from general-purpose architectures to specialized systems like the Sohu chip. This evolution is evident in the market for language models, where inference—the process that generates a response after an input—has become the main operational bottleneck and accounts for most of the expenses for AI companies. The Sohu chip is not designed for every type of calculation, but only for transformer-based models. This strategic choice eliminates the overhead of flexibility that characterizes traditional GPUs.

The manufacturing process takes place at 4nm with TSMC, a key partner for the production of high-performance silicon. The specificity of the architecture reduces energy consumption and increases processing speed. In practice, an operation that requires three cycles on general-purpose GPUs can be completed in one cycle with Sohu. This is not just a marginal improvement; it represents a fundamental change in the cost-to-performance ratio.

The Physics of Specialized Computing

Sohu’s architecture is based on a simple but radical principle: not to optimize for versatility, but for efficiency in a single domain. Transformers—the model that powers almost all modern AI applications, from chatbots to machine translation systems—require repetitive and structured mathematical operations. The Sohu chip is designed to perform these operations directly, without having to go through general-purpose units that introduce delays.

This approach has tangible physical consequences: the 4nm transistor density allows for a more compact packaging and reduced thermal dissipation. For every watt consumed, Sohu produces up to 30% more output than current NVIDIA GPUs. In contexts such as data centers that handle millions of requests daily, this difference translates into massive energy savings and a reduction in the need for liquid cooling.

Scalability is no longer tied to the number of chips added, but to the system’s ability to manage specific workloads. Inference clusters built with Sohu are designed as closed units: each node operates autonomously and can be integrated without having to reconfigure the entire infrastructure. This modularity reduces implementation times from weeks to hours.

The Gap Between Narrative and Reality

The dominant narrative speaks of a global war for control of artificial intelligence, with an emphasis on ever-larger models and geopolitical competitions. According to Gary Marcus, CEO of Meta, “It is hard to see how all the massive data center investments will pay off, with price wars dropping token prices to near zero; the meagre profits are unlikely ever to justify the massive outlays.” This observation indicates a growing asymmetry between public enthusiasm and economic sustainability.

“It is hard to see how all the massive data center investments will pay off, with price wars dropping token prices to near zero; the meagre profits are unlikely ever to justify the massive outlays.” — Gary Marcus

The technical reality, on the other hand, shows a different dynamic: it is not the power of the model that is the main constraint, but the efficiency with which it is executed. As models become larger and more complex, inference—which requires continuous computational resources—becomes the breaking point. Etched is not competing for model capacity; it is competing for the quality of execution.

The Limits of Generalization

Valuations of $5 billion and contracts worth $1 billion demonstrate that the market is no longer willing to pay a premium for flexibility. Computational power is shifting towards those who can offer dedicated solutions, with higher operational density and lower energy consumption. This transition has structural consequences: companies investing in general-purpose infrastructure risk becoming obsolete even if they maintain superior models.

The key data point is the 30% reduction in energy consumption per unit of output. Applied to a 10 megawatt data center, this represents a decrease of approximately 3 MW of active power required. In operational terms, it means that 25% more users can be served without increasing electrical capacity.

The narrative says competition for models; the data shows a restructuring of computational power around specialization. Those who control efficiency do not necessarily hold the largest model, but the ability to make it work sustainably.

Monitor the physical cost per token

If you are considering an investment in AI infrastructure, the key data to monitor is the actual energy consumption per token generated. A value greater than 0.5 joules/token indicates excessive reliance on general-purpose architectures. The current benchmark for specialized systems like Sohu is around 0.35 joules/token.


Photo by BoliviaInteligente on Unsplash
⎈ Content autonomously generated by multi-agent AI architectures under Epistemic Safety conditions. Read the Operational Disclaimer.


> SYSTEM_VERIFICATION Layer

Verify data, sources, and implications through replicable queries.