How to Calculate if Your Network is Bottlenecking Distributed Training
How to Calculate if Your Network is Bottlenecking Distributed Training
A practical guide to understanding why your multi-node GPU training might be slower than expected.
Technical deep-dives on GPU infrastructure, distributed training, and cloud-native systems.
How to Calculate if Your Network is Bottlenecking Distributed Training
A practical guide to understanding why your multi-node GPU training might be slower than expected.
GPU-to-GPU Communication Across Nodes: What Actually Works
A practical guide to NVLink, NVSwitch, InfiniBand, RoCE, and GPUDirect for multi-node GPU clusters. Cut through the jargon and understand what hardwar...
Understanding Power Management in GPU via PCIe
Learn how modern GPUs implement intelligent power management through PCIe generation scaling and why your high-end GPUs might be operating at Gen 1 sp...
Centralised Control Planes for SaaS - Part 1: SaaS Business Models
For a couple of years, my journey has revolved around constructing control planes for data infrastructure startups. As an engineer I have been fortuna...
Centralised Control Planes for SaaS - Part 2: Stateless Async Event Handling
In our previous post, "Centralised Control Planes for SaaS - Part 1," we delved into the world of SaaS business models and the challenges that arise w...
Centralised Control Planes for SaaS - Part 3: Logical and Physical Models
In our previous post, "Centralised Control Planes for SaaS - Part 2," we discussed about the approaches to build a stateless control plane and how to ...
State Management for Infra Products - Part 1: State Machines vs Observed State
The goal of this series is to provide an in-depth understanding of building state-driven infrastructure products, as opposed to API-driven ones. Much ...
Understanding RX vs TX - Making Sense of Network Traffic Direction with Real Examples
Learn the fundamentals of RX (receive) and TX (transmit) metrics in network monitoring through real-world Mellanox ONYX switch examples, common traffi...
Demystifying Helm and Operator pattern.
Discover insights and explanations about Helm and Operator Pattern in the DataInfra.io blog. Gain a clearer understanding of these tools and patterns ...