ML Systems & Infrastructure

Dr. Sudipta Pathak

Building scalable ML platforms, distributed training systems, and AI infrastructure. Expert in MLOps, model optimization, and production-grade machine learning.

About Me

I am a Machine Learning Systems Engineer with deep expertise in building production-grade AI infrastructure. My work spans from distributed training systems to edge deployment, with a focus on scalability, efficiency, and reliability.

Distributed Training

Large-scale model training across GPU clusters with optimized data parallelism and model parallelism strategies.

ML Infrastructure

Building robust ML platforms with Kubernetes, Kubeflow, and custom orchestration for production workloads.

Data Pipelines

High-throughput data processing pipelines with Apache Spark, Ray, and modern data lake architectures.

Cloud-Native ML

Multi-cloud deployments on AWS, GCP, and Azure with focus on cost optimization and scalability.

Model Optimization

Quantization, pruning, and distillation techniques for deploying efficient models at the edge and cloud.

MLOps & Security

End-to-end ML lifecycle management with CI/CD, monitoring, and enterprise-grade security practices.

10+
Years Experience
50+
Production Systems
1B+
Model Parameters
99.9%
Uptime Achieved

Featured Work

Selected projects showcasing expertise in ML systems, distributed computing, and production AI infrastructure.

Distributed LLM Training Framework

A fault-tolerant distributed training system for large language models with automatic checkpointing and elastic scaling.

PyTorchRayKubernetesDeepSpeed

ML Inference Optimization Platform

High-performance inference serving with dynamic batching, model quantization, and multi-model GPU sharing.

TensorRTTritonCUDAgRPC

Real-time Feature Store

Low-latency feature computation and serving for online ML models with point-in-time correctness guarantees.

RedisApache FlinkKafkaRust

MLOps Pipeline Orchestrator

End-to-end ML workflow automation with experiment tracking, model versioning, and automated deployment.

KubeflowMLflowGitOpsTerraform

Writings

Technical blog posts, paper reviews, and tutorials on ML systems, infrastructure, and distributed computing.

blog
Jan 202615 min read

Scaling Distributed Training to 10,000 GPUs

A deep dive into the challenges and solutions for training large models at extreme scale, including communication optimizations and fault tolerance strategies.

Read more
papers
Jan 202610 min read

Paper Review: MegaScale - Production-Grade LLM Training

Analysis of ByteDance's MegaScale framework for training LLMs on over 10,000 GPUs, covering their design principles and lessons learned.

Read more
tutorials
Dec 202525 min read

Building Production-Ready ML Pipelines with Kubeflow

Step-by-step tutorial on setting up end-to-end ML workflows with Kubeflow Pipelines, including best practices for CI/CD integration.

Read more
blog
Dec 202512 min read

Optimizing Transformer Inference at Scale

Techniques for reducing latency and increasing throughput in production transformer serving, including KV-cache optimization and speculative decoding.

Read more
papers
Nov 20258 min read

Paper Review: Efficient Large-Scale Language Model Training

Review of recent advances in efficient training methods including mixture of experts, activation checkpointing, and 8-bit optimizers.

Read more
tutorials
Nov 202530 min read

Introduction to Triton Kernels for Deep Learning

Hands-on tutorial for writing custom GPU kernels using OpenAI's Triton, from basics to advanced optimizations.

Read more

ML News & Tools

Stay updated with the latest breaking news, research papers, and essential tools in the ML systems and infrastructure space.

Latest News

BreakingJan 2026

OpenAI Releases GPT-5 Architecture Details

New technical report reveals innovations in mixture of experts scaling and training efficiency improvements.

OpenAI ResearchRead more
ToolsJan 2026

PyTorch 3.0 Introduces Native Distributed Checkpointing

Major release brings built-in support for async checkpointing and automatic fault recovery for large-scale training.

PyTorch BlogRead more
HardwareDec 2025

NVIDIA Announces Blackwell Ultra GPU Series

New architecture delivers 4x performance for transformer inference with dedicated FP4 compute units.

NVIDIA NewsRead more

Essential Tools

vLLM

28k

High-throughput LLM inference with PagedAttention

Inference

Unsloth

18k

2-5x faster LLM fine-tuning with 80% less memory

Training

llama.cpp

65k

Port of LLaMA models in C/C++ for edge deployment

Optimization

Ollama

82k

Run LLMs locally with simple CLI and API

Local AI

Weekly ML Systems Newsletter

Get curated news, paper summaries, and tool recommendations delivered weekly.

Get in Touch

Whether you are looking for a senior ML Systems engineer, need consulting on AI infrastructure, or want to collaborate on research, I would love to hear from you.

Send a Message