Hello, I'm Arpit Singh Gautam

I am a Data Scientist in the CSG CTO Lab at Dell Technologies, working on efficient LLM inference and the reliability of large models. My research identity is efficiency × reliability for foundation models: making large models cheaper to run and more trustworthy.

My deepest current line is the reliability of quantized models - what post-training quantization does to calibration, factual recall, and security, and how to preserve each cheaply. My work appears at EACL 2026 (FEVER) and AAAI 2026 (ToM), with preprints on RL-based quantization (RAMP) and disaggregated LLM serving (StreamServe).

Research interests: LLM systems & efficient inference - quantization, KV-cache optimization, serving · Trustworthy / honest LLMs - hallucination, calibration, safety · Reinforcement learning & reasoning for foundation models · Interpretability / mechanistic ML

Email Google Scholar GitHub LinkedIn X Hashnode TopMate Medium

Recent Updates

Papers · Projects · Talks · Blog posts - all in one place.

Patent · Filed

U.S. patent application filed - Federated and Self-Learning Techniques for Root Cause Detection in Edge-Cloud Environments

May 2026

Co-inventor on U.S. Patent Application No. 19/670,270 (pending). A further 6 Edge-AI inventions have been approved for USPTO filing by Dell's internal patent committee.

Competition · 29th of 1000+

29th of 1000+ in the SAIR Foundation Mathematics Distillation Challenge

2026

Distilled equational-implication reasoning over magmas into a 7.44 KB decision-procedure cheatsheet: 60.2% accuracy at $0.00037 per problem, smaller and cheaper than the first-place entry. Read the writeup →

Paper · arXiv

StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving

Apr 2026

A prefill-decode disaggregated serving system co-optimizing multi-signal routing and dynamic speculative execution: 11-18× latency reduction and up to 4.4× average throughput over TP-vLLM baselines on 4× A800 GPUs.

Project · Open Source

Rebuilding Triton and Helion from Scratch in 4,000 Lines of Python

Jul 2026

newt and deuteron: the two-layer GPU kernel stack rebuilt from zero, JIT-compiling Python to real tensor-core machine code via NVRTC. Memory-bandwidth parity with Triton and ~92% of its cold fp16 matmul. Read the writeup →

Project Launch

LLM Quantization Gallery - Interactive explorer for quantization techniques

Apr 7, 2026

An interactive gallery comparing INT4, INT8, GPTQ, AWQ, QLoRA and GGUF methods - with perplexity scores, memory footprints, and throughput benchmarks side by side. Read the blog post →

Competition

AutoML 2026 NAS Unseen-Data Challenge

2026

Budget-aware neural architecture search under a strict 10-minute compute limit: dynamic capacity scaling gave zero OOM crashes and outperformed every baseline in Phase 2.

Paper · arXiv

RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On-Device LLM Inference

Mar 18, 2026

Off-policy SAC framework that learns per-layer bit-width assignments to minimize perplexity under a global bit budget. Achieves 5.54 perplexity at 3.68 GB on Llama 2 7B - outperforming uniform 4-bit AWQ. Zero-shot transfer to Llama 2 13B and Mistral 7B.

Paper Accepted · EACL 2026

The Energy of Falsehood: Detecting Hallucinations via Diffusion Model Likelihoods

Apr 2026

Accepted at the FEVER Workshop @ EACL 2026 - now on ACL Anthology. Proposes a diffusion-based generative stability method for automated fact verification with Kailash Talreja and Saurabh Jha.

Competition · Ongoing

Competing in ARC-AGI-3 - Abstract Reasoning Challenge

2026 · Ongoing

Participating in François Chollet's ARC-AGI-3 challenge - one of AI's hardest benchmarks for fluid intelligence. Submitted a random agent baseline and exploring learning-based approaches. Read the blog post →

Paper Accepted · AAAI 2026

Faithful Theory of Mind Distillation: Why Preference Based Refinement Improves Imitation

2025

Accepted at the Advancing AI through Theory of Mind Workshop @ AAAI 2026. Sequential SFT + preference-based refinement for theory of mind reasoning in LLMs.

Paper · arXiv

CogniSQL-R1-Zero: Lightweight Reinforced Reasoning for Efficient SQL Generation

2025

GRPO RL + DeepSpeed across 4× A100 GPUs. SOTA execution accuracy on BIRD benchmark, outperforming 236B+ parameter models.

Career

Started full-time as Data Scientist at Dell Technologies (CSG CTO Lab)

Jul 2025

Joined Dell's CSG CTO Lab in Bengaluru full-time, focusing on LLM inference optimization, distributed AI systems, and on-device deployment research.

Talk · International Conference

2 Invited Sessions at IBM Z Day International Conference

2024

Spoke to 20,000+ live attendees on AI/ML on mainframe systems. One of the highest-attended technical sessions at the conference.