Building systems that scale and agents that reason

I design highly scalable applications, develop agentic microservices, and work on distributed systems — across both computation and storage. Projects, live demos, and daily technical writing live here.

Find me

What I work on

Scalable Systems

Designing applications that grow gracefully — from architecture decisions to operational patterns that hold under load.

Agentic Applications

Building autonomous software that reasons, plans, and acts — from single agents to coordinated multi-agent workflows.

Agentic Microservices

Decomposing intelligence into independently deployable services with clear boundaries, contracts, and observability.

Distributed Computation

Parallel and fault-tolerant processing across nodes — task orchestration, stream processing, and workload scheduling.

Distributed Storage

Data systems built for consistency, partition tolerance, and horizontal scale — from replication to sharding strategies.

Projects

Agentic AILangGraphBrowser AutomationLLM

UI-Navigator Agent

Autonomous browser agent that accepts a target platform, operational goal, and task intent, then plans and executes multi-step UI workflows end to end. Designed to translate natural-language objectives into reliable, platform-specific interaction sequences without manual navigation.

NLPSpaCyGeminiFastAPIAzure

ResumeSnap

Career intelligence platform with a companion browser extension. SpaCy NLP resolves contextual semantics from resume content, the Gemini REST API generates stack-aligned project outlines, and Azure Whisper handles voice-to-text ingestion for hands-free input.

Next.jsStripeReal-timeE-commerce

Gandom Bakery Platform

Production e-commerce system for a local bakery with real-time admin notifications, bidirectional inventory sync between back office and storefront, Stripe payment processing, and automated inventory ingestion workflows.

Recent posts

DistributiveSystemDesign
#AIInfrastructure··10 min

Tyche: Optimizing Serverless Machine Learning via Proactive Pre-Loading

Can we completely eliminate Machine Learning "Cold Starts" in Serverless Clusters? When packaging ML models into serverless functions, the standard "container pre-warming" used by cloud providers isn't enough. Why? Because traditional apps are lightweight, but ML workflows carry massive dependencies (like PyTorch) and heavy model files (like BERT). A staggering 70% of a serverless ML cold start is spent just loading these libraries from disk into memory. In my latest technical report, I break down "Accelerating ML Inference via Opportunistic Pre-Loading on Serverless Clusters" (published in IEEE Transactions on Parallel and Distributed Systems*, Vol. 37, No. 2, February 2026). The paper introduces Tyche, an architecture that solves this by opportunistically pre-loading ML artifacts into already-warmed containers and GPUs before a request even lands. Here is how the underlying math dynamically handles erratic traffic spikes without wasting heavy CPU retraining cycles: ⏱️ The 7.4-Second Math Adaptation Instead of relying on rigid, historical 24-hour traffic averages that fail during sudden surges, Tyche monitors a tight sliding window of recent requests (e.g., W=5) to calculate the request arrival rate (lambda). It then plugs this live rate into a Poisson distribution formula using two optimal probability thresholds: Load Threshold P_load = 6 The moment the probability of an incoming request hits 6%, Tyche acts. For a standard traffic pace of 0.5 requests/min, the math triggers a proactive pre-load timer at exactly 7.4 seconds of idle time. The model is booted and waiting before the user arrives. Offload Threshold P_offload = 94%: If a traffic lull happens and the probability that a prediction was wrong hits 94% (around 5.6 minutes), Tyche immediately flushes the model to keep the cluster memory lean. ⚡ The Real Engineering Win When a sudden burst of traffic hits, the sliding window instantly recalculates. If $\lambda$ jumps from 0.5 to 0.55: 1. Zero Retraining Overhead: No heavy GPU/CPU cycles are wasted adjusting complex ML weights. 2. Instant Math Recalculation: The target pre-load window automatically tightens from 7.4 seconds down to ~6.7 seconds. The entire system winds up aggressively during surges and relaxes during lulls—yielding up to a 93% reduction in loading latency. #Serverless #MachineLearning #SystemArchitecture #CloudComputing #AWSLambda #DistributedSystems #IEEE #TechCommunity

LLM_Optimization
#AIInfrastructure··5 minutes

How do we make Mixture-of-Experts (MoE) AI models actually fit into memory? 🧠⚡

I recently dove into an incredible paper on RFID-MoE (Compression via Adaptive Routing and Information Density) , and it tackles one of the biggest bottlenecks in modern AI infrastructure: the massive memory footprint of sparsely activated models. Here is my quick breakdown of the problem and the clever engineering solutions the authors proposed: 🚨 The Bottleneck While MoE models save computing power by routing data to smaller, independent "expert" sub-networks instead of one giant network, storing all those experts still requires an immense amount of GPU memory. Standard compression techniques (like SVD) try to shrink these experts, but they suffer from two major flaws: They treat all experts equally: They ignore the fact that some experts are used thousands of times while others are rarely touched. They throw away the scraps: They treat the leftover data from compression (the "residual") as trash and discard it. 💡 The RFID-MoE Solution The authors introduced two brilliant mechanisms to optimize this workflow: 1️⃣ Adaptive Rank Allocation: Instead of a uniform memory budget, the system looks at both Routing Frequency (how often an expert is used) and Information Density (its effective rank). By fusing these two metrics, it slashes memory on unused space while fiercely protecting the highly specialized knowledge hidden in rare experts. 2️⃣ Parameter-Efficient Residual Reconstruction: Instead of throwing away the compression leftovers, they recycled them! They captured the residual into a tiny, low-dimensional vector and used a clever sparse projection matrix to map it back into the model. The result? They recovered a massive amount of lost information with almost zero extra memory footprint. The Takeaway: Great AI engineering isn't just about building bigger models; it's about finding elegant, hardware-efficient ways to serve them.

#LLMOptimization#SystemArchitecture#AIInfrastructure

Connect

Interested in distributed systems, agentic architecture, or collaborating on a project? Find me on GitHub, LinkedIn, or NotebookLM — or book a quick intro call.