Building systems that scale and agents that reason
I design highly scalable applications, develop agentic microservices, and work on distributed systems — across both computation and storage. Projects, live demos, and daily technical writing live here.
Autonomous browser agent that accepts a target platform, operational goal, and task intent, then plans and executes multi-step UI workflows end to end. Designed to translate natural-language objectives into reliable, platform-specific interaction sequences without manual navigation.
Career intelligence platform with a companion browser extension. SpaCy NLP resolves contextual semantics from resume content, the Gemini REST API generates stack-aligned project outlines, and Azure Whisper handles voice-to-text ingestion for hands-free input.
Production e-commerce system for a local bakery with real-time admin notifications, bidirectional inventory sync between back office and storefront, Stripe payment processing, and automated inventory ingestion workflows.
Can we completely eliminate Machine Learning "Cold Starts" in Serverless Clusters?
When packaging ML models into serverless functions, the standard "container pre-warming" used by cloud providers isn't enough.
Why? Because traditional apps are lightweight, but ML workflows carry massive dependencies (like PyTorch) and heavy model files (like BERT). A staggering 70% of a serverless ML cold start is spent just loading these libraries from disk into memory.
In my latest technical report, I break down "Accelerating ML Inference via Opportunistic Pre-Loading on Serverless Clusters" (published in IEEE Transactions on Parallel and Distributed Systems*, Vol. 37, No. 2, February 2026). The paper introduces Tyche, an architecture that solves this by opportunistically pre-loading ML artifacts into already-warmed containers and GPUs before a request even lands.
Here is how the underlying math dynamically handles erratic traffic spikes without wasting heavy CPU retraining cycles:
⏱️ The 7.4-Second Math Adaptation
Instead of relying on rigid, historical 24-hour traffic averages that fail during sudden surges, Tyche monitors a tight sliding window of recent requests (e.g., W=5) to calculate the request arrival rate (lambda).
It then plugs this live rate into a Poisson distribution formula using two optimal probability thresholds:
Load Threshold P_load = 6 The moment the probability of an incoming request hits 6%, Tyche acts. For a standard traffic pace of 0.5 requests/min, the math triggers a proactive pre-load timer at exactly 7.4 seconds of idle time. The model is booted and waiting before the user arrives.
Offload Threshold P_offload = 94%: If a traffic lull happens and the probability that a prediction was wrong hits 94% (around 5.6 minutes), Tyche immediately flushes the model to keep the cluster memory lean.
⚡ The Real Engineering Win
When a sudden burst of traffic hits, the sliding window instantly recalculates. If $\lambda$ jumps from 0.5 to 0.55:
1. Zero Retraining Overhead: No heavy GPU/CPU cycles are wasted adjusting complex ML weights.
2. Instant Math Recalculation: The target pre-load window automatically tightens from 7.4 seconds down to ~6.7 seconds.
The entire system winds up aggressively during surges and relaxes during lulls—yielding up to a 93% reduction in loading latency.
#Serverless #MachineLearning #SystemArchitecture #CloudComputing #AWSLambda #DistributedSystems #IEEE #TechCommunity
I recently dove into an incredible paper on RFID-MoE (Compression via Adaptive Routing and Information Density) , and it tackles one of the biggest bottlenecks in modern AI infrastructure: the massive memory footprint of sparsely activated models.
Here is my quick breakdown of the problem and the clever engineering solutions the authors proposed:
🚨 The Bottleneck
While MoE models save computing power by routing data to smaller, independent "expert" sub-networks instead of one giant network, storing all those experts still requires an immense amount of GPU memory. Standard compression techniques (like SVD) try to shrink these experts, but they suffer from two major flaws:
They treat all experts equally: They ignore the fact that some experts are used thousands of times while others are rarely touched.
They throw away the scraps: They treat the leftover data from compression (the "residual") as trash and discard it.
💡 The RFID-MoE Solution
The authors introduced two brilliant mechanisms to optimize this workflow:
1️⃣ Adaptive Rank Allocation: Instead of a uniform memory budget, the system looks at both Routing Frequency (how often an expert is used) and Information Density (its effective rank). By fusing these two metrics, it slashes memory on unused space while fiercely protecting the highly specialized knowledge hidden in rare experts.
2️⃣ Parameter-Efficient Residual Reconstruction: Instead of throwing away the compression leftovers, they recycled them! They captured the residual into a tiny, low-dimensional vector and used a clever sparse projection matrix to map it back into the model. The result? They recovered a massive amount of lost information with almost zero extra memory footprint.
The Takeaway: Great AI engineering isn't just about building bigger models; it's about finding elegant, hardware-efficient ways to serve them.
Discover a dual-layer audit strategy in Python using .find() and .index() to build resilient cloud automation in Azure. Learn how to elegantly separate harmless file discoveries from critical data contract violations to create self-auditing data pipelines.
Connect
Interested in distributed systems, agentic architecture, or collaborating on a project? Find me on GitHub, LinkedIn, or NotebookLM — or book a quick intro call.