Experiment Tracking with MLflow and Langfuse
Set up experiment tracking for ML models with MLflow and LLM observability with Langfuse. Includes hyperparameter sweeps, model registry, and cost tracking.
Tested, explained, with code that runs
When new models drop or interesting papers come out, I spin up the GPUs, implement the ideas, and report back what actually works. These are practical guides with runnable code, written from the Coast of Somewhere Beautiful. I learn by building, and I'm here to help you do the same.
NEW Article: What It Takes to Be a Senior Machine Learning EngineerSet up experiment tracking for ML models with MLflow and LLM observability with Langfuse. Includes hyperparameter sweeps, model registry, and cost tracking.
NEW
Build a complete ML pipeline with GitHub Actions: data validation, model training, automated testing, and staged deployment to production.
NEW
Deploy ML models to production with optimized inference: torch.compile vs ONNX benchmarks, FastAPI serving patterns, and AWS deployment options.
NEW
Monitor production ML models with data drift detection, performance tracking, and automated alerting. Includes working Python implementations.
NEW
A roadmap to the skills, knowledge, and practices that separate senior MLEs from the rest - with links to hands-on tutorials for each area.
FEATURED
Technical comparison of DeepSeek V3.2, Llama 4, Gemini 3, and Qwen3 architectures—plus DeepSeek's mHC innovation expected in V4.
FEATURED
From DeepSeek's January bombshell to vibe coding going mainstream, here's what actually changed for AI practitioners in 2025.
Nvidia's Llama Nemotron RAG models are purpose-built for multimodal search and visual document retrieval tasks, combining vision and language capabilities for improved accuracy. This release offers practical value for practitioners implementing production RAG systems, particularly those handling mixed-media documents. The article likely covers model architecture, performance benchmarks, and implementation guidance relevant to building retrieval systems at scale.
InfiAgent addresses a critical production challenge for LLM agents: managing unbounded context growth and error accumulation during long-horizon tasks. The framework externalizes persistent state into a file-centric abstraction, offering practical solutions for deploying agents at scale without sacrificing reasoning stability—directly applicable to building robust agentic systems in production environments.
This systematic study evaluates embedding similarity metrics for predicting cross-lingual transfer success across African languages, providing practical guidance for selecting source languages in low-resource NLP scenarios. The findings on cosine gap and retrieval-based metrics (P@1, CSLS) offer actionable insights for practitioners building multilingual systems and optimizing transfer learning strategies. Relevant for those working with embeddings and retrieval systems in production ML contexts.
This paper addresses practical enterprise search challenges by demonstrating how to fine-tune small language models for relevance labeling at scale, achieving quality comparable to LLMs with better efficiency. Directly applicable to production ML systems requiring domain-specific relevance ranking without the cost of large model inference. Combines fine-tuning techniques with retrieval system optimization, making it valuable for practitioners building scalable search and RAG pipelines.
MemRL presents a method for LLMs to self-improve through episodic memory and reinforcement learning, addressing limitations of fine-tuning and passive retrieval. The approach combines memory-based retrieval with active learning signals, relevant for building adaptive AI agents and RAG systems that evolve without catastrophic forgetting. Practical value for practitioners implementing production agents that need continuous improvement without expensive retraining.
This paper addresses practical table question answering using smaller, open-weight LLMs that can run locally, eliminating costly API dependencies. Directly relevant for practitioners deploying LLM-based systems in production environments with resource constraints, demonstrating how to achieve competitive performance with accessible models rather than proprietary large-scale alternatives.
Build a semantic search system to find historically similar College Football Playoff games using Amazon S3 Vectors and Bedrock embeddings.
Create a shared Christmas tree where visitors add AI-generated ornaments using Amazon Nova Canvas, with defense-in-depth content moderation using Bedrock Guardrails and Claude.
Create an AI bartender that suggests cocktails based on weather, searches by ingredient, and generates party menus with shopping lists.
Create an AI agent that combines tide, weather, and marine data to generate fishing reports. Learn tool-calling patterns with the Strands SDK, NOAA APIs, and Claude on AWS Bedrock.