Michelle Samson

AI/ML · infra & backend · SRE mindset

2× IEEE first-author (RL) MS Computer Engineering SRE @ 21M+ DAU

RL researcher and product engineer. I find problems, build, ship, and keep models honest in production. 2× first-authored IEEE papers in RL; shipped production RL for robotics safety certification; now SRE on a 21M+ DAU platform, where the autonomous incident-response agent I built cut MTTR by 53%.

See Projects Get in Touch

About

Engineer across AI/ML and infrastructure, with an SRE mindset. Product engineer end-to-end — ideate, build, ship, operate.

I care about the boring parts of production ML: drift, train/serve skew, cache hit rates, MTTR — the things that decide whether a model is actually useful once it leaves the notebook.

MS Computer Engineering (accelerated), University of Texas at Arlington.

2× IEEE publications (RL)

21M+ DAU on systems I operate

−53% MTTR from the IR agent I built

Research

Empirical RL for autonomous robot navigation. Simulations, reward structures, ablations, results.

IEEE GLOBECOM 2024

RagNAR: Ray-Tracing Based Navigation for Autonomous Robot in Unstructured Environment

A DQN-based navigation agent trained against a physics-informed simulator. Reward engineering and ablation studies across sim conditions; failure-mode analysis to characterize where the policy breaks.

Deep Q-Networks Reward Engineering CUDA Sim-to-real

IEEE DySPAN 2024

Testbed Design for Robot Navigation Through Differential Ray Tracing

NVIDIA Sionna as the RF-propagation physics layer for an RL agent optimizing navigation paths. Higher-fidelity simulation closes the sim-to-real gap that usually makes transfer brittle in robotics.

NVIDIA Sionna RF Modeling Physics-Informed RL Transfer

Open Source

ML infrastructure projects on github.com/msmichellesamson. Each explores one silent-failure mode in production ML: drift, skew, cache misses, cost.

LLM systems

llm-inference-router

Can a heuristic router approximate RouteLLM? Send ~50% of queries to a cheaper model with negligible quality loss, without the deployment burden of a learned router.

LLMs Cost routing RouteLLM

ML observability

embedding-drift-monitor

Detecting when an embedding model's output distribution has shifted underneath you — before downstream retrieval quietly degrades and nobody attributes it to drift for weeks.

Embeddings Drift MLOps

ML observability

model-performance-predictor

Predicting LLM quality degradation from inference-side signals alone — no labels, no human feedback — before users notice. Observability for LLMs that classical APM never solved.

LLMs Observability Time series

RAG infra

distributed-embedding-cache

Predictive precomputation for RAG embedding caches. Cache hit rate is the lever for both latency and carbon cost — every hit is one fewer GPU forward pass.

RAG Distributed caching vLLM-adjacent

Caching research

vector-cache-optimizer

Learned eviction vs. LRU/LFU for embedding caches, where “value” isn't just access frequency but semantic overlap with future queries. Benchmarks against Belady's MIN.

Caching LeCaR / Glider ML systems

Feature infra

ml-feature-store-sync

A sandbox for studying train/serve skew — the silent failure where feature pipelines diverge between training and serving, and model quality degrades with no exception thrown.

Feature stores Train/serve skew MLOps

Experience

SRE · ML engineering · RL research.

Site Reliability Engineer

Life.Church / YouVersion · Apr 2025 – Present

21M+ DAU · 27K+ RPS · globally distributed

Designed and delivered end-to-end an autonomous Incident Response Agent with a subagentic blackboard architecture (specialized agents per tool/skill). Shipped ahead of Google's ADK announcement, led with 2 engineers, reduced MTTR by 53% (73 → 34 min) and eliminated ~500 hrs/yr of toil. Also: observability audit surfaced $80K/yr in cloud savings across GCP, Kubernetes, and Terraform IaC.

LLMs RAG FAISS GCP / GKE Terraform SRE

Software Engineering Intern — ML Safety

Reynolds & Moore (RMAI) · Oct 2024 – Mar 2025

Built the RL model behind a first-of-its-kind AI safety certification product for robotic systems. Designed dataset labeling and model-tuning pipelines (scikit-learn + TensorFlow); deployed production inference via FastAPI.

Reinforcement Learning Safety Certification scikit-learn TensorFlow FastAPI

Software Engineer — Android

The Soco App Corp · May 2024 – Apr 2025

Shipped a Kotlin/Compose Android app to Google Play end-to-end: MVVM, Coroutines, REST APIs (Node.js/Express), Firebase (Firestore/Auth/Crashlytics), GitHub Actions CI/CD. Product engineering with a real release cadence.

Kotlin Jetpack Compose Node/Express Firebase

Graduate Research Assistant — Reinforcement Learning

UT Arlington · TWIST Lab · Aug 2023 – May 2024

Implemented Deep Q-Networks for autonomous robot navigation using CUDA, NVIDIA Sionna, and Differential Ray Tracing. Produced two first-authored IEEE publications. Led a cross-functional ROS project and mentored undergraduates.

Deep Q-Networks NVIDIA Sionna CUDA ROS

Stack

AI / ML / RL

Reinforcement Learning DQN Policy Gradients RLHF LLMs RAG PyTorch TensorFlow scikit-learn FAISS NVIDIA Sionna CUDA

Infrastructure / SRE

GCP (GKE, GCE, IAM) Kubernetes Terraform ArgoCD Prometheus Grafana New Relic Docker SLO / SLI / Error budgets Incident response

Backend

Python Rust JavaScript / Node Go (familiar) C++ (familiar) Kotlin REST / gRPC SQL (SQLite, Postgres) Tokio / async

Research practice

Experiment design Reward engineering Ablations Sim-to-real IEEE publication (2x)

Let's Talk

I'm actively exploring roles at AI/ML-and-systems companies — research residencies, engineering teams building models, ML infrastructure. If any of that is your world, say hi.

github.com/msmichellesamson linkedin.com/in/msmichellesamson Google Scholar · 2 publications Medium · writing

a developer?