Conversational E-commerce Assistant
Hybrid retrieval · Reranking · Multi-turn cart agent
Customers search a 50K-product catalog in natural language, manage carts, request refunds. The system decides when to escalate to a human.
Juan David Suárez Sandoval — AI Engineer. Nine production-grade projects covering 99% of AI Engineering in 2026: hybrid RAG, fine-tuning with LoRA, agents with tool use, Document AI, Computer Use, Code AI, AI Safety, GraphRAG, and Voice AI. Each one ships with real datasets, ≥100-case eval sets, baselines, Docker, observability and public demo — built against a 12-block Definition of Done.
Four convictions that drive every architectural decision in this portfolio. Each one maps to a project that proves it.
The most expensive AI failures are silent. A confident wrong answer ships to a downstream system and pays the wrong vendor. Confidence-gated routing, safety pre-checks, and human-in-the-loop escalation are non-negotiable. (P02 Support Triage · P05 Computer Use)
Zero-shot Claude on every ticket is $0.012 each and 0.78 F1. A LoRA-tuned DistilBERT does the same in 85 ms for $0.0004 with 0.98 F1. Pick the smallest model that meets the bar; reserve frontier reasoning for the hard 6%. (P02 Support Triage)
vector.search() is table stakes. Production needs hybrid BM25 + vector + reranking with RRF fusion, knowledge graphs for multi-hop reasoning, and continuous RAGAS evaluation against a held-out set. (P01 Conversational E-Commerce · P08 GraphRAG SEC)
"It works on my machine" is the death of agentic systems. Every project ships with LangSmith trace hooks, Pydantic boundaries, deterministic regression suites, and metrics tables that the README treats as first-class. No metric, no merge. (All nine)
A radar of the production concerns AI teams hire for. Each axis is grounded in at least one of the nine projects.
Most "AI portfolios" are a chatbot in a Streamlit. The nine projects in this repository each answer one production question that hiring managers will ask in the loop: "how does it fail, and how does it know it failed?"
The radar is self-assessed against a rubric: each axis scores higher when the relevant project has a working orchestrator, a real eval suite, and a written rationale for the architectural choice over the obvious alternative. There are no "10/10" axes — every project still has a TODO list.
The capabilities below map across the nine projects in the next section.
Confidence-gated routing, safety pre-checks before irreversible actions, and explicit policies for "block / suggest / auto-resolve" tiers.
LoRA adapters on DistilBERT for 27-class intent classification: 0.98 macro-F1, 85 ms on CPU, $0.0004 per ticket — published with a full HuggingFace model card.
GraphRAG, contextual retrieval (Anthropic), hybrid BM25 + vector + Cohere rerank with RRF, Ragas evaluation harness.
LangGraph plan-execute-reflect loops, intent routing into specialist nodes, bounded retries, deterministic tool dispatch.
OWASP LLM Top-10 attack catalog, prompt-injection detection, PII redaction, guardrails as middleware, attack-success-rate reports.
Docker Compose everywhere, Langfuse traces on every LLM call, Pydantic boundaries, cost tracking per pipeline run.
Each one targets a real industrial use case, ships with a public dataset, ≥100-case eval set with manual ground truth, baselines and ablations, observability traces, and a public live demo. All built against the 12-block Definition of Done.
Hybrid retrieval · Reranking · Multi-turn cart agent
Customers search a 50K-product catalog in natural language, manage carts, request refunds. The system decides when to escalate to a human.
DistilBERT fine-tuned with LoRA · Similar-ticket retrieval · Confidence-gated auto-resolve
Tickets arrive via email/Slack/chat. The system classifies intent and priority, retrieves similar resolved tickets, drafts a solution, and decides: auto-resolve, suggest, or escalate.
Planner-executor-reflector agent loop · Web search · Personalized outreach
Receives a list of target companies, researches each on public web and news, builds a structured profile, generates personalized cold outreach. Measures lift over template-only and single-pass baselines.
Layout analysis · OCR · Claude Vision fallback · Per-field confidence
Extracts structured data from complex PDFs: contracts, financial reports, medical forms, scanned forms with tables and multi-column layouts. Auto-approves high-confidence; routes low-confidence to human review.
Anthropic Computer Use API · Virtualized Ubuntu VM · Action-verification loop
Operates a virtualized desktop by reading screenshots and emitting clicks/keystrokes. Automates back-office workflows in legacy systems that don't expose APIs.
tree-sitter AST · Multi-aspect parallel analyzers · GitHub Action
Reviews pull requests inline: detects bugs, flags security patterns, identifies missing tests, suggests performance improvements. Filters by severity to avoid drowning the developer.
OWASP LLM Top 10 coverage · Adversarial attack suite · Guardrails layer
Evaluates other LLM-based systems for vulnerabilities: prompt injection, jailbreaks, PII leakage, hallucinations. Implements guardrails and produces security audit reports.
Knowledge graph from 10-K filings · Cypher traversal · Hybrid graph + vector retrieval
Answers complex multi-hop questions over S&P 500 ecosystem: who supplies whom, who sits on competing boards, which companies share regulatory exposure. Microsoft GraphRAG technique applied to public financials.
Whisper STT · Claude reasoning · ElevenLabs TTS · Sub-second turn latency
Telephony customer service: caller talks, system transcribes, reasons, retrieves from KB, synthesizes natural voice response. Target: under 800ms end-to-end per turn to feel conversational.
Categorized by layer. Dots indicate self-assessed proficiency — three dots means I've shipped it under load, two means I've built a non-trivial project with it, one means I've used it enough to have an opinion.
Short version below. The rest lives in the code.
Juan David Suárez Sandoval — Computer Science / Systems Engineering, Universidad Nacional de Colombia. Based in Bogotá, building for remote teams in EU and USA.
I build AI systems with the discipline of a backend engineer: typed boundaries, observability, evaluation loops, Docker from day one. Most "agents" you'll find online are demos. The work in this portfolio is engineered for what happens after the demo — the Monday morning when a customer hands you a real document and a real deadline.
Reliability under failure is the single most-asked-about property in AI hiring loops in 2025–2026. Every project here is built to answer one version of that question.
— my own thesis, written on day oneI'm currently open to Full-Stack AI Engineer and AI Platform Engineer roles — remote-first, willing to relocate for the right team. Comfortable with a paid take-home, a live system-design session, or a code walkthrough of any project here.
Outside the IDE: I read papers from Anthropic, Microsoft Research, and DeepMind weekly. I keep notes on what's actually shippable vs. what's still research-grade.
If you're hiring for AI engineering in EU or USA and you've read this far, I'd love to talk. I'll do a paid take-home, a live system-design session, or walk you through any project on this page over a call.