Open to AI Engineer roles · EU / USA · remote-first

I build AI systems that survive Monday morning in production.

Juan David Suárez Sandoval — AI Engineer. Nine production-grade projects covering 99% of AI Engineering in 2026: hybrid RAG, fine-tuning with LoRA, agents with tool use, Document AI, Computer Use, Code AI, AI Safety, GraphRAG, and Voice AI. Each one ships with real datasets, ≥100-case eval sets, baselines, Docker, observability and public demo — built against a 12-block Definition of Done.

9
Projects targeting AI Engineering 2026
12
Definition-of-Done blocks per project
≥100
Eval cases with manual ground truth, per project
100%
Dockerized, traced, with public demo
01 · The thesis

What separates production AI from a demo

Four convictions that drive every architectural decision in this portfolio. Each one maps to a project that proves it.

"
01 / 04

Reliability under failure beats flashy demos.

The most expensive AI failures are silent. A confident wrong answer ships to a downstream system and pays the wrong vendor. Confidence-gated routing, safety pre-checks, and human-in-the-loop escalation are non-negotiable. (P02 Support Triage · P05 Computer Use)

"
02 / 04

The right small model beats the big one every time.

Zero-shot Claude on every ticket is $0.012 each and 0.78 F1. A LoRA-tuned DistilBERT does the same in 85 ms for $0.0004 with 0.98 F1. Pick the smallest model that meets the bar; reserve frontier reasoning for the hard 6%. (P02 Support Triage)

"
03 / 04

Retrieval quality beats vector similarity.

vector.search() is table stakes. Production needs hybrid BM25 + vector + reranking with RRF fusion, knowledge graphs for multi-hop reasoning, and continuous RAGAS evaluation against a held-out set. (P01 Conversational E-Commerce · P08 GraphRAG SEC)

"
04 / 04

Eval loops beat vibes.

"It works on my machine" is the death of agentic systems. Every project ships with LangSmith trace hooks, Pydantic boundaries, deterministic regression suites, and metrics tables that the README treats as first-class. No metric, no merge. (All nine)

02 · Capabilities

Six axes of production AI

A radar of the production concerns AI teams hire for. Each axis is grounded in at least one of the nine projects.

Capability map · self-assessed
Reliability Fine-tuning Retrieval Agent loops Safety / RT Observability

Engineered for the boring parts of AI.

Most "AI portfolios" are a chatbot in a Streamlit. The nine projects in this repository each answer one production question that hiring managers will ask in the loop: "how does it fail, and how does it know it failed?"

The radar is self-assessed against a rubric: each axis scores higher when the relevant project has a working orchestrator, a real eval suite, and a written rationale for the architectural choice over the obvious alternative. There are no "10/10" axes — every project still has a TODO list.

The capabilities below map across the nine projects in the next section.

Reliability under failure

Confidence-gated routing, safety pre-checks before irreversible actions, and explicit policies for "block / suggest / auto-resolve" tiers.

Fine-tuning when it pays

LoRA adapters on DistilBERT for 27-class intent classification: 0.98 macro-F1, 85 ms on CPU, $0.0004 per ticket — published with a full HuggingFace model card.

Advanced retrieval

GraphRAG, contextual retrieval (Anthropic), hybrid BM25 + vector + Cohere rerank with RRF, Ragas evaluation harness.

Agent loops & planning

LangGraph plan-execute-reflect loops, intent routing into specialist nodes, bounded retries, deterministic tool dispatch.

LLM safety & red-teaming

OWASP LLM Top-10 attack catalog, prompt-injection detection, PII redaction, guardrails as middleware, attack-success-rate reports.

Production observability

Docker Compose everywhere, Langfuse traces on every LLM call, Pydantic boundaries, cost tracking per pipeline run.

03 · The work

Nine projects covering AI Engineering 2026

Each one targets a real industrial use case, ships with a public dataset, ≥100-case eval set with manual ground truth, baselines and ablations, observability traces, and a public live demo. All built against the 12-block Definition of Done.

01 ◯ Planned

Conversational E-commerce Assistant

Hybrid retrieval · Reranking · Multi-turn cart agent

Customers search a 50K-product catalog in natural language, manage carts, request refunds. The system decides when to escalate to a human.

Use case: Rappi, Mercado Libre, Walmart, Instacart, Amazon
QdrantpgvectorChromaBM25 + RRFCohere RerankClaude Sonnet 4.5LangGraphRAGASStreamlit
Instacart Market Basket (Kaggle, 3.4M orders)
02 ◯ Planned

Customer Support Triage Agent

DistilBERT fine-tuned with LoRA · Similar-ticket retrieval · Confidence-gated auto-resolve

Tickets arrive via email/Slack/chat. The system classifies intent and priority, retrieves similar resolved tickets, drafts a solution, and decides: auto-resolve, suggest, or escalate.

Use case: Intercom, Zendesk, Freshdesk, HubSpot
DistilBERT + PEFTHuggingFace HubQdrantClaude Sonnet 4.5LangGraphNext.js + shadcnLangSmith
Bitext (27K intents) + Twitter Customer Support (3M)
03 ◯ Planned

B2B Sales Intelligence Agent

Planner-executor-reflector agent loop · Web search · Personalized outreach

Receives a list of target companies, researches each on public web and news, builds a structured profile, generates personalized cold outreach. Measures lift over template-only and single-pass baselines.

Use case: Apollo.io, Clay.com, Outreach.io
Claude Sonnet 4.5TavilyHackerNews APIselectolaxPydantic v2LangGraphNext.js
YC Companies (~5K with metadata)
04 ◯ Planned

Document Intelligence Pipeline

Layout analysis · OCR · Claude Vision fallback · Per-field confidence

Extracts structured data from complex PDFs: contracts, financial reports, medical forms, scanned forms with tables and multi-column layouts. Auto-approves high-confidence; routes low-confidence to human review.

Use case: Hyperscience, Rossum, Klarity
unstructured 0.16Tesseract / PaddleOCRClaude VisionCamelotTable TransformerPydantic v2Next.js + PDF viewer
FUNSD (199 forms) + DocVQA (12.7K docs) + PubLayNet (360K pages)
05 ◯ Planned

Computer Use Agent

Anthropic Computer Use API · Virtualized Ubuntu VM · Action-verification loop

Operates a virtualized desktop by reading screenshots and emitting clicks/keystrokes. Automates back-office workflows in legacy systems that don't expose APIs.

Use case: RPA for banking/insurance, legacy-system extraction
Claude Sonnet 4.5 (computer_use tool)Ubuntu 22.04 + XvfbxdotoolVNCLangGraphNext.js + VNC viewer
Custom eval (20 tasks) + OSWorld + WebArena benchmarks
06 ◯ Planned

Code Review Agent

tree-sitter AST · Multi-aspect parallel analyzers · GitHub Action

Reviews pull requests inline: detects bugs, flags security patterns, identifies missing tests, suggests performance improvements. Filters by severity to avoid drowning the developer.

Use case: Cursor, Codium, Sourcegraph Cody, Codacy
Claude Sonnet 4.5tree-sitterruff + mypysemgrepPyGithubLangGraphNext.js + diff viewer
SWE-bench Lite (300 issues) + CodeReviewer (642K diff/review pairs)
07 ◯ Planned

AI Safety & Red Teaming Framework

OWASP LLM Top 10 coverage · Adversarial attack suite · Guardrails layer

Evaluates other LLM-based systems for vulnerabilities: prompt injection, jailbreaks, PII leakage, hallucinations. Implements guardrails and produces security audit reports.

Use case: Robust Intelligence, Lakera, Protect AI, HiddenLayer
guardrails-ai / NeMoPresidio (PII)garak (NVIDIA)giskardprotectai/debertaNext.js dashboard
HarmBench (400) + JailbreakBench (100) + ToxicChat (10K)
08 ◯ Planned

GraphRAG over SEC EDGAR

Knowledge graph from 10-K filings · Cypher traversal · Hybrid graph + vector retrieval

Answers complex multi-hop questions over S&P 500 ecosystem: who supplies whom, who sits on competing boards, which companies share regulatory exposure. Microsoft GraphRAG technique applied to public financials.

Use case: Visible Alpha, Tegus, AlphaSense, M&A advisory
Neo4j 5 + APOC + GDSneo4j-graphragClaude Sonnet 4.5Voyage AIPydantic v2Next.js + react-force-graph-2d
SEC EDGAR 10-K filings (S&P 500, last 5 years, ~10K docs)
09 ◯ Planned

Voice AI Conversational Agent

Whisper STT · Claude reasoning · ElevenLabs TTS · Sub-second turn latency

Telephony customer service: caller talks, system transcribes, reasons, retrieves from KB, synthesizes natural voice response. Target: under 800ms end-to-end per turn to feel conversational.

Use case: Bland AI, Vapi, Retell, Hume — booking/support/commerce by voice
Whisper-large-v3Claude Sonnet 4.5ElevenLabs / XTTS-v2LiveKit / Twiliosilero-vadLangGraphNext.js + WebRTC
Mozilla Common Voice + LibriSpeech + Spoken-SQuAD + MultiWOZ 2.4
04 · The toolbox

What I actually reach for

Categorized by layer. Dots indicate self-assessed proficiency — three dots means I've shipped it under load, two means I've built a non-trivial project with it, one means I've used it enough to have an opinion.

λ
AI · Agents · Orchestration
core
LangGraph
LangChain core
LlamaIndex
Anthropic Computer Use
PEFT / LoRA fine-tuning
tree-sitter (diff + AST)
RAGAS evaluation
LangSmith
Σ
LLMs · Embeddings · Reranking
models
Claude (Anthropic SDK)
OpenAI API
Voyage AI embeddings
OpenAI text-embedding-3
Cohere Rerank v3
Tool use / structured output
Backend · Data
async-first
Python 3.12
FastAPI · async + SSE
Pydantic v2
PostgreSQL 16
pgvector
Neo4j 5 + APOC
Qdrant
Chroma
Redis
⟨/⟩
Frontend · Infra
ship it
Next.js 14 · App Router
React 18
TypeScript
Tailwind · shadcn/ui
react-flow · react-force-graph
recharts · D3
Docker · docker-compose
Railway · Vercel · Fly.io
05 · About

The person behind the repos

Short version below. The rest lives in the code.

Juan David Suárez Sandoval — Computer Science / Systems Engineering, Universidad Nacional de Colombia. Based in Bogotá, building for remote teams in EU and USA.

I build AI systems with the discipline of a backend engineer: typed boundaries, observability, evaluation loops, Docker from day one. Most "agents" you'll find online are demos. The work in this portfolio is engineered for what happens after the demo — the Monday morning when a customer hands you a real document and a real deadline.

Reliability under failure is the single most-asked-about property in AI hiring loops in 2025–2026. Every project here is built to answer one version of that question.

— my own thesis, written on day one

I'm currently open to Full-Stack AI Engineer and AI Platform Engineer roles — remote-first, willing to relocate for the right team. Comfortable with a paid take-home, a live system-design session, or a code walkthrough of any project here.

Outside the IDE: I read papers from Anthropic, Microsoft Research, and DeepMind weekly. I keep notes on what's actually shippable vs. what's still research-grade.

3
Languages spoken
15+
Agent roles designed
4
Vector stores used
README revisions
How I learn · recent papers I keep returning to
  • LoRA: Low-Rank Adaptation of Large Language ModelsMicrosoft
  • From Local to Global: A GraphRAG Approach to QAMicrosoft Research
  • Contextual RetrievalAnthropic
  • OWASP LLM Top 10 (2025)OWASP
  • Robust Speech Recognition via Large-Scale Weak SupervisionOpenAI · Whisper
  • SWE-bench: Can LLMs Resolve Real-World GitHub Issues?Princeton

Let's build something that holds up in production.

If you're hiring for AI engineering in EU or USA and you've read this far, I'd love to talk. I'll do a paid take-home, a live system-design session, or walk you through any project on this page over a call.

Available · EU / USA · remote Bogotá, Colombia · UTC−5 Reply within 24h