Juan David Suárez Sánchez — AI Engineer Portfolio

Juan David Suárez Sánchez — AI Engineer.

La diferencia entre un sistema de IA que funciona y uno que falla rara vez está en el modelo. Está en el retrieval, en la evaluación, en los guardrails y en la observabilidad.

Projects targeting AI Engineering 2026

Definition-of-Done blocks per project

≥100

Eval cases with manual ground truth, per project

100%

Dockerized, traced, with public demo

Lo que aprendí construyendo sistemas de IA

Seis ejes de IA en producción auto-evaluados, con la convicción que cada uno respalda.

Mapa de capacidades · auto-evaluado

Lo aburrido es donde se decide la producción.

La mayoría de "portafolios de IA" son un chatbot en Streamlit. Los nueve proyectos aquí responden una pregunta que un hiring manager va a hacer en la entrevista: "¿cómo falla, y cómo se da cuenta de que falló?"

El radar es auto-evaluado contra una rúbrica: cada eje sube cuando el proyecto correspondiente tiene un orquestador funcionando, un eval real, y una justificación escrita de por qué se escogió esa arquitectura sobre la alternativa obvia. Ningún eje está en 10/10 — todos los proyectos tienen TODO list.

Confiabilidad

Un sistema de IA en producción no se evalúa por su mejor respuesta sino por su peor falla. Las decisiones irreversibles (pagar, enviar, escalar) necesitan chequeos de confidence explícitos antes de ejecutarse, no después.

Elección de modelo

Usar el modelo más caro para cada tarea es un error de ingeniería, no de presupuesto. Un clasificador pequeño fine-tuneado resuelve el 80% de las tareas a un costo y latencia varios órdenes de magnitud menores. El modelo frontera se reserva para el 20% que realmente lo necesita.

Retrieval

La búsqueda vectorial pura no escala a producción. Los sistemas que funcionan combinan BM25, embeddings densos y reranking, evaluados continuamente contra un set de queries con ground truth manual.

Evaluación

Un sistema de IA sin eval suite es un sistema que no sabés si está mejorando o empeorando. Las métricas no son un entregable separado del código, son parte del código.

Agent loops

Un agente que decide solo sin chequeos es un agente que falla solo. Los loops plan-execute-reflect funcionan cuando cada paso es observable, cada tool call tiene retries acotados, y cada decisión irreversible pasa por un nodo de validación explícito.

Deploy y producción

El código que funciona en local pero no se puede desplegar es código incompleto. La distancia entre prototipo y producción se cierra con Docker desde el día uno, validación de schemas en cada endpoint, configuración externalizada, y observabilidad implementada desde el primer request del sistema.

Nueve proyectos cubriendo AI Engineering 2026

Cada proyecto está construido sobre un dataset público, evaluado contra ground truth manual con métricas estándar de la industria (RAGAS, F1, WER), comparado contra al menos dos baselines, y desplegado con observability completa.

01 ◯ Planned

Conversational E-commerce Assistant

Hybrid retrieval · Reranking · Multi-turn cart agent

Customers search a 50K-product catalog in natural language, manage carts, request refunds. The system decides when to escalate to a human.

Use case: Rappi, Mercado Libre, Walmart, Instacart, Amazon

QdrantpgvectorChromaBM25 + RRFCohere RerankClaude Sonnet 4.5LangGraphRAGASStreamlit

Instacart Market Basket (Kaggle, 3.4M orders)

→ Live demo → Repo on GitHub → Full spec

02 ◯ Planned

Customer Support Triage Agent

DistilBERT fine-tuned with LoRA · Similar-ticket retrieval · Confidence-gated auto-resolve

Tickets arrive via email/Slack/chat. The system classifies intent and priority, retrieves similar resolved tickets, drafts a solution, and decides: auto-resolve, suggest, or escalate.

Use case: Intercom, Zendesk, Freshdesk, HubSpot

DistilBERT + PEFTHuggingFace HubQdrantClaude Sonnet 4.5LangGraphNext.js + shadcnLangSmith

Bitext (27K intents) + Twitter Customer Support (3M)

→ Live demo → Repo on GitHub → Full spec

03 ◯ Planned

B2B Sales Intelligence Agent

Planner-executor-reflector agent loop · Web search · Personalized outreach

Receives a list of target companies, researches each on public web and news, builds a structured profile, generates personalized cold outreach. Measures lift over template-only and single-pass baselines.

Use case: Apollo.io, Clay.com, Outreach.io

Claude Sonnet 4.5TavilyHackerNews APIselectolaxPydantic v2LangGraphNext.js

YC Companies (~5K with metadata)

→ Live demo → Repo on GitHub → Full spec

04 ◯ Planned

Document Intelligence Pipeline

Layout analysis · OCR · Claude Vision fallback · Per-field confidence

Extracts structured data from complex PDFs: contracts, financial reports, medical forms, scanned forms with tables and multi-column layouts. Auto-approves high-confidence; routes low-confidence to human review.

Use case: Hyperscience, Rossum, Klarity

unstructured 0.16Tesseract / PaddleOCRClaude VisionCamelotTable TransformerPydantic v2Next.js + PDF viewer

FUNSD (199 forms) + DocVQA (12.7K docs) + PubLayNet (360K pages)

→ Live demo → Repo on GitHub → Full spec

05 ◯ Planned

Computer Use Agent

Anthropic Computer Use API · Virtualized Ubuntu VM · Action-verification loop

Operates a virtualized desktop by reading screenshots and emitting clicks/keystrokes. Automates back-office workflows in legacy systems that don't expose APIs.

Use case: RPA for banking/insurance, legacy-system extraction

Claude Sonnet 4.5 (computer_use tool)Ubuntu 22.04 + XvfbxdotoolVNCLangGraphNext.js + VNC viewer

Custom eval (20 tasks) + OSWorld + WebArena benchmarks

→ Live demo → Repo on GitHub → Full spec

06 ◯ Planned

Code Review Agent

tree-sitter AST · Multi-aspect parallel analyzers · GitHub Action

Reviews pull requests inline: detects bugs, flags security patterns, identifies missing tests, suggests performance improvements. Filters by severity to avoid drowning the developer.

Use case: Cursor, Codium, Sourcegraph Cody, Codacy

Claude Sonnet 4.5tree-sitterruff + mypysemgrepPyGithubLangGraphNext.js + diff viewer

SWE-bench Lite (300 issues) + CodeReviewer (642K diff/review pairs)

→ Live demo → Repo on GitHub → Full spec

07 ◯ Planned

AI Safety & Red Teaming Framework

OWASP LLM Top 10 coverage · Adversarial attack suite · Guardrails layer

Evaluates other LLM-based systems for vulnerabilities: prompt injection, jailbreaks, PII leakage, hallucinations. Implements guardrails and produces security audit reports.

Use case: Robust Intelligence, Lakera, Protect AI, HiddenLayer

guardrails-ai / NeMoPresidio (PII)garak (NVIDIA)giskardprotectai/debertaNext.js dashboard

HarmBench (400) + JailbreakBench (100) + ToxicChat (10K)

→ Live demo → Repo on GitHub → Full spec

08 ◯ Planned

GraphRAG over SEC EDGAR

Knowledge graph from 10-K filings · Cypher traversal · Hybrid graph + vector retrieval

Answers complex multi-hop questions over S&P 500 ecosystem: who supplies whom, who sits on competing boards, which companies share regulatory exposure. Microsoft GraphRAG technique applied to public financials.

Use case: Visible Alpha, Tegus, AlphaSense, M&A advisory

Neo4j 5 + APOC + GDSneo4j-graphragClaude Sonnet 4.5Voyage AIPydantic v2Next.js + react-force-graph-2d

SEC EDGAR 10-K filings (S&P 500, last 5 years, ~10K docs)

→ Live demo → Repo on GitHub → Full spec

09 ◯ Planned

Voice AI Conversational Agent

Whisper STT · Claude reasoning · ElevenLabs TTS · Sub-second turn latency

Telephony customer service: caller talks, system transcribes, reasons, retrieves from KB, synthesizes natural voice response. Target: under 800ms end-to-end per turn to feel conversational.

Use case: Bland AI, Vapi, Retell, Hume — booking/support/commerce by voice

Whisper-large-v3Claude Sonnet 4.5ElevenLabs / XTTS-v2LiveKit / Twiliosilero-vadLangGraphNext.js + WebRTC

Mozilla Common Voice + LibriSpeech + Spoken-SQuAD + MultiWOZ 2.4

→ Live demo → Repo on GitHub → Full spec

Las herramientas con las que construyo

Ocho capas, desde los modelos hasta las metodologías. Los nombres son lo que uso realmente en los nueve proyectos; los detalles de cómo se combinan están en cada repo.

Modelos y proveedores

Claude (Anthropic SDK)OpenAI APIVoyage AItext-embedding-3Cohere Rerank v3WhisperElevenLabs

Orquestación de agentes

LangGraphLangChainLlamaIndexMCPAnthropic Computer UseClaude Codetool usefunction callingstructured outputs

⌘

Retrieval y bases de datos

QdrantpgvectorChromaPineconeNeo4j 5 + APOCGraphRAG (Microsoft)neo4j-graphragPostgreSQL 16RedisBM25RRF fusion

∇

Training, evaluación y observabilidad

HuggingFace TransformersPEFTLoRAUnslothWeights & BiasesRAGASPromptfooDeepEvalInspect AILangSmithLangfuse

◆

Especialidades

Document AI · unstructuredTesseractPaddleOCRCamelotTable TransformerClaude VisionVoice · LiveKitsilero-vadXTTS-v2Code AI · tree-sitterruffmypysemgrepPyGithubSafety · guardrails-aiNeMoPresidiogarakgiskard

⟨/⟩

Backend, infraestructura y cloud

Python 3.12FastAPI (async + SSE)Pydantic v2Dockerdocker-composeGitHub ActionsNext.js 14React 18TypeScriptTailwindshadcn/uireact-flowreact-force-graphrechartsD3GCPAWSAzureVercelRailwayFly.io

∫

Fundamentos matemáticos

Lógica formalCálculoÁlgebra linealProbabilidadEstadística

⇄

Metodologías y prácticas

AgileScrumKanbanGit workflowscode reviewCI/CD

The person behind the repos

Short version below. The rest lives in the code.

Juan David Suárez Sánchez — Computer Science / Systems Engineering, Universidad Nacional de Colombia. Based in Bogotá, building for remote teams in EU and USA.

I build AI systems with the discipline of a backend engineer: typed boundaries, observability, evaluation loops, Docker from day one. Most "agents" you'll find online are demos. The work in this portfolio is engineered for what happens after the demo — the Monday morning when a customer hands you a real document and a real deadline.

Reliability under failure is the single most-asked-about property in AI hiring loops in 2025–2026. Every project here is built to answer one version of that question.

— my own thesis, written on day one

I'm currently open to Full-Stack AI Engineer and AI Platform Engineer roles — remote-first, willing to relocate for the right team. Comfortable with a paid take-home, a live system-design session, or a code walkthrough of any project here.

Outside the IDE: I read papers from Anthropic, Microsoft Research, and DeepMind weekly. I keep notes on what's actually shippable vs. what's still research-grade.

Languages spoken

15+

Agent roles designed

Vector stores used

∞

README revisions

How I learn · recent papers I keep returning to

LoRA: Low-Rank Adaptation of Large Language ModelsMicrosoft
From Local to Global: A GraphRAG Approach to QAMicrosoft Research
Contextual RetrievalAnthropic
OWASP LLM Top 10 (2025)OWASP
Robust Speech Recognition via Large-Scale Weak SupervisionOpenAI · Whisper
SWE-bench: Can LLMs Resolve Real-World GitHub Issues?Princeton

Let's build something that holds up in production.

If you're hiring for AI engineering in EU or USA and you've read this far, I'd love to talk. I'll do a paid take-home, a live system-design session, or walk you through any project on this page over a call.

● Available · EU / USA · remote Bogotá, Colombia · UTC−5 Reply within 24h