Role Overview:
We are looking for an AI Developer who can build intelligent document processing pipelines — primarily focused on extracting structured and unstructured data from PDFs, with principles that extend to image-based inputs. You will design and ship OCR-powered solutions that turn raw documents (contracts, forms, reports) into clean, queryable data, and integrate LLM reasoning layers on top using OpenAI and Anthropic APIs.
This is a hands-on engineering role. You will own the full pipeline: ingestion, OCR, parsing, prompt engineering, and API delivery via FastAPI.
What You'll Do:
Document Intelligence & OCR:
- Design and build end-to-end PDF and document extraction pipelines (flat text and structured output).
- Select and implement the right OCR strategy per document type — native PDF text layer, layout-aware parsing, or image-based OCR.
- Parse complex layouts: multi-column text, tables, headers/footers, embedded figures, form fields.
- Output clean structured JSON or relational data from raw document inputs.
Backend API Development
- Build and maintain FastAPI services that expose document processing capabilities.
- Design async endpoints for large document batches; handle timeouts, retries, and partial failures gracefully.
- Write clean, testable Python code; follow REST best practices
- Integrate with storage layers (S3 / GCS), queues (Google Pub/Sub), and downstream systems as needed.
Must-Have Requirements:
Core — these are non-negotiable. Hands-on experience building OCR or document extraction pipelines in production. Strong Python skills — clean, maintainable code with proper error handling. Practical experience with FastAPI (routing, dependency injection, async, middleware). Prompt engineering experience with OpenAI or Anthropic APIs — not just calling the API, but designing reliable extraction chainsFamiliarity with PDF internals: text layers, bounding boxes, embedded fonts, page structure
OCR & Document Processing Skills:
We work primarily with PDFs, but the underlying principles apply equally to scanned images. You should know when and how to apply each approach:
Approach | When to Use / What to Know
Native PDF text extraction | pdfplumber, PyMuPDF, pdfminer — fast, accurate when the text layer exists; must detect and fall back when it doesn't
Layout-aware parsing | Preserve reading order across columns, tables, and mixed content blocks
Image-based OCR | Tesseract, EasyOCR, or cloud OCR (AWS Textract, Google Document AI, Azure Form Recognizer) for scanned inputs
Table extraction | Structured output from tabular data — row/column alignment, merged cells, nested tables
Output formats | Flat text, structured JSON, markdown — output type driven by downstream use case
Nice to Have:
- Experience with vision-language models (GPT-4V, Claude 3 vision) for image-heavy documents.
- Comfortable with AI-driven development (fully Developer-in-the-loop)
- Cloud OCR: AWS Textract, Google Document AI, or Azure Form Recognizer.
- LangChain, LlamaIndex, or similar orchestration frameworks.
- Vector search / RAG pipelines for document Q&A.
- Docker, basic CI/CD, and cloud deployment (AWS / GCP / Azure).
- Experience with agentic workflows (tool use, multi-step LLM chains).
LLM Integration & Prompt Engineering:
- Write, test, and iterate prompts for OpenAI (GPT-4o, GPT-4 Turbo) and Anthropic (Claude) models.
- Agents & Orchestration
- Apply prompt engineering techniques: chain-of-thought, few-shot, structured output forcing, tool use/function calling
- Build extraction agents that combine OCR output with LLM reasoning for ambiguous or complex documents
- Evaluate and benchmark prompt strategies; document what works and why.
You'll Thrive Here If:
- You care about output quality — you're not happy until the extraction is clean and reliable.
- You test your prompts like you test your code — systematically, with real data.
- You know when to use an LLM and when a regex is the better tool.
- You can communicate tradeoffs clearly to non-technical stakeholders.
- You're comfortable in a fast-moving, remote-first environment.