Skip to content
Compliance

A Defensible Half-Million-Email eDiscovery Pipeline with AI

How we built a court-defensible, AI-assisted eDiscovery triage pipeline that reviewed half a million emails for roughly $300 instead of six figures, on Vertex AI with Gemini 2.5 and designed with Claude as an engineering partner. The architecture, the cost math, and the defensibility layer that makes the methodology hold up under challenge.

Sage Solutions 12 min read

A client came to us with a classic eDiscovery problem: a Google Vault export sitting in a folder on a desk. Hundreds of thousands of emails across multiple custodians, multiple gigabytes per MBOX chunk, and a discovery deadline.

The traditional path for this kind of work is well known and expensive. Contract attorneys at $50–150 an hour, six to eight weeks of first-pass review, and a real chance the team misses the document that actually matters. The modern path, Technology-Assisted Review (TAR), has been court-accepted since Da Silva Moore v. Publicis Groupe in 2012, but the established tooling (Relativity, Brainspace, Reveal) prices smaller matters out of the market. A 500,000-email matter through a major eDiscovery vendor can land north of six figures before an attorney has read a single document.

The question we set out to answer: could we build a production-grade, defensible eDiscovery triage pipeline with current-generation AI, at a cost that makes the work accessible to small and mid-sized matters?

The answer is yes. Here is how we did it, and how we made it hold up under challenge.

AI as an architecture partner, not a replacement

The pipeline itself runs on Google Cloud’s Vertex AI with Gemini 2.5 as the analysis engine. The design work, though, the architecture, the prompt engineering, the defensibility framework, the cost modeling, was a back-and-forth between our team and Claude, Anthropic’s frontier model.

Most of the public conversation about AI is about replacing analysts or automating tasks. The more useful move, in our experience, is using a frontier model as a sparring partner for engineering decisions: pressure-testing the architecture, surfacing edge cases you didn’t think of, writing production-ready code with the boilerplate already handled, and translating between domains (legal defensibility, GCP pricing, MIME-parsing quirks) that no single engineer carries all of.

Elapsed time from “we have an MBOX folder” to “we have a runnable pipeline”: one afternoon. That kind of compression is exactly the custom AI workflow work we now build as a practice. (For the model-selection logic underneath it all, local vs. cloud and cheap vs. frontier, see our companion piece on matching the workload to the model.)

The architecture: six phases, every one with a checkpoint

Phase 1 — Chain of custody. Raw Vault MBOX files land in an immutable Google Cloud Storage bucket with object versioning and an optional retention lock; the Vault XML manifest goes alongside. Nothing downstream can corrupt the source. Every later step works from a derived copy.

Phase 2 — Conversion. A Python job (a temporary Compute Engine VM for jobs over 50 GB, local for smaller ones) parses the MBOX files and emits JSONL, where each line is a complete Gemini batch request: system prompt, matter-specific responsiveness criteria, the email content, and a strict response schema. Output is sharded into 50,000-email files so any failed shard re-runs on its own.

Phase 3 — Pass 1 triage (Gemini 2.5 Flash). Vertex AI batch prediction processes every email. Each one comes back as a structured JSON object: a responsiveness score (0–5), a privilege classification with confidence, a confidentiality tier, extracted entities (people, organizations, dollar amounts, key dates), a hot-document flag, a one-sentence summary, and a critical needs_human_review boolean. Results land directly in BigQuery for SQL querying.

Phase 4 — Borderline filtering. A BigQuery query selects the 10–25% of documents that need a closer look: anything Pass 1 was unsure about, anything flagged privileged, hot-document candidates, and high-stakes responsive documents.

Phase 5 — Pass 2 deep review (Gemini 2.5 Pro). The flagged subset runs through a more capable model with a more rigorous prompt. This is where the nuanced privilege calls live (attorney-client carve-outs, common-interest doctrine, work-product borderlines) that the faster model isn’t quite right for.

Phase 6 — Attorney review queue. Pass 1 and Pass 2 results are joined in BigQuery, prioritized by score, and exported to the attorney’s review platform of choice.

The design choice that matters most: the AI never makes a final legal determination. It triages. Every privilege-flagged document, every hot-doc candidate, and every needs_human_review item gets a human attorney’s eyes before any production decision. The AI puts the documents in the right order and surfaces the rationale; the attorney decides.

The cost story

This is where the math gets interesting. At current Vertex AI batch pricing (Gemini 2.5 Flash runs about $0.15 per million input tokens in batch mode):

  • Pass 1 — 500,000 emails through Gemini 2.5 Flash: roughly $225.
  • Pass 2 — the flagged ~15% subset through Gemini 2.5 Pro: about $75 more.
  • Total model spend for half a million emails: ~$300.

Storage and BigQuery costs at this volume round to noise. The conversion compute, a single VM for a few hours, is well under $20. The per-email token counts behind these figures are illustrative and depend on prompt and message length; the per-token rates are current and will move.

The comparable cost using traditional contract-attorney first-pass review is in the tens of thousands of dollars minimum, before the eDiscovery vendor’s per-gigabyte hosting fees. AI-assisted triage does not replace attorney review. It can’t, and it shouldn’t. What it does is invert the economics: instead of paying humans to read every document and flag the interesting ones, you pay a few hundred dollars to put the documents in priority order, then spend attorney time only where it actually matters.

When the data can’t leave the building

For this matter we ran the triage as cloud batch, on speed and scale. Some matters can’t. When the data is privileged, regulated, or just too sensitive to send to anyone’s API, the same triage runs on-premises instead: a smaller model on a Mac, via Apple’s MLX framework, classifying privilege, scoring responsiveness, and pulling entities at fixed cost, with nothing leaving the machine. The economics flip from per-token to fixed cost, and the privacy question goes away. We cover that tradeoff in our companion piece on local LLMs vs. frontier AI.

Illustration of a macOS terminal running a local large language model via Apple's MLX framework, triaging eDiscovery documents on-device: a privilege classification, a responsiveness score, a hot-document flag, and extracted entities, all at fixed cost with nothing leaving the Mac.

Defensibility is the product

The hardest part of this build was not the AI integration. It was the defensibility layer.

TAR is court-accepted, but only when the process is documented and reproducible. If opposing counsel challenges your methodology, and in any matter that matters they will, you need to be able to produce:

  • The exact prompt sent to the model, with a version hash and timestamp
  • The specific model snapshot used (not “Gemini 2.5 Flash” but gemini-2.5-flash-001)
  • A complete audit log of every API call: input email Message-ID, output response, processing timestamp
  • The original source data, unmodified, in a chain-of-custody-preserving location
  • A statistically valid validation sample (typically 400–600 documents) reviewed by attorneys to measure precision and recall against the AI’s calls
  • Documentation that counsel approved the prompt and responsiveness criteria before the run

Every one of these is baked into the pipeline structurally, not bolted on:

  • The conversion script refuses to run if the matter-description placeholder hasn’t been replaced. There is no path to running with default settings.
  • Each Vertex request carries a labels block with the prompt-version string and a hash of the system prompt.
  • Model versions are pinned in code, not floating.
  • Source MBOX files live in a versioned, optionally retention-locked bucket, separate from any working copy.
  • Every response keeps a custom_id that joins back to the original email by Message-ID.

The defensibility checklist in our runbook has ten items. None are optional. That is the line between AI as a neat demo and AI you can defend in court, and it is the same discipline we bring to every compliance and cybersecurity engagement: the audit trail comes first.

What the frontier model actually did

Walking back through the design conversation, the model’s contributions fell into a few buckets. None of them were “make the decisions”:

  • Surfacing edge cases. Our first-draft MBOX parser handled text/plain bodies and ignored everything else. The model flagged that a large fraction of modern email is text/html only, that non-UTF-8 encodings are common, and that MIME-encoded headers (=?utf-8?B?...?=) decode silently to garbage if you don’t handle them. Each is the kind of bug that, in production, makes a small percentage of documents quietly lose data, which is exactly the failure mode that makes a methodology indefensible.
  • Translating between domains. The defensibility requirements come from case law (Da Silva Moore, Rio Tinto, Hyles, In re Biomet). The technical requirements come from Vertex’s batch API docs. The cost requirements come from current pricing tables. Mapping all three into one coherent architecture is exactly where a model that has read everything beats a human Googling for hours.
  • Generating production-ready code. The conversion script that emerged is ~200 lines of hardened Python: argument parsing, sharded output, prompt-version guards, charset-aware decoding, an HTML-to-text fallback, per-message exception handling so one bad email can’t kill a run, and chain-of-custody preservation by Message-ID. From scratch that is a half-day. With the model as a pair programmer it was under an hour, most of it spent on review and edge-case discussion rather than typing.
  • Building the handoff. When we moved execution into an agentic environment (Anthropic’s Cowork), the model produced the full runbook: phases, checkpoints, the operator’s decision points, the cost table, the defensibility checklist, the conversion script inline. Self-contained and ready for the next agent to execute.

What it did not do: make any final call. The matter description, the model choices, the retention period, the validation-sample size, counsel sign-off: all human judgment, with explicit pause points in the runbook.

What this means beyond eDiscovery

A few takeaways that generalize past litigation:

  • AI architecture work is shifting from “automate this task” to “design this system.” The leverage isn’t replacing the person who reads emails. It’s compressing six weeks of architecture, prompt engineering, and code into an afternoon, for a system that didn’t exist before.
  • Defensibility (and observability) is the product. Anyone can wire up an LLM API call. What makes it usable in regulated, legal, or compliance-sensitive contexts is the audit trail, version control, chain of custody, and validation around the AI, not the AI itself. We build the boring infrastructure first.
  • Cost engineering beats model selection. The choice between Flash and Pro mattered less than the choice to run a two-pass architecture. The choice of provider mattered less than the choice to use batch prediction instead of interactive APIs. Order-of-magnitude savings live in architecture, not vendor.
  • Human judgment belongs at the boundaries, not the middle. The AI does the bulk-volume triage; humans set the criteria up front and make the final calls downstream. The middle, document-by-document scoring, is where machine scale wins and where attorney time is most expensively wasted.

Looking ahead

This pipeline is one of several AI-native workflows we’re building into our managed services and custom development practice. Other recent builds: forensic email investigations across hundreds of mailboxes, threat-intelligence automation for client security operations, and structured contract review with risk flagging. The common thread is AI as infrastructure, with the defensibility, observability, and chain-of-custody work done properly. That is the layer our clients increasingly need and rarely have the in-house team to build.

If you are sitting on a Vault export, a litigation hold, a regulatory inquiry, or any data-heavy review problem, especially in a law firm or finance context, and want to talk through what a modern pipeline looks like for your situation, get in touch. The first 30 minutes are free, and we’ll tell you honestly whether AI belongs in the workflow at all.

Related services

Want to talk about this?

We are happy to have a 30-minute call about anything in this article — your environment, your risks, your options.

Call Free assessment