Local-First Ingestion Engine

The world’s fastest ingestion engine: unstructured files to RAG-ready JSON at 1TB in 2 hours.

NexusParse recursively scans PST, MSG, PDF, DOCX, XLSX, and ZIP inputs, extracts content and metadata, reconstructs email threads, and writes a unified item schema to disk.

Built for legal, compliance, eDiscovery, data migration, and high-throughput RAG ingestion.

Capabilities

Multi-Format Parsing

Scans `.pst`, `.msg`, `.pdf`, `.docx`, `.xlsx`, and nested `.zip` files.

Recursive Attachments

Parseable attachments are processed recursively, including archives and email files.

Unified Item JSON

Every item uses consistent fields for content, metadata, source, and relationships.

Email Threading

Parent and thread IDs preserve message relationships across PST extractions.

OCR for Scanned PDFs

Tesseract-backed OCR is used when installed and PDFs contain little to no text.

Smart Chunking

Email-aware and document-aware chunking supports retrieval and AI workflows.

Optional Elasticsearch

Stream indexed items in bulk while ingesting, or run with local output only.

Python SDK

Run ingestion and iterate output items from Python for local automation.

Path Mirroring

Output folders mirror source structure so lineage and provenance stay auditable.

How It Works

  1. 1

    Point NexusParse at your source folder

    Use include/exclude filtering with recursive scanning for archives and attachments.

  2. 2

    Run one batch command

    cargo run -- --source ./input --output ./output --format json --verbose
  3. 3

    Consume structured outputs

    Each item is written as unified JSON, and you can optionally export centralized RAG JSONL chunks.

Unified Schema

Extracted items follow a consistent structure:

{
  "content": "...",
  "metadata": { "...": "..." },
  "source": { "...": "..." },
  "attachments": [],
  "parent_id": "...",
  "thread_id": "..."
}

Get Started

CLI

nexusparse --source ./input --output ./output --format json --verbose

Python SDK

from nexusparse_sdk import ingest, iter_items

ingest("./input", "./output")
for item in iter_items("./output"):
    print(item["metadata"]["item_type"])

FAQ

Does data leave our environment?

No. NexusParse is local-first and writes output to your specified destination.

Can we automate ingestion jobs?

Yes. Run it in CI, cron, or internal pipelines via CLI or Python SDK wrappers.

What about OCR dependencies?

OCR is optional and used only when Tesseract is installed on your host.

Is PostgreSQL required?

No. NexusParse writes local output and can optionally stream to Elasticsearch.