Multi-Format Parsing
Scans `.pst`, `.msg`, `.pdf`, `.docx`, `.xlsx`, and nested `.zip` files.
Local-First Ingestion Engine
NexusParse recursively scans PST, MSG, PDF, DOCX, XLSX, and ZIP inputs, extracts content and metadata, reconstructs email threads, and writes a unified item schema to disk.
Scans `.pst`, `.msg`, `.pdf`, `.docx`, `.xlsx`, and nested `.zip` files.
Parseable attachments are processed recursively, including archives and email files.
Every item uses consistent fields for content, metadata, source, and relationships.
Parent and thread IDs preserve message relationships across PST extractions.
Tesseract-backed OCR is used when installed and PDFs contain little to no text.
Email-aware and document-aware chunking supports retrieval and AI workflows.
Stream indexed items in bulk while ingesting, or run with local output only.
Run ingestion and iterate output items from Python for local automation.
Output folders mirror source structure so lineage and provenance stay auditable.
Use include/exclude filtering with recursive scanning for archives and attachments.
cargo run -- --source ./input --output ./output --format json --verboseEach item is written as unified JSON, and you can optionally export centralized RAG JSONL chunks.
Extracted items follow a consistent structure:
{
"content": "...",
"metadata": { "...": "..." },
"source": { "...": "..." },
"attachments": [],
"parent_id": "...",
"thread_id": "..."
}nexusparse --source ./input --output ./output --format json --verbosefrom nexusparse_sdk import ingest, iter_items
ingest("./input", "./output")
for item in iter_items("./output"):
print(item["metadata"]["item_type"])No. NexusParse is local-first and writes output to your specified destination.
Yes. Run it in CI, cron, or internal pipelines via CLI or Python SDK wrappers.
OCR is optional and used only when Tesseract is installed on your host.
No. NexusParse writes local output and can optionally stream to Elasticsearch.