Critical Unstructured.io Vulnerability CVE-2025-64712 Threatens AI Pipelines at Amazon, Google, and Fortune 1000 Enterprises

A critical vulnerability (CVE-2025-64712) discovered in Unstructured.io, a widely deployed ETL library for AI data processing, exposes Amazon, Google, Bank of America, and 87% of Fortune 1000 companies to remote code execution attacks.

The Vulnerability: CVSS 9.8 Path Traversal Leading to RCE

Security researchers have identified a severe path traversal vulnerability in Unstructured.io’s partition_msg function, which handles Microsoft Outlook .msg email attachments. The flaw enables attackers to write arbitrary files anywhere on the filesystem by crafting malicious attachment filenames containing directory traversal sequences like ../../etc/passwd.

The vulnerable code in AttachmentPartitioner.iter_elements stores attachments in /tmp/ by directly concatenating the temporary directory path with the unvalidated filename from the .msg file. This allows attackers to escape the intended directory and overwrite critical system files, including:

SSH authorized_keys for persistent access
Cron jobs for automated command execution
Startup scripts for boot persistence
Web shells for remote control

Massive Supply Chain Blast Radius

Unstructured.io transforms unstructured data—PDFs, emails, images—into AI-ready formats for vector databases and machine learning pipelines. The library is deeply embedded in enterprise AI infrastructure:

4+ million monthly downloads via PyPI
100,000+ GitHub repositories reference it as a dependency
LangChain and LlamaIndex wrap Unstructured.io, amplifying exposure
OpenWebUI includes it for document processing
Azure, AWS, and GCP documentation references the library for production deployments

This nested dependency structure makes tracking exposure extremely difficult. Organizations may be vulnerable without directly importing Unstructured.io.

Why This Matters for Enterprise Security

The vulnerability highlights critical weaknesses in AI supply chain security:

1. Invisible Dependencies: Many organizations don’t know Unstructured.io runs in their environment because it’s imported transitively through popular AI frameworks.

2. Production AI Pipeline Risk: The library processes documents in real-time AI applications—a compromised server could exfiltrate training data, poison models, or pivot to internal networks.

3. Cloud Infrastructure Exposure: With integrations to S3, Google Drive, OneDrive, and Salesforce, exploited systems may have credentials for multiple cloud services.

Technical Attack Vector

The attack requires no authentication and operates over the network with low complexity:

Attack Vector: Network
Attack Complexity: Low
Privileges Required: None
User Interaction: None

Any application that processes untrusted .msg files through Unstructured.io becomes a target.

Immediate Actions Required

CISA and security vendors urge organizations to:

Audit Python dependencies across all projects for unstructured.io references
Check transitive dependencies in LangChain, LlamaIndex, and similar frameworks
Update immediately to the patched version from GitHub
Review .msg file processing workflows for untrusted input sources
Scan for indicators of past exploitation—unexpected file writes, new SSH keys, modified startup scripts

For organizations running AI data pipelines in production, this vulnerability represents a critical risk requiring immediate remediation.

Source: Cyber Press