Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.
Stars
14.4k
Forks
1.2k
Watchers
14.4k
Open Issues
240
Overall repository health assessment
No package.json found
This might not be a Node.js project
267
commits
151
commits
122
commits
120
commits
116
commits
96
commits
90
commits
77
commits
73
commits
70
commits
fix(chunking): preserve semantic headers in carried table chunks (#4313)
615782aView on GitHubfeat: render Formula elements as $$ blocks with optional normalization (#4308)
264d569View on GitHubfix(deps): upgrade vulnerable transitive dependencies [security] (#4318)
051b358View on GitHubrefactor: deduplicate PDF rendering by delegating to unstructured-inference (#4315)
affb9d6View on GitHubperf: speed up standardize_quotes with str.translate() (#4314)
8929336View on GitHubmem: exclude unused spaCy pipeline components to reduce model memory (#4296)
a3172f8View on GitHubfeat(chunking): repeat table headers on continuation chunks (#4298)
b6cf510View on GitHubmem: reduce PaddleOCR rec_batch_num from 6 to 1 (#4295)
47f4728View on GitHubReplace lazyproperty with functools.cached_property (#4282)
7c5855bView on GitHubfix(chunking): preserve nested table structure in reconstruction (#4301)
94b3ffdView on GitHubfix: Self-contained script for version extraction in release CI (#4304)
b0e86a4View on GitHubfix(deps): Update security updates [SECURITY] (#4303)
6447dabView on GitHub