Found 7,158 repositories(showing 30)
PaddlePaddle
Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.
opendatalab
Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.
opendataloader-project
PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.
py-pdf
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
bytedance
The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.
QuivrHQ
File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.
euske
Python PDF Parser (Not actively maintained). Check out pdfminer.six.
run-llama
A fast, helpful, and open-source document parser
CosmosShadow
Using GPT to parse PDF
CatchTheTornado
Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown
vsch
CommonMark/Markdown Java parser with source level AST. CommonMark 0.28, emulation of: pegdown, kramdown, markdown.pl, MultiMarkdown. With HTML to MD, MD to PDF, MD to DOCX conversion modules.
chatdoc-com
OCRFlux is a lightweight yet powerful multimodal toolkit that significantly advances PDF-to-Markdown conversion, excelling in complex layout handling, complicated table parsing and cross-page content merging.
yob
The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
yobix-ai
Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.
LibPDF-js
A modern PDF library for TypeScript. Parse, modify, and generate PDFs with a clean, intuitive API.
dromara
yft-design is a powerful, visually stunning online design tool built with Vue3, fabric.js, and Element Plus. 基于fabric.js的开源版【稿定设计】。一款美观且功能强大的在线设计工具,具备海报设计和图片编辑功能。适用于多种场景,如海报生成、电商产品图制作、文章长图设计、视频/公众号封面编辑等 。
NanoNets
Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.
wisupai
E2M converts various file types (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, m4a) into Markdown. It’s easy to install, with dedicated parsers and converters, supporting custom configs. E2M offers an all-in-one, flexible, and open-source solution.
galkahana
Node.js module for high performance creation, modification and parsing of PDF files and streams
twwch
AI-Powered Smart Resume Builder — 50+ professional templates, PDF/image parsing, AI optimization, JD match analysis, multi-format export. Open source & free, one-click Docker deployment.
galkahana
High performance library for creating, modiyfing and parsing PDF files in C++
adithya-s-k
Easily deployable 🚀 API to convert PDF to markdown quickly with high accuracy.
Lulzx
Zero-copy PDF text extraction library written in Zig. High-performance, memory-mapped parsing with SIMD acceleration.
Skythinker616
【新增PDF和Office文件解析上传】安卓端全场景GPT助手,可用音量键唤起并进行语音交流,支持联网、拍照、模板、PDF和Office文件解析等 | GPT assistant for Android, activated via volume keys for voice interaction, supporting features such as networking, taking photos, templates and parsing PDF and Office documents.
flyhunterl
高性能Markdown笔记工具!免费AI,智能便签、TODO推送、本地知识库、AI小说引擎。PDF解析、自动语音笔记、录音转文本。毫秒级启动High-performance Markdown note tool! Free AI, smart notes, TODO reminders, local knowledge base, AI novel engine. PDF parsing, auto voice notes, audio-to-text. Millisecond startup.
drmingler
Easily deployable and scalable backend server that efficiently converts various document formats (pdf, docx, pptx, html, images, etc) into Markdown. With support for both CPU and GPU processing, it is Ideal for large-scale workflows, it offers text/table extraction, OCR, and batch processing with sync/async endpoints.
adrienjoly
🚜 Parse text and tables from PDF files.
allenai
Science Parse parses scientific papers (in PDF form) and returns them in structured form.
ispras
Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser
SylphxAI
📄 Production-ready MCP server for PDF processing - 5-10x faster with parallel processing and 94%+ test coverage