Search Results

Found 7,158 repositories(showing 30)

PaddleOCR

PaddlePaddle

💚95

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

75.4k

10.2k

Apache-2.0

Python

Updated 1 hour ago

ai4sciencechineseocrdocument-parsing+10

MinerU

opendatalab

💚95

Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.

59.5k

5.0k

AGPL-3.0

Python

Updated 2 minutes ago

ai4sciencedocument-analysisextract-data+10

opendataloader-pdf

opendataloader-project

💚94

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

16.1k

1.4k

Apache-2.0

Java

Updated 1 minute ago

a11yaccessibilityai+17

pypdf

py-pdf

💚90

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

9.9k

1.6k

NOASSERTION

Python

Updated 2 hours ago

help-wantedpdfpdf-documents+5

Dolphin

bytedance

💛81

The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.

8.9k

754

NOASSERTION

Python

Updated 1 hour ago

document-analysislayout-analysisocr+6

MegaParse

QuivrHQ

💛77

File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.

7.3k

417

Apache-2.0

Python

Updated 19 hours ago

docxllmparser+2

pdfminer

euske

💛86

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

5.3k

1.1k

MIT

Python

Updated 2 days ago

liteparse

run-llama

💛77

A fast, helpful, and open-source document parser

4.2k

278

Apache-2.0

TypeScript

Updated 1 hour ago

document-ocrdocument-processingocr+4

gptpdf

CosmosShadow

💛71

Using GPT to parse PDF

3.6k

264

MIT

Python

Updated 2 days ago

Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown

3.1k

268

MIT

Python

Updated 2 days ago

anonymizationapiextract+6

flexmark-java

vsch

💛71

CommonMark/Markdown Java parser with source level AST. CommonMark 0.28, emulation of: pegdown, kramdown, markdown.pl, MultiMarkdown. With HTML to MD, MD to PDF, MD to DOCX conversion modules.

2.6k

301

BSD-2-Clause

Java

Updated 3 hours ago

commonmarkhtml-to-markdownjava+8

OCRFlux

chatdoc-com

🧡69

OCRFlux is a lightweight yet powerful multimodal toolkit that significantly advances PDF-to-Markdown conversion, excelling in complex layout handling, complicated table parsing and cross-page content merging.

2.5k

153

Apache-2.0

Python

Updated 7 hours ago

pdf-reader

yob

💛70

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.

1.9k

286

MIT

Ruby

Updated 11 hours ago

extractous

yobix-ai

💛73

Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.

1.7k

Apache-2.0

Rust

Updated 8 minutes ago

data-pipelinesdocxetl+14

core

LibPDF-js

🧡67

A modern PDF library for TypeScript. Parse, modify, and generate PDFs with a clean, intuitive API.

1.7k

MIT

TypeScript

Updated 12 hours ago

digital-signaturedocumentesign+10

yft-design

dromara

💛75

yft-design is a powerful, visually stunning online design tool built with Vue3, fabric.js, and Element Plus. 基于fabric.js的开源版【稿定设计】。一款美观且功能强大的在线设计工具，具备海报设计和图片编辑功能。适用于多种场景，如海报生成、电商产品图制作、文章长图设计、视频/公众号封面编辑等。

1.5k

313

MIT

TypeScript

Updated 7 hours ago

canvas-editorclipperelement-plus+12

docstrange

NanoNets

💛73

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

1.4k

125

MIT

Python

Updated 1 hour ago

aidocument-parserdocument-parsing+10

e2m

wisupai

💛72

E2M converts various file types (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, m4a) into Markdown. It’s easy to install, with dedicated parsers and converters, supporting custom configs. E2M offers an all-in-one, flexible, and open-source solution.

1.3k

Apache-2.0

Jupyter Notebook

Updated 15 hours ago

doc2xe2mllm+3

HummusJS

galkahana

🧡53

Node.js module for high performance creation, modification and parsing of PDF files and streams

1.2k

169

NOASSERTION

Updated 1 week ago

nodejspdf-generationpdf-manipulation+2

JadeAI

twwch

💛72

AI-Powered Smart Resume Builder — 50+ professional templates, PDF/image parsing, AI optimization, JD match analysis, multi-format export. Open source & free, one-click Docker deployment.

1.1k

123

Apache-2.0

TypeScript

Updated 3 hours ago

aiai-writingawesome-ai-tools+4

PDF-Writer

galkahana

🧡58

High performance library for creating, modiyfing and parsing PDF files in C++

1.0k

230

Apache-2.0

Updated 3 weeks ago

marker-api

adithya-s-k

💛72

Easily deployable 🚀 API to convert PDF to markdown quickly with high accuracy.

959

117

GPL-3.0

Python

Updated 21 hours ago

apifastapimarker+5

zpdf

Lulzx

💛71

Zero-copy PDF text extraction library written in Zig. High-performance, memory-mapped parsing with SIMD acceleration.

896

CC0-1.0

Zig

Updated 2 days ago

high-performanceparserpdf+5

gpt-assistant-android

Skythinker616

💛72

【新增PDF和Office文件解析上传】安卓端全场景GPT助手，可用音量键唤起并进行语音交流，支持联网、拍照、模板、PDF和Office文件解析等 | GPT assistant for Android, activated via volume keys for voice interaction, supporting features such as networking, taking photos, templates and parsing PDF and Office documents.

872

123

GPL-3.0

Java

Updated 4 days ago

androidassistantchatgpt+4

flymd

flyhunterl

🧡66

高性能Markdown笔记工具！免费AI，智能便签、TODO推送、本地知识库、AI小说引擎。PDF解析、自动语音笔记、录音转文本。毫秒级启动High-performance Markdown note tool! Free AI, smart notes, TODO reminders, local knowledge base, AI novel engine. PDF parsing, auto voice notes, audio-to-text. Millisecond startup.

766

NOASSERTION

JavaScript

Updated 2 hours ago

docling-api

drmingler

💛72

Easily deployable and scalable backend server that efficiently converts various document formats (pdf, docx, pptx, html, images, etc) into Markdown. With support for both CPU and GPU processing, it is Ideal for large-scale workflows, it offers text/table extraction, OCR, and batch processing with sync/async endpoints.

764

MIT

Python

Updated 1 day ago

apifastapimarkdown-parser+6

npm-pdfreader

adrienjoly

🧡67

🚜 Parse text and tables from PDF files.

701

MIT

HTML

Updated 2 days ago

data-extractionjavascriptparse-tables+5

science-parse

allenai

💛72

Science Parse parses scientific papers (in PDF form) and returns them in structured form.

698

Apache-2.0

Java

Updated 2 days ago

dedoc

ispras

💛71

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser

657

Apache-2.0

Python

Updated 1 hour ago

docdocument-analysisdocument-content-extraction+15

pdf-reader-mcp

SylphxAI

💛71

📄 Production-ready MCP server for PDF processing - 5-10x faster with parallel processing and 94%+ test coverage

638

MIT

TypeScript

Updated 22 hours ago

ai-agentai-toolsdocument-processing+13

GitHub Explorer

Search Results

PaddleOCR

MinerU

opendataloader-pdf

pypdf

Dolphin

MegaParse

pdfminer

liteparse

gptpdf

text-extract-api

flexmark-java

OCRFlux

pdf-reader

extractous

core

yft-design

docstrange

e2m

HummusJS

JadeAI

PDF-Writer

marker-api

zpdf

gpt-assistant-android

flymd

docling-api

npm-pdfreader

science-parse

dedoc

pdf-reader-mcp

PaddleOCR

MinerU

opendataloader-pdf

pypdf

Dolphin

MegaParse

pdfminer

liteparse

gptpdf

text-extract-api

flexmark-java

OCRFlux

pdf-reader

extractous

core

yft-design

docstrange

e2m

HummusJS

JadeAI

PDF-Writer

marker-api

zpdf

gpt-assistant-android

flymd

docling-api

npm-pdfreader

science-parse

dedoc

pdf-reader-mcp