Search Results

Found 3,105 repositories(showing 30)

opendataloader-pdf

opendataloader-project

💛89

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

13.0k

1.1k

Apache-2.0

Java

Updated 2 minutes ago

a11yaccessibilityai+17

MegaParse

QuivrHQ

💛77

File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.

7.3k

418

Apache-2.0

Python

Updated 1 hour ago

docxllmparser+2

pdfminer

euske

💛86

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

5.3k

1.1k

MIT

Python

Updated 5 days ago

flexmark-java

vsch

💛71

CommonMark/Markdown Java parser with source level AST. CommonMark 0.28, emulation of: pegdown, kramdown, markdown.pl, MultiMarkdown. With HTML to MD, MD to PDF, MD to DOCX conversion modules.

2.6k

301

BSD-2-Clause

Java

Updated 13 hours ago

commonmarkhtml-to-markdownjava+8

pdf-reader

yob

💛70

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.

1.9k

286

MIT

Ruby

Updated 21 hours ago

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser

656

Apache-2.0

Python

Updated 1 day ago

docdocument-analysisdocument-content-extraction+15

scipdf_parser

titipata

🧡61

Python PDF parser for scientific publications: content and figures

452

MIT

Python

Updated 1 week ago

grobidparserpdf+3

py-pdf-parser

jstockwin

🧡56

A Python tool to help extracting information from structured PDFs.

429

MIT

Python

Updated 2 weeks ago

parsingpdfpdf-parsing+1

caradoc

caradoc-org

🧡51

A PDF parser and validator

314

GPL-2.0

OCaml

Updated 1 month ago

nlp-resume-parser

hxu296

❤️46

NLP-powered, GPT-3 enabled Resume Parser from PDF to JSON.

276

Python

Updated 1 month ago

gpt-3nlpnlp-parsing+4

Pdf2Dom

radkovo

🧡56

Pdf2Dom is a PDF parser that converts the documents to a HTML DOM representation. The obtained DOM tree may be then serialized to a HTML file or further processed. A command-line utility for converting the PDF documents to HTML is included in the distribution package. Pdf2Dom may be also used as an independent Java library with a standard DOM interface for your DOM-based applications or as an alternative parser for the CSSBox rendering engine in order to add the PDF processing capability to CSSBox. Pdf2Dom is based on the Apache PDFBox™ library.

194

LGPL-3.0

Java

Updated 1 week ago

sapp

dealfonso

❤️41

Simple and Agnostic PDF Document Parser in PHP - sign PDF docs using PHP

148

LGPL-3.0

PHP

Updated 1 month ago

acrobatagnostic-pdf-parserdigital-signature+13

goldmark-pdf

stephenafamo

❤️40

A PDF renderer for the goldmark markdown parser.

144

MIT

Updated 1 month ago

commonmarkgogolang+4

markify

KylinMountain

🧡50

Convert files into markdown to help RAG or LLM understand, based on markitdown and MinerU, which could provide high quality pdf parser.

133

NOASSERTION

Python

Updated 1 month ago

markdownpdfrag

resumeToPDF

jsonresume

🧡60

Convert your resume.json into a PDF, it runs through our HTML parser

118

JavaScript

Updated 5 hours ago

gerber-parser

hsiang-lee

🧡60

gerber-parser is a library for parsing and rendering Gerber files in the RS-274X format. It natively uses the Qt graphics system for rendering and can export to various formats, including PNG, SVG, and PDF. The library is also designed to be extensible, allowing you to easily integrate alternative rendering engines to suit different technology sta

105

MIT

C++

Updated 2 weeks ago

gerberpcbpython+2

pdf-extract

bitextor

🧡55

PDF parser and converter to HTML

GPL-3.0

Java

Updated 3 weeks ago

bella-domify

LianjiaTech

🧡65

文档解析（Document Parser），支持 PDF、TXT、DOC、DOCX、Markdown 等文件格式，高效提取与解析内容，生成标准文档树结构。内置 PDF Parser、Text Parser、Word Parser，助力 RAG、知识库、全文检索等智能应用。

GPL-2.0

Python

Updated 2 days ago

document-parserparserpdf-parser

license-parser

ksoftllc

🧡51

American Driver's License PDF-417 Barcode Parser

MIT

Swift

Updated 2 months ago

ofdparser

wangyi160

🧡60

OFD (open fixed layout document) is a Chinese Document Format, just like PDF. It is standardized as GB/T 33190-2016. ofdparser is a parser to parse the format according to the standard.

Java

Updated 2 days ago

nextjs-pdf-parser

tuffstuff9

🧡55

Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.

TypeScript

Updated 3 weeks ago

content-extractionfilepondnextjs+11

temporals

dcparker

❤️35

"We could develop some interpreter that would be able to parse and process a range of expressions that we might want to deal with. This would be quite flexible, but also pretty hard" (Martin Fowler, http://martinfowler.com/apsupp/recurring.pdf). Temporals is a Ruby parser for just that.

Ruby

Updated 5 years ago

tcpdi

pauln

❤️26

PDF importer for TCPDF, based on FPDI. Requires tcpdi_parser and fdpf_tpl.

100

Apache-2.0

PHP

Updated 11 months ago

pdf4py

dipietrantonio

🧡50

A PDF parser written in Python 3 with no external dependencies.

MIT

Python

Updated 1 month ago

information-extractionparserpdf+2

TTPMapper

infosecn1nja

🧡65

TTPMapper is an AI-driven threat intelligence parser that converts unstructured reports whether from web URLs or PDF files into structured intelligence. Using the DeepSeek LLM, it extracts MITRE ATT&CK techniques, IOCs, threat actors, and generates contextual summaries.

GPL-3.0

Python

Updated 6 days ago

blueteamcybersecuritydeepseek-chat+2

pdf-parser

ethanhwang1024

💛70

A parser for pdf that can extract paragraphs, tables and pictures (PDF解析器)

Apache-2.0

Java

Updated 3 days ago

universal-manga-downloader

0xH4KU

🧡60

Universal Manga Downloader (UMD) is a Tkinter desktop app that searches Bato and MangaDex, queues chapters, downloads page images, and converts them into PDF or CBZ archives. Everything runs locally and is extensible through parser/converter plugins discovered at runtime.

NOASSERTION

Python

Updated 4 weeks ago

PDFParser

SimpleApp

🧡55

Swift PDFParser for PDF parsing and text mining. Includes a TrueType font parser

Swift

Updated 4 weeks ago

pdf-parserswifttruetype

PdfDocumentParser

SergiyStoyan

❤️45

PdfDocumentParser is a .NET toolset for building PDF parsers.

AGPL-3.0

Updated 2 months ago

frameworkparserpdf

pyxpdf

ashutoshvarma

❤️25

Fast and memory-efficient Python PDF Parser based on xpdf sources

NOASSERTION

Cython

Updated 4 months ago

cythonpdfpdf-converter+8

GitHub Explorer

Search Results

opendataloader-pdf

MegaParse

pdfminer

flexmark-java

pdf-reader

dedoc

scipdf_parser

py-pdf-parser

caradoc

nlp-resume-parser

Pdf2Dom

sapp

goldmark-pdf

markify

resumeToPDF

gerber-parser

pdf-extract

bella-domify

license-parser

ofdparser

nextjs-pdf-parser

temporals

tcpdi

pdf4py

TTPMapper

pdf-parser

universal-manga-downloader

PDFParser

PdfDocumentParser

pyxpdf

opendataloader-pdf

MegaParse

pdfminer

flexmark-java

pdf-reader

dedoc

scipdf_parser

py-pdf-parser

caradoc

nlp-resume-parser

Pdf2Dom

sapp

goldmark-pdf

markify

resumeToPDF

gerber-parser

pdf-extract

bella-domify

license-parser

ofdparser

nextjs-pdf-parser

temporals

tcpdi

pdf4py

TTPMapper

pdf-parser

universal-manga-downloader

PDFParser

PdfDocumentParser

pyxpdf