Found 3,105 repositories(showing 30)
opendataloader-project
PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.
QuivrHQ
File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.
euske
Python PDF Parser (Not actively maintained). Check out pdfminer.six.
vsch
CommonMark/Markdown Java parser with source level AST. CommonMark 0.28, emulation of: pegdown, kramdown, markdown.pl, MultiMarkdown. With HTML to MD, MD to PDF, MD to DOCX conversion modules.
yob
The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
ispras
Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser
titipata
Python PDF parser for scientific publications: content and figures
jstockwin
A Python tool to help extracting information from structured PDFs.
caradoc-org
A PDF parser and validator
hxu296
NLP-powered, GPT-3 enabled Resume Parser from PDF to JSON.
radkovo
Pdf2Dom is a PDF parser that converts the documents to a HTML DOM representation. The obtained DOM tree may be then serialized to a HTML file or further processed. A command-line utility for converting the PDF documents to HTML is included in the distribution package. Pdf2Dom may be also used as an independent Java library with a standard DOM interface for your DOM-based applications or as an alternative parser for the CSSBox rendering engine in order to add the PDF processing capability to CSSBox. Pdf2Dom is based on the Apache PDFBox™ library.
dealfonso
Simple and Agnostic PDF Document Parser in PHP - sign PDF docs using PHP
stephenafamo
A PDF renderer for the goldmark markdown parser.
KylinMountain
Convert files into markdown to help RAG or LLM understand, based on markitdown and MinerU, which could provide high quality pdf parser.
jsonresume
Convert your resume.json into a PDF, it runs through our HTML parser
hsiang-lee
gerber-parser is a library for parsing and rendering Gerber files in the RS-274X format. It natively uses the Qt graphics system for rendering and can export to various formats, including PNG, SVG, and PDF. The library is also designed to be extensible, allowing you to easily integrate alternative rendering engines to suit different technology sta
bitextor
PDF parser and converter to HTML
LianjiaTech
文档解析(Document Parser),支持 PDF、TXT、DOC、DOCX、Markdown 等文件格式,高效提取与解析内容,生成标准文档树结构。内置 PDF Parser、Text Parser、Word Parser,助力 RAG、知识库、全文检索等智能应用。
ksoftllc
American Driver's License PDF-417 Barcode Parser
wangyi160
OFD (open fixed layout document) is a Chinese Document Format, just like PDF. It is standardized as GB/T 33190-2016. ofdparser is a parser to parse the format according to the standard.
tuffstuff9
Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.
dcparker
"We could develop some interpreter that would be able to parse and process a range of expressions that we might want to deal with. This would be quite flexible, but also pretty hard" (Martin Fowler, http://martinfowler.com/apsupp/recurring.pdf). Temporals is a Ruby parser for just that.
pauln
PDF importer for TCPDF, based on FPDI. Requires tcpdi_parser and fdpf_tpl.
dipietrantonio
A PDF parser written in Python 3 with no external dependencies.
infosecn1nja
TTPMapper is an AI-driven threat intelligence parser that converts unstructured reports whether from web URLs or PDF files into structured intelligence. Using the DeepSeek LLM, it extracts MITRE ATT&CK techniques, IOCs, threat actors, and generates contextual summaries.
ethanhwang1024
A parser for pdf that can extract paragraphs, tables and pictures (PDF解析器)
Universal Manga Downloader (UMD) is a Tkinter desktop app that searches Bato and MangaDex, queues chapters, downloads page images, and converts them into PDF or CBZ archives. Everything runs locally and is extensible through parser/converter plugins discovered at runtime.
SimpleApp
Swift PDFParser for PDF parsing and text mining. Includes a TrueType font parser
SergiyStoyan
PdfDocumentParser is a .NET toolset for building PDF parsers.
ashutoshvarma
Fast and memory-efficient Python PDF Parser based on xpdf sources