Found 22,911 repositories(showing 30)
apify
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
jsvine
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
opendatalab
A Comprehensive Toolkit for High-Quality PDF Content Extraction
apify
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Parsel, BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
kreuzberg-dev
A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 91+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.
torakiki
PDFsam, a desktop application to split, merge, mix, rotate PDF files and extract pages
jlegewie
Zotero plugin to manage your attachments: automatically rename, move, and attach PDFs (or other files) to Zotero items, sync PDFs from your Zotero library to your (mobile) PDF reader (e.g. an iPad, Android tablet, etc.), and extract PDF annotations.
apache
The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
camelot-dev
A Python library to extract tabular data from PDFs
microsoft
Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.
smalot
PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
UglyToad
Read and extract text and other content from PDFs in C# (port of PDFBox)
chezou
Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame
WZBSocialScienceCenter
A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
invoice-x
Extract structured data from PDF invoices
tabulapdf
Extract tables from PDF files
microlinkhq
The headless Chrome/Chromium driver on top of Puppeteer. Take screenshots, generate PDFs, extract text and HTML with a production-ready API.
camelot-dev
A web interface to extract tabular data from PDFs
dbashford
node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!
echohive42
AI reads books: Page-by-Page PDF Knowledge Extractor & Summarizer. script performs an intelligent page-by-page analysis of PDF books, methodically extracting knowledge points and generating progressive summaries at specified intervals
JonathanLink
Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).
NanoNets
Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.
SSShooter
AI-powered Summaries by Extracting Content from EPUB and PDF. epub、pdf 拆书 AI 总结
metachris
Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.
spatie
Extract text from a pdf
SkywalkerDarren
ChatWeb can crawl web pages, read PDF, DOCX, TXT, and extract the main content, then answer your questions based on the content, or summarize the key points.
seekbytes
GUI analyzer for deep-diving into PDF files. Detect malicious payloads, understand object relationships, and extract key information for threat analysis.
allenai
Given a scholarly PDF, extract figures, tables, captions, and section titles.
datalab-to
Extract structured text from pdfs quickly
ispras
Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser