Search Results

Found 22,911 repositories(showing 30)

crawlee

apify

💚98

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

22.7k

1.3k

Apache-2.0

TypeScript

Updated 1 hour ago

apifyautomationcrawler+14

pdfplumber

jsvine

💛84

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

10.1k

874

MIT

Python

Updated 3 hours ago

pdfpdf-parsingtable-extraction

PDF-Extract-Kit

opendatalab

💛87

A Comprehensive Toolkit for High-Quality PDF Content Extraction

9.6k

719

AGPL-3.0

Python

Updated 9 hours ago

crawlee-python

apify

💛81

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Parsel, BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

8.7k

703

Apache-2.0

Python

Updated 10 hours ago

apifyautomationbeautifulsoup+14

kreuzberg

kreuzberg-dev

💛76

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 91+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

7.5k

364

MIT

Rust

Updated 4 hours ago

buncsharpdocument-intelligence+17

pdfsam

torakiki

💛78

PDFsam, a desktop application to split, merge, mix, rotate PDF files and extract pages

4.3k

393

AGPL-3.0

Java

Updated 4 hours ago

combineextractjava+16

zotfile

jlegewie

🧡67

Zotero plugin to manage your attachments: automatically rename, move, and attach PDFs (or other files) to Zotero items, sync PDFs from your Zotero library to your (mobile) PDF reader (e.g. an iPad, Android tablet, etc.), and extract PDF annotations.

4.3k

291

Java

Updated 20 hours ago

tika

apache

💛78

The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

3.7k

921

Apache-2.0

Java

Updated 5 hours ago

contentextractionjava+2

camelot

camelot-dev

💛74

A Python library to extract tabular data from PDFs

3.7k

535

MIT

Python

Updated 19 hours ago

table-transformer

microsoft

💛76

Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.

2.9k

311

MIT

Python

Updated 5 days ago

table-detectiontable-extractiontable-functional-analysis+1

pdfparser

smalot

💛73

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.

2.7k

576

LGPL-3.0

PHP

Updated 4 days ago

PdfPig

UglyToad

💛76

Read and extract text and other content from PDFs in C# (port of PDFBox)

2.4k

313

Apache-2.0

Updated 23 hours ago

alto-xmlcsharpdocument-analysis+11

tabula-py

chezou

💛75

Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame

2.3k

303

MIT

Python

Updated 17 hours ago

pandaspdfpython+2

pdftabextract

WZBSocialScienceCenter

💛76

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

2.3k

370

Apache-2.0

Python

Updated 3 days ago

data-miningimage-processingocr+3

invoice2data

invoice-x

💛73

Extract structured data from PDF invoices

2.1k

541

MIT

Python

Updated 12 hours ago

data-miningpython

tabula-java

tabulapdf

💛72

Extract tables from PDF files

2.0k

450

MIT

Java

Updated 20 hours ago

extracting-tablesextraction-enginepdfs

browserless

microlinkhq

💛73

The headless Chrome/Chromium driver on top of Puppeteer. Take screenshots, generate PDFs, extract text and HTML with a production-ready API.

1.8k

MIT

JavaScript

Updated 1 day ago

automationbrowser-automationchromium+5

excalibur

camelot-dev

🧡69

A web interface to extract tabular data from PDFs

1.8k

238

MIT

Python

Updated 2 days ago

extractfor-humanspdf+1

textract

dbashford

🧡64

node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!

1.7k

198

MIT

HTML

Updated 1 week ago

extract-textextractionnodejs

AI-reads-books-page-by-page

echohive42

💛73

AI reads books: Page-by-Page PDF Knowledge Extractor & Summarizer. script performs an intelligent page-by-page analysis of PDF books, methodically extracting knowledge points and generating progressive summaries at specified intervals

1.6k

170

MIT

Python

Updated 16 hours ago

PDFLayoutTextStripper

JonathanLink

💛74

Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).

1.6k

214

Apache-2.0

Java

Updated 2 days ago

data-extractionextractjava+4

docstrange

NanoNets

💛73

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

1.4k

124

MIT

Python

Updated 14 hours ago

aidocument-parserdocument-parsing+10

ebook-to-mindmap

SSShooter

💛73

AI-powered Summaries by Extracting Content from EPUB and PDF. epub、pdf 拆书 AI 总结

1.1k

140

MIT

TypeScript

Updated 1 day ago

aibooksummary

pdfx

metachris

💛72

Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.

1.1k

117

Apache-2.0

Python

Updated 3 days ago

pdf-to-text

spatie

🧡67

Extract text from a pdf

1.0k

133

MIT

PHP

Updated 2 days ago

pdfpdf-converterphp+1

chatWeb

SkywalkerDarren

🧡62

ChatWeb can crawl web pages, read PDF, DOCX, TXT, and extract the main content, then answer your questions based on the content, or summarize the key points.

911

136

MIT

Python

Updated 2 weeks ago

aichatgptcrawler+12

IPA

seekbytes

🧡66

GUI analyzer for deep-diving into PDF files. Detect malicious payloads, understand object relationships, and extract key information for threat analysis.

872

GPL-2.0

Rust

Updated 1 day ago

eguimalware-analysispdf+1

pdffigures2

allenai

💛72

Given a scholarly PDF, extract figures, tables, captions, and section titles.

736

132

Apache-2.0

Scala

Updated 2 days ago

pdftext

datalab-to

💛71

Extract structured text from pdfs quickly

683

Apache-2.0

Python

Updated 1 day ago

dedoc

ispras

💛71

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser

655

Apache-2.0

Python

Updated 12 hours ago

docdocument-analysisdocument-content-extraction+15

GitHub Explorer

Search Results

crawlee

pdfplumber

PDF-Extract-Kit

crawlee-python

kreuzberg

pdfsam

zotfile

tika

camelot

table-transformer

pdfparser

PdfPig

tabula-py

pdftabextract

invoice2data

tabula-java

browserless

excalibur

textract

AI-reads-books-page-by-page

PDFLayoutTextStripper

docstrange

ebook-to-mindmap

pdfx

pdf-to-text

chatWeb

IPA

pdffigures2

pdftext

dedoc

crawlee

pdfplumber

PDF-Extract-Kit

crawlee-python

kreuzberg

pdfsam

zotfile

tika

camelot

table-transformer

pdfparser

PdfPig

tabula-py

pdftabextract

invoice2data

tabula-java

browserless

excalibur

textract

AI-reads-books-page-by-page

PDFLayoutTextStripper

docstrange

ebook-to-mindmap

pdfx

pdf-to-text

chatWeb

IPA

pdffigures2

pdftext

dedoc