Search Results

Found 1,613 repositories(showing 30)

ColiVara

tjmlabs

💛73

Colivara is a suite of services that allows you to store, search, and retrieve documents based on their visual embedding. ColiVara has state of the art retrieval performance on both text and visual documents. using vision models instead of chunking and text-processing for documents. No OCR, no text extraction, no broken tables, or missing images.

1.5k

121

NOASSERTION

Python

Updated 5 hours ago

docling-api

drmingler

💛72

Easily deployable and scalable backend server that efficiently converts various document formats (pdf, docx, pptx, html, images, etc) into Markdown. With support for both CPU and GPU processing, it is Ideal for large-scale workflows, it offers text/table extraction, OCR, and batch processing with sync/async endpoints.

763

MIT

Python

Updated 4 hours ago

apifastapimarkdown-parser+6

obsidian-text-extractor

scambier

💛71

A (companion) plugin to facilitate the extraction of text from images (OCR) and PDFs.

562

GPL-3.0

TypeScript

Updated 12 hours ago

obsidianobsidian-pluginocr+1

open-semantic-etl

opensemanticsearch

🧡61

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database

277

GPL-3.0

Python

Updated 22 hours ago

annotationdocumentselasticsearch+17

scribe.js

scribeocr

🧡65

JavaScript OCR and text extraction for images and PDFs.

267

AGPL-3.0

JavaScript

Updated 8 hours ago

javascriptmcpocr+2

ocr-text-extraction

jasonlfunk

🧡62

A simple program to extract the text from an image before performing OCR

220

188

MIT

Python

Updated 1 week ago

ExtractPDF4J

🧡55

Java PDF table extraction & OCR library. Extract structured tables from text-based and scanned PDFs using stream, lattice (OpenCV-style grid detection), and hybrid parsing.

133

NOASSERTION

Java

Updated 3 hours ago

clidocument-processingjava+9

docwire

❤️40

DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality

101

NOASSERTION

C++

Updated 1 week ago

apiartificial-intelligencec+17

open-semantic-search-apps

opensemanticsearch

❤️45

Python/Django based webapps and web user interfaces for search, structure (meta data management like thesaurus, ontologies, annotations and named entities) and data import (ETL like text extraction, OCR and crawling filesystems or websites)

100

GPL-3.0

CSS

Updated 2 weeks ago

djangodjango-applicationnamed-entities+16

pdf-text-data-extractor

nainiayoub

🧡66

PDF text data extraction web app with OCR for scanned documents

Python

Updated 22 hours ago

ocrocr-pythonocr-text-reader+6

doc_processing_toolkit

18F

❤️35

Python library to extract text from PDF, and default to OCR when text extraction fails.

NOASSERTION

Python

Updated 1 year ago

laravel-ocr

mayaramyadav

🧡55

Laravel OCR & Document Data Extractor – A powerful OCR and document parsing engine for Laravel. It provides intelligent text extraction, structured data parsing, and AI-powered cleanup for documents like invoices, receipts, and PDFs.

PHP

Updated 1 week ago

laravellaravel-packageocr+2

4-simple-steps-in-Builiding-OCR

Nagakiran1

❤️45

Optical character recognition (OCR) is process of classification of opti- cal patterns contained in a digital image. The character recognition is achieved through segmentation, feature extraction and classification. Keras Deep learning Network is used at here in recognising the Text characters and OpenCV is used in segmenting the text and Noise normalization.

Jupyter Notebook

Updated 1 week ago

PDF2TXT

jamalmazrui

❤️45

Batch convert PDF files to text under Windows, using several text extraction methods or OCR

LGPL-3.0

Visual Basic

Updated 2 months ago

GenFlowchart

ResponsibleAILab

🧡60

GenFlowChart is a framework that implements flowchart parsing using generative AI. Leveraging SAM for segmentation and OCR for text extraction, it reconstructs workflows through prompt-engineered integration.

Python

Updated 6 hours ago

hypr-lens

thesleepingsage

🧡60

A standalone, portable toolkit that provides a polished region selector UI with window detection, screenshot capture, OCR text extraction, Google Lens integration, and screen recording.

GPL-3.0

QML

Updated 1 week ago

Traffic-Violation-Detection

FarzadNekouee

❤️40

An urban traffic violation detection system using classical image processing techniques. Features include real-time traffic light recognition, adaptive night-time stop line detection, robust license plate extraction, PyTesseract OCR for text recognition, dynamic penalized plate display, and MySQL logging.

MIT

Jupyter Notebook

Updated 1 year ago

mimeograph

steelThread

❤️20

CoffeeScript lib for PDF OCR and text extraction

CoffeeScript

Updated 7 years ago

ubuntu-text-capture

edwineas

❤️40

Ubuntu Text Capture is a Python tool that captures a selected area of the screen, extracts text using Tesseract OCR, and copies it to the clipboard. It includes a customizable GNOME keyboard shortcut (Shift + Ctrl + T) for quick activation, making text extraction from images fast and easy.

MIT

Shell

Updated 3 months ago

Automate-identification-and-recognition-of-handwritten-text-from-an-image

VMD7

🧡50

This project offers an efficient method for identifying and recognizing handwritten text from images. Using a Convolutional Recurrent Neural Network (CRNN) for Optical Character Recognition (OCR), it effectively extracts text from images, aiding in the digitization of handwritten documents and automated text extraction.

MIT

Jupyter Notebook

Updated 1 month ago

crnncrnn-kreascrnn-ocr+5

paddleOCR_rec_dec

vinodbaste

🧡60

Optical Character Recognition (OCR) is a powerful technology that enables machines to recognize and extract text from images or scanned documents. OCR finds applications in various fields, including document digitization, text extraction from images, and text-based data analysis.

Apache-2.0

Python

Updated 2 weeks ago

detectionimage-processingocr+2

NLP-Course4Humanities_2024

ieg-dhr

❤️45

This repository is part of an NLP course for humanities and cultural studies. This course uses historical newspapers as a source and applies NLP methods to them. NLP tasks: Tokenization, Lemmatization, TF-IDF, Part-of-speech tagging, semantic search with transformers, article extraction and OCR post-correction with LLMs, NER and text classification

Jupyter Notebook

Updated 2 months ago

article-extractionhistorical-newspapersllms+9

Invoice-Data-Extraction-System

mjawadshahid

❤️45

Automate the extraction of key data fields from invoice images using YOLOv8 and OCR. Train custom models to detect fields like invoice ID, total amount, and address, then extract text and export to Excel. Ideal for streamlining data entry and reducing manual effort.

Jupyter Notebook

Updated 1 month ago

ProTegO

Ruby-He

❤️45

[MM'23] ProTegO: Protect Text Content against OCR Extraction Attack

MIT

Python

Updated 2 months ago

Text-Extraction-From-Business-Card-Using-OCR

tulasinnd

❤️35

This code is an OCR application that extracts text from images uploaded by users, using the EasyOCR library. The extracted text is then processed to extract information such as email, phone number, pin code, address, and website URL, and displayed on a Streamlit web app interface.

Python

Updated 5 months ago

easyocr-librarypythonregular-expressions+1

Text-Extraction-with-YOLOv11-and-OCR

hemantramphul

🧡60

Integrating object detection with YOLO11 and Optical Character Recognition (OCR) using Tesseract.

MIT

Jupyter Notebook

Updated 4 weeks ago

notebookocr-recognitionpython+2

TesseractOCR

andrewdefries

❤️40

Full text extraction using the Open Source Tesseract OCR software https://code.google.com/p/tesseract-ocr/ and imagemagick

MIT

Shell

Updated 7 months ago

PaddleOCR-Tutorial

Jacky0111

❤️45

Explore the world of Optical Character Recognition (OCR) with this beginner-friendly PaddleOCR tutorial. From installation to hands-on projects, this repository guides you through the essentials, making OCR accessible for beginners and intermediate users. Dive in and unlock the potential of text extraction from images using PaddleOCR

Python

Updated 1 month ago

ocr-mrz-tesseract

francisco-gargiulo

🧡50

This Node.js OCR system utilizes Tesseract to extract Machine-Readable Zone (MRZ) data from passports and IDs. It accurately recognizes text characters, enabling efficient and reliable data extraction for passport scanning and verification purposes.

MIT

JavaScript

Updated 1 month ago

captiocr

carlosacchi

🧡60

CaptiOCR - A real-time screen text extraction tool using Tesseract OCR. Capture, recognize, and log on-screen text dynamically. Future updates will include on-demand language installation, resizable selection areas, and live text overlays.

MIT

Python

Updated 2 weeks ago

captionslivelive-caption+9

GitHub Explorer

Search Results

ColiVara

docling-api

obsidian-text-extractor

open-semantic-etl

scribe.js

ocr-text-extraction

ExtractPDF4J

docwire

open-semantic-search-apps

pdf-text-data-extractor

doc_processing_toolkit

laravel-ocr

4-simple-steps-in-Builiding-OCR

PDF2TXT

GenFlowchart

hypr-lens

Traffic-Violation-Detection

mimeograph

ubuntu-text-capture

Automate-identification-and-recognition-of-handwritten-text-from-an-image

paddleOCR_rec_dec

NLP-Course4Humanities_2024

Invoice-Data-Extraction-System

ProTegO

Text-Extraction-From-Business-Card-Using-OCR

Text-Extraction-with-YOLOv11-and-OCR

TesseractOCR

PaddleOCR-Tutorial

ocr-mrz-tesseract

captiocr

ColiVara

docling-api

obsidian-text-extractor

open-semantic-etl

scribe.js

ocr-text-extraction

ExtractPDF4J

docwire

open-semantic-search-apps

pdf-text-data-extractor

doc_processing_toolkit

laravel-ocr

4-simple-steps-in-Builiding-OCR

PDF2TXT

GenFlowchart

hypr-lens

Traffic-Violation-Detection

mimeograph

ubuntu-text-capture

Automate-identification-and-recognition-of-handwritten-text-from-an-image

paddleOCR_rec_dec

NLP-Course4Humanities_2024

Invoice-Data-Extraction-System

ProTegO

Text-Extraction-From-Business-Card-Using-OCR

Text-Extraction-with-YOLOv11-and-OCR

TesseractOCR

PaddleOCR-Tutorial

ocr-mrz-tesseract

captiocr