Found 1,613 repositories(showing 30)
tjmlabs
Colivara is a suite of services that allows you to store, search, and retrieve documents based on their visual embedding. ColiVara has state of the art retrieval performance on both text and visual documents. using vision models instead of chunking and text-processing for documents. No OCR, no text extraction, no broken tables, or missing images.
drmingler
Easily deployable and scalable backend server that efficiently converts various document formats (pdf, docx, pptx, html, images, etc) into Markdown. With support for both CPU and GPU processing, it is Ideal for large-scale workflows, it offers text/table extraction, OCR, and batch processing with sync/async endpoints.
scambier
A (companion) plugin to facilitate the extraction of text from images (OCR) and PDFs.
opensemanticsearch
Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
scribeocr
JavaScript OCR and text extraction for images and PDFs.
jasonlfunk
A simple program to extract the text from an image before performing OCR
ExtractPDF4J
Java PDF table extraction & OCR library. Extract structured tables from text-based and scanned PDFs using stream, lattice (OpenCV-style grid detection), and hybrid parsing.
docwire
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality
opensemanticsearch
Python/Django based webapps and web user interfaces for search, structure (meta data management like thesaurus, ontologies, annotations and named entities) and data import (ETL like text extraction, OCR and crawling filesystems or websites)
nainiayoub
PDF text data extraction web app with OCR for scanned documents
Python library to extract text from PDF, and default to OCR when text extraction fails.
mayaramyadav
Laravel OCR & Document Data Extractor – A powerful OCR and document parsing engine for Laravel. It provides intelligent text extraction, structured data parsing, and AI-powered cleanup for documents like invoices, receipts, and PDFs.
Nagakiran1
Optical character recognition (OCR) is process of classification of opti- cal patterns contained in a digital image. The character recognition is achieved through segmentation, feature extraction and classification. Keras Deep learning Network is used at here in recognising the Text characters and OpenCV is used in segmenting the text and Noise normalization.
jamalmazrui
Batch convert PDF files to text under Windows, using several text extraction methods or OCR
ResponsibleAILab
GenFlowChart is a framework that implements flowchart parsing using generative AI. Leveraging SAM for segmentation and OCR for text extraction, it reconstructs workflows through prompt-engineered integration.
thesleepingsage
A standalone, portable toolkit that provides a polished region selector UI with window detection, screenshot capture, OCR text extraction, Google Lens integration, and screen recording.
FarzadNekouee
An urban traffic violation detection system using classical image processing techniques. Features include real-time traffic light recognition, adaptive night-time stop line detection, robust license plate extraction, PyTesseract OCR for text recognition, dynamic penalized plate display, and MySQL logging.
steelThread
CoffeeScript lib for PDF OCR and text extraction
edwineas
Ubuntu Text Capture is a Python tool that captures a selected area of the screen, extracts text using Tesseract OCR, and copies it to the clipboard. It includes a customizable GNOME keyboard shortcut (Shift + Ctrl + T) for quick activation, making text extraction from images fast and easy.
This project offers an efficient method for identifying and recognizing handwritten text from images. Using a Convolutional Recurrent Neural Network (CRNN) for Optical Character Recognition (OCR), it effectively extracts text from images, aiding in the digitization of handwritten documents and automated text extraction.
vinodbaste
Optical Character Recognition (OCR) is a powerful technology that enables machines to recognize and extract text from images or scanned documents. OCR finds applications in various fields, including document digitization, text extraction from images, and text-based data analysis.
ieg-dhr
This repository is part of an NLP course for humanities and cultural studies. This course uses historical newspapers as a source and applies NLP methods to them. NLP tasks: Tokenization, Lemmatization, TF-IDF, Part-of-speech tagging, semantic search with transformers, article extraction and OCR post-correction with LLMs, NER and text classification
mjawadshahid
Automate the extraction of key data fields from invoice images using YOLOv8 and OCR. Train custom models to detect fields like invoice ID, total amount, and address, then extract text and export to Excel. Ideal for streamlining data entry and reducing manual effort.
Ruby-He
[MM'23] ProTegO: Protect Text Content against OCR Extraction Attack
This code is an OCR application that extracts text from images uploaded by users, using the EasyOCR library. The extracted text is then processed to extract information such as email, phone number, pin code, address, and website URL, and displayed on a Streamlit web app interface.
hemantramphul
Integrating object detection with YOLO11 and Optical Character Recognition (OCR) using Tesseract.
andrewdefries
Full text extraction using the Open Source Tesseract OCR software https://code.google.com/p/tesseract-ocr/ and imagemagick
Jacky0111
Explore the world of Optical Character Recognition (OCR) with this beginner-friendly PaddleOCR tutorial. From installation to hands-on projects, this repository guides you through the essentials, making OCR accessible for beginners and intermediate users. Dive in and unlock the potential of text extraction from images using PaddleOCR
francisco-gargiulo
This Node.js OCR system utilizes Tesseract to extract Machine-Readable Zone (MRZ) data from passports and IDs. It accurately recognizes text characters, enabling efficient and reliable data extraction for passport scanning and verification purposes.
carlosacchi
CaptiOCR - A real-time screen text extraction tool using Tesseract OCR. Capture, recognize, and log on-screen text dynamically. Future updates will include on-demand language installation, resizable selection areas, and live text overlays.