Search Results

Found 326 repositories(showing 30)

pdf-to-markdown

iamarunbrahma

💛70

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

123

MIT

Python

Updated 8 hours ago

document-conversiondocument-processinginformation-retrieval+8

structured-rag-pdf

thu-vu92

🧡56

In this project, I explored how to extract structured information from PDF documents, using Langchain and OpenAI models

103

Jupyter Notebook

Updated 2 weeks ago

docchat-docling

HaileyTQuach

🧡55

DocChat is an AI-powered Multi-Agent RAG system using Docling for structured document parsing and BM25 + vector search retrievers to retrieve fact-checked answers from PDFs, DOCX, and text files, preventing hallucinations. 🚀

NOASSERTION

Python

Updated 4 weeks ago

ai-agentsbm25chromadb+7

NeuSym-RAG

OpenDFM

🧡50

[ACL 2025] NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering

Python

Updated 1 week ago

academic-researchdatabaseneural-symbolic-processing+6

Multimodal-RAG

PritiG1

🧡60

Multimodal RAG with Docling that lets you query PDFs containing text, tables, images, and formulas using a Retrieval-Augmented Generation pipeline. It leverages Docling for structured PDF parsing and Qdrant for fast vector search over embedded document chunks.

Apache-2.0

Python

Updated 3 days ago

Survey-RAG

SLEEPYBQ

❤️40

Survey-RAG is a tool for processing academic survey PDF documents and extracting information using large language models. This tool utilizes vector databases and Retrieval-Augmented Generation (RAG) to efficiently extract structured information from multiple PDF files.

MIT

Python

Updated 8 months ago

VisQueryPDF

oztrkoguz

❤️35

It automatically describes images in PDF files and generates questions from these descriptions. With its advanced RAG structure, it directs these questions directly to PDF text content, providing comprehensive information extraction and analysis.

Apache-2.0

Python

Updated 1 year ago

agentcliplangchain+6

knowledge-base-builder

kostadindev

🧡60

Python package that constructs a structured markdown knowledge base from external sources such as PDFs, websites, and GitHub repos with LLM summarization. Ideal for RAG, search-friendly LLM contexts (/llms.txt), and chatbots.

MIT

Python

Updated 2 weeks ago

knowledge-basellmsrag+1

Paper-Snap

Dr-Venom29

🧡55

A cloud-native RAG system for research paper analysis featuring structured PDF ingestion via LangExtract, high-speed Groq (Llama 3.3) inference, and Supabase vector storage.

Python

Updated 1 week ago

groqlangchainrag+2

docling-rs

carles-abarca

🧡55

Native Rust port of IBM's Docling document processing library. Convert PDF, DOCX, XLSX, PPTX, HTML, Markdown, and CSV to structured data for RAG applications.

Rust

Updated 6 days ago

doclingdocument-processingdocx-converter+4

ragconnect-resume-parser

mhsefidgar

🧡50

A pluggable, object-oriented Python framework for extracting structured information (name, email, and skills) from PDF and Word resumes, suitable for Agentic AI and RAG systems.

Apache-2.0

Python

Updated 1 month ago

TrueWealth-AI

Md-Emon-Hasan

🧡50

An intelligent financial advisor chatbot powered by LLaMA-3, built with LangGraph for structured reasoning, RAG from The Intelligent Investor PDF, and real-time data from Yahoo Finance and DuckDuckGo. The system dynamically orchestrates tools, handles fallbacks, and maintains conversation memory.

MIT

Jupyter Notebook

Updated 1 month ago

ai-agentsai-assistantai-finance+17

Role-Validator-XML-to-PDF-Job-Role-Comparison-Tool

ThankaBharathi

❤️40

An AI-powered Streamlit app that validates job roles by comparing structured XML definitions with roles extracted from unstructured PDFs. Combines Google Gemini, RAG (via Pinecone), and fuzzy string matching for intelligent role comparison and PDF report generation. Automates validation, detects mismatches, and saves hours of manual effort.

MIT

Python

Updated 7 months ago

AI_Invoice_RAG

BUCHAAE

❤️30

Local AI invoice analysis demo using Mixtral via Ollama, LangChain, ChromaDB, and HuggingFace embeddings. Extracts structured data from PDFs, creates a vector store, and supports natural language Q&A through a Gradio interface. No cloud dependencies—fully local RAG pipeline.

Python

Updated 9 months ago

rag-ready-extractor

CarlosManuelDiaz

🧡50

Stop indexing noise. Turn messy websites and PDFs into clean, structured data for RAG pipelines with semantic importance scoring and token optimization.

MIT

Updated 1 month ago

ai-agentai-agentsapi+9

Complex-Doc-RAG

skyblue-ustc

❤️35

A structure-aware RAG system for complex documents (PDFs with tables), featuring Docling parsing and DeepSeek inference.

Python

Updated 3 months ago

InsightExtract

behnamfaghih

❤️30

A Python application that extracts structured profiles from PDF documents and responds to queries in natural language using a Retrieval Augmented Generation (RAG) pipeline.

Python

Updated 8 months ago

RAG-LLM-Finance-Chatbot

bicerfatih

🧡55

The LLM-Finance-Chatbot is designed to answer finance-related queries using company reports in PDF format. RAG is used to optimize the output. This project showcases coding abilities and proficiency with Large Language Models (LLMs), specifically focusing on the use of GPT models for extracting and serving information from structured documents.

Python

Updated 4 weeks ago

ParentMarkDownChatbot

fiammante

❤️40

A local LLM+RAG chatbot with structured pdf ingestion using Word to convert pdf to docx, and then pandoc to convert from docx to markdown enabling the use of langchain ParentDocumentRetriever with MarkdownTextSplitter.

MIT

Jupyter Notebook

Updated 1 year ago

rag

RoodyCode

🧡60

A modular, self-hosted RAG pipeline for building a private, searchable personal knowledge base from PDFs and structured documents.

Python

Updated 3 days ago

aidocument-ingestionembeddings+10

pdf2md

PetrAPConsulting

❤️40

Convert complex structured PDF documents with tables, formulas without OCR to clear markdown using Google's vision models. Markdown files are suitable for RAG pipeline.

MIT

Python

Updated 3 months ago

conversionragsemantic-search

Prajnya-Academic-PDF-Processor

mku1988-oss

🧡55

Production-ready academic PDF processor that extracts metadata and structured JSON from research papers. Designed as a plug-and-play module for RAG pipelines: the high-quality document understanding core.

MIT

Python

Updated 2 weeks ago

StackRAG-Backend

BryanTheLai

🧡55

StackRAG is a multi-tenant Retrieval-Augmented Generation (RAG) platform for financial document intelligence. It extracts structured data from financial PDFs using LLMs, offers secure multi-tenancy, real-time APIs, and is built on Python, FastAPI, Docker, and PostgreSQL.

MIT

Jupyter Notebook

Updated 2 weeks ago

business-analyticsbusiness-intelligenceconversational-ai+16

LangGraph-RAG-Agent

yasaminfn

❤️40

MultiTool-LangGraph-RAG-Agent is an AI-powered multi-tool assistant built with LangGraph and LangChain, featuring FastAPI backend with JWT authentication, Streamlit UI, PDF Q&A with RAG, OCR for low-text pages, semantic search via pgvector, persistent memory in PostgreSQL, Tavily web search, real-time crypto prices, structured logging, and session

MIT

Python

Updated 3 months ago

Winter-Garden-Legal-RAG

cesaremcasa

❤️45

An architecture-first legal RAG system demonstrating production-grade subsystem design and clean engineering boundaries. Implements FastAPI endpoints, modular retrieval layer (BM25, FAISS, Hybrid), PDF/HTML parsing infrastructure, and grounding validation stubs ready for expansion. Features structured JSON logging with request tracing, centralized

Python

Updated 2 months ago

SmartOps

CraftyEngineer

❤️45

SmartOps is an AI-powered RAG pipeline that scrapes messy documents from the web and parses them into structured JSON using OCR and NLP. Built with LangChain, ChromaDB, and Streamlit, it supports PDF/HTML parsing and natural language querying over the data.

Python

Updated 2 months ago

TableRAG

HemaKumar0077

❤️30

TableRAG is an advanced question-answering framework that combines structured tabular data (CSV files) and unstructured text documents (PDF, DOCX, TXT, MD) using Retrieval-Augmented Generation (RAG). Ask natural language questions and get intelligent answers that leverage both your data tables and text content.

Python

Updated 3 months ago

faissfaiss-vector-databasegroq+7

simple_resume_parser

MattTPin

❤️20

SimpleResumeParser is a lightweight resume parsing framework that extracts names, emails, and skills from PDF or DOCX resumes using a combination of LLMs (RAG), HuggingFace NER models, rules-based logic, and regex patterns. It’s designed with pluggable architecture for production-ready, structured data extraction.

Python

Updated 3 months ago

sievio

JochiRaider

❤️40

Sievio turns GitHub, local repos, and web PDFs into clean JSONL for LLM pretraining, fine-tuning, and RAG. It offers structure-aware chunking, reliable Unicode decoding, pluggable QC and safety checks, plus optional dataset cards and deduplication.

MIT

Python

Updated 3 months ago

code-miningdata-deduplicationdata-pipelines+11

AzureDocumentIntelligenceChunker

davidmoserai

❤️35

A lightweight Python library for metadata-rich document chunking in Retrieval-Augmented Generation (RAG) workflows. It leverages Azure AI Document Intelligence to enhance chunking by retaining hierarchical structure, page numbers, and bounding boxes for seamless integration with PDF viewers.

Python

Updated 8 months ago

agentagentsazure+15

GitHub Explorer

Search Results

pdf-to-markdown

structured-rag-pdf

docchat-docling

NeuSym-RAG

Multimodal-RAG

Survey-RAG

VisQueryPDF

knowledge-base-builder

Paper-Snap

docling-rs

ragconnect-resume-parser

TrueWealth-AI

Role-Validator-XML-to-PDF-Job-Role-Comparison-Tool

AI_Invoice_RAG

rag-ready-extractor

Complex-Doc-RAG

InsightExtract

RAG-LLM-Finance-Chatbot

ParentMarkDownChatbot

rag

pdf2md

Prajnya-Academic-PDF-Processor

StackRAG-Backend

LangGraph-RAG-Agent

Winter-Garden-Legal-RAG

SmartOps

TableRAG

simple_resume_parser

sievio

AzureDocumentIntelligenceChunker

pdf-to-markdown

structured-rag-pdf

docchat-docling

NeuSym-RAG

Multimodal-RAG

Survey-RAG

VisQueryPDF

knowledge-base-builder

Paper-Snap

docling-rs

ragconnect-resume-parser

TrueWealth-AI

Role-Validator-XML-to-PDF-Job-Role-Comparison-Tool

AI_Invoice_RAG

rag-ready-extractor

Complex-Doc-RAG

InsightExtract

RAG-LLM-Finance-Chatbot

ParentMarkDownChatbot

rag

pdf2md

Prajnya-Academic-PDF-Processor

StackRAG-Backend

LangGraph-RAG-Agent

Winter-Garden-Legal-RAG

SmartOps

TableRAG

simple_resume_parser

sievio

AzureDocumentIntelligenceChunker