Found 326 repositories(showing 30)
iamarunbrahma
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.
thu-vu92
In this project, I explored how to extract structured information from PDF documents, using Langchain and OpenAI models
HaileyTQuach
DocChat is an AI-powered Multi-Agent RAG system using Docling for structured document parsing and BM25 + vector search retrievers to retrieve fact-checked answers from PDFs, DOCX, and text files, preventing hallucinations. 🚀
OpenDFM
[ACL 2025] NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering
PritiG1
Multimodal RAG with Docling that lets you query PDFs containing text, tables, images, and formulas using a Retrieval-Augmented Generation pipeline. It leverages Docling for structured PDF parsing and Qdrant for fast vector search over embedded document chunks.
SLEEPYBQ
Survey-RAG is a tool for processing academic survey PDF documents and extracting information using large language models. This tool utilizes vector databases and Retrieval-Augmented Generation (RAG) to efficiently extract structured information from multiple PDF files.
oztrkoguz
It automatically describes images in PDF files and generates questions from these descriptions. With its advanced RAG structure, it directs these questions directly to PDF text content, providing comprehensive information extraction and analysis.
kostadindev
Python package that constructs a structured markdown knowledge base from external sources such as PDFs, websites, and GitHub repos with LLM summarization. Ideal for RAG, search-friendly LLM contexts (/llms.txt), and chatbots.
Dr-Venom29
A cloud-native RAG system for research paper analysis featuring structured PDF ingestion via LangExtract, high-speed Groq (Llama 3.3) inference, and Supabase vector storage.
carles-abarca
Native Rust port of IBM's Docling document processing library. Convert PDF, DOCX, XLSX, PPTX, HTML, Markdown, and CSV to structured data for RAG applications.
mhsefidgar
A pluggable, object-oriented Python framework for extracting structured information (name, email, and skills) from PDF and Word resumes, suitable for Agentic AI and RAG systems.
Md-Emon-Hasan
An intelligent financial advisor chatbot powered by LLaMA-3, built with LangGraph for structured reasoning, RAG from The Intelligent Investor PDF, and real-time data from Yahoo Finance and DuckDuckGo. The system dynamically orchestrates tools, handles fallbacks, and maintains conversation memory.
ThankaBharathi
An AI-powered Streamlit app that validates job roles by comparing structured XML definitions with roles extracted from unstructured PDFs. Combines Google Gemini, RAG (via Pinecone), and fuzzy string matching for intelligent role comparison and PDF report generation. Automates validation, detects mismatches, and saves hours of manual effort.
BUCHAAE
Local AI invoice analysis demo using Mixtral via Ollama, LangChain, ChromaDB, and HuggingFace embeddings. Extracts structured data from PDFs, creates a vector store, and supports natural language Q&A through a Gradio interface. No cloud dependencies—fully local RAG pipeline.
CarlosManuelDiaz
Stop indexing noise. Turn messy websites and PDFs into clean, structured data for RAG pipelines with semantic importance scoring and token optimization.
skyblue-ustc
A structure-aware RAG system for complex documents (PDFs with tables), featuring Docling parsing and DeepSeek inference.
behnamfaghih
A Python application that extracts structured profiles from PDF documents and responds to queries in natural language using a Retrieval Augmented Generation (RAG) pipeline.
bicerfatih
The LLM-Finance-Chatbot is designed to answer finance-related queries using company reports in PDF format. RAG is used to optimize the output. This project showcases coding abilities and proficiency with Large Language Models (LLMs), specifically focusing on the use of GPT models for extracting and serving information from structured documents.
fiammante
A local LLM+RAG chatbot with structured pdf ingestion using Word to convert pdf to docx, and then pandoc to convert from docx to markdown enabling the use of langchain ParentDocumentRetriever with MarkdownTextSplitter.
RoodyCode
A modular, self-hosted RAG pipeline for building a private, searchable personal knowledge base from PDFs and structured documents.
PetrAPConsulting
Convert complex structured PDF documents with tables, formulas without OCR to clear markdown using Google's vision models. Markdown files are suitable for RAG pipeline.
mku1988-oss
Production-ready academic PDF processor that extracts metadata and structured JSON from research papers. Designed as a plug-and-play module for RAG pipelines: the high-quality document understanding core.
BryanTheLai
StackRAG is a multi-tenant Retrieval-Augmented Generation (RAG) platform for financial document intelligence. It extracts structured data from financial PDFs using LLMs, offers secure multi-tenancy, real-time APIs, and is built on Python, FastAPI, Docker, and PostgreSQL.
yasaminfn
MultiTool-LangGraph-RAG-Agent is an AI-powered multi-tool assistant built with LangGraph and LangChain, featuring FastAPI backend with JWT authentication, Streamlit UI, PDF Q&A with RAG, OCR for low-text pages, semantic search via pgvector, persistent memory in PostgreSQL, Tavily web search, real-time crypto prices, structured logging, and session
cesaremcasa
An architecture-first legal RAG system demonstrating production-grade subsystem design and clean engineering boundaries. Implements FastAPI endpoints, modular retrieval layer (BM25, FAISS, Hybrid), PDF/HTML parsing infrastructure, and grounding validation stubs ready for expansion. Features structured JSON logging with request tracing, centralized
CraftyEngineer
SmartOps is an AI-powered RAG pipeline that scrapes messy documents from the web and parses them into structured JSON using OCR and NLP. Built with LangChain, ChromaDB, and Streamlit, it supports PDF/HTML parsing and natural language querying over the data.
HemaKumar0077
TableRAG is an advanced question-answering framework that combines structured tabular data (CSV files) and unstructured text documents (PDF, DOCX, TXT, MD) using Retrieval-Augmented Generation (RAG). Ask natural language questions and get intelligent answers that leverage both your data tables and text content.
MattTPin
SimpleResumeParser is a lightweight resume parsing framework that extracts names, emails, and skills from PDF or DOCX resumes using a combination of LLMs (RAG), HuggingFace NER models, rules-based logic, and regex patterns. It’s designed with pluggable architecture for production-ready, structured data extraction.
JochiRaider
Sievio turns GitHub, local repos, and web PDFs into clean JSONL for LLM pretraining, fine-tuning, and RAG. It offers structure-aware chunking, reliable Unicode decoding, pluggable QC and safety checks, plus optional dataset cards and deduplication.
davidmoserai
A lightweight Python library for metadata-rich document chunking in Retrieval-Augmented Generation (RAG) workflows. It leverages Azure AI Document Intelligence to enhance chunking by retaining hierarchical structure, page numbers, and bounding boxes for seamless integration with PDF viewers.