Found 36 repositories(showing 30)
shauryr
This repository provides details and links to the ACL anthology corpus/collection including .bib, .pdf and grobid extractions of the pdfs
ritikamotwani
Detecting Deception using Verbal Cues | Dataset Used: Real life trial data collected during a series of experiments at Michigan (http://web.eecs.umich.edu/~zmohamed/PDFs/Trial.ICMI.pdf) and Deceptive Opinion Spam Corpus v1.4(https://myleott.com/op-spam.html)
0ca
A set of pdf documents used during the fuzzing process
jeremiahbohr
Transform academic PDFs into a Knowledge Graph with typed claims, temporal analysis, bibliometric tools, and grounded LLM synthesis that cites only your corpus.
prateekralhan
A streamlit based webapp to detect scanned/digital PDFs from a large corpus as well as allow the user to OCR the scanned docs
ammons-datalabs
A compact evidence-ingestion pipeline for research PDFs using Python, PostgreSQL, and Elasticsearch. Extracts and cleans text, enriches metadata from Paperpile, and indexes full-text for search. Includes tests, a synthetic corpus, and a modular design for extending with API, queues, or workers.
elsehow
make a text corpus (for machine learning) from a batch of PDFs
rossmounce
Where I store my annotations on the inital corpus of Open Access BMC PDFs
sofianedjerbi
Transform any knowledge corpus into an AI-ready vector intelligence layer. From PDFs to production RAG in minutes.
00shu
This project implements a RAG system to answer questions based on the training corpus provided. In this case the training corpus is rules and regulation pdfs provided by the government.
mkrnr
Contains code that we use to evaluate the quality of our PDF corpus. For example, we look at the amount of scanned PDFs as well as PDFs that contain OCR text.
NXTLupo
The NLP Dashboard Application is a FastAPI-based tool designed to process and analyze a corpus of documents (PDFs, Word Docs, text, CSV, JSON files) via a simple web interface. The app performs semantic analysis, sentiment analysis, topic extraction, and summarization.
saurabhjondhale
OCR for Sanskrit language extraction from PDFs using Corpus.
dansachs
Turning static PDFs into a dynamic parallel corpus for Ambonese Malay.
HiraStanley
Testing out parsing tools on a set corpus of PDFs and benchmarking results
e-centricity
modelling-lite. plugable script for visualizing simple models on a corpus of pdfs or txt
tanguyguyot
Simple RAG implementation along with philosophy PDFs corpus (Marcus Aurelius) for a chatbot, in French language.
abhirajjsingh
prototype for finding similar document pairs in a corpus (PDFs) & returning similar ones on new uploads
kcmclau21122
PY that creates Q&A pairs from a corpus of PDFs using Question and Answer Generation with Language Models
limitcracker
Ελληνικό εκπαιδευτικό corpus για RAG/LLM με raw PDFs (FEK/εγκύκλιοι), κανονικοποιημένα κείμενα, markdown εξαγωγές, chunks, FAQ και QA/eval datasets.
botwin-tokyo
Your private, local-first research librarian. Upload PDFs (OCR support), search + chat with citations, and keep your corpus on your own hardware (Jetson-friendly).
benjlis
Slides and materials for the talk "Creating Email Archives from PDFs – The COVID-19 Corpus" delivered at the EABCC Email Archiving Symposium in June '23
JKDrewes
A RAG pipeline using Ollama open source that allows users to create and query from a corpus of .pdfs, .txts, and other files. Fully private.
KeerthidharLoki
Multimodal RAG system for long-document intelligence over the MMDocRAG benchmark corpus — 222 PDFs, 4,055 QA pairs, hybrid BM25+ANN retrieval, Gemini 2.5 Flash generation
deutranium
Analyzing a corpus of 11,000+ tweets with different user and tweet properties | Parsing PDFs with funky table formats to generate the required data || Link to analysis below
StaHk-collab
Created a tool which takes two corpus (PDFs) as input of different domains and analyzes the domain or context specific ambiguities in natural language text using neural word embeddings.
bantoinese83
An AI-enabled document intelligence backend using FastAPI, OCR, embeddings, and vector search. This system can process PDFs and images, extract text via OCR, generate embeddings, and perform semantic search over the document corpus.
16AI20
This project implements a locally-deployable Retrieval-Augmented Generation (RAG) system designed for processing multimodal content including HTML documents, PDFs, audio files, images, and structured data to provide accurate, contextual responses based on any document corpus
moses-shenassa
CorpusFlower is a concordance and retrieval engine that lets meaning bloom from your text corpus. A local RAG workflow for scholars, researchers, clergy, and analysts — built to illuminate hidden patterns, contexts, and themes inside large collections of PDFs.
florian-coder
Genre-conditioned children’s story generator. Extracts stories from PDFs, discovers themes with TF-IDF + NMF, labels a corpus with genre tags, trains an LSTM language model, and generates new stories using Top-P sampling with keyword boosting.