Search Results

Found 36 repositories(showing 30)

ACL-anthology-corpus

shauryr

🧡65

This repository provides details and links to the ACL anthology corpus/collection including .bib, .pdf and grobid extractions of the pdfs

190

Jupyter Notebook

Updated 1 day ago

Detecting Deception using Verbal Cues | Dataset Used: Real life trial data collected during a series of experiments at Michigan (http://web.eecs.umich.edu/~zmohamed/PDFs/Trial.ICMI.pdf) and Deceptive Opinion Spam Corpus v1.4(https://myleott.com/op-spam.html)

Python

Updated 2 months ago

corpus_pdfs

0ca

❤️45

A set of pdf documents used during the fuzzing process

Updated 1 month ago

literature-mapper

jeremiahbohr

🧡50

Transform academic PDFs into a Knowledge Graph with typed claims, temporal analysis, bibliometric tools, and grounded LLM synthesis that cites only your corpus.

MIT

Python

Updated 1 month ago

academic-researchbibliometricsknowledge-graph+4

Scanned-PDFs-checker

prateekralhan

❤️40

A streamlit based webapp to detect scanned/digital PDFs from a large corpus as well as allow the user to OCR the scanned docs

Apache-2.0

Python

Updated 1 year ago

ghostscriptocrmypdfopensourceforgood+4

adl-pdf-ingest-phd

ammons-datalabs

❤️40

A compact evidence-ingestion pipeline for research PDFs using Python, PostgreSQL, and Elasticsearch. Extracts and cleans text, enriches metadata from Paperpile, and indexes full-text for search. Includes tests, a synthetic corpus, and a modular design for extending with API, queues, or workers.

MIT

Python

Updated 3 months ago

corpus-from-pdfs

elsehow

❤️35

make a text corpus (for machine learning) from a batch of PDFs

Updated 3 years ago

BMCphyloannotation

rossmounce

❤️35

Where I store my annotations on the inital corpus of Open Access BMC PDFs

Updated 1 year ago

docuvec

sofianedjerbi

❤️40

Transform any knowledge corpus into an AI-ready vector intelligence layer. From PDFs to production RAG in minutes.

MIT

Python

Updated 7 months ago

MiniLaw

00shu

❤️35

This project implements a RAG system to answer questions based on the training corpus provided. In this case the training corpus is rules and regulation pdfs provided by the government.

Python

Updated 1 year ago

pdf-evaluation

mkrnr

❤️40

Contains code that we use to evaluate the quality of our PDF corpus. For example, we look at the amount of scanned PDFs as well as PDFs that contain OCR text.

GPL-3.0

Jupyter Notebook

Updated 7 years ago

nlp_dashboard_app

NXTLupo

❤️30

The NLP Dashboard Application is a FastAPI-based tool designed to process and analyze a corpus of documents (PDFs, Word Docs, text, CSV, JSON files) via a simple web interface. The app performs semantic analysis, sentiment analysis, topic extraction, and summarization.

Python

Updated 8 months ago

OCR_Sanskrit_corpus

saurabhjondhale

❤️35

OCR for Sanskrit language extraction from PDFs using Corpus.

Python

Updated 8 months ago

project-manise

dansachs

❤️45

Turning static PDFs into a dynamic parallel corpus for Ambonese Malay.

Python

Updated 2 months ago

PDF-scanner-benchmark

HiraStanley

🧡55

Testing out parsing tools on a set corpus of PDFs and benchmarking results

Jupyter Notebook

Updated 1 week ago

word-histograms

e-centricity

❤️35

modelling-lite. plugable script for visualizing simple models on a corpus of pdfs or txt

Updated 7 years ago

ragosophy

tanguyguyot

🧡50

Simple RAG implementation along with philosophy PDFs corpus (Marcus Aurelius) for a chatbot, in French language.

MIT

Python

Updated 1 month ago

doc_similarity_tool

abhirajjsingh

❤️30

prototype for finding similar document pairs in a corpus (PDFs) & returning similar ones on new uploads

Python

Updated 8 months ago

GenerateQA-lmqg

kcmclau21122

❤️45

PY that creates Q&A pairs from a corpus of PDFs using Question and Answer Generation with Language Models

Python

Updated 1 month ago

EduGuruGR

limitcracker

🧡55

Ελληνικό εκπαιδευτικό corpus για RAG/LLM με raw PDFs (FEK/εγκύκλιοι), κανονικοποιημένα κείμενα, markdown εξαγωγές, chunks, FAQ και QA/eval datasets.

Python

Updated 1 week ago

librarian-of-alexandria

botwin-tokyo

❤️45

Your private, local-first research librarian. Upload PDFs (OCR support), search + chat with citations, and keep your corpus on your own hardware (Jetson-friendly).

NOASSERTION

TypeScript

Updated 2 months ago

edge-aiembeddingsjetson+7

eabcc-presentation

benjlis

❤️40

Slides and materials for the talk "Creating Email Archives from PDFs – The COVID-19 Corpus" delivered at the EABCC Email Archiving Symposium in June '23

CC0-1.0

Updated 2 years ago

rag_pipeline

JKDrewes

❤️40

A RAG pipeline using Ollama open source that allows users to create and query from a corpus of .pdfs, .txts, and other files. Fully private.

CC0-1.0

Python

Updated 5 months ago

omni-query

KeerthidharLoki

🧡55

Multimodal RAG system for long-document intelligence over the MMDocRAG benchmark corpus — 222 PDFs, 4,055 QA pairs, hybrid BM25+ANN retrieval, Gemini 2.5 Flash generation

Python

Updated 1 week ago

PreCog-submission

deutranium

❤️35

Analyzing a corpus of 11,000+ tweets with different user and tweet properties | Parsing PDFs with funky table formats to generate the required data || Link to analysis below

Python

Updated 4 years ago

ml-tool-software-requirements

StaHk-collab

❤️40

Created a tool which takes two corpus (PDFs) as input of different domains and analyzes the domain or context specific ambiguities in natural language text using neural word embeddings.

MIT

Python

Updated 9 months ago

fasttextflask-applicationmachine-learning+3

Document-Intelligence-Platform

bantoinese83

❤️35

An AI-enabled document intelligence backend using FastAPI, OCR, embeddings, and vector search. This system can process PDFs and images, extract text via OCR, generate embeddings, and perform semantic search over the document corpus.

Python

Updated 3 months ago

multimodal-rag-pipeline

16AI20

❤️40

This project implements a locally-deployable Retrieval-Augmented Generation (RAG) system designed for processing multimodal content including HTML documents, PDFs, audio files, images, and structured data to provide accurate, contextual responses based on any document corpus

MIT

Python

Updated 6 months ago

corpusflower-rag-engine

moses-shenassa

❤️40

CorpusFlower is a concordance and retrieval engine that lets meaning bloom from your text corpus. A local RAG workflow for scholars, researchers, clergy, and analysts — built to illuminate hidden patterns, contexts, and themes inside large collections of PDFs.

MIT

Python

Updated 4 months ago

concordancedark-academiaembeddings+14

Storymaker-AI

florian-coder

❤️45

Genre-conditioned children’s story generator. Extracts stories from PDFs, discovers themes with TF-IDF + NMF, labels a corpus with genre tags, trains an LSTM language model, and generates new stories using Top-P sampling with keyword boosting.

Python

Updated 1 month ago

GitHub Explorer

Search Results

ACL-anthology-corpus

Deception-Detection

corpus_pdfs

literature-mapper

Scanned-PDFs-checker

adl-pdf-ingest-phd

corpus-from-pdfs

BMCphyloannotation

docuvec

MiniLaw

pdf-evaluation

nlp_dashboard_app

OCR_Sanskrit_corpus

project-manise

PDF-scanner-benchmark

word-histograms

ragosophy

doc_similarity_tool

GenerateQA-lmqg

EduGuruGR

librarian-of-alexandria

eabcc-presentation

rag_pipeline

omni-query

PreCog-submission

ml-tool-software-requirements

Document-Intelligence-Platform

multimodal-rag-pipeline

corpusflower-rag-engine

Storymaker-AI

ACL-anthology-corpus

Deception-Detection

corpus_pdfs

literature-mapper

Scanned-PDFs-checker

adl-pdf-ingest-phd

corpus-from-pdfs

BMCphyloannotation

docuvec

MiniLaw

pdf-evaluation

nlp_dashboard_app

OCR_Sanskrit_corpus

project-manise

PDF-scanner-benchmark

word-histograms

ragosophy

doc_similarity_tool

GenerateQA-lmqg

EduGuruGR

librarian-of-alexandria

eabcc-presentation

rag_pipeline

omni-query

PreCog-submission

ml-tool-software-requirements

Document-Intelligence-Platform

multimodal-rag-pipeline

corpusflower-rag-engine

Storymaker-AI