Found 175 repositories(showing 30)
brandonstarxel
This package, developed as part of our research detailed in the Chroma Technical Report, provides tools for text chunking and evaluation. It allows users to compare different chunking methods and includes implementations of several novel chunking strategies.
sighsmile
conlleval in Python (script for chunking/NER evaluation)
mburaksayici
RAG boilerplate with semantic/propositional chunking, hybrid search (BM25 + dense), LLM reranking, query enhancement agents, CrewAI orchestration, Qdrant vector search, Redis/Mongo sessioning, Celery ingestion pipeline, Gradio UI, and an evaluation suite (Hit-Rate, MRR, hybrid configs).
bobmatnyc
AI-powered code review CLI with multiple providers (Gemini, Claude, OpenAI). Features 95%+ token reduction via semantic chunking, 7 review types (security/performance/evaluation), multi-language support, interactive fixes, and developer skill assessment.
This repository contains the code for implementation of RAG approach with company policies data, evaluation of RAG solution and smart chunking techniques
This repository contains the results of automatic glossary terms extraction and their clustering considering two important qualitative attributes, i.e. feature and benefit of the original CrowdRE requirement specifications dataset. In the original CrowdRE dataset, each entry has 6 attributes, i.e., role, feature, benefit, domain, tags and date-time of creation. Since, we are interested in extracting domain-specific terms from this dataset, we only focus on feature and benefit attributes of this dataset. The dataset used in our experiments containing only the feature and benefit attributes of the original CrowdRE dataset can be viewed in the file named "CrowdRE Requirements Dataset.csv". However, the original CrowdRE dataset is devloped by P. K. Murukannaiah et al. and can be accessed from "The smarthome crowd requirements dataset", https://crowdre.github.io/murukannaiah-smarthome-requirements-dataset/, April, 2017. We have computed and reported the ground truth set for a random subset of 100 requirement specifications of the used CrowdRE dataset. In total, we have manually identified a total of 120 ground truth glossary terms with 30 overlapping clusters. The ground truth glossary terms have been calculated from the best intuition of the people (s) involved in this project in an unbiased manner, as there exists no benchmark or gold standard related to the ground truth extraction and clustering for the CrowdRE dataset. The file named "Ground Truth Clusters.docx" shows the ground truth glossary terms along with the manually formulated semantically similar clusters. Note: the clusters are separated with (######) symbol in the file. Further, the manually identified 120 glossary terms in the ground truth set are shown in the third column of the file named as "Extracted Glossary Terms (With and Without WordNet Removal) and Ground Truth Glossary Terms.csv". We have extracted a total of 143 and 292 glossary terms from the CrowdRE dataset with or without removing some words specified in the WordNet lexical database (https://wordnet.princeton.edu/) using a mature text chunking approach. The results are shown in the first and second column of the file named "Extracted Glossary Terms (With and Without WordNet Removal) and Ground Truth Glossary Terms.csv". The extracted glossary terms are trained with the help of a domain specific corpora that is most related to used CrowdRE dataset, i.e. (Wikipedia Home Automation Category for a maximum depth of two, "https://en.wikipedia.org/wiki/Category:Home_automation") and with a pre-trained word vectors UMBC webbase corpus and statmt.org news dataset trained with subwords information in wikipedia 2017 (T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin. Advances in Pre-Training Distributed Word Representations) using FastText word embedding vectors (https://fasttext.cc/docs/en/english-vectors.html). The main purpose of the training is to deduce the clusters by forming a the similarity matrix for the extracted glossary terms. For this, we have used two clustering algorithms, viz. K-Means and EM clustering algorithms. The similarity matrix have been developed using the computed semantic similarity scores (cosine similarity) between the word vectors using the word embedding based FastText model. The results in terms of automated formulated clusters for the random subset of 100 requirement specifications of the CrowdRE dataset for which the ground truth glossary terms are calculated are shown in the files named "Automated Ideal (Ground Truth) Clusters.docx" and "Automated Extraction and Clustering.docx" respectively. Note: there exists a maximum of n/2 clusters for n glossary terms. For evaluating the efficacy of the clustering algorithms, we used some commonly used performance evaluation metrics like (precision, recall, f-scores). The evaluation graphs utilizing the area under curve plots (AUC) and evaluating the normalized AUC scores for all the used clustering algorithms are trained on two different datasets and the evaluation results are shown in the two separate files namely, "Cluster Plots.docx" and "Extraction +Clustering Plots.docx" respectively.
AgBigdataLab
Chunk-Factory is a fast, efficient text chunking library with real-time evaluation.
stranger00135
Automatically discover the best RAGFlow chunking parameters for each document type. Two-phase optimization with LLM-based evaluation.
Leo310
Assess the effectiveness of chunking strategies in RAG systems via a custom evaluation framework.
ahmeabd
AutoChunker Paper Implementation: Structured Text Chunking and its Evaluation
sjordan1975
Systematic evaluation of RAG configurations for financial document search. Tests 66 combinations of chunking, embedding, and retrieval strategies against a 160-page annual report. Winner: sentence-based chunking with no overlap (MRR 0.83)
Lightweight RAG system using MiniLM embeddings and GPT-2/Ollama for Wikipedia question answering. Uses the rag-datasets/rag-mini-wikipedia dataset for training and evaluation, with document chunking, ChromaDB-based semantic search, and context-aware answer generation evaluated via EM and F1 metrics.
satish860
Evaluation Pipeline for Semantic Chunking
DocSlicer
RAG chunking benchmark suite - evaluation code for docslicer.ai
johntmunger
Retrieval-grounded LLM architecture using semantic chunking, vector-backed search, citation mapping, and evaluation-driven refinement.
dubistdu
RAG retrieval evaluation: boundary-aware chunking, synthetic question generation, vector retrieval metrics (Recall@K, MRR), and sanity checks.
denys-yu
Research codebase for studying chunking strategies in Retrieval-Augmented Generation (RAG), with reproducible experiments, indexing methods, and QA-based evaluation.
Jackie7ii
Standalone implementation of ACT (Action Chunking with Transformers) on LIBERO simulation benchmarks. Supports training, in-training rollout eval, and full evaluation with video saving.
usb1998
Retrieval-Augmented Generation pipeline for legal document question answering using MPNet embeddings, cross-encoder reranking, optimized chunking, extractive answering, and detailed ROUGE/BLEU evaluation.
dannyblaker
Complete guide to document chunking from basics to production. Includes 7 chunking strategies (character, word, sentence, token-based, recursive, semantic), RAG implementation, evaluation metrics, and comprehensive documentation. Perfect for NLP, LLM applications, and RAG systems. Learn with working code examples and best practices.
moelsaka01
End-to-end Retrieval Augmented Generation (RAG) platform with document ingestion, chunking, embedding-based retrieval, FastAPI backend, themed UI, index metadata endpoint, evaluation metrics, and Docker deployment.
vanshksingh
Advanced Retrieval-Augmented Generation (RAG) techniques with modular implementations of hierarchical indexing, adaptive retrieval, semantic chunking, and explainable retrieval. Includes evaluation scripts and sample datasets for benchmarking.
vidhij23
This repository provides a comprehensive suite for Agentic framework with Retrieval-Augmented Generation (RAG), document processing, and evaluation, with a focus on maternal health. It includes modular RAG pipelines, document chunking, vector store management, evaluation scripts, and a rich set of Jupyter notebooks for experimentation and analysis.
gitanjaligilhotra1-lab
Beginner-to-advanced Generative AI knowledge base covering AI/ML fundamentals, LLMs, prompting, embeddings & vector databases, RAG (chunking, retrieval), agents (A2A), MCP, LangChain/LlamaIndex, fine-tuning (LoRA), evaluation, optimization, and enterprise GenAI systems.
Ravisir21
Prototype and evaluation of a RAG Q&A system using Ambedkar corpus. Built with LangChain, ChromaDB, HuggingFace embeddings, and Ollama LLM. Includes retrieval, semantic, and answer quality metrics with chunking analysis for performance optimization.
AI-Solutions-KK
Model-agnostic NLP pipeline that converts PDFs into clean, chunked, training-ready datasets for BERT, LoRA, QLoRA, and semantic pair training. Includes OCR fallback, noise cleaning, chunking, multi-format dataset export, and automatic evaluation reports.
FelipeRochaMartins
Soulsborne RAG is an end‑to‑end Retrieval‑Augmented Generation system for Soulsborne games, showcasing modern RAG practices (scraping, LLM‑based chunking/refinement, vector search, contextualization, query expansion, reranking, and evaluation) with local/remote models.
AI-Solutions-KK
Model-agnostic NLP pipeline that converts PDFs into clean, chunked, training-ready datasets for BERT, LoRA, QLoRA, and semantic pair training. Includes OCR fallback, noise cleaning, chunking, multi-format dataset export, and automatic evaluation reports.
achrafjarrou
Système RAG production-ready pour analyse financière LVMH 2023. FastAPI + ChromaDB + Groq LLM. Pipeline complet: chunking intelligent, vector search, re-ranking, cache, métriques. 85% accuracy, 234ms latence. Tests automatiques, Docker, évaluation golden dataset. Python 3.11 | LangChain | MLOps
sankarbaseone
A modular Retrieval-Augmented Generation (RAG) engine for building enterprise AI assistants. Supports document ingestion, chunking, embeddings, vector search, and LLM-based answer generation. Includes evaluation tools and an extensible architecture for chatbots, knowledge bases, and AI copilots.