Found 76 repositories(showing 30)
njucckevin
Code for Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model
enrico310786
experiments with microsoft phi3 vision language model. Image captioning, OCR, data extraction
CongcongWen1208
RS-MoE: A Vision–Language Model With Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering (IEEE TGRS 2025)
Open-Model-Initiative
graphcap is an application that leverages directed acyclic graphs (DAGs) along with vision and language models to generate structured image captions and scene graphs from multimodal data.
This project demonstrates an image caption generator built using Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks with PyTorch. The model extracts features from images using CNNs and generates descriptive captions using LSTMs, showcasing the integration of computer vision and natural language processing.
Vamsi404
This repository contains a deep learning-based model for generating captions for images using a combination of computer vision (ResNet50) and natural language processing (LSTM). The model is trained on the Flickr8k dataset, with GloVe embeddings for word representations. It includes data preprocessing, model training, and inference scripts.
With current scenario in 2020, where quarantine is the buzz word and work-from-home has become a norm, there is an increasing usage of internet. At the touch of a screen we can order groceries, medicines etc. However, not everyone is fortunate enough to use it as seamlessly. For example, the people who suffer from impaired vision might find it cumbersome and frustrating to distinguish between blueberries and grapes. This project aims to create a neural network model that can help such demographics. The complexity and novelty in creating such a model is that it should not simply detect the object but also give useful and accurate information about that object. Hence, this project proposes an ‘Image caption generator (using deep learning)’ that processes the image and describes it in a short sentence using a natural language such as English. The model is an amalgamation of two types of neural networks, CNN (Convolutional Neural Network) for image processing and LSTM (Long short-term memory), a type of Recurrent Neural Network, for text processing. A subset of 14,000 images, along with their sample captions, has been selected from Flickr_30K dataset. The generated caption is evaluated using human judgement as well as BLEU-1 score. Furthermore, the model has been trained and tested with several variations such as incorporation of pre-trained GloVe embeddings, different dropout and regularizer rates, and two types of feature extraction models for images: Xception and VG16. Most relevant and fitting captions were obtained using features from Xception model with an encoder-decoder based architecture. Highest BLEU-1 scores (above 0.5 on a scale of 0 to 1) were obtained with VG16 model using GloVe embeddings.
Zanshinmu
Automatically caption images using various LLaVA multimodal models. This tool processes images with state-of-the-art vision language models to generate accurate, high-quality captions.
The MultiModal-Vision-Language-Model-Training repository provides scripts for fine-tuning vision-language models (PaliGemma, BLIP-2, BLIP, SmolVLM, Qwen-VL, Florence-2) on SkinCAP and ROCOv2 datasets for medical image captioning. Optimized with LoRA and 4-bit quantization, it includes efficient training, evaluation (loss, accuracy, ROUGE, BLEU)
cristianLucianPopescu
Image captioning combines computer vision techniques, such as image recognition and feature extraction, with natural language processing techniques, such as language modeling and sequence generation. The model in the notebook learns to associate the visual features of the image with the textual descriptions provided.
LaloVene
App that helps people with limited vision see the world by using an Image Captioning DL model, Ionic Angular, translation API and a Text2Speech library. Works in 11 languages.
ai-agents-cybersecurity
This project captions every image (PNG, JPG, JPEG) inside the `input/` directory using a local MLX vision-language model. It ships with a CLI entry point in `src/main.py` that targets [lmstudio-community/Qwen3-VL-30B-A3B-Instruct-MLX-8bit](https://huggingface.co/lmstudio-community/Qwen3-VL-30B-A3B-Instruct-MLX-8bit or BF16)
An image captioning system that combines CLIP with GPT-2.
No description available
masoumehkhaleghian
Hands-on image captioning experiments with the Qwen 2.5-VL vision-language model, showing how to load the model, send images in a chat-style prompt, and generate natural-language descriptions for pictures
HimadeepRagiri
🖼️ Image Captioning using CNN-LSTM with ResNet18 and GloVe on the Flickr8k dataset. This deep learning model extracts visual features and generates natural language captions, showcasing end-to-end integration of computer vision and NLP.
rushiparhad
This repository implements an end-to-end image captioning system using a Transformer-based Vision–Language architecture. The project focuses on building a custom self-tuned ViT-GPT2 model that generates meaningful natural-language descriptions for images with limited training data.
Aravind-3110
Developed an image captioning system using Xception CNN for feature extraction and LSTM/transformer models with attention for generating accurate captions. Explored CLIP and BLIP for enhanced vision-language understanding. Tech stack: Python, TensorFlow/Keras, PyTorch, OpenCV
Ashok7890-reddy
This is a Caption generating AI project that transforms basic image captioning into a comprehensive, production-ready application. It demonstrates end-to-end computer vision and NLP capabilities using the BLIP (Bootstrapped Language-Image Pre-training) model with advanced features and professional implementation.
Henildiyora
A modular AI toolkit integrating state-of-the-art vision and language models (BLIP, DETR, ViLT) for real-time image captioning, object detection, and visual Q&A. Built with Streamlit and Hugging Face.
SUP3RMASS1VE
Florence-2 is a large vision-language model capable of various image and text generation tasks, such as object detection, captioning, and grounding. This demo allows users to interact with these capabilities by uploading images and selecting from various tasks.
This project implements an Image Captioning Generation System using the Flickr8K dataset. The model leverages a CNN (ResNet-50) for image encoding and an LSTM network (with attention) for decoding into natural language descriptions. This focused on learning and building multi-modal deep learning systems that combine vision and language.
srasal445
The AI Image Caption Generator is an end-to-end application that automatically generates natural-language captions for images using deep learning and computer vision. It combines a visual feature extractor (CNN) with a language model ((RNN/Transformer) to produce context-aware, human-like descriptions useful for accessibility,content indexing.
Built an AI pipeline that matches images with relevant captions using OpenAI’s CLIP model. Images and candidate texts are embedded into a shared vision–language space, compared with cosine similarity, and ranked to return the top matches. Features include caption generation, retrieval, applications in tagging, recommendation, and search.
This AI pipeline turns images into spoken stories. It uses image captioning to generate text, fine-tunes a language model (with LoRA) to create a story, and converts it to speech. The result is an audio file narrating the generated story, combining computer vision, NLP, and speech synthesis.
PRITHIVSAKTHIUR
Understand physical common sense and generate appropriate embodied decisions. optimized for document-level optical character recognition, long-context vision-language understanding. build with hand-curated dataset for text-to-image models, providing significantly more detailed descriptions or captions of given images.
NDDimension
A Deep-Learning based web app that generates image captions using a pre-trained CNN-LSTM model. Upload your own image or use sample ones to see AI describe them in natural language. Built with TensorFlow, trained on Flickr8k, and combines computer vision with NLP.
raviy0807
The goal of this course is to give learners basic understanding of modern neural networks and their applications in computer vision and natural language understanding. The course starts with a recap of linear models and discussion of stochastic optimization methods that are crucial for training deep neural networks. Learners will study all popular building blocks of neural networks including fully connected layers, convolutional and recurrent layers. Learners will use these building blocks to define complex modern architectures in TensorFlow and Keras frameworks. In the course project learner will implement deep neural network for the task of image captioning which solves the problem of giving a text description for an input image. The prerequisites for this course are: 1) Basic knowledge of Python. 2) Basic linear algebra and probability.
Our project builds an AI pipeline for image classification and captioning to help e-commerce, media, and assistive tech scale visual content understanding. By comparing supervised models with vision-language models like BLIP2, we improve metadata, accessibility, and user engagement without heavy labeling.
Our approach combines the state-of-the-art performance of transformers in both computer vision and natural language processing domains to create descriptive and contextually relevant captions for a wide range of images.