Search Results

Found 76 repositories(showing 30)

KnowCap

njucckevin

❤️20

Code for Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model

Python

Updated 5 months ago

phi3_vision_language

enrico310786

❤️35

experiments with microsoft phi3 vision language model. Image captioning, OCR, data extraction

Jupyter Notebook

Updated 8 months ago

RS-MoE

CongcongWen1208

❤️40

RS-MoE: A Vision–Language Model With Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering (IEEE TGRS 2025)

Python

Updated 3 weeks ago

mixture-of-expertsmoeremote-sensing+3

graphcap

Open-Model-Initiative

❤️25

graphcap is an application that leverages directed acyclic graphs (DAGs) along with vision and language models to generate structured image captions and scene graphs from multimodal data.

Apache-2.0

TypeScript

Updated 7 months ago

Image-Caption-Generator-with-CNN-and-LSTM

Adityajl

❤️35

This project demonstrates an image caption generator built using Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks with PyTorch. The model extracts features from images using CNNs and generates descriptive captions using LSTMs, showcasing the integration of computer vision and natural language processing.

Jupyter Notebook

Updated 10 months ago

ImageCaptionAI

Vamsi404

❤️35

This repository contains a deep learning-based model for generating captions for images using a combination of computer vision (ResNet50) and natural language processing (LSTM). The model is trained on the Flickr8k dataset, with GloVe embeddings for word representations. It includes data preprocessing, model training, and inference scripts.

Jupyter Notebook

Updated 9 months ago

Image-Caption-Generation-using-Deep-Learning-14k-Flickr-Dataset-

dhavalthakur

❤️45

With current scenario in 2020, where quarantine is the buzz word and work-from-home has become a norm, there is an increasing usage of internet. At the touch of a screen we can order groceries, medicines etc. However, not everyone is fortunate enough to use it as seamlessly. For example, the people who suffer from impaired vision might find it cumbersome and frustrating to distinguish between blueberries and grapes. This project aims to create a neural network model that can help such demographics. The complexity and novelty in creating such a model is that it should not simply detect the object but also give useful and accurate information about that object. Hence, this project proposes an ‘Image caption generator (using deep learning)’ that processes the image and describes it in a short sentence using a natural language such as English. The model is an amalgamation of two types of neural networks, CNN (Convolutional Neural Network) for image processing and LSTM (Long short-term memory), a type of Recurrent Neural Network, for text processing. A subset of 14,000 images, along with their sample captions, has been selected from Flickr_30K dataset. The generated caption is evaluated using human judgement as well as BLEU-1 score. Furthermore, the model has been trained and tested with several variations such as incorporation of pre-trained GloVe embeddings, different dropout and regularizer rates, and two types of feature extraction models for images: Xception and VG16. Most relevant and fitting captions were obtained using features from Xception model with an encoder-decoder based architecture. Highest BLEU-1 scores (above 0.5 on a scale of 0 to 1) were obtained with VG16 model using GloVe embeddings.

Jupyter Notebook

Updated 2 months ago

llava-caption

Zanshinmu

❤️40

Automatically caption images using various LLaVA multimodal models. This tool processes images with state-of-the-art vision language models to generate accurate, high-quality captions.

GPL-3.0

Python

Updated 1 year ago

MultiModal-Vision-Language-Model-Training

nafew-azim

❤️40

The MultiModal-Vision-Language-Model-Training repository provides scripts for fine-tuning vision-language models (PaliGemma, BLIP-2, BLIP, SmolVLM, Qwen-VL, Florence-2) on SkinCAP and ROCOv2 datasets for medical image captioning. Optimized with LoRA and 4-bit quantization, it includes efficient training, evaluation (loss, accuracy, ROUGE, BLEU)

Apache-2.0

Python

Updated 6 months ago

ImageCaptioning

cristianLucianPopescu

❤️40

Image captioning combines computer vision techniques, such as image recognition and feature extraction, with natural language processing techniques, such as language modeling and sequence generation. The model in the notebook learns to associate the visual features of the image with the textual descriptions provided.

MIT

Jupyter Notebook

Updated 2 years ago

Image-Captioning-Project

LaloVene

❤️40

App that helps people with limited vision see the world by using an Image Captioning DL model, Ionic Angular, translation API and a Text2Speech library. Works in 11 languages.

MIT

Jupyter Notebook

Updated 3 years ago

angulardeep-learningdocker+5

QWEN3-VL-Python-OCR-Script-MLX

ai-agents-cybersecurity

🧡50

This project captions every image (PNG, JPG, JPEG) inside the `input/` directory using a local MLX vision-language model. It ships with a CLI entry point in `src/main.py` that targets [lmstudio-community/Qwen3-VL-30B-A3B-Instruct-MLX-8bit](https://huggingface.co/lmstudio-community/Qwen3-VL-30B-A3B-Instruct-MLX-8bit or BF16)

Apache-2.0

Python

Updated 1 month ago

Image-captioning-with-Vision-Language-Model

phongviet

❤️35

An image captioning system that combines CLIP with GPT-2.

Jupyter Notebook

Updated 4 months ago

Image-Captioning-with-Vision-Language-Models

asimsultan

❤️25

No description available

Python

Updated 1 year ago

ImageCaptioning

masoumehkhaleghian

❤️45

Hands-on image captioning experiments with the Qwen 2.5-VL vision-language model, showing how to load the model, send images in a chat-style prompt, and generate natural-language descriptions for pictures

Jupyter Notebook

Updated 1 month ago

Image_Captioning

HimadeepRagiri

❤️40

🖼️ Image Captioning using CNN-LSTM with ResNet18 and GloVe on the Flickr8k dataset. This deep learning model extracts visual features and generates natural language captions, showcasing end-to-end integration of computer vision and NLP.

MIT

Python

Updated 8 months ago

Image_Captioning_model

rushiparhad

❤️45

This repository implements an end-to-end image captioning system using a Transformer-based Vision–Language architecture. The project focuses on building a custom self-tuned ViT-GPT2 model that generates meaningful natural-language descriptions for images with limited training data.

Python

Updated 2 months ago

AI-Based-Image-Captioning-System

Aravind-3110

❤️35

Developed an image captioning system using Xception CNN for feature extraction and LSTM/transformer models with attention for generating accurate captions. Explored CLIP and BLIP for enhanced vision-language understanding. Tech stack: Python, TensorFlow/Keras, PyTorch, OpenCV

Jupyter Notebook

Updated 7 months ago

Image-Caption-Generation-

Ashok7890-reddy

❤️30

This is a Caption generating AI project that transforms basic image captioning into a comprehensive, production-ready application. It demonstrates end-to-end computer vision and NLP capabilities using the BLIP (Bootstrapped Language-Image Pre-training) model with advanced features and professional implementation.

Python

Updated 6 months ago

Multi_Modal_Image_Analysis_Dashboard

Henildiyora

❤️35

A modular AI toolkit integrating state-of-the-art vision and language models (BLIP, DETR, ViLT) for real-time image captioning, object detection, and visual Q&A. Built with Streamlit and Hugging Face.

Python

Updated 4 months ago

blipcomputer-visiondetr+5

Florence-2-Image-Captioning

SUP3RMASS1VE

❤️35

Florence-2 is a large vision-language model capable of various image and text generation tasks, such as object detection, captioning, and grounding. This demo allows users to interact with these capabilities by uploading images and selecting from various tasks.

Python

Updated 11 months ago

pinokio

Image-Captioning-Generation

uXaamA

❤️35

This project implements an Image Captioning Generation System using the Flickr8K dataset. The model leverages a CNN (ResNet-50) for image encoding and an LSTM network (with attention) for decoding into natural language descriptions. This focused on learning and building multi-modal deep learning systems that combine vision and language.

Python

Updated 9 months ago

Al-Image-caption-generator

srasal445

❤️35

The AI Image Caption Generator is an end-to-end application that automatically generates natural-language captions for images using deep learning and computer vision. It combines a visual feature extractor (CNN) with a language model ((RNN/Transformer) to produce context-aware, human-like descriptions useful for accessibility,content indexing.

Jupyter Notebook

Updated 6 months ago

-Multimodal-Image-Text-Caption-Matching-with-CLIP

sashank113

❤️35

Built an AI pipeline that matches images with relevant captions using OpenAI’s CLIP model. Images and candidate texts are embedded into a shared vision–language space, compared with cosine similarity, and ranked to return the top matches. Features include caption generation, retrieval, applications in tagging, recommendation, and search.

Jupyter Notebook

Updated 7 months ago

Image-to-Story-to-Speech-A-Multi-Modal-Generative-Pipeline-with-LoRA-Fine-Tuning

aliibraheem516

❤️35

This AI pipeline turns images into spoken stories. It uses image captioning to generate text, fine-tunes a language model (with LoRA) to create a story, and converts it to speech. The result is an audio file narrating the generated story, combining computer vision, NLP, and speech synthesis.

Jupyter Notebook

Updated 6 months ago

Cosmos-x-DocScope

PRITHIVSAKTHIUR

❤️40

Understand physical common sense and generate appropriate embodied decisions. optimized for document-level optical character recognition, long-context vision-language understanding. build with hand-curated dataset for text-to-image models, providing significantly more detailed descriptions or captions of given images.

Apache-2.0

Python

Updated 10 months ago

huggingface-transformersocrqwen2-5-vl+3

Image-Caption-Generator-using-CNN-LSTM

NDDimension

🧡50

A Deep-Learning based web app that generates image captions using a pre-trained CNN-LSTM model. Upload your own image or use sample ones to see AI describe them in natural language. Built with TensorFlow, trained on Flickr8k, and combines computer vision with NLP.

MIT

Jupyter Notebook

Updated 1 month ago

ai-projectcnncomputer-vision+11

Intro_to_Deep_Learning_Coursera

raviy0807

❤️40

The goal of this course is to give learners basic understanding of modern neural networks and their applications in computer vision and natural language understanding. The course starts with a recap of linear models and discussion of stochastic optimization methods that are crucial for training deep neural networks. Learners will study all popular building blocks of neural networks including fully connected layers, convolutional and recurrent layers. Learners will use these building blocks to define complex modern architectures in TensorFlow and Keras frameworks. In the course project learner will implement deep neural network for the task of image captioning which solves the problem of giving a text description for an input image. The prerequisites for this course are: 1) Basic knowledge of Python. 2) Basic linear algebra and probability.

MIT

Jupyter Notebook

Updated 6 years ago

Image-Captioning-with-Vision-Language-Models-on-COCO-2017

YuhuiFu

❤️35

Our project builds an AI pipeline for image classification and captioning to help e-commerce, media, and assistive tech scale visual content understanding. By comparing supervised models with vision-language models like BLIP2, we improve metadata, accessibility, and user engagement without heavy labeling.

Jupyter Notebook

Updated 7 months ago

Vision-Transformer-Based-Image-Captioning-with-GPT-Language-Model

UmaimaKhan01

❤️35

Our approach combines the state-of-the-art performance of transformers in both computer vision and natural language processing domains to create descriptive and contextually relevant captions for a wide range of images.

Jupyter Notebook

Updated 1 year ago

GitHub Explorer

Search Results

KnowCap

phi3_vision_language

RS-MoE

graphcap

Image-Caption-Generator-with-CNN-and-LSTM

ImageCaptionAI

Image-Caption-Generation-using-Deep-Learning-14k-Flickr-Dataset-

llava-caption

MultiModal-Vision-Language-Model-Training

ImageCaptioning

Image-Captioning-Project

QWEN3-VL-Python-OCR-Script-MLX

Image-captioning-with-Vision-Language-Model

Image-Captioning-with-Vision-Language-Models

ImageCaptioning

Image_Captioning

Image_Captioning_model

AI-Based-Image-Captioning-System

Image-Caption-Generation-

Multi_Modal_Image_Analysis_Dashboard

Florence-2-Image-Captioning

Image-Captioning-Generation

Al-Image-caption-generator

-Multimodal-Image-Text-Caption-Matching-with-CLIP

Image-to-Story-to-Speech-A-Multi-Modal-Generative-Pipeline-with-LoRA-Fine-Tuning

Cosmos-x-DocScope

Image-Caption-Generator-using-CNN-LSTM

Intro_to_Deep_Learning_Coursera

Image-Captioning-with-Vision-Language-Models-on-COCO-2017

Vision-Transformer-Based-Image-Captioning-with-GPT-Language-Model

KnowCap

phi3_vision_language

RS-MoE

graphcap

Image-Caption-Generator-with-CNN-and-LSTM

ImageCaptionAI

Image-Caption-Generation-using-Deep-Learning-14k-Flickr-Dataset-

llava-caption

MultiModal-Vision-Language-Model-Training

ImageCaptioning

Image-Captioning-Project

QWEN3-VL-Python-OCR-Script-MLX

Image-captioning-with-Vision-Language-Model

Image-Captioning-with-Vision-Language-Models

ImageCaptioning

Image_Captioning

Image_Captioning_model

AI-Based-Image-Captioning-System

Image-Caption-Generation-

Multi_Modal_Image_Analysis_Dashboard

Florence-2-Image-Captioning

Image-Captioning-Generation

Al-Image-caption-generator

-Multimodal-Image-Text-Caption-Matching-with-CLIP

Image-to-Story-to-Speech-A-Multi-Modal-Generative-Pipeline-with-LoRA-Fine-Tuning

Cosmos-x-DocScope

Image-Caption-Generator-using-CNN-LSTM

Intro_to_Deep_Learning_Coursera

Image-Captioning-with-Vision-Language-Models-on-COCO-2017

Vision-Transformer-Based-Image-Captioning-with-GPT-Language-Model