Found 13 repositories(showing 13)
anyantudre
Florence-2 is a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks.
Ravi-Teja-konda
VLM driven tool that processes surveillance videos, extracts frames, and generates insightful annotations using a fine-tuned Florence-2 Vision-Language Model. Includes a Gradio-based interface for querying and analyzing video footage.
jacobmarks
Run SOTA Vision-Language Model Florence-2 on your data!
CharlesCNorton
ViLMA (Vision-Language Model Active Monitoring) - A real-time desktop monitoring tool leveraging Florence-2
PRITHIVSAKTHIUR
This application utilizes the powerful Florence-2 vision-language model from Microsoft to generate comprehensive captions for images. The model is capable of understanding visual content and expressing it in natural language.
The MultiModal-Vision-Language-Model-Training repository provides scripts for fine-tuning vision-language models (PaliGemma, BLIP-2, BLIP, SmolVLM, Qwen-VL, Florence-2) on SkinCAP and ROCOv2 datasets for medical image captioning. Optimized with LoRA and 4-bit quantization, it includes efficient training, evaluation (loss, accuracy, ROUGE, BLEU)
SUP3RMASS1VE
Florence-2 is a large vision-language model capable of various image and text generation tasks, such as object detection, captioning, and grounding. This demo allows users to interact with these capabilities by uploading images and selecting from various tasks.
No description available
5hak1r
A Multimodal Image Captioning and Audio Narration System Using the Florence 2 Vision Language Model.
sandrarairan
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks Model Summary This Hub repository contains a HuggingFace's transformers implementation of Florence-2 model from Microsoft. Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks.
asifa-nazir
A comparative analysis of Vision-Language Models (Florence-2 vs. BLIP-2) performing dense region captioning on isolated objects segmented by SAM3.
Abdeen-A-AI
This project implements an advanced generative AI pipeline for extracting and rating features from images. It combines the power of Florence-2, a state-of-the-art vision-language model, with a fine-tuned version of Mistral-v3, a cutting-edge large language model.
r-vage
Smart Model Loader for ComfyUI — for vision-language models, text LLMs, and WD14 taggers across 8 backends (Transformers, GGUF, vLLM, SGLang, Ollama, llama.cpp, YOLO, WD14). Supports QwenVL, Mistral3, Florence-2, LLaVA, YOLO with multi-task chaining, few-shot training, and auto-download. V3 API + Nodes 2.0 compatible. NVIDIA/AMD/ROCm.
All 13 repositories loaded