Found 1,466 repositories(showing 30)
huggingface
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
souzatharsis
An Open Source Python alternative to NotebookLM's podcast feature: Transforming Multimodal Content into Captivating Multilingual Audio Conversations with GenAI
Eventual-Inc
High-performance data engine for AI and multimodal workloads. Process images, audio, video, and structured data at any scale
EvolvingLMMs-Lab
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
QwenLM
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
gpt-omni
open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
InternLM
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
datachain-ai
Analytics, Versioning and ETL for multimodal data: video, audio, PDFs, images
hkchengrex
[CVPR 2025] MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Tencent-Hunyuan
HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation.
YingqingHe
🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).
coder-duibai
A comprehensive list of Awesome Contrastive Learning Papers&Codes.Research include, but are not limited to: CV, NLP, Audio, Video, Multimodal, Graph, Language, etc.
ViaAnthroposBenevolentia
Vanilla JS web interface for Gemini 2.0 flash-exp Multimodal API with text, audio, camera, screen inputs and audio responses and function calling
ShmuelRonen
A ComfyUI custom node that integrates Google's Gemini Flash 2.0 Experimental model, enabling multimodal analysis of text, images, video frames, and audio directly within ComfyUI workflows.
david-yoon
TensorFlow implementation of "Multimodal Speech Emotion Recognition using Audio and Text," IEEE SLT-18
EvoLinkAI
Complete guide to Seedance 2.0 — multimodal AI video generation with image, video, audio & text input. Prompts, use cases & practical examples.
A multimodal approach on emotion recognition using audio and text.
wassengerhq
Ready-to-use AI Multimodal ChatGPT-based WhatsApp chatbot assistant for your business. Now supports GPT-4o with text + audio + image input, audio responses, and improved RAG + MCP 🤩
itanishqshelar
SmartRAG is a privacy-first multimodal RAG system that lets you chat intelligently with your documents, images, and audio. Upload PDFs, Word files, or recordings and get accurate, context-aware answers all processed locally on your device with no external APIs.
FreedomIntelligence
Towards Fine-grained Audio Captioning with Multimodal Contextual Cues
CPJKU
A Multimodal Audio Sheet Music Dataset
WenhaoYou1
Music Performance Audio-Visual Question Answering Requires Specialized Multimodal Designs
HarryHsing
EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning [🔥The Exploration of R1 for General Audio-Visual Reasoning with Qwen2.5-Omni]
matthijsvk
Multimodal speech recognition using lipreading (with CNNs) and audio (using LSTMs). Sensor fusion is done with an attention network.
LUMIA-Group
Official PyTorch implementation of paper Leveraging Unimodal Self Supervised Learning for Multimodal Audio-Visual Speech Recognition (ACL 2022)
aistudynow
Comfyui Nodes HunyuanVideo-Foley Low Vram: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation.
Deveshsaipandian
An offline multimodal AI system that analyzes documents, images, and audio to support financial risk assessment, insurance claim verification, and relief fund allocation during disasters. Designed for low-connectivity environments with privacy-preserving, citation-backed insights.
klingfoley
Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation
aws-solutions-library-samples
This Guidance shows how Amazon Bedrock Data Automation streamlines the generation of valuable insights from unstructured multimodal content such as documents, images, audio, and videos through a unified multi-modal inference API.
rikeilong
[ECCV’24] Official Implementation for CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios