Search Results

Found 1,466 repositories(showing 30)

transformers

huggingface

💚100

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

158.8k

32.7k

Apache-2.0

Python

Updated 27 minutes ago

audiodeep-learningdeepseek+16

podcastfy

souzatharsis

💛83

An Open Source Python alternative to NotebookLM's podcast feature: Transforming Multimodal Content into Captivating Multilingual Audio Conversations with GenAI

6.2k

716

Apache-2.0

Python

Updated 6 hours ago

elevenlabsgeminigenai+3

Daft

Eventual-Inc

💛75

High-performance data engine for AI and multimodal workloads. Process images, audio, video, and structured data at any scale

5.4k

436

Apache-2.0

Rust

Updated 2 hours ago

ai-engineeringai-pipelinearrow+16

lmms-eval

EvolvingLMMs-Lab

💛80

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

4.0k

552

NOASSERTION

Python

Updated 1 hour ago

agiaudio-evaluationbenchmark+8

Qwen2.5-Omni

QwenLM

🧡67

Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.

4.0k

325

Apache-2.0

Jupyter Notebook

Updated 2 days ago

mini-omni

gpt-omni

💛72

open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.

3.5k

308

MIT

Python

Updated 1 day ago

InternLM-XComposer

InternLM

🧡65

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

2.9k

176

Apache-2.0

Python

Updated 2 days ago

chatgptfoundationgpt+13

datachain

datachain-ai

🧡69

Analytics, Versioning and ETL for multimodal data: video, audio, PDFs, images

2.7k

140

Apache-2.0

Python

Updated 2 days ago

aicvdata-analytics+7

MMAudio

hkchengrex

💛70

[CVPR 2025] MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

2.1k

251

MIT

Python

Updated 21 hours ago

audioaudio-synthesiscomputer-vision+3

HunyuanVideo-Foley

Tencent-Hunyuan

🧡67

HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation.

1.3k

105

NOASSERTION

Python

Updated 3 days ago

aigc-audiofoley-artfoley-sound-synthesis+5

Awesome-LLMs-meet-Multimodal-Generation

YingqingHe

🧡51

🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).

542

HTML

Updated 1 week ago

aigclarge-language-modelslarge-vision-language-models+14

Contrastive-Learning-Papers-Codes

coder-duibai

🧡61

A comprehensive list of Awesome Contrastive Learning Papers&Codes.Research include, but are not limited to: CV, NLP, Audio, Video, Multimodal, Graph, Language, etc.

410

CC0-1.0

Updated 1 week ago

gemini-2-live-api-demo

ViaAnthroposBenevolentia

🧡62

Vanilla JS web interface for Gemini 2.0 flash-exp Multimodal API with text, audio, camera, screen inputs and audio responses and function calling

391

159

MIT

JavaScript

Updated 1 week ago

function-callinggemini-apigemini-flash+3

ComfyUI-Gemini_Flash_2.0_Exp

ShmuelRonen

💛71

A ComfyUI custom node that integrates Google's Gemini Flash 2.0 Experimental model, enabling multimodal analysis of text, images, video frames, and audio directly within ComfyUI workflows.

334

MIT

Python

Updated 9 hours ago

multimodal-speech-emotion

david-yoon

🧡51

TensorFlow implementation of "Multimodal Speech Emotion Recognition using Audio and Text," IEEE SLT-18

297

MIT

Jupyter Notebook

Updated 1 month ago

multimodal-deep-learningparalinguisticsspeech-emotion-recognition

awesome-seedance-2-guide

EvoLinkAI

💛71

Complete guide to Seedance 2.0 — multimodal AI video generation with image, video, audio & text input. Prompts, use cases & practical examples.

255

MIT

Shell

Updated 1 minute ago

Audio-and-text-based-emotion-recognition

aris-ai

🧡50

A multimodal approach on emotion recognition using audio and text.

187

Apache-2.0

Jupyter Notebook

Updated 1 month ago

whatsapp-chatgpt-bot

wassengerhq

🧡56

Ready-to-use AI Multimodal ChatGPT-based WhatsApp chatbot assistant for your business. Now supports GPT-4o with text + audio + image input, audio responses, and improved RAG + MCP 🤩

149

MIT

JavaScript

Updated 1 week ago

ai-botchatbotchatgpt+9

SmartRAG is a privacy-first multimodal RAG system that lets you chat intelligently with your documents, images, and audio. Upload PDFs, Word files, or recordings and get accurate, context-aware answers all processed locally on your device with no external APIs.

111

MIT

Python

Updated 5 days ago

FusionAudio

FreedomIntelligence

❤️45

Towards Fine-grained Audio Captioning with Multimodal Contextual Cues

Python

Updated 1 month ago

msmd

CPJKU

❤️40

A Multimodal Audio Sheet Music Dataset

NOASSERTION

Jupyter Notebook

Updated 8 months ago

datasetlilypondmusic-information-retrieval+2

Survey4MusicAVQA

WenhaoYou1

❤️45

Music Performance Audio-Visual Question Answering Requires Specialized Multimodal Designs

Updated 2 months ago

EchoInk

HarryHsing

🧡55

EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning [🔥The Exploration of R1 for General Audio-Visual Reasoning with Qwen2.5-Omni]

Python

Updated 1 day ago

multimodalSR

matthijsvk

❤️40

Multimodal speech recognition using lipreading (with CNNs) and audio (using LSTMs). Sensor fusion is done with an attention network.

MIT

Jupyter Notebook

Updated 11 months ago

Leveraging-Self-Supervised-Learning-for-AVSR

LUMIA-Group

❤️35

Official PyTorch implementation of paper Leveraging Unimodal Self Supervised Learning for Multimodal Audio-Visual Speech Recognition (ACL 2022)

MIT

Python

Updated 3 months ago

Comfyui-HunyuanFoley

aistudynow

❤️45

Comfyui Nodes HunyuanVideo-Foley Low Vram: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation.

MIT

Python

Updated 1 week ago

comfyuicomfyui-nodes

Offline-Multimodal-Financial-Intelligent-System-

Deveshsaipandian

🧡55

An offline multimodal AI system that analyzes documents, images, and audio to support financial risk assessment, insurance claim verification, and relief fund allocation during disasters. Designed for low-connectivity environments with privacy-preserving, citation-backed insights.

Jupyter Notebook

Updated 1 week ago

Kling-Foley

klingfoley

❤️45

Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation

JavaScript

Updated 1 month ago

guidance-for-multimodal-data-processing-using-amazon-bedrock-data-automation

aws-solutions-library-samples

🧡65

This Guidance shows how Amazon Bedrock Data Automation streamlines the generation of valuable insights from unstructured multimodal content such as documents, images, audio, and videos through a unified multi-modal inference API.

MIT-0

Python

Updated 4 days ago

Bay-CAT

rikeilong

❤️40

[ECCV’24] Official Implementation for CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

Apache-2.0

Python

Updated 2 months ago

GitHub Explorer

Search Results

transformers

podcastfy

Daft

lmms-eval

Qwen2.5-Omni

mini-omni

InternLM-XComposer

datachain

MMAudio

HunyuanVideo-Foley

Awesome-LLMs-meet-Multimodal-Generation

Contrastive-Learning-Papers-Codes

gemini-2-live-api-demo

ComfyUI-Gemini_Flash_2.0_Exp

multimodal-speech-emotion

awesome-seedance-2-guide

Audio-and-text-based-emotion-recognition

whatsapp-chatgpt-bot

SmartRAG

FusionAudio

msmd

Survey4MusicAVQA

EchoInk

multimodalSR

Leveraging-Self-Supervised-Learning-for-AVSR

Comfyui-HunyuanFoley

Offline-Multimodal-Financial-Intelligent-System-

Kling-Foley

guidance-for-multimodal-data-processing-using-amazon-bedrock-data-automation

Bay-CAT

transformers

podcastfy

Daft

lmms-eval

Qwen2.5-Omni

mini-omni

InternLM-XComposer

datachain

MMAudio

HunyuanVideo-Foley

Awesome-LLMs-meet-Multimodal-Generation

Contrastive-Learning-Papers-Codes

gemini-2-live-api-demo

ComfyUI-Gemini_Flash_2.0_Exp

multimodal-speech-emotion

awesome-seedance-2-guide

Audio-and-text-based-emotion-recognition

whatsapp-chatgpt-bot

SmartRAG

FusionAudio

msmd

Survey4MusicAVQA

EchoInk

multimodalSR

Leveraging-Self-Supervised-Learning-for-AVSR

Comfyui-HunyuanFoley

Offline-Multimodal-Financial-Intelligent-System-

Kling-Foley

guidance-for-multimodal-data-processing-using-amazon-bedrock-data-automation

Bay-CAT