Found 7,156 repositories(showing 30)
bytedance
The Open-Source Multimodal AI Agent Stack: Connecting Cutting-Edge AI Models and Agent Infra
deepset-ai
Open-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and agent workflows with explicit control over retrieval, routing, memory, and generation. Built for scalable agents, RAG, multimodal applications, semantic search, and conversational systems.
jina-ai
☁️ Build multimodal AI applications with cloud-native stack
NVIDIA-NeMo
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
duixcom
🚀 Truly open-source AI avatar(digital human) toolkit for offline video generation and digital human cloning.
pipecat-ai
Open Source framework for voice and multimodal conversational AI
lancedb
Developer-friendly OSS embedded retrieval library for multimodal AI. Search More; Manage Less.
gorse-io
AI powered open source recommender system engine supports classical/LLM rankers and multimodal content via embedding
activeloopai
Deeplake is AI Data Runtime for Agents. It provides serverless postgres with a multimodal datalake, enabling scalable retrieval and training.
open-mmlab
OpenMMLab Multimodal Advanced, Generative, and Intelligent Creation Toolbox. Unlock the magic 🪄: Generative-AI (AIGC), easy-to-use APIs, awsome model zoo, diffusion models, for text-to-image generation, image/video restoration/enhancement, etc.
lance-format
Open Lakehouse Format for Multimodal AI. Convert from Parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
facebookresearch
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
Eventual-Inc
High-performance data engine for AI and multimodal workloads. Process images, audio, video, and structured data at any scale
NVlabs
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
OpenGVLab
InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)
SkyworkAI
Skywork-R1V is an advanced multimodal AI model series developed by Skywork AI, specializing in vision-language reasoning.
SamurAIGPT
Multi-modal Generative Media Skills for AI Agents (Claude Code, Cursor, Gemini CLI). High-quality image, video, and audio generation powered by muapi.ai.
jonyzhang2023
A curated list of state-of-the-art research in embodied AI, focusing on vision-language-action (VLA) models, vision-language navigation (VLN), and related multimodal learning approaches.
qingchencloud
🦞 OpenClaw 可视化管理面板 — 内置 AI 助手(工具调用 + 图片识别 + 多模态),一键安装 | Visual management panel with built-in AI assistant (tool calling + vision + multimodal + i18n(11))
microsoft
[CVPR 2025] Magma: A Foundation Model for Multimodal AI Agents
pixeltable
Data Infrastructure providing a declarative, incremental approach for multimodal AI workloads.
OpenAdaptAI
Open Source Generative Process Automation (i.e. Generative RPA). AI-First Process Automation with Large ([Language (LLMs) / Action (LAMs) / Multimodal (LMMs)] / Visual Language (VLMs)) Models
fikrikarim
On-device, real-time multimodal AI. Have natural voice and vision conversations with an AI that runs entirely on your machine. Powered by Gemma 4 E2B and Kokoro.
unum-cloud
Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️
dtsola
小遥搜索,听懂你的话、看懂你的图,用AI找到本地任何文件。让搜索像聊天一样简单。XiaoyaoSearch: Understands your words, reads your images, finds any local file with AI. Making search as easy as chatting.
AIDC-AI
An Open-Source Multimodal AIGC Solution based on ComfyUI + MCP + LLM https://pixelle.ai
lancedb
Resource, examples & tutorials for multimodal AI, RAG and agents using vector search and LLMs
livekit
Build realtime multimodal AI agents with Node.js
waybarrios
OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP tool calling, and multimodal support. Native MLX backend, 400+ tok/s. Works with Claude Code.
fnzhan
[TPAMI 2023] Multimodal Image Synthesis and Editing: The Generative AI Era