Found 1,636 repositories(showing 30)
microsoft
A simple screen parsing tool towards pure vision based GUI agent
GetStream
Open Vision Agents by Stream. Build Vision Agents quickly with any model or video provider. Uses Stream's edge network for ultra-low latency.
magnitudedev
Open-source, vision-first browser agent
crmne
One beautiful Ruby API for OpenAI, Anthropic, Gemini, Bedrock, Azure, OpenRouter, DeepSeek, Ollama, VertexAI, Perplexity, Mistral, xAI, GPUStack & OpenAI compatible APIs. Agents, Chat, Vision, Audio, PDF, Images, Embeddings, Tools, Streaming & Rails integration.
Intent-Lab
Real-time AI assistant for Meta Ray-Ban smart glasses -- voice + vision + agentic actions via Gemini Live and OpenClaw
showlab
[CVPR 2025] Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.
reworkd
Vision utilities for web interaction agents 👀
szczyglis-dev
Desktop AI Assistant powered by GPT-5, GPT-4, o1, o3, Gemini, Claude, Ollama, DeepSeek, Perplexity, Grok, Bielik, chat, vision, voice, RAG, image and video generation, agents, tools, MCP, plugins, speech synthesis and recognition, web search, memory, presets, assistants,and more. Linux, Windows, Mac
agent-network-protocol
AgentNetworkProtocol(ANP) is an open source protocol for agent communication. Our vision is to define how agents connect with each other, building an open, secure, and efficient collaboration network for billions of intelligent agents.
MoonshotAI
Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities
Fugtemypt123
VIGA: Vision-as-Inverse-Graphics Agent
taco-group
[NeurIPS 2025] 4KAgent: Agentic Any Image to 4K Super-Resolution. An intelligent computer vision agent that can magically restore any image to perfect-4K!
google-research
Train robotic agents to learn pick and place with deep learning for vision-based manipulation in PyBullet. Transporter Nets, CoRL 2020.
askui
Enable AI to control your desktop, mobile and HMI devices
PV-Bhat
Vibe Check is a tool that provides mentor-like feedback to AI Agents, preventing tunnel-vision, over-engineering and reasoning lock-in for complex and long-horizon agent workflows. KISS your over-eager AI Agents goodbye! Effective for: Coding, Ambiguous Tasks, High-Risk tasks
qwwzdyj
An AI agent that automates the creation of CVPR/NeurIPS standard academic diagrams. Implements a strict "Logic (Architect) -> Vision (Renderer)" workflow to transform paper abstracts into high-fidelity scientific illustrations.
RL4VLM
Official Repo for Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
xlang-ai
[ICML2025] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
epiral
AI Agent as a Pinix Clip — agentic loop with memory, tools, and vision
dusty-nv
Optimized local inference for LLMs with HuggingFace-like APIs for quantization, vision/language models, multimodal agents, speech, vector DB, and RAG.
graniet
A powerful Rust library and CLI tool to unify and orchestrate multiple LLM, Agent and voice backends (OpenAI, Claude, Gemini, Ollama, ElevenLabs...) with a single, extensible API. Build, chain, evaluate, and serve complex multi-step AI workflows — including speech-to-text, text-to-speech, completions, vision, and reasoning.
This AI Smart Speaker uses speech recognition, TTS (text-to-speech), and STT (speech-to-text) to enable voice and vision-driven conversations, with additional web search capabilities via OpenAI and Langchain agents.
agent-network-protocol
Our vision is to provide communication capabilities for intelligent agents, allowing them to connect with each other to form a collaborative network of intelligent agents.
ritzz-ai
Official implementation of GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
TYH-labs
Zero-friction LLM fine-tuning skill for Claude Code, Gemini CLI & any ACP agent. Unsloth on NVIDIA · TRL+MPS/MLX on Apple Silicon. Automates env setup, LoRA training (SFT, DPO, GRPO, vision), post-hoc GRPO log diagnostics, evaluation, and export end-to-end. Part of the Gaslamp AI platform.
vivekpathania
AI Experiments A public repository of AI/ML projects exploring generative models, NLP, computer vision, and autonomous agents. Includes code, documentation, and demos for educational purposes.
Jiayi-Pan
👀🧠 GPT-4 Vision x 💪⌨️ Vimium = Autonomous Web Agent
qnguyen3
Lightweight Vision native Multimodal Document Agent
agents-x-project
[MTI-LLM@NeurIPS 2025] Official implementation of "PyVision: Agentic Vision with Dynamic Tooling."
google-gemini
Gemini Live provides multimodal realtime agent capabilities. Build voice agents that can process vision and text in realtime.