Search Results

Found 1,636 repositories(showing 30)

OmniParser

microsoft

💚100

A simple screen parsing tool towards pure vision based GUI agent

24.6k

2.2k

CC-BY-4.0

Jupyter Notebook

Updated 7 hours ago

Vision-Agents

GetStream

💛79

Open Vision Agents by Stream. Build Vision Agents quickly with any model or video provider. Uses Stream's edge network for ultra-low latency.

7.6k

622

Apache-2.0

Python

Updated 1 hour ago

agentic-aiagentsai+8

browser-agent

magnitudedev

💛76

Open-source, vision-first browser agent

4.0k

224

Apache-2.0

TypeScript

Updated 1 day ago

aiautomationbrowser+7

One beautiful Ruby API for OpenAI, Anthropic, Gemini, Bedrock, Azure, OpenRouter, DeepSeek, Ollama, VertexAI, Perplexity, Mistral, xAI, GPUStack & OpenAI compatible APIs. Agents, Chat, Vision, Audio, PDF, Images, Embeddings, Tools, Streaming & Rails integration.

3.8k

413

MIT

Ruby

Updated 2 hours ago

agentsaianthropic+17

VisionClaw

Intent-Lab

💛76

Real-time AI assistant for Meta Ray-Ban smart glasses -- voice + vision + agentic actions via Gemini Live and OpenClaw

2.0k

356

NOASSERTION

Updated 1 hour ago

ShowUI

showlab

💛73

[CVPR 2025] Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.

1.8k

133

Apache-2.0

Python

Updated 4 hours ago

agentcomputer-usegui-agent+2

tarsier

reworkd

🧡63

Vision utilities for web interaction agents 👀

1.8k

123

MIT

Jupyter Notebook

Updated 1 week ago

gpt4vllmsocr+5

py-gpt

szczyglis-dev

💛70

Desktop AI Assistant powered by GPT-5, GPT-4, o1, o3, Gemini, Claude, Ollama, DeepSeek, Perplexity, Grok, Bielik, chat, vision, voice, RAG, image and video generation, agents, tools, MCP, plugins, speech synthesis and recognition, web search, memory, presets, assistants,and more. Linux, Windows, Mac

1.7k

318

NOASSERTION

Python

Updated 7 hours ago

aiai-assistantartificial-intelligence+17

AgentNetworkProtocol

agent-network-protocol

💛72

AgentNetworkProtocol(ANP) is an open source protocol for agent communication. Our vision is to define how agents connect with each other, building an open, secure, and efficient collaboration network for billions of intelligent agents.

1.3k

Apache-2.0

HTML

Updated 20 hours ago

agentcommunicationprotocol

Kimi-VL

MoonshotAI

🧡67

Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities

1.2k

MIT

Updated 21 hours ago

VIGA

Fugtemypt123

💛72

VIGA: Vision-as-Inverse-Graphics Agent

915

MIT

Python

Updated 6 hours ago

4KAgent

taco-group

🧡66

[NeurIPS 2025] 4KAgent: Agentic Any Image to 4K Super-Resolution. An intelligent computer vision agent that can magically restore any image to perfect-4K!

781

Apache-2.0

Python

Updated 2 hours ago

agentagentic-aicomputer-vision+12

ravens

google-research

🧡67

Train robotic agents to learn pick and place with deep learning for vision-based manipulation in PyBullet. Transporter Nets, CoRL 2020.

624

105

Apache-2.0

Python

Updated 1 day ago

artificial-intelligencecomputer-visiondeep-learning+11

python-sdk

askui

🧡66

Enable AI to control your desktop, mobile and HMI devices

528

MIT

Python

Updated 14 hours ago

agentscomputer-visionllms+3

vibe-check-mcp-server

PV-Bhat

💛71

Vibe Check is a tool that provides mentor-like feedback to AI Agents, preventing tunnel-vision, over-engineering and reasoning lock-in for complex and long-horizon agent workflows. KISS your over-eager AI Agents goodbye! Effective for: Coding, Ambiguous Tasks, High-Risk tasks

480

MIT

TypeScript

Updated 4 days ago

agentic-aiagentic-workflowai-agents+9

AIA-Academic-Illustrator-

qwwzdyj

🧡66

An AI agent that automates the creation of CVPR/NeurIPS standard academic diagrams. Implements a strict "Logic (Architect) -> Vision (Renderer)" workflow to transform paper abstracts into high-fidelity scientific illustrations.

450

JavaScript

Updated 20 hours ago

RL4VLM

🧡66

Official Repo for Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

409

MIT

Jupyter Notebook

Updated 6 days ago

aguvis

xlang-ai

🧡56

[ICML2025] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

386

Python

Updated 15 hours ago

agent-clip

epiral

🧡61

AI Agent as a Pinix Clip — agentic loop with memory, tools, and vision

384

TypeScript

Updated 3 hours ago

NanoLLM

dusty-nv

🧡61

Optimized local inference for LLMs with HuggingFace-like APIs for quantization, vision/language models, multimodal agents, speech, vector DB, and RAG.

366

MIT

Python

Updated 11 hours ago

edge-aillm-inferencemultimodal+4

llm

graniet

🧡66

A powerful Rust library and CLI tool to unify and orchestrate multiple LLM, Agent and voice backends (OpenAI, Claude, Gemini, Ollama, ElevenLabs...) with a single, extensible API. Build, chain, evaluate, and serve complex multi-step AI workflows — including speech-to-text, text-to-speech, completions, vision, and reasoning.

333

MIT

Rust

Updated 1 day ago

ChatGPT-OpenAI-Smart-Speaker

Olney1

🧡61

This AI Smart Speaker uses speech recognition, TTS (text-to-speech), and STT (speech-to-text) to enable voice and vision-driven conversations, with additional web search capabilities via OpenAI and Langchain agents.

311

MIT

Python

Updated 3 weeks ago

agentsaiartificial-intelligence+14

anp

agent-network-protocol

💛71

Our vision is to provide communication capabilities for intelligent agents, allowing them to connect with each other to form a collaborative network of intelligent agents.

291

Apache-2.0

Python

Updated 2 days ago

agentaidid

GUI-R1

ritzz-ai

🧡65

Official implementation of GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

240

Apache-2.0

Python

Updated 22 hours ago

deep-reinforcement-learninggrpogui-agent+6

unsloth-buddy

TYH-labs

🧡60

Zero-friction LLM fine-tuning skill for Claude Code, Gemini CLI & any ACP agent. Unsloth on NVIDIA · TRL+MPS/MLX on Apple Silicon. Automates env setup, LoRA training (SFT, DPO, GRPO, vision), post-hoc GRPO log diagnostics, evaluation, and export end-to-end. Part of the Gaslamp AI platform.

200

Python

Updated 1 day ago

apple-siliconclaude-codedpo+10

ai-experiments

vivekpathania

🧡61

AI Experiments A public repository of AI/ML projects exploring generative models, NLP, computer vision, and autonomous agents. Includes code, documentation, and demos for educational purposes.

167

Apache-2.0

Python

Updated 2 weeks ago

GPT-V-on-Web

Jiayi-Pan

💛70

👀🧠 GPT-4 Vision x 💪⌨️ Vimium = Autonomous Web Agent

166

AGPL-3.0

Python

Updated 1 day ago

docpixie

qnguyen3

🧡55

Lightweight Vision native Multimodal Document Agent

161

MIT

Python

Updated 3 weeks ago

PyVision

agents-x-project

🧡60

[MTI-LLM@NeurIPS 2025] Official implementation of "PyVision: Agentic Vision with Dynamic Tooling."

156

Python

Updated 2 days ago

agentcomputer-visionmllm

gemini-live-api-examples

google-gemini

💛71

Gemini Live provides multimodal realtime agent capabilities. Build voice agents that can process vision and text in realtime.

147

Apache-2.0

JavaScript

Updated 2 hours ago

GitHub Explorer

Search Results

OmniParser

Vision-Agents

browser-agent

ruby_llm

VisionClaw

ShowUI

tarsier

py-gpt

AgentNetworkProtocol

Kimi-VL

VIGA

4KAgent

ravens

python-sdk

vibe-check-mcp-server

AIA-Academic-Illustrator-

RL4VLM

aguvis

agent-clip

NanoLLM

llm

ChatGPT-OpenAI-Smart-Speaker

anp

GUI-R1

unsloth-buddy

ai-experiments

GPT-V-on-Web

docpixie

PyVision

gemini-live-api-examples

OmniParser

Vision-Agents

browser-agent

ruby_llm

VisionClaw

ShowUI

tarsier

py-gpt

AgentNetworkProtocol

Kimi-VL

VIGA

4KAgent

ravens

python-sdk

vibe-check-mcp-server

AIA-Academic-Illustrator-

RL4VLM

aguvis

agent-clip

NanoLLM

llm

ChatGPT-OpenAI-Smart-Speaker

anp

GUI-R1

unsloth-buddy

ai-experiments

GPT-V-on-Web

docpixie

PyVision

gemini-live-api-examples