Last updated: 16/04/2025

X.AIOfficial Docs

Grok-2-Vision

grok-2-vision-1212

active

Grok-2-Vision

Grok-2-Vision is a powerful multimodal AI model from X-AI that can seamlessly process and generate text, images, video, audio, transcriptions, and text-to-speech. With a large 128,000 token context window, this model is capable of understanding and creating rich, contextual content across a wide range of applications.

Supports a 128K token context window. Handles Text, Image, Video, Audio, Transcription, Text-to-Speech inputs and outputs. Supports fine-tuning for custom applications.

Model Timeline

Launch Date

12/12/2023

Capabilities

Text

Input Pricing

$0.00/ KTok

Context: 128,000 tokens

Output Pricing

$0.00/ KTok

Max tokens: 4,096

Vision Capabilities

Max resolution: 4096x4096
Max images per prompt: 10

Embeddings

Embeddings Pricing

$0.0001/1k tokens

Flatten your repo for AI in seconds

Flatten repos. Prompt faster. One click → one GPT-ready file

Free Online & Desktop