Pricing

Model API pricing depends on the model type and invocation specifications. Multimodal models are usually billed by input specification, output specification, or generation unit, while some image models and large language models may be billed by token. The model detail page is the source of truth for the exact rules.

LLM pricing

Large language models are usually billed by token. The basic price items are input price and output price. Some models also provide cache read and cache write prices to distinguish cache-hit reads from cache writes.

The same model may also split input or output prices by context length, or split cache write prices by cache duration. For example, short-context and long-context usage may have different prices, and 5-minute cache writes may differ from 1-hour cache writes.

Price item	Billing unit	Pricing factors
Input price	Token	Input token count, context length
Output price	Token	Output token count, context length
Cache read price	Token	Cache-hit token count, context length
Cache write price	Token	Written token count, cache duration

Multimodal pricing

Multimodal pricing is not tied to a single parameter. Image models are usually billed per image, with prices varying by resolution, generation mode, output count, and similar specifications. Video models are usually billed per second, though some are billed per video, with prices varying by duration, resolution, generation mode, and whether audio is generated.

Some Google or GPT image models use token-based pricing. For example, Nano Banana and GPT Image 2 typically split text input, image input, and image output into separate price items.

Model type	Billing unit	Pricing factors
Image models	Per image	Resolution, mode, output count
Video models	Seconds or per video	Resolution, mode, duration, whether audio is included
Audio models	Seconds, character count, or per call	Input length, output duration, voice or mode
3D models	Per call	Output format, quality tier, generation mode
Special image models	Token	Text input, image input, text output, image output