Production AI APIs on Apple Silicon

API Platform

Flat-rate APIs in the cloud, or unlimited self-hosted on Courier OS.

Flex APIs

Models Load on Demand

Limited memory? No problem. With the flex API you can load 10 models on one Mac Mini! Models load on-demand and intelligently negotiate memory with each other so you can use many different models in one constrained memory pool.

Static APIs

Always-On Production Endpoints

Static APIs stay loaded 24/7 for minimum latency. Pair static hot paths with flex overflow models for the best cost and performance mix.

More Than a Simple Model Server

Production Ready Out of the Box

Our API platform goes far beyond model serving with dynamic memory management, industry-leading tool calling for agents, automatic hallucination detection, analytics, and the highest throughput on Apple Silicon

Flex + Static Deployment

Mix always-on and on-demand APIs on the same Mac.

Intelligent Memory Management

Never worry about OOMs again. LRU offload for flex models handles production spikes automatically.

Optimized Apple Silicon Speed

MLX-native stack with TurboQuant, KV caching, and robust batching gives you the best speeds and highest throughput on a single Mac.

Maximum Throughput

Robust batching for concurrency plus model pooling with a round-robin-styled API for redundancy. Courier maximizes throughput and speed on Apple hardware.

Automatic Hallucination Detection

Our system detects when models hang or hallucinate and automatically restarts them. You only lose one request instead of an entire queue, ensuring maximum uptime and reliability even with SLMs.

Industry-Leading Agent Capabilities

We built industry leading tool calling technology, making open source models on our platform the most performant for agentic workflows.

Real-Time Analytics

Track requests, tokens, latency, and model usage from day one.

Built-in Whisper API

Never pay for audio transcription again. Transcription and translation endpoints for automation pipelines.

Courier Core Library

Curated Cloud Models We Ship Today

Benchmarked, maintained, and recommended for production — from lite agents to frontier models, embeddings, speech, and image generation.

Qwen3.5 2B

Qwen3.5

The smallest Qwen3.5 omni model — vision, audio, and text in just 2GB. Ideal for routing, classification, and fast sub-agent tasks on a 16GB Mac Mini.

Qwen3.5 4B

Qwen3.5

A step up from the 2B variant with noticeably better reasoning while staying under 10GB. Great for lightweight multi-modal agents where you need a bit more quality without midweight latency.

Gemma 4 E2B

Gemma 4

The smallest Gemma 4 omni model. Tiny but still supports all modalities — perfect for routing, fast sub-agent tasks, and prototyping multi-modal workflows on memory-constrained devices.

Gemma 4 E4B

Gemma 4

This model surprises me. It's incredibly capable, small, and is an omni model that can perform tool calls as well. There could be many use cases for a model this size and this performant.

Qwen3.5 9B

Qwen3.5

An impressive omni model for its size — vision, audio, and text in just 9GB. A fast lightweight default when you want multi-modal capability on a Mac Mini without the latency of a midweight model.

Gemma 4 12B

Gemma 4

A powerful lite agent that punches like a balanced agent but at lite-agent cost. Full omni support for chat, RAG, multimodal, and tool-using workflows — 12GB at 8bit with room to spare on a 24GB Mac Mini.

Gemma 4 26B A4B

Gemma 4

The fast Gemma 4 variant. Pick this for high-throughput generalist workloads — chat, RAG, multimodal, broad-knowledge tasks — where latency matters more than maxed-out quality.

Gemma 4 31B

Gemma 4

The quality Gemma 4 variant. Pick this for complex chat, multimodal reasoning, and broad-knowledge tasks where accuracy beats throughput.

Qwen3.6 27B

Qwen3.6

The quality Qwen3.6 variant. Pick this for the hardest coding agents and multi-step reasoning where accuracy on code beats raw speed.

Browse the full almanac →

Real-Time Analytics

Loading production data...

Pricing

Simple, Transparent Pricing

From generous free cloud access to unlimited self-hosting. Scale your AI infrastructure as you grow.

Courier Cloud

Flat-rate usage without token math

Start for free, scale up on the most affordable production tiers at $25/mo and $100/mo for production workloads.

Start building completely free
OpenAI-compatible endpoints
Production analytics included
Upgrade path to dedicated cloud hardware

Self-hosted unlimited usage

Buy the Mac once. Run unlimited inference on your hardware with the same API platform, analytics, and model library — your data never leaves your infrastructure.

Your Journey with Courier

Free Tier

Start on Courier Cloud at no cost to see how it works for you.

Pro Tier

Maxing out the free limits? Upgrade for higher usage and faster speeds.

Max Tier

For heavy cloud users who need the highest limits and fastest speeds.

Self-Managed

Ready for unlimited? Run Courier on your own Mac.

Fully Managed

Zero hassle. We manage your Mac hardware for you.

Courier Cloud

Free

$0/mo

500 credits/mo

The perfect way to start. Experiment with our Mac-optimized infrastructure at no cost.

Generous rate limits
Access to open-source models
Models load on demand

Popular Choice

Courier Cloud

Pro

$25/mo

25k credits/mo

Higher limits and faster speeds.

Our fastest hardware
Higher rate limits
Lower latency and faster ttft speeds

Courier Cloud

Max

$100/mo

100k credits/mo

Maximum cloud capacity for power users and scaling applications.

Everything in Pro
Highest cloud rate limits
Premium model access
Priority cloud support

Self-Hosted License

Self-Managed

Standalone

$300/mo

Install on your own M-series Mac. Fully featured and transferable. This is what Courier Cloud runs on.

Unlimited tokens and requests
Speed-optimized & production-ready
Transferable between Macs
Full privacy & data control

Zero Maintenance

3rd-Party Managed

Managed

$500/mo

Send us your Mac, and we manage everything: backup power, internet, setup, and maintenance.

All Standalone benefits
No hardware upkeep hassle
Backup power & redundant internet
24/7 technical maintenance

Coming Soon

Find the Right Mac for Your Use Case

Start from a preset or build your own flex stack, then see which Apple Silicon Mac fits your models.

Model Count

Select multiple models for different tasks (e.g., coding, vision, and general chat). As your user base grows, you will see increased latency and degradation in user-experience if multiple models are not utilized.

1-3 Models:Focused Setup

4-10 Models:Versatile Setup

10+ Models:Full Ecosystem

Throughput - Quantization & VRAM

Performance is determined by model quantization and available VRAM (Video Memory). Reasoning diminishes as quantization drops, possibly leading to hallucinations and other unintended side-effects.

4-bit: Maximum speed, lower VRAM
8-bit: Balanced speed and logic
16-bit: Maximum reasoning capability

Model Size - Parameters

Parameters are the internal variables the AI learns during training. A 30GB model has more "knowledge" than an 8GB model.

Lite (1GB - 14GB):Fast, Efficient

Balanced (15GB - 50GB):Versatile, Strong

Frontier (70GB+):Advanced Reasoning

Context Window - Memory

The context window is the amount of text (tokens) the AI can "remember" during a conversation or process in a single request.

32k tokens:~50 pages of text

128k tokens:Full book length

1M+ tokens:Entire codebases

Dynamic Memory Management

Courier offers 2 model serving options to maximize memory efficiency, Flex and Static

Flex Models

Flex models load into memory upon request and unload after 5 minutes of inactivity.

• Enables running multiple large models on limited hardware

• Dynamic memory allocation

• Only the largest flex model counts towards VRAM requirements

Static Models

Static models stay loaded in memory at all times, providing instant response.

• Instant availability, no load time

• Continuous memory occupancy

• Each static model adds directly to total VRAM requirements

Feeling overwhelmed or unsure what to choose?

Let us help you figure it out.

Start With a Use Case

Pre-configured flex stacks — memory is calculated from the largest flex model loaded at once.

Scout

Chat agent, lightweight sub-agent, and embeddings for Pathfinder

All models use flex APIs

Coding Agent

Planner + implementer stack for agentic coding workflows

All models use flex APIs

Production Server

General-purpose production API with Gemma 4

All models use flex APIs

What do you need AI for?

Select the primary functions for your self-hosted AI setup

Agent

Tool-calling agents, chat, and multi-modal workflows

Image Generation

Generate images from text prompts

Embeddings/RAG

Semantic search, retrieval, and memory

Select Your Models

Choose the AI models to include in your platform (Filtered by your use cases)

Placeholder

No models selected. Add models to your platform to continue.

Hardware Recommendation

Infrastructure Requirements

Based on your model selection

Total VRAM Required7 GB

Recommended HardwareMac Mini (16GB)

Need multi-device clustering or a custom setup? Book a free consultation

API Platform

Models Load on Demand

Always-On Production Endpoints

Production Ready Out of the Box

Flex + Static Deployment

Intelligent Memory Management

Optimized Apple Silicon Speed

Maximum Throughput

Automatic Hallucination Detection

Industry-Leading Agent Capabilities

Real-Time Analytics

Built-in Whisper API

Curated Cloud Models We Ship Today

Qwen3.5 2B

Qwen3.5 4B

Gemma 4 E2B

Gemma 4 E4B

Qwen3.5 9B

Gemma 4 12B

Gemma 4 26B A4B

Gemma 4 31B

Qwen3.6 27B

Simple, Transparent Pricing

Flat-rate usage without token math

Self-hosted unlimited usage

Your Journey with Courier

Free Tier

Pro Tier

Max Tier

Self-Managed

Fully Managed

Courier Cloud

Free

Pro

Max

Self-Hosted License

Standalone

Managed

Platform Configuration Guidelines

Dynamic Memory Management

Start With a Use Case

What do you need AI for?

Select Your Models

Hardware Recommendation