HomeAPI PlatformScout
Production AI APIs on Apple Silicon

API Platform

Flat-rate APIs in the cloud, or unlimited self-hosted on Courier OS.

Flex APIs

Models Load on Demand

Limited memory? No problem. With the flex API you can load 10 models on one Mac Mini! Models load on-demand and intelligently negotiate memory with each other so you can use many different models in one constrained memory pool.

Static APIs

Always-On Production Endpoints

Static APIs stay loaded 24/7 for minimum latency. Pair static hot paths with flex overflow models for the best cost and performance mix.

More Than a Simple Model Server

Production Ready Out of the Box

Our API platform goes far beyond model serving with dynamic memory management, industry-leading tool calling for agents, automatic hallucination detection, analytics, and the highest throughput on Apple Silicon

Flex + Static Deployment

Mix always-on and on-demand APIs on the same Mac.

Intelligent Memory Management

Never worry about OOMs again. LRU offload for flex models handles production spikes automatically.

Optimized Apple Silicon Speed

MLX-native stack with TurboQuant, KV caching, and robust batching gives you the best speeds and highest throughput on a single Mac.

Maximum Throughput

Robust batching for concurrency plus model pooling with a round-robin-styled API for redundancy. Courier maximizes throughput and speed on Apple hardware.

Automatic Hallucination Detection

Our system detects when models hang or hallucinate and automatically restarts them. You only lose one request instead of an entire queue, ensuring maximum uptime and reliability even with SLMs.

Industry-Leading Agent Capabilities

We built industry leading tool calling technology, making open source models on our platform the most performant for agentic workflows.

Real-Time Analytics

Track requests, tokens, latency, and model usage from day one.

Built-in Whisper API

Never pay for audio transcription again. Transcription and translation endpoints for automation pipelines.

Courier Core Library

Curated Cloud Models We Ship Today

Benchmarked, maintained, and recommended for production — from lite agents to frontier models, embeddings, speech, and image generation.

Qwen3.5 2B

Qwen3.5

The smallest Qwen3.5 omni model — vision, audio, and text in just 2GB. Ideal for routing, classification, and fast sub-agent tasks on a 16GB Mac Mini.

Qwen3.5 4B

Qwen3.5

A step up from the 2B variant with noticeably better reasoning while staying under 10GB. Great for lightweight multi-modal agents where you need a bit more quality without midweight latency.

Gemma 4 E2B

Gemma 4

The smallest Gemma 4 omni model. Tiny but still supports all modalities — perfect for routing, fast sub-agent tasks, and prototyping multi-modal workflows on memory-constrained devices.

Gemma 4 E4B

Gemma 4

This model surprises me. It's incredibly capable, small, and is an omni model that can perform tool calls as well. There could be many use cases for a model this size and this performant.

Qwen3.5 9B

Qwen3.5

An impressive omni model for its size — vision, audio, and text in just 9GB. A fast lightweight default when you want multi-modal capability on a Mac Mini without the latency of a midweight model.

Gemma 4 12B

Gemma 4

A powerful lite agent that punches like a balanced agent but at lite-agent cost. Full omni support for chat, RAG, multimodal, and tool-using workflows — 12GB at 8bit with room to spare on a 24GB Mac Mini.

Gemma 4 26B A4B

Gemma 4

The fast Gemma 4 variant. Pick this for high-throughput generalist workloads — chat, RAG, multimodal, broad-knowledge tasks — where latency matters more than maxed-out quality.

Gemma 4 31B

Gemma 4

The quality Gemma 4 variant. Pick this for complex chat, multimodal reasoning, and broad-knowledge tasks where accuracy beats throughput.

Qwen3.6 27B

Qwen3.6

The quality Qwen3.6 variant. Pick this for the hardest coding agents and multi-step reasoning where accuracy on code beats raw speed.

Real-Time Analytics
Loading production data...
Pricing

Simple, Transparent Pricing

From generous free cloud access to unlimited self-hosting. Scale your AI infrastructure as you grow.

Courier Cloud

Flat-rate usage without token math

Start for free, scale up on the most affordable production tiers at $25/mo and $100/mo for production workloads.

  • Start building completely free
  • OpenAI-compatible endpoints
  • Production analytics included
  • Upgrade path to dedicated cloud hardware

Self-hosted unlimited usage

Buy the Mac once. Run unlimited inference on your hardware with the same API platform, analytics, and model library — your data never leaves your infrastructure.

Your Journey with Courier

1
Free Tier

Start on Courier Cloud at no cost to see how it works for you.

2
Pro Tier

Maxing out the free limits? Upgrade for higher usage and faster speeds.

3
Max Tier

For heavy cloud users who need the highest limits and fastest speeds.

4
Self-Managed

Ready for unlimited? Run Courier on your own Mac.

5
Fully Managed

Zero hassle. We manage your Mac hardware for you.

Courier Cloud

Courier Cloud

Free

$0/mo
500 credits/mo

The perfect way to start. Experiment with our Mac-optimized infrastructure at no cost.

  • Generous rate limits
  • Access to open-source models
  • Models load on demand
Popular Choice
Courier Cloud

Pro

$25/mo
25k credits/mo

Higher limits and faster speeds.

  • Our fastest hardware
  • Higher rate limits
  • Lower latency and faster ttft speeds
Courier Cloud

Max

$100/mo
100k credits/mo

Maximum cloud capacity for power users and scaling applications.

  • Everything in Pro
  • Highest cloud rate limits
  • Premium model access
  • Priority cloud support

Self-Hosted License

Self-Managed

Standalone

$300/mo

Install on your own M-series Mac. Fully featured and transferable. This is what Courier Cloud runs on.

  • Unlimited tokens and requests
  • Speed-optimized & production-ready
  • Transferable between Macs
  • Full privacy & data control
Zero Maintenance
3rd-Party Managed

Managed

$500/mo

Send us your Mac, and we manage everything: backup power, internet, setup, and maintenance.

  • All Standalone benefits
  • No hardware upkeep hassle
  • Backup power & redundant internet
  • 24/7 technical maintenance
Coming Soon
Find the Right Mac for Your Use Case
Start from a preset or build your own flex stack, then see which Apple Silicon Mac fits your models.
Model Count

Select multiple models for different tasks (e.g., coding, vision, and general chat). As your user base grows, you will see increased latency and degradation in user-experience if multiple models are not utilized.

1-3 Models:Focused Setup
4-10 Models:Versatile Setup
10+ Models:Full Ecosystem
Throughput - Quantization & VRAM

Performance is determined by model quantization and available VRAM (Video Memory). Reasoning diminishes as quantization drops, possibly leading to hallucinations and other unintended side-effects.

  • 4-bit: Maximum speed, lower VRAM
  • 8-bit: Balanced speed and logic
  • 16-bit: Maximum reasoning capability
Model Size - Parameters

Parameters are the internal variables the AI learns during training. A 30GB model has more "knowledge" than an 8GB model.

Lite (1GB - 14GB):Fast, Efficient
Balanced (15GB - 50GB):Versatile, Strong
Frontier (70GB+):Advanced Reasoning
Context Window - Memory

The context window is the amount of text (tokens) the AI can "remember" during a conversation or process in a single request.

32k tokens:~50 pages of text
128k tokens:Full book length
1M+ tokens:Entire codebases

Dynamic Memory Management

Courier offers 2 model serving options to maximize memory efficiency, Flex and Static

Flex Models

Flex models load into memory upon request and unload after 5 minutes of inactivity.

• Enables running multiple large models on limited hardware

• Dynamic memory allocation

• Only the largest flex model counts towards VRAM requirements

Static Models

Static models stay loaded in memory at all times, providing instant response.

• Instant availability, no load time

• Continuous memory occupancy

• Each static model adds directly to total VRAM requirements

Feeling overwhelmed or unsure what to choose?

Let us help you figure it out.

Start With a Use Case

Pre-configured flex stacks — memory is calculated from the largest flex model loaded at once.

Scout
Chat agent, lightweight sub-agent, and embeddings for Pathfinder
All models use flex APIs
Coding Agent
Planner + implementer stack for agentic coding workflows
All models use flex APIs
Production Server
General-purpose production API with Gemma 4
All models use flex APIs

What do you need AI for?

Select the primary functions for your self-hosted AI setup

Agent
Tool-calling agents, chat, and multi-modal workflows
Image Generation
Generate images from text prompts
Embeddings/RAG
Semantic search, retrieval, and memory

Select Your Models

Choose the AI models to include in your platform (Filtered by your use cases)

No models selected. Add models to your platform to continue.

Hardware Recommendation

Infrastructure Requirements
Based on your model selection
Total VRAM Required7 GB
Recommended HardwareMac Mini (16GB)

Need multi-device clustering or a custom setup? Book a free consultation