Models Load on Demand
Limited memory? No problem. With the flex API you can load 10 models on one Mac Mini! Models load on-demand and intelligently negotiate memory with each other so you can use many different models in one constrained memory pool.
Always-On Production Endpoints
Static APIs stay loaded 24/7 for minimum latency. Pair static hot paths with flex overflow models for the best cost and performance mix.
Production Ready Out of the Box
Our API platform goes far beyond model serving with dynamic memory management, industry-leading tool calling for agents, automatic hallucination detection, analytics, and the highest throughput on Apple Silicon
Flex + Static Deployment
Mix always-on and on-demand APIs on the same Mac.
Intelligent Memory Management
Never worry about OOMs again. LRU offload for flex models handles production spikes automatically.
Optimized Apple Silicon Speed
MLX-native stack with TurboQuant, KV caching, and robust batching gives you the best speeds and highest throughput on a single Mac.
Maximum Throughput
Robust batching for concurrency plus model pooling with a round-robin-styled API for redundancy. Courier maximizes throughput and speed on Apple hardware.
Automatic Hallucination Detection
Our system detects when models hang or hallucinate and automatically restarts them. You only lose one request instead of an entire queue, ensuring maximum uptime and reliability even with SLMs.
Industry-Leading Agent Capabilities
We built industry leading tool calling technology, making open source models on our platform the most performant for agentic workflows.
Real-Time Analytics
Track requests, tokens, latency, and model usage from day one.
Built-in Whisper API
Never pay for audio transcription again. Transcription and translation endpoints for automation pipelines.
Curated Cloud Models We Ship Today
Benchmarked, maintained, and recommended for production — from lite agents to frontier models, embeddings, speech, and image generation.
Qwen3.5 2B
Qwen3.5The smallest Qwen3.5 omni model — vision, audio, and text in just 2GB. Ideal for routing, classification, and fast sub-agent tasks on a 16GB Mac Mini.
Qwen3.5 4B
Qwen3.5A step up from the 2B variant with noticeably better reasoning while staying under 10GB. Great for lightweight multi-modal agents where you need a bit more quality without midweight latency.
Gemma 4 E2B
Gemma 4The smallest Gemma 4 omni model. Tiny but still supports all modalities — perfect for routing, fast sub-agent tasks, and prototyping multi-modal workflows on memory-constrained devices.
Gemma 4 E4B
Gemma 4This model surprises me. It's incredibly capable, small, and is an omni model that can perform tool calls as well. There could be many use cases for a model this size and this performant.
Qwen3.5 9B
Qwen3.5An impressive omni model for its size — vision, audio, and text in just 9GB. A fast lightweight default when you want multi-modal capability on a Mac Mini without the latency of a midweight model.
Gemma 4 12B
Gemma 4A powerful lite agent that punches like a balanced agent but at lite-agent cost. Full omni support for chat, RAG, multimodal, and tool-using workflows — 12GB at 8bit with room to spare on a 24GB Mac Mini.
Gemma 4 26B A4B
Gemma 4The fast Gemma 4 variant. Pick this for high-throughput generalist workloads — chat, RAG, multimodal, broad-knowledge tasks — where latency matters more than maxed-out quality.
Gemma 4 31B
Gemma 4The quality Gemma 4 variant. Pick this for complex chat, multimodal reasoning, and broad-knowledge tasks where accuracy beats throughput.
Qwen3.6 27B
Qwen3.6The quality Qwen3.6 variant. Pick this for the hardest coding agents and multi-step reasoning where accuracy on code beats raw speed.
Simple, Transparent Pricing
From generous free cloud access to unlimited self-hosting. Scale your AI infrastructure as you grow.
Flat-rate usage without token math
Start for free, scale up on the most affordable production tiers at $25/mo and $100/mo for production workloads.
- Start building completely free
- OpenAI-compatible endpoints
- Production analytics included
- Upgrade path to dedicated cloud hardware
Your Journey with Courier
Free Tier
Start on Courier Cloud at no cost to see how it works for you.
Pro Tier
Maxing out the free limits? Upgrade for higher usage and faster speeds.
Max Tier
For heavy cloud users who need the highest limits and fastest speeds.
Self-Managed
Ready for unlimited? Run Courier on your own Mac.
Fully Managed
Zero hassle. We manage your Mac hardware for you.
Courier Cloud
Free
The perfect way to start. Experiment with our Mac-optimized infrastructure at no cost.
- Generous rate limits
- Access to open-source models
- Models load on demand
Pro
Higher limits and faster speeds.
- Our fastest hardware
- Higher rate limits
- Lower latency and faster ttft speeds
Max
Maximum cloud capacity for power users and scaling applications.
- Everything in Pro
- Highest cloud rate limits
- Premium model access
- Priority cloud support
Self-Hosted License
Standalone
Install on your own M-series Mac. Fully featured and transferable. This is what Courier Cloud runs on.
- Unlimited tokens and requests
- Speed-optimized & production-ready
- Transferable between Macs
- Full privacy & data control
Managed
Send us your Mac, and we manage everything: backup power, internet, setup, and maintenance.
- All Standalone benefits
- No hardware upkeep hassle
- Backup power & redundant internet
- 24/7 technical maintenance
Select multiple models for different tasks (e.g., coding, vision, and general chat). As your user base grows, you will see increased latency and degradation in user-experience if multiple models are not utilized.
Performance is determined by model quantization and available VRAM (Video Memory). Reasoning diminishes as quantization drops, possibly leading to hallucinations and other unintended side-effects.
- 4-bit: Maximum speed, lower VRAM
- 8-bit: Balanced speed and logic
- 16-bit: Maximum reasoning capability
Parameters are the internal variables the AI learns during training. A 30GB model has more "knowledge" than an 8GB model.
The context window is the amount of text (tokens) the AI can "remember" during a conversation or process in a single request.
Dynamic Memory Management
Courier offers 2 model serving options to maximize memory efficiency, Flex and Static
Flex models load into memory upon request and unload after 5 minutes of inactivity.
• Enables running multiple large models on limited hardware
• Dynamic memory allocation
• Only the largest flex model counts towards VRAM requirements
Static models stay loaded in memory at all times, providing instant response.
• Instant availability, no load time
• Continuous memory occupancy
• Each static model adds directly to total VRAM requirements
Start With a Use Case
Pre-configured flex stacks — memory is calculated from the largest flex model loaded at once.
What do you need AI for?
Select the primary functions for your self-hosted AI setup
Select Your Models
Choose the AI models to include in your platform (Filtered by your use cases)
No models selected. Add models to your platform to continue.
Hardware Recommendation
Need multi-device clustering or a custom setup? Book a free consultation