Start on our free tier, scale to meet demand

Affordable, Predictable, Private

When integrating AI, nobody knows what it will cost. Mainstream providers charge for tokens. Most people don't even know what a token is, much less how many they'll use. Our AI cloud is one of the only options on the market that offers flat-rate AI APIs. If you ever need more usage, you can self-host and go Unlimited at any time.

Watch Courier

See Courier in Action

Come check out how Courier works. It's easy to get started and we promise you'll absolutely love it!

Stop Renting

AI Token Spend Increased 49.7% in 2025

As AI adoption grows so do token costs. Companies are increasing their budgets every year to account for their AI usage costs and privacy concerns.

No Data Control

Your sensitive data flows through third-party servers. You have no control over security or compliance.

Unpredictable Costs

Per-token pricing eats your margins as you grow.

Vendor Lock-In

Closed models and proprietary APIs trap your business.

How it Works

We Eliminate Token Spend

With tech innovations by Apple and advancements in open source AI we eliminate token-based billing by self-hosting your intelligence layer affordably and reliably or our flat-rate cloud.

Courier Cloud

Flat-rate AI APIs starting completely free on our cloud powered by Mac Studios

Other Providers

  • Per-token pricing that scales with you
  • Expensive GPU rentals with limited compute
  • Blackbox data and AI practices

$1,000s+/mo

A better option

Courier Cloud

  • Starts FREE
  • Flat-rate plans for Production scaling
  • Transparent data policies & open source AI

FREE - $400/mo


Courier Self-Hosted

Instead of renting AI, you can own AI by self-hosting. Macs are 1/10th the cost of NVIDIA GPUs and up to 50x more power efficient. Simply buy a Mac and run Courier self-hosted to have unlimited AI that you own.

Renting

When you pay for cloud solutions you are paying for 3 things:

  • GPUs - Companies like OpenAI, Anthropic and Gemini build server farms filled with GPUs for hosting AI models
  • AI Models - These companies build and train cutting-edge AI models
  • APIs - They provide software that hosts these models on GPUs and allows you to communicate with them

Ownership

To own your AI stack, you need those same 3 things. To do this, we use:

  • Apple's Unified Memory. It's less than 1/10th the cost of traditional NVIDIA GPUs.
  • Open-source LLMs by companies like Mistral and Qwen. These models are outperforming mainstream models and completely free to use in commercial applications
  • Courier - We've built software that turns any M-series Mac into an AI API platform with plug and play ease.

We chose Apple's Unified Memory Architecture because of how affordable and performant it is. Our software is optimized for their memory architecture and handles all the complexity of running open source AI models. You can use models running on our cloud completely free with rate limits or purchase your own Mac and subscribe for a flat-rate.

Experiment for FREE on Courier Cloud

Try any open-source model with our free Flex APIs, no credit card required.

Production Ready

Enterprise Features That Make Self-Hosting Simple, Affordable, and Reliable

Courier provides premium capabilities that go far beyond simple model serving, turning affordable Apple hardware into production-ready AI API platforms with plug-and-play ease.

Optimized For Speed

Courier is built on MLX ando optimized for speed, With optimizations like TurboQuant, KV Caching, and native MLX we offer the fastest inference speeds on Apple Silicon. Currently averaging 20-40 tokens/sec on the M3 Ultra.

Learn more →

Industry-Leading Tool Calling for Agents

Industry-leading tool calling for self-hosted AI stacks, with production-ready reliability for text and fused modality (vision) models. One of the most robust OpenAI-compatible tool-calling implementations available on an API platform you can own.

Learn more →

Automatic Hallucination Detection

Our system detects when models hang or hallucinate and automatically restarts them. You only lose one request instead of an entire queue, ensuring maximum uptime and reliability even with SLMs.

Learn more →

Intelligent Memory Management

Never OOM. When new models need more memory, our LRU policy automatically offloads least recently used flex models. This ensures optimal memory utilization without manual intervention, crucial for production workloads.

Learn more →

Flexible API Deployment

Deploy models as 24/7 static APIs or on-demand flex APIs. Flex models auto-offload after 5 minutes of inactivity, optimizing memory usage while maintaining low latency for active workloads.

Learn more →

Maximum Throughput

Robust batching for concurrency + model pooling with a round robin-styled API for redundancy. Courier maximizes throughput and speed on Apple Hardware.

Learn more →

Whisper API Support

Production-ready OpenAI-compatible transcriptions and translations with multiple output formats, timestamp controls, and predictable error behavior for automation workflows.

Learn more →

Comprehensive Analytics

Our platform includes robust analytics so you can track usage, tokens, response times, and more from your analytics dashboard. Calculate savings, analyze usage patterns, and monitor uptime.

Learn more →

Unlimited Usage

Pay for the hardware once and process an unlimited amount of AI requests without anymore token costs

Advanced Fine-Tuning

Native support for Low-Rank Adaptation (LoRA) adapters on top of base models for specialized tasks and fine-tuning.

n8n Integration

Automate Anything with Drag-and-Drop AI

Connect Courier directly to your n8n workflows with our custom community nodes. No complex API integrations. Just drag, drop, and automate.

  • Native n8n credential and node support
  • Auto-synced model library from your workbench
  • Multi-modal support: text, vision, and audio
  • Works with self-hosted Courier instances
Installing Courier community package...
✔ Courier Credentials added
✔ Courier LLM Node ready
✔ Courier Chat Node ready
Ready to automate!

Experience these features for free on Courier Cloud

All production features are available on our free tier. No self-hosting required to get started.

Real-Time Analytics
Loading production data...
Pricing

Simple, Transparent Pricing

From generous free cloud access to unlimited self-hosting. Scale your AI infrastructure as you grow.

Your Journey with Courier

1

Free Tier

Start on Courier Cloud at no cost to see how it works for you.

2

Pro Tier

Maxing out the free limits? Upgrade for higher usage and faster speeds

3

Max Tier

For heavy cloud users who need the highest limits and fastest speeds.

4

Self-Managed

Ready for unlimited? Run Courier on your own Mac.

5

Fully Managed

Zero hassle. We manage your Mac hardware for you.

Courier Cloud

Courier Cloud

Free

$0/mo
2k - 8k requests/mo

The perfect way to start. Experiment with our Mac-optimized infrastructure at no cost.

  • Generous rate limits
  • Access to open-source models
  • Models load on demand
Popular Choice
Courier Cloud

Pro

$100/mo
45k - 120k requests/mo

Higher limits and faster speeds.

  • Our fastest hardware
  • Higher rate limits
  • Lower latency and faster ttft speeds
Coming Soon
Courier Cloud

Max

$400/mo
200k - 450k requests/mo

Maximum cloud capacity for power users and scaling applications.

  • Highest cloud rate limits
  • Premium model access
  • Priority cloud support
Coming Soon

Self-Hosted License

Self-Managed

Standalone

$300/mo

Install on your own M-series Mac. Fully featured and transferable. This is what Courier Cloud runs on.

  • Unlimited tokens & usage
  • Speed-optimized & production-ready
  • Transferable between Macs
  • Full privacy & data control
Zero Maintenance
3rd-Party Managed

Managed

$500/mo

Send us your Mac, and we manage everything: backup power, internet, setup, and maintenance.

  • All Standalone benefits
  • No hardware upkeep hassle
  • Backup power & redundant internet
  • 24/7 technical maintenance
Coming Soon

What are you waiting for?

Create a free account today and start building with AI.

Model Library

Access Cutting-Edge Open-Weight Models

All these models are available free on Courier Cloud. Choose from the best open-weight models including GLM 4.5 Air, Gemma 4, Qwen, and more.

Model Library

Complete Open-Weight Model Library

Access cutting-edge models from Hugging Face with no vendor lock-in. Includes text generation, code generation, and multi-modal models.

GLM 4.5 AirGemma 4Qwen+ Many More
See Which Mac Fits Your AI Models
Select your models, configure quantization and context, and see which Mac hardware you need.

Platform Configuration Guidelines

Understanding your infrastructure needs

Model Count

Select multiple models for different tasks (e.g., coding, vision, and general chat). As your user base grows, you will see increased latency and degradation in user-experience if multiple models are not utilized.

1-3 Models:Focused Setup
4-10 Models:Versatile Setup
10+ Models:Full Ecosystem
Throughput - Quantization & VRAM

Performance is determined by model quantization and available VRAM (Video Memory). Reasoning diminishes as quantization drops, possibly leading to hallucinations and other unintended side-effects.

  • 4-bit: Maximum speed, lower VRAM
  • 8-bit: Balanced speed and logic
  • 16-bit: Maximum reasoning capability
Model Size - Parameters

Parameters are the internal variables the AI learns during training. A 30 billion (30B) parameter model has more "knowledge" than an 8B model.

Small (1B - 14B):Fast, Efficient
Medium (15B - 50B):Versatile, Strong
Large (70B+):Advanced Reasoning
Context Window - Memory

The context window is the amount of text (tokens) the AI can "remember" during a conversation or process in a single request.

32k tokens:~50 pages of text
128k tokens:Full book length
1M+ tokens:Entire codebases

Courier Model Information

Courier offers 2 different types of models depending on your use-case

Flex Models

Flex models load into memory upon request and unload after 5 minutes of inactivity.

• Enables running multiple large models on limited hardware

• Dynamic memory allocation

• Only the largest flex model counts towards VRAM requirements

Static Models

Static models stay loaded in memory at all times, providing instant response.

• Instant availability, no load time

• Continuous memory occupancy

• Each static model adds directly to total VRAM requirements

Feeling overwhelmed or unsure what to choose?

Let us help you figure it out.

What do you need AI for?

Select the primary functions for your self-hosted AI setup

Text-Text
Select this if you use AI for chatbots or basic text-in text-out functionality.
  • Generating responses to questions
  • Generating summaries
  • Translating text
Text-to-Image
Select this if you use AI for generating images from text.
  • Generating images from text prompts
  • Generating images from code snippets
Image Processing
Select this if you have images that need to be analyzed and processed into data.
  • Analyzing Images
  • Text-based tasks as well
Audio Transcription
Select this if you need to transcribe audio into readable text.
  • Transcribing podcasts
  • Transcribing audiobooks
Omni Model
Select this for models that natively handle text, images, audio, and tools in one model.
  • Multi-modal input and output
  • Native tool calling
  • Text, image, and audio

Select Your Models

Choose the AI models to include in your platform (Filtered by your use cases)

No models selected. Add models to your platform to continue.

Hardware Recommendation

Infrastructure Requirements
Based on your model selection
Total VRAM Required0 GB
Recommended Hardware

Need multi-device clustering or a custom setup? Book a free consultation