Mojo-tokenizer: Fastest AI Token Output → Readable Text on Apple Silicon?

Mojo-tokenizer: Fastest AI Token Output → Readable Text on Apple Silicon?

mojo-tokenizer decodes tokens at 144M tokens/sec on Apple Silicon — 3.1x faster than tiktoken and 1.2x faster than rs-bpe.

We benchmarked 3 BPE implementations across 5 diverse text files (23 MB total) using the GPT-4 tokenizer vocabulary. Every claim is backed by reproducible benchmarks with statistical analysis.

The Numbers

Implementation Language Decoding Encoding (full pipeline)
mojo-tokenizer Mojo 144 M/s 8.0 M/s
rs-bpe Rust 121 M/s — (raw BPE only)
tiktoken Rust/Python 47 M/s 5.1 M/s

20 iterations, 3 warmup. sherlock.txt (607 KB, 143K tokens). Apple Silicon.

Note: tiktoken is implemented in Rust with Python bindings — we're comparing against Rust-accelerated code, not pure Python.

Why decoding matters: Every LLM inference call decodes output tokens. At 144M tok/s, mojo-tokenizer can decode GPT-4's entire 128K context window in under 1ms.

Decoding Performance Across Files

File Type mojo-tokenizer rs-bpe tiktoken vs tiktoken
sherlock.txt Literary 144 M/s 121 M/s 47 M/s 3.1x
war_and_peace.txt Literary 141 M/s 114 M/s 43 M/s 3.3x
les_miserables.txt Literary 140 M/s 122 M/s 43 M/s 3.3x
arxiv_abstracts.txt Scientific 117 M/s 105 M/s 42 M/s 2.8x
wikitext2_train.txt Encyclopedia 128 M/s 116 M/s 45 M/s 2.8x

Speedup is consistent: 2.8x to 3.3x across all content types.

Encoding Performance (Fair Comparison)

Implementation Full Pipeline Raw BPE
mojo-tokenizer 8.0 M/s 9.2 M/s
rs-bpe (Rust) 10.0 M/s
tiktoken (Rust/Py) 5.1 M/s

Note: rs-bpe's raw BPE is faster (10.0 M/s). But production tokenizers need pretokenization to match tiktoken output. For the full pipeline, mojo-tokenizer beats tiktoken by 1.6x.

What We Test

Metric Description
Vocabulary cl100k_base (100,256 tokens) — GPT-4, ChatGPT
Test data 5 files: 607KB to 10.7MB (23MB total, 5.3M tokens)
Algorithm O(n) backtracking BPE
Validation 100% exact token match with tiktoken on all files

Supported Formats

mojo-tokenizer currently focuses on OpenAI-style BPE tokenization:

Encoding Vocab Size Models Status
o200k_base 199,998 gpt-5.2, gpt-oss-120B, gpt-oss-20B ✓ Verified
HuggingFace BPE varies Qwen, Llama, Mistral Experimental
"Hello, world!" → [13225, 11, 2375, 0]  (o200k_base)

mojo-tokenizer produces exact token matches with tiktoken on o200k_base. HuggingFace BPE format loading is implemented but not yet validated against HuggingFace tokenizers.

How It Works

FlatTokenStorage: The Decoding Secret

The key to 144M tok/s decoding is FlatTokenStorage — a flat byte array containing all token bytes contiguously:

Traditional storage:
  token_0 → [72, 101, 108, 108, 111]      # "Hello" - separate allocation
  token_1 → [32, 119, 111, 114, 108, 100] # " world" - separate allocation
  ...100K more allocations...

FlatTokenStorage:
  data:    [72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, ...]
  offsets: [0, 5, 11, ...]  # Where each token starts

Decoding becomes a series of memcpy() calls instead of 100K pointer dereferences:

# For each token ID, copy its bytes directly
memcpy(dest_ptr, flat_data + offsets[token_id], lengths[token_id])

Source: src/flat_token_storage.mojo

O(n) Backtracking BPE (Encoding)

Ported from rs-bpe. Single forward pass with merge table instead of priority queue:

Traditional BPE: O(n log n) - priority queue for merges
Backtracking BPE: O(n) - single pass with precomputed tables

PairCache1000: O(1) Merge Lookup

For token pairs where both < 1000 (covers ~80% of merges):

# HashMap: merge_rank = self.merges.get(pair_key)  # O(1) amortized
# Array:   merge_rank = self.pair_cache[t1][t2]    # O(1) guaranteed

4MB 2D array eliminates hash overhead. +21% encoding speedup.

Zero-Copy Borrowed References

BacktrackEncoderRef borrows vocabulary instead of copying:

struct BacktrackEncoderRef[origin: Origin]:
    var vocab: UnsafePointer[Vocabulary, origin]  # Borrowed

Eliminates ~105MB copy per encode call. Compiler-enforced lifetime safety.

Statistical Rigor

All measurements include:

  • 20 iterations per file (3 warmup)
  • Standard deviation (CV < 5% indicates stable results)
  • Percentiles (p50, p95, p99)
  • 5 diverse text types (literary, scientific, encyclopedic)
sherlock.txt decoding:
  Mean: 143.7 +/- 4.3 M/s
  p50=144.4, p95=148.0, p99=148.3 M/s
  CV=3.0%

When Tokenization Matters

Cloud APIs (OpenAI, Claude, etc.) handle tokenization internally — you send text, they tokenize behind the scenes. But fast tokenization matters when:

  1. Running models locally — MLX, PyTorch, or llama.cpp require you to tokenize before inference
  2. Context window management — Counting tokens to fit within limits (128K for GPT-4, 200K for Claude)
  3. Cost estimation — Counting tokens before API calls to estimate costs
  4. Token-level operations — Prompt caching, continuation from specific positions

Example: Local Inference Pipeline

Text → [Tokenizer] → Token IDs → [LLM] → Output IDs → [Tokenizer] → Text
         144M tok/s decode ↑                      ↑ 8M tok/s encode

At 144M tok/s, tokenizer overhead becomes negligible compared to model inference time.

Limitations

  1. Apple Silicon only — Results may differ on x86
  2. English-heavy data — CJK, Arabic, emoji not tested
  3. Vocabulary load time — mojo-tokenizer: ~800ms vs tiktoken: ~65ms
  4. Memory usage — mojo-tokenizer: ~10MB vs tiktoken: ~3MB

Try It

git clone https://github.com/atsentia/mojo-tokenizer
cd mojo-tokenizer

# Run comprehensive benchmark
mojo run bench_comprehensive.mojo

Full methodology: docs/COMPREHENSIVE_BENCHMARK_RESULTS.md

These results are from Apple Silicon. We'd welcome community benchmarks on different platforms.


Appendix: Why Lead With Decoding?

Decoding is simpler than encoding — just look up bytes by token ID. But it's the operation that happens on every LLM output token, making it latency-critical.

At 144M tok/s:

  • GPT-4's 128K context decodes in 0.9ms
  • A typical 500-token response decodes in 3.5us

This makes tokenizer decoding effectively free compared to model inference time.

Appendix: Raw BPE vs Full Pipeline

You might notice rs-bpe's raw BPE (10.0 M/s) beats mojo-tokenizer (9.2 M/s). We don't lead with this because:

  1. Raw BPE produces different tokens — without pretokenization, you get valid BPE but not tiktoken-compatible output
  2. Production needs pretokenization — the regex patterns that split text before BPE
  3. Full pipeline is what you'd deploy — mojo-tokenizer's 8.0 M/s vs tiktoken's 5.1 M/s is the real comparison

We're honest about where Rust wins. Our advantage is in decoding and the full encoding pipeline.

Appendix: HuggingFace Trending Models (Jan 9, 2026)

Most trending models on HuggingFace use BPE tokenization:

Model Tokenizer Vocab Size
LiquidAI/LFM2.5-1.2B-Instruct BPE 64,400
MiniMaxAI/MiniMax-M2.1 GPT2/BPE 200,000
naver-hyperclovax/HyperCLOVAX-SEED-Think-32B GPT2/BPE 128,000
LGAI-EXAONE/K-EXAONE-236B-A23B BPE 153,600
IQuestLab/IQuest-Coder-V1-40B BPE 75,858
tencent/HY-MT1.5-1.8B BPE 120,000
Qwen/Qwen2-1.5B BPE 151,643
mistralai/Mistral-7B-v0.1 BPE 32,000
nvidia/nemotron-speech (ASR) SentencePiece
Lightricks/LTX-Video Visual encoder

Status: These models use BPE format. HuggingFace BPE loading in mojo-tokenizer is not yet validated — the loader parses tokenizer.json but doesn't produce matching tokens yet. Use tiktoken/o200k_base for production (verified to match exactly).

Multimodal models (Qwen-VL, LTX-Video) use BPE only for text. Images/videos use vision encoders.


mojo-tokenizer is early-stage software. While benchmarks are rigorous, the API may change. We welcome contributions and bug reports at github.com/atsentia/mojo-tokenizer.


Originally posted at atsentia.com blog


Profile picture

Bio: AI Engineer and founder of Atsentia. Previously: Microsoft, Google, YouTube. PhD in CS.