Claude 4.7's new tokenizer costs 47% more tokens on real content
Original: Measuring Claude 4.7's tokenizer costs
Why This Matters
Higher tokenization costs impact AI development budgets and performance optimization strategies
Independent testing shows Claude 4.7's tokenizer uses 1.47x more tokens on technical documentation versus Anthropic's claimed 1.0-1.35x range. Real Claude Code content averaged 1.325x increase, with code content hit harder than prose at 1.29-1.39x versus 1.20x.
Testing of Anthropic's Claude Opus 4.7 tokenizer reveals higher token costs than officially stated. Using Anthropic's token counter API, researcher measured seven real-world samples including CLAUDE.md files, user prompts, and code diffs. Results showed 1.325x weighted average increase across real content, with technical docs hitting 1.47x - exceeding Anthropic's upper 1.35x estimate. Code content was impacted more heavily (1.29-1.39x) than English prose (1.20x), while CJK languages saw minimal change (1.01x). The tokenizer appears to use shorter sub-word merges for English and code patterns. This means faster quota consumption, higher cached prefix costs, and earlier rate limit hits despite unchanged pricing. Characters-per-token for English dropped from 4.33 to 3.60, suggesting the tokenizer trades efficiency for other improvements.