Width / normalization
Rust
open
Meilisearch charabia mis-detects half-width katakana script (Japanese search)
Fix script detection of halfwidth katakana and fullwidth forms
charabia · meilisearch/charabia
Symptom
Meilisearch's charabia tokenizer incorrectly classifies halfwidth katakana (U+FF65-U+FF9F) and some fullwidth forms as wrong scripts, causing them to be processed by the wrong tokenizer and failing Japanese search.
Minimal repro
Index halfwidth katakana text in Meilisearch; search for the same terms; results missing because script detection classifies halfwidth katakana as non-Japanese.
Fix
Extend script detection to recognize U+FF65–U+FF9F (halfwidth katakana) as Japanese script.