Width / normalization Rust open

Meilisearch charabia mis-detects half-width katakana script (Japanese search)

Fix script detection of halfwidth katakana and fullwidth forms

charabia · meilisearch/charabia

Symptom

Meilisearch's charabia tokenizer incorrectly classifies halfwidth katakana (U+FF65-U+FF9F) and some fullwidth forms as wrong scripts, causing them to be processed by the wrong tokenizer and failing Japanese search.

Minimal repro
Index halfwidth katakana text in Meilisearch; search for the same terms; results missing because script detection classifies halfwidth katakana as non-Japanese.
Fix

Extend script detection to recognize U+FF65–U+FF9F (halfwidth katakana) as Japanese script.

Fix PR → #charabia-halfwidth-katakana-script

Also in: Rust

← all 93 entries