Python — CJK, IME & Unicode bugs and fixes (6 cases)

Kana / romaji 4

pykakasi: missing っでぃ (ddi) sokuon in Hepburn/Kunrei romaji
pykakasi Python open

The geminated d + small i sequence っでぃ (ddi) has no Hepburn or Kunrei entry, so loanword spellings that use it romanize incorrectly.
jaconv kana2alphabet does not romanize small katakana ヵ/ヶ
jaconv Python open

kana2alphabet does not handle the small katakana ヵ/ヶ (small ka/ke), so counters like 一ヶ月 are mis-romanized.
pykakasi fails to romanize half-width katakana with voiced marks
pykakasi Python closed

pykakasi does not romanize half-width katakana correctly, particularly when a half-width voiced or semi-voiced mark (U+FF9E / U+FF9F) follows the base kana.
Python unidecode mangles half-width katakana with dakuten/handakuten
unidecode Python closed

Python unidecode transliterates half-width katakana carrying dakuten/handakuten incorrectly, producing artifacts, while hiragana and full-width katakana romanize correctly.

split() word count treats a spaceless CJK answer as one word (omi)
omi Python open

Onboarding decides whether a spoken answer has enough content with len(transcript.split()) >= 2. str.split() returns 1 for CJK text that has no spaces, so a full answer like 東京に住んでいます is counted as a single word, never reaches the LLM check, and the question stays marked unanswered for Japanese, Chinese, and Korean speakers.

kanji2number cannot parse 萬, the daiji form of 万 (Kanjize)
Kanjize Python open

kanji2number cannot parse 萬, the daiji (大字) traditional form of 万 (10,000), so legal and financial documents that use 大字 numerals fail to convert.

Other stacks