Segmentation / word count Python open

split() word count treats a spaceless CJK answer as one word (omi)

Onboarding answer gate counts a spaceless CJK answer as one word

Symptom

Onboarding decides whether a spoken answer has enough content with len(transcript.split()) >= 2. str.split() returns 1 for CJK text that has no spaces, so a full answer like 東京に住んでいます is counted as a single word, never reaches the LLM check, and the question stays marked unanswered for Japanese, Chinese, and Korean speakers.

Minimal repro

len('東京に住んでいます'.split())  # 1, so word_count >= 2 is False and the answer is rejected

Fix

Use the existing CJK-aware _word_count helper (already used by should_discard_conversation) instead of plain split(); it falls back to split() for non-CJK text, so English input is unchanged.

Fix PR → #omi-onboarding-cjk-wordcount