greymoth proof, not pitch Writing · Nº 01

field note / surrogate & grapheme / 2026-07-03

A width check said the string was safe to cut. It split a kanji in half.

A text-boundary failure: measure a string by one unit, then cut it by another. A one-line fast path built for ASCII sliced a character mid-code-point — and it bit hardest on 𠮷 (U+20BB7), a rare form of a real Japanese surname, precisely because that character is a surrogate pair and full-width at once. The bug is universal; the rare kanji is just where it surfaces first.

A name went into a terminal table and came out broken. The surname was 𠮷田. That first character is not the ordinary 吉 you get from the 吉 key, it is 𠮷 (U+20BB7), a rarer form that real people in Japan actually have on their family register. The table truncated the cell to fit a column, and what printed was 𠮷 followed by a replacement character. The kanji had been cut in half.

The interesting part is where the bug lived. Not in the truncation loop. In a one-line shortcut that decided, before truncating, that this particular string was safe to cut by raw index. It was wrong, and it was wrong for a reason that only shows up on the exact character I just described.

Three numbers that are usually the same, and one string where they aren't

A JavaScript string has more than one length depending on what you ask.

For plain ASCII these all collapse to the same number. "abc" is 3 code units, 3 code points, 3 columns. That coincidence is what a lot of text code quietly leans on. It holds right up until a character makes two of those numbers agree for different reasons.

𠮷 is exactly that character. Two code units because it is a surrogate pair. Two columns because it is wide. Same number, 2, arrived at two completely different ways. Hold onto that, it is the whole bug.

The real code

This is the truncation helper in cli-table3, the library a lot of CLIs use to draw tables. strlen here is display width. It strips ANSI color codes and runs the string through string-width, which counts a wide CJK character as 2. So strlen answers "how many columns," not "how many characters."

function truncateWidth(str, desiredLength) {
  if (str.length === strlen(str)) {
    return str.substr(0, desiredLength);
  }

  while (strlen(str) > desiredLength) {
    str = str.slice(0, -1);
  }

  return str;
}

Read the first branch as an optimization. "If the code-unit length equals the display width, then every character is one unit and one column, so there are no wide characters and nothing tricky, I can just cut by index with substr." For "abc" that is true, 3 === 3, cut away.

Now feed it "𠮷𠮷". Code-unit length is 4. Display width is 4. 4 === 4, so the branch fires and it cuts by code unit:

"𠮷𠮷".substr(0, 3)   // "𠮷" + "\uD842"

substr(0, 3) takes three code units: the full first 𠮷, then the high surrogate of the second one. The low surrogate is left behind. You get one clean kanji followed by a lone high surrogate \uD842, which is not a character at all. Terminals render it as the replacement box. That is the half a kanji in the table cell.

The shortcut was built for the case where length equals width because everything is one-to-one. A surrogate-pair wide character satisfies length === width too, 2 === 2, but for the opposite reason, both numbers are 2 because the character is doubled on both axes. It walks straight into the fast path and gets sliced by index, which is the one thing that path assumed it would never have to do.

Why it survived

The obvious question is how a CJK bug survives in a table library that people clearly use with CJK. The answer is that ordinary Japanese and Chinese text never reaches this branch.

Take 漢. It is U+6F22, inside the BMP, so "漢".length is 1. Its width is 2. 1 === 2 is false, so 漢 skips the fast path entirely and goes to the while loop below. Every common kanji, every kana, every Hangul syllable behaves this way: one code unit, two columns, length never equals width. They are all safe.

The fast path only misfires when a single character is a surrogate pair and wide. That intersection is small. It is CJK Extension B and beyond, the rare kanji that show up in personal names and place names, plus emoji, which are also non-BMP and mostly width 2. So the library worked for years of 東京 and 漢字 and quietly mangled 𠮷田 and anything with an emoji in a narrow column. The common case took a different branch, so the shortcut looked safe.

The slow path had a milder version of the same disease, by the way. str.slice(0, -1) removes one code unit, not one character. Hand the loop a string ending in a surrogate pair and it lops off a low surrogate on the first pass and leaves the high one dangling. Same family, quieter symptom.

The fix

Two changes. Guard the fast path so it refuses any string that contains a high surrogate, and make the slow path trim whole code points instead of code units.

function truncateWidth(str, desiredLength) {
  // `str.length === strlen(str)` is also true for surrogate-pair characters
  // (e.g. CJK Extension B or emoji), which count as 2 code units and 2 columns.
  // `substr`/`slice` cut by code unit, so exclude them here and trim by code
  // point below to avoid splitting a surrogate pair into a lone surrogate.
  if (str.length === strlen(str) && !/[\uD800-\uDBFF]/.test(str)) {
    return str.substr(0, desiredLength);
  }

  let chars = Array.from(str);
  while (strlen(chars.join('')) > desiredLength) {
    chars.pop();
  }

  return chars.join('');
}

Array.from(str) iterates by code point, so Array.from("𠮷𠮷") is a two-element array, each element a whole kanji. pop() removes one whole character. The loop can no longer stop in the middle of a surrogate pair because there is no middle to stop in. The fast path stays for the genuinely simple case, ASCII and other strings with no surrogates, where substr is both correct and cheaper.

Worth naming the tools. Array.from and the spread operator both split by code point, which fixes surrogate pairs. They do not split by grapheme, so a flag emoji or a family emoji built from several code points joined with zero-width joiners will still come apart. If you need whole user-perceived characters, that is Intl.Segmenter with granularity: 'grapheme'. Code point was the right level here because the unit of width is the code point, but know which one you are reaching for.

The failing fixture

This is the test that goes red before the fix and green after. It is the whole point, because the fix is one line and the value is keeping it fixed, not finding it once.

it('does not split a surrogate-pair wide char (CJK Ext B)', function () {
  let kanji = String.fromCodePoint(0x20bb7);          // 𠮷
  expect(truncate('a' + kanji + 'bc', 4)).toEqual('a' + kanji + '…');
  expect(truncate('a' + kanji + 'bc', 3)).toEqual('a…');
  expect(truncate(kanji + kanji, 3)).toEqual(kanji + '…');
});

it('does not split a surrogate-pair wide char (emoji)', function () {
  let emoji = String.fromCodePoint(0x1f600);
  expect(truncate('a' + emoji + 'bc', 3)).toEqual('a…');
  expect(truncate('x' + emoji + emoji + 'y', 4)).toEqual('x' + emoji + '…');
});

Note the inputs are built with String.fromCodePoint, not pasted glyphs. That keeps the test readable in any editor and makes the code point explicit, so nobody later "cleans up" 𠮷 into 吉 and deletes the coverage without noticing. The assertion that matters most is truncate(kanji + kanji, 3): a width budget that lands between the two columns of the second character. The old code returned a lone surrogate there. That is the exact spot the bug lives.

The check, for the next one

The general shape is bigger than one library. Any code that truncates, pads, aligns, or measures text is juggling three different numbers for one string, and it is only correct if it uses the same one throughout:

stringcode units (.length)code pointsdisplay columns
abc333
漢字224
𠮷212
😀212

The failure mode is always the same: measure by one number, cut by another. cli-table3 measured width, then cut by code unit, and the two disagreed on the one character where they happened to be equal for different reasons. So the check is a habit, not a rule. When you slice a string with substr, slice, or a bare index, ask what unit that index is in. It is code units. Then ask whether the length you compared it against was in the same unit. If you measured display width or code points and then cut by index, you have this bug, and it is invisible until a non-BMP character walks through.

And test it deliberately. One CJK Extension B character, String.fromCodePoint(0x20bb7), and one emoji, at a width that lands mid-character. ASCII will never show you this. You have to hand the function the input it is quietly afraid of.

This one is a single entry in a corpus of 97 real CJK, IME, and Unicode failures I have been collecting, most of them one-line fixes hiding in libraries that work perfectly in English. The same split-a-code-point shape shows up in four more places — and below, unlike the short version of this post, I've pulled each one out of the corpus in full so you can see the shape repeat, and check every diagnosis against a diff.

+ The same shape, four more places

cli-table3 measured one unit and cut by another. Once you hold that shape in your head you start seeing it everywhere: a font decode table, a rich-text cursor, a second truncation helper, a punctuation pass. All four below sit in the same corpus category — surrogate & grapheme: code that walks text by UTF-16 code unit or bare code point instead of by grapheme cluster. One of them isn't even CJK — slate splits Hindi conjuncts — which is the point: this is a universal text-boundary error, not a Japanese-only one. Two are still open and two merged nowhere; I'm keeping the closed ones in the record because the diagnosis stands whether or not the maintainer took the diff.

opentype.js — cmap clamp open · PR #858
ShapeThe decode side of the same coin. A cmap format 12/13 subtable maps character codes to glyphs, and opentype.js never clamped those codes to the Unicode ceiling of U+10FFFF.
BreaksA malformed font with out-of-range codes sends glyph lookup off the end for supplementary characters — emoji and CJK Extension B+ resolve to the wrong glyph.
FixClamp every format 12/13 startCharCode / endCharCode to 0x10FFFF during parsing.
slate — Indic conjuncts open · PR #6074
ShapeThe next level up the ladder from code points. A code point is the right unit for width; it is the wrong unit for a cursor. Slate's grapheme segmentation skips Unicode UAX #29 rule GB9c, so an Indic conjunct cluster — consonant + virama + consonant — splits across a boundary.
BreaksType क्ष in Hindi, press Backspace, and only one code point is deleted instead of the whole cluster. Same class of "walked text by the wrong unit," one rung higher: code point where it needed grapheme.
FixApply GB9c: treat Indic_Conjunct_Break=Linker sequences as a single grapheme cluster.
clerk — truncate, again closed · PR #9029
ShapeAlmost a carbon copy of cli-table3, in a web UI. Clerk's truncateWithEndVisible falls back to substring / slice on raw code units in its short-width path.
BreaksAn email or name containing an emoji, rendered through the short-width fallback, ends mid-surrogate-pair and shows a ? or a replacement box — the browser's version of the terminal's half a kanji.
FixSplit by code point with Array.from / spread before truncating, or step up to Intl.Segmenter. Same fix as cli-table3. This one didn't land.
markdown-it — smart quotes closed · PR #1186
ShapeThe same blind spot in a text pass rather than a length. markdown-it's smart-quotes replacement decides whether a character is whitespace or punctuation with a check that doesn't account for non-BMP characters (U+10000 and up).
BreaksA straight quote next to a supplementary symbol or emoji pairs wrong or is left un-converted — the "is this a boundary" question answered by the wrong unit again.
FixMake the whitespace / punctuation class checks Unicode-aware, or iterate by code point. Also closed, kept for the record.

Read them in a row and the family resolves: measure by one unit, act by another. cli-table3 and clerk measured width and cut by code unit. opentype.js decoded by a code that outran the code-point ceiling. slate walked a cursor by code point where the unit was grapheme. markdown-it answered a boundary question by a class check blind to non-BMP. Same wrong assumption — that the convenient number and the correct number are the same one — wearing five different costumes. Grouping by the broken assumption instead of the symptom is what tells you the order to fix things, and where the next one shows up.

Read next / verify

Don't take my word for the diagnosis — the cli-table3 diff is public, read it and decide if it holds.

— greymoth (@greymoth__)

← all writing greymoth — the record