Code that walks text by UTF-16 code unit or bare code point instead of by grapheme cluster. Surrogate pairs and non-BMP characters get split, ZWJ emoji and variation selectors are mis-detected, and combining marks or conjunct clusters drift away from their base.
Surrogate & grapheme
JS
open
cli-table3 · cli-table/cli-table3
Symptom
cli-table3 truncates text by byte/code-unit count rather than code-point count, splitting surrogate pairs in emoji or supplementary CJK characters, producing mojibake in terminal table cells.
Minimal repro
Create a cli-table3 table with a column containing emoji (e.g., 🎉) or supplementary CJK characters; set a column width that truncates mid-emoji; output shows garbled characters.
Fix
Use a Unicode-aware splitter (spread operator or Array.from) to iterate code points rather than code units when truncating.
Surrogate & grapheme
TS
open
clerk · clerk/javascript
Symptom
Clerk UI's truncateWithEndVisible function uses substring/slice on raw code units in its short-width fallback, splitting surrogate pairs in emoji or non-BMP characters.
Minimal repro
A Clerk UI component displaying an email/name containing emoji in the short-width fallback path; the truncated string ends mid-surrogate-pair, showing '?' or garbled chars.
Fix
Use Array.from() or spread to split by code points, or use Intl.Segmenter, before truncating.
Surrogate & grapheme
JS
open
opentype.js · opentypejs/opentype.js
Symptom
opentype.js does not clamp cmap format 12/13 character codes to U+10FFFF; malformed fonts with out-of-range codes cause incorrect glyph lookups for supplementary characters.
Minimal repro
Load a font with a cmap subtable containing entries beyond U+10FFFF; glyph lookup for supplementary characters (emoji, CJK Extension B+) returns wrong glyph.
Fix
Clamp all format 12/13 startCharCode/endCharCode values to 0x10FFFF during parsing.
Surrogate & grapheme
JS
closed
markdown-it · markdown-it/markdown-it
Symptom
markdown-it's smart quotes replacement does not handle non-BMP punctuation and symbols (U+10000+); surrounding text with supplementary characters causes wrong quote pairing or no conversion.
Minimal repro
markdown-it smartquotes on text adjacent to emoji or supplementary Unicode symbols (e.g., '𝕳ello'); smart quote pairing is incorrect.
Fix
Use a regex that is Unicode-aware for the 'whitespace' and 'punctuation' character class checks, or use Array.from for code-point iteration.
Surrogate & grapheme
JS
open
slate · ianstormtaylor/slate
Symptom
Slate rich-text editor does not implement Unicode UAX #29 GB9c rule, splitting Indic conjunct clusters (consonant + virama + consonant sequences) across grapheme boundaries, causing incorrect cursor positioning and deletion in Hindi, Bengali, Tamil, etc.
Minimal repro
Type a conjunct consonant in Hindi (e.g., 'क्ष') in Slate; press Backspace; only one codepoint is deleted instead of the full cluster.
Fix
Apply Unicode GB9c rule: treat <Indic_Conjunct_Break=Linker> sequences as a single grapheme cluster.
Surrogate & grapheme
merged
wenmode · lepture/wenmode
Symptom
wenmode's is_punctuation function treats Unicode combining marks (Mn category) and format characters (Cf/ZWJ) as punctuation per the CommonMark spec, incorrectly suppressing valid emphasis around CJK text with diacritics or ZWJ sequences.
Minimal repro
Markdown text using emphasis (* or _) adjacent to a combining mark or ZWJ character; emphasis fails to render because combining marks are classified as punctuation.
Fix
Exclude Unicode General Categories Mn, Mc, Cf from the punctuation classification; add ASCII fast-path to avoid performance regression.
Surrogate & grapheme
JS
cited
closed
grapheme-splitter · orling/grapheme-splitter
Symptom
grapheme-splitter breaks ZWJ-joined emoji into parts instead of one grapheme cluster: the rainbow flag splits into its component glyphs, and skin-tone sequences come apart.
Minimal repro
new GraphemeSplitter().splitGraphemes('🏳️🌈') returns two elements instead of one.
Fix
Implement the Unicode emoji ZWJ sequence rules (UTS #51) so a ZWJ-joined emoji stays a single cluster.
Surrogate & grapheme
JS
cited
closed
lodash · lodash/lodash
Symptom
_.toArray splits an emoji built from a tag sequence (a subdivision flag) into its component code points instead of returning it as one element.
Minimal repro
_.toArray for the England flag emoji (a base flag plus tag characters) returns seven pieces instead of one.
Fix
Use a Unicode-aware iterator such as Intl.Segmenter that handles tag sequences when converting a string to an array.
Surrogate & grapheme
Windows Terminal
cited
closed
microsoft/terminal · microsoft/terminal
Symptom
Windows Terminal ignores the text-presentation variation selector U+FE0E, rendering the color-emoji form even when text style is explicitly requested.
Minimal repro
Print a text-style sequence such as U+23CF followed by U+FE0E; it renders as a color emoji instead of the text glyph.
Fix
Honor VS-15 (U+FE0E) for text presentation and VS-16 (U+FE0F) for emoji presentation, per the Unicode emoji variation sequences.
Surrogate & grapheme
JS
cited
open
emoji-regex · mathiasbynens/emoji-regex
Symptom
emoji-regex matches a base character even when it is followed by U+FE0E (the text variation selector), so text-presentation characters are wrongly classified as emoji.
Minimal repro
emojiRegex().test('\u2757\uFE0E') returns true even though U+FE0E requests text presentation.
Fix
Exclude a match followed by VS-15 (U+FE0E); treat only a trailing VS-16 (U+FE0F) or no selector as emoji.
Surrogate & grapheme
TS
open
kaplay · kaplayjs/kaplay
Symptom
compileStyledText builds charStyleMap keyed by UTF-16 code-unit length, but formatText later applies the styles by grapheme index (via runes()). The two indexings match for ASCII, but drift apart after any character longer than one code unit: an emoji, a ZWJ sequence, or an astral-plane CJK ideograph (CJK Extension B, e.g. names written with 𠮷). Every style after such a character lands on the wrong grapheme or is dropped.
Minimal repro
In styled text, "😀[c]x[/c]" keys the colour style at code unit 2, but runes("😀x") puts x at grapheme index 1, so the style is lost.
Fix
Make compileStyledText walk grapheme clusters with the same runes() helper formatText already uses, keying charStyleMap by grapheme index. Normalize the input to NFC up front and consume a whole grapheme per escape so the slice lengths stay consistent.