ASCII / Unicode Explorer Guide

What this tool is for

Toolzy's ASCII / Unicode Explorer is a character and short-string inspector for developers. It is not a transcoding suite and it is not a lossy Unicode-to-ASCII converter. The goal is to help you answer practical debugging questions quickly:

What is this hidden or suspicious character?
Is it ASCII or only Unicode?
What bytes does it become in UTF-8?
Why does string.length not match what I see on screen?
Why do two strings look the same but fail equality checks?

Everything runs locally in the browser, which matters when you're inspecting production logs, copied tokens, customer data, scraped payloads, or code snippets.

The core model: code point vs code unit vs grapheme cluster

These concepts get blurred together, and most text bugs come from that confusion.

Unicode code point

A Unicode code point is the abstract identifier such as U+0041 for A or U+1F600 for 😀.

UTF-16 code unit

JavaScript strings are UTF-16 sequences. Basic Multilingual Plane values often take one code unit. Supplementary-plane values take two code units, which are the surrogate pair.

That means:

A -> one code point, one UTF-16 code unit
😀 -> one code point, two UTF-16 code units
a lone surrogate -> one UTF-16 code unit, but not a valid Unicode scalar value

Grapheme cluster

A grapheme cluster is closer to what people think of as one visible character. It may contain multiple code points:

e + U+0301 COMBINING ACUTE ACCENT
emoji sequences joined with U+200D ZERO WIDTH JOINER
flags built from regional indicator symbols

This is why visible character count, code point count, and string.length can all differ.

ASCII is only 128 values

ASCII is not a synonym for text, bytes, or generic English-looking characters. Canonical ASCII is only 0-127 / 0x00-0x7F. That includes:

control characters like NUL, TAB, LF, CR, and DEL
printable punctuation like !, @, and ~
digits 0-9
uppercase and lowercase Latin letters A-Z and a-z

Values from 0x80 through 0xFF are not ASCII. Older material may call those "extended ASCII", but that label hides the reality that they belong to incompatible 8-bit code pages or to Unicode values in modern software. Toolzy keeps the page explicit about that boundary.

Hidden-character debugging examples

Non-breaking spaces

U+00A0 NO-BREAK SPACE often looks like a normal space, but line wrapping and string comparisons differ. That makes it a common source of layout bugs and failed equality checks.

Zero-width characters

Characters like U+200B ZERO WIDTH SPACE, U+200C ZWNJ, U+200D ZWJ, and U+2060 WORD JOINER may render invisibly while still affecting tokenization, matching, cursor movement, or emoji presentation.

Bidi controls and isolates

Directional marks and isolate controls can make text appear reordered. They are essential in some multilingual contexts, but they also create confusing debugging situations when pasted unexpectedly.

Replacement character

U+FFFD REPLACEMENT CHARACTER usually signals a decoding problem upstream. It means something went wrong before the string reached your browser, not that U+FFFD itself is magical.

Surrogate pairs and lone surrogates

Supplementary-plane code points like many emoji are represented in UTF-16 as a high surrogate followed by a low surrogate. Together they form a valid surrogate pair.

A lone surrogate is different. JavaScript can still store it in a string, but it is not a valid Unicode scalar value. That matters because UTF-8 and many Unicode-oriented operations assume scalar values. Toolzy accepts lone surrogates in text mode for debugging and labels the text as ill-formed UTF-16 instead of silently dropping or repairing the data.

Why normalization matters

Unicode supports canonically equivalent sequences. A precomposed character like é can exist as one code point, or it can exist as e plus U+0301 COMBINING ACUTE ACCENT. Both may render identically while comparing differently at the raw string level.

This tool does not rewrite your input into NFC or NFD. It only warns when the original text changes under NFC normalization so you can spot why equality, deduplication, or hashing may behave unexpectedly.

Why UTF-8 and HTML numeric references are useful

The most practical debugging outputs are usually:

U+ notation for documentation and specs
decimal and hex for APIs, database values, and interviews
UTF-8 bytes for payload and encoding inspection
UTF-16 code units for JavaScript behavior
JavaScript escapes for source files and test cases
HTML numeric references for markup debugging

Toolzy uses numeric HTML references instead of promising named entities for every character, because named entities are incomplete while numeric references always map directly to a code point.

Current v1 data scope

This implementation bundles a pinned local Unicode name subset focused on ASCII and common debugging characters. That keeps the tool fast, local, and reviewable in git. Status detection still works across the full Unicode range, including surrogate, private-use, noncharacter, and unassigned values.

The tradeoff is that some assigned code points will appear as assigned values without their official names yet. That is a deliberate v1 compromise rather than a runtime fetch.

Troubleshooting

Why does 41 resolve differently from 65 in lookup mode? — Auto mode treats bare values as hexadecimal to support shorthand like 41 for U+0041. Decimal mode exists for values like 65.

Why does the table show two rows for é if I only see one character? — Because the visible grapheme is built from two Unicode code points: the base letter and a combining mark.

Why can pasted text contain invalid Unicode scalar values at all? — Because JavaScript strings are UTF-16 code-unit sequences and can still hold lone surrogates. Toolzy reports them as ill-formed UTF-16 instead of pretending they are normal Unicode characters.

Why doesn't every assigned character have a name in this build? — The local pinned dataset is intentionally small for v1. Official names are bundled for ASCII and common debugging characters first, while broader coverage can be added later without changing the browser-only architecture.