The Complete Guide to Binary Code and Text Encoding

Binary isn't an abstraction — it's the literal physical state of your hardware. Every piece of text, every pixel, every instruction your CPU executes comes down to voltage levels representing ones and zeros. This guide covers how text becomes binary, why the encoding matters more than you think, and where binary shows up in real dev work.


What binary representation actually means

Every character in a string maps to a number called a code point. The letter A is 65. The digit 7 is 55. The space character is 32. These numbers are stored in memory as binary — base-2 — because that's all transistors can do: on or off, 1 or 0.

ASCII defined the original mapping: 128 characters using 7 bits (0–127). Extended ASCII variants claimed the 8th bit to squeeze in another 128 characters, but every vendor picked different ones — which is why files from the '90s are full of .

This tool outputs 8-bit bytes, zero-padded. The letter A (65 decimal) is 1000001 in raw binary but 01000001 as an 8-bit byte. That leading zero matters — it's the fixed-width format that actual memory uses.

'Hi' → H = 72 = 01001000
        i = 105 = 01101001
Output: 01001000 01101001

ASCII, Unicode, and UTF-8

ASCII covers 128 characters — English letters, digits, punctuation, and control codes. That's it. No accented characters, no CJK, no emoji.

Unicode is the universal character set: over 150,000 characters across every writing system. But Unicode is a table, not an encoding. It assigns code points (like U+0041 for A). How those code points are stored as bytes is a separate question.

UTF-8 is the answer the web settled on. It's a variable-length encoding:

This tool converts text to binary using the UTF-8 byte values of each character. For plain ASCII input, you get one 8-bit group per character. For multi-byte characters, you get multiple groups — which is correct, not a bug.


Binary in computing

Computers use binary because transistors are switches. A transistor is either conducting or not — two states, one bit. Group 8 bits and you get a byte (256 possible values). Group 4 or 8 bytes and you get a word (32-bit or 64-bit, depending on architecture).

Endianness determines byte order in multi-byte values. Big-endian stores the most significant byte first (like how we write numbers). Little-endian stores the least significant byte first (how x86 CPUs work). The number 0x0102 is stored as 01 02 in big-endian and 02 01 in little-endian. This matters when reading binary protocols or file formats across platforms.

Bitwise operations manipulate individual bits and are everywhere in systems programming:

AND:   1100 & 1010 = 1000   (mask bits)
OR:    1100 | 1010 = 1110   (set bits)
XOR:   1100 ^ 1010 = 0110   (toggle bits)
Shift: 1100 << 1   = 11000  (multiply by 2)

Unix file permissions (chmod 755) are a bitmask: 111 101 101 — rwx for owner, r-x for group and others.


Practical uses

Debugging encoding bugs. When a user's name renders as é instead of é, you're looking at UTF-8 bytes interpreted as Latin-1. Seeing the actual binary clarifies what went wrong.

Network protocols. TCP flags, DNS headers, TLS records — specs describe them in bits and bytes. Converting values to binary while reading an RFC makes the structure click.

CTF challenges. Binary encoding is a staple layer in capture-the-flag puzzles. Decode the binary to text, check if the result is Base64 or hex, peel layers until you find the flag.

Teaching and learning. Showing someone that A (01000001) and a (01100001) differ by exactly one bit makes character encoding tangible in a way that no textbook explanation can.


Other numeral systems

Binary is base-2. It's precise but verbose — eight digits for a single character. Other bases trade verbosity for readability:

Base Name Digits Example (65) Common use
2 Binary 0–1 01000001 Hardware, bit manipulation
8 Octal 0–7 101 Unix permissions, legacy code
10 Decimal 0–9 65 Human-readable numbers
16 Hexadecimal 0–9, A–F 41 Memory addresses, color codes

Hex is the developer's shorthand for binary. Each hex digit maps to exactly 4 bits, so one byte is always two hex characters. 0xFF is 11111111. 0x0A is 00001010 (newline). Hex editors, debuggers, and protocol dumps all default to hex because it's compact without losing the byte-level structure.


Troubleshooting

Emoji produce way more binary than expected — A single emoji like 🚀 is U+1F680, which encodes to 4 bytes in UTF-8: 11110000 10011111 10011010 10000000. That's not a bug — it's how variable-length encoding works. Simple-looking characters can be multi-byte.

Different tools give different binary for the same character — This almost always comes down to encoding. One tool might use UTF-8 byte values, another might use UTF-16 code units, and another might use raw Unicode code points. Check which encoding each tool assumes before comparing output.

Why do binary bytes have leading zeros? — Convention. Binary is displayed in fixed-width 8-bit groups so every byte is the same length. The value 1 (decimal) is 00000001, not 1. Without zero-padding, you can't tell where one byte ends and the next begins in a sequence.

Binary input decodes to unreadable characters — The byte values probably correspond to control characters (0–31) or fall outside printable ASCII (128+). Not all valid binary maps to visible text. Try interpreting the values as decimal or hex first to check what range they're in.

Decoded text has question marks or replacement characters — The binary likely represents multi-byte UTF-8 sequences that got split or truncated. A 3-byte character missing its last byte is invalid UTF-8, and decoders replace it with (U+FFFD). Make sure you're decoding complete byte sequences.