🍋
Menu
How-To Beginner 2 min read 310 words

Encoding Explained: UTF-8, ASCII, Base64, and URL Encoding

Understand character encodings, binary-to-text encoding, and URL encoding to prevent data corruption and bugs.

Text and Data Encoding

Encoding confusion causes garbled text, broken URLs, corrupted data, and security vulnerabilities. Understanding the purpose of each encoding type prevents these issues.

Character Encoding: ASCII and UTF-8

ASCII maps 128 characters (English letters, digits, punctuation) to numbers 0-127, using 7 bits per character. UTF-8 extends this to support every Unicode character (148,000+) using 1-4 bytes. ASCII text is valid UTF-8. The reverse is not true — UTF-8 text containing non-ASCII characters is not valid ASCII. Always use UTF-8 for new projects.

UTF-8 vs UTF-16 vs UTF-32

UTF-8 uses variable-width encoding (1-4 bytes): efficient for ASCII-heavy text (English), less so for CJK characters (3 bytes each). UTF-16 uses 2 or 4 bytes: efficient for CJK text, wasteful for ASCII. UTF-32 uses exactly 4 bytes per character: simplest to process but wasteful of space. Web standard: UTF-8. Windows internals: UTF-16. Database analysis: UTF-32.

Base64 Encoding

Base64 converts binary data to ASCII text using 64 characters (A-Z, a-z, 0-9, +, /). It's used to embed binary data in text-only contexts: email attachments (MIME), data URIs in HTML/CSS, and JWT payloads. Base64 increases data size by approximately 33%. Base64url variant replaces + with - and / with _ for URL safety.

URL Encoding (Percent Encoding)

Special characters in URLs are encoded as %XX where XX is the hex value: space becomes %20, & becomes %26. This prevents special characters from being interpreted as URL syntax. Over-encoding (encoding characters that don't need it) is harmless but makes URLs ugly. Under-encoding causes parsing errors and potential security issues.

Common Encoding Bugs

Mojibake (garbled text) means the encoding was misidentified — UTF-8 bytes interpreted as Latin-1, or vice versa. Double encoding (%2520 instead of %20) means the data was URL-encoded twice. Base64 "padding" errors (invalid length, missing = signs) indicate the encoded data was truncated during transmission.

Outils associés

Formats associés

Guides associés