Unicode Converter
Inspect Unicode code points, UTF-8 bytes, and character information for any text.
About This Tool
The Unicode Converter inspects any text and returns the Unicode code point (U+XXXX), UTF-8 byte sequence, and character category for every character. You can also look up any character by entering its code point (e.g., U+1F600 or just 1F600) to see the character and its encoding details.
All analysis runs in your browser. The tool supports the full Unicode range including ASCII, Latin, CJK, Arabic, Hangul, Japanese kana, and Emoji characters.
How to Use
- Type or paste text into the Text → Unicode Info field and click Analyze.
- The table shows the character, code point (U+XXXX), UTF-8 bytes, and category for each character.
- To look up a specific character, enter its code point (e.g.,
1F600orU+1F600) in the Code Point Lookup field and click Look up.
UTF-8 Encoding Reference
UTF-8 is a variable-width encoding: ASCII characters (U+0000–U+007F) use 1 byte. Characters U+0080–U+07FF use 2 bytes. U+0800–U+FFFF (most CJK, Arabic, Hangul) use 3 bytes. Supplementary characters U+10000+ (including most emoji) use 4 bytes. This explains why non-ASCII characters take more storage space than ASCII.
Use Cases
Software engineers debug character encoding issues in strings that contain unexpected symbols or corrupted text. Web developers verify that special characters in user input are correctly encoded before storage. Internationalization engineers inspect CJK and RTL characters to confirm their code points. Security researchers analyze unusual Unicode characters used in homograph attacks or obfuscated strings.
FAQ
- What is a Unicode code point? — A unique number assigned to every character in the Unicode standard, written as U+XXXX in hexadecimal (e.g., U+0041 = 'A').
- What is UTF-8? — A variable-width character encoding that represents Unicode code points as 1–4 bytes. It is the dominant encoding on the web.
- Why do emoji take 4 bytes in UTF-8? — Emoji are in the Unicode Supplementary Multilingual Plane (U+10000+) which requires 4 bytes in UTF-8 encoding.