Unicode basics: one code point per character, round-trip interconvertibility
with all existing character sets. But what's a character?
- Some fonts have taken on independent meaning, e.g. ℜ
is originally Fraktur, but now is math-ese for the reals
- Some otherwise equivalent characters appear in preexisting
character sets, e.g. mainland Chinese sets with both traditional
and simplified characters (坛 vs. 壇, 罈)
- Some characters look the same, but aren't, e.g. "o" vs.
Cyrillic "о": security issue
- Composed vs. decomposed forms (accents like é, Hangul): normalize
for search, comparison