Category Archives: encoding

Use Unicode UTF-8 for worldwide language support

Windows NT can date back to 1993, at that time Unicode was still a new thing to the industry. Microsoft had decided to use UCS-2 (pre-W2K) and UTF-16 LE (since W2K) for supporting Unicode and introduced a separated set of Windows APIs (-W APIs) for 2-byte wide char, while keeping the (-A APIs) for Windows… Read More »

從 Python 3.6 的 PEP 529 看 Windows APIs

談談 Python 3.6 中的 PEP 529 Windows 版的 Python 去到 3.6 才改用 UTF-16 APIs，在此之前一直使用 ANSI APIs。這個要從 Windows 的歷史講起。Windows 9x 系列 (95 – Me)，原生並不支援 Unicode (在 2001 年 Microsoft 推出了 Microsoft Layer for Unicode 令 9x 支援 Unicode，但 2001 年已推出了 Windows XP)。在 9x，它是使用 multibyte code page 來支援非 ANSI (英文) 字元。與此同時，差不多並行推出的 Windows NT 系列 (NT… Read More »

再談 UTF-8

Unicode 自 2.0 到現在 6.0 都是 21 bits 編碼。UTF 就是實作把這個最大為 21 bits 的數字儲存。常見的 UTF 有 UTF-8、UTF-16 及 UTF-32。今次要再講講 UTF-8。 UTF-8 之所以流行是因為它跟 ASCII 兼容，”a” 字在 ASCII 及 UTF-8 時都是 0x61，但在 UTF-16 時卻是 0x00 0x61，當中 0x00 用 ASCII 解讀時會觸發 null-terminated。且儲存英文及數字只需使用 1 byte，比 UTF-16 節省一倍。雖然儲存中文字時，UTF-8 會比 UTF-16 用多較多空間，不過除非是全中文文章，否則中英混雜時，例如 HTML，用 UTF-8 一般會較有儲存大小上的好處。 UTF-8 能兼容 ASCII 之餘又能儲存大於 128 編碼的袐密在於可變位數，詳情可參看淺談… Read More »