ASCII，Unicode 和 UTF-8

ASCII: 八个二进制的一个字节，一共表示256个符号

Unicode: 所有符号的集合，但是中文或者其他的符号需要好几个字节来表示，英文的符号只需要一个字节，就会造成内存的浪费

UTF-8：是对Unicode的一种实现，用1-4个字节来表示一个符号，规则：

1）对于单字节的符号，字节的第一位设为0，后面7位为这个符号的 Unicode 码。因此对于英语字母，UTF-8 编码和 ASCII 码是相同的。

2）对于n字节的符号（n > 1），第一个字节的前n位都设为1，第n + 1位设为0，后面字节的前两位一律设为10。剩下的没有提及的二进制位，全部为这个符号的 Unicode 码。

Unicode符号范围     |        UTF-8编码方式
(十六进制)        |              （二进制）
----------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

UTF-8的编码实现

function encodeUtf8(str) {
  var bytes = []
  for (ch of str) {
    // for...of循环，能正确识别 32 位的 UTF-16 字符， 可以查阅资料了解。
    let code = ch.codePointAt(0)
    if (code >= 65536 && code <= 1114111) {// 位运算， 补齐8位
      bytes.push((code >> 18) | 0xf0)
      bytes.push(((code >> 12) & 0x3f) | 0x80)
      bytes.push(((code >> 6) & 0x3f) | 0x80)
      bytes.push((code & 0x3f) | 0x80)
    } else if (code >= 2048 && code <= 65535) {
      bytes.push((code >> 12) | 0xe0)
      bytes.push(((code >> 6) & 0x3f) | 0x80)
      bytes.push((code & 0x3f) | 0x80)
    } else if (code >= 128 && code <= 2047) {
      bytes.push((code >> 6) | 0xc0)
      bytes.push((code & 0x3f) | 0x80)
    } else {
      bytes.push(code)
    }
  }
  return bytes
}

参考

http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html

https://juejin.im/post/5e328bdff265da3e4569ad9b

发表评论 取消回复

发表评论取消回复