Network Working Group                                        P. Hoffman
Request for Comments: 2781                     Internet Mail Consortium
Category: Informational                                      F. Yergeau
                                                      Alis Technologies
                                                          February 2000
        
Network Working Group                                        P. Hoffman
Request for Comments: 2781                     Internet Mail Consortium
Category: Informational                                      F. Yergeau
                                                      Alis Technologies
                                                          February 2000
        

UTF-16, an encoding of ISO 10646

UTF-16,ISO 10646的编码

Status of this Memo

本备忘录的状况

This memo provides information for the Internet community. It does not specify an Internet standard of any kind. Distribution of this memo is unlimited.

本备忘录为互联网社区提供信息。它没有规定任何类型的互联网标准。本备忘录的分发不受限制。

Copyright Notice

版权公告

Copyright (C) The Internet Society (2000). All Rights Reserved.

版权所有(C)互联网协会(2000年)。版权所有。

1. Introduction
1. 介绍

This document describes the UTF-16 encoding of Unicode/ISO-10646, addresses the issues of serializing UTF-16 as an octet stream for transmission over the Internet, discusses MIME charset naming as described in [CHARSET-REG], and contains the registration for three MIME charset parameter values: UTF-16BE (big-endian), UTF-16LE (little-endian), and UTF-16.

本文档描述了Unicode/ISO-10646的UTF-16编码,解决了将UTF-16序列化为八位字节流以便在Internet上传输的问题,讨论了[charset-REG]中所述的MIME字符集命名,并包含三个MIME字符集参数值的注册:UTF-16BE(大端)、UTF-16LE(小端),和UTF-16。

1.1 Background and motivation
1.1 背景和动机

The Unicode Standard [UNICODE] and ISO/IEC 10646 [ISO-10646] jointly define a coded character set (CCS), hereafter referred to as Unicode, which encompasses most of the world's writing systems [WORKSHOP]. UTF-16, the object of this specification, is one of the standard ways of encoding Unicode character data; it has the characteristics of encoding all currently defined characters (in plane 0, the BMP) in exactly two octets and of being able to encode all other characters likely to be defined (the next 16 planes) in exactly four octets.

Unicode标准[Unicode]和ISO/IEC 10646[ISO-10646]共同定义了编码字符集(CCS),以下称为Unicode,它涵盖了世界上大多数书写系统[WORKSHOP]。本规范的目标UTF-16是编码Unicode字符数据的标准方法之一;它的特点是将所有当前定义的字符(在平面0中,BMP)精确编码为两个八位字节,并且能够将所有其他可能定义的字符(接下来的16个平面)精确编码为四个八位字节。

The Unicode Standard further defines additional character properties and other application details of great interest to implementors. Up to the present time, changes in Unicode and amendments to ISO/IEC 10646 have tracked each other, so that the character repertoires and code point assignments have remained in sync. The relevant standardization committees have committed to maintain this very useful synchronism, as well as not to assign characters outside of the 17 planes accessible to UTF-16.

Unicode标准进一步定义了实现人员非常感兴趣的其他字符属性和其他应用程序细节。到目前为止,Unicode的变化和ISO/IEC10646的修订相互跟踪,因此字符表和代码点分配保持同步。相关标准化委员会已承诺保持这一非常有用的同步,并且不在UTF-16可访问的17个平面之外分配字符。

The IETF policy on character sets and languages [CHARPOLICY] says that IETF protocols MUST be able to use the UTF-8 character encoding scheme [UTF-8]. Some products and network standards already specify UTF-16, making it an important encoding for the Internet. This document is not an update to the [CHARPOLICY] document, only a description of the UTF-16 encoding.

关于字符集和语言的IETF政策[CHARPOLICY]规定IETF协议必须能够使用UTF-8字符编码方案[UTF-8]。一些产品和网络标准已经规定了UTF-16,使其成为互联网的重要编码。本文档不是[CHARPOLICY]文档的更新,只是UTF-16编码的说明。

1.2 Terminology
1.2 术语

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [MUSTSHOULD].

本文件中的关键词“必须”、“不得”、“要求”、“应”、“不应”、“应”、“不应”、“建议”、“可”和“可选”应按照RFC 2119[必须]中的说明进行解释。

Throughout this document, character values are shown in hexadecimal notation. For example, "0x013C" is the character whose value is the character assigned the integer value 316 (decimal) in the CCS.

在本文档中,字符值以十六进制表示法显示。例如,“0x013C”是一个字符,其值是在CCS中分配给整数值316(十进制)的字符。

2. UTF-16 definition
2. UTF-16定义

UTF-16 is described in the Unicode Standard, version 3.0 [UNICODE]. The definitive reference is Annex Q of ISO/IEC 10646-1 [ISO-10646]. The rest of this section summarizes the definition is simple terms.

UTF-16在Unicode标准3.0版[Unicode]中有描述。最终参考是ISO/IEC 10646-1[ISO-10646]的附录Q。本节其余部分总结了简单术语的定义。

In ISO 10646, each character is assigned a number, which Unicode calls the Unicode scalar value. This number is the same as the UCS-4 value of the character, and this document will refer to it as the "character value" for brevity. In the UTF-16 encoding, characters are represented using either one or two unsigned 16-bit integers, depending on the character value. Serialization of these integers for transmission as a byte stream is discussed in Section 3.

在ISO10646中,每个字符都分配了一个数字,Unicode称之为Unicode标量值。该数字与字符的UCS-4值相同,为了简洁起见,本文档将其称为“字符值”。在UTF-16编码中,根据字符值,使用一个或两个无符号16位整数表示字符。第3节讨论了这些整数作为字节流传输的序列化。

The rules for how characters are encoded in UTF-16 are:

UTF-16中字符的编码规则如下:

- Characters with values less than 0x10000 are represented as a single 16-bit integer with a value equal to that of the character number.

- 值小于0x10000的字符表示为单个16位整数,其值等于字符号的值。

- Characters with values between 0x10000 and 0x10FFFF are represented by a 16-bit integer with a value between 0xD800 and 0xDBFF (within the so-called high-half zone or high surrogate area) followed by a 16-bit integer with a value between 0xDC00 and 0xDFFF (within the so-called low-half zone or low surrogate area).

- 值介于0x10000和0x10FFFF之间的字符由一个16位整数表示,该整数的值介于0xD800和0xDBFF之间(在所谓的高半区域或高代理区域内),后跟一个值介于0xDC00和0xDFFF之间的16位整数(在所谓的低半区域或低代理区域内)。

- Characters with values greater than 0x10FFFF cannot be encoded in UTF-16.

- 值大于0x10FFFF的字符不能在UTF-16中编码。

Note: Values between 0xD800 and 0xDFFF are specifically reserved for use with UTF-16, and don't have any characters assigned to them.

注意:0xD800和0xDFFF之间的值专门保留用于UTF-16,并且没有为它们分配任何字符。

2.1 Encoding UTF-16
2.1 编码UTF-16

Encoding of a single character from an ISO 10646 character value to UTF-16 proceeds as follows. Let U be the character number, no greater than 0x10FFFF.

将单个字符从ISO10646字符值编码为UTF-16的过程如下。设U为字符数,不大于0x10FFFF。

1) If U < 0x10000, encode U as a 16-bit unsigned integer and terminate.

1) 如果U<0x10000,则将U编码为16位无符号整数并终止。

2) Let U' = U - 0x10000. Because U is less than or equal to 0x10FFFF, U' must be less than or equal to 0xFFFFF. That is, U' can be represented in 20 bits.

2) 设U'=U-0x10000。因为U小于或等于0x10FFFF,所以U'必须小于或等于0xFFFFF。也就是说,U’可以用20位表示。

3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and 0xDC00, respectively. These integers each have 10 bits free to encode the character value, for a total of 20 bits.

3) 将两个16位无符号整数W1和W2分别初始化为0xD800和0xDC00。这些整数每个都有10位自由位来编码字符值,总共20位。

4) Assign the 10 high-order bits of the 20-bit U' to the 10 low-order bits of W1 and the 10 low-order bits of U' to the 10 low-order bits of W2. Terminate.

4) 将20位U'的10个高位分配给W1的10个低位,将U'的10个低位分配给W2的10个低位。终止

Graphically, steps 2 through 4 look like: U' = yyyyyyyyyyxxxxxxxxxx W1 = 110110yyyyyyyyyy W2 = 110111xxxxxxxxxx

从图形上看,步骤2到4类似于:U'=yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyw2=110111xxxxxxxxx

2.2 Decoding UTF-16
2.2 解码UTF-16

Decoding of a single character from UTF-16 to an ISO 10646 character value proceeds as follows. Let W1 be the next 16-bit integer in the sequence of integers representing the text. Let W2 be the (eventual) next integer following W1.

将单个字符从UTF-16解码为ISO10646字符值的过程如下。设W1为表示文本的整数序列中的下一个16位整数。设W2为W1之后的(最终)下一个整数。

1) If W1 < 0xD800 or W1 > 0xDFFF, the character value U is the value of W1. Terminate.

1) 如果W1<0xD800或W1>0xDFFF,则字符值U为W1的值。终止

2) Determine if W1 is between 0xD800 and 0xDBFF. If not, the sequence is in error and no valid character can be obtained using W1. Terminate.

2) 确定W1是否在0xD800和0xDBFF之间。如果不是,则序列出错,使用W1无法获得有效字符。终止

3) If there is no W2 (that is, the sequence ends with W1), or if W2 is not between 0xDC00 and 0xDFFF, the sequence is in error. Terminate.

3) 如果没有W2(即序列以W1结尾),或者W2不在0xDC00和0xDFFF之间,则序列出错。终止

4) Construct a 20-bit unsigned integer U', taking the 10 low-order bits of W1 as its 10 high-order bits and the 10 low-order bits of W2 as its 10 low-order bits.

4) 构造一个20位无符号整数U',将W1的10个低阶位作为其10个高阶位,将W2的10个低阶位作为其10个低阶位。

5) Add 0x10000 to U' to obtain the character value U. Terminate.

5) 将0x10000添加到U'以获取字符值U。终止。

Note that steps 2 and 3 indicate errors. Error recovery is not specified by this document. When terminating with an error in steps 2 and 3, it may be wise to set U to the value of W1 to help the caller diagnose the error and not lose information. Also note that a string decoding algorithm, as opposed to the single-character decoding described above, need not terminate upon detection of an error, if proper error reporting and/or recovery is provided.

请注意,步骤2和3表示错误。此文档未指定错误恢复。当在步骤2和步骤3中以错误终止时,最好将U设置为W1的值,以帮助调用方诊断错误,而不丢失信息。还要注意,与上述单字符解码相反,如果提供了正确的错误报告和/或恢复,则字符串解码算法不需要在检测到错误时终止。

3. Labelling UTF-16 text
3. 标记UTF-16文本

Appendix A of this specification contains registrations for three MIME charsets: "UTF-16BE", "UTF-16LE", and "UTF-16". MIME charsets represent the combination of a CCS (a coded character set) and a CES (a character encoding scheme). Here the CCS is Unicode/ISO 10646 and the CES is the same in all three cases, except for the serialization order of the octets in each character, and the external determination of which serialization is used.

本规范的附录A包含三个MIME字符集的注册:“UTF-16BE”、“UTF-16LE”和“UTF-16”。MIME字符集表示CCS(编码字符集)和CES(字符编码方案)的组合。在这里,CCS是Unicode/ISO 10646,CES在所有三种情况下都是相同的,除了每个字符中八位字节的序列化顺序,以及使用哪种序列化的外部确定。

This section describes which of the three labels to apply to a stream of text. Section 4 describes how to interpret the labels on a stream of text.

本节介绍将三个标签中的哪一个应用于文本流。第4节描述了如何解释文本流上的标签。

3.1 Definition of big-endian and little-endian
3.1 big-endian和little-endian的定义

Historically, computer hardware has processed two-octet entities such as 16-bit integers in one of two ways. So-called "big-endian" hardware handles two-octet entities with the higher-order octet first, that is at the lower address in memory; when written out to disk or to a network interface (serializing), the high-order octet thus appears first in the data stream. On the other hand, "Little-endian" hardware handles two-octet entities with the lower-order octet first. Hardware of both kinds is common today.

历史上,计算机硬件以两种方式之一处理两个八位元实体,如16位整数。所谓的“big-endian”硬件首先处理两个具有高阶八位字节的八位字节实体,即内存中较低的地址;当写入磁盘或网络接口(串行化)时,高阶八位组因此首先出现在数据流中。另一方面,“Little endian”硬件首先处理两个具有低阶八位元的八位元实体。今天,这两种硬件都很常见。

For example, the unsigned 16-bit integer that represents the decimal number 258 is 0x0102. The big-endian serialization of that number is the octet 0x01 followed by the octet 0x02. The little-endian serialization of that number is the octet 0x02 followed by the octet 0x01. The following C code fragment demonstrates a way to write 16- bit quantities to a file in big-endian order, irrespective of the hardware's native byte order.

例如,表示十进制数258的无符号16位整数是0x0102。该数字的大端序列化是八位字节0x01,后跟八位字节0x02。该数字的小端序列化是八位字节0x02,后跟八位字节0x01。下面的C代码片段演示了一种将16位量以大端顺序写入文件的方法,而不考虑硬件的本机字节顺序。

  void write_be(unsigned short u, FILE f)  /* assume short is 16 bits */
  {
    putc(u >> 8,   f);                     /* output high-order byte */
    putc(u & 0xFF, f);                     /* then low-order */
  }
        
  void write_be(unsigned short u, FILE f)  /* assume short is 16 bits */
  {
    putc(u >> 8,   f);                     /* output high-order byte */
    putc(u & 0xFF, f);                     /* then low-order */
  }
        

The term "network byte order" has been used in many RFCs to indicate big-endian serialization, although that term has yet to be formally defined in a standards-track document. Although ISO 10646 prefers big-endian serialization (section 6.3 of [ISO-10646]), little-endian order is also sometimes used on the Internet.

术语“网络字节顺序”已在许多RFC中用于表示大端序列化,尽管该术语尚未在标准跟踪文档中正式定义。虽然ISO10646更喜欢大端序列化(ISO-10646第6.3节),但互联网上有时也会使用小端顺序。

3.2 Byte order mark (BOM)
3.2 字节顺序标记(BOM)

The Unicode Standard and ISO 10646 define the character "ZERO WIDTH NON-BREAKING SPACE" (0xFEFF), which is also known informally as "BYTE ORDER MARK" (abbreviated "BOM"). The latter name hints at a second possible usage of the character, in addition to its normal use as a genuine "ZERO WIDTH NON-BREAKING SPACE" within text. This usage, suggested by Unicode section 2.4 and ISO 10646 Annex F (informative), is to prepend a 0xFEFF character to a stream of Unicode characters as a "signature"; a receiver of such a serialized stream may then use the initial character both as a hint that the stream consists of Unicode characters and as a way to recognize the serialization order. In serialized UTF-16 prepended with such a signature, the order is big-endian if the first two octets are 0xFE followed by 0xFF; if they are 0xFF followed by 0xFE, the order is little-endian. Note that 0xFFFE is not a Unicode character, precisely to preserve the usefulness of 0xFEFF as a byte-order mark.

Unicode标准和ISO10646定义了字符“零宽度不间断空格”(0xFEFF),也称为“字节顺序标记”(缩写为“BOM”)。后一个名称暗示了字符的第二种可能用法,除了在文本中作为真正的“零宽度不间断空格”的正常用法之外。Unicode第2.4节和ISO 10646附录F(资料性)建议的这种用法是将0xFEFF字符作为“签名”前置到Unicode字符流中;然后,这种序列化流的接收者可以使用初始字符作为流由Unicode字符组成的提示,并作为识别序列化顺序的方法。在带有此类签名的序列化UTF-16中,如果前两个八位字节是0xFE,后跟0xFF,则顺序为big-endian;如果它们是0xFF后跟0xFE,则顺序为little endian。请注意,0xFFFE不是Unicode字符,正是为了保留0xFFFF作为字节顺序标记的有用性。

It is important to understand that the character 0xFEFF appearing at any position other than the beginning of a stream MUST be interpreted with the semantics for the zero-width non-breaking space, and MUST NOT be interpreted as a byte-order mark. The contrapositive of that statement is not always true: the character 0xFEFF in the first position of a stream MAY be interpreted as a zero-width non-breaking space, and is not always a byte-order mark. For example, if a process splits a UTF-16 string into many parts, a part might begin with 0xFEFF because there was a zero-width non-breaking space at the beginning of that substring.

重要的是要理解,出现在除流开头以外的任何位置的字符0xFEFF必须使用零宽度非中断空间的语义进行解释,并且不得解释为字节顺序标记。该语句的反作用并不总是正确的:流第一个位置的字符0xFEFF可能被解释为零宽度非中断空格,并且不总是字节顺序标记。例如,如果进程将UTF-16字符串拆分为多个部分,则部分可能以0xFEFF开头,因为该子字符串的开头有一个零宽度的非中断空间。

The Unicode standard further suggests than an initial 0xFEFF character may be stripped before processing the text, the rationale being that such a character in initial position may be an artifact of the encoding (an encoding signature), not a genuine intended "ZERO WIDTH NON-BREAKING SPACE". Note that such stripping might affect an external process at a different layer (such as a digital signature or a count of the characters) that is relying on the presence of all characters in the stream.

Unicode标准进一步建议,在处理文本之前,可以去除初始0xFEFF字符,其基本原理是,初始位置的此类字符可能是编码的产物(编码签名),而不是真正的预期“零宽度非中断空间”。注意,这种剥离可能会影响依赖于流中所有字符的存在的不同层(例如数字签名或字符计数)的外部处理。

In particular, in UTF-16 plain text it is likely, but not certain, that an initial 0xFEFF is a signature. When concatenating two strings, it is important to strip out those signatures, because otherwise the resulting string may contain an unintended "ZERO WIDTH

特别是,在UTF-16纯文本中,初始0xFEFF很可能是签名,但不确定。连接两个字符串时,去掉这些签名很重要,因为否则结果字符串可能包含意外的“零宽度”

NON-BREAKING SPACE" at the connection point. Also, some specifications mandate an initial 0xFEFF character in objects labelled as UTF-16 and specify that this signature is not part of the object.

连接点处的“不间断空格”。此外,一些规范要求在标记为UTF-16的对象中使用初始0xFEFF字符,并指定此签名不是对象的一部分。

3.3 Choosing a label for UTF-16 text
3.3 为UTF-16文本选择标签

Any labelling application that uses UTF-16 character encoding, and explicitly labels the text, and knows the serialization order of the characters in text, SHOULD label the text as either "UTF-16BE" or "UTF-16LE", whichever is appropriate based on the endianness of the text. This allows applications processing the text, but unable to look inside the text, to know the serialization definitively.

任何使用UTF-16字符编码并显式标记文本的标签应用程序,以及知道文本中字符的序列化顺序的标签应用程序,都应根据文本的尾端性将文本标记为“UTF-16BE”或“UTF-16LE”,以适当的为准。这允许处理文本但无法查看文本内部的应用程序最终了解序列化。

Text in the "UTF-16BE" charset MUST be serialized with the octets which make up a single 16-bit UTF-16 value in big-endian order. Systems labelling UTF-16BE text MUST NOT prepend a BOM to the text.

“UTF-16BE”字符集中的文本必须用八位字节序列化,八位字节以大端顺序构成单个16位UTF-16值。标记UTF-16BE文本的系统不得在文本前添加BOM表。

Text in the "UTF-16LE" charset MUST be serialized with the octets which make up a single 16-bit UTF-16 value in little-endian order. Systems labelling UTF-16LE text MUST NOT prepend a BOM to the text.

“UTF-16LE”字符集中的文本必须用八位字节序列化,八位字节以小尾端顺序组成一个16位UTF-16值。标记UTF-16LE文本的系统不得在文本前添加BOM表。

Any labelling application that uses UTF-16 character encoding, and puts an explicit charset label on the text, and does not know the serialization order of the characters in text, MUST label the text as "UTF-16", and SHOULD make sure the text starts with 0xFEFF.

任何使用UTF-16字符编码、在文本上放置显式字符集标签且不知道文本中字符的序列化顺序的标签应用程序都必须将文本标记为“UTF-16”,并应确保文本以0xFEFF开头。

An exception to the "SHOULD" rule of using "UTF-16BE" or "UTF-16LE" would occur with document formats that mandate a BOM in UTF-16 text, thereby requiring the use of the "UTF-16" tag only.

使用“UTF-16BE”或“UTF-16LE”的“应该”规则的例外情况会出现在要求使用UTF-16文本的BOM表的文档格式中,因此只需要使用“UTF-16”标记。

4. Interpreting text labels
4. 解释文本标签

When a program sees text labelled as "UTF-16BE", "UTF-16LE", or "UTF-16", it can make some assumptions, based on the labelling rules given in the previous section. These assumptions allow the program to then process the text.

当程序看到标记为“UTF-16BE”、“UTF-16LE”或“UTF-16”的文本时,它可以根据上一节中给出的标记规则做出一些假设。这些假设允许程序处理文本。

4.1 Interpreting text labelled as UTF-16BE
4.1 解释标记为UTF-16BE的文本

Text labelled "UTF-16BE" can always be interpreted as being big-endian. The detection of an initial BOM does not affect de-serialization of text labelled as UTF-16BE. Finding 0xFF followed by 0xFE is an error since there is no Unicode character 0xFFFE.

标记为“UTF-16BE”的文本始终可以解释为大端。初始BOM的检测不会影响标记为UTF-16BE的文本的反序列化。查找0xFF后跟0xFE是一个错误,因为没有Unicode字符0xFFFE。

4.2 Interpreting text labelled as UTF-16LE
4.2 解释标记为UTF-16LE的文本

Text labelled "UTF-16LE" can always be interpreted as being little-endian. The detection of an initial BOM does not affect de-serialization of text labelled as UTF-16LE. Finding 0xFE followed by 0xFF is an error since there is no Unicode character 0xFFFE, which would be the interpretation of those octets under little-endian order.

标记为“UTF-16LE”的文本始终可以解释为小端。初始BOM的检测不会影响标记为UTF-16LE的文本的反序列化。查找0xFE后跟0xFF是一个错误,因为没有Unicode字符0xFFFE,这将是以小尾端顺序对这些八位字节的解释。

4.3 Interpreting text labelled as UTF-16
4.3 解释标记为UTF-16的文本

Text labelled with the "UTF-16" charset might be serialized in either big-endian or little-endian order. If the first two octets of the text is 0xFE followed by 0xFF, then the text can be interpreted as being big-endian. If the first two octets of the text is 0xFF followed by 0xFE, then the text can be interpreted as being little-endian. If the first two octets of the text is not 0xFE followed by 0xFF, and is not 0xFF followed by 0xFE, then the text SHOULD be interpreted as being big-endian.

标有“UTF-16”字符集的文本可以按大端或小端顺序序列化。如果文本的前两个八位字节是0xFE,后跟0xFF,则可以将文本解释为big-endian。如果文本的前两个八位字节是0xFF,后跟0xFE,则可以将文本解释为小尾端。如果文本的前两个八位字节不是0xFE后跟0xFF,也不是0xFF后跟0xFE,则应将文本解释为big-endian。

All applications that process text with the "UTF-16" charset label MUST be able to read at least the first two octets of the text and be able to process those octets in order to determine the serialization order of the text. Applications that process text with the "UTF-16" charset label MUST NOT assume the serialization without first checking the first two octets to see if they are a big-endian BOM, a little-endian BOM, or not a BOM. All applications that process text with the "UTF-16" charset label MUST be able to interpret both big-endian and little-endian text.

所有处理带有“UTF-16”字符集标签的文本的应用程序必须至少能够读取文本的前两个八位字节,并且能够处理这些八位字节以确定文本的序列化顺序。处理带有“UTF-16”字符集标签的文本的应用程序在未首先检查前两个八位字节以确定它们是大端BOM、小端BOM还是非BOM的情况下,不得采用序列化。所有处理带有“UTF-16”字符集标签的文本的应用程序必须能够解释大端和小端文本。

5. Examples
5. 例子

For the sake of example, let's suppose that there is a hieroglyphic character representing the Egyptian god Ra with character value 0x12345 (this character does not exist at present in Unicode).

例如,假设有一个代表埃及神Ra的象形文字字符,字符值为0x12345(该字符目前在Unicode中不存在)。

The examples here all evaluate to the phrase:

这里的例子都适用于以下短语:

*=Ra

*=Ra

where the "*" represents the Ra hieroglyph (0x12345).

其中“*”表示Ra象形文字(0x12345)。

Text labelled with UTF-16BE, without a BOM: D8 08 DF 45 00 3D 00 52 00 61

标有UTF-16BE的文本,无BOM表:D8 08 DF 45 00 3D 00 52 00 61

Text labelled with UTF-16LE, without a BOM: 08 D8 45 DF 3D 00 52 00 61 00

标有UTF-16LE的文本,无BOM表:08 D8 45 DF 3D 00 52 00 61 00

Big-endian text labelled with UTF-16, with a BOM: FE FF D8 08 DF 45 00 3D 00 52 00 61

标有UTF-16的大端文本,带有BOM表:FE FF D8 08 DF 45 00 3D 00 52 00 61

Little-endian text labelled with UTF-16, with a BOM: FF FE 08 D8 45 DF 3D 00 52 00 61 00

标有UTF-16的小尾端文本,带有BOM:FF FE 08 D8 45 DF 3D 00 52 00 61 00

6. Versions of the standards
6. 标准的版本

ISO/IEC 10646 is updated from time to time by published amendments; similarly, different versions of the Unicode standard exist: 1.0, 1.1, 2.0, 2.1, and 3.0 as of this writing. Each new version replaces the previous one, but implementations, and more significantly data, are not updated instantly.

ISO/IEC 10646不时通过发布的修订进行更新;类似地,Unicode标准也存在不同的版本:在撰写本文时为1.0、1.1、2.0、2.1和3.0。每一个新版本都会取代以前的版本,但是实现,更重要的是数据,不会立即更新。

In general, the changes amount to adding new characters, which does not pose particular problems with old data. Amendment 5 to ISO/IEC 10646, however, has moved and expanded the Korean Hangul block, thereby making any previous data containing Hangul characters invalid under the new version. Unicode 2.0 has the same difference from Unicode 1.1. The official justification for allowing such an incompatible change was that no significant implementations and data containing Hangul existed, a statement that is likely to be true but remains unprovable. The incident has been dubbed the "Korean mess", and the relevant committees have pledged to never, ever again make such an incompatible change.

一般来说,更改相当于添加新字符,这不会对旧数据造成特殊问题。然而,ISO/IEC 10646的修正案5移动并扩展了韩国语韩国语块,从而使任何先前包含韩国语字符的数据在新版本下无效。Unicode 2.0与Unicode 1.1有相同的区别。允许这种不兼容更改的官方理由是,不存在任何重要的实现和包含韩文的数据,这种说法可能是正确的,但仍然无法证明。这起事件被称为“朝鲜乱象”,相关委员会承诺永远不再做出如此不相容的改变。

New versions, and in particular any incompatible changes, have consequences regarding MIME character encoding labels, to be discussed in Appendix A.

新版本,尤其是任何不兼容的更改,都会对MIME字符编码标签产生影响,将在附录A中讨论。

7. IANA Considerations
7. IANA考虑

IANA is to register the character sets found in Appendixes A.1, A.2, and A.3 according to RFC 2278, using registration templates found in those appendixes.

IANA使用附录A.1、A.2和A.3中的注册模板,根据RFC 2278注册附录A.1、A.2和A.3中的字符集。

8. Security Considerations
8. 安全考虑

UTF-16 is based on the ISO 10646 character set, which is frequently being added to, as described in Section 6 and Appendix A of this document. Processors must be able to handle characters that are not defined at the time that the processor was created in such a way as to not allow an attacker to harm a recipient by including unknown characters.

UTF-16基于ISO 10646字符集,如本文件第6节和附录A所述,该字符集经常被添加到。处理器必须能够处理在创建处理器时未定义的字符,以防止攻击者通过包含未知字符而伤害收件人。

Processors that handle any type of text, including text encoded as UTF-16, must be vigilant in checking for control characters that might reprogram a display terminal or keyboard. Similarly, processors

处理任何类型文本(包括编码为UTF-16的文本)的处理器必须警惕检查可能对显示终端或键盘重新编程的控制字符。类似地,处理器

that interpret text entities (such as looking for embedded programming code), must be careful not to execute the code without first alerting the recipient.

解释文本实体(如查找嵌入式编程代码)的用户必须小心,在未事先通知收件人的情况下,不要执行代码。

Text in UTF-16 may contain special characters, such as the OBJECT REPLACEMENT CHARACTER (0xFFFC), that might cause external processing, depending on the interpretation of the processing program and the availability of an external data stream that would be executed. This external processing may have side-effects that allow the sender of a message to attack the receiving system.

UTF-16中的文本可能包含可能导致外部处理的特殊字符,如对象替换字符(0xFFFC),这取决于处理程序的解释和将执行的外部数据流的可用性。这种外部处理可能会产生副作用,使消息的发送者能够攻击接收系统。

Implementors of UTF-16 need to consider the security aspects of how they handle illegal UTF-16 sequences (that is, sequences involving surrogate pairs that have illegal values or unpaired surrogates). It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-16 parser by sending it an octet sequence that is not permitted by the UTF-16 syntax, causing it to behave in some anomalous fashion.

UTF-16的实现者需要考虑它们如何处理非法UTF 16序列(即,涉及具有非法值或不成对代理的代理对)的安全方面。可以想象,在某些情况下,攻击者可以通过发送UTF-16语法不允许的八位字节序列来攻击不谨慎的UTF-16解析器,从而导致其行为异常。

9. References
9. 工具书类

[CHARPOLICY] Alvestrand, H., "IETF Policy on Character Sets and Languages", BCP 18, RFC 2277, January 1998.

[CHARPOLICY]Alvestrand,H.,“IETF字符集和语言政策”,BCP 18,RFC 2277,1998年1月。

[CHARSET-REG] Freed, N. and J. Postel, "IANA Charset Registration Procedures", BCP 19, RFC 2278, January 1998.

[CHARSET-REG]Freed,N.和J.Postel,“IANA字符集注册程序”,BCP 19,RFC 2278,1998年1月。

[HTTP-1.1] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P. and T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.

[HTTP-1.1]菲尔丁,R.,盖蒂斯,J.,莫卧儿,J.,弗莱斯蒂克,H.,马斯特,L.,利奇,P.和T.伯纳斯李,“超文本传输协议——HTTP/1.1”,RFC 2616,1999年6月。

[ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane. 22 amendments and two technical corrigenda have been published up to now. UTF-16 is described in Annex Q, published as Amendment 1. Many other amendments are currently at various stages of standardization. A second edition is in preparation, probably to be published in 2000; in this new edition, UTF-16 will probably be described in Annex C.

[ISO-10646]ISO/IEC 10646-1:1993。国际标准信息技术通用多八位编码字符集(UCS)第1部分:体系结构和基本多语言平面。到目前为止,已经发布了22项修正案和两项技术勘误。UTF-16如附录Q所述,作为修改件1发布。许多其他修正案目前正处于不同的标准化阶段。第二版正在编写中,可能于2000年出版;在新版中,UTF-16可能会在附录C中描述。

[MUSTSHOULD] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.

[MUSTSHOULD]Bradner,S.,“RFC中用于表示需求水平的关键词”,BCP 14,RFC 2119,1997年3月。

[UNICODE] The Unicode Consortium, "The Unicode Standard -- Version 3.0", ISBN 0-201-61633-5. Described at

[UNICODE]UNICODE联盟,“UNICODE标准——3.0版”,ISBN 0-201-61633-5。描述于

<http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>.

<http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>.

[UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC 2279, January 1998.

[UTF-8]Yergeau,F.,“UTF-8,ISO 10646的转换格式”,RFC 2279,1998年1月。

[WORKSHOP] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., Atkinson, R., Crispin., M. and P. Svanberg, "Report of the IAB Character Set Workshop", RFC 2130, April 1997.

[研讨会]Weider,C.,Preston,C.,Simonsen,K.,Alvestrand,H.,Atkinson,R.,Crispin.,M.和P.Svanberg,“IAB字符集研讨会报告”,RFC 21301997年4月。

10. Acknowledgments
10. 致谢

Deborah Goldsmith wrote a great deal of the initial wording for this specification. Martin Duerst proposed numerous significant changes. Other significant contributors include:

Deborah Goldsmith为此规范编写了大量的初始措辞。马丁·杜尔斯特提出了许多重大变革。其他重要贡献者包括:

Mati Allouche Walt Daniels Mark Davis Ned Freed Asmus Freytag Lloyd Honomichl Dan Kegel Murata Makoto Larry Masinter Markus Scherer Keld Simonsen Ken Whistler

Mati Allouche Walt Daniels Mark Davis释放了Asmus Freytag Lloyd Honomichl和Kegel Murata Makoto Larry Masinter Markus Scherer Keld Simonsen Ken Whistler

Some of the text in this specification was copied from [UTF-8], and that document was worked on by many people. Please see the acknowledgments section in that document for more people who may have contributed indirectly to this document.

本规范中的一些文本是从[UTF-8]复制而来的,该文档由许多人编写。请参阅该文件中的确认部分,了解更多可能间接参与本文件的人员。

A. Charset registrations

A.字符集注册

This memo is meant to serve as the basis for registration of three MIME charsets [CHARSET-REG]. The proposed charsets are "UTF-16BE", "UTF-16LE", and "UTF-16". These strings label objects containing text consisting of characters from the repertoire of ISO/IEC 10646 including all amendments at least up to amendment 5 (Korean block), encoded to a sequence of octets using the encoding and serialization schemes outlined above.

本备忘录旨在作为注册三个MIME字符集[CHARSET-REG]的基础。建议的字符集为“UTF-16BE”、“UTF-16LE”和“UTF-16”。这些字符串标记的对象包含由ISO/IEC 10646指令集中的字符组成的文本,包括至少至修正案5(韩语块)的所有修正案,使用上述编码和序列化方案编码为八位字节序列。

Note that "UTF-16BE", "UTF-16LE", and "UTF-16" are NOT suitable for use in media types under the "text" top-level type, because they do not encode line endings in the way required for MIME "text" media types. An exception to this is HTTP, which uses a MIME-like mechanism, but is exempt from the restrictions on the text top-level type (see section 19.4.2 of HTTP 1.1 [HTTP-1.1]).

请注意,“UTF-16BE”、“UTF-16LE”和“UTF-16”不适合在“文本”顶级类型下的媒体类型中使用,因为它们不以MIME“文本”媒体类型所需的方式编码行尾。HTTP是一个例外,它使用类似MIME的机制,但不受文本顶级类型的限制(参见HTTP 1.1[HTTP-1.1]第19.4.2节)。

It is noteworthy that the labels described here do not contain a version identification, referring generically to ISO/IEC 10646. This is intentional, the rationale being as follows:

值得注意的是,此处描述的标签不包含版本标识,通常参考ISO/IEC 10646。这是有意的,理由如下:

A MIME charset is designed to give just the information needed to interpret a sequence of bytes received on the wire into a sequence of characters, nothing more (see RFC 2045, section 2.2, in [MIME]). As long as a character set standard does not change incompatibly, version numbers serve no purpose, because one gains nothing by learning from the tag that newly assigned characters may be received that one doesn't know about. The tag itself doesn't teach anything about the new characters, which are going to be received anyway.

MIME字符集的设计仅提供将线路上接收的字节序列解释为字符序列所需的信息,仅此而已(参见[MIME]中的RFC 2045,第2.2节)。只要字符集标准没有发生不兼容的变化,版本号就没有任何作用,因为从标签中了解到新分配的字符可能会被接收到,而用户对此一无所知。标签本身并没有告诉我们任何关于新角色的信息,这些新角色无论如何都会被接收。

Hence, as long as the standards evolve compatibly, the apparent advantage of having labels that identify the versions is only that, apparent. But there is a disadvantage to such version-dependent labels: when an older application receives data accompanied by a newer, unknown label, it may fail to recognize the label and be completely unable to deal with the data, whereas a generic, known label would have triggered mostly correct processing of the data, which may well not contain any new characters.

因此,只要标准能够兼容地发展,拥有标识版本的标签的明显优势就是显而易见的。但这种依赖于版本的标签有一个缺点:当较旧的应用程序接收到带有较新的未知标签的数据时,它可能无法识别该标签,并且完全无法处理该数据,而通用的已知标签会触发对数据的大部分正确处理,很可能不包含任何新字符。

The "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible change, in principle contradicting the appropriateness of a version independent MIME charset as described above. But the compatibility problem can only appear with data containing Korean Hangul characters encoded according to Unicode 1.1 (or equivalently ISO/IEC 10646 before amendment 5), and there is arguably no such data to worry about, this being the very reason the incompatible change was deemed acceptable.

“Korean mess”(ISO/IEC 10646修订5)是一个不兼容的变更,原则上与上述独立于版本的MIME字符集的适当性相矛盾。但兼容性问题只会出现在包含根据Unicode 1.1编码的韩国语韩国语字符的数据上(或在修正案5之前相当于ISO/IEC 10646),并且可以说没有此类数据需要担心,这正是不兼容更改被视为可接受的原因。

In practice, then, a version-independent label is warranted, provided the label is understood to refer to all versions after Amendment 5, and provided no incompatible change actually occurs. Should incompatible changes occur in a later version of ISO/IEC 10646, the MIME charsets defined here will stay aligned with the previous version until and unless the IETF specifically decides otherwise.

因此,在实践中,如果标签被理解为是指修订5后的所有版本,并且没有实际发生不兼容的更改,则保证使用独立于版本的标签。如果ISO/IEC 10646的更高版本中出现不兼容的更改,则此处定义的MIME字符集将与上一版本保持一致,除非IETF另有明确决定。

A.1 Registration for UTF-16BE
A.1 UTF-16BE的注册

To: ietf-charsets@iana.org Subject: Registration of new charset

致:ietf-charsets@iana.org主题:新字符集的注册

Charset name(s): UTF-16BE

字符集名称:UTF-16BE

Published specification(s): This specification

已发布规范:本规范

Suitable for use in MIME content types under the "text" top-level type: No

适用于“文本”顶级类型下的MIME内容类型:否

   Person & email address to contact for further information:
   Paul Hoffman <phoffman@imc.org>
   Francois Yergeau <fyergeau@alis.com>
        
   Person & email address to contact for further information:
   Paul Hoffman <phoffman@imc.org>
   Francois Yergeau <fyergeau@alis.com>
        
A.2 Registration for UTF-16LE
A.2 UTF-16LE的注册

To: ietf-charsets@iana.org Subject: Registration of new charset

致:ietf-charsets@iana.org主题:新字符集的注册

Charset name(s): UTF-16LE

字符集名称:UTF-16LE

Published specification(s): This specification

已发布规范:本规范

Suitable for use in MIME content types under the "text" top-level type: No

适用于“文本”顶级类型下的MIME内容类型:否

   Person & email address to contact for further information:
   Paul Hoffman <phoffman@imc.org>
   Francois Yergeau <fyergeau@alis.com>
        
   Person & email address to contact for further information:
   Paul Hoffman <phoffman@imc.org>
   Francois Yergeau <fyergeau@alis.com>
        
A.3 Registration for UTF-16
A.3 UTF-16的注册

To: ietf-charsets@iana.org Subject: Registration of new charset

致:ietf-charsets@iana.org主题:新字符集的注册

Charset name(s): UTF-16

字符集名称:UTF-16

Published specification(s): This specification

已发布规范:本规范

Suitable for use in MIME content types under the "text" top-level type: No

适用于“文本”顶级类型下的MIME内容类型:否

   Person & email address to contact for further information:
   Paul Hoffman <phoffman@imc.org>
   Francois Yergeau <fyergeau@alis.com>
        
   Person & email address to contact for further information:
   Paul Hoffman <phoffman@imc.org>
   Francois Yergeau <fyergeau@alis.com>
        

Authors' Addresses

作者地址

Paul Hoffman Internet Mail Consortium 127 Segre Place Santa Cruz, CA 95060 USA

保罗·霍夫曼互联网邮件联盟127塞格雷广场圣克鲁斯,加利福尼亚州95060

   EMail: phoffman@imc.org
        
   EMail: phoffman@imc.org
        

Francois Yergeau Alis Technologies 100, boul. Alexis-Nihon, Suite 600 Montreal QC H4M 2P2 Canada

Francois Yergeau Alis Technologies 100,boul。Alexis Nihon,加拿大蒙特利尔QC H4M 2P2 600套房

   EMail: fyergeau@alis.com
        
   EMail: fyergeau@alis.com
        

Full Copyright Statement

完整版权声明

Copyright (C) The Internet Society (2000). All Rights Reserved.

版权所有(C)互联网协会(2000年)。版权所有。

This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English.

本文件及其译本可复制并提供给他人,对其进行评论或解释或协助其实施的衍生作品可全部或部分编制、复制、出版和分发,不受任何限制,前提是上述版权声明和本段包含在所有此类副本和衍生作品中。但是,不得以任何方式修改本文件本身,例如删除版权通知或对互联网协会或其他互联网组织的引用,除非出于制定互联网标准的需要,在这种情况下,必须遵循互联网标准过程中定义的版权程序,或根据需要将其翻译成英语以外的其他语言。

The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns.

上述授予的有限许可是永久性的,互联网协会或其继承人或受让人不会撤销。

This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

本文件和其中包含的信息是按“原样”提供的,互联网协会和互联网工程任务组否认所有明示或暗示的保证,包括但不限于任何保证,即使用本文中的信息不会侵犯任何权利,或对适销性或特定用途适用性的任何默示保证。

Acknowledgement

确认

Funding for the RFC Editor function is currently provided by the Internet Society.

RFC编辑功能的资金目前由互联网协会提供。