Network Working Group                                       F. Yergeau
Request for Comments: 2279                           Alis Technologies
Obsoletes: 2044                                           January 1998
Category: Standards Track
        
Network Working Group                                       F. Yergeau
Request for Comments: 2279                           Alis Technologies
Obsoletes: 2044                                           January 1998
Category: Standards Track
        

UTF-8, a transformation format of ISO 10646

UTF-8,ISO 10646的转换格式

Status of this Memo

本备忘录的状况

This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited.

本文件规定了互联网社区的互联网标准跟踪协议,并要求进行讨论和提出改进建议。有关本协议的标准化状态和状态,请参考当前版本的“互联网官方协议标准”(STD 1)。本备忘录的分发不受限制。

Copyright Notice

版权公告

Copyright (C) The Internet Society (1998). All Rights Reserved.

版权所有(C)互联网协会(1998年)。版权所有。

Abstract

摘要

ISO/IEC 10646-1 defines a multi-octet character set called the Universal Character Set (UCS) which encompasses most of the world's writing systems. Multi-octet characters, however, are not compatible with many current applications and protocols, and this has led to the development of a few so-called UCS transformation formats (UTF), each with different characteristics. UTF-8, the object of this memo, has the characteristic of preserving the full US-ASCII range, providing compatibility with file systems, parsers and other software that rely on US-ASCII values but are transparent to other values. This memo updates and replaces RFC 2044, in particular addressing the question of versions of the relevant standards.

ISO/IEC 10646-1定义了一个称为通用字符集(UCS)的多八位字符集,它包含了世界上大多数的书写系统。然而,多八位组字符与当前的许多应用程序和协议不兼容,这导致了一些所谓的UCS转换格式(UTF)的开发,每种格式都具有不同的特性。本备忘录的对象UTF-8的特点是保留完整的US-ASCII范围,提供与依赖US-ASCII值但对其他值透明的文件系统、解析器和其他软件的兼容性。本备忘录更新并取代RFC 2044,特别是解决了相关标准的版本问题。

1. Introduction
1. 介绍

ISO/IEC 10646-1 [ISO-10646] defines a multi-octet character set called the Universal Character Set (UCS), which encompasses most of the world's writing systems. Two multi-octet encodings are defined, a four-octet per character encoding called UCS-4 and a two-octet per character encoding called UCS-2, able to address only the first 64K characters of the UCS (the Basic Multilingual Plane, BMP), outside of which there are currently no assignments.

ISO/IEC 10646-1[ISO-10646]定义了一个称为通用字符集(UCS)的多八位字符集,它包含了世界上大多数的书写系统。定义了两种多八位编码,一种是称为UCS-4的每字符四个八位编码,另一种是称为UCS-2的每字符两个八位编码,只能处理UCS(基本多语言平面,BMP)的前64K个字符,目前没有分配。

It is noteworthy that the same set of characters is defined by the Unicode standard [UNICODE], which further defines additional character properties and other application details of great interest to implementors, but does not have the UCS-4 encoding. Up to the

值得注意的是,同一组字符是由Unicode标准[Unicode]定义的,该标准进一步定义了实现人员非常感兴趣的其他字符属性和其他应用程序细节,但没有UCS-4编码。直到

present time, changes in Unicode and amendments to ISO/IEC 10646 have tracked each other, so that the character repertoires and code point assignments have remained in sync. The relevant standardization committees have committed to maintain this very useful synchronism.

目前,Unicode的变化和ISO/IEC10646的修订相互跟踪,因此字符表和代码点分配保持同步。相关标准化委员会已承诺保持这种非常有用的同步性。

The UCS-2 and UCS-4 encodings, however, are hard to use in many current applications and protocols that assume 8 or even 7 bit characters. Even newer systems able to deal with 16 bit characters cannot process UCS-4 data. This situation has led to the development of so-called UCS transformation formats (UTF), each with different characteristics.

然而,UCS-2和UCS-4编码很难在假定为8位甚至7位字符的许多当前应用程序和协议中使用。即使是能够处理16位字符的较新系统也无法处理UCS-4数据。这种情况导致了所谓的UCS转换格式(UTF)的发展,每种格式都具有不同的特性。

UTF-1 has only historical interest, having been removed from ISO/IEC 10646. UTF-7 has the quality of encoding the full BMP repertoire using only octets with the high-order bit clear (7 bit US-ASCII values, [US-ASCII]), and is thus deemed a mail-safe encoding ([RFC2152]). UTF-8, the object of this memo, uses all bits of an octet, but has the quality of preserving the full US-ASCII range: US-ASCII characters are encoded in one octet having the normal US-ASCII value, and any octet with such a value can only stand for an US-ASCII character, and nothing else.

UTF-1仅具有历史意义,已从ISO/IEC 10646中删除。UTF-7具有仅使用具有高阶位清除(7位US-ASCII值[US-ASCII])的八位字节对完整BMP指令集进行编码的质量,因此被视为邮件安全编码([RFC2152])。本备忘录的对象UTF-8使用了八位字节的所有位,但具有保留完整US-ASCII范围的性质:US-ASCII字符编码在一个具有正常US-ASCII值的八位字节中,任何具有该值的八位字节只能代表US-ASCII字符,而不能代表其他字符。

UTF-16 is a scheme for transforming a subset of the UCS-4 repertoire into pairs of UCS-2 values from a reserved range. UTF-16 impacts UTF-8 in that UCS-2 values from the reserved range must be treated specially in the UTF-8 transformation.

UTF-16是一种将UCS-4指令集的子集从保留范围转换为成对UCS-2值的方案。UTF-16影响UTF-8,因为在UTF-8转换中必须特别处理保留范围中的UCS-2值。

UTF-8 encodes UCS-2 or UCS-4 characters as a varying number of octets, where the number of octets, and the value of each, depend on the integer value assigned to the character in ISO/IEC 10646. This transformation format has the following characteristics (all values are in hexadecimal):

UTF-8将UCS-2或UCS-4字符编码为不同数量的八位字节,其中八位字节的数量和每个八位字节的值取决于ISO/IEC 10646中分配给该字符的整数值。此转换格式具有以下特征(所有值均为十六进制):

- Character values from 0000 0000 to 0000 007F (US-ASCII repertoire) correspond to octets 00 to 7F (7 bit US-ASCII values). A direct consequence is that a plain ASCII string is also a valid UTF-8 string.

- 从0000 0000到0000 007F(US-ASCII指令集)的字符值对应于八位字节00到7F(7位US-ASCII值)。直接的结果是,普通ASCII字符串也是有效的UTF-8字符串。

- US-ASCII values do not appear otherwise in a UTF-8 encoded character stream. This provides compatibility with file systems or other software (e.g. the printf() function in C libraries) that parse based on US-ASCII values but are transparent to other values.

- US-ASCII值不会以其他方式出现在UTF-8编码字符流中。这提供了与基于US-ASCII值进行解析但对其他值透明的文件系统或其他软件(例如C库中的printf()函数)的兼容性。

- Round-trip conversion is easy between UTF-8 and either of UCS-4, UCS-2.

- UTF-8和UCS-4、UCS-2之间的往返转换很容易。

- The first octet of a multi-octet sequence indicates the number of octets in the sequence.

- 多八位组序列的第一个八位组表示序列中的八位组数。

- The octet values FE and FF never appear.

- 八位组值FE和FF从未出现。

- Character boundaries are easily found from anywhere in an octet stream.

- 字符边界很容易从八位字节流中的任何位置找到。

- The lexicographic sorting order of UCS-4 strings is preserved. Of course this is of limited interest since the sort order is not culturally valid in either case.

- 保留UCS-4字符串的词典排序顺序。当然,这是有限的兴趣,因为排序顺序在这两种情况下都不是文化上有效的。

- The Boyer-Moore fast search algorithm can be used with UTF-8 data.

- Boyer-Moore快速搜索算法可用于UTF-8数据。

- UTF-8 strings can be fairly reliably recognized as such by a simple algorithm, i.e. the probability that a string of characters in any other encoding appears as valid UTF-8 is low, diminishing with increasing string length.

- UTF-8字符串可以通过一个简单的算法相当可靠地识别,即任何其他编码中的字符字符串显示为有效UTF-8的概率很低,随着字符串长度的增加而减小。

UTF-8 was originally a project of the X/Open Joint Internationalization Group XOJIG with the objective to specify a File System Safe UCS Transformation Format [FSS-UTF] that is compatible with UNIX systems, supporting multilingual text in a single encoding. The original authors were Gary Miller, Greger Leijonhufvud and John Entenmann. Later, Ken Thompson and Rob Pike did significant work for the formal UTF-8.

UTF-8最初是X/Open联合国际化组XOJIG的一个项目,其目标是指定一种与UNIX系统兼容的文件系统安全UCS转换格式[FSS-UTF],支持单一编码中的多语言文本。最初的作者是加里·米勒、格雷格·莱容胡夫德和约翰·恩特曼。后来,肯·汤普森和罗伯·派克为正式的UTF-8做了重要的工作。

A description can also be found in Unicode Technical Report #4 and in the Unicode Standard, version 2.0 [UNICODE]. The definitive reference, including provisions for UTF-16 data within UTF-8, is Annex R of ISO/IEC 10646-1 [ISO-10646].

Unicode技术报告#4和Unicode标准2.0版[Unicode]中也有相关说明。最终参考,包括UTF-8中UTF-16数据的规定,是ISO/IEC 10646-1[ISO-10646]的附录R。

2. UTF-8 definition
2. UTF-8定义

In UTF-8, characters are encoded using sequences of 1 to 6 octets. The only octet of a "sequence" of one has the higher-order bit set to 0, the remaining 7 bits being used to encode the character value. In a sequence of n octets, n>1, the initial octet has the n higher-order bits set to 1, followed by a bit set to 0. The remaining bit(s) of that octet contain bits from the value of the character to be encoded. The following octet(s) all have the higher-order bit set to 1 and the following bit set to 0, leaving 6 bits in each to contain bits from the character to be encoded.

在UTF-8中,字符使用1到6个八位字节的序列进行编码。一个“序列”的唯一八位字节的高阶位设置为0,其余7位用于编码字符值。在n个八位组的序列中,n>1,初始八位组将n个高阶位设置为1,然后将一个位设置为0。该八位字节的剩余位包含要编码的字符值的位。下面的八位字节都将高阶位设置为1,将下面的位设置为0,每个八位字节中保留6位,以包含要编码的字符中的位。

The table below summarizes the format of these different octet types. The letter x indicates bits available for encoding bits of the UCS-4 character value.

下表总结了这些不同八位组类型的格式。字母x表示可用于编码UCS-4字符值位的位。

UCS-4 range (hex.) UTF-8 octet sequence (binary) 0000 0000-0000 007F 0xxxxxxx 0000 0080-0000 07FF 110xxxxx 10xxxxxx 0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx

UCS-4范围(十六进制)UTF-8八位字节序列(二进制)0000 0000-0000 007F 0xxxxxxx 0000 0080-0000 07FF 110xxxxx 10xxxxxx 0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx

0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 0400 0000-7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx

0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 0020 0000-03FF FFFF 11111 0xX 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 0400 0000-7FFF FFFF 1110x 10xxxxxx。。。10xxxxxx

Encoding from UCS-4 to UTF-8 proceeds as follows:

从UCS-4到UTF-8的编码过程如下:

1) Determine the number of octets required from the character value and the first column of the table above. It is important to note that the rows of the table are mutually exclusive, i.e. there is only one valid way to encode a given UCS-4 character.

1) 根据字符值和上表的第一列确定所需的八位字节数。需要注意的是,表中的行是互斥的,即对给定的UCS-4字符只有一种有效的编码方式。

2) Prepare the high-order bits of the octets as per the second column of the table.

2) 根据表的第二列准备八位字节的高阶位。

3) Fill in the bits marked x from the bits of the character value, starting from the lower-order bits of the character value and putting them first in the last octet of the sequence, then the next to last, etc. until all x bits are filled in.

3) 从字符值的位中填入标记为x的位,从字符值的低阶位开始,并将其首先放入序列的最后一个八位字节,然后再放入下一个八位字节,以此类推,直到填入所有x位。

The algorithm for encoding UCS-2 (or Unicode) to UTF-8 can be obtained from the above, in principle, by simply extending each UCS-2 character with two zero-valued octets. However, pairs of UCS-2 values between D800 and DFFF (surrogate pairs in Unicode parlance), being actually UCS-4 characters transformed through UTF-16, need special treatment: the UTF-16 transformation must be undone, yielding a UCS-4 character that is then transformed as above.

将UCS-2(或Unicode)编码为UTF-8的算法原则上可以从上述内容中获得,只需将每个UCS-2字符扩展为两个零值八位组即可。但是,D800和DFFF之间的UCS-2值对(Unicode术语中的代理项对)实际上是通过UTF-16转换的UCS-4字符,需要进行特殊处理:UTF-16转换必须撤消,生成一个UCS-4字符,然后按上述方式进行转换。

Decoding from UTF-8 to UCS-4 proceeds as follows:

从UTF-8到UCS-4的解码过程如下:

1) Initialize the 4 octets of the UCS-4 character with all bits set to 0.

1) 初始化UCS-4字符的4个八位字节,所有位设置为0。

2) Determine which bits encode the character value from the number of octets in the sequence and the second column of the table above (the bits marked x).

2) 根据序列中的八位字节数和上表的第二列(标有x的位)确定哪些位对字符值进行编码。

3) Distribute the bits from the sequence to the UCS-4 character, first the lower-order bits from the last octet of the sequence and proceeding to the left until no x bits are left.

3) 将序列中的位分配到UCS-4字符,首先是序列最后一个八位字节中的低阶位,然后向左移动,直到没有剩下x位。

If the UTF-8 sequence is no more than three octets long, decoding can proceed directly to UCS-2.

如果UTF-8序列长度不超过三个八位字节,则解码可直接进行到UCS-2。

NOTE -- actual implementations of the decoding algorithm above should protect against decoding invalid sequences. For instance, a naive implementation may (wrongly) decode the invalid UTF-8 sequence C0 80 into the character U+0000, which may have security consequences and/or cause other problems. See the Security Considerations section below.

注——上述解码算法的实际实现应防止解码无效序列。例如,简单实现可能(错误地)将无效UTF-8序列C080解码为字符U+0000,这可能会产生安全后果和/或导致其他问题。请参阅下面的安全注意事项部分。

A more detailed algorithm and formulae can be found in [FSS_UTF], [UNICODE] or Annex R to [ISO-10646].

更详细的算法和公式可在[FSS_UTF]、[UNICODE]或[ISO-10646]的附录R中找到。

3. Versions of the standards
3. 标准的版本

ISO/IEC 10646 is updated from time to time by published amendments; similarly, different versions of the Unicode standard exist: 1.0, 1.1 and 2.0 as of this writing. Each new version obsoletes and replaces the previous one, but implementations, and more significantly data, are not updated instantly.

ISO/IEC 10646不时通过发布的修订进行更新;类似地,Unicode标准也存在不同的版本:在撰写本文时为1.0、1.1和2.0。每一个新版本都会淘汰并替换上一个版本,但实现和更重要的数据不会立即更新。

In general, the changes amount to adding new characters, which does not pose particular problems with old data. Amendment 5 to ISO/IEC 10646, however, has moved and expanded the Korean Hangul block, thereby making any previous data containing Hangul characters invalid under the new version. Unicode 2.0 has the same difference from Unicode 1.1. The official justification for allowing such an incompatible change was that no implementations and no data containing Hangul existed, a statement that is likely to be true but remains unprovable. The incident has been dubbed the "Korean mess", and the relevant committees have pledged to never, ever again make such an incompatible change.

一般来说,更改相当于添加新字符,这不会对旧数据造成特殊问题。然而,ISO/IEC 10646的修正案5移动并扩展了韩国语韩国语块,从而使任何先前包含韩国语字符的数据在新版本下无效。Unicode 2.0与Unicode 1.1有相同的区别。允许这种不兼容更改的官方理由是,不存在任何实现,也不存在包含韩文的数据,这一说法可能是正确的,但仍然无法证明。这起事件被称为“朝鲜乱象”,相关委员会承诺永远不再做出如此不相容的改变。

New versions, and in particular any incompatible changes, have q conseuences regarding MIME character encoding labels, to be discussed in section 5.

新版本,尤其是任何不兼容的更改,都有关于MIME字符编码标签的问题,将在第5节中讨论。

4. Examples
4. 例子

The UCS-2 sequence "A<NOT IDENTICAL TO><ALPHA>." (0041, 2262, 0391, 002E) may be encoded in UTF-8 as follows:

UCS-2序列“A<与<α>不相同”(004122620391002E)可按如下方式在UTF-8中编码:

41 E2 89 A2 CE 91 2E

41 E2 89 A2 CE 91 2E

The UCS-2 sequence representing the Hangul characters for the Korean word "hangugo" (D55C, AD6D, C5B4) may be encoded as follows:

代表韩国语单词“hangugo”(D55C、AD6D、C5B4)的韩国语字符的UCS-2序列可按如下方式编码:

ED 95 9C EA B5 AD EC 96 B4

ED 95 9C EA B5和EC 96 B4

The UCS-2 sequence representing the Han characters for the Japanese word "nihongo" (65E5, 672C, 8A9E) may be encoded as follows:

代表日语单词“nihongo”(65E5、672C、8A9E)的汉字的UCS-2序列可编码如下:

E6 97 A5 E6 9C AC E8 AA 9E

E6 97 A5 E6 9C AC E8 AA 9E

5. MIME registration
5. MIME注册

This memo is meant to serve as the basis for registration of a MIME character set parameter (charset) [CHARSET-REG]. The proposed charset parameter value is "UTF-8". This string labels media types containing text consisting of characters from the repertoire of ISO/IEC 10646 including all amendments at least up to amendment 5 (Korean block), encoded to a sequence of octets using the encoding scheme outlined above. UTF-8 is suitable for use in MIME content types under the "text" top-level type.

本备忘录旨在作为注册MIME字符集参数(charset)[charset-REG]的基础。建议的字符集参数值为“UTF-8”。该字符串标记包含由ISO/IEC 10646指令集中的字符组成的文本的媒体类型,包括至少至第5次修订(韩语块)的所有修订,使用上述编码方案编码为八位字节序列。UTF-8适用于“text”顶级类型下的MIME内容类型。

It is noteworthy that the label "UTF-8" does not contain a version identification, referring generically to ISO/IEC 10646. This is intentional, the rationale being as follows:

值得注意的是,标签“UTF-8”不包含版本标识,通常指ISO/IEC 10646。这是有意的,理由如下:

A MIME charset label is designed to give just the information needed to interpret a sequence of bytes received on the wire into a sequence of characters, nothing more (see RFC 2045, section 2.2, in [MIME]). As long as a character set standard does not change incompatibly, version numbers serve no purpose, because one gains nothing by learning from the tag that newly assigned characters may be received that one doesn't know about. The tag itself doesn't teach anything about the new characters, which are going to be received anyway.

MIME字符集标签的设计仅提供将线路上接收的字节序列解释为字符序列所需的信息,仅此而已(参见[MIME]中的RFC 2045,第2.2节)。只要字符集标准没有发生不兼容的变化,版本号就没有任何作用,因为从标签中了解到新分配的字符可能会被接收到,而用户对此一无所知。标签本身并没有告诉我们任何关于新角色的信息,这些新角色无论如何都会被接收。

Hence, as long as the standards evolve compatibly, the apparent advantage of having labels that identify the versions is only that, apparent. But there is a disadvantage to such version-dependent labels: when an older application receives data accompanied by a newer, unknown label, it may fail to recognize the label and be completely unable to deal with the data, whereas a generic, known label would have triggered mostly correct processing of the data, which may well not contain any new characters.

因此,只要标准能够兼容地发展,拥有标识版本的标签的明显优势就是显而易见的。但这种依赖于版本的标签有一个缺点:当较旧的应用程序接收到带有较新的未知标签的数据时,它可能无法识别该标签,并且完全无法处理该数据,而通用的已知标签会触发对数据的大部分正确处理,很可能不包含任何新字符。

Now the "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible change, in principle contradicting the appropriateness of a version independent MIME charset label as described above. But the compatibility problem can only appear with data containing Korean Hangul characters encoded according to Unicode 1.1 (or equivalently ISO/IEC 10646 before amendment 5), and there is arguably no such data to worry about, this being the very reason the incompatible change was deemed acceptable.

现在,“Korean mess”(ISO/IEC 10646修订5)是一个不兼容的变更,原则上与上述独立于版本的MIME字符集标签的适当性相矛盾。但兼容性问题只会出现在包含根据Unicode 1.1编码的韩国语韩国语字符的数据上(或在修正案5之前相当于ISO/IEC 10646),并且可以说没有此类数据需要担心,这正是不兼容更改被视为可接受的原因。

In practice, then, a version-independent label is warranted, provided the label is understood to refer to all versions after Amendment 5, and provided no incompatible change actually occurs. Should incompatible changes occur in a later version of ISO/IEC 10646, the MIME charset label defined here will stay aligned with the previous version until and unless the IETF specifically decides otherwise.

因此,在实践中,如果标签被理解为是指修订5后的所有版本,并且没有实际发生不兼容的更改,则保证使用独立于版本的标签。如果ISO/IEC 10646的更高版本中出现不兼容的更改,则此处定义的MIME字符集标签将与上一版本保持一致,直至IETF另有明确决定。

It is also proposed to register the charset parameter value "UNICODE-1-1-UTF-8", for the exclusive purpose of labelling text data containing Hangul syllables encoded to UTF-8 without taking into account Amendment 5 of ISO/IEC 10646 (i.e. using the pre-amendment 5 code point assignments). Any other UTF-8 data SHOULD NOT use this label, in particular data not containing any Hangul syllables, and it is felt important to strongly recommend against creating any new Hangul-containing data without taking Amendment 5 of ISO/IEC 10646 into account.

还建议注册字符集参数值“UNICODE-1-1-UTF-8”,专门用于标记包含编码为UTF-8的韩语音节的文本数据,而不考虑ISO/IEC 10646的修订5(即使用修订5之前的代码点分配)。任何其他UTF-8数据不应使用此标签,尤其是不包含任何韩语音节的数据,强烈建议不要在不考虑ISO/IEC 10646修正案5的情况下创建任何新的包含韩语音节的数据。

6. Security Considerations
6. 安全考虑

Implementors of UTF-8 need to consider the security aspects of how they handle illegal UTF-8 sequences. It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax.

UTF-8的实现者需要考虑它们如何处理非法UTF-8序列的安全性方面。可以想象,在某些情况下,攻击者可以通过发送UTF-8语法不允许的八位字节序列来攻击不谨慎的UTF-8解析器。

A particularly subtle form of this attack could be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters. For example, a parser might prohibit the NUL character when encoded as the single-octet sequence 00, but allow the illegal two-octet sequence C0 80 and interpret it as a NUL character. Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal octet sequence 2F C0 AE 2E 2F.

这种攻击的一种特别微妙的形式可能是针对一个解析器进行的,该解析器对其输入的UTF-8编码形式执行安全关键有效性检查,但将某些非法八位字节序列解释为字符。例如,当编码为单八位元序列00时,解析器可能禁止NUL字符,但允许非法的双八位元序列C080并将其解释为NUL字符。另一个例子可能是一个解析器,它禁止八位元序列2F 2E 2E 2F(“///”),但允许非法八位元序列2F C0 AE 2E 2F。

Acknowledgments

致谢

The following have participated in the drafting and discussion of this memo:

以下人员参与了本备忘录的起草和讨论:

James E. Agenbroad Andries Brouwer Martin J. D|rst Ned Freed David Goldsmith Edwin F. Hart Kent Karlsson Markus Kuhn Michael Kung Alain LaBonte John Gardiner Myers Murray Sargent Keld Simonsen Arnold Winkler

詹姆斯·E·阿根布罗德·安德里斯·布鲁沃·马丁·J·D·内斯特·内德释放了大卫·戈德史密斯、埃德温·F·哈特·肯特·卡尔松、马克斯·库恩、迈克尔·孔阿兰·拉邦、约翰·加德纳·迈尔斯、默里·萨金特·凯尔德·西蒙森、阿诺德·温克勒

Bibliography

参考文献

[CHARSET-REG] Freed, N., and J. Postel, "IANA Charset Registration Procedures", BCP 19, RFC 2278, January 1998.

[CHARSET-REG]Freed,N.和J.Postel,“IANA字符集注册程序”,BCP 19,RFC 2278,1998年1月。

[FSS_UTF] X/Open CAE Specification C501 ISBN 1-85912-082-2 28cm. 22p. pbk. 172g. 4/95, X/Open Company Ltd., "File System Safe UCS Transformation Format (FSS_UTF)", X/Open Preleminary Specification, Document Number P316. Also published in Unicode Technical Report #4.

[FSS_UTF]X/开放式CAE规范C501 ISBN 1-85912-082-2 28cm。22便士。pbk。172g。1995年4月,X/Open有限公司,“文件系统安全UCS转换格式(FSS_UTF)”,X/Open初步规范,文件编号P316。也发表在Unicode技术报告#4中。

[ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane. Five amendments and a technical corrigendum have been published up to now. UTF-8 is described in Annex R, published as Amendment 2. UTF-16 is described in Annex Q, published as Amendment 1. 17 other amendments are currently at various stages of standardization.

[ISO-10646]ISO/IEC 10646-1:1993。国际标准信息技术通用多八位编码字符集(UCS)第1部分:体系结构和基本多语言平面。到目前为止,已经出版了五份修正案和一份技术勘误表。UTF-8如附录R所述,作为修改件2发布。UTF-16如附录Q所述,作为修改件1发布。17其他修正案目前正处于不同的标准化阶段。

[MIME] Freed, N., and N. Borenstein, "Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies", RFC 2045. N. Freed, N. Borenstein, "Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types", RFC 2046. K. Moore, "MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text", RFC 2047. N. Freed, J. Klensin, J. Postel, "Multipurpose Internet Mail Extensions (MIME) Part Four: Registration Procedures", RFC 2048. N. Freed, N. Borenstein, " Multipurpose Internet Mail Extensions (MIME) Part Five: Conformance Criteria and Examples", RFC 2049. All November 1996.

[MIME]Freed,N.和N.Borenstein,“多用途互联网邮件扩展(MIME)第一部分:互联网邮件正文格式”,RFC 2045。N.Freed,N.Borenstein,“多用途互联网邮件扩展(MIME)第二部分:媒体类型”,RFC 2046。K.Moore,“MIME(多用途互联网邮件扩展)第三部分:非ASCII文本的消息头扩展”,RFC 2047。N.Freed,J.Klensin,J.Postel,“多用途互联网邮件扩展(MIME)第四部分:注册程序”,RFC 2048。N.Freed,N.Borenstein,“多用途互联网邮件扩展(MIME)第五部分:一致性标准和示例”,RFC 2049。整个1996年11月。

[RFC2152] Goldsmith, D., and M. Davis, "UTF-7: A Mail-safe Transformation Format of Unicode", RFC 1642, Taligent inc., May 1997. (Obsoletes RFC1642)

[RFC2152]Goldsmith,D.和M.Davis,“UTF-7:Unicode的邮件安全转换格式”,RFC 1642,Taligent inc.,1997年5月。(淘汰RFC1642)

[UNICODE] The Unicode Consortium, "The Unicode Standard -- Version 2.0", Addison-Wesley, 1996.

[UNICODE]UNICODE联盟,“UNICODE标准——2.0版”,Addison Wesley,1996年。

[US-ASCII] Coded Character Set--7-bit American Standard Code for Information Interchange, ANSI X3.4-1986.

[US-ASCII]编码字符集——信息交换用7位美国标准代码,ANSI X3.4-1986。

Author's Address

作者地址

Francois Yergeau Alis Technologies 100, boul. Alexis-Nihon Suite 600 Montreal QC H4M 2P2 Canada

Francois Yergeau Alis Technologies 100,boul。加拿大蒙特利尔QC H4M 2P2亚历克西斯日本套房600

   Phone: +1 (514) 747-2547
   Fax:   +1 (514) 747-2561
   EMail: fyergeau@alis.com
        
   Phone: +1 (514) 747-2547
   Fax:   +1 (514) 747-2561
   EMail: fyergeau@alis.com
        

Full Copyright Statement

完整版权声明

Copyright (C) The Internet Society (1998). All Rights Reserved.

版权所有(C)互联网协会(1998年)。版权所有。

This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English.

本文件及其译本可复制并提供给他人,对其进行评论或解释或协助其实施的衍生作品可全部或部分编制、复制、出版和分发,不受任何限制,前提是上述版权声明和本段包含在所有此类副本和衍生作品中。但是,不得以任何方式修改本文件本身,例如删除版权通知或对互联网协会或其他互联网组织的引用,除非出于制定互联网标准的需要,在这种情况下,必须遵循互联网标准过程中定义的版权程序,或根据需要将其翻译成英语以外的其他语言。

The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns.

上述授予的有限许可是永久性的,互联网协会或其继承人或受让人不会撤销。

This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

本文件和其中包含的信息是按“原样”提供的,互联网协会和互联网工程任务组否认所有明示或暗示的保证,包括但不限于任何保证,即使用本文中的信息不会侵犯任何权利,或对适销性或特定用途适用性的任何默示保证。