Network Working Group                                       K. Whistler
Request for Comments: 2482                                       Sybase
Category: Informational                                        G. Adams
                                                               Spyglass
                                                           January 1999
        
Network Working Group                                       K. Whistler
Request for Comments: 2482                                       Sybase
Category: Informational                                        G. Adams
                                                               Spyglass
                                                           January 1999
        

Language Tagging in Unicode Plain Text

Unicode纯文本中的语言标记

Status of this Memo

本备忘录的状况

This memo provides information for the Internet community. It does not specify an Internet standard of any kind. Distribution of this memo is unlimited.

本备忘录为互联网社区提供信息。它没有规定任何类型的互联网标准。本备忘录的分发不受限制。

Copyright Notice

版权公告

Copyright (C) The Internet Society (1999). All Rights Reserved.

版权所有(C)互联网协会(1999年)。版权所有。

IESG Note:

IESG注:

This document has been accepted by ISO/IEC JTC1/SC2/WG2 in meeting #34 to be submitted as a recommendation from WG2 for inclusion in Plane 14 in part 2 of ISO/IEC 10646.

本文件已被ISO/IEC JTC1/SC2/WG2在第34次会议上接受,作为WG2的建议提交,以纳入ISO/IEC 10646第2部分的平面14中。

1. Abstract
1. 摘要

This document proposed a mechanism for language tagging in [UNICODE] plain text. A set of special-use tag characters on Plane 14 of [ISO10646] (accessible through UTF-8, UTF-16, and UCS-4 encoding forms) are proposed for encoding to enable the spelling out of ASCII-based string tags using characters which can be strictly separated from ordinary text content characters in ISO10646 (or UNICODE).

本文件提出了一种在[UNICODE]纯文本中进行语言标记的机制。建议使用[ISO10646]平面14上的一组特殊用途标记字符(可通过UTF-8、UTF-16和UCS-4编码形式访问)进行编码,以便使用可与ISO10646(或UNICODE)中的普通文本内容字符严格分离的字符拼写基于ASCII的字符串标记。

One tag identification character and one cancel tag character are also proposed. In particular, a language tag identification character is proposed to identify a language tag string specifically; the language tag itself makes use of [RFC1766] language tag strings spelled out using the Plane 14 tag characters. Provision of a specific, low-overhead mechanism for embedding language tags in plain text is aimed at meeting the need of Internet Protocols such as ACAP, which require a standard mechanism for marking language in UTF-8 strings.

还提出了一个标签识别字符和一个取消标签字符。具体地,提出了一种语言标签识别字符,用于具体地识别语言标签字符串;语言标记本身使用[RFC1766]语言标记字符串,使用平面14标记字符拼写。为在纯文本中嵌入语言标记提供了一种特定的、低开销的机制,旨在满足互联网协议(如ACAP)的需要,该协议需要一种标准机制,用于以UTF-8字符串标记语言。

The tagging mechanism as well the characters proposed in this document have been approved by the Unicode Consortium for inclusion in The Unicode Standard. However, implementation of this decision

本文件中提出的标记机制和字符已获得Unicode联盟的批准,可纳入Unicode标准。然而,这项决定的执行情况

awaits formal acceptance by ISO JTC1/SC2/WG2, the working group responsible for ISO10646. Potential implementers should be aware that until this formal acceptance occurs, any usage of the characters proposed herein is strictly experimental and not sanctioned for standardized character data interchange.

等待负责ISO10646的工作组ISO JTC1/SC2/WG2正式接受。潜在的实现者应该意识到,在正式接受之前,本文提出的字符的任何使用都是严格的实验性的,不允许标准化字符数据交换。

2. Definitions and Notation
2. 定义和符号

No attempt is made to define all terms used in this document. In particular, the terminology pertaining to the subject of coded character systems is not explicitly specified. See [UNICODE], [ISO10646], and [RFC2130] for additional definitions in this area.

未试图定义本文件中使用的所有术语。特别是,与编码字符系统主题相关的术语没有明确规定。有关此区域的其他定义,请参见[UNICODE]、[ISO10646]和[RFC2130]。

2.1 Requirements Notation
2.1 需求符号

This document occasionally uses terms that appear in capital letters. When the terms "MUST", "SHOULD", "MUST NOT", "SHOULD NOT", and "MAY" appear capitalized, they are being used to indicate particular requirements of this specification. A discussion of the meanings of these terms appears in [RFC2119].

本文档偶尔使用大写字母表示的术语。当术语“必须”、“应该”、“不得”、“不应该”和“可能”出现大写时,它们被用来表示本规范的特殊要求。[RFC2119]中对这些术语的含义进行了讨论。

2.2 Definitions
2.2 定义

The terms defined below are used in special senses and thus warrant some clarification.

以下定义的术语具有特殊含义,因此需要进行一些澄清。

2.2.1 Tagging
2.2.1 标记

The association of attributes of text with a point or range of the primary text. (The value of a particular tag is not generally considered to be a part of the "content" of the text. Typical examples of tagging is to mark language or font of a portion of text.)

文本属性与主文本点或范围的关联。(特定标记的值通常不被视为文本“内容”的一部分。标记的典型示例是标记部分文本的语言或字体。)

2.2.2 Annotation
2.2.2 注释

The association of secondary textual content with a point or range of the primary text. (The value of a particular annotation *is* considered to be a part of the "content" of the text. Typical examples include glossing, citations, exemplication, Japanese yomi, etc.)

次要文本内容与主要文本点或范围的关联。(特定注释*的价值*被视为文本“内容”的一部分。典型的例子包括修饰、引用、例证、日语yomi等。)

2.2.3 Out-of-band
2.2.3 带外

An out-of-band channel conveys a tag in such a way that the textual content, as encoded, is completely untouched and unmodified. This is typically done by metadata or hyperstructure of some sort.

带外通道以这样一种方式传送标签,即编码后的文本内容完全未被触及和修改。这通常是通过某种元数据或超结构来完成的。

2.2.4 In-band
2.2.4 带内

An in-band channel conveys a tag along with the textual content, using the same basic encoding mechanism as the text itself. This is done by various means, but an obvious example is SGML markup, where the tags are encoded in the same character set as the text and are interspersed with and carried along with the text data.

带内通道使用与文本本身相同的基本编码机制,将标签与文本内容一起传送。这是通过各种方式实现的,但一个明显的例子是SGML标记,其中标记编码在与文本相同的字符集中,并与文本数据一起散布和携带。

3.0 Background
3.0 出身背景

There has been much discussion over the last 8 years of language tagging and of other kinds of tagging of Unicode plain text. It is fair to say that there is more-or-less universal agreement that language tagging of Unicode plain text is required for certain textual processes. For example, language "hinting" of multilingual text is necessary for multilingual spell-checking based on multiple dictionaries to work well. Language tagging provides a minimum level of required information for text-to-speech processes to work correctly. Language tagging is regularly done on web pages, to enable selection of alternate content, for example.

在过去的8年中,人们对语言标记和Unicode纯文本的其他类型标记进行了大量讨论。可以公平地说,对于某些文本过程,Unicode纯文本的语言标记是必需的,这或多或少是一种普遍的共识。例如,多语言文本的语言“暗示”对于基于多个词典的多语言拼写检查是必要的,这样才能很好地工作。语言标记提供了文本到语音过程正常工作所需的最低级别的信息。例如,在网页上定期进行语言标记,以便选择其他内容。

However, there has been a great deal of controversy regarding the appropriate placement of language tags. Some have held that the only appropriate placement of language tags (or other kinds of tags) is out-of-band, making use of attributed text structures or metadata. Others have argued that there are requirements for lower-complexity in-band mechanisms for language tags (or other tags) in plain text.

然而,关于语言标记的适当放置,一直存在着很大的争议。一些人认为,语言标记(或其他类型的标记)的唯一适当位置是带外,利用属性文本结构或元数据。其他人则认为,对于纯文本中的语言标记(或其他标记),需要较低复杂度的带内机制。

The controversy has been muddied by the existence and widespread use of a number of in-band text markup mechanisms (HTML, text/enriched, etc.) which enable language tagging, but which imply the use of general parsing mechanisms which are deemed too "heavyweight" for protocol developers and a number of other applications. The difficulty of using general in-band text markup for simple protocols derives from the fact that some characters are used both for textual content and for the text markup; this makes it more difficult to write simple, fast algorithms to find only the textual content and ignore the tags, or vice versa. (Think of this as the algorithmic equivalent of the difficulty the human reader has attempting to read just the content of raw HTML source text without a browser interpreting all the markup tags.)

许多带内文本标记机制(HTML、文本/浓缩等)的存在和广泛使用使争议变得扑朔迷离,这些机制支持语言标记,但意味着使用一般的解析机制,这些机制对于协议开发人员和许多其他应用程序来说过于“重量级”。在简单协议中使用一般带内文本标记的困难源于这样一个事实,即一些字符同时用于文本内容和文本标记;这使得编写只查找文本内容而忽略标记的简单、快速算法变得更加困难,反之亦然。(这在算法上相当于人类读者在没有浏览器解释所有标记标记的情况下试图只读取原始HTML源文本的内容的困难。)

The Plane 14 proposal addresses the recurrent and persistent call for a lighter-weight mechanism for text tagging than typical text markup mechanisms in Unicode. It proposes a special set of characters used *only* for tagging. These tag characters can be embedded into plain

Plane 14提案解决了反复出现的和持续存在的对文本标记的轻量级机制的需求,而不是Unicode中的典型文本标记机制。它建议使用一组特殊的字符*仅*用于标记。这些标记字符可以嵌入到纯文本中

text and can be identified and/or ignored with trivial algorithms, since there is no overloading of usage for these tag characters--they can only express tag values and never textual content itself.

文本和可通过普通算法识别和/或忽略,因为这些标记字符的用法没有过载——它们只能表示标记值,而不能表示文本内容本身。

The Plane 14 proposal is not intended for general annotation of text, such as textual citations, phonetic readings (e.g. Japanese Yomi), etc. In its present form, its use is intended to be restriced solely to specifying in-line language tags. Future extensions may widen this scope of intended usage.

平面14提案不适用于文本的一般注释,如文本引用、语音读物(如日语Yomi)等。在其当前形式中,其用途仅限于指定行内语言标记。未来的扩展可能会扩大预期用途的范围。

4.0 Proposal
4.0 提议

This proposal suggests the use of 97 dedicated tag characters encoded at the start of Plane 14 of ISO/IEC 10646 consisting of a clone of the 94 printable 7-bit ASCII graphic characters and ASCII SPACE, as well as a tag identification character and a tag cancel character.

该提案建议使用在ISO/IEC 10646第14平面开始处编码的97个专用标记字符,包括94个可打印的7位ASCII图形字符和ASCII空格的克隆,以及标记标识字符和标记取消字符。

These tag characters are to be used to spell out any ASCII-based tagging scheme which needs to be embedded in Unicode plain text. In particular, they can be used to spell out language tags in order to meet the expressed requirements of the ACAP protocol and the likely requirements of other new protocols following the guidelines of the IAB character workshop (RFC 2130).

这些标记字符用于拼写任何需要嵌入Unicode纯文本中的基于ASCII的标记方案。特别是,它们可用于拼写语言标签,以满足ACAP协议的明确要求以及遵循IAB字符研讨会(RFC 2130)指南的其他新协议的可能要求。

The suggested range in Plane 14 for the block reserved for tag characters is as follows, expressed in each of the three most generally used encoding schemes for ISO/IEC 10646:

平面14中为标记字符保留的块的建议范围如下,以ISO/IEC 10646最常用的三种编码方案中的每一种表示:

UCS-4

UCS-4

U-000E0000 .. U-000E007F

U-000E0000。。U-000E007F

UTF-16

UTF-16

   U+DB40 U+DC00 .. U+DB40 U+DC7F
        
   U+DB40 U+DC00 .. U+DB40 U+DC7F
        

UTF-8

UTF-8

0xF3 0xA0 0x80 0x80 .. 0xF3 0xA0 0x81 0xBF

0xF3 0xA0 0x80 0x80。。0xF3 0xA0 0x81 0xBF

Of this range, U-000E0020 .. U-000E007E is the suggested range for the ASCII clone tag characters themselves.

在该范围内,U-000E0020。。U-000E007E是ASCII克隆标记字符本身的建议范围。

4.1 Names for the Tag Characters
4.1 标记字符的名称

The names for the ASCII clone tag characters should be exactly the ISO 10646 names for 7-bit ASCII, prefixed with the word "TAG".

ASCII克隆标记字符的名称应与7位ASCII的ISO 10646名称完全相同,前缀为“标记”。

In addition, there is one tag identification character and a CANCEL TAG character. The use and syntax of these characters is described in detail below.

此外,还有一个标记标识字符和一个取消标记字符。下面详细介绍这些字符的用法和语法。

The entire encoding for the proposed Plane 14 tag characters and names of those characters can be derived from the following list. (The encoded values here and throughout this proposal are listed in UCS-4 form, which is easiest to interpret. It is assumed that most Unicode applications will, however, be making use either of UTF-16 or UTF-8 encoding forms for actual implementation.)

建议的平面14个标记字符的整个编码以及这些字符的名称可以从以下列表中派生。(本建议书中以及整个建议书中的编码值以UCS-4格式列出,这是最容易解释的。但是,假设大多数Unicode应用程序将使用UTF-16或UTF-8编码格式进行实际实施。)

U-000E0000 <reserved> U-000E0001 LANGUAGE TAG U-000E0002 <reserved> U-000E001F <reserved> U-000E0020 TAG SPACE U-000E0021 TAG EXCLAMATION MARK U-000E0041 TAG LATIN CAPITAL LETTER A U-000E007A TAG LATIN SMALL LETTER Z U-000E007E TAG TILDE U-000E007F CANCEL TAG

U-000E0000<reserved>U-000E0001语言标记U-000E0002<reserved>U-000E001F<reserved>U-000E0020标记空间U-000E0021标记感叹号U-000E0041标记拉丁文大写字母A U-000E007A标记拉丁文小写字母Z U-000E007E标记平铺U-000E007F取消标记

4.2 Range Checking for Tag Characters
4.2 标记字符的范围检查

The range checks required for code testing for tag characters would be as follows. The same range check is expressed here in C for each of the three significant encoding forms for 10646.

标记字符代码测试所需的范围检查如下所示。对于10646的三种有效编码形式中的每一种,这里用C表示相同的范围检查。

Range check expressed in UCS-4:

以UCS-4表示的范围检查:

if ( ( *s >= 0xE0000 ) || ( *s <= 0xE007F ) )
        
if ( ( *s >= 0xE0000 ) || ( *s <= 0xE007F ) )
        

Range check expressed in UTF-16 (Unicode):

以UTF-16(Unicode)表示的范围检查:

if ( ( *s == 0xDB40 ) && ( *(s+1) >= 0xDC00 ) && ( *(s+1) <= 0xDC7F ) )
        
if ( ( *s == 0xDB40 ) && ( *(s+1) >= 0xDC00 ) && ( *(s+1) <= 0xDC7F ) )
        

Expressed in UTF-8:

以UTF-8表示:

if ( ( *s == 0xF3 ) && ( *(s+1) == 0xA0 ) && ( *(s+2) & 0xE0 == 0x80 )
        
if ( ( *s == 0xF3 ) && ( *(s+1) == 0xA0 ) && ( *(s+2) & 0xE0 == 0x80 )
        

Because of the choice of the range for the tag characters, it would also be possible to express the range check for UCS-4 or UTF-16 in terms of bitmask operations, as well.

由于可以选择标记字符的范围,因此也可以用位掩码操作来表示UCS-4或UTF-16的范围检查。

4.3 Syntax for Embedding Tags
4.3 嵌入标记的语法

The use of the Plane 14 tag characters is very simple. In order to embed any ASCII-derived tag in Unicode plain text, the tag is simply spelled out with the tag characters instead, prefixed with the relevant tag identification character. The resultant string is embedded directly in the text.

平面14个标记字符的使用非常简单。为了在Unicode纯文本中嵌入任何ASCII派生标记,只需使用标记字符来拼写标记,并以相关标记标识字符作为前缀。结果字符串直接嵌入到文本中。

The tag identification character is used as a mechanism for identifying tags of different types. This enables multiple types of tags to coexist amicably embedded in plain text and solves the problem of delimitation if a tag is concatenated directly onto another tag. Although only one type of tag is currently specified, namely the language tag, the encoding of other tag identification characters in the future would allow for distinct tag types to be used.

标签识别字符用作识别不同类型标签的机制。这使多种类型的标记能够友好地嵌入到纯文本中,并解决了如果将一个标记直接连接到另一个标记上时的定界问题。尽管目前只指定了一种类型的标记,即语言标记,但将来对其他标记标识字符的编码将允许使用不同的标记类型。

No termination character is required for a tag. A tag terminates either when the first non Plane 14 Tag Character (i.e. any other normal Unicode value) is encountered, or when the next tag identification character is encountered.

标记不需要终止字符。当遇到第一个非平面14标记字符(即任何其他普通Unicode值)或遇到下一个标记标识字符时,标记终止。

All tag arguments must be encoded only with the tag characters U-000E0020 .. U-000E007E. No other characters are valid for expressing the tag argument.

所有标记参数必须仅使用标记字符U-000E0020进行编码。。U-000E007E。没有其他字符可用于表示标记参数。

A detailed BNF syntax for tags is listed below.

下面列出了标记的详细BNF语法。

4.4 Tag Scope and Nesting
4.4 标记范围和嵌套

The value of an established tag continues from the point the tag is embedded in text until either:

已建立标记的值从标记嵌入到文本中的点继续,直到:

A. The text itself goes out of scope, as defined by the application. (E.g. for line-oriented protocols, when reaching the end-of-line or end-of-string; for text streams, when reaching the end-of-stream; etc.)

A.文本本身超出了应用程序定义的范围。(例如,对于面向行的协议,当到达行的末尾或字符串的末尾时;对于文本流,当到达流的末尾时;等等)

or

B. The tag is explicitly cancelled by the CANCEL TAG character.

B.标记由取消标记字符显式取消。

Tags of the same type cannot be nested in any way. The appearance of a new embedded language tag, for example, after text which was already language tagged, simply changes the tagged value for subsequent text to that specified in the new tag.

同一类型的标记不能以任何方式嵌套。例如,新的嵌入式语言标记出现在已经被语言标记的文本之后,只是将后续文本的标记值更改为新标记中指定的值。

Tags of different type can have interdigitating scope, but not hierarchical scope. In effect, tags of different type completely ignore each other, so that the use of language tags can be completely asynchronous with the use of character set source tags (or any other tag type) in the same text in the future.

不同类型的标记可以具有交指作用域,但不能具有层次作用域。实际上,不同类型的标记完全相互忽略,因此语言标记的使用可以与将来在同一文本中使用字符集源标记(或任何其他标记类型)完全异步。

4.5 Cancelling Tag Values
4.5 取消标记值

U-000E007F CANCEL TAG is provided to allow the specific cancelling of a tag value. The use of CANCEL TAG has the following syntax. To cancel a tag value of a particular type, prefix the CANCEL TAG character with the tag identification character of the appropriate type. For example, the complete string to cancel a language tag is:

提供U-000E007F取消标签,以允许特定取消标签值。CANCEL标记的使用具有以下语法。要取消特定类型的标记值,请在取消标记字符前面加上相应类型的标记标识字符。例如,取消语言标记的完整字符串为:

U-000E0001 U-000E007F

U-000E0001 U-000E007F

The value of the relevant tag type returns to the default state for that tag type, namely: no tag value specified, the same as untagged text.

相关标记类型的值返回到该标记类型的默认状态,即:未指定标记值,与未标记文本相同。

The use of CANCEL TAG without a prefixed tag identification character cancels *any* Plane 14 tag values which may be defined. Since only language tags are currently provided with an explicit tag identification character, only language tags are currently affected.

使用不带前缀标记标识字符的CANCEL TAG可取消可定义的*任何*平面14标记值。由于当前仅为语言标记提供显式标记标识字符,因此当前仅影响语言标记。

The main function of CANCEL TAG is to make possible such operations as blind concatenation of strings in a tagged context without the propagation of inappropriate tag values across the string boundaries. For example, a string tagged with a Japanese language tag can have its tag value "sealed off" with a terminating CANCEL TAG before another string of unknown language value is concatenated to it. This would prevent the string of unknown language from being erroneously marked as being Japanese simply because of a concatenation to a Japanese string.

CANCEL TAG的主要功能是在标记的上下文中实现字符串的盲连接等操作,而无需跨字符串边界传播不适当的标记值。例如,使用日语标记的字符串可以在将另一个未知语言值的字符串连接到该字符串之前,使用终止取消标记将其标记值“密封”。这将防止未知语言的字符串仅仅因为连接到日语字符串而被错误地标记为日语。

4.6 Tag Syntax Description
4.6 标记语法描述

An extended BNF (Backus-Naur Form) description of the tags specified in this proposal is found below. Note the following BNF extensions used in this formalism:

本建议书中规定的标签的扩展BNF(巴克斯诺尔表)说明见下文。请注意此形式中使用的以下BNF扩展:

1. Semantic constraints are specified by rules in the form of an assertion specified between double braces; the variable $$ denotes the string consisting of all terminal symbols matched by the this non-terminal.

1. 语义约束由规则以双括号之间指定的断言形式指定;变量$$表示由该非终端匹配的所有终端符号组成的字符串。

      Example:   {{ Assert ( $$[0] == '?' ); }}
        
      Example:   {{ Assert ( $$[0] == '?' ); }}
        

Meaning: The first character of the string matched by this non-terminal must be '?'

含义:此非终结符匹配的字符串的第一个字符必须是“?”

2. A number of predicate functions are employed in semantic constraint rules which are not otherwise defined; their name is sufficient for determining their predication.

2. 语义约束规则中使用了大量谓词函数,但没有另行定义;它们的名称足以决定它们的断言。

Example: IsRFC1766LanguageIdentifier ( tag-argument )

示例:IsRFC1766LanguageIdentifier(标记参数)

Meaning: tag-argument is a valid RFC1766 language identifier

含义:标记参数是有效的RFC1766语言标识符

3. A lexical expander function, TAG, is employed to denote the tag form of an ASCII character; the argument to this function is either a character or a character set specified by a range or enumeration expression.

3. 词法扩展函数TAG用于表示ASCII字符的标记形式;此函数的参数是由范围表达式或枚举表达式指定的字符或字符集。

Example: TAG('-')

示例:标记('-')

Meaning: TAG HYPHEN-MINUS

意思:标记连字符-减号

Example: TAG([A-Z])

示例:标记([A-Z])

Meaning: TAG LATIN CAPITAL LETTER A ... TAG LATIN CAPITAL LETTER Z

意思:标记拉丁文大写字母A。。。标记拉丁文大写字母Z

4. A macro is employed to denote terminal symbols that are character literals which can't be directly represented in ASCII. The argument to the macro is the UNICODE (ISO/IEC 10646) character name.

4. 宏用于表示不能直接用ASCII表示的字符文字的终端符号。宏的参数是UNICODE(ISO/IEC 10646)字符名。

      Example:   '${TAG CANCEL}'
        
      Example:   '${TAG CANCEL}'
        

Meaning: character literal whose code value is U-000E007F

含义:代码值为U-000E007F的字符文字

5. Occurrence indicators used are '+' (one or more) and '*' (zero or more); optional occurrence is indicated by enclosure in '[' and ']'.

5. 使用的出现指示器为“+”(一个或多个)和“*”(零个或多个);可选出现由“[”和“]”中的附件指示。

4.6.1 Formal Tag Syntax
4.6.1 形式标记语法

tag : language-tag | cancel-all-tag ;

标签:语言标签|取消所有标签;

language-tag : language-tag-introducer language-tag-argument ;

语言标记:语言标记介绍人语言标记参数;

language-tag-argument   :   tag-argument
              {{ Assert ( IsRFC1766LanguageIdentifier ( $$ ); }}
                        |   tag-cancel
                        ;
        
language-tag-argument   :   tag-argument
              {{ Assert ( IsRFC1766LanguageIdentifier ( $$ ); }}
                        |   tag-cancel
                        ;
        

cancel-all-tag : tag-cancel ;

取消所有标签:标签取消;

tag-argument : tag-character+ ;

标记参数:标记字符+;

tag-character           :   { c : c in
              TAG( { a : a in printable ASCII characters or SPACE } ) }
                        ;
        
tag-character           :   { c : c in
              TAG( { a : a in printable ASCII characters or SPACE } ) }
                        ;
        
language-tag-introducer :   '${TAG LANGUAGE}'
                        ;
        
language-tag-introducer :   '${TAG LANGUAGE}'
                        ;
        
tag-cancel              :   '${TAG CANCEL}'
                        ;
        
tag-cancel              :   '${TAG CANCEL}'
                        ;
        
5.0 Tag Types
5.0 标记类型
5.1 Language Tags
5.1 语言标签

Language tags are of general interest and should have a high degree of interoperability for protocol usage. To this end, a specific LANGUAGE TAG tag identification character is provided. A Plane 14 tag string prefixed by U-000E0001 LANGUAGE TAG is specified to constitute a language tag. Furthermore, the tag values for the language tag are to be spelled out as specified in RFC 1766, making use only of registered tag values or of user-defined language tags starting with the characters "x-".

语言标记是人们普遍感兴趣的,对于协议的使用应该具有高度的互操作性。为此,提供特定语言标签识别字符。指定以U-000E0001语言标记为前缀的平面14标记字符串构成语言标记。此外,语言标记的标记值将按照RFC 1766中的规定进行拼写,仅使用注册的标记值或以字符“x-”开头的用户定义的语言标记。

For example, to embed a language tag for Japanese, the Plane 14 characters would be used as follows. The Japanese tag from RFC 1766 is "ja" (composed of ISO 639 language id) or, alternatively, "ja-JP" (composed of ISO 639 language id plus ISO 3166 country id). Since RFC 1766 specifies that language tags are not case significant, it is recommended that for language tags, the entire tag be lowercased before conversion to Plane 14 tag characters. (This would not be required for Unicode conformance, but should be followed as general practice by protocols making use of RFC 1766 language tags, to simplify and speed up the processing for operations which need to identify or ignore language tags embedded in text.) Lowercasing,

例如,要嵌入日语的语言标记,将按如下方式使用平面14个字符。RFC1766中的日语标记是“ja”(由ISO 639语言id组成),或者是“ja JP”(由ISO 639语言id加上ISO 3166国家id组成)。由于RFC 1766规定语言标记不区分大小写,因此建议对于语言标记,在转换为平面14标记字符之前,将整个标记小写。(Unicode一致性不需要这样做,但协议应遵循使用RFC 1766语言标记的一般做法,以简化和加速需要识别或忽略嵌入文本中的语言标记的操作的处理。)小写,

rather than uppercasing, is recommended because it follows the majority practice of expressing language tag values in lowercase letters.

建议使用大写而不是大写,因为它遵循了大多数使用小写字母表示语言标记值的做法。

Thus the entire language tag (in its longer form) would be converted to Plane 14 tag characters as follows:

因此,整个语言标记(较长形式)将转换为平面14标记字符,如下所示:

U-000E0001 U-000E006A U-000E0061 U-000E002D U-000E006A U-000E0070

U-000E0001 U-000E006A U-000E0061 U-000E002D U-000E006A U-000E0070

The language tag (in its shorter, "ja" form) could be expressed as follows:

语言标记(以其较短的“ja”形式)可以表示为:

U-000E0001 U-000E006A U-000E0061

U-000E0001 U-000E006A U-000E0061

The value of this string is then expressed in whichever encoding form (UCS-4, UTF-16, UTF-8) is required and embedded in text at the relevant point.

然后,该字符串的值以需要的任何编码形式(UCS-4、UTF-16、UTF-8)表示,并嵌入到相关点的文本中。

5.2 Additional Tags
5.2 附加标签

Additional tag identification characters might be defined in the future. An example would be a CHARACTER SET SOURCE TAG, or a GENERIC TAG for private definition of tags.

将来可能会定义其他标记标识字符。例如,字符集源标记,或用于标记私有定义的通用标记。

In each case, when a specific tag identification character is encoded, a corresponding reference standard for the values of the tags associated with the identifier should be designated, so that interoperating parties which make use of the tags will know how to interpret the values the tags may take.

在每种情况下,当编码特定标签标识字符时,应指定与标识符相关联的标签值的对应参考标准,以便使用标签的互操作方将知道如何解释标签可能采用的值。

6.0 Display Issues
6.0 显示问题

All characters in the tag character block are considered to have no visible rendering in normal text. A process which interprets tags may choose to modify the rendering of text based on the tag values (as for example, changing font to preferred style for rendering Chinese versus Japanese). The tag characters themselves have no display; they may be considered similar to a U+200B ZERO WIDTH SPACE in that regard. The tag characters also do not affect breaking, joining, or any other format or layout properties, except insofar as the process interpreting the tag chooses to impose such behavior based on the tag value.

标记字符块中的所有字符都被视为在普通文本中没有可见的呈现。解释标记的过程可以选择基于标记值修改文本的呈现(例如,将字体更改为呈现中文和日语的首选样式)。标记字符本身没有显示;在这方面,可以认为它们类似于U+200B零宽度空间。标记字符也不会影响打断、连接或任何其他格式或布局属性,除非解释标记的过程选择基于标记值施加此类行为。

For debugging or other operations which must render the tags themselves visible, it is advisable that the tag characters be rendered using the corresponding ASCII character glyphs (perhaps modified systematically to differentiate them from normal ASCII

对于必须使标记本身可见的调试或其他操作,建议使用相应的ASCII字符图示符来呈现标记字符(可能会进行系统修改以将其与普通ASCII字符区分开来)

characters). But, as noted below, the tag character values are chosen so that even without display support, the tag characters will be interpretable in most debuggers.

字符)。但是,如下所述,选择标记字符值是为了即使没有显示支持,标记字符也可以在大多数调试器中解释。

7.0 Unicode Conformance Issues
7.0 Unicode一致性问题

The basic rules for Unicode conformance for the tag characters are exactly the same as for any other Unicode characters. A conformant process is not required to interpret the tag characters. If it does not interpret tag characters, it should leave their values undisturbed and do whatever it does with any other uninterpreted characters. If it does interpret them, it should interpret them according to the standard, i.e. as spelled-out tags.

标记字符的Unicode一致性的基本规则与任何其他Unicode字符完全相同。解释标记字符不需要一致的过程。如果它不解释标记字符,它应该保持其值不受干扰,并对任何其他未解释的字符执行任何操作。如果它确实解释了它们,它应该根据标准来解释它们,即按照标签的说明来解释。

So for a non-TagAware Unicode application, any language tag characters (or any other kind of tag expressed with Plane 14 tag characters) encountered would be handled exactly as for uninterpreted Tibetan from the BMP, uninterpreted Linear B from Plane 1, or uninterpreted Egyptian hieroglyphics from private use space in Plane 15.

因此,对于非标记识别Unicode应用程序,遇到的任何语言标记字符(或用平面14标记字符表示的任何其他类型的标记)的处理方式与BMP中未解释的藏文、平面1中未解释的线性B或平面15中私人使用空间中未解释的埃及象形文字的处理方式完全相同。

A TagAware but TagPhobic Unicode application can recognize the tag character range in Plane 14 and choose to deliberately strip them out completely to produce plain text with no tags.

具有标记意识但对标记恐惧的Unicode应用程序可以识别平面14中的标记字符范围,并选择故意将其完全去除,以生成无标记的纯文本。

The presence of a correctly formed tag cannot be taken as a guarantee that the data so tagged is correctly tagged. For example, nothing prevents an application from erroneously labelling French data as Spanish, or from labelling JIS-derived data as Japanese, even if it contains Greek or Cyrillic characters.

正确格式标签的存在不能保证这样标记的数据被正确标记。例如,没有任何东西可以阻止应用程序将法语数据错误地标记为西班牙语,或将JIS派生数据错误地标记为日语,即使它包含希腊或西里尔字符。

7.1 Note on Encoding Language Tags
7.1 关于编码语言标记的注记

The fact that this proposal for encoding tag characters in Unicode includes a mechanism for specifying language tag values does not mean that Unicode is departing from one of its basic encoding principles:

这个用Unicode编码标记字符的建议包含了一种指定语言标记值的机制,这一事实并不意味着Unicode背离了其基本编码原则之一:

Unicode encodes scripts, not languages.

Unicode编码脚本,而不是语言。

This is still true of the Unicode encoding (and ISO/IEC 10646), even in the presence of a mechanism for specifying language tags in plain text. There is nothing obligatory about the use of Plane 14 tags, whether for language tags or any other kind of tags.

Unicode编码(和ISO/IEC10646)仍然如此,即使存在以纯文本指定语言标记的机制。无论是语言标签还是任何其他类型的标签,使用Plane 14标签都不是强制性的。

Language tagging in no way impacts current encoded characters or the encoding of future scripts.

语言标记不会影响当前编码字符或未来脚本的编码。

It is fully anticipated that implementations of Unicode which already make use of out-of-band mechanisms for language tagging or "heavy-weight" in-band mechanisms such as HTML will continue to do exactly what they are doing and will ignore Plane 14 tag characters completely.

完全可以预期,已经使用带外机制进行语言标记的Unicode实现或HTML等“重量级”带内机制将继续完全执行它们正在执行的操作,并将完全忽略平面14标记字符。

8.0 Security Considerations
8.0 安全考虑

There are no known security issues raised by this document.

本文档未提出任何已知的安全问题。

References

工具书类

[ISO10646] ISO/IEC 10646-1:1993 International Organization for Standardization. "Information Technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane", Geneva, 1993.

[ISO10646]ISO/IEC 10646-1:1993国际标准化组织。“信息技术——通用多八位编码字符集(UCS)——第1部分:体系结构和基本多语言平面”,日内瓦,1993年。

[RFC1766] Alvestrand, H., "Tags for the Identification of Languages", RFC 1766, March 1995.

[RFC1766]Alvestrand,H.,“语言识别标签”,RFC1766,1995年3月。

[RFC2070] Yergeau, F., Nicol, G. Adams, G. and M. Duerst, "Internationalization of the Hypertext Markup Language", RFC 2070, January 1997.

[RFC2070]Yergeau,F.,Nicol,G.Adams,G.和M.Duerst,“超文本标记语言的国际化”,RFC 2070,1997年1月。

[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.

[RFC2119]Bradner,S.,“RFC中用于表示需求水平的关键词”,BCP 14,RFC 2119,1997年3月。

[RFC2130] Weider, C. Preston, C., Simonsen, K., Alvestrand, H., Atkinson, R., Crispin, M. and P. Svanberg, "The Report of the IAB Character Set Workshop held 29 February - 1 March, 1996", RFC 2130, April 1997.

[RFC2130]Weider,C.Preston,C.,Simonsen,K.,Alvestrand,H.,Atkinson,R.,Crispin,M.和P.Svanberg,“1996年2月29日至3月1日举行的IAB字符集研讨会报告”,RFC 21301997年4月。

[UNICODE] The Unicode Standard, Version 2.0, The Unicode Consortium, Addison-Wesley, July 1996.

[UNICODE]UNICODE标准,2.0版,UNICODE联盟,Addison-Wesley,1996年7月。

Acknowledgements

致谢

The following people also contributed to this document, directly or indirectly: Chris Newman, Mark Crispin, Rick McGowan, Joe Becker, John Jenkins, and Asmus Freytag. This document also was reviewed by the Unicode Technical Committee, and the authors wish to thank all of the UTC representatives for their input. The authors are, of course, responsible for any errors or omissions which may remain in the text.

以下人员也直接或间接参与了本文件:克里斯·纽曼、马克·克里斯平、里克·麦高文、乔·贝克尔、约翰·詹金斯和阿斯马斯·弗雷塔格。Unicode技术委员会也对本文件进行了审查,作者希望感谢所有UTC代表的投入。当然,作者应对文本中可能存在的任何错误或遗漏负责。

Authors' Addresses

作者地址

Ken Whistler Sybase, Inc. 6475 Christie Ave. Emeryville, CA 94608-1050

Ken Whistler Sybase,Inc.加利福尼亚州埃默里维尔克里斯蒂大道6475号,邮编94608-1050

   Phone: +1 510 922 3611
   EMail: kenw@sybase.com
        
   Phone: +1 510 922 3611
   EMail: kenw@sybase.com
        

Glenn Adams Spyglass, Inc. One Cambridge Center Cambridge, MA 02142

格伦·亚当斯望远镜公司,马萨诸塞州剑桥中心1号,邮编02142

   Phone: +1 617 679 4652
   EMail: glenn@spyglass.com
        
   Phone: +1 617 679 4652
   EMail: glenn@spyglass.com
        

Full Copyright Statement

完整版权声明

Copyright (C) The Internet Society (1999). All Rights Reserved.

版权所有(C)互联网协会(1999年)。版权所有。

This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English.

本文件及其译本可复制并提供给他人,对其进行评论或解释或协助其实施的衍生作品可全部或部分编制、复制、出版和分发,不受任何限制,前提是上述版权声明和本段包含在所有此类副本和衍生作品中。但是,不得以任何方式修改本文件本身,例如删除版权通知或对互联网协会或其他互联网组织的引用,除非出于制定互联网标准的需要,在这种情况下,必须遵循互联网标准过程中定义的版权程序,或根据需要将其翻译成英语以外的其他语言。

The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns.

上述授予的有限许可是永久性的,互联网协会或其继承人或受让人不会撤销。

This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

本文件和其中包含的信息是按“原样”提供的,互联网协会和互联网工程任务组否认所有明示或暗示的保证,包括但不限于任何保证,即使用本文中的信息不会侵犯任何权利,或对适销性或特定用途适用性的任何默示保证。