Network Working Group                                          M. Duerst
Request for Comments: 3987                                           W3C
Category: Standards Track                                    M. Suignard
                                                   Microsoft Corporation
                                                            January 2005
        
Network Working Group                                          M. Duerst
Request for Comments: 3987                                           W3C
Category: Standards Track                                    M. Suignard
                                                   Microsoft Corporation
                                                            January 2005
        

Internationalized Resource Identifiers (IRIs)

国际化资源标识符(IRIs)

Status of This Memo

关于下段备忘

This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited.

本文件规定了互联网社区的互联网标准跟踪协议,并要求进行讨论和提出改进建议。有关本协议的标准化状态和状态,请参考当前版本的“互联网官方协议标准”(STD 1)。本备忘录的分发不受限制。

Copyright Notice

版权公告

Copyright (C) The Internet Society (2005).

版权所有(C)互联网协会(2005年)。

Abstract

摘要

This document defines a new protocol element, the Internationalized Resource Identifier (IRI), as a complement to the Uniform Resource Identifier (URI). An IRI is a sequence of characters from the Universal Character Set (Unicode/ISO 10646). A mapping from IRIs to URIs is defined, which means that IRIs can be used instead of URIs, where appropriate, to identify resources.

本文档定义了一个新的协议元素,即国际化资源标识符(IRI),作为统一资源标识符(URI)的补充。IRI是通用字符集(Unicode/ISO 10646)中的字符序列。定义了从IRIs到URI的映射,这意味着在适当的情况下,可以使用IRIs代替URI来标识资源。

The approach of defining a new protocol element was chosen instead of extending or changing the definition of URIs. This was done in order to allow a clear distinction and to avoid incompatibilities with existing software. Guidelines are provided for the use and deployment of IRIs in various protocols, formats, and software components that currently deal with URIs.

选择了定义新协议元素的方法,而不是扩展或更改URI的定义。这样做是为了明确区分,并避免与现有软件不兼容。本指南针对当前处理URI的各种协议、格式和软件组件中的IRIs的使用和部署提供了指导。

Table of Contents

目录

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
       1.1.  Overview and Motivation  . . . . . . . . . . . . . . . .  3
       1.2.  Applicability  . . . . . . . . . . . . . . . . . . . . .  3
       1.3.  Definitions  . . . . . . . . . . . . . . . . . . . . . .  4
       1.4.  Notation . . . . . . . . . . . . . . . . . . . . . . . .  5
   2.  IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . .  6
       2.1.  Summary of IRI Syntax  . . . . . . . . . . . . . . . . .  6
       2.2.  ABNF for IRI References and IRIs . . . . . . . . . . . .  7
        
   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
       1.1.  Overview and Motivation  . . . . . . . . . . . . . . . .  3
       1.2.  Applicability  . . . . . . . . . . . . . . . . . . . . .  3
       1.3.  Definitions  . . . . . . . . . . . . . . . . . . . . . .  4
       1.4.  Notation . . . . . . . . . . . . . . . . . . . . . . . .  5
   2.  IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . .  6
       2.1.  Summary of IRI Syntax  . . . . . . . . . . . . . . . . .  6
       2.2.  ABNF for IRI References and IRIs . . . . . . . . . . . .  7
        
   3.  Relationship between IRIs and URIs . . . . . . . . . . . . . . 10
       3.1.  Mapping of IRIs to URIs  . . . . . . . . . . . . . . . . 10
       3.2.  Converting URIs to IRIs  . . . . . . . . . . . . . . . . 14
             3.2.1.  Examples . . . . . . . . . . . . . . . . . . . . 15
   4.  Bidirectional IRIs for Right-to-Left Languages.  . . . . . . . 16
       4.1.  Logical Storage and Visual Presentation  . . . . . . . . 17
       4.2.  Bidi IRI Structure . . . . . . . . . . . . . . . . . . . 18
       4.3.  Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . 19
       4.4.  Examples . . . . . . . . . . . . . . . . . . . . . . . . 19
   5.  Normalization and Comparison . . . . . . . . . . . . . . . . . 21
       5.1.  Equivalence  . . . . . . . . . . . . . . . . . . . . . . 22
       5.2.  Preparation for Comparison . . . . . . . . . . . . . . . 22
       5.3.  Comparison Ladder  . . . . . . . . . . . . . . . . . . . 23
             5.3.1.  Simple String Comparison . . . . . . . . . . . . 23
             5.3.2.  Syntax-Based Normalization . . . . . . . . . . . 24
             5.3.3.  Scheme-Based Normalization . . . . . . . . . . . 27
             5.3.4.  Protocol-Based Normalization . . . . . . . . . . 28
   6.  Use of IRIs  . . . . . . . . . . . . . . . . . . . . . . . . . 29
       6.1.  Limitations on UCS Characters Allowed in IRIs  . . . . . 29
       6.2.  Software Interfaces and Protocols  . . . . . . . . . . . 29
       6.3.  Format of URIs and IRIs in Documents and Protocols . . . 30
       6.4.  Use of UTF-8 for Encoding Original Characters .. . . . . 30
       6.5.  Relative IRI References  . . . . . . . . . . . . . . . . 32
   7.  URI/IRI Processing Guidelines (informative)  . . . . . . . . . 32
       7.1.  URI/IRI Software Interfaces  . . . . . . . . . . . . . . 32
       7.2.  URI/IRI Entry  . . . . . . . . . . . . . . . . . . . . . 33
       7.3.  URI/IRI Transfer between Applications  . . . . . . . . . 33
       7.4.  URI/IRI Generation . . . . . . . . . . . . . . . . . . . 34
       7.5.  URI/IRI Selection  . . . . . . . . . . . . . . . . . . . 34
       7.6.  Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 35
       7.7.  Interpretation of URIs and IRIs  . . . . . . . . . . . . 36
       7.8.  Upgrading Strategy . . . . . . . . . . . . . . . . . . . 36
   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 37
   9.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 39
   10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 40
       10.1. Normative References . . . . . . . . . . . . . . . . . . 40
       10.2. Informative References . . . . . . . . . . . . . . . . . 41
   A.  Design Alternatives  . . . . . . . . . . . . . . . . . . . . . 44
       A.1.  New Scheme(s)  . . . . . . . . . . . . . . . . . . . . . 44
       A.2.  Character Encodings Other Than UTF-8 . . . . . . . . . . 44
       A.3.  New Encoding Convention  . . . . . . . . . . . . . . . . 44
       A.4.  Indicating Character Encodings in the URI/IRI  . . . . . 45
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 45
   Full Copyright Statement . . . . . . . . . . . . . . . . . . . . . 46
        
   3.  Relationship between IRIs and URIs . . . . . . . . . . . . . . 10
       3.1.  Mapping of IRIs to URIs  . . . . . . . . . . . . . . . . 10
       3.2.  Converting URIs to IRIs  . . . . . . . . . . . . . . . . 14
             3.2.1.  Examples . . . . . . . . . . . . . . . . . . . . 15
   4.  Bidirectional IRIs for Right-to-Left Languages.  . . . . . . . 16
       4.1.  Logical Storage and Visual Presentation  . . . . . . . . 17
       4.2.  Bidi IRI Structure . . . . . . . . . . . . . . . . . . . 18
       4.3.  Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . 19
       4.4.  Examples . . . . . . . . . . . . . . . . . . . . . . . . 19
   5.  Normalization and Comparison . . . . . . . . . . . . . . . . . 21
       5.1.  Equivalence  . . . . . . . . . . . . . . . . . . . . . . 22
       5.2.  Preparation for Comparison . . . . . . . . . . . . . . . 22
       5.3.  Comparison Ladder  . . . . . . . . . . . . . . . . . . . 23
             5.3.1.  Simple String Comparison . . . . . . . . . . . . 23
             5.3.2.  Syntax-Based Normalization . . . . . . . . . . . 24
             5.3.3.  Scheme-Based Normalization . . . . . . . . . . . 27
             5.3.4.  Protocol-Based Normalization . . . . . . . . . . 28
   6.  Use of IRIs  . . . . . . . . . . . . . . . . . . . . . . . . . 29
       6.1.  Limitations on UCS Characters Allowed in IRIs  . . . . . 29
       6.2.  Software Interfaces and Protocols  . . . . . . . . . . . 29
       6.3.  Format of URIs and IRIs in Documents and Protocols . . . 30
       6.4.  Use of UTF-8 for Encoding Original Characters .. . . . . 30
       6.5.  Relative IRI References  . . . . . . . . . . . . . . . . 32
   7.  URI/IRI Processing Guidelines (informative)  . . . . . . . . . 32
       7.1.  URI/IRI Software Interfaces  . . . . . . . . . . . . . . 32
       7.2.  URI/IRI Entry  . . . . . . . . . . . . . . . . . . . . . 33
       7.3.  URI/IRI Transfer between Applications  . . . . . . . . . 33
       7.4.  URI/IRI Generation . . . . . . . . . . . . . . . . . . . 34
       7.5.  URI/IRI Selection  . . . . . . . . . . . . . . . . . . . 34
       7.6.  Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 35
       7.7.  Interpretation of URIs and IRIs  . . . . . . . . . . . . 36
       7.8.  Upgrading Strategy . . . . . . . . . . . . . . . . . . . 36
   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 37
   9.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 39
   10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 40
       10.1. Normative References . . . . . . . . . . . . . . . . . . 40
       10.2. Informative References . . . . . . . . . . . . . . . . . 41
   A.  Design Alternatives  . . . . . . . . . . . . . . . . . . . . . 44
       A.1.  New Scheme(s)  . . . . . . . . . . . . . . . . . . . . . 44
       A.2.  Character Encodings Other Than UTF-8 . . . . . . . . . . 44
       A.3.  New Encoding Convention  . . . . . . . . . . . . . . . . 44
       A.4.  Indicating Character Encodings in the URI/IRI  . . . . . 45
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 45
   Full Copyright Statement . . . . . . . . . . . . . . . . . . . . . 46
        
1. Introduction
1. 介绍
1.1. Overview and Motivation
1.1. 概述和动机

A Uniform Resource Identifier (URI) is defined in [RFC3986] as a sequence of characters chosen from a limited subset of the repertoire of US-ASCII [ASCII] characters.

[RFC3986]将统一资源标识符(URI)定义为从US-ASCII[ASCII]字符集的有限子集中选择的字符序列。

The characters in URIs are frequently used for representing words of natural languages. This usage has many advantages: Such URIs are easier to memorize, easier to interpret, easier to transcribe, easier to create, and easier to guess. For most languages other than English, however, the natural script uses characters other than A - Z. For many people, handling Latin characters is as difficult as handling the characters of other scripts is for those who use only the Latin alphabet. Many languages with non-Latin scripts are transcribed with Latin letters. These transcriptions are now often used in URIs, but they introduce additional ambiguities.

URI中的字符经常用于表示自然语言的单词。这种用法有很多优点:这样的URI更容易记忆、解释、转录、创建和猜测。然而,对于英语以外的大多数语言,自然脚本使用的字符不是A-Z。对于许多人来说,处理拉丁字符就像处理其他脚本的字符一样困难,对于那些只使用拉丁字母的人来说。许多非拉丁文字的语言是用拉丁字母转录的。这些转录本现在常用于URI中,但它们引入了额外的歧义。

The infrastructure for the appropriate handling of characters from local scripts is now widely deployed in local versions of operating system and application software. Software that can handle a wide variety of scripts and languages at the same time is increasingly common. Also, increasing numbers of protocols and formats can carry a wide range of characters.

用于适当处理本地脚本中字符的基础结构现在广泛部署在本地版本的操作系统和应用程序软件中。能够同时处理多种脚本和语言的软件越来越普遍。此外,越来越多的协议和格式可以承载广泛的字符。

This document defines a new protocol element called Internationalized Resource Identifier (IRI) by extending the syntax of URIs to a much wider repertoire of characters. It also defines "internationalized" versions corresponding to other constructs from [RFC3986], such as URI references. The syntax of IRIs is defined in section 2, and the relationship between IRIs and URIs in section 3.

本文档通过将URI的语法扩展到更广泛的字符集,定义了一个称为国际化资源标识符(IRI)的新协议元素。它还定义了与[RFC3986]中的其他构造相对应的“国际化”版本,例如URI引用。第2节定义了IRIs的语法,第3节定义了IRIs和URI之间的关系。

Using characters outside of A - Z in IRIs brings some difficulties. Section 4 discusses the special case of bidirectional IRIs, section 5 various forms of equivalence between IRIs, and section 6 the use of IRIs in different situations. Section 7 gives additional informative guidelines, and section 8 security considerations.

在虹膜中使用A-Z以外的字符会带来一些困难。第4节讨论了双向虹膜的特殊情况,第5节讨论了虹膜之间的各种等价形式,第6节讨论了虹膜在不同情况下的使用。第7节给出了额外的信息指南,第8节给出了安全注意事项。

1.2. Applicability
1.2. 适用性

IRIs are designed to be compatible with recommendations for new URI schemes [RFC2718]. The compatibility is provided by specifying a well-defined and deterministic mapping from the IRI character sequence to the functionally equivalent URI character sequence. Practical use of IRIs (or IRI references) in place of URIs (or URI references) depends on the following conditions being met:

IRI设计为与新URI方案的建议兼容[RFC2718]。通过指定从IRI字符序列到功能等效的URI字符序列的定义良好的确定性映射来提供兼容性。实际使用IRI(或IRI引用)代替URI(或URI引用)取决于满足以下条件:

a. A protocol or format element should be explicitly designated to be able to carry IRIs. The intent is not to introduce IRIs into contexts that are not defined to accept them. For example, XML schema [XMLSchema] has an explicit type "anyURI" that includes IRIs and IRI references. Therefore, IRIs and IRI references can be in attributes and elements of type "anyURI". On the other hand, in the HTTP protocol [RFC2616], the Request URI is defined as a URI, which means that direct use of IRIs is not allowed in HTTP requests.

a. 协议或格式元素应明确指定为能够携带IRIs。目的不是将IRIs引入未定义为接受它们的上下文中。例如,XMLSchema[XMLSchema]有一个显式类型“anyURI”,其中包括IRIs和IRI引用。因此,IRI和IRI引用可以位于“anyURI”类型的属性和元素中。另一方面,在HTTP协议[RFC2616]中,请求URI被定义为URI,这意味着HTTP请求中不允许直接使用IRIs。

b. The protocol or format carrying the IRIs should have a mechanism to represent the wide range of characters used in IRIs, either natively or by some protocol- or format-specific escaping mechanism (for example, numeric character references in [XML1]).

b. 承载IRIs的协议或格式应具有一种机制,以表示IRIs中使用的广泛字符,可以是本机使用的,也可以是特定于协议或格式的转义机制使用的(例如,[XML1]中的数字字符引用)。

c. The URI corresponding to the IRI in question has to encode original characters into octets using UTF-8. For new URI schemes, this is recommended in [RFC2718]. It can apply to a whole scheme (e.g., IMAP URLs [RFC2192] and POP URLs [RFC2384], or the URN syntax [RFC2141]). It can apply to a specific part of a URI, such as the fragment identifier (e.g., [XPointer]). It can apply to a specific URI or part(s) thereof. For details, please see section 6.4.

c. 与所讨论的IRI相对应的URI必须使用UTF-8将原始字符编码为八位字节。对于新的URI方案,[RFC2718]中建议这样做。它可以应用于整个方案(例如,IMAP URL[RFC2192]和POP URL[RFC2384]或URN语法[RFC2141])。它可以应用于URI的特定部分,例如片段标识符(例如,[XPointer])。它可以应用于特定URI或其部分。有关详细信息,请参见第6.4节。

1.3. Definitions
1.3. 定义

The following definitions are used in this document; they follow the terms in [RFC2130], [RFC2277], and [ISO10646].

本文件中使用了以下定义:;它们遵循[RFC2130]、[RFC2277]和[ISO10646]中的术语。

character: A member of a set of elements used for the organization, control, or representation of data. For example, "LATIN CAPITAL LETTER A" names a character.

字符:用于组织、控制或表示数据的一组元素的成员。例如,“拉丁大写字母A”命名字符。

octet: An ordered sequence of eight bits considered as a unit.

八位元:一种八位元的有序序列,被视为一个单位。

character repertoire: A set of characters (in the mathematical sense).

角色剧目:一组角色(在数学意义上)。

sequence of characters: A sequence of characters (one after another).

字符序列:字符序列(一个接一个)。

sequence of octets: A sequence of octets (one after another).

八位元序列:八位元序列(一个接一个)。

character encoding: A method of representing a sequence of characters as a sequence of octets (maybe with variants). Also, a method of (unambiguously) converting a sequence of octets into a sequence of characters.

字符编码:一种将字符序列表示为八位字节序列(可能带有变体)的方法。另外,一种(明确地)将八位字节序列转换为字符序列的方法。

charset: The name of a parameter or attribute used to identify a character encoding.

字符集:用于标识字符编码的参数或属性的名称。

UCS: Universal Character Set. The coded character set defined by ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV4].

UCS:通用字符集。由ISO/IEC 10646[ISO10646]和Unicode标准[UNIV4]定义的编码字符集。

IRI reference: Denotes the common usage of an Internationalized Resource Identifier. An IRI reference may be absolute or relative. However, the "IRI" that results from such a reference only includes absolute IRIs; any relative IRI references are resolved to their absolute form. Note that in [RFC2396] URIs did not include fragment identifiers, but in [RFC3986] fragment identifiers are part of URIs.

IRI引用:表示国际化资源标识符的通用用法。IRI参考可以是绝对的,也可以是相对的。然而,从这样的参考中得出的“IRI”仅包括绝对IRI;任何相对IRI引用都解析为其绝对形式。注意,[RFC2396]中的URI不包括片段标识符,但[RFC3986]中的片段标识符是URI的一部分。

running text: Human text (paragraphs, sentences, phrases) with syntax according to orthographic conventions of a natural language, as opposed to syntax defined for ease of processing by machines (e.g., markup, programming languages).

运行文本:人类文本(段落、句子、短语),其语法符合自然语言的正交约定,而不是为便于机器处理而定义的语法(例如标记、编程语言)。

protocol element: Any portion of a message that affects processing of that message by the protocol in question.

协议元素:消息中影响相关协议处理该消息的任何部分。

presentation element: A presentation form corresponding to a protocol element; for example, using a wider range of characters.

表示元素:对应于协议元素的表示形式;例如,使用范围更广的字符。

create (a URI or IRI): With respect to URIs and IRIs, the term is used for the initial creation. This may be the initial creation of a resource with a certain identifier, or the initial exposition of a resource under a particular identifier.

创建(URI或IRI):关于URI和IRI,术语用于初始创建。这可能是具有特定标识符的资源的初始创建,或特定标识符下资源的初始公开。

generate (a URI or IRI): With respect to URIs and IRIs, the term is used when the IRI is generated by derivation from other information.

生成(URI或IRI):关于URI和IRI,当IRI通过从其他信息派生而生成时,使用该术语。

1.4. Notation
1.4. 符号

RFCs and Internet Drafts currently do not allow any characters outside the US-ASCII repertoire. Therefore, this document uses various special notations to denote such characters in examples.

RFC和Internet草稿目前不允许使用US-ASCII指令表以外的任何字符。因此,本文件使用各种特殊符号在示例中表示此类字符。

In text, characters outside US-ASCII are sometimes referenced by using a prefix of 'U+', followed by four to six hexadecimal digits.

在文本中,US-ASCII以外的字符有时会使用前缀“U+”,后跟四到六个十六进制数字来引用。

To represent characters outside US-ASCII in examples, this document uses two notations: 'XML Notation' and 'Bidi Notation'.

为了在示例中表示US-ASCII以外的字符,本文档使用两种符号:“XML符号”和“Bidi符号”。

XML Notation uses a leading '&#x', a trailing ';', and the hexadecimal number of the character in the UCS in between. For example, я stands for CYRILLIC CAPITAL LETTER YA. In this notation, an actual '&' is denoted by '&'.

XML表示法使用前导“&#x”,尾随“;”,以及介于两者之间的UCS中字符的十六进制数。例如я代表西里尔文大写字母YA。在这种表示法中,实际的“&”用“&”表示。

Bidi Notation is used for bidirectional examples: Lowercase letters stand for Latin letters or other letters that are written left to right, whereas uppercase letters represent Arabic or Hebrew letters that are written right to left.

Bidi符号用于双向示例:小写字母代表拉丁字母或从左到右书写的其他字母,而大写字母代表从右到左书写的阿拉伯文或希伯来文字母。

To denote actual octets in examples (as opposed to percent-encoded octets), the two hex digits denoting the octet are enclosed in "<" and ">". For example, the octet often denoted as 0xc9 is denoted here as <c9>.

为了在示例中表示实际的八位字节(与百分比编码的八位字节相反),表示八位字节的两个十六进制数字包含在“<”和“>”中。例如,通常表示为0xc9的八位字节在这里表示为<c9>。

In this document, the key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" are to be interpreted as described in [RFC2119].

本文件中的关键词“必须”、“不得”、“要求”、“应”、“不应”、“应”、“不应”、“建议”、“可”和“可选”应按照[RFC2119]中的说明进行解释。

2. IRI Syntax
2. IRI语法

This section defines the syntax of Internationalized Resource Identifiers (IRIs).

本节定义了国际化资源标识符(IRI)的语法。

As with URIs, an IRI is defined as a sequence of characters, not as a sequence of octets. This definition accommodates the fact that IRIs may be written on paper or read over the radio as well as stored or transmitted digitally. The same IRI may be represented as different sequences of octets in different protocols or documents if these protocols or documents use different character encodings (and/or transfer encodings). Using the same character encoding as the containing protocol or document ensures that the characters in the IRI can be handled (e.g., searched, converted, displayed) in the same way as the rest of the protocol or document.

与URI一样,IRI被定义为字符序列,而不是八位字节序列。这一定义考虑到IRIs可以写在纸上或通过无线电读取,也可以以数字方式存储或传输。如果不同协议或文档使用不同的字符编码(和/或传输编码),则相同的IRI可以表示为不同协议或文档中的不同八位字节序列。使用与包含协议或文档相同的字符编码可确保IRI中的字符可以与协议或文档的其余部分以相同的方式处理(例如,搜索、转换、显示)。

2.1. Summary of IRI Syntax
2.1. IRI语法概述

IRIs are defined similarly to URIs in [RFC3986], but the class of unreserved characters is extended by adding the characters of the UCS (Universal Character Set, [ISO10646]) beyond U+007F, subject to the limitations given in the syntax rules below and in section 6.1.

IRI的定义类似于[RFC3986]中的URI,但无保留字符的类别通过将UCS(通用字符集[ISO10646])的字符添加到U+007F之外而得到扩展,但受以下语法规则和第6.1节中给出的限制。

Otherwise, the syntax and use of components and reserved characters is the same as that in [RFC3986]. All the operations defined in [RFC3986], such as the resolution of relative references, can be applied to IRIs by IRI-processing software in exactly the same way as they are for URIs by URI-processing software.

否则,组件和保留字符的语法和用法与[RFC3986]中的相同。[RFC3986]中定义的所有操作,例如相对参考的解析,都可以通过IRI处理软件应用于IRI,其方式与URI处理软件应用于URI的方式完全相同。

Characters outside the US-ASCII repertoire are not reserved and therefore MUST NOT be used for syntactical purposes, such as to delimit components in newly defined schemes. For example, U+00A2, CENT SIGN, is not allowed as a delimiter in IRIs, because it is in the 'iunreserved' category. This is similar to the fact that it is not possible to use '-' as a delimiter in URIs, because it is in the 'unreserved' category.

US-ASCII指令表之外的字符不保留,因此不得用于语法目的,例如在新定义的方案中分隔组件。例如,U+00A2,分号,不允许作为IRIs中的分隔符,因为它属于“iunreserved”类别。这类似于无法在URI中使用“-”作为分隔符的事实,因为它属于“unreserved”类别。

2.2. ABNF for IRI References and IRIs
2.2. 用于IRI参考和IRI的ABNF

Although it might be possible to define IRI references and IRIs merely by their transformation to URI references and URIs, they can also be accepted and processed directly. Therefore, an ABNF definition for IRI references (which are the most general concept and the start of the grammar) and IRIs is given here. The syntax of this ABNF is described in [RFC2234]. Character numbers are taken from the UCS, without implying any actual binary encoding. Terminals in the ABNF are characters, not bytes.

尽管可以仅通过将IRI引用和IRI转换为URI引用和URI来定义它们,但也可以直接接受和处理它们。因此,这里给出了IRI引用(这是最一般的概念,也是语法的起点)和IRI的ABNF定义。[RFC2234]中描述了该ABNF的语法。字符号取自UCS,不表示任何实际的二进制编码。ABNF中的终端是字符,而不是字节。

The following grammar closely follows the URI grammar in [RFC3986], except that the range of unreserved characters is expanded to include UCS characters, with the restriction that private UCS characters can occur only in query parts. The grammar is split into two parts: Rules that differ from [RFC3986] because of the above-mentioned expansion, and rules that are the same as those in [RFC3986]. For rules that are different than those in [RFC3986], the names of the non-terminals have been changed as follows. If the non-terminal contains 'URI', this has been changed to 'IRI'. Otherwise, an 'i' has been prefixed.

以下语法与[RFC3986]中的URI语法非常相似,只是未保留字符的范围扩展为包括UCS字符,但限制专用UCS字符只能出现在查询部分中。语法分为两部分:由于上述扩展而与[RFC3986]不同的规则,以及与[RFC3986]相同的规则。对于与[RFC3986]中不同的规则,非端子的名称已更改如下。如果非终端包含“URI”,则已将其更改为“IRI”。否则,前缀为“i”。

The following rules are different from those in [RFC3986]:

以下规则与[RFC3986]中的规则不同:

IRI = scheme ":" ihier-part [ "?" iquery ] [ "#" ifragment ]

IRI=方案“:“ihier part[”?“iquiry][“#”ifragment]

   ihier-part     = "//" iauthority ipath-abempty
                  / ipath-absolute
                  / ipath-rootless
                  / ipath-empty
        
   ihier-part     = "//" iauthority ipath-abempty
                  / ipath-absolute
                  / ipath-rootless
                  / ipath-empty
        
   IRI-reference  = IRI / irelative-ref
        
   IRI-reference  = IRI / irelative-ref
        

absolute-IRI = scheme ":" ihier-part [ "?" iquery ]

绝对IRI=方案“:“ihier部分[”?“iquiry]

irelative-ref = irelative-part [ "?" iquery ] [ "#" ifragment ]

i相对参考=i相对部分[“?”iquery][“#”ifragment]

   irelative-part = "//" iauthority ipath-abempty
                       / ipath-absolute
        
   irelative-part = "//" iauthority ipath-abempty
                       / ipath-absolute
        

/ ipath-noscheme / ipath-empty

/ipath noscheme/ipath empty

   iauthority     = [ iuserinfo "@" ] ihost [ ":" port ]
   iuserinfo      = *( iunreserved / pct-encoded / sub-delims / ":" )
   ihost          = IP-literal / IPv4address / ireg-name
        
   iauthority     = [ iuserinfo "@" ] ihost [ ":" port ]
   iuserinfo      = *( iunreserved / pct-encoded / sub-delims / ":" )
   ihost          = IP-literal / IPv4address / ireg-name
        
   ireg-name      = *( iunreserved / pct-encoded / sub-delims )
        
   ireg-name      = *( iunreserved / pct-encoded / sub-delims )
        
   ipath          = ipath-abempty   ; begins with "/" or is empty
                  / ipath-absolute  ; begins with "/" but not "//"
                  / ipath-noscheme  ; begins with a non-colon segment
                  / ipath-rootless  ; begins with a segment
                  / ipath-empty     ; zero characters
        
   ipath          = ipath-abempty   ; begins with "/" or is empty
                  / ipath-absolute  ; begins with "/" but not "//"
                  / ipath-noscheme  ; begins with a non-colon segment
                  / ipath-rootless  ; begins with a segment
                  / ipath-empty     ; zero characters
        
   ipath-abempty  = *( "/" isegment )
   ipath-absolute = "/" [ isegment-nz *( "/" isegment ) ]
   ipath-noscheme = isegment-nz-nc *( "/" isegment )
   ipath-rootless = isegment-nz *( "/" isegment )
   ipath-empty    = 0<ipchar>
        
   ipath-abempty  = *( "/" isegment )
   ipath-absolute = "/" [ isegment-nz *( "/" isegment ) ]
   ipath-noscheme = isegment-nz-nc *( "/" isegment )
   ipath-rootless = isegment-nz *( "/" isegment )
   ipath-empty    = 0<ipchar>
        
   isegment       = *ipchar
   isegment-nz    = 1*ipchar
   isegment-nz-nc = 1*( iunreserved / pct-encoded / sub-delims
                        / "@" )
                  ; non-zero-length segment without any colon ":"
        
   isegment       = *ipchar
   isegment-nz    = 1*ipchar
   isegment-nz-nc = 1*( iunreserved / pct-encoded / sub-delims
                        / "@" )
                  ; non-zero-length segment without any colon ":"
        
   ipchar         = iunreserved / pct-encoded / sub-delims / ":"
                  / "@"
        
   ipchar         = iunreserved / pct-encoded / sub-delims / ":"
                  / "@"
        
   iquery         = *( ipchar / iprivate / "/" / "?" )
        
   iquery         = *( ipchar / iprivate / "/" / "?" )
        
   ifragment      = *( ipchar / "/" / "?" )
        
   ifragment      = *( ipchar / "/" / "?" )
        
   iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
        
   iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
        
   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD
        
   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD
        
   iprivate       = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD
        
   iprivate       = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD
        

Some productions are ambiguous. The "first-match-wins" (a.k.a. "greedy") algorithm applies. For details, see [RFC3986].

有些作品模棱两可。“第一场比赛获胜”(又称“贪婪”)算法适用。有关详细信息,请参见[RFC3986]。

The following rules are the same as those in [RFC3986]:

以下规则与[RFC3986]中的规则相同:

   scheme         = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
        
   scheme         = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
        
   port           = *DIGIT
        
   port           = *DIGIT
        

IP-literal = "[" ( IPv6address / IPvFuture ) "]"

IP literal=“[(IPV6地址/IPvFuture)]”

   IPvFuture      = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
        
   IPvFuture      = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
        
   IPv6address    =                            6( h16 ":" ) ls32
                  /                       "::" 5( h16 ":" ) ls32
                  / [               h16 ] "::" 4( h16 ":" ) ls32
                  / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
                  / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
                  / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
                  / [ *4( h16 ":" ) h16 ] "::"              ls32
                  / [ *5( h16 ":" ) h16 ] "::"              h16
                  / [ *6( h16 ":" ) h16 ] "::"
        
   IPv6address    =                            6( h16 ":" ) ls32
                  /                       "::" 5( h16 ":" ) ls32
                  / [               h16 ] "::" 4( h16 ":" ) ls32
                  / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
                  / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
                  / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
                  / [ *4( h16 ":" ) h16 ] "::"              ls32
                  / [ *5( h16 ":" ) h16 ] "::"              h16
                  / [ *6( h16 ":" ) h16 ] "::"
        
   h16            = 1*4HEXDIG
   ls32           = ( h16 ":" h16 ) / IPv4address
        
   h16            = 1*4HEXDIG
   ls32           = ( h16 ":" h16 ) / IPv4address
        

IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet

IPv4address=dec八位组“,”dec八位组“,”dec八位组“,”dec八位组“,”dec八位组“

   dec-octet      = DIGIT                 ; 0-9
                  / %x31-39 DIGIT         ; 10-99
                  / "1" 2DIGIT            ; 100-199
                  / "2" %x30-34 DIGIT     ; 200-249
                  / "25" %x30-35          ; 250-255
        
   dec-octet      = DIGIT                 ; 0-9
                  / %x31-39 DIGIT         ; 10-99
                  / "1" 2DIGIT            ; 100-199
                  / "2" %x30-34 DIGIT     ; 200-249
                  / "25" %x30-35          ; 250-255
        

pct-encoded = "%" HEXDIG HEXDIG

pct编码=“%”HEXDIG HEXDIG

   unreserved     = ALPHA / DIGIT / "-" / "." / "_" / "~"
   reserved       = gen-delims / sub-delims
   gen-delims     = ":" / "/" / "?" / "#" / "[" / "]" / "@"
   sub-delims     = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="
        
   unreserved     = ALPHA / DIGIT / "-" / "." / "_" / "~"
   reserved       = gen-delims / sub-delims
   gen-delims     = ":" / "/" / "?" / "#" / "[" / "]" / "@"
   sub-delims     = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="
        

This syntax does not support IPv6 scoped addressing zone identifiers.

此语法不支持IPv6作用域寻址区域标识符。

3. Relationship between IRIs and URIs
3. 虹膜与URI的关系

IRIs are meant to replace URIs in identifying resources for protocols, formats, and software components that use a UCS-based character repertoire. These protocols and components may never need to use URIs directly, especially when the resource identifier is used simply for identification purposes. However, when the resource identifier is used for resource retrieval, it is in many cases necessary to determine the associated URI, because currently most retrieval mechanisms are only defined for URIs. In this case, IRIs can serve as presentation elements for URI protocol elements. An example would be an address bar in a Web user agent. (Additional rationale is given in section 3.1.)

IRI旨在取代URI,用于识别使用基于UCS的字符库的协议、格式和软件组件的资源。这些协议和组件可能永远不需要直接使用URI,特别是当资源标识符仅用于标识目的时。然而,当资源标识符用于资源检索时,在许多情况下需要确定关联的URI,因为目前大多数检索机制仅为URI定义。在这种情况下,IRIs可以作为URI协议元素的表示元素。例如,Web用户代理中的地址栏。(第3.1节给出了其他基本原理。)

3.1. Mapping of IRIs to URIs
3.1. 虹膜到URI的映射

This section defines how to map an IRI to a URI. Everything in this section also applies to IRI references and URI references, as well as to components thereof (for example, fragment identifiers).

本节定义如何将IRI映射到URI。本节中的所有内容也适用于IRI引用和URI引用,以及它们的组件(例如,片段标识符)。

This mapping has two purposes:

此映射有两个目的:

Syntaxical. Many URI schemes and components define additional syntactical restrictions not captured in section 2.2. Scheme-specific restrictions are applied to IRIs by converting IRIs to URIs and checking the URIs against the scheme-specific restrictions.

联合的。许多URI方案和组件定义了第2.2节中未涉及的其他语法限制。通过将IRI转换为URI并根据特定于方案的限制检查URI,可将特定于方案的限制应用于IRI。

Interpretational. URIs identify resources in various ways. IRIs also identify resources. When the IRI is used solely for identification purposes, it is not necessary to map the IRI to a URI (see section 5). However, when an IRI is used for resource retrieval, the resource that the IRI locates is the same as the one located by the URI obtained after converting the IRI according to the procedure defined here. This means that there is no need to define resolution separately on the IRI level.

解释性的。URI以各种方式标识资源。IRIs还可以识别资源。当IRI仅用于识别目的时,无需将IRI映射到URI(参见第5节)。然而,当IRI用于资源检索时,IRI定位的资源与根据此处定义的过程转换IRI后获得的URI定位的资源相同。这意味着不需要在IRI级别单独定义分辨率。

Applications MUST map IRIs to URIs by using the following two steps.

应用程序必须使用以下两个步骤将IRIs映射到URI。

Step 1. Generate a UCS character sequence from the original IRI format. This step has the following three variants, depending on the form of the input:

第一步。从原始IRI格式生成UCS字符序列。根据输入的形式,此步骤有以下三种变体:

a. If the IRI is written on paper, read aloud, or otherwise represented as a sequence of characters independent of any character encoding, represent the IRI as a sequence of characters from the UCS normalized according to Normalization Form C (NFC, [UTR15]).

a. 如果IRI写在纸上、读出或以其他方式表示为独立于任何字符编码的字符序列,则将IRI表示为根据标准化形式C(NFC[UTR15])标准化的UCS中的字符序列。

b. If the IRI is in some digital representation (e.g., an octet stream) in some known non-Unicode character encoding, convert the IRI to a sequence of characters from the UCS normalized according to NFC.

b. 如果IRI在某些已知的非Unicode字符编码中以某种数字表示(例如,八位字节流),则将IRI转换为根据NFC标准化的UCS中的字符序列。

c. If the IRI is in a Unicode-based character encoding (for example, UTF-8 or UTF-16), do not normalize (see section 5.3.2.2 for details). Apply step 2 directly to the encoded Unicode character sequence.

c. 如果IRI采用基于Unicode的字符编码(例如UTF-8或UTF-16),请不要进行规范化(有关详细信息,请参阅第5.3.2.2节)。将步骤2直接应用于编码的Unicode字符序列。

Step 2. For each character in 'ucschar' or 'iprivate', apply steps 2.1 through 2.3 below.

第二步。对于“ucschar”或“iprivate”中的每个字符,应用下面的步骤2.1到2.3。

2.1. Convert the character to a sequence of one or more octets using UTF-8 [RFC3629].

2.1. 使用UTF-8[RFC3629]将字符转换为一个或多个八位字节的序列。

2.2. Convert each octet to %HH, where HH is the hexadecimal notation of the octet value. Note that this is identical to the percent-encoding mechanism in section 2.1 of [RFC3986]. To reduce variability, the hexadecimal notation SHOULD use uppercase letters.

2.2. 将每个八位字节转换为%HH,其中HH是八位字节值的十六进制表示法。注意,这与[RFC3986]第2.1节中的百分比编码机制相同。为了减少可变性,十六进制表示法应该使用大写字母。

2.3. Replace the original character with the resulting character sequence (i.e., a sequence of %HH triplets).

2.3. 将原始字符替换为生成的字符序列(即%HH三元组的序列)。

The above mapping from IRIs to URIs produces URIs fully conforming to [RFC3986]. The mapping is also an identity transformation for URIs and is idempotent; applying the mapping a second time will not change anything. Every URI is by definition an IRI.

上述从IRIs到URI的映射生成完全符合[RFC3986]的URI。映射也是URI的身份转换,是幂等的;第二次应用映射不会改变任何东西。根据定义,每个URI都是一个IRI。

Systems accepting IRIs MAY convert the ireg-name component of an IRI as follows (before step 2 above) for schemes known to use domain names in ireg-name, if the scheme definition does not allow percent-encoding for ireg-name:

如果方案定义不允许对ireg名称进行百分比编码,则接受IRI的系统可以将IRI的ireg名称组件转换为已知在ireg名称中使用域名的方案,如下所示(在上述步骤2之前):

Replace the ireg-name part of the IRI by the part converted using the ToASCII operation specified in section 4.1 of [RFC3490] on each dot-separated label, and by using U+002E (FULL STOP) as a label separator, with the flag UseSTD3ASCIIRules set to TRUE, and with the flag AllowUnassigned set to FALSE for creating IRIs and set to TRUE otherwise.

使用[RFC3490]第4.1节中规定的ToASCII操作在每个点分隔标签上转换的部分,并使用U+002E(句号)作为标签分隔符,将IRI的ireg名称部分替换为IRI,将标志USED3CASCIRULES设置为TRUE,将标志AllowUnassigned设置为FALSE以创建IRI,否则设置为TRUE。

The ToASCII operation may fail, but this would mean that the IRI cannot be resolved. This conversion SHOULD be used when the goal is to maximize interoperability with legacy URI resolvers. For example, the IRI

ToASCII操作可能失败,但这意味着无法解决IRI。当目标是最大化与遗留URI解析器的互操作性时,应使用此转换。例如,IRI

   "http://r&#xE9;sum&#xE9;.example.org"
        
   "http://r&#xE9;sum&#xE9;.example.org"
        

may be converted to

可转换为

   "http://xn--rsum-bpad.example.org"
        
   "http://xn--rsum-bpad.example.org"
        

instead of

而不是

"http://r%C3%A9sum%C3%A9.example.org".

“http://r%C3%A9sum%C3%A9.example.org”。

An IRI with a scheme that is known to use domain names in ireg-name, but where the scheme definition does not allow percent-encoding for ireg-name, meets scheme-specific restrictions if either the straightforward conversion or the conversion using the ToASCII operation on ireg-name result in an URI that meets the scheme-specific restrictions.

如果直接转换或对ireg名称使用ToASCII操作的转换导致URI满足方案特定限制,则具有已知在ireg名称中使用域名的方案的IRI(方案定义不允许对ireg名称进行百分比编码)满足方案特定限制。

Such an IRI resolves to the URI obtained after converting the IRI and uses the ToASCII operation on ireg-name. Implementations do not have to do this conversion as long as they produce the same result.

这样的IRI解析为转换IRI后获得的URI,并对ireg名称使用ToASCII操作。只要实现产生相同的结果,就不必进行这种转换。

Note: The difference between variants b and c in step 1 (using normalization with NFC, versus not using any normalization) accounts for the fact that in many non-Unicode character encodings, some text cannot be represented directly. For example, the word "Vietnam" is natively written "Vi&#x1EC7;t Nam" (containing a LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW) in NFC, but a direct transcoding from the windows-1258 character encoding leads to "Vi&#xEA;&#x323;t Nam" (containing a LATIN SMALL LETTER E WITH CIRCUMFLEX followed by a COMBINING DOT BELOW). Direct transcoding of other 8-bit encodings of Vietnamese may lead to other representations.

注意:步骤1中变量b和c之间的差异(使用NFC规范化,与不使用任何规范化相比)说明了一个事实,即在许多非Unicode字符编码中,某些文本无法直接表示。例如,“越南”一词在NFC中的本意是“Vi&#x1EC7;t Nam”(包含带扬抑符和下面的点的拉丁小写字母E),但直接从windows-1258字符编码转换为“Vi&#xEA;&#x323;t Nam”(包含带扬抑符的拉丁小写字母E,后跟下面的组合点)。越南语的其他8位编码的直接转码可能导致其他表示。

Note: The uniform treatment of the whole IRI in step 2 is important to make processing independent of URI scheme. See [Gettys] for an in-depth discussion.

注意:步骤2中对整个IRI的统一处理对于使处理独立于URI方案非常重要。请参阅[Gettys]以了解深入的讨论。

Note: In practice, whether the general mapping (steps 1 and 2) or the ToASCII operation of [RFC3490] is used for ireg-name will not be noticed if mapping from IRI to URI and resolution is tightly integrated (e.g., carried out in the same user agent). But

注意:在实践中,如果从IRI到URI的映射和解析紧密集成(例如,在同一个用户代理中执行),则不会注意到[RFC3490]的常规映射(步骤1和2)或ToASCII操作是否用于ireg名称。但是

conversion using [RFC3490] may be able to better deal with backwards compatibility issues in case mapping and resolution are separated, as in the case of using an HTTP proxy.

在映射和解析分离的情况下,使用[RFC3490]进行转换可能能够更好地处理向后兼容性问题,就像使用HTTP代理一样。

Note: Internationalized Domain Names may be contained in parts of an IRI other than the ireg-name part. It is the responsibility of scheme-specific implementations (if the Internationalized Domain Name is part of the scheme syntax) or of server-side implementations (if the Internationalized Domain Name is part of 'iquery') to apply the necessary conversions at the appropriate point. Example: Trying to validate the Web page at http://r&#xE9;sum&#xE9;.example.org would lead to an IRI of http://validator.w3.org/check?uri=http%3A%2F%2Fr&#xE9;sum&#xE9;. example.org, which would convert to a URI of http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9. example.org. The server side implementation would be responsible for making the necessary conversions to be able to retrieve the Web page.

注:国际化域名可能包含在IRI的部分,而不是ireg名称部分。特定于方案的实现(如果国际化域名是方案语法的一部分)或服务器端实现(如果国际化域名是“iquery”的一部分)负责在适当的点应用必要的转换。示例:尝试在http://r&#xE9验证网页;总和&#xE9;。org将导致http://validator.w3.org/check?uri=http%3A%2F%2Fr&#xE9;总和&#xE9;。org,它将转换为http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9. example.org。服务器端实现将负责进行必要的转换,以便能够检索网页。

Systems accepting IRIs MAY also deal with the printable characters in US-ASCII that are not allowed in URIs, namely "<", ">", '"', space, "{", "}", "|", "\", "^", and "`", in step 2 above. If these characters are found but are not converted, then the conversion SHOULD fail. Please note that the number sign ("#"), the percent sign ("%"), and the square bracket characters ("[", "]") are not part of the above list and MUST NOT be converted. Protocols and formats that have used earlier definitions of IRIs including these characters MAY require percent-encoding of these characters as a preprocessing step to extract the actual IRI from a given field. This preprocessing MAY also be used by applications allowing the user to enter an IRI.

接受IRIs的系统还可以处理URI中不允许的US-ASCII格式的可打印字符,即“<”、“>”、“”、“”、空格、“{”、“}”、“|”、“\”、“^”和“`”,在上面的步骤2中。如果找到这些字符但未转换,则转换应失败。请注意,数字符号(“#”)和百分号(“%”),以及方括号字符(“[”,“]”)不属于上述列表的一部分,且不得转换。使用包含这些字符的早期IRI定义的协议和格式可能需要对这些字符进行百分比编码,作为从给定字段提取实际IRI的预处理步骤。允许用户输入的应用程序也可使用此预处理IRI。

Note: In this process (in step 2.3), characters allowed in URI references and existing percent-encoded sequences are not encoded further. (This mapping is similar to, but different from, the encoding applied when arbitrary content is included in some part of a URI.) For example, an IRI of "http://www.example.org/red%09ros&#xE9;#red" (in XML notation) is converted to "http://www.example.org/red%09ros%C3%A9#red", not to something like "http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red".

注意:在这个过程中(步骤2.3),URI引用中允许的字符和现有的百分比编码序列不会进一步编码。(此映射类似于但不同于URI的某些部分包含任意内容时应用的编码。)例如,IRI为“http://www.example.org/red%09ros&#xE9##red”(以XML表示法)转换为“http://www.example.org/red%09ros%C3%A9#red“,而不是像“http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red”。

   Note: Some older software transcoding to UTF-8 may produce illegal
      output for some input, in particular for characters outside the
      BMP (Basic Multilingual Plane).  As an example, for the IRI with
      non-BMP characters (in XML Notation):
      "http://example.com/&#x10300;&#x10301;&#x10302";
        
   Note: Some older software transcoding to UTF-8 may produce illegal
      output for some input, in particular for characters outside the
      BMP (Basic Multilingual Plane).  As an example, for the IRI with
      non-BMP characters (in XML Notation):
      "http://example.com/&#x10300;&#x10301;&#x10302";
        

which contains the first three letters of the Old Italic alphabet, the correct conversion to a URI is "http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82"

包含旧斜体字母表的前三个字母,正确的URI转换为“http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82"

3.2. Converting URIs to IRIs
3.2. 将URI转换为IRIs

In some situations, converting a URI into an equivalent IRI may be desirable. This section gives a procedure for this conversion. The conversion described in this section will always result in an IRI that maps back to the URI used as an input for the conversion (except for potential case differences in percent-encoding and for potential percent-encoded unreserved characters). However, the IRI resulting from this conversion may not be exactly the same as the original IRI (if there ever was one).

在某些情况下,可能需要将URI转换为等效的IRI。本节给出了此转换的过程。本节中描述的转换将始终产生一个IRI,该IRI映射回用作转换输入的URI(百分比编码中的潜在大小写差异和百分比编码的潜在无保留字符除外)。但是,此转换产生的IRI可能与原始IRI(如果有)不完全相同。

URI-to-IRI conversion removes percent-encodings, but not all percent-encodings can be eliminated. There are several reasons for this:

URI到IRI的转换会删除百分比编码,但并非所有百分比编码都可以消除。这有几个原因:

1. Some percent-encodings are necessary to distinguish percent-encoded and unencoded uses of reserved characters.

1. 某些百分比编码是区分保留字符的百分比编码和未编码使用所必需的。

2. Some percent-encodings cannot be interpreted as sequences of UTF-8 octets.

2. 某些百分比编码不能解释为UTF-8八位字节序列。

(Note: The octet patterns of UTF-8 are highly regular. Therefore, there is a very high probability, but no guarantee, that percent-encodings that can be interpreted as sequences of UTF-8 octets actually originated from UTF-8. For a detailed discussion, see [Duerst97].)

(注:UTF-8的八位元模式是高度规则的。因此,很有可能,但不能保证,可以解释为UTF-8八位元序列的百分比编码实际上源自UTF-8。有关详细讨论,请参阅[Duerst97]。)

3. The conversion may result in a character that is not appropriate in an IRI. See sections 2.2, 4.1, and 6.1 for further details.

3. 转换可能会导致IRI中出现不合适的字符。详见第2.2、4.1和6.1节。

Conversion from a URI to an IRI is done by using the following steps (or any other algorithm that produces the same result):

使用以下步骤(或产生相同结果的任何其他算法)完成从URI到IRI的转换:

1. Represent the URI as a sequence of octets in US-ASCII.

1. 将URI表示为US-ASCII格式的八位字节序列。

2. Convert all percent-encodings ("%" followed by two hexadecimal digits) to the corresponding octets, except those corresponding to "%", characters in "reserved", and characters in US-ASCII not allowed in URIs.

2. 将所有百分比编码(“%”后跟两个十六进制数字)转换为相应的八位字节,但与“%”相对应的编码、“保留”中的字符以及URI中不允许的US-ASCII字符除外。

3. Re-percent-encode any octet produced in step 2 that is not part of a strictly legal UTF-8 octet sequence.

3. 对第2步中产生的、不属于严格合法的UTF-8八位字节序列的任何八位字节进行重新百分比编码。

4. Re-percent-encode all octets produced in step 3 that in UTF-8 represent characters that are not appropriate according to sections 2.2, 4.1, and 6.1.

4. 根据第2.2、4.1和6.1节,对步骤3中生成的UTF-8中表示不合适字符的所有八位字节进行重新百分比编码。

5. Interpret the resulting octet sequence as a sequence of characters encoded in UTF-8.

5. 将产生的八位字节序列解释为UTF-8编码的字符序列。

This procedure will convert as many percent-encoded characters as possible to characters in an IRI. Because there are some choices when step 4 is applied (see section 6.1), results may vary.

此过程将尽可能多的百分比编码字符转换为IRI中的字符。由于在应用步骤4时有一些选择(见第6.1节),结果可能会有所不同。

Conversions from URIs to IRIs MUST NOT use any character encoding other than UTF-8 in steps 3 and 4, even if it might be possible to guess from the context that another character encoding than UTF-8 was used in the URI. For example, the URI "http://www.example.org/r%E9sum%E9.html" might with some guessing be interpreted to contain two e-acute characters encoded as iso-8859-1. It must not be converted to an IRI containing these e-acute characters. Otherwise, in the future the IRI will be mapped to "http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different URI from "http://www.example.org/r%E9sum%E9.html".

在步骤3和4中,从URI到IRIs的转换不得使用UTF-8以外的任何字符编码,即使可能从上下文猜测URI中使用了UTF-8以外的其他字符编码。例如,URI“http://www.example.org/r%E9sum%E9.html“经过猜测,可能会被解释为包含两个编码为iso-8859-1的e-acute字符。不得将其转换为包含这些电子字符的IRI。否则,将来IRI将映射到“http://www.example.org/r%C3%A9sum%C3%A9.html,它是不同于的URIhttp://www.example.org/r%E9sum%E9.html".

3.2.1. Examples
3.2.1. 例子

This section shows various examples of converting URIs to IRIs. Each example shows the result after each of the steps 1 through 5 is applied. XML Notation is used for the final result. Octets are denoted by "<" followed by two hexadecimal digits followed by ">".

本节展示了将URI转换为IRIs的各种示例。每个示例显示了应用步骤1至5后的结果。最终结果使用XML表示法。八位字节由“<”表示,后跟两个十六进制数字,后跟“>”。

The following example contains the sequence "%C3%BC", which is a strictly legal UTF-8 sequence, and which is converted into the actual character U+00FC, LATIN SMALL LETTER U WITH DIAERESIS (also known as u-umlaut).

下面的示例包含序列“%C3%BC”,这是一个严格合法的UTF-8序列,它被转换为实际字符U+00FC,拉丁文小写字母U加上分音符(也称为U-umlaut)。

1. http://www.example.org/D%C3%BCrst

1. http://www.example.org/D%C3%BCrst

2. http://www.example.org/D<c3><bc>rst

2. http://www.example.org/D<c3><bc>rst

3. http://www.example.org/D<c3><bc>rst

3. http://www.example.org/D<c3><bc>rst

4. http://www.example.org/D<c3><bc>rst

4. http://www.example.org/D<c3><bc>rst

5. http://www.example.org/D&#xFC;rst

5. http://www.example.org/D&#xFC;rst

The following example contains the sequence "%FC", which might represent U+00FC, LATIN SMALL LETTER U WITH DIAERESIS, in the iso-8859-1 character encoding. (It might represent other characters in other character encodings. For example, the octet <fc> in

以下示例包含序列“%FC”,它可能表示iso-8859-1字符编码中的U+00FC,即带分音符的拉丁文小写字母U。(它可能表示其他字符编码中的其他字符。例如,在

iso-8859-5 represents U+045C, CYRILLIC SMALL LETTER KJE.) Because <fc> is not part of a strictly legal UTF-8 sequence, it is re-percent-encoded in step 3.

iso-8859-5代表U+045C,西里尔文小写字母KJE。)因为<fc>不是严格合法的UTF-8序列的一部分,它在步骤3中被重新百分比编码。

1. http://www.example.org/D%FCrst

1. http://www.example.org/D%FCrst

2. http://www.example.org/D<fc>rst

2. http://www.example.org/D<fc>rst

3. http://www.example.org/D%FCrst

3. http://www.example.org/D%FCrst

4. http://www.example.org/D%FCrst

4. http://www.example.org/D%FCrst

5. http://www.example.org/D%FCrst

5. http://www.example.org/D%FCrst

The following example contains "%e2%80%ae", which is the percent-encoded UTF-8 character encoding of U+202E, RIGHT-TO-LEFT OVERRIDE. Section 4.1 forbids the direct use of this character in an IRI. Therefore, the corresponding octets are re-percent-encoded in step 4. This example shows that the case (upper- or lowercase) of letters used in percent-encodings may not be preserved. The example also contains a punycode-encoded domain name label (xn--99zt52a), which is not converted.

以下示例包含“%e2%80%ae”,这是U+202E的UTF-8字符编码百分比,从右向左覆盖。第4.1节禁止在IRI中直接使用此字符。因此,在步骤4中对相应的八位字节进行重新百分比编码。此示例显示百分比编码中使用的字母大小写(大写或小写)可能不会保留。该示例还包含一个punycode编码的域名标签(xn--99zt52a),该标签未转换。

1. http://xn--99zt52a.example.org/%e2%80%ae

1. http://xn--99zt52a.example.org/%e2%80%ae

2. http://xn--99zt52a.example.org/<e2><80><ae>

2. http://xn--99zt52a.example.org/<e2><80><ae>

3. http://xn--99zt52a.example.org/<e2><80><ae>

3. http://xn--99zt52a.example.org/<e2><80><ae>

4. http://xn--99zt52a.example.org/%E2%80%AE

4. http://xn--99zt52a.example.org/%E2%80%AE

5. http://xn--99zt52a.example.org/%E2%80%AE

5. http://xn--99zt52a.example.org/%E2%80%AE

Implementations with scheme-specific knowledge MAY convert punycode-encoded domain name labels to the corresponding characters by using the ToUnicode procedure. Thus, for the example above, the label "xn--99zt52a" may be converted to U+7D0D U+8C46 (Japanese Natto), leading to the overall IRI of "http://&#x7D0D;&#x8C46;.example.org/%E2%80%AE".

具有方案特定知识的实现可以使用ToUnicode过程将punycode编码的域名标签转换为相应的字符。因此,对于上面的示例,标签“xn--99zt52a”可以转换为U+7D0D U+8C46(日语纳豆),从而产生“http://&&x7D0D;&x8C46;.example.org/%E2%80%AE”的整体IRI。

4. Bidirectional IRIs for Right-to-Left Languages
4. 用于从右向左语言的双向虹膜

Some UCS characters, such as those used in the Arabic and Hebrew scripts, have an inherent right-to-left (rtl) writing direction. IRIs containing these characters (called bidirectional IRIs or Bidi IRIs) require additional attention because of the non-trivial

某些UCS字符(如阿拉伯语和希伯来语脚本中使用的字符)具有固有的从右向左(rtl)书写方向。包含这些字符的虹膜(称为双向虹膜或Bidi虹膜)需要额外的注意,因为其非平凡性

relation between logical representation (used for digital representation and for reading/spelling) and visual representation (used for display/printing).

逻辑表示(用于数字表示和阅读/拼写)与视觉表示(用于显示/打印)之间的关系。

Because of the complex interaction between the logical representation, the visual representation, and the syntax of a Bidi IRI, a balance is needed between various requirements. The main requirements are

由于Bidi IRI的逻辑表示、视觉表示和语法之间的复杂交互,需要在各种需求之间取得平衡。主要要求是

1. user-predictable conversion between visual and logical representation;

1. 用户可预测的视觉和逻辑表示之间的转换;

2. the ability to include a wide range of characters in various parts of the IRI; and

2. 能够在IRI的各个部分包含广泛的字符;和

3. minor or no changes or restrictions for implementations.

3. 对实现的更改或限制很少或没有。

4.1. Logical Storage and Visual Presentation
4.1. 逻辑存储和可视化表示

When stored or transmitted in digital representation, bidirectional IRIs MUST be in full logical order and MUST conform to the IRI syntax rules (which includes the rules relevant to their scheme). This ensures that bidirectional IRIs can be processed in the same way as other IRIs.

当以数字表示形式存储或传输时,双向IRI必须完全符合逻辑顺序,并且必须符合IRI语法规则(包括与其方案相关的规则)。这确保了双向虹膜的处理方式与其他虹膜相同。

Bidirectional IRIs MUST be rendered by using the Unicode Bidirectional Algorithm [UNIV4], [UNI9]. Bidirectional IRIs MUST be rendered in the same way as they would be if they were in a left-to-right embedding; i.e., as if they were preceded by U+202A, LEFT-TO-RIGHT EMBEDDING (LRE), and followed by U+202C, POP DIRECTIONAL FORMATTING (PDF). Setting the embedding direction can also be done in a higher-level protocol (e.g., the dir='ltr' attribute in HTML).

双向虹膜必须使用Unicode双向算法[UNIV4]、[UNI9]进行渲染。双向虹膜的渲染方式必须与从左到右嵌入的方式相同;i、 例如,就好像前面是U+202A,即从左到右嵌入(LRE),后面是U+202C,即弹出方向格式(PDF)。也可以在更高级别的协议中设置嵌入方向(例如,HTML中的dir='ltr'属性)。

There is no requirement to use the above embedding if the display is still the same without the embedding. For example, a bidirectional IRI in a text with left-to-right base directionality (such as used for English or Cyrillic) that is preceded and followed by whitespace and strong left-to-right characters does not need an embedding. Also, a bidirectional relative IRI reference that only contains strong right-to-left characters and weak characters and that starts and ends with a strong right-to-left character and appears in a text with right-to-left base directionality (such as used for Arabic or Hebrew) and is preceded and followed by whitespace and strong characters does not need an embedding.

如果没有嵌入,显示仍然相同,则不需要使用上述嵌入。例如,文本中的双向IRI具有从左到右的基本方向性(例如用于英语或西里尔语),其前后带有空格和强的从左到右字符,不需要嵌入。此外,一种双向相对IRI引用,仅包含从右向左的强字符和从右向左的弱字符,以从右向左的强字符开始和结束,并以从右向左的基本方向显示在文本中(如用于阿拉伯语或希伯来语)和的前面和后面都是空格,强字符不需要嵌入。

In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be sufficient to force the correct display behavior. However, the details of the Unicode Bidirectional algorithm are not always easy to understand. Implementers are strongly advised to err on the side of caution and to use embedding in all cases where they are not completely sure that the display behavior is unaffected without the embedding.

在其他一些情况下,使用U+200E,从左到右标记(LRM),可能足以强制执行正确的显示行为。然而,Unicode双向算法的细节并不总是容易理解。强烈建议实施者谨慎行事,并在所有情况下使用嵌入,因为他们不能完全确定在没有嵌入的情况下显示行为不会受到影响。

The Unicode Bidirectional Algorithm ([UNI9], section 4.3) permits higher-level protocols to influence bidirectional rendering. Such changes by higher-level protocols MUST NOT be used if they change the rendering of IRIs.

Unicode双向算法([UNI9],第4.3节)允许高级协议影响双向渲染。如果高级协议更改虹膜的呈现,则不得使用此类更改。

The bidirectional formatting characters that may be used before or after the IRI to ensure correct display are not themselves part of the IRI. IRIs MUST NOT contain bidirectional formatting characters (LRM, RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual rendering of the IRI but do not appear themselves. It would therefore not be possible to input an IRI with such characters correctly.

可在IRI之前或之后使用以确保正确显示的双向格式化字符本身不是IRI的一部分。IRI不能包含双向格式字符(LRM、RLM、LRE、RLE、LRO、RLO和PDF)。它们会影响IRI的视觉渲染,但不会自行显示。因此,无法正确输入带有此类字符的IRI。

4.2. Bidi IRI Structure
4.2. Bidi-IRI结构

The Unicode Bidirectional Algorithm is designed mainly for running text. To make sure that it does not affect the rendering of bidirectional IRIs too much, some restrictions on bidirectional IRIs are necessary. These restrictions are given in terms of delimiters (structural characters, mostly punctuation such as "@", ".", ":", and "/") and components (usually consisting mostly of letters and digits).

Unicode双向算法主要用于运行文本。为了确保它不会对双向虹膜的渲染造成太大影响,需要对双向虹膜进行一些限制。这些限制是以分隔符(结构字符,主要是标点符号,如“@“、”、“:”和“/”)和组件(通常主要由字母和数字组成)的形式给出的。

The following syntax rules from section 2.2 correspond to components for the purpose of Bidi behavior: iuserinfo, ireg-name, isegment, isegment-nz, isegment-nz-nc, ireg-name, iquery, and ifragment.

第2.2节中的以下语法规则对应于用于Bidi行为的组件:iuserinfo、ireg name、ISEMENT、ISEMENT nz、ISEMENT nz nc、ireg name、iquery和ifragment。

Specifications that define the syntax of any of the above components MAY divide them further and define smaller parts to be components according to this document. As an example, the restrictions of [RFC3490] on bidirectional domain names correspond to treating each label of a domain name as a component for schemes with ireg-name as a domain name. Even where the components are not defined formally, it may be helpful to think about some syntax in terms of components and to apply the relevant restrictions. For example, for the usual name/value syntax in query parts, it is convenient to treat each name and each value as a component. As another example, the extensions in a resource name can be treated as separate components.

定义上述任何组件语法的规范可以进一步划分这些组件,并根据本文档将较小的部分定义为组件。例如,[RFC3490]对双向域名的限制对应于将域名的每个标签视为以ireg名称为域名的方案的组件。即使在没有正式定义组件的情况下,考虑组件的一些语法并应用相关限制也可能会有所帮助。例如,对于查询部分中常用的名称/值语法,将每个名称和值视为一个组件是很方便的。作为另一个示例,资源名称中的扩展可以被视为单独的组件。

For each component, the following restrictions apply:

对于每个组件,以下限制适用:

1. A component SHOULD NOT use both right-to-left and left-to-right characters.

1. 组件不应同时使用从右到左和从左到右的字符。

2. A component using right-to-left characters SHOULD start and end with right-to-left characters.

2. 使用从右向左字符的组件应以从右向左字符开始和结束。

The above restrictions are given as shoulds, rather than as musts. For IRIs that are never presented visually, they are not relevant. However, for IRIs in general, they are very important to ensure consistent conversion between visual presentation and logical representation, in both directions.

上述限制是作为应做的,而不是必须做的。对于从未以视觉方式呈现的虹膜,它们是不相关的。然而,对于虹膜而言,它们对于确保视觉呈现和逻辑呈现在两个方向上的一致转换非常重要。

Note: In some components, the above restrictions may actually be strictly enforced. For example, [RFC3490] requires that these restrictions apply to the labels of a host name for those schemes where ireg-name is a host name. In some other components (for example, path components) following these restrictions may not be too difficult. For other components, such as parts of the query part, it may be very difficult to enforce the restrictions because the values of query parameters may be arbitrary character sequences.

注意:在某些组件中,可能会严格执行上述限制。例如,[RFC3490]要求这些限制适用于ireg name为主机名的方案的主机名标签。在一些其他组件(例如,路径组件)中,遵循这些限制可能不会太困难。对于其他组件,例如查询部分的一部分,可能很难实施限制,因为查询参数的值可能是任意字符序列。

If the above restrictions cannot be satisfied otherwise, the affected component can always be mapped to URI notation as described in section 3.1. Please note that the whole component has to be mapped (see also Example 9 below).

如果无法满足上述限制,则受影响的组件始终可以映射到URI符号,如第3.1节所述。请注意,必须映射整个组件(另请参见下面的示例9)。

4.3. Input of Bidi IRIs
4.3. Bidi虹膜的输入

Bidi input methods MUST generate Bidi IRIs in logical order while rendering them according to section 4.1. During input, rendering SHOULD be updated after every new character is input to avoid end-user confusion.

Bidi输入方法必须按照逻辑顺序生成Bidi虹膜,同时根据第4.1节进行渲染。在输入期间,应在输入每个新字符后更新渲染,以避免最终用户混淆。

4.4. Examples
4.4. 例子

This section gives examples of bidirectional IRIs, in Bidi Notation. It shows legal IRIs with the relationship between logical and visual representation and explains how certain phenomena in this relationship may look strange to somebody not familiar with bidirectional behavior, but familiar to users of Arabic and Hebrew. It also shows what happens if the restrictions given in section 4.2 are not followed. The examples below can be seen at [BidiEx], in Arabic, Hebrew, and Bidi Notation variants.

本节以Bidi表示法给出了双向虹膜的示例。它展示了逻辑和视觉表现之间的关系,并解释了这种关系中的某些现象在不熟悉双向行为但熟悉阿拉伯语和希伯来语用户的人看来是多么奇怪。它还显示了如果不遵守第4.2节中给出的限制,会发生什么情况。下面的例子可以在[BidiEx]上看到,有阿拉伯语、希伯来语和Bidi符号变体。

To read the bidi text in the examples, read the visual representation from left to right until you encounter a block of rtl text. Read the rtl block (including slashes and other special characters) from right to left, then continue at the next unread ltr character.

要阅读示例中的bidi文本,请从左到右阅读视觉表示,直到遇到rtl文本块。从右到左读取rtl块(包括斜杠和其他特殊字符),然后继续读取下一个未读ltr字符。

Example 1: A single component with rtl characters is inverted: Logical representation: "http://ab.CDEFGH.ij/kl/mn/op.html" Visual representation: "http://ab.HGFEDC.ij/kl/mn/op.html" Components can be read one by one, and each component can be read in its natural direction.

示例1:带有rtl字符的单个组件被反转:逻辑表示:http://ab.CDEFGH.ij/kl/mn/op.html“视觉表示:”http://ab.HGFEDC.ij/kl/mn/op.html“组件可以逐个读取,每个组件都可以按其自然方向读取。

Example 2: More than one consecutive component with rtl characters is inverted as a whole: Logical representation: "http://ab.CDE.FGH/ij/kl/mn/op.html" Visual representation: "http://ab.HGF.EDC/ij/kl/mn/op.html" A sequence of rtl components is read rtl, in the same way as a sequence of rtl words is read rtl in a bidi text.

示例2:多个具有rtl字符的连续组件作为一个整体反转:逻辑表示:http://ab.CDE.FGH/ij/kl/mn/op.html“视觉表示:”http://ab.HGF.EDC/ij/kl/mn/op.htmlrtl组件序列的读取方式为rtl,与rtl单词序列在bidi文本中的读取方式相同。

Example 3: All components of an IRI (except for the scheme) are rtl. All rtl components are inverted overall: Logical representation: "http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV" Visual representation: "http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA" The whole IRI (except the scheme) is read rtl. Delimiters between rtl components stay between the respective components; delimiters between ltr and rtl components don't move.

示例3:IRI的所有组件(方案除外)都是rtl。所有rtl组件都是反向的:逻辑表示:“http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV“视觉表现:http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA“整个IRI(方案除外)是rtl读取的。rtl组件之间的分隔符位于各个组件之间;ltr和rtl组件之间的分隔符不会移动。

Example 4: Each of several sequences of rtl components is inverted on its own: Logical representation: "http://AB.CD.ef/gh/IJ/KL.html" Visual representation: "http://DC.BA.ef/gh/LK/JI.html" Each sequence of rtl components is read rtl, in the same way as each sequence of rtl words in an ltr text is read rtl.

示例4:rtl组件的多个序列中的每一个都独立反转:逻辑表示:http://AB.CD.ef/gh/IJ/KL.html“视觉表示:”http://DC.BA.ef/gh/LK/JI.html“rtl组件的每个序列都是rtl读取的,就像ltr文本中的每个rtl字序列是rtl读取的一样。

Example 5: Example 2, applied to components of different kinds: Logical representation: "http://ab.cd.EF/GH/ij/kl.html" Visual representation: "http://ab.cd.HG/FE/ij/kl.html" The inversion of the domain name label and the path component may be unexpected, but it is consistent with other bidi behavior. For reassurance that the domain component really is "ab.cd.EF", it may be helpful to read aloud the visual representation following the bidi algorithm. After "http://ab.cd." one reads the RTL block "E-F-slash-G-H", which corresponds to the logical representation.

示例5:示例2,应用于不同类型的组件:逻辑表示:http://ab.cd.EF/GH/ij/kl.html“视觉表示:”http://ab.cd.HG/FE/ij/kl.html“域名标签和路径组件的反转可能是意外的,但它与其他bidi行为一致。为了确保域组件确实是“ab.cd.EF”,按照bidi算法大声读出可视化表示可能会有所帮助。“之后”http://ab.cd.一个读取RTL块“E-F-slash-G-H”,它对应于逻辑表示。

Example 6: Same as Example 5, with more rtl components: Logical representation: "http://ab.CD.EF/GH/IJ/kl.html" Visual representation: "http://ab.JI/HG/FE.DC/kl.html" The inversion of the domain name labels and the path components may be easier to identify because the delimiters also move.

示例6:与示例5相同,具有更多rtl组件:逻辑表示:http://ab.CD.EF/GH/IJ/kl.html“视觉表示:”http://ab.JI/HG/FE.DC/kl.html“域名标签和路径组件的反转可能更容易识别,因为分隔符也会移动。

Example 7: A single rtl component includes digits: Logical representation: "http://ab.CDE123FGH.ij/kl/mn/op.html" Visual representation: "http://ab.HGF123EDC.ij/kl/mn/op.html" Numbers are written ltr in all cases but are treated as an additional embedding inside a run of rtl characters. This is completely consistent with usual bidirectional text.

示例7:单个rtl组件包括数字:逻辑表示:http://ab.CDE123FGH.ij/kl/mn/op.html“视觉表示:”http://ab.HGF123EDC.ij/kl/mn/op.html“在所有情况下,数字都是用ltr写入的,但在一系列rtl字符中被视为附加嵌入。这与通常的双向文本完全一致。

Example 8 (not allowed): Numbers are at the start or end of an rtl component: Logical representation: "http://ab.cd.ef/GH1/2IJ/KL.html" Visual representation: "http://ab.cd.ef/LK/JI1/2HG.html" The sequence "1/2" is interpreted by the bidi algorithm as a fraction, fragmenting the components and leading to confusion. There are other characters that are interpreted in a special way close to numbers; in particular, "+", "-", "#", "$", "%", ",", ".", and ":".

示例8(不允许):数字位于rtl组件的开头或结尾:逻辑表示:http://ab.cd.ef/GH1/2IJ/KL.html“视觉表示:”http://ab.cd.ef/LK/JI1/2HG.htmlbidi算法将序列“1/2”解释为一个分数,使组成部分分裂并导致混淆。还有其他一些字符是以接近数字的特殊方式解释的;特别是,“+”、“-”、“#”、“美元”、“百分比”、“和”:。

Example 9 (not allowed): The numbers in the previous example are percent-encoded: Logical representation: "http://ab.cd.ef/GH%31/%32IJ/KL.html", Visual representation (Hebrew): "http://ab.cd.ef/%31HG/LK/JI%32.html" Visual representation (Arabic): "http://ab.cd.ef/31%HG/%LK/JI32.html" Depending on whether the uppercase letters represent Arabic or Hebrew, the visual representation is different.

示例9(不允许):上一示例中的数字采用百分比编码:逻辑表示:http://ab.cd.ef/GH%31/%32IJ/KL.html,视觉表现(希伯来语):http://ab.cd.ef/%31HG/LK/JI%32.html“视觉表现(阿拉伯语):”http://ab.cd.ef/31%HG/%LK/JI32.html"根据大写字母是表示阿拉伯语还是希伯来语,视觉表示方式有所不同。

Example 10 (allowed but not recommended): Logical representation: "http://ab.CDEFGH.123/kl/mn/op.html" Visual representation: "http://ab.123.HGFEDC/kl/mn/op.html" Components consisting of only numbers are allowed (it would be rather difficult to prohibit them), but these may interact with adjacent RTL components in ways that are not easy to predict.

示例10(允许但不推荐):逻辑表示:http://ab.CDEFGH.123/kl/mn/op.html“视觉表示:”http://ab.123.HGFEDC/kl/mn/op.html“只允许包含数字的组件(禁止它们相当困难),但它们可能以不易预测的方式与相邻RTL组件相互作用。

5. Normalization and Comparison
5. 规范化与比较

Note: The structure and much of the material for this section is taken from section 6 of [RFC3986]; the differences are due to the specifics of IRIs.

注:本节的结构和大部分材料取自[RFC3986]第6节;这些差异是由于虹膜的特殊性造成的。

One of the most common operations on IRIs is simple comparison: Determining whether two IRIs are equivalent without using the IRIs or the mapped URIs to access their respective resource(s). A comparison is performed whenever a response cache is accessed, a browser checks its history to color a link, or an XML parser processes tags within a namespace. Extensive normalization prior to comparison of IRIs may be used by spiders and indexing engines to prune a search space or reduce duplication of request actions and response storage.

对IRIs最常见的操作之一是简单的比较:确定两个IRIs是否等效,而不使用IRIs或映射的URI访问它们各自的资源。每当访问响应缓存、浏览器检查其历史记录以给链接着色或XML解析器处理命名空间内的标记时,都会执行比较。spider和索引引擎可以使用IRIs比较之前的广泛规范化来修剪搜索空间或减少请求操作和响应存储的重复。

IRI comparison is performed for some particular purpose. Protocols or implementations that compare IRIs for different purposes will often be subject to differing design trade-offs in regards to how much effort should be spent in reducing aliased identifiers. This section describes various methods that may be used to compare IRIs, the trade-offs between them, and the types of applications that might use them.

进行IRI比较是为了某些特定目的。为不同目的比较IRI的协议或实现通常会在减少别名标识符方面花费多少精力,从而在设计上进行不同的权衡。本节介绍可用于比较IRIs的各种方法、它们之间的权衡以及可能使用它们的应用程序类型。

5.1. Equivalence
5.1. 等值

Because IRIs exist to identify resources, presumably they should be considered equivalent when they identify the same resource. However, this definition of equivalence is not of much practical use, as there is no way for an implementation to compare two resources unless it has full knowledge or control of them. For this reason, determination of equivalence or difference of IRIs is based on string comparison, perhaps augmented by reference to additional rules provided by URI scheme definitions. We use the terms "different" and "equivalent" to describe the possible outcomes of such comparisons, but there are many application-dependent versions of equivalence.

因为IRI是用来识别资源的,所以当它们识别相同的资源时,应该认为它们是等价的。然而,这种等价性的定义没有太多实际用途,因为除非实现完全了解或控制两种资源,否则实现无法比较这两种资源。由于这个原因,IRIs的等价性或差异性的确定是基于字符串比较的,可能通过引用URI方案定义提供的附加规则来增强。我们使用术语“不同”和“等效”来描述此类比较的可能结果,但存在许多依赖于应用程序的等效版本。

Even though it is possible to determine that two IRIs are equivalent, IRI comparison is not sufficient to determine whether two IRIs identify different resources. For example, an owner of two different domain names could decide to serve the same resource from both, resulting in two different IRIs. Therefore, comparison methods are designed to minimize false negatives while strictly avoiding false positives.

即使可以确定两个IRI是等效的,但IRI比较不足以确定两个IRI是否识别不同的资源。例如,两个不同域名的所有者可以决定从这两个域名服务相同的资源,从而产生两个不同的IRI。因此,比较方法的目的是尽量减少误报,同时严格避免误报。

In testing for equivalence, applications should not directly compare relative references; the references should be converted to their respective target IRIs before comparison. When IRIs are compared to select (or avoid) a network action, such as retrieval of a representation, fragment components (if any) should be excluded from the comparison.

在测试等价性时,应用程序不应直接比较相对引用;比较前,应将参考转换为各自的目标虹膜。当将IRIs与选择(或避免)网络操作(如检索表示)进行比较时,应将片段组件(如果有)排除在比较之外。

Applications using IRIs as identity tokens with no relationship to a protocol MUST use the Simple String Comparison (see section 5.3.1). All other applications MUST select one of the comparison practices from the Comparison Ladder (see section 5.3 or, after IRI-to-URI conversion, select one of the comparison practices from the URI comparison ladder in [RFC3986], section 6.2)

将IRIs用作与协议无关的身份令牌的应用程序必须使用简单字符串比较(见第5.3.1节)。所有其他应用程序必须从比较阶梯中选择一种比较实践(参见第5.3节,或者在IRI转换为URI后,从[RFC3986]第6.2节的URI比较阶梯中选择一种比较实践)

5.2. Preparation for Comparison
5.2. 比较准备

Any kind of IRI comparison REQUIRES that all escapings or encodings in the protocol or format that carries an IRI are resolved. This is usually done when the protocol or format is parsed. Examples of such

任何类型的IRI比较都需要解析协议或格式中携带IRI的所有转义或编码。这通常是在解析协议或格式时完成的。这方面的例子

escapings or encodings are entities and numeric character references in [HTML4] and [XML1]. As an example, "http://example.org/ros&eacute;" (in HTML), "http://example.org/ros&#233"; (in HTML or XML), and "http://example.org/ros&#xE9"; (in HTML or XML) are all resolved into what is denoted in this document (see section 1.4) as "http://example.org/ros&#xE9"; (the "&#xE9;" here standing for the actual e-acute character, to compensate for the fact that this document cannot contain non-ASCII characters).

转义或编码是[HTML4]和[XML1]中的实体和数字字符引用。例如,”http://example.org/ros&eacute;(HTML格式)http://example.org/ros&#233"; (以HTML或XML格式)和“http://example.org/ros&#xE9"; (在HTML或XML中)都被解析为本文档中表示的内容(见第1.4节),如“http://example.org/ros&#xE9"; (此处的“&#xE9;”代表实际的e-acute字符,以补偿此文档不能包含非ASCII字符的事实)。

Similar considerations apply to encodings such as Transfer Codings in HTTP (see [RFC2616]) and Content Transfer Encodings in MIME ([RFC2045]), although in these cases, the encoding is based not on characters but on octets, and additional care is required to make sure that characters, and not just arbitrary octets, are compared (see section 5.3.1).

类似的考虑也适用于编码,如HTTP中的传输编码(参见[RFC2616])和MIME中的内容传输编码([RFC2045]),尽管在这些情况下,编码不是基于字符而是基于八位字节,需要额外注意确保对字符而不仅仅是任意八位字节进行比较(参见第5.3.1节)。

5.3. Comparison Ladder
5.3. 比较阶梯

In practice, a variety of methods are used, to test IRI equivalence. These methods fall into a range distinguished by the amount of processing required and the degree to which the probability of false negatives is reduced. As noted above, false negatives cannot be eliminated. In practice, their probability can be reduced, but this reduction requires more processing and is not cost-effective for all applications.

在实践中,使用了多种方法来测试IRI等效性。这些方法属于一个范围,其区别在于所需的处理量和假阴性概率降低的程度。如上所述,不能消除假阴性。在实践中,它们的概率可以降低,但这种降低需要更多的处理,并且并非对所有应用程序都具有成本效益。

If this range of comparison practices is considered as a ladder, the following discussion will climb the ladder, starting with practices that are cheap but have a relatively higher chance of producing false negatives, and proceeding to those that have higher computational cost and lower risk of false negatives.

如果将这一系列比较实践视为一个阶梯,那么下面的讨论将沿着阶梯上升,首先是成本较低但产生假阴性概率相对较高的实践,然后是计算成本较高且假阴性风险较低的实践。

5.3.1. Simple String Comparison
5.3.1. 简单字符串比较

If two IRIs, when considered as character strings, are identical, then it is safe to conclude that they are equivalent. This type of equivalence test has very low computational cost and is in wide use in a variety of applications, particularly in the domain of parsing. It is also used when a definitive answer to the question of IRI equivalence is needed that is independent of the scheme used and that can be calculated quickly and without accessing a network. An example of such a case is XML Namespaces ([XMLNamespace]).

如果两个虹膜(当被视为字符串时)是相同的,那么可以安全地得出它们是等价的结论。这种类型的等价性测试具有非常低的计算成本,并且在各种应用中被广泛使用,特别是在解析领域。当需要独立于所用方案的IRI等价性问题的确定答案时,也可使用该方法,该方法可快速计算,且无需访问网络。这种情况的一个例子是XML名称空间([XMLNamespace])。

Testing strings for equivalence requires some basic precautions. This procedure is often referred to as "bit-for-bit" or "byte-for-byte" comparison, which is potentially misleading. Testing strings for equality is normally based on pair comparison of the characters that

测试字符串的等价性需要一些基本的预防措施。此过程通常被称为“位对位”或“字节对字节”比较,这可能会产生误导。测试字符串是否相等通常基于

make up the strings, starting from the first and proceeding until both strings are exhausted and all characters are found to be equal, until a pair of characters compares unequal, or until one of the strings is exhausted before the other.

组成字符串,从第一个字符串开始,直到两个字符串都用完并且所有字符都相等,直到一对字符比较不相等,或者直到其中一个字符串在另一个字符串之前用完。

This character comparison requires that each pair of characters be put in comparable encoding form. For example, should one IRI be stored in a byte array in UTF-8 encoding form and the second in a UTF-16 encoding form, bit-for-bit comparisons applied naively will produce errors. It is better to speak of equality on a character-for-character rather than on a byte-for-byte or bit-for-bit basis. In practical terms, character-by-character comparisons should be done codepoint by codepoint after conversion to a common character encoding form. When comparing character by character, the comparison function MUST NOT map IRIs to URIs, because such a mapping would create additional spurious equivalences. It follows that an IRI SHOULD NOT be modified when being transported if there is any chance that this IRI might be used as an identifier.

这种字符比较要求每对字符采用可比较的编码形式。例如,如果一个IRI以UTF-8编码形式存储在字节数组中,第二个以UTF-16编码形式存储在字节数组中,那么简单地应用逐位比较将产生错误。更好的说法是字符对字符的平等,而不是字节对字节或比特对比特的平等。实际上,在转换为通用字符编码形式后,逐字符比较应逐码点进行。在逐字符比较时,比较函数不能将IRIs映射到URI,因为这样的映射会创建额外的伪等价。因此,如果有可能将IRI用作标识符,则在传输时不应修改该IRI。

False negatives are caused by the production and use of IRI aliases. Unnecessary aliases can be reduced, regardless of the comparison method, by consistently providing IRI references in an already normalized form (i.e., a form identical to what would be produced after normalization is applied, as described below). Protocols and data formats often limit some IRI comparisons to simple string comparison, based on the theory that people and implementations will, in their own best interest, be consistent in providing IRI references, or at least be consistent enough to negate any efficiency that might be obtained from further normalization.

误报是由IRI别名的产生和使用引起的。通过以已经标准化的形式(即,与应用标准化后产生的形式相同的形式,如下所述)一致地提供IRI引用,可以减少不必要的别名,而不考虑比较方法。协议和数据格式通常将一些IRI比较限制为简单的字符串比较,这是基于这样一种理论,即人员和实现在提供IRI引用时将保持一致,或者至少保持足够的一致性,以否定进一步规范化可能获得的任何效率。

5.3.2. Syntax-Based Normalization
5.3.2. 基于语法的规范化

Implementations may use logic based on the definitions provided by this specification to reduce the probability of false negatives. This processing is moderately higher in cost than character-for-character string comparison. For example, an application using this approach could reasonably consider the following two IRIs equivalent:

实现可以使用基于本规范提供的定义的逻辑来降低误报概率。此处理的成本略高于字符串对字符串的比较。例如,使用这种方法的应用程序可以合理地考虑以下两个虹膜等效:

      example://a/b/c/%7Bfoo%7D/ros&#xE9;
      eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9
        
      example://a/b/c/%7Bfoo%7D/ros&#xE9;
      eXAMPLE://a/./b/../b/%63/%7bfoo%7d/ros%C3%A9
        

Web user agents, such as browsers, typically apply this type of IRI normalization when determining whether a cached response is available. Syntax-based normalization includes such techniques as case normalization, character normalization, percent-encoding normalization, and removal of dot-segments.

Web用户代理(如浏览器)通常在确定缓存响应是否可用时应用这种类型的IRI规范化。基于语法的规范化包括大小写规范化、字符规范化、百分比编码规范化和删除点段等技术。

5.3.2.1. Case Normalization
5.3.2.1. 案例规范化

For all IRIs, the hexadecimal digits within a percent-encoding triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore should be normalized to use uppercase letters for the digits A - F.

对于所有IRIs,百分比编码三元组中的十六进制数字(例如,“%3a”与“%3a”)不区分大小写,因此应规范化,以使用大写字母表示数字a-F。

When an IRI uses components of the generic syntax, the component syntax equivalence rules always apply; namely, that the scheme and US-ASCII only host are case insensitive and therefore should be normalized to lowercase. For example, the URI "HTTP://www.EXAMPLE.com/" is equivalent to "http://www.example.com/". Case equivalence for non-ASCII characters in IRI components that are IDNs are discussed in section 5.3.3. The other generic syntax components are assumed to be case sensitive unless specifically defined otherwise by the scheme.

当IRI使用通用语法的组件时,组件语法等价规则始终适用;也就是说,scheme和仅US-ASCII主机不区分大小写,因此应该规范化为小写。例如,URI“HTTP://www.example.com/”相当于http://www.example.com/". 第5.3.3节讨论了作为IDN的IRI组件中非ASCII字符的大小写等效性。除非方案中另有明确定义,否则其他通用语法组件假定为区分大小写。

Creating schemes that allow case-insensitive syntax components containing non-ASCII characters should be avoided. Case normalization of non-ASCII characters can be culturally dependent and is always a complex operation. The only exception concerns non-ASCII host names for which the character normalization includes a mapping step derived from case folding.

应避免创建允许包含非ASCII字符的不区分大小写语法组件的方案。非ASCII字符的大小写规范化可能与文化有关,并且始终是一项复杂的操作。唯一的例外涉及非ASCII主机名,字符规范化包括从大小写折叠派生的映射步骤。

5.3.2.2. Character Normalization
5.3.2.2. 字符规范化

The Unicode Standard [UNIV4] defines various equivalences between sequences of characters for various purposes. Unicode Standard Annex #15 [UTR15] defines various Normalization Forms for these equivalences, in particular Normalization Form C (NFC, Canonical Decomposition, followed by Canonical Composition) and Normalization Form KC (NFKC, Compatibility Decomposition, followed by Canonical Composition).

Unicode标准[UNIV4]为各种目的定义了字符序列之间的各种等价性。Unicode标准附录#15[UTR15]定义了这些等价物的各种规范化形式,特别是规范化形式C(NFC,标准分解,然后是标准组合)和规范化形式KC(NFKC,兼容性分解,然后是标准组合)。

Equivalence of IRIs MUST rely on the assumption that IRIs are appropriately pre-character-normalized rather than apply character normalization when comparing two IRIs. The exceptions are conversion from a non-digital form, and conversion from a non-UCS-based character encoding to a UCS-based character encoding. In these cases, NFC or a normalizing transcoder using NFC MUST be used for interoperability. To avoid false negatives and problems with transcoding, IRIs SHOULD be created by using NFC. Using NFKC may avoid even more problems; for example, by choosing half-width Latin letters instead of full-width ones, and full-width instead of half-width Katakana.

在比较两个虹膜时,虹膜的等效性必须依赖于虹膜被适当地预字符归一化的假设,而不是应用字符归一化。例外情况是从非数字形式转换,以及从非基于UCS的字符编码转换为基于UCS的字符编码。在这些情况下,必须使用NFC或使用NFC的规范化转码器来实现互操作性。为了避免误报和转码问题,应该使用NFC创建IRIs。使用NFKC可以避免更多问题;例如,选择半宽拉丁字母而不是全宽拉丁字母,选择全宽拉丁字母而不是半宽片假名。

As an example, "http://www.example.org/r&#xE9;sum&#xE9;.html" (in XML Notation) is in NFC. On the other hand, "http://www.example.org/re&#x301;sume&#x301;.html" is not in NFC.

例如,”http://www.example.org/r&#xE9“sum&#xE9;.html”(以XML表示法)是NFC格式的。另一方面,”http://www.example.org/re&#x301“sume&#x301;.html”不在NFC中。

The former uses precombined e-acute characters, and the latter uses "e" characters followed by combining acute accents. Both usages are defined as canonically equivalent in [UNIV4].

前者使用预先组合的e-acute字符,而后者使用“e”字符,然后再组合急性重音。这两种用法在[UNIV4]中定义为规范等效。

Note: Because it is unknown how a particular sequence of characters is being treated with respect to character normalization, it would be inappropriate to allow third parties to normalize an IRI arbitrarily. This does not contradict the recommendation that when a resource is created, its IRI should be as character normalized as possible (i.e., NFC or even NFKC). This is similar to the uppercase/lowercase problems. Some parts of a URI are case insensitive (domain name). For others, it is unclear whether they are case sensitive, case insensitive, or something in between (e.g., case sensitive, but with a multiple choice selection if the wrong case is used, instead of a direct negative result). The best recipe is that the creator use a reasonable capitalization and, when transferring the URI, capitalization never be changed.

注意:由于不知道在字符规范化方面如何处理特定的字符序列,因此允许第三方任意规范IRI是不合适的。这与创建资源时,其IRI应尽可能地进行字符规范化(即NFC或甚至NFKC)的建议并不矛盾。这类似于大小写问题。URI的某些部分不区分大小写(域名)。对于其他人来说,不清楚它们是区分大小写、不区分大小写还是介于两者之间(例如区分大小写,但如果使用了错误的大小写,则使用多项选择,而不是直接的否定结果)。最好的方法是创建者使用合理的大小写,并且在传输URI时,大小写永远不会改变。

Various IRI schemes may allow the usage of Internationalized Domain Names (IDN) [RFC3490] either in the ireg-name part or elsewhere. Character Normalization also applies to IDNs, as discussed in section 5.3.3.

各种IRI方案可能允许在ireg名称部分或其他地方使用国际化域名(IDN)[RFC3490]。字符规范化也适用于IDN,如第5.3.3节所述。

5.3.2.3. Percent-Encoding Normalization
5.3.2.3. 百分比编码规范化

The percent-encoding mechanism (section 2.1 of [RFC3986]) is a frequent source of variance among otherwise identical IRIs. In addition to the case normalization issue noted above, some IRI producers percent-encode octets that do not require percent-encoding, resulting in IRIs that are equivalent to their non encoded counterparts. These IRIs should be normalized by decoding any percent-encoded octet sequence that corresponds to an unreserved character, as described in section 2.3 of [RFC3986].

百分比编码机制(RFC3986第2.1节)是其他相同虹膜之间的一个常见差异源。除了上面提到的案例规范化问题外,一些IRI生产者对不需要百分比编码的八位字节进行百分比编码,从而产生与其非编码对应的IRI。如[RFC3986]第2.3节所述,这些虹膜应通过解码对应于未保留字符的任何百分比编码八位字节序列来规范化。

For actual resolution, differences in percent-encoding (except for the percent-encoding of reserved characters) MUST always result in the same resource. For example, "http://example.org/~user", "http://example.org/%7euser", and "http://example.org/%7Euser", must resolve to the same resource.

对于实际分辨率,百分比编码的差异(保留字符的百分比编码除外)必须始终导致相同的资源。例如,”http://example.org/~user“,”http://example.org/%7euser“、和”http://example.org/%7Euser,必须解析为同一资源。

If this kind of equivalence is to be tested, the percent-encoding of both IRIs to be compared has to be aligned; for example, by converting both IRIs to URIs (see section 3.1), eliminating escape differences in the resulting URIs, and making sure that the case of the hexadecimal characters in the percent-encoding is always the same (preferably uppercase). If the IRI is to be passed to another

如果要测试这种等价性,则要比较的两个虹膜的编码百分比必须对齐;例如,通过将两个IRI转换为URI(参见第3.1节),消除结果URI中的转义差异,并确保百分比编码中十六进制字符的大小写始终相同(最好是大写)。如果要将IRI传递给另一个

application or used further in some other way, its original form MUST be preserved. The conversion described here should be performed only for local comparison.

应用或以其他方式进一步使用,必须保留其原始形式。此处描述的转换应仅用于本地比较。

5.3.2.4. Path Segment Normalization
5.3.2.4. 路径段规范化

The complete path segments "." and ".." are intended only for use within relative references (section 4.1 of [RFC3986]) and are removed as part of the reference resolution process (section 5.2 of [RFC3986]). However, some implementations may incorrectly assume that reference resolution is not necessary when the reference is already an IRI, and thus fail to remove dot-segments when they occur in non-relative paths. IRI normalizers should remove dot-segments by applying the remove_dot_segments algorithm to the path, as described in section 5.2.4 of [RFC3986].

完整的路径段“.”和“.”仅用于相关参考(RFC3986第4.1节)中,并作为参考解析过程的一部分删除(RFC3986第5.2节)。然而,一些实现可能错误地假设,当参考已经是IRI时,参考分辨率是不必要的,因此,当点段出现在非相对路径中时,无法移除点段。IRI规范化器应通过对路径应用remove_dot_segments算法来删除点段,如[RFC3986]第5.2.4节所述。

5.3.3. Scheme-Based Normalization
5.3.3. 基于方案的规范化

The syntax and semantics of IRIs vary from scheme to scheme, as described by the defining specification for each scheme. Implementations may use scheme-specific rules, at further processing cost, to reduce the probability of false negatives. For example, because the "http" scheme makes use of an authority component, has a default port of "80", and defines an empty path to be equivalent to "/", the following four IRIs are equivalent:

IRIs的语法和语义因方案而异,如每个方案的定义规范所述。实现可以使用特定于方案的规则,以进一步的处理成本降低误报的概率。例如,由于“http”方案使用了授权组件,具有默认端口“80”,并定义了一个空路径等效于“/”,因此以下四个IRI是等效的:

      http://example.com
      http://example.com/
      http://example.com:/
      http://example.com:80/
        
      http://example.com
      http://example.com/
      http://example.com:/
      http://example.com:80/
        

In general, an IRI that uses the generic syntax for authority with an empty path should be normalized to a path of "/". Likewise, an explicit ":port", for which the port is empty or the default for the scheme, is equivalent to one where the port and its ":" delimiter are elided and thus should be removed by scheme-based normalization. For example, the second IRI above is the normal form for the "http" scheme.

一般来说,对具有空路径的权限使用通用语法的IRI应规范化为路径“/”。同样,显式“:port”(端口为空或方案的默认值)等同于省略端口及其“:”分隔符,因此应通过基于方案的规范化删除。例如,上面的第二个IRI是“http”方案的标准形式。

Another case where normalization varies by scheme is in the handling of an empty authority component or empty host subcomponent. For many scheme specifications, an empty authority or host is considered an error; for others, it is considered equivalent to "localhost" or the end-user's host. When a scheme defines a default for authority and an IRI reference to that default is desired, the reference should be normalized to an empty authority for the sake of uniformity, brevity,

规范化因方案而异的另一种情况是处理空权限组件或空主机子组件。对于许多方案规范,空权限或主机被视为错误;对于其他主机,它被认为等同于“localhost”或最终用户的主机。当方案为权限定义了默认值,并且需要对该默认值进行IRI引用时,为了统一性、简洁性和可扩展性,应将该引用规范化为空权限,

and internationalization. If, however, either the userinfo or port subcomponents are non-empty, then the host should be given explicitly even if it matches the default.

和国际化。但是,如果userinfo或port子组件为非空,则即使主机与默认值匹配,也应显式指定主机。

Normalization should not remove delimiters when their associated component is empty unless it is licensed to do so by the scheme specification. For example, the IRI "http://example.com/?" cannot be assumed to be equivalent to any of the examples above. Likewise, the presence or absence of delimiters within a userinfo subcomponent is usually significant to its interpretation. The fragment component is not subject to any scheme-based normalization; thus, two IRIs that differ only by the suffix "#" are considered different regardless of the scheme.

当相关组件为空时,规范化不应删除分隔符,除非方案规范授权它这样做。例如,IRI“http://example.com/?“不能假定与上述任何示例等效。同样,userinfo子组件中是否存在分隔符通常对其解释非常重要。片段组件不受任何基于方案的规范化的约束;因此,两个仅因后缀“#”不同的虹膜被认为是不同的,而与方案无关。

Some IRI schemes may allow the usage of Internationalized Domain Names (IDN) [RFC3490] either in their ireg-name part or elsewhere. When in use in IRIs, those names SHOULD be validated by using the ToASCII operation defined in [RFC3490], with the flags "UseSTD3ASCIIRules" and "AllowUnassigned". An IRI containing an invalid IDN cannot successfully be resolved. Validated IDN components of IRIs SHOULD be character normalized by using the Nameprep process [RFC3491]; however, for legibility purposes, they SHOULD NOT be converted into ASCII Compatible Encoding (ACE).

一些IRI方案可能允许在其ireg名称部分或其他地方使用国际化域名(IDN)[RFC3490]。在IRIs中使用时,应使用[RFC3490]中定义的ToASCII操作验证这些名称,标记为“UseSTD3ASCIIRules”和“Allowunasigned”。无法成功解析包含无效IDN的IRI。应使用Nameprep过程[RFC3491]对IRIs的已验证IDN组件进行字符规范化;但是,为了便于阅读,不应将其转换为ASCII兼容编码(ACE)。

Scheme-based normalization may also consider IDN components and their conversions to punycode as equivalent. As an example, "http://r&#xE9;sum&#xE9;.example.org" may be considered equivalent to "http://xn--rsum-bpad.example.org".

基于方案的归一化还可以考虑IDN组件及其转换为PyyCal码的等价性。例如,“http://r&#xE9;sum&#xE9;example.org”可能被认为等同于http://xn--rsum-bpad.example.org".

Other scheme-specific normalizations are possible.

其他特定于方案的规范化也是可能的。

5.3.4. Protocol-Based Normalization
5.3.4. 基于协议的规范化

Substantial effort to reduce the incidence of false negatives is often cost-effective for web spiders. Consequently, they implement even more aggressive techniques in IRI comparison. For example, if they observe that an IRI such as

对于网络蜘蛛来说,为减少误报率所做的大量努力通常是具有成本效益的。因此,他们在IRI比较中实施了更具攻击性的技术。例如,如果他们观察到IRI,如

      http://example.com/data
        
      http://example.com/data
        

redirects to an IRI differing only in the trailing slash

重定向到仅在尾部斜杠上不同的IRI

      http://example.com/data/
        
      http://example.com/data/
        

they will likely regard the two as equivalent in the future. This kind of technique is only appropriate when equivalence is clearly indicated by both the result of accessing the resources and the

在未来,他们可能会将两者视为等价物。这种技术只有在访问资源的结果和结果都清楚地表明了等价性时才适用

common conventions of their scheme's dereference algorithm (in this case, use of redirection by HTTP origin servers to avoid problems with relative references).

他们方案的解引用算法的常见约定(在本例中,HTTP源服务器使用重定向来避免相对引用的问题)。

6. Use of IRIs
6. 虹膜的使用
6.1. Limitations on UCS Characters Allowed in IRIs
6.1. IRIs中允许的UCS字符限制

This section discusses limitations on characters and character sequences usable for IRIs beyond those given in section 2.2 and section 4.1. The considerations in this section are relevant when IRIs are created and when URIs are converted to IRIs.

本节讨论了除第2.2节和第4.1节中给出的字符和字符序列外,可用于IRIs的字符和字符序列的限制。本节中的注意事项与创建IRI以及将URI转换为IRI相关。

a. The repertoire of characters allowed in each IRI component is limited by the definition of that component. For example, the definition of the scheme component does not allow characters beyond US-ASCII.

a. 每个IRI组件中允许的字符集受该组件定义的限制。例如,scheme组件的定义不允许使用US-ASCII以外的字符。

(Note: In accordance with URI practice, generic IRI software cannot and should not check for such limitations.)

(注意:根据URI惯例,通用IRI软件不能也不应该检查此类限制。)

b. The UCS contains many areas of characters for which there are strong visual look-alikes. Because of the likelihood of transcription errors, these also should be avoided. This includes the full-width equivalents of Latin characters, half-width Katakana characters for Japanese, and many others. It also includes many look-alikes of "space", "delims", and "unwise", characters excluded in [RFC3491].

b. UCS包含许多具有强烈视觉相似性的字符区域。由于转录错误的可能性,这些也应该避免。这包括拉丁字符的全宽等价物、日语的半宽片假名字符以及许多其他字符。它还包括许多类似于[RFC3491]中排除的“space”、“delims”和“unwise”的字符。

Additional information is available from [UNIXML]. [UNIXML] is written in the context of running text rather than in that of identifiers. Nevertheless, it discusses many of the categories of characters not appropriate for IRIs.

其他信息可从[UNIXML]获得。[UNIXML]是在运行文本的上下文中编写的,而不是在标识符的上下文中编写的。然而,它讨论了许多不适合IRIs的字符类别。

6.2. Software Interfaces and Protocols
6.2. 软件接口和协议

Although an IRI is defined as a sequence of characters, software interfaces for URIs typically function on sequences of octets or other kinds of code units. Thus, software interfaces and protocols MUST define which character encoding is used.

尽管IRI定义为字符序列,但URI的软件接口通常在八位字节序列或其他类型的代码单元上运行。因此,软件接口和协议必须定义使用哪种字符编码。

Intermediate software interfaces between IRI-capable components and URI-only components MUST map the IRIs per section 3.1, when transferring from IRI-capable to URI-only components. This mapping SHOULD be applied as late as possible. It SHOULD NOT be applied between components that are known to be able to handle IRIs.

当从支持IRI的组件传输到仅限URI的组件时,支持IRI的组件和仅限URI的组件之间的中间软件接口必须根据第3.1节映射IRI。应尽可能晚地应用此映射。它不应该应用于已知能够处理IRIs的组件之间。

6.3. Format of URIs and IRIs in Documents and Protocols
6.3. 文件和协议中URI和IRI的格式

Document formats that transport URIs may have to be upgraded to allow the transport of IRIs. In cases where the document as a whole has a native character encoding, IRIs MUST also be encoded in this character encoding and converted accordingly by a parser or interpreter. IRI characters not expressible in the native character encoding SHOULD be escaped by using the escaping conventions of the document format if such conventions are available. Alternatively, they MAY be percent-encoded according to section 3.1. For example, in HTML or XML, numeric character references SHOULD be used. If a document as a whole has a native character encoding and that character encoding is not UTF-8, then IRIs MUST NOT be placed into the document in the UTF-8 character encoding.

传输URI的文档格式可能必须升级以允许传输IRI。在文档作为一个整体具有本机字符编码的情况下,IRIs也必须以这种字符编码进行编码,并由解析器或解释器进行相应的转换。本机字符编码中无法表达的IRI字符应使用文档格式的转义约定进行转义(如果此类约定可用)。或者,可根据第3.1节对其进行百分比编码。例如,在HTML或XML中,应该使用数字字符引用。如果文档作为一个整体具有本机字符编码,且该字符编码不是UTF-8,则不得将IRIs放入UTF-8字符编码的文档中。

Note: Some formats already accommodate IRIs, although they use different terminology. HTML 4.0 [HTML4] defines the conversion from IRIs to URIs as error-avoiding behavior. XML 1.0 [XML1], XLink [XLink], XML Schema [XMLSchema], and specifications based upon them allow IRIs. Also, it is expected that all relevant new W3C formats and protocols will be required to handle IRIs [CharMod].

注意:一些格式已经适应了IRIs,尽管它们使用不同的术语。HTML4.0[HTML4]将从IRIs到URI的转换定义为错误避免行为。XML1.0[XML1]、XLink[XLink]、XMLSchema[XMLSchema]以及基于它们的规范允许IRIs。此外,预计所有相关的新W3C格式和协议都需要处理IRIs[CharMod]。

6.4. Use of UTF-8 for Encoding Original Characters
6.4. 使用UTF-8编码原始字符

This section discusses details and gives examples for point c) in section 1.2. To be able to use IRIs, the URI corresponding to the IRI in question has to encode original characters into octets by using UTF-8. This can be specified for all URIs of a URI scheme or can apply to individual URIs for schemes that do not specify how to encode original characters. It can apply to the whole URI, or only to some part. For background information on encoding characters into URIs, see also section 2.5 of [RFC3986].

本节讨论了细节,并给出了第1.2节第c)点的示例。为了能够使用IRI,与所讨论的IRI对应的URI必须使用UTF-8将原始字符编码为八位字节。这可以为URI方案的所有URI指定,也可以应用于未指定如何编码原始字符的方案的单个URI。它可以应用于整个URI,也可以仅应用于某个部分。有关将字符编码为URI的背景信息,请参见[RFC3986]第2.5节。

For new URI schemes, using UTF-8 is recommended in [RFC2718]. Examples where UTF-8 is already used are the URN syntax [RFC2141], IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand, because the HTTP URL scheme does not specify how to encode original characters, only some HTTP URLs can have corresponding but different IRIs.

对于新的URI方案,[RFC2718]中建议使用UTF-8。已经使用UTF-8的示例包括URN语法[RFC2141]、IMAP URL[RFC2192]和POP URL[RFC2384]。另一方面,由于HTTP URL方案没有指定如何编码原始字符,因此只有一些HTTP URL可以具有相应但不同的IRI。

   For example, for a document with a URI of
   "http://www.example.org/r%C3%A9sum%C3%A9.html", it is possible to
   construct a corresponding IRI (in XML notation, see, section 1.4):
   "http://www.example.org/r&#xE9;sum&#xE9;.html" ("&#xE9"; stands for
   the e-acute character, and "%C3%A9" is the UTF-8 encoded and
   percent-encoded representation of that character).  On the other
   hand, for a document with a URI of
        
   For example, for a document with a URI of
   "http://www.example.org/r%C3%A9sum%C3%A9.html", it is possible to
   construct a corresponding IRI (in XML notation, see, section 1.4):
   "http://www.example.org/r&#xE9;sum&#xE9;.html" ("&#xE9"; stands for
   the e-acute character, and "%C3%A9" is the UTF-8 encoded and
   percent-encoded representation of that character).  On the other
   hand, for a document with a URI of
        

"http://www.example.org/r%E9sum%E9.html", the percent-encoding octets cannot be converted to actual characters in an IRI, as the percent-encoding is not based on UTF-8.

"http://www.example.org/r%E9sum%E9.html“,无法将百分比编码八位字节转换为IRI中的实际字符,因为百分比编码不基于UTF-8。”。

This means that for most URI schemes, there is no need to upgrade their scheme definition in order for them to work with IRIs. The main case where upgrading makes sense is when a scheme definition, or a particular component of a scheme, is strictly limited to the use of US-ASCII characters with no provision to include non-ASCII characters/octets via percent-encoding, or if a scheme definition currently uses highly scheme-specific provisions for the encoding of non-ASCII characters. An example of this is the mailto: scheme [RFC2368].

这意味着,对于大多数URI方案,无需升级其方案定义即可使用IRIs。升级有意义的主要情况是,方案定义或方案的特定组件严格限于使用US-ASCII字符,没有规定通过百分比编码包括非ASCII字符/八位字节,或者,如果方案定义当前对非ASCII字符的编码使用高度特定于方案的规定。这方面的一个例子是mailto:scheme[RFC2368]。

This specification does not upgrade any scheme specifications in any way; this has to be done separately. Also, note that there is no such thing as an "IRI scheme"; all IRIs use URI schemes, and all URI schemes can be used with IRIs, even though in some cases only by using URIs directly as IRIs, without any conversion.

本规范不以任何方式升级任何方案规范;这必须分开进行。此外,请注意,没有所谓的“IRI计划”;所有的IRI都使用URI方案,并且所有的URI方案都可以与IRI一起使用,即使在某些情况下,仅将URI直接用作IRI,而不进行任何转换。

URI schemes can impose restrictions on the syntax of scheme-specific URIs; i.e., URIs that are admissible under the generic URI syntax [RFC3986] may not be admissible due to narrower syntactic constraints imposed by a URI scheme specification. URI scheme definitions cannot broaden the syntactic restrictions of the generic URI syntax; otherwise, it would be possible to generate URIs that satisfied the scheme-specific syntactic constraints without satisfying the syntactic constraints of the generic URI syntax. However, additional syntactic constraints imposed by URI scheme specifications are applicable to IRI, as the corresponding URI resulting from the mapping defined in section 3.1 MUST be a valid URI under the syntactic restrictions of generic URI syntax and any narrower restrictions imposed by the corresponding URI scheme specification.

URI方案可以对特定于方案的URI的语法施加限制;i、 例如,在通用URI语法[RFC3986]下可接受的URI可能不可接受,因为URI方案规范施加了更窄的语法约束。URI方案定义不能扩展通用URI语法的语法限制;否则,可以生成满足特定于方案的语法约束的URI,而不满足通用URI语法的语法约束。然而,URI方案规范施加的其他语法约束适用于IRI,因为第3.1节中定义的映射产生的相应URI必须是通用URI语法语法限制下的有效URI,以及相应URI方案规范施加的任何更窄限制。

The requirement for the use of UTF-8 applies to all parts of a URI (with the potential exception of the ireg-name part; see section 3.1). However, it is possible that the capability of IRIs to represent a wide range of characters directly is used just in some parts of the IRI (or IRI reference). The other parts of the IRI may only contain US-ASCII characters, or they may not be based on UTF-8. They may be based on another character encoding, or they may directly encode raw binary data (see also [RFC2397]).

UTF-8的使用要求适用于URI的所有部分(ireg名称部分的潜在例外;参见第3.1节)。然而,IRI直接表示大范围字符的能力可能仅用于IRI(或IRI参考)的某些部分。IRI的其他部分可能仅包含US-ASCII字符,也可能不基于UTF-8。它们可以基于另一种字符编码,也可以直接对原始二进制数据进行编码(另请参见[RFC2397])。

For example, it is possible to have a URI reference of "http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9", where the document name is encoded in iso-8859-1 based on server settings, but where the fragment identifier is encoded in UTF-8 according to

例如,URI引用可能为“http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9,其中文档名称根据服务器设置在iso-8859-1中编码,但片段标识符根据

[XPointer]. The IRI corresponding to the above URI would be (in XML notation) "http://www.example.org/r%E9sum%E9.xml#r&#xE9;sum&#xE9";.

[XPointer]。与上述URI对应的IRI将是(以XML表示法)http://www.example.org/r%E9sum%E9.xml#r&#xE9“sum&#xE9”;。

Similar considerations apply to query parts. The functionality of IRIs (namely, to be able to include non-ASCII characters) can only be used if the query part is encoded in UTF-8.

类似的注意事项也适用于查询零件。只有当查询部分以UTF-8编码时,才能使用IRIs的功能(即,能够包含非ASCII字符)。

6.5. Relative IRI References
6.5. 相对IRI参考

Processing of relative IRI references against a base is handled straightforwardly; the algorithms of [RFC3986] can be applied directly, treating the characters additionally allowed in IRI references in the same way that unreserved characters are in URI references.

对基准的相对IRI参考的处理是直接进行的;[RFC3986]的算法可以直接应用,将IRI引用中额外允许的字符处理为与URI引用中未保留字符相同的方式。

7. URI/IRI Processing Guidelines (Informative)
7. URI/IRI处理指南(资料性)

This informative section provides guidelines for supporting IRIs in the same software components and operations that currently process URIs: Software interfaces that handle URIs, software that allows users to enter URIs, software that creates or generates URIs, software that displays URIs, formats and protocols that transport URIs, and software that interprets URIs. These may all require modification before functioning properly with IRIs. The considerations in this section also apply to URI references and IRI references.

本信息部分提供了在当前处理URI的相同软件组件和操作中支持IRIs的指南:处理URI的软件接口、允许用户输入URI的软件、创建或生成URI的软件、显示URI的软件、传输URI的格式和协议,以及解释URI的软件。在IRIs正常运行之前,这些可能都需要修改。本节中的注意事项也适用于URI引用和IRI引用。

7.1. URI/IRI Software Interfaces
7.1. URI/IRI软件接口

Software interfaces that handle URIs, such as URI-handling APIs and protocols transferring URIs, need interfaces and protocol elements that are designed to carry IRIs.

处理URI的软件接口,如URI处理API和传输URI的协议,需要设计用于承载IRIs的接口和协议元素。

In case the current handling in an API or protocol is based on US-ASCII, UTF-8 is recommended as the character encoding for IRIs, as it is compatible with US-ASCII, is in accordance with the recommendations of [RFC2277], and makes converting to URIs easy. In any case, the API or protocol definition must clearly define the character encoding to be used.

如果API或协议中的当前处理基于US-ASCII,建议将UTF-8作为IRIs的字符编码,因为它与US-ASCII兼容,符合[RFC2277]的建议,并使转换为URI变得容易。在任何情况下,API或协议定义必须明确定义要使用的字符编码。

The transfer from URI-only to IRI-capable components requires no mapping, although the conversion described in section 3.2 above may be performed. It is preferable not to perform this inverse conversion when there is a chance that this cannot be done correctly.

仅从URI传输到支持IRI的组件不需要映射,尽管可以执行上面第3.2节中描述的转换。当可能无法正确执行此逆转换时,最好不要执行此逆转换。

7.2. URI/IRI Entry
7.2. URI/IRI条目

Some components allow users to enter URIs into the system by typing or dictation, for example. This software must be updated to allow for IRI entry.

例如,某些组件允许用户通过键入或听写将URI输入系统。必须更新此软件以允许IRI输入。

A person viewing a visual representation of an IRI (as a sequence of glyphs, in some order, in some visual display) or hearing an IRI will use an entry method for characters in the user's language to input the IRI. Depending on the script and the input method used, this may be a more or less complicated process.

查看IRI的视觉表示(作为符号序列,以某种顺序,在某种视觉显示中)或听到IRI的人将使用用户语言中字符的输入方法来输入IRI。根据所使用的脚本和输入方法,这可能是一个或多或少复杂的过程。

The process of IRI entry must ensure, as much as possible, that the restrictions defined in section 2.2 are met. This may be done by choosing appropriate input methods or variants/settings thereof, by appropriately converting the characters being input, by eliminating characters that cannot be converted, and/or by issuing a warning or error message to the user.

IRI录入流程必须尽可能确保满足第2.2节规定的限制。这可以通过选择适当的输入方法或其变体/设置、适当转换正在输入的字符、消除无法转换的字符和/或向用户发出警告或错误消息来实现。

As an example of variant settings, input method editors for East Asian Languages usually allow the input of Latin letters and related characters in full-width or half-width versions. For IRI input, the input method editor should be set so that it produces half-width Latin letters and punctuation and full-width Katakana.

作为变体设置的一个示例,东亚语言的输入法编辑器通常允许输入全宽或半宽版本的拉丁字母和相关字符。对于IRI输入,应设置输入法编辑器,使其生成半宽拉丁字母和标点符号以及全宽片假名。

An input field primarily or solely used for the input of URIs/IRIs may allow the user to view an IRI as it is mapped to a URI. Places where the input of IRIs is frequent may provide the possibility for viewing an IRI as mapped to a URI. This will help users when some of the software they use does not yet accept IRIs.

主要或仅用于输入URI/IRI的输入字段可允许用户在IRI映射到URI时查看其。经常输入IRI的地方可以提供查看映射到URI的IRI的可能性。当用户使用的某些软件尚未接受IRIs时,这将有助于用户。

An IRI input component interfacing to components that handle URIs, but not IRIs, must map the IRI to a URI before passing it to these components.

与处理URI(而非IRI)的组件接口的IRI输入组件必须将IRI映射到URI,然后才能将其传递给这些组件。

For the input of IRIs with right-to-left characters, please see section 4.3.

关于从右向左字符的虹膜输入,请参见第4.3节。

7.3. URI/IRI Transfer between Applications
7.3. 应用程序之间的URI/IRI传输

Many applications, particularly mail user agents, try to detect URIs appearing in plain text. For this, they use some heuristics based on URI syntax. They then allow the user to click on such URIs and retrieve the corresponding resource in an appropriate (usually scheme-dependent) application.

许多应用程序,特别是邮件用户代理,试图检测以纯文本形式出现的URI。为此,他们使用了一些基于URI语法的启发式方法。然后,它们允许用户单击此类URI并在适当(通常依赖于方案)的应用程序中检索相应的资源。

Such applications have to be upgraded to use the IRI syntax as a base for heuristics. In particular, a non-ASCII character should not be taken as the indication of the end of an IRI. Such applications also have to make sure that they correctly convert the detected IRI from the character encoding of the document or application where the IRI appears to the character encoding used by the system-wide IRI invocation mechanism, or to a URI (according to section 3.1) if the system-wide invocation mechanism only accepts URIs.

此类应用程序必须升级以使用IRI语法作为启发式的基础。特别是,不应将非ASCII字符作为IRI结束的指示。此类应用程序还必须确保正确地将检测到的IRI从IRI出现的文档或应用程序的字符编码转换为系统范围的IRI调用机制使用的字符编码,或者如果系统范围的调用机制仅接受URI,则转换为URI(根据第3.1节)。

The clipboard is another frequently used way to transfer URIs and IRIs from one application to another. On most platforms, the clipboard is able to store and transfer text in many languages and scripts. Correctly used, the clipboard transfers characters, not bytes, which will do the right thing with IRIs.

剪贴板是将URI和IRI从一个应用程序传输到另一个应用程序的另一种常用方式。在大多数平台上,剪贴板能够以多种语言和脚本存储和传输文本。如果使用正确,剪贴板将传输字符,而不是字节,这将正确处理IRIs。

7.4. URI/IRI Generation
7.4. URI/IRI生成

Systems that offer resources through the Internet, where those resources have logical names, sometimes automatically generate URIs for the resources they offer. For example, some HTTP servers can generate a directory listing for a file directory and then respond to the generated URIs with the files.

通过Internet提供资源的系统(这些资源具有逻辑名称)有时会自动为它们提供的资源生成URI。例如,一些HTTP服务器可以为文件目录生成目录列表,然后用文件响应生成的URI。

Many legacy character encodings are in use in various file systems. Many currently deployed systems do not transform the local character representation of the underlying system before generating URIs.

许多遗留字符编码在各种文件系统中使用。许多当前部署的系统在生成URI之前不会转换底层系统的本地字符表示。

For maximum interoperability, systems that generate resource identifiers should make the appropriate transformations. For example, if a file system contains a file named "r&#xE9;sum&#xE9;.html", a server should expose this as "r%C3%A9sum%C3%A9.html" in a URI, which allows use of "r&#xE9;sum&#xE9;.html" in an IRI, even if locally the file name is kept in a character encoding other than UTF-8.

为了实现最大的互操作性,生成资源标识符的系统应该进行适当的转换。例如,如果文件系统包含名为“r&#xE9;sum&#xE9;html”的文件,则服务器应在URI中将其公开为“r%C3%A9sum%C3%A9.html”,这允许在IRI中使用“r&#xE9;sum&#xE9;html”,即使文件名在本地以UTF-8以外的字符编码保存。

This recommendation particularly applies to HTTP servers. For FTP servers, similar considerations apply; see [RFC2640].

此建议特别适用于HTTP服务器。对于FTP服务器,类似的考虑也适用;见[RFC2640]。

7.5. URI/IRI Selection
7.5. URI/IRI选择

In some cases, resource owners and publishers have control over the IRIs used to identify their resources. This control is mostly executed by controlling the resource names, such as file names, directly.

在某些情况下,资源所有者和发布者可以控制用于识别其资源的IRI。此控制主要通过直接控制资源名(如文件名)来执行。

In these cases, it is recommended to avoid choosing IRIs that are easily confused. For example, for US-ASCII, the lower-case ell ("l") is easily confused with the digit one ("1"), and the upper-case oh ("O") is easily confused with the digit zero ("0"). Publishers should avoid confusing users with "br0ken" or "1ame" identifiers.

在这些情况下,建议避免选择容易混淆的虹膜。例如,对于US-ASCII,小写字母ell(“l”)容易与数字1(“1”)混淆,大写字母oh(“O”)容易与数字零(“0”)混淆。出版商应避免将用户与“br0ken”或“1ame”标识符混淆。

Outside the US-ASCII repertoire, there are many more opportunities for confusion; a complete set of guidelines is too lengthy to include here. As long as names are limited to characters from a single script, native writers of a given script or language will know best when ambiguities can appear, and how they can be avoided. What may look ambiguous to a stranger may be completely obvious to the average native user. On the other hand, in some cases, the UCS contains variants for compatibility reasons; for example, for typographic purposes. These should be avoided wherever possible. Although there may be exceptions, newly created resource names should generally be in NFKC [UTR15] (which means that they are also in NFC).

在US-ASCII指令集之外,还有更多的混淆机会;一套完整的指南太长,无法包含在这里。只要名字仅限于单个脚本中的字符,特定脚本或语言的本地作者就最清楚何时会出现歧义,以及如何避免歧义。对于陌生人来说,模棱两可的东西对于普通本地用户来说可能是显而易见的。另一方面,在某些情况下,出于兼容性原因,UCS包含变体;例如,出于排版目的。应尽可能避免这些情况。尽管可能有例外,但新创建的资源名称通常应使用NFKC[UTR15](这意味着它们也使用NFC)。

As an example, the UCS contains the "fi" ligature at U+FB01 for compatibility reasons. Wherever possible, IRIs should use the two letters "f" and "i" rather than the "fi" ligature. An example where the latter may be used is in the query part of an IRI for an explicit search for a word written containing the "fi" ligature.

例如,出于兼容性原因,UCS在U+FB01处包含“fi”连字。只要可能,IRIs应该使用两个字母“f”和“i”,而不是“fi”连字。可以使用后者的一个示例是在IRI的查询部分中,显式搜索包含“fi”连字的单词。

In certain cases, there is a chance that characters from different scripts look the same. The best known example is the similarity of the Latin "A", the Greek "Alpha", and the Cyrillic "A". To avoid such cases, only IRIs should be created where all the characters in a single component are used together in a given language. This usually means that all of these characters will be from the same script, but there are languages that mix characters from different scripts (such as Japanese). This is similar to the heuristics used to distinguish between letters and numbers in the examples above. Also, for Latin, Greek, and Cyrillic, using lowercase letters results in fewer ambiguities than using uppercase letters would.

在某些情况下,不同脚本中的角色可能看起来相同。最著名的例子是拉丁语“A”、希腊语“Alpha”和西里尔语“A”的相似性。为了避免这种情况,应该只创建IRIs,其中单个组件中的所有字符在给定语言中一起使用。这通常意味着所有这些字符都来自同一个脚本,但有些语言混合了来自不同脚本的字符(如日语)。这类似于上面示例中用于区分字母和数字的启发式方法。此外,对于拉丁语、希腊语和西里尔语,使用小写字母会比使用大写字母产生更少的歧义。

7.6. Display of URIs/IRIs
7.6. URI/IRIs的显示

In situations where the rendering software is not expected to display non-ASCII parts of the IRI correctly using the available layout and font resources, these parts should be percent-encoded before being displayed.

如果渲染软件不希望使用可用的布局和字体资源正确显示IRI的非ASCII部分,则应在显示这些部分之前对其进行百分比编码。

For display of Bidi IRIs, please see section 4.1.

有关Bidi虹膜的显示,请参见第4.1节。

7.7. Interpretation of URIs and IRIs
7.7. URI和IRIs的解释

Software that interprets IRIs as the names of local resources should accept IRIs in multiple forms and convert and match them with the appropriate local resource names.

将IRIs解释为本地资源名称的软件应接受多种形式的IRIs,并将其转换并与适当的本地资源名称匹配。

First, multiple representations include both IRIs in the native character encoding of the protocol and also their URI counterparts.

首先,多个表示既包括协议的本机字符编码中的虹膜,也包括它们的URI对应项。

Second, it may include URIs constructed based on character encodings other than UTF-8. These URIs may be produced by user agents that do not conform to this specification and that use legacy character encodings to convert non-ASCII characters to URIs. Whether this is necessary, and what character encodings to cover, depends on a number of factors, such as the legacy character encodings used locally and the distribution of various versions of user agents. For example, software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in addition to UTF-8.

其次,它可能包括基于UTF-8以外的字符编码构造的URI。这些URI可能由不符合本规范的用户代理生成,并使用传统字符编码将非ASCII字符转换为URI。这是否必要,以及要涵盖哪些字符编码,取决于许多因素,例如本地使用的传统字符编码和各种版本的用户代理的分布。例如,除UTF-8外,日语软件还可以接受Shift_JIS和/或EUC-JP中的URI。

Third, it may include additional mappings to be more user-friendly and robust against transmission errors. These would be similar to how some servers currently treat URIs as case insensitive or perform additional matching to account for spelling errors. For characters beyond the US-ASCII repertoire, this may, for example, include ignoring the accents on received IRIs or resource names. Please note that such mappings, including case mappings, are language dependent.

第三,它可能包括额外的映射,以更加用户友好,并对传输错误具有鲁棒性。这类似于某些服务器目前将URI视为不区分大小写的,或者执行额外的匹配来解释拼写错误。例如,对于US-ASCII指令表以外的字符,这可能包括忽略接收到的虹膜或资源名称上的重音。请注意,此类映射(包括大小写映射)依赖于语言。

It can be difficult to identify a resource unambiguously if too many mappings are taken into consideration. However, percent-encoded and not percent-encoded parts of IRIs can always be clearly distinguished. Also, the regularity of UTF-8 (see [Duerst97]) makes the potential for collisions lower than it may seem at first.

如果考虑太多的映射,则很难明确地识别资源。然而,虹膜的百分比编码部分和非百分比编码部分总是可以清楚地区分。此外,UTF-8的规律性(见[Duerst97])使得碰撞的可能性比最初看起来的要低。

7.8. Upgrading Strategy
7.8. 升级策略

Where this recommendation places further constraints on software for which many instances are already deployed, it is important to introduce upgrades carefully and to be aware of the various interdependencies.

如果本建议对已经部署了许多实例的软件施加了进一步的限制,则必须小心地引入升级,并了解各种相互依赖关系。

If IRIs cannot be interpreted correctly, they should not be created, generated, or transported. This suggests that upgrading URI interpreting software to accept IRIs should have highest priority.

如果无法正确解释IRIs,则不应创建、生成或传输IRIs。这表明升级URI解释软件以接受IRIs应该具有最高优先级。

On the other hand, a single IRI is interpreted only by a single or very few interpreters that are known in advance, although it may be entered and transported very widely.

另一方面,一个IRI只能由一个或很少几个事先知道的口译员来解释,尽管它可能被输入和传输得非常广泛。

Therefore, IRIs benefit most from a broad upgrade of software to be able to enter and transport IRIs. However, before an individual IRI is published, care should be taken to upgrade the corresponding interpreting software in order to cover the forms expected to be received by various versions of entry and transport software.

因此,IRIs从软件的广泛升级中获益最大,能够进入和传输IRIs。但是,在发布单个IRI之前,应注意升级相应的口译软件,以涵盖各种版本的入境和运输软件预期收到的表格。

The upgrade of generating software to generate IRIs instead of using a local character encoding should happen only after the service is upgraded to accept IRIs. Similarly, IRIs should only be generated when the service accepts IRIs and the intervening infrastructure and protocol is known to transport them safely.

只有在服务升级为接受IRIs后,才能升级生成软件以生成IRIs,而不是使用本地字符编码。类似地,只有当服务接受IRIs并且知道介入的基础设施和协议能够安全地传输IRIs时,才会生成IRIs。

Software converting from URIs to IRIs for display should be upgraded only after upgraded entry software has been widely deployed to the population that will see the displayed result.

只有在将升级的输入软件广泛部署到将看到显示结果的人群中之后,才应升级从URI转换为IRIs进行显示的软件。

Where there is a free choice of character encodings, it is often possible to reduce the effort and dependencies for upgrading to IRIs by using UTF-8 rather than another encoding. For example, when a new file-based Web server is set up, using UTF-8 as the character encoding for file names will make the transition to IRIs easier. Likewise, when a new Web form is set up using UTF-8 as the character encoding of the form page, the returned query URIs will use UTF-8 as the character encoding (unless the user, for whatever reason, changes the character encoding) and will therefore be compatible with IRIs.

如果可以自由选择字符编码,通常可以通过使用UTF-8而不是其他编码来减少升级到IRIs的工作量和依赖性。例如,当建立一个新的基于文件的Web服务器时,使用UTF-8作为文件名的字符编码将使转换到IRIs更容易。同样,当使用UTF-8作为表单页面的字符编码设置新的Web表单时,返回的查询URI将使用UTF-8作为字符编码(除非用户出于任何原因更改字符编码),因此将与IRIs兼容。

These recommendations, when taken together, will allow for the extension from URIs to IRIs in order to handle characters other than US-ASCII while minimizing interoperability problems. For considerations regarding the upgrade of URI scheme definitions, see section 6.4.

这些建议结合在一起,将允许从URI扩展到IRIs,以便处理US-ASCII以外的字符,同时最小化互操作性问题。有关URI方案定义升级的注意事项,请参见第6.4节。

8. Security Considerations
8. 安全考虑

The security considerations discussed in [RFC3986] also apply to IRIs. In addition, the following issues require particular care for IRIs.

[RFC3986]中讨论的安全注意事项也适用于IRIs。此外,以下问题需要特别注意IRIs。

Incorrect encoding or decoding can lead to security problems. In particular, some UTF-8 decoders do not check against overlong byte sequences. As an example, a "/" is encoded with the byte 0x2F both in UTF-8 and in US-ASCII, but some UTF-8 decoders also wrongly interpret the sequence 0xC0 0xAF as a "/". A sequence such as

不正确的编码或解码可能导致安全问题。特别是,一些UTF-8解码器不检查过长的字节序列。例如,在UTF-8和US-ASCII中,用字节0x2F对“/”进行编码,但一些UTF-8解码器也错误地将序列0xC0 0xAF解释为“/”。序列,如

"%C0%AF.." may pass some security tests and then be interpreted as "/.." in a path if UTF-8 decoders are fault-tolerant, if conversion and checking are not done in the right order, and/or if reserved characters and unreserved characters are not clearly distinguished.

如果UTF-8解码器是容错的,如果转换和检查没有按正确的顺序进行,和/或如果保留字符和非保留字符没有明确区分,则“%C0%AF..”可能通过一些安全测试,然后在路径中被解释为“/..”。

There are various ways in which "spoofing" can occur with IRIs. "Spoofing" means that somebody may add a resource name that looks the same or similar to the user, but that points to a different resource. The added resource may pretend to be the real resource by looking very similar but may contain all kinds of changes that may be difficult to spot and that can cause all kinds of problems. Most spoofing possibilities for IRIs are extensions of those for URIs.

虹膜有多种“欺骗”的方式。“欺骗”意味着某人可能会添加一个看起来与用户相同或相似的资源名称,但指向不同的资源。添加的资源可能通过看起来非常相似而假装是真实的资源,但可能包含各种各样的更改,这些更改可能很难发现,并且可能会导致各种问题。IRIs的大多数欺骗可能是对URI的扩展。

Spoofing can occur for various reasons. First, a user's normalization expectations or actual normalization when entering an IRI or transcoding an IRI from a legacy character encoding do not match the normalization used on the server side. Conceptually, this is no different from the problems surrounding the use of case-insensitive web servers. For example, a popular web page with a mixed-case name ("http://big.example.com/PopularPage.html") might be "spoofed" by someone who is able to create "http://big.example.com/popularpage.html". However, the use of unnormalized character sequences, and of additional mappings for user convenience, may increase the chance for spoofing. Protocols and servers that allow the creation of resources with names that are not normalized are particularly vulnerable to such attacks. This is an inherent security problem of the relevant protocol, server, or resource and is not specific to IRIs, but it is mentioned here for completeness.

由于各种原因,可能会发生欺骗。首先,当输入IRI或从传统字符编码转换IRI时,用户的规范化期望或实际规范化与服务器端使用的规范化不匹配。从概念上讲,这与使用不区分大小写的web服务器的问题没有什么不同。例如,使用混合大小写名称(“http://big.example.com/PopularPage.html)可能被能够创建“”的人“欺骗”http://big.example.com/popularpage.html". 但是,使用非规范化字符序列以及为方便用户而添加的映射可能会增加欺骗的机会。允许使用未规范化名称创建资源的协议和服务器特别容易受到此类攻击。这是相关协议、服务器或资源的固有安全问题,并不特定于IRIs,但为了完整性,这里提到它。

Spoofing can occur in various IRI components, such as the domain name part or a path part. For considerations specific to the domain name part, see [RFC3491]. For the path part, administrators of sites that allow independent users to create resources in the same sub area may have to be careful to check for spoofing.

欺骗可以发生在各种IRI组件中,例如域名部分或路径部分。有关特定于域名部分的注意事项,请参阅[RFC3491]。对于路径部分,允许独立用户在同一子区域中创建资源的站点的管理员可能必须小心检查是否存在欺骗。

Spoofing can occur because in the UCS many characters look very similar. Details are discussed in Section 7.5. Again, this is very similar to spoofing possibilities on US-ASCII, e.g., using "br0ken" or "1ame" URIs.

由于UCS中的许多字符看起来非常相似,因此可能会发生欺骗。详情见第7.5节。同样,这与US-ASCII上的欺骗可能性非常相似,例如,使用“br0ken”或“1ame”URI。

Spoofing can occur when URIs with percent-encodings based on various character encodings are accepted to deal with older user agents. In some cases, particularly for Latin-based resource names, this is usually easy to detect because UTF-8-encoded names, when interpreted and viewed as legacy character encodings, produce mostly garbage.

当接受基于各种字符编码的百分比编码的URI来处理较旧的用户代理时,可能会发生欺骗。在某些情况下,特别是对于基于拉丁语的资源名称,这通常很容易检测到,因为UTF-8编码的名称在被解释和视为遗留字符编码时,会产生大部分垃圾。

When concurrently used character encodings have a similar structure but there are no characters that have exactly the same encoding, detection is more difficult.

当同时使用的字符编码具有相似的结构,但没有具有完全相同编码的字符时,检测会更加困难。

Spoofing can occur with bidirectional IRIs, if the restrictions in section 4.2 are not followed. The same visual representation may be interpreted as different logical representations, and vice versa. It is also very important that a correct Unicode bidirectional implementation be used.

如果不遵守第4.2节中的限制,双向IRIs可能会发生欺骗。相同的视觉表示可以解释为不同的逻辑表示,反之亦然。使用正确的Unicode双向实现也是非常重要的。

9. Acknowledgements
9. 致谢

We would like to thank Larry Masinter for his work as coauthor of many earlier versions of this document (draft-masinter-url-i18n-xx).

我们要感谢Larry Masinter作为本文件许多早期版本(草稿-Masinter-url-i18n-xx)的合著者所做的工作。

The discussion on the issue addressed here started a long time ago. There was a thread in the HTML working group in August 1995 (under the topic of "Globalizing URIs") and in the www-international mailing list in July 1996 (under the topic of "Internationalization and URLs"), and there were ad-hoc meetings at the Unicode conferences in September 1995 and September 1997.

关于这里提到的问题的讨论很久以前就开始了。1995年8月HTML工作组(主题为“URI全球化”)和1996年7月www国际邮件列表(主题为“国际化和URL”)中都有一条线索,1995年9月和1997年9月的Unicode会议上也举行了特别会议。

Many thanks go to Francois Yergeau, Matitiahu Allouche, Roy Fielding, Tim Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie Daigle, Ted Hardie, Bill Fenner, Margaret Wasserman, Russ Housley, Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex Texin, Graham Klyne, Bjoern Hoehrmann, Chris Lilley, Ian Jacobs, Adam Costello, Dan Oscarson, Elliotte Rusty Harold, Mike J. Brown, Roy Badami, Jonathan Rosenne, Asmus Freytag, Simon Josefsson, Carlos Viegas Damasio, Chris Haynes, Walter Underwood, and many others for help with understanding the issues and possible solutions, and with getting the details right.

非常感谢弗朗索瓦·耶乔、马蒂蒂亚胡·阿洛切、罗伊·菲尔丁、蒂姆·伯纳斯·李、马克·戴维斯、M.T.卡拉斯科·贝尼特斯、詹姆斯·克拉克、蒂姆·布雷、克里斯·温特、亚龙·格兰德、安德里亚·维恩、米莎·沃尔夫、莱斯利·戴格尔、特德·哈代、比尔·芬纳、玛格丽特·瓦瑟曼、罗斯·霍斯利、马科托·村田、史蒂文·阿特金、瑞安·斯坦西弗、特克斯·特辛、格雷厄姆·克莱恩、,比约恩·霍尔曼、克里斯·利利、伊恩·雅各布斯、亚当·科斯特洛、丹·奥斯卡森、埃利奥特·拉斯蒂·哈罗德、迈克·J·布朗、罗伊·巴达米、乔纳森·罗森尼、阿斯姆斯·弗雷塔格、西蒙·约瑟夫森、卡洛斯·维加斯·达马西奥、克里斯·海恩斯、沃尔特·安德伍德和其他许多人,感谢他们帮助我们理解这些问题和可能的解决方案,以及正确的细节。

This document is a product of the Internationalization Working Group (I18N WG) of the World Wide Web Consortium (W3C). Thanks to the members of the W3C I18N Working Group and Interest Group for their contributions and their work on [CharMod]. Thanks also go to the members of many other W3C Working Groups for adopting IRIs, and to the members of the Montreal IAB Workshop on Internationalization and Localization for their review.

本文档是万维网联盟(W3C)国际化工作组(I18N WG)的产品。感谢W3C I18N工作组和兴趣小组的成员在[CharMod]方面的贡献和工作。还感谢许多其他W3C工作组的成员采用IRIs,并感谢蒙特利尔IAB国际化和本地化研讨会的成员进行审查。

10. References
10. 工具书类
10.1. Normative References
10.1. 规范性引用文件

[ASCII] American National Standards Institute, "Coded Character Set -- 7-bit American Standard Code for Information Interchange", ANSI X3.4, 1986.

[ASCII]美国国家标准协会,“编码字符集——信息交换用7位美国标准代码”,ANSI X3.41986。

[ISO10646] International Organization for Standardization, "ISO/IEC 10646:2003: Information Technology - Universal Multiple-Octet Coded Character Set (UCS)", ISO Standard 10646, December 2003.

[ISO10646]国际标准化组织,“ISO/IEC 10646:2003:信息技术-通用多八位编码字符集(UCS)”,ISO标准10646,2003年12月。

[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.

[RFC2119]Bradner,S.,“RFC中用于表示需求水平的关键词”,BCP 14,RFC 2119,1997年3月。

[RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", RFC 2234, November 1997.

[RFC2234]Crocker,D.和P.Overell,“语法规范的扩充BNF:ABNF”,RFC 2234,1997年11月。

[RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, "Internationalizing Domain Names in Applications (IDNA)", RFC 3490, March 2003.

[RFC3490]Faltstrom,P.,Hoffman,P.,和A.Costello,“应用程序中的域名国际化(IDNA)”,RFC 34902003年3月。

[RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN)", RFC 3491, March 2003.

[RFC3491]Hoffman,P.和M.Blanchet,“Nameprep:国际化域名(IDN)的Stringprep配置文件”,RFC 3491,2003年3月。

[RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 10646", STD 63, RFC 3629, November 2003.

[RFC3629]Yergeau,F.,“UTF-8,ISO 10646的转换格式”,STD 63,RFC 3629,2003年11月。

[RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986, January 2005.

[RFC3986]Berners Lee,T.,Fielding,R.,和L.Masinter,“统一资源标识符(URI):通用语法”,STD 66,RFC 3986,2005年1月。

[UNI9] Davis, M., "The Bidirectional Algorithm", Unicode Standard Annex #9, March 2004, <http://www.unicode.org/reports/tr9/tr9-13.html>.

[UNI9]Davis,M.,“双向算法”,Unicode标准附件#9,2004年3月<http://www.unicode.org/reports/tr9/tr9-13.html>.

[UNIV4] The Unicode Consortium, "The Unicode Standard, Version 4.0.1, defined by: The Unicode Standard, Version 4.0 (Reading, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1), as amended by Unicode 4.0.1 (http://www.unicode.org/versions/Unicode4.0.1/)", March 2004.

[UNIV4]Unicode联盟,“Unicode标准,版本4.0.1,定义为:Unicode标准,版本4.0(雷丁,马萨诸塞州,Addison-Wesley,2003.ISBN 0-321-18578-1),经Unicode 4.0.1修订(http://www.unicode.org/versions/Unicode4.0.1/)“,2004年3月。

[UTR15] Davis, M. and M. Duerst, "Unicode Normalization Forms", Unicode Standard Annex #15, April 2003, <http://www.unicode.org/unicode/reports/ tr15/tr15-23.html>.

[UTR15]Davis,M.和M.Duerst,“Unicode规范化格式”,Unicode标准附录#15,2003年4月<http://www.unicode.org/unicode/reports/ tr15/tr15-23.html>。

10.2. Informative References
10.2. 资料性引用

[BidiEx] "Examples of bidirectional IRIs", <http://www.w3.org/International/iri-edit/ BidiExamples>.

[BidiEx]“双向虹膜示例”<http://www.w3.org/International/iri-edit/ BidieExamples>。

[CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M., and T. Texin, "Character Model for the World Wide Web: Resource Identifiers", World Wide Web Consortium Candidate Recommendation, November 2004, <http://www.w3.org/TR/charmod-resid>.

[CharMod]Duerst,M.,Yergeau,F.,Ishida,R.,Wolf,M.,和T.Texin,“万维网字符模型:资源标识符”,万维网联盟候选人推荐,2004年11月<http://www.w3.org/TR/charmod-resid>.

[Duerst97] Duerst, M., "The Properties and Promises of UTF-8", Proc. 11th International Unicode Conference, San Jose , September 1997, <http://www.ifi.unizh.ch/mml/mduerst/papers/ PDF/IUC11-UTF-8.pdf>.

[Duerst97]Duerst,M.,“UTF-8的性质和承诺”,Proc。第11届国际Unicode会议,圣何塞,1997年9月<http://www.ifi.unizh.ch/mml/mduerst/papers/ PDF/IUC11-UTF-8.PDF>。

[Gettys] Gettys, J., "URI Model Consequences", <http://www.w3.org/DesignIssues/ModelConsequences>.

[Gettys]Gettys,J.,“URI模型结果”<http://www.w3.org/DesignIssues/ModelConsequences>.

[HTML4] Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01 Specification", World Wide Web Consortium Recommendation, December 1999, <http://www.w3.org/TR/html401/appendix/ notes.html#h-B.2>.

[HTML4]Raggett,D.,Le Hors,A.,和I.Jacobs,“HTML 4.01规范”,万维网联盟建议,1999年12月<http://www.w3.org/TR/html401/appendix/ notes.html#h-B.2>。

[RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies", RFC 2045, November 1996.

[RFC2045]Freed,N.和N.Borenstein,“多用途Internet邮件扩展(MIME)第一部分:Internet邮件正文格式”,RFC 20451996年11月。

[RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., Atkinson, R., Crispin, M., and P. Svanberg, "The Report of the IAB Character Set Workshop held 29 February - 1 March, 1996", RFC 2130, April 1997.

[RFC2130]Weider,C.,Preston,C.,Simonsen,K.,Alvestrand,H.,Atkinson,R.,Crispin,M.,和P.Svanberg,“1996年2月29日至3月1日举行的IAB字符集研讨会报告”,RFC 21301997年4月。

[RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997.

[RFC2141]Moats,R.,“瓮语法”,RFC 21411997年5月。

[RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September 1997.

[RFC2192]纽曼,C.,“IMAP URL方案”,RFC21921997年9月。

[RFC2277] Alvestrand, H., "IETF Policy on Character Sets and Languages", BCP 18, RFC 2277, January 1998.

[RFC2277]Alvestrand,H.,“IETF字符集和语言政策”,BCP 18,RFC 2277,1998年1月。

[RFC2368] Hoffman, P., Masinter, L., and J. Zawinski, "The mailto URL scheme", RFC 2368, July 1998.

[RFC2368]Hoffman,P.,Masinter,L.,和J.Zawinski,“邮件URL方案”,RFC 2368,1998年7月。

[RFC2384] Gellens, R., "POP URL Scheme", RFC 2384, August 1998.

[RFC2384]Gellens,R.,“POP URL方案”,RFC 2384,1998年8月。

[RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform Resource Identifiers (URI): Generic Syntax", RFC 2396, August 1998.

[RFC2396]Berners Lee,T.,Fielding,R.,和L.Masinter,“统一资源标识符(URI):通用语法”,RFC 2396,1998年8月。

[RFC2397] Masinter, L., "The "data" URL scheme", RFC 2397, August 1998.

[RFC2397]Masinter,L.“数据”URL方案”,RFC 2397,1998年8月。

[RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.

[RFC2616]菲尔丁,R.,盖蒂斯,J.,莫卧儿,J.,弗莱斯蒂克,H.,马斯特,L.,利奇,P.,和T.伯纳斯李,“超文本传输协议——HTTP/1.1”,RFC 2616,1999年6月。

[RFC2640] Curtin, B., "Internationalization of the File Transfer Protocol", RFC 2640, July 1999.

[RFC2640]Curtin,B.,“文件传输协议的国际化”,RFC 26401999年7月。

[RFC2718] Masinter, L., Alvestrand, H., Zigmond, D., and R. Petke, "Guidelines for new URL Schemes", RFC 2718, November 1999.

[RFC2718]Masinter,L.,Alvestrand,H.,Zigmond,D.,和R.Petke,“新URL方案指南”,RFC 27181999年11月。

[UNIXML] Duerst, M. and A. Freytag, "Unicode in XML and other Markup Languages", Unicode Technical Report #20, World Wide Web Consortium Note, June 2003, <http://www.w3.org/TR/unicode-xml/>.

[UNIXML]Duerst,M.和A.Freytag,“XML和其他标记语言中的Unicode”,Unicode技术报告#20,万维网联盟说明,2003年6月<http://www.w3.org/TR/unicode-xml/>.

[XLink] DeRose, S., Maler, E., and D. Orchard, "XML Linking Language (XLink) Version 1.0", World Wide Web Consortium Recommendation, June 2001, <http://www.w3.org/TR/xlink/#link-locators>.

[XLink]DeRose,S.,Maler,E.,和D.Orchard,“XML链接语言(XLink)1.0版”,万维网联盟建议,2001年6月<http://www.w3.org/TR/xlink/#link-定位器>。

[XML1] Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., and F. Yergeau, "Extensible Markup Language (XML) 1.0 (Third Edition)", World Wide Web Consortium Recommendation, February 2004, <http://www.w3.org/TR/REC-xml#sec-external-ent>.

[XML1]Bray,T.,Paoli,J.,Sperberg McQueen,C.,Maler,E.,和F.Yergeau,“可扩展标记语言(XML)1.0(第三版)”,万维网联盟建议,2004年2月<http://www.w3.org/TR/REC-xml#sec-外部耳鼻喉科>。

[XMLNamespace] Bray, T., Hollander, D., and A. Layman, "Namespaces in XML", World Wide Web Consortium Recommendation, January 1999, <http://www.w3.org/TR/REC-xml-names>.

[XMLNamespace]Bray,T.,Hollander,D.,和A.Layman,“XML中的名称空间”,万维网联盟建议,1999年1月<http://www.w3.org/TR/REC-xml-names>.

[XMLSchema] Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes", World Wide Web Consortium Recommendation, May 2001, <http://www.w3.org/TR/xmlschema-2/#anyURI>.

[XMLSchema]Biron,P.和A.Malhotra,“XML模式第2部分:数据类型”,万维网联盟建议,2001年5月<http://www.w3.org/TR/xmlschema-2/#anyURI>.

[XPointer] Grosso, P., Maler, E., Marsh, J. and N. Walsh, "XPointer Framework", World Wide Web Consortium Recommendation, March 2003, <http://www.w3.org/TR/xptr-framework/#escaping>.

[XPointer]Grosso,P.,Maler,E.,Marsh,J.和N.Walsh,“XPointer框架”,万维网联盟建议,2003年3月<http://www.w3.org/TR/xptr-framework/#escaping>.

Appendix A. Design Alternatives
附录A.设计备选方案

This section shortly summarizes major design alternatives and the reasons for why they were not chosen.

本节简要总结了主要的设计方案以及未选择这些方案的原因。

Appendix A.1. New Scheme(s)

附录A.1。新计划

Introducing new schemes (for example, httpi:, ftpi:,...) or a new metascheme (e.g., i:, leading to URI/IRI prefixes such as i:http:, i:ftp:,...) was proposed to make IRI-to-URI conversion scheme dependent or to distinguish between percent-encodings resulting from IRI-to-URI conversion and percent-encodings from legacy character encodings.

建议引入新的方案(例如,httpi:,ftpi:,…)或新的元方案(例如,i:,导致URI/IRI前缀,如i:http:,i:ftp:,…),以使IRI-to-URI转换方案依赖于或区分IRI-to-URI转换产生的百分比编码和传统字符编码产生的百分比编码。

New schemes are not needed to distinguish URIs from true IRIs (i.e., IRIs that contain non-ASCII characters). The benefit of being able to detect the origin of percent-encodings is marginal, as UTF-8 can be detected with very high reliability. Deploying new schemes is extremely hard, so not requiring new schemes for IRIs makes deployment of IRIs vastly easier. Making conversion scheme dependent is highly inadvisable and would be encouraged by separate schemes for IRIs. Using a uniform convention for conversion from IRIs to URIs makes IRI implementation orthogonal to the introduction of actual new schemes.

不需要新的方案来区分URI和真实的IRI(即包含非ASCII字符的IRI)。能够检测百分比编码的起源的好处是微乎其微的,因为UTF-8可以以非常高的可靠性进行检测。部署新方案非常困难,因此不需要为IRIs部署新方案,这使得IRIs的部署变得非常容易。使转换方案依赖于IRIs是非常不可取的,单独的IRIs方案会鼓励这种做法。使用从IRI到URI的转换的统一约定使IRI实现与实际新方案的引入正交。

Appendix A.2. Character Encodings Other Than UTF-8

附录A.2。UTF-8以外的字符编码

At an early stage, UTF-7 was considered as an alternative to UTF-8 when IRIs are converted to URIs. UTF-7 would not have needed percent-encoding and in most cases would have been shorter than percent-encoded UTF-8.

在早期阶段,当IRI转换为URI时,UTF-7被认为是UTF-8的替代品。UTF-7不需要百分比编码,而且在大多数情况下,UTF-8的编码长度比UTF-8的编码长度短。

Using UTF-8 avoids a double layering and overloading of the use of the "+" character. UTF-8 is fully compatible with US-ASCII and has therefore been recommended by the IETF, and is being used widely.

使用UTF-8避免了使用“+”字符的双重分层和重载。UTF-8与US-ASCII完全兼容,因此被IETF推荐,并被广泛使用。

UTF-7 has never been used much and is now clearly being discouraged. Requiring implementations to convert from UTF-8 to UTF-7 and back would be an additional implementation burden.

UTF-7从未被大量使用,现在显然不鼓励使用。要求实现从UTF-8转换到UTF-7,然后再转换回来,这将是一个额外的实现负担。

Appendix A.3. New Encoding Convention

附录A.3。新的编码约定

Instead of using the existing percent-encoding convention of URIs, which is based on octets, the idea was to create a new encoding convention; for example, to use "%u" to introduce UCS code points.

与其使用现有的基于八位字节的URI百分比编码约定,不如创建一个新的编码约定;例如,使用“%u”引入UCS代码点。

Using the existing octet-based percent-encoding mechanism does not need an upgrade of the URI syntax and does not need corresponding server upgrades.

使用现有的基于八位字节的百分比编码机制不需要对URI语法进行升级,也不需要相应的服务器升级。

Appendix A.4. Indicating Character Encodings in the URI/IRI

附录A.4。指示URI/IRI中的字符编码

Some proposals suggested indicating the character encodings used in an URI or IRI with some new syntactic convention in the URI itself, similar to the "charset" parameter for e-mails and Web pages. As an example, the label in square brackets in "http://www.example.org/ros[iso-8859-1]&#xE9"; indicated that the following "&#xE9"; had to be interpreted as iso-8859-1.

一些建议建议指出URI或IRI中使用的字符编码,并在URI本身中使用一些新的语法约定,类似于电子邮件和网页的“charset”参数。例如,中方括号中的标签“http://www.example.org/ros[iso-8859-1]&#xE9”;表示以下“&#xE9”;必须解释为iso-8859-1。

If UTF-8 is used exclusively, an upgrade to the URI syntax is not needed. It avoids potentially multiple labels that have to be copied correctly in all cases, even on the side of a bus or on a napkin, leading to usability problems (and being prohibitively annoying). Exclusively using UTF-8 also reduces transcoding errors and confusion.

如果专门使用UTF-8,则不需要升级URI语法。它避免了在任何情况下都必须正确复制的多个标签,即使是在公共汽车侧面或餐巾纸上,也会导致可用性问题(并且令人讨厌)。专门使用UTF-8还可以减少转码错误和混乱。

Authors' Addresses

作者地址

Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever possible, for example as "D&#252;rst" in XML and HTML.) World Wide Web Consortium 5322 Endo Fujisawa, Kanagawa 252-8520 Japan

Martin Duerst(注:请尽可能用u-umlaut写“Duerst”,例如用XML和HTML写“D&#252;rst”)。万维网联合会5322日本神奈川藤泽县Endo Fujisawa 252-8520

Phone: +81 466 49 1170 Fax: +81 466 49 1171 EMail: duerst@w3.org URI: http://www.w3.org/People/D%C3%BCrst/ (Note: This is the percent-encoded form of an IRI.)

电话:+8146649170传真:+8146649171电子邮件:duerst@w3.orgURI:http://www.w3.org/People/D%C3%BCrst/ (注意:这是IRI的百分比编码形式。)

Michel Suignard Microsoft Corporation One Microsoft Way Redmond, WA 98052 U.S.A.

美国华盛顿州雷德蒙微软大道一号Michel Suignard微软公司,邮编:98052。

   Phone: +1 425 882-8080
   EMail: michelsu@microsoft.com
   URI:   http://www.suignard.com
        
   Phone: +1 425 882-8080
   EMail: michelsu@microsoft.com
   URI:   http://www.suignard.com
        

Full Copyright Statement

完整版权声明

Copyright (C) The Internet Society (2005).

版权所有(C)互联网协会(2005年)。

This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights.

本文件受BCP 78中包含的权利、许可和限制的约束,除其中规定外,作者保留其所有权利。

This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

本文件及其包含的信息是按“原样”提供的,贡献者、他/她所代表或赞助的组织(如有)、互联网协会和互联网工程任务组不承担任何明示或暗示的担保,包括但不限于任何保证,即使用本文中的信息不会侵犯任何权利,或对适销性或特定用途适用性的任何默示保证。

Intellectual Property

知识产权

The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the IETF's procedures with respect to rights in IETF Documents can be found in BCP 78 and BCP 79.

IETF对可能声称与本文件所述技术的实施或使用有关的任何知识产权或其他权利的有效性或范围,或此类权利下的任何许可可能或可能不可用的程度,不采取任何立场;它也不表示它已作出任何独立努力来确定任何此类权利。有关IETF文件中权利的IETF程序信息,请参见BCP 78和BCP 79。

Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr.

向IETF秘书处披露的知识产权副本和任何许可证保证,或本规范实施者或用户试图获得使用此类专有权利的一般许可证或许可的结果,可从IETF在线知识产权存储库获取,网址为http://www.ietf.org/ipr.

The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org.

IETF邀请任何相关方提请其注意任何版权、专利或专利申请,或其他可能涵盖实施本标准所需技术的专有权利。请将信息发送至IETF的IETF-ipr@ietf.org.

Acknowledgement

确认

Funding for the RFC Editor function is currently provided by the Internet Society.

RFC编辑功能的资金目前由互联网协会提供。