Internet Engineering Task Force (IETF)                       T. Sirainen
Request for Comments: 6203                                    March 2011
Category: Standards Track
ISSN: 2070-1721
        
Internet Engineering Task Force (IETF)                       T. Sirainen
Request for Comments: 6203                                    March 2011
Category: Standards Track
ISSN: 2070-1721
        

IMAP4 Extension for Fuzzy Search

模糊搜索的IMAP4扩展

Abstract

摘要

This document describes an IMAP protocol extension enabling a server to perform searches with inexact matching and assigning relevancy scores for matched messages.

本文档描述了IMAP协议扩展,使服务器能够执行不精确匹配的搜索,并为匹配的消息分配相关性分数。

Status of This Memo

关于下段备忘

This is an Internet Standards Track document.

这是一份互联网标准跟踪文件。

This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Further information on Internet Standards is available in Section 2 of RFC 5741.

本文件是互联网工程任务组(IETF)的产品。它代表了IETF社区的共识。它已经接受了公众审查,并已被互联网工程指导小组(IESG)批准出版。有关互联网标准的更多信息,请参见RFC 5741第2节。

Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc6203.

有关本文件当前状态、任何勘误表以及如何提供反馈的信息,请访问http://www.rfc-editor.org/info/rfc6203.

Copyright Notice

版权公告

Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved.

版权所有(c)2011 IETF信托基金和确定为文件作者的人员。版权所有。

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

本文件受BCP 78和IETF信托有关IETF文件的法律规定的约束(http://trustee.ietf.org/license-info)自本文件出版之日起生效。请仔细阅读这些文件,因为它们描述了您对本文件的权利和限制。从本文件中提取的代码组件必须包括信托法律条款第4.e节中所述的简化BSD许可证文本,并提供简化BSD许可证中所述的无担保。

1. Introduction
1. 介绍

When humans perform searches in IMAP clients, they typically want to see the most relevant search results first. IMAP servers are able to do this in the most efficient way when they're free to internally decide how searches should match messages. This document describes a new SEARCH=FUZZY extension that provides such functionality.

当人类在IMAP客户端中执行搜索时,他们通常希望首先看到最相关的搜索结果。IMAP服务器可以在内部自由决定搜索应如何匹配消息时,以最有效的方式执行此操作。本文档描述了一个新的SEARCH=FUZZY扩展,它提供了这种功能。

2. Conventions Used in This Document
2. 本文件中使用的公约

In examples, "C:" indicates lines sent by a client that is connected to a server. "S:" indicates lines sent by the server to the client.

在示例中,“C:”表示连接到服务器的客户端发送的线路。“S:”表示服务器发送到客户端的行。

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [KEYWORDS].

本文件中的关键词“必须”、“不得”、“必需”、“应”、“不应”、“应”、“不应”、“建议”、“可”和“可选”应按照RFC 2119[关键词]中所述进行解释。

3. The FUZZY Search Key
3. 模糊搜索键

The FUZZY search key takes another search key as its argument. The server is allowed to perform all matching in an implementation-defined manner for this search key, including ignoring the active comparator as defined by [RFC5255]. Typically, this would be used to search for strings. For example:

模糊搜索键将另一个搜索键作为其参数。允许服务器以实现定义的方式对此搜索键执行所有匹配,包括忽略[RFC5255]定义的活动比较器。通常,这将用于搜索字符串。例如:

      C: A1 SEARCH FUZZY (SUBJECT "IMAP break")
      S: * SEARCH 1 5 10
      S: A1 OK Search completed.
        
      C: A1 SEARCH FUZZY (SUBJECT "IMAP break")
      S: * SEARCH 1 5 10
      S: A1 OK Search completed.
        

Besides matching messages with a subject of "IMAP break", the above search may also match messages with subjects "broken IMAP", "IMAP is broken", or anything else the server decides that might be a good match.

除了匹配主题为“IMAP断开”的邮件外,上述搜索还可能匹配主题为“断开IMAP”、“IMAP断开”的邮件,或者服务器认为可能是良好匹配的任何其他邮件。

This example does a fuzzy SUBJECT search, but a non-fuzzy FROM search:

此示例执行模糊主题搜索,但执行非模糊源搜索:

      C: A2 SEARCH FUZZY SUBJECT work FROM user@example.com
      S: * SEARCH 1 4
      S: A2 OK Search completed.
        
      C: A2 SEARCH FUZZY SUBJECT work FROM user@example.com
      S: * SEARCH 1 4
      S: A2 OK Search completed.
        

How the server handles multiple separate FUZZY search keys is implementation-defined.

服务器如何处理多个单独的模糊搜索键由实现定义。

Fuzzy search algorithms might change, or the results of the algorithms might be different from search to search, so that fuzzy searches with the same parameters might give different results for 1) the same user at different times, 2) different users (searches

模糊搜索算法可能会改变,或者不同搜索的算法结果可能不同,因此具有相同参数的模糊搜索可能会为1)同一用户在不同时间给出不同的结果,2)不同用户(搜索

executed simultaneously), or 3) different users (searches executed at different times). For example, a fuzzy search might adapt to a user's search habits in an attempt to give more relevant results (in a "learning" manner). Such differences can also occur because of operational decisions, such as load balancing. Clients asking for "fuzzy" really are requesting search results in a not-necessarily-deterministic way and need to give the user appropriate warning about that.

同时执行),或3)不同用户(在不同时间执行搜索)。例如,模糊搜索可能会适应用户的搜索习惯,试图给出更相关的结果(以“学习”的方式)。这种差异也可能由于操作决策(如负载平衡)而发生。请求“模糊”的客户实际上是以一种不一定确定的方式请求搜索结果,并且需要就此向用户发出适当的警告。

4. Relevancy Scores for Search Results
4. 搜索结果的相关性得分

Servers SHOULD assign a search relevancy score for each matched message when the FUZZY search key is given. Relevancy scores are given in the range 1-100, where 100 is the highest relevancy. The relevancy scores SHOULD use the full 1-100 range, so that clients can show them to users in a meaningful way, e.g., as a percentage value.

当给出模糊搜索键时,服务器应该为每个匹配的消息分配一个搜索相关性分数。相关性得分在1-100之间,其中100是最高相关性。相关性得分应使用完整的1-100范围,以便客户端能够以有意义的方式向用户显示它们,例如,作为百分比值。

As the name already indicates, relevancy scores specify how relevant to the search the matched message is. It's not necessarily the same as how precisely the message matched. For example, a message whose subject fuzzily matches the search string might get a higher relevancy score than a message whose body had the exact string in the middle of a sentence. When multiple search keys are matched fuzzily, how the relevancy score is calculated is server-dependent.

正如名称已经指出的,相关性分数指定匹配消息与搜索的相关性。这不一定与消息匹配的精确程度相同。例如,一个主题模糊地匹配搜索字符串的消息可能比一个消息的关联度更高,该消息在句子的中间有确切的字符串。当模糊匹配多个搜索键时,相关度得分的计算方式取决于服务器。

If the server also advertises the ESEARCH capability as defined by [ESEARCH], the relevancy scores can be retrieved using the new RELEVANCY return option for SEARCH:

如果服务器还宣传[ESEARCH]定义的ESEARCH功能,则可使用新的搜索相关度返回选项检索相关度得分:

      C: B1 SEARCH RETURN (RELEVANCY ALL) FUZZY TEXT "Helo"
      S: * ESEARCH (TAG "B1") ALL 1,5,10 RELEVANCY (4 99 42)
      S: B1 OK Search completed.
        
      C: B1 SEARCH RETURN (RELEVANCY ALL) FUZZY TEXT "Helo"
      S: * ESEARCH (TAG "B1") ALL 1,5,10 RELEVANCY (4 99 42)
      S: B1 OK Search completed.
        

In the example above, the server would treat "hello", "help", and other similar strings as fuzzily matching the misspelled "Helo".

在上面的示例中,服务器将“hello”、“help”和其他类似字符串视为与拼写错误的“Helo”模糊匹配。

The RELEVANCY return option MUST NOT be used unless a FUZZY search key is also given. Note that SEARCH results aren't sorted by relevancy; SORT is needed for that.

除非还提供了模糊搜索键,否则不得使用相关性返回选项。请注意,搜索结果不按相关性排序;这需要分类。

5. Fuzzy Matching with Non-String Search Keys
5. 非字符串搜索键的模糊匹配

Fuzzy matching is not limited to just string matching. All search keys SHOULD be matched fuzzily, although exactly what that means for different search keys is left for server implementations to decide -- including deciding that fuzzy matching is meaningless for a particular key, and falling back to exact matching. Some suggestions are given below.

模糊匹配不仅仅限于字符串匹配。所有的搜索键都应该进行模糊匹配,尽管这对于不同的搜索键意味着什么,这完全由服务器实现来决定——包括确定模糊匹配对于特定的键来说没有意义,以及返回到精确匹配。下面给出一些建议。

Dates: A typical example could be when a user wants to find a message "from Dave about a week ago". A client could perform this search using SEARCH FUZZY (FROM "Dave" SINCE 21-Jan-2009 BEFORE 24-Jan-2009). The server could return messages outside the specified date range, but the further away the message is, the lower the relevancy score.

日期:一个典型的例子是,当用户想要查找“一周前Dave发来的”消息时。客户可以使用搜索模糊(自2009年1月21日起至2009年1月24日之前的“Dave”)执行此搜索。服务器可以返回指定日期范围之外的消息,但消息越远,相关性得分越低。

Sizes: These should be handled similarly to dates. If a user wants to search for "about 1 MB attachments", the client could do this by sending SEARCH FUZZY (LARGER 900000 SMALLER 1100000). Again, the further away the message size is from the specified range, the lower the relevancy score.

尺寸:这些尺寸的处理方式应与日期类似。如果用户想要搜索“大约1MB的附件”,客户端可以通过发送SearchFuzzy(较大900000,较小1100000)来完成。同样,消息大小离指定范围越远,相关性得分越低。

Flags: If other search criteria match, the server could return messages that don't have the specified flags set, but with lower relevancy scores. SEARCH SUBJECT "xyz" FUZZY ANSWERED, for example, might be useful if the user thinks the message he is looking for has the ANSWERED flag set, but he isn't sure.

标志:如果其他搜索条件匹配,服务器可能会返回未设置指定标志但关联性分数较低的消息。例如,如果用户认为他正在查找的邮件设置了应答标志,但他不确定,那么搜索主题“xyz”模糊应答可能会很有用。

Unique Identifiers (UIDs), sequences, modification sequences: These are examples of keys for which exact matching probably makes sense. Alternatively, a server might choose, for instance, to expand a UID range by 5% on each side.

唯一标识符(UID)、序列、修改序列:这些是键的示例,精确匹配可能有意义。或者,服务器可以选择,例如,将UID范围每边扩展5%。

6. Extensions to SORT and SEARCH
6. 排序和搜索的扩展

If the server also advertises the SORT capability as defined by [SORT], the results can be sorted by the new RELEVANCY sort criteria:

如果服务器还公布了[SORT]定义的排序功能,则可以按照新的相关性排序标准对结果进行排序:

      C: C1 SORT (RELEVANCY) UTF-8 FUZZY SUBJECT "Helo"
      S: * SORT 5 10 1
      S: C1 OK Sort completed.
        
      C: C1 SORT (RELEVANCY) UTF-8 FUZZY SUBJECT "Helo"
      S: * SORT 5 10 1
      S: C1 OK Sort completed.
        

The message with the highest score is returned first. As with the RELEVANCY return option, RELEVANCY sort criteria MUST NOT be used unless a FUZZY search key is also given.

首先返回得分最高的消息。与相关性返回选项一样,除非还提供了模糊搜索键,否则不得使用相关性排序条件。

If the server also advertises the ESORT capability as defined by [CONTEXT], the relevancy scores can be retrieved using the new RELEVANCY return option for SORT:

如果服务器还公布了[CONTEXT]定义的ESORT功能,则可使用排序的新关联性返回选项检索关联性得分:

      C: C2 SORT RETURN (RELEVANCY ALL) (RELEVANCY) UTF-8 FUZZY TEXT
         "Helo"
      S: * ESEARCH (TAG "C2") ALL 5,10,1 RELEVANCY (99 42 4)
      S: C2 OK Sort completed.
        
      C: C2 SORT RETURN (RELEVANCY ALL) (RELEVANCY) UTF-8 FUZZY TEXT
         "Helo"
      S: * ESEARCH (TAG "C2") ALL 5,10,1 RELEVANCY (99 42 4)
      S: C2 OK Sort completed.
        

Furthermore, if the server advertises the CONTEXT=SORT (or CONTEXT=SEARCH) capability, then the client can limit the number of returned messages to a SORT (or a SEARCH) by using the PARTIAL return option. For example, this returns the 10 most relevant messages:

此外,如果服务器宣传CONTEXT=SORT(或CONTEXT=SEARCH)功能,那么客户端可以通过使用PARTIAL return选项将返回的消息数量限制为某个排序(或搜索)。例如,这将返回10条最相关的消息:

      C: C3 SORT RETURN (PARTIAL 1:10) (RELEVANCY) UTF-8 FUZZY TEXT
         "World"
      S: * ESEARCH (TAG "C3") PARTIAL (1:10 42,9,34,13,15,4,2,7,23,82)
      S: C3 OK Sort completed.
        
      C: C3 SORT RETURN (PARTIAL 1:10) (RELEVANCY) UTF-8 FUZZY TEXT
         "World"
      S: * ESEARCH (TAG "C3") PARTIAL (1:10 42,9,34,13,15,4,2,7,23,82)
      S: C3 OK Sort completed.
        
7. Formal Syntax
7. 形式语法

The following syntax specification uses the augmented Backus-Naur Form (BNF) as described in [ABNF]. It includes definitions from [RFC3501], [IMAP-ABNF], and [SORT].

以下语法规范使用[ABNF]中所述的增广巴科斯诺尔形式(BNF)。它包括[RFC3501]、[IMAP-ABNF]和[SORT]中的定义。

      capability         =/ "SEARCH=FUZZY"
        
      capability         =/ "SEARCH=FUZZY"
        
      score              = 1*3DIGIT
         ;; (1 <= n <= 100)
        
      score              = 1*3DIGIT
         ;; (1 <= n <= 100)
        
      score-list         = "(" [score *(SP score)] ")"
        
      score-list         = "(" [score *(SP score)] ")"
        

search-key =/ "FUZZY" SP search-key

搜索键=/“模糊”SP搜索键

      search-return-data =/ "RELEVANCY" SP score-list
         ;; Conforms to <search-return-data>, from [IMAP-ABNF]
        
      search-return-data =/ "RELEVANCY" SP score-list
         ;; Conforms to <search-return-data>, from [IMAP-ABNF]
        
      search-return-opt  =/ "RELEVANCY"
         ;; Conforms to <search-return-opt>, from [IMAP-ABNF]
        
      search-return-opt  =/ "RELEVANCY"
         ;; Conforms to <search-return-opt>, from [IMAP-ABNF]
        

sort-key =/ "RELEVANCY"

排序键=/“相关性”

8. Security Considerations
8. 安全考虑

Implementation of this extension might enable denial-of-service attacks against server resources. Servers MAY limit the resources that a single search (or a single user) may use. Additionally, implementors should be aware of the following: Fuzzy search engines are often complex with non-obvious disk space, memory, and/or CPU usage patterns. Server implementors should at least test the fuzzy-search behavior with large messages that contain very long words and/or unique random strings. Also, very long search keys might cause excessive memory or CPU usage.

实现此扩展可能会对服务器资源发起拒绝服务攻击。服务器可能会限制单个搜索(或单个用户)可能使用的资源。此外,实现者应该注意以下几点:模糊搜索引擎通常比较复杂,没有明显的磁盘空间、内存和/或CPU使用模式。服务器实现者至少应该测试包含很长单词和/或唯一随机字符串的大型消息的模糊搜索行为。此外,过长的搜索键可能会导致内存或CPU使用过多。

Invalid input may also be problematic. For example, if the search engine takes a UTF-8 stream as input, it might fail more or less badly when illegal UTF-8 sequences are fed to it from a message whose

无效输入也可能有问题。例如,如果搜索引擎以一个UTF-8流作为输入,那么当非法的UTF-8序列从一条消息(其

character set was claimed to be UTF-8. This could be avoided by validating all the input and, for example, replacing illegal UTF-8 sequences with the Unicode replacement character (U+FFFD).

字符集被称为UTF-8。这可以通过验证所有输入来避免,例如,用Unicode替换字符(U+FFFD)替换非法的UTF-8序列。

Search relevancy rankings might be susceptible to "poisoning" by smart attackers using certain keywords or hidden markup (e.g., HTML) in their messages to boost the rankings. This can't be fully prevented by servers, so clients should prepare for it by at least allowing users to see all the search results, rather than hiding results below a certain score.

搜索相关性排名可能容易受到智能攻击者的“毒害”,智能攻击者在其消息中使用某些关键字或隐藏标记(如HTML)来提高排名。服务器无法完全阻止这一点,因此客户端应至少允许用户查看所有搜索结果,而不是将结果隐藏在某个分数以下。

9. IANA Considerations
9. IANA考虑

IMAP4 capabilities are registered by publishing a standards track or IESG-approved experimental RFC. The "Internet Message Access Protocol (IMAP) 4 Capabilities Registry" is available from http://www.iana.org/.

IMAP4功能通过发布标准跟踪或IESG批准的实验RFC进行注册。“Internet消息访问协议(IMAP)4功能注册表”可从http://www.iana.org/.

This document defines the SEARCH=FUZZY IMAP capability. IANA has added it to the registry.

本文档定义了搜索=模糊IMAP功能。IANA已将其添加到注册表中。

10. Acknowledgements
10. 致谢

Alexey Melnikov, Zoltan Ordogh, Barry Leiba, Cyrus Daboo, and Dave Cridland have helped with this document.

Alexey Melnikov、Zoltan Ordogh、Barry Leiba、Cyrus Daboo和Dave Cridland为本文件提供了帮助。

11. Normative References
11. 规范性引用文件

[ABNF] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", STD 68, RFC 5234, January 2008.

[ABNF]Crocker,D.,Ed.和P.Overell,“语法规范的扩充BNF:ABNF”,STD 68,RFC 5234,2008年1月。

[CONTEXT] Cridland, D. and C. King, "Contexts for IMAP4", RFC 5267, July 2008.

[背景]Cridland,D.和C.King,“IMAP4的背景”,RFC 52672008年7月。

[ESEARCH] Melnikov, A. and D. Cridland, "IMAP4 Extension to SEARCH Command for Controlling What Kind of Information Is Returned", RFC 4731, November 2006.

[ESEARCH]Melnikov,A.和D.Cridland,“用于控制返回何种信息的搜索命令的IMAP4扩展”,RFC 47312006年11月。

[IMAP-ABNF] Melnikov, A. and C. Daboo, "Collected Extensions to IMAP4 ABNF", RFC 4466, April 2006.

[IMAP-ABNF]Melnikov,A.和C.Daboo,“IMAP4 ABNF的收集扩展”,RFC 4466,2006年4月。

[KEYWORDS] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.

[关键词]Bradner,S.,“RFC中用于表示需求水平的关键词”,BCP 14,RFC 2119,1997年3月。

[RFC3501] Crispin, M., "INTERNET MESSAGE ACCESS PROTOCOL - VERSION 4rev1", RFC 3501, March 2003.

[RFC3501]Crispin,M.,“互联网消息访问协议-版本4rev1”,RFC 35012003年3月。

[RFC5255] Newman, C., Gulbrandsen, A., and A. Melnikov, "Internet Message Access Protocol Internationalization", RFC 5255, June 2008.

[RFC5255]Newman,C.,Gulbrandsen,A.,和A.Melnikov,“互联网消息访问协议国际化”,RFC 52552008年6月。

[SORT] Crispin, M. and K. Murchison, "Internet Message Access Protocol - SORT and THREAD Extensions", RFC 5256, June 2008.

[SORT]Crispin,M.和K.Murchison,“互联网消息访问协议-排序和线程扩展”,RFC 5256,2008年6月。

Author's Address

作者地址

Timo Sirainen

蒂莫·西莱宁

   EMail: tss@iki.fi
        
   EMail: tss@iki.fi