Network Working Group                                            D. Korn
Request for Comments: 3284                                     AT&T Labs
Category: Standards Track                                   J. MacDonald
                                                             UC Berkeley
                                                                J. Mogul
                                                 Hewlett-Packard Company
                                                                   K. Vo
                                                               AT&T Labs
                                                               June 2002
        
Network Working Group                                            D. Korn
Request for Comments: 3284                                     AT&T Labs
Category: Standards Track                                   J. MacDonald
                                                             UC Berkeley
                                                                J. Mogul
                                                 Hewlett-Packard Company
                                                                   K. Vo
                                                               AT&T Labs
                                                               June 2002
        

The VCDIFF Generic Differencing and Compression Data Format

VCDIFF通用差分和压缩数据格式

Status of this Memo

本备忘录的状况

This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited.

本文件规定了互联网社区的互联网标准跟踪协议,并要求进行讨论和提出改进建议。有关本协议的标准化状态和状态,请参考当前版本的“互联网官方协议标准”(STD 1)。本备忘录的分发不受限制。

Copyright Notice

版权公告

Copyright (C) The Internet Society (2002). All Rights Reserved.

版权所有(C)互联网协会(2002年)。版权所有。

Abstract

摘要

This memo describes VCDIFF, a general, efficient and portable data format suitable for encoding compressed and/or differencing data so that they can be easily transported among computers.

本备忘录描述了VCDIFF,这是一种通用、高效和可移植的数据格式,适用于编码压缩和/或差异数据,以便在计算机之间轻松传输。

Table of Contents

目录

    1.  Executive Summary ...........................................  2
    2.  Conventions .................................................  4
    3.  Delta Instructions ..........................................  5
    4.  Delta File Organization .....................................  6
    5.  Delta Instruction Encoding .................................. 12
    6.  Decoding a Target Window .................................... 20
    7.  Application-Defined Code Tables ............................. 21
    8.  Performance ................................................. 22
    9.  Further Issues .............................................. 24
   10.  Summary ..................................................... 25
   11.  Acknowledgements ............................................ 25
   12.  Security Considerations ..................................... 25
   13.  Source Code Availability .................................... 25
   14.  Intellectual Property Rights ................................ 26
   15.  IANA Considerations ......................................... 26
   16.  References .................................................. 26
   17.  Authors' Addresses .......................................... 28
   18.  Full Copyright Statement .................................... 29
        
    1.  Executive Summary ...........................................  2
    2.  Conventions .................................................  4
    3.  Delta Instructions ..........................................  5
    4.  Delta File Organization .....................................  6
    5.  Delta Instruction Encoding .................................. 12
    6.  Decoding a Target Window .................................... 20
    7.  Application-Defined Code Tables ............................. 21
    8.  Performance ................................................. 22
    9.  Further Issues .............................................. 24
   10.  Summary ..................................................... 25
   11.  Acknowledgements ............................................ 25
   12.  Security Considerations ..................................... 25
   13.  Source Code Availability .................................... 25
   14.  Intellectual Property Rights ................................ 26
   15.  IANA Considerations ......................................... 26
   16.  References .................................................. 26
   17.  Authors' Addresses .......................................... 28
   18.  Full Copyright Statement .................................... 29
        
1. Executive Summary
1. 执行摘要

Compression and differencing techniques can greatly improve storage and transmission of files and file versions. Since files are often transported across machines with distinct architectures and performance characteristics, such data should be encoded in a form that is portable and can be decoded with little or no knowledge of the encoders. This document describes Vcdiff, a compact portable encoding format designed for these purposes.

压缩和差异技术可以极大地改进文件和文件版本的存储和传输。由于文件通常在具有不同体系结构和性能特征的机器之间传输,因此此类数据应以可移植的形式进行编码,并且可以在几乎不了解或不了解编码器的情况下进行解码。本文档描述了Vcdiff,一种为这些目的而设计的紧凑的便携式编码格式。

Data differencing is the process of computing a compact and invertible encoding of a "target file" given a "source file". Data compression is similar, but without the use of source data. The UNIX utilities diff, compress, and gzip are well-known examples of data differencing and compression tools. For data differencing, the computed encoding is called a "delta file", and for data compression, it is called a "compressed file". Delta and compressed files are good for storage and transmission as they are often smaller than the originals.

数据差分是计算给定“源文件”的“目标文件”的紧凑可逆编码的过程。数据压缩与此类似,但不使用源数据。UNIX实用程序diff、compress和gzip是数据差异和压缩工具的著名示例。对于数据差分,计算出的编码称为“增量文件”,对于数据压缩,则称为“压缩文件”。增量文件和压缩文件有利于存储和传输,因为它们通常比原始文件小。

Data differencing and data compression are traditionally treated as distinct types of data processing. However, as shown in the Vdelta technique by Korn and Vo [1], compression can be thought of as a special case of differencing in which the source data is empty. The basic idea is to unify the string parsing scheme used in the Lempel-Ziv'77 (LZ'77) style compressors [2] and the block-move technique of Tichy [3]. Loosely speaking, this works as follows:

数据差分和数据压缩传统上被视为不同类型的数据处理。然而,正如Korn和Vo[1]在Vdelta技术中所示,压缩可以被认为是一种特殊的差分情况,其中源数据为空。基本思想是统一Lempel Ziv'77(LZ'77)风格压缩器[2]中使用的字符串解析方案和Tichy[3]的块移动技术。粗略地说,其工作原理如下:

a. Concatenate source and target data. b. Parse the data from left to right as in LZ'77 but make sure that a parsed segment starts the target data. c. Start to output when reaching target data.

a. 连接源数据和目标数据。B从左到右解析数据,如LZ'77中所示,但确保解析的段启动目标数据。C到达目标数据时开始输出。

Parsing is based on string matching algorithms, such as suffix trees [4] or hashing with different time and space performance characteristics. Vdelta uses a fast string matching algorithm that requires less memory than other techniques [5,6]. However, even with this algorithm, the memory requirement can still be prohibitive for large files. A common way to deal with memory limitation is to partition an input file into chunks called "windows" and process them separately. Here, except for unpublished work by Vo, little has been done on designing effective windowing schemes. Current techniques, including Vdelta, simply use source and target windows with corresponding addresses across source and target files.

解析基于字符串匹配算法,例如后缀树[4]或具有不同时间和空间性能特征的哈希。Vdelta使用一种比其他技术需要更少内存的快速字符串匹配算法[5,6]。但是,即使使用此算法,对于大型文件,内存需求仍然是不可接受的。处理内存限制的一种常见方法是将输入文件划分为称为“窗口”的块,并分别处理它们。在这里,除了Vo未发表的工作外,在设计有效的窗口方案方面做得很少。当前的技术,包括Vdelta,只需使用源和目标窗口,并在源和目标文件之间具有相应的地址。

String matching and windowing algorithms have great influence on the compression rate of delta and compressed files. However, it is desirable to have a portable encoding format that is independent of such algorithms. This enables the construction of client-server applications in which a server may serve clients with unknown computing characteristics. Unfortunately, all current differencing and compressing tools, including Vdelta, fall short in this respect. Their storage formats are closely intertwined with the implemented string matching and/or windowing algorithms.

字符串匹配和窗口算法对增量和压缩文件的压缩率有很大影响。然而,希望具有独立于此类算法的便携式编码格式。这允许构建客户机-服务器应用程序,其中服务器可以为具有未知计算特性的客户机提供服务。不幸的是,所有当前的差分和压缩工具,包括Vdelta,在这方面都存在不足。它们的存储格式与实现的字符串匹配和/或窗口算法紧密交织在一起。

The encoding format Vcdiff proposed here addresses the above issues. Vcdiff achieves the characteristics below:

这里提出的编码格式Vcdiff解决了上述问题。Vcdiff具有以下特性:

Output compactness: The basic encoding format compactly represents compressed or delta files. Applications can further extend the basic encoding format with "secondary encoders" to achieve more compression.

输出紧凑性:基本编码格式紧凑地表示压缩文件或增量文件。应用程序可以使用“辅助编码器”进一步扩展基本编码格式,以实现更多压缩。

Data portability: The basic encoding format is free from machine byte order and word size issues. This allows data to be encoded on one machine and decoded on a different machine with different architecture.

数据可移植性:基本编码格式不存在机器字节顺序和字大小问题。这允许数据在一台机器上编码,并在具有不同体系结构的不同机器上解码。

Algorithm genericity: The decoding algorithm is independent from string matching and windowing algorithms. This allows competition among implementations of the encoder while keeping the same decoder.

算法通用性:解码算法独立于字符串匹配和窗口算法。这允许编码器实现之间的竞争,同时保持相同的解码器。

Decoding efficiency: Except for secondary encoder issues, the decoding algorithm runs in time proportionate to the size of the target file and uses space proportionate to the maximal window size. Vcdiff differs from more conventional compressors in that it uses only byte-aligned data, thus avoiding bit-level operations, which improves decoding speed at the slight cost of compression efficiency.

解码效率:除了二级编码器问题外,解码算法的运行时间与目标文件的大小成比例,并使用与最大窗口大小成比例的空间。Vcdiff与更传统的压缩器的不同之处在于,它只使用字节对齐的数据,因此避免了位级操作,从而在略微降低压缩效率的情况下提高了解码速度。

The combined differencing and compression method is called "delta compression" [14]. As this way of data processing treats compression as a special case of differencing, we shall use the term "delta file" to indicate the compressed output for both cases.

组合差分和压缩方法称为“增量压缩”[14]。由于这种数据处理方式将压缩视为差分的一种特殊情况,我们将使用术语“增量文件”来表示这两种情况下的压缩输出。

2. Conventions
2. 习俗

The basic data unit is a byte. For portability, Vcdiff shall limit a byte to its lower eight bits even on machines with larger bytes. The bits in a byte are ordered from right to left so that the least significant bit (LSB) has value 1, and the most significant bit (MSB), has value 128.

基本数据单位是一个字节。为了便于移植,Vcdiff应将一个字节限制在其较低的8位,即使在具有较大字节的机器上也是如此。字节中的位按从右到左的顺序排列,以便最低有效位(LSB)的值为1,最高有效位(MSB)的值为128。

For purposes of exposition in this document, we adopt the convention that the LSB is numbered 0, and the MSB is numbered 7. Bit numbers never appear in the encoded format itself.

为了在本文档中进行阐述,我们采用了LSB编号为0、MSB编号为7的约定。比特数永远不会以编码格式本身出现。

Vcdiff encodes unsigned integer values using a portable, variable-sized format (originally introduced in the Sfio library [7]). This encoding treats an integer as a number in base 128. Then, each digit in this representation is encoded in the lower seven bits of a byte. Except for the least significant byte, other bytes have their most significant bit turned on to indicate that there are still more digits in the encoding. The two key properties of this integer encoding that are beneficial to a data compression format are:

Vcdiff使用可移植的可变大小格式(最初引入Sfio库[7])对无符号整数值进行编码。此编码将整数视为以128为基数的数字。然后,此表示中的每个数字都以字节的低七位进行编码。除最低有效字节外,其他字节的最高有效位已打开,以指示编码中还有更多的数字。此整数编码的两个关键属性有利于数据压缩格式:

a. The encoding is portable among systems using 8-bit bytes, and b. Small values are encoded compactly.

a. 该编码在使用8位字节的系统之间是可移植的,并且b。小值被压缩编码。

For example, consider the value 123456789, which can be represented with four 7-bit digits whose values are 58, 111, 26, 21 in order from most to least significant. Below is the 8-bit byte encoding of these digits. Note that the MSBs of 58, 111 and 26 are on.

例如,考虑值123456789,它可以用四个7位数字表示,其值为58, 111, 26、21,从最大到最小。下面是这些数字的8位字节编码。请注意,58、111和26的MSB处于打开状态。

              +-------------------------------------------+
              | 10111010 | 11101111 | 10011010 | 00010101 |
              +-------------------------------------------+
                MSB+58     MSB+111    MSB+26     0+21
        
              +-------------------------------------------+
              | 10111010 | 11101111 | 10011010 | 00010101 |
              +-------------------------------------------+
                MSB+58     MSB+111    MSB+26     0+21
        

Henceforth, the terms "byte" and "integer" will refer to a byte and an unsigned integer as described.

此后,术语“字节”和“整数”将指如所述的字节和无符号整数。

Algorithms in the C language are occasionally exhibited to clarify the descriptions. Such C code is meant for clarification only, and is not part of the actual specification of the Vcdiff format.

偶尔会展示C语言中的算法以澄清描述。此类C代码仅用于澄清,不属于Vcdiff格式的实际规范的一部分。

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14, RFC 2119 [12].

本文件中的关键词“必须”、“不得”、“要求”、“应”、“不应”、“应”、“不应”、“建议”、“可”和“可选”应按照BCP 14、RFC 2119[12]中所述进行解释。

3. Delta Instructions
3. 增量指令

A large target file is partitioned into non-overlapping sections called "target windows". These target windows are processed separately and sequentially based on their order in the target file.

一个大的目标文件被划分成称为“目标窗口”的非重叠部分。这些目标窗口将根据其在目标文件中的顺序分别和顺序进行处理。

A target window T, of length t, may be compared against some source data segment S, of length s. By construction, this source data segment S comes either from the source file, if one is used, or from a part of the target file earlier than T. In this way, during decoding, S is completely known when T is being decoded.

可以将长度为T的目标窗口T与长度为S的一些源数据段S进行比较。通过构造,该源数据段S要么来自源文件(如果使用了源文件),要么来自目标文件早于T的一部分。这样,在解码期间,当T被解码时,S是完全已知的。

The choices of T, t, S and s are made by some window selection algorithm, which can greatly affect the size of the encoding. However, as seen later, these choices are encoded so that no knowledge of the window selection algorithm is needed during decoding.

T、T、S和S的选择是通过一些窗口选择算法进行的,这会极大地影响编码的大小。然而,如后所述,这些选择被编码,以便在解码期间不需要关于窗口选择算法的知识。

Assume that S[j] represents the jth byte in S, and T[k] represents the kth byte in T. Then, for the delta instructions, we treat the data windows S and T as substrings of a superstring U, formed by concatenating them like this:

假设S[j]表示S中的第j个字节,T[k]表示T中的第k个字节。然后,对于增量指令,我们将数据窗口S和T视为超弦U的子串,通过如下方式连接它们:

S[0]S[1]...S[s-1]T[0]T[1]...T[t-1]

S[0]S[1]…S[S-1]T[0]T[1]…T[T-1]

The "address" of a byte in S or T is referred to by its location in U. For example, the address of T[k] is s+k.

S或T中字节的“地址”是指其在U中的位置。例如,T[k]的地址是S+k。

The instructions to encode and direct the reconstruction of a target window are called delta instructions. There are three types:

编码和指导目标窗口重建的指令称为增量指令。有三种类型:

ADD: This instruction has two arguments, a size x and a sequence of x bytes to be copied. COPY: This instruction has two arguments, a size x and an address p in the string U. The arguments specify the substring of U that must be copied. We shall assert that such a substring must be entirely contained in either S or T.

ADD:此指令有两个参数,一个大小为x,一个要复制的x字节序列。复制:此指令有两个参数,字符串U中的大小x和地址p。这些参数指定必须复制的U的子字符串。我们将断言这样一个子串必须完全包含在S或T中。

RUN: This instruction has two arguments, a size x and a byte b, that will be repeated x times.

运行:此指令有两个参数,一个是大小x,一个是字节b,将重复x次。

Below are example source and target windows and the delta instructions that encode the target window in terms of the source window.

下面是源窗口和目标窗口的示例,以及根据源窗口对目标窗口进行编码的增量指令。

a b c d e f g h i j k l m n o p a b c d w x y z e f g h e f g h e f g h e f g h z z z z

a b c d e f g h i j k l m n o p a b c d w x y z e f g h f g h z z

COPY 4, 0 ADD 4, w x y z COPY 4, 4 COPY 12, 24 RUN 4, z

复制4,0添加4,w x y z复制4,4复制12,24运行4,z

Thus, the first letter 'a' in the target window is at location 16 in the superstring. Note that the fourth instruction, "COPY 12, 24", copies data from T itself since address 24 is position 8 in T. This instruction also shows that it is fine to overlap the data to be copied with the data being copied from, as long as the latter starts earlier. This enables efficient encoding of periodic sequences, i.e., sequences with regularly repeated subsequences. The RUN instruction is a compact way to encode a sequence repeating the same byte even though such a sequence can be thought of as a periodic sequence with period 1.

因此,目标窗口中的第一个字母“a”位于超弦中的位置16。请注意,第四条指令“COPY 12,24”从T本身复制数据,因为地址24是T中的位置8。该指令还表明,只要要复制的数据更早开始,就可以将要复制的数据与要复制的数据重叠。这使得周期序列(即具有规则重复子序列的序列)的有效编码成为可能。RUN指令是对重复相同字节的序列进行编码的一种紧凑方式,即使这种序列可以被认为是周期为1的周期序列。

To reconstruct the target window, one simply processes one delta instruction at a time and copies the data, either from the source window or the target window being reconstructed, based on the type of the instruction and the associated address, if any.

要重建目标窗口,只需一次处理一条delta指令,并根据指令类型和相关地址(如果有)从正在重建的源窗口或目标窗口复制数据。

4. Delta File Organization
4. 增量文件组织

A Vcdiff delta file starts with a Header section followed by a sequence of Window sections. The Header section includes magic bytes to identify the file type, and information concerning data processing beyond the basic encoding format. The Window sections encode the target windows.

Vcdiff delta文件以一个头段开始,后跟一系列窗口段。标题部分包括用于标识文件类型的魔法字节,以及有关基本编码格式之外的数据处理的信息。窗口部分对目标窗口进行编码。

Below is the overall organization of a delta file. The indented items refine the ones immediately above them. An item in square brackets may or may not be present in the file depending on the information encoded in the Indicator byte above it.

下面是增量文件的总体组织。缩进的项目细化了紧靠其上方的项目。方括号中的项目可能存在于文件中,也可能不存在于文件中,这取决于其上方的指示符字节中编码的信息。

      Header
          Header1                                  - byte
          Header2                                  - byte
          Header3                                  - byte
          Header4                                  - byte
          Hdr_Indicator                            - byte
          [Secondary compressor ID]                - byte
          [Length of code table data]              - integer
          [Code table data]
              Size of near cache                   - byte
              Size of same cache                   - byte
              Compressed code table data
      Window1
          Win_Indicator                            - byte
          [Source segment size]                    - integer
          [Source segment position]                - integer
          The delta encoding of the target window
              Length of the delta encoding         - integer
              The delta encoding
                  Size of the target window        - integer
                  Delta_Indicator                  - byte
                  Length of data for ADDs and RUNs - integer
                  Length of instructions and sizes - integer
                  Length of addresses for COPYs    - integer
                  Data section for ADDs and RUNs   - array of bytes
                  Instructions and sizes section   - array of bytes
                  Addresses section for COPYs      - array of bytes
      Window2
      ...
        
      Header
          Header1                                  - byte
          Header2                                  - byte
          Header3                                  - byte
          Header4                                  - byte
          Hdr_Indicator                            - byte
          [Secondary compressor ID]                - byte
          [Length of code table data]              - integer
          [Code table data]
              Size of near cache                   - byte
              Size of same cache                   - byte
              Compressed code table data
      Window1
          Win_Indicator                            - byte
          [Source segment size]                    - integer
          [Source segment position]                - integer
          The delta encoding of the target window
              Length of the delta encoding         - integer
              The delta encoding
                  Size of the target window        - integer
                  Delta_Indicator                  - byte
                  Length of data for ADDs and RUNs - integer
                  Length of instructions and sizes - integer
                  Length of addresses for COPYs    - integer
                  Data section for ADDs and RUNs   - array of bytes
                  Instructions and sizes section   - array of bytes
                  Addresses section for COPYs      - array of bytes
      Window2
      ...
        
4.1 The Header Section
4.1 标题部分

Each delta file starts with a header section organized as below. Note the convention that square-brackets enclose optional items.

每个增量文件都以如下组织的标题部分开始。请注意方括号内包含可选项目的惯例。

         Header1                                  - byte = 0xD6
         Header2                                  - byte = 0xC3
         Header3                                  - byte = 0xC4
         Header4                                  - byte
         Hdr_Indicator                            - byte
         [Secondary compressor ID]                - byte
         [Length of code table data]              - integer
         [Code table data]
        
         Header1                                  - byte = 0xD6
         Header2                                  - byte = 0xC3
         Header3                                  - byte = 0xC4
         Header4                                  - byte
         Hdr_Indicator                            - byte
         [Secondary compressor ID]                - byte
         [Length of code table data]              - integer
         [Code table data]
        

The first three Header bytes are the ASCII characters 'V', 'C' and 'D' with their most significant bits turned on (in hexadecimal, the values are 0xD6, 0xC3, and 0xC4). The fourth Header byte is currently set to zero. In the future, it might be used to indicate the version of Vcdiff.

前三个头字节是ASCII字符“V”、“C”和“D”,其最高有效位已打开(十六进制中的值为0xD6、0xC3和0xC4)。第四个标头字节当前设置为零。将来,它可能用于指示Vcdiff的版本。

The Hdr_Indicator byte shows if there is any initialization data required to aid in the reconstruction of data in the Window sections. This byte MAY have non-zero values for either, both, or neither of the two bits VCD_DECOMPRESS and VCD_CODETABLE below:

Hdr_指示器字节显示是否需要任何初始化数据来帮助重建窗口部分中的数据。对于以下两位VCD_解压缩和VCD_编码表中的任一位、两位或任何一位,该字节可能具有非零值:

       7 6 5 4 3 2 1 0
      +-+-+-+-+-+-+-+-+
      | | | | | | | | |
      +-+-+-+-+-+-+-+-+
                   ^ ^
                   | |
                   | +-- VCD_DECOMPRESS
                   +---- VCD_CODETABLE
        
       7 6 5 4 3 2 1 0
      +-+-+-+-+-+-+-+-+
      | | | | | | | | |
      +-+-+-+-+-+-+-+-+
                   ^ ^
                   | |
                   | +-- VCD_DECOMPRESS
                   +---- VCD_CODETABLE
        

If bit 0 (VCD_DECOMPRESS) is non-zero, this indicates that a secondary compressor may have been used to further compress certain parts of the delta encoding data as described in Sections 4.3 and 6. In that case, the ID of the secondary compressor is given next. If this bit is zero, the compressor ID byte is not included.

如果位0(VCD_解压)为非零,则表示可能已使用辅助压缩器进一步压缩第4.3节和第6节中所述的增量编码数据的某些部分。在这种情况下,下一步给出辅助压缩机的ID。如果该位为零,则不包括压缩机ID字节。

If bit 1 (VCD_CODETABLE) is non-zero, this indicates that an application-defined code table is to be used for decoding the delta instructions. This table itself is compressed. The length of the data comprising this compressed code table and the data follow next. Section 7 discusses application-defined code tables. If this bit is zero, the code table data length and the code table data are not included.

如果位1(VCD_代码表)为非零,则表示应用程序定义的代码表将用于解码增量指令。这张桌子本身是压缩的。包含此压缩代码表的数据的长度,接下来是数据。第7节讨论应用程序定义的代码表。如果该位为零,则不包括代码表数据长度和代码表数据。

If both bits are set, then the compressor ID byte is included before the code table data length and the code table data.

如果设置了两个位,则压缩程序ID字节包含在代码表数据长度和代码表数据之前。

4.2 The Format of a Window Section
4.2 窗口节的格式

Each Window section is organized as follows:

每个窗口部分的组织如下:

Win_Indicator - byte [Source segment length] - integer [Source segment position] - integer The delta encoding of the target window

Win_指示符-字节[源段长度]-整数[源段位置]-整数-目标窗口的增量编码

Below are the details of the various items:

以下是各项目的详细信息:

Win_Indicator: This byte is a set of bits, as shown:

Win_指示器:该字节是一组位,如图所示:

          7 6 5 4 3 2 1 0
         +-+-+-+-+-+-+-+-+
         | | | | | | | | |
         +-+-+-+-+-+-+-+-+
                      ^ ^
                      | |
                      | +-- VCD_SOURCE
                      +---- VCD_TARGET
        
          7 6 5 4 3 2 1 0
         +-+-+-+-+-+-+-+-+
         | | | | | | | | |
         +-+-+-+-+-+-+-+-+
                      ^ ^
                      | |
                      | +-- VCD_SOURCE
                      +---- VCD_TARGET
        

If bit 0 (VCD_SOURCE) is non-zero, this indicates that a segment of data from the "source" file was used as the corresponding source window of data to encode the target window. The decoder will use this same source data segment to decode the target window.

如果位0(VCD_源)为非零,则表示“源”文件中的一段数据用作相应的数据源窗口,以对目标窗口进行编码。解码器将使用相同的源数据段来解码目标窗口。

If bit 1 (VCD_TARGET) is non-zero, this indicates that a segment of data from the "target" file was used as the corresponding source window of data to encode the target window. As above, this same source data segment is used to decode the target window.

如果位1(VCD_目标)为非零,则表示“目标”文件中的一段数据用作相应的数据源窗口,以对目标窗口进行编码。如上所述,该源数据段用于解码目标窗口。

The Win_Indicator byte MUST NOT have more than one of the bits set (non-zero). It MAY have none of these bits set.

Win_指示器字节设置的位不得超过一位(非零)。它可能没有设置这些位。

If one of these bits is set, the byte is followed by two integers to indicate respectively, the length and position of the source data segment in the relevant file. If the indicator byte is zero, the target window was compressed by itself without comparing against another data segment, and these two integers are not included.

如果设置了这些位中的一位,则字节后面会跟两个整数,分别指示源数据段在相关文件中的长度和位置。如果指示符字节为零,则目标窗口在不与另一个数据段进行比较的情况下自行压缩,并且不包括这两个整数。

The delta encoding of the target window:

目标窗口的增量编码:

This contains the delta encoding of the target window, either in terms of the source data segment (i.e., VCD_SOURCE or VCD_TARGET was set) or by itself if no source window is specified. This data format is discussed next.

它包含目标窗口的增量编码,或者根据源数据段(即,设置了VCD_源或VCD_目标),或者如果未指定源窗口,则包含目标窗口的增量编码。下面将讨论此数据格式。

4.3 The Delta Encoding of a Target Window
4.3 目标窗口的增量编码

The delta encoding of a target window is organized as follows:

目标窗口的增量编码组织如下:

Length of the delta encoding - integer The delta encoding Length of the target window - integer Delta_Indicator - byte Length of data for ADDs and RUNs - integer Length of instructions section - integer Length of addresses for COPYs - integer Data section for ADDs and RUNs - array of bytes Instructions and sizes section - array of bytes Addresses section for COPYs - array of bytes

增量编码长度-整数目标窗口的增量编码长度-整数增量指示符-用于添加和运行的数据字节长度-指令整数长度部分-用于复制的地址整数长度-用于添加和运行的整数数据部分-字节数组指令和大小部分-字节数组地址副本部分-字节数组

Length of the delta encoding: This integer gives the total number of remaining bytes that comprise the data of the delta encoding for this target window.

增量编码的长度:此整数表示组成此目标窗口增量编码数据的剩余字节总数。

The delta encoding: This contains the data representing the delta encoding which is described next.

增量编码:它包含表示增量编码的数据,下面将对其进行描述。

Length of the target window: This integer indicates the actual size of the target window after decompression. A decoder can use this value to allocate memory to store the uncompressed data.

目标窗口的长度:此整数表示解压缩后目标窗口的实际大小。解码器可以使用此值分配内存来存储未压缩的数据。

Delta_Indicator: This byte is a set of bits, as shown:

Delta_指示器:该字节是一组位,如图所示:

          7 6 5 4 3 2 1 0
         +-+-+-+-+-+-+-+-+
         | | | | | | | | |
         +-+-+-+-+-+-+-+-+
                    ^ ^ ^
                    | | |
                    | | +-- VCD_DATACOMP
                    | +---- VCD_INSTCOMP
                    +------ VCD_ADDRCOMP
        
          7 6 5 4 3 2 1 0
         +-+-+-+-+-+-+-+-+
         | | | | | | | | |
         +-+-+-+-+-+-+-+-+
                    ^ ^ ^
                    | | |
                    | | +-- VCD_DATACOMP
                    | +---- VCD_INSTCOMP
                    +------ VCD_ADDRCOMP
        

VCD_DATACOMP: bit value 1. VCD_INSTCOMP: bit value 2. VCD_ADDRCOMP: bit value 4.

VCD_数据组件:位值1。VCD_INSTCOMP:位值2。VCD_ADDRCOMP:位值4。

As discussed, the delta encoding consists of COPY, ADD and RUN instructions. The ADD and RUN instructions have accompanying unmatched data (that is, data that does not specifically match any data in the source window or in some earlier part of the target window) and the COPY instructions have addresses of where the matches occur. OPTIONALLY, these types of data MAY be further compressed using a secondary compressor. Thus, Vcdiff separates the encoding of the delta instructions into three parts:

如前所述,增量编码由复制、添加和运行指令组成。“添加”和“运行”指令附带有不匹配的数据(即,与源窗口或目标窗口的某些早期部分中的任何数据都不匹配的数据),而“复制”指令包含匹配发生的地址。可选地,可以使用辅助压缩器进一步压缩这些类型的数据。因此,Vcdiff将增量指令的编码分为三个部分:

a. The unmatched data in the ADD and RUN instructions, b. The delta instructions and accompanying sizes, and c. The addresses of the COPY instructions.

a. 添加和运行指令b中不匹配的数据。增量说明和随附尺寸,以及c。副本说明的地址。

If the bit VCD_DECOMPRESS (Section 4.1) was on, each of these sections may have been compressed using the specified secondary compressor. The bit positions 0 (VCD_DATACOMP), 1 (VCD_INSTCOMP), and 2 (VCD_ADDRCOMP) respectively indicate, if non-zero, that the corresponding parts are compressed. Then, these parts MUST be decompressed before decoding the delta instructions.

如果位VCD_解压(第4.1节)打开,则这些部分中的每一部分都可能已使用指定的辅助压缩机进行压缩。位位置0(VCD_DATACOMP)、1(VCD_INSTCOMP)和2(VCD_ADDRCOMP)分别表示,如果非零,则相应的部分被压缩。然后,在解码增量指令之前,必须对这些部分进行解压缩。

Length of data for ADDs and RUNs: This is the length (in bytes) of the section of data storing the unmatched data accompanying the ADD and RUN instructions.

添加和运行的数据长度:这是存储添加和运行指令附带的不匹配数据的数据部分的长度(以字节为单位)。

Length of instructions section: This is the length (in bytes) of the delta instructions and accompanying sizes.

指令长度部分:这是增量指令和伴随大小的长度(字节)。

Length of addresses for COPYs: This is the length (in bytes) of the section storing the addresses of the COPY instructions.

复制地址长度:这是存储复制指令地址的部分的长度(以字节为单位)。

Data section for ADDs and RUNs: This sequence of bytes encodes the unmatched data for the ADD and RUN instructions.

添加和运行的数据部分:此字节序列对添加和运行指令的不匹配数据进行编码。

Instructions and sizes section: This sequence of bytes encodes the instructions and their sizes.

指令和大小部分:这个字节序列对指令及其大小进行编码。

Addresses section for COPYs: This sequence of bytes encodes the addresses of the COPY instructions.

复制的地址部分:这个字节序列编码复制指令的地址。

5. Delta Instruction Encoding
5. 增量指令编码

The delta instructions described in Section 3 represent the results of string matching. For many data differencing applications in which the changes between source and target data are small, any straightforward representation of these instructions would be adequate. However, for applications including differencing of binary files or data compression, it is important to encode these instructions well to achieve good compression rates. The keys to this achievement is to efficiently encode the addresses of COPY instructions and the sizes of all delta instructions.

第3节中描述的增量指令表示字符串匹配的结果。对于许多源数据和目标数据之间的变化很小的数据差异应用程序,这些指令的任何直接表示都是足够的。然而,对于包括二进制文件差分或数据压缩在内的应用程序,对这些指令进行良好编码以获得良好的压缩率非常重要。实现这一目标的关键是有效地编码拷贝指令的地址和所有增量指令的大小。

5.1 Address Encoding Modes of COPY Instructions
5.1 复制指令的地址编码方式

Addresses of COPY instructions are locations of matches and often occur close by or even exactly equal to one another. This is because data in local regions are often replicated with minor changes. In turn, this means that coding a newly matched address against some recently matched addresses can be beneficial. To take advantage of this phenomenon and encode addresses of COPY instructions more efficiently, the Vcdiff data format supports the use of two different types of address caches. Both the encoder and decoder maintain these caches, so that decoder's caches remain synchronized with the encoder's caches.

复制指令的地址是匹配的位置,通常彼此相邻甚至完全相等。这是因为本地区域中的数据通常只需稍作更改即可复制。反过来,这意味着根据一些最近匹配的地址编码新匹配的地址可能是有益的。为了利用这一现象并更有效地编码复制指令的地址,Vcdiff数据格式支持使用两种不同类型的地址缓存。编码器和解码器都维护这些缓存,因此解码器的缓存与编码器的缓存保持同步。

a. A "near" cache is an array with "s_near" slots, each containing an address used for encoding addresses nearby to previously encoded addresses (in the positive direction only). The near cache also maintains a "next_slot" index to the near cache. New entries to the near cache are always inserted in the next_slot index, which maintains a circular buffer of the s_near most recent addresses.

a. “近”缓存是一个带有“s_近”插槽的数组,每个插槽包含一个用于编码先前编码地址附近的地址的地址(仅在正方向)。近缓存还维护近缓存的“下一个\u插槽”索引。near缓存的新条目总是插入到下一个\u插槽索引中,该索引保持最近地址附近的s\u的循环缓冲区。

b. A "same" cache is an array with "s_same", with a multiple of 256 slots, each containing an address. The same cache maintains a hash table of recent addresses used for repeated encoding of the exact same address.

b. “相同”缓存是一个具有“s_same”的数组,具有256个插槽的倍数,每个插槽包含一个地址。同一缓存维护一个最近地址的哈希表,用于重复编码完全相同的地址。

By default, the parameters s_near and s_same are respectively set to 4 and 3. An encoder MAY modify these values, but then it MUST encode the new values in the encoding itself, as discussed in Section 7, so that the decoder can properly set up its own caches.

默认情况下,参数s_near和s_same分别设置为4和3。编码器可以修改这些值,但随后必须在编码本身中对新值进行编码,如第7节所述,以便解码器能够正确设置自己的缓存。

At the start of processing a target window, an implementation (encoder or decoder) initializes all of the slots in both caches to zero. The next_slot pointer of the near cache is set to point to slot zero.

在开始处理目标窗口时,实现(编码器或解码器)将两个缓存中的所有插槽初始化为零。近缓存的下一个\u插槽指针设置为指向插槽零。

Each time a COPY instruction is processed by the encoder or decoder, the implementation's caches are updated as follows, where "addr" is the address in the COPY instruction.

每次编码器或解码器处理复制指令时,实现的缓存更新如下,其中“addr”是复制指令中的地址。

a. The slot in the near cache referenced by the next_slot index is set to addr. The next_slot index is then incremented modulo s_near.

a. 下一个\u插槽索引引用的近缓存中的插槽设置为addr。然后,下一个时隙索引以s_的模递增。

b. The slot in the same cache whose index is addr%(s_same*256) is set to addr. [We use the C notations of % for modulo and * for multiplication.]

b. 索引为addr%(s_same*256)的同一缓存中的插槽设置为addr。[我们使用%表示模,使用*表示乘法。]

5.2 Example code for maintaining caches
5.2 维护缓存的示例代码

To make clear the above description, below are examples of cache data structures and algorithms to initialize and update them:

为了明确上述描述,以下是缓存数据结构和算法的示例,用于初始化和更新它们:

   typedef struct _cache_s
   {
       int*  near;      /* array of size s_near        */
       int   s_near;
       int   next_slot; /* the circular index for near */
       int*  same;      /* array of size s_same*256    */
       int   s_same;
   } Cache_t;
        
   typedef struct _cache_s
   {
       int*  near;      /* array of size s_near        */
       int   s_near;
       int   next_slot; /* the circular index for near */
       int*  same;      /* array of size s_same*256    */
       int   s_same;
   } Cache_t;
        
   cache_init(Cache_t* ka)
   {
       int   i;
        
   cache_init(Cache_t* ka)
   {
       int   i;
        
       ka->next_slot = 0;
       for(i = 0; i < ka->s_near; ++i)
            ka->near[i] = 0;
        
       ka->next_slot = 0;
       for(i = 0; i < ka->s_near; ++i)
            ka->near[i] = 0;
        
       for(i = 0; i < ka->s_same*256; ++i)
            ka->same[i] = 0;
   }
        
       for(i = 0; i < ka->s_same*256; ++i)
            ka->same[i] = 0;
   }
        
   cache_update(Cache_t* ka, int addr)
   {
       if(ka->s_near > 0)
       {   ka->near[ka->next_slot] = addr;
           ka->next_slot = (ka->next_slot + 1) % ka->s_near;
       }
        
   cache_update(Cache_t* ka, int addr)
   {
       if(ka->s_near > 0)
       {   ka->near[ka->next_slot] = addr;
           ka->next_slot = (ka->next_slot + 1) % ka->s_near;
       }
        
       if(ka->s_same > 0)
           ka->same[addr % (ka->s_same*256)] = addr;
   }
        
       if(ka->s_same > 0)
           ka->same[addr % (ka->s_same*256)] = addr;
   }
        
5.3 Encoding of COPY instruction addresses
5.3 拷贝指令地址的编码

The address of a COPY instruction is encoded using different modes, depending on the type of cached address used, if any.

复制指令的地址使用不同的模式进行编码,具体取决于所使用的缓存地址的类型(如果有的话)。

Let "addr" be the address of a COPY instruction to be decoded and "here" be the current location in the target data (i.e., the start of the data about to be encoded or decoded). Let near[j] be the jth element in the near cache, and same[k] be the kth element in the same cache. Below are the possible address modes:

假设“addr”是要解码的复制指令的地址,“here”是目标数据中的当前位置(即,即将编码或解码的数据的开始)。假设near[j]是near缓存中的第j个元素,而same[k]是同一缓存中的第k个元素。以下是可能的地址模式:

VCD_SELF: This mode has value 0. The address was encoded by itself as an integer.

VCD_SELF:此模式的值为0。地址本身被编码为整数。

VCD_HERE: This mode has value 1. The address was encoded as the integer value "here - addr".

VCD_在此:此模式的值为1。地址被编码为整数值“here-addr”。

Near modes: The "near modes" are in the range [2,s_near+1]. Let m be the mode of the address encoding. The address was encoded as the integer value "addr - near[m-2]".

近模式:“近模式”在[2,s_近+1]范围内。设m为地址编码的模式。地址被编码为整数值“addr-near[m-2]”。

Same modes: The "same modes" are in the range [s_near+2,s_near+s_same+1]. Let m be the mode of the encoding. The address was encoded as a single byte b such that "addr == same[(m - (s_near+2))*256 + b]".

相同模式:“相同模式”的范围为[s_接近+2,s_接近+s_相同+1]。设m为编码模式。地址被编码为单个字节b,这样“addr==same[(m-(s_near+2))*256+b]”。

5.4 Example code for encoding and decoding of COPY instruction addresses
5.4 复制指令地址的编码和解码示例代码

We show example algorithms below to demonstrate the use of address modes more clearly. The encoder has the freedom to choose address modes, the sample addr_encode() algorithm merely shows one way of picking the address mode. The decoding algorithm addr_decode() will uniquely decode addresses, regardless of the encoder's algorithm choice.

我们将在下面展示示例算法,以更清楚地演示地址模式的使用。编码器可以自由选择地址模式,示例addr_encode()算法仅显示了选择地址模式的一种方法。无论编码器的算法选择如何,解码算法addr_decode()都将对地址进行唯一解码。

Note that the address caches are updated immediately after an address is encoded or decoded. In this way, the decoder is always synchronized with the encoder.

请注意,地址缓存在地址编码或解码后立即更新。这样,解码器始终与编码器同步。

   int addr_encode(Cache_t* ka, int addr, int here, int* mode)
   {
       int  i, d, bestd, bestm;
        
   int addr_encode(Cache_t* ka, int addr, int here, int* mode)
   {
       int  i, d, bestd, bestm;
        
       /* Attempt to find the address mode that yields the
        * smallest integer value for "d", the encoded address
        * value, thereby minimizing the encoded size of the
        * address. */
        
       /* Attempt to find the address mode that yields the
        * smallest integer value for "d", the encoded address
        * value, thereby minimizing the encoded size of the
        * address. */
        
       bestd = addr; bestm = VCD_SELF;      /* VCD_SELF == 0 */
        
       bestd = addr; bestm = VCD_SELF;      /* VCD_SELF == 0 */
        
       if((d = here-addr) < bestd)
           { bestd = d; bestm = VCD_HERE; } /* VCD_HERE == 1 */
        
       if((d = here-addr) < bestd)
           { bestd = d; bestm = VCD_HERE; } /* VCD_HERE == 1 */
        
       for(i = 0; i < ka->s_near; ++i)
           if((d = addr - ka->near[i]) >= 0 && d < bestd)
               { bestd = d; bestm = i+2; }
        
       for(i = 0; i < ka->s_near; ++i)
           if((d = addr - ka->near[i]) >= 0 && d < bestd)
               { bestd = d; bestm = i+2; }
        
       if(ka->s_same > 0 && ka->same[d = addr%(ka->s_same*256)] == addr)
           { bestd = d%256; bestm = ka->s_near + 2 + d/256; }
        
       if(ka->s_same > 0 && ka->same[d = addr%(ka->s_same*256)] == addr)
           { bestd = d%256; bestm = ka->s_near + 2 + d/256; }
        

cache_update(ka,addr);

缓存更新(ka,addr);

       *mode = bestm; /* this returns the address encoding mode */
       return  bestd; /* this returns the encoded address       */
   }
        
       *mode = bestm; /* this returns the address encoding mode */
       return  bestd; /* this returns the encoded address       */
   }
        

Note that the addr_encode() algorithm chooses the best address mode using a local optimization, but that may not lead to the best encoding efficiency because different modes lead to different instruction encodings, as described below.

请注意,addr_encode()算法使用局部优化选择最佳地址模式,但这可能不会导致最佳编码效率,因为不同的模式会导致不同的指令编码,如下所述。

The functions addrint() and addrbyte() used in addr_decode(), obtain from the "Addresses section for COPYs" (Section 4.3), an integer or a byte, respectively. These utilities will not be described here. We simply recall that an integer is represented as a compact variable-sized string of bytes, as described in Section 2 (i.e., base 128).

addr_decode()中使用的函数addrint()和addrbyte()分别从“副本地址部分”(第4.3节)获取整数或字节。这里不介绍这些实用程序。我们只需回忆一下,整数表示为紧凑的可变大小字节字符串,如第2节所述(即基数128)。

   int addr_decode(Cache_t* ka, int here, int mode)
   {   int  addr, m;
        
   int addr_decode(Cache_t* ka, int here, int mode)
   {   int  addr, m;
        
       if(mode == VCD_SELF)
            addr = addrint();
       else if(mode == VCD_HERE)
            addr = here - addrint();
       else if((m = mode - 2) >= 0 && m < ka->s_near) /* near cache */
            addr = ka->near[m] + addrint();
       else /* same cache */
       {    m = mode - (2 + ka->s_near);
            addr = ka->same[m*256 + addrbyte()];
       }
        
       if(mode == VCD_SELF)
            addr = addrint();
       else if(mode == VCD_HERE)
            addr = here - addrint();
       else if((m = mode - 2) >= 0 && m < ka->s_near) /* near cache */
            addr = ka->near[m] + addrint();
       else /* same cache */
       {    m = mode - (2 + ka->s_near);
            addr = ka->same[m*256 + addrbyte()];
       }
        

cache_update(ka, addr);

缓存更新(ka,addr);

       return addr;
   }
        
       return addr;
   }
        
5.4 Instruction Codes
5.4 指令代码

Matches are often short in lengths and separated by small amounts of unmatched data. That is, the lengths of COPY and ADD instructions are often small. This is particularly true of binary data such as executable files or structured data, such as HTML or XML. In such cases, compression can be improved by combining the encoding of the sizes and the instruction types, as well as combining the encoding of adjacent delta instructions with sufficiently small data sizes. Effective choices of when to perform such combinations depend on many factors including the data being processed and the string matching algorithm in use. For example, if many COPY instructions have the same data sizes, it may be worthwhile to encode these instructions more compactly than others.

匹配的长度通常较短,并且由少量不匹配的数据分隔。也就是说,复制和添加指令的长度通常很小。对于二进制数据(如可执行文件)或结构化数据(如HTML或XML),尤其如此。在这种情况下,可以通过组合大小和指令类型的编码,以及组合相邻增量指令与足够小的数据大小的编码来改进压缩。何时执行这种组合的有效选择取决于许多因素,包括正在处理的数据和正在使用的字符串匹配算法。例如,如果许多复制指令具有相同的数据大小,则值得对这些指令进行比其他指令更紧凑的编码。

The Vcdiff data format is designed so that a decoder does not need to be aware of the choices made in encoding algorithms. This is achieved with the notion of an "instruction code table", containing 256 entries. Each entry defines, either a single delta instruction or a pair of instructions that have been combined. Note that the code table itself only exists in main memory, not in the delta file (unless using an application-defined code table, described in Section 7). The encoded data simply includes the index of each instruction and, since there are only 256 indices, each index can be represented as a single byte.

Vcdiff数据格式的设计使得解码器不需要知道编码算法中的选择。这是通过“指令代码表”的概念实现的,该表包含256个条目。每个条目定义单个增量指令或组合的一对指令。请注意,代码表本身仅存在于主存中,而不存在于增量文件中(除非使用应用程序定义的代码表,如第7节所述)。编码数据只包括每个指令的索引,由于只有256个索引,每个索引可以表示为一个字节。

Each instruction code entry contains six fields, each of which is a single byte with an unsigned value:

每个指令代码项包含六个字段,每个字段都是一个带无符号值的单字节:

          +-----------------------------------------------+
          | inst1 | size1 | mode1 | inst2 | size2 | mode2 |
          +-----------------------------------------------+
        
          +-----------------------------------------------+
          | inst1 | size1 | mode1 | inst2 | size2 | mode2 |
          +-----------------------------------------------+
        

Each triple (inst,size,mode) defines a delta instruction. The meanings of these fields are as follows:

每个三元组(inst、size、mode)定义一条delta指令。这些字段的含义如下:

inst: An "inst" field can have one of the four values: NOOP (0), ADD (1), RUN (2) or COPY (3) to indicate the instruction types. NOOP means that no instruction is specified. In this case, both the corresponding size and mode fields will be zero.

inst:“inst”字段可以有四个值之一:NOOP(0)、ADD(1)、RUN(2)或COPY(3)以指示指令类型。NOOP表示未指定任何指令。在这种情况下,相应的大小和模式字段都将为零。

size: A "size" field is zero or positive. A value zero means that the size associated with the instruction is encoded separately as an integer in the "Instructions and sizes section" (Section 6). A positive value for "size" defines the actual data size. Note that since the size is restricted to a byte, the maximum value for any instruction with size implicitly defined in the code table is 255.

大小:“大小”字段为零或正。值零表示与指令关联的大小在“指令和大小部分”(第6节)中单独编码为整数。“大小”的正值定义实际数据大小。请注意,由于大小限制为一个字节,因此代码表中隐式定义了大小的任何指令的最大值为255。

mode: A "mode" field is significant only when the associated delta instruction is a COPY. It defines the mode used to encode the associated addresses. For other instructions, this is always zero.

模式:“模式”字段仅在关联的增量指令为副本时有效。它定义了用于编码相关地址的模式。对于其他指令,该值始终为零。

5.6 The Code Table
5.6 代码表

Following the discussions on address modes and instruction code tables, we define a "Code Table" to have the data below:

在讨论了地址模式和指令代码表之后,我们定义了一个“代码表”,其中包含以下数据:

s_near: the size of the near cache, s_same: the size of the same cache, i_code: the 256-entry instruction code table.

s_near:近缓存的大小,s_same:同一缓存的大小,i_代码:256条指令代码表。

Vcdiff itself defines a "default code table" in which s_near is 4 and s_same is 3. Thus, there are 9 address modes for a COPY instruction. The first two are VCD_SELF (0) and VCD_HERE (1). Modes 2, 3, 4 and 5 are for addresses coded against the near cache. And modes 6, 7 and 8, are for addresses coded against the same cache.

Vcdiff本身定义了一个“默认代码表”,其中s_near为4,s_same为3。因此,复制指令有9种地址模式。前两个是VCD_SELF(0)和这里的VCD_(1)。模式2、3、4和5是针对近缓存编码的地址。模式6、7和8用于针对同一缓存编码的地址。

        TYPE      SIZE     MODE    TYPE     SIZE     MODE     INDEX
       ---------------------------------------------------------------
    1.  RUN         0        0     NOOP       0        0        0
    2.  ADD    0, [1,17]     0     NOOP       0        0      [1,18]
    3.  COPY   0, [4,18]     0     NOOP       0        0     [19,34]
    4.  COPY   0, [4,18]     1     NOOP       0        0     [35,50]
    5.  COPY   0, [4,18]     2     NOOP       0        0     [51,66]
    6.  COPY   0, [4,18]     3     NOOP       0        0     [67,82]
    7.  COPY   0, [4,18]     4     NOOP       0        0     [83,98]
    8.  COPY   0, [4,18]     5     NOOP       0        0     [99,114]
    9.  COPY   0, [4,18]     6     NOOP       0        0    [115,130]
   10.  COPY   0, [4,18]     7     NOOP       0        0    [131,146]
   11.  COPY   0, [4,18]     8     NOOP       0        0    [147,162]
   12.  ADD       [1,4]      0     COPY     [4,6]      0    [163,174]
   13.  ADD       [1,4]      0     COPY     [4,6]      1    [175,186]
   14.  ADD       [1,4]      0     COPY     [4,6]      2    [187,198]
   15.  ADD       [1,4]      0     COPY     [4,6]      3    [199,210]
   16.  ADD       [1,4]      0     COPY     [4,6]      4    [211,222]
   17.  ADD       [1,4]      0     COPY     [4,6]      5    [223,234]
   18.  ADD       [1,4]      0     COPY       4        6    [235,238]
   19.  ADD       [1,4]      0     COPY       4        7    [239,242]
   20.  ADD       [1,4]      0     COPY       4        8    [243,246]
   21.  COPY        4      [0,8]   ADD        1        0    [247,255]
       ---------------------------------------------------------------
        
        TYPE      SIZE     MODE    TYPE     SIZE     MODE     INDEX
       ---------------------------------------------------------------
    1.  RUN         0        0     NOOP       0        0        0
    2.  ADD    0, [1,17]     0     NOOP       0        0      [1,18]
    3.  COPY   0, [4,18]     0     NOOP       0        0     [19,34]
    4.  COPY   0, [4,18]     1     NOOP       0        0     [35,50]
    5.  COPY   0, [4,18]     2     NOOP       0        0     [51,66]
    6.  COPY   0, [4,18]     3     NOOP       0        0     [67,82]
    7.  COPY   0, [4,18]     4     NOOP       0        0     [83,98]
    8.  COPY   0, [4,18]     5     NOOP       0        0     [99,114]
    9.  COPY   0, [4,18]     6     NOOP       0        0    [115,130]
   10.  COPY   0, [4,18]     7     NOOP       0        0    [131,146]
   11.  COPY   0, [4,18]     8     NOOP       0        0    [147,162]
   12.  ADD       [1,4]      0     COPY     [4,6]      0    [163,174]
   13.  ADD       [1,4]      0     COPY     [4,6]      1    [175,186]
   14.  ADD       [1,4]      0     COPY     [4,6]      2    [187,198]
   15.  ADD       [1,4]      0     COPY     [4,6]      3    [199,210]
   16.  ADD       [1,4]      0     COPY     [4,6]      4    [211,222]
   17.  ADD       [1,4]      0     COPY     [4,6]      5    [223,234]
   18.  ADD       [1,4]      0     COPY       4        6    [235,238]
   19.  ADD       [1,4]      0     COPY       4        7    [239,242]
   20.  ADD       [1,4]      0     COPY       4        8    [243,246]
   21.  COPY        4      [0,8]   ADD        1        0    [247,255]
       ---------------------------------------------------------------
        

The default instruction code table is depicted above, in a compact representation that we use only for descriptive purposes. See section 7 for the specification of how an instruction code table is represented in the Vcdiff encoding format. In the depiction, a zero value for size indicates that the size is separately coded. The mode of non-COPY instructions is represented as 0, even though they are not used.

默认指令代码表如上图所示,以紧凑的表示形式表示,我们仅用于描述性目的。有关如何以Vcdiff编码格式表示指令代码表的规范,请参见第7节。在描述中,大小的零值表示大小被单独编码。非复制指令的模式表示为0,即使未使用它们。

In the depiction, each numbered line represents one or more entries in the actual instruction code table (recall that an entry in the instruction code table may represent up to two combined delta instructions.) The last column ("INDEX") shows which index value, or range of index values, of the entries are covered by that line. (The notation [i,j] means values from i through j, inclusively.) The first 6 columns of a line in the depiction, describe the pairs of instructions used for the corresponding index value(s).

在描述中,每一编号行表示实际指令代码表中的一个或多个条目(请记住,指令代码表中的一个条目最多可表示两个组合的增量指令)。最后一列(“索引”)显示该行所涵盖的条目的索引值或索引值范围。(符号[i,j]表示从i到j的值,包括在内)。描述中一行的前6列描述了用于相应索引值的指令对。

If a line in the depiction includes a column entry using the [i,j] notation, this means that the line is instantiated for each value in the range from i to j, inclusively. The notation "0, [i,j]" means that the line is instantiated for the value 0 and for each value in the range from i to j, inclusively.

如果描述中的一行包含使用[i,j]表示法的列条目,这意味着该行被实例化为i到j范围内的每个值。表示法“0,[i,j]”表示为值0以及i到j范围内的每个值(包括)实例化该行。

If a line in the depiction includes more than one entry using the [i,j] notation, implying a "nested loop" to convert the line to a range of table entries, the first such [i,j] range specifies the outer loop, and the second specifies the inner loop.

如果描述中的一行包含多个使用[i,j]符号的条目,这意味着“嵌套循环”将该行转换为一系列表条目,则第一个[i,j]范围指定外部循环,第二个指定内部循环。

The below examples should make clear the above description:

以下示例应清楚说明上述说明:

Line 1 shows the single RUN instruction with index 0. As the size field is 0, this RUN instruction always has its actual size encoded separately.

第1行显示索引为0的单次运行指令。由于大小字段为0,因此此运行指令始终将其实际大小单独编码。

Line 2 shows the 18 single ADD instructions. The ADD instruction with size field 0 (i.e., the actual size is coded separately) has index 1. ADD instructions with sizes from 1 to 17 use code indices 2 to 18 and their sizes are as given (so they will not be separately encoded.)

第2行显示了18条单添加指令。大小字段为0的ADD指令(即,实际大小单独编码)具有索引1。添加大小为1到17的指令使用代码索引2到18,并且它们的大小与给定的相同(因此它们不会单独编码)

Following the single ADD instructions are the single COPY instructions ordered by their address encoding modes. For example, line 11 shows the COPY instructions with mode 8, i.e., the last of the same cache. In this case, the COPY instruction with size field 0 has index 147. Again, the actual size of this instruction will be coded separately.

在单次添加指令之后是按地址编码模式排序的单次复制指令。例如,第11行显示具有模式8的复制指令,即相同缓存的最后一个。在这种情况下,大小字段为0的复制指令具有索引147。同样,此指令的实际大小将单独编码。

Lines 12 to 21 show the pairs of instructions that are combined together. For example, line 12 depicts the 12 entries in which an ADD instruction is combined with an immediately following COPY instruction. The entries with indices 163, 164, 165 represent the pairs in which the ADD instructions all have size 1, while the COPY instructions have mode 0 (VCD_SELF) and sizes 4, 5 and 6 respectively.

第12到21行显示了组合在一起的指令对。例如,第12行描述了12个条目,其中ADD指令与紧跟其后的COPY指令组合在一起。索引为163、164、165的条目表示ADD指令都具有大小1的对,而复制指令分别具有模式0(VCD_SELF)和大小4、5和6。

The last line, line 21, shows the eight instruction pairs, where the first instruction is a COPY and the second is an ADD. In this case, all COPY instructions have size 4 with mode ranging from 0 to 8 and all the ADD instructions have size 1. Thus, the entry with the largest index 255 combines a COPY instruction of size 4 and mode 8 with an ADD instruction of size 1.

最后一行,第21行,显示了八条指令对,其中第一条指令是副本,第二条是ADD。在这种情况下,所有复制指令的大小为4,模式范围为0到8,所有添加指令的大小为1。因此,具有最大索引255的条目将大小为4和模式8的复制指令与大小为1的添加指令组合在一起。

The choice of the minimum size 4 for COPY instructions in the default code table was made from experiments that showed that excluding small matches (less then 4 bytes long) improved the compression rates.

默认代码表中复制指令的最小大小4是根据实验选择的,实验表明排除小匹配(小于4字节)可以提高压缩率。

6. Decoding a Target Window
6. 解码目标窗口

Section 4.3 discusses that the delta instructions and associated data are encoded in three arrays of bytes:

第4.3节讨论了增量指令和相关数据以三个字节数组进行编码:

Data section for ADDs and RUNs, Instructions and sizes section, and Addresses section for COPYs.

添加和运行的数据部分、说明和大小部分以及副本的地址部分。

Further, these data sections may have been further compressed by some secondary compressor. Assuming that any such compressed data has been decompressed so that we now have three arrays:

此外,这些数据段可能已被一些辅助压缩机进一步压缩。假设任何此类压缩数据都已解压缩,因此我们现在有三个阵列:

inst: bytes coding the instructions and sizes. data: unmatched data associated with ADDs and RUNs. addr: bytes coding the addresses of COPYs.

指令:编码指令和大小的字节。数据:与添加和运行关联的不匹配数据。地址:编码副本地址的字节。

These arrays are organized as follows:

这些阵列的组织方式如下:

inst: a sequence of (index, [size1], [size2]) tuples, where "index" is an index into the instruction code table, and size1 and size2 are integers that MAY or MAY NOT be included in the tuple as follows. The entry with the given "index" in the instruction code table potentially defines two delta instructions. If the first delta instruction is not a VCD_NOOP and its size is zero, then size1 MUST be present. Otherwise, size1 MUST be omitted and the size of the instruction (if it is not VCD_NOOP) is as defined in the table. The presence or absence of size2 is defined similarly with respect to the second delta instruction.

说明:一个(index、[size1]、[size2])元组序列,其中“index”是指令代码表中的索引,而size1和size2是整数,可以包含在元组中,也可以不包含在元组中,如下所示。指令代码表中具有给定“索引”的条目可能定义两条增量指令。如果第一条delta指令不是VCD_NOOP且其大小为零,则必须存在size1。否则,必须省略size1,指令的大小(如果不是VCD_NOOP)如表中所定义。size2的存在或不存在与第二条delta指令的定义类似。

data: a sequence of data values, encoded as bytes.

数据:数据值序列,编码为字节。

addr: a sequence of address values. Addresses are normally encoded as integers as described in Section 2 (i.e., base 128). However, since the same cache emits addresses in the range [0,255], same cache addresses are always encoded as a single byte.

地址:地址值的序列。地址通常编码为第2节中所述的整数(即基数128)。但是,由于相同的缓存发出[0255]范围内的地址,因此相同的缓存地址始终编码为单个字节。

To summarize, each tuple in the "inst" array includes an index to some entry in the instruction code table that determines:

总之,“inst”数组中的每个元组都包含指令代码表中某个项的索引,该项决定:

a. Whether one or two instructions were encoded and their types.

a. 是否对一条或两条指令进行编码及其类型。

b. If the instructions have their sizes encoded separately, these sizes will follow, in order, in the tuple.

b. 如果指令的大小单独编码,则这些大小将按照元组的顺序排列。

c. If the instructions have accompanying data, i.e., ADDs or RUNs, their data will be in the array "data".

c. 如果指令具有附带数据,即添加或运行,则其数据将位于数组“data”中。

d. Similarly, if the instructions are COPYs, the coded addresses are found in the array "addr".

d. 类似地,如果指令是copy,则编码地址位于数组“addr”中。

The decoding procedure simply processes the arrays by reading one code index at a time, looking up the corresponding instruction code entry, then consuming the respective sizes, data and addresses following the directions in this entry. In other words, the decoder maintains an implicit next-element pointer for each array; "consuming" an instruction tuple, data, or address value implies incrementing the associated pointer.

解码过程通过一次读取一个代码索引,查找相应的指令代码条目,然后按照该条目中的指示使用相应的大小、数据和地址来简单地处理阵列。换句话说,解码器为每个数组维护一个隐式的下一个元素指针;“使用”指令元组、数据或地址值意味着增加相关指针。

For example, if during the processing of the target window, the next unconsumed tuple in the inst array has an index value of 19, then the first instruction is a COPY, whose size is found as the immediately following integer in the inst array. Since the mode of this COPY instruction is VCD_SELF, the corresponding address is found by consuming the next integer in the addr array. The data array is left intact. As the second instruction for code index 19 is a NOOP, this tuple is finished.

例如,如果在处理目标窗口期间,inst数组中下一个未使用的元组的索引值为19,则第一条指令是一个副本,其大小在inst数组中为紧跟其后的整数。由于此复制指令的模式是VCD_SELF,因此通过使用addr数组中的下一个整数可以找到相应的地址。数据数组保持不变。由于代码索引19的第二条指令是NOOP,因此该元组结束。

7. APPLICATION-DEFINED CODE TABLES
7. 应用程序定义的代码表

Although the default code table used in Vcdiff is good for general purpose encoders, there are times when other code tables may perform better. For example, to code a file with many identical segments of data, it may be advantageous to have a COPY instruction with the specific size of these data segments, so that the instruction can be encoded in a single byte. Such a special code table MUST then be encoded in the delta file so that the decoder can reconstruct it before decoding the data.

虽然Vcdiff中使用的默认代码表适用于通用编码器,但有时其他代码表的性能可能会更好。例如,要用许多相同的数据段对文件进行编码,最好使用具有这些数据段特定大小的复制指令,以便可以在单个字节中对指令进行编码。然后,必须在增量文件中对这种特殊的代码表进行编码,以便解码器可以在解码数据之前重建它。

Vcdiff allows an application-defined code table to be specified in a delta file with the following data:

Vcdiff允许在具有以下数据的增量文件中指定应用程序定义的代码表:

Size of near cache - byte Size of same cache - byte Compressed code table data

近缓存大小-相同缓存的字节大小-字节压缩代码表数据

The "compressed code table data" encodes the delta between the default code table (source) and the new code table (target) in the same manner as described in Section 4.3 for encoding a target window in terms of a source window. This delta is computed using the following steps:

“压缩代码表数据”以第4.3节中描述的相同方式对默认代码表(源)和新代码表(目标)之间的增量进行编码,以根据源窗口对目标窗口进行编码。使用以下步骤计算此增量:

a. Convert the new instruction code table into a string, "code", of 1536 bytes using the below steps in order:

a. 按顺序使用以下步骤将新指令代码表转换为1536字节的字符串“代码”:

i. Add in order the 256 bytes representing the types of the first instructions in the instruction pairs. ii. Add in order the 256 bytes representing the types of the second instructions in the instruction pairs. iii. Add in order the 256 bytes representing the sizes of the first instructions in the instruction pairs. iv. Add in order the 256 bytes representing the sizes of the second instructions in the instruction pairs. v. Add in order the 256 bytes representing the modes of the first instructions in the instruction pairs. vi. Add in order the 256 bytes representing the modes of the second instructions in the instruction pairs.

我按顺序添加表示指令对中第一条指令类型的256字节。二,。按顺序添加表示指令对中第二条指令类型的256字节。iii.按顺序添加表示指令对中第一条指令大小的256字节。iv.按顺序添加表示指令对中第二条指令大小的256字节。v按顺序添加表示指令对中第一条指令模式的256字节。vi.按顺序添加表示指令对中第二条指令模式的256字节。

b. Similarly, convert the default code table into a string "dflt".

b. 类似地,将默认代码表转换为字符串“dflt”。

c. Treat the string "code" as a target window and "dflt" as the corresponding source data and apply an encoding algorithm to compute the delta encoding of "code" in terms of "dflt". This computation MUST use the default code table for encoding the delta instructions.

c. 将字符串“code”视为目标窗口,“dflt”视为相应的源数据,并应用编码算法计算“code”在“dflt”方面的增量编码。此计算必须使用默认代码表来编码增量指令。

The decoder can then reverse the above steps to decode the compressed table data using the method of Section 6, employing the default code table, to generate the new code table. Note that the decoder does not need to know about the details of the encoding algorithm used in step (c). It is able to decode the new code table because the Vcdiff format is independent from the choice of encoding algorithm, and because the encoder in step (c) uses the known, default code table.

解码器随后可以反转上述步骤,使用第6部分的方法解码压缩表数据,使用默认代码表,以生成新代码表。注意,解码器不需要知道步骤(c)中使用的编码算法的细节。它能够解码新的代码表,因为Vcdiff格式独立于编码算法的选择,并且因为步骤(c)中的编码器使用已知的默认代码表。

8. Performance
8. 表演

The encoding format is compact. For compression only, using the LZ-77 string parsing strategy and without any secondary compressors, the typical compression rate is better than Unix compress and close to gzip. For differencing, the data format is better than all known methods in terms of its stated goal, which is primarily decoding speed and encoding efficiency.

编码格式紧凑。仅对于压缩,使用LZ-77字符串解析策略并且没有任何辅助压缩器,典型的压缩率比Unix compress好,接近gzip。对于差分,数据格式在其规定的目标方面优于所有已知方法,主要是解码速度和编码效率。

We compare the performance of compress, gzip and Vcdiff using the archives of three versions of the Gnu C compiler, gcc-2.95.1.tar, gcc-2.95.2.tar and gcc-2.95.3.tar. Gzip was used at its default compression level. The Vcdiff data were obtained using the Vcodex/Vcdiff software (Section 13).

我们使用gnuc编译器的三个版本gcc-2.95.1.tar、gcc-2.95.2.tar和gcc-2.95.3.tar的归档文件来比较compress、gzip和Vcdiff的性能。Gzip是在默认压缩级别使用的。使用Vcodex/Vcdiff软件获得Vcdiff数据(第13节)。

Below are the different Vcdiff runs:

以下是不同的Vcdiff运行:

Vcdiff: vcdiff is used as a compressor only.

Vcdiff:Vcdiff仅用作压缩机。

Vcdiff-d: vcdiff is used as a differencer only. That is, it only compares target data against source data. Since the files involved are large, they are broken into windows. In this case, each target window, starting at some file offset in the target file, is compared against a source window with the same file offset (in the source file). The source window is also slightly larger than the target window to increase matching opportunities.

Vcdiff-d:Vcdiff仅用作差分器。也就是说,它仅将目标数据与源数据进行比较。由于所涉及的文件很大,因此会将它们拆分到窗口中。在这种情况下,从目标文件中某个文件偏移量开始的每个目标窗口都会与具有相同文件偏移量的源窗口(在源文件中)进行比较。源窗口也略大于目标窗口,以增加匹配机会。

Vcdiff-dc: This is similar to Vcdiff-d, but vcdiff can also compare target data against target data as applicable. Thus, vcdiff both computes differences and compresses data. The windowing algorithm is the same as above. However, the above hint is recinded in this case.

Vcdiff dc:这与Vcdiff-d类似,但Vcdiff也可以在适用时将目标数据与目标数据进行比较。因此,vcdiff既计算差异又压缩数据。窗口算法与上述相同。但是,在这种情况下会收到上述提示。

Vcdiff-dcw: This is similar to Vcdiff-dc but the windowing algorithm uses a content-based heuristic to select a source window that is more likely to match with a given target window. Thus, the source data segment selected for a target window often will not be aligned with the file offsets of this target window.

Vcdiff-dcw:这与Vcdiff-dc类似,但窗口算法使用基于内容的启发式方法来选择更可能与给定目标窗口匹配的源窗口。因此,为目标窗口选择的源数据段通常不会与该目标窗口的文件偏移量对齐。

                       gcc-2.95.1     gcc-2.95.2     gcc-2.95.3
      ---------------------------------------------------------
      1. raw size      55,746,560     55,797,760     55,787,520
      2. compress         -           19,939,390     19,939,453
      3. gzip             -           12,973,443     12,998,097
      4. Vcdiff           -           15,358,786     15,371,737
      5. Vcdiff-d         -              100,971     26,383,849
      6. Vcdiff-dc        -               97,246     14,461,203
      7. Vcdiff-dcw       -              256,445      1,248,543
        
                       gcc-2.95.1     gcc-2.95.2     gcc-2.95.3
      ---------------------------------------------------------
      1. raw size      55,746,560     55,797,760     55,787,520
      2. compress         -           19,939,390     19,939,453
      3. gzip             -           12,973,443     12,998,097
      4. Vcdiff           -           15,358,786     15,371,737
      5. Vcdiff-d         -              100,971     26,383,849
      6. Vcdiff-dc        -               97,246     14,461,203
      7. Vcdiff-dcw       -              256,445      1,248,543
        

The above table shows the raw sizes of the tar files and the sizes of the compressed results. The differencing results in the gcc-2.95.2 column were obtained by compressing gcc-2.95.2, given gcc-2.95.1. The same results for the column gcc-2.95.3 were obtained by compressing gcc-2.95.3, given gcc-2.95.2.

上表显示了tar文件的原始大小和压缩结果的大小。在给定gcc-2.95.1的情况下,通过压缩gcc-2.95.2得到gcc-2.95.2列中的差异结果。在给定gcc-2.95.2的情况下,通过压缩gcc-2.95.3得到了gcc-2.95.3柱的相同结果。

Rows 2, 3 and 4 show that, for compression only, the compression rate from Vcdiff is worse than gzip and better than compress.

第2行、第3行和第4行显示,仅对于压缩,来自Vcdiff的压缩率比gzip差,比compress好。

The last three rows in the column gcc-2.95.2 show that when two file versions are very similar, differencing can give dramatically good compression rates. Vcdiff-d and Vcdiff-dc use the same simple window selection method of aligning by file offsets, but Vcdiff-dc also does compression so its output is slightly smaller. Vcdiff-dcw uses a content-based algorithm to search for source data that likely will match a given target window. Although it does a good job, the algorithm does not always find the best matches, which in this case, are given by the simple algorithm of Vcdiff-d. As a result, the output size for Vcdiff-dcw is slightly larger.

gcc-2.95.2列的最后三行显示,当两个文件版本非常相似时,差异可以提供非常好的压缩率。Vcdiff-d和Vcdiff-dc使用相同的简单窗口选择方法,即通过文件偏移对齐,但Vcdiff-dc也进行压缩,因此其输出稍小。Vcdiff dcw使用基于内容的算法搜索可能与给定目标窗口匹配的源数据。虽然它做得很好,但该算法并不总是找到最佳匹配,在这种情况下,Vcdiff-d的简单算法给出了最佳匹配。因此,Vcdiff dcw的输出大小稍大。

The situation is reversed in the gcc-2.95.3 column. Here, the files and their contents were sufficiently rearranged or changed between the making of the gcc-2.95.3.tar archive and the gcc-2.95.2 archive so that the simple method of aligning windows by file offsets no longer works. As a result, Vcdiff-d and Vcdiff-dc do not perform well. By allowing compression, along with differencing, Vcdiff-dc manages to beat Vcdiff-c, which does compression only. The content-based window matching algorithm in Vcdiff-dcw is effective in matching the right source and target windows so that Vcdiff-dcw is the overall winner.

gcc-2.95.3一栏中的情况正好相反。在这里,文件及其内容在gcc-2.95.3.tar归档文件和gcc-2.95.2归档文件之间进行了充分的重新排列或更改,因此通过文件偏移量对齐窗口的简单方法不再有效。因此,Vcdiff-d和Vcdiff-dc性能不佳。通过允许压缩和差分,Vcdiff-dc能够击败Vcdiff-c,后者只进行压缩。Vcdiff-dcw中基于内容的窗口匹配算法能够有效地匹配正确的源窗口和目标窗口,因此Vcdiff-dcw是最终的赢家。

9. Further Issues
9. 进一步问题

This document does not address a few issues:

本文件不涉及以下几个问题:

Secondary compressors: As discussed in Section 4.3, certain sections in the delta encoding of a window may be further compressed by a secondary compressor. In our experience, the basic Vcdiff format is adequate for most purposes so that secondary compressors are seldom needed. In particular, for normal use of data differencing, where the files to be compared have long stretches of matches, much of the gain in compression rate is already achieved by normal string matching. Thus, the use of secondary compressors is seldom needed in this case. However, for applications beyond differencing of such nearly identical files, secondary compressors may be needed to achieve maximal compressed results.

二级压缩器:如第4.3节所述,窗口增量编码中的某些部分可由二级压缩器进一步压缩。根据我们的经验,基本Vcdiff格式适合大多数用途,因此很少需要二次压缩机。特别是,对于数据差分的正常使用,要比较的文件有很长的匹配范围,压缩率的大部分增益已经通过正常字符串匹配实现。因此,在这种情况下很少需要使用二次压缩机。然而,对于这些几乎相同的文件的差异之外的应用程序,可能需要辅助压缩器来实现最大的压缩结果。

Therefore, we recommend leaving the Vcdiff data format defined as in this document so that the use of secondary compressors can be implemented when they become needed in the future. The formats of the compressed data via such compressors or any compressors that may be defined in the future are left open to their implementations. These could include Huffman encoding, arithmetic encoding, and splay tree encoding [8,9].

因此,我们建议保留本文件中定义的Vcdiff数据格式,以便在将来需要时使用二级压缩器。通过此类压缩器或将来可能定义的任何压缩器的压缩数据的格式留待其实现。这些可能包括哈夫曼编码、算术编码和八字树编码[8,9]。

Large file system vs. small file system: As discussed in Section 4, a target window in a large file may be compared against some source window in another file or in the same file (from some earlier part). In that case, the file offset of the source window is specified as a variable-sized integer in the delta encoding. There is a possibility that the encoding was computed on a system supporting much larger files than in a system where the data may be decoded (e.g., 64-bit file systems vs. 32- bit file systems). In that case, some target data may not be recoverable. This problem could afflict any compression format, and ought to be resolved with a generic negotiation mechanism in the appropriate protocol(s).

大文件系统与小文件系统:如第4节所述,可以将大文件中的目标窗口与另一个文件或同一文件中的某个源窗口(来自前面的部分)进行比较。在这种情况下,源窗口的文件偏移量在增量编码中指定为可变大小的整数。编码可能是在支持比数据可解码的系统(例如,64位文件系统与32位文件系统)大得多的文件的系统上计算的。在这种情况下,某些目标数据可能无法恢复。这个问题可能影响任何压缩格式,应该通过适当协议中的通用协商机制来解决。

10. Summary
10. 总结

We have described Vcdiff, a general and portable encoding format for compression and differencing. The format is good in that it allows implementing a decoder without knowledge of the encoders. Further, ignoring the use of secondary compressors not defined within the format, the decoding algorithms run in linear time and requires working space proportional to window size.

我们已经描述了Vcdiff,一种用于压缩和差分的通用和便携式编码格式。这种格式很好,因为它允许在不知道编码器的情况下实现解码器。此外,忽略格式中未定义的辅助压缩器的使用,解码算法以线性时间运行,并且需要与窗口大小成比例的工作空间。

11. Acknowledgements
11. 致谢

Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur Van Hoff who provided much encouragement to publicize Vcdiff. In particular, Jeff helped in clarifying the description of the data format presented here.

感谢巴拉昌德·克里希纳穆尔西、杰夫·莫格尔和阿瑟·范霍夫,他们为宣传Vcdiff提供了很多鼓励。特别是,Jeff帮助澄清了这里介绍的数据格式的描述。

12. Security Considerations
12. 安全考虑

Vcdiff only provides a format to encode compressed and differenced data. It does not address any issues concerning how such data are, in fact, stored in a given file system or the run-time memory of a computer system. Therefore, we do not anticipate any security issues with respect to Vcdiff.

Vcdiff仅提供一种格式来对压缩和差异数据进行编码。它没有解决任何关于这些数据实际上如何存储在给定的文件系统或计算机系统的运行时内存中的问题。因此,我们预计Vcdiff不会出现任何安全问题。

13. Source Code Availability
13. 源代码可用性

Vcdiff is implemented as a data transforming method in Phong Vo's Vcodex library. AT&T Corp. has made the source code for Vcodex available for anyone to use to transmit data via HTTP/1.1 Delta Encoding [10,11]. The source code and according license is accessible at the below URL:

Vcdiff是在Phong Vo的Vcodex库中实现的一种数据转换方法。AT&T公司已经为任何人提供了Vcodex的源代码,供他们通过HTTP/1.1增量编码传输数据[10,11]。可通过以下URL访问源代码和相应的许可证:

      http://www.research.att.com/sw/tools
        
      http://www.research.att.com/sw/tools
        
14. Intellectual Property Rights
14. 知识产权

The IETF has been notified of intellectual property rights claimed in regard to some or all of the specification contained in this document. For more information consult the online list of claimed rights, at <http://www.ietf.org/ipr.html>.

IETF已收到关于本文件所含部分或全部规范的知识产权声明。有关更多信息,请参阅在线权利主张列表,网址为<http://www.ietf.org/ipr.html>.

The IETF takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Information on the IETF's procedures with respect to rights in standards-track and standards-related documentation can be found in BCP 11. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementors or users of this specification can be obtained from the IETF Secretariat.

IETF对可能声称与本文件所述技术的实施或使用有关的任何知识产权或其他权利的有效性或范围,或此类权利下的任何许可可能或可能不可用的程度,不采取任何立场;它也不表示它已作出任何努力来确定任何此类权利。有关IETF在标准跟踪和标准相关文件中权利的程序信息,请参见BCP 11。可从IETF秘书处获得可供发布的权利声明副本和任何许可证保证,或本规范实施者或用户试图获得使用此类专有权利的一般许可证或许可的结果。

15. IANA Considerations
15. IANA考虑

The Internet Assigned Numbers Authority (IANA) administers the number space for Secondary Compressor ID values. Values and their meaning must be documented in an RFC or other peer-reviewed, permanent, and readily available reference, in sufficient detail so that interoperability between independent implementations is possible. Subject to these constraints, name assignments are First Come, First Served - see RFC 2434 [13]. Legal ID values are in the range 1..255.

互联网分配号码管理局(IANA)管理二次压缩机ID值的号码空间。值及其含义必须记录在RFC或其他同行评审的、永久的、随时可用的参考文件中,并提供足够的详细信息,以便独立实现之间的互操作性成为可能。根据这些限制条件,名称分配是先到先得-见RFC 2434[13]。合法ID值的范围为1..255。

This document does not define any values in this number space.

本文档不在此数字空间中定义任何值。

16. References
16. 工具书类

[1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression, Practical Reusable Unix Software, Editor B. Krishnamurthy, John Wiley & Sons, Inc., 1995.

[1] D.G.Korn和K.P.Vo,Vdelta:差分和压缩,实用可重用Unix软件,编辑B.Krishnamurthy,John Wiley&Sons,Inc.,1995年。

[2] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data Compression, IEEE Trans. on Information Theory, 23(3):337-343, 1977.

[2] J.Ziv和A.Lempel,《顺序数据压缩的通用算法》,IEEE Trans。《信息论》,23(3):337-3431977。

[3] W. Tichy, The String-to-String Correction Problem with Block Moves, ACM Transactions on Computer Systems, 2(4):309-321, November 1984.

[3] W.Tichy,区块移动的串对串校正问题,计算机系统上的ACM事务,2(4):309-321,1984年11月。

[4] E.M. McCreight, A Space-Economical Suffix Tree Construction Algorithm, Journal of the ACM, 23:262-272, 1976.

[4] E.M.McCreight,一种空间经济型后缀树构造算法,ACM杂志,23:262-2721976。

[5] J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta Algorithms, IEEE Software Configuration and Maintenance Workshop, 1996.

[5] J.J.Hunt,K.P.Vo,W.Tichy,Delta算法的实证研究,IEEE软件配置和维护研讨会,1996年。

[6] J.J. Hunt, K.P. Vo, W. Tichy, Delta Algorithms: An Empirical Analysis, ACM Trans. on Software Engineering and Methodology, 7:192-214, 1998.

[6] J.J.Hunt,K.P.Vo,W.Tichy,Delta算法:实证分析,ACM Trans。《软件工程与方法论》,7:192-214,1998。

[7] D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library, Proc. of the Summer '91 Usenix Conference, 1991.

[7] D.G.Korn,K.P.Vo,Sfio:缓冲I/O库,程序。1991年夏天的Usenix会议。

[8] D. W. Jones, Application of Splay Trees to Data Compression, CACM, 31(8):996:1007.

[8] D.W.Jones,Splay树在数据压缩中的应用,CACM,31(8):996:1007。

[9] M. Nelson, J. Gailly, The Data Compression Book, ISBN 1-55851- 434-1, M&T Books, New York, NY, 1995.

[9] M.Nelson,J.Gailly,《数据压缩书》,ISBN 1-55851-434-1,M&T图书,纽约,1995年。

[10] J.C. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy, Potential benefits of delta encoding and data compression for HTTP, SIGCOMM '97, Cannes, France, 1997.

[10] J.C.Mogul、F.Douglis、A.Feldmann和B.Krishnamurthy,《HTTP增量编码和数据压缩的潜在好处》,SIGCOMM'97,法国戛纳,1997年。

[11] Mogul, J., Krishnamurthy, B., Douglis, F., Feldmann, A., Goland, Y. and A. Van Hoff, "Delta Encoding in HTTP", RFC 3229, January 2002.

[11] Mogul,J.,Krishnamurthy,B.,Douglis,F.,Feldmann,A.,Goland,Y.和A.Van Hoff,“HTTP中的增量编码”,RFC 3229,2002年1月。

[12] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.

[12] Bradner,S.,“RFC中用于表示需求水平的关键词”,BCP 14,RFC 2119,1997年3月。

[13] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA Considerations Section in RFCs", BCP 26, RFC 2434, October 1998.

[13] Narten,T.和H.Alvestrand,“在RFCs中编写IANA注意事项部分的指南”,BCP 26,RFC 2434,1998年10月。

[14] D.G. Korn and K.P. Vo, Engineering a Differencing and Compression Data Format, Submitted to Usenix'2002, 2001.

[14] D.G.Korn和K.P.Vo,《设计差分和压缩数据格式》,提交给Usenix'20022001。

17. Authors' Addresses
17. 作者地址

Kiem-Phong Vo (main contact) AT&T Labs, Room D223 180 Park Avenue Florham Park, NJ 07932

Kiem Phong Vo(主要联系人)美国电话电报公司实验室,地址:新泽西州弗洛勒姆公园公园大道180号D223室,邮编:07932

Phone: 1 973 360 8630 EMail: kpv@research.att.com

电话:1 973 360 8630电子邮件:kpv@research.att.com

David G. Korn AT&T Labs, Room D237 180 Park Avenue Florham Park, NJ 07932

David G.Korn AT&T实验室,地址:新泽西州弗洛勒姆公园公园大道180号D237室,邮编:07932

Phone: 1 973 360 8602 EMail: dgk@research.att.com

电话:1 973 360 8602电子邮件:dgk@research.att.com

Jeffrey C. Mogul Western Research Laboratory Hewlett-Packard Company 1501 Page Mill Road, MS 1251 Palo Alto, California, 94304, U.S.A.

Jeffrey C.Mogul西部研究实验室惠普公司美国加利福尼亚州帕洛阿尔托市佩奇米尔路1501号,邮编:94304。

Phone: 1 650 857 2206 (email preferred) EMail: JeffMogul@acm.org

电话:16508572206(首选电子邮件)电子邮件:JeffMogul@acm.org

Joshua P. MacDonald Computer Science Division University of California, Berkeley 345 Soda Hall Berkeley, CA 94720

约书亚·P·麦克唐纳德计算机科学系加利福尼亚大学,伯克利345苏打厅伯克利,CA 94720

   EMail: jmacd@cs.berkeley.edu
        
   EMail: jmacd@cs.berkeley.edu
        
18. Full Copyright Statement
18. 完整版权声明

Copyright (C) The Internet Society (2002). All Rights Reserved.

版权所有(C)互联网协会(2002年)。版权所有。

This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English.

本文件及其译本可复制并提供给他人,对其进行评论或解释或协助其实施的衍生作品可全部或部分编制、复制、出版和分发,不受任何限制,前提是上述版权声明和本段包含在所有此类副本和衍生作品中。但是,不得以任何方式修改本文件本身,例如删除版权通知或对互联网协会或其他互联网组织的引用,除非出于制定互联网标准的需要,在这种情况下,必须遵循互联网标准过程中定义的版权程序,或根据需要将其翻译成英语以外的其他语言。

The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns.

上述授予的有限许可是永久性的,互联网协会或其继承人或受让人不会撤销。

This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

本文件和其中包含的信息是按“原样”提供的,互联网协会和互联网工程任务组否认所有明示或暗示的保证,包括但不限于任何保证,即使用本文中的信息不会侵犯任何权利,或对适销性或特定用途适用性的任何默示保证。

Acknowledgement

确认

Funding for the RFC Editor function is currently provided by the Internet Society.

RFC编辑功能的资金目前由互联网协会提供。