文档设计增量格式

设计增量格式

富文本编辑器缺乏表达其自身内容的规范。直到最近,大多数富文本编辑器甚至不知道自己的编辑区域中有什么。这些编辑器只传递用户 HTML,并承担解析和解释它的负担。在任何给定时间,此解释都会与主要浏览器浏览器的解释不同,从而导致用户编辑体验不同。

¥Rich text editors lack a specification to express its own contents. Until recently, most rich text editors did not even know what was in their own edit areas. These editors just pass the user HTML, along with the burden of parsing and interpretting this. At any given time, this interpretation will differ from those of major browser vendors, leading to different editing experiences for users.

Quill 是第一个真正理解自身内容的富文本编辑器。关键在于 Deltas,这是描述富文本的规范。Delta 的设计易于理解和使用。我们将介绍 Deltas 背后的一些想法,以阐明为什么会这样。

¥Quill is the first rich text editor to actually understand its own contents. Key to this is Deltas, the specification describing rich text. Deltas are designed to be easy to understand and use. We will walk through some of the thinking behind Deltas, to shed light on why things are the way they are.

如果你正在寻找关于 Delta 是什么的参考,Delta 文档 是一个更简洁的资源。

¥If you are looking for a reference on what Deltas are, the Delta documentation is a more concise resource.

纯文本

¥Plain Text

让我们从纯文本的基础开始。已经存在一种存储纯文本的通用格式:字符串。现在,如果我们想在此基础上描述格式化文本,例如当某个范围为粗体时,我们需要添加其他信息。

¥Let's start at the basics with just plain text. There already is a ubiquitous format to store plain text: the string. Now if we want to build upon this and describe formatted text, such as when a range is bold, we need to add additional information.

数组是唯一可用的其他有序数据类型,因此我们使用对象数组。这也使我们能够利用 JSON 与各种工具兼容。

¥Arrays are the only other ordered data type available, so we use an array of objects. This also allows us to leverage JSON for compatibility with a breadth of tools.

const content = [
{ text: 'Hello' },
{ text: 'World', bold: true }
];

如果需要,我们可以向主对象添加斜体、下划线和其他格式;但将 text 与所有这些分开会更简洁,因此我们将格式组织在一个字段下,我们将其命名为 attributes

¥We can add italics, underline, and other formats to the main object if we want to; but it is cleaner to separate text from all of this so we organize formatting under one field, which we will name attributes.

const content = [
{ text: 'Hello' },
{ text: 'World', attributes: { bold: true } }
];

紧凑

¥Compact

即使我们目前使用的简单的 Delta 格式,由于上述 "你好,世界" 示例可以以不同的方式表示,因此我们无法预测将生成哪个:

¥Even with our simple Delta format so far, it is unpredictable since the above "Hello World" example can be represented differently, so we cannot predict which will be generated:

const content = [
{ text: 'Hel' },
{ text: 'lo' },
{ text: 'World', attributes: { bold: true } }
];

为了解决这个问题,我们添加了 Deltas 必须紧凑的约束。由于此约束,上述表示不是有效的 Delta,因为它可以通过上一个示例更紧凑地表示,其中 "Hel" 和 "lo" 不是分开的。同样,我们不能使用 { bold: false, italic: true, underline: null },因为 { italic: true } 更紧凑。

¥To solve this, we add the constraint that Deltas must be compact. With this constraint, the above representation is not a valid Delta, since it can be represented more compactly by the previous example, where "Hel" and "lo" were not separate. Similarly we cannot have { bold: false, italic: true, underline: null }, because { italic: true } is more compact.

规范化

¥Canonical

我们没有为 bold 赋予任何含义,只是它描述了一些文本格式。我们完全可以使用不同的名称,例如 weightedstrong,或者使用不同的可能值范围,例如数值或描述性的权重范围。可以在 CSS 中找到一个示例,其中大多数歧义都存在。如果我们在页面上看到粗体文本,我们无法预测它的规则集是 font-weight: bold 还是 font-weight: 700。这使得解析 CSS 以辨别其含义的任务变得更加复杂。

¥We have not assigned any meaning to bold, just that it describes some formatting for text. We could very well have used different names, such as weighted or strong, or used a different range of possible values, such as a numerical or descriptive range of weights. An example can be found in CSS, where most of these ambiguities are at play. If we saw bolded text on a page, we cannot predict if its rule set is font-weight: bold or font-weight: 700. This makes the task of parsing CSS to discern its meaning, much more complex.

我们没有定义可能的属性集及其含义,但我们添加了一个额外的约束,即 Delta 必须是规范的。如果两个 Delta 相等,则它们所代表的内容必须相等,并且不能有两个不相等的 Delta 代表相同的内容。以编程方式,这允许你简单地深度比较两个 Delta,以确定它们所代表的内容是否相等。

¥We do not define the set of possible attributes, nor their meanings, but we do add an additional constraint that Deltas must be canonical. If two Deltas are equal, the content they represent must be equal, and there cannot be two unequal Deltas that represent the same content. Programmatically, this allows you to simply deep compare two Deltas to determine if the content they represent is equal.

因此,如果存在以下情况,我们唯一能得出的结论是 ab 不同,但 ab 的含义并非如此。

¥So if we had the following, the only conclusion we can draw is a is different from b, but not what a or b means.

const content = [{
text: "Mystery",
attributes: {
a: true,
b: true
}
}];

由实现者选择合适的名称:

¥It is up to the implementer to pick appropriate names:

const content = [{
text: "Mystery",
attributes: {
italic: true,
bold: true
}
}];

这种规范化适用于键和值,即 textattributes。例如,Quill 的默认设置:

¥This canonicalization applies to both keys and values, text and attributes. For example, Quill by default:

  • 使用六个字符的十六进制值来表示颜色,而不是 RGB

    ¥Uses six character hex values to represent colors and not RGB

  • 只有一种表示换行符的方式,即使用 \n,而不是 \r\r\n

    ¥There is only one way to represent a newline which is with \n, not \r or \r\n

  • text: "Hello World" 明确表示 "你好" 和 "世界" 之间恰好有两个空格。

    ¥text: "Hello  World" unambiguously means there are precisely two spaces between "Hello" and "World"

其中一些选项可以由用户自定义,但 Deltas 中的规范约束规定这些选项必须是唯一的。

¥Some of these choices may be customized by the user, but the canonical constraint in Deltas dictate that the choice must be unique.

这种明确的可预测性使得 Deltas 更易于使用,因为你需要处理的情况更少,而且相应的 Delta 的外观也没有任何意外。从长远来看,这将使使用 Deltas 的应用更易于理解和维护。

¥This unambiguous predictability makes Deltas easier to work with, both because you have fewer cases to handle and because there are no surprises in what a corresponding Delta will look like. Long term, this makes applications using Deltas easier to understand and maintain.

行格式

¥Line Formatting

行格式会影响整行的内容,因此它们对我们的紧凑和规范约束提出了一个有趣的挑战。一种看似合理的表示居中文本的方式如下:

¥Line formats affect the contents of an entire line, so they present an interesting challenge for our compact and canonical constraints. A seemingly reasonable way to represent centered text would be the following:

const content = [
{ text: "Hello", attributes: { align: "center" } },
{ text: "\nWorld" }
];

但是,如果用户删除了换行符怎么办?如果我们简单地去掉换行符,Delta 现在看起来会像这样:

¥But what if the user deletes the newline character? If we just naively get rid of the newline character, the Delta would now look like this:

const content = [
{ text: "Hello", attributes: { align: "center" } },
{ text: "World" }
];

这条线仍然居中吗?如果答案为否,则我们的表示不紧凑,因为我们不需要属性对象,并且可以组合两个字符串:

¥Is this line still centered? If the answer is no, then our representation is not compact, since we do not need the attribute object and can combine the two strings:

const content = [
{ text: "HelloWorld" }
];

但如果答案是肯定的,那么我们就违反了规范约束,因为任何具有 align 属性的字符排列都代表相同的内容。

¥But if the answer is yes, then we violate the canonical constraint since any permutation of characters having an align attribute would represent the same content.

因此,我们不能简单地去掉换行符。我们还必须删除行属性,或者扩展它们以填充行上的所有字符。

¥So we cannot just naively get rid of the newline character. We also have to either get rid of line attributes, or expand them to fill all characters on the line.

如果我们从以下内容中删除换行符会怎样?

¥What if we removed the newline from the following?

const content = [
{ text: "Hello", attributes: { align: "center" } },
{ text: "\n" },
{ text: "World", attributes: { align: "right" } }
];

我们最终的线是居中对齐还是右对齐尚不清楚。我们可以删除两者,或者设置一些排序规则来优先使用其中一个,但这样一来,我们的 Delta 会变得更加复杂,也更难处理。

¥It is not clear if our resulting line is aligned center or right. We could delete both or have some ordering rule to favor one over the other, but our Delta is becoming more complex and harder to work with on this path.

这个问题需要原子性,我们在换行符本身中找到了它。但是,我们有一个问题:如果我们有 n 行,则只有 n-1 个换行符。

¥This problem begs for atomicity, and we find this in the newline character itself. But we have an off by one problem in that if we have n lines, we only have n-1 newline characters.

为了解决这个问题,Quill 会为所有文档添加 "adds" 换行符,并始终以 "\n" 结尾 Deltas。

¥To solve this, Quill "adds" a newline to all documents and always ends Deltas with "\n".

// Hello World on two lines
const content = [
{ text: "Hello" },
{ text: "\n", attributes: { align: "center" } },
{ text: "World" },
{ text: "\n", attributes: { align: "right" } } // Deltas must end with newline
];

嵌入内容

¥Embedded Content

我们希望添加嵌入内容,例如图片或视频。字符串很自然地用于文本,但我们为嵌入提供了更多选项。由于嵌入类型各有不同,我们的选择只需包含此类型信息,然后再包含实际内容即可。这里有很多合理的选择,但我们将使用一个对象,其唯一的键是嵌入类型,其值是内容表示,它可以是任何类型或值。

¥We want to add embedded content like images or video. Strings were natural to use for text but we have a lot more options for embeds. Since there are different types of embeds, our choice just needs to include this type information, and then the actual content. There are many reasonable options here but we will use an object whose only key is the embed type and the value is the content representation, which may have any type or value.

const img = {
image: {
url: 'https://quilljs.com/logo.png'
}
};
const f = {
formula: 'e=mc^2'
};

与文本类似,图片可能具有一些定义性特性和一些暂时性特性。我们将 attributes 用于文本内容,并使用相同的 attributes 字段用于图片。正因为如此,我们可以保留一直使用的通用结构,但应该将 text 键重命名为更通用的名称。出于我们稍后将探讨的原因,我们将选择 insert 作为名称。综合起来,我们得到:

¥Similar to text, images might have some defining characteristics, and some transient ones. We used attributes for text content and can use the same attributes field for images. But because of this, we can keep the general structure we have been using, but should rename our text key into something more general. For reasons we will explore later, we will choose the name insert. Putting this all together we have:

const content = [{
insert: 'Hello'
}, {
insert: 'World',
attributes: { bold: true }
}, {
insert: {
image: 'https://exclamation.com/mark.png'
},
attributes: { width: '100' }
}];

描述更改

¥Describing Changes

顾名思义,Delta 格式可以描述文档的更改,以及文档本身。实际上,我们可以将文档视为对空文档所做的更改,以使其变为我们所描述的文档。你可能已经猜到了,使用 Delta 来描述更改正是我们之前将 text 重命名为 insert 的原因。我们将 Delta 数组中的每个元素称为一个操作。

¥As the name Delta implies, our format can describe changes to documents, as well as documents themselves. In fact we can think of documents as the changes we would make to the empty document, to get to the one we are describing. As you might have already guessed, using Deltas to also describe changes is why we renamed text to insert earlier. We call each element in our Delta array an Operation.

删除

¥Delete

要描述删除文本,我们需要知道要删除的位置和字符数。要删除嵌入,除了了解嵌入的长度外,无需任何特殊处理。如果不是 ,则我们需要指定当仅删除嵌入内容的一部分时会发生什么。目前尚无此类规范,因此无论图片由多少像素组成、视频时长多少分钟或一副幻灯片包含多少张幻灯片,都无法保证其准确性;嵌入的长度均为 1。

¥To describe deleting text, we need to know where and how many characters to delete. To delete embeds, there needs not be any special treatment, other than to understand the length of an embed. If it is anything other than one, we would then need to specify what happens when only part of an embed is deleted. There is currently no such specification, so regardless of how many pixels make up an image, how many minutes long a video is, or how many slides are in a deck; embeds are all of length one.

描述删除操作的一种合理方法是显式存储其索引和长度。

¥One reasonable way to describe a deletion is to explicitly store its index and length.

const delta = [{
delete: {
index: 4,
length: 1
}
}, {
delete: {
index: 12,
length: 3
}
}];

我们必须根据索引对删除操作进行排序,并确保范围不重叠,否则将违反我们的规范约束。这种索引和长度方法还有其他一些缺点,但在描述格式变化后更容易理解。

¥We would have to order the deletions based on indexes, and ensure no ranges overlap, otherwise our canonical constraint would be violated. There are a couple other shortcomings to this index and length approach, but they are easier to appreciate after describing format changes.

插入

¥Insert

现在 Deltas 可能描述的是非空文档的更改,{ insert: "Hello" } 是不够的,因为我们不知道 "你好" 应该插入到哪里。我们可以通过添加类似于 delete 的索引来解决这个问题。

¥Now that Deltas may be describing changes to a non-empty document, { insert: "Hello" } is insufficient, because we do not know where "Hello" should be inserted. We can solve this by also adding an index, similar to delete.

格式

¥Format

与删除操作类似,我们需要指定要格式化的文本范围以及格式更改本身。格式化功能存在于 attributes 对象中,因此一个简单的解决方案是提供一个额外的 attributes 对象与现有对象合并。此合并操作较为浅显,以保持简单。我们尚未发现足够引人注目的用例,需要进行深度合并并增加复杂性。

¥Similar to deletes, we need to specify the range of text to format, along with the format change itself. Formatting exists in the attributes object, so a simple solution is to provide an additional attributes object to merge with the existing one. This merge is shallow to keep things simple. We have not found a use case that is compelling enough to require a deep merge and warrants the added complexity.

const delta = [{
format: {
index: 4,
length: 1
},
attributes: {
bold: true
}
}];

特殊情况是当我们想要删除格式时。我们将使用 null 来实现此目的,因此 { bold: null } 表示删除粗体格式。我们可以指定任何虚假值,但属性值为 0 或空字符串可能存在一些合法的用例。

¥The special case is when we want to remove formatting. We will use null for this purpose, so { bold: null } would mean remove the bold format. We could have specified any falsy value, but there may be legitimate use cases for an attribute value to be 0 or the empty string.

注意:现在我们必须谨慎处理应用层的索引。如前所述,Delta 不会为任何 attributes 的键值对、任何嵌入类型或值赋予任何固有含义。Delta 无法识别图片是否具有持续时间、文本是否具有替代文本以及视频是否能够加粗。以下是一个合法的 Delta,可能是由于应用未注意格式范围而应用了其他合法的 Delta 的结果。

¥Note: We now have to be careful with indexes at the application layer. As mentioned earlier, Deltas do not ascribe any inherent meaning to any the attributes' key-value pairs, nor any embed types or values. Deltas do not know an image does not have duration, text does not have alternative texts, and videos cannot be bolded. The following is a legal Delta that might have been the result of applying other legal Deltas, by an application not being careful of format ranges.

const delta = [{
insert: {
image: "https://imgur.com/"
},
attributes: {
duration: 600
}
}, {
insert: "Hello",
attributes: {
alt: "Funny cat photo"
}
}, {
insert: {
video: "https://youtube.com/"
},
attributes: {
bold: true
}
}];

陷阱

¥Pitfalls

首先,我们应该明确,在应用任何操作之前,索引必须引用其在文档中的位置。否则,后续操作可能会删除之前的插入、取消格式化之前的格式等,从而违反紧凑性。

¥First, we should be clear that an index must refer to its position in the document before any Operations are applied. Otherwise, a later Operation may delete a previous insert, unformat a previous format, etc., which would violate compactness.

操作也必须严格排序才能满足我们的规范约束。按索引、长度和类型排序是实现此目的的一种有效方法。

¥Operations must also be strictly ordered to satisfy our canonical constraint. Ordering by index, then length, and then type is one valid way this can be accomplished.

如前所述,删除范围不能重叠。反对格式范围重叠的案例不那么简单,但事实证明我们也不希望格式重叠。

¥As stated earlier, delete ranges cannot overlap. The case against overlapping format ranges is less brief, but it turns out we do not want overlapping formats either.

Delta 可能无效的原因越来越多。更好的格式是完全不允许表达这种情况。

¥The number of reasons a Delta might be invalid is piling up. A better format would simply not allow such cases to be expressed at all.

保留

¥Retain

如果我们暂时抛开紧凑性形式,我们可以描述一种更简单的格式来表达插入、删除和格式化:

¥If we step back from our compactness formalities for a moment, we can describe a much simpler format to express inserting, deleting, and formatting:

  • Delta 的操作长度至少与被修改的文档长度相同。

    ¥A Delta would have Operations that are at least as long as the document being modified.

  • 每个操作都会描述该索引处的字符会发生什么。

    ¥Each Operation would describe what happens to the character at that index.

  • 可选的插入操作可能会使增量比它描述的文档更长。

    ¥Optional insert Operations may make the Delta longer than the document it describes.

这需要创建一个新的操作,该操作仅表示 "保持此字符原样"。我们将其称为 retain

¥This necessitates the creation of a new Operation, that will simply mean "keep this character as is". We call this a retain.

// Starting with "HelloWorld",
// bold "Hello", and insert a space right after it
const change = [
{ format: true, attributes: { bold: true } }, // H
{ format: true, attributes: { bold: true } }, // e
{ format: true, attributes: { bold: true } }, // l
{ format: true, attributes: { bold: true } }, // l
{ format: true, attributes: { bold: true } }, // o
{ insert: ' ' },
{ retain: true }, // W
{ retain: true }, // o
{ retain: true }, // r
{ retain: true }, // l
{ retain: true } // d
]

由于每个字符都已描述,因此不再需要显式指定索引和长度。这使得重叠范围和无序索引无法表达。

¥Since every character is described, explicit indexes and lengths are no longer necessary. This makes overlapping ranges and out-of-order indexes impossible to express.

因此,我们可以进行简单的优化,合并相邻的相等操作,重新引入长度。如果最后一个操作是 retain,我们可以直接删除它,因为它只是指示 "不对文档的其余部分执行任何操作"。

¥Therefore, we can make the easy optimization to merge adjacent equal Operations, re-introducing length. If the last Operation is a retain we can simply drop it, for it simply instructs to "do nothing to the rest of the document".

const change = [
{ format: 5, attributes: { bold: true } }
{ insert: ' ' }
]

此外,你可能会注意到,retain 在某些方面只是 format 的一个特例。例如,{ format: 1, attributes: {} }{ retain: 1 } 之间没有实际区别。压缩会删除空的 attributes 对象,只剩下 { format: 1 },从而造成规范化冲突。因此,在我们的示例中,我们将简单地组合 formatretain,并保留名称 retain

¥Furthermore, you might notice that a retain is in some ways just a special case of format. For instance, there is no practical difference between { format: 1, attributes: {} } and { retain: 1 }. Compacting would drop the empty attributes object leaving us with just { format: 1 }, creating a canonicalization conflict. Thus, in our example we will simply combine format and retain, and keep the name retain.

const change = [
{ retain: 5, attributes: { bold: true } },
{ insert: ' ' }
]

我们现在有一个非常接近当前标准格式的 Delta。

¥We now have a Delta that is very close to the current standard format.

操作

¥Ops

目前,我们有一个易于使用的 JSON 数组来描述富文本。这在存储层和传输层非常有用,但应用可以从更多功能中受益。我们可以通过将 Deltas 实现为一个类来添加此功能,该类可以轻松地从 JSON 初始化或导出到 JSON,然后为其提供相关方法。

¥Right now we have an easy to use JSON Array that describes rich text. This is great at the storage and transport layers, but applications could benefit from more functionality. We can add this by implementing Deltas as a class, that can be easily initialized from or exported to JSON, and then providing it with relevant methods.

在 Delta 诞生之初,无法对数组进行子类化。因此,增量 (Delta) 表示为对象,并具有单个属性 ops,该属性存储一个操作数组,就像我们之前讨论的那样。

¥At the time of Delta's inception, it was not possible to sub-class an Array. For this reason Deltas are expressed as Objects, with a single property ops that stores an array of Operations like the ones we have been discussing.

const delta = {
ops: [{
insert: 'Hello'
}, {
insert: 'World',
attributes: { bold: true }
}, {
insert: {
image: 'https://exclamation.com/mark.png'
},
attributes: { width: '100' }
}]
};

最后,我们得到了目前存在的 Delta 格式

¥Finally, we arrive at the Delta format, as it exists today.