diff options
Diffstat (limited to 'docs/usermanual-clusters.xml')
-rw-r--r-- | docs/usermanual-clusters.xml | 931 |
1 files changed, 661 insertions, 270 deletions
diff --git a/docs/usermanual-clusters.xml b/docs/usermanual-clusters.xml index 608371b..228cc56 100644 --- a/docs/usermanual-clusters.xml +++ b/docs/usermanual-clusters.xml @@ -1,304 +1,695 @@ +<?xml version="1.0"?> +<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN" + "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [ + <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'"> + <!ENTITY version SYSTEM "version.xml"> +]> <chapter id="clusters"> -<sect1 id="clusters"> <title>Clusters</title> - <para> - In shaping text, a <emphasis>cluster</emphasis> is a sequence of - code points that needs to be treated as a single, indivisible unit. - </para> - <para> - When you add text to a HB buffer, each character is associated with - a <emphasis>cluster value</emphasis>. This is an arbitrary number as - far as HB is concerned. - </para> - <para> - Most clients will use UTF-8, UTF-16, or UTF-32 indices, but the - actual number does not matter. Moreover, it is not required for the - cluster values to be monotonically increasing, but pretty much all - of HB's tests are performed on monotonically increasing cluster - numbers. Nevertheless, there is no such assumption in the code - itself. With that in mind, let's examine what happens with cluster - values during shaping under each cluster-level. - </para> - <para> - HarfBuzz provides three <emphasis>levels</emphasis> of clustering - support. Level 0 is the default behavior and reproduces the behavior - of the old HarfBuzz library. Level 1 tweaks this behavior slightly - to produce better results, so level 1 clustering is recommended for - code that is not required to implement backward compatibility with - the old HarfBuzz. - </para> - <para> - Level 2 differs significantly in how it treats cluster values. - Levels 0 and 1 both process ligatures and glyph decomposition by - merging clusters; level 2 does not. - </para> - <para> - The conceptual model for what the cluster values mean, in levels 0 - and 1, is this: - </para> - <itemizedlist spacing="compact"> - <listitem> - <para> - the sequence of cluster values will always remain monotone - </para> - </listitem> - <listitem> - <para> - each value represents a single cluster - </para> - </listitem> - <listitem> - <para> - each cluster contains one or more glyphs and one or more - characters - </para> - </listitem> - </itemizedlist> - <para> - Assuming that initial cluster numbers were monotonically increasing - and distinct, then all adjacent glyphs having the same cluster - number belong to the same cluster, and all characters belong to the - cluster that has the highest number not larger than their initial - cluster number. This will become clearer with an example. - </para> -</sect1> -<sect1 id="a-clustering-example-for-levels-0-and-1"> - <title>A clustering example for levels 0 and 1</title> - <para> - Let's say we start with the following character sequence and cluster - values: - </para> - <programlisting> - A,B,C,D,E - 0,1,2,3,4 -</programlisting> - <para> - We then map the characters to glyphs. For simplicity, let's assume - that each character maps to the corresponding, identical-looking - glyph: - </para> - <programlisting> - A,B,C,D,E - 0,1,2,3,4 -</programlisting> - <para> - Now if, for example, <literal>B</literal> and <literal>C</literal> - ligate, then the clusters to which they belong "merge". - This merged cluster takes for its cluster number the minimum of all - the cluster numbers of the clusters that went in. In this case, we - get: - </para> - <programlisting> - A,BC,D,E - 0,1 ,3,4 -</programlisting> - <para> - Now let's assume that the <literal>BC</literal> glyph decomposes - into three components, and <literal>D</literal> also decomposes into - two. The components each inherit the cluster value of their parent: - </para> - <programlisting> - A,BC0,BC1,BC2,D0,D1,E - 0,1 ,1 ,1 ,3 ,3 ,4 -</programlisting> - <para> - Now if <literal>BC2</literal> and <literal>D0</literal> ligate, then - their clusters (numbers 1 and 3) merge into - <literal>min(1,3) = 1</literal>: - </para> - <programlisting> - A,BC0,BC1,BC2D0,D1,E - 0,1 ,1 ,1 ,1 ,4 -</programlisting> - <para> - At this point, cluster 1 means: the character sequence - <literal>BCD</literal> is represented by glyphs - <literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any - further. - </para> -</sect1> -<sect1 id="reordering-in-levels-0-and-1"> - <title>Reordering in levels 0 and 1</title> - <para> - Another common operation in the more complex shapers is when things - reorder. In those cases, to maintain monotone clusters, HB merges - the clusters of everything in the reordering sequence. For example, - let's again start with the character sequence: - </para> - <programlisting> - A,B,C,D,E - 0,1,2,3,4 -</programlisting> - <para> - If <literal>D</literal> is reordered before <literal>B</literal>, - then the <literal>B</literal>, <literal>C</literal>, and - <literal>D</literal> clusters merge, and we get: - </para> - <programlisting> - A,D,B,C,E - 0,1,1,1,4 -</programlisting> - <para> - This is clearly not ideal, but it is the only sensible way to - maintain monotone indices and retain the true relationship between - glyphs and characters. - </para> -</sect1> -<sect1 id="the-distinction-between-levels-0-and-1"> - <title>The distinction between levels 0 and 1</title> - <para> - So, the above is pretty much what cluster levels 0 and 1 do. The - only difference between the two is this: in level 0, at the very - beginning of the shaping process, we also merge clusters between - base characters and all Unicode marks (combining or not) following - them. E.g.: - </para> - <programlisting> - A,acute,B - 0,1 ,2 -</programlisting> - <para> - will become: - </para> - <programlisting> - A,acute,B - 0,0 ,2 -</programlisting> - <para> - This is the default behavior. We do it because Windows did it and - old HarfBuzz did it, so this remained the default. But this behavior - makes it impossible to color diacritic marks differently from their - base characters. That's why in level 1 we do not perform this - initial merging step. - </para> - <para> - For clients, level 0 is more convenient if they rely on HarfBuzz - clusters for cursor positioning. But that's wrong anyway: cursor - positions should be determined based on Unicode grapheme boundaries, - NOT shaping clusters. As such, level 1 clusters are preferred. - </para> - <para> - One last note about levels 0 and 1. We currently don't allow a - <literal>MultipleSubst</literal> lookup to replace a glyph with zero - glyphs (i.e., to delete a glyph). But in some other situations, - glyphs can be deleted. In those cases, if the glyph being deleted is - the last glyph of its cluster, we make sure to merge the cluster - with a neighboring cluster. - </para> - <para> - This is, primarily, to make sure that the starting cluster of the - text always has the cluster index pointing to the start of the text - for the run; more than one client currently relies on this - guarantee. - </para> - <para> - Incidentally, Apple's CoreText does something else to maintain the - same promise: it inserts a glyph with id 65535 at the beginning of - the glyph string if the glyph corresponding to the first character - in the run was deleted. HarfBuzz might do something similar in the - future. - </para> -</sect1> -<sect1 id="level-2"> - <title>Level 2</title> - <para> - Level 2 is a different beast from levels 0 and 1. It is simple to - describe, but hard to make sense of. It simply doesn't do any - cluster merging whatsoever. When things ligate or otherwise multiple - glyphs turn into one, the cluster value of the first glyph is - retained. - </para> - <para> - Here are a few examples of why processing cluster values produced at - this level might be tricky: - </para> - <sect2 id="ligatures-with-combining-marks"> - <title>Ligatures with combining marks</title> - <para> - Imagine capital letters are bases and lower case letters are - combining marks. With an input sequence like this: + <section id="clusters-and-shaping"> + <title>Clusters and shaping</title> + <para> + In text shaping, a <emphasis>cluster</emphasis> is a sequence of + characters that needs to be treated as a single, indivisible + unit. A single letter or symbol can be a cluster of its + own. Other clusters correspond to longer subsequences of the + input code points — such as a ligature or conjunct form + — and require the shaper to ensure that the cluster is not + broken during the shaping process. + </para> + <para> + A cluster is distinct from a <emphasis>grapheme</emphasis>, + which is the smallest unit of meaning in a writing system or + script. + </para> + <para> + The definitions of the two terms are similar. However, clusters + are only relevant for script shaping and glyph layout. In + contrast, graphemes are a property of the underlying script, and + are of interest when client programs implement orthographic + or linguistic functionality. + </para> + <para> + For example, two individual letters are often two separate + graphemes. When two letters form a ligature, however, they + combine into a single glyph. They are then part of the same + cluster and are treated as a unit by the shaping engine — + even though the two original, underlying letters remain separate + graphemes. + </para> + <para> + HarfBuzz is concerned with clusters, <emphasis>not</emphasis> + with graphemes — although client programs using HarfBuzz + may still care about graphemes for other reasons from time to time. + </para> + <para> + During the shaping process, there are several shaping operations + that may merge adjacent characters (for example, when two code + points form a ligature or a conjunct form and are replaced by a + single glyph) or split one character into several (for example, + when decomposing a code point through the + <literal>ccmp</literal> feature). Operations like these alter + clusters; HarfBuzz tracks the changes to ensure that no clusters + get lost or broken during shaping. + </para> + <para> + HarfBuzz records cluster information independently from how + shaping operations affect the individual glyphs returned in an + output buffer. Consequently, a client program using HarfBuzz can + utilize the cluster information to implement features such as: + </para> + <itemizedlist> + <listitem> + <para> + Correctly positioning the cursor within a shaped text run, + even when characters have formed ligatures, composed or + decomposed, reordered, or undergone other shaping operations. + </para> + </listitem> + <listitem> + <para> + Correctly highlighting a text selection that includes some, + but not all, of the characters in a word. + </para> + </listitem> + <listitem> + <para> + Applying text attributes (such as color or underlining) to + part, but not all, of a word. + </para> + </listitem> + <listitem> + <para> + Generating output document formats (such as PDF) with + embedded text that can be fully extracted. + </para> + </listitem> + <listitem> + <para> + Determining the mapping between input characters and output + glyphs, such as which glyphs are ligatures. + </para> + </listitem> + <listitem> + <para> + Performing line-breaking, justification, and other + line-level or paragraph-level operations that must be done + after shaping is complete, but which require examining + character-level properties. + </para> + </listitem> + </itemizedlist> + </section> + <section id="working-with-harfbuzz-clusters"> + <title>Working with HarfBuzz clusters</title> + <para> + When you add text to a HarfBuzz buffer, each code point must be + assigned a <emphasis>cluster value</emphasis>. + </para> + <para> + This cluster value is an arbitrary number; HarfBuzz uses it only + to distinguish between clusters. Many client programs will use + the index of each code point in the input text stream as the + cluster value. This is for the sake of convenience; the actual + value does not matter. + </para> + <para> + Some of the shaping operations performed by HarfBuzz — + such as reordering, composition, decomposition, and substitution + — may alter the cluster values of some characters. The + final cluster values in the buffer at the end of the shaping + process will indicate to client programs which subsequences of + glyphs represent a cluster and, therefore, must not be + separated. + </para> + <para> + In addition, client programs can query the final cluster values + to discern other potentially important information about the + glyphs in the output buffer (such as whether or not a ligature + was formed). + </para> + <para> + For example, if the initial sequence of cluster values was: + </para> + <programlisting> + 0,1,2,3,4 + </programlisting> + <para> + and the final sequence of cluster values is: </para> <programlisting> - A,a,B,b,C,c - 0,1,2,3,4,5 -</programlisting> + 0,0,3,3 + </programlisting> <para> - if <literal>A,B,C</literal> ligate, then here are the cluster - values one would get under the various levels: + then there are two clusters in the output buffer: the first + cluster includes the first two glyphs, and the second cluster + includes the third and fourth glyphs. It is also evident that a + ligature or conjunct has been formed, because there are fewer + glyphs in the output buffer (four) than there were code points + in the input buffer (five). </para> <para> - level 0: + Although client programs using HarfBuzz are free to assign + initial cluster values in any manner they choose to, HarfBuzz + does offer some useful guarantees if the cluster values are + assigned in a monotonic (either non-decreasing or non-increasing) + order. + </para> + <para> + For left-to-right scripts (LTR) and top-to-bottom scripts (TTB), + HarfBuzz will preserve the monotonic property: client programs + are guaranteed that monotonically increasing initial clulster + values will be returned as monotonically increasing final + cluster values. + </para> + <para> + For right-to-left scripts (RTL) and bottom-to-top scripts (BTT), + the directionality of the buffer itself is reversed for final + output as a matter of design. Therefore, HarfBuzz inverts the + monotonic property: client programs are guaranteed that + monotonically increasing initial clulster values will be + returned as monotonically <emphasis>decreasing</emphasis> final + cluster values. + </para> + <para> + Client programs can adjust how HarfBuzz handles clusters during + shaping by setting the + <literal>cluster_level</literal> of the + buffer. HarfBuzz offers three <emphasis>levels</emphasis> of + clustering support for this property: + </para> + <itemizedlist> + <listitem> + <para><emphasis>Level 0</emphasis> is the default and + reproduces the behavior of the old HarfBuzz library. + </para> + <para> + The distinguishing feature of level 0 behavior is that, at + the beginning of processing the buffer, all code points that + are categorized as <emphasis>marks</emphasis>, + <emphasis>modifier symbols</emphasis>, or + <emphasis>Emoji extended pictographic</emphasis> modifiers, + as well as the <emphasis>Zero Width Joiner</emphasis> and + <emphasis>Zero Width Non-Joiner</emphasis> code points, are + assigned the cluster value of the closest preceding code + point from <emphasis>different</emphasis> category. + </para> + <para> + In essence, whenever a base character is followed by a mark + character or a sequence of mark characters, those marks are + reassigned to the same initial cluster value as the base + character. This reassignment is referred to as + "merging" the affected clusters. This behavior is based on + the Grapheme Cluster Boundary specification in <ulink + url="https://www.unicode.org/reports/tr29/#Regex_Definitions">Unicode + Technical Report 29</ulink>. + </para> + <para> + Client programs can specify level 0 behavior for a buffer by + setting its <literal>cluster_level</literal> to + <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_GRAPHEMES</literal>. + </para> + </listitem> + <listitem> + <para> + <emphasis>Level 1</emphasis> tweaks the old behavior + slightly to produce better results. Therefore, level 1 + clustering is recommended for code that is not required to + implement backward compatibility with the old HarfBuzz. + </para> + <para> + Level 1 differs from level 0 by not merging the + clusters of marks and other modifier code points with the + preceding "base" code point's cluster. By preserving the + separate cluster values of these marks and modifier code + points, script shapers can perform additional operations + that might lead to improved results (for example, reordering + a sequence of marks). + </para> + <para> + Client programs can specify level 1 behavior for a buffer by + setting its <literal>cluster_level</literal> to + <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_CHARACTERS</literal>. + </para> + </listitem> + <listitem> + <para> + <emphasis>Level 2</emphasis> differs significantly in how it + treats cluster values. In level 2, HarfBuzz never merges + clusters. + </para> + <para> + This difference can be seen most clearly when HarfBuzz processes + ligature substitutions and glyph decompositions. In level 0 + and level 1, ligatures and glyph decomposition both involve + merging clusters; in level 2, neither of these operations + triggers a merge. + </para> + <para> + Client programs can specify level 2 behavior for a buffer by + setting its <literal>cluster_level</literal> to + <literal>HB_BUFFER_CLUSTER_LEVEL_CHARACTERS</literal>. + </para> + </listitem> + </itemizedlist> + <para> + As mentioned earlier, client programs using HarfBuzz often + assign initial cluster values in a buffer by reusing the indices + of the code points in the input text. This gives a sequence of + cluster values that is monotonically increasing (for example, + 0,1,2,3,4). + </para> + <para> + It is not <emphasis>required</emphasis> that the cluster values + in a buffer be monotonically increasing. However, if the initial + cluster values in a buffer are monotonic and the buffer is + configured to use cluster level 0 or 1, then HarfBuzz + guarantees that the final cluster values in the shaped buffer + will also be monotonic. No such guarantee is made for cluster + level 2. + </para> + <para> + In levels 0 and 1, HarfBuzz implements the following conceptual + model for cluster values: + </para> + <itemizedlist spacing="compact"> + <listitem> + <para> + If the sequence of input cluster values is monotonic, the + sequence of cluster values will remain monotonic. + </para> + </listitem> + <listitem> + <para> + Each cluster value represents a single cluster. + </para> + </listitem> + <listitem> + <para> + Each cluster contains one or more glyphs and one or more + characters. + </para> + </listitem> + </itemizedlist> + <para> + In practice, this model offers several benefits. Assuming that + the initial cluster values were monotonically increasing + and distinct before shaping began, then, in the final output: + </para> + <itemizedlist spacing="compact"> + <listitem> + <para> + All adjacent glyphs having the same final cluster + value belong to the same cluster. + </para> + </listitem> + <listitem> + <para> + Each character belongs to the cluster that has the highest + cluster value <emphasis>not larger than</emphasis> its + initial cluster value. + </para> + </listitem> + </itemizedlist> + </section> + + <section id="a-clustering-example-for-levels-0-and-1"> + <title>A clustering example for levels 0 and 1</title> + <para> + The basic shaping operations affect clusters in a predictable + manner when using level 0 or level 1: + </para> + <itemizedlist> + <listitem> + <para> + When two or more clusters <emphasis>merge</emphasis>, the + resulting merged cluster takes as its cluster value the + <emphasis>minimum</emphasis> of the incoming cluster values. + </para> + </listitem> + <listitem> + <para> + When a cluster <emphasis>decomposes</emphasis>, all of the + resulting child clusters inherit as their cluster value the + cluster value of the parent cluster. + </para> + </listitem> + <listitem> + <para> + When a character is <emphasis>reordered</emphasis>, the + reordered character and all clusters that the character + moves past as part of the reordering are merged into one cluster. + </para> + </listitem> + </itemizedlist> + <para> + The functionality, guarantees, and benefits of level 0 and level + 1 behavior can be seen with some examples. First, let us examine + what happens with cluster values when shaping involves cluster + merging with ligatures and decomposition. + </para> + + <para> + Let's say we start with the following character sequence (top row) and + initial cluster values (bottom row): </para> <programlisting> - ABC,a,b,c - 0 ,0,0,0 -</programlisting> + A,B,C,D,E + 0,1,2,3,4 + </programlisting> <para> - level 1: + During shaping, HarfBuzz maps these characters to glyphs from + the font. For simplicity, let us assume that each character maps + to the corresponding, identical-looking glyph: </para> <programlisting> - ABC,a,b,c - 0 ,0,0,5 -</programlisting> + A,B,C,D,E + 0,1,2,3,4 + </programlisting> <para> - level 2: + Now if, for example, <literal>B</literal> and <literal>C</literal> + form a ligature, then the clusters to which they belong + "merge". This merged cluster takes for its cluster + value the minimum of all the cluster values of the clusters that + went in to the ligature. In this case, we get: </para> <programlisting> - ABC,a,b,c - 0 ,1,3,5 -</programlisting> + A,BC,D,E + 0,1 ,3,4 + </programlisting> + <para> + because 1 is the minimum of the set {1,2}, which were the + cluster values of <literal>B</literal> and + <literal>C</literal>. + </para> + <para> + Next, let us say that the <literal>BC</literal> ligature glyph + decomposes into three components, and <literal>D</literal> also + decomposes into two components. Whenever a cluster decomposes, + its components each inherit the cluster value of their parent: + </para> + <programlisting> + A,BC0,BC1,BC2,D0,D1,E + 0,1 ,1 ,1 ,3 ,3 ,4 + </programlisting> + <para> + Next, if <literal>BC2</literal> and <literal>D0</literal> form a + ligature, then their clusters (cluster values 1 and 3) merge into + <literal>min(1,3) = 1</literal>: + </para> + <programlisting> + A,BC0,BC1,BC2D0,D1,E + 0,1 ,1 ,1 ,1 ,4 + </programlisting> + <para> + Note that the entirety of cluster 3 merges into cluster 1, not + just the <literal>D0</literal> glyph. This reflects the fact + that the cluster <emphasis>must</emphasis> be treated as an + indivisible unit. + </para> <para> - Making sense of the last example is the hardest for a client, - because there is nothing in the cluster values to suggest that - <literal>B</literal> and <literal>C</literal> ligated with - <literal>A</literal>. + At this point, cluster 1 means: the character sequence + <literal>BCD</literal> is represented by glyphs + <literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any + further. </para> - </sect2> - <sect2 id="reordering"> - <title>Reordering</title> + </section> + <section id="reordering-in-levels-0-and-1"> + <title>Reordering in levels 0 and 1</title> <para> - Another tricky case is when things reorder. Under level 2: + Another common operation in the more complex shapers is glyph + reordering. In order to maintain a monotonic cluster sequence + when glyph reordering takes place, HarfBuzz merges the clusters + of everything in the reordering sequence. + </para> + <para> + For example, let us again start with the character sequence (top + row) and initial cluster values (bottom row): </para> <programlisting> - A,B,C,D,E - 0,1,2,3,4 -</programlisting> + A,B,C,D,E + 0,1,2,3,4 + </programlisting> <para> - Now imagine <literal>D</literal> moves before - <literal>B</literal>: + If <literal>D</literal> is reordered to the position immediately + before <literal>B</literal>, then HarfBuzz merges the + <literal>B</literal>, <literal>C</literal>, and + <literal>D</literal> clusters — all the clusters between + the final position of the reordered glyph and its original + position. This means that we get: </para> <programlisting> - A,D,B,C,E - 0,3,1,2,4 -</programlisting> + A,D,B,C,E + 0,1,1,1,4 + </programlisting> + <para> + as the final cluster sequence. + </para> + <para> + Merging this many clusters is not ideal, but it is the only + sensible way for HarfBuzz to maintain the guarantee that the + sequence of cluster values remains monotonic and to retain the + true relationship between glyphs and characters. + </para> + </section> + <section id="the-distinction-between-levels-0-and-1"> + <title>The distinction between levels 0 and 1</title> + <para> + The preceding examples demonstrate the main effects of using + cluster levels 0 and 1. The only difference between the two + levels is this: in level 0, at the very beginning of the shaping + process, HarfBuzz merges the cluster of each base character + with the clusters of all Unicode marks (combining or not) and + modifiers that follow it. + </para> <para> - Now, if <literal>D</literal> ligates with <literal>B</literal>, we - get: + For example, let us start with the following character sequence + (top row) and accompanying initial cluster values (bottom row): </para> <programlisting> - A,DB,C,E - 0,3 ,2,4 -</programlisting> + A,acute,B + 0,1 ,2 + </programlisting> <para> - In a different scenario, <literal>A</literal> and - <literal>B</literal> could have ligated - <emphasis>before</emphasis> <literal>D</literal> reordered; that - would have resulted in: + The <literal>acute</literal> is a Unicode mark. If HarfBuzz is + using cluster level 0 on this sequence, then the + <literal>A</literal> and <literal>acute</literal> clusters will + merge, and the result will become: </para> <programlisting> - AB,D,C,E - 0 ,3,2,4 -</programlisting> + A,acute,B + 0,0 ,2 + </programlisting> + <para> + This merger is performed before any other script-shaping + steps. + </para> + <para> + This initial cluster merging is the default behavior of the + Windows shaping engine, and the old HarfBuzz codebase copied + that behavior to maintain compatibility. Consequently, it has + remained the default behavior in the new HarfBuzz codebase. + </para> + <para> + But this initial cluster-merging behavior makes it impossible + client programs to implement some features (such as to + color diacritic marks differently from their base + characters). That is why, in level 1, HarfBuzz does not perform + the initial merging step. + </para> + <para> + For client programs that rely on HarfBuzz cluster values to + perform cursor positioning, level 0 is more convenient. But + relying on cluster boundaries for cursor positioning is wrong: cursor + positions should be determined based on Unicode grapheme + boundaries, not on shaping-cluster boundaries. As such, using + level 1 clustering behavior is recommended. + </para> <para> - There's no way to differentiate between these two scenarios based - on the cluster numbers alone. + One final facet of levels 0 and 1 is worth noting. HarfBuzz + currently does not allow any + <emphasis>multiple-substitution</emphasis> GSUB lookups to + replace a glyph with zero glyphs (in other words, to delete a + glyph). </para> <para> - Another problem happens with ligatures under level 2 if the - direction of the text is forced to opposite of its natural - direction (e.g. left-to-right Arabic). But that's too much of a - corner case to worry about. + But, in some other situations, glyphs can be deleted. In + those cases, if the glyph being deleted is the last glyph of its + cluster, HarfBuzz makes sure to merge the deleted glyph's + cluster with a neighboring cluster. </para> - </sect2> -</sect1> + <para> + This is done primarily to make sure that the starting cluster of the + text always has the cluster index pointing to the start of the text + for the run; more than one client program currently relies on this + guarantee. + </para> + <para> + Incidentally, Apple's CoreText does something different to + maintain the same promise: it inserts a glyph with id 65535 at + the beginning of the glyph string if the glyph corresponding to + the first character in the run was deleted. HarfBuzz might do + something similar in the future. + </para> + </section> + <section id="level-2"> + <title>Level 2</title> + <para> + HarfBuzz's level 2 cluster behavior uses a significantly + different model than that of level 0 and level 1. + </para> + <para> + The level 2 behavior is easy to describe, but it may be + difficult to understand in practical terms. In brief, level 2 + performs no merging of clusters whatsoever. + </para> + <para> + This means that there is no initial base-and-mark merging step + (as is done in level 0), and it means that reordering moves and + ligature substitutions do not trigger a cluster merge. + </para> + <para> + Only one shaping operation directly affects clusters when using + level 2: + </para> + <itemizedlist> + <listitem> + <para> + When a cluster <emphasis>decomposes</emphasis>, all of the + resulting child clusters inherit as their cluster value the + cluster value of the parent cluster. + </para> + </listitem> + </itemizedlist> + <para> + When glyphs do form a ligature (or when some other feature + substitutes multiple glyphs with one glyph) the cluster value + of the first glyph is retained as the cluster value for the + resulting ligature. + </para> + <para> + This occurrence sounds similar to a cluster merge, but it is + different. In particular, no subsequent characters — + including marks and modifiers — are affected. They retain + their previous cluster values. + </para> + <para> + Level 2 cluster behavior is ultimately less complex than level 0 + or level 1, but there are several cases for which processing + cluster values produced at level 2 may be tricky. + </para> + <section id="ligatures-with-combining-marks-in-level-2"> + <title>Ligatures with combining marks in level 2</title> + <para> + The first example of how HarfBuzz's level 2 cluster behavior + can be tricky is when the text to be shaped includes combining + marks attached to ligatures. + </para> + <para> + Let us start with an input sequence with the following + characters (top row) and initial cluster values (bottom row): + </para> + <programlisting> + A,acute,B,breve,C,circumflex + 0,1 ,2,3 ,4,5 + </programlisting> + <para> + If the sequence <literal>A,B,C</literal> forms a ligature, + then these are the cluster values HarfBuzz will return under + the various cluster levels: + </para> + <para> + Level 0: + </para> + <programlisting> + ABC,acute,breve,circumflex + 0 ,0 ,0 ,0 + </programlisting> + <para> + Level 1: + </para> + <programlisting> + ABC,acute,breve,circumflex + 0 ,0 ,0 ,5 + </programlisting> + <para> + Level 2: + </para> + <programlisting> + ABC,acute,breve,circumflex + 0 ,1 ,3 ,5 + </programlisting> + <para> + Making sense of the level 2 result is the hardest for a client + program, because there is nothing in the cluster values that + indicates that <literal>B</literal> and <literal>C</literal> + formed a ligature with <literal>A</literal>. + </para> + <para> + In contrast, the "merged" cluster values of the mark glyphs + that are seen in the level 0 and level 1 output are evidence + that a ligature substitution took place. + </para> + </section> + <section id="reordering-in-level-2"> + <title>Reordering in level 2</title> + <para> + Another example of how HarfBuzz's level 2 cluster behavior + can be tricky is when glyphs reorder. Consider an input sequence + with the following characters (top row) and initial cluster + values (bottom row): + </para> + <programlisting> + A,B,C,D,E + 0,1,2,3,4 + </programlisting> + <para> + Now imagine <literal>D</literal> moves before + <literal>B</literal> in a reordering operation. The cluster + values will then be: + </para> + <programlisting> + A,D,B,C,E + 0,3,1,2,4 + </programlisting> + <para> + Next, if <literal>D</literal> forms a ligature with + <literal>B</literal>, the output is: + </para> + <programlisting> + A,DB,C,E + 0,3 ,2,4 + </programlisting> + <para> + However, in a different scenario, in which the shaping rules + of the script instead caused <literal>A</literal> and + <literal>B</literal> to form a ligature + <emphasis>before</emphasis> the <literal>D</literal> reordered, the + result would be: + </para> + <programlisting> + AB,D,C,E + 0 ,3,2,4 + </programlisting> + <para> + There is no way for a client program to differentiate between + these two scenarios based on the cluster values + alone. Consequently, client programs that use level 2 might + need to undertake additional work in order to manage cursor + positioning, text attributes, or other desired features. + </para> + </section> + <section id="other-considerations-in-level-2"> + <title>Other considerations in level 2</title> + <para> + There may be other problems encountered with ligatures under + level 2, such as if the direction of the text is forced to + opposite of its natural direction (for example, Arabic text + that is forced into left-to-right directionality). But, + generally speaking, these other scenarios are minor corner + cases that are too obscure for most client programs to need to + worry about. + </para> + </section> + </section> </chapter> |