summaryrefslogtreecommitdiff
path: root/docs/specs
diff options
context:
space:
mode:
authorAnas Nashif <anas.nashif@intel.com>2012-11-06 07:50:24 -0800
committerAnas Nashif <anas.nashif@intel.com>2012-11-06 07:50:24 -0800
commit060629c6ef0b7e5c267d84c91600113264d33120 (patch)
tree18fcb144ac71b9c4d08ee5d1dc58e2b16c109a5a /docs/specs
downloadqemu-060629c6ef0b7e5c267d84c91600113264d33120.tar.gz
qemu-060629c6ef0b7e5c267d84c91600113264d33120.tar.bz2
qemu-060629c6ef0b7e5c267d84c91600113264d33120.zip
Imported Upstream version 1.2.0upstream/1.2.0
Diffstat (limited to 'docs/specs')
-rw-r--r--docs/specs/acpi_pci_hotplug.txt45
-rw-r--r--docs/specs/ivshmem_device_spec.txt96
-rw-r--r--docs/specs/ppc-spapr-hcalls.txt78
-rw-r--r--docs/specs/qcow2.txt352
-rw-r--r--docs/specs/qed_spec.txt138
5 files changed, 709 insertions, 0 deletions
diff --git a/docs/specs/acpi_pci_hotplug.txt b/docs/specs/acpi_pci_hotplug.txt
new file mode 100644
index 000000000..a839434f3
--- /dev/null
+++ b/docs/specs/acpi_pci_hotplug.txt
@@ -0,0 +1,45 @@
+QEMU<->ACPI BIOS PCI hotplug interface
+--------------------------------------
+
+QEMU supports PCI hotplug via ACPI, for PCI bus 0. This document
+describes the interface between QEMU and the ACPI BIOS.
+
+ACPI GPE block (IO ports 0xafe0-0xafe3, byte access):
+-----------------------------------------
+
+Generic ACPI GPE block. Bit 1 (GPE.1) used to notify PCI hotplug/eject
+event to ACPI BIOS, via SCI interrupt.
+
+PCI slot injection notification pending (IO port 0xae00-0xae03, 4-byte access):
+---------------------------------------------------------------
+Slot injection notification pending. One bit per slot.
+
+Read by ACPI BIOS GPE.1 handler to notify OS of injection
+events. Read-only.
+
+PCI slot removal notification (IO port 0xae04-0xae07, 4-byte access):
+-----------------------------------------------------
+Slot removal notification pending. One bit per slot.
+
+Read by ACPI BIOS GPE.1 handler to notify OS of removal
+events. Read-only.
+
+PCI device eject (IO port 0xae08-0xae0b, 4-byte access):
+----------------------------------------
+
+Write: Used by ACPI BIOS _EJ0 method to request device removal.
+One bit per slot.
+
+Read: Hotplug features register. Used by platform to identify features
+available. Current base feature set (no bits set):
+ - Read-only "up" register @0xae00, 4-byte access, bit per slot
+ - Read-only "down" register @0xae04, 4-byte access, bit per slot
+ - Read/write "eject" register @0xae08, 4-byte access,
+ write: bit per slot eject, read: hotplug feature set
+ - Read-only hotplug capable register @0xae0c, 4-byte access, bit per slot
+
+PCI removability status (IO port 0xae0c-0xae0f, 4-byte access):
+-----------------------------------------------
+
+Used by ACPI BIOS _RMV method to indicate removability status to OS. One
+bit per slot. Read-only
diff --git a/docs/specs/ivshmem_device_spec.txt b/docs/specs/ivshmem_device_spec.txt
new file mode 100644
index 000000000..667a8628f
--- /dev/null
+++ b/docs/specs/ivshmem_device_spec.txt
@@ -0,0 +1,96 @@
+
+Device Specification for Inter-VM shared memory device
+------------------------------------------------------
+
+The Inter-VM shared memory device is designed to share a region of memory to
+userspace in multiple virtual guests. The memory region does not belong to any
+guest, but is a POSIX memory object on the host. Optionally, the device may
+support sending interrupts to other guests sharing the same memory region.
+
+
+The Inter-VM PCI device
+-----------------------
+
+*BARs*
+
+The device supports three BARs. BAR0 is a 1 Kbyte MMIO region to support
+registers. BAR1 is used for MSI-X when it is enabled in the device. BAR2 is
+used to map the shared memory object from the host. The size of BAR2 is
+specified when the guest is started and must be a power of 2 in size.
+
+*Registers*
+
+The device currently supports 4 registers of 32-bits each. Registers
+are used for synchronization between guests sharing the same memory object when
+interrupts are supported (this requires using the shared memory server).
+
+The server assigns each VM an ID number and sends this ID number to the QEMU
+process when the guest starts.
+
+enum ivshmem_registers {
+ IntrMask = 0,
+ IntrStatus = 4,
+ IVPosition = 8,
+ Doorbell = 12
+};
+
+The first two registers are the interrupt mask and status registers. Mask and
+status are only used with pin-based interrupts. They are unused with MSI
+interrupts.
+
+Status Register: The status register is set to 1 when an interrupt occurs.
+
+Mask Register: The mask register is bitwise ANDed with the interrupt status
+and the result will raise an interrupt if it is non-zero. However, since 1 is
+the only value the status will be set to, it is only the first bit of the mask
+that has any effect. Therefore interrupts can be masked by setting the first
+bit to 0 and unmasked by setting the first bit to 1.
+
+IVPosition Register: The IVPosition register is read-only and reports the
+guest's ID number. The guest IDs are non-negative integers. When using the
+server, since the server is a separate process, the VM ID will only be set when
+the device is ready (shared memory is received from the server and accessible via
+the device). If the device is not ready, the IVPosition will return -1.
+Applications should ensure that they have a valid VM ID before accessing the
+shared memory.
+
+Doorbell Register: To interrupt another guest, a guest must write to the
+Doorbell register. The doorbell register is 32-bits, logically divided into
+two 16-bit fields. The high 16-bits are the guest ID to interrupt and the low
+16-bits are the interrupt vector to trigger. The semantics of the value
+written to the doorbell depends on whether the device is using MSI or a regular
+pin-based interrupt. In short, MSI uses vectors while regular interrupts set the
+status register.
+
+Regular Interrupts
+
+If regular interrupts are used (due to either a guest not supporting MSI or the
+user specifying not to use them on startup) then the value written to the lower
+16-bits of the Doorbell register results is arbitrary and will trigger an
+interrupt in the destination guest.
+
+Message Signalled Interrupts
+
+A ivshmem device may support multiple MSI vectors. If so, the lower 16-bits
+written to the Doorbell register must be between 0 and the maximum number of
+vectors the guest supports. The lower 16 bits written to the doorbell is the
+MSI vector that will be raised in the destination guest. The number of MSI
+vectors is configurable but it is set when the VM is started.
+
+The important thing to remember with MSI is that it is only a signal, no status
+is set (since MSI interrupts are not shared). All information other than the
+interrupt itself should be communicated via the shared memory region. Devices
+supporting multiple MSI vectors can use different vectors to indicate different
+events have occurred. The semantics of interrupt vectors are left to the
+user's discretion.
+
+
+Usage in the Guest
+------------------
+
+The shared memory device is intended to be used with the provided UIO driver.
+Very little configuration is needed. The guest should map BAR0 to access the
+registers (an array of 32-bit ints allows simple writing) and map BAR2 to
+access the shared memory region itself. The size of the shared memory region
+is specified when the guest (or shared memory server) is started. A guest may
+map the whole shared memory region or only part of it.
diff --git a/docs/specs/ppc-spapr-hcalls.txt b/docs/specs/ppc-spapr-hcalls.txt
new file mode 100644
index 000000000..52ba8d42a
--- /dev/null
+++ b/docs/specs/ppc-spapr-hcalls.txt
@@ -0,0 +1,78 @@
+When used with the "pseries" machine type, QEMU-system-ppc64 implements
+a set of hypervisor calls using a subset of the server "PAPR" specification
+(IBM internal at this point), which is also what IBM's proprietary hypervisor
+adheres too.
+
+The subset is selected based on the requirements of Linux as a guest.
+
+In addition to those calls, we have added our own private hypervisor
+calls which are mostly used as a private interface between the firmware
+running in the guest and QEMU.
+
+All those hypercalls start at hcall number 0xf000 which correspond
+to a implementation specific range in PAPR.
+
+- H_RTAS (0xf000)
+
+RTAS is a set of runtime services generally provided by the firmware
+inside the guest to the operating system. It predates the existence
+of hypervisors (it was originally an extension to Open Firmware) and
+is still used by PAPR to provide various services that aren't performance
+sensitive.
+
+We currently implement the RTAS services in QEMU itself. The actual RTAS
+"firmware" blob in the guest is a small stub of a few instructions which
+calls our private H_RTAS hypervisor call to pass the RTAS calls to QEMU.
+
+Arguments:
+
+ r3 : H_RTAS (0xf000)
+ r4 : Guest physical address of RTAS parameter block
+
+Returns:
+
+ H_SUCCESS : Successully called the RTAS function (RTAS result
+ will have been stored in the parameter block)
+ H_PARAMETER : Unknown token
+
+- H_LOGICAL_MEMOP (0xf001)
+
+When the guest runs in "real mode" (in powerpc lingua this means
+with MMU disabled, ie guest effective == guest physical), it only
+has access to a subset of memory and no IOs.
+
+PAPR provides a set of hypervisor calls to perform cachable or
+non-cachable accesses to any guest physical addresses that the
+guest can use in order to access IO devices while in real mode.
+
+This is typically used by the firmware running in the guest.
+
+However, doing a hypercall for each access is extremely inefficient
+(even more so when running KVM) when accessing the frame buffer. In
+that case, things like scrolling become unusably slow.
+
+This hypercall allows the guest to request a "memory op" to be applied
+to memory. The supported memory ops at this point are to copy a range
+of memory (supports overlap of source and destination) and XOR which
+is used by our SLOF firmware to invert the screen.
+
+Arguments:
+
+ r3: H_LOGICAL_MEMOP (0xf001)
+ r4: Guest physical address of destination
+ r5: Guest physical address of source
+ r6: Individual element size
+ 0 = 1 byte
+ 1 = 2 bytes
+ 2 = 4 bytes
+ 3 = 8 bytes
+ r7: Number of elements
+ r8: Operation
+ 0 = copy
+ 1 = xor
+
+Returns:
+
+ H_SUCCESS : Success
+ H_PARAMETER : Invalid argument
+
diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt
new file mode 100644
index 000000000..36a559d88
--- /dev/null
+++ b/docs/specs/qcow2.txt
@@ -0,0 +1,352 @@
+== General ==
+
+A qcow2 image file is organized in units of constant size, which are called
+(host) clusters. A cluster is the unit in which all allocations are done,
+both for actual guest data and for image metadata.
+
+Likewise, the virtual disk as seen by the guest is divided into (guest)
+clusters of the same size.
+
+All numbers in qcow2 are stored in Big Endian byte order.
+
+
+== Header ==
+
+The first cluster of a qcow2 image contains the file header:
+
+ Byte 0 - 3: magic
+ QCOW magic string ("QFI\xfb")
+
+ 4 - 7: version
+ Version number (valid values are 2 and 3)
+
+ 8 - 15: backing_file_offset
+ Offset into the image file at which the backing file name
+ is stored (NB: The string is not null terminated). 0 if the
+ image doesn't have a backing file.
+
+ 16 - 19: backing_file_size
+ Length of the backing file name in bytes. Must not be
+ longer than 1023 bytes. Undefined if the image doesn't have
+ a backing file.
+
+ 20 - 23: cluster_bits
+ Number of bits that are used for addressing an offset
+ within a cluster (1 << cluster_bits is the cluster size).
+ Must not be less than 9 (i.e. 512 byte clusters).
+
+ Note: qemu as of today has an implementation limit of 2 MB
+ as the maximum cluster size and won't be able to open images
+ with larger cluster sizes.
+
+ 24 - 31: size
+ Virtual disk size in bytes
+
+ 32 - 35: crypt_method
+ 0 for no encryption
+ 1 for AES encryption
+
+ 36 - 39: l1_size
+ Number of entries in the active L1 table
+
+ 40 - 47: l1_table_offset
+ Offset into the image file at which the active L1 table
+ starts. Must be aligned to a cluster boundary.
+
+ 48 - 55: refcount_table_offset
+ Offset into the image file at which the refcount table
+ starts. Must be aligned to a cluster boundary.
+
+ 56 - 59: refcount_table_clusters
+ Number of clusters that the refcount table occupies
+
+ 60 - 63: nb_snapshots
+ Number of snapshots contained in the image
+
+ 64 - 71: snapshots_offset
+ Offset into the image file at which the snapshot table
+ starts. Must be aligned to a cluster boundary.
+
+If the version is 3 or higher, the header has the following additional fields.
+For version 2, the values are assumed to be zero, unless specified otherwise
+in the description of a field.
+
+ 72 - 79: incompatible_features
+ Bitmask of incompatible features. An implementation must
+ fail to open an image if an unknown bit is set.
+
+ Bit 0: Dirty bit. If this bit is set then refcounts
+ may be inconsistent, make sure to scan L1/L2
+ tables to repair refcounts before accessing the
+ image.
+
+ Bits 1-63: Reserved (set to 0)
+
+ 80 - 87: compatible_features
+ Bitmask of compatible features. An implementation can
+ safely ignore any unknown bits that are set.
+
+ Bit 0: Lazy refcounts bit. If this bit is set then
+ lazy refcount updates can be used. This means
+ marking the image file dirty and postponing
+ refcount metadata updates.
+
+ Bits 1-63: Reserved (set to 0)
+
+ 88 - 95: autoclear_features
+ Bitmask of auto-clear features. An implementation may only
+ write to an image with unknown auto-clear features if it
+ clears the respective bits from this field first.
+
+ Bits 0-63: Reserved (set to 0)
+
+ 96 - 99: refcount_order
+ Describes the width of a reference count block entry (width
+ in bits = 1 << refcount_order). For version 2 images, the
+ order is always assumed to be 4 (i.e. the width is 16 bits).
+
+ 100 - 103: header_length
+ Length of the header structure in bytes. For version 2
+ images, the length is always assumed to be 72 bytes.
+
+Directly after the image header, optional sections called header extensions can
+be stored. Each extension has a structure like the following:
+
+ Byte 0 - 3: Header extension type:
+ 0x00000000 - End of the header extension area
+ 0xE2792ACA - Backing file format name
+ 0x6803f857 - Feature name table
+ other - Unknown header extension, can be safely
+ ignored
+
+ 4 - 7: Length of the header extension data
+
+ 8 - n: Header extension data
+
+ n - m: Padding to round up the header extension size to the next
+ multiple of 8.
+
+Unless stated otherwise, each header extension type shall appear at most once
+in the same image.
+
+The remaining space between the end of the header extension area and the end of
+the first cluster can be used for the backing file name. It is not allowed to
+store other data here, so that an implementation can safely modify the header
+and add extensions without harming data of compatible features that it
+doesn't support. Compatible features that need space for additional data can
+use a header extension.
+
+
+== Feature name table ==
+
+The feature name table is an optional header extension that contains the name
+for features used by the image. It can be used by applications that don't know
+the respective feature (e.g. because the feature was introduced only later) to
+display a useful error message.
+
+The number of entries in the feature name table is determined by the length of
+the header extension data. Each entry look like this:
+
+ Byte 0: Type of feature (select feature bitmap)
+ 0: Incompatible feature
+ 1: Compatible feature
+ 2: Autoclear feature
+
+ 1: Bit number within the selected feature bitmap (valid
+ values: 0-63)
+
+ 2 - 47: Feature name (padded with zeros, but not necessarily null
+ terminated if it has full length)
+
+
+== Host cluster management ==
+
+qcow2 manages the allocation of host clusters by maintaining a reference count
+for each host cluster. A refcount of 0 means that the cluster is free, 1 means
+that it is used, and >= 2 means that it is used and any write access must
+perform a COW (copy on write) operation.
+
+The refcounts are managed in a two-level table. The first level is called
+refcount table and has a variable size (which is stored in the header). The
+refcount table can cover multiple clusters, however it needs to be contiguous
+in the image file.
+
+It contains pointers to the second level structures which are called refcount
+blocks and are exactly one cluster in size.
+
+Given a offset into the image file, the refcount of its cluster can be obtained
+as follows:
+
+ refcount_block_entries = (cluster_size / sizeof(uint16_t))
+
+ refcount_block_index = (offset / cluster_size) % refcount_block_entries
+ refcount_table_index = (offset / cluster_size) / refcount_block_entries
+
+ refcount_block = load_cluster(refcount_table[refcount_table_index]);
+ return refcount_block[refcount_block_index];
+
+Refcount table entry:
+
+ Bit 0 - 8: Reserved (set to 0)
+
+ 9 - 63: Bits 9-63 of the offset into the image file at which the
+ refcount block starts. Must be aligned to a cluster
+ boundary.
+
+ If this is 0, the corresponding refcount block has not yet
+ been allocated. All refcounts managed by this refcount block
+ are 0.
+
+Refcount block entry (x = refcount_bits - 1):
+
+ Bit 0 - x: Reference count of the cluster. If refcount_bits implies a
+ sub-byte width, note that bit 0 means the least significant
+ bit in this context.
+
+
+== Cluster mapping ==
+
+Just as for refcounts, qcow2 uses a two-level structure for the mapping of
+guest clusters to host clusters. They are called L1 and L2 table.
+
+The L1 table has a variable size (stored in the header) and may use multiple
+clusters, however it must be contiguous in the image file. L2 tables are
+exactly one cluster in size.
+
+Given a offset into the virtual disk, the offset into the image file can be
+obtained as follows:
+
+ l2_entries = (cluster_size / sizeof(uint64_t))
+
+ l2_index = (offset / cluster_size) % l2_entries
+ l1_index = (offset / cluster_size) / l2_entries
+
+ l2_table = load_cluster(l1_table[l1_index]);
+ cluster_offset = l2_table[l2_index];
+
+ return cluster_offset + (offset % cluster_size)
+
+L1 table entry:
+
+ Bit 0 - 8: Reserved (set to 0)
+
+ 9 - 55: Bits 9-55 of the offset into the image file at which the L2
+ table starts. Must be aligned to a cluster boundary. If the
+ offset is 0, the L2 table and all clusters described by this
+ L2 table are unallocated.
+
+ 56 - 62: Reserved (set to 0)
+
+ 63: 0 for an L2 table that is unused or requires COW, 1 if its
+ refcount is exactly one. This information is only accurate
+ in the active L1 table.
+
+L2 table entry:
+
+ Bit 0 - 61: Cluster descriptor
+
+ 62: 0 for standard clusters
+ 1 for compressed clusters
+
+ 63: 0 for a cluster that is unused or requires COW, 1 if its
+ refcount is exactly one. This information is only accurate
+ in L2 tables that are reachable from the the active L1
+ table.
+
+Standard Cluster Descriptor:
+
+ Bit 0: If set to 1, the cluster reads as all zeros. The host
+ cluster offset can be used to describe a preallocation,
+ but it won't be used for reading data from this cluster,
+ nor is data read from the backing file if the cluster is
+ unallocated.
+
+ With version 2, this is always 0.
+
+ 1 - 8: Reserved (set to 0)
+
+ 9 - 55: Bits 9-55 of host cluster offset. Must be aligned to a
+ cluster boundary. If the offset is 0, the cluster is
+ unallocated.
+
+ 56 - 61: Reserved (set to 0)
+
+
+Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)):
+
+ Bit 0 - x: Host cluster offset. This is usually _not_ aligned to a
+ cluster boundary!
+
+ x+1 - 61: Compressed size of the images in sectors of 512 bytes
+
+If a cluster is unallocated, read requests shall read the data from the backing
+file (except if bit 0 in the Standard Cluster Descriptor is set). If there is
+no backing file or the backing file is smaller than the image, they shall read
+zeros for all parts that are not covered by the backing file.
+
+
+== Snapshots ==
+
+qcow2 supports internal snapshots. Their basic principle of operation is to
+switch the active L1 table, so that a different set of host clusters are
+exposed to the guest.
+
+When creating a snapshot, the L1 table should be copied and the refcount of all
+L2 tables and clusters reachable from this L1 table must be increased, so that
+a write causes a COW and isn't visible in other snapshots.
+
+When loading a snapshot, bit 63 of all entries in the new active L1 table and
+all L2 tables referenced by it must be reconstructed from the refcount table
+as it doesn't need to be accurate in inactive L1 tables.
+
+A directory of all snapshots is stored in the snapshot table, a contiguous area
+in the image file, whose starting offset and length are given by the header
+fields snapshots_offset and nb_snapshots. The entries of the snapshot table
+have variable length, depending on the length of ID, name and extra data.
+
+Snapshot table entry:
+
+ Byte 0 - 7: Offset into the image file at which the L1 table for the
+ snapshot starts. Must be aligned to a cluster boundary.
+
+ 8 - 11: Number of entries in the L1 table of the snapshots
+
+ 12 - 13: Length of the unique ID string describing the snapshot
+
+ 14 - 15: Length of the name of the snapshot
+
+ 16 - 19: Time at which the snapshot was taken in seconds since the
+ Epoch
+
+ 20 - 23: Subsecond part of the time at which the snapshot was taken
+ in nanoseconds
+
+ 24 - 31: Time that the guest was running until the snapshot was
+ taken in nanoseconds
+
+ 32 - 35: Size of the VM state in bytes. 0 if no VM state is saved.
+ If there is VM state, it starts at the first cluster
+ described by first L1 table entry that doesn't describe a
+ regular guest cluster (i.e. VM state is stored like guest
+ disk content, except that it is stored at offsets that are
+ larger than the virtual disk presented to the guest)
+
+ 36 - 39: Size of extra data in the table entry (used for future
+ extensions of the format)
+
+ variable: Extra data for future extensions. Unknown fields must be
+ ignored. Currently defined are (offset relative to snapshot
+ table entry):
+
+ Byte 40 - 47: Size of the VM state in bytes. 0 if no VM
+ state is saved. If this field is present,
+ the 32-bit value in bytes 32-35 is ignored.
+
+ Byte 48 - 55: Virtual disk size of the snapshot in bytes
+
+ Version 3 images must include extra data at least up to
+ byte 55.
+
+ variable: Unique ID string for the snapshot (not null terminated)
+
+ variable: Name of the snapshot (not null terminated)
diff --git a/docs/specs/qed_spec.txt b/docs/specs/qed_spec.txt
new file mode 100644
index 000000000..7982e058b
--- /dev/null
+++ b/docs/specs/qed_spec.txt
@@ -0,0 +1,138 @@
+=Specification=
+
+The file format looks like this:
+
+ +----------+----------+----------+-----+
+ | cluster0 | cluster1 | cluster2 | ... |
+ +----------+----------+----------+-----+
+
+The first cluster begins with the '''header'''. The header contains information about where regular clusters start; this allows the header to be extensible and store extra information about the image file. A regular cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''. L1 and L2 tables are composed of one or more contiguous clusters.
+
+Normally the file size will be a multiple of the cluster size. If the file size is not a multiple, extra information after the last cluster may not be preserved if data is written. Legitimate extra information should use space between the header and the first regular cluster.
+
+All fields are little-endian.
+
+==Header==
+ Header {
+ uint32_t magic; /* QED\0 */
+
+ uint32_t cluster_size; /* in bytes */
+ uint32_t table_size; /* for L1 and L2 tables, in clusters */
+ uint32_t header_size; /* in clusters */
+
+ uint64_t features; /* format feature bits */
+ uint64_t compat_features; /* compat feature bits */
+ uint64_t autoclear_features; /* self-resetting feature bits */
+
+ uint64_t l1_table_offset; /* in bytes */
+ uint64_t image_size; /* total logical image size, in bytes */
+
+ /* if (features & QED_F_BACKING_FILE) */
+ uint32_t backing_filename_offset; /* in bytes from start of header */
+ uint32_t backing_filename_size; /* in bytes */
+ }
+
+Field descriptions:
+* ''cluster_size'' must be a power of 2 in range [2^12, 2^26].
+* ''table_size'' must be a power of 2 in range [1, 16].
+* ''header_size'' is the number of clusters used by the header and any additional information stored before regular clusters.
+* ''features'', ''compat_features'', and ''autoclear_features'' are file format extension bitmaps. They work as follows:
+** An image with unknown ''features'' bits enabled must not be opened. File format changes that are not backwards-compatible must use ''features'' bits.
+** An image with unknown ''compat_features'' bits enabled can be opened safely. The unknown features are simply ignored and represent backwards-compatible changes to the file format.
+** An image with unknown ''autoclear_features'' bits enable can be opened safely after clearing the unknown bits. This allows for backwards-compatible changes to the file format which degrade gracefully and can be re-enabled again by a new program later.
+* ''l1_table_offset'' is the offset of the first byte of the L1 table in the image file and must be a multiple of ''cluster_size''.
+* ''image_size'' is the block device size seen by the guest and must be a multiple of 512 bytes.
+* ''backing_filename_offset'' and ''backing_filename_size'' describe a string in (byte offset, byte size) form. It is not NUL-terminated and has no alignment constraints. The string must be stored within the first ''header_size'' clusters. The backing filename may be an absolute path or relative to the image file.
+
+Feature bits:
+* QED_F_BACKING_FILE = 0x01. The image uses a backing file.
+* QED_F_NEED_CHECK = 0x02. The image needs a consistency check before use.
+* QED_F_BACKING_FORMAT_NO_PROBE = 0x04. The backing file is a raw disk image and no file format autodetection should be attempted. This should be used to ensure that raw backing files are never detected as an image format if they happen to contain magic constants.
+
+There are currently no defined ''compat_features'' or ''autoclear_features'' bits.
+
+Fields predicated on a feature bit are only used when that feature is set. The fields always take up header space, regardless of whether or not the feature bit is set.
+
+==Tables==
+
+Tables provide the translation from logical offsets in the block device to cluster offsets in the file.
+
+ #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))
+
+ Table {
+ uint64_t offsets[TABLE_NOFFSETS];
+ }
+
+The tables are organized as follows:
+
+ +----------+
+ | L1 table |
+ +----------+
+ ,------' | '------.
+ +----------+ | +----------+
+ | L2 table | ... | L2 table |
+ +----------+ +----------+
+ ,------' | '------.
+ +----------+ | +----------+
+ | Data | ... | Data |
+ +----------+ +----------+
+
+A table is made up of one or more contiguous clusters. The table_size header field determines table size for an image file. For example, cluster_size=64 KB and table_size=4 results in 256 KB tables.
+
+The logical image size must be less than or equal to the maximum possible size of clusters rooted by the L1 table:
+ header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size
+
+L1, L2, and data cluster offsets must be aligned to header.cluster_size. The following offsets have special meanings:
+
+===L2 table offsets===
+* 0 - unallocated. The L2 table is not yet allocated.
+
+===Data cluster offsets===
+* 0 - unallocated. The data cluster is not yet allocated.
+* 1 - zero. The data cluster contents are all zeroes and no cluster is allocated.
+
+Future format extensions may wish to store per-offset information. The least significant 12 bits of an offset are reserved for this purpose and must be set to zero. Image files with cluster_size > 2^12 will have more unused bits which should also be zeroed.
+
+===Unallocated L2 tables and data clusters===
+Reads to an unallocated area of the image file access the backing file. If there is no backing file, then zeroes are produced. The backing file may be smaller than the image file and reads of unallocated areas beyond the end of the backing file produce zeroes.
+
+Writes to an unallocated area cause a new data clusters to be allocated, and a new L2 table if that is also unallocated. The new data cluster is populated with data from the backing file (or zeroes if no backing file) and the data being written.
+
+===Zero data clusters===
+Zero data clusters are a space-efficient way of storing zeroed regions of the image.
+
+Reads to a zero data cluster produce zeroes. Note that the difference between an unallocated and a zero data cluster is that zero data clusters stop the reading of contents from the backing file.
+
+Writes to a zero data cluster cause a new data cluster to be allocated. The new data cluster is populated with zeroes and the data being written.
+
+===Logical offset translation===
+Logical offsets are translated into cluster offsets as follows:
+
+ table_bits table_bits cluster_bits
+ <--------> <--------> <--------------->
+ +----------+----------+-----------------+
+ | L1 index | L2 index | byte offset |
+ +----------+----------+-----------------+
+
+ Structure of a logical offset
+
+ offset_mask = ~(cluster_size - 1) # mask for the image file byte offset
+
+ def logical_to_cluster_offset(l1_index, l2_index, byte_offset):
+ l2_offset = l1_table[l1_index]
+ l2_table = load_table(l2_offset)
+ cluster_offset = l2_table[l2_index] & offset_mask
+ return cluster_offset + byte_offset
+
+==Consistency checking==
+
+This section is informational and included to provide background on the use of the QED_F_NEED_CHECK ''features'' bit.
+
+The QED_F_NEED_CHECK bit is used to mark an image as dirty before starting an operation that could leave the image in an inconsistent state if interrupted by a crash or power failure. A dirty image must be checked on open because its metadata may not be consistent.
+
+Consistency check includes the following invariants:
+# Each cluster is referenced once and only once. It is an inconsistency to have a cluster referenced more than once by L1 or L2 tables. A cluster has been leaked if it has no references.
+# Offsets must be within the image file size and must be ''cluster_size'' aligned.
+# Table offsets must at least ''table_size'' * ''cluster_size'' bytes from the end of the image file so that there is space for the entire table.
+
+The consistency check process starts by from ''l1_table_offset'' and scans all L2 tables. After the check completes with no other errors besides leaks, the QED_F_NEED_CHECK bit can be cleared and the image can be accessed.