summaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
Diffstat (limited to 'docs')
-rw-r--r--docs/bitmaps.md163
-rw-r--r--docs/build-system.txt507
-rw-r--r--docs/libcacard.txt483
-rw-r--r--docs/migration.txt191
-rw-r--r--docs/multiseat.txt2
-rw-r--r--docs/qapi-code-gen.txt519
-rw-r--r--docs/qcow2-cache.txt164
-rw-r--r--docs/qmp-events.txt (renamed from docs/qmp/qmp-events.txt)12
-rw-r--r--docs/qmp-intro.txt (renamed from docs/qmp/README)0
-rw-r--r--docs/qmp-spec.txt (renamed from docs/qmp/qmp-spec.txt)5
-rw-r--r--docs/rcu.txt2
-rw-r--r--docs/replay.txt168
-rw-r--r--docs/specs/fw_cfg.txt94
-rw-r--r--docs/specs/ivshmem_device_spec.txt127
-rw-r--r--docs/specs/ppc-spapr-hcalls.txt4
-rw-r--r--docs/specs/ppc-spapr-hotplug.txt48
-rw-r--r--docs/specs/qcow2.txt2
-rw-r--r--docs/specs/rocker.txt2
-rw-r--r--docs/specs/vhost-user.txt208
-rw-r--r--docs/tracing.txt10
-rw-r--r--docs/virtio-migration.txt106
-rw-r--r--docs/win32-qemu-event.promela98
-rw-r--r--docs/writing-qmp-commands.txt30
23 files changed, 2261 insertions, 684 deletions
diff --git a/docs/bitmaps.md b/docs/bitmaps.md
index fa87f077f..a2e8d5116 100644
--- a/docs/bitmaps.md
+++ b/docs/bitmaps.md
@@ -19,12 +19,20 @@ which is included at the end of this document.
* A dirty bitmap's name is unique to the node, but bitmaps attached to different
nodes can share the same name.
+* Dirty bitmaps created for internal use by QEMU may be anonymous and have no
+ name, but any user-created bitmaps may not be. There can be any number of
+ anonymous bitmaps per node.
+
+* The name of a user-created bitmap must not be empty ("").
+
## Bitmap Modes
* A Bitmap can be "frozen," which means that it is currently in-use by a backup
operation and cannot be deleted, renamed, written to, reset,
etc.
+* The normal operating mode for a bitmap is "active."
+
## Basic QMP Usage
### Supported Commands ###
@@ -97,11 +105,7 @@ which is included at the end of this document.
}
```
-## Transactions (Not yet implemented)
-
-* Transactional commands are forthcoming in a future version,
- and are not yet available for use. This section serves as
- documentation of intent for their design and usage.
+## Transactions
### Justification
@@ -323,6 +327,155 @@ full backup as a backing image.
"event": "BLOCK_JOB_COMPLETED" }
```
+### Partial Transactional Failures
+
+* Sometimes, a transaction will succeed in launching and return success,
+ but then later the backup jobs themselves may fail. It is possible that
+ a management application may have to deal with a partial backup failure
+ after a successful transaction.
+
+* If multiple backup jobs are specified in a single transaction, when one of
+ them fails, it will not interact with the other backup jobs in any way.
+
+* The job(s) that succeeded will clear the dirty bitmap associated with the
+ operation, but the job(s) that failed will not. It is not "safe" to delete
+ any incremental backups that were created successfully in this scenario,
+ even though others failed.
+
+#### Example
+
+* QMP example highlighting two backup jobs:
+
+ ```json
+ { "execute": "transaction",
+ "arguments": {
+ "actions": [
+ { "type": "drive-backup",
+ "data": { "device": "drive0", "bitmap": "bitmap0",
+ "format": "qcow2", "mode": "existing",
+ "sync": "incremental", "target": "d0-incr-1.qcow2" } },
+ { "type": "drive-backup",
+ "data": { "device": "drive1", "bitmap": "bitmap1",
+ "format": "qcow2", "mode": "existing",
+ "sync": "incremental", "target": "d1-incr-1.qcow2" } },
+ ]
+ }
+ }
+ ```
+
+* QMP example response, highlighting one success and one failure:
+ * Acknowledgement that the Transaction was accepted and jobs were launched:
+ ```json
+ { "return": {} }
+ ```
+
+ * Later, QEMU sends notice that the first job was completed:
+ ```json
+ { "timestamp": { "seconds": 1447192343, "microseconds": 615698 },
+ "data": { "device": "drive0", "type": "backup",
+ "speed": 0, "len": 67108864, "offset": 67108864 },
+ "event": "BLOCK_JOB_COMPLETED"
+ }
+ ```
+
+ * Later yet, QEMU sends notice that the second job has failed:
+ ```json
+ { "timestamp": { "seconds": 1447192399, "microseconds": 683015 },
+ "data": { "device": "drive1", "action": "report",
+ "operation": "read" },
+ "event": "BLOCK_JOB_ERROR" }
+ ```
+
+ ```json
+ { "timestamp": { "seconds": 1447192399, "microseconds": 685853 },
+ "data": { "speed": 0, "offset": 0, "len": 67108864,
+ "error": "Input/output error",
+ "device": "drive1", "type": "backup" },
+ "event": "BLOCK_JOB_COMPLETED" }
+
+* In the above example, "d0-incr-1.qcow2" is valid and must be kept,
+ but "d1-incr-1.qcow2" is invalid and should be deleted. If a VM-wide
+ incremental backup of all drives at a point-in-time is to be made,
+ new backups for both drives will need to be made, taking into account
+ that a new incremental backup for drive0 needs to be based on top of
+ "d0-incr-1.qcow2."
+
+### Grouped Completion Mode
+
+* While jobs launched by transactions normally complete or fail on their own,
+ it is possible to instruct them to complete or fail together as a group.
+
+* QMP transactions take an optional properties structure that can affect
+ the semantics of the transaction.
+
+* The "completion-mode" transaction property can be either "individual"
+ which is the default, legacy behavior described above, or "grouped,"
+ a new behavior detailed below.
+
+* Delayed Completion: In grouped completion mode, no jobs will report
+ success until all jobs are ready to report success.
+
+* Grouped failure: If any job fails in grouped completion mode, all remaining
+ jobs will be cancelled. Any incremental backups will restore their dirty
+ bitmap objects as if no backup command was ever issued.
+
+ * Regardless of if QEMU reports a particular incremental backup job as
+ CANCELLED or as an ERROR, the in-memory bitmap will be restored.
+
+#### Example
+
+* Here's the same example scenario from above with the new property:
+
+ ```json
+ { "execute": "transaction",
+ "arguments": {
+ "actions": [
+ { "type": "drive-backup",
+ "data": { "device": "drive0", "bitmap": "bitmap0",
+ "format": "qcow2", "mode": "existing",
+ "sync": "incremental", "target": "d0-incr-1.qcow2" } },
+ { "type": "drive-backup",
+ "data": { "device": "drive1", "bitmap": "bitmap1",
+ "format": "qcow2", "mode": "existing",
+ "sync": "incremental", "target": "d1-incr-1.qcow2" } },
+ ],
+ "properties": {
+ "completion-mode": "grouped"
+ }
+ }
+ }
+ ```
+
+* QMP example response, highlighting a failure for drive2:
+ * Acknowledgement that the Transaction was accepted and jobs were launched:
+ ```json
+ { "return": {} }
+ ```
+
+ * Later, QEMU sends notice that the second job has errored out,
+ but that the first job was also cancelled:
+ ```json
+ { "timestamp": { "seconds": 1447193702, "microseconds": 632377 },
+ "data": { "device": "drive1", "action": "report",
+ "operation": "read" },
+ "event": "BLOCK_JOB_ERROR" }
+ ```
+
+ ```json
+ { "timestamp": { "seconds": 1447193702, "microseconds": 640074 },
+ "data": { "speed": 0, "offset": 0, "len": 67108864,
+ "error": "Input/output error",
+ "device": "drive1", "type": "backup" },
+ "event": "BLOCK_JOB_COMPLETED" }
+ ```
+
+ ```json
+ { "timestamp": { "seconds": 1447193702, "microseconds": 640163 },
+ "data": { "device": "drive0", "type": "backup", "speed": 0,
+ "len": 67108864, "offset": 16777216 },
+ "event": "BLOCK_JOB_CANCELLED" }
+ ```
+
<!--
The FreeBSD Documentation License
diff --git a/docs/build-system.txt b/docs/build-system.txt
new file mode 100644
index 000000000..5ddddeaaf
--- /dev/null
+++ b/docs/build-system.txt
@@ -0,0 +1,507 @@
+ The QEMU build system architecture
+ ==================================
+
+This document aims to help developers understand the architecture of the
+QEMU build system. As with projects using GNU autotools, the QEMU build
+system has two stages, first the developer runs the "configure" script
+to determine the local build environment characteristics, then they run
+"make" to build the project. There is about where the similarities with
+GNU autotools end, so try to forget what you know about them.
+
+
+Stage 1: configure
+==================
+
+The QEMU configure script is written directly in shell, and should be
+compatible with any POSIX shell, hence it uses #!/bin/sh. An important
+implication of this is that it is important to avoid using bash-isms on
+development platforms where bash is the primary host.
+
+In contrast to autoconf scripts, QEMU's configure is expected to be
+silent while it is checking for features. It will only display output
+when an error occurs, or to show the final feature enablement summary
+on completion.
+
+Adding new checks to the configure script usually comprises the
+following tasks:
+
+ - Initialize one or more variables with the default feature state.
+
+ Ideally features should auto-detect whether they are present,
+ so try to avoid hardcoding the initial state to either enabled
+ or disabled, as that forces the user to pass a --enable-XXX
+ / --disable-XXX flag on every invocation of configure.
+
+ - Add support to the command line arg parser to handle any new
+ --enable-XXX / --disable-XXX flags required by the feature XXX.
+
+ - Add information to the help output message to report on the new
+ feature flag.
+
+ - Add code to perform the actual feature check. As noted above, try to
+ be fully dynamic in checking enablement/disablement.
+
+ - Add code to print out the feature status in the configure summary
+ upon completion.
+
+ - Add any new makefile variables to $config_host_mak on completion.
+
+
+Taking (a simplified version of) the probe for gnutls from configure,
+we have the following pieces:
+
+ # Initial variable state
+ gnutls=""
+
+ ..snip..
+
+ # Configure flag processing
+ --disable-gnutls) gnutls="no"
+ ;;
+ --enable-gnutls) gnutls="yes"
+ ;;
+
+ ..snip..
+
+ # Help output feature message
+ gnutls GNUTLS cryptography support
+
+ ..snip..
+
+ # Test for gnutls
+ if test "$gnutls" != "no"; then
+ if ! $pkg_config --exists "gnutls"; then
+ gnutls_cflags=`$pkg_config --cflags gnutls`
+ gnutls_libs=`$pkg_config --libs gnutls`
+ libs_softmmu="$gnutls_libs $libs_softmmu"
+ libs_tools="$gnutls_libs $libs_tools"
+ QEMU_CFLAGS="$QEMU_CFLAGS $gnutls_cflags"
+ gnutls="yes"
+ elif test "$gnutls" = "yes"; then
+ feature_not_found "gnutls" "Install gnutls devel"
+ else
+ gnutls="no"
+ fi
+ fi
+
+ ..snip..
+
+ # Completion feature summary
+ echo "GNUTLS support $gnutls"
+
+ ..snip..
+
+ # Define make variables
+ if test "$gnutls" = "yes" ; then
+ echo "CONFIG_GNUTLS=y" >> $config_host_mak
+ fi
+
+
+Helper functions
+----------------
+
+The configure script provides a variety of helper functions to assist
+developers in checking for system features:
+
+ - do_cc $ARGS...
+
+ Attempt to run the system C compiler passing it $ARGS...
+
+ - do_cxx $ARGS...
+
+ Attempt to run the system C++ compiler passing it $ARGS...
+
+ - compile_object $CFLAGS
+
+ Attempt to compile a test program with the system C compiler using
+ $CFLAGS. The test program must have been previously written to a file
+ called $TMPC.
+
+ - compile_prog $CFLAGS $LDFLAGS
+
+ Attempt to compile a test program with the system C compiler using
+ $CFLAGS and link it with the system linker using $LDFLAGS. The test
+ program must have been previously written to a file called $TMPC.
+
+ - has $COMMAND
+
+ Determine if $COMMAND exists in the current environment, either as a
+ shell builtin, or executable binary, returning 0 on success.
+
+ - path_of $COMMAND
+
+ Return the fully qualified path of $COMMAND, printing it to stdout,
+ and returning 0 on success.
+
+ - check_define $NAME
+
+ Determine if the macro $NAME is defined by the system C compiler
+
+ - check_include $NAME
+
+ Determine if the include $NAME file is available to the system C
+ compiler
+
+ - write_c_skeleton
+
+ Write a minimal C program main() function to the temporary file
+ indicated by $TMPC
+
+ - feature_not_found $NAME $REMEDY
+
+ Print a message to stderr that the feature $NAME was not available
+ on the system, suggesting the user try $REMEDY to address the
+ problem.
+
+ - error_exit $MESSAGE $MORE...
+
+ Print $MESSAGE to stderr, followed by $MORE... and then exit from the
+ configure script with non-zero status
+
+ - query_pkg_config $ARGS...
+
+ Run pkg-config passing it $ARGS. If QEMU is doing a static build,
+ then --static will be automatically added to $ARGS
+
+
+Stage 2: makefiles
+==================
+
+The use of GNU make is required with the QEMU build system.
+
+Although the source code is spread across multiple subdirectories, the
+build system should be considered largely non-recursive in nature, in
+contrast to common practices seen with automake. There is some recursive
+invocation of make, but this is related to the things being built,
+rather than the source directory structure.
+
+QEMU currently supports both VPATH and non-VPATH builds, so there are
+three general ways to invoke configure & perform a build.
+
+ - VPATH, build artifacts outside of QEMU source tree entirely
+
+ cd ../
+ mkdir build
+ cd build
+ ../qemu/configure
+ make
+
+ - VPATH, build artifacts in a subdir of QEMU source tree
+
+ mkdir build
+ cd build
+ ../configure
+ make
+
+ - non-VPATH, build artifacts everywhere
+
+ ./configure
+ make
+
+The QEMU maintainers generally recommend that a VPATH build is used by
+developers. Patches to QEMU are expected to ensure VPATH build still
+works.
+
+
+Module structure
+----------------
+
+There are a number of key outputs of the QEMU build system:
+
+ - Tools - qemu-img, qemu-nbd, qga (guest agent), etc
+ - System emulators - qemu-system-$ARCH
+ - Userspace emulators - qemu-$ARCH
+ - Unit tests
+
+The source code is highly modularized, split across many files to
+facilitate building of all of these components with as little duplicated
+compilation as possible. There can be considered to be two distinct
+groups of files, those which are independent of the QEMU emulation
+target and those which are dependent on the QEMU emulation target.
+
+In the target-independent set lives various general purpose helper code,
+such as error handling infrastructure, standard data structures,
+platform portability wrapper functions, etc. This code can be compiled
+once only and the .o files linked into all output binaries.
+
+In the target-dependent set lives CPU emulation, device emulation and
+much glue code. This sometimes also has to be compiled multiple times,
+once for each target being built.
+
+The utility code that is used by all binaries is built into a
+static archive called libqemuutil.a, which is then linked to all the
+binaries. In order to provide hooks that are only needed by some of the
+binaries, code in libqemuutil.a may depend on other functions that are
+not fully implemented by all QEMU binaries. To deal with this there is a
+second library called libqemustub.a which provides dummy stubs for all
+these functions. These will get lazy linked into the binary if the real
+implementation is not present. In this way, the libqemustub.a static
+library can be thought of as a portable implementation of the weak
+symbols concept. All binaries should link to both libqemuutil.a and
+libqemustub.a. e.g.
+
+ qemu-img$(EXESUF): qemu-img.o ..snip.. libqemuutil.a libqemustub.a
+
+
+Windows platform portability
+----------------------------
+
+On Windows, all binaries have the suffix '.exe', so all Makefile rules
+which create binaries must include the $(EXESUF) variable on the binary
+name. e.g.
+
+ qemu-img$(EXESUF): qemu-img.o ..snip..
+
+This expands to '.exe' on Windows, or '' on other platforms.
+
+A further complication for the system emulator binaries is that
+two separate binaries need to be generated.
+
+The main binary (e.g. qemu-system-x86_64.exe) is linked against the
+Windows console runtime subsystem. These are expected to be run from a
+command prompt window, and so will print stderr to the console that
+launched them.
+
+The second binary generated has a 'w' on the end of its name (e.g.
+qemu-system-x86_64w.exe) and is linked against the Windows graphical
+runtime subsystem. These are expected to be run directly from the
+desktop and will open up a dedicated console window for stderr output.
+
+The Makefile.target will generate the binary for the graphical subsystem
+first, and then use objcopy to relink it against the console subsystem
+to generate the second binary.
+
+
+Object variable naming
+----------------------
+
+The QEMU convention is to define variables to list different groups of
+object files. These are named with the convention $PREFIX-obj-y. For
+example the libqemuutil.a file will be linked with all objects listed
+in a variable 'util-obj-y'. So, for example, util/Makefile.obj will
+contain a set of definitions looking like
+
+ util-obj-y += bitmap.o bitops.o hbitmap.o
+ util-obj-y += fifo8.o
+ util-obj-y += acl.o
+ util-obj-y += error.o qemu-error.o
+
+When there is an object file which needs to be conditionally built based
+on some characteristic of the host system, the configure script will
+define a variable for the conditional. For example, on Windows it will
+define $(CONFIG_POSIX) with a value of 'n' and $(CONFIG_WIN32) with a
+value of 'y'. It is now possible to use the config variables when
+listing object files. For example,
+
+ util-obj-$(CONFIG_WIN32) += oslib-win32.o qemu-thread-win32.o
+ util-obj-$(CONFIG_POSIX) += oslib-posix.o qemu-thread-posix.o
+
+On Windows this expands to
+
+ util-obj-y += oslib-win32.o qemu-thread-win32.o
+ util-obj-n += oslib-posix.o qemu-thread-posix.o
+
+Since libqemutil.a links in $(util-obj-y), the POSIX specific files
+listed against $(util-obj-n) are ignored on the Windows platform builds.
+
+
+CFLAGS / LDFLAGS / LIBS handling
+--------------------------------
+
+There are many different binaries being built with differing purposes,
+and some of them might even be 3rd party libraries pulled in via git
+submodules. As such the use of the global CFLAGS variable is generally
+avoided in QEMU, since it would apply to too many build targets.
+
+Flags that are needed by any QEMU code (i.e. everything *except* GIT
+submodule projects) are put in $(QEMU_CFLAGS) variable. For linker
+flags the $(LIBS) variable is sometimes used, but a couple of more
+targeted variables are preferred. $(libs_softmmu) is used for
+libraries that must be linked to system emulator targets, $(LIBS_TOOLS)
+is used for tools like qemu-img, qemu-nbd, etc and $(LIBS_QGA) is used
+for the QEMU guest agent. There is currently no specific variable for
+the userspace emulator targets as the global $(LIBS), or more targeted
+variables shown below, are sufficient.
+
+In addition to these variables, it is possible to provide cflags and
+libs against individual source code files, by defining variables of the
+form $FILENAME-cflags and $FILENAME-libs. For example, the curl block
+driver needs to link to the libcurl library, so block/Makefile defines
+some variables:
+
+ curl.o-cflags := $(CURL_CFLAGS)
+ curl.o-libs := $(CURL_LIBS)
+
+The scope is a little different between the two variables. The libs get
+used when linking any target binary that includes the curl.o object
+file, while the cflags get used when compiling the curl.c file only.
+
+
+Statically defined files
+------------------------
+
+The following key files are statically defined in the source tree, with
+the rules needed to build QEMU. Their behaviour is influenced by a
+number of dynamically created files listed later.
+
+- Makefile
+
+The main entry point used when invoking make to build all the components
+of QEMU. The default 'all' target will naturally result in the build of
+every component. The various tools and helper binaries are built
+directly via a non-recursive set of rules.
+
+Each system/userspace emulation target needs to have a slightly
+different set of make rules / variables. Thus, make will be recursively
+invoked for each of the emulation targets.
+
+The recursive invocation will end up processing the toplevel
+Makefile.target file (more on that later).
+
+
+- */Makefile.objs
+
+Since the source code is spread across multiple directories, the rules
+for each file are similarly modularized. Thus each subdirectory
+containing .c files will usually also contain a Makefile.objs file.
+These files are not directly invoked by a recursive make, but instead
+they are imported by the top level Makefile and/or Makefile.target
+
+Each Makefile.objs usually just declares a set of variables listing the
+.o files that need building from the source files in the directory. They
+will also define any custom linker or compiler flags. For example in
+block/Makefile.objs
+
+ block-obj-$(CONFIG_LIBISCSI) += iscsi.o
+ block-obj-$(CONFIG_CURL) += curl.o
+
+ ..snip...
+
+ iscsi.o-cflags := $(LIBISCSI_CFLAGS)
+ iscsi.o-libs := $(LIBISCSI_LIBS)
+ curl.o-cflags := $(CURL_CFLAGS)
+ curl.o-libs := $(CURL_LIBS)
+
+If there are any rules defined in the Makefile.objs file, they should
+all use $(obj) as a prefix to the target, e.g.
+
+ $(obj)/generated-tcg-tracers.h: $(obj)/generated-tcg-tracers.h-timestamp
+
+
+- Makefile.target
+
+This file provides the entry point used to build each individual system
+or userspace emulator target. Each enabled target has its own
+subdirectory. For example if configure is run with the argument
+'--target-list=x86_64-softmmu', then a sub-directory 'x86_64-softmu'
+will be created, containing a 'Makefile' which symlinks back to
+Makefile.target
+
+So when the recursive '$(MAKE) -C x86_64-softmmu' is invoked, it ends up
+using Makefile.target for the build rules.
+
+
+- rules.mak
+
+This file provides the generic helper rules for invoking build tools, in
+particular the compiler and linker. This also contains the magic (hairy)
+'unnest-vars' function which is used to merge the variable definitions
+from all Makefile.objs in the source tree down into the main Makefile
+context.
+
+
+- default-configs/*.mak
+
+The files under default-configs/ control what emulated hardware is built
+into each QEMU system and userspace emulator targets. They merely
+contain a long list of config variable definitions. For example,
+default-configs/x86_64-softmmu.mak has:
+
+ include pci.mak
+ include sound.mak
+ include usb.mak
+ CONFIG_QXL=$(CONFIG_SPICE)
+ CONFIG_VGA_ISA=y
+ CONFIG_VGA_CIRRUS=y
+ CONFIG_VMWARE_VGA=y
+ CONFIG_VIRTIO_VGA=y
+ ...snip...
+
+These files rarely need changing unless new devices / hardware need to
+be enabled for a particular system/userspace emulation target
+
+
+- tests/Makefile
+
+Rules for building the unit tests. This file is included directly by the
+top level Makefile, so anything defined in this file will influence the
+entire build system. Care needs to be taken when writing rules for tests
+to ensure they only apply to the unit test execution / build.
+
+
+- po/Makefile
+
+Rules for building and installing the binary message catalogs from the
+text .po file sources. This almost never needs changing for any reason.
+
+
+Dynamically created files
+-------------------------
+
+The following files are generated dynamically by configure in order to
+control the behaviour of the statically defined makefiles. This avoids
+the need for QEMU makefiles to go through any pre-processing as seen
+with autotools, where Makefile.am generates Makefile.in which generates
+Makefile.
+
+
+- config-host.mak
+
+When configure has determined the characteristics of the build host it
+will write a long list of variables to config-host.mak file. This
+provides the various install directories, compiler / linker flags and a
+variety of CONFIG_* variables related to optionally enabled features.
+This is imported by the top level Makefile in order to tailor the build
+output.
+
+The variables defined here are those which are applicable to all QEMU
+build outputs. Variables which are potentially different for each
+emulator target are defined by the next file...
+
+It is also used as a dependency checking mechanism. If make sees that
+the modification timestamp on configure is newer than that on
+config-host.mak, then configure will be re-run.
+
+
+- config-host.h
+
+The config-host.h file is used by source code to determine what features
+are enabled. It is generated from the contents of config-host.mak using
+the scripts/create_config program. This extracts all the CONFIG_* variables,
+most of the HOST_* variables and a few other misc variables from
+config-host.mak, formatting them as C preprocessor macros.
+
+
+- $TARGET-NAME/config-target.mak
+
+TARGET-NAME is the name of a system or userspace emulator, for example,
+x86_64-softmmu denotes the system emulator for the x86_64 architecture.
+This file contains the variables which need to vary on a per-target
+basis. For example, it will indicate whether KVM or Xen are enabled for
+the target and any other potential custom libraries needed for linking
+the target.
+
+
+- $TARGET-NAME/config-devices.mak
+
+TARGET-NAME is again the name of a system or userspace emulator. The
+config-devices.mak file is automatically generated by make using the
+scripts/make_device_config.sh program, feeding it the
+default-configs/$TARGET-NAME file as input.
+
+
+- $TARGET-NAME/Makefile
+
+This is the entrypoint used when make recurses to build a single system
+or userspace emulator target. It is merely a symlink back to the
+Makefile.target in the top level.
diff --git a/docs/libcacard.txt b/docs/libcacard.txt
deleted file mode 100644
index 8db421d3a..000000000
--- a/docs/libcacard.txt
+++ /dev/null
@@ -1,483 +0,0 @@
-This file documents the CAC (Common Access Card) library in the libcacard
-subdirectory.
-
-Virtual Smart Card Emulator
-
-This emulator is designed to provide emulation of actual smart cards to a
-virtual card reader running in a guest virtual machine. The emulated smart
-cards can be representations of real smart cards, where the necessary functions
-such as signing, card removal/insertion, etc. are mapped to real, physical
-cards which are shared with the client machine the emulator is running on, or
-the cards could be pure software constructs.
-
-The emulator is structured to allow multiple replaceable or additional pieces,
-so it can be easily modified for future requirements. The primary envisioned
-modifications are:
-
-1) The socket connection to the virtual card reader (presumably a CCID reader,
-but other ISO-7816 compatible readers could be used). The code that handles
-this is in vscclient.c.
-
-2) The virtual card low level emulation. This is currently supplied by using
-NSS. This emulation could be replaced by implementations based on other
-security libraries, including but not limitted to openssl+pkcs#11 library,
-raw pkcs#11, Microsoft CAPI, direct opensc calls, etc. The code that handles
-this is in vcard_emul_nss.c.
-
-3) Emulation for new types of cards. The current implementation emulates the
-original DoD CAC standard with separate pki containers. This emulator lives in
-cac.c. More than one card type emulator could be included. Other cards could
-be emulated as well, including PIV, newer versions of CAC, PKCS #15, etc.
-
---------------------
-Replacing the Socket Based Virtual Reader Interface.
-
-The current implementation contains a replaceable module vscclient.c. The
-current vscclient.c implements a sockets interface to the virtual ccid reader
-on the guest. CCID commands that are pertinent to emulation are passed
-across the socket, and their responses are passed back along that same socket.
-The protocol that vscclient uses is defined in vscard_common.h and connects
-to a qemu ccid usb device. Since this socket runs as a client, vscclient.c
-implements a program with a main entry. It also handles argument parsing for
-the emulator.
-
-An application that wants to use the virtual reader can replace vscclient.c
-with its own implementation that connects to its own CCID reader. The calls
-that the CCID reader can call are:
-
- VReaderList * vreader_get_reader_list();
-
- This function returns a list of virtual readers. These readers may map to
- physical devices, or simulated devices depending on vcard the back end. Each
- reader in the list should represent a reader to the virtual machine. Virtual
- USB address mapping is left to the CCID reader front end. This call can be
- made any time to get an updated list. The returned list is a copy of the
- internal list that can be referenced by the caller without locking. This copy
- must be freed by the caller with vreader_list_delete when it is no longer
- needed.
-
- VReaderListEntry *vreader_list_get_first(VReaderList *);
-
- This function gets the first entry on the reader list. Along with
- vreader_list_get_next(), vreader_list_get_first() can be used to walk the
- reader list returned from vreader_get_reader_list(). VReaderListEntries are
- part of the list themselves and do not need to be freed separately from the
- list. If there are no entries on the list, it will return NULL.
-
- VReaderListEntry *vreader_list_get_next(VReaderListEntry *);
-
- This function gets the next entry in the list. If there are no more entries
- it will return NULL.
-
- VReader * vreader_list_get_reader(VReaderListEntry *)
-
- This function returns the reader stored in the reader List entry. Caller gets
- a new reference to a reader. The caller must free its reference when it is
- finished with vreader_free().
-
- void vreader_free(VReader *reader);
-
- This function frees a reference to a reader. Readers are reference counted
- and are automatically deleted when the last reference is freed.
-
- void vreader_list_delete(VReaderList *list);
-
- This function frees the list, all the elements on the list, and all the
- reader references held by the list.
-
- VReaderStatus vreader_power_on(VReader *reader, char *atr, int *len);
-
- This function simulates a card power on. A virtual card does not care about
- the actual voltage and other physical parameters, but it does care that the
- card is actually on or off. Cycling the card causes the card to reset. If
- the caller provides enough space, vreader_power_on will return the ATR of
- the virtual card. The amount of space provided in atr should be indicated
- in *len. The function modifies *len to be the actual length of of the
- returned ATR.
-
- VReaderStatus vreader_power_off(VReader *reader);
-
- This function simulates a power off of a virtual card.
-
- VReaderStatus vreader_xfer_bytes(VReader *reader, unsigne char *send_buf,
- int send_buf_len,
- unsigned char *receive_buf,
- int receive_buf_len);
-
- This function sends a raw apdu to a card and returns the card's response.
- The CCID front end should return the response back. Most of the emulation
- is driven from these APDUs.
-
- VReaderStatus vreader_card_is_present(VReader *reader);
-
- This function returns whether or not the reader has a card inserted. The
- vreader_power_on, vreader_power_off, and vreader_xfer_bytes will return
- VREADER_NO_CARD.
-
- const char *vreader_get_name(VReader *reader);
-
- This function returns the name of the reader. The name comes from the card
- emulator level and is usually related to the name of the physical reader.
-
- VReaderID vreader_get_id(VReader *reader);
-
- This function returns the id of a reader. All readers start out with an id
- of -1. The application can set the id with vreader_set_id.
-
- VReaderStatus vreader_get_id(VReader *reader, VReaderID id);
-
- This function sets the reader id. The application is responsible for making
- sure that the id is unique for all readers it is actively using.
-
- VReader *vreader_find_reader_by_id(VReaderID id);
-
- This function returns the reader which matches the id. If two readers match,
- only one is returned. The function returns NULL if the id is -1.
-
- Event *vevent_wait_next_vevent();
-
- This function blocks waiting for reader and card insertion events. There
- will be one event for each card insertion, each card removal, each reader
- insertion and each reader removal. At start up, events are created for all
- the initial readers found, as well as all the cards that are inserted.
-
- Event *vevent_get_next_vevent();
-
- This function returns a pending event if it exists, otherwise it returns
- NULL. It does not block.
-
-----------------
-Card Type Emulator: Adding a New Virtual Card Type
-
-The ISO 7816 card spec describes 2 types of cards:
- 1) File system cards, where the smartcard is managed by reading and writing
-data to files in a file system. There is currently only boiler plate
-implemented for file system cards.
- 2) VM cards, where the card has loadable applets which perform the card
-functions. The current implementation supports VM cards.
-
-In the case of VM cards, the difference between various types of cards is
-really what applets have been installed in that card. This structure is
-mirrored in card type emulators. The 7816 emulator already handles the basic
-ISO 7186 commands. Card type emulators simply need to add the virtual applets
-which emulate the real card applets. Card type emulators have exactly one
-public entry point:
-
- VCARDStatus xxx_card_init(VCard *card, const char *flags,
- const unsigned char *cert[],
- int cert_len[],
- VCardKey *key[],
- int cert_count);
-
- The parameters for this are:
- card - the virtual card structure which will represent this card.
- flags - option flags that may be specific to this card type.
- cert - array of binary certificates.
- cert_len - array of lengths of each of the certificates specified in cert.
- key - array of opaque key structures representing the private keys on
- the card.
- cert_count - number of entries in cert, cert_len, and key arrays.
-
- Any cert, cert_len, or key with the same index are matching sets. That is
- cert[0] is cert_len[0] long and has the corresponding private key of key[0].
-
-The card type emulator is expected to own the VCardKeys, but it should copy
-any raw cert data it wants to save. It can create new applets and add them to
-the card using the following functions:
-
- VCardApplet *vcard_new_applet(VCardProcessAPDU apdu_func,
- VCardResetApplet reset_func,
- const unsigned char *aid,
- int aid_len);
-
- This function creates a new applet. Applet structures store the following
- information:
- 1) the AID of the applet (set by aid and aid_len).
- 2) a function to handle APDUs for this applet. (set by apdu_func, more on
- this below).
- 3) a function to reset the applet state when the applet is selected.
- (set by reset_func, more on this below).
- 3) applet private data, a data pointer used by the card type emulator to
- store any data or state it needs to complete requests. (set by a
- separate call).
- 4) applet private data free, a function used to free the applet private
- data when the applet itself is destroyed.
- The created applet can be added to the card with vcard_add_applet below.
-
- void vcard_set_applet_private(VCardApplet *applet,
- VCardAppletPrivate *private,
- VCardAppletPrivateFree private_free);
- This function sets the private data and the corresponding free function.
- VCardAppletPrivate is an opaque data structure to the rest of the emulator.
- The card type emulator can define it any way it wants by defining
- struct VCardAppletPrivateStruct {};. If there is already a private data
- structure on the applet, the old one is freed before the new one is set up.
- passing two NULL clear any existing private data.
-
- VCardStatus vcard_add_applet(VCard *card, VCardApplet *applet);
-
- Add an applet onto the list of applets attached to the card. Once an applet
- has been added, it can be selected by its AID, and then commands will be
- routed to it VCardProcessAPDU function. This function adopts the applet that
- is passed into it. Note: 2 applets with the same AID should not be added to
- the same card. It is permissible to add more than one applet. Multiple applets
- may have the same VCardPRocessAPDU entry point.
-
-The certs and keys should be attached to private data associated with one or
-more appropriate applets for that card. Control will come to the card type
-emulators once one of its applets are selected through the VCardProcessAPDU
-function it specified when it created the applet.
-
-The signature of VCardResetApplet is:
- VCardStatus (*VCardResetApplet) (VCard *card, int channel);
- This function will reset the any internal applet state that needs to be
- cleared after a select applet call. It should return VCARD_DONE;
-
-The signature of VCardProcessAPDU is:
- VCardStatus (*VCardProcessAPDU)(VCard *card, VCardAPDU *apdu,
- VCardResponse **response);
- This function examines the APDU and determines whether it should process
- the apdu directly, reject the apdu as invalid, or pass the apdu on to
- the basic 7816 emulator for processing.
- If the 7816 emulator should process the apdu, then the VCardProcessAPDU
- should return VCARD_NEXT.
- If there is an error, then VCardProcessAPDU should return an error
- response using vcard_make_response and the appropriate 7816 error code
- (see card_7816t.h) or vcard_make_response with a card type specific error
- code. It should then return VCARD_DONE.
- If the apdu can be processed correctly, VCardProcessAPDU should do so,
- set the response value appropriately for that APDU, and return VCARD_DONE.
- VCardProcessAPDU should always set the response if it returns VCARD_DONE.
- It should always either return VCARD_DONE or VCARD_NEXT.
-
-Parsing the APDU --
-
-Prior to processing calling the card type emulator's VCardProcessAPDU function, the emulator has already decoded the APDU header and set several fields:
-
- apdu->a_data - The raw apdu data bytes.
- apdu->a_len - The len of the raw apdu data.
- apdu->a_body - The start of any post header parameter data.
- apdu->a_Lc - The parameter length value.
- apdu->a_Le - The expected length of any returned data.
- apdu->a_cla - The raw apdu class.
- apdu->a_channel - The channel (decoded from the class).
- apdu->a_secure_messaging_type - The decoded secure messaging type
- (from class).
- apdu->a_type - The decode class type.
- apdu->a_gen_type - the generic class type (7816, PROPRIETARY, RFU, PTS).
- apdu->a_ins - The instruction byte.
- apdu->a_p1 - Parameter 1.
- apdu->a_p2 - Parameter 2.
-
-Creating a Response --
-
-The expected result of any APDU call is a response. The card type emulator must
-set *response with an appropriate VCardResponse value if it returns VCARD_DONE.
-Responses could be as simple as returning a 2 byte status word response, to as
-complex as returning a block of data along with a 2 byte response. Which is
-returned will depend on the semantics of the APDU. The following functions will
-create card responses.
-
- VCardResponse *vcard_make_response(VCard7816Status status);
-
- This is the most basic function to get a response. This function will
- return a response the consists solely one 2 byte status code. If that status
- code is defined in card_7816t.h, then this function is guaranteed to
- return a response with that status. If a cart type specific status code
- is passed and vcard_make_response fails to allocate the appropriate memory
- for that response, then vcard_make_response will return a VCardResponse
- of VCARD7816_STATUS_EXC_ERROR_MEMORY. In any case, this function is
- guaranteed to return a valid VCardResponse.
-
- VCardResponse *vcard_response_new(unsigned char *buf, int len,
- VCard7816Status status);
-
- This function is similar to vcard_make_response except it includes some
- returned data with the response. It could also fail to allocate enough
- memory, in which case it will return NULL.
-
- VCardResponse *vcard_response_new_status_bytes(unsigned char sw1,
- unsigned char sw2);
-
- Sometimes in 7816 the response bytes are treated as two separate bytes with
- split meanings. This function allows you to create a response based on
- two separate bytes. This function could fail, in which case it will return
- NULL.
-
- VCardResponse *vcard_response_new_bytes(unsigned char *buf, int len,
- unsigned char sw1,
- unsigned char sw2);
-
- This function is the same as vcard_response_new except you may specify
- the status as two separate bytes like vcard_response_new_status_bytes.
-
-
-Implementing functionality ---
-
-The following helper functions access information about the current card
-and applet.
-
- VCARDAppletPrivate *vcard_get_current_applet_private(VCard *card,
- int channel);
-
- This function returns any private data set by the card type emulator on
- the currently selected applet. The card type emulator keeps track of the
- current applet state in this data structure. Any certs and keys associated
- with a particular applet is also stored here.
-
- int vcard_emul_get_login_count(VCard *card);
-
- This function returns the the number of remaining login attempts for this
- card. If the card emulator does not know, or the card does not have a
- way of giving this information, this function returns -1.
-
-
- VCard7816Status vcard_emul_login(VCard *card, unsigned char *pin,
- int pin_len);
-
- This function logs into the card and returns the standard 7816 status
- word depending on the success or failure of the call.
-
- void vcard_emul_delete_key(VCardKey *key);
-
- This function frees the VCardKey passed in to xxxx_card_init. The card
- type emulator is responsible for freeing this key when it no longer needs
- it.
-
- VCard7816Status vcard_emul_rsa_op(VCard *card, VCardKey *key,
- unsigned char *buffer,
- int buffer_size);
-
- This function does a raw rsa op on the buffer with the given key.
-
-The sample card type emulator is found in cac.c. It implements the cac specific
-applets. Only those applets needed by the coolkey pkcs#11 driver on the guest
-have been implemented. To support the full range CAC middleware, a complete CAC
-card according to the CAC specs should be implemented here.
-
-------------------------------
-Virtual Card Emulator
-
-This code accesses both real smart cards and simulated smart cards through
-services provided on the client. The current implementation uses NSS, which
-already knows how to talk to various PKCS #11 modules on the client, and is
-portable to most operating systems. A particular emulator can have only one
-virtual card implementation at a time.
-
-The virtual card emulator consists of a series of virtual card services. In
-addition to the services describe above (services starting with
-vcard_emul_xxxx), the virtual card emulator also provides the following
-functions:
-
- VCardEmulError vcard_emul_init(cont VCardEmulOptions *options);
-
- The options structure is built by another function in the virtual card
- interface where a string of virtual card emulator specific strings are
- mapped to the options. The actual structure is defined by the virtual card
- emulator and is used to determine the configuration of soft cards, or to
- determine which physical cards to present to the guest.
-
- The vcard_emul_init function will build up sets of readers, create any
- threads that are needed to watch for changes in the reader state. If readers
- have cards present in them, they are also initialized.
-
- Readers are created with the function.
-
- VReader *vreader_new(VReaderEmul *reader_emul,
- VReaderEmulFree reader_emul_free);
-
- The freeFunc is used to free the VReaderEmul * when the reader is
- destroyed. The VReaderEmul structure is an opaque structure to the
- rest of the code, but defined by the virtual card emulator, which can
- use it to store any reader specific state.
-
- Once the reader has been created, it can be added to the front end with the
- call:
-
- VReaderStatus vreader_add_reader(VReader *reader);
-
- This function will automatically generate the appropriate new reader
- events and add the reader to the list.
-
- To create a new card, the virtual card emulator will call a similar
- function.
-
- VCard *vcard_new(VCardEmul *card_emul,
- VCardEmulFree card_emul_free);
-
- Like vreader_new, this function takes a virtual card emulator specific
- structure which it uses to keep track of the card state.
-
- Once the card is created, it is attached to a card type emulator with the
- following function:
-
- VCardStatus vcard_init(VCard *vcard, VCardEmulType type,
- const char *flags,
- unsigned char *const *certs,
- int *cert_len,
- VCardKey *key[],
- int cert_count);
-
- The vcard is the value returned from vcard_new. The type is the
- card type emulator that this card should presented to the guest as.
- The flags are card type emulator specific options. The certs,
- cert_len, and keys are all arrays of length cert_count. These are the
- the same of the parameters xxxx_card_init() accepts.
-
- Finally the card is associated with its reader by the call:
-
- VReaderStatus vreader_insert_card(VReader *vreader, VCard *vcard);
-
- This function, like vreader_add_reader, will take care of any event
- notification for the card insert.
-
-
- VCardEmulError vcard_emul_force_card_remove(VReader *vreader);
-
- Force a card that is present to appear to be removed to the guest, even if
- that card is a physical card and is present.
-
-
- VCardEmulError vcard_emul_force_card_insert(VReader *reader);
-
- Force a card that has been removed by vcard_emul_force_card_remove to be
- reinserted from the point of view of the guest. This will only work if the
- card is physically present (which is always true fro a soft card).
-
- void vcard_emul_get_atr(Vcard *card, unsigned char *atr, int *atr_len);
-
- Return the virtual ATR for the card. By convention this should be the value
- VCARD_ATR_PREFIX(size) followed by several ascii bytes related to this
- particular emulator. For instance the NSS emulator returns
- {VCARD_ATR_PREFIX(3), 'N', 'S', 'S' }. Do ot return more data then *atr_len;
-
- void vcard_emul_reset(VCard *card, VCardPower power)
-
- Set the state of 'card' to the current power level and reset its internal
- state (logout, etc).
-
--------------------------------------------------------
-List of files and their function:
-README - This file
-card_7816.c - emulate basic 7816 functionality. Parse APDUs.
-card_7816.h - apdu and response services definitions.
-card_7816t.h - 7816 specific structures, types and definitions.
-event.c - event handling code.
-event.h - event handling services definitions.
-eventt.h - event handling structures and types
-vcard.c - handle common virtual card services like creation, destruction, and
- applet management.
-vcard.h - common virtual card services function definitions.
-vcardt.h - comon virtual card types
-vreader.c - common virtual reader services.
-vreader.h - common virtual reader services definitions.
-vreadert.h - comon virtual reader types.
-vcard_emul_type.c - manage the card type emulators.
-vcard_emul_type.h - definitions for card type emulators.
-cac.c - card type emulator for CAC cards
-vcard_emul.h - virtual card emulator service definitions.
-vcard_emul_nss.c - virtual card emulator implementation for nss.
-vscclient.c - socket connection to guest qemu usb driver.
-vscard_common.h - common header with the guest qemu usb driver.
-mutex.h - header file for machine independent mutexes.
-link_test.c - static test to make sure all the symbols are properly defined.
diff --git a/docs/migration.txt b/docs/migration.txt
index f6df4beb2..fda8d61d6 100644
--- a/docs/migration.txt
+++ b/docs/migration.txt
@@ -291,3 +291,194 @@ save/send this state when we are in the middle of a pio operation
(that is what ide_drive_pio_state_needed() checks). If DRQ_STAT is
not enabled, the values on that fields are garbage and don't need to
be sent.
+
+= Return path =
+
+In most migration scenarios there is only a single data path that runs
+from the source VM to the destination, typically along a single fd (although
+possibly with another fd or similar for some fast way of throwing pages across).
+
+However, some uses need two way communication; in particular the Postcopy
+destination needs to be able to request pages on demand from the source.
+
+For these scenarios there is a 'return path' from the destination to the source;
+qemu_file_get_return_path(QEMUFile* fwdpath) gives the QEMUFile* for the return
+path.
+
+ Source side
+ Forward path - written by migration thread
+ Return path - opened by main thread, read by return-path thread
+
+ Destination side
+ Forward path - read by main thread
+ Return path - opened by main thread, written by main thread AND postcopy
+ thread (protected by rp_mutex)
+
+= Postcopy =
+'Postcopy' migration is a way to deal with migrations that refuse to converge
+(or take too long to converge) its plus side is that there is an upper bound on
+the amount of migration traffic and time it takes, the down side is that during
+the postcopy phase, a failure of *either* side or the network connection causes
+the guest to be lost.
+
+In postcopy the destination CPUs are started before all the memory has been
+transferred, and accesses to pages that are yet to be transferred cause
+a fault that's translated by QEMU into a request to the source QEMU.
+
+Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
+doesn't finish in a given time the switch is made to postcopy.
+
+=== Enabling postcopy ===
+
+To enable postcopy, issue this command on the monitor prior to the
+start of migration:
+
+migrate_set_capability x-postcopy-ram on
+
+The normal commands are then used to start a migration, which is still
+started in precopy mode. Issuing:
+
+migrate_start_postcopy
+
+will now cause the transition from precopy to postcopy.
+It can be issued immediately after migration is started or any
+time later on. Issuing it after the end of a migration is harmless.
+
+Note: During the postcopy phase, the bandwidth limits set using
+migrate_set_speed is ignored (to avoid delaying requested pages that
+the destination is waiting for).
+
+=== Postcopy device transfer ===
+
+Loading of device data may cause the device emulation to access guest RAM
+that may trigger faults that have to be resolved by the source, as such
+the migration stream has to be able to respond with page data *during* the
+device load, and hence the device data has to be read from the stream completely
+before the device load begins to free the stream up. This is achieved by
+'packaging' the device data into a blob that's read in one go.
+
+Source behaviour
+
+Until postcopy is entered the migration stream is identical to normal
+precopy, except for the addition of a 'postcopy advise' command at
+the beginning, to tell the destination that postcopy might happen.
+When postcopy starts the source sends the page discard data and then
+forms the 'package' containing:
+
+ Command: 'postcopy listen'
+ The device state
+ A series of sections, identical to the precopy streams device state stream
+ containing everything except postcopiable devices (i.e. RAM)
+ Command: 'postcopy run'
+
+The 'package' is sent as the data part of a Command: 'CMD_PACKAGED', and the
+contents are formatted in the same way as the main migration stream.
+
+During postcopy the source scans the list of dirty pages and sends them
+to the destination without being requested (in much the same way as precopy),
+however when a page request is received from the destination, the dirty page
+scanning restarts from the requested location. This causes requested pages
+to be sent quickly, and also causes pages directly after the requested page
+to be sent quickly in the hope that those pages are likely to be used
+by the destination soon.
+
+Destination behaviour
+
+Initially the destination looks the same as precopy, with a single thread
+reading the migration stream; the 'postcopy advise' and 'discard' commands
+are processed to change the way RAM is managed, but don't affect the stream
+processing.
+
+------------------------------------------------------------------------------
+ 1 2 3 4 5 6 7
+main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN )
+thread | |
+ | (page request)
+ | \___
+ v \
+listen thread: --- page -- page -- page -- page -- page --
+
+ a b c
+------------------------------------------------------------------------------
+
+On receipt of CMD_PACKAGED (1)
+ All the data associated with the package - the ( ... ) section in the
+diagram - is read into memory (into a QEMUSizedBuffer), and the main thread
+recurses into qemu_loadvm_state_main to process the contents of the package (2)
+which contains commands (3,6) and devices (4...)
+
+On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
+a new thread (a) is started that takes over servicing the migration stream,
+while the main thread carries on loading the package. It loads normal
+background page data (b) but if during a device load a fault happens (5) the
+returned page (c) is loaded by the listen thread allowing the main threads
+device load to carry on.
+
+The last thing in the CMD_PACKAGED is a 'RUN' command (6) letting the destination
+CPUs start running.
+At the end of the CMD_PACKAGED (7) the main thread returns to normal running behaviour
+and is no longer used by migration, while the listen thread carries
+on servicing page data until the end of migration.
+
+=== Postcopy states ===
+
+Postcopy moves through a series of states (see postcopy_state) from
+ADVISE->DISCARD->LISTEN->RUNNING->END
+
+ Advise: Set at the start of migration if postcopy is enabled, even
+ if it hasn't had the start command; here the destination
+ checks that its OS has the support needed for postcopy, and performs
+ setup to ensure the RAM mappings are suitable for later postcopy.
+ The destination will fail early in migration at this point if the
+ required OS support is not present.
+ (Triggered by reception of POSTCOPY_ADVISE command)
+
+ Discard: Entered on receipt of the first 'discard' command; prior to
+ the first Discard being performed, hugepages are switched off
+ (using madvise) to ensure that no new huge pages are created
+ during the postcopy phase, and to cause any huge pages that
+ have discards on them to be broken.
+
+ Listen: The first command in the package, POSTCOPY_LISTEN, switches
+ the destination state to Listen, and starts a new thread
+ (the 'listen thread') which takes over the job of receiving
+ pages off the migration stream, while the main thread carries
+ on processing the blob. With this thread able to process page
+ reception, the destination now 'sensitises' the RAM to detect
+ any access to missing pages (on Linux using the 'userfault'
+ system).
+
+ Running: POSTCOPY_RUN causes the destination to synchronise all
+ state and start the CPUs and IO devices running. The main
+ thread now finishes processing the migration package and
+ now carries on as it would for normal precopy migration
+ (although it can't do the cleanup it would do as it
+ finishes a normal migration).
+
+ End: The listen thread can now quit, and perform the cleanup of migration
+ state, the migration is now complete.
+
+=== Source side page maps ===
+
+The source side keeps two bitmaps during postcopy; 'the migration bitmap'
+and 'unsent map'. The 'migration bitmap' is basically the same as in
+the precopy case, and holds a bit to indicate that page is 'dirty' -
+i.e. needs sending. During the precopy phase this is updated as the CPU
+dirties pages, however during postcopy the CPUs are stopped and nothing
+should dirty anything any more.
+
+The 'unsent map' is used for the transition to postcopy. It is a bitmap that
+has a bit cleared whenever a page is sent to the destination, however during
+the transition to postcopy mode it is combined with the migration bitmap
+to form a set of pages that:
+ a) Have been sent but then redirtied (which must be discarded)
+ b) Have not yet been sent - which also must be discarded to cause any
+ transparent huge pages built during precopy to be broken.
+
+Note that the contents of the unsentmap are sacrificed during the calculation
+of the discard set and thus aren't valid once in postcopy. The dirtymap
+is still valid and is used to ensure that no page is sent more than once. Any
+request for a page that has already been sent is ignored. Duplicate requests
+such as this can happen as a page is sent at about the same time the
+destination accesses it.
+
diff --git a/docs/multiseat.txt b/docs/multiseat.txt
index ebf244693..807518c8a 100644
--- a/docs/multiseat.txt
+++ b/docs/multiseat.txt
@@ -135,7 +135,7 @@ configuration:
TAG+="seat", ENV{ID_AUTOSEAT}="1"
Patch with this rule has been submitted to upstream udev/systemd, was
-accepted and and should be included in the next systemd release (222).
+accepted and should be included in the next systemd release (222).
So, if your guest has this or a newer version, multiseat will work just
fine without any manual guest configuration.
diff --git a/docs/qapi-code-gen.txt b/docs/qapi-code-gen.txt
index 61b5be47f..ceb9a782d 100644
--- a/docs/qapi-code-gen.txt
+++ b/docs/qapi-code-gen.txt
@@ -106,15 +106,15 @@ Types, commands, and events share a common namespace. Therefore,
generally speaking, type definitions should always use CamelCase for
user-defined type names, while built-in types are lowercase. Type
definitions should not end in 'Kind', as this namespace is used for
-creating implicit C enums for visiting union types. Command names,
+creating implicit C enums for visiting union types, or in 'List', as
+this namespace is used for creating array types. Command names,
and field names within a type, should be all lower case with words
separated by a hyphen. However, some existing older commands and
complex types use underscore; when extending such expressions,
consistency is preferred over blindly avoiding underscore. Event
-names should be ALL_CAPS with words separated by underscore. The
-special string '**' appears for some commands that manually perform
-their own type checking rather than relying on the type-safe code
-produced by the qapi code generators.
+names should be ALL_CAPS with words separated by underscore. Field
+names cannot start with 'has-' or 'has_', as this is reserved for
+tracking optional fields.
Any name (command, event, type, field, or enum value) beginning with
"x-" is marked experimental, and may be withdrawn or changed
@@ -125,9 +125,10 @@ vendor), even if the rest of the name uses dash (example:
__com.redhat_drive-mirror). Other than downstream extensions (with
leading underscore and the use of dots), all names should begin with a
letter, and contain only ASCII letters, digits, dash, and underscore.
-It is okay to reuse names that match C keywords; the generator will
-rename a field named "default" in the QAPI to "q_default" in the
-generated C code.
+Names beginning with 'q_' are reserved for the generator: QMP names
+that resemble C keywords or other problematic strings will be munged
+in C to use this prefix. For example, a field named "default" in
+qapi becomes "q_default" in the generated C code.
In the rest of this document, usage lines are given for each
expression type, with literal strings written in lower case and
@@ -140,17 +141,25 @@ must have a value that forms a struct name.
=== Built-in Types ===
-The following types are built-in to the parser:
- 'str' - arbitrary UTF-8 string
- 'int' - 64-bit signed integer (although the C code may place further
- restrictions on acceptable range)
- 'number' - floating point number
- 'bool' - JSON value of true or false
- 'int8', 'int16', 'int32', 'int64' - like 'int', but enforce maximum
- bit size
- 'uint8', 'uint16', 'uint32', 'uint64' - unsigned counterparts
- 'size' - like 'uint64', but allows scaled suffix from command line
- visitor
+The following types are predefined, and map to C as follows:
+
+ Schema C JSON
+ str char * any JSON string, UTF-8
+ number double any JSON number
+ int int64_t a JSON number without fractional part
+ that fits into the C integer type
+ int8 int8_t likewise
+ int16 int16_t likewise
+ int32 int32_t likewise
+ int64 int64_t likewise
+ uint8 uint8_t likewise
+ uint16 uint16_t likewise
+ uint32 uint32_t likewise
+ uint64 uint64_t likewise
+ size uint64_t like uint64_t, except StringInputVisitor
+ accepts size suffixes
+ bool bool JSON true or false
+ any QObject * any JSON value
=== Includes ===
@@ -163,7 +172,7 @@ The QAPI schema definitions can be modularized using the 'include' directive:
The directive is evaluated recursively, and include paths are relative to the
file using the directive. Multiple includes of the same file are
-safe. No other keys should appear in the expression, and the include
+idempotent. No other keys should appear in the expression, and the include
value should be a string.
As a matter of style, it is a good idea to have all files be
@@ -236,6 +245,7 @@ both fields like this:
=== Enumeration types ===
Usage: { 'enum': STRING, 'data': ARRAY-OF-STRING }
+ { 'enum': STRING, '*prefix': STRING, 'data': ARRAY-OF-STRING }
An enumeration type is a dictionary containing a single 'data' key
whose value is a list of strings. An example enumeration is:
@@ -247,6 +257,13 @@ useful. The list of strings should be lower case; if an enum name
represents multiple words, use '-' between words. The string 'max' is
not allowed as an enum value, and values should not be repeated.
+The enum constants will be named by using a heuristic to turn the
+type name into a set of underscore separated words. For the example
+above, 'MyEnum' will turn into 'MY_ENUM' giving a constant name
+of 'MY_ENUM_VALUE1' for the first value. If the default heuristic
+does not result in a desirable name, the optional 'prefix' field
+can be used when defining the enum.
+
The enumeration values are passed as strings over the Client JSON
Protocol, but are encoded as C enum integral values in generated code.
While the C code starts numbering at 0, it is better to use explicit
@@ -300,7 +317,6 @@ an implicit C enum 'NameKind' is created, corresponding to the union
the union can be named 'max', as this would collide with the implicit
enum. The value for each branch can be of any type.
-
A flat union definition specifies a struct as its base, and
avoids nesting on the wire. All branches of the union must be
complex types, and the top-level fields of the union dictionary on
@@ -314,7 +330,7 @@ adding a common field 'readonly', renaming the discriminator to
something more applicable, and reducing the number of {} required on
the wire:
- { 'enum': 'BlockdevDriver', 'data': [ 'raw', 'qcow2' ] }
+ { 'enum': 'BlockdevDriver', 'data': [ 'file', 'qcow2' ] }
{ 'struct': 'BlockdevCommonOptions',
'data': { 'driver': 'BlockdevDriver', 'readonly': 'bool' } }
{ 'union': 'BlockdevOptions',
@@ -350,7 +366,7 @@ is identical on the wire to:
{ 'struct': 'Base', 'data': { 'type': 'Enum' } }
{ 'struct': 'Branch1', 'data': { 'data': 'str' } }
{ 'struct': 'Branch2', 'data': { 'data': 'int' } }
- { 'union': 'Flat': 'base': 'Base', 'discriminator': 'type',
+ { 'union': 'Flat', 'base': 'Base', 'discriminator': 'type',
'data': { 'one': 'Branch1', 'two': 'Branch2' } }
@@ -394,7 +410,7 @@ following example objects:
=== Commands ===
Usage: { 'command': STRING, '*data': COMPLEX-TYPE-NAME-OR-DICT,
- '*returns': TYPE-NAME-OR-DICT,
+ '*returns': TYPE-NAME,
'*gen': false, '*success-response': false }
Commands are defined by using a dictionary containing several members,
@@ -405,10 +421,9 @@ Client JSON Protocol command exchange.
The 'data' argument maps to the "arguments" dictionary passed in as
part of a Client JSON Protocol command. The 'data' member is optional
and defaults to {} (an empty dictionary). If present, it must be the
-string name of a complex type, a one-element array containing the name
-of a complex type, or a dictionary that declares an anonymous type
-with the same semantics as a 'struct' expression, with one exception
-noted below when 'gen' is used.
+string name of a complex type, or a dictionary that declares an
+anonymous type with the same semantics as a 'struct' expression, with
+one exception noted below when 'gen' is used.
The 'returns' member describes what will appear in the "return" field
of a Client JSON Protocol reply on successful completion of a command.
@@ -416,14 +431,13 @@ The member is optional from the command declaration; if absent, the
"return" field will be an empty dictionary. If 'returns' is present,
it must be the string name of a complex or built-in type, a
one-element array containing the name of a complex or built-in type,
-or a dictionary that declares an anonymous type with the same
-semantics as a 'struct' expression, with one exception noted below
-when 'gen' is used. Although it is permitted to have the 'returns'
-member name a built-in type or an array of built-in types, any command
-that does this cannot be extended to return additional information in
-the future; thus, new commands should strongly consider returning a
-dictionary-based type or an array of dictionaries, even if the
-dictionary only contains one field at the present.
+with one exception noted below when 'gen' is used. Although it is
+permitted to have the 'returns' member name a built-in type or an
+array of built-in types, any command that does this cannot be extended
+to return additional information in the future; thus, new commands
+should strongly consider returning a dictionary-based type or an array
+of dictionaries, even if the dictionary only contains one field at the
+present.
All commands in Client JSON Protocol use a dictionary to report
failure, with no way to specify that in QAPI. Where the error return
@@ -448,17 +462,14 @@ which would validate this Client JSON Protocol transaction:
<= { "return": [ { "value": "one" }, { } ] }
In rare cases, QAPI cannot express a type-safe representation of a
-corresponding Client JSON Protocol command. In these cases, if the
-command expression includes the key 'gen' with boolean value false,
-then the 'data' or 'returns' member that intends to bypass generated
-type-safety and do its own manual validation should use an inline
-dictionary definition, with a value of '**' rather than a valid type
-name for the keys that the generated code will not validate. Please
-try to avoid adding new commands that rely on this, and instead use
-type-safe unions. For an example of bypass usage:
+corresponding Client JSON Protocol command. You then have to suppress
+generation of a marshalling function by including a key 'gen' with
+boolean value false, and instead write your own function. Please try
+to avoid adding new commands that rely on this, and instead use
+type-safe unions. For an example of this usage:
{ 'command': 'netdev_add',
- 'data': {'type': 'str', 'id': 'str', '*props': '**'},
+ 'data': {'type': 'str', 'id': 'str'},
'gen': false }
Normally, the QAPI schema is used to describe synchronous exchanges,
@@ -495,13 +506,229 @@ Resulting in this JSON object:
"timestamp": { "seconds": 1267020223, "microseconds": 435656 } }
+== Client JSON Protocol introspection ==
+
+Clients of a Client JSON Protocol commonly need to figure out what
+exactly the server (QEMU) supports.
+
+For this purpose, QMP provides introspection via command
+query-qmp-schema. QGA currently doesn't support introspection.
+
+While Client JSON Protocol wire compatibility should be maintained
+between qemu versions, we cannot make the same guarantees for
+introspection stability. For example, one version of qemu may provide
+a non-variant optional member of a struct, and a later version rework
+the member to instead be non-optional and associated with a variant.
+Likewise, one version of qemu may list a member with open-ended type
+'str', and a later version could convert it to a finite set of strings
+via an enum type; or a member may be converted from a specific type to
+an alternate that represents a choice between the original type and
+something else.
+
+query-qmp-schema returns a JSON array of SchemaInfo objects. These
+objects together describe the wire ABI, as defined in the QAPI schema.
+There is no specified order to the SchemaInfo objects returned; a
+client must search for a particular name throughout the entire array
+to learn more about that name, but is at least guaranteed that there
+will be no collisions between type, command, and event names.
+
+However, the SchemaInfo can't reflect all the rules and restrictions
+that apply to QMP. It's interface introspection (figuring out what's
+there), not interface specification. The specification is in the QAPI
+schema. To understand how QMP is to be used, you need to study the
+QAPI schema.
+
+Like any other command, query-qmp-schema is itself defined in the QAPI
+schema, along with the SchemaInfo type. This text attempts to give an
+overview how things work. For details you need to consult the QAPI
+schema.
+
+SchemaInfo objects have common members "name" and "meta-type", and
+additional variant members depending on the value of meta-type.
+
+Each SchemaInfo object describes a wire ABI entity of a certain
+meta-type: a command, event or one of several kinds of type.
+
+SchemaInfo for commands and events have the same name as in the QAPI
+schema.
+
+Command and event names are part of the wire ABI, but type names are
+not. Therefore, the SchemaInfo for types have auto-generated
+meaningless names. For readability, the examples in this section use
+meaningful type names instead.
+
+To examine a type, start with a command or event using it, then follow
+references by name.
+
+QAPI schema definitions not reachable that way are omitted.
+
+The SchemaInfo for a command has meta-type "command", and variant
+members "arg-type" and "ret-type". On the wire, the "arguments"
+member of a client's "execute" command must conform to the object type
+named by "arg-type". The "return" member that the server passes in a
+success response conforms to the type named by "ret-type".
+
+If the command takes no arguments, "arg-type" names an object type
+without members. Likewise, if the command returns nothing, "ret-type"
+names an object type without members.
+
+Example: the SchemaInfo for command query-qmp-schema
+
+ { "name": "query-qmp-schema", "meta-type": "command",
+ "arg-type": ":empty", "ret-type": "SchemaInfoList" }
+
+ Type ":empty" is an object type without members, and type
+ "SchemaInfoList" is the array of SchemaInfo type.
+
+The SchemaInfo for an event has meta-type "event", and variant member
+"arg-type". On the wire, a "data" member that the server passes in an
+event conforms to the object type named by "arg-type".
+
+If the event carries no additional information, "arg-type" names an
+object type without members. The event may not have a data member on
+the wire then.
+
+Each command or event defined with dictionary-valued 'data' in the
+QAPI schema implicitly defines an object type.
+
+Example: the SchemaInfo for EVENT_C from section Events
+
+ { "name": "EVENT_C", "meta-type": "event",
+ "arg-type": ":obj-EVENT_C-arg" }
+
+ Type ":obj-EVENT_C-arg" is an implicitly defined object type with
+ the two members from the event's definition.
+
+The SchemaInfo for struct and union types has meta-type "object".
+
+The SchemaInfo for a struct type has variant member "members".
+
+The SchemaInfo for a union type additionally has variant members "tag"
+and "variants".
+
+"members" is a JSON array describing the object's common members, if
+any. Each element is a JSON object with members "name" (the member's
+name), "type" (the name of its type), and optionally "default". The
+member is optional if "default" is present. Currently, "default" can
+only have value null. Other values are reserved for future
+extensions. The "members" array is in no particular order; clients
+must search the entire object when learning whether a particular
+member is supported.
+
+Example: the SchemaInfo for MyType from section Struct types
+
+ { "name": "MyType", "meta-type": "object",
+ "members": [
+ { "name": "member1", "type": "str" },
+ { "name": "member2", "type": "int" },
+ { "name": "member3", "type": "str", "default": null } ] }
+
+"tag" is the name of the common member serving as type tag.
+"variants" is a JSON array describing the object's variant members.
+Each element is a JSON object with members "case" (the value of type
+tag this element applies to) and "type" (the name of an object type
+that provides the variant members for this type tag value). The
+"variants" array is in no particular order, and is not guaranteed to
+list cases in the same order as the corresponding "tag" enum type.
+
+Example: the SchemaInfo for flat union BlockdevOptions from section
+Union types
+
+ { "name": "BlockdevOptions", "meta-type": "object",
+ "members": [
+ { "name": "driver", "type": "BlockdevDriver" },
+ { "name": "readonly", "type": "bool"} ],
+ "tag": "driver",
+ "variants": [
+ { "case": "file", "type": "FileOptions" },
+ { "case": "qcow2", "type": "Qcow2Options" } ] }
+
+Note that base types are "flattened": its members are included in the
+"members" array.
+
+A simple union implicitly defines an enumeration type for its implicit
+discriminator (called "type" on the wire, see section Union types).
+
+A simple union implicitly defines an object type for each of its
+variants.
+
+Example: the SchemaInfo for simple union BlockdevOptions from section
+Union types
+
+ { "name": "BlockdevOptions", "meta-type": "object",
+ "members": [
+ { "name": "kind", "type": "BlockdevOptionsKind" } ],
+ "tag": "type",
+ "variants": [
+ { "case": "file", "type": ":obj-FileOptions-wrapper" },
+ { "case": "qcow2", "type": ":obj-Qcow2Options-wrapper" } ] }
+
+ Enumeration type "BlockdevOptionsKind" and the object types
+ ":obj-FileOptions-wrapper", ":obj-Qcow2Options-wrapper" are
+ implicitly defined.
+
+The SchemaInfo for an alternate type has meta-type "alternate", and
+variant member "members". "members" is a JSON array. Each element is
+a JSON object with member "type", which names a type. Values of the
+alternate type conform to exactly one of its member types. There is
+no guarantee on the order in which "members" will be listed.
+
+Example: the SchemaInfo for BlockRef from section Alternate types
+
+ { "name": "BlockRef", "meta-type": "alternate",
+ "members": [
+ { "type": "BlockdevOptions" },
+ { "type": "str" } ] }
+
+The SchemaInfo for an array type has meta-type "array", and variant
+member "element-type", which names the array's element type. Array
+types are implicitly defined. For convenience, the array's name may
+resemble the element type; however, clients should examine member
+"element-type" instead of making assumptions based on parsing member
+"name".
+
+Example: the SchemaInfo for ['str']
+
+ { "name": "[str]", "meta-type": "array",
+ "element-type": "str" }
+
+The SchemaInfo for an enumeration type has meta-type "enum" and
+variant member "values". The values are listed in no particular
+order; clients must search the entire enum when learning whether a
+particular value is supported.
+
+Example: the SchemaInfo for MyEnum from section Enumeration types
+
+ { "name": "MyEnum", "meta-type": "enum",
+ "values": [ "value1", "value2", "value3" ] }
+
+The SchemaInfo for a built-in type has the same name as the type in
+the QAPI schema (see section Built-in Types), with one exception
+detailed below. It has variant member "json-type" that shows how
+values of this type are encoded on the wire.
+
+Example: the SchemaInfo for str
+
+ { "name": "str", "meta-type": "builtin", "json-type": "string" }
+
+The QAPI schema supports a number of integer types that only differ in
+how they map to C. They are identical as far as SchemaInfo is
+concerned. Therefore, they get all mapped to a single type "int" in
+SchemaInfo.
+
+As explained above, type names are not part of the wire ABI. Not even
+the names of built-in types. Clients should examine member
+"json-type" instead of hard-coding names of built-in types.
+
+
== Code generation ==
-Schemas are fed into 3 scripts to generate all the code/files that, paired
-with the core QAPI libraries, comprise everything required to take JSON
-commands read in by a Client JSON Protocol server, unmarshal the arguments into
-the underlying C types, call into the corresponding C function, and map the
-response back to a Client JSON Protocol response to be returned to the user.
+Schemas are fed into four scripts to generate all the code/files that,
+paired with the core QAPI libraries, comprise everything required to
+take JSON commands read in by a Client JSON Protocol server, unmarshal
+the arguments into the underlying C types, call into the corresponding
+C function, and map the response back to a Client JSON Protocol
+response to be returned to the user.
As an example, we'll use the following schema, which describes a single
complex user-defined type (which will produce a C struct, along with a list
@@ -540,36 +767,35 @@ Example:
$ cat qapi-generated/example-qapi-types.c
[Uninteresting stuff omitted...]
- void qapi_free_UserDefOneList(UserDefOneList *obj)
+ void qapi_free_UserDefOne(UserDefOne *obj)
{
- QapiDeallocVisitor *md;
+ QapiDeallocVisitor *qdv;
Visitor *v;
if (!obj) {
return;
}
- md = qapi_dealloc_visitor_new();
- v = qapi_dealloc_get_visitor(md);
- visit_type_UserDefOneList(v, &obj, NULL, NULL);
- qapi_dealloc_visitor_cleanup(md);
+ qdv = qapi_dealloc_visitor_new();
+ v = qapi_dealloc_get_visitor(qdv);
+ visit_type_UserDefOne(v, &obj, NULL, NULL);
+ qapi_dealloc_visitor_cleanup(qdv);
}
- void qapi_free_UserDefOne(UserDefOne *obj)
+ void qapi_free_UserDefOneList(UserDefOneList *obj)
{
- QapiDeallocVisitor *md;
+ QapiDeallocVisitor *qdv;
Visitor *v;
if (!obj) {
return;
}
- md = qapi_dealloc_visitor_new();
- v = qapi_dealloc_get_visitor(md);
- visit_type_UserDefOne(v, &obj, NULL, NULL);
- qapi_dealloc_visitor_cleanup(md);
+ qdv = qapi_dealloc_visitor_new();
+ v = qapi_dealloc_get_visitor(qdv);
+ visit_type_UserDefOneList(v, &obj, NULL, NULL);
+ qapi_dealloc_visitor_cleanup(qdv);
}
-
$ cat qapi-generated/example-qapi-types.h
[Uninteresting stuff omitted...]
@@ -580,25 +806,24 @@ Example:
typedef struct UserDefOne UserDefOne;
- typedef struct UserDefOneList
- {
+ typedef struct UserDefOneList UserDefOneList;
+
+ struct UserDefOne {
+ int64_t integer;
+ char *string;
+ };
+
+ void qapi_free_UserDefOne(UserDefOne *obj);
+
+ struct UserDefOneList {
union {
UserDefOne *value;
uint64_t padding;
};
- struct UserDefOneList *next;
- } UserDefOneList;
-
-[Functions on built-in types omitted...]
-
- struct UserDefOne
- {
- int64_t integer;
- char *string;
+ UserDefOneList *next;
};
void qapi_free_UserDefOneList(UserDefOneList *obj);
- void qapi_free_UserDefOne(UserDefOne *obj);
#endif
@@ -627,14 +852,15 @@ Example:
$ cat qapi-generated/example-qapi-visit.c
[Uninteresting stuff omitted...]
- static void visit_type_UserDefOne_fields(Visitor *m, UserDefOne **obj, Error **errp)
+ static void visit_type_UserDefOne_fields(Visitor *v, UserDefOne **obj, Error **errp)
{
Error *err = NULL;
- visit_type_int(m, &(*obj)->integer, "integer", &err);
+
+ visit_type_int(v, &(*obj)->integer, "integer", &err);
if (err) {
goto out;
}
- visit_type_str(m, &(*obj)->string, "string", &err);
+ visit_type_str(v, &(*obj)->string, "string", &err);
if (err) {
goto out;
}
@@ -643,40 +869,40 @@ Example:
error_propagate(errp, err);
}
- void visit_type_UserDefOne(Visitor *m, UserDefOne **obj, const char *name, Error **errp)
+ void visit_type_UserDefOne(Visitor *v, UserDefOne **obj, const char *name, Error **errp)
{
Error *err = NULL;
- visit_start_struct(m, (void **)obj, "UserDefOne", name, sizeof(UserDefOne), &err);
+ visit_start_struct(v, (void **)obj, "UserDefOne", name, sizeof(UserDefOne), &err);
if (!err) {
if (*obj) {
- visit_type_UserDefOne_fields(m, obj, errp);
+ visit_type_UserDefOne_fields(v, obj, errp);
}
- visit_end_struct(m, &err);
+ visit_end_struct(v, &err);
}
error_propagate(errp, err);
}
- void visit_type_UserDefOneList(Visitor *m, UserDefOneList **obj, const char *name, Error **errp)
+ void visit_type_UserDefOneList(Visitor *v, UserDefOneList **obj, const char *name, Error **errp)
{
Error *err = NULL;
GenericList *i, **prev;
- visit_start_list(m, name, &err);
+ visit_start_list(v, name, &err);
if (err) {
goto out;
}
for (prev = (GenericList **)obj;
- !err && (i = visit_next_list(m, prev, &err)) != NULL;
+ !err && (i = visit_next_list(v, prev, &err)) != NULL;
prev = &i) {
UserDefOneList *native_i = (UserDefOneList *)i;
- visit_type_UserDefOne(m, &native_i->value, NULL, &err);
+ visit_type_UserDefOne(v, &native_i->value, NULL, &err);
}
error_propagate(errp, err);
err = NULL;
- visit_end_list(m, &err);
+ visit_end_list(v, &err);
out:
error_propagate(errp, err);
}
@@ -688,8 +914,8 @@ Example:
[Visitors for built-in types omitted...]
- void visit_type_UserDefOne(Visitor *m, UserDefOne **obj, const char *name, Error **errp);
- void visit_type_UserDefOneList(Visitor *m, UserDefOneList **obj, const char *name, Error **errp);
+ void visit_type_UserDefOne(Visitor *v, UserDefOne **obj, const char *name, Error **errp);
+ void visit_type_UserDefOneList(Visitor *v, UserDefOneList **obj, const char *name, Error **errp);
#endif
@@ -717,64 +943,63 @@ Example:
$ cat qapi-generated/example-qmp-marshal.c
[Uninteresting stuff omitted...]
- static void qmp_marshal_output_my_command(UserDefOne *ret_in, QObject **ret_out, Error **errp)
+ static void qmp_marshal_output_UserDefOne(UserDefOne *ret_in, QObject **ret_out, Error **errp)
{
- Error *local_err = NULL;
- QmpOutputVisitor *mo = qmp_output_visitor_new();
- QapiDeallocVisitor *md;
+ Error *err = NULL;
+ QmpOutputVisitor *qov = qmp_output_visitor_new();
+ QapiDeallocVisitor *qdv;
Visitor *v;
- v = qmp_output_get_visitor(mo);
- visit_type_UserDefOne(v, &ret_in, "unused", &local_err);
- if (local_err) {
+ v = qmp_output_get_visitor(qov);
+ visit_type_UserDefOne(v, &ret_in, "unused", &err);
+ if (err) {
goto out;
}
- *ret_out = qmp_output_get_qobject(mo);
+ *ret_out = qmp_output_get_qobject(qov);
out:
- error_propagate(errp, local_err);
- qmp_output_visitor_cleanup(mo);
- md = qapi_dealloc_visitor_new();
- v = qapi_dealloc_get_visitor(md);
+ error_propagate(errp, err);
+ qmp_output_visitor_cleanup(qov);
+ qdv = qapi_dealloc_visitor_new();
+ v = qapi_dealloc_get_visitor(qdv);
visit_type_UserDefOne(v, &ret_in, "unused", NULL);
- qapi_dealloc_visitor_cleanup(md);
+ qapi_dealloc_visitor_cleanup(qdv);
}
- static void qmp_marshal_input_my_command(QDict *args, QObject **ret, Error **errp)
+ static void qmp_marshal_my_command(QDict *args, QObject **ret, Error **errp)
{
- Error *local_err = NULL;
- UserDefOne *retval = NULL;
- QmpInputVisitor *mi = qmp_input_visitor_new_strict(QOBJECT(args));
- QapiDeallocVisitor *md;
+ Error *err = NULL;
+ UserDefOne *retval;
+ QmpInputVisitor *qiv = qmp_input_visitor_new_strict(QOBJECT(args));
+ QapiDeallocVisitor *qdv;
Visitor *v;
UserDefOne *arg1 = NULL;
- v = qmp_input_get_visitor(mi);
- visit_type_UserDefOne(v, &arg1, "arg1", &local_err);
- if (local_err) {
+ v = qmp_input_get_visitor(qiv);
+ visit_type_UserDefOne(v, &arg1, "arg1", &err);
+ if (err) {
goto out;
}
- retval = qmp_my_command(arg1, &local_err);
- if (local_err) {
+ retval = qmp_my_command(arg1, &err);
+ if (err) {
goto out;
}
- qmp_marshal_output_my_command(retval, ret, &local_err);
+ qmp_marshal_output_UserDefOne(retval, ret, &err);
out:
- error_propagate(errp, local_err);
- qmp_input_visitor_cleanup(mi);
- md = qapi_dealloc_visitor_new();
- v = qapi_dealloc_get_visitor(md);
+ error_propagate(errp, err);
+ qmp_input_visitor_cleanup(qiv);
+ qdv = qapi_dealloc_visitor_new();
+ v = qapi_dealloc_get_visitor(qdv);
visit_type_UserDefOne(v, &arg1, "arg1", NULL);
- qapi_dealloc_visitor_cleanup(md);
- return;
+ qapi_dealloc_visitor_cleanup(qdv);
}
static void qmp_init_marshal(void)
{
- qmp_register_command("my-command", qmp_marshal_input_my_command, QCO_NO_OPTIONS);
+ qmp_register_command("my-command", qmp_marshal_my_command, QCO_NO_OPTIONS);
}
qapi_init(qmp_init_marshal);
@@ -811,7 +1036,7 @@ Example:
void qapi_event_send_my_event(Error **errp)
{
QDict *qmp;
- Error *local_err = NULL;
+ Error *err = NULL;
QMPEventFuncEmit emit;
emit = qmp_event_get_func_emit();
if (!emit) {
@@ -820,15 +1045,15 @@ Example:
qmp = qmp_event_build_dict("MY_EVENT");
- emit(EXAMPLE_QAPI_EVENT_MY_EVENT, qmp, &local_err);
+ emit(EXAMPLE_QAPI_EVENT_MY_EVENT, qmp, &err);
- error_propagate(errp, local_err);
+ error_propagate(errp, err);
QDECREF(qmp);
}
- const char *EXAMPLE_QAPIEvent_lookup[] = {
- "MY_EVENT",
- NULL,
+ const char *const example_QAPIEvent_lookup[] = {
+ [EXAMPLE_QAPI_EVENT_MY_EVENT] = "MY_EVENT",
+ [EXAMPLE_QAPI_EVENT_MAX] = NULL,
};
$ cat qapi-generated/example-qapi-event.h
[Uninteresting stuff omitted...]
@@ -843,11 +1068,45 @@ Example:
void qapi_event_send_my_event(Error **errp);
- extern const char *EXAMPLE_QAPIEvent_lookup[];
- typedef enum EXAMPLE_QAPIEvent
- {
+ typedef enum example_QAPIEvent {
EXAMPLE_QAPI_EVENT_MY_EVENT = 0,
EXAMPLE_QAPI_EVENT_MAX = 1,
- } EXAMPLE_QAPIEvent;
+ } example_QAPIEvent;
+
+ extern const char *const example_QAPIEvent_lookup[];
+
+ #endif
+
+=== scripts/qapi-introspect.py ===
+
+Used to generate the introspection C code for a schema. The following
+files are created:
+
+$(prefix)qmp-introspect.c - Defines a string holding a JSON
+ description of the schema.
+$(prefix)qmp-introspect.h - Declares the above string.
+
+Example:
+
+ $ python scripts/qapi-introspect.py --output-dir="qapi-generated"
+ --prefix="example-" example-schema.json
+ $ cat qapi-generated/example-qmp-introspect.c
+[Uninteresting stuff omitted...]
+
+ const char example_qmp_schema_json[] = "["
+ "{\"arg-type\": \"0\", \"meta-type\": \"event\", \"name\": \"MY_EVENT\"}, "
+ "{\"arg-type\": \"1\", \"meta-type\": \"command\", \"name\": \"my-command\", \"ret-type\": \"2\"}, "
+ "{\"members\": [], \"meta-type\": \"object\", \"name\": \"0\"}, "
+ "{\"members\": [{\"name\": \"arg1\", \"type\": \"2\"}], \"meta-type\": \"object\", \"name\": \"1\"}, "
+ "{\"members\": [{\"name\": \"integer\", \"type\": \"int\"}, {\"name\": \"string\", \"type\": \"str\"}], \"meta-type\": \"object\", \"name\": \"2\"}, "
+ "{\"json-type\": \"int\", \"meta-type\": \"builtin\", \"name\": \"int\"}, "
+ "{\"json-type\": \"string\", \"meta-type\": \"builtin\", \"name\": \"str\"}]";
+ $ cat qapi-generated/example-qmp-introspect.h
+[Uninteresting stuff omitted...]
+
+ #ifndef EXAMPLE_QMP_INTROSPECT_H
+ #define EXAMPLE_QMP_INTROSPECT_H
+
+ extern const char example_qmp_schema_json[];
#endif
diff --git a/docs/qcow2-cache.txt b/docs/qcow2-cache.txt
new file mode 100644
index 000000000..5bb06072d
--- /dev/null
+++ b/docs/qcow2-cache.txt
@@ -0,0 +1,164 @@
+qcow2 L2/refcount cache configuration
+=====================================
+Copyright (C) 2015 Igalia, S.L.
+Author: Alberto Garcia <berto@igalia.com>
+
+This work is licensed under the terms of the GNU GPL, version 2 or
+later. See the COPYING file in the top-level directory.
+
+Introduction
+------------
+The QEMU qcow2 driver has two caches that can improve the I/O
+performance significantly. However, setting the right cache sizes is
+not a straightforward operation.
+
+This document attempts to give an overview of the L2 and refcount
+caches, and how to configure them.
+
+Please refer to the docs/specs/qcow2.txt file for an in-depth
+technical description of the qcow2 file format.
+
+
+Clusters
+--------
+A qcow2 file is organized in units of constant size called clusters.
+
+The cluster size is configurable, but it must be a power of two and
+its value 512 bytes or higher. QEMU currently defaults to 64 KB
+clusters, and it does not support sizes larger than 2MB.
+
+The 'qemu-img create' command supports specifying the size using the
+cluster_size option:
+
+ qemu-img create -f qcow2 -o cluster_size=128K hd.qcow2 4G
+
+
+The L2 tables
+-------------
+The qcow2 format uses a two-level structure to map the virtual disk as
+seen by the guest to the disk image in the host. These structures are
+called the L1 and L2 tables.
+
+There is one single L1 table per disk image. The table is small and is
+always kept in memory.
+
+There can be many L2 tables, depending on how much space has been
+allocated in the image. Each table is one cluster in size. In order to
+read or write data from the virtual disk, QEMU needs to read its
+corresponding L2 table to find out where that data is located. Since
+reading the table for each I/O operation can be expensive, QEMU keeps
+an L2 cache in memory to speed up disk access.
+
+The size of the L2 cache can be configured, and setting the right
+value can improve the I/O performance significantly.
+
+
+The refcount blocks
+-------------------
+The qcow2 format also mantains a reference count for each cluster.
+Reference counts are used for cluster allocation and internal
+snapshots. The data is stored in a two-level structure similar to the
+L1/L2 tables described above.
+
+The second level structures are called refcount blocks, are also one
+cluster in size and the number is also variable and dependent on the
+amount of allocated space.
+
+Each block contains a number of refcount entries. Their size (in bits)
+is a power of two and must not be higher than 64. It defaults to 16
+bits, but a different value can be set using the refcount_bits option:
+
+ qemu-img create -f qcow2 -o refcount_bits=8 hd.qcow2 4G
+
+QEMU keeps a refcount cache to speed up I/O much like the
+aforementioned L2 cache, and its size can also be configured.
+
+
+Choosing the right cache sizes
+------------------------------
+In order to choose the cache sizes we need to know how they relate to
+the amount of allocated space.
+
+The amount of virtual disk that can be mapped by the L2 and refcount
+caches (in bytes) is:
+
+ disk_size = l2_cache_size * cluster_size / 8
+ disk_size = refcount_cache_size * cluster_size * 8 / refcount_bits
+
+With the default values for cluster_size (64KB) and refcount_bits
+(16), that is
+
+ disk_size = l2_cache_size * 8192
+ disk_size = refcount_cache_size * 32768
+
+So in order to cover n GB of disk space with the default values we
+need:
+
+ l2_cache_size = disk_size_GB * 131072
+ refcount_cache_size = disk_size_GB * 32768
+
+QEMU has a default L2 cache of 1MB (1048576 bytes) and a refcount
+cache of 256KB (262144 bytes), so using the formulas we've just seen
+we have
+
+ 1048576 / 131072 = 8 GB of virtual disk covered by that cache
+ 262144 / 32768 = 8 GB
+
+
+How to configure the cache sizes
+--------------------------------
+Cache sizes can be configured using the -drive option in the
+command-line, or the 'blockdev-add' QMP command.
+
+There are three options available, and all of them take bytes:
+
+"l2-cache-size": maximum size of the L2 table cache
+"refcount-cache-size": maximum size of the refcount block cache
+"cache-size": maximum size of both caches combined
+
+There are two things that need to be taken into account:
+
+ - Both caches must have a size that is a multiple of the cluster
+ size.
+
+ - If you only set one of the options above, QEMU will automatically
+ adjust the others so that the L2 cache is 4 times bigger than the
+ refcount cache.
+
+This means that these options are equivalent:
+
+ -drive file=hd.qcow2,l2-cache-size=2097152
+ -drive file=hd.qcow2,refcount-cache-size=524288
+ -drive file=hd.qcow2,cache-size=2621440
+
+The reason for this 1/4 ratio is to ensure that both caches cover the
+same amount of disk space. Note however that this is only valid with
+the default value of refcount_bits (16). If you are using a different
+value you might want to calculate both cache sizes yourself since QEMU
+will always use the same 1/4 ratio.
+
+It's also worth mentioning that there's no strict need for both caches
+to cover the same amount of disk space. The refcount cache is used
+much less often than the L2 cache, so it's perfectly reasonable to
+keep it small.
+
+
+Reducing the memory usage
+-------------------------
+It is possible to clean unused cache entries in order to reduce the
+memory usage during periods of low I/O activity.
+
+The parameter "cache-clean-interval" defines an interval (in seconds).
+All cache entries that haven't been accessed during that interval are
+removed from memory.
+
+This example removes all unused cache entries every 15 minutes:
+
+ -drive file=hd.qcow2,cache-clean-interval=900
+
+If unset, the default value for this parameter is 0 and it disables
+this feature.
+
+Note that this functionality currently relies on the MADV_DONTNEED
+argument for madvise() to actually free the memory, so it is not
+useful in systems that don't follow that behavior.
diff --git a/docs/qmp/qmp-events.txt b/docs/qmp-events.txt
index d92cc4833..d2f1ce497 100644
--- a/docs/qmp/qmp-events.txt
+++ b/docs/qmp-events.txt
@@ -28,6 +28,8 @@ Example:
"data": { "actual": 944766976 },
"timestamp": { "seconds": 1267020223, "microseconds": 435656 } }
+Note: this event is rate-limited.
+
BLOCK_IMAGE_CORRUPTED
---------------------
@@ -296,6 +298,8 @@ Example:
"data": { "reference": "usr1", "sector-num": 345435, "sectors-count": 5 },
"timestamp": { "seconds": 1344522075, "microseconds": 745528 } }
+Note: this event is rate-limited.
+
QUORUM_REPORT_BAD
-----------------
@@ -318,6 +322,8 @@ Example:
"data": { "node-name": "1.raw", "sector-num": 345435, "sectors-count": 5 },
"timestamp": { "seconds": 1344522075, "microseconds": 745528 } }
+Note: this event is rate-limited.
+
RESET
-----
@@ -358,6 +364,8 @@ Example:
"data": { "offset": 78 },
"timestamp": { "seconds": 1267020223, "microseconds": 435656 } }
+Note: this event is rate-limited.
+
SHUTDOWN
--------
@@ -632,6 +640,8 @@ Example:
"data": { "id": "channel0", "open": true },
"timestamp": { "seconds": 1401385907, "microseconds": 422329 } }
+Note: this event is rate-limited separately for each "id".
+
WAKEUP
------
@@ -662,3 +672,5 @@ Example:
Note: If action is "reset", "shutdown", or "pause" the WATCHDOG event is
followed respectively by the RESET, SHUTDOWN, or STOP events.
+
+Note: this event is rate-limited.
diff --git a/docs/qmp/README b/docs/qmp-intro.txt
index f6a3a031e..f6a3a031e 100644
--- a/docs/qmp/README
+++ b/docs/qmp-intro.txt
diff --git a/docs/qmp/qmp-spec.txt b/docs/qmp-spec.txt
index 4c28cd943..4fb10a5d6 100644
--- a/docs/qmp/qmp-spec.txt
+++ b/docs/qmp-spec.txt
@@ -175,6 +175,11 @@ The format of asynchronous events is:
For a listing of supported asynchronous events, please, refer to the
qmp-events.txt file.
+Some events are rate-limited to at most one per second. If additional
+"similar" events arrive within one second, all but the last one are
+dropped, and the last one is delayed. "Similar" normally means same
+event type. See qmp-events.txt for details.
+
2.5 QGA Synchronization
-----------------------
diff --git a/docs/rcu.txt b/docs/rcu.txt
index 21ecb8106..2f70954e8 100644
--- a/docs/rcu.txt
+++ b/docs/rcu.txt
@@ -128,7 +128,7 @@ The core RCU API is small:
the callback function is g_free, in particular, g_free_rcu can be
used. In the above case, one could have written simply:
- g_free_rcu(foo_reclaim, rcu);
+ g_free_rcu(&foo, rcu);
typeof(*p) atomic_rcu_read(p);
diff --git a/docs/replay.txt b/docs/replay.txt
new file mode 100644
index 000000000..149727e2a
--- /dev/null
+++ b/docs/replay.txt
@@ -0,0 +1,168 @@
+Copyright (c) 2010-2015 Institute for System Programming
+ of the Russian Academy of Sciences.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+Record/replay
+-------------
+
+Record/replay functions are used for the reverse execution and deterministic
+replay of qemu execution. This implementation of deterministic replay can
+be used for deterministic debugging of guest code through a gdb remote
+interface.
+
+Execution recording writes a non-deterministic events log, which can be later
+used for replaying the execution anywhere and for unlimited number of times.
+It also supports checkpointing for faster rewinding during reverse debugging.
+Execution replaying reads the log and replays all non-deterministic events
+including external input, hardware clocks, and interrupts.
+
+Deterministic replay has the following features:
+ * Deterministically replays whole system execution and all contents of
+ the memory, state of the hardware devices, clocks, and screen of the VM.
+ * Writes execution log into the file for later replaying for multiple times
+ on different machines.
+ * Supports i386, x86_64, and ARM hardware platforms.
+ * Performs deterministic replay of all operations with keyboard and mouse
+ input devices.
+
+Usage of the record/replay:
+ * First, record the execution, by adding the following arguments to the command line:
+ '-icount shift=7,rr=record,rrfile=replay.bin -net none'.
+ Block devices' images are not actually changed in the recording mode,
+ because all of the changes are written to the temporary overlay file.
+ * Then you can replay it by using another command
+ line option: '-icount shift=7,rr=replay,rrfile=replay.bin -net none'
+ * '-net none' option should also be specified if network replay patches
+ are not applied.
+
+Papers with description of deterministic replay implementation:
+http://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html
+http://dl.acm.org/citation.cfm?id=2786805.2803179
+
+Modifications of qemu include:
+ * wrappers for clock and time functions to save their return values in the log
+ * saving different asynchronous events (e.g. system shutdown) into the log
+ * synchronization of the bottom halves execution
+ * synchronization of the threads from thread pool
+ * recording/replaying user input (mouse and keyboard)
+ * adding internal checkpoints for cpu and io synchronization
+
+Non-deterministic events
+------------------------
+
+Our record/replay system is based on saving and replaying non-deterministic
+events (e.g. keyboard input) and simulating deterministic ones (e.g. reading
+from HDD or memory of the VM). Saving only non-deterministic events makes
+log file smaller, simulation faster, and allows using reverse debugging even
+for realtime applications.
+
+The following non-deterministic data from peripheral devices is saved into
+the log: mouse and keyboard input, network packets, audio controller input,
+USB packets, serial port input, and hardware clocks (they are non-deterministic
+too, because their values are taken from the host machine). Inputs from
+simulated hardware, memory of VM, software interrupts, and execution of
+instructions are not saved into the log, because they are deterministic and
+can be replayed by simulating the behavior of virtual machine starting from
+initial state.
+
+We had to solve three tasks to implement deterministic replay: recording
+non-deterministic events, replaying non-deterministic events, and checking
+that there is no divergence between record and replay modes.
+
+We changed several parts of QEMU to make event log recording and replaying.
+Devices' models that have non-deterministic input from external devices were
+changed to write every external event into the execution log immediately.
+E.g. network packets are written into the log when they arrive into the virtual
+network adapter.
+
+All non-deterministic events are coming from these devices. But to
+replay them we need to know at which moments they occur. We specify
+these moments by counting the number of instructions executed between
+every pair of consecutive events.
+
+Instruction counting
+--------------------
+
+QEMU should work in icount mode to use record/replay feature. icount was
+designed to allow deterministic execution in absence of external inputs
+of the virtual machine. We also use icount to control the occurrence of the
+non-deterministic events. The number of instructions elapsed from the last event
+is written to the log while recording the execution. In replay mode we
+can predict when to inject that event using the instruction counter.
+
+Timers
+------
+
+Timers are used to execute callbacks from different subsystems of QEMU
+at the specified moments of time. There are several kinds of timers:
+ * Real time clock. Based on host time and used only for callbacks that
+ do not change the virtual machine state. For this reason real time
+ clock and timers does not affect deterministic replay at all.
+ * Virtual clock. These timers run only during the emulation. In icount
+ mode virtual clock value is calculated using executed instructions counter.
+ That is why it is completely deterministic and does not have to be recorded.
+ * Host clock. This clock is used by device models that simulate real time
+ sources (e.g. real time clock chip). Host clock is the one of the sources
+ of non-determinism. Host clock read operations should be logged to
+ make the execution deterministic.
+ * Real time clock for icount. This clock is similar to real time clock but
+ it is used only for increasing virtual clock while virtual machine is
+ sleeping. Due to its nature it is also non-deterministic as the host clock
+ and has to be logged too.
+
+Checkpoints
+-----------
+
+Replaying of the execution of virtual machine is bound by sources of
+non-determinism. These are inputs from clock and peripheral devices,
+and QEMU thread scheduling. Thread scheduling affect on processing events
+from timers, asynchronous input-output, and bottom halves.
+
+Invocations of timers are coupled with clock reads and changing the state
+of the virtual machine. Reads produce non-deterministic data taken from
+host clock. And VM state changes should preserve their order. Their relative
+order in replay mode must replicate the order of callbacks in record mode.
+To preserve this order we use checkpoints. When a specific clock is processed
+in record mode we save to the log special "checkpoint" event.
+Checkpoints here do not refer to virtual machine snapshots. They are just
+record/replay events used for synchronization.
+
+QEMU in replay mode will try to invoke timers processing in random moment
+of time. That's why we do not process a group of timers until the checkpoint
+event will be read from the log. Such an event allows synchronizing CPU
+execution and timer events.
+
+Another checkpoints application in record/replay is instruction counting
+while the virtual machine is idle. This function (qemu_clock_warp) is called
+from the wait loop. It changes virtual machine state and must be deterministic
+then. That is why we added checkpoint to this function to prevent its
+operation in replay mode when it does not correspond to record mode.
+
+Bottom halves
+-------------
+
+Disk I/O events are completely deterministic in our model, because
+in both record and replay modes we start virtual machine from the same
+disk state. But callbacks that virtual disk controller uses for reading and
+writing the disk may occur at different moments of time in record and replay
+modes.
+
+Reading and writing requests are created by CPU thread of QEMU. Later these
+requests proceed to block layer which creates "bottom halves". Bottom
+halves consist of callback and its parameters. They are processed when
+main loop locks the global mutex. These locks are not synchronized with
+replaying process because main loop also processes the events that do not
+affect the virtual machine state (like user interaction with monitor).
+
+That is why we had to implement saving and replaying bottom halves callbacks
+synchronously to the CPU execution. When the callback is about to execute
+it is added to the queue in the replay module. This queue is written to the
+log when its callbacks are executed. In replay mode callbacks are not processed
+until the corresponding event is read from the events log file.
+
+Sometimes the block layer uses asynchronous callbacks for its internal purposes
+(like reading or writing VM snapshots or disk image cluster tables). In this
+case bottom halves are not marked as "replayable" and do not saved
+into the log.
diff --git a/docs/specs/fw_cfg.txt b/docs/specs/fw_cfg.txt
index 74351dd18..b8c794f54 100644
--- a/docs/specs/fw_cfg.txt
+++ b/docs/specs/fw_cfg.txt
@@ -76,6 +76,13 @@ increasing address order, similar to memcpy().
Selector Register IOport: 0x510
Data Register IOport: 0x511
+DMA Address IOport: 0x514
+
+=== ARM Register Locations ===
+
+Selector Register address: Base + 8 (2 bytes)
+Data Register address: Base + 0 (8 bytes)
+DMA Address address: Base + 16 (8 bytes)
== Firmware Configuration Items ==
@@ -86,11 +93,15 @@ by selecting the "signature" item using key 0x0000 (FW_CFG_SIGNATURE),
and reading four bytes from the data register. If the fw_cfg device is
present, the four bytes read will contain the characters "QEMU".
-=== Revision (Key 0x0001, FW_CFG_ID) ===
+If the DMA interface is available, then reading the DMA Address
+Register returns 0x51454d5520434647 ("QEMU CFG" in big-endian format).
+
+=== Revision / feature bitmap (Key 0x0001, FW_CFG_ID) ===
-A 32-bit little-endian unsigned int, this item is used as an interface
-revision number, and is currently set to 1 by QEMU when fw_cfg is
-initialized.
+A 32-bit little-endian unsigned int, this item is used to check for enabled
+features.
+ - Bit 0: traditional interface. Always set.
+ - Bit 1: DMA interface.
=== File Directory (Key 0x0019, FW_CFG_FILE_DIR) ===
@@ -132,6 +143,55 @@ Selector Reg. Range Usage
In practice, the number of allowed firmware configuration items is given
by the value of FW_CFG_MAX_ENTRY (see fw_cfg.h).
+= Guest-side DMA Interface =
+
+If bit 1 of the feature bitmap is set, the DMA interface is present. This does
+not replace the existing fw_cfg interface, it is an add-on. This interface
+can be used through the 64-bit wide address register.
+
+The address register is in big-endian format. The value for the register is 0
+at startup and after an operation. A write to the least significant half (at
+offset 4) triggers an operation. This means that operations with 32-bit
+addresses can be triggered with just one write, whereas operations with
+64-bit addresses can be triggered with one 64-bit write or two 32-bit writes,
+starting with the most significant half (at offset 0).
+
+In this register, the physical address of a FWCfgDmaAccess structure in RAM
+should be written. This is the format of the FWCfgDmaAccess structure:
+
+typedef struct FWCfgDmaAccess {
+ uint32_t control;
+ uint32_t length;
+ uint64_t address;
+} FWCfgDmaAccess;
+
+The fields of the structure are in big endian mode, and the field at the lowest
+address is the "control" field.
+
+The "control" field has the following bits:
+ - Bit 0: Error
+ - Bit 1: Read
+ - Bit 2: Skip
+ - Bit 3: Select. The upper 16 bits are the selected index.
+
+When an operation is triggered, if the "control" field has bit 3 set, the
+upper 16 bits are interpreted as an index of a firmware configuration item.
+This has the same effect as writing the selector register.
+
+If the "control" field has bit 1 set, a read operation will be performed.
+"length" bytes for the current selector and offset will be copied into the
+physical RAM address specified by the "address" field.
+
+If the "control" field has bit 2 set (and not bit 1), a skip operation will be
+performed. The offset for the current selector will be advanced "length" bytes.
+
+To check the result, read the "control" field:
+ error bit set -> something went wrong.
+ all bits cleared -> transfer finished successfully.
+ otherwise -> transfer still in progress (doesn't happen
+ today due to implementation not being async,
+ but may in the future).
+
= Host-side API =
The following functions are available to the QEMU programmer for adding
@@ -159,6 +219,17 @@ will convert a 16-, 32-, or 64-bit integer to little-endian, then add
a dynamically allocated copy of the appropriately sized item to fw_cfg
under the given selector key value.
+== fw_cfg_modify_iXX() ==
+
+Modify the value of an XX-bit item (where XX may be 16, 32, or 64).
+Similarly to the corresponding fw_cfg_add_iXX() function set, convert
+a 16-, 32-, or 64-bit integer to little endian, create a dynamically
+allocated copy of the required size, and replace the existing item at
+the given selector key value with the newly allocated one. The previous
+item, assumed to have been allocated during an earlier call to
+fw_cfg_add_iXX() or fw_cfg_modify_iXX() (of the same width XX), is freed
+before the function returns.
+
== fw_cfg_add_file() ==
Given a filename (i.e., fw_cfg item name), starting pointer, and size,
@@ -216,6 +287,21 @@ the following syntax:
where <item_name> is the fw_cfg item name, and <path> is the location
on the host file system of a file containing the data to be inserted.
+Small enough items may be provided directly as strings on the command
+line, using the syntax:
+
+ -fw_cfg [name=]<item_name>,string=<string>
+
+The terminating NUL character of the content <string> will NOT be
+included as part of the fw_cfg item data, which is consistent with
+the absence of a NUL terminator for items inserted via the file option.
+
+Both <item_name> and, if applicable, the content <string> are passed
+through by QEMU without any interpretation, expansion, or further
+processing. Any such processing (potentially performed e.g., by the shell)
+is outside of QEMU's responsibility; as such, using plain ASCII characters
+is recommended.
+
NOTE: Users *SHOULD* choose item names beginning with the prefix "opt/"
when using the "-fw_cfg" command line option, to avoid conflicting with
item names used internally by QEMU. For instance:
diff --git a/docs/specs/ivshmem_device_spec.txt b/docs/specs/ivshmem_device_spec.txt
index 667a8628f..d318d65c3 100644
--- a/docs/specs/ivshmem_device_spec.txt
+++ b/docs/specs/ivshmem_device_spec.txt
@@ -2,30 +2,106 @@
Device Specification for Inter-VM shared memory device
------------------------------------------------------
-The Inter-VM shared memory device is designed to share a region of memory to
-userspace in multiple virtual guests. The memory region does not belong to any
-guest, but is a POSIX memory object on the host. Optionally, the device may
-support sending interrupts to other guests sharing the same memory region.
+The Inter-VM shared memory device is designed to share a memory region (created
+on the host via the POSIX shared memory API) between multiple QEMU processes
+running different guests. In order for all guests to be able to pick up the
+shared memory area, it is modeled by QEMU as a PCI device exposing said memory
+to the guest as a PCI BAR.
+The memory region does not belong to any guest, but is a POSIX memory object on
+the host. The host can access this shared memory if needed.
+
+The device also provides an optional communication mechanism between guests
+sharing the same memory object. More details about that in the section 'Guest to
+guest communication' section.
The Inter-VM PCI device
-----------------------
-*BARs*
+From the VM point of view, the ivshmem PCI device supports three BARs.
+
+- BAR0 is a 1 Kbyte MMIO region to support registers and interrupts when MSI is
+ not used.
+- BAR1 is used for MSI-X when it is enabled in the device.
+- BAR2 is used to access the shared memory object.
+
+It is your choice how to use the device but you must choose between two
+behaviors :
+
+- basically, if you only need the shared memory part, you will map BAR2.
+ This way, you have access to the shared memory in guest and can use it as you
+ see fit (memnic, for example, uses it in userland
+ http://dpdk.org/browse/memnic).
+
+- BAR0 and BAR1 are used to implement an optional communication mechanism
+ through interrupts in the guests. If you need an event mechanism between the
+ guests accessing the shared memory, you will most likely want to write a
+ kernel driver that will handle interrupts. See details in the section 'Guest
+ to guest communication' section.
+
+The behavior is chosen when starting your QEMU processes:
+- no communication mechanism needed, the first QEMU to start creates the shared
+ memory on the host, subsequent QEMU processes will use it.
+
+- communication mechanism needed, an ivshmem server must be started before any
+ QEMU processes, then each QEMU process connects to the server unix socket.
+
+For more details on the QEMU ivshmem parameters, see qemu-doc documentation.
+
+
+Guest to guest communication
+----------------------------
+
+This section details the communication mechanism between the guests accessing
+the ivhsmem shared memory.
-The device supports three BARs. BAR0 is a 1 Kbyte MMIO region to support
-registers. BAR1 is used for MSI-X when it is enabled in the device. BAR2 is
-used to map the shared memory object from the host. The size of BAR2 is
-specified when the guest is started and must be a power of 2 in size.
+*ivshmem server*
-*Registers*
+This server code is available in qemu.git/contrib/ivshmem-server.
-The device currently supports 4 registers of 32-bits each. Registers
-are used for synchronization between guests sharing the same memory object when
-interrupts are supported (this requires using the shared memory server).
+The server must be started on the host before any guest.
+It creates a shared memory object then waits for clients to connect on a unix
+socket. All the messages are little-endian int64_t integer.
-The server assigns each VM an ID number and sends this ID number to the QEMU
-process when the guest starts.
+For each client (QEMU process) that connects to the server:
+- the server sends a protocol version, if client does not support it, the client
+ closes the communication,
+- the server assigns an ID for this client and sends this ID to him as the first
+ message,
+- the server sends a fd to the shared memory object to this client,
+- the server creates a new set of host eventfds associated to the new client and
+ sends this set to all already connected clients,
+- finally, the server sends all the eventfds sets for all clients to the new
+ client.
+
+The server signals all clients when one of them disconnects.
+
+The client IDs are limited to 16 bits because of the current implementation (see
+Doorbell register in 'PCI device registers' subsection). Hence only 65536
+clients are supported.
+
+All the file descriptors (fd to the shared memory, eventfds for each client)
+are passed to clients using SCM_RIGHTS over the server unix socket.
+
+Apart from the current ivshmem implementation in QEMU, an ivshmem client has
+been provided in qemu.git/contrib/ivshmem-client for debug.
+
+*QEMU as an ivshmem client*
+
+At initialisation, when creating the ivshmem device, QEMU first receives a
+protocol version and closes communication with server if it does not match.
+Then, QEMU gets its ID from the server then makes it available through BAR0
+IVPosition register for the VM to use (see 'PCI device registers' subsection).
+QEMU then uses the fd to the shared memory to map it to BAR2.
+eventfds for all other clients received from the server are stored to implement
+BAR0 Doorbell register (see 'PCI device registers' subsection).
+Finally, eventfds assigned to this QEMU process are used to send interrupts in
+this VM.
+
+*PCI device registers*
+
+From the VM point of view, the ivshmem PCI device supports 4 registers of
+32-bits each.
enum ivshmem_registers {
IntrMask = 0,
@@ -49,8 +125,8 @@ bit to 0 and unmasked by setting the first bit to 1.
IVPosition Register: The IVPosition register is read-only and reports the
guest's ID number. The guest IDs are non-negative integers. When using the
server, since the server is a separate process, the VM ID will only be set when
-the device is ready (shared memory is received from the server and accessible via
-the device). If the device is not ready, the IVPosition will return -1.
+the device is ready (shared memory is received from the server and accessible
+via the device). If the device is not ready, the IVPosition will return -1.
Applications should ensure that they have a valid VM ID before accessing the
shared memory.
@@ -59,8 +135,8 @@ Doorbell register. The doorbell register is 32-bits, logically divided into
two 16-bit fields. The high 16-bits are the guest ID to interrupt and the low
16-bits are the interrupt vector to trigger. The semantics of the value
written to the doorbell depends on whether the device is using MSI or a regular
-pin-based interrupt. In short, MSI uses vectors while regular interrupts set the
-status register.
+pin-based interrupt. In short, MSI uses vectors while regular interrupts set
+the status register.
Regular Interrupts
@@ -71,7 +147,7 @@ interrupt in the destination guest.
Message Signalled Interrupts
-A ivshmem device may support multiple MSI vectors. If so, the lower 16-bits
+An ivshmem device may support multiple MSI vectors. If so, the lower 16-bits
written to the Doorbell register must be between 0 and the maximum number of
vectors the guest supports. The lower 16 bits written to the doorbell is the
MSI vector that will be raised in the destination guest. The number of MSI
@@ -83,14 +159,3 @@ interrupt itself should be communicated via the shared memory region. Devices
supporting multiple MSI vectors can use different vectors to indicate different
events have occurred. The semantics of interrupt vectors are left to the
user's discretion.
-
-
-Usage in the Guest
-------------------
-
-The shared memory device is intended to be used with the provided UIO driver.
-Very little configuration is needed. The guest should map BAR0 to access the
-registers (an array of 32-bit ints allows simple writing) and map BAR2 to
-access the shared memory region itself. The size of the shared memory region
-is specified when the guest (or shared memory server) is started. A guest may
-map the whole shared memory region or only part of it.
diff --git a/docs/specs/ppc-spapr-hcalls.txt b/docs/specs/ppc-spapr-hcalls.txt
index 667b3fa00..5bd8eab78 100644
--- a/docs/specs/ppc-spapr-hcalls.txt
+++ b/docs/specs/ppc-spapr-hcalls.txt
@@ -41,8 +41,8 @@ When the guest runs in "real mode" (in powerpc lingua this means
with MMU disabled, ie guest effective == guest physical), it only
has access to a subset of memory and no IOs.
-PAPR provides a set of hypervisor calls to perform cachable or
-non-cachable accesses to any guest physical addresses that the
+PAPR provides a set of hypervisor calls to perform cacheable or
+non-cacheable accesses to any guest physical addresses that the
guest can use in order to access IO devices while in real mode.
This is typically used by the firmware running in the guest.
diff --git a/docs/specs/ppc-spapr-hotplug.txt b/docs/specs/ppc-spapr-hotplug.txt
index 46e07196b..631b0cada 100644
--- a/docs/specs/ppc-spapr-hotplug.txt
+++ b/docs/specs/ppc-spapr-hotplug.txt
@@ -302,4 +302,52 @@ consisting of <phys>, <size> and <maxcpus>.
pseries guests use this property to note the maximum allowed CPUs for the
guest.
+== ibm,dynamic-reconfiguration-memory ==
+
+ibm,dynamic-reconfiguration-memory is a device tree node that represents
+dynamically reconfigurable logical memory blocks (LMB). This node
+is generated only when the guest advertises the support for it via
+ibm,client-architecture-support call. Memory that is not dynamically
+reconfigurable is represented by /memory nodes. The properties of this
+node that are of interest to the sPAPR memory hotplug implementation
+in QEMU are described here.
+
+ibm,lmb-size
+
+This 64bit integer defines the size of each dynamically reconfigurable LMB.
+
+ibm,associativity-lookup-arrays
+
+This property defines a lookup array in which the NUMA associativity
+information for each LMB can be found. It is a property encoded array
+that begins with an integer M, the number of associativity lists followed
+by an integer N, the number of entries per associativity list and terminated
+by M associativity lists each of length N integers.
+
+This property provides the same information as given by ibm,associativity
+property in a /memory node. Each assigned LMB has an index value between
+0 and M-1 which is used as an index into this table to select which
+associativity list to use for the LMB. This index value for each LMB
+is defined in ibm,dynamic-memory property.
+
+ibm,dynamic-memory
+
+This property describes the dynamically reconfigurable memory. It is a
+property encoded array that has an integer N, the number of LMBs followed
+by N LMB list entires.
+
+Each LMB list entry consists of the following elements:
+
+- Logical address of the start of the LMB encoded as a 64bit integer. This
+ corresponds to reg property in /memory node.
+- DRC index of the LMB that corresponds to ibm,my-drc-index property
+ in a /memory node.
+- Four bytes reserved for expansion.
+- Associativity list index for the LMB that is used as an index into
+ ibm,associativity-lookup-arrays property described earlier. This
+ is used to retrieve the right associativity list to be used for this
+ LMB.
+- A 32bit flags word. The bit at bit position 0x00000008 defines whether
+ the LMB is assigned to the the partition as of boot time.
+
[1] http://thread.gmane.org/gmane.linux.ports.ppc.embedded/75350/focus=106867
diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt
index 121dfc8cc..f236d8c6d 100644
--- a/docs/specs/qcow2.txt
+++ b/docs/specs/qcow2.txt
@@ -257,7 +257,7 @@ L2 table entry:
63: 0 for a cluster that is unused or requires COW, 1 if its
refcount is exactly one. This information is only accurate
- in L2 tables that are reachable from the the active L1
+ in L2 tables that are reachable from the active L1
table.
Standard Cluster Descriptor:
diff --git a/docs/specs/rocker.txt b/docs/specs/rocker.txt
index 1c743515c..d2a82624f 100644
--- a/docs/specs/rocker.txt
+++ b/docs/specs/rocker.txt
@@ -297,7 +297,7 @@ but not fired. If only partial credits are returned, the interrupt remains
masked but the device generates an interrupt, signaling the driver that more
outstanding work is available.
-(* this masking is unrelated to to the MSI-X interrupt mask register)
+(* this masking is unrelated to the MSI-X interrupt mask register)
Endianness
----------
diff --git a/docs/specs/vhost-user.txt b/docs/specs/vhost-user.txt
index 650bb1818..0312d40af 100644
--- a/docs/specs/vhost-user.txt
+++ b/docs/specs/vhost-user.txt
@@ -87,6 +87,14 @@ Depending on the request type, payload can be:
User address: a 64-bit user address
mmap offset: 64-bit offset where region starts in the mapped memory
+* Log description
+ ---------------------------
+ | log size | log offset |
+ ---------------------------
+ log size: size of area used for logging
+ log offset: offset from start of supplied file descriptor
+ where logging starts (i.e. where guest address 0 would be logged)
+
In QEMU the vhost-user message is implemented with the following struct:
typedef struct VhostUserMsg {
@@ -98,6 +106,7 @@ typedef struct VhostUserMsg {
struct vhost_vring_state state;
struct vhost_vring_addr addr;
VhostUserMemory memory;
+ VhostUserLog log;
};
} QEMU_PACKED VhostUserMsg;
@@ -113,12 +122,15 @@ message replies. Most of the requests don't require replies. Here is a list of
the ones that do:
* VHOST_GET_FEATURES
+ * VHOST_GET_PROTOCOL_FEATURES
* VHOST_GET_VRING_BASE
+ * VHOST_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
There are several messages that the master sends with file descriptors passed
in the ancillary data:
* VHOST_SET_MEM_TABLE
+ * VHOST_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
* VHOST_SET_LOG_FD
* VHOST_SET_VRING_KICK
* VHOST_SET_VRING_CALL
@@ -127,6 +139,122 @@ in the ancillary data:
If Master is unable to send the full message or receives a wrong reply it will
close the connection. An optional reconnection mechanism can be implemented.
+Any protocol extensions are gated by protocol feature bits,
+which allows full backwards compatibility on both master
+and slave.
+As older slaves don't support negotiating protocol features,
+a feature bit was dedicated for this purpose:
+#define VHOST_USER_F_PROTOCOL_FEATURES 30
+
+Starting and stopping rings
+----------------------
+Client must only process each ring when it is started.
+
+Client must only pass data between the ring and the
+backend, when the ring is enabled.
+
+If ring is started but disabled, client must process the
+ring without talking to the backend.
+
+For example, for a networking device, in the disabled state
+client must not supply any new RX packets, but must process
+and discard any TX packets.
+
+If VHOST_USER_F_PROTOCOL_FEATURES has not been negotiated, the ring is initialized
+in an enabled state.
+
+If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is initialized
+in a disabled state. Client must not pass data to/from the backend until ring is enabled by
+VHOST_USER_SET_VRING_ENABLE with parameter 1, or after it has been disabled by
+VHOST_USER_SET_VRING_ENABLE with parameter 0.
+
+Each ring is initialized in a stopped state, client must not process it until
+ring is started, or after it has been stopped.
+
+Client must start ring upon receiving a kick (that is, detecting that file
+descriptor is readable) on the descriptor specified by
+VHOST_USER_SET_VRING_KICK, and stop ring upon receiving
+VHOST_USER_GET_VRING_BASE.
+
+While processing the rings (whether they are enabled or not), client must
+support changing some configuration aspects on the fly.
+
+Multiple queue support
+----------------------
+
+Multiple queue is treated as a protocol extension, hence the slave has to
+implement protocol features first. The multiple queues feature is supported
+only when the protocol feature VHOST_USER_PROTOCOL_F_MQ (bit 0) is set.
+
+The max number of queues the slave supports can be queried with message
+VHOST_USER_GET_PROTOCOL_FEATURES. Master should stop when the number of
+requested queues is bigger than that.
+
+As all queues share one connection, the master uses a unique index for each
+queue in the sent message to identify a specified queue. One queue pair
+is enabled initially. More queues are enabled dynamically, by sending
+message VHOST_USER_SET_VRING_ENABLE.
+
+Migration
+---------
+
+During live migration, the master may need to track the modifications
+the slave makes to the memory mapped regions. The client should mark
+the dirty pages in a log. Once it complies to this logging, it may
+declare the VHOST_F_LOG_ALL vhost feature.
+
+To start/stop logging of data/used ring writes, server may send messages
+VHOST_USER_SET_FEATURES with VHOST_F_LOG_ALL and VHOST_USER_SET_VRING_ADDR with
+VHOST_VRING_F_LOG in ring's flags set to 1/0, respectively.
+
+All the modifications to memory pointed by vring "descriptor" should
+be marked. Modifications to "used" vring should be marked if
+VHOST_VRING_F_LOG is part of ring's flags.
+
+Dirty pages are of size:
+#define VHOST_LOG_PAGE 0x1000
+
+The log memory fd is provided in the ancillary data of
+VHOST_USER_SET_LOG_BASE message when the slave has
+VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol feature.
+
+The size of the log is supplied as part of VhostUserMsg
+which should be large enough to cover all known guest
+addresses. Log starts at the supplied offset in the
+supplied file descriptor.
+The log covers from address 0 to the maximum of guest
+regions. In pseudo-code, to mark page at "addr" as dirty:
+
+page = addr / VHOST_LOG_PAGE
+log[page / 8] |= 1 << page % 8
+
+Where addr is the guest physical address.
+
+Use atomic operations, as the log may be concurrently manipulated.
+
+Note that when logging modifications to the used ring (when VHOST_VRING_F_LOG
+is set for this ring), log_guest_addr should be used to calculate the log
+offset: the write to first byte of the used ring is logged at this offset from
+log start. Also note that this value might be outside the legal guest physical
+address range (i.e. does not have to be covered by the VhostUserMemory table),
+but the bit offset of the last byte of the ring must fall within
+the size supplied by VhostUserLog.
+
+VHOST_USER_SET_LOG_FD is an optional message with an eventfd in
+ancillary data, it may be used to inform the master that the log has
+been modified.
+
+Once the source has finished migration, rings will be stopped by
+the source. No further update must be done before rings are
+restarted.
+
+Protocol features
+-----------------
+
+#define VHOST_USER_PROTOCOL_F_MQ 0
+#define VHOST_USER_PROTOCOL_F_LOG_SHMFD 1
+#define VHOST_USER_PROTOCOL_F_RARP 2
+
Message types
-------------
@@ -138,6 +266,8 @@ Message types
Slave payload: u64
Get from the underlying vhost implementation the features bitmask.
+ Feature bit VHOST_USER_F_PROTOCOL_FEATURES signals slave support for
+ VHOST_USER_GET_PROTOCOL_FEATURES and VHOST_USER_SET_PROTOCOL_FEATURES.
* VHOST_USER_SET_FEATURES
@@ -146,6 +276,33 @@ Message types
Master payload: u64
Enable features in the underlying vhost implementation using a bitmask.
+ Feature bit VHOST_USER_F_PROTOCOL_FEATURES signals slave support for
+ VHOST_USER_GET_PROTOCOL_FEATURES and VHOST_USER_SET_PROTOCOL_FEATURES.
+
+ * VHOST_USER_GET_PROTOCOL_FEATURES
+
+ Id: 15
+ Equivalent ioctl: VHOST_GET_FEATURES
+ Master payload: N/A
+ Slave payload: u64
+
+ Get the protocol feature bitmask from the underlying vhost implementation.
+ Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in
+ VHOST_USER_GET_FEATURES.
+ Note: slave that reported VHOST_USER_F_PROTOCOL_FEATURES must support
+ this message even before VHOST_USER_SET_FEATURES was called.
+
+ * VHOST_USER_SET_PROTOCOL_FEATURES
+
+ Id: 16
+ Ioctl: VHOST_SET_FEATURES
+ Master payload: u64
+
+ Enable protocol features in the underlying vhost implementation.
+ Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in
+ VHOST_USER_GET_FEATURES.
+ Note: slave that reported VHOST_USER_F_PROTOCOL_FEATURES must support
+ this message even before VHOST_USER_SET_FEATURES was called.
* VHOST_USER_SET_OWNER
@@ -160,11 +317,13 @@ Message types
* VHOST_USER_RESET_OWNER
Id: 4
- Equivalent ioctl: VHOST_RESET_OWNER
Master payload: N/A
- Issued when a new connection is about to be closed. The Master will no
- longer own this connection (and will usually close it).
+ This is no longer used. Used to be sent to request disabling
+ all rings, but some clients interpreted it to also discard
+ connection state (this interpretation would lead to bugs).
+ It is recommended that clients either ignore this message,
+ or use it to disable all rings.
* VHOST_USER_SET_MEM_TABLE
@@ -182,8 +341,14 @@ Message types
Id: 6
Equivalent ioctl: VHOST_SET_LOG_BASE
Master payload: u64
+ Slave payload: N/A
+
+ Sets logging shared memory space.
+ When slave has VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol
+ feature, the log memory fd is provided in the ancillary data of
+ VHOST_USER_SET_LOG_BASE message, the size and offset of shared
+ memory area provided in the message.
- Sets the logging base address.
* VHOST_USER_SET_LOG_FD
@@ -264,3 +429,38 @@ Message types
Bits (0-7) of the payload contain the vring index. Bit 8 is the
invalid FD flag. This flag is set when there is no file descriptor
in the ancillary data.
+
+ * VHOST_USER_GET_QUEUE_NUM
+
+ Id: 17
+ Equivalent ioctl: N/A
+ Master payload: N/A
+ Slave payload: u64
+
+ Query how many queues the backend supports. This request should be
+ sent only when VHOST_USER_PROTOCOL_F_MQ is set in quried protocol
+ features by VHOST_USER_GET_PROTOCOL_FEATURES.
+
+ * VHOST_USER_SET_VRING_ENABLE
+
+ Id: 18
+ Equivalent ioctl: N/A
+ Master payload: vring state description
+
+ Signal slave to enable or disable corresponding vring.
+ This request should be sent only when VHOST_USER_F_PROTOCOL_FEATURES
+ has been negotiated.
+
+ * VHOST_USER_SEND_RARP
+
+ Id: 19
+ Equivalent ioctl: N/A
+ Master payload: u64
+
+ Ask vhost user backend to broadcast a fake RARP to notify the migration
+ is terminated for guest that does not support GUEST_ANNOUNCE.
+ Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in
+ VHOST_USER_GET_FEATURES and protocol feature bit VHOST_USER_PROTOCOL_F_RARP
+ is present in VHOST_USER_GET_PROTOCOL_FEATURES.
+ The first 6 bytes of the payload contain the mac address of the guest to
+ allow the vhost user backend to construct and broadcast the fake RARP.
diff --git a/docs/tracing.txt b/docs/tracing.txt
index 7117c5e7d..3853a6ad8 100644
--- a/docs/tracing.txt
+++ b/docs/tracing.txt
@@ -258,11 +258,11 @@ is generated to make use in scripts more convenient. This step can also be
performed manually after a build in order to change the binary name in the .stp
probes:
- scripts/tracetool --dtrace --stap \
- --binary path/to/qemu-binary \
- --target-type system \
- --target-name x86_64 \
- <trace-events >qemu.stp
+ scripts/tracetool.py --backends=dtrace --format=stap \
+ --binary path/to/qemu-binary \
+ --target-type system \
+ --target-name x86_64 \
+ <trace-events >qemu.stp
== Trace event properties ==
diff --git a/docs/virtio-migration.txt b/docs/virtio-migration.txt
new file mode 100644
index 000000000..cf66458b9
--- /dev/null
+++ b/docs/virtio-migration.txt
@@ -0,0 +1,106 @@
+Virtio devices and migration
+============================
+
+Copyright 2015 IBM Corp.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later. See
+the COPYING file in the top-level directory.
+
+Saving and restoring the state of virtio devices is a bit of a twisty maze,
+for several reasons:
+- state is distributed between several parts:
+ - virtio core, for common fields like features, number of queues, ...
+ - virtio transport (pci, ccw, ...), for the different proxy devices and
+ transport specific state (msix vectors, indicators, ...)
+ - virtio device (net, blk, ...), for the different device types and their
+ state (mac address, request queue, ...)
+- most fields are saved via the stream interface; subsequently, subsections
+ have been added to make cross-version migration possible
+
+This file attempts to document the current procedure and point out some
+caveats.
+
+
+Save state procedure
+====================
+
+virtio core virtio transport virtio device
+----------- ---------------- -------------
+
+ save() function registered
+ via register_savevm()
+virtio_save() <----------
+ ------> save_config()
+ - save proxy device
+ - save transport-specific
+ device fields
+- save common device
+ fields
+- save common virtqueue
+ fields
+ ------> save_queue()
+ - save transport-specific
+ virtqueue fields
+ ------> save_device()
+ - save device-specific
+ fields
+- save subsections
+ - device endianness,
+ if changed from
+ default endianness
+ - 64 bit features, if
+ any high feature bit
+ is set
+ - virtio-1 virtqueue
+ fields, if VERSION_1
+ is set
+
+
+Load state procedure
+====================
+
+virtio core virtio transport virtio device
+----------- ---------------- -------------
+
+ load() function registered
+ via register_savevm()
+virtio_load() <----------
+ ------> load_config()
+ - load proxy device
+ - load transport-specific
+ device fields
+- load common device
+ fields
+- load common virtqueue
+ fields
+ ------> load_queue()
+ - load transport-specific
+ virtqueue fields
+- notify guest
+ ------> load_device()
+ - load device-specific
+ fields
+- load subsections
+ - device endianness
+ - 64 bit features
+ - virtio-1 virtqueue
+ fields
+- sanitize endianness
+- sanitize features
+- virtqueue index sanity
+ check
+ - feature-dependent setup
+
+
+Implications of this setup
+==========================
+
+Devices need to be careful in their state processing during load: The
+load_device() procedure is invoked by the core before subsections have
+been loaded. Any code that depends on information transmitted in subsections
+therefore has to be invoked in the device's load() function _after_
+virtio_load() returned (like e.g. code depending on features).
+
+Any extension of the state being migrated should be done in subsections
+added to the core for compatibility reasons. If transport or device specific
+state is added, core needs to invoke a callback from the new subsection.
diff --git a/docs/win32-qemu-event.promela b/docs/win32-qemu-event.promela
new file mode 100644
index 000000000..c446a7155
--- /dev/null
+++ b/docs/win32-qemu-event.promela
@@ -0,0 +1,98 @@
+/*
+ * This model describes the implementation of QemuEvent in
+ * util/qemu-thread-win32.c.
+ *
+ * Author: Paolo Bonzini <pbonzini@redhat.com>
+ *
+ * This file is in the public domain. If you really want a license,
+ * the WTFPL will do.
+ *
+ * To verify it:
+ * spin -a docs/event.promela
+ * gcc -O2 pan.c -DSAFETY
+ * ./a.out
+ */
+
+bool event;
+int value;
+
+/* Primitives for a Win32 event */
+#define RAW_RESET event = false
+#define RAW_SET event = true
+#define RAW_WAIT do :: event -> break; od
+
+#if 0
+/* Basic sanity checking: test the Win32 event primitives */
+#define RESET RAW_RESET
+#define SET RAW_SET
+#define WAIT RAW_WAIT
+#else
+/* Full model: layer a userspace-only fast path on top of the RAW_*
+ * primitives. SET/RESET/WAIT have exactly the same semantics as
+ * RAW_SET/RAW_RESET/RAW_WAIT, but try to avoid invoking them.
+ */
+#define EV_SET 0
+#define EV_FREE 1
+#define EV_BUSY -1
+
+int state = EV_FREE;
+
+int xchg_result;
+#define SET if :: state != EV_SET -> \
+ atomic { /* xchg_result=xchg(state, EV_SET) */ \
+ xchg_result = state; \
+ state = EV_SET; \
+ } \
+ if :: xchg_result == EV_BUSY -> RAW_SET; \
+ :: else -> skip; \
+ fi; \
+ :: else -> skip; \
+ fi
+
+#define RESET if :: state == EV_SET -> atomic { state = state | EV_FREE; } \
+ :: else -> skip; \
+ fi
+
+int tmp1, tmp2;
+#define WAIT tmp1 = state; \
+ if :: tmp1 != EV_SET -> \
+ if :: tmp1 == EV_FREE -> \
+ RAW_RESET; \
+ atomic { /* tmp2=cas(state, EV_FREE, EV_BUSY) */ \
+ tmp2 = state; \
+ if :: tmp2 == EV_FREE -> state = EV_BUSY; \
+ :: else -> skip; \
+ fi; \
+ } \
+ if :: tmp2 == EV_SET -> tmp1 = EV_SET; \
+ :: else -> tmp1 = EV_BUSY; \
+ fi; \
+ :: else -> skip; \
+ fi; \
+ assert(tmp1 != EV_FREE); \
+ if :: tmp1 == EV_BUSY -> RAW_WAIT; \
+ :: else -> skip; \
+ fi; \
+ :: else -> skip; \
+ fi
+#endif
+
+active proctype waiter()
+{
+ if
+ :: !value ->
+ RESET;
+ if
+ :: !value -> WAIT;
+ :: else -> skip;
+ fi;
+ :: else -> skip;
+ fi;
+ assert(value);
+}
+
+active proctype notifier()
+{
+ value = true;
+ SET;
+}
diff --git a/docs/writing-qmp-commands.txt b/docs/writing-qmp-commands.txt
index ab1fdd36b..59aa77ae2 100644
--- a/docs/writing-qmp-commands.txt
+++ b/docs/writing-qmp-commands.txt
@@ -122,12 +122,12 @@ There are a few things to be noticed:
Now a little hack is needed. As we're still using the old QMP server we need
to add the new command to its internal dispatch table. This step won't be
required in the near future. Open the qmp-commands.hx file and add the
-following in the botton:
+following at the bottom:
{
.name = "hello-world",
.args_type = "",
- .mhandler.cmd_new = qmp_marshal_input_hello_world,
+ .mhandler.cmd_new = qmp_marshal_hello_world,
},
You're done. Now build qemu, run it as suggested in the "Testing" section,
@@ -179,7 +179,7 @@ The last step is to update the qmp-commands.hx file:
{
.name = "hello-world",
.args_type = "message:s?",
- .mhandler.cmd_new = qmp_marshal_input_hello_world,
+ .mhandler.cmd_new = qmp_marshal_hello_world,
},
Notice that the "args_type" member got our "message" argument. The character
@@ -210,7 +210,7 @@ if you don't see these strings, then something went wrong.
=== Errors ===
QMP commands should use the error interface exported by the error.h header
-file. Basically, errors are set by calling the error_set() function.
+file. Basically, most errors are set by calling the error_setg() function.
Let's say we don't accept the string "message" to contain the word "love". If
it does contain it, we want the "hello-world" command to return an error:
@@ -219,8 +219,7 @@ void qmp_hello_world(bool has_message, const char *message, Error **errp)
{
if (has_message) {
if (strstr(message, "love")) {
- error_set(errp, ERROR_CLASS_GENERIC_ERROR,
- "the word 'love' is not allowed");
+ error_setg(errp, "the word 'love' is not allowed");
return;
}
printf("%s\n", message);
@@ -229,10 +228,8 @@ void qmp_hello_world(bool has_message, const char *message, Error **errp)
}
}
-The first argument to the error_set() function is the Error pointer to pointer,
-which is passed to all QMP functions. The second argument is a ErrorClass
-value, which should be ERROR_CLASS_GENERIC_ERROR most of the time (more
-details about error classes are given below). The third argument is a human
+The first argument to the error_setg() function is the Error pointer
+to pointer, which is passed to all QMP functions. The next argument is a human
description of the error, this is a free-form printf-like string.
Let's test the example above. Build qemu, run it as defined in the "Testing"
@@ -249,8 +246,9 @@ The QMP server's response should be:
}
}
-As a general rule, all QMP errors should use ERROR_CLASS_GENERIC_ERROR. There
-are two exceptions to this rule:
+As a general rule, all QMP errors should use ERROR_CLASS_GENERIC_ERROR
+(done by default when using error_setg()). There are two exceptions to
+this rule:
1. A non-generic ErrorClass value exists* for the failure you want to report
(eg. DeviceNotFound)
@@ -259,8 +257,8 @@ are two exceptions to this rule:
want to report, hence you have to add a new ErrorClass value so that they
can check for it
-If the failure you want to report doesn't fall in one of the two cases above,
-just report ERROR_CLASS_GENERIC_ERROR.
+If the failure you want to report falls into one of the two cases above,
+use error_set() with a second argument of an ErrorClass value.
* All existing ErrorClass values are defined in the qapi-schema.json file
@@ -461,7 +459,7 @@ The last step is to add the correspoding entry in the qmp-commands.hx file:
{
.name = "query-alarm-clock",
.args_type = "",
- .mhandler.cmd_new = qmp_marshal_input_query_alarm_clock,
+ .mhandler.cmd_new = qmp_marshal_query_alarm_clock,
},
Time to test the new command. Build qemu, run it as described in the "Testing"
@@ -607,7 +605,7 @@ To test this you have to add the corresponding qmp-commands.hx entry:
{
.name = "query-alarm-methods",
.args_type = "",
- .mhandler.cmd_new = qmp_marshal_input_query_alarm_methods,
+ .mhandler.cmd_new = qmp_marshal_query_alarm_methods,
},
Now Build qemu, run it as explained in the "Testing" section and try our new