summaryrefslogtreecommitdiff
path: root/docs/specs
diff options
context:
space:
mode:
authorYonghee Han <onstudy@samsung.com>2016-07-27 16:40:17 +0900
committerYonghee Han <onstudy@samsung.com>2016-07-27 00:53:56 -0700
commit3158f4a51894e46ecb593bffbfd12824e1d6534a (patch)
tree2bef7f0238e687c5de65f48b5995ee124a95d157 /docs/specs
parenta3b133b0ea0696e42fd876b9a803e28bc6ef5299 (diff)
downloadqemu-3158f4a51894e46ecb593bffbfd12824e1d6534a.tar.gz
qemu-3158f4a51894e46ecb593bffbfd12824e1d6534a.tar.bz2
qemu-3158f4a51894e46ecb593bffbfd12824e1d6534a.zip
Imported Upstream version 2.4.1upstream/2.4.1
Change-Id: I0b584f569cb0e0f4eac13cdb79e110c2dbc34bfc
Diffstat (limited to 'docs/specs')
-rw-r--r--docs/specs/acpi_mem_hotplug.txt58
-rw-r--r--docs/specs/fw_cfg.txt21
-rw-r--r--docs/specs/pci-ids.txt2
-rw-r--r--docs/specs/ppc-spapr-hotplug.txt305
-rw-r--r--docs/specs/rocker.txt1014
5 files changed, 1396 insertions, 4 deletions
diff --git a/docs/specs/acpi_mem_hotplug.txt b/docs/specs/acpi_mem_hotplug.txt
index 12909940c..3df3620ce 100644
--- a/docs/specs/acpi_mem_hotplug.txt
+++ b/docs/specs/acpi_mem_hotplug.txt
@@ -2,7 +2,7 @@ QEMU<->ACPI BIOS memory hotplug interface
--------------------------------------
ACPI BIOS GPE.3 handler is dedicated for notifying OS about memory hot-add
-events.
+and hot-remove events.
Memory hot-plug interface (IO port 0xa00-0xa17, 1-4 byte access):
---------------------------------------------------------------
@@ -19,7 +19,9 @@ Memory hot-plug interface (IO port 0xa00-0xa17, 1-4 byte access):
1: Device insert event, used to distinguish device for which
no device check event to OSPM was issued.
It's valid only when bit 1 is set.
- 2-7: reserved and should be ignored by OSPM
+ 2: Device remove event, used to distinguish device for which
+ no device eject request to OSPM was issued.
+ 3-7: reserved and should be ignored by OSPM
[0x15-0x17] reserved
write access:
@@ -31,14 +33,62 @@ Memory hot-plug interface (IO port 0xa00-0xa17, 1-4 byte access):
[0xc-0x13] reserved, writes into it are ignored
[0x14] Memory device control fields
bits:
- 0: reserved, OSPM must clear it before writing to register
+ 0: reserved, OSPM must clear it before writing to register.
+ Due to BUG in versions prior 2.4 that field isn't cleared
+ when other fields are written. Keep it reserved and don't
+ try to reuse it.
1: if set to 1 clears device insert event, set by OSPM
after it has emitted device check event for the
selected memory device
- 2-7: reserved, OSPM must clear them before writing to register
+ 2: if set to 1 clears device remove event, set by OSPM
+ after it has emitted device eject request for the
+ selected memory device
+ 3: if set to 1 initiates device eject, set by OSPM when it
+ triggers memory device removal and calls _EJ0 method
+ 4-7: reserved, OSPM must clear them before writing to register
Selecting memory device slot beyond present range has no effect on platform:
- write accesses to memory hot-plug registers not documented above are
ignored
- read accesses to memory hot-plug registers not documented above return
all bits set to 1.
+
+Memory hot remove process diagram:
+----------------------------------
+ +-------------+     +-----------------------+      +------------------+     
+ |  1. QEMU    |     | 2. QEMU               |      |3. QEMU           |     
+ |  device_del +---->+ device unplug request +----->+Send SCI to guest,|     
+ |             |     |         cb            |      |return control to |     
+ +-------------+     +-----------------------+      |management        |     
+                                                    +------------------+     
+                                                                             
+ +---------------------------------------------------------------------+     
+                                                                             
+ +---------------------+              +-------------------------+            
+ | OSPM:               | remove event | OSPM:                   |            
+ | send Eject Request, |              | Scan memory devices     |            
+ | clear remove event  +<-------------+ for event flags         |            
+ |                     |              |                         |            
+ +---------------------+              +-------------------------+            
+           |                                                                 
+           |                                                                 
+ +---------v--------+            +-----------------------+                   
+ | Guest OS:        |  success   | OSPM:                 |                   
+ | process Ejection +----------->+ Execute _EJ0 method,  |                   
+ | request          |            | set eject bit in flags|                   
+ +------------------+            +-----------------------+                   
+           |failure                         |                                
+           v                                v                                
+ +------------------------+      +-----------------------+                   
+ | OSPM:                  |      | QEMU:                 |                   
+ | set OST event & status |      | call device unplug cb |                   
+ | fields                 |      |                       |                   
+ +------------------------+      +-----------------------+                   
+          |                                  |                               
+          v                                  v                               
+ +------------------+              +-------------------+                     
+ |QEMU:             |              |QEMU:              |                     
+ |Send OST QMP event|              |Send device deleted|                     
+ |                  |              |QMP event          |                     
+ +------------------+              |                   |                     
+                                   +-------------------+
diff --git a/docs/specs/fw_cfg.txt b/docs/specs/fw_cfg.txt
index 6accd924b..74351dd18 100644
--- a/docs/specs/fw_cfg.txt
+++ b/docs/specs/fw_cfg.txt
@@ -203,3 +203,24 @@ completes fully overwriting the item's data.
NOTE: This function is deprecated, and will be completely removed
starting with QEMU v2.4.
+
+== Externally Provided Items ==
+
+As of v2.4, "file" fw_cfg items (i.e., items with selector keys above
+FW_CFG_FILE_FIRST, and with a corresponding entry in the fw_cfg file
+directory structure) may be inserted via the QEMU command line, using
+the following syntax:
+
+ -fw_cfg [name=]<item_name>,file=<path>
+
+where <item_name> is the fw_cfg item name, and <path> is the location
+on the host file system of a file containing the data to be inserted.
+
+NOTE: Users *SHOULD* choose item names beginning with the prefix "opt/"
+when using the "-fw_cfg" command line option, to avoid conflicting with
+item names used internally by QEMU. For instance:
+
+ -fw_cfg name=opt/my_item_name,file=./my_blob.bin
+
+Similarly, QEMU developers *SHOULD NOT* use item names prefixed with
+"opt/" when inserting items programmatically, e.g. via fw_cfg_add_file().
diff --git a/docs/specs/pci-ids.txt b/docs/specs/pci-ids.txt
index c6732fe00..0adcb89aa 100644
--- a/docs/specs/pci-ids.txt
+++ b/docs/specs/pci-ids.txt
@@ -45,7 +45,9 @@ PCI devices (other than virtio):
1b36:0003 PCI Dual-port 16550A adapter (docs/specs/pci-serial.txt)
1b36:0004 PCI Quad-port 16550A adapter (docs/specs/pci-serial.txt)
1b36:0005 PCI test device (docs/specs/pci-testdev.txt)
+1b36:0006 PCI Rocker Ethernet switch device
1b36:0007 PCI SD Card Host Controller Interface (SDHCI)
+1b36:000a PCI-PCI bridge (multiseat)
All these devices are documented in docs/specs.
diff --git a/docs/specs/ppc-spapr-hotplug.txt b/docs/specs/ppc-spapr-hotplug.txt
new file mode 100644
index 000000000..46e07196b
--- /dev/null
+++ b/docs/specs/ppc-spapr-hotplug.txt
@@ -0,0 +1,305 @@
+= sPAPR Dynamic Reconfiguration =
+
+sPAPR/"pseries" guests make use of a facility called dynamic-reconfiguration
+to handle hotplugging of dynamic "physical" resources like PCI cards, or
+"logical"/paravirtual resources like memory, CPUs, and "physical"
+host-bridges, which are generally managed by the host/hypervisor and provided
+to guests as virtualized resources. The specifics of dynamic-reconfiguration
+are documented extensively in PAPR+ v2.7, Section 13.1. This document
+provides a summary of that information as it applies to the implementation
+within QEMU.
+
+== Dynamic-reconfiguration Connectors ==
+
+To manage hotplug/unplug of these resources, a firmware abstraction known as
+a Dynamic Resource Connector (DRC) is used to assign a particular dynamic
+resource to the guest, and provide an interface for the guest to manage
+configuration/removal of the resource associated with it.
+
+== Device-tree description of DRCs ==
+
+A set of 4 Open Firmware device tree array properties are used to describe
+the name/index/power-domain/type of each DRC allocated to a guest at
+boot-time. There may be multiple sets of these arrays, rooted at different
+paths in the device tree depending on the type of resource the DRCs manage.
+
+In some cases, the DRCs themselves may be provided by a dynamic resource,
+such as the DRCs managing PCI slots on a hotplugged PHB. In this case the
+arrays would be fetched as part of the device tree retrieval interfaces
+for hotplugged resources described under "Guest->Host interface".
+
+The array properties are described below. Each entry/element in an array
+describes the DRC identified by the element in the corresponding position
+of ibm,drc-indexes:
+
+ibm,drc-names:
+ first 4-bytes: BE-encoded integer denoting the number of entries
+ each entry: a NULL-terminated <name> string encoded as a byte array
+
+ <name> values for logical/virtual resources are defined in PAPR+ v2.7,
+ Section 13.5.2.4, and basically consist of the type of the resource
+ followed by a space and a numerical value that's unique across resources
+ of that type.
+
+ <name> values for "physical" resources such as PCI or VIO devices are
+ defined as being "location codes", which are the "location labels" of
+ each encapsulating device, starting from the chassis down to the
+ individual slot for the device, concatenated by a hyphen. This provides
+ a mapping of resources to a physical location in a chassis for debugging
+ purposes. For QEMU, this mapping is less important, so we assign a
+ location code that conforms to naming specifications, but is simply a
+ location label for the slot by itself to simplify the implementation.
+ The naming convention for location labels is documented in detail in
+ PAPR+ v2.7, Section 12.3.1.5, and in our case amounts to using "C<n>"
+ for PCI/VIO device slots, where <n> is unique across all PCI/VIO
+ device slots.
+
+ibm,drc-indexes:
+ first 4-bytes: BE-encoded integer denoting the number of entries
+ each 4-byte entry: BE-encoded <index> integer that is unique across all DRCs
+ in the machine
+
+ <index> is arbitrary, but in the case of QEMU we try to maintain the
+ convention used to assign them to pSeries guests on pHyp:
+
+ bit[31:28]: integer encoding of <type>, where <type> is:
+ 1 for CPU resource
+ 2 for PHB resource
+ 3 for VIO resource
+ 4 for PCI resource
+ 8 for Memory resource
+ bit[27:0]: integer encoding of <id>, where <id> is unique across
+ all resources of specified type
+
+ibm,drc-power-domains:
+ first 4-bytes: BE-encoded integer denoting the number of entries
+ each 4-byte entry: 32-bit, BE-encoded <index> integer that specifies the
+ power domain the resource will be assigned to. In the case of QEMU
+ we associated all resources with a "live insertion" domain, where the
+ power is assumed to be managed automatically. The integer value for
+ this domain is a special value of -1.
+
+
+ibm,drc-types:
+ first 4-bytes: BE-encoded integer denoting the number of entries
+ each entry: a NULL-terminated <type> string encoded as a byte array
+
+ <type> is assigned as follows:
+ "CPU" for a CPU
+ "PHB" for a physical host-bridge
+ "SLOT" for a VIO slot
+ "28" for a PCI slot
+ "MEM" for memory resource
+
+== Guest->Host interface to manage dynamic resources ==
+
+Each DRC is given a globally unique DRC Index, and resources associated with
+a particular DRC are configured/managed by the guest via a number of RTAS
+calls which reference individual DRCs based on the DRC index. This can be
+considered the guest->host interface.
+
+rtas-set-power-level:
+ arg[0]: integer identifying power domain
+ arg[1]: new power level for the domain, 0-100
+ output[0]: status, 0 on success
+ output[1]: power level after command
+
+ Set the power level for a specified power domain
+
+rtas-get-power-level:
+ arg[0]: integer identifying power domain
+ output[0]: status, 0 on success
+ output[1]: current power level
+
+ Get the power level for a specified power domain
+
+rtas-set-indicator:
+ arg[0]: integer identifying sensor/indicator type
+ arg[1]: index of sensor, for DR-related sensors this is generally the
+ DRC index
+ arg[2]: desired sensor value
+ output[0]: status, 0 on success
+
+ Set the state of an indicator or sensor. For the purpose of this document we
+ focus on the indicator/sensor types associated with a DRC. The types are:
+
+ 9001: isolation-state, controls/indicates whether a device has been made
+ accessible to a guest
+
+ supported sensor values:
+ 0: isolate, device is made unaccessible by guest OS
+ 1: unisolate, device is made available to guest OS
+
+ 9002: dr-indicator, controls "visual" indicator associated with device
+
+ supported sensor values:
+ 0: inactive, resource may be safely removed
+ 1: active, resource is in use and cannot be safely removed
+ 2: identify, used to visually identify slot for interactive hotplug
+ 3: action, in most cases, used in the same manner as identify
+
+ 9003: allocation-state, generally only used for "logical" DR resources to
+ request the allocation/deallocation of a resource prior to acquiring
+ it via isolation-state->unisolate, or after releasing it via
+ isolation-state->isolate, respectively. for "physical" DR (like PCI
+ hotplug/unplug) the pre-allocation of the resource is implied and
+ this sensor is unused.
+
+ supported sensor values:
+ 0: unusable, tell firmware/system the resource can be
+ unallocated/reclaimed and added back to the system resource pool
+ 1: usable, request the resource be allocated/reserved for use by
+ guest OS
+ 2: exchange, used to allocate a spare resource to use for fail-over
+ in certain situations. unused in QEMU
+ 3: recover, used to reclaim a previously allocated resource that's
+ not currently allocated to the guest OS. unused in QEMU
+
+rtas-get-sensor-state:
+ arg[0]: integer identifying sensor/indicator type
+ arg[1]: index of sensor, for DR-related sensors this is generally the
+ DRC index
+ output[0]: status, 0 on success
+
+ Used to read an indicator or sensor value.
+
+ For DR-related operations, the only noteworthy sensor is dr-entity-sense,
+ which has a type value of 9003, as allocation-state does in the case of
+ rtas-set-indicator. The semantics/encodings of the sensor values are distinct
+ however:
+
+ supported sensor values for dr-entity-sense (9003) sensor:
+ 0: empty,
+ for physical resources: DRC/slot is empty
+ for logical resources: unused
+ 1: present,
+ for physical resources: DRC/slot is populated with a device/resource
+ for logical resources: resource has been allocated to the DRC
+ 2: unusable,
+ for physical resources: unused
+ for logical resources: DRC has no resource allocated to it
+ 3: exchange,
+ for physical resources: unused
+ for logical resources: resource available for exchange (see
+ allocation-state sensor semantics above)
+ 4: recovery,
+ for physical resources: unused
+ for logical resources: resource available for recovery (see
+ allocation-state sensor semantics above)
+
+rtas-ibm-configure-connector:
+ arg[0]: guest physical address of 4096-byte work area buffer
+ arg[1]: 0, or address of additional 4096-byte work area buffer. only non-zero
+ if a prior RTAS response indicated a need for additional memory
+ output[0]: status:
+ 0: completed transmittal of device-tree node
+ 1: instruct guest to prepare for next DT sibling node
+ 2: instruct guest to prepare for next DT child node
+ 3: instruct guest to prepare for next DT property
+ 4: instruct guest to ascend to parent DT node
+ 5: instruct guest to provide additional work-area buffer
+ via arg[1]
+ 990x: instruct guest that operation took too long and to try
+ again later
+
+ Used to fetch an OF device-tree description of the resource associated with
+ a particular DRC. The DRC index is encoded in the first 4-bytes of the first
+ work area buffer.
+
+ Work area layout, using 4-byte offsets:
+ wa[0]: DRC index of the DRC to fetch device-tree nodes from
+ wa[1]: 0 (hard-coded)
+ wa[2]: for next-sibling/next-child response:
+ wa offset of null-terminated string denoting the new node's name
+ for next-property response:
+ wa offset of null-terminated string denoting new property's name
+ wa[3]: for next-property response (unused otherwise):
+ byte-length of new property's value
+ wa[4]: for next-property response (unused otherwise):
+ new property's value, encoded as an OFDT-compatible byte array
+
+== hotplug/unplug events ==
+
+For most DR operations, the hypervisor will issue host->guest add/remove events
+using the EPOW/check-exception notification framework, where the host issues a
+check-exception interrupt, then provides an RTAS event log via an
+rtas-check-exception call issued by the guest in response. This framework is
+documented by PAPR+ v2.7, and already use in by QEMU for generating powerdown
+requests via EPOW events.
+
+For DR, this framework has been extended to include hotplug events, which were
+previously unneeded due to direct manipulation of DR-related guest userspace
+tools by host-level management such as an HMC. This level of management is not
+applicable to PowerKVM, hence the reason for extending the notification
+framework to support hotplug events.
+
+Note that these events are not yet formally part of the PAPR+ specification,
+but support for this format has already been implemented in DR-related
+guest tools such as powerpc-utils/librtas, as well as kernel patches that have
+been submitted to handle in-kernel processing of memory/cpu-related hotplug
+events[1], and is planned for formal inclusion is PAPR+ specification. The
+hotplug-specific payload is QEMU implemented as follows (with all values
+encoded in big-endian format):
+
+struct rtas_event_log_v6_hp {
+#define SECTION_ID_HOTPLUG 0x4850 /* HP */
+ struct section_header {
+ uint16_t section_id; /* set to SECTION_ID_HOTPLUG */
+ uint16_t section_length; /* sizeof(rtas_event_log_v6_hp),
+ * plus the length of the DRC name
+ * if a DRC name identifier is
+ * specified for hotplug_identifier
+ */
+ uint8_t section_version; /* version 1 */
+ uint8_t section_subtype; /* unused */
+ uint16_t creator_component_id; /* unused */
+ } hdr;
+#define RTAS_LOG_V6_HP_TYPE_CPU 1
+#define RTAS_LOG_V6_HP_TYPE_MEMORY 2
+#define RTAS_LOG_V6_HP_TYPE_SLOT 3
+#define RTAS_LOG_V6_HP_TYPE_PHB 4
+#define RTAS_LOG_V6_HP_TYPE_PCI 5
+ uint8_t hotplug_type; /* type of resource/device */
+#define RTAS_LOG_V6_HP_ACTION_ADD 1
+#define RTAS_LOG_V6_HP_ACTION_REMOVE 2
+ uint8_t hotplug_action; /* action (add/remove) */
+#define RTAS_LOG_V6_HP_ID_DRC_NAME 1
+#define RTAS_LOG_V6_HP_ID_DRC_INDEX 2
+#define RTAS_LOG_V6_HP_ID_DRC_COUNT 3
+ uint8_t hotplug_identifier; /* type of the resource identifier,
+ * which serves as the discriminator
+ * for the 'drc' union field below
+ */
+ uint8_t reserved;
+ union {
+ uint32_t index; /* DRC index of resource to take action
+ * on
+ */
+ uint32_t count; /* number of DR resources to take
+ * action on (guest chooses which)
+ */
+ char name[1]; /* string representing the name of the
+ * DRC to take action on
+ */
+ } drc;
+} QEMU_PACKED;
+
+== ibm,lrdr-capacity ==
+
+ibm,lrdr-capacity is a property in the /rtas device tree node that identifies
+the dynamic reconfiguration capabilities of the guest. It consists of a triple
+consisting of <phys>, <size> and <maxcpus>.
+
+ <phys>, encoded in BE format represents the maximum address in bytes and
+ hence the maximum memory that can be allocated to the guest.
+
+ <size>, encoded in BE format represents the size increments in which
+ memory can be hot-plugged to the guest.
+
+ <maxcpus>, a BE-encoded integer, represents the maximum number of
+ processors that the guest can have.
+
+pseries guests use this property to note the maximum allowed CPUs for the
+guest.
+
+[1] http://thread.gmane.org/gmane.linux.ports.ppc.embedded/75350/focus=106867
diff --git a/docs/specs/rocker.txt b/docs/specs/rocker.txt
new file mode 100644
index 000000000..1c743515c
--- /dev/null
+++ b/docs/specs/rocker.txt
@@ -0,0 +1,1014 @@
+Rocker Network Switch Register Programming Guide
+Copyright (c) Scott Feldman <sfeldma@gmail.com>
+Copyright (c) Neil Horman <nhorman@tuxdriver.com>
+Version 0.11, 12/29/2014
+
+LICENSE
+=======
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2 of the License, or
+(at your option) any later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+SECTION 1: Introduction
+=======================
+
+Overview
+--------
+
+This document describes the hardware/software interface for the Rocker switch
+device. The intended audience is authors of OS drivers and device emulation
+software.
+
+Notations and Conventions
+-------------------------
+
+o In register descriptions, [n:m] indicates a range from bit n to bit m,
+inclusive.
+o Use of leading 0x indicates a hexadecimal number.
+o Use of leading 0b indicates a binary number.
+o The use of RSVD or Reserved indicates that a bit or field is reserved for
+future use.
+o Field width is in bytes, unless otherwise noted.
+o Register are (R) read-only, (R/W) read/write, (W) write-only, or (COR) clear
+on read
+o TLV values in network-byte-order are designated with (N).
+
+
+SECTION 2: PCI Configuration Registers
+======================================
+
+PCI Configuration Space
+-----------------------
+
+Each switch instance registers as a PCI device with PCI configuration space:
+
+ offset width description value
+ ---------------------------------------------
+ 0x0 2 Vendor ID 0x1b36
+ 0x2 2 Device ID 0x0006
+ 0x4 4 Command/Status
+ 0x8 1 Revision ID 0x01
+ 0x9 3 Class code 0x2800
+ 0xC 1 Cache line size
+ 0xD 1 Latency timer
+ 0xE 1 Header type
+ 0xF 1 Built-in self test
+ 0x10 4 Base address low
+ 0x14 4 Base address high
+ 0x18-28 Reserved
+ 0x2C 2 Subsystem vendor ID *
+ 0x2E 2 Subsystem ID *
+ 0x30-38 Reserved
+ 0x3C 1 Interrupt line
+ 0x3D 1 Interrupt pin 0x00
+ 0x3E 1 Min grant 0x00
+ 0x3D 1 Max latency 0x00
+ 0x40 1 TRDY timeout
+ 0x41 1 Retry count
+ 0x42 2 Reserved
+
+
+* Assigned by sub-system implementation
+
+SECTION 3: Memory-Mapped Register Space
+=======================================
+
+There are two memory-mapped BARs. BAR0 maps device register space and is
+0x2000 in size. BAR1 maps MSI-X vector and PBA tables and is also 0x2000 in
+size, allowing for 256 MSI-X vectors.
+
+All registers are 4 or 8 bytes long. It is assumed host software will access 4
+byte registers with one 4-byte access, and 8 byte registers with either two
+4-byte accesses or a single 8-byte access. In the case of two 4-byte accesses,
+access must be lower and then upper 4-bytes, in that order.
+
+BAR0 device register space is organized as follows:
+
+ offset description
+ ------------------------------------------------------
+ 0x0000-0x000f Bogus registers to catch misbehaving
+ drivers. Writes do nothing. Reads
+ back as 0xDEADBABE.
+ 0x0010-0x00ff Test registers
+ 0x0300-0x03ff General purpose registers
+ 0x1000-0x1fff Descriptor control
+
+Holes in register space are reserved. Writes to reserved registers do nothing.
+Reads to reserved registers read back as 0.
+
+No fancy stuff like write-combining is enabled on any of the registers.
+
+BAR1 MSI-X register space is organized as follows:
+
+ offset description
+ ------------------------------------------------------
+ 0x0000-0x0fff MSI-X vector table (256 vectors total)
+ 0x1000-0x1fff MSI-X PBA table
+
+
+SECTION 4: Interrupts, DMA, and Endianness
+==========================================
+
+PCI Interrupts
+--------------
+
+The device supports only MSI-X interrupts. BAR1 memory-mapped region contains
+the MSI-X vector and PBA tables, with support for up to 256 MSI-X vectors.
+
+The vector assignment is:
+
+ vector description
+ -----------------------------------------------------
+ 0 Command descriptor ring completion
+ 1 Event descriptor ring completion
+ 2 Test operation completion
+ 3 RSVD
+ 4-255 Tx and Rx descriptor ring completion
+ Tx vector is even
+ Rx vector is odd
+
+A MSI-X vector table entry is 16 bytes:
+
+ field offset width description
+ -------------------------------------------------------------
+ lower_addr 0x0 4 [31:2] message address[31:2]
+ [1:0] Rsvd (4 byte alignment
+ required)
+ upper_addr 0x4 4 [31:19] Rsvd
+ [14:0] message address[46:32]
+ data 0x8 4 message data[31:0]
+ control 0xc 4 [31:1] Rsvd
+ [0] mask (0 = enable,
+ 1 = masked)
+
+Software should install the Interrupt Service Routine (ISR) before any ports
+are enabled or any commands are issued on the command ring.
+
+DMA Operations
+--------------
+
+DMA operations are used for packet DMA to/from the CPU, command and event
+processing. Command processing includes statistical counters and table dumps,
+table insertion/deletion, and more. Event processing provides an async
+notification method for device-originating events. Each DMA operation has a
+set of control registers to manage a descriptor ring. The descriptor rings are
+allocated from contiguous host DMA-able memory and registers specify the rings
+base address, size and current head and tail indices. Software always writes
+the head, and hardware always writes the tail.
+
+The higher-order bit of DMA_DESC_COMP_ERR is used to mark hardware completion
+of a descriptor. Software will clear this bit when posting a descriptor to the
+ring, and hardware will set this bit when the descriptor is complete.
+
+Descriptor ring sizes must be a power of 2 and range from 2 to 64K entries.
+Descriptor rings' base address must be 8-byte aligned. Descriptors must be
+packed within ring. Each descriptor in each ring must also be aligned on an 8
+byte boundary. Each descriptor ring will have these registers:
+
+ DMA_DESC_xxx_BASE_ADDR, offset 0x1000 + (x * 32), 64-bit, (R/W)
+ DMA_DESC_xxx_SIZE, offset 0x1008 + (x * 32), 32-bit, (R/W)
+ DMA_DESC_xxx_HEAD, offset 0x100c + (x * 32), 32-bit, (R/W)
+ DMA_DESC_xxx_TAIL, offset 0x1010 + (x * 32), 32-bit, (R)
+ DMA_DESC_xxx_CTRL, offset 0x1014 + (x * 32), 32-bit, (W)
+ DMA_DESC_xxx_CREDITS, offset 0x1018 + (x * 32), 32-bit, (R/W)
+ DMA_DESC_xxx_RSVD1, offset 0x101c + (x * 32), 32-bit, (R/W)
+
+Where x is descriptor ring index:
+
+ index ring
+ --------------------
+ 0 CMD
+ 1 EVENT
+ 2 TX (port 0)
+ 3 RX (port 0)
+ 4 TX (port 1)
+ 5 RX (port 1)
+ .
+ .
+ .
+ 124 TX (port 61)
+ 125 RX (port 61)
+ 126 Resv
+ 127 Resv
+
+Writing BASE_ADDR or SIZE will reset HEAD and TAIL to zero. HEAD cannot be
+written past TAIL. To do so would wrap the ring. An empty ring is when HEAD
+== TAIL. A full ring is when HEAD is one position behind TAIL. Both HEAD and
+TAIL increment and modulo wrap at the ring size.
+
+CTRL register bits:
+
+ bit name description
+ ------------------------------------------------------------------------
+ [0] CTRL_RESET Reset the descriptor ring
+ [1:31] Reserved
+
+All descriptor types share some common fields:
+
+ field width description
+ -------------------------------------------------------------------
+ DMA_DESC_BUF_ADDR 8 Phys addr of desc payload, 8-byte
+ aligned
+ DMA_DESC_COOKIE 8 Desc cookie for completion matching,
+ upper-most bit is reserved
+ DMA_DESC_BUF_SIZE 2 Desc payload size in bytes
+ DMA_DESC_TLV_SIZE 2 Desc payload total size in bytes
+ used for TLVs. Must be <=
+ DMA_DESC_BUF_SIZE.
+ DMA_DESC_COMP_ERR 2 Completion status of associated
+ desc payload. High order bit is
+ clear on new descs, toggled by
+ hw for completed items.
+
+To support forward- and backward-compatibility, descriptor and completion
+payloads are specified in TLV format. Fields are packed with Type=field name,
+Length=field length, and Value=field value. Software will ignore unknown fields
+filled in by the switch. Likewise, the switch will ignore unknown fields
+filled in by software.
+
+Descriptor payload buffer is 8-byte aligned and TLVs are 8-byte aligned. The
+value within a TLV is also 8-byte aligned. The (packed, 8 byte) TLV header is:
+
+ field width description
+ -----------------------------
+ type 4 TLV type
+ len 2 TLV value length
+ pad 2 Reserved
+
+The alignment requirements for descriptors and TLVs are to avoid unaligned
+access exceptions in software. Note that the payload for each TLV is also
+8 byte aligned.
+
+Figure 1 shows an example descriptor buffer with two TLVs.
+
+ <------- 8 bytes ------->
+
+ 8-byte +––––+ +–––––––––––+–––––+–––––+ +–+
+ align | type | len | pad | TLV#1 hdr |
+ +–––––––––––+–––––+–––––+ (len=22) |
+ | | |
+ | value | TVL#1 value |
+ | | (padded to 8-byte |
+ | +–––––+ alignment) |
+ | |/////| |
+ 8-byte +––––+ +–––––––––––+–––––––––––+ |
+ align | type | len | pad | TLV#2 hdr DESC_BUF_SIZE
+ +–––––+–––––+–––––+–––––+ (len=2) |
+ |value|/////////////////| TLV#2 value |
+ +–––––+/////////////////| |
+ |///////////////////////| |
+ |///////////////////////| |
+ |///////////////////////| |
+ |////////unused/////////| |
+ |////////space//////////| |
+ |///////////////////////| |
+ |///////////////////////| |
+ |///////////////////////| |
+ +–––––––––––––––––––––––+ +–+
+
+ fig. 1
+
+TLVs can be nested within the NEST TLV type.
+
+Interrupt credits
+^^^^^^^^^^^^^^^^^
+
+MSI-X vectors used for descriptor ring completions use a credit mechanism for
+efficient device, PCIe bus, OS and driver operations. Each descriptor ring has
+a credit count which represents the number of outstanding descriptors to be
+processed by the driver. As the device marks descriptors complete, the credit
+count is incremented. As the driver processes those outstanding descriptors,
+it returns credits back to the device. This way, the device knows the driver's
+progress and can make decisions about when to fire the next interrupt or not.
+When the credit count is zero, and the first descriptors are posted for the
+driver, a single interrupt is fired. Once the interrupt is fired, the
+interrupt is disabled (auto-masked*). In response to the interrupt, the driver
+will process descriptors and PIO write a returned credit value for that
+descriptor ring. If the driver returns all credits (the driver caught up with
+the device and there is no outstanding work), then the interrupt is unmasked,
+but not fired. If only partial credits are returned, the interrupt remains
+masked but the device generates an interrupt, signaling the driver that more
+outstanding work is available.
+
+(* this masking is unrelated to to the MSI-X interrupt mask register)
+
+Endianness
+----------
+
+Device registers are hard-coded to little-endian (LE). The driver should
+convert to/from host endianess to LE for device register accesses.
+
+Descriptors are LE. Descriptor buffer TLVs will have LE type and length
+fields, but the value field can either be LE or network-byte-order, depending
+on context. TLV values containing network packet data will be in network-byte
+order. A TLV value containing a field or mask used to compare against network
+packet data is network-byte order. For example, flow match fields (and masks)
+are network-byte-order since they're matched directly, byte-by-byte, against
+network packet data. All non-network-packet TLV multi-byte values will be LE.
+
+TLV values in network-byte-order are designated with (N).
+
+
+SECTION 5: Test Registers
+=========================
+
+Rocker has several test registers to support troubleshooting register access,
+interrupt generation, and DMA operations:
+
+ TEST_REG, offset 0x0010, 32-bit (R/W)
+ TEST_REG64, offset 0x0018, 64-bit (R/W)
+ TEST_IRQ, offset 0x0020, 32-bit (R/W)
+ TEST_DMA_ADDR, offset 0x0028, 64-bit (R/W)
+ TEST_DMA_SIZE, offset 0x0030, 32-bit (R/W)
+ TEST_DMA_CTRL, offset 0x0034, 32-bit (R/W)
+
+Reads to TEST_REG and TEST_REG64 will read a value equal to twice the last
+value written to the register. The 32-bit and 64-bit versions are for testing
+32-bit and 64-bit host accesses.
+
+A vector can be written to TEST_IRQ and the device will generate an interrupt
+for that vector.
+
+To test basic DMA operations, allocate a DMA-able host buffer and put the
+buffer address into TEST_DMA_ADDR and size into TEST_DMA_SIZE. Then, write to
+TEST_DMA_CTRL to manipulate the buffer contents. TEST_DMA_CTRL operations are:
+
+ operation value description
+ -----------------------------------------------------------
+ TEST_DMA_CTRL_CLEAR 1 clear buffer
+ TEST_DMA_CTRL_FILL 2 fill buffer bytes with 0x96
+ TEST_DMA_CTRL_INVERT 4 invert bytes in buffer
+
+Various buffer address and sizes should be tested to verify no address boundary
+issue exists. In particular, buffers that start on odd-8-byte boundary and/or
+span multiple PAGE sizes should be tested.
+
+
+SECTION 6: Ports
+================
+
+Physical and Logical Ports
+------------------------------------
+
+The switch supports up to 62 physical (front-panel) ports. Register
+PORT_PHYS_COUNT returns the actual number of physical ports available:
+
+ PORT_PHYS_COUNT, offset 0x0304, 32-bit, (R)
+
+In addition to front-panel ports, the switch supports logical ports for
+tunnels.
+
+Front-panel ports and logical tunnel ports are mapped into a single 32-bit port
+space. A special CPU port is assigned port 0. The front-panel ports are
+mapped to ports 1-62. A special loopback port is assigned port 63. Logical
+tunnel ports are assigned ports 0x0001000-0x0001ffff.
+To summarize the port assignments:
+
+ port mapping
+ -------------------------------------------------------
+ 0 CPU port (for packets to/from host CPU)
+ 1-62 front-panel physical ports
+ 63 loopback port
+ 64-0x0000ffff RSVD
+ 0x00010000-0x0001ffff logical tunnel ports
+ 0x00020000-0xffffffff RSVD
+
+Physical Port Mode
+------------------
+
+Switch front-panel ports operate in a mode. Currently, the only mode is
+OF-DPA. OF-DPA[1] mode is based on OpenFlow Data Plane Abstraction (OF-DPA)
+Abstract Switch Specification, Version 1.0, from Broadcom Corporation. To
+set/get the mode for front-panel ports, see port settings, below.
+
+Port Settings
+-------------
+
+Link status for all front-panel ports is available via PORT_PHYS_LINK_STATUS:
+
+ PORT_PHYS_LINK_STATUS, offset 0x0310, 64-bit, (R)
+
+ Value is port bitmap. Bits 0 and 63 always read 0. Bits 1-62
+ read 1 for link UP and 0 for link DOWN for respective front-panel ports.
+
+Other properties for front-panel ports are available via DMA CMD descriptors:
+
+ Get PORT_SETTINGS descriptor:
+
+ field width description
+ ----------------------------------------------
+ PORT_SETTINGS 2 CMD_GET
+ PPORT 4 Physical port #
+
+ Get PORT_SETTINGS completion:
+
+ field width description
+ ----------------------------------------------
+ PPORT 4 Physical port #
+ SPEED 4 Current port interface speed, in Mbps
+ DUPLEX 1 1 = Full, 0 = Half
+ AUTONEG 1 1 = enabled, 0 = disabled
+ MACADDR 6 Port MAC address
+ MODE 1 0 = OF-DPA
+ LEARNING 1 MAC address learning on port
+ 1 = enabled
+ 0 = disabled
+ PHYS_NAME <var> Physical port name (string)
+
+ Set PORT_SETTINGS descriptor:
+
+ field width description
+ ----------------------------------------------
+ PORT_SETTINGS 2 CMD_SET
+ PPORT 4 Physical port #
+ SPEED 4 Port interface speed, in Mbps
+ DUPLEX 1 1 = Full, 0 = Half
+ AUTONEG 1 1 = enabled, 0 = disabled
+ MACADDR 6 Port MAC address
+ MODE 1 0 = OF-DPA
+
+Port Enable
+-----------
+
+Front-panel ports are initially disabled, which means port ingress and egress
+packets will be dropped. To enable or disable a port, use PORT_PHYS_ENABLE:
+
+ PORT_PHYS_ENABLE: offset 0x0318, 64-bit, (R/W)
+
+ Value is bitmap of first 64 ports. Bits 0 and 63 are ignored
+ and always read as 0. Write 1 to enable port; write 0 to disable it.
+ Default is 0.
+
+
+SECTION 7: Switch Control
+=========================
+
+This section covers switch-wide register settings.
+
+Control
+-------
+
+This register is used for low level control of the switch.
+
+ CONTROL: offset 0x0300, 32-bit, (W)
+
+ bit name description
+ ------------------------------------------------------------------------
+ [0] CONTROL_RESET If set, device will perform reset
+ [1:31] Reserved
+
+Switch ID
+---------
+
+The switch has a SWITCH_ID to be used by software to uniquely identify the
+switch:
+
+ SWITCH_ID: offset 0x0320, 64-bit, (R)
+
+ Value is opaque to switch software and no special encoding is implied.
+
+
+SECTION 8: Events
+=================
+
+Non-I/O asynchronous events from the device are notified to the host using the
+event ring. The TLV structure for events is:
+
+ field width description
+ ---------------------------------------------------
+ TYPE 4 Event type, one of:
+ 1: LINK_CHANGED
+ 2: MAC_VLAN_SEEN
+ INFO <nest> Event info (details below)
+
+Link Changed Event
+------------------
+
+When link status changes on a physical port, this event is generated.
+
+ field width description
+ ---------------------------------------------------
+ INFO <nest>
+ PPORT 4 Physical port
+ LINKUP 1 Link status:
+ 0: down
+ 1: up
+
+MAC VLAN Seen Event
+-------------------
+
+When a packet ingresses on a port and the source MAC/VLAN isn't known to the
+device, the device will generate this event. In response to the event, the
+driver should install to the device the MAC/VLAN on the port into the bridge
+table. Once installed, the MAC/VLAN is known on the port and this event will
+no longer be generated.
+
+ field width description
+ ---------------------------------------------------
+ INFO <nest>
+ PPORT 4 Physical port
+ MAC 6 MAC address
+ VLAN 2 VLAN ID
+
+
+SECTION 9: CPU Packet Processing
+================================
+
+Ingress packets directed to the host CPU for further processing are delivered
+in the DMA RX ring. Likewise, host CPU originating packets destined to egress
+on switch ports are scheduled by software using the DMA TX ring.
+
+Tx Packet Processing
+--------------------
+
+Software schedules packets for egress on switch ports using the DMA TX ring. A
+TX descriptor buffer describes the packet location and size in host DMA-able
+memory, the destination port, and any hardware-offload functions (such as L3
+payload checksum offload). Software then bumps the descriptor head to signal
+hardware of new Tx work. In response, hardware will DMA read Tx descriptors up
+to head, DMA read descriptor buffer and packet data, perform offloading
+functions, and finally frame packet on wire (network). Once packet processing
+is complete, hardware will writeback status to descriptor(s) to signal to
+software that Tx is complete and software resources (e.g. skb) backing packet
+can be released.
+
+Figure 2 shows an example 3-fragment packet queued with one Tx descriptor. A
+TLV is used for each packet fragment.
+
+ pkt frag 1
+ +–––––––+ +–+
+ +–––+ | |
+ desc buf | | | |
+ +––––––––+ | | | |
+ Tx ring +–––+ +–––––+ | | |
+ +–––––––––+ | | TLVs | +–––––––+ |
+ | +–––+ +––––––––+ pkt frag 2 |
+ | desc 0 | | +–––––+ +–––––––+ |
+ +–––––––––+ | TLVs | +–––+ | |
+ head+–+ | +––––––––+ | | |
+ | desc 1 | | +–––––+ +–––––––+ |pkt
+ +–––––––––+ | TLVs | | |
+ | | +––––––––+ | pkt frag 3 |
+ | | | +–––––––+ |
+ +–––––––––+ +–––+ | |
+ | | | | |
+ | | | | |
+ +–––––––––+ | | |
+ | | | | |
+ | | | | |
+ +–––––––––+ | | |
+ | | +–––––––+ +–+
+ | |
+ +–––––––––+
+
+ fig 2.
+
+The TLVs for Tx descriptor buffer are:
+
+ field width description
+ ---------------------------------------------------------------------
+ PPORT 4 Destination physical port #
+ TX_OFFLOAD 1 Hardware offload modes:
+ 0: no offload
+ 1: insert IP csum (ipv4 only)
+ 2: insert TCP/UDP csum
+ 3: L3 csum calc and insert
+ into csum offset (TX_L3_CSUM_OFF)
+ 16-bit 1's complement csum value.
+ IPv4 pseudo-header and IP
+ already calculated by OS
+ and inserted.
+ 4: TSO (TCP Segmentation Offload)
+ TX_L3_CSUM_OFF 2 For L3 csum offload mode, the offset,
+ from the beginning of the packet,
+ of the csum field in the L3 header
+ TX_TSO_MSS 2 For TSO offload mode, the
+ Maximum Segment Size in bytes
+ TX_TSO_HDR_LEN 2 For TSO offload mode, the
+ length of ethernet, IP, and
+ TCP/UDP headers, including IP
+ and TCP options.
+ TX_FRAGS <array> Packet fragments
+ TX_FRAG <nest> Packet fragment
+ TX_FRAG_ADDR 8 DMA address of packet fragment
+ TX_FRAG_LEN 2 Packet fragment length
+
+Possible status return codes in descriptor on completion are:
+
+ DESC_COMP_ERR reason
+ --------------------------------------------------------------------
+ 0 OK
+ -ROCKER_ENXIO address or data read err on desc buf or packet
+ fragment
+ -ROCKER_EINVAL bad pport or TSO or csum offloading error
+ -ROCKER_ENOMEM no memory for internal staging tx fragment
+
+Rx Packet Processing
+--------------------
+
+For packets ingressing on switch ports that are not forwarded by the switch but
+rather directed to the host CPU for further processing are delivered in the DMA
+RX ring. Rx descriptor buffers are allocated by software and placed on the
+ring. Hardware will fill Rx descriptor buffers with packet data, write the
+completion, and signal to software that a new packet is ready. Since Rx packet
+size is not known a-priori, the Rx descriptor buffer must be allocated for
+worst-case packet size. A single Rx descriptor will contain the entire Rx
+packet data in one RX_FRAG. Other Rx TLVs describe and hardware offloads
+performed on the packet, such as checksum validation.
+
+The TLVs for Rx descriptor buffer are:
+
+ field width description
+ ---------------------------------------------------
+ PPORT 4 Source physical port #
+ RX_FLAGS 2 Packet parsing flags:
+ (1 << 0): IPv4 packet
+ (1 << 1): IPv6 packet
+ (1 << 2): csum calculated
+ (1 << 3): IPv4 csum good
+ (1 << 4): IP fragment
+ (1 << 5): TCP packet
+ (1 << 6): UDP packet
+ (1 << 7): TCP/UDP csum good
+ (1 << 8): Offload forward
+ RX_CSUM 2 IP calculated checksum:
+ IPv4: IP payload csum
+ IPv6: header and payload csum
+ (Only valid is RX_FLAGS:csum calc is set)
+ RX_FRAG_ADDR 8 DMA address of packet fragment
+ RX_FRAG_MAX_LEN 2 Packet maximum fragment length
+ RX_FRAG_LEN 2 Actual packet fragment length after receive
+
+Offload forward RX_FLAG indicates the device has already forwarded the packet
+so the host CPU should not also forward the packet.
+
+Possible status return codes in descriptor on completion are:
+
+ DESC_COMP_ERR reason
+ --------------------------------------------------------------------
+ 0 OK
+ -ROCKER_ENXIO address or data read err on desc buf
+ -ROCKER_ENOMEM no memory for internal staging desc buf
+ -ROCKER_EMSGSIZE Rx descriptor buffer wasn't big enough to contain
+ packet data TLV and other TLVs.
+
+
+SECTION 10: OF-DPA Mode
+======================
+
+OF-DPA mode allows the switch to offload flow packet processing functions to
+hardware. An OpenFlow controller would communicate with an OpenFlow agent
+installed on the switch. The OpenFlow agent would (directly or indirectly)
+communicate with the Rocker switch driver, which in turn would program switch
+hardware with flow functionality, as defined in OF-DPA. The block diagram is:
+
+ +–––––––––––––––----–––+
+ | OF |
+ | Remote Controller |
+ +––––––––+––----–––––––+
+ |
+ |
+ +––––––––+–––––––––+
+ | OF |
+ | Local Agent |
+ +––––––––––––––––––+
+ | |
+ | Rocker Driver |
+ +––––––––––––––––––+
+ <this spec>
+ +––––––––––––––––––+
+ | |
+ | Rocker Switch |
+ +––––––––––––––––––+
+
+To participate in flow functions, ports must be configure for OF-DPA mode
+during switch initialization.
+
+OF-DPA Flow Table Interface
+---------------------------
+
+There are commands to add, modify, delete, and get stats of flow table entries.
+The commands are issued using the DMA CMD descriptor ring. The following
+commands are defined:
+
+ CMD_ADD: add an entry to flow table
+ CMD_MOD: modify an entry in flow table
+ CMD_DEL: delete an entry from flow table
+ CMD_GET_STATS: get stats for flow entry
+
+TLVs for add and modify commands are:
+
+ field width description
+ ----------------------------------------------------
+ OF_DPA_CMD 2 CMD_[ADD|MOD]
+ OF_DPA_TBL 2 Flow table ID
+ 0: ingress port
+ 10: vlan
+ 20: termination mac
+ 30: unicast routing
+ 40: multicast routing
+ 50: bridging
+ 60: ACL policy
+ OF_DPA_PRIORITY 4 Flow priority
+ OF_DPA_HARDTIME 4 Hard timeout for flow
+ OF_DPA_IDLETIME 4 Idle timeout for flow
+ OF_DPA_COOKIE 8 Cookie
+
+Additional TLVs based on flow table ID:
+
+Table ID 0: ingress port
+
+ field width description
+ ----------------------------------------------------
+ OF_DPA_IN_PPORT 4 ingress physical port number
+ OF_DPA_GOTO_TBL 2 goto table ID; zero to drop
+
+Table ID 10: vlan
+
+ field width description
+ ----------------------------------------------------
+ OF_DPA_IN_PPORT 4 ingress physical port number
+ OF_DPA_VLAN_ID 2 (N) vlan ID
+ OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask
+ OF_DPA_GOTO_TBL 2 goto table ID; zero to drop
+ OF_DPA_NEW_VLAN_ID 2 (N) new vlan ID
+
+Table ID 20: termination mac
+
+ field width description
+ ----------------------------------------------------
+ OF_DPA_IN_PPORT 4 ingress physical port number
+ OF_DPA_IN_PPORT_MASK 4 ingress physical port number mask
+ OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd
+ OF_DPA_DST_MAC 6 (N) destination MAC
+ OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask
+ OF_DPA_VLAN_ID 2 (N) vlan ID
+ OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask
+ OF_DPA_GOTO_TBL 2 only acceptable values are
+ unicast or multicast routing
+ table IDs
+ OF_DPA_OUT_PPORT 2 if specified, must be
+ controller, set zero otherwise
+
+Table ID 30: unicast routing
+
+ field width description
+ ----------------------------------------------------
+ OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd
+ OF_DPA_DST_IP 4 (N) destination IPv4 address.
+ Must be unicast address
+ OF_DPA_DST_IP_MASK 4 (N) IP mask. Must be prefix mask
+ OF_DPA_DST_IPV6 16 (N) destination IPv6 address.
+ Must be unicast address
+ OF_DPA_DST_IPV6_MASK 16 (N) IPv6 mask. Must be prefix mask
+ OF_DPA_GOTO_TBL 2 goto table ID; zero to drop
+ OF_DPA_GROUP_ID 4 data for GROUP action must
+ be an L3 Unicast group entry
+
+Table ID 40: multicast routing
+
+ field width description
+ ----------------------------------------------------
+ OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd
+ OF_DPA_VLAN_ID 2 (N) vlan ID
+ OF_DPA_SRC_IP 4 (N) source IPv4. Optional,
+ can contain IPv4 address,
+ must be completely masked
+ if not used
+ OF_DPA_SRC_IP_MASK 4 (N) IP Mask
+ OF_DPA_DST_IP 4 (N) destination IPv4 address.
+ Must be multicast address
+ OF_DPA_SRC_IPV6 16 (N) source IPv6 Address. Optional.
+ Can contain IPv6 address,
+ must be completely masked
+ if not used
+ OF_DPA_SRC_IPV6_MASK 16 (N) IPv6 mask.
+ OF_DPA_DST_IPV6 16 (N) destination IPv6 Address. Must
+ be multicast address
+ Must be multicast address
+ OF_DPA_GOTO_TBL 2 goto table ID; zero to drop
+ OF_DPA_GROUP_ID 4 data for GROUP action must
+ be an L3 multicast group entry
+
+Table ID 50: bridging
+
+ field width description
+ ----------------------------------------------------
+ OF_DPA_VLAN_ID 2 (N) vlan ID
+ OF_DPA_TUNNEL_ID 4 tunnel ID
+ OF_DPA_DST_MAC 6 (N) destination MAC
+ OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask
+ OF_DPA_GOTO_TBL 2 goto table ID; zero to drop
+ OF_DPA_GROUP_ID 4 data for GROUP action must
+ be a L2 Interface, L2
+ Multicast, L2 Flood,
+ or L2 Overlay group entry
+ as appropriate
+ OF_DPA_TUNNEL_LPORT 4 unicast Tenant Bridging
+ flows specify a tunnel
+ logical port ID
+ OF_DPA_OUT_PPORT 2 data for OUTPUT action,
+ restricted to CONTROLLER,
+ set to 0 otherwise
+
+Table ID 60: acl policy
+
+ field width description
+ ----------------------------------------------------
+ OF_DPA_IN_PPORT 4 ingress physical port number
+ OF_DPA_IN_PPORT_MASK 4 ingress physical port number mask
+ OF_DPA_ETHERTYPE 2 (N) ethertype
+ OF_DPA_VLAN_ID 2 (N) vlan ID
+ OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask
+ OF_DPA_VLAN_PCP 2 (N) vlan Priority Code Point
+ OF_DPA_VLAN_PCP_MASK 2 (N) vlan Priority Code Point mask
+ OF_DPA_SRC_MAC 6 (N) source MAC
+ OF_DPA_SRC_MAC_MASK 6 (N) source MAC mask
+ OF_DPA_DST_MAC 6 (N) destination MAC
+ OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask
+ OF_DPA_TUNNEL_ID 4 tunnel ID
+ OF_DPA_SRC_IP 4 (N) source IPv4. Optional,
+ can contain IPv4 address,
+ must be completely masked
+ if not used
+ OF_DPA_SRC_IP_MASK 4 (N) IP Mask
+ OF_DPA_DST_IP 4 (N) destination IPv4 address.
+ Must be multicast address
+ OF_DPA_DST_IP_MASK 4 (N) IP Mask
+ OF_DPA_SRC_IPV6 16 (N) source IPv6 Address. Optional.
+ Can contain IPv6 address,
+ must be completely masked
+ if not used
+ OF_DPA_SRC_IPV6_MASK 16 (N) IPv6 mask
+ OF_DPA_DST_IPV6 16 (N) destination IPv6 Address. Must
+ be multicast address.
+ OF_DPA_DST_IPV6_MASK 16 (N) IPv6 mask
+ OF_DPA_SRC_ARP_IP 4 (N) source IPv4 address in the ARP
+ payload. Only used if ethertype
+ == 0x0806.
+ OF_DPA_SRC_ARP_IP_MASK 4 (N) IP Mask
+ OF_DPA_IP_PROTO 1 IP protocol
+ OF_DPA_IP_PROTO_MASK 1 IP protocol mask
+ OF_DPA_IP_DSCP 1 DSCP
+ OF_DPA_IP_DSCP_MASK 1 DSCP mask
+ OF_DPA_IP_ECN 1 ECN
+ OF_DPA_IP_ECN_MASK 1 ECN mask
+ OF_DPA_L4_SRC_PORT 2 (N) L4 source port, only for
+ TCP, UDP, or SCTP
+ OF_DPA_L4_SRC_PORT_MASK 2 (N) L4 source port mask
+ OF_DPA_L4_DST_PORT 2 (N) L4 source port, only for
+ TCP, UDP, or SCTP
+ OF_DPA_L4_DST_PORT_MASK 2 (N) L4 source port mask
+ OF_DPA_ICMP_TYPE 1 ICMP type, only if IP
+ protocol is 1
+ OF_DPA_ICMP_TYPE_MASK 1 ICMP type mask
+ OF_DPA_ICMP_CODE 1 ICMP code
+ OF_DPA_ICMP_CODE_MASK 1 ICMP code mask
+ OF_DPA_IPV6_LABEL 4 (N) IPv6 flow label
+ OF_DPA_IPV6_LABEL_MASK 4 (N) IPv6 flow label mask
+ OF_DPA_GROUP_ID 4 data for GROUP action
+ OF_DPA_QUEUE_ID_ACTION 1 write the queue ID
+ OF_DPA_NEW_QUEUE_ID 1 queue ID
+ OF_DPA_VLAN_PCP_ACTION 1 write the VLAN priority
+ OF_DPA_NEW_VLAN_PCP 1 VLAN priority
+ OF_DPA_IP_DSCP_ACTION 1 write the DSCP
+ OF_DPA_NEW_IP_DSCP 1 new DSCP
+ OF_DPA_TUNNEL_LPORT 4 restrct to valid tunnel
+ logical port, set to 0
+ otherwise.
+ OF_DPA_OUT_PPORT 2 data for OUTPUT action,
+ restricted to CONTROLLER,
+ set to 0 otherwise
+ OF_DPA_CLEAR_ACTIONS 4 if 1 packets matching flow are
+ dropped (all other instructions
+ ignored)
+
+TLVs for flow delete and get stats command are:
+
+ field width description
+ ---------------------------------------------------
+ OF_DPA_CMD 2 CMD_[DEL|GET_STATS]
+ OF_DPA_COOKIE 8 Cookie
+
+On completion of get stats command, the descriptor buffer is written back with
+the following TLVs:
+
+ field width description
+ ---------------------------------------------------
+ OF_DPA_STAT_DURATION 4 Flow duration
+ OF_DPA_STAT_RX_PKTS 8 Received packets
+ OF_DPA_STAT_TX_PKTS 8 Transmit packets
+
+Possible status return codes in descriptor on completion are:
+
+ DESC_COMP_ERR command reason
+ --------------------------------------------------------------------
+ 0 all OK
+ -ROCKER_EFAULT all head or tail index outside
+ of ring
+ -ROCKER_ENXIO all address or data read err on
+ desc buf
+ -ROCKER_EMSGSIZE GET_STATS cmd descriptor buffer wasn't
+ big enough to contain write-back
+ TLVs
+ -ROCKER_EINVAL all invalid parameters passed in
+ -ROCKER_EEXIST ADD entry already exists
+ -ROCKER_ENOSPC ADD no space left in flow table
+ -ROCKER_ENOENT MOD|DEL|GET_STATS cookie invalid
+
+Group Table Interface
+---------------------
+
+There are commands to add, modify, delete, and get stats of group table
+entries. The commands are issued using the DMA CMD descriptor ring. The
+following commands are defined:
+
+ CMD_ADD: add an entry to group table
+ CMD_MOD: modify an entry in group table
+ CMD_DEL: delete an entry from group table
+ CMD_GET_STATS: get stats for group entry
+
+TLVs for add and modify commands are:
+
+ field width description
+ -----------------------------------------------------------
+ FLOW_GROUP_CMD 2 CMD_[ADD|MOD]
+ FLOW_GROUP_ID 2 Flow group ID
+ FLOW_GROUP_TYPE 1 Group type:
+ 0: L2 interface
+ 1: L2 rewrite
+ 2: L3 unicast
+ 3: L2 multicast
+ 4: L2 flood
+ 5: L3 interface
+ 6: L3 multicast
+ 7: L3 ECMP
+ 8: L2 overlay
+ FLOW_VLAN_ID 2 Vlan ID (types 0, 3, 4, 6)
+ FLOW_L2_PORT 2 Port (types 0)
+ FLOW_INDEX 4 Index (all types but 0)
+ FLOW_OVERLAY_TYPE 1 Overlay sub-type (type 8):
+ 0: Flood unicast tunnel
+ 1: Flood multicast tunnel
+ 2: Multicast unicast tunnel
+ 3: Multicast multicast tunnel
+ FLOW_GROUP_ACTION nest
+ FLOW_GROUP_ID 2 next group ID in chain (all
+ types except 0)
+ FLOW_OUT_PORT 4 egress port (types 0, 8)
+ FLOW_POP_VLAN_TAG 1 strip outer VLAN tag (type 1
+ only)
+ FLOW_VLAN_ID 2 (types 1, 5)
+ FLOW_SRC_MAC 6 (types 1, 2, 5)
+ FLOW_DST_MAC 6 (types 1, 2)
+
+TLVs for flow delete and get stats command are:
+
+ field width description
+ -----------------------------------------------------------
+ FLOW_GROUP_CMD 2 CMD_[DEL|GET_STATS]
+ FLOW_GROUP_ID 2 Flow group ID
+
+On completion of get stats command, the descriptor buffer is written back with
+the following TLVs:
+
+ field width description
+ ---------------------------------------------------
+ FLOW_GROUP_ID 2 Flow group ID
+ FLOW_STAT_DURATION 4 Flow duration
+ FLOW_STAT_REF_COUNT 4 Flow reference count
+ FLOW_STAT_BUCKET_COUNT 4 Flow bucket count
+
+Possible status return codes in descriptor on completion are:
+
+ DESC_COMP_ERR command reason
+ --------------------------------------------------------------------
+ 0 all OK
+ -ROCKER_EFAULT all head or tail index outside
+ of ring
+ -ROCKER_ENXIO all address or data read err on
+ desc buf
+ -ROCKER_ENOSPC GET_STATS cmd descriptor buffer wasn't
+ big enough to contain write-back
+ TLVs
+ -ROCKER_EINVAL ADD|MOD invalid parameters passed in
+ -ROCKER_EEXIST ADD entry already exists
+ -ROCKER_ENOSPC ADD no space left in flow table
+ -ROCKER_ENOENT MOD|DEL|GET_STATS group ID invalid
+ -ROCKER_EBUSY DEL group reference count non-zero
+ -ROCKER_ENODEV ADD next group ID doesn't exist
+
+
+
+References
+==========
+
+[1] OpenFlow Data Plane Abstraction (OF-DPA) Abstract Switch Specification,
+Version 1.0, from Broadcom Corporation, February 21, 2014.