summaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorEric Anholt <eric@anholt.net>2023-10-19 10:21:04 +0200
committerMarge Bot <emma+marge@anholt.net>2023-10-23 17:59:55 +0000
commit7a3fb60ac85300f0030c5edd2587bf4913c17f69 (patch)
tree121def6ac5b36b87257d3c6e41fe1c8e85c7fc82 /docs
parent553070f993f576b8dd0688c4548bca9035679a5b (diff)
downloadmesa-7a3fb60ac85300f0030c5edd2587bf4913c17f69.tar.gz
mesa-7a3fb60ac85300f0030c5edd2587bf4913c17f69.tar.bz2
mesa-7a3fb60ac85300f0030c5edd2587bf4913c17f69.zip
docs/ci: Add some links in the CI docs to how to track job flakes
and also figuring out how many boards are available for sharding management. Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25806>
Diffstat (limited to 'docs')
-rw-r--r--docs/ci/docker.rst2
-rw-r--r--docs/ci/index.rst34
2 files changed, 28 insertions, 8 deletions
diff --git a/docs/ci/docker.rst b/docs/ci/docker.rst
index 4a3c842416d..4e181335fa2 100644
--- a/docs/ci/docker.rst
+++ b/docs/ci/docker.rst
@@ -34,7 +34,7 @@ at the job's log for which specific tests failed).
DUT requirements
----------------
-In addition to the general :ref:`CI-farm-expectations`, using
+In addition to the general :ref:`CI-job-user-expectations`, using
Docker requires:
* DUTs must have a stable kernel and GPU reset (if applicable).
diff --git a/docs/ci/index.rst b/docs/ci/index.rst
index 2b8797200f7..bd7e3d49103 100644
--- a/docs/ci/index.rst
+++ b/docs/ci/index.rst
@@ -148,10 +148,10 @@ If you're having issues with the Intel CI, your best bet is to ask about
it on ``#dri-devel`` on OFTC and tag `Nico Cortes
<https://gitlab.freedesktop.org/ngcortes>`__ (``ngcortes`` on IRC).
-.. _CI-farm-expectations:
+.. _CI-job-user-expectations:
-CI farm expectations
---------------------
+CI job user expectations:
+-------------------------
To make sure that testing of one vendor's drivers doesn't block
unrelated work by other vendors, we require that a given driver's test
@@ -160,11 +160,23 @@ driver had CI and failed once a week, we would be seeing someone's
code getting blocked on a spurious failure daily, which is an
unacceptable cost to the project.
+To ensure that, driver maintainers with CI enabled should watch the Flakes panel
+of the `CI flakes dashboard
+<https://ci-stats-grafana.freedesktop.org/d/Ae_TLIwVk/mesa-ci-quality-false-positives?orgId=1>`__,
+particularly the "Flake jobs" pane, to inspect jobs in their driver where the
+automatic retry of a failing job produced a success a second time.
+Additionally, most CI reports test-level flakes to an IRC channel, and flakes
+reported as NEW are not expected and could cause spurious failures in jobs.
+Please track the NEW reports in jobs and add them as appropriate to the
+``-flakes.txt`` file for your driver.
+
Additionally, the test farm needs to be able to provide a short enough
-turnaround time that we can get our MRs through marge-bot without the
-pipeline backing up. As a result, we require that the test farm be
-able to handle a whole pipeline's worth of jobs in less than 15 minutes
-(to compare, the build stage is about 10 minutes).
+turnaround time that we can get our MRs through marge-bot without the pipeline
+backing up. As a result, we require that the test farm be able to handle a
+whole pipeline's worth of jobs in less than 15 minutes (to compare, the build
+stage is about 10 minutes). Given boot times and intermittent network delays,
+this generally means that the test runtime as reported by deqp-runner should be
+kept to 10 minutes.
If a test farm is short the HW to provide these guarantees, consider dropping
tests to reduce runtime. dEQP job logs print the slowest tests at the end of
@@ -179,6 +191,14 @@ artifacts. Or, you can add the following to your job to only run some fraction
to just run 1/10th of the test list.
+For Collabora's LAVA farm, the `device types
+<https://lava.collabora.dev/scheduler/device_types>`__ page can tell you how
+many boards of a specific tag are currently available by adding the "Idle" and
+"Busy" columns. For bare-metal, a gitlab admin can look at the `runners
+<https://gitlab.freedesktop.org/admin/runners>`__ page. A pipeline should
+probably not create more jobs for a board type than there are boards, unless you
+clearly have some short-runtime jobs.
+
If a HW CI farm goes offline (network dies and all CI pipelines end up
stalled) or its runners are consistently spuriously failing (disk
full?), and the maintainer is not immediately available to fix the