docs/ci: Add some links in the CI docs to how to track job flakes

and also figuring out how many boards are available for sharding management. Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25806>
author: Eric Anholt <eric@anholt.net> 2023-10-19 10:21:04 +0200
committer: Marge Bot <emma+marge@anholt.net> 2023-10-23 17:59:55 +0000
commit: 7a3fb60ac85300f0030c5edd2587bf4913c17f69 (patch)
tree: 121def6ac5b36b87257d3c6e41fe1c8e85c7fc82 /docs
parent: 553070f993f576b8dd0688c4548bca9035679a5b (diff)
download: mesa-7a3fb60ac85300f0030c5edd2587bf4913c17f69.tar.gz
mesa-7a3fb60ac85300f0030c5edd2587bf4913c17f69.tar.bz2
mesa-7a3fb60ac85300f0030c5edd2587bf4913c17f69.zip
2 files changed, 28 insertions, 8 deletions
diff --git a/docs/ci/docker.rst b/docs/ci/docker.rst
index 4a3c842416d..4e181335fa2 100644
--- a/docs/ci/docker.rst
+++ b/docs/ci/docker.rst
@@ -34,7 +34,7 @@ at the job's log for which specific tests failed).
 DUT requirements
 ----------------
 
-In addition to the general :ref:`CI-farm-expectations`, using
+In addition to the general :ref:`CI-job-user-expectations`, using
 Docker requires:
 
 * DUTs must have a stable kernel and GPU reset (if applicable).
diff --git a/docs/ci/index.rst b/docs/ci/index.rst
index 2b8797200f7..bd7e3d49103 100644
--- a/docs/ci/index.rst
+++ b/docs/ci/index.rst
@@ -148,10 +148,10 @@ If you're having issues with the Intel CI, your best bet is to ask about
 it on ``#dri-devel`` on OFTC and tag `Nico Cortes
 <https://gitlab.freedesktop.org/ngcortes>`__ (``ngcortes`` on IRC).
 
-.. _CI-farm-expectations:
+.. _CI-job-user-expectations:
 
-CI farm expectations
---------------------
+CI job user expectations:
+-------------------------
 
 To make sure that testing of one vendor's drivers doesn't block
 unrelated work by other vendors, we require that a given driver's test
@@ -160,11 +160,23 @@ driver had CI and failed once a week, we would be seeing someone's
 code getting blocked on a spurious failure daily, which is an
 unacceptable cost to the project.
 
+To ensure that, driver maintainers with CI enabled should watch the Flakes panel
+of the `CI flakes dashboard
+<https://ci-stats-grafana.freedesktop.org/d/Ae_TLIwVk/mesa-ci-quality-false-positives?orgId=1>`__,
+particularly the "Flake jobs" pane, to inspect jobs in their driver where the
+automatic retry of a failing job produced a success a second time.
+Additionally, most CI reports test-level flakes to an IRC channel, and flakes
+reported as NEW are not expected and could cause spurious failures in jobs.
+Please track the NEW reports in jobs and add them as appropriate to the
+``-flakes.txt`` file for your driver.
+
 Additionally, the test farm needs to be able to provide a short enough
-turnaround time that we can get our MRs through marge-bot without the
-pipeline backing up.  As a result, we require that the test farm be
-able to handle a whole pipeline's worth of jobs in less than 15 minutes
-(to compare, the build stage is about 10 minutes).
+turnaround time that we can get our MRs through marge-bot without the pipeline
+backing up.  As a result, we require that the test farm be able to handle a
+whole pipeline's worth of jobs in less than 15 minutes (to compare, the build
+stage is about 10 minutes).  Given boot times and intermittent network delays,
+this generally means that the test runtime as reported by deqp-runner should be
+kept to 10 minutes.
 
 If a test farm is short the HW to provide these guarantees, consider dropping
 tests to reduce runtime.  dEQP job logs print the slowest tests at the end of
@@ -179,6 +191,14 @@ artifacts.  Or, you can add the following to your job to only run some fraction
 
 to just run 1/10th of the test list.
 
+For Collabora's LAVA farm, the `device types
+<https://lava.collabora.dev/scheduler/device_types>`__ page can tell you how
+many boards of a specific tag are currently available by adding the "Idle" and
+"Busy" columns.  For bare-metal, a gitlab admin can look at the `runners
+<https://gitlab.freedesktop.org/admin/runners>`__ page.  A pipeline should
+probably not create more jobs for a board type than there are boards, unless you
+clearly have some short-runtime jobs.
+
 If a HW CI farm goes offline (network dies and all CI pipelines end up
 stalled) or its runners are consistently spuriously failing (disk
 full?), and the maintainer is not immediately available to fix the
author	Eric Anholt <eric@anholt.net>	2023-10-19 10:21:04 +0200
committer	Marge Bot <emma+marge@anholt.net>	2023-10-23 17:59:55 +0000
commit	7a3fb60ac85300f0030c5edd2587bf4913c17f69 (patch)
tree	121def6ac5b36b87257d3c6e41fe1c8e85c7fc82 /docs
parent	553070f993f576b8dd0688c4548bca9035679a5b (diff)
download	mesa-7a3fb60ac85300f0030c5edd2587bf4913c17f69.tar.gz mesa-7a3fb60ac85300f0030c5edd2587bf4913c17f69.tar.bz2 mesa-7a3fb60ac85300f0030c5edd2587bf4913c17f69.zip