Adding boards to lava

The main purpose of LAVA is to test on real hardware; This document sets out the steps needed to add a new type of device (DUT) to LAVA. Fundamentally for each device type LAVA needs a few different things to be present:

A way to (hard) reset the device to a known state (aka power control)
A way to communicate to the device from early on the in boot session (boot control)
A way to boot systems from network/storage/etc as applicable (boot loader)
A way to verify a device is healthy (aka health check)

We’ll go through each of these one by one.

Power control

After each job, LAVA turns off a DUT and turns it back on for the next test job. Every time a DUT is turned on it should start in a known good state. For example it should predictably boot to its bootloader with all devices in a clean state. Obviously we cannot support boards where power-on requires manual interaction :)

Power control typically be achieved in several different ways depending on the board. A common method is to simply control the power supply. However for some boards this is not applicable (e.g. laptops with build-in power supply) or boards which need a specific power up sequence (e.g. power should be applied, stable and only then can it come out of reset). For those boards there is prior art in using a dedicated debug/development board (Like the google servo) or usage of e.g. raspberry PI to control gpio pins (to ensure a stable reset sequence).

For each new device type the correct method has to be determined and documented.

Low-level boot control

As LAVA is used to test essentially from the bootloader onwards it needs a method to control boards at a very low level. Starting from the bootloader.

This is typically done via a serial line or some other dedicated connection (e.g. IPMI SOL or debug boards). The power control method should also ensure that this line is usable stable directly after power-up, which can be an issue for boards directly integrating a USB to Serial adaptor.

For each new device type the correct method has to be determined and documented.

Bootloader

Once the device is powered on, lava will try and start controlling the bootloader via the control channel in the previous section. LAVA will then orchestrate the loading of the kernel and other artifacts (e.g. ramdisk, dtb etc) as is required for a given board. In almost all cases these artifacts are loaded from the network.

Fundamentally that means the bootloader configured for the device has to support some forms of interactive shell as well as supporting network connectivity (typically dhcp and tftp) to load artifacts.

On top of that the bootloader always has to be come up in a known good state so should not be impacted to e.g. changes on storage media. If possible the recommendation is to either load a bootloader from the network (if supported by the ROM/UEFI system etc) or have it be loaded by a non-mass storage location (e.g. SPI flash, eMMC hardware boot partition). Loading a bootloader from SD card can be supported but typically means the board will be less reliable and prevents usage of mass storage for tests.

For each new board which bootloader, how to build/get it, how it’s loaded in the system should be documented. As well as any bootloader specific extra information (e.g. load address for u-boot).

Health checks

To check whether a device is healthy, LAVA uses “known good” jobs. These get run on a regular basis to make sure everything is in a good state. Typically this consists of a simple Linux kernel boot to ramdisk however in the future we tend to extend these to support more methods (e.g. nfs boot).

The key requirement here is that for each device type there is a known good kernel available which can be used to judge the device health. Ideally this is simply a recent upstream kernel build with the standard defconfig for the relevant platform. This kernel should also be verified to work with nfs boot, in other words the kernel configuration should have a functional network port.

For each board it should be documented which kernel can be used to validate device health.

Device preparation process

To get a device type added to the lab, the following step should be taken:

Ensure the device can meet all of the above requirements
Open a task in the phabricator lava lab project under the new devices column
- Note in the task why it’s important to add a given device type (what does it add to the lab)
- Note what kind of timeline you’re expected (e.g. is there a deadline)
- Note the intended amount of devices you’d like to add as well as the targeted support level
- Please file the task early as this allows the lab team to plan ahead as infrastructure preparation might be required
Document the device and the above requirements follow the TODO template document which should be merged into this repository
Once the device document is done, ask the lab to review the document
Ensure the lab team has access to enough devices to be added

Once these steps are done the lab team can add the new devices to the lab.

Device addition process

Roughly the addition process will follow these steps:

New hardware is added to the lab (typically only one example initially)
Device and Device type will be configured in lava
Device will be put in a looping health-check mode
- This will constantly run health-checks as a stress test.
If the device has survived at least 24 hours of the looping check without failures it will be made available for job submission.
- If there were failures they’ll be further research to determine the problem together with the person responsible for the preparation.
Additional device of the type will be added at this point as there is a good baseline stability (each of which should go through a 24 burn-in looping health check again).

Once the initial addition process is finalised the device will be monitored for longer term reliability before moving from experimental level to supported or priority levels.