-
-
Notifications
You must be signed in to change notification settings - Fork 414
usbhid-ups: improve handling of transient LIBUSB_ERROR_IO failures #3259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
usbhid-ups: improve handling of transient LIBUSB_ERROR_IO failures #3259
Conversation
Some devices (e.g., CyberPower CP1500PFCLCD) have firmware bugs that cause random I/O errors on specific HID reports during normal polling. Rather than triggering expensive reconnection attempts that can fail in daemon mode, skip the failing report and continue with remaining polls. Add safety check to detect true device disconnection: if all polls fail during an update cycle (items_succeeded == 0), trigger reconnect as before. This improves stability especially in daemon mode while still detecting real disconnections via other error codes or complete poll failure. Fixes networkupstools#3116 Signed-off-by: Jordan Rodgers <com6056@gmail.com>
cdee828 to
badbed7
Compare
|
Great analysis, thanks. That must have been a fun debugging session... As for This is probably a separate issue from this avoidance of reconnections in the first place, but did you have a chance to experiment whether reconnection in this driver actually works/fails? I wonder if the problem is coincidental, e.g. nothing wrong with the driver code, but it is the UPS firmware that gets stuck, reboots and is not responding just at the moment we try to connect back to it... I suppose pulling the USB cable and plugging it back while the driver is running can help in the investigation (unless this is one of those UPSes that power on the USB chip when it is connected, and power it off/cycle when not in use). |
Regarding this:
Out of town right now but I can definitely dig in more when I'm back to see if I can somehow get the debug logging while in daemon mode to answer these questions! |
Signed-off-by: Jordan Rodgers <com6056@gmail.com>
|
Disregard the netbsd runner fault, ran out of disk. |
|
Great thanks for your contribution! |
Some devices (e.g., CyberPower CP1500PFCLCD) have firmware bugs that cause random I/O errors on specific HID reports during normal polling. Rather than triggering expensive reconnection attempts that can fail in daemon mode, skip the failing report and continue with remaining polls.
Add safety check to detect true device disconnection: if all polls fail during an update cycle (items_succeeded == 0), trigger reconnect as before.
This improves stability especially in daemon mode while still detecting real disconnections via other error codes or complete poll failure.
Fixes #3116
More in-depth details from my investigation in case it is helpful:
Problem Description
CyberPower UPS devices (tested with CP1500PFCLCD, ProductID 0x0601) experience "Data stale" errors when running in daemon mode, but work perfectly when debug mode is enabled (
-Dflag orNUT_DEBUG_LEVEL=1).Root Cause
The device has firmware bugs that cause transient
LIBUSB_ERROR_IOfailures on certain HID reports during normal polling operations. Reports like:0x1a(input.sensitivity)0x19(ups.realpower)pollfreq)These reports exist and work most of the time, but randomly fail with I/O errors.
Current Behavior (Broken)
When
LIBUSB_ERROR_IOoccurs during polling:reconnect.tryingstatereconnect_ups()to close and reopen USB devicesetsid(), closed file descriptors, etc.)dstate_datastale()→ "Data stale" error to upsdWhy Debug Mode "Worked"
The
-Dflag setsforeground = 1, preventing the driver from daemonizing:setsid()callSo debug mode didn't fix the firmware bug—it just made the reconnection workaround actually work. But reconnecting on every transient error is expensive and unnecessary.
Solution
Skip transient errors instead of reconnecting:
LIBUSB_ERROR_IOoccurs, log it andcontinueto next poll item instead of triggering reconnectitems_succeededcounterWhy This Works
LIBUSB_ERROR_NO_DEVICE,LIBUSB_ERROR_ACCESS, etc.)MAXAGEtimeoutTesting
With this fix and
pollonly=true:LIBUSB_ERROR_IOlogged at debug level 3, but skippedRelated
This is likely the same underlying issue reported in other contexts (see instantlinux/docker-tools#67) where CyberPower devices experience intermittent failures. The difference is our specific model (0x0601) fails predictably during full update cycles, making the root cause easier to identify.
General points
Described the changes in the PR submission or a separate issue, e.g.
known published or discovered protocols, applicable hardware (expected
compatible and actually tested/developed against), limitations, etc.
There may be multiple commits in the PR, aligned and commented with
a functional change. Notably, coding style changes better belong in a
separate PR, but certainly in a dedicated commit to simplify reviews
of "real" changes in the other commits. Similarly for typo fixes in
comments or text documents.
Please star NUT on GitHub, this helps with sponsorships! ;)
Frequent "underwater rocks" for driver addition/update PRs
Revised existing driver families and added a sub-driver if applicable
(
nutdrv_qx,usbhid-ups...) or added a brand new driver in the othercase.
Did not extend obsoleted drivers with new hardware support features
(notably
blazerand other single-device family drivers for Qx protocols,except the new
nutdrv_qxwhich should cover them all).For updated existing device drivers, bumped the
DRIVER_VERSIONmacroor its equivalent.
For USB devices (HID or not), revised that the driver uses unique
VID/PID combinations, or raised discussions when this is not the case
(several vendors do use same interface chips for unrelated protocols).
For new USB devices, built and committed the changes for the
scripts/upower/95-upower-hid.hwdbfileProposed NUT data mapping is aligned with existing
docs/nut-names.txtfile. If the device exposes useful data points not listed in the file, the
experimental.*namespace can be used as documented there, and discussionshould be raised on the NUT Developers mailing list to standardize the new
concept.
Updated
data/driver.list.inif applicable (new tested device info)Frequent "underwater rocks" for general C code PRs
structure layout and alignment in memory, endianness (layout of bytes and
bits in memory for multi-byte numeric types), or use of generic
intwherelanguage or libraries dictate the use of
size_t(orssize_tsometimes).Progress and errors are handled with
upsdebugx(),upslogx(),fatalx()and related methods, not with directprintf()orexit().Similarly, NUT helpers are used for error-checked memory allocation and
string operations (except where customized error handling is needed,
such as unlocking device ports, etc.)
Coding style (including whitespace for indentations) follows precedent
in the code of the file, and examples/guide in
docs/developers.txtfile.For newly added files, the
Makefile.amrecipes were updated and themake distchecktarget passes.General documentation updates
Updated
docs/acknowledgements.txt(for vendor-backed device support)Added or updated manual page information in
docs/man/*.txtfilesand corresponding recipe lists in
docs/man/Makefile.amfor new pagesPassed
make spellcheck, updated spell-checking dictionary in thedocs/nut.dictfile if needed (did not remove any words -- themakerule printout in case of changes suggests how to maintain it).
Additional work may be needed after posting this PR
Propose a PR for NUT DDL with detailed device data dumps from tests
against real hardware (the more models, the better).
Address NUT CI farm build failures for the PR: testing on numerous
platforms and toolkits can expose issues not seen on just one system.
the changed codebase.