thr3ads.net - Nut upsdev - [Nut-upsdev] some megatec-usb issues [Feb 2007]

If this information is useful, please help other people find it:
Share via:

Alexander I. Gordeev

2007-Feb-07 15:57 UTC

[Nut-upsdev] some megatec-usb issues

Hi All,

I've finally found solutions for my previous problems.
But since this includes changes in the shared files, I'd like
to discuss them here.

1. Driver restart problem.
When I start the driver for the first time, everything is ok. But when I want to
restart it, problems begin. The driver fails to read Report descriptor for the
second time (libusb_open is used to open the device). This is caused by my
particular UPS model's allergy. I can read the Report descriptor only once
and
only before reading any "special" string descriptors. After that I can
do
everything else without troubles. But if I attempt to read it after reading
string descriptors than everything stops working (string descriptors too). So I
have to reset UPS...

This works perfectly under Windows, as the Report descriptor is retrieved only
once - by the OS. Program that drives UPS doesn't do it.

So the solution is simple: not to retrieve any HID descriptors as I don't
need
them and they cause problems. But I think the best time to determine whether to
retrieve HID stuff or not is while device matching is performed. There could be
other devices and subdrivers that need these things. My solution is to reserve
some value that a matcher can return indicating that we don't need HID. For
example, 2. I looked through other drivers that use libusb_open, and their
matchers all return 1. (If I'm not right, please, tell me). So this decision
wouldn't affect other drivers. Also I want to note that any matcher should
be
able to prevent retrieving HID descriptors.

libusb_open has already everything that is needed. If the "mode"
variable is set
to MODE_REOPEN then HID descriptors aren't retrieved. So the simpliest thing
is
to change it:

--- libusb.c    (revision 799)
+++ libusb.c    (working copy)
@@ -180,7 +180,7 @@
                                 } else if (ret==-2) {
                                         upsdebugx(2, "matcher: unspecified
error");
                                         goto next_device;
-                               }
+                               } else if (ret==2) mode = MODE_REOPEN;
                         }
                         upsdebugx(2, "Device matches");

This change is enough, it doesn't affect other drivers and IMHO it makes
libusb_open more generic. Of course, some comments should also be there.

2. "UPS No Ack" problem.

I also have a solution for the problem of the "UPS No Ack" answers.
The problem: when I do "informative" requests (Q1, I, F - descriptors
0x3, 0xc
and 0xd respectively) I sometimes get non-usual responses - "UPS No
Ack". So the
data gets stale rather often. By the way, this is the only thing I get when
doing "non-informative" requests.

Example:

tornado:/home/alex/development/nut/tools# ./get_descriptor 002 003 1 0 0 0x81 3
3
Bus 002 device 003 configuration 1 interface 0 altsetting 0 endpoint 129
descriptor 0x03 index 3:

  60 03 28 00 32 00 31 00 37 00 2e 00 30 00 20 00 31 00 36 00 35 00 2e 00
  30 00 20 00 32 00 31 00 37 00 2e 00 30 00 20 00 30 00 30 00 30 00 20 00
  35 00 30 00 2e 00 30 00 20 00 31 00 33 00 2e 00 34 00 20 00 30 00 30 00
  2e 00 30 00 20 00 30 00 30 00 30 00 30 00 31 00 30 00 30 00 30 00 0d 00

  `.(.2.1.7...0. .1.6.5...0. .2.1.7...0. .0.0.0. .5.0...0. .1.3...4. .0.0.
  ..0. .0.0.0.0.1.0.0.0...
tornado:/home/alex/development/nut/tools# ./get_descriptor 002 003 1 0 0 0x81 3
3
Bus 002 device 003 configuration 1 interface 0 altsetting 0 endpoint 129
descriptor 0x03 index 3:

  16 03 55 00 50 00 53 00 20 00 4e 00 6f 00 20 00 41 00 63 00 6b 00

  ..U.P.S. .N.o. .A.c.k.

As you see, the program doesn't report errors. So this is not a USB transfer
error. It should be an internal UPS issue. I think, my UPS is serial internaly
but uses some custom hack that enables USB communication. But sometimes this
mechanism fails because of the bad design or some other things. There is another
fact in support of this theory: if I power off the UPS its USB stuff continues
to work. It behaves quite the same but the only data I can get from these
"special" string descriptors is "UPS No Ack"! So, I think,
the custom
serial-over-USB hardware is USB-powered but it fails to read data from the
serial part since the latter is powered off.

My tests show that the driver gets "UPS No Ack" in about 17,5% of all
requests.
This is quite a huge value because every time the data gets stale I have some
log records. The result is great log pollution. There would be several thousands
of records per day. This is inadmissible for just a buggy hardware.
Also the driver sometimes refuses to start if there are too many failed attempts
during driver startup.

My solution is to try to get descriptor again without delay in the same get_data
call. You may think this way is not good because it will increase number of
errors because of the additional load. No, this is not true. This is incredible,
but my tests show that there will be _less_ errors with this solution. And
furthermore UPS feels best when it is asked permanently without any delay.

Now, about the tests. I've made 3 versions of the driver: (1) without
retries,
(2) with unlimited number of retries (until success) and (3) the one, which
didn't stop asking UPS without delay even in the case of success (this was
not a
driver actually since it never returns data). Here are there results
respectively:

(1)

All requests: 17779
Fails: 3112
Percent failed: 17,5038
Distribution 1:
19.236178       18.223747       17.211317       14.680241       16.705102      
17.211317       17.717532       19.067439
21.092300       14.342764       18.055009       16.198886       16.198886      
17.717532       15.355194       18.561224
17.042578       20.248608       15.523933       18.561224       16.536363      
16.367625       21.261038       18.898701
15.692671       18.392486       16.030148       17.886270       18.223747      
16.873840
Distribution 2:
17.042578       15.861410       15.017718       20.417346       18.561224      
20.586085       17.548794       17.042578
19.742393       16.536363       18.223747       20.248608       18.392486      
17.211317       16.705102       16.367625
15.692671       16.873840       18.392486       14.174026       20.248608      
16.536363       16.536363       15.523933
15.861410       17.211317       18.392486       18.561224       18.561224      
17.042578
Continuous fail subsequences lengths:
1:      1769 (73.463455%)
2:      582 (24.169435%)
3:      50 (2.076412%)
4:      6 (0.249169%)
5:      1 (0.041528%)

(2)

All requests: 20056
Fails: 2629
Percent failed: 13,1083
Distribution 1:
11.368169       10.769844       14.509374       10.919426       15.107698      
13.462306       14.509374       12.116075
14.210211       14.658955       14.210211       12.116075       13.312724      
10.620263       13.163143       13.312724
13.611887       12.564819       13.013562       13.163143       13.013562      
15.406861       13.163143       13.462306
12.714400       14.359793       12.863981       11.816913       12.564819      
13.163143
Distribution 2:
11.966494       10.769844       13.462306       11.816913       12.863981      
13.312724       11.368169       15.706023
11.667331       12.116075       14.958117       11.966494       11.816913      
15.706023       12.714400       12.714400
14.658955       12.714400       13.611887       14.958117       13.911049      
10.769844       15.556442       12.116075
13.462306       10.470682       13.611887       13.761468       14.060630      
14.658955
Continuous fail subsequences lengths:
1:      2623 (99.885758%)
2:      3 (0.114242%)

(3)

All requests: 7349
Fails: 385
Percent failed: 5,23881
Distribution 1:
3.265750        6.123282        4.082188        3.265750        4.898626       
7.347938        4.082188        5.306844
9.797251        3.673969        7.756157        12.246564       8.572595       
6.939720        5.715063        5.306844
6.531501        3.673969        2.449313        4.490407        4.898626       
6.123282        2.449313        2.857532
6.939720        4.898626        4.490407        4.898626        2.041094       
2.041094
Distribution 2:
5.715063        5.306844        6.939720        3.673969        1.632875       
7.756157        7.756157        4.898626
6.531501        8.164376        5.715063        4.082188        4.490407       
4.082188        4.490407        4.490407
7.756157        2.857532        6.123282        4.898626        3.265750       
4.490407        3.673969        6.123282
3.265750        4.082188        7.347938        6.123282        4.082188       
7.347938
Continuous fail subsequences lengths:
1:      385 (100.000000%)

(By fail sequences I mean sequences of numbers of failed attempts in the list of
all attempts.)

Distribution 1-2 are two simple tests I've thought out myself since I
don't
remember how to check sequences for being well-distributed. I think, they show
at least that these fail sequences are distributed. :)
In these tests every element of fail sequence falls into one of 30 clusters if
it meets some criterion (cnum = number of clusters, e = current element, anum
overall number of attempts): anum * i/cnum <= e < anum * (i+1)/cnum for
the i-th
cluster in the first test and e%cnum == i for the i-th cluster in the second
test. Then the percent of the paticular cluster size from the maximum size is
printed.
Maybe distribution tests don't have much sense in the test with unlimited
number
of retries (2).
These tests show at least that the flow of fails is rather constant.

The other thing which is of particular interest (at least in the tests 2 and 3)
is the maximum length of continuous fail subsequences and the number of fail
subsequences of a certain length.

These tests show that the probability of fail is much less in the case of no
delay between attempts.

But this will increase the overall number of attempts during the same time
period if we have retries so it should be better to compare probabilities of
fail in a single get_data call for both cases with retries and with no retries.
The latter is 17,5038% - the same as in the first test.
The former is "Fails"/("All requests" - "Fails")
in the test with unlimited
retries because if there are retries then the number calls will be the same as
the number of successful attempts. So the former = 15,0858%.

These values show that the possibility of a failed attempt per single get_data
call is _less_ if my solution is used. This is also a great addition to the
benefit of having much less useless records in logs!

Tests show that having 2-3 extra attempts would be quite enough. I'd like
not to
have unlimited number of retries because I want to detect if UPS powers off.


What do you think about all this? Please, tell me if you want me to define
something more exactly.

I can say that with this two solutions applied the driver works pretty stable.

-- 
   Alexander

Peter Selinger

2007-Feb-07 21:43 UTC

head link

[Nut-upsdev] some megatec-usb issues

Alexander I. Gordeev wrote:> 
> Hi All,
> 
> I've finally found solutions for my previous problems.
> But since this includes changes in the shared files, I'd like
> to discuss them here.
> 
> 1. Driver restart problem.
> When I start the driver for the first time, everything is ok. But
> when I want to restart it, problems begin. The driver fails to read
> Report descriptor for the second time (libusb_open is used to open
> the device). This is caused by my particular UPS model's allergy. I
> can read the Report descriptor only once and only before reading any
> "special" string descriptors. After that I can do everything else
> without troubles. But if I attempt to read it after reading string
> descriptors than everything stops working (string descriptors
> too). So I have to reset UPS...
> 
> This works perfectly under Windows, as the Report descriptor is
> retrieved only once - by the OS. Program that drives UPS doesn't do
> it.  So the solution is simple: not to retrieve any HID descriptors
> as I don't need them and they cause problems. But I think the best
> time to determine whether to retrieve HID stuff or not is while
> device matching is performed. There could be other devices and
> subdrivers that need these things. My solution is to reserve some
> value that a matcher can return indicating that we don't need
> HID. For example, 2. I looked through other drivers that use
> libusb_open, and their matchers all return 1. (If I'm not right,
> please, tell me). So this decision wouldn't affect other
> drivers. Also I want to note that any matcher should be able to
> prevent retrieving HID descriptors.
I don't like the idea of returning a different value from the matcher,
because it has nothing to do with matching. A better idea would be to
pass an additional boolean parameter to libusb_open, to determine
whether to read the report descriptor or not.
> 2. "UPS No Ack" problem.
> 
> I also have a solution for the problem of the "UPS No Ack"
answers.
This is a driver-internal problem, as it does not require any other
NUT files to be modified. You should implement the solution that works
best for your device, and does not harm to any other devices. I think
it would be reasonable to retry a failed request, say, up to N times,
where N is some preset value (perhaps even user configurable). (You
should not retry it an infinite number of times, because that could
cause the driver to hang.

-- Peter

Arjen de Korte

2007-Feb-10 12:37 UTC

head link

[Nut-upsdev] some megatec-usb issues

>> 2. "UPS No Ack" problem.
>>
>> I also have a solution for the problem of the "UPS No Ack"
answers.
>
> This is a driver-internal problem, as it does not require any other
> NUT files to be modified. You should implement the solution that works
> best for your device, and does not harm to any other devices. I think
> it would be reasonable to retry a failed request, say, up to N times,
> where N is some preset value (perhaps even user configurable). (You
> should not retry it an infinite number of times, because that could
> cause the driver to hang.
I think that we can solve this in a similar (broader) way as in the
server. That is, dstate_dataok() records the last time it was called and
we compare this time to the current time in the driver/main.c. If this is
more than maxage seconds ago (similar to the MAXAGE value defined in
upsd.conf, but now in ups.conf), dstate_datastale() will be called.

In many cases it is much easier to determine that a status poll was
successful, rather than determining the reason why it wasn't (timeout,
communication failure, UPS disconnected). If drivers no longer *must* call
dstate_datastale(), but by letting dstate_dataok() timeout automatically
if it hasn't been called recently, this will provide a more uniform
mechanism among different drivers. You'd only have to call dstate_dataok()
if you decoded a reply from the UPS successfully and do nothing if you
were not.

Of course, if a driver is able to unambiguously determine that a UPS is no
longer available, it would still be possible to call dstate_datastale(),
so the impact to existing drivers should be minimal. The only thing we
need to make sure that no driver is using an internal mechanism that only
calls dstate_dataok() once when coming from a dstate_datastale() situation
(which is not how dstate_dataok/datastale() was intended to be used
anyway).

Best regards, Arjen
-- 
Eindhoven - The Netherlands
Key fingerprint - 66 4E 03 2C 9D B5 CB 9B  7A FE 7E C1 EE 88 BC 57

Possibly Parallel Threads

Search for more apparently analagous threads

Nut upsdev - Feb 2007 - some megatec-usb issues

[Nut-upsdev] some megatec-usb issues

[Nut-upsdev] some megatec-usb issues

[Nut-upsdev] some megatec-usb issues

Possibly Parallel Threads