X2Go Bug report logs - #1313
there is still a problem in getting a correct value for loadavgXX with loadchecker

version graph

Package: x2gobroker-agent; Maintainer for x2gobroker-agent is X2Go Developers <x2go-dev@lists.x2go.org>; Source for x2gobroker-agent is src:x2gobroker.

Reported by: Walid MOGHRABI <w.moghrabi@servicemagic.eu>

Date: Mon, 13 Aug 2018 10:45:02 UTC

Severity: normal

Found in version 0.0.4.0-0~1038~ubuntu16.04.1

Done: Mike Gabriel <mike.gabriel@das-netzwerkteam.de>

Reply or subscribe to this bug.

Toggle useless messages

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to x2go-dev@lists.x2go.org, X2Go Developers <x2go-dev@lists.x2go.org>:
Bug#1313; Package x2gobroker-agent. (Mon, 13 Aug 2018 10:45:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Walid MOGHRABI <w.moghrabi@servicemagic.eu>:
New Bug report received and forwarded. Copy sent to X2Go Developers <x2go-dev@lists.x2go.org>.

Your message specified a Severity: in the pseudo-header, but the severity value bug was not recognised. The default severity normal is being used instead. The recognised values are: critical, grave, important, normal, minor, wishlist.

(Mon, 13 Aug 2018 10:45:02 GMT) Full text and rfc822 format available.


Message #5 received at submit@bugs.x2go.org (full text, mbox):

From: Walid MOGHRABI <w.moghrabi@servicemagic.eu>
To: submit@bugs.x2go.org
Subject: there is still a problem in getting a correct value for loadavgXX with loadchecker
Date: Mon, 13 Aug 2018 12:34:00 +0200 (CEST)
package: x2gobroker-agent 
version: 0.0.4.0-0~1038~ubuntu16.04.1
priority: bug

I don't have a "0" value anymore since latest fixes so the loadchecker process don't crash anymore but still, there is something strange.
Here is a fragment of my loadchecker logs from this morning.
Just to give you the context, I have 22 servers which are all automaticaly started at 6 AM (wake on lan) and they are absolutely the same (blade servers with same CPU, memory amount, bios version, ...).
I checked our monitoring to see if users were correctly distributed over the farm and at 7:30AM, I had about 7 or 8 users connected but 4 of them were on tce-server-21 where I should have had 1 user on 8 servers.
Here is the loadchecker log fragment :

root@tce-manager-01 [~] # grep -B 1 'loadavgXX:1;' /var/log/x2gobroker/loadchecker.log
...
2018-07-24 07:15:01,200 - loadchecker - INFO - Executing agent command on remote host tce-server-21 (10.50.0.221): sh -c '/usr/lib/x2go/x2gobroker-agent foo checkload'
2018-07-24 07:15:01,622 - loadchecker - INFO - Broker agent answered: OK; loadavgXX:1; memAvail:23684; myMemAvail:23810; numCPU:16; typeCPU:2400; 
--
2018-07-24 07:17:50,354 - loadchecker - INFO - Executing agent command on remote host tce-server-21 (10.50.0.221): sh -c '/usr/lib/x2go/x2gobroker-agent foo checkload'
2018-07-24 07:17:50,779 - loadchecker - INFO - Broker agent answered: OK; loadavgXX:1; memAvail:23686; myMemAvail:23812; numCPU:16; typeCPU:2400; 
--
2018-07-24 07:20:32,550 - loadchecker - INFO - Executing agent command on remote host tce-server-21 (10.50.0.221): sh -c '/usr/lib/x2go/x2gobroker-agent foo checkload'
2018-07-24 07:20:32,964 - loadchecker - INFO - Broker agent answered: OK; loadavgXX:1; memAvail:23683; myMemAvail:23809; numCPU:16; typeCPU:2400; 
--
2018-07-24 07:23:21,610 - loadchecker - INFO - Executing agent command on remote host tce-server-21 (10.50.0.221): sh -c '/usr/lib/x2go/x2gobroker-agent foo checkload'
2018-07-24 07:23:22,034 - loadchecker - INFO - Broker agent answered: OK; loadavgXX:1; memAvail:23685; myMemAvail:23811; numCPU:16; typeCPU:2400; 
--
2018-07-24 07:26:03,872 - loadchecker - INFO - Executing agent command on remote host tce-server-21 (10.50.0.221): sh -c '/usr/lib/x2go/x2gobroker-agent foo checkload'
2018-07-24 07:26:04,286 - loadchecker - INFO - Broker agent answered: OK; loadavgXX:1; memAvail:23684; myMemAvail:23809; numCPU:16; typeCPU:2400; 
--
2018-07-24 07:28:52,917 - loadchecker - INFO - Executing agent command on remote host tce-server-21 (10.50.0.221): sh -c '/usr/lib/x2go/x2gobroker-agent foo checkload'
2018-07-24 07:28:53,338 - loadchecker - INFO - Broker agent answered: OK; loadavgXX:1; memAvail:23684; myMemAvail:23809; numCPU:16; typeCPU:2400; 
--
2018-07-24 07:31:35,252 - loadchecker - INFO - Executing agent command on remote host tce-server-21 (10.50.0.221): sh -c '/usr/lib/x2go/x2gobroker-agent foo checkload'
2018-07-24 07:31:35,670 - loadchecker - INFO - Broker agent answered: OK; loadavgXX:1; memAvail:23685; myMemAvail:23811; numCPU:16; typeCPU:2400; 
--
2018-07-24 07:34:24,424 - loadchecker - INFO - Executing agent command on remote host tce-server-21 (10.50.0.221): sh -c '/usr/lib/x2go/x2gobroker-agent foo checkload'
2018-07-24 07:34:24,842 - loadchecker - INFO - Broker agent answered: OK; loadavgXX:1; memAvail:23683; myMemAvail:23809; numCPU:16; typeCPU:2400; 

As you can see, there is only 1 server with a loadavgXX = 1 (which means that in fact, we got a zero value from the broker agent).
This is not normal, at 7:34, there were 4 users already connected to this server and most of my other servers were empty.
Restarting x2gobroker-loadchecker service fixed the issue.
I think there is a problem in retrieving this informations ... even memAvail seem strange on this server to me ... with 4 connected users, it should have been lower than that.
I also think the number of connected users should be taken into account when calculating the load factor (maybe this is already the case, not sure about that).
---
DISCLAIMER: This e-mail is private and confidential and may contain proprietary or legally privileged information. It is for the intended recipient only. If you have received this email in error, please notify the author by replying to it and then destroy it. If you are not the intended recipient you must not use, disclose, distribute, copy, print or rely on this e-mail or any attachment. Thank you


Information forwarded to x2go-dev@lists.x2go.org, X2Go Developers <x2go-dev@lists.x2go.org>:
Bug#1313; Package x2gobroker-agent. (Thu, 13 Sep 2018 13:20:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Mike Gabriel <mike.gabriel@das-netzwerkteam.de>:
Extra info received and forwarded to list. Copy sent to X2Go Developers <x2go-dev@lists.x2go.org>. (Thu, 13 Sep 2018 13:20:03 GMT) Full text and rfc822 format available.

Message #10 received at 1313@bugs.x2go.org (full text, mbox):

From: Mike Gabriel <mike.gabriel@das-netzwerkteam.de>
To: Walid MOGHRABI <w.moghrabi@servicemagic.eu>, 1313@bugs.x2go.org
Subject: Re: [X2Go-Dev] Bug#1313: there is still a problem in getting a correct value for loadavgXX with loadchecker
Date: Thu, 13 Sep 2018 13:19:19 +0000
[Message part 1 (text/plain, inline)]
Hi Walid,

On  Mo 13 Aug 2018 12:34:00 CEST, Walid MOGHRABI wrote:

> package: x2gobroker-agent
> version: 0.0.4.0-0~1038~ubuntu16.04.1
> priority: bug
>
> I don't have a "0" value anymore since latest fixes so the  
> loadchecker process don't crash anymore but still, there is  
> something strange.
> Here is a fragment of my loadchecker logs from this morning.
> Just to give you the context, I have 22 servers which are all  
> automaticaly started at 6 AM (wake on lan) and they are absolutely  
> the same (blade servers with same CPU, memory amount, bios version,  
> ...).
> I checked our monitoring to see if users were correctly distributed  
> over the farm and at 7:30AM, I had about 7 or 8 users connected but  
> 4 of them were on tce-server-21 where I should have had 1 user on 8  
> servers.

Have you seen this issues more often? Does it hop from one server to  
another or occur on more than one server at a time?

> Here is the loadchecker log fragment :
>
> root@tce-manager-01 [~] # grep -B 1 'loadavgXX:1;'  
> /var/log/x2gobroker/loadchecker.log
> ...
> 2018-07-24 07:15:01,200 - loadchecker - INFO - Executing agent  
> command on remote host tce-server-21 (10.50.0.221): sh -c  
> '/usr/lib/x2go/x2gobroker-agent foo checkload'
> 2018-07-24 07:15:01,622 - loadchecker - INFO - Broker agent  
> answered: OK; loadavgXX:1; memAvail:23684; myMemAvail:23810;  
> numCPU:16; typeCPU:2400;
> --
> 2018-07-24 07:17:50,354 - loadchecker - INFO - Executing agent  
> command on remote host tce-server-21 (10.50.0.221): sh -c  
> '/usr/lib/x2go/x2gobroker-agent foo checkload'
> 2018-07-24 07:17:50,779 - loadchecker - INFO - Broker agent  
> answered: OK; loadavgXX:1; memAvail:23686; myMemAvail:23812;  
> numCPU:16; typeCPU:2400;
> --
> 2018-07-24 07:20:32,550 - loadchecker - INFO - Executing agent  
> command on remote host tce-server-21 (10.50.0.221): sh -c  
> '/usr/lib/x2go/x2gobroker-agent foo checkload'
> 2018-07-24 07:20:32,964 - loadchecker - INFO - Broker agent  
> answered: OK; loadavgXX:1; memAvail:23683; myMemAvail:23809;  
> numCPU:16; typeCPU:2400;
> --
> 2018-07-24 07:23:21,610 - loadchecker - INFO - Executing agent  
> command on remote host tce-server-21 (10.50.0.221): sh -c  
> '/usr/lib/x2go/x2gobroker-agent foo checkload'
> 2018-07-24 07:23:22,034 - loadchecker - INFO - Broker agent  
> answered: OK; loadavgXX:1; memAvail:23685; myMemAvail:23811;  
> numCPU:16; typeCPU:2400;
> --
> 2018-07-24 07:26:03,872 - loadchecker - INFO - Executing agent  
> command on remote host tce-server-21 (10.50.0.221): sh -c  
> '/usr/lib/x2go/x2gobroker-agent foo checkload'
> 2018-07-24 07:26:04,286 - loadchecker - INFO - Broker agent  
> answered: OK; loadavgXX:1; memAvail:23684; myMemAvail:23809;  
> numCPU:16; typeCPU:2400;
> --
> 2018-07-24 07:28:52,917 - loadchecker - INFO - Executing agent  
> command on remote host tce-server-21 (10.50.0.221): sh -c  
> '/usr/lib/x2go/x2gobroker-agent foo checkload'
> 2018-07-24 07:28:53,338 - loadchecker - INFO - Broker agent  
> answered: OK; loadavgXX:1; memAvail:23684; myMemAvail:23809;  
> numCPU:16; typeCPU:2400;
> --
> 2018-07-24 07:31:35,252 - loadchecker - INFO - Executing agent  
> command on remote host tce-server-21 (10.50.0.221): sh -c  
> '/usr/lib/x2go/x2gobroker-agent foo checkload'
> 2018-07-24 07:31:35,670 - loadchecker - INFO - Broker agent  
> answered: OK; loadavgXX:1; memAvail:23685; myMemAvail:23811;  
> numCPU:16; typeCPU:2400;
> --
> 2018-07-24 07:34:24,424 - loadchecker - INFO - Executing agent  
> command on remote host tce-server-21 (10.50.0.221): sh -c  
> '/usr/lib/x2go/x2gobroker-agent foo checkload'
> 2018-07-24 07:34:24,842 - loadchecker - INFO - Broker agent  
> answered: OK; loadavgXX:1; memAvail:23683; myMemAvail:23809;  
> numCPU:16; typeCPU:2400;

The log message "Broker agent answered:" comes directly from X2Go  
Broker Agent. It is basically its raw output.

This means, that the flaw must be in x2gobroker-agent.pl on the remote  
X2Go Server. Or that the loadchecker stops querying the broker agent  
and re-uses old data.

Looking at x2gobroker-agent.pl: If we focus on the loadavgXX for now,  
we come to the conclusion, that the load was really "0" or it was  
negative (both gives us a loadavgXX value of "1". The value should  
normally be greater (system load of 1.0 brings a loadavgXX of 100).

Looking at x2gobroker.agent.py: As the values always change slightly,  
we can't say that Python provides us the same return result string all  
the time. The query to the broker agent must have happened.

We need to do more debugging if this issue reoccurs:

  * run '/usr/lib/x2go/x2gobroker-agent foo checkload' on the  
affected X2Go Server
    and see if the reported values match with what the load checker sees.

  * check if it is reoccuring on the same X2Go Server

  * if /usr/lib/x2go/x2gobroker-agent returns a load of zero,
    look at /proc/loadavg

  * and /proc/sys/vm/min_free_kbytes,
    /proc/meminfo
    /proc/cpuinfo

... and report all back here...

> As you can see, there is only 1 server with a loadavgXX = 1 (which  
> means that in fact, we got a zero value from the broker agent).
> This is not normal, at 7:34, there were 4 users already connected to  
> this server and most of my other servers were empty.

> Restarting x2gobroker-loadchecker service fixed the issue.

Considering the above analysis that the issue must come from  
x2gobroker-agent.pl, a restart of the loadchecker can in theory not  
solve such an issue.

Can you see the x2gobroker-agent.pl process appear and disappear in  
the process list on the remote X2Go Server? Or does it stay open, even  
zombied?

> I think there is a problem in retrieving this informations ... even  
> memAvail seem strange on this server to me ... with 4 connected  
> users, it should have been lower than that.

Hmmm... Ok... Maybe the wrong server got tested? Two identical IPs on  
the subnet?

> I also think the number of connected users should be taken into  
> account when calculating the load factor (maybe this is already the  
> case, not sure about that).

Yes, we take the number of sessions into account. But that is not  
provided by the broker agent, but is available in the X2Go Server  
database and queried from there.

Mike
-- 

DAS-NETZWERKTEAM
mike gabriel, herweg 7, 24357 fleckeby
mobile: +49 (1520) 1976 148
landline: +49 (4354) 8390 139

GnuPG Fingerprint: 9BFB AEE8 6C0A A5FF BF22  0782 9AF4 6B30 2577 1B31
mail: mike.gabriel@das-netzwerkteam.de, http://das-netzwerkteam.de

[Message part 2 (application/pgp-signature, inline)]

Information forwarded to x2go-dev@lists.x2go.org, X2Go Developers <x2go-dev@lists.x2go.org>:
Bug#1313; Package x2gobroker-agent. (Fri, 14 Dec 2018 15:05:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Mike Gabriel <mike.gabriel@das-netzwerkteam.de>:
Extra info received and forwarded to list. Copy sent to X2Go Developers <x2go-dev@lists.x2go.org>. (Fri, 14 Dec 2018 15:05:03 GMT) Full text and rfc822 format available.

Message #15 received at 1313@bugs.x2go.org (full text, mbox):

From: Mike Gabriel <mike.gabriel@das-netzwerkteam.de>
To: 1313@bugs.x2go.org
Cc: Walid MOGHRABI <w.moghrabi@servicemagic.eu>
Subject: Re: [X2Go-Dev] Bug#1313: Bug#1313: there is still a problem in getting a correct value for loadavgXX with loadchecker
Date: Fri, 14 Dec 2018 15:02:52 +0000
[Message part 1 (text/plain, inline)]
Control: close -1

On  Do 13 Sep 2018 15:19:19 CEST, Mike Gabriel wrote:

> Hi Walid,
>
> On  Mo 13 Aug 2018 12:34:00 CEST, Walid MOGHRABI wrote:
>
>> package: x2gobroker-agent
>> version: 0.0.4.0-0~1038~ubuntu16.04.1
>> priority: bug
>>
>> I don't have a "0" value anymore since latest fixes so the  
>> loadchecker process don't crash anymore but still, there is  
>> something strange.
>> Here is a fragment of my loadchecker logs from this morning.
>> Just to give you the context, I have 22 servers which are all  
>> automaticaly started at 6 AM (wake on lan) and they are absolutely  
>> the same (blade servers with same CPU, memory amount, bios version,  
>> ...).
>> I checked our monitoring to see if users were correctly distributed  
>> over the farm and at 7:30AM, I had about 7 or 8 users connected but  
>> 4 of them were on tce-server-21 where I should have had 1 user on 8  
>> servers.
>
> Have you seen this issues more often? Does it hop from one server to  
> another or occur on more than one server at a time?
>
>> Here is the loadchecker log fragment :
>>
>> root@tce-manager-01 [~] # grep -B 1 'loadavgXX:1;'  
>> /var/log/x2gobroker/loadchecker.log
>> ...
>> 2018-07-24 07:15:01,200 - loadchecker - INFO - Executing agent  
>> command on remote host tce-server-21 (10.50.0.221): sh -c  
>> '/usr/lib/x2go/x2gobroker-agent foo checkload'
>> 2018-07-24 07:15:01,622 - loadchecker - INFO - Broker agent  
>> answered: OK; loadavgXX:1; memAvail:23684; myMemAvail:23810;  
>> numCPU:16; typeCPU:2400;
>> --
>> 2018-07-24 07:17:50,354 - loadchecker - INFO - Executing agent  
>> command on remote host tce-server-21 (10.50.0.221): sh -c  
>> '/usr/lib/x2go/x2gobroker-agent foo checkload'
>> 2018-07-24 07:17:50,779 - loadchecker - INFO - Broker agent  
>> answered: OK; loadavgXX:1; memAvail:23686; myMemAvail:23812;  
>> numCPU:16; typeCPU:2400;
>> --
>> 2018-07-24 07:20:32,550 - loadchecker - INFO - Executing agent  
>> command on remote host tce-server-21 (10.50.0.221): sh -c  
>> '/usr/lib/x2go/x2gobroker-agent foo checkload'
>> 2018-07-24 07:20:32,964 - loadchecker - INFO - Broker agent  
>> answered: OK; loadavgXX:1; memAvail:23683; myMemAvail:23809;  
>> numCPU:16; typeCPU:2400;
>> --
>> 2018-07-24 07:23:21,610 - loadchecker - INFO - Executing agent  
>> command on remote host tce-server-21 (10.50.0.221): sh -c  
>> '/usr/lib/x2go/x2gobroker-agent foo checkload'
>> 2018-07-24 07:23:22,034 - loadchecker - INFO - Broker agent  
>> answered: OK; loadavgXX:1; memAvail:23685; myMemAvail:23811;  
>> numCPU:16; typeCPU:2400;
>> --
>> 2018-07-24 07:26:03,872 - loadchecker - INFO - Executing agent  
>> command on remote host tce-server-21 (10.50.0.221): sh -c  
>> '/usr/lib/x2go/x2gobroker-agent foo checkload'
>> 2018-07-24 07:26:04,286 - loadchecker - INFO - Broker agent  
>> answered: OK; loadavgXX:1; memAvail:23684; myMemAvail:23809;  
>> numCPU:16; typeCPU:2400;
>> --
>> 2018-07-24 07:28:52,917 - loadchecker - INFO - Executing agent  
>> command on remote host tce-server-21 (10.50.0.221): sh -c  
>> '/usr/lib/x2go/x2gobroker-agent foo checkload'
>> 2018-07-24 07:28:53,338 - loadchecker - INFO - Broker agent  
>> answered: OK; loadavgXX:1; memAvail:23684; myMemAvail:23809;  
>> numCPU:16; typeCPU:2400;
>> --
>> 2018-07-24 07:31:35,252 - loadchecker - INFO - Executing agent  
>> command on remote host tce-server-21 (10.50.0.221): sh -c  
>> '/usr/lib/x2go/x2gobroker-agent foo checkload'
>> 2018-07-24 07:31:35,670 - loadchecker - INFO - Broker agent  
>> answered: OK; loadavgXX:1; memAvail:23685; myMemAvail:23811;  
>> numCPU:16; typeCPU:2400;
>> --
>> 2018-07-24 07:34:24,424 - loadchecker - INFO - Executing agent  
>> command on remote host tce-server-21 (10.50.0.221): sh -c  
>> '/usr/lib/x2go/x2gobroker-agent foo checkload'
>> 2018-07-24 07:34:24,842 - loadchecker - INFO - Broker agent  
>> answered: OK; loadavgXX:1; memAvail:23683; myMemAvail:23809;  
>> numCPU:16; typeCPU:2400;
>
> The log message "Broker agent answered:" comes directly from X2Go  
> Broker Agent. It is basically its raw output.
>
> This means, that the flaw must be in x2gobroker-agent.pl on the  
> remote X2Go Server. Or that the loadchecker stops querying the  
> broker agent and re-uses old data.
>
> Looking at x2gobroker-agent.pl: If we focus on the loadavgXX for  
> now, we come to the conclusion, that the load was really "0" or it  
> was negative (both gives us a loadavgXX value of "1". The value  
> should normally be greater (system load of 1.0 brings a loadavgXX of  
> 100).
>
> Looking at x2gobroker.agent.py: As the values always change  
> slightly, we can't say that Python provides us the same return  
> result string all the time. The query to the broker agent must have  
> happened.
>
> We need to do more debugging if this issue reoccurs:
>
>   * run '/usr/lib/x2go/x2gobroker-agent foo checkload' on the  
> affected X2Go Server
>     and see if the reported values match with what the load checker sees.
>
>   * check if it is reoccuring on the same X2Go Server
>
>   * if /usr/lib/x2go/x2gobroker-agent returns a load of zero,
>     look at /proc/loadavg
>
>   * and /proc/sys/vm/min_free_kbytes,
>     /proc/meminfo
>     /proc/cpuinfo
>
> ... and report all back here...
>
>> As you can see, there is only 1 server with a loadavgXX = 1 (which  
>> means that in fact, we got a zero value from the broker agent).
>> This is not normal, at 7:34, there were 4 users already connected  
>> to this server and most of my other servers were empty.
>
>> Restarting x2gobroker-loadchecker service fixed the issue.
>
> Considering the above analysis that the issue must come from  
> x2gobroker-agent.pl, a restart of the loadchecker can in theory not  
> solve such an issue.
>
> Can you see the x2gobroker-agent.pl process appear and disappear in  
> the process list on the remote X2Go Server? Or does it stay open,  
> even zombied?
>
>> I think there is a problem in retrieving this informations ... even  
>> memAvail seem strange on this server to me ... with 4 connected  
>> users, it should have been lower than that.
>
> Hmmm... Ok... Maybe the wrong server got tested? Two identical IPs  
> on the subnet?
>
>> I also think the number of connected users should be taken into  
>> account when calculating the load factor (maybe this is already the  
>> case, not sure about that).
>
> Yes, we take the number of sessions into account. But that is not  
> provided by the broker agent, but is available in the X2Go Server  
> database and queried from there.
>
> Mike

Request from Walid on IRC. Not an issue anymore.

Thus, closing...

Mike
-- 

DAS-NETZWERKTEAM
mike gabriel, herweg 7, 24357 fleckeby
mobile: +49 (1520) 1976 148
landline: +49 (4354) 8390 139

GnuPG Fingerprint: 9BFB AEE8 6C0A A5FF BF22  0782 9AF4 6B30 2577 1B31
mail: mike.gabriel@das-netzwerkteam.de, http://das-netzwerkteam.de

[Message part 2 (application/pgp-signature, inline)]

Marked Bug as done Request was from Mike Gabriel <mike.gabriel@das-netzwerkteam.de> to 1313-submit@bugs.x2go.org. (Fri, 14 Dec 2018 15:05:03 GMT) Full text and rfc822 format available.

Notification sent to Walid MOGHRABI <w.moghrabi@servicemagic.eu>:
Bug acknowledged by developer. (Fri, 14 Dec 2018 15:05:04 GMT) Full text and rfc822 format available.

Send a report that this bug log contains spam.


X2Go Developers <owner@bugs.x2go.org>. Last modified: Tue Dec 18 10:49:26 2018; Machine Name: ymir.das-netzwerkteam.de

X2Go Bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.