From mike.gabriel@das-netzwerkteam.de  Thu Sep 13 15:19:31 2018
Received: (at 1313) by bugs.x2go.org; 13 Sep 2018 13:19:34 +0000
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on
	ymir.das-netzwerkteam.de
X-Spam-Level: 
X-Spam-Status: No, score=-1.1 required=3.0 tests=BAYES_00,RDNS_NONE,
	URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.1
Received: from localhost (localhost [127.0.0.1])
	by ymir.das-netzwerkteam.de (Postfix) with ESMTP id 61B915DAEE
	for <1313@bugs.x2go.org>; Thu, 13 Sep 2018 15:19:31 +0200 (CEST)
X-Virus-Scanned: Debian amavisd-new at ymir.das-netzwerkteam.de
Received: from ymir.das-netzwerkteam.de ([127.0.0.1])
	by localhost (ymir.das-netzwerkteam.de [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id 0qh0wqBXy-6B for <1313@bugs.x2go.org>;
	Thu, 13 Sep 2018 15:19:25 +0200 (CEST)
Received: from fregna.das-netzwerkteam.de (unknown [IPv6:2a01:4f8:202:1381::1])
	by ymir.das-netzwerkteam.de (Postfix) with ESMTPS id 5AB025DA81
	for <1313@bugs.x2go.org>; Thu, 13 Sep 2018 15:19:25 +0200 (CEST)
Received: from grimnir.das-netzwerkteam.de (grimnir.das-netzwerkteam.de [IPv6:2a01:4f8:202:1381::105])
	by fregna.das-netzwerkteam.de (Postfix) with ESMTPS id 2A76D604F6;
	Thu, 13 Sep 2018 13:19:25 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
	by grimnir.das-netzwerkteam.de (Postfix) with ESMTP id 26029C268C;
	Thu, 13 Sep 2018 15:19:25 +0200 (CEST)
X-Virus-Scanned: Debian amavisd-new at grimnir.das-netzwerkteam.de
Received: from grimnir.das-netzwerkteam.de ([127.0.0.1])
	by localhost (grimnir.das-netzwerkteam.de [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id 0wxeRlyXxcMo; Thu, 13 Sep 2018 15:19:19 +0200 (CEST)
Received: from das-netzwerkteam.de (localhost [127.0.0.1])
	by grimnir.das-netzwerkteam.de (Postfix) with ESMTPS id 6C76BC25C2;
	Thu, 13 Sep 2018 15:19:19 +0200 (CEST)
Received: from bifrost.das-netzwerkteam.de (bifrost.das-netzwerkteam.de
 [178.62.101.154]) by mail.das-netzwerkteam.de (Horde Framework) with HTTPS;
 Thu, 13 Sep 2018 13:19:19 +0000
Date: Thu, 13 Sep 2018 13:19:19 +0000
Message-ID: <20180913131919.Horde.E_bg6JMBLAcV9jdu5upQ4C5@mail.das-netzwerkteam.de>
From: Mike Gabriel <mike.gabriel@das-netzwerkteam.de>
To: Walid MOGHRABI <w.moghrabi@servicemagic.eu>, 1313@bugs.x2go.org
Subject: Re: [X2Go-Dev] Bug#1313: there is still a problem in getting a
 correct value for loadavgXX with loadchecker
References: <883547417.4000531.1534156346103.JavaMail.root@servicemagic.eu>
 <1342096826.4000684.1534156440651.JavaMail.root@servicemagic.eu>
In-Reply-To: <1342096826.4000684.1534156440651.JavaMail.root@servicemagic.eu>
User-Agent: Horde Application Framework 5
Accept-Language: de,en
Organization: DAS-NETZWERKTEAM
X-Originating-IP: 178.62.101.154
X-Remote-Browser: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Firefox/52.0
Content-Type: multipart/signed; boundary="=_K-6vfqXkWCLYqM978CV7_VO";
 protocol="application/pgp-signature"; micalg=pgp-sha256
MIME-Version: 1.0

This message is in MIME format and has been PGP signed.

--=_K-6vfqXkWCLYqM978CV7_VO
Content-Type: text/plain; charset=utf-8; format=flowed; DelSp=Yes
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Hi Walid,

On  Mo 13 Aug 2018 12:34:00 CEST, Walid MOGHRABI wrote:

> package: x2gobroker-agent
> version: 0.0.4.0-0~1038~ubuntu16.04.1
> priority: bug
>
> I don't have a "0" value anymore since latest fixes so the=20=20
>=20loadchecker process don't crash anymore but still, there is=20=20
>=20something strange.
> Here is a fragment of my loadchecker logs from this morning.
> Just to give you the context, I have 22 servers which are all=20=20
>=20automaticaly started at 6 AM (wake on lan) and they are absolutely=20=
=20
>=20the same (blade servers with same CPU, memory amount, bios version,=20=
=20
>=20...).
> I checked our monitoring to see if users were correctly distributed=20=20
>=20over the farm and at 7:30AM, I had about 7 or 8 users connected but=20=
=20
>=204 of them were on tce-server-21 where I should have had 1 user on 8=20=
=20
>=20servers.

Have you seen this issues more often? Does it hop from one server to=20=20
another=20or occur on more than one server at a time?

> Here is the loadchecker log fragment :
>
> root@tce-manager-01 [~] # grep -B 1 'loadavgXX:1;'=20=20
>=20/var/log/x2gobroker/loadchecker.log
> ...
> 2018-07-24 07:15:01,200 - loadchecker - INFO - Executing agent=20=20
>=20command on remote host tce-server-21 (10.50.0.221): sh -c=20=20
>=20'/usr/lib/x2go/x2gobroker-agent foo checkload'
> 2018-07-24 07:15:01,622 - loadchecker - INFO - Broker agent=20=20
>=20answered: OK; loadavgXX:1; memAvail:23684; myMemAvail:23810;=20=20
>=20numCPU:16; typeCPU:2400;
> --
> 2018-07-24 07:17:50,354 - loadchecker - INFO - Executing agent=20=20
>=20command on remote host tce-server-21 (10.50.0.221): sh -c=20=20
>=20'/usr/lib/x2go/x2gobroker-agent foo checkload'
> 2018-07-24 07:17:50,779 - loadchecker - INFO - Broker agent=20=20
>=20answered: OK; loadavgXX:1; memAvail:23686; myMemAvail:23812;=20=20
>=20numCPU:16; typeCPU:2400;
> --
> 2018-07-24 07:20:32,550 - loadchecker - INFO - Executing agent=20=20
>=20command on remote host tce-server-21 (10.50.0.221): sh -c=20=20
>=20'/usr/lib/x2go/x2gobroker-agent foo checkload'
> 2018-07-24 07:20:32,964 - loadchecker - INFO - Broker agent=20=20
>=20answered: OK; loadavgXX:1; memAvail:23683; myMemAvail:23809;=20=20
>=20numCPU:16; typeCPU:2400;
> --
> 2018-07-24 07:23:21,610 - loadchecker - INFO - Executing agent=20=20
>=20command on remote host tce-server-21 (10.50.0.221): sh -c=20=20
>=20'/usr/lib/x2go/x2gobroker-agent foo checkload'
> 2018-07-24 07:23:22,034 - loadchecker - INFO - Broker agent=20=20
>=20answered: OK; loadavgXX:1; memAvail:23685; myMemAvail:23811;=20=20
>=20numCPU:16; typeCPU:2400;
> --
> 2018-07-24 07:26:03,872 - loadchecker - INFO - Executing agent=20=20
>=20command on remote host tce-server-21 (10.50.0.221): sh -c=20=20
>=20'/usr/lib/x2go/x2gobroker-agent foo checkload'
> 2018-07-24 07:26:04,286 - loadchecker - INFO - Broker agent=20=20
>=20answered: OK; loadavgXX:1; memAvail:23684; myMemAvail:23809;=20=20
>=20numCPU:16; typeCPU:2400;
> --
> 2018-07-24 07:28:52,917 - loadchecker - INFO - Executing agent=20=20
>=20command on remote host tce-server-21 (10.50.0.221): sh -c=20=20
>=20'/usr/lib/x2go/x2gobroker-agent foo checkload'
> 2018-07-24 07:28:53,338 - loadchecker - INFO - Broker agent=20=20
>=20answered: OK; loadavgXX:1; memAvail:23684; myMemAvail:23809;=20=20
>=20numCPU:16; typeCPU:2400;
> --
> 2018-07-24 07:31:35,252 - loadchecker - INFO - Executing agent=20=20
>=20command on remote host tce-server-21 (10.50.0.221): sh -c=20=20
>=20'/usr/lib/x2go/x2gobroker-agent foo checkload'
> 2018-07-24 07:31:35,670 - loadchecker - INFO - Broker agent=20=20
>=20answered: OK; loadavgXX:1; memAvail:23685; myMemAvail:23811;=20=20
>=20numCPU:16; typeCPU:2400;
> --
> 2018-07-24 07:34:24,424 - loadchecker - INFO - Executing agent=20=20
>=20command on remote host tce-server-21 (10.50.0.221): sh -c=20=20
>=20'/usr/lib/x2go/x2gobroker-agent foo checkload'
> 2018-07-24 07:34:24,842 - loadchecker - INFO - Broker agent=20=20
>=20answered: OK; loadavgXX:1; memAvail:23683; myMemAvail:23809;=20=20
>=20numCPU:16; typeCPU:2400;

The log message "Broker agent answered:" comes directly from X2Go=20=20
Broker=20Agent. It is basically its raw output.

This means, that the flaw must be in x2gobroker-agent.pl on the remote=20=
=20
X2Go=20Server. Or that the loadchecker stops querying the broker agent=20=
=20
and=20re-uses old data.

Looking at x2gobroker-agent.pl: If we focus on the loadavgXX for now,=20=20
we=20come to the conclusion, that the load was really "0" or it was=20=20
negative=20(both gives us a loadavgXX value of "1". The value should=20=20
normally=20be greater (system load of 1.0 brings a loadavgXX of 100).

Looking at x2gobroker.agent.py: As the values always change slightly,=20=20
we=20can't say that Python provides us the same return result string all=20=
=20
the=20time. The query to the broker agent must have happened.

We need to do more debugging if this issue reoccurs:

   * run '/usr/lib/x2go/x2gobroker-agent foo checkload' on the=20=20
affected=20X2Go Server
     and see if the reported values match with what the load checker sees.

   * check if it is reoccuring on the same X2Go Server

   * if /usr/lib/x2go/x2gobroker-agent returns a load of zero,
     look at /proc/loadavg

   * and /proc/sys/vm/min_free_kbytes,
     /proc/meminfo
     /proc/cpuinfo

... and report all back here...

> As you can see, there is only 1 server with a loadavgXX =3D 1 (which=20=
=20
>=20means that in fact, we got a zero value from the broker agent).
> This is not normal, at 7:34, there were 4 users already connected to=20=
=20
>=20this server and most of my other servers were empty.

> Restarting x2gobroker-loadchecker service fixed the issue.

Considering the above analysis that the issue must come from=20=20
x2gobroker-agent.pl,=20a restart of the loadchecker can in theory not=20=20
solve=20such an issue.

Can you see the x2gobroker-agent.pl process appear and disappear in=20=20
the=20process list on the remote X2Go Server? Or does it stay open, even=20=
=20
zombied?

>=20I think there is a problem in retrieving this informations ... even=20=
=20
>=20memAvail seem strange on this server to me ... with 4 connected=20=20
>=20users, it should have been lower than that.

Hmmm... Ok... Maybe the wrong server got tested? Two identical IPs on=20=20
the=20subnet?

> I also think the number of connected users should be taken into=20=20
>=20account when calculating the load factor (maybe this is already the=20=
=20
>=20case, not sure about that).

Yes, we take the number of sessions into account. But that is not=20=20
provided=20by the broker agent, but is available in the X2Go Server=20=20
database=20and queried from there.

Mike
--=20

DAS-NETZWERKTEAM
mike=20gabriel, herweg 7, 24357 fleckeby
mobile: +49 (1520) 1976 148
landline: +49 (4354) 8390 139

GnuPG Fingerprint: 9BFB AEE8 6C0A A5FF BF22  0782 9AF4 6B30 2577 1B31
mail: mike.gabriel@das-netzwerkteam.de, http://das-netzwerkteam.de


--=_K-6vfqXkWCLYqM978CV7_VO
Content-Type: application/pgp-signature
Content-Description: Digitale PGP-Signatur
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIzBAABCAAdFiEEm/uu6GwKpf+/IgeCmvRrMCV3GzEFAluaY9YACgkQmvRrMCV3
GzFtZhAAipZhbsX41foy/JXdcET/KaujkOf+JiXDQ1OUE1YCs7l4+EeMrabu6ehw
hbnznzNAveU0vhwBdQIid7hXuj5G4Ft3wl7KjOG2/1KAvW/oICXcgVeMB35K/m9s
BGxJORX1yUES3pUSO1H14mmTJyghzV9EHHpeKaMQCM94FPaT2f/gfGIDnTnPdPuf
nTU9lNVXfJyOB+Wju/HnDN6Y6yLHO1RzH98ZhOX3Mb0oB6F3kW7AncRfhbmEVUuQ
l0J+/0/BwL2xh9JTP0F4iuE3WJnGNLI0XSwqv6Yjla+tmalU6Wd7JdYBg2spUZE6
X3udnIYOi4WuF+QV0QTh5ab2I3Tlz8WsfMR4Od47+tM0ZIuII6MLnMhY25uvh/Ey
SO6/zrVZfKt7LeXoN4HSC61nBH/SpRzIeeqJEsjEaqfTzLEXL6jVlaEL8Uv/JkFE
rMJhll3hGHhGptcqi/RYqBn8Up8083FNJD+YxaMje/20Z5NsZ3SGlb+tJU2VxtYu
/CZQVixvxceIEyG6nuTM4X+lCpPehxpDUfTEfxWBJ7Vk7MWtntmJZ+7N56tDHZir
s9Wsd5Jrv3JzftKMRte60xpLPNX4RpLNpwlKxoYHUClllHOrr375IjG6SaPcg6io
5P1KLxwMcxN9NG84a4GsWfKXjYW23aideuuF+W1FXiRdpaq2OYg=
=zjur
-----END PGP SIGNATURE-----

--=_K-6vfqXkWCLYqM978CV7_VO--

