From unknown Thu Mar 28 12:37:50 2024 X-Loop: owner@bugs.x2go.org Subject: Bug#1313: [X2Go-Dev] Bug#1313: Bug#1313: there is still a problem in getting a correct value for loadavgXX with loadchecker Reply-To: Mike Gabriel , 1313@bugs.x2go.org Resent-From: Mike Gabriel Resent-To: x2go-dev@lists.x2go.org Resent-CC: X2Go Developers X-Loop: owner@bugs.x2go.org Resent-Date: Fri, 14 Dec 2018 15:05:02 +0000 Resent-Message-ID: Resent-Sender: owner@bugs.x2go.org X-X2Go-PR-Message: followup 1313 X-X2Go-PR-Package: x2gobroker-agent X-X2Go-PR-Keywords: Received: via spool by 1313-submit@bugs.x2go.org id=B1313.154479978826838 (code B ref 1313); Fri, 14 Dec 2018 15:05:02 +0000 Received: (at 1313) by bugs.x2go.org; 14 Dec 2018 15:03:08 +0000 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on ymir.das-netzwerkteam.de X-Spam-Level: X-Spam-Status: No, score=-1.1 required=3.0 tests=BAYES_00,RDNS_NONE, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.2 Received: from localhost (localhost [127.0.0.1]) by ymir.das-netzwerkteam.de (Postfix) with ESMTP id E8E685DACB for <1313@bugs.x2go.org>; Fri, 14 Dec 2018 16:03:04 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at ymir.das-netzwerkteam.de Received: from ymir.das-netzwerkteam.de ([127.0.0.1]) by localhost (ymir.das-netzwerkteam.de [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4UZV4wI3m6zT for <1313@bugs.x2go.org>; Fri, 14 Dec 2018 16:02:58 +0100 (CET) Received: from fregna.das-netzwerkteam.de (unknown [IPv6:2a01:4f8:202:1381::1]) by ymir.das-netzwerkteam.de (Postfix) with ESMTPS id 7BB2C5DAEF for <1313@bugs.x2go.org>; Fri, 14 Dec 2018 16:02:58 +0100 (CET) Received: from grimnir.das-netzwerkteam.de (grimnir.das-netzwerkteam.de [IPv6:2a01:4f8:202:1381::105]) by fregna.das-netzwerkteam.de (Postfix) with ESMTPS id 653B561160; Fri, 14 Dec 2018 15:02:58 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by grimnir.das-netzwerkteam.de (Postfix) with ESMTP id 5ADC4C270C; Fri, 14 Dec 2018 16:02:58 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at grimnir.das-netzwerkteam.de Received: from grimnir.das-netzwerkteam.de ([127.0.0.1]) by localhost (grimnir.das-netzwerkteam.de [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id rvmJYMhsGK3z; Fri, 14 Dec 2018 16:02:52 +0100 (CET) Received: from das-netzwerkteam.de (localhost [127.0.0.1]) by grimnir.das-netzwerkteam.de (Postfix) with ESMTPS id AB845C26F5; Fri, 14 Dec 2018 16:02:52 +0100 (CET) Received: from m-049.informatik.uni-kiel.de (m-049.informatik.uni-kiel.de [134.245.254.49]) by mail.das-netzwerkteam.de (Horde Framework) with HTTPS; Fri, 14 Dec 2018 15:02:52 +0000 Date: Fri, 14 Dec 2018 15:02:52 +0000 Message-ID: <20181214150252.Horde.byNG7EXbaadzROu-QLl9djD@mail.das-netzwerkteam.de> From: Mike Gabriel To: 1313@bugs.x2go.org Cc: Walid MOGHRABI References: <883547417.4000531.1534156346103.JavaMail.root@servicemagic.eu> <1342096826.4000684.1534156440651.JavaMail.root@servicemagic.eu> <20180913131919.Horde.E_bg6JMBLAcV9jdu5upQ4C5@mail.das-netzwerkteam.de> In-Reply-To: <20180913131919.Horde.E_bg6JMBLAcV9jdu5upQ4C5@mail.das-netzwerkteam.de> User-Agent: Horde Application Framework 5 Accept-Language: de,en Organization: DAS-NETZWERKTEAM X-Originating-IP: 134.245.254.49 X-Remote-Browser: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0 Content-Type: multipart/signed; boundary="=_6-78xO-KqXvV_VjDhzMr_NN"; protocol="application/pgp-signature"; micalg=pgp-sha256 MIME-Version: 1.0 This message is in MIME format and has been PGP signed. --=_6-78xO-KqXvV_VjDhzMr_NN Content-Type: text/plain; charset=utf-8; format=flowed; DelSp=Yes Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Control: close -1 On Do 13 Sep 2018 15:19:19 CEST, Mike Gabriel wrote: > Hi Walid, > > On Mo 13 Aug 2018 12:34:00 CEST, Walid MOGHRABI wrote: > >> package: x2gobroker-agent >> version: 0.0.4.0-0~1038~ubuntu16.04.1 >> priority: bug >> >> I don't have a "0" value anymore since latest fixes so the=20=20 >>=20loadchecker process don't crash anymore but still, there is=20=20 >>=20something strange. >> Here is a fragment of my loadchecker logs from this morning. >> Just to give you the context, I have 22 servers which are all=20=20 >>=20automaticaly started at 6 AM (wake on lan) and they are absolutely=20= =20 >>=20the same (blade servers with same CPU, memory amount, bios version,=20= =20 >>=20...). >> I checked our monitoring to see if users were correctly distributed=20= =20 >>=20over the farm and at 7:30AM, I had about 7 or 8 users connected but=20= =20 >>=204 of them were on tce-server-21 where I should have had 1 user on 8=20= =20 >>=20servers. > > Have you seen this issues more often? Does it hop from one server to=20= =20 >=20another or occur on more than one server at a time? > >> Here is the loadchecker log fragment : >> >> root@tce-manager-01 [~] # grep -B 1 'loadavgXX:1;'=20=20 >>=20/var/log/x2gobroker/loadchecker.log >> ... >> 2018-07-24 07:15:01,200 - loadchecker - INFO - Executing agent=20=20 >>=20command on remote host tce-server-21 (10.50.0.221): sh -c=20=20 >>=20'/usr/lib/x2go/x2gobroker-agent foo checkload' >> 2018-07-24 07:15:01,622 - loadchecker - INFO - Broker agent=20=20 >>=20answered: OK; loadavgXX:1; memAvail:23684; myMemAvail:23810;=20=20 >>=20numCPU:16; typeCPU:2400; >> -- >> 2018-07-24 07:17:50,354 - loadchecker - INFO - Executing agent=20=20 >>=20command on remote host tce-server-21 (10.50.0.221): sh -c=20=20 >>=20'/usr/lib/x2go/x2gobroker-agent foo checkload' >> 2018-07-24 07:17:50,779 - loadchecker - INFO - Broker agent=20=20 >>=20answered: OK; loadavgXX:1; memAvail:23686; myMemAvail:23812;=20=20 >>=20numCPU:16; typeCPU:2400; >> -- >> 2018-07-24 07:20:32,550 - loadchecker - INFO - Executing agent=20=20 >>=20command on remote host tce-server-21 (10.50.0.221): sh -c=20=20 >>=20'/usr/lib/x2go/x2gobroker-agent foo checkload' >> 2018-07-24 07:20:32,964 - loadchecker - INFO - Broker agent=20=20 >>=20answered: OK; loadavgXX:1; memAvail:23683; myMemAvail:23809;=20=20 >>=20numCPU:16; typeCPU:2400; >> -- >> 2018-07-24 07:23:21,610 - loadchecker - INFO - Executing agent=20=20 >>=20command on remote host tce-server-21 (10.50.0.221): sh -c=20=20 >>=20'/usr/lib/x2go/x2gobroker-agent foo checkload' >> 2018-07-24 07:23:22,034 - loadchecker - INFO - Broker agent=20=20 >>=20answered: OK; loadavgXX:1; memAvail:23685; myMemAvail:23811;=20=20 >>=20numCPU:16; typeCPU:2400; >> -- >> 2018-07-24 07:26:03,872 - loadchecker - INFO - Executing agent=20=20 >>=20command on remote host tce-server-21 (10.50.0.221): sh -c=20=20 >>=20'/usr/lib/x2go/x2gobroker-agent foo checkload' >> 2018-07-24 07:26:04,286 - loadchecker - INFO - Broker agent=20=20 >>=20answered: OK; loadavgXX:1; memAvail:23684; myMemAvail:23809;=20=20 >>=20numCPU:16; typeCPU:2400; >> -- >> 2018-07-24 07:28:52,917 - loadchecker - INFO - Executing agent=20=20 >>=20command on remote host tce-server-21 (10.50.0.221): sh -c=20=20 >>=20'/usr/lib/x2go/x2gobroker-agent foo checkload' >> 2018-07-24 07:28:53,338 - loadchecker - INFO - Broker agent=20=20 >>=20answered: OK; loadavgXX:1; memAvail:23684; myMemAvail:23809;=20=20 >>=20numCPU:16; typeCPU:2400; >> -- >> 2018-07-24 07:31:35,252 - loadchecker - INFO - Executing agent=20=20 >>=20command on remote host tce-server-21 (10.50.0.221): sh -c=20=20 >>=20'/usr/lib/x2go/x2gobroker-agent foo checkload' >> 2018-07-24 07:31:35,670 - loadchecker - INFO - Broker agent=20=20 >>=20answered: OK; loadavgXX:1; memAvail:23685; myMemAvail:23811;=20=20 >>=20numCPU:16; typeCPU:2400; >> -- >> 2018-07-24 07:34:24,424 - loadchecker - INFO - Executing agent=20=20 >>=20command on remote host tce-server-21 (10.50.0.221): sh -c=20=20 >>=20'/usr/lib/x2go/x2gobroker-agent foo checkload' >> 2018-07-24 07:34:24,842 - loadchecker - INFO - Broker agent=20=20 >>=20answered: OK; loadavgXX:1; memAvail:23683; myMemAvail:23809;=20=20 >>=20numCPU:16; typeCPU:2400; > > The log message "Broker agent answered:" comes directly from X2Go=20=20 >=20Broker Agent. It is basically its raw output. > > This means, that the flaw must be in x2gobroker-agent.pl on the=20=20 >=20remote X2Go Server. Or that the loadchecker stops querying the=20=20 >=20broker agent and re-uses old data. > > Looking at x2gobroker-agent.pl: If we focus on the loadavgXX for=20=20 >=20now, we come to the conclusion, that the load was really "0" or it=20= =20 >=20was negative (both gives us a loadavgXX value of "1". The value=20=20 >=20should normally be greater (system load of 1.0 brings a loadavgXX of=20= =20 >=20100). > > Looking at x2gobroker.agent.py: As the values always change=20=20 >=20slightly, we can't say that Python provides us the same return=20=20 >=20result string all the time. The query to the broker agent must have=20= =20 >=20happened. > > We need to do more debugging if this issue reoccurs: > > * run '/usr/lib/x2go/x2gobroker-agent foo checkload' on the=20=20 >=20affected X2Go Server > and see if the reported values match with what the load checker sees. > > * check if it is reoccuring on the same X2Go Server > > * if /usr/lib/x2go/x2gobroker-agent returns a load of zero, > look at /proc/loadavg > > * and /proc/sys/vm/min_free_kbytes, > /proc/meminfo > /proc/cpuinfo > > ... and report all back here... > >> As you can see, there is only 1 server with a loadavgXX =3D 1 (which=20= =20 >>=20means that in fact, we got a zero value from the broker agent). >> This is not normal, at 7:34, there were 4 users already connected=20=20 >>=20to this server and most of my other servers were empty. > >> Restarting x2gobroker-loadchecker service fixed the issue. > > Considering the above analysis that the issue must come from=20=20 >=20x2gobroker-agent.pl, a restart of the loadchecker can in theory not=20= =20 >=20solve such an issue. > > Can you see the x2gobroker-agent.pl process appear and disappear in=20=20 >=20the process list on the remote X2Go Server? Or does it stay open,=20=20 >=20even zombied? > >> I think there is a problem in retrieving this informations ... even=20= =20 >>=20memAvail seem strange on this server to me ... with 4 connected=20=20 >>=20users, it should have been lower than that. > > Hmmm... Ok... Maybe the wrong server got tested? Two identical IPs=20=20 >=20on the subnet? > >> I also think the number of connected users should be taken into=20=20 >>=20account when calculating the load factor (maybe this is already the=20= =20 >>=20case, not sure about that). > > Yes, we take the number of sessions into account. But that is not=20=20 >=20provided by the broker agent, but is available in the X2Go Server=20=20 >=20database and queried from there. > > Mike Request from Walid on IRC. Not an issue anymore. Thus, closing... Mike --=20 DAS-NETZWERKTEAM mike=20gabriel, herweg 7, 24357 fleckeby mobile: +49 (1520) 1976 148 landline: +49 (4354) 8390 139 GnuPG Fingerprint: 9BFB AEE8 6C0A A5FF BF22 0782 9AF4 6B30 2577 1B31 mail: mike.gabriel@das-netzwerkteam.de, http://das-netzwerkteam.de --=_6-78xO-KqXvV_VjDhzMr_NN Content-Type: application/pgp-signature Content-Description: Digitale PGP-Signatur Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIzBAABCAAdFiEEm/uu6GwKpf+/IgeCmvRrMCV3GzEFAlwTxhwACgkQmvRrMCV3 GzGzIw//e1ghRfsUTamOI1luaPrAn3B1rr3XjZhosT76UOay7w6cRmnu5MmAUWFh rk8JiB/ZSx8ctPSnRJH5jlA0MxPOdpZx9aQDPcL7tg3b0Ty+kNvBy9IpCq+pv2Ek IsuY2Ta7w+t7k9lvOiFszE0l5OR6PXoqS7l4m5NQYxrge49ABOJYO+Mwi0AUfwuX xDAO7Is/7vzY9FdsHkGpRCAuOkRN9NbrhLuBnzukmHDTSf+xJ0Q8OoGgxLISIFOw qZT+IhGtNoJ/BuC2w9m/nEhKVoUvMHy/NHNup40Wf8pDJn/NIKyYXrT2zwd9v0MS gilLQ4jgJcqJgBYlavP/dJ1yT2O3xLQnHSlaLO3J/TJpLmfHXecGyezn/kfDtwOY zaX8eGEP7WpxRgQYHUKdCDjLG7IUQZh6ZPPw98/KXZX+KJoFO8MpUMBShC3eX9/M VuY+Xqi/S5Lx9YN7HdQY8cxDPNKF8PBHyYHkXc93WhtpTvZMJDgXbMsO86rqiIZA xjiAT/ubUnoM4Yu9JkTSDje36Y0gX5I7zXS/nVCLZjwcZeNBcJ0/1trNHe0Qn45E veP12UrolDHbtLbxitpz12ExjUvH0SdT/j31E0BRRfHGV+JXsXZuYQvZXOQXQmKJ UIipivwaTswmN6RnDWmqaku730bMIJHAXm4OIpfLvaelPgT/iOg= =VZG/ -----END PGP SIGNATURE----- --=_6-78xO-KqXvV_VjDhzMr_NN--