Thread: Mysterious performance degradation in exceptional cases

Mysterious performance degradation in exceptional cases

From
Matthias Apitz
Date:
We have a C-written application server which uses ESQL/C on top
of PostgreSQL 13.1 on Linux. The application in question always serves
the same search in a librarian database, given to the server
as commands over the network, login into the application and doing
a search:

SLNPServerInit
User:zfl
SLNPEndCommand

SLNPSearch
HitListName:Zfernleihe
Search:1000=472214284
SLNPEndCommand

To fulfill the search, the application server has to do some 100
ESQL/C calls and all this should not take longer than 1-2 seconds, and
normally it does not take longer. But, in some situations it takes
longer than 180 seconds, in 10% of the cases. The other 90% are below 2 seconds,
i.e. this is digital: Or 2 seconds, or more than 180 seconds, no values between.

We can easily simulate the above with a small shell script just sending over
the above two commands with 'netcat' and throwing away its result (the real search is
done by an inter library loan software which has an timeout of 180 seconds
to wait for the SLNPSearch search result -- that's why we got to know
about the problem at all, because all this is running automagically with
no user dialogs). The idea of the simulated search was to get to know
with the ESQL/C log files which operation takes so long and why.

Well, since some day, primary to catch the situation, we send over every
10 seconds this simulated searches and since then the problem went away at all.

The Linux server where all this is running is highly equipped with memory and CPUs
and 99% idle.

The picture, that the problem went away with our test search every 10 seconds,
let me think in something like "since we keep the PostgreSQL server busy
that way it has not chance to go into some kind of deeper sleep" (for
example being swapped out or whatever).

Any ideas about this?

    matthias
-- 
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub



Re: Mysterious performance degradation in exceptional cases

From
Adrian Klaver
Date:
On 9/14/22 01:31, Matthias Apitz wrote:
> 
> We have a C-written application server which uses ESQL/C on top
> of PostgreSQL 13.1 on Linux. The application in question always serves
> the same search in a librarian database, given to the server
> as commands over the network, login into the application and doing
> a search:
> 
> SLNPServerInit
> User:zfl
> SLNPEndCommand
> 
> SLNPSearch
> HitListName:Zfernleihe
> Search:1000=472214284
> SLNPEndCommand
> 
> To fulfill the search, the application server has to do some 100
> ESQL/C calls and all this should not take longer than 1-2 seconds, and
> normally it does not take longer. But, in some situations it takes
> longer than 180 seconds, in 10% of the cases. The other 90% are below 2 seconds,
> i.e. this is digital: Or 2 seconds, or more than 180 seconds, no values between.
> 
> We can easily simulate the above with a small shell script just sending over
> the above two commands with 'netcat' and throwing away its result (the real search is
> done by an inter library loan software which has an timeout of 180 seconds
> to wait for the SLNPSearch search result -- that's why we got to know
> about the problem at all, because all this is running automagically with
> no user dialogs). The idea of the simulated search was to get to know
> with the ESQL/C log files which operation takes so long and why.

Does the test search run the inter library loan software?

> 
> Well, since some day, primary to catch the situation, we send over every
> 10 seconds this simulated searches and since then the problem went away at all.

To be clear the problem went away for the real search?

Where is the inter library software, in your application or are you 
reaching out to another application?

Is the search running across a remote network?

> 
> The Linux server where all this is running is highly equipped with memory and CPUs
> and 99% idle.
> 
> The picture, that the problem went away with our test search every 10 seconds,
> let me think in something like "since we keep the PostgreSQL server busy
> that way it has not chance to go into some kind of deeper sleep" (for
> example being swapped out or whatever).
> 
> Any ideas about this?
> 
>     matthias


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Mysterious performance degradation in exceptional cases

From
Tom Lane
Date:
Matthias Apitz <guru@unixarea.de> writes:
> To fulfill the search, the application server has to do some 100
> ESQL/C calls and all this should not take longer than 1-2 seconds, and
> normally it does not take longer. But, in some situations it takes
> longer than 180 seconds, in 10% of the cases. The other 90% are below 2 seconds,
> i.e. this is digital: Or 2 seconds, or more than 180 seconds, no values between.

I'm wondering about a plan change.  Can you install auto_explain on
the server and capture info about the fast and slow cases?

            regards, tom lane



Re: Mysterious performance degradation in exceptional cases

From
Matthias Apitz
Date:
El día miércoles, septiembre 14, 2022 a las 07:19:31a. m. -0700, Adrian Klaver escribió:

> On 9/14/22 01:31, Matthias Apitz wrote:
> > 
> > We have a C-written application server which uses ESQL/C on top
> > of PostgreSQL 13.1 on Linux. The application in question always serves
> > the same search in a librarian database, given to the server
> > as commands over the network, login into the application and doing
> > a search:
> > 
> > SLNPServerInit
> > User:zfl
> > SLNPEndCommand
> > 
> > SLNPSearch
> > HitListName:Zfernleihe
> > Search:1000=472214284
> > SLNPEndCommand
> > 
> > To fulfill the search, the application server has to do some 100
> > ESQL/C calls and all this should not take longer than 1-2 seconds, and
> > normally it does not take longer. But, in some situations it takes
> > longer than 180 seconds, in 10% of the cases. The other 90% are below 2 seconds,
> > i.e. this is digital: Or 2 seconds, or more than 180 seconds, no values between.
> > 
> > We can easily simulate the above with a small shell script just sending over
> > the above two commands with 'netcat' and throwing away its result (the real search is
> > done by an inter library loan software which has an timeout of 180 seconds
> > to wait for the SLNPSearch search result -- that's why we got to know
> > about the problem at all, because all this is running automagically with
> > no user dialogs). The idea of the simulated search was to get to know
> > with the ESQL/C log files which operation takes so long and why.
> 
> Does the test search run the inter library loan software?

The real picture is:

  ILL-software --(network, search command)---> app-server --(ESQL/C)--> PostgreSQL-server
  test search  --(localhost, search command)-> app-server --(ESQL/C)--> PostgreSQL-server

> > Well, since some day, primary to catch the situation, we send over every
> > 10 seconds this simulated searches and since then the problem went away at all.
> 
> To be clear the problem went away for the real search?

Yes, since the 'test search' runs every 10 seconds, the above pictured
'ILL-software', doing the same search, does not face the problem anymore.

> 
> Where is the inter library software, in your application or are you reaching
> out to another application?

The above 'app-server' fulfills the search requested by the
'ILL-software' (or the 'test search'), i.e. looks up for one single
librarian record (one row in the PostgreSQL database) and delivers
it to the 'ILL-software'. The request from the 'ILL-software' is not
a heavy duty, more or less 50 requests per day.

> Is the search running across a remote network?

The real search comes over the network through a stunnel. But we
watched with tcpdump the incoming search and the response by the
'app-server' locally. In the case of the timeout, the 'app-server' does not
answer within 180 seconds, i.e. does not send anything into the stunnel,
and the remote 'ILL-software' terminates the connection with an F-packet.

I will now:

- shutdown the test search every 10 secs to see if the problem re-appears
- set 'log_autovacuum_min_duration = 0' in postgresql.conf to see if
  the times of the problem matches;

Thanks for your feedback in any case.

    matthias

-- 
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub



Re: Mysterious performance degradation in exceptional cases

From
Adrian Klaver
Date:
On 9/14/22 22:33, Matthias Apitz wrote:
> El día miércoles, septiembre 14, 2022 a las 07:19:31a. m. -0700, Adrian Klaver escribió:
> 
>> On 9/14/22 01:31, Matthias Apitz wrote:

>> Where is the inter library software, in your application or are you reaching
>> out to another application?
> 
> The above 'app-server' fulfills the search requested by the
> 'ILL-software' (or the 'test search'), i.e. looks up for one single
> librarian record (one row in the PostgreSQL database) and delivers
> it to the 'ILL-software'. The request from the 'ILL-software' is not
> a heavy duty, more or less 50 requests per day.
> 
>> Is the search running across a remote network?
> 
> The real search comes over the network through a stunnel. But we
> watched with tcpdump the incoming search and the response by the
> 'app-server' locally. In the case of the timeout, the 'app-server' does not
> answer within 180 seconds, i.e. does not send anything into the stunnel,
> and the remote 'ILL-software' terminates the connection with an F-packet.

The 'app-server' does not answer, but does the database not answer also?

Have you looked to see if the database is providing a response in a 
timely manner and if it is getting 'lost' in the 'app-server'?

Also have you considered Tom Lane's suggestion of using auto_explain?

> 
> I will now:
> 
> - shutdown the test search every 10 secs to see if the problem re-appears
> - set 'log_autovacuum_min_duration = 0' in postgresql.conf to see if
>    the times of the problem matches;
> 
> Thanks for your feedback in any case.
> 
>     matthias
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Mysterious performance degradation in exceptional cases

From
Matthias Apitz
Date:
El día jueves, septiembre 15, 2022 a las 08:40:24a. m. -0700, Adrian Klaver escribió:

> On 9/14/22 22:33, Matthias Apitz wrote:
> > El día miércoles, septiembre 14, 2022 a las 07:19:31a. m. -0700, Adrian Klaver escribió:
> > 
> > > On 9/14/22 01:31, Matthias Apitz wrote:
> 
> > > Where is the inter library software, in your application or are you reaching
> > > out to another application?
> > 
> > The above 'app-server' fulfills the search requested by the
> > 'ILL-software' (or the 'test search'), i.e. looks up for one single
> > librarian record (one row in the PostgreSQL database) and delivers
> > it to the 'ILL-software'. The request from the 'ILL-software' is not
> > a heavy duty, more or less 50 requests per day.
> > 
> > > Is the search running across a remote network?
> > 
> > The real search comes over the network through a stunnel. But we
> > watched with tcpdump the incoming search and the response by the
> > 'app-server' locally. In the case of the timeout, the 'app-server' does not
> > answer within 180 seconds, i.e. does not send anything into the stunnel,
> > and the remote 'ILL-software' terminates the connection with an F-packet.
> 
> The 'app-server' does not answer, but does the database not answer also?
> 
> Have you looked to see if the database is providing a response in a timely
> manner and if it is getting 'lost' in the 'app-server'?

I do not "think" that the time is spent elsewhere in the 'app-server'.
to make a fact of the "thinking", I enabled the tracing of our dblayer
showing how many milliseconds have been spent between entering the dblayer and
returning the result to the application layers. This is in effect since
36 hours now. Since September 13 the problem has not showed up anymore.
We are waiting for it...

> Also have you considered Tom Lane's suggestion of using auto_explain?

I'm afraid that this would affect all other applications using the same
server. We still have other options to use. When the above tracing shows
a result, i.e. which of the high level operations of the dblayer uses
more time (more milliseconds) than normal, the next level is the detailed logging
of the ESQL/C operations (where we added time stamps which normaly this
logging does not have in its source code).

Thanks

    matthias

-- 
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub

Sahra Wagenknecht im Bundestag: "Aber die Vorstellung, dass wir Putin dadurch
bestrafen, dass wir Millionen Familien in Deutschland in die Armut stürzen und
dass wir unsere Industrie zerstören, während Gasprom Rekordgewinne macht – ja,
wie bescheuert ist das denn?" Recht hat sie!



Re: Mysterious performance degradation in exceptional cases

From
Adrian Klaver
Date:
On 9/18/22 02:30, Matthias Apitz wrote:
> El día jueves, septiembre 15, 2022 a las 08:40:24a. m. -0700, Adrian Klaver escribió:
> 
>> On 9/14/22 22:33, Matthias Apitz wrote:
>>> El día miércoles, septiembre 14, 2022 a las 07:19:31a. m. -0700, Adrian Klaver escribió:
>>>
>>>> On 9/14/22 01:31, Matthias Apitz wrote:
>>

>> The 'app-server' does not answer, but does the database not answer also?
>>
>> Have you looked to see if the database is providing a response in a timely
>> manner and if it is getting 'lost' in the 'app-server'?
> 
> I do not "think" that the time is spent elsewhere in the 'app-server'.
> to make a fact of the "thinking", I enabled the tracing of our dblayer
> showing how many milliseconds have been spent between entering the dblayer and
> returning the result to the application layers. This is in effect since
> 36 hours now. Since September 13 the problem has not showed up anymore.
> We are waiting for it...

Is this with or without your every 10 sec 'ping' search program?

> 
>> Also have you considered Tom Lane's suggestion of using auto_explain?
> 
> I'm afraid that this would affect all other applications using the same
> server. We still have other options to use. When the above tracing shows
> a result, i.e. which of the high level operations of the dblayer uses
> more time (more milliseconds) than normal, the next level is the detailed logging
> of the ESQL/C operations (where we added time stamps which normaly this
> logging does not have in its source code).

You can load auto_explain per session as shown here:

https://www.postgresql.org/docs/current/auto-explain.html

F.4.2. Example

> 
> Thanks
> 
>     matthias
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Mysterious performance degradation in exceptional cases

From
Matthias Apitz
Date:
El día domingo, septiembre 18, 2022 a las 07:47:32a. m. -0700, Adrian Klaver escribió:

> On 9/18/22 02:30, Matthias Apitz wrote:
> > El día jueves, septiembre 15, 2022 a las 08:40:24a. m. -0700, Adrian Klaver escribió:
> > 
> > > On 9/14/22 22:33, Matthias Apitz wrote:
> > > > El día miércoles, septiembre 14, 2022 a las 07:19:31a. m. -0700, Adrian Klaver escribió:
> > > > 
> > > > > On 9/14/22 01:31, Matthias Apitz wrote:
> > > 
> 
> > > The 'app-server' does not answer, but does the database not answer also?
> > > 
> > > Have you looked to see if the database is providing a response in a timely
> > > manner and if it is getting 'lost' in the 'app-server'?
> > 
> > I do not "think" that the time is spent elsewhere in the 'app-server'.
> > to make a fact of the "thinking", I enabled the tracing of our dblayer
> > showing how many milliseconds have been spent between entering the dblayer and
> > returning the result to the application layers. This is in effect since
> > 36 hours now. Since September 13 the problem has not showed up anymore.
> > We are waiting for it...
> 
> Is this with or without your every 10 sec 'ping' search program?

The 'ping' search was stopped on September 16, 7:45 CEST. The 'ping' search
never showed the problem.

> > > Also have you considered Tom Lane's suggestion of using auto_explain?
> > 
> > I'm afraid that this would affect all other applications using the same
> > server. We still have other options to use. When the above tracing shows
> > a result, i.e. which of the high level operations of the dblayer uses
> > more time (more milliseconds) than normal, the next level is the detailed logging
> > of the ESQL/C operations (where we added time stamps which normaly this
> > logging does not have in its source code).
> 
> You can load auto_explain per session as shown here:

Every connect from the ILL forks a new 'app-server' session which
creates a new ESQL/C session, and all this occur randomly when the ILL
wants to search for a book title if this is available in that library.
In short: no way.

Thanks

    matthias

-- 
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub

Sahra Wagenknecht im Bundestag: "Aber die Vorstellung, dass wir Putin dadurch
bestrafen, dass wir Millionen Familien in Deutschland in die Armut stürzen und
dass wir unsere Industrie zerstören, während Gasprom Rekordgewinne macht – ja,
wie bescheuert ist das denn?" Recht hat sie!



Re: Mysterious performance degradation in exceptional cases

From
Adrian Klaver
Date:
On 9/18/22 09:36, Matthias Apitz wrote:

>> You can load auto_explain per session as shown here:
> 
> Every connect from the ILL forks a new 'app-server' session which
> creates a new ESQL/C session, and all this occur randomly when the ILL
> wants to search for a book title if this is available in that library.
> In short: no way.

How is it random?

Are you saying that the ESQL/C session does not 'know' the search is 
coming from the ILL connection?

> 
> Thanks
> 
>     matthias
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Mysterious performance degradation in exceptional cases

From
Matthias Apitz
Date:
El día Mittwoch, September 14, 2022 a las 10:31:26 +0200, Matthias Apitz escribió:

> 
> We have a C-written application server which uses ESQL/C on top
> of PostgreSQL 13.1 on Linux. The application in question always serves
> the same search in a librarian database, given to the server
> as commands over the network, login into the application and doing
> a search:
> 
> SLNPServerInit
> User:zfl
> SLNPEndCommand
> 
> SLNPSearch
> HitListName:Zfernleihe
> Search:1000=472214284
> SLNPEndCommand
> 
> To fulfill the search, the application server has to do some 100
> ESQL/C calls and all this should not take longer than 1-2 seconds, and
> normally it does not take longer. But, in some situations it takes
> longer than 180 seconds, in 10% of the cases. The other 90% are below 2 seconds,
> i.e. this is digital: Or 2 seconds, or more than 180 seconds, no values between.
> 
> ...

After weeks of silence, the problem showed up again. I was wrong saying
"some 100 of ESQL/C", because I had forgotten that the main initialization is
done only once in the master daemon of the application server and every
such connect spawns a new short living child and this only executes only
16 calls to the DBLAYER:

DBCEXIT|N|sisisinst|RDIR|4|21.10.2022 06:08:25:594
DBCEXIT|N|adm_partab|RDIR|2|21.10.2022 06:08:25:597
DBCEXIT|N|adm_partab|RDIR|1|21.10.2022 06:08:25:598
DBCEXIT|N|adm_partab|RDIR|0|21.10.2022 06:08:25:599
DBCEXIT|N|adm_partab|RDIR|0|21.10.2022 06:08:25:600
DBCEXIT|N|adm_partab|RDIR|0|21.10.2022 06:08:25:601
DBCEXIT|N|adm_partab|RDIR|0|21.10.2022 06:08:25:602
DBCEXIT|N|d12termfkt|RWHR|1|21.10.2022 06:08:25:630
DBCEXIT|N|d12termfkt|RWHR|2|21.10.2022 06:08:25:632
DBCEXIT|N|d601biblpar|RALL|3|21.10.2022 06:08:25:636
DBCEXIT|N|d06vw|RDIR|1|21.10.2022 06:08:25:637
DBCEXIT|N|titel_worte|RDIR|13|21.10.2022 06:08:25:660
DBCEXIT|N|titel_worte|FETCH|0|21.10.2022 06:08:25:660
DBCEXIT|N|d14zweigsicht|RWHR|1|21.10.2022 06:08:25:662
DBCEXIT|W|titel_daten|COUNT|183875|21.10.2022 06:11:29:538
                            ^^^^^^

'RDIR' means a SELECT with a key, ... 'COUNT' executes a 

SELECT count(*) from titel_daten

The 5th column gives the duration in/below the DBLAYER in milliseconds, in
the above case 183875, i.e. 183 seconds. The table in question titel_daten
has 5830370 rows. I did it right now with psql and it takes a second
only:

sisis=# select count(*) from titel_daten;
  count  
---------
 5827383
(1 row)

What could be a reason for 183 seconds, sometimes not always?

Thanks

    matthias
 

-- 
Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub
May, 9: Спаси́бо освободители! Thank you very much, Russian liberators!



Re: Mysterious performance degradation in exceptional cases

From
Laurenz Albe
Date:
On Fri, 2022-10-21 at 09:48 +0200, Matthias Apitz wrote:
> > all this should not take longer than 1-2 seconds, and
> > normally it does not take longer. But, in some situations it takes
> > longer than 180 seconds, in 10% of the cases.
> 
> DBCEXIT|N|sisisinst|RDIR|4|21.10.2022 06:08:25:594
> DBCEXIT|N|adm_partab|RDIR|2|21.10.2022 06:08:25:597
> DBCEXIT|N|adm_partab|RDIR|1|21.10.2022 06:08:25:598
> DBCEXIT|N|adm_partab|RDIR|0|21.10.2022 06:08:25:599
> DBCEXIT|N|adm_partab|RDIR|0|21.10.2022 06:08:25:600
> DBCEXIT|N|adm_partab|RDIR|0|21.10.2022 06:08:25:601
> DBCEXIT|N|adm_partab|RDIR|0|21.10.2022 06:08:25:602
> DBCEXIT|N|d12termfkt|RWHR|1|21.10.2022 06:08:25:630
> DBCEXIT|N|d12termfkt|RWHR|2|21.10.2022 06:08:25:632
> DBCEXIT|N|d601biblpar|RALL|3|21.10.2022 06:08:25:636
> DBCEXIT|N|d06vw|RDIR|1|21.10.2022 06:08:25:637
> DBCEXIT|N|titel_worte|RDIR|13|21.10.2022 06:08:25:660
> DBCEXIT|N|titel_worte|FETCH|0|21.10.2022 06:08:25:660
> DBCEXIT|N|d14zweigsicht|RWHR|1|21.10.2022 06:08:25:662
> DBCEXIT|W|titel_daten|COUNT|183875|21.10.2022 06:11:29:538
>                             ^^^^^^
> 
> 'RDIR' means a SELECT with a key, ... 'COUNT' executes a 
> 
> SELECT count(*) from titel_daten
> 
> The 5th column gives the duration in/below the DBLAYER in milliseconds, in
> the above case 183875, i.e. 183 seconds. The table in question titel_daten
> has 5830370 rows. I did it right now with psql and it takes a second
> only:
> 
> sisis=# select count(*) from titel_daten;
>   count  
> ---------
>  5827383
> (1 row)
> 
> What could be a reason for 183 seconds, sometimes not always?

The most obvious idea is a lock with a statement that takes an ACCESS EXCLUSIVE
lock, like TRUNCATE or ALTER TABLE.

But there are other possibilities, like network problems or a problem
inside your application.

To narrow that down, set "log_min_duration_statement" and possibly activate
"auto_explain", so that you can verify if the long delay is due to long
processing time inside the database.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com