Thread: Mysterious performance degradation in exceptional cases
We have a C-written application server which uses ESQL/C on top of PostgreSQL 13.1 on Linux. The application in question always serves the same search in a librarian database, given to the server as commands over the network, login into the application and doing a search: SLNPServerInit User:zfl SLNPEndCommand SLNPSearch HitListName:Zfernleihe Search:1000=472214284 SLNPEndCommand To fulfill the search, the application server has to do some 100 ESQL/C calls and all this should not take longer than 1-2 seconds, and normally it does not take longer. But, in some situations it takes longer than 180 seconds, in 10% of the cases. The other 90% are below 2 seconds, i.e. this is digital: Or 2 seconds, or more than 180 seconds, no values between. We can easily simulate the above with a small shell script just sending over the above two commands with 'netcat' and throwing away its result (the real search is done by an inter library loan software which has an timeout of 180 seconds to wait for the SLNPSearch search result -- that's why we got to know about the problem at all, because all this is running automagically with no user dialogs). The idea of the simulated search was to get to know with the ESQL/C log files which operation takes so long and why. Well, since some day, primary to catch the situation, we send over every 10 seconds this simulated searches and since then the problem went away at all. The Linux server where all this is running is highly equipped with memory and CPUs and 99% idle. The picture, that the problem went away with our test search every 10 seconds, let me think in something like "since we keep the PostgreSQL server busy that way it has not chance to go into some kind of deeper sleep" (for example being swapped out or whatever). Any ideas about this? matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
On 9/14/22 01:31, Matthias Apitz wrote: > > We have a C-written application server which uses ESQL/C on top > of PostgreSQL 13.1 on Linux. The application in question always serves > the same search in a librarian database, given to the server > as commands over the network, login into the application and doing > a search: > > SLNPServerInit > User:zfl > SLNPEndCommand > > SLNPSearch > HitListName:Zfernleihe > Search:1000=472214284 > SLNPEndCommand > > To fulfill the search, the application server has to do some 100 > ESQL/C calls and all this should not take longer than 1-2 seconds, and > normally it does not take longer. But, in some situations it takes > longer than 180 seconds, in 10% of the cases. The other 90% are below 2 seconds, > i.e. this is digital: Or 2 seconds, or more than 180 seconds, no values between. > > We can easily simulate the above with a small shell script just sending over > the above two commands with 'netcat' and throwing away its result (the real search is > done by an inter library loan software which has an timeout of 180 seconds > to wait for the SLNPSearch search result -- that's why we got to know > about the problem at all, because all this is running automagically with > no user dialogs). The idea of the simulated search was to get to know > with the ESQL/C log files which operation takes so long and why. Does the test search run the inter library loan software? > > Well, since some day, primary to catch the situation, we send over every > 10 seconds this simulated searches and since then the problem went away at all. To be clear the problem went away for the real search? Where is the inter library software, in your application or are you reaching out to another application? Is the search running across a remote network? > > The Linux server where all this is running is highly equipped with memory and CPUs > and 99% idle. > > The picture, that the problem went away with our test search every 10 seconds, > let me think in something like "since we keep the PostgreSQL server busy > that way it has not chance to go into some kind of deeper sleep" (for > example being swapped out or whatever). > > Any ideas about this? > > matthias -- Adrian Klaver adrian.klaver@aklaver.com
Matthias Apitz <guru@unixarea.de> writes: > To fulfill the search, the application server has to do some 100 > ESQL/C calls and all this should not take longer than 1-2 seconds, and > normally it does not take longer. But, in some situations it takes > longer than 180 seconds, in 10% of the cases. The other 90% are below 2 seconds, > i.e. this is digital: Or 2 seconds, or more than 180 seconds, no values between. I'm wondering about a plan change. Can you install auto_explain on the server and capture info about the fast and slow cases? regards, tom lane
El día miércoles, septiembre 14, 2022 a las 07:19:31a. m. -0700, Adrian Klaver escribió: > On 9/14/22 01:31, Matthias Apitz wrote: > > > > We have a C-written application server which uses ESQL/C on top > > of PostgreSQL 13.1 on Linux. The application in question always serves > > the same search in a librarian database, given to the server > > as commands over the network, login into the application and doing > > a search: > > > > SLNPServerInit > > User:zfl > > SLNPEndCommand > > > > SLNPSearch > > HitListName:Zfernleihe > > Search:1000=472214284 > > SLNPEndCommand > > > > To fulfill the search, the application server has to do some 100 > > ESQL/C calls and all this should not take longer than 1-2 seconds, and > > normally it does not take longer. But, in some situations it takes > > longer than 180 seconds, in 10% of the cases. The other 90% are below 2 seconds, > > i.e. this is digital: Or 2 seconds, or more than 180 seconds, no values between. > > > > We can easily simulate the above with a small shell script just sending over > > the above two commands with 'netcat' and throwing away its result (the real search is > > done by an inter library loan software which has an timeout of 180 seconds > > to wait for the SLNPSearch search result -- that's why we got to know > > about the problem at all, because all this is running automagically with > > no user dialogs). The idea of the simulated search was to get to know > > with the ESQL/C log files which operation takes so long and why. > > Does the test search run the inter library loan software? The real picture is: ILL-software --(network, search command)---> app-server --(ESQL/C)--> PostgreSQL-server test search --(localhost, search command)-> app-server --(ESQL/C)--> PostgreSQL-server > > Well, since some day, primary to catch the situation, we send over every > > 10 seconds this simulated searches and since then the problem went away at all. > > To be clear the problem went away for the real search? Yes, since the 'test search' runs every 10 seconds, the above pictured 'ILL-software', doing the same search, does not face the problem anymore. > > Where is the inter library software, in your application or are you reaching > out to another application? The above 'app-server' fulfills the search requested by the 'ILL-software' (or the 'test search'), i.e. looks up for one single librarian record (one row in the PostgreSQL database) and delivers it to the 'ILL-software'. The request from the 'ILL-software' is not a heavy duty, more or less 50 requests per day. > Is the search running across a remote network? The real search comes over the network through a stunnel. But we watched with tcpdump the incoming search and the response by the 'app-server' locally. In the case of the timeout, the 'app-server' does not answer within 180 seconds, i.e. does not send anything into the stunnel, and the remote 'ILL-software' terminates the connection with an F-packet. I will now: - shutdown the test search every 10 secs to see if the problem re-appears - set 'log_autovacuum_min_duration = 0' in postgresql.conf to see if the times of the problem matches; Thanks for your feedback in any case. matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
On 9/14/22 22:33, Matthias Apitz wrote: > El día miércoles, septiembre 14, 2022 a las 07:19:31a. m. -0700, Adrian Klaver escribió: > >> On 9/14/22 01:31, Matthias Apitz wrote: >> Where is the inter library software, in your application or are you reaching >> out to another application? > > The above 'app-server' fulfills the search requested by the > 'ILL-software' (or the 'test search'), i.e. looks up for one single > librarian record (one row in the PostgreSQL database) and delivers > it to the 'ILL-software'. The request from the 'ILL-software' is not > a heavy duty, more or less 50 requests per day. > >> Is the search running across a remote network? > > The real search comes over the network through a stunnel. But we > watched with tcpdump the incoming search and the response by the > 'app-server' locally. In the case of the timeout, the 'app-server' does not > answer within 180 seconds, i.e. does not send anything into the stunnel, > and the remote 'ILL-software' terminates the connection with an F-packet. The 'app-server' does not answer, but does the database not answer also? Have you looked to see if the database is providing a response in a timely manner and if it is getting 'lost' in the 'app-server'? Also have you considered Tom Lane's suggestion of using auto_explain? > > I will now: > > - shutdown the test search every 10 secs to see if the problem re-appears > - set 'log_autovacuum_min_duration = 0' in postgresql.conf to see if > the times of the problem matches; > > Thanks for your feedback in any case. > > matthias > -- Adrian Klaver adrian.klaver@aklaver.com
El día jueves, septiembre 15, 2022 a las 08:40:24a. m. -0700, Adrian Klaver escribió: > On 9/14/22 22:33, Matthias Apitz wrote: > > El día miércoles, septiembre 14, 2022 a las 07:19:31a. m. -0700, Adrian Klaver escribió: > > > > > On 9/14/22 01:31, Matthias Apitz wrote: > > > > Where is the inter library software, in your application or are you reaching > > > out to another application? > > > > The above 'app-server' fulfills the search requested by the > > 'ILL-software' (or the 'test search'), i.e. looks up for one single > > librarian record (one row in the PostgreSQL database) and delivers > > it to the 'ILL-software'. The request from the 'ILL-software' is not > > a heavy duty, more or less 50 requests per day. > > > > > Is the search running across a remote network? > > > > The real search comes over the network through a stunnel. But we > > watched with tcpdump the incoming search and the response by the > > 'app-server' locally. In the case of the timeout, the 'app-server' does not > > answer within 180 seconds, i.e. does not send anything into the stunnel, > > and the remote 'ILL-software' terminates the connection with an F-packet. > > The 'app-server' does not answer, but does the database not answer also? > > Have you looked to see if the database is providing a response in a timely > manner and if it is getting 'lost' in the 'app-server'? I do not "think" that the time is spent elsewhere in the 'app-server'. to make a fact of the "thinking", I enabled the tracing of our dblayer showing how many milliseconds have been spent between entering the dblayer and returning the result to the application layers. This is in effect since 36 hours now. Since September 13 the problem has not showed up anymore. We are waiting for it... > Also have you considered Tom Lane's suggestion of using auto_explain? I'm afraid that this would affect all other applications using the same server. We still have other options to use. When the above tracing shows a result, i.e. which of the high level operations of the dblayer uses more time (more milliseconds) than normal, the next level is the detailed logging of the ESQL/C operations (where we added time stamps which normaly this logging does not have in its source code). Thanks matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub Sahra Wagenknecht im Bundestag: "Aber die Vorstellung, dass wir Putin dadurch bestrafen, dass wir Millionen Familien in Deutschland in die Armut stürzen und dass wir unsere Industrie zerstören, während Gasprom Rekordgewinne macht – ja, wie bescheuert ist das denn?" Recht hat sie!
On 9/18/22 02:30, Matthias Apitz wrote: > El día jueves, septiembre 15, 2022 a las 08:40:24a. m. -0700, Adrian Klaver escribió: > >> On 9/14/22 22:33, Matthias Apitz wrote: >>> El día miércoles, septiembre 14, 2022 a las 07:19:31a. m. -0700, Adrian Klaver escribió: >>> >>>> On 9/14/22 01:31, Matthias Apitz wrote: >> >> The 'app-server' does not answer, but does the database not answer also? >> >> Have you looked to see if the database is providing a response in a timely >> manner and if it is getting 'lost' in the 'app-server'? > > I do not "think" that the time is spent elsewhere in the 'app-server'. > to make a fact of the "thinking", I enabled the tracing of our dblayer > showing how many milliseconds have been spent between entering the dblayer and > returning the result to the application layers. This is in effect since > 36 hours now. Since September 13 the problem has not showed up anymore. > We are waiting for it... Is this with or without your every 10 sec 'ping' search program? > >> Also have you considered Tom Lane's suggestion of using auto_explain? > > I'm afraid that this would affect all other applications using the same > server. We still have other options to use. When the above tracing shows > a result, i.e. which of the high level operations of the dblayer uses > more time (more milliseconds) than normal, the next level is the detailed logging > of the ESQL/C operations (where we added time stamps which normaly this > logging does not have in its source code). You can load auto_explain per session as shown here: https://www.postgresql.org/docs/current/auto-explain.html F.4.2. Example > > Thanks > > matthias > -- Adrian Klaver adrian.klaver@aklaver.com
El día domingo, septiembre 18, 2022 a las 07:47:32a. m. -0700, Adrian Klaver escribió: > On 9/18/22 02:30, Matthias Apitz wrote: > > El día jueves, septiembre 15, 2022 a las 08:40:24a. m. -0700, Adrian Klaver escribió: > > > > > On 9/14/22 22:33, Matthias Apitz wrote: > > > > El día miércoles, septiembre 14, 2022 a las 07:19:31a. m. -0700, Adrian Klaver escribió: > > > > > > > > > On 9/14/22 01:31, Matthias Apitz wrote: > > > > > > > The 'app-server' does not answer, but does the database not answer also? > > > > > > Have you looked to see if the database is providing a response in a timely > > > manner and if it is getting 'lost' in the 'app-server'? > > > > I do not "think" that the time is spent elsewhere in the 'app-server'. > > to make a fact of the "thinking", I enabled the tracing of our dblayer > > showing how many milliseconds have been spent between entering the dblayer and > > returning the result to the application layers. This is in effect since > > 36 hours now. Since September 13 the problem has not showed up anymore. > > We are waiting for it... > > Is this with or without your every 10 sec 'ping' search program? The 'ping' search was stopped on September 16, 7:45 CEST. The 'ping' search never showed the problem. > > > Also have you considered Tom Lane's suggestion of using auto_explain? > > > > I'm afraid that this would affect all other applications using the same > > server. We still have other options to use. When the above tracing shows > > a result, i.e. which of the high level operations of the dblayer uses > > more time (more milliseconds) than normal, the next level is the detailed logging > > of the ESQL/C operations (where we added time stamps which normaly this > > logging does not have in its source code). > > You can load auto_explain per session as shown here: Every connect from the ILL forks a new 'app-server' session which creates a new ESQL/C session, and all this occur randomly when the ILL wants to search for a book title if this is available in that library. In short: no way. Thanks matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub Sahra Wagenknecht im Bundestag: "Aber die Vorstellung, dass wir Putin dadurch bestrafen, dass wir Millionen Familien in Deutschland in die Armut stürzen und dass wir unsere Industrie zerstören, während Gasprom Rekordgewinne macht – ja, wie bescheuert ist das denn?" Recht hat sie!
On 9/18/22 09:36, Matthias Apitz wrote: >> You can load auto_explain per session as shown here: > > Every connect from the ILL forks a new 'app-server' session which > creates a new ESQL/C session, and all this occur randomly when the ILL > wants to search for a book title if this is available in that library. > In short: no way. How is it random? Are you saying that the ESQL/C session does not 'know' the search is coming from the ILL connection? > > Thanks > > matthias > -- Adrian Klaver adrian.klaver@aklaver.com
El día Mittwoch, September 14, 2022 a las 10:31:26 +0200, Matthias Apitz escribió: > > We have a C-written application server which uses ESQL/C on top > of PostgreSQL 13.1 on Linux. The application in question always serves > the same search in a librarian database, given to the server > as commands over the network, login into the application and doing > a search: > > SLNPServerInit > User:zfl > SLNPEndCommand > > SLNPSearch > HitListName:Zfernleihe > Search:1000=472214284 > SLNPEndCommand > > To fulfill the search, the application server has to do some 100 > ESQL/C calls and all this should not take longer than 1-2 seconds, and > normally it does not take longer. But, in some situations it takes > longer than 180 seconds, in 10% of the cases. The other 90% are below 2 seconds, > i.e. this is digital: Or 2 seconds, or more than 180 seconds, no values between. > > ... After weeks of silence, the problem showed up again. I was wrong saying "some 100 of ESQL/C", because I had forgotten that the main initialization is done only once in the master daemon of the application server and every such connect spawns a new short living child and this only executes only 16 calls to the DBLAYER: DBCEXIT|N|sisisinst|RDIR|4|21.10.2022 06:08:25:594 DBCEXIT|N|adm_partab|RDIR|2|21.10.2022 06:08:25:597 DBCEXIT|N|adm_partab|RDIR|1|21.10.2022 06:08:25:598 DBCEXIT|N|adm_partab|RDIR|0|21.10.2022 06:08:25:599 DBCEXIT|N|adm_partab|RDIR|0|21.10.2022 06:08:25:600 DBCEXIT|N|adm_partab|RDIR|0|21.10.2022 06:08:25:601 DBCEXIT|N|adm_partab|RDIR|0|21.10.2022 06:08:25:602 DBCEXIT|N|d12termfkt|RWHR|1|21.10.2022 06:08:25:630 DBCEXIT|N|d12termfkt|RWHR|2|21.10.2022 06:08:25:632 DBCEXIT|N|d601biblpar|RALL|3|21.10.2022 06:08:25:636 DBCEXIT|N|d06vw|RDIR|1|21.10.2022 06:08:25:637 DBCEXIT|N|titel_worte|RDIR|13|21.10.2022 06:08:25:660 DBCEXIT|N|titel_worte|FETCH|0|21.10.2022 06:08:25:660 DBCEXIT|N|d14zweigsicht|RWHR|1|21.10.2022 06:08:25:662 DBCEXIT|W|titel_daten|COUNT|183875|21.10.2022 06:11:29:538 ^^^^^^ 'RDIR' means a SELECT with a key, ... 'COUNT' executes a SELECT count(*) from titel_daten The 5th column gives the duration in/below the DBLAYER in milliseconds, in the above case 183875, i.e. 183 seconds. The table in question titel_daten has 5830370 rows. I did it right now with psql and it takes a second only: sisis=# select count(*) from titel_daten; count --------- 5827383 (1 row) What could be a reason for 183 seconds, sometimes not always? Thanks matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub May, 9: Спаси́бо освободители! Thank you very much, Russian liberators!
On Fri, 2022-10-21 at 09:48 +0200, Matthias Apitz wrote: > > all this should not take longer than 1-2 seconds, and > > normally it does not take longer. But, in some situations it takes > > longer than 180 seconds, in 10% of the cases. > > DBCEXIT|N|sisisinst|RDIR|4|21.10.2022 06:08:25:594 > DBCEXIT|N|adm_partab|RDIR|2|21.10.2022 06:08:25:597 > DBCEXIT|N|adm_partab|RDIR|1|21.10.2022 06:08:25:598 > DBCEXIT|N|adm_partab|RDIR|0|21.10.2022 06:08:25:599 > DBCEXIT|N|adm_partab|RDIR|0|21.10.2022 06:08:25:600 > DBCEXIT|N|adm_partab|RDIR|0|21.10.2022 06:08:25:601 > DBCEXIT|N|adm_partab|RDIR|0|21.10.2022 06:08:25:602 > DBCEXIT|N|d12termfkt|RWHR|1|21.10.2022 06:08:25:630 > DBCEXIT|N|d12termfkt|RWHR|2|21.10.2022 06:08:25:632 > DBCEXIT|N|d601biblpar|RALL|3|21.10.2022 06:08:25:636 > DBCEXIT|N|d06vw|RDIR|1|21.10.2022 06:08:25:637 > DBCEXIT|N|titel_worte|RDIR|13|21.10.2022 06:08:25:660 > DBCEXIT|N|titel_worte|FETCH|0|21.10.2022 06:08:25:660 > DBCEXIT|N|d14zweigsicht|RWHR|1|21.10.2022 06:08:25:662 > DBCEXIT|W|titel_daten|COUNT|183875|21.10.2022 06:11:29:538 > ^^^^^^ > > 'RDIR' means a SELECT with a key, ... 'COUNT' executes a > > SELECT count(*) from titel_daten > > The 5th column gives the duration in/below the DBLAYER in milliseconds, in > the above case 183875, i.e. 183 seconds. The table in question titel_daten > has 5830370 rows. I did it right now with psql and it takes a second > only: > > sisis=# select count(*) from titel_daten; > count > --------- > 5827383 > (1 row) > > What could be a reason for 183 seconds, sometimes not always? The most obvious idea is a lock with a statement that takes an ACCESS EXCLUSIVE lock, like TRUNCATE or ALTER TABLE. But there are other possibilities, like network problems or a problem inside your application. To narrow that down, set "log_min_duration_statement" and possibly activate "auto_explain", so that you can verify if the long delay is due to long processing time inside the database. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com