Thread: Issues Scaling Postgres Concurrency
Hello everyone.
I’m running into severe performance problems with Postgres as I increase the number of concurrent requests against my backend. I’ve identified that the bottleneck is Postgres, and to simplify the test case, I created an endpoint that only does a count query on a table with ~500k rows. At 5 concurrent users, the response time was 33ms, at 10 users it was 60ms, and at 20 users it was 120ms.
As the number of concurrent users increases, the response time for the count query also increases significantly, indicating that Postgres may not be scaling well to handle the increasing load.
As the number of concurrent users increases, the response time for the count query also increases significantly, indicating that Postgres may not be scaling well to handle the increasing load.
This manifests in essentially a server meltdown on production. As the concurrent requests stack up, our server is stuck waiting for more and more queries. Eventually requests begin timing out as they start taking over 30 seconds to respond.
Am I doing something obviously wrong? Does this sound like normal behavior? I'm not very experienced at DB ops so I'm not 100% sure what to expect here, but I have worked as a Software Engineer for over 10 years and I've not encountered problems like this before.
I would appreciate any insights or advice on how to optimize Postgres for high concurrency scenarios. Thank you in advance for your help!
On 3/13/23 9:24 AM, Harrison Borges wrote: > Hello everyone. > > I’m running into severe performance problems with Postgres as I increase > the number of concurrent requests against my backend. I’ve identified > that the bottleneck is Postgres, and to simplify the test case, I > created an endpoint that only does a count query on a table with ~500k > rows. At 5 concurrent users, the response time was 33ms, at 10 users it > was 60ms, and at 20 users it was 120ms. > > As the number of concurrent users increases, the response time for the > count query also increases significantly, indicating that Postgres may > not be scaling well to handle the increasing load. > > This manifests in essentially a server meltdown on production. As the > concurrent requests stack up, our server is stuck waiting for more and > more queries. Eventually requests begin timing out as they start taking > over 30 seconds to respond. > > Am I doing something obviously wrong? Does this sound like normal > behavior? I'm not very experienced at DB ops so I'm not 100% sure what > to expect here, but I have worked as a Software Engineer for over 10 > years and I've not encountered problems like this before. 1) https://wiki.postgresql.org/wiki/Slow_Counting 2) Are you using connection pooling? > > I would appreciate any insights or advice on how to optimize Postgres > for high concurrency scenarios. Thank you in advance for your help! -- Adrian Klaver adrian.klaver@aklaver.com
On Mon, 2023-03-13 at 12:24 -0400, Harrison Borges wrote:
Hello everyone.I’m running into severe performance problems with Postgres as I increase the number of concurrent requests against my backend. I’ve identified that the bottleneck is Postgres, and to simplify the test case, I created an endpoint that only does a count query on a table with ~500k rows. At 5 concurrent users, the response time was 33ms, at 10 users it was 60ms, and at 20 users it was 120ms.
I'm no expert on high concurrency, but for something this simple I'd expect that you're just CPU bottlenecked. Count in PostgreSQL actually has to read all the rows in the table. And yeah you can't do too many of them at the same time.
On Mon, 2023-03-13 at 12:24 -0400, Harrison Borges wrote: > I’m running into severe performance problems with Postgres as I increase the number > of concurrent requests against my backend. I’ve identified that the bottleneck is > Postgres, and to simplify the test case, I created an endpoint that only does a > count query on a table with ~500k rows. At 5 concurrent users, the response time > was 33ms, at 10 users it was 60ms, and at 20 users it was 120ms. > > As the number of concurrent users increases, the response time for the count query > also increases significantly, indicating that Postgres may not be scaling well to > handle the increasing load. > > This manifests in essentially a server meltdown on production. As the concurrent > requests stack up, our server is stuck waiting for more and more queries. > Eventually requests begin timing out as they start taking over 30 seconds to respond. > > Am I doing something obviously wrong? Does this sound like normal behavior? That sounds like quite normal and expected behavior. A query that counts the number of rows in a table of half a million rows is quite expensive and keeps a CPU core busy for a while (provided everything is cached). At some degree of parallelism, your CPU is overloaded, which leads to non-linear slowdown. The thing you are doing wrong is that you are putting too much load on this system. Yours, Laurenz Albe
On 2023-03-14 22:47:43 +0100, Laurenz Albe wrote: > On Mon, 2023-03-13 at 12:24 -0400, Harrison Borges wrote: > > I’m running into severe performance problems with Postgres as I > > increase the number of concurrent requests against my backend. I’ve > > identified that the bottleneck is Postgres, and to simplify the test > > case, I created an endpoint that only does a count query on a table > > with ~500k rows. At 5 concurrent users, the response time was 33ms, > > at 10 users it was 60ms, and at 20 users it was 120ms. [...] > > This manifests in essentially a server meltdown on production. As > > the concurrent requests stack up, our server is stuck waiting for > > more and more queries. Eventually requests begin timing out as they > > start taking over 30 seconds to respond. > > > > Am I doing something obviously wrong? Does this sound like normal > > behavior? > > That sounds like quite normal and expected behavior. > > A query that counts the number of rows in a table of half a million > rows is quite expensive and keeps a CPU core busy for a while > (provided everything is cached). At some degree of parallelism, your > CPU is overloaded, which leads to non-linear slowdown. The slowdown looks pretty linear to me (6ms per user). The interesting thing is that on my laptop even two concurrent accesses cause a 100% slowdown. I think this is because the task is actually memory-bound: The cores may do the counting in parallel, but they have to read the data from the same RAM (since it's too large to fit in the CPU cache) and they have to take turns accessing it. > The thing you are doing wrong is that you are putting too much load on > this system. Or possibly counting stuff far more often than necessary. If an exact count is necessary more frequently than it changes it is probably a good idea to store that somewhere and update it in a trigger. (If the count doesn't have to be totally up-to-date, caching it in the application may be even better.) hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
"Peter J. Holzer" <hjp-pgsql@hjp.at> writes: > On 2023-03-14 22:47:43 +0100, Laurenz Albe wrote: >> A query that counts the number of rows in a table of half a million >> rows is quite expensive and keeps a CPU core busy for a while >> (provided everything is cached). At some degree of parallelism, your >> CPU is overloaded, which leads to non-linear slowdown. > The interesting thing is that on my laptop even two concurrent accesses > cause a 100% slowdown. I think this is because the task is actually > memory-bound: The cores may do the counting in parallel, but they have > to read the data from the same RAM (since it's too large to fit in the > CPU cache) and they have to take turns accessing it. I wondered if synchronize_seqscans might exacerbate this behavior by encouraging more cores to be touching the same buffers at the same time. I couldn't measure much difference between having it on vs. off in a pgbench test with all the clients counting the same table ... but maybe on other hardware the effect would show up. regards, tom lane