Thread: Issues Scaling Postgres Concurrency

Issues Scaling Postgres Concurrency

From
Harrison Borges
Date:
Hello everyone.

I’m running into severe performance problems with Postgres as I increase the number of concurrent requests against my backend. I’ve identified that the bottleneck is Postgres, and to simplify the test case, I created an endpoint that only does a count query on a table with ~500k rows. At 5 concurrent users, the response time was 33ms, at 10 users it was 60ms, and at 20 users it was 120ms.

As the number of concurrent users increases, the response time for the count query also increases significantly, indicating that Postgres may not be scaling well to handle the increasing load. 

This manifests in essentially a server meltdown on production. As the concurrent requests stack up, our server is stuck waiting for more and more queries. Eventually requests begin timing out as they start taking over 30 seconds to respond.

Am I doing something obviously wrong? Does this sound like normal behavior? I'm not very experienced at DB ops so I'm not 100% sure what to expect here, but I have worked as a Software Engineer for over 10 years and I've not encountered problems like this before.

I would appreciate any insights or advice on how to optimize Postgres for high concurrency scenarios. Thank you in advance for your help!

Re: Issues Scaling Postgres Concurrency

From
Adrian Klaver
Date:
On 3/13/23 9:24 AM, Harrison Borges wrote:
> Hello everyone.
> 
> I’m running into severe performance problems with Postgres as I increase 
> the number of concurrent requests against my backend. I’ve identified 
> that the bottleneck is Postgres, and to simplify the test case, I 
> created an endpoint that only does a count query on a table with ~500k 
> rows. At 5 concurrent users, the response time was 33ms, at 10 users it 
> was 60ms, and at 20 users it was 120ms.
> 
> As the number of concurrent users increases, the response time for the 
> count query also increases significantly, indicating that Postgres may 
> not be scaling well to handle the increasing load.
> 
> This manifests in essentially a server meltdown on production. As the 
> concurrent requests stack up, our server is stuck waiting for more and 
> more queries. Eventually requests begin timing out as they start taking 
> over 30 seconds to respond.
> 
> Am I doing something obviously wrong? Does this sound like normal 
> behavior? I'm not very experienced at DB ops so I'm not 100% sure what 
> to expect here, but I have worked as a Software Engineer for over 10 
> years and I've not encountered problems like this before.

1) https://wiki.postgresql.org/wiki/Slow_Counting

2) Are you using connection pooling?

> 
> I would appreciate any insights or advice on how to optimize Postgres 
> for high concurrency scenarios. Thank you in advance for your help!


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: Issues Scaling Postgres Concurrency

From
Alan Hodgson
Date:
On Mon, 2023-03-13 at 12:24 -0400, Harrison Borges wrote:
Hello everyone.

I’m running into severe performance problems with Postgres as I increase the number of concurrent requests against my backend. I’ve identified that the bottleneck is Postgres, and to simplify the test case, I created an endpoint that only does a count query on a table with ~500k rows. At 5 concurrent users, the response time was 33ms, at 10 users it was 60ms, and at 20 users it was 120ms.

I'm no expert on high concurrency, but for something this simple I'd expect that you're just CPU bottlenecked. Count in PostgreSQL actually has to read all the rows in the table. And yeah you can't do too many of them at the same time.

Re: Issues Scaling Postgres Concurrency

From
Laurenz Albe
Date:
On Mon, 2023-03-13 at 12:24 -0400, Harrison Borges wrote:
> I’m running into severe performance problems with Postgres as I increase the number
> of concurrent requests against my backend. I’ve identified that the bottleneck is
> Postgres, and to simplify the test case, I created an endpoint that only does a
> count query on a table with ~500k rows. At 5 concurrent users, the response time
> was 33ms, at 10 users it was 60ms, and at 20 users it was 120ms.
>
> As the number of concurrent users increases, the response time for the count query
> also increases significantly, indicating that Postgres may not be scaling well to
> handle the increasing load. 
>
> This manifests in essentially a server meltdown on production. As the concurrent
> requests stack up, our server is stuck waiting for more and more queries.
> Eventually requests begin timing out as they start taking over 30 seconds to respond.
>
> Am I doing something obviously wrong? Does this sound like normal behavior?

That sounds like quite normal and expected behavior.

A query that counts the number of rows in a table of half a million rows is
quite expensive and keeps a CPU core busy for a while (provided everything is
cached). At some degree of parallelism, your CPU is overloaded, which leads
to non-linear slowdown.

The thing you are doing wrong is that you are putting too much load on this
system.

Yours,
Laurenz Albe



Re: Issues Scaling Postgres Concurrency

From
"Peter J. Holzer"
Date:
On 2023-03-14 22:47:43 +0100, Laurenz Albe wrote:
> On Mon, 2023-03-13 at 12:24 -0400, Harrison Borges wrote:
> > I’m running into severe performance problems with Postgres as I
> > increase the number of concurrent requests against my backend. I’ve
> > identified that the bottleneck is Postgres, and to simplify the test
> > case, I created an endpoint that only does a count query on a table
> > with ~500k rows. At 5 concurrent users, the response time was 33ms,
> > at 10 users it was 60ms, and at 20 users it was 120ms.
[...]
> > This manifests in essentially a server meltdown on production. As
> > the concurrent requests stack up, our server is stuck waiting for
> > more and more queries. Eventually requests begin timing out as they
> > start taking over 30 seconds to respond.
> >
> > Am I doing something obviously wrong? Does this sound like normal
> > behavior?
>
> That sounds like quite normal and expected behavior.
>
> A query that counts the number of rows in a table of half a million
> rows is quite expensive and keeps a CPU core busy for a while
> (provided everything is cached). At some degree of parallelism, your
> CPU is overloaded, which leads to non-linear slowdown.

The slowdown looks pretty linear to me (6ms per user).

The interesting thing is that on my laptop even two concurrent accesses
cause a 100% slowdown. I think this is because the task is actually
memory-bound: The cores may do the counting in parallel, but they have
to read the data from the same RAM (since it's too large to fit in the
CPU cache) and they have to take turns accessing it.

> The thing you are doing wrong is that you are putting too much load on
> this system.

Or possibly counting stuff far more often than necessary. If an exact
count is necessary more frequently than it changes it is probably a good
idea to store that somewhere and update it in a trigger.

(If the count doesn't have to be totally up-to-date, caching it in the
application may be even better.)

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

Re: Issues Scaling Postgres Concurrency

From
Tom Lane
Date:
"Peter J. Holzer" <hjp-pgsql@hjp.at> writes:
> On 2023-03-14 22:47:43 +0100, Laurenz Albe wrote:
>> A query that counts the number of rows in a table of half a million
>> rows is quite expensive and keeps a CPU core busy for a while
>> (provided everything is cached). At some degree of parallelism, your
>> CPU is overloaded, which leads to non-linear slowdown.

> The interesting thing is that on my laptop even two concurrent accesses
> cause a 100% slowdown. I think this is because the task is actually
> memory-bound: The cores may do the counting in parallel, but they have
> to read the data from the same RAM (since it's too large to fit in the
> CPU cache) and they have to take turns accessing it.

I wondered if synchronize_seqscans might exacerbate this behavior by
encouraging more cores to be touching the same buffers at the same time.
I couldn't measure much difference between having it on vs. off in a
pgbench test with all the clients counting the same table ... but maybe
on other hardware the effect would show up.

            regards, tom lane