Thread: work_mem in high transaction rate database

From:
Flavio Henrique Araque Gurgel
Date:

Hello all

In a dedicated server with 16 cores and 16GB of RAM running PostgreSQL 8.2.5 we have a database with basically two kinds of transactions:
- short transactions with a couple of updates and inserts that runs all the day;
- batch data loads with hundreds of inserts that runs several times a day;
- one delete for thousands of lines after each batch;
- selects are made when users need reports, low concurrency here.

Today the max_connections is ~2500 where the application is a cluster of JBoss servers with a pool a bit smaller then this total.
work_mem = 1GB
maintenance_work_mem = 1GB
shared_buffers = 4GB

autovacuum takes a lot of time running in the largest tables (3 large tables in 50) causing some connections to have to wait for it to finish to start transactioning again.

I see a few processes (connections) using 10 ~ 20% of total system memory and the others using no more then 1%.

What I want to ask is: is it better to keep the work_mem as high as it is today or is it a safe bet triyng to reduce this number, for example, to 1 or 2MB so I can keep the distribution of memory more balanced among all connections?

Thanks!

Flavio Henrique A. Gurgel

From:
Scott Marlowe
Date:

On Tue, Mar 3, 2009 at 5:28 PM, Flavio Henrique Araque Gurgel
<> wrote:
> Hello all
>
> In a dedicated server with 16 cores and 16GB of RAM running PostgreSQL 8.2.5
> we have a database with basically two kinds of transactions:
> - short transactions with a couple of updates and inserts that runs all the
> day;
> - batch data loads with hundreds of inserts that runs several times a day;
> - one delete for thousands of lines after each batch;
> - selects are made when users need reports, low concurrency here.
>
> Today the max_connections is ~2500 where the application is a cluster of
> JBoss servers with a pool a bit smaller then this total.
> work_mem = 1GB
> maintenance_work_mem = 1GB
> shared_buffers = 4GB

Oh my lord, that is a foot gun waiting to go off.  Assuming 2k
connections, and somehow a fair number of them went active with big
sorts, you'd be able to exhaust all physical memory  with about 8 to
16 connections.  Lower work_mem now. To something like 1 to 4 Meg.  Do
not pass go.  If some oddball query really needs a lot of work_mem,
and benchmarks show something larger work_mem helps, consider raising
the work_mem setting for that one query to something under 1G (way
under 1G) That makes it noticeably faster.  Don't allocate more than a
test shows you helps.

> autovacuum takes a lot of time running in the largest tables (3 large tables
> in 50) causing some connections to have to wait for it to finish to start
> transactioning again.

Vacuum does not block transactions.  unless you're dropping tables or something.

> I see a few processes (connections) using 10 ~ 20% of total system memory
> and the others using no more then 1%.

This is commonly misread.  It has to do with the vagaries of shared
memory allocation and accounting.  The numbers quite likely don't mean
what you think they mean.  Post the first 20 or so lines from top to
show us.

> What I want to ask is: is it better to keep the work_mem as high as it is
> today or is it a safe bet triyng to reduce this number, for example, to 1 or
> 2MB so I can keep the distribution of memory more balanced among all
> connections?

You're work_mem is dangerously high.  Your current reading of top may
not actually support lowering it directly.  Since you've got 4G
shared_buffers allocated, any process that's touched all or most of
shared_buffer memory will show as using 4G of ram.  That's why you
should post output of top, or google on linux virtual memory and top
and what the numbers mean.

Let's say that 1% of your queries can benefit from > 100Meg work_mem,
and 5% with 60M, and 10% with 40M, and 20% with 20M, and 30% with 16M,
and 50% 8M and 4M is enough for all the else to do well.

If, somehow, 100 queries fired off that could use > 100Meg, they
might, with your current settings use all your memory and start using
swap til swap ran out and they started getting out of memory errors
and failing.  This would affect all the other queries on the machine
as well.

OTOH, if you had work_mem limited to 16M, and 100 of those same
queries fired off, they'd individually run a little slower, but they
wouldn't be able to run the machine out of memory.

If your work_mem and max_connections multiplied is > than some
fraction of memory you're doing it wrong, and setting your machine up
for mysterious, heavy load failures, the worst kind.

From:
Akos Gabriel
Date:

Tue, 3 Mar 2009 18:37:42 -0700 -n
Scott Marlowe <> írta:


> Oh my lord, that is a foot gun waiting to go off.  Assuming 2k
> connections, and somehow a fair number of them went active with big

I absolutely agree with Scott. Plus set effective_cache_size
accordingly, this would help the planner. You can read a lot about
setting this in the mailing list archives.

--
Üdvözlettel,
Gábriel Ákos
-=E-Mail :|Web:  http://www.i-logic.hu=-
-=Tel/fax:+3612367353            |Mobil:+36209278894         =-

From:
Dimitri Fontaine
Date:

Hi,

On Wednesday 04 March 2009 02:37:42 Scott Marlowe wrote:
> If some oddball query really needs a lot of work_mem,
> and benchmarks show something larger work_mem helps, consider raising
> the work_mem setting for that one query to something under 1G (way
> under 1G) That makes it noticeably faster.  Don't allocate more than a
> test shows you helps.

The probably easiest way to integrate this into an existing application is
this way, in my experience:

 BEGIN;
  SET LOCAL work_mem TO '650MB';
  SELECT -- the query requiring such a large setting
 COMMIT;

Right after the commit the global configured work_mem (or the previous
session's one, in fact) will be in effect, you won't have to reset it yourself.

Regards,
--
dim

Attachment
From:
Flavio Henrique Araque Gurgel
Date:

----- "Scott Marlowe" <> escreveu:
> Oh my lord, that is a foot gun waiting to go off.  Assuming 2k
> connections, and somehow a fair number of them went active with big
> sorts, you'd be able to exhaust all physical memory  with about 8 to
> 16 connections.  Lower work_mem now. To something like 1 to 4 Meg.  Do
> not pass go.  If some oddball query really needs a lot of work_mem,
> and benchmarks show something larger work_mem helps, consider raising
> the work_mem setting for that one query to something under 1G (way
> under 1G) That makes it noticeably faster.  Don't allocate more than a
> test shows you helps.

Thanks a lot Scott. That's what I thought in the beginning but was very doubtful since the documentation is a bit odd regarding this point and several bloggers talk about increasing this value up to 250MB. I really think that separating regular non pooled distributed applications and pooled application servers makes a lot of difference in this point.

> Vacuum does not block transactions.  unless you're dropping tables or something.

I'll try to separate things and check if the DELETE queries have something related here.

(...)
> what you think they mean.  Post the first 20 or so lines from top to
> show us.

Unfortunately I can't do it. The data there is very sensitive (it's a public company here in Brazil) and the server is operated only by selected personal. I just ask for information and give written recomendations. Anyway, I'm going to pay some more attention in this topic.

This is a very interesting implementation of PostgreSQL (3 large databases, heavy load, things growing all the time) and I'll let you all know what happened when tuning it. I'll feedback you after lowering work_mem and changing related settings.

Thanks
Flavio
From:
Scott Carey
Date:

You may have decreased performance in your batch jobs with the lower work_mem setting.
Additionally, the fact that you haven’t had swap storm issues so far means that although there is certain risk of an issue, its probably a lot lower than what has been talked about here so far.
Without a change in client behavior (new queries or large change in data) a change in load alone is very unlikely to cause a problem.  So take your time to do it right.  I disagree with the knee-jerk “change it now!” response.  The very fact you have gotten this far means it is not as risky as the bare settings indicate.
Definitely plan on improving and testing out work_mem variants soon, but a hasty change to a small value might cause your batch jobs to take much longer — what is your risk if they take too long or don’t complete in time?  That risk is yours to assess — if its not much of a risk, then by all means lower work_mem soon.  But if it is business critical for those batch jobs to complete within some time frame, be careful.

If you know what you are doing, and are careful,  your work_mem is probably, but not necessarily too high.
It all depends on how much you know about your clients.  
For example 2010 connections with 500MB work_mem is not always a problem.  If you know 2000 of those are from an application that accesses with a user that can only see small tables, and you know what queries those are, it may be perfectly safe.  For example, I’ve dealt with an application that had a couple thousand connections, 95% were idle at any time (connection pool much like those likely from your Jboss apps).  The queries these ran were a limited set of about 20 statements that all accessed by unique key on small-ish sized tables (<30MB) with few joins.  There were tons of connections, but they:
1: hardly did anything, most were idle.  On 75% of the connections, a query set was run exactly once every 15 minutes selecting * from small (sub 1MB) tables.  
2: the more active connections (20%) did small select queries on single rows accessed by primary key.

So, the calculation  max connections * work_mem is utterly inappropriate for that sort of workload.   Yes, in theory, those connections could use work_mem * some factor of memory — if they changed their queries, and accessed other tables.  In practice — nowhere close.

The remaining few connections(~5) were batch jobs that needed ~800MB of work_mem or else the performance would stink.  And they didn’t need 800MB of work_mem for real (the hashes used ~250MB) they needed a SETTING of 800MB because the planner is incapable of estimating row counts properly with partitioned table access.
Both applications were not able to configure their own work_mem for quite some time (getting client applications to change is not always a quick process).
But the risk of having a large setting, even with 2000 connections was low.  The risk of changing it too low was very high (batch jobs taking 3 hours instead of 10 minutes).  Only 5 ish connections even accessed schemas/tables with lots of data.  The remaining couple thousand were constrained in many ways other than work_mem.

In the end I did have swap storms... However it was not caused by work_mem.  It was the query planner, which can use GBs of memory per connection planning a query on partitioned tables.

So, my point is that if you don’t know a lot about the database or its clients be very wary of large work_mem settings.  If you do, and have a lot of control or knowledge about your clients, the work_mem * max_connections calculation is inappropriate.

The detailed formula is along the lines of:
SUM_i [work_mem_i * active_connecions_i]    (for each ‘type’ of connection i).
If you don’t know enough about your connections, then the conservative estimate is work_mem * max_connections.

A single query has the potential of using multiples of work_mem depending on how many concurrent hashes / sorts are in a query, so the above is not quite right either.

Is there a way to make a particular database user have a user-local work_mem setting without having the client change their code? You could then have each application have its own user, with its own default setting.  The batch jobs with few connections can get much larger work_mem than the Jboss ones.  This would be especially powerful for applications that can’t change or that use higher level tools for db access that make it impossible or very difficult to send non-standard commands like “SET”.

On 3/4/09 4:46 AM, "Flavio Henrique Araque Gurgel" <> wrote:

----- "Scott Marlowe" <> escreveu:
> Oh my lord, that is a foot gun waiting to go off.  Assuming 2k
> connections, and somehow a fair number of them went active with big
> sorts, you'd be able to exhaust all physical memory  with about 8 to
> 16 connections.  Lower work_mem now. To something like 1 to 4 Meg.  Do
> not pass go.  If some oddball query really needs a lot of work_mem,
> and benchmarks show something larger work_mem helps, consider raising
> the work_mem setting for that one query to something under 1G (way
> under 1G) That makes it noticeably faster.  Don't allocate more than a
> test shows you helps.

Thanks a lot Scott. That's what I thought in the beginning but was very doubtful since the documentation is a bit odd regarding this point and several bloggers talk about increasing this value up to 250MB. I really think that separating regular non pooled distributed applications and pooled application servers makes a lot of difference in this point.

> Vacuum does not block transactions.  unless you're dropping tables or something.

I'll try to separate things and check if the DELETE queries have something related here.

(...)
> what you think they mean.  Post the first 20 or so lines from top to
> show us.

Unfortunately I can't do it. The data there is very sensitive (it's a public company here in Brazil) and the server is operated only by selected personal. I just ask for information and give written recomendations. Anyway, I'm going to pay some more attention in this topic.

This is a very interesting implementation of PostgreSQL (3 large databases, heavy load, things growing all the time) and I'll let you all know what happened when tuning it. I'll feedback you after lowering work_mem and changing related settings.

Thanks
Flavio

From:
Scott Marlowe
Date:

On Wed, Mar 4, 2009 at 11:18 AM, Scott Carey <> wrote:
> You may have decreased performance in your batch jobs with the lower
> work_mem setting.

That would be why I recommended benchmarking queries that need more
memory and setting work_mem for those queries alone.

> Additionally, the fact that you haven’t had swap storm issues so far means
> that although there is certain risk of an issue, its probably a lot lower
> than what has been talked about here so far.

No, it means you and the OP are guessing at what's a good number
without any actual proof of it.  Guessing it not a particularly good
method for setting work_mem, especially on a server with 2000+
connections.

> Without a change in client behavior (new queries or large change in data) a
> change in load alone is very unlikely to cause a problem.

That is demonstrably incorrect.  If the average number of live queries
out of the 2000 connections is currently 10, and an increase in load
makes it 500, there is a very REAL chance of running the server out of
memory.

> So take your time
> to do it right.

I believe I made mention of benchmarking queries above and in my first
post.  But doing it right does NOT mean setting work_mem to 2G then
whittling it down as your server crashes under load.

>  I disagree with the knee-jerk “change it now!” response.
>  The very fact you have gotten this far means it is not as risky as the bare
> settings indicate.

Sorry, but I disagree back at you, and it's not a knee jerk reaction,
it's a reaction honed from years of watching supposedly stable
postgresql servers crash and burn under slashdot effect type loads.

> Definitely plan on improving and testing out work_mem variants soon, but a
> hasty change to a small value might cause your batch jobs to take much
> longer — what is your risk if they take too long or don’t complete in time?
>  That risk is yours to assess — if its not much of a risk, then by all means
> lower work_mem soon.  But if it is business critical for those batch jobs to
> complete within some time frame, be careful.

Sorry, but that's backwards.  Unless the batch jobs are the only
important thing on this server, running it with work_mem=2G is asking
for trouble under any real kind of load.  It's sacrificing stability
for some unknown and quite probably minimal performance improvement.

It seems a lot of your post is based on either hoping for the best, or
assuming access patterns won't change much over time.  Do you test at
higher and higher parallel loads until failure occurs and then figure
out how to limit that type of failure?  I do, because I can't afford
to have my db servers crash and burn midday under peak load.  And you
never know when some app is gonna start suddenly spewing things like
unconstrained joins due to some bug, and if you've got work_mem set to
1G your server IS gonna have problems.