Thread: using a lot of maintenance_work_mem
Hi, I am (ab)using a PostgreSQL database (with PostGIS extension) in a large data processing job - each day, I load several GB of data, run a lot of analyses on it, and then throw everything away again. Loading, running, and dumping the results takes about 18 hours every day. The job involves a lot of index building and sorting, and is run on a 64-bit machine with 96 GB of RAM. Naturally I would like the system to use as much RAM as possible before resorting to disk-based operations, but no amount of maintenance_work_mem setting seems to make it do my bidding. I'm using PostgreSQL 8.3 but would be willing and able to upgrade to any later version. Some googling has unearthed the issue - which is likely known to all of you, just repeating it to prove I've done my homework - that tuplesort.c always tries to double its memory allocation, and will refuse to do so if that results in an allocation greater than MaxAllocSize: if ((Size) (state->memtupsize * 2) >= MaxAllocSize / sizeof(SortTuple)) return false; And MaxAllocSize is hardcoded to 1 GB in memutils.h. (All this based on Postgres 9.1alpha source - I didn't want to bring something up that has been fixed already.) Now I assume that there are reasons that you're doing this. memutils.h has the (for me) cryptic comment about MaxAllocSize: "XXX This is deliberately chosen to correspond to the limiting size of varlena objects under TOAST. See VARATT_MASK_SIZE in postgres.h.", but VARATT_MASK_SIZE has zero other occurences in the source code. If I were to either (a) increase MaxAllocSize to, say, 48 GB instead of 1 GB, or (b) hack tuplesort.c to ignore MaxAllocSize, just for my local setup - would that likely be viable in my situation, or would I break countless things? I can afford some experimentation; as I said, I'm throwing away the database every day anyway. I just thought I'd solicit your advice before I do anything super stupid. - If I can use my setup to somehow contribute to further PostgreSQL development by trying out some things, I'll be more than happy to do so. I do C/C++ but apart from building packages for several platforms, I haven't worked with the PostgreSQL source code. Of course the cop-out solution would be to just create a huge RAM disk and instruct PostgreSQL to use that for disk-based sorting. I'll do that if all of you say "OMG don't touch MaxAllocSize" ;) Bye Frederik -- Frederik Ramm ## eMail frederik@remote.org ## N49°00'09" E008°23'33"
Frederik Ramm <frederik@remote.org> writes: > Now I assume that there are reasons that you're doing this. memutils.h > has the (for me) cryptic comment about MaxAllocSize: "XXX This is > deliberately chosen to correspond to the limiting size of varlena > objects under TOAST. See VARATT_MASK_SIZE in postgres.h.", but > VARATT_MASK_SIZE has zero other occurences in the source code. Hm, I guess that comment needs updated then. > If I were to either (a) increase MaxAllocSize to, say, 48 GB instead of > 1 GB, or (b) hack tuplesort.c to ignore MaxAllocSize, just for my local > setup - would that likely be viable in my situation, or would I break > countless things? You would break countless things. It might be okay anyway in a trusted environment, ie, one without users trying to crash the system, but there are a lot of security-critical implications of that test. If we were actually trying to support such large allocations, what I'd be inclined to do is introduce a separate call along the lines of MemoryContextAllocLarge() that lacks the safety check. But before expending time on that, I'd want to see some evidence that it's actually helpful for production situations. I'm a bit dubious that you're going to gain much here. regards, tom lane
Frederik Ramm <frederik@remote.org> wrote: > I am (ab)using a PostgreSQL database (with PostGIS extension) in > a large data processing job - each day, I load several GB of data, > run a lot of analyses on it, and then throw everything away again. > Loading, running, and dumping the results takes about 18 hours > every day. > > The job involves a lot of index building and sorting, and is run > on a 64-bit machine with 96 GB of RAM. > > Naturally I would like the system to use as much RAM as possible > before resorting to disk-based operations, but no amount of > maintenance_work_mem setting seems to make it do my bidding. If you can tolerate some risk that for a given day you might fail to generate the analysis, or you might need to push the schedule back to get it, you could increase performance by compromising recoverability. You seem to be willing to consider such risk based on your mention of a RAM disk.- If a single session can be maintained for loading and using the data, you might be able to use temporary tables and a large temp_buffers size. Of course, when the connection closes, the tables are gone.- You could turn off fsync and full_page_writes, but on a crash your database might be corrupted beyond usability.- You could turn off synchronous_commit.- Make sure you have archivingturned off.- If you are not already doing so, load the data into each table within the same database transaction which does CREATE TABLE or TRUNCATE TABLE. Other than the possibility that the temp table might keep things in RAM, these suggestions don't directly address your question, but I thought they might be helpful. -Kevin
On mån, 2011-02-14 at 10:11 -0500, Tom Lane wrote: > But before expending time on that, I'd want to see some evidence that > it's actually helpful for production situations. I'm a bit dubious > that you're going to gain much here. If you want to build an index on a 500GB table and you have 1TB RAM, then being able to use >>1GB maintenance_work_mem can only be good, no?
Tom & Kevin, thank you for your replies. Kevin, I had already employed all the tricks you mention, except using temporary tables which would be hard for me due to the structure of my application (but I could try using something like pgbouncer or so), but thanks a lot for sharing the ideas. Tom Lane wrote: >> If I were to either (a) increase MaxAllocSize to, say, 48 GB instead of >> 1 GB, or (b) hack tuplesort.c to ignore MaxAllocSize, just for my local >> setup - would that likely be viable in my situation, or would I break >> countless things? > > You would break countless things. Indeed I did. I tried to raise the MaxAllocSize from 1 GB to a large number, but immediately got strange memory allocation errors during the regression test (something that looked like a wrapped integer in a memory allocation request). I reduced the number in steps, and found I could compile and run PostgreSQL 8.3 with a MaxAllocSize of 4 GB, and PostgreSQL 9.0 with 2 GB without breakage. In a completely un-scientific test run, comprising 42 individual SQL statements aimed at importing and indexing a large volume of data, I got the following results: pg8.3 with normal MaxAllocSize .................. 15284s pg8.3 with MaxAllocSize increased to 4 GB ....... 14609s (-4.5%) pg9.0 with normal MaxAllocSize .................. 12969s (-15.2%) pg9.0 with MaxAllocSize increased to 2 GB ....... 13211s (-13.5%) > I'd want to see some evidence that it's actually > helpful for production situations. I'm a bit dubious that you're going > to gain much here. So, on the whole it seems you were right; the performance, at least with that small memory increase I managed to build in without breaking things, doesn't increase a lot, or not at all for PostgreSQL 9.0. The single query that gained most from the increase in memory was an ALTER TABLE statement to add a BIGINT primary key to a table with about 50 million records - this was 75% faster on the both 8.3 and 9.0 but since it took only 120 seconds to begin with, didn't change the result a lot. The single query where pg9.0 beat pg8.3 by a country mile was a CREATE INDEX statement on a BIGINT column to a table with about 500 million records - this cost 2679 seconds on normal 8.3, 2443 seconds on large-memory 8.3, and aroung 1650 seconds on 9.0, large memory or not. The query that, on both 8.3 and 9.0, took about 10% longer with more memory was a CREATE INDEX statement on a TEXT column. All this, as I said, completely un-scientific - I did take care to flush caches and not run anything in parallel, but that was about all I did so it might come out differently when run often. My result of all of this? Switch to 9.0 of course ;) Bye Frederik -- Frederik Ramm ## eMail frederik@remote.org ## N49°00'09" E008°23'33"
Frederik Ramm <frederik@remote.org> writes: > The single query where pg9.0 beat pg8.3 by a country mile was a CREATE > INDEX statement on a BIGINT column to a table with about 500 million > records - this cost 2679 seconds on normal 8.3, 2443 seconds on > large-memory 8.3, and aroung 1650 seconds on 9.0, large memory or not. FWIW, that's probably due to bigint having become pass-by-value on 64-bit platforms. regards, tom lane
On Wed, 2011-02-16 at 23:24 +0200, Peter Eisentraut wrote: > > > But before expending time on that, I'd want to see some evidence > that > > it's actually helpful for production situations. I'm a bit dubious > > that you're going to gain much here. > > If you want to build an index on a 500GB table and you have 1TB RAM, > then being able to use >>1GB maintenance_work_mem can only be good, > no? That would also probably speed up Slony (or similar) replication engines in initial replication phase. I know that I had to wait a lot while creating big indexes on a machine which had enough ram. -- Devrim GÜNDÜZ EnterpriseDB: http://www.enterprisedb.com PostgreSQL Danışmanı/Consultant, Red Hat Certified Engineer Community: devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr http://www.gunduz.org Twitter: http://twitter.com/devrimgunduz
Devrim G�ND�Z wrote: > On Wed, 2011-02-16 at 23:24 +0200, Peter Eisentraut wrote: > > > > > But before expending time on that, I'd want to see some evidence > > that > > > it's actually helpful for production situations. I'm a bit dubious > > > that you're going to gain much here. > > > > If you want to build an index on a 500GB table and you have 1TB RAM, > > then being able to use >>1GB maintenance_work_mem can only be good, > > no? > > That would also probably speed up Slony (or similar) replication engines > in initial replication phase. I know that I had to wait a lot while > creating big indexes on a machine which had enough ram. Well, I figure it will be hard to allow larger maximums, but can we make the GUC variable maximums be more realistic? Right now it is MAX_KILOBYTES (INT_MAX). -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
--On 20. Februar 2011 09:32:02 -0500 Bruce Momjian <bruce@momjian.us> wrote: > Well, I figure it will be hard to allow larger maximums, but can we make > the GUC variable maximums be more realistic? Right now it is > MAX_KILOBYTES (INT_MAX). This is something i proposed some time ago, too. At least, it will stop us from promising something which is maintenance_work_mem not able to deliver. -- Thanks Bernd
--On 20. Februar 2011 15:48:06 +0100 Bernd Helmle <mailings@oopsware.de> wrote: >> Well, I figure it will be hard to allow larger maximums, but can we make >> the GUC variable maximums be more realistic? Right now it is >> MAX_KILOBYTES (INT_MAX). > > This is something i proposed some time ago, too. At least, it will stop > us from promising something which is maintenance_work_mem not able to > deliver. Hmm, on further reflection a better option might be to just document this behavior more detailed. I could imagine that making maintenance_work_mem having a hard upper limit would break countless SQL scripts, where it was set just high enough in the hope of speed increase... -- Thanks Bernd
Bruce Momjian <bruce@momjian.us> writes: > Well, I figure it will be hard to allow larger maximums, but can we make > the GUC variable maximums be more realistic? Right now it is > MAX_KILOBYTES (INT_MAX). You seem to be confusing one limitation in one code path with the overall meaning of maintenance_work_mem. regards, tom lane
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Well, I figure it will be hard to allow larger maximums, but can we make > > the GUC variable maximums be more realistic? Right now it is > > MAX_KILOBYTES (INT_MAX). > > You seem to be confusing one limitation in one code path with the > overall meaning of maintenance_work_mem. Oh, OK, so sorts are limited, but not hash sizes? Are there any other uses? Should this be documented somehow? What is the actual sort limit? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce Momjian <bruce@momjian.us> writes: > Tom Lane wrote: >> You seem to be confusing one limitation in one code path with the >> overall meaning of maintenance_work_mem. > Oh, OK, so sorts are limited, but not hash sizes? Are there any other > uses? Should this be documented somehow? What is the actual sort > limit? The particular complaint that's being made here is about tuplesort.c's array of SortTuples, which isn't all (or even the largest part) of its memory consumption. The tuples themselves eat significantly more in nearly all cases. I don't think there's any very easy way to document what the largest useful maintenance_work_mem for sorting is based on that --- you'd have to pull a number for tuple size out of the air. But it's certainly possible to use up lots of gigabytes when sorting wide tuples. I think the original complaint in this thread was about building an index, which probably had relatively small tuples so the SortTuple constraint was more pressing. In any case, this is the sort of thing that'd be far better to fix than document. regards, tom lane
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Tom Lane wrote: > >> You seem to be confusing one limitation in one code path with the > >> overall meaning of maintenance_work_mem. > > > Oh, OK, so sorts are limited, but not hash sizes? Are there any other > > uses? Should this be documented somehow? What is the actual sort > > limit? > > The particular complaint that's being made here is about tuplesort.c's > array of SortTuples, which isn't all (or even the largest part) of its > memory consumption. The tuples themselves eat significantly more in > nearly all cases. I don't think there's any very easy way to document > what the largest useful maintenance_work_mem for sorting is based on > that --- you'd have to pull a number for tuple size out of the air. > But it's certainly possible to use up lots of gigabytes when sorting > wide tuples. I think the original complaint in this thread was about > building an index, which probably had relatively small tuples so the > SortTuple constraint was more pressing. > > In any case, this is the sort of thing that'd be far better to fix than > document. Added to TODO: Allow sorts to use more available memory * http://archives.postgresql.org/pgsql-hackers/2007-11/msg01026.php * http://archives.postgresql.org/pgsql-hackers/2010-09/msg01123.php * http://archives.postgresql.org/pgsql-hackers/2011-02/msg01957.php -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Tom, all, Having run into issues caused by small work_mem, again, I felt the need to respond to this. * Tom Lane (tgl@sss.pgh.pa.us) wrote: > You would break countless things. It might be okay anyway in a trusted > environment, ie, one without users trying to crash the system, but there > are a lot of security-critical implications of that test. I really don't see work_mem or maintenance_work_mem as security-related parameters. Amusingly, the Postgres95 1.01 release apparently attmpted to make the cap 16GB (but failed and made it 256M instead). After a bit of poking around, I found this commit: commit 85c17dbff8ade0c5237e3ac1ece7cacacfdde399 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Tue Feb 6 01:53:53 2001 +0000 Out-of-bounds memory allocation request sizes should be treated as just elog(ERROR) not an Assert trap, since we'vedowngraded out-of-memory to elog(ERROR) not a fatal error. Also, change the hard boundary from 256Mb to 1Gb, justso that anyone who's actually got that much memory to spare can play with TOAST objects approaching a gigabyte. If we want to implement a system to limit what users can request with regard to work_mem then we can do that, but a smart user could probably circumvent such a system by building huge queries.. A system which monitered actual usage and ERROR'd out would probably be better to address that concern. > If we were actually trying to support such large allocations, > what I'd be inclined to do is introduce a separate call along the lines > of MemoryContextAllocLarge() that lacks the safety check. This sounds like the right approach to me. Basically, I'd like to have MemoryContextAllocLarge(), on 64bit platforms, and have it be used for things like sorts and hash tables. We'd need to distinguish that usage from things which allocate varlena's and the like. > But before > expending time on that, I'd want to see some evidence that it's actually > helpful for production situations. I'm a bit dubious that you're going > to gain much here. I waited ~26hrs for a rather simple query: explain select <bunch-of-columns>, <bunch-of-aggregates> from really_big_table where customer_code ~ '^CUST123' group by <bunch-of-columns> ; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------GroupAggregate (cost=37658456.68..42800117.89 rows=10546998 width=146) -> Sort (cost=37658456.68..37922131.61 rows=105469973 width=146) Sort Key: <bunch-of-columns> -> Seq Scan on really_big_table (cost=0.00..15672543.00 rows=105469973width=146) Filter: ((customer_code)::text ~ '^CUST123'::text) (5 rows) This query ran for ~26 hours, where ~20 hours was spent sorting the ~30G which resulted from the Seq-Scan+filter (the raw table is ~101G). The resulting table (after the GroupAgg) was only 30MB in size (~80k rows instead of the estimated 10M above). Another query against the same 101G table, which used a HashAgg, completed just a bit faster than the 26 hours: QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------HashAggregate (cost=19627666.99..19631059.80 rows=90475 width=116) (actual time=1435604.737..1435618.293 rows=4869 loops=1) -> Seq Scanon really_big_table (cost=0.00..15672543.00 rows=105469973 width=116) (actual time=221029.805..804802.329 rows=104616597loops=1) Filter: ((agency_hierarchy_code)::text ~ '^CUST123'::text)Total runtime: 1435625.388 ms (4 rows) Now, this query had fewer columns in the group by (required to convince PG to use a HashAgg), but, seriously, it only took 23 minutes to scan through the entire table. It could have taken 3 hours and I would have been happy. Admittedly, part of the problem here is the whole cross-column correllation stats problem, but I wouldn't care if the stats were right and I ended up with a 1.5G hash table and 10M records result, I'm pretty sure generating that would be a lot faster using a HashAgg than a sort+GroupAgg. Also, I feel like we're pretty far from having the cross-column statistics fixed and I'm not 100% convinced that it'd actually come up with a decent result for this query anyway (there's 18 columns in the group by clause for the first query...). Anyhow, I just wanted to show that there are definitely cases where the current limit is making things difficult for real-world PG users on production systems. Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> If we were actually trying to support such large allocations, >> what I'd be inclined to do is introduce a separate call along the lines >> of MemoryContextAllocLarge() that lacks the safety check. > This sounds like the right approach to me. Basically, I'd like to have > MemoryContextAllocLarge(), on 64bit platforms, and have it be used for > things like sorts and hash tables. We'd need to distinguish that usage > from things which allocate varlena's and the like. Yes, but ... >> But before >> expending time on that, I'd want to see some evidence that it's actually >> helpful for production situations. I'm a bit dubious that you're going >> to gain much here. > I waited ~26hrs for a rather simple query: The fact that X is slow does not prove anything about whether Y will make it faster. In particular I see nothing here showing that this query is bumping up against the 1GB-for-sort-pointers limit, or that if it is, any significant gain would result from increasing that. I think the only real way to prove that is to hack the source code to remove the limit and see what happens. (You could try using malloc directly, not palloc at all, to have a non-production-quality but very localized patch to test.) BTW, it sounded like your argument had to do with whether it would use HashAgg or not -- that is *not* dependent on the per-palloc limit, and never has been. regards, tom lane
On Sat, Apr 9, 2011 at 6:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > BTW, it sounded like your argument had to do with whether it would use > HashAgg or not -- that is *not* dependent on the per-palloc limit, and > never has been. > His point was he wanted to be allowed to set work_mem > 1GB. This is going to become a bigger and bigger problem with 72-128GB and larger machines already becoming quite standard. -- greg
On Sun, 2011-04-10 at 03:05 +0100, Greg Stark wrote: > On Sat, Apr 9, 2011 at 6:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > BTW, it sounded like your argument had to do with whether it would use > > HashAgg or not -- that is *not* dependent on the per-palloc limit, and > > never has been. > > > > His point was he wanted to be allowed to set work_mem > 1GB. This is > going to become a bigger and bigger problem with 72-128GB and larger > machines already becoming quite standard. > Yes it is, it even came up at East. 1GB just doesn't cut it anymore... JD -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
* Greg Stark (gsstark@mit.edu) wrote: > On Sat, Apr 9, 2011 at 6:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > BTW, it sounded like your argument had to do with whether it would use > > HashAgg or not -- that is *not* dependent on the per-palloc limit, and > > never has been. > > His point was he wanted to be allowed to set work_mem > 1GB. This is > going to become a bigger and bigger problem with 72-128GB and larger > machines already becoming quite standard. Actually, Tom has a point in that work_mem can be set above 1GB (which is where I had it set previously..). I didn't think it'd actually do anything given the MaxAlloc limit, but suprisingly, it does (at least, under 8.4). I'm currently trying to see if we've got anything that's going to *break* with work_mem set up that high; right now I have a hashagg plan running across this data set which has 2.4G allocted to it so far. I'll update this thread with whatever I find out. I'm trying to remember the other issues that I ran in to with this limit (beyond the whole sort limit, which I do think would be helped by allowing a larger value, but it's not as big a deal). Thanks, Stephen
On Apr 9, 2011, at 9:23 PM, Stephen Frost wrote: > Actually, Tom has a point in that work_mem can be set above 1GB (which > is where I had it set previously..). I didn't think it'd actually do > anything given the MaxAlloc limit, but suprisingly, it does (at least, > under 8.4). I'm currently trying to see if we've got anything that's > going to *break* with work_mem set up that high; right now I have a > hashagg plan running across this data set which has 2.4G allocted to > it so far. > > I'll update this thread with whatever I find out. I'm trying to > remember the other issues that I ran in to with this limit (beyond the > whole sort limit, which I do think would be helped by allowing a larger > value, but it's not as big a deal). FWIW, I regularly set maintenance_work_mem to 8G for index builds. Presumably that's equivalent to running a sort in a regularquery with work_mem set that high... -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net