Thread: random observations while testing with a 1,8B row table
Hi all! During my testing of large work_mem and maintainence_work_mem setting wrt to CREATE INDEX and sorting I encountered a number of things wrt to doing various operations on such a large table (about 106GB on disk with no dead tuples). I will summarize some of the just in case somebody is interested: -> table used has 5 integer columns non-indexed during the loads -> hardware is a Dual Opteron 280 with 4 cores@2,4GHz and 16GB RAM, data is on a multipathed (busy) SAN with different (RAID 10) Arrays for WAL and data. 1. data loading - I'm using COPY with batches of 300M rows it takes *) with one copy running it takes about 20minutes/batch to load the data (~250k rows/sec) and virtually no context switches. *) with two copys running concurrently it takes a bit less then 30 minutes/batch and a steady 40k context switches/sec (~340k rows/sec overall) *) with three copy it takes about 40min/batch at 140k context switches/sec (380k rows/sec overall) the profiles for those runs look very similiar to: samples % symbol name 5065118 20.9607 XLogInsert 3496868 14.4709 DoCopy 2807313 11.6174 CopyReadLine 1373621 5.6844 PageAddItem 1227069 5.0779 heap_formtuple 1193319 4.9383 LWLockAcquire 894243 3.7006 hash_search 717427 2.9689 LWLockRelease 699359 2.8941 pg_atoi 691385 2.8611 FunctionCall3 640383 2.6501 heap_insert 579331 2.3974 int4in 411286 1.7020 AllocSetReset 376452 1.5579 hash_any 349220 1.4452 RelationGetBufferForTuple 261568 1.0824 AllocSetAlloc 257511 1.0656 ReadBuffer while the amount of IO going on is quite a lot it looks like we are still mostly CPU-bound for COPY. 2. updating all of the rows in the table: I updated all of the rows in the table with a simple UPDATE testtable set a=a+1; this took about 2,5 hours (~200rows/sec) with a profile looking like: samples % symbol name 27860285 26.5844 XLogInsert 4828077 4.6070 PageAddItem 4490535 4.2849 heap_update 4267647 4.0722 slot_deform_tuple 3996750 3.8137 LWLockAcquire 3716184 3.5460 slot_getattr 3454679 3.2965 hash_search 2998742 2.8614 hash_any 2909261 2.7760 heap_fill_tuple 2825256 2.6959 LWLockRelease 2283086 2.1785 LockBuffer 2135048 2.0373 ExecTargetList 1636017 1.5611 ExecEvalVar 1632377 1.5576 UnpinBuffer 1566087 1.4944 RelationGetBufferForTuple 1561378 1.4899 ExecMakeFunctionResultNoSets 1511366 1.4421 ReadBuffer 1381614 1.3183 heap_compute_data_size 3. vacuuming this table - it turned out that VACUUM FULL is completly unusable on a table(which i actually expected before) of this size not only to the locking involved but rather due to a gigantic memory requirement and unbelievable slowness. It seems that the heap-scan part of vacuum full completed after about 2 hours ending up with a postmaster having a resident size of about 8,5GB(!!!) with maintainance_work_mem set to 1GB. profile for this stage looks like: samples % symbol name 941058 26.0131 scan_heap 444435 12.2852 HeapTupleSatisfiesVacuum 242117 6.6927 TransactionIdIsInProgress 220044 6.0825 _mdfd_getseg 212571 5.8760 hash_search 186963 5.1681 TransactionIdPrecedes 176016 4.8655 SetBufferCommitInfoNeedsSave 137668 3.8055 TransactionIdDidCommit 137068 3.7889 PageRepairFragmentation 111474 3.0814 TransactionLogFetch 103814 2.8697 LWLockAcquire 102925 2.8451 LWLockRelease 102456 2.8321 hash_any 67199 1.8575 BufferAlloc after that the postmaster started slowly consuming more and more memory, doing virtually no IO and eating CPU like mad with a profile similiar to: samples % symbol name 2708391248 94.1869 repair_frag 155395833 5.4040 enough_space 5707137 0.1985 XLogInsert 1410703 0.0491 PageAddItem 691616 0.0241 BgBufferSync I actually ended up canceling the VACUUM FULL after about 50 hours of runtime with a resident size of ~11,5GB. Stefan
Stefan, On 3/10/06 9:40 AM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote: > I will summarize some of the just in case somebody is interested: I am! > -> table used has 5 integer columns non-indexed during the loads > -> hardware is a Dual Opteron 280 with 4 cores@2,4GHz and 16GB RAM, data > is on a multipathed (busy) SAN with different (RAID 10) Arrays for WAL > and data. How many connections out of the machine? How many disks behind each LUN? So - about 20 Bytes per row (5*4) unless those are int8, but on disk it's 108GB/1.8B = 60 Bytes per row on disk. I wonder what all that overhead is? > 1. data loading - I'm using COPY with batches of 300M rows it takes > > *) with one copy running it takes about 20minutes/batch to load the data > (~250k rows/sec) and virtually no context switches. > > *) with two copys running concurrently it takes a bit less then 30 > minutes/batch and a steady 40k context switches/sec (~340k rows/sec overall) > > *) with three copy it takes about 40min/batch at 140k context > switches/sec (380k rows/sec overall) So, from 15 MB/s up to about 20 MB/s. > while the amount of IO going on is quite a lot it looks like we are > still mostly CPU-bound for COPY. It's what we see almost always. In this case if your I/O configuration is capable of performing at about 3x the 20MB/s max parsing rate, or 60MB/s, you will be CPU limited. The 3x is approximate, and based on observations, the reasoning underneath it is that Postgres is writing the data several times, once to the WAL, then from the WAL to the heap files. > 2. updating all of the rows in the table: > > I updated all of the rows in the table with a simple UPDATE testtable > set a=a+1; > this took about 2,5 hours (~200rows/sec) Ugh. This is where Bizgres MPP shines, I'll try to recreate your test and post results. This scales linearly in Bizgres MPP with the number of disks and CPUs available, but I would hope for much more than that. > 3. vacuuming this table - it turned out that VACUUM FULL is completly > unusable on a table(which i actually expected before) of this size not > only to the locking involved but rather due to a gigantic memory > requirement and unbelievable slowness. Simple vacuum should be enough IMO. - Luke
Luke Lonergan wrote: > Stefan, > > On 3/10/06 9:40 AM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote: > > >>I will summarize some of the just in case somebody is interested: > > > I am! heh - not surprised :-) > > >>-> table used has 5 integer columns non-indexed during the loads >>-> hardware is a Dual Opteron 280 with 4 cores@2,4GHz and 16GB RAM, data >>is on a multipathed (busy) SAN with different (RAID 10) Arrays for WAL >>and data. > > > How many connections out of the machine? How many disks behind each LUN? 2 HBAs in the server, 2x2 possible paths to each LUN. 6 disks for the WAL and 12 disks for the data > > So - about 20 Bytes per row (5*4) unless those are int8, but on disk it's > 108GB/1.8B = 60 Bytes per row on disk. I wonder what all that overhead is? > > >>1. data loading - I'm using COPY with batches of 300M rows it takes >> >>*) with one copy running it takes about 20minutes/batch to load the data >>(~250k rows/sec) and virtually no context switches. >> >>*) with two copys running concurrently it takes a bit less then 30 >>minutes/batch and a steady 40k context switches/sec (~340k rows/sec overall) >> >>*) with three copy it takes about 40min/batch at 140k context >>switches/sec (380k rows/sec overall) > > > So, from 15 MB/s up to about 20 MB/s. > > > >>while the amount of IO going on is quite a lot it looks like we are >>still mostly CPU-bound for COPY. > > > It's what we see almost always. In this case if your I/O configuration is > capable of performing at about 3x the 20MB/s max parsing rate, or 60MB/s, > you will be CPU limited. the IO-System I use should be capable of doing that if pushed hard enough :-) > > The 3x is approximate, and based on observations, the reasoning underneath > it is that Postgres is writing the data several times, once to the WAL, then > from the WAL to the heap files. > > >>2. updating all of the rows in the table: >> >>I updated all of the rows in the table with a simple UPDATE testtable >>set a=a+1; >>this took about 2,5 hours (~200rows/sec) > > > Ugh. This is where Bizgres MPP shines, I'll try to recreate your test and > post results. This scales linearly in Bizgres MPP with the number of disks > and CPUs available, but I would hope for much more than that. interesting to know, but still I'm testing/playing with postgresql here not bizgres MPP ... > > >>3. vacuuming this table - it turned out that VACUUM FULL is completly >>unusable on a table(which i actually expected before) of this size not >>only to the locking involved but rather due to a gigantic memory >>requirement and unbelievable slowness. > > > Simple vacuum should be enough IMO. sure, that was mostly meant as an experiment, if I had to do this on a production database I would most likely use CLUSTER to get the desired effect (which in my case was purely getting back the diskspace wasted by dead tuples) Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >>> 3. vacuuming this table - it turned out that VACUUM FULL is completly >>> unusable on a table(which i actually expected before) of this size not >>> only to the locking involved but rather due to a gigantic memory >>> requirement and unbelievable slowness. > sure, that was mostly meant as an experiment, if I had to do this on a > production database I would most likely use CLUSTER to get the desired > effect (which in my case was purely getting back the diskspace wasted by > dead tuples) Yeah, the VACUUM FULL algorithm is really designed for situations where just a fraction of the rows have to be moved to re-compact the table. It might be interesting to teach it to abandon that plan and go to a CLUSTER-like table rewrite once the percentage of dead space is seen to reach some suitable level. CLUSTER has its own disadvantages though (2X peak disk space usage, doesn't work on core catalogs, etc). regards, tom lane
Stefan, On 3/10/06 11:48 AM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote: > 2 HBAs in the server, 2x2 possible paths to each LUN. > 6 disks for the WAL and 12 disks for the data So - you have 18 disks worth of potential bandwidth, not factoring loss due to RAID. That's roughly 18 * 60 = 1,080 MB/s. If we organized that into four banks, one for each CPU and made each one RAID5 and left two disks for spares, you'd have 12 disks working for you at 720MB/s, which is possibly double the number of active FC channels you have, unless they are all active, in which case you have a nicely matched 800MB/s of FC. >> So, from 15 MB/s up to about 20 MB/s. Gee - seems a long distance from 700 MB/s potential :-) > the IO-System I use should be capable of doing that if pushed hard > enough :-) I would expect some 10x this if configured well. > interesting to know, but still I'm testing/playing with postgresql here > not bizgres MPP ... Sure. Still, what I'd expect is something like 10x this update rate using the parallelism buried in your hardware. If you configure the same machine with 4 Bizgres MPP segments running on 4 LUNs I think you'd be shocked at the speedups. - Luke
On Mar 10, 2006, at 11:54 AM, Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >>>> 3. vacuuming this table - it turned out that VACUUM FULL is >>>> completly >>>> unusable on a table(which i actually expected before) of this >>>> size not >>>> only to the locking involved but rather due to a gigantic memory >>>> requirement and unbelievable slowness. > >> sure, that was mostly meant as an experiment, if I had to do this >> on a >> production database I would most likely use CLUSTER to get the >> desired >> effect (which in my case was purely getting back the diskspace >> wasted by >> dead tuples) > > Yeah, the VACUUM FULL algorithm is really designed for situations > where > just a fraction of the rows have to be moved to re-compact the table. > It might be interesting to teach it to abandon that plan and go to a > CLUSTER-like table rewrite once the percentage of dead space is > seen to > reach some suitable level. CLUSTER has its own disadvantages though > (2X peak disk space usage, doesn't work on core catalogs, etc). I get bitten by this quite often (customer machines, one giant table, purge out a lot of old data). CLUSTER is great for that, given the headroom, though I've often resorted to a dump and restore because I've not had the headroom for cluster, and it's a lot less downtime than a full vacuum. While the right fix there is to redo the application engine side to use table partitioning, I keep wondering whether it would be possible to move rows near the end of the table to the beginning in one, non- locking phase (vacuum to populate FSM with free space near beginning of table, touch rows starting at end of table, repeat) and then finish off with a vacuum full to tidy up the remainder and truncate the files (or a simpler "lock the table and truncate anything unused at the end"). Cheers, Steve
Luke Lonergan wrote: > Stefan, > > On 3/10/06 11:48 AM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote: > > >>2 HBAs in the server, 2x2 possible paths to each LUN. >>6 disks for the WAL and 12 disks for the data > > > So - you have 18 disks worth of potential bandwidth, not factoring loss due > to RAID. That's roughly 18 * 60 = 1,080 MB/s. If we organized that into > four banks, one for each CPU and made each one RAID5 and left two disks for > spares, you'd have 12 disks working for you at 720MB/s, which is possibly > double the number of active FC channels you have, unless they are all > active, in which case you have a nicely matched 800MB/s of FC. wrong(or rather extremely optimistic) the array itself only has two (redundant) FC-loops(@2GB )to the attached expansion chassis. The array has 2 active/active controllers (with a failover penalty) with two host interfaces each, furthermore it has write-cache mirroring(to the standby controller) enabled which means the traffic has to go over the internal FC-loop too. beside that the host(as I said) itself only has two HBAs @2GB which are configured for failover which limits the hosts maximum available bandwith to less than 200MB/S per LUN. > > >>>So, from 15 MB/s up to about 20 MB/s. > > > Gee - seems a long distance from 700 MB/s potential :-) well the array is capable of about 110MB/s write per controller head (a bit more half the possible due to write mirroring enabled which uses delta-syncronisation). WAL and data are on different controllers though by default. > > >>the IO-System I use should be capable of doing that if pushed hard >>enough :-) > > > I would expect some 10x this if configured well. see above ... > > >>interesting to know, but still I'm testing/playing with postgresql here >>not bizgres MPP ... > > > Sure. Still, what I'd expect is something like 10x this update rate using > the parallelism buried in your hardware. > > If you configure the same machine with 4 Bizgres MPP segments running on 4 > LUNs I think you'd be shocked at the speedups. that might be true, though it might sound a bit harsh I really prefer to spend the small amount of spare time I have with testing(and helping to improve if possible) postgresql than playing with a piece of commercial software I'm not going to use anyway ... Stefan
Stefan, On 3/10/06 12:23 PM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote: > wrong(or rather extremely optimistic) the array itself only has two > (redundant) FC-loops(@2GB )to the attached expansion chassis. The array > has 2 active/active controllers (with a failover penalty) with two host > interfaces each, furthermore it has write-cache mirroring(to the standby > controller) enabled which means the traffic has to go over the internal > FC-loop too. > beside that the host(as I said) itself only has two HBAs @2GB which are > configured for failover which limits the hosts maximum available > bandwith to less than 200MB/S per LUN. Wow - the ickiness of SAN fro a performance / value standpoint never ceases to astound me. >> Gee - seems a long distance from 700 MB/s potential :-) > > well the array is capable of about 110MB/s write per controller head (a > bit more half the possible due to write mirroring enabled which uses > delta-syncronisation). > WAL and data are on different controllers though by default. So - you're getting 20MB/s on loading from a potential of 200MB/s? >> I would expect some 10x this if configured well. > > see above ... OTOH - configured well could include taking the disks out of the smart (?) chassis, plugging them into a dumb chassis and deploying 2 dual channel U320 SCSI adapters - total cost of about $3,000. > that might be true, though it might sound a bit harsh I really prefer to > spend the small amount of spare time I have with testing(and helping to > improve if possible) postgresql than playing with a piece of commercial > software I'm not going to use anyway ... No problem - that's our job anyway - to make the case for Postgres' use in typical large scale use-cases like the one you describe. - Luke
Luke Lonergan wrote: > Stefan, > > On 3/10/06 12:23 PM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote: > > >>wrong(or rather extremely optimistic) the array itself only has two >>(redundant) FC-loops(@2GB )to the attached expansion chassis. The array >>has 2 active/active controllers (with a failover penalty) with two host >>interfaces each, furthermore it has write-cache mirroring(to the standby >>controller) enabled which means the traffic has to go over the internal >>FC-loop too. >>beside that the host(as I said) itself only has two HBAs @2GB which are >>configured for failover which limits the hosts maximum available >>bandwith to less than 200MB/S per LUN. > > > Wow - the ickiness of SAN fro a performance / value standpoint never ceases > to astound me. Well while make it sound a bit like that, performance is not everything. One has to factor manageability,scalability (in terms of future upgrades using the same platform and such) and high-availability features in too. With that in mind a SAN (or a NAS - depends on the actual usecases) suddenly looks much more interesting than plain old DASD. > > >>>Gee - seems a long distance from 700 MB/s potential :-) >> >>well the array is capable of about 110MB/s write per controller head (a >>bit more half the possible due to write mirroring enabled which uses >>delta-syncronisation). >>WAL and data are on different controllers though by default. > > > So - you're getting 20MB/s on loading from a potential of 200MB/s? no - I can write 110MB/s on thw WAL LUN and 110MB/s on the other LUN concurrently. > > >>>I would expect some 10x this if configured well. >> >>see above ... > > > OTOH - configured well could include taking the disks out of the smart (?) > chassis, plugging them into a dumb chassis and deploying 2 dual channel U320 > SCSI adapters - total cost of about $3,000. as i said above even if that would work (it does not because the disks have FC-connectors) I would loose a LOT of features like being able to use the SAN for more than a single host (big one!) or doing firmware-upgrades without downtime, using SAN-replication, having cable-length exceeding 12m(makes it possible to place parts of the infrastructure at remote sites),out-of-band management,scriptable(!),... Beside that, sequential-io as you are propagating everywhere is NOT the holy grail or the sole solution to a fast database. While the SAN above really is not a screamer for that kind of application it is actually a very good performer(compared with some of the DASD based boxes) under heavy random-io and concurrent load. This has a direct measurable influence on the overall speed of our production applications which are mostly OLTP ;-) > > >>that might be true, though it might sound a bit harsh I really prefer to >>spend the small amount of spare time I have with testing(and helping to >>improve if possible) postgresql than playing with a piece of commercial >>software I'm not going to use anyway ... > > > No problem - that's our job anyway - to make the case for Postgres' use in > typical large scale use-cases like the one you describe. yep Stefan
Stefan, On 3/11/06 12:21 AM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote: >> So - you're getting 20MB/s on loading from a potential of 200MB/s? > > no - I can write 110MB/s on thw WAL LUN and 110MB/s on the other LUN > concurrently. The numbers you published earlier show you are getting a maximum of 20MB/s on data loading. It's CPU limited by Postgres COPY. > Beside that, sequential-io as you are propagating everywhere is NOT the > holy grail or the sole solution to a fast database. > While the SAN above really is not a screamer for that kind of > application it is actually a very good performer(compared with some of > the DASD based boxes) under heavy random-io and concurrent load. > This has a direct measurable influence on the overall speed of our > production applications which are mostly OLTP ;-) The same DASD can be configured with RAID10 and will far surpass the external FC SAN configuration you describe at the same price. The DASD story is not all about sequential I/O. The main limitation is the number of devices you can make available using DASD, and that's a Postgres limitation that we solve. For OLTP apps, you may be fast enough with the limited bandwidth of your FC SAN, but it would still be faster with hundreds of channels and disks. - Luke
Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > >>>>3. vacuuming this table - it turned out that VACUUM FULL is completly >>>>unusable on a table(which i actually expected before) of this size not >>>>only to the locking involved but rather due to a gigantic memory >>>>requirement and unbelievable slowness. > > >>sure, that was mostly meant as an experiment, if I had to do this on a >>production database I would most likely use CLUSTER to get the desired >>effect (which in my case was purely getting back the diskspace wasted by >>dead tuples) > > > Yeah, the VACUUM FULL algorithm is really designed for situations where > just a fraction of the rows have to be moved to re-compact the table. > It might be interesting to teach it to abandon that plan and go to a > CLUSTER-like table rewrite once the percentage of dead space is seen to > reach some suitable level. CLUSTER has its own disadvantages though > (2X peak disk space usage, doesn't work on core catalogs, etc). hmm very interesting idea, I for myself like it but from what i have seen people quite often use vacuum full to get their disk usage down _because_ they are running low on space (and because it's not that well known that CLUSTER could be much faster) - maybe we should add a note/hint about this to the maintenance/vacuum docs at least ? Stefan
Ühel kenal päeval, R, 2006-03-10 kell 12:23, kirjutas Steve Atkins: > I get bitten by this quite often (customer machines, one giant table, > purge out a lot of old data). > > CLUSTER is great for that, given the headroom, though I've often > resorted to a dump and restore because I've not had the headroom > for cluster, and it's a lot less downtime than a full vacuum. > > While the right fix there is to redo the application engine side to use > table partitioning, I keep wondering whether it would be possible > to move rows near the end of the table to the beginning in one, non- > locking > phase (vacuum to populate FSM with free space near beginning of table, > touch rows starting at end of table, repeat) and then finish off with a > vacuum full to tidy up the remainder and truncate the files (or a > simpler > "lock the table and truncate anything unused at the end"). At some point I had to compress a very busily updated table. I used the following approach: 1) VACUUM buzytable; (lazy not full) 2) SELECT primary_key_value, ctid FROM buzytable; 3) Extract N last records from there and for each keep repeating 3A) UPDATE buzytable SET primary_key_value = primary_key_value WHERE primary_key_value = extracted_value 3B)SELECT ctid FROM buzytable WHERE primary_key_value = extracted_value until the tuple is moved to another pages, hopefully nearer to the beginning of table repeat from 1) until the page for last row (extracted from ctid) is smaller than some thresold. This was the only way I was able to get a table back to small enough size without service interruption. -------------- Hannu
Hannu Krosing <hannu@skype.net> writes: > At some point I had to compress a very busily updated table. I used the > following approach: > [ move a few rows at a time ] We could possibly do something similar with VACUUM FULL as well: once maintenance_work_mem is filled, start discarding per-page data to stay under the memory limit. This would mean that some area near the middle of the table remains uncompacted, but it would allow putting an upper bound on the time and space used by any one pass of VACUUM FULL ... regards, tom lane
On Sat, Mar 11, 2006 at 10:21:43PM +0200, Hannu Krosing wrote: > > table partitioning, I keep wondering whether it would be possible > > to move rows near the end of the table to the beginning in one, non- > > locking > > phase (vacuum to populate FSM with free space near beginning of table, > > touch rows starting at end of table, repeat) and then finish off with a > > vacuum full to tidy up the remainder and truncate the files (or a > > simpler > > "lock the table and truncate anything unused at the end"). > > At some point I had to compress a very busily updated table. I used the > following approach: > > 1) VACUUM buzytable; (lazy not full) > > 2) SELECT primary_key_value, ctid FROM buzytable; > > 3) Extract N last records from there and for each keep repeating > > 3A) UPDATE buzytable > SET primary_key_value = primary_key_value > WHERE primary_key_value = extracted_value > > 3B) SELECT ctid FROM buzytable > WHERE primary_key_value = extracted_value > > until the tuple is moved to another pages, hopefully nearer to > the beginning of table > > repeat from 1) until the page for last row (extracted from ctid) is > smaller than some thresold. BTW, this is what the following TODO would hopefully fix: Allow FSM to return free space toward the beginning of the heap file, in hopes that empty pages at the end can be truncated by VACUUM -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461