Thread: 8.3 / 8.2.6 restore comparison
Hello, I have been testing a migration for a week now trying to get it into a reasonable state. This is what we have: Restore file 220G 8.2.6 and 8.3.0 are configured identically: shared_buffers = 8000MB work_mem = 32MB maintenance_work_mem = 512MB fsync = off full_page_writes = off checkpoint_segments = 300 synchronous_commit = off (8.3) wal_writer_delay = off (8.3) autovacuum = off 8.2.6 after 2 hours has restored 41GB. 8.3.0 after 2.5 hours had restored 38GB. Originally I was thinking that 8.2.6 was stomping 8.3. However I am thinking that the reduction in the tuple header sizes for 8.3 means that yes I restored 38GB, it is actually *more* data than 8.2.6. Does that seem accurate to everyone else? If so what can we do to speed this up? We are certainly *not* saturating the disk (16 spindles SCSI). I am thinking the way we are going to need to do this is to have an extended outage and write a custom script to do a concurrent dump and load. (no in this case slony is not an option). Sincerely, Joshua D. Drake -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit
Joshua D. Drake wrote: > Hello, > > I have been testing a migration for a week now trying to get it into a > reasonable state. This is what we have: > > Restore file 220G > > 8.2.6 and 8.3.0 are configured identically: > > shared_buffers = 8000MB > work_mem = 32MB > maintenance_work_mem = 512MB > fsync = off > full_page_writes = off > checkpoint_segments = 300 > synchronous_commit = off (8.3) > wal_writer_delay = off (8.3) > autovacuum = off > > 8.2.6 after 2 hours has restored 41GB. > 8.3.0 after 2.5 hours had restored 38GB. > > Originally I was thinking that 8.2.6 was stomping 8.3. However I am > thinking that the reduction in the tuple header sizes for 8.3 means > that yes I restored 38GB, it is actually *more* data than 8.2.6. Does > that seem accurate to everyone else? If so what can we do to speed this > up? We are certainly *not* saturating the disk (16 spindles SCSI). > > I am thinking the way we are going to need to do this is to have an > extended outage and write a custom script to do a concurrent dump and > load. (no in this case slony is not an option). > > I just tested a ~110GB load. On our modest backup server, 8.2 yesterday did the data load (i.e. the COPY steps) in 1h57m. Today, 8.3 on identical data and settings took 1h42m. Relation size is down by about 10% too, which is very nice, and probably accounts for the load time improvement. cheers andrew
On Thu, 07 Feb 2008 09:47:08 -0500 Andrew Dunstan <andrew@dunslane.net> wrote: > > > Restore file 220G > > > > 8.2.6 and 8.3.0 are configured identically: > > > > shared_buffers = 8000MB > > work_mem = 32MB > > maintenance_work_mem = 512MB > > fsync = off > > full_page_writes = off > > checkpoint_segments = 300 > > synchronous_commit = off (8.3) > > wal_writer_delay = off (8.3) > > autovacuum = off > > > > 8.2.6 after 2 hours has restored 41GB. > > 8.3.0 after 2.5 hours had restored 38GB. > > I just tested a ~110GB load. On our modest backup server, 8.2 > yesterday did the data load (i.e. the COPY steps) in 1h57m. Today, > 8.3 on identical data and settings took 1h42m. Relation size is down > by about 10% too, which is very nice, and probably accounts for the > load time improvement. Ergghh o.k. I am definitely missing something in the environment. By your numbers I should be well over 100GB restored at 2.5 hours. I am not. I am only 38GB in. What type of IO do you have on that machine? What type of CPU and RAM? Sincerely, Joshua D. Drake -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit
* Joshua D. Drake (jd@commandprompt.com) wrote: > Ergghh o.k. I am definitely missing something in the environment. By > your numbers I should be well over 100GB restored at 2.5 hours. I am > not. I am only 38GB in. I'm guessing you've checked this, so don't shoot me if you have, but.... How was the "restore file" built? Does it create the indexes, primary keys, FKs, whatever, before loading the data? That'd slow things down tremendously.. Or if it's creating them while loading the data, there would be large pauses while it's building the indexes... Enjoy, Stephen
Joshua D. Drake wrote: > On Thu, 07 Feb 2008 09:47:08 -0500 > Andrew Dunstan <andrew@dunslane.net> wrote: > > >>> Restore file 220G >>> >>> 8.2.6 and 8.3.0 are configured identically: >>> >>> shared_buffers = 8000MB >>> work_mem = 32MB >>> maintenance_work_mem = 512MB >>> fsync = off >>> full_page_writes = off >>> checkpoint_segments = 300 >>> synchronous_commit = off (8.3) >>> wal_writer_delay = off (8.3) >>> autovacuum = off >>> >>> 8.2.6 after 2 hours has restored 41GB. >>> 8.3.0 after 2.5 hours had restored 38GB. >>> >> I just tested a ~110GB load. On our modest backup server, 8.2 >> yesterday did the data load (i.e. the COPY steps) in 1h57m. Today, >> 8.3 on identical data and settings took 1h42m. Relation size is down >> by about 10% too, which is very nice, and probably accounts for the >> load time improvement. >> > > Ergghh o.k. I am definitely missing something in the environment. By > your numbers I should be well over 100GB restored at 2.5 hours. I am > not. I am only 38GB in. > > What type of IO do you have on that machine? What type of CPU and RAM? > 2Ghz Xeon dual core 16 Gb RAM HW RAID0 data store - not sure how many spindles I didn't touch any of the 8.3-only config stuff. I have work_mem set a lot higher than you do, though - not sure if that makes any difference to a bulk load. This is not a very heavy duty box. Note: a full restore takes much longer than this - it is almost all taken up building indexes. I will be testing that over the weekend, probably. cheers andrew
On Thu, 07 Feb 2008 11:20:49 -0500 Andrew Dunstan <andrew@dunslane.net> wrote: > > > Joshua D. Drake wrote: > > On Thu, 07 Feb 2008 09:47:08 -0500 > > Andrew Dunstan <andrew@dunslane.net> wrote: > > > > > >>> Restore file 220G > >>> > >>> 8.2.6 and 8.3.0 are configured identically: > >>> > >>> shared_buffers = 8000MB > >>> work_mem = 32MB > >>> maintenance_work_mem = 512MB > >>> fsync = off > >>> full_page_writes = off > >>> checkpoint_segments = 300 > >>> synchronous_commit = off (8.3) > >>> wal_writer_delay = off (8.3) > >>> autovacuum = off > >>> > >>> 8.2.6 after 2 hours has restored 41GB. > >>> 8.3.0 after 2.5 hours had restored 38GB. > >>> > >> I just tested a ~110GB load. On our modest backup server, 8.2 > >> yesterday did the data load (i.e. the COPY steps) in 1h57m. Today, > >> 8.3 on identical data and settings took 1h42m. Relation size is > >> down by about 10% too, which is very nice, and probably accounts > >> for the load time improvement. > >> > > > > Ergghh o.k. I am definitely missing something in the environment. By > > your numbers I should be well over 100GB restored at 2.5 hours. I am > > not. I am only 38GB in. > > > > What type of IO do you have on that machine? What type of CPU and > > RAM? > > 2Ghz Xeon dual core > 16 Gb RAM > HW RAID0 data store - not sure how many spindles > > I didn't touch any of the 8.3-only config stuff. I have work_mem set > a lot higher than you do, though - not sure if that makes any > difference to a bulk load. > > This is not a very heavy duty box. Depends on how you look at it. Restores are traditionally CPU bound if you have any kind of IO. If you have RAID 0 you have some IO available to you. It bothers me that the environment I am testing which has 16 spindles (8 per volume) is garnering such miserable performance in comparison. In my environment we aren't able to push the I/O at all (wait times < 3%) and yet these are Opterons that we have which have great memory bandwidth. 8.2 did use more I/O than 8.3 but I think that is just architectural with the change to 8.3 in general. Joshua D. Drake -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit
On Thu, 7 Feb 2008 11:11:32 -0500 Stephen Frost <sfrost@snowman.net> wrote: > * Joshua D. Drake (jd@commandprompt.com) wrote: > > Ergghh o.k. I am definitely missing something in the environment. By > > your numbers I should be well over 100GB restored at 2.5 hours. I am > > not. I am only 38GB in. > > I'm guessing you've checked this, so don't shoot me if you have, > but.... Uhh yeah :) >How was the "restore file" built? Does it create the > indexes, primary keys, FKs, whatever, before loading the data? > That'd slow things down tremendously.. Or if it's creating them > while loading the data, there would be large pauses while it's > building the indexes... > Absolutely correct. it would. This dump was created using pg_dumpall. Sincerely, Joshua D. Drake -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit
On Wed, 6 Feb 2008, Joshua D. Drake wrote: > 8.2.6 after 2 hours has restored 41GB. I've been doing a long bulk import job recently (COPY) on a box with more spindles than yours (but with a dumb controller) and I too am stuck at that speed; I calculate a consistant 19.6GB/hour. The actual disk I/O is very low as that works out to only 5.7MB/s of progress. Mine was bottlenecked by capacity of a single CPU (4X Opteron system). I think this is one of those barriers it's hard to crack without a caching controller, for reasons I haven't figured out completely yet. > I am thinking the way we are going to need to do this is to have an > extended outage and write a custom script to do a concurrent dump and > load. If you look at the -performance list this week I've been yelping about this issue and trying to figure out how to setup a useful multi-CPU loader for cases like these. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, 7 Feb 2008 11:54:07 -0500 (EST) Greg Smith <gsmith@gregsmith.com> wrote: > On Wed, 6 Feb 2008, Joshua D. Drake wrote: > > > 8.2.6 after 2 hours has restored 41GB. > > I've been doing a long bulk import job recently (COPY) on a box with > more spindles than yours (but with a dumb controller) and I too am > stuck at that speed; I calculate a consistant 19.6GB/hour. Bingo! O.k. that is ~ I am pushing. So I am not alone in this world. > The > actual disk I/O is very low as that works out to only 5.7MB/s of > progress. Mine was bottlenecked by capacity of a single CPU (4X > Opteron system). I think this is one of those barriers it's hard to > crack without a caching controller, for reasons I haven't figured out > completely yet. O.k. we have 8X but it is also Opteron. > > > I am thinking the way we are going to need to do this is to have an > > extended outage and write a custom script to do a concurrent dump > > and load. > > If you look at the -performance list this week I've been yelping > about this issue and trying to figure out how to setup a useful > multi-CPU loader for cases like these. I would think that for the custom format this would be reasonably simple. It is certainly something we need to figure out. The idea that PostgreSQL can only restore 20G an hour on a box that can actually write 120+ Megs a second (random) is pathetic. I know Luke has mentioned some issues in the past as well around CPU boundness with an upper limit of 300M/s (IIRC) but even that doesn't equate to what is going on here as we are not getting anywhere near that. Joshua D. Drake -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit
"Joshua D. Drake" <jd@commandprompt.com> writes: > I know Luke has mentioned some issues in the past as well around CPU > boundness with an upper limit of 300M/s (IIRC) but even that doesn't > equate to what is going on here as we are not getting anywhere near > that. Some vmstat and oprofile investigation seems called for. Right now we're just guessing about what the bottleneck is. regards, tom lane
On Thu, 07 Feb 2008 13:47:22 -0500 Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: > > I know Luke has mentioned some issues in the past as well around CPU > > boundness with an upper limit of 300M/s (IIRC) but even that doesn't > > equate to what is going on here as we are not getting anywhere near > > that. > > Some vmstat and oprofile investigation seems called for. Right now > we're just guessing about what the bottleneck is. vmstat -adSK 5 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----r b swpd free inact active si so bi bo in cs us sy id wa0 2 800 15853104 7150484 8581708 0 0 1571 387 0 0 10 7 71 122 0 800 15668016 7335564 8581456 0 0 12982 24382 2699 1421 11 2 82 51 0 800 15488688 7513572 8583016 0 0 13006 24351 2601 1322 10 2 82 61 0 800 15296024 7704488 8584160 0 0 13494 23324 2430 1135 11 2 84 31 0 800 15017240 7984204 8582308 0 0 17219 29910 2632 1427 10 3 86 11 2 800 14754224 8239160 8583344 0 0 13710 41932 2666 1371 8 2 80 91 3 800 14590216 8390680 8594228 0 0 8112 33920 2361 1351 5 2 72 201 0 800 14435000 8562800 8585540 0 0 9138 23773 3114 1461 6 2 75 172 0 800 14147160 8848424 8587072 0 0 14967 35206 3156 2675 10 4 76 101 0 800 13850560 9141480 8588632 0 0 15454 42181 3047 2633 10 4 72141 1 800 13765824 9212296 8603744 0 0 4805 33795 2903 6312 4 9 68 201 0 800 13552088 9436156 8594284 0 0 11916 20255 2505 1695 7 3 79 111 0 800 13375152 9612124 8595336 0 0 11829 26774 2644 135911 2 83 41 0 800 13301232 9684104 8596356 0 0 4747 21765 2173 829 12 1 82 51 0 800 13294680 96930608597360 0 0 386 1786 1732 587 12 0 87 01 0 800 13284448 9702212 8598452 0 0 410 1871 1939 878 13 0 86 11 0 800 13273056 9711180 8599452 0 0 386 1870 1996 907 12 0 87 01 0 800 132354889748400 8600492 0 0 1824 1862 2128 1107 12 1 86 11 0 800 13019720 9962004 8602264 0 0 1777319770 2632 1596 11 2 81 61 0 800 12832288 10143020 8603912 0 0 18359 19553 2627 1586 11 2 78 81 0 800 12665952 10313812 8605088 0 0 17102 17550 2587 1582 11 2 83 51 0 800 12491488 10486568 8606292 0 0 16795 17964 2514 1488 11 2 85 21 0 800 12317592 10659160 8607484 0 0 17103 18222 2543 1531 11 2 82 52 0 800 12137840 10836760 8608636 0 0 17562 17909 2526 1496 11 2 82 6 This is 8.2.6, 64bit. Sincerely, Joshua D. Drake -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit
"Joshua D. Drake" <jd@commandprompt.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Some vmstat and oprofile investigation seems called for. Right now >> we're just guessing about what the bottleneck is. > vmstat -adSK 5 > ... Looks to me like you're disk-bound (and your kernel is pretty bad at distinguishing "idle" from "disk wait" states). Plain COPY into an index-less table ought to be writing fairly linearly, so I'm surprised you aren't able to get more out of the disk. Have you got WAL on a separate spindle from the database? regards, tom lane
On Thu, 07 Feb 2008 19:20:26 -0500 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Looks to me like you're disk-bound (and your kernel is pretty bad at > distinguishing "idle" from "disk wait" states). Plain COPY into an > index-less table ought to be writing fairly linearly, so I'm surprised > you aren't able to get more out of the disk. Have you got WAL on > a separate spindle from the database? I didn't, but do now and am testing. The way this is currently designed is: /data1 (8 disks RAID 10) /data2 (8 disks RAID 10) /pg_xlogs (2 disks RAID 1) /data1 is what is primarily written against for the first couple of hours and then it will switch to data2 because of table spaces. However either way, we should expect (I would think) to see at least 100 megs a second on an 8 Disk RAID 10. It is SCSI. vmstat -adSK 5 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----r b swpd free inact active si so bi bo in cs us sy id wa1 1 160 6534048 23711184 1434532 0 0 1571 387 0 0 10 7 71 121 1 160 6206560 23898644 1573680 0 0 18168 23851 2684 1647 11 3 84 21 2 160 5887072 24080840 1711060 0 0 17655 58395 3322 1762 11 3 74 112 2 160 5546160 24275268 1856108 0 0 18681 61834 3302 1838 11 3 74 111 2 160 5363432 24383836 1933028 0 0 13043 52353 3247 1578 12 2 74 111 1 160 5295336 24423764 1963808 0 0 4997 39778 2865 1310 12 1 75 121 0 160 5284184 24426444 1971608 0 0 385 6967 2087 955 13 1 84 31 0 160 5274608 24429180 1978236 0 0 384 1009 2030 958 13 0 87 01 0 160 5265408 24431976 1984688 0 0 410 946 1780 656 12 0 87 01 1 160 5224912 24453296 2003700 0 0 1998 23168 2419 1163 12 1 79 81 2 160 5005048 24605300 2069236 0 0 18270 41342 2942 1721 12 2 76 101 2 160 4816016 24748120 2117208 0 0 19014 37165 3014 1702 12 2 72 151 1 160 4618064 24894476 2167412 0 0 19091 15226 2517 1626 12 2 83 31 1 160 4423376 25039072 2218324 0 0 18603 10398 2585 1644 12 2 84 31 1 160 4223568 25187592 2268816 0 0 19475 10326 2511 1666 12 2 84 21 1 160 4023056 25337080 2319844 0 0 19450 10601 2462 154312 2 84 21 1 160 3821328 25487664 2370328 0 0 19834 10657 2454 1540 12 2 84 21 1 160 3617616 256408322420820 0 0 20398 11202 2536 1678 12 2 84 21 2 160 3410040 25795488 2473800 0 0 20090 29063 2756 1680 12 2 76 101 1 160 3189064 25957688 2531996 0 0 20321 13745 2523 1595 12 2 83 4 Sincerely, Joshua D. Drake -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit
On Thu, 7 Feb 2008 16:37:39 -0800 "Joshua D. Drake" <jd@commandprompt.com> wrote: > I didn't, but do now and am testing. The way this is currently > designed is: > > /data1 (8 disks RAID 10) > /data2 (8 disks RAID 10) > /pg_xlogs (2 disks RAID 1) > > /data1 is what is primarily written against for the first couple of > hours and then it will switch to data2 because of table spaces. > However either way, we should expect (I would think) to see at least > 100 megs a second on an 8 Disk RAID 10. It is SCSI. > Following up with this... 22G in one hour, with xlogs on a different partition. Just looking at we are averaging 3-5% IOWait, further we are only writing ~ 2Megs a second. This is frustrating. Sincerely, Joshua D. Drake -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit
On Thu, 07 Feb 2008 13:47:22 -0500 Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: > > I know Luke has mentioned some issues in the past as well around CPU > > boundness with an upper limit of 300M/s (IIRC) but even that doesn't > > equate to what is going on here as we are not getting anywhere near > > that. > > Some vmstat and oprofile investigation seems called for. Right now > we're just guessing about what the bottleneck is. I also recommend looking into partition alignment on the disks. At least to quantify how much the storage subsystem suffers from mis-aligned partitions. It could be that it doesn't. Regards, Mark
"Joshua D. Drake" <jd@commandprompt.com> writes: > Following up with this... 22G in one hour, with xlogs on a different > partition. Just looking at we are averaging 3-5% IOWait, further we are > only writing ~ 2Megs a second. It might be interesting to capture some strace output and get a sense of what's actually being read and written. It seems to me that the bad I/O rate has to indicate that we're seeking all over the place. We should try to determine exactly what's being touched in what order. regards, tom lane
On Thu, 07 Feb 2008 21:04:44 -0500 Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: > > Following up with this... 22G in one hour, with xlogs on a different > > partition. Just looking at we are averaging 3-5% IOWait, further we > > are only writing ~ 2Megs a second. > > It might be interesting to capture some strace output and get a sense > of what's actually being read and written. It seems to me that the > bad I/O rate has to indicate that we're seeking all over the place. > We should try to determine exactly what's being touched in what order. Do you want counts or actual output? Joshua D. Drake > > regards, tom lane > > ---------------------------(end of > broadcast)--------------------------- TIP 4: Have you searched our > list archives? > > http://archives.postgresql.org > -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit
On Thu, 7 Feb 2008 18:32:41 -0800 "Joshua D. Drake" <jd@commandprompt.com> wrote: > > Do you want counts or actual output? % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ----------------58.22 1.597638 33 47795 write31.86 0.874104 23 38024 recvfrom 4.43 0.121559 92 1319 munmap 4.17 0.114445 3 37906 lseek 0.70 0.019341 13 1440 brk 0.56 0.015402 12 1316 mmap 0.04 0.000991 26 38 open 0.01 0.000298 8 38 close 0.01 0.000156 4 38 time 0.00 0.000008 8 1 semop ------ ----------- ----------- --------- --------- ---------------- 100.00 2.743942 127915 total -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Thu, 7 Feb 2008 18:40:37 -0800 "Joshua D. Drake" <jd@commandprompt.com> wrote: > On Thu, 7 Feb 2008 18:32:41 -0800 > "Joshua D. Drake" <jd@commandprompt.com> wrote: > > > > > Do you want counts or actual output? > > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 58.22 1.597638 33 47795 write > 31.86 0.874104 23 38024 recvfrom > 4.43 0.121559 92 1319 munmap > 4.17 0.114445 3 37906 lseek > 0.70 0.019341 13 1440 brk > 0.56 0.015402 12 1316 mmap > 0.04 0.000991 26 38 open > 0.01 0.000298 8 38 close > 0.01 0.000156 4 38 time > 0.00 0.000008 8 1 semop > ------ ----------- ----------- --------- --------- ---------------- > 100.00 2.743942 127915 total > Were their any thoughts on this? I am also finding that backing up this database is rudely slow with the same type of odd metrics (almost zero (or zero) iowait). We can saturate a CPU but the CPU is certainly not pushing the data to disk as fast as it could. Joshua D. Drake - -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHu0hqATb/zqfZUUQRAm77AJ41yWT4f6UNiMwyHtcq9GASK51uDgCgpKzf lu14eCbSDl2v2Rsq40zww8E= =DshT -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tue, 19 Feb 2008 13:21:46 -0800 "Joshua D. Drake" <jd@commandprompt.com> wrote: > Were their any thoughts on this? I am also finding that backing up > this database is rudely slow with the same type of odd metrics > (almost zero (or zero) iowait). We can saturate a CPU but the CPU is > certainly not pushing the data to disk as fast as it could. Further on this. We have tested on RHEL4 and RHEL5. Their are two machines, each with 32GB of ram. I have four of these in the RHEL 5 machine: processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 65 model name : Dual-Core AMD Opteron(tm) Processor 8216 stepping : 2 cpu MHz : 2411.132 cache size : 1024 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy bogomips : 4823.59 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp tm stc Here is the mini specs: # 5U Q524 - Quad Opteron 24 SCSI # Tyan S4885G3NR 800 Series CPU Opteron 800 # AMD Opteron 880 - 2.4GHz 2Core x 4 # 32GB - DDR ECC REG 400MHz (16x2GB) x 1 (16 x 2GB 3200 ECC REG Smart Modular (32GB) # HD 1: 73GB SCSI 15K RPM x 24 # PCI 1: MegaRaid SCSI 320-2E - 2ch, U320, PCI Express, 128MB # MegaRaid LSIiTBBU01 Battery - Order #: LSI00009 # PCI 2: MegaRaid SCSI 320-2E - 2ch, U320, PCI Express, 128MB # MegaRaid LSIiTBBU01 Battery - Order #: LSI00009 # DVD-ROM/Sony 3.5 Inch Floppy Drive The configuration is: / RAID 1 / xlogs RAID 1 /data1 10 drives RAID 10 /data2 10 drives RAID 10 The thing that is frustrating here, is it appears that PostgreSQL just can't utilize the hardware. I *know* it can because we have larger machines in production that use PostgreSQL happily. However when I have 220G backups taking 8 hours and restores taking 11 hours, I begin to wonder where the bottleneck is. Assuming 25 megs a second per drive (random write) on data1 and data2 we should be pushing 250M a second. Just to be insanely conservative let's cut that in half to 125M per second. That means every 10 seconds we should do ~ 1G. That means every minute we should to ~ 6G, which means 360G an hour. Granted we have index creation and everything else going on but 11 hours and no IO chewing? As a note these are reproducible on both machines regardless of RHEL5 or RHEL4. I know there are much stats here but I have provided them in previous posts on this thread. Perhaps someone sees a red flag in the hardware? Sincerely, Joshua D. Drake - -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHu0vyATb/zqfZUUQRAiuYAJ9ut6i/cPv2MYc8RO2+wNw09M5/WwCfUaGY sAkFt+S14i0kFMn6mz9juBw= =TNys -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tue, 19 Feb 2008 13:36:48 -0800 "Joshua D. Drake" <jd@commandprompt.com> wrote: Hello, Some more testing on this. This time (using 8.3) I modified my restore process to use multiple processes by manipulating TOC files. I used three processes for the data copies, two processes for the pk creation, two process for normal indexes and two processes for constraint creation. The machine averaged 40-60MB/s write versus the pathetic ~ 2/3 MB/s on a single thread. It had an average I/O wait of < 10%. Lastly it restored 57G of data 1.25 hours. Under my single thread testing 57G would have taken ~ 3 hours. I am pretty sure I can make it faster too as I wasn't balancing with tablespaces nor did I move the xlogs off. IMO this pretty much proves that we need to seriously look at a multi connection restores. I can't imagine a situation where we look at people that have dual and quad cores on their desktops and say... sorry we can't use that to help you get your data quicker. Sincerely, Joshua D. Drake - -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHu4QvATb/zqfZUUQRAhyZAJ9U468fVDm8ww/2TrjDt6gM2wtlhwCffYYq KJEsKpvRm6efiMQ+uAn/cs4= =ZEGc -----END PGP SIGNATURE-----
Bah. It's the stuff in the format translation path and conversion to/from datums that is the bottleneck. We sped up COPY TO recently by a factor of 10 using similar approaches to what we did for COPY FROM in the past. There's a format conversion that is the culprit. We routinely get about 12 MB/s of heap insertion rate per CPU core and it's CPU bound. You can peek in on what's happening using gstack on Linux, or the gdb "attach and print stacktrace" approach for a crude profile. - Luke On 2/19/08 1:36 PM, "Joshua D. Drake" <jd@commandprompt.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On Tue, 19 Feb 2008 13:21:46 -0800 > "Joshua D. Drake" <jd@commandprompt.com> wrote: > >> Were their any thoughts on this? I am also finding that backing up >> this database is rudely slow with the same type of odd metrics >> (almost zero (or zero) iowait). We can saturate a CPU but the CPU is >> certainly not pushing the data to disk as fast as it could. > > Further on this. We have tested on RHEL4 and RHEL5. Their are two > machines, each with 32GB of ram. I have four of these in the RHEL 5 > machine: > > processor : 0 > vendor_id : AuthenticAMD > cpu family : 15 > model : 65 > model name : Dual-Core AMD Opteron(tm) Processor 8216 > stepping : 2 > cpu MHz : 2411.132 > cache size : 1024 KB > physical id : 0 > siblings : 2 > core id : 0 > cpu cores : 2 > fpu : yes > fpu_exception : yes > cpuid level : 1 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge > mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext > fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm > extapic cr8_legacy bogomips : 4823.59 TLB size : 1024 4K > pages clflush size : 64 cache_alignment : 64 > address sizes : 40 bits physical, 48 bits virtual > power management: ts fid vid ttp tm stc > > Here is the mini specs: > > # 5U Q524 - Quad Opteron 24 SCSI > # Tyan S4885G3NR 800 Series CPU Opteron 800 > # AMD Opteron 880 - 2.4GHz 2Core x 4 > # 32GB - DDR ECC REG 400MHz (16x2GB) x 1 (16 x 2GB 3200 ECC REG Smart > Modular (32GB) # HD 1: 73GB SCSI 15K RPM x 24 > # PCI 1: MegaRaid SCSI 320-2E - 2ch, U320, PCI Express, 128MB > # MegaRaid LSIiTBBU01 Battery - Order #: LSI00009 > # PCI 2: MegaRaid SCSI 320-2E - 2ch, U320, PCI Express, 128MB > # MegaRaid LSIiTBBU01 Battery - Order #: LSI00009 > # DVD-ROM/Sony 3.5 Inch Floppy Drive > > The configuration is: > > / RAID 1 > / xlogs RAID 1 > /data1 10 drives RAID 10 > /data2 10 drives RAID 10 > > The thing that is frustrating here, is it appears that PostgreSQL just > can't utilize the hardware. I *know* it can because we have larger > machines in production that use PostgreSQL happily. However when I have > 220G backups taking 8 hours and restores taking 11 hours, I begin to > wonder where the bottleneck is. > > Assuming 25 megs a second per drive (random write) on data1 and data2 > we should be pushing 250M a second. Just to be insanely conservative > let's cut that in half to 125M per second. That means every 10 seconds > we should do ~ 1G. That means every minute we should to ~ 6G, which > means 360G an hour. > > Granted we have index creation and everything else going on but 11 > hours and no IO chewing? > > As a note these are reproducible on both machines regardless of RHEL5 > or RHEL4. > > I know there are much stats here but I have provided them in previous > posts on this thread. Perhaps someone sees a red flag in the hardware? > > > Sincerely, > > Joshua D. Drake > - -- > The PostgreSQL Company since 1997: http://www.commandprompt.com/ > PostgreSQL Community Conference: http://www.postgresqlconference.org/ > Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate > PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.6 (GNU/Linux) > > iD8DBQFHu0vyATb/zqfZUUQRAiuYAJ9ut6i/cPv2MYc8RO2+wNw09M5/WwCfUaGY > sAkFt+S14i0kFMn6mz9juBw= > =TNys > -----END PGP SIGNATURE----- > ---------------------------(end of broadcast)--------------------------- TIP > 9: In versions below 8.0, the planner will ignore your desire to choose > an index scan if your joining column's datatypes do not match
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tue, 19 Feb 2008 18:54:38 -0800 Luke Lonergan <llonergan@greenplum.com> wrote: > Bah. > > It's the stuff in the format translation path and conversion to/from > datums that is the bottleneck. So how do we fix that? This performance is pretty anemic. > > We sped up COPY TO recently by a factor of 10 using similar > approaches to what we did for COPY FROM in the past. There's a > format conversion that is the culprit. Any chance that code (in whatever shape) could be released for review of applicability to core? > > We routinely get about 12 MB/s of heap insertion rate per CPU core > and it's CPU bound. That's embarrassing on today's hardware don't you think? Or am I missing something entirely? > > You can peek in on what's happening using gstack on Linux, or the gdb > "attach and print stacktrace" approach for a crude profile. > Yeah, that... is beyond my abilities. Well reading it is anyway. I can provide any information people want though. Tom? Greg? Andrew? Somebody? What information do you want from me to help you track this down? Sincerely, Joshua D. Drake - -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHvwWuATb/zqfZUUQRAucvAKCAXMlS25oLrTD5bbdIuy6/qq61OgCgn4NG 4I2V5WgSMgr24TAM3CsIO50= =AaB/ -----END PGP SIGNATURE-----
"Joshua D. Drake" <jd@commandprompt.com> writes: > Yeah, that... is beyond my abilities. Well reading it is anyway. I can > provide any information people want though. Tom? Greg? Andrew? Somebody? > What information do you want from me to help you track this down? The vmstat output you showed before said user CPU time was only around 12%, which seems to destroy Luke's thesis that data conversion time is the problem. IIRC we were speculating that data was being written in a pattern that required a lot of seeking thus ruining throughput, but we didn't have any hard evidence of that. Did you do the strace'ing I suggested? regards, tom lane
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Fri, 22 Feb 2008 12:39:57 -0500 Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: > > Yeah, that... is beyond my abilities. Well reading it is anyway. I > > can provide any information people want though. Tom? Greg? Andrew? > > Somebody? What information do you want from me to help you track > > this down? > > The vmstat output you showed before said user CPU time was only around > 12%, which seems to destroy Luke's thesis that data conversion time > is the problem. IIRC we were speculating that data was being written > in a pattern that required a lot of seeking thus ruining throughput, > but we didn't have any hard evidence of that. Did you do the > strace'ing I suggested? Yes, I asked if you wanted counts or the whole output. You didn't answer :). I provided the counts. http://archives.postgresql.org/pgsql-hackers/2008-02/msg00359.php I won't be able to run another test until this weekend :( but anything I have is yours. Joshua D. Drake - -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHvwqLATb/zqfZUUQRAl8tAJ0VmS6ffaFBq1a/UUYZxX9F84vcFgCgkhgl +UF5Zp59H/JIJa1/ZSKvuC4= =xw/k -----END PGP SIGNATURE-----
"Joshua D. Drake" <jd@commandprompt.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> IIRC we were speculating that data was being written >> in a pattern that required a lot of seeking thus ruining throughput, >> but we didn't have any hard evidence of that. Did you do the >> strace'ing I suggested? > Yes, I asked if you wanted counts or the whole output. You didn't > answer :). I provided the counts. Uh, sorry. Counts are useless here, we need to see the sequence of write locations to find out if there's a lot of nonconsecutive writes happening. BTW, the strace had better run across the whole PG process tree --- it's quite possible that there's some interaction between the bgwriter and the backend doing COPY, for instance. regards, tom lane
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Fri, 22 Feb 2008 12:54:00 -0500 Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: > > Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> IIRC we were speculating that data was being written > >> in a pattern that required a lot of seeking thus ruining > >> throughput, but we didn't have any hard evidence of that. Did you > >> do the strace'ing I suggested? > > > Yes, I asked if you wanted counts or the whole output. You didn't > > answer :). I provided the counts. > > Uh, sorry. Counts are useless here, we need to see the sequence of > write locations to find out if there's a lot of nonconsecutive > writes happening. BTW, the strace had better run across the whole > PG process tree --- it's quite possible that there's some interaction > between the bgwriter and the backend doing COPY, for instance. So you want an strace on postmaster not the connection? Sincerely, Joshua D. Drake - -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHvw7DATb/zqfZUUQRAoB0AJ9motp+YpW8A4jlu03VsiaLHRIxAACfQ+tO ZDXUqBcR5e7xsgPnKju9AZc= =q3FH -----END PGP SIGNATURE-----
"Joshua D. Drake" <jd@commandprompt.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> ... BTW, the strace had better run across the whole >> PG process tree --- it's quite possible that there's some interaction >> between the bgwriter and the backend doing COPY, for instance. > So you want an strace on postmaster not the connection? You need to launch the postmaster about like this: strace -f postmaster ...postmaster args... 2>strace.out regards, tom lane regards, tom lane
On Fri, 22 Feb 2008, Tom Lane wrote: > Counts are useless here, we need to see the sequence of write locations > to find out if there's a lot of nonconsecutive writes happening. How were you planning to analyze the strace output to quantify that? I would think you'd need to parse the file to grab those locations then generate statistics about seeking from them. Those strace files will be huge and I'd expect Joshua would need to do something like that on his side before the results will be short enough to share. > it's quite possible that there's some interaction between the bgwriter > and the backend doing COPY It might be worthwhile to turn off the background writer for these runs, both to see if it changes anything and to simplify the output. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> writes: > On Fri, 22 Feb 2008, Tom Lane wrote: >> Counts are useless here, we need to see the sequence of write locations >> to find out if there's a lot of nonconsecutive writes happening. > How were you planning to analyze the strace output to quantify that? I didn't really have any preconceived ideas about that. I just want to see some raw data to see if something shows up. regards, tom lane
Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >> Yeah, that... is beyond my abilities. Well reading it is anyway. I can >> provide any information people want though. Tom? Greg? Andrew? Somebody? >> What information do you want from me to help you track this down? > > The vmstat output you showed before said user CPU time was only around > 12%, which seems to destroy Luke's thesis that data conversion time > is the problem. IIRC we were speculating that data was being written > in a pattern that required a lot of seeking thus ruining throughput, > but we didn't have any hard evidence of that. Did you do the strace'ing > I suggested? Joshua said this is an 8 core box - so 12% would about equal to one core being maxed out completely CPU wise which seems to correlate fairly well with my experience (COPY being CPU bottlenecked on fast hardware without any real culprit to find in profiles). Stefan
Stefan Kaltenbrunner wrote: > Tom Lane wrote: >> "Joshua D. Drake" <jd@commandprompt.com> writes: >>> Yeah, that... is beyond my abilities. Well reading it is anyway. I can >>> provide any information people want though. Tom? Greg? Andrew? Somebody? >>> What information do you want from me to help you track this down? >> >> The vmstat output you showed before said user CPU time was only around >> 12%, which seems to destroy Luke's thesis that data conversion time >> is the problem. IIRC we were speculating that data was being written >> in a pattern that required a lot of seeking thus ruining throughput, >> but we didn't have any hard evidence of that. Did you do the strace'ing >> I suggested? > > > Joshua said this is an 8 core box - so 12% would about equal to one core > being maxed out completely CPU wise which seems to correlate fairly well > with my experience (COPY being CPU bottlenecked on fast hardware without > any real culprit to find in profiles). Right. I mentioned previously that I was able to gain what could be considered barely acceptable performance break breaking up a custom dump into multiple pieces and restoring across multiple connections. I have Alvaro reviewing the strace. I can't post it without some cleansing. Sincerely, Joshua D. Drake
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > Tom Lane wrote: >> The vmstat output you showed before said user CPU time was only around >> 12%, which seems to destroy Luke's thesis that data conversion time >> is the problem. > Joshua said this is an 8 core box - so 12% would about equal to one core > being maxed out completely CPU wise Oh, good point. That should be fairly easy to confirm with "top" I would think. regards, tom lane
Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> Tom Lane wrote: >>> The vmstat output you showed before said user CPU time was only around >>> 12%, which seems to destroy Luke's thesis that data conversion time >>> is the problem. > >> Joshua said this is an 8 core box - so 12% would about equal to one core >> being maxed out completely CPU wise > > Oh, good point. That should be fairly easy to confirm with "top" I > would think. mpstat -P ALL 5 01:43:33 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 01:43:38 PM all 12.70 0.00 0.15 0.05 0.00 0.00 0.00 87.10 1322.60 01:43:38 PM 0 99.60 0.00 0.40 0.00 0.00 0.00 0.00 0.00 1000.40 01:43:38 PM 1 0.20 0.00 0.00 0.40 0.00 0.00 0.00 99.40 322.00 01:43:38 PM 2 0.00 0.00 0.20 0.00 0.00 0.00 0.00 99.80 0.00 01:43:38 PM 3 2.00 0.00 0.20 0.00 0.00 0.00 0.00 97.60 0.00 01:43:38 PM 4 0.00 0.00 0.20 0.00 0.00 0.00 0.00 100.00 0.00 01:43:38 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 01:43:38 PM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 01:43:38 PM 7 0.00 0.00 0.20 0.00 0.00 0.00 0.00 100.00 0.00 01:43:38 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 01:43:43 PM all 13.10 0.00 0.68 0.00 0.00 0.00 0.00 86.22 1328.20 01:43:43 PM 0 95.60 0.00 4.40 0.00 0.00 0.00 0.00 0.00 1000.40 01:43:43 PM 1 0.00 0.00 0.00 0.00 0.00 0.20 0.00 100.00 327.60 01:43:43 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 01:43:43 PM 3 9.00 0.00 1.20 0.00 0.00 0.00 0.00 89.80 0.00 01:43:43 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 01:43:43 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 01:43:43 PM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 01:43:43 PM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 01:43:43 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 01:43:48 PM all 13.48 0.00 1.68 0.00 0.00 0.00 0.00 84.85 1345.29 01:43:48 PM 0 87.58 0.00 11.02 0.00 0.00 0.00 0.00 1.40 1002.40 01:43:48 PM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 343.09 01:43:48 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.20 0.00 01:43:48 PM 3 20.44 0.00 2.20 0.00 0.00 0.00 0.00 77.56 0.00 01:43:48 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.20 0.00 01:43:48 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.20 0.00 01:43:48 PM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.20 0.00 01:43:48 PM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.20 0.00 iostat -k 5: avg-cpu: %user %nice %system %iowait %steal %idle 13.48 0.00 2.23 0.07 0.00 84.23 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 36.87 0.00 1443.69 0 7204 sdb 0.60 0.00 21.64 0 108 sdc 0.00 0.00 0.00 0 0 sdd 0.00 0.00 0.00 0 0 hda 0.00 0.00 0.00 0 0 avg-cpu: %user %nice %system %iowait %steal %idle 13.28 0.00 2.15 1.33 0.00 83.25 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 168.66 0.00 9947.31 0 49836 sdb 0.00 0.00 0.00 0 0 sdc 0.00 0.00 0.00 0 0 sdd 0.00 0.00 0.00 0 0 hda 0.00 0.00 0.00 0 0 avg-cpu: %user %nice %system %iowait %steal %idle 13.57 0.00 2.32 0.05 0.00 84.06 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 27.05 0.00 1425.25 0 7112 sdb 0.00 0.00 0.00 0 0 sdc 0.00 0.00 0.00 0 0 sdd 0.00 0.00 0.00 0 0 hda 0.00 0.00 0.00 0 0 vmstat -adps -S M 5 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free inact active si so bi bo in cs us sy id wa st 1 0 0 12793 14385 4301 0 0 12 101 4 12 0 0 99 0 0 1 0 0 12694 14421 4364 0 0 0 19943 1740 435 13 2 85 0 0 1 0 0 12606 14444 4429 0 0 0 28070 1839 449 12 2 84 2 0 1 0 0 12508 14469 4503 0 0 0 9188 1501 424 13 1 85 0 0 1 0 0 12402 14498 4579 0 0 0 20746 1734 446 13 2 85 0 0 Joshua D. Drake > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate >
"Joshua D. Drake" <jd@commandprompt.com> writes: > Tom Lane wrote: >> Oh, good point. That should be fairly easy to confirm with "top" I >> would think. > mpstat -P ALL 5 > 01:43:38 PM CPU %user %nice %sys %iowait %irq %soft %steal > %idle intr/s > 01:43:43 PM all 13.10 0.00 0.68 0.00 0.00 0.00 0.00 > 86.22 1328.20 > 01:43:43 PM 0 95.60 0.00 4.40 0.00 0.00 0.00 0.00 > 0.00 1000.40 OK, so it is CPU-bound after all. Do you have oprofile or anything equivalent? regards, tom lane
Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >> Tom Lane wrote: >>> Oh, good point. That should be fairly easy to confirm with "top" I >>> would think. > >> mpstat -P ALL 5 > >> 01:43:38 PM CPU %user %nice %sys %iowait %irq %soft %steal >> %idle intr/s >> 01:43:43 PM all 13.10 0.00 0.68 0.00 0.00 0.00 0.00 >> 86.22 1328.20 >> 01:43:43 PM 0 95.60 0.00 4.40 0.00 0.00 0.00 0.00 >> 0.00 1000.40 > > OK, so it is CPU-bound after all. Do you have oprofile or anything > equivalent? I am at your disposal. If you need oprofile tell me the command and I will get you the output. Joshua D. Drake > > regards, tom lane >
Joshua D. Drake wrote: >> OK, so it is CPU-bound after all. Do you have oprofile or anything >> equivalent? > > I am at your disposal. If you need oprofile tell me the command and I > will get you the output. (thanks to stefan for helping me get this output) opreport -l /usr/bin/postgres CPU: AMD64 processors, speed 2411.14 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 47654 15.5165 DoCopy 36231 11.7971 CopyReadLine 25046 8.1551 heap_formtuple 24137 7.8592 XLogInsert 13829 4.5028 InputFunctionCall 9618 3.1317 ParseDateTime 9521 3.1001 DecodeDateTime 9446 3.0757 pg_next_dst_boundary 7651 2.4912 DecodeDate 6948 2.2623 AllocSetAlloc 6290 2.0481 ExecConstraints 5462 1.7785 DecodeNumber 3287 1.0703 pg_atoi 3035 0.9882 hash_search_with_hash_value 2866 0.9332 heap_insert 2786 0.9071 MemoryContextCreate 2771 0.9023 hash_any 2727 0.8879 slot_deform_tuple 2639 0.8593 DecodeTime 2517 0.8196 SearchCatCache 2512 0.8179 LWLockAcquire 2362 0.7691 varchar_input 2273 0.7401 timestamptz_in 2267 0.7382 AllocSetDelete 2203 0.7173 .plt 2031 0.6613 int4in 1936 0.6304 CopyGetData 1865 0.6073 tm2timestamp 1846 0.6011 MemoryContextAllocZero 1753 0.5708 ReadBuffer_common 1732 0.5640 ExecMakeFunctionResultNoSets 1658 0.5399 RelationGetBufferForTuple 1655 0.5389 varcharin 1640 0.5340 MemoryContextAlloc 1496 0.4871 AllocSetFree 1476 0.4806 slot_getattr 1466 0.4773 DetermineTimeZoneOffset 1460 0.4754 timestamp_in 1418 0.4617 AllocSetReset 1403 0.4568 LWLockRelease 1398 0.4552 PageAddItem 1384 0.4506 date2j 1370 0.4461 DecodeTimezone 1112 0.3621 boolin 1074 0.3497 slot_attisnull 1049 0.3416 SPI_connect 990 0.3224 AllocSetContextCreate 986 0.3210 PinBuffer 897 0.2921 pq_getmessage 830 0.2703 ExecQual 830 0.2703 SearchSysCache 830 0.2703 pg_strncasecmp 828 0.2696 SPI_finish 777 0.2530 FunctionCall2 758 0.2468 AdjustTimestampForTypmod 746 0.2429 DirectFunctionCall1 725 0.2361 CopyLoadRawBuf 723 0.2354 FunctionCall1 719 0.2341 MarkBufferDirty 694 0.2260 heap_attisnull 686 0.2234 MemoryContextDelete 677 0.2204 pq_getbytes 670 0.2182 float8in 663 0.2159 resetStringInfo 657 0.2139 heap_getsysattr 619 0.2016 CatalogCacheComputeHashValue 608 0.1980 ReleaseBuffer 608 0.1980 ResourceOwnerEnlargeBuffers 605 0.1970 ExecEvalRelabelType 590 0.1921 ReleaseCatCache 551 0.1794 RelationPutHeapTuple 522 0.1700 string_hash 514 0.1674 pfree 506 0.1648 pq_copymsgbytes 505 0.1644 ResourceOwnerForgetCatCacheRef 486 0.1582 pq_getbyte 483 0.1573 Int64GetDatum 476 0.1550 IsToastNamespace 471 0.1534 CacheInvalidateHeapTuple 469 0.1527 hash_search 468 0.1524 hash_uint32 453 0.1475 MemoryContextDeleteChildren 443 0.1442 PageGetHeapFreeSpace 432 0.1407 _SPI_begin_call 428 0.1394 textout 422 0.1374 ResourceOwnerForgetBuffer 421 0.1371 ExecEvalScalarVar 420 0.1368 LockBuffer 393 0.1280 tag_hash 388 0.1263 hashoid 374 0.1218 PageGetFreeSpace 373 0.1215 Float8GetDatum 362 0.1179 UnpinBuffer 355 0.1156 DLMoveToFront 354 0.1153 appendBinaryStringInfo 353 0.1149 string_compare 349 0.1136 isTempToastNamespace 337 0.1097 get_hash_value 335 0.1091 check_stack_depth 331 0.1078 IsSystemRelation 327 0.1065 ExecStoreTuple 322 0.1048 enlargeStringInfo 319 0.1039 BufTableLookup 313 0.1019 ItemPointerEquals 311 0.1013 oideq 308 0.1003 BufferGetBlockNumber 297 0.0967 UnlockReleaseBuffer 291 0.0948 MemoryContextReset 284 0.0925 dt2local 280 0.0912 GetCurrentTransactionIdIfAny 280 0.0912 ReleaseSysCache 271 0.0882 BufTableHashCode 264 0.0860 ResourceOwnerEnlargeCatCacheRefs 259 0.0843 GetCurrentTransactionId 259 0.0843 pg_detoast_datum_packed 252 0.0821 pgstat_count_heap_insert 239 0.0778 ReadBuffer 222 0.0723 IsSystemNamespace 220 0.0716 LockAcquire 219 0.0713 ResourceOwnerRememberCatCacheRef 201 0.0654 GetDatabaseEncoding 182 0.0593 ExecARInsertTriggers 171 0.0557 ResourceOwnerRememberBuffer 157 0.0511 StartBufferIO 147 0.0479 XLogWrite 127 0.0414 AdvanceXLInsertBuffer 101 0.0329 SyncOneBuffer 83 0.0270 mdextend 79 0.0257 LockRelease 65 0.0212 PinBuffer_Locked 64 0.0208 FileWrite 64 0.0208 ForwardFsyncRequest 63 0.0205 FileSeek 62 0.0202 GetCurrentSubTransactionId 60 0.0195 secure_read 55 0.0179 StrategyGetBuffer 50 0.0163 FlushBuffer 49 0.0160 _mdfd_getseg 47 0.0153 UnlockRelationForExtension 46 0.0150 mdwrite 45 0.0147 CleanUpLock 44 0.0143 TerminateBufferIO 42 0.0137 BgBufferSync 40 0.0130 mdopen 37 0.0120 pq_recvbuf 36 0.0117 mdnblocks 36 0.0117 prepare_for_client_read 35 0.0114 PageInit 33 0.0107 RecordAndGetPageWithFreeSpace 32 0.0104 AllocSetInit 30 0.0098 LockRelationForExtension 30 0.0098 lookup_fsm_page_entry 28 0.0091 register_dirty_segment 27 0.0088 XLogBackgroundFlush 25 0.0081 FileAccess 25 0.0081 OutputFunctionCall 25 0.0081 find_free_space 22 0.0072 GrantLockLocal 21 0.0068 pg_usleep 20 0.0065 RememberFsyncRequest 18 0.0059 UnGrantLock 17 0.0055 create_fsm_rel 16 0.0052 ProcQueueInit 16 0.0052 SHMQueueInit 16 0.0052 XLogFlush 16 0.0052 pgstat_start 15 0.0049 SHMQueueDelete 12 0.0039 SHMQueueInsertBefore 12 0.0039 _mdnblocks 11 0.0036 BufTableInsert 10 0.0033 WalWriterMain 10 0.0033 smgrextend 9 0.0029 AbsorbFsyncRequests 9 0.0029 pgstat_send_bgwriter 8 0.0026 BufTableDelete 8 0.0026 Delete 7 0.0023 LockTagHashCode 6 0.0020 BackgroundWriterMain 6 0.0020 RemoveLocalLock 6 0.0020 smgropen 5 0.0016 LockCheckConflicts 5 0.0016 smgrnblocks 5 0.0016 smgrwrite 4 0.0013 CheckArchiveTimeout 4 0.0013 PostmasterIsAlive 4 0.0013 client_read_ended 3 9.8e-04 GrantLock 3 9.8e-04 PGSemaphoreUnlock 3 9.8e-04 issue_xlog_fsync 2 6.5e-04 BgWriterNap 2 6.5e-04 Insert 2 6.5e-04 StrategySyncStart 2 6.5e-04 force_statwrite 2 6.5e-04 pg_fdatasync 1 3.3e-04 CatalogCacheCreateEntry 1 3.3e-04 CommitTransaction 1 3.3e-04 GetCurrentTransactionNestLevel 1 3.3e-04 GetSnapshotData 1 3.3e-04 PGSemaphoreLock 1 3.3e-04 ScanKeyEntryInitializeWithInfo 1 3.3e-04 SysLoggerMain 1 3.3e-04 _bt_compare 1 3.3e-04 hash_seq_search 1 3.3e-04 nocachegetattr 1 3.3e-04 pgstat_setheader 1 3.3e-04 pgstat_write_statsfile > > Joshua D. Drake > >> >> regards, tom lane >> > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend >
"Joshua D. Drake" <jd@commandprompt.com> writes: > opreport -l /usr/bin/postgres > CPU: AMD64 processors, speed 2411.14 MHz (estimated) > Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a > unit mask of 0x00 (No unit mask) count 100000 > samples % symbol name > 47654 15.5165 DoCopy > 36231 11.7971 CopyReadLine > 25046 8.1551 heap_formtuple > 24137 7.8592 XLogInsert > 13829 4.5028 InputFunctionCall > 9618 3.1317 ParseDateTime > 9521 3.1001 DecodeDateTime > 9446 3.0757 pg_next_dst_boundary > 7651 2.4912 DecodeDate > 6948 2.2623 AllocSetAlloc > 6290 2.0481 ExecConstraints > 5462 1.7785 DecodeNumber > 3287 1.0703 pg_atoi Hmm, so what's the declaration of the table you're copying into? Obviously there's some datetime column(s) in it ... regards, tom lane
Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >> opreport -l /usr/bin/postgres >> CPU: AMD64 processors, speed 2411.14 MHz (estimated) >> Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a >> unit mask of 0x00 (No unit mask) count 100000 >> samples % symbol name >> 47654 15.5165 DoCopy >> 36231 11.7971 CopyReadLine >> 25046 8.1551 heap_formtuple >> 24137 7.8592 XLogInsert >> 13829 4.5028 InputFunctionCall >> 9618 3.1317 ParseDateTime >> 9521 3.1001 DecodeDateTime >> 9446 3.0757 pg_next_dst_boundary >> 7651 2.4912 DecodeDate >> 6948 2.2623 AllocSetAlloc >> 6290 2.0481 ExecConstraints >> 5462 1.7785 DecodeNumber >> 3287 1.0703 pg_atoi > > Hmm, so what's the declaration of the table you're copying into? > Obviously there's some datetime column(s) in it ... Let me restart the load. I don't have any statement logging turned on. However there is a likelyhood of multiple timestamps. Joshua D. Drake
Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >> opreport -l /usr/bin/postgres >> CPU: AMD64 processors, speed 2411.14 MHz (estimated) >> Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a >> unit mask of 0x00 (No unit mask) count 100000 >> samples % symbol name >> 47654 15.5165 DoCopy >> 36231 11.7971 CopyReadLine >> 25046 8.1551 heap_formtuple >> 24137 7.8592 XLogInsert >> 13829 4.5028 InputFunctionCall >> 9618 3.1317 ParseDateTime >> 9521 3.1001 DecodeDateTime >> 9446 3.0757 pg_next_dst_boundary >> 7651 2.4912 DecodeDate >> 6948 2.2623 AllocSetAlloc >> 6290 2.0481 ExecConstraints >> 5462 1.7785 DecodeNumber >> 3287 1.0703 pg_atoi > > Hmm, so what's the declaration of the table you're copying into? > Obviously there's some datetime column(s) in it ... Ahh nevermind.. stopping the load gave me the table: Column | Type | Modifiers -----------------------+-----------------------------+-------------------- unit_id | integer | not null lead_id | integer | not null timehit | timestamp with timezone | not null followuptime | timestamp with time zone | not null message | integer | not null default 0 email | character varying(50) | not null stop_status | integer | not null stop_time | timestamp with time zone | note |character varying(20) | name | character varying(60) | name2 | character varying(60) | add_method | integer | verified | integer | verification_notes | character varying(100) | add_notes | character varying(100) | datum1 | character varying(100) | datum2 | character varying(100) | datum3 | character varying(100) | datum4 | character varying(100) | datum5 | charactervarying(100) | datum6 | character varying(100) | datum7 | character varying(100) | datum8 | character varying(100) | datum9 | character varying(100) | datum10 | character varying(100) | datum11 | character varying(100) | datum12 | character varying(100) | datum13 | character varying(100) | datum14 | character varying(100) | datum15 | character varying(100) | datum16 | charactervarying(100) | datum17 | character varying(100) | datum18 | character varying(100) | datum19 | character varying(100) | datum20 | character varying(100) | datum21 | character varying(100) | datum22 | character varying(100) | datum23 | character varying(100) | datum24 | character varying(100) | datum25 | character varying(100) | stop_method | integer | add_url | charactervarying(100) | verification_time | timestamp without time zone | geog_data_last_update | timestamp withouttime zone | geog_country | character varying(100) | geog_region | character varying(100) | geog_city | character varying(100) | geog_postal | character varying(100) | geog_lat | double precision | geog_lon | double precision | geog_dma_code | integer | geog_area_code | integer | geog_isp | character varying(100) | geog_org | character varying(100) | file_upload_id | integer | I have 100 of these tables (identical). Joshua D. Drake
Joshua D. Drake wrote: > Joshua D. Drake wrote: > >>> OK, so it is CPU-bound after all. Do you have oprofile or anything >>> equivalent? >> >> I am at your disposal. If you need oprofile tell me the command and I >> will get you the output. > > (thanks to stefan for helping me get this output) > > opreport -l /usr/bin/postgres > CPU: AMD64 processors, speed 2411.14 MHz (estimated) > Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a > unit mask of 0x00 (No unit mask) count 100000 > samples % symbol name > 47654 15.5165 DoCopy > 36231 11.7971 CopyReadLine > 25046 8.1551 heap_formtuple > 24137 7.8592 XLogInsert > 13829 4.5028 InputFunctionCall > 9618 3.1317 ParseDateTime > 9521 3.1001 DecodeDateTime >>>> 9446 3.0757 pg_next_dst_boundary This function should've become faster with the tzcode update last week, though since it's only taking 3% of the time in your case, you won't notice. This profile looks very similar to what I've seen in my tests. DoCopy tends to be at the top. Digging deeper into that, I believe that much of the time attributed to DoCopy is actually spent in CopyReadAttributesText, but since it's inlined oprofile can't show it separately. Incidentally, I've been working on a patch to speed up CopyReadLine. I was going to run some more tests first, but since we're talking about it, I guess I should just post the patch. I'll post to pgsql-patches shortly. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sun, 24 Feb 2008 00:43:18 +0000 "Heikki Linnakangas" <heikki@enterprisedb.com> wrote: > Incidentally, I've been working on a patch to speed up CopyReadLine. > I was going to run some more tests first, but since we're talking > about it, I guess I should just post the patch. I'll post to > pgsql-patches shortly. > On your post to patches you mentioned only about a 5% improvement. Don't get me wrong, 5% is 5% and I respect it greatly but as far as I can tell we are about 300% behind the curve. My tests were maxing out at ~22G an hour. On hardware that can do in 360G an hour and that is assuming > 50% overhead between OS, libs, etc... I have no choice but to conclude we have a much, much deeper and fundamental issue going on with COPY. I am inspired by Itagaki Takahiro and his batch update of indexes which should help greatly overall but doesn't help my specific issue. Forgive me for not being a C programmer and Alvaro is not online so I would vet these questions with him first. I know that copy is in theory a bulk loader but, when performing the readline how many lines are we reading? Do we read up to 8192? Or do we shove in say 8megs of data before we invoke DoCopy? I am just curious if there is some simple low hanging fruit that is possibly missing. Sincerely, Joshua D. Drake -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit
Joshua D. Drake wrote: > On Sun, 24 Feb 2008 00:43:18 +0000 > "Heikki Linnakangas" <heikki@enterprisedb.com> wrote: > >> Incidentally, I've been working on a patch to speed up CopyReadLine. >> I was going to run some more tests first, but since we're talking >> about it, I guess I should just post the patch. I'll post to >> pgsql-patches shortly. > > On your post to patches you mentioned only about a 5% improvement. > Don't get me wrong, 5% is 5% and I respect it greatly but as far as I > can tell we are about 300% behind the curve. Yeah. Looking at the profile, the time is spent really all over the place. There's no one clear bottleneck to focus on. I think we could do a few more ~5% improvements, but At some point, I think we have to bite the bullet and find a way to use multiple CPUs for a single load. I don't have any good ideas or plans for that, but hopefully someone does. > My tests were maxing out at ~22G an hour. On hardware that can do > in 360G an hour and that is assuming > 50% overhead between OS, libs, > etc... I have no choice but to conclude we have a much, much deeper and > fundamental issue going on with COPY. I am inspired by Itagaki Takahiro > and his batch update of indexes which should help greatly overall but > doesn't help my specific issue. Yep, the index build idea is an I/O improvement, not a CPU one. > Forgive me for not being a C programmer and Alvaro is not online so I > would vet these questions with him first. > > I know that copy is in theory a bulk loader but, when performing the > readline how many lines are we reading? Do we read up to 8192? Or do we > shove in say 8megs of data before we invoke DoCopy? We read 64 KB at a time, and then CopyReadLineText returns one line at a time from that buffer. Looking at your profile more, and after the memchr patch, the "raw input side" of copy, consisting of reading the data from disk in 64KB blocks, splitting that into lines, and splitting lines into columns, still takes ~20% of the CPU time. I suspect CopyReadAttributesText is the biggest culprit there. You could avoid the ~8% spent in XLogInsert in PostgreSQL 8.3, by creating the table (or truncating it) in the same transaction with the COPY. After that, heap_formtuple is high on the list. I wonder if we could do something about that. > I am just curious if there is some simple low hanging fruit that is > possibly missing. I don't see any piece of code that's causing problems. We can shave off a few percents here and there I think, but don't expect a 300% improvement anytime soon. A few ideas I've thought about are: - use a specialized version of strtol, for base 10. That won't help on your table, but I've seen strtol consume a significant amount of time on tables with numeric/integer columns. - Instead of pallocing and memcpying the text fields, leave a little bit of room between fields in the attribute_buf, and write the varlen header there directly. This might help you since your table has a lot of text fields. - Instead of the normal PG function calling conventions, provide specialized fastpath input functions for the most common data types. InputFunctionCall consumed 4.5% of the CPU time in your profile. - Use a simpler memory context implementation, that's like a stack with no pfree support, for the per-tuple context. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > I don't see any piece of code that's causing problems. Meant to say: I don't see any *single* piece of code that's causing the problems... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Joshua D. Drake wrote: >> On Sun, 24 Feb 2008 00:43:18 +0000 >> "Heikki Linnakangas" <heikki@enterprisedb.com> wrote: >> >>> Incidentally, I've been working on a patch to speed up CopyReadLine. >>> I was going to run some more tests first, but since we're talking >>> about it, I guess I should just post the patch. I'll post to >>> pgsql-patches shortly. >> >> On your post to patches you mentioned only about a 5% improvement. >> Don't get me wrong, 5% is 5% and I respect it greatly but as far as I >> can tell we are about 300% behind the curve. > > Yeah. Looking at the profile, the time is spent really all over the > place. There's no one clear bottleneck to focus on. I think we could do > a few more ~5% improvements, but > > At some point, I think we have to bite the bullet and find a way to use > multiple CPUs for a single load. I don't have any good ideas or plans > for that, but hopefully someone does. > There was talk elsewhere about making pg_dump/restore smarter. It could create tables, then COPY, and create the indexes last. Add to that pg_restore using multiple connections and you'd have it. One connection could do a COPY, then a second connection could be created to start the CREATE INDEX's for that table, while the first connection went on to COPY the next table. Or something like that... -Andy
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > At some point, I think we have to bite the bullet and find a way to use > multiple CPUs for a single load. I don't have any good ideas or plans > for that, but hopefully someone does. As already mentioned upthread, we could do that today, with zero backend changes, by making pg_restore drive multiple sessions. Now there are scenarios where this wouldn't help too much --- eg, a database with only one enormous table. We couldn't parallelize the loading of that table, although we could parallelize creation of its indexes. But for an example such as JD is complaining about, we ought to have no trouble thrashing his disks into the ground ;-) What you would need for this is an -Fc or -Ft dump, because a plain script output from pg_dump doesn't carry any dependency information, much less any index of where in the file different bits of data are. Just armwaving, I envision a multiprocess pg_restore like this: * one controller process that tracks the not-yet-loaded TOC items in the dump* N child processes that just transmit a selectedTOC item to a connected backend, and then ask the controller what to do next The controller would need to use the dependency information to avoid, eg, handing out a CREATE INDEX command before the parent table was created and loaded. One issue is that this couldn't use "-1" single-transaction restoring, since obviously each child would need its own transaction, and furthermore would have to commit before going back to the controller for more work (since dependent TOC items might well get loaded by a different child later). That defeats a couple of optimizations that Simon put in recently. The one for no XLOG during COPY is not too hard to see how to re-enable, but I'm not sure what else there was. Most likely, the bottleneck with this sort of thing would be multiple parallel reads from the pg_dump archive file. Possibly the controller process could be taught to schedule COPY and CREATE INDEX operations so that not too many processes are trying to read lots of archive data at the same time. regards, tom lane
Heikki Linnakangas wrote: > Joshua D. Drake wrote: >> On Sun, 24 Feb 2008 00:43:18 +0000 >> "Heikki Linnakangas" <heikki@enterprisedb.com> wrote: >> I know that copy is in theory a bulk loader but, when performing the >> readline how many lines are we reading? Do we read up to 8192? Or do we >> shove in say 8megs of data before we invoke DoCopy? > > We read 64 KB at a time, and then CopyReadLineText returns one line at a > time from that buffer. O.k. I am sure I am oversimplifying things but why are we returning one line at a time? That reads expensive to me. Just following the general, don't do inserts one at a time, do them in batch idea for example. I would also question the 64KB at a time. Why not a 1024KB (arbitrary) at a time? Is it a resource issue? In the old days when we actually had people trying to run postgresql on 128 and 256 megs of ram, o.k. but now? > Looking at your profile more, and after the memchr patch, the "raw input > side" of copy, consisting of reading the data from disk in 64KB blocks, > splitting that into lines, and splitting lines into columns, still takes > ~20% of the CPU time. I suspect CopyReadAttributesText is the biggest > culprit there. In reading: http://doxygen.postgresql.org/backend_2commands_2copy_8c-source.html It looks like CopyReadAttributesText is used as part of the column breakup. It also appears that this happens "before" insert right? So if that is the case we are going to pay an additional penalty on the data checking. > > You could avoid the ~8% spent in XLogInsert in PostgreSQL 8.3, by > creating the table (or truncating it) in the same transaction with the > COPY. In the same transaction? Oh that's interesting. So that might be a TODO right there. Change pg_dump so it does: create,copy,create,copy,index Instead of: create,create,copy,copy,index > > After that, heap_formtuple is high on the list. I wonder if we could do > something about that. Just from the name I am going to guess this is where we turn it into something that is actually injected into PostgreSQL. > I don't see any piece of code that's causing problems. We can shave off > a few percents here and there I think, but don't expect a 300% > improvement anytime soon. A few ideas I've thought about are: Well don't get me wrong, I am not expecting miracles here. I am just confounded at the complete lack of performance in this arena. I don't think a lot of people recognize what a significant issue this is since we don't have in place backups. > > - use a specialized version of strtol, for base 10. That won't help on > your table, but I've seen strtol consume a significant amount of time on > tables with numeric/integer columns. > > - Instead of pallocing and memcpying the text fields, leave a little bit > of room between fields in the attribute_buf, and write the varlen header > there directly. This might help you since your table has a lot of text > fields. > > - Instead of the normal PG function calling conventions, provide > specialized fastpath input functions for the most common data types. > InputFunctionCall consumed 4.5% of the CPU time in your profile. > > - Use a simpler memory context implementation, that's like a stack with > no pfree support, for the per-tuple context. > By my calculations you are presenting a possibility of at least ~ 30% improvement. That is significant in my book. Hopefully as we explore these options we will find others. Sincerely, Joshua D. Drake
Tom Lane wrote: > Greg Smith <gsmith@gregsmith.com> writes: >> On Fri, 22 Feb 2008, Tom Lane wrote: >>> Counts are useless here, we need to see the sequence of write locations >>> to find out if there's a lot of nonconsecutive writes happening. > >> How were you planning to analyze the strace output to quantify that? > > I didn't really have any preconceived ideas about that. I just want to > see some raw data to see if something shows up. Isn't blktrace the tool to get that kind of information? Anyway, as the following threads point out the problems seems to be somewhere else.. -- Best regards, Hannes Dorbath
Tom Lane wrote: > "Heikki Linnakangas" <heikki@enterprisedb.com> writes: >> At some point, I think we have to bite the bullet and find a way to use >> multiple CPUs for a single load. I don't have any good ideas or plans >> for that, but hopefully someone does. > > As already mentioned upthread, we could do that today, with zero backend > changes, by making pg_restore drive multiple sessions. Now there are > scenarios where this wouldn't help too much --- eg, a database with only > one enormous table. We couldn't parallelize the loading of that table, > although we could parallelize creation of its indexes. But for an > example such as JD is complaining about, we ought to have no trouble > thrashing his disks into the ground ;-) Bring it on ! but I would note that with the current pg_restore I was not able to trash my disks. I only used four backends (I could have used 8) but even with that, I was only doing ~ 45M a second. So if I double, I still have bandwidth. It would still be an huge improvement though. > > What you would need for this is an -Fc or -Ft dump, because a plain > script output from pg_dump doesn't carry any dependency information, I would focus on -Fc. With the limitations of -Ft this would be a good way to start phasing -Ft out. > much less any index of where in the file different bits of data are. > Just armwaving, I envision a multiprocess pg_restore like this: > > * one controller process that tracks the not-yet-loaded > TOC items in the dump > * N child processes that just transmit a selected TOC item > to a connected backend, and then ask the controller > what to do next > Most likely, the bottleneck with this sort of thing would be multiple > parallel reads from the pg_dump archive file. Possibly the controller > process could be taught to schedule COPY and CREATE INDEX operations > so that not too many processes are trying to read lots of archive > data at the same time. A less hacker and more DBA bottleneck will be to limit the number of backends being created for restore. We don't really want to have more than one backend per CPU, otherwise we just start switching. Sincerely, Joshua D. Drake
Joshua D. Drake wrote: > Heikki Linnakangas wrote: >> >> We read 64 KB at a time, and then CopyReadLineText returns one line >> at a time from that buffer. > > O.k. I am sure I am oversimplifying things but why are we returning > one line at a time? That reads expensive to me. Just following the > general, don't do inserts one at a time, do them in batch idea for > example. Quite simply because one line corresponds to one record. And yes, I believe you are oversimplifying, or under several misapprehensions about what can be done at this level. > > I would also question the 64KB at a time. Why not a 1024KB (arbitrary) > at a time? Is it a resource issue? In the old days when we actually > had people trying to run postgresql on 128 and 256 megs of ram, o.k. > but now? It would be simple enough to change. Try it and see if it actually makes a difference. All you have to change is the define of RAW_BUF_SIZE. > > In reading: > > http://doxygen.postgresql.org/backend_2commands_2copy_8c-source.html > > It looks like CopyReadAttributesText is used as part of the column > breakup. It also appears that this happens "before" insert right? So > if that is the case we are going to pay an additional penalty on the > data checking. > What? I don't understand what you are talking about. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Joshua D. Drake wrote: >> I would also question the 64KB at a time. Why not a 1024KB (arbitrary) >> at a time? Is it a resource issue? In the old days when we actually >> had people trying to run postgresql on 128 and 256 megs of ram, o.k. >> but now? > It would be simple enough to change. Try it and see if it actually makes > a difference. All you have to change is the define of RAW_BUF_SIZE. Seems unlikely that making it bigger than (a fraction of) L2 cache would be a smart move. regards, tom lane
Andrew Dunstan wrote: > > > Joshua D. Drake wrote: >> Heikki Linnakangas wrote: >>> >>> We read 64 KB at a time, and then CopyReadLineText returns one line >>> at a time from that buffer. >> >> O.k. I am sure I am oversimplifying things but why are we returning >> one line at a time? That reads expensive to me. Just following the >> general, don't do inserts one at a time, do them in batch idea for >> example. > > Quite simply because one line corresponds to one record. And yes, I > believe you are oversimplifying, or under several misapprehensions about > what can be done at this level. Well I find without questioning, I won't get the answers so :). > It would be simple enough to change. Try it and see if it actually makes > a difference. All you have to change is the define of RAW_BUF_SIZE. > Fair enough. May I assume this is the only place I need to change it? http://doxygen.postgresql.org/backend_2commands_2copy_8c-source.html#l00158 >> It looks like CopyReadAttributesText is used as part of the column >> breakup. It also appears that this happens "before" insert right? So >> if that is the case we are going to pay an additional penalty on the >> data checking. >> > > What? I don't understand what you are talking about. Data checking on insert to the DB itself. I have no doubt that I may be wrong on this. Joshua D. Drake
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> Joshua D. Drake wrote: >>> I would also question the 64KB at a time. Why not a 1024KB (arbitrary) >>> at a time? Is it a resource issue? In the old days when we actually >>> had people trying to run postgresql on 128 and 256 megs of ram, o.k. >>> but now? > >> It would be simple enough to change. Try it and see if it actually makes >> a difference. All you have to change is the define of RAW_BUF_SIZE. > > Seems unlikely that making it bigger than (a fraction of) L2 cache > would be a smart move. O.k. these CPUs have 1meg of L2 so I will try with 512k. Joshua D. Drake > > regards, tom lane >
On Sun, 2008-02-24 at 09:47 -0800, Joshua D. Drake wrote: > A less hacker and more DBA bottleneck will be to limit the number of > backends being created for restore. We don't really want to have more > than one backend per CPU, otherwise we just start switching. Are you sure that it would always be a loss? If there is any significant I/O latency for a single backend, it seems like a context switch could be a win for processor utilization. It might not be a win overall, but at least potentially a win. Regards,Jeff Davis
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mon, 25 Feb 2008 11:36:56 -0800 Jeff Davis <pgsql@j-davis.com> wrote: > > If there is any significant I/O latency for a single backend, it seems > like a context switch could be a win for processor utilization. It > might not be a win overall, but at least potentially a win. Do we want a 20% potential win or an 80% potential win? I would personally rather keep it simple, hard core, and data shoving as possible without any issue with scheduling etc.. Sincerely, Joshua D. Drake - -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHwx+EATb/zqfZUUQRAvOgAJ4vWCO74XzXy9Pbzqz3otWoqKI3HgCfRwUI ZLd0SOgf5jnInZvOxCS+iNU= =Syk1 -----END PGP SIGNATURE-----
On Mon, 2008-02-25 at 12:05 -0800, Joshua D. Drake wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On Mon, 25 Feb 2008 11:36:56 -0800 > Jeff Davis <pgsql@j-davis.com> wrote: > > > > > If there is any significant I/O latency for a single backend, it seems > > like a context switch could be a win for processor utilization. It > > might not be a win overall, but at least potentially a win. > > Do we want a 20% potential win or an 80% potential win? > > I would personally rather keep it simple, hard core, and data shoving > as possible without any issue with scheduling etc.. > Just a thought. After it's actually implemented it won't be hard to see if it's a win. Regards,Jeff Davis
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mon, 25 Feb 2008 12:17:10 -0800 Jeff Davis <pgsql@j-davis.com> wrote: > > I would personally rather keep it simple, hard core, and data > > shoving as possible without any issue with scheduling etc.. > > > > Just a thought. After it's actually implemented it won't be hard to > see if it's a win. Yep :) but as a note: I am currently testing on the data set that is giving us all these issues. Previously we were pushing ~ 22G an hour over a single thread. I am currently pushing ~ 28G every 16 minutes over 6 threads. With 30-40% IO wait. Sincerely, Joshua D. Drake - -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHwyToATb/zqfZUUQRAsDzAJ0cZEujQIW1SQ9Wd1nd1jWRVWy09ACgpryh SJENqCnmwKoSMF5fSHBRtsg= =hVeo -----END PGP SIGNATURE-----
On Mon, 2008-02-25 at 12:28 -0800, Joshua D. Drake wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On Mon, 25 Feb 2008 12:17:10 -0800 > Jeff Davis <pgsql@j-davis.com> wrote: > > > > > I would personally rather keep it simple, hard core, and data > > > shoving as possible without any issue with scheduling etc.. > > > > > > > Just a thought. After it's actually implemented it won't be hard to > > see if it's a win. > > Yep :) but as a note: > > I am currently testing on the data set that is giving us all these > issues. Previously we were pushing ~ 22G an hour over a single thread. > I am currently pushing ~ 28G every 16 minutes over 6 threads. > > With 30-40% IO wait. That begs the question: what about 12 threads? Regards,Jeff Davis
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mon, 25 Feb 2008 14:05:58 -0800 Jeff Davis <pgsql@j-davis.com> wrote: > > Yep :) but as a note: > > > > I am currently testing on the data set that is giving us all these > > issues. Previously we were pushing ~ 22G an hour over a single > > thread. I am currently pushing ~ 28G every 16 minutes over 6 > > threads. > > > > With 30-40% IO wait. > > That begs the question: what about 12 threads? That is exactly what Alvaro said :P, let me finish this one first. Joshua D. Drake - -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHwz0EATb/zqfZUUQRAo1HAJkB58g/gkCWPTqSqjNzrcxGZ4eiNACgi7Va gZGMboxrPwV4euv67anSyfo= =9S0a -----END PGP SIGNATURE-----
On Mon, 25 Feb 2008 14:11:16 -0800 "Joshua D. Drake" <jd@commandprompt.com> wrote: O.k. with 24 connections 3.5 hours. Testing with 12 now. 6 never finished due to a bug. Observations: As simple as this solution is, it is not eloquent nor is it smart. Using this method, if you have a 100GB table (which is very common) you are still bound in a bad way by a single connection and you are holding up everyone else. Sincerely, Joshua D. Drake -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit
On Mon, 2008-02-25 at 21:18 -0800, Joshua D. Drake wrote: > As simple as this solution is, it is not eloquent nor is it smart. > Using this method, if you have a 100GB table (which is very common) > you are still bound in a bad way by a single connection and you are > holding up everyone else. In your case I can see your point. For me it would still be very helpful. If that 100GB table has several indexes, particularly on localized text, that can take a lot of processor time to rebuild (even for a substantially smaller dataset, like in the "7 hour restore" thread). It seems like a no-brainer to be able to utilize all available cores. I think one big improvement is to break it into steps as Simon suggests here: http://archives.postgresql.org/pgsql-hackers/2008-02/msg00205.php and my idea to further break it down: http://archives.postgresql.org/pgsql-hackers/2008-02/msg00699.php I think we should consider all of these pg_restore improvements, because they're merely simplifying the DBA's job. Currently, to get these benefits, I have to organize and parallelize the restore manually. Actually, the tests you're running are helping me as much as any pg_restore changes might anyway. I don't mind a small amount of extra work to dump/restore, but other users might get a bad impression of PostgreSQL if they don't know how to make it perform to their expectations. Regards,Jeff Davis
On Mon, 25 Feb 2008 22:29:32 -0800 Jeff Davis <pgsql@j-davis.com> wrote: > For me it would still be very helpful. If that 100GB table has several > indexes, particularly on localized text, that can take a lot of > processor time to rebuild (even for a substantially smaller dataset, > like in the "7 hour restore" thread). It seems like a no-brainer to be > able to utilize all available cores. Oh, I agree that we should be using all cores. I would argue that we should have been doing that for years now but more importantly to me is that pg_restore even single threaded is slow. > > I think we should consider all of these pg_restore improvements, > because they're merely simplifying the DBA's job. Currently, to get > these benefits, I have to organize and parallelize the restore > manually. Definitely. > > Actually, the tests you're running are helping me as much as any > pg_restore changes might anyway. I don't mind a small amount of extra > work to dump/restore, but other users might get a bad impression of > PostgreSQL if they don't know how to make it perform to their > expectations. Certainly but having to hand roll this is bad. It presents us in a decidedly hackish light. Sincerely, Joshua D. Drake > > Regards, > Jeff Davis > -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit
On Mon, 25 Feb 2008 22:25:09 -0800 "Joshua D. Drake" <jd@commandprompt.com> wrote: 24 connections: 3.5 hours 12 connections: 4.5 hours -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tue, 26 Feb 2008 07:43:23 -0800 "Joshua D. Drake" <jd@commandprompt.com> wrote: > On Mon, 25 Feb 2008 22:25:09 -0800 > "Joshua D. Drake" <jd@commandprompt.com> wrote: > > 24 connections: 3.5 hours > 12 connections: 4.5 hours > 24 connections RAW_BUF_SIZE 524288 3.5 hours However one observation that I am going to (try) to test is that we are spending a lot of time waiting for the last thread to finish. I wouldn't be surprised if I could cut the time in half if I could script this out to do: table.data->table.pk->table.index table.data->table.pk->table.index table.data->table.pk->table.index table.data->table.pk->table.index constraints versus table.data table.data table.data table.pk table.pk table.pk table.index ... Joshua D. Drake - -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHxJ3+ATb/zqfZUUQRAlURAKCtOSbHoWTNSkzV0U48Ib0P/8SSNgCdFeWK v6Q1l5BBGxqqtPu1UMhA81E= =VA2+ -----END PGP SIGNATURE-----
"Joshua D. Drake" <jd@commandprompt.com> writes: > However one observation that I am going to (try) to test is that we are > spending a lot of time waiting for the last thread to finish. IOW you haven't balanced the work given to each thread very well? Or is there something else happening? How exactly are you allocating tasks to threads in this prototype, anyway? regards, tom lane
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tue, 26 Feb 2008 18:39:53 -0500 Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: > > However one observation that I am going to (try) to test is that we > > are spending a lot of time waiting for the last thread to finish. > > IOW you haven't balanced the work given to each thread very well? > Or is there something else happening? > > How exactly are you allocating tasks to threads in this prototype, > anyway? > Right there is no balance here. Let me explain what I did. I performed a pg_restore -l to get the TOC file. I then broke that file up into five other files. prefix = schema (minus indexes, constraints) data = data pk = primary keys index = indexes triggers_constraints = well triggers and cosntraints (foreign keys in this instance) The first step of the script loads prefix. It then splits the data file into -n- number of files and launches -n- number of pg_restore processes with -L. It runs through all data, then starts on pk in the exact same manner and then indxex etc... Everything moves along each step very quickly until the last thread. So I could burn through all of data in 60 minutes except for the last three tables in the 24th file which will take 30 minutes on their own (arbitrary numbers). While I am waiting on the last 3 tables, nothing is happening. We are in a holding pattern for the 24 connection pk load to start. What should happen is as each TABLE DATA line finishes the appropriate CONSTRAINT (pk) and INDEX are also built. The two single threaded processes although you could probably make them multi threaded are prefix and triggers constraints. Sincerely, Joshua D. Drake - -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHxKXGATb/zqfZUUQRAjePAJ9xx6ea+Vo4J5T3CxLYRfKj2Cm1gQCeOjbU DHcAzEsVpedyUnJjUuL7DI8= =ynM0 -----END PGP SIGNATURE-----
"Joshua D. Drake" <jd@commandprompt.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> How exactly are you allocating tasks to threads in this prototype, >> anyway? > Right there is no balance here. Let me explain what I did. I performed > a pg_restore -l to get the TOC file. I then broke that file up into > five other files. > prefix = schema (minus indexes, constraints) > data = data > pk = primary keys > index = indexes > triggers_constraints = well triggers and cosntraints (foreign keys in > this instance) > The first step of the script loads prefix. It then splits the data > file into -n- number of files and launches -n- number of > pg_restore processes with -L. > It runs through all data, then starts on pk in the exact same manner > and then indxex etc... So you have four serialization points not just one; at each one the slowest subtask forces everyone else to wait, even if there's work that could potentially be done on other tables. This is fine for a quick-and-dirty proof of concept but it's certainly not how we'd want to implement the real thing. But I doubt you can get much further without putting some actual dependency awareness into it. regards, tom lane
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tue, 26 Feb 2008 19:03:57 -0500 Tom Lane <tgl@sss.pgh.pa.us> wrote: > So you have four serialization points not just one; at each one the > slowest subtask forces everyone else to wait, even if there's work > that could potentially be done on other tables. This is fine for a > quick-and-dirty proof of concept but it's certainly not how we'd want > to implement the real thing. But I doubt you can get much further > without putting some actual dependency awareness into it. Exactly. I have gone as far as I actually can without quite a bit of brain trauma and hand holding. I could probably prototype it in perl fairly easy but C is a complete no go for me. I hope the numbers were at least helpful. Sincerely, Joshua D. Drake > > regards, tom lane > > ---------------------------(end of > broadcast)--------------------------- TIP 7: You can help support the > PostgreSQL project by donating at > > http://www.postgresql.org/about/donate > - -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHxK2jATb/zqfZUUQRArgSAJ9z6f4dJLZ1FSbU6ISpIKXaIePzJQCePRY9 8PUIRudPtAlEBH1ivDMJOC4= =afNY -----END PGP SIGNATURE-----
Heikki, are there any TODO items here? --------------------------------------------------------------------------- Heikki Linnakangas wrote: > Joshua D. Drake wrote: > > On Sun, 24 Feb 2008 00:43:18 +0000 > > "Heikki Linnakangas" <heikki@enterprisedb.com> wrote: > > > >> Incidentally, I've been working on a patch to speed up CopyReadLine. > >> I was going to run some more tests first, but since we're talking > >> about it, I guess I should just post the patch. I'll post to > >> pgsql-patches shortly. > > > > On your post to patches you mentioned only about a 5% improvement. > > Don't get me wrong, 5% is 5% and I respect it greatly but as far as I > > can tell we are about 300% behind the curve. > > Yeah. Looking at the profile, the time is spent really all over the > place. There's no one clear bottleneck to focus on. I think we could do > a few more ~5% improvements, but > > At some point, I think we have to bite the bullet and find a way to use > multiple CPUs for a single load. I don't have any good ideas or plans > for that, but hopefully someone does. > > > My tests were maxing out at ~22G an hour. On hardware that can do > > in 360G an hour and that is assuming > 50% overhead between OS, libs, > > etc... I have no choice but to conclude we have a much, much deeper and > > fundamental issue going on with COPY. I am inspired by Itagaki Takahiro > > and his batch update of indexes which should help greatly overall but > > doesn't help my specific issue. > > Yep, the index build idea is an I/O improvement, not a CPU one. > > > Forgive me for not being a C programmer and Alvaro is not online so I > > would vet these questions with him first. > > > > I know that copy is in theory a bulk loader but, when performing the > > readline how many lines are we reading? Do we read up to 8192? Or do we > > shove in say 8megs of data before we invoke DoCopy? > > We read 64 KB at a time, and then CopyReadLineText returns one line at a > time from that buffer. > > Looking at your profile more, and after the memchr patch, the "raw input > side" of copy, consisting of reading the data from disk in 64KB blocks, > splitting that into lines, and splitting lines into columns, still takes > ~20% of the CPU time. I suspect CopyReadAttributesText is the biggest > culprit there. > > You could avoid the ~8% spent in XLogInsert in PostgreSQL 8.3, by > creating the table (or truncating it) in the same transaction with the COPY. > > After that, heap_formtuple is high on the list. I wonder if we could do > something about that. > > > I am just curious if there is some simple low hanging fruit that is > > possibly missing. > > I don't see any piece of code that's causing problems. We can shave off > a few percents here and there I think, but don't expect a 300% > improvement anytime soon. A few ideas I've thought about are: > > - use a specialized version of strtol, for base 10. That won't help on > your table, but I've seen strtol consume a significant amount of time on > tables with numeric/integer columns. > > - Instead of pallocing and memcpying the text fields, leave a little bit > of room between fields in the attribute_buf, and write the varlen header > there directly. This might help you since your table has a lot of text > fields. > > - Instead of the normal PG function calling conventions, provide > specialized fastpath input functions for the most common data types. > InputFunctionCall consumed 4.5% of the CPU time in your profile. > > - Use a simpler memory context implementation, that's like a stack with > no pfree support, for the per-tuple context. > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Heikki, are there any TODO items here? Hmm. You could put an entry there for "profile and optimize COPY", with the below list of ideas as a starting point. It's more about profiling and performance testing than coding. > --------------------------------------------------------------------------- > > Heikki Linnakangas wrote: >> Joshua D. Drake wrote: >>> On Sun, 24 Feb 2008 00:43:18 +0000 >>> "Heikki Linnakangas" <heikki@enterprisedb.com> wrote: >>> >>>> Incidentally, I've been working on a patch to speed up CopyReadLine. >>>> I was going to run some more tests first, but since we're talking >>>> about it, I guess I should just post the patch. I'll post to >>>> pgsql-patches shortly. >>> On your post to patches you mentioned only about a 5% improvement. >>> Don't get me wrong, 5% is 5% and I respect it greatly but as far as I >>> can tell we are about 300% behind the curve. >> Yeah. Looking at the profile, the time is spent really all over the >> place. There's no one clear bottleneck to focus on. I think we could do >> a few more ~5% improvements, but >> >> At some point, I think we have to bite the bullet and find a way to use >> multiple CPUs for a single load. I don't have any good ideas or plans >> for that, but hopefully someone does. >> >>> My tests were maxing out at ~22G an hour. On hardware that can do >>> in 360G an hour and that is assuming > 50% overhead between OS, libs, >>> etc... I have no choice but to conclude we have a much, much deeper and >>> fundamental issue going on with COPY. I am inspired by Itagaki Takahiro >>> and his batch update of indexes which should help greatly overall but >>> doesn't help my specific issue. >> Yep, the index build idea is an I/O improvement, not a CPU one. >> >>> Forgive me for not being a C programmer and Alvaro is not online so I >>> would vet these questions with him first. >>> >>> I know that copy is in theory a bulk loader but, when performing the >>> readline how many lines are we reading? Do we read up to 8192? Or do we >>> shove in say 8megs of data before we invoke DoCopy? >> We read 64 KB at a time, and then CopyReadLineText returns one line at a >> time from that buffer. >> >> Looking at your profile more, and after the memchr patch, the "raw input >> side" of copy, consisting of reading the data from disk in 64KB blocks, >> splitting that into lines, and splitting lines into columns, still takes >> ~20% of the CPU time. I suspect CopyReadAttributesText is the biggest >> culprit there. >> >> You could avoid the ~8% spent in XLogInsert in PostgreSQL 8.3, by >> creating the table (or truncating it) in the same transaction with the COPY. >> >> After that, heap_formtuple is high on the list. I wonder if we could do >> something about that. >> >>> I am just curious if there is some simple low hanging fruit that is >>> possibly missing. >> I don't see any piece of code that's causing problems. We can shave off >> a few percents here and there I think, but don't expect a 300% >> improvement anytime soon. A few ideas I've thought about are: >> >> - use a specialized version of strtol, for base 10. That won't help on >> your table, but I've seen strtol consume a significant amount of time on >> tables with numeric/integer columns. >> >> - Instead of pallocing and memcpying the text fields, leave a little bit >> of room between fields in the attribute_buf, and write the varlen header >> there directly. This might help you since your table has a lot of text >> fields. >> >> - Instead of the normal PG function calling conventions, provide >> specialized fastpath input functions for the most common data types. >> InputFunctionCall consumed 4.5% of the CPU time in your profile. >> >> - Use a simpler memory context implementation, that's like a stack with >> no pfree support, for the per-tuple context. >> >> -- >> Heikki Linnakangas >> EnterpriseDB http://www.enterprisedb.com >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 1: if posting/reading through Usenet, please send an appropriate >> subscribe-nomail command to majordomo@postgresql.org so that your >> message can get through to the mailing list cleanly > -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Mon, 3 Mar 2008 20:40:02 -0500 (EST) Bruce Momjian <bruce@momjian.us> wrote: As a closing note on this from my side. When I ran the restore with 90 connections (as a stress test) it only took 20 minutes longer than the 24 connections. Joshua D. Drake - -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL SPI Liaison | SPI Director | PostgreSQL political pundit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFH0E3IATb/zqfZUUQRAjn3AJ9d0IpBn9pd16AYFWT3qs2frS51lQCfcOIN uQvwc263TjLMZOYzx7Ktbpo= =6xOV -----END PGP SIGNATURE-----
Added to TODO: > o Impove COPY performance > > http://archives.postgresql.org/pgsql-hackers/2008-02/msg00954.php --------------------------------------------------------------------------- Heikki Linnakangas wrote: > Bruce Momjian wrote: > > Heikki, are there any TODO items here? > > Hmm. You could put an entry there for "profile and optimize COPY", with > the below list of ideas as a starting point. It's more about profiling > and performance testing than coding. > > > --------------------------------------------------------------------------- > > > > Heikki Linnakangas wrote: > >> Joshua D. Drake wrote: > >>> On Sun, 24 Feb 2008 00:43:18 +0000 > >>> "Heikki Linnakangas" <heikki@enterprisedb.com> wrote: > >>> > >>>> Incidentally, I've been working on a patch to speed up CopyReadLine. > >>>> I was going to run some more tests first, but since we're talking > >>>> about it, I guess I should just post the patch. I'll post to > >>>> pgsql-patches shortly. > >>> On your post to patches you mentioned only about a 5% improvement. > >>> Don't get me wrong, 5% is 5% and I respect it greatly but as far as I > >>> can tell we are about 300% behind the curve. > >> Yeah. Looking at the profile, the time is spent really all over the > >> place. There's no one clear bottleneck to focus on. I think we could do > >> a few more ~5% improvements, but > >> > >> At some point, I think we have to bite the bullet and find a way to use > >> multiple CPUs for a single load. I don't have any good ideas or plans > >> for that, but hopefully someone does. > >> > >>> My tests were maxing out at ~22G an hour. On hardware that can do > >>> in 360G an hour and that is assuming > 50% overhead between OS, libs, > >>> etc... I have no choice but to conclude we have a much, much deeper and > >>> fundamental issue going on with COPY. I am inspired by Itagaki Takahiro > >>> and his batch update of indexes which should help greatly overall but > >>> doesn't help my specific issue. > >> Yep, the index build idea is an I/O improvement, not a CPU one. > >> > >>> Forgive me for not being a C programmer and Alvaro is not online so I > >>> would vet these questions with him first. > >>> > >>> I know that copy is in theory a bulk loader but, when performing the > >>> readline how many lines are we reading? Do we read up to 8192? Or do we > >>> shove in say 8megs of data before we invoke DoCopy? > >> We read 64 KB at a time, and then CopyReadLineText returns one line at a > >> time from that buffer. > >> > >> Looking at your profile more, and after the memchr patch, the "raw input > >> side" of copy, consisting of reading the data from disk in 64KB blocks, > >> splitting that into lines, and splitting lines into columns, still takes > >> ~20% of the CPU time. I suspect CopyReadAttributesText is the biggest > >> culprit there. > >> > >> You could avoid the ~8% spent in XLogInsert in PostgreSQL 8.3, by > >> creating the table (or truncating it) in the same transaction with the COPY. > >> > >> After that, heap_formtuple is high on the list. I wonder if we could do > >> something about that. > >> > >>> I am just curious if there is some simple low hanging fruit that is > >>> possibly missing. > >> I don't see any piece of code that's causing problems. We can shave off > >> a few percents here and there I think, but don't expect a 300% > >> improvement anytime soon. A few ideas I've thought about are: > >> > >> - use a specialized version of strtol, for base 10. That won't help on > >> your table, but I've seen strtol consume a significant amount of time on > >> tables with numeric/integer columns. > >> > >> - Instead of pallocing and memcpying the text fields, leave a little bit > >> of room between fields in the attribute_buf, and write the varlen header > >> there directly. This might help you since your table has a lot of text > >> fields. > >> > >> - Instead of the normal PG function calling conventions, provide > >> specialized fastpath input functions for the most common data types. > >> InputFunctionCall consumed 4.5% of the CPU time in your profile. > >> > >> - Use a simpler memory context implementation, that's like a stack with > >> no pfree support, for the per-tuple context. > >> > >> -- > >> Heikki Linnakangas > >> EnterpriseDB http://www.enterprisedb.com > >> > >> ---------------------------(end of broadcast)--------------------------- > >> TIP 1: if posting/reading through Usenet, please send an appropriate > >> subscribe-nomail command to majordomo@postgresql.org so that your > >> message can get through to the mailing list cleanly > > > > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +