Thread: gprof SELECT COUNT(*) results
I did some gprof on a simple "SELECT COUNT(*) FROM test" query on cvs tip. Linux josh.db 2.4.29-1 #2 Tue Jan 25 17:03:33 EST 2005 i686 unknown gcc: 2.96 gprof: 2.13.90.0.2 ./configure --without-readline There are 260k or so records in table test(i int), about 1500 pages. I give a shared_buffers to 3000, which is enough to hold all data pages. Other GUCs are by default. After some warmups (to make sure these pages are in the file system buffers), I do "SELECT COUNT(*)" for 10 times of each round, and I tested 3 rounds. The results are: - Round 1 - % cumulative self self totaltime seconds seconds calls s/call s/call name16.67 0.27 0.27 2648542 0.00 0.00 LWLockAcquire13.58 0.49 0.22 2648543 0.00 0.00 LWLockRelease 8.02 0.62 0.13 5266128 0.00 0.00 LockBuffer 8.02 0.75 0.13 2621456 0.00 0.00 heapgettup - Round 2 - % cumulative self self totaltime seconds seconds calls s/call s/call name19.14 0.31 0.31 2648542 0.00 0.00 LWLockAcquire13.58 0.53 0.22 2648543 0.00 0.00 LWLockRelease11.11 0.71 0.18 2621456 0.00 0.00 heapgettup 6.79 0.82 0.11 5266128 0.00 0.00 LockBuffer - Round 3 - % cumulative self self totaltime seconds seconds calls s/call s/call name17.12 0.25 0.25 2648542 0.00 0.00 LWLockAcquire 8.22 0.37 0.12 2648543 0.00 0.00 LWLockRelease 7.53 0.48 0.11 2621456 0.00 0.00 heapgettup 6.85 0.58 0.10 2621440 0.00 0.00 ExecEvalConst There are some variance in the results, so my question is: (1) Are these results faithful? (2) If so, does it indicate that LWLock needs some improvements? Regards, Qingqing
On Thu, 2005-11-24 at 13:25 -0500, Qingqing Zhou wrote: > I did some gprof on a simple "SELECT COUNT(*) FROM test" query on cvs tip. > > Linux josh.db 2.4.29-1 #2 Tue Jan 25 17:03:33 EST 2005 i686 unknown > gcc: 2.96 > gprof: 2.13.90.0.2 > ./configure --without-readline > > There are 260k or so records in table test(i int), about 1500 pages. I > give a shared_buffers to 3000, which is enough to hold all data pages. > Other GUCs are by default. After some warmups (to make sure these pages > are in the file system buffers), I do "SELECT COUNT(*)" for 10 times of > each round, and I tested 3 rounds. The results are: > > - Round 1 - > % cumulative self self total > time seconds seconds calls s/call s/call name > 16.67 0.27 0.27 2648542 0.00 0.00 LWLockAcquire > 13.58 0.49 0.22 2648543 0.00 0.00 LWLockRelease > 8.02 0.62 0.13 5266128 0.00 0.00 LockBuffer > 8.02 0.75 0.13 2621456 0.00 0.00 heapgettup > > - Round 2 - > % cumulative self self total > time seconds seconds calls s/call s/call name > 19.14 0.31 0.31 2648542 0.00 0.00 LWLockAcquire > 13.58 0.53 0.22 2648543 0.00 0.00 LWLockRelease > 11.11 0.71 0.18 2621456 0.00 0.00 heapgettup > 6.79 0.82 0.11 5266128 0.00 0.00 LockBuffer > > - Round 3 - > % cumulative self self total > time seconds seconds calls s/call s/call name > 17.12 0.25 0.25 2648542 0.00 0.00 LWLockAcquire > 8.22 0.37 0.12 2648543 0.00 0.00 LWLockRelease > 7.53 0.48 0.11 2621456 0.00 0.00 heapgettup > 6.85 0.58 0.10 2621440 0.00 0.00 ExecEvalConst > > There are some variance in the results, so my question is: > (1) Are these results faithful? > (2) If so, does it indicate that LWLock needs some improvements? Maybe, maybe not. The whole system is designed around high levels of concurrent access. If you know for certain you don't ever need that then other systems are probably the right choice. Concurrency has a cost and a benefit. If you measure the cost, but not the benefit, it will seem expensive. Your results show you have 2.6m rows, not 260k rows. Yes? Best Regards, Simon Riggs
On Thu, 24 Nov 2005, Simon Riggs wrote: > > Maybe, maybe not. The whole system is designed around high levels of > concurrent access. If you know for certain you don't ever need that then > other systems are probably the right choice. Concurrency has a cost and > a benefit. If you measure the cost, but not the benefit, it will seem > expensive. > Yeah, understood. What I can't understand that in this case why it costs so much -- without concurrency, the LWLock code path just invloves spinlock_lock/unlock and serveral simple instructions? What's more, we can see that for each row, a LWLock pair is invoked. So on a more aggressive thought, can we change it to page level? I know it is terriblly difficult since our query processor infrastructure is based on a single-tuple interface ... > Your results show you have 2.6m rows, not 260k rows. Yes? It is 260k since I test each round by 10 "SELECT COUNT(*)". Regards, Qingqing
Qingqing Zhou <zhouqq@cs.toronto.edu> writes: > Yeah, understood. What I can't understand that in this case why it costs > so much -- without concurrency, the LWLock code path just invloves > spinlock_lock/unlock and serveral simple instructions? You executed LWLock 2.6 million times in just under 300ms. If my math is right that's about 115 nanoseconds per lock or about 300 cycles on a 2.6Ghz processor. That sounds like a lot but it's about the right order of magnitude. Was this on a multiprocessor machine? In which case a big part of that time is probably spent synchronizing between the processors. -- greg
On Thu, 24 Nov 2005, Greg Stark wrote: > > > You executed LWLock 2.6 million times in just under 300ms. If my math is right > that's about 115 nanoseconds per lock or about 300 cycles on a 2.6Ghz > processor. > > That sounds like a lot but it's about the right order of magnitude. Was this > on a multiprocessor machine? In which case a big part of that time is probably > spent synchronizing between the processors. > Your math is right iff my math is right :-) It is a 2.4G desktop computer. I may need to write some separate tests to see if this is what we should pay for bus lock instruction. Regards, Qingqing
Qingqing Zhou <zhouqq@cs.toronto.edu> writes: > Yeah, understood. What I can't understand that in this case why it costs > so much -- without concurrency, the LWLock code path just invloves > spinlock_lock/unlock and serveral simple instructions? I don't see those costing nearly as much as your results show ... perhaps there's something platform-specific at work? What I see, down to the 1% level, is Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 37.98 13.91 13.91 _mcount 6.53 16.30 2.39 5242900 0.00 0.00 heapgettup 3.33 17.52 1.22 10542596 0.00 0.00 LockBuffer 3.30 18.73 1.21 5242880 0.00 0.00 advance_transition_function 2.68 19.71 0.98 5242880 0.00 0.00 IncrBufferRefCount 2.46 20.61 0.90 5385174 0.00 0.00 LWLockRelease 2.38 21.48 0.87 5271273 0.00 0.00 ReleaseAndReadBuffer 2.35 22.34 0.86 5385174 0.00 0.00 LWLockAcquire 2.18 23.14 0.80 5242938 0.00 0.00 ReleaseBuffer 2.10 23.91 0.77 5242900 0.00 0.00 ExecStoreTuple1.97 24.63 0.72 noshlibs 1.91 25.33 0.70 5242900 0.00 0.00 SeqNext 1.72 25.96 0.63 5271294 0.00 0.00 ResourceOwnerRememberBuffer 1.72 26.59 0.63 5242900 0.00 0.00 heap_getnext 1.72 27.22 0.63 5242880 0.00 0.00 advance_aggregates 1.69 27.84 0.62 5242940 0.00 0.00 ExecProcNode 1.64 28.44 0.60 $$dyncall 1.61 29.03 0.59 5242900 0.00 0.00 MemoryContextReset 1.53 29.59 0.56 5242880 0.00 0.00 HeapTupleSatisfiesSnapshot 1.45 30.12 0.53 5242880 0.00 0.00 int8inc 1.37 30.62 0.50 5243140 0.00 0.00 ExecClearTuple 1.17 31.05 0.43 5242880 0.00 0.00 ExecEvalExprSwitchContext 1.15 31.47 0.42 5271294 0.00 0.00 ResourceOwnerForgetBuffer 1.12 31.88 0.41 SeqNext 1.09 32.28 0.40 5271294 0.00 0.00 ResourceOwnerEnlargeBuffers 1.09 32.68 0.40 5242900 0.00 0.00 ExecScan 1.04 33.06 0.38 5242900 0.00 0.00 ExecSeqScan (This is for 20, not 10, iterations of your example, but otherwise it's the same test case.) I've since gotten rid of the IncrBufferRefCount, ReleaseBuffer, and ResourceOwnerXXX entries by eliminating some inefficiency in ExecStoreTuple, so that puts the buffer lock stuff further up, but it's still not all that critical by my numbers. > What's more, we can see that for each row, a LWLock pair is invoked. So on > a more aggressive thought, can we change it to page level? Yeah, I was wondering the same. It'd be possible to rewrite the seqscan stuff so that we do the visibility tests for all the tuples on a given page at once, taking the buffer content lock just once, and saving aside the valid tuple IDs to return later. This should definitely be faster when all the tuples actually get fetched. It might be a bit slower for a LIMIT query, but I'm not sure if we care that much. The only other objection I can think of is that if there are any broken tuples on a page, this approach would likely make it impossible to fetch any of the non-broken ones :-( regards, tom lane
On Thu, 24 Nov 2005, Qingqing Zhou wrote: > > I may need to write some separate tests to see if this is what we should > pay for bus lock instruction. > Here I come up with a test program to see how spinlock costs: $/pgsql/src/backend/storage/lmgr#./a.out Spinlock pair(2648542) duration: 143.134 ms $/pgsql/src/backend/storage/lmgr#./a.out Spinlock pair(2648542) duration: 143.107 ms $/pgsql/src/backend/storage/lmgr#./a.out Spinlock pair(2648542) duration: 143.104 ms So seems lock instruction really costs ... Regards, Qingqing --- /** spintest.c -* Test spinlock acquire/release without concurrency.** To compile (the -pg is to match the gprofmake I used):* backend/storage/lmgr#gcc -O2 -pg -Wall -I ../../../include/ spintest.c*/ #include "postgres.h" #include "storage/lwlock.h" #include "storage/spin.h" #include <sys/time.h> #define TIMES 2648542 int NumLocks = 0; void s_lock(volatile slock_t *lock, const char *file, int line) {fprintf(stderr, "should never be here\n");abort(); } int main(void) {int i;slock_t lock = 0;struct timeval start_t, stop_t;long usecs; gettimeofday(&start_t, NULL);for (i = 0; i < TIMES; i ++){ SpinLockAcquire_NoHoldoff(&lock); /* pretend to do something */ NumLocks ++; SpinLockRelease_NoHoldoff(&lock);} gettimeofday(&stop_t, NULL); if (stop_t.tv_usec < start_t.tv_usec) { stop_t.tv_sec--; stop_t.tv_usec += 1000000;} usecs = (long) (stop_t.tv_sec - start_t.tv_sec) * 1000000 + (long) (stop_t.tv_usec - start_t.tv_usec); fprintf (stdout, "Spinlock pair(%u) duration: %ld.%03ld ms\n", TIMES, (long) ((stop_t.tv_sec - start_t.tv_sec) * 1000+ (stop_t.tv_usec - start_t.tv_usec) / 1000), (long) (stop_t.tv_usec - start_t.tv_usec) % 1000); return 0; }
On Thu, 24 Nov 2005, Tom Lane wrote: > > I don't see those costing nearly as much as your results show > ... perhaps there's something platform-specific at work? > What I see, down to the 1% level, is > I can see your computer is really slow, so my theory is that since it is easy to hold a running-slowly horse than a fast one, so my spinlock on a 2.4G modern machine should takes relatively longer time to get effective. Just kidding. I am not sure what's happened, but in previous email there is a program I wrote to test the spinlock performance. In my machine, the profiling results matches the single spinlock test. > > The only other objection I can think of is that if there are any broken > tuples on a page, this approach would likely make it impossible to fetch > any of the non-broken ones :-( > What do you mean by "broken tuple"? An data corrupted tuple? So you mean if scan operator find a broken tuple on a page, then it will abort the operation without returning any other good tuples? I think this is acceptable. Regards, Qingqing
On Thu, 2005-11-24 at 23:48 -0500, Tom Lane wrote: > > What's more, we can see that for each row, a LWLock pair is invoked. So on > > a more aggressive thought, can we change it to page level? > > Yeah, I was wondering the same. It'd be possible to rewrite the seqscan > stuff so that we do the visibility tests for all the tuples on a given > page at once, taking the buffer content lock just once, and saving aside > the valid tuple IDs to return later. This should definitely be faster > when all the tuples actually get fetched. And me also. Looks good from here. It would be a very localised optimization, so fairly easy to implement. Even on your lower gprof numbers this would be 15% faster stand-alone. Take into account the significant reduction in LWlock requests and it might go much faster still on a busy system. The gain for other users would be very good also. > It might be a bit slower for > a LIMIT query, but I'm not sure if we care that much. Agreed. It would be fairly easy to make this happen only for full SeqScans, and the greater majority of those don't have a LIMIT of less than one block. > The only other > objection I can think of is that if there are any broken tuples on a > page, this approach would likely make it impossible to fetch any of the > non-broken ones :-( I was thinking of the brute force approach: take a complete copy of the block when we read the first tuple off it. That way any wierdness on-block is avoided until we logically attempt to read that tuple. It also allows us to palloc the right amount of space first time. This could be an interesting win: no other DBMS can take this approach. I think it has further advantages also: 1. Copying the tuples locally will also mean that the data will probably be in the L2 cache rather than main memory, so this could make things faster still. 2. For large scans, the buffer for that block doesn't need to be pinned after the first touch, so the buffer could be cleared and replaced with another before we have logically finished reading the block. That will improve fluidity in shared_buffers and improve efficiency. Are you, or will you be implementing this? If not, I'll implement this - it seems an important BI optimization. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > On Thu, 2005-11-24 at 23:48 -0500, Tom Lane wrote: >> Yeah, I was wondering the same. It'd be possible to rewrite the seqscan >> stuff so that we do the visibility tests for all the tuples on a given >> page at once, taking the buffer content lock just once, and saving aside >> the valid tuple IDs to return later. This should definitely be faster >> when all the tuples actually get fetched. > I was thinking of the brute force approach: take a complete copy of the > block when we read the first tuple off it. That way any wierdness > on-block is avoided until we logically attempt to read that tuple. It > also allows us to palloc the right amount of space first time. That occurred to me too, but I rejected it on two grounds: * All that data copying will be expensive. * We couldn't update tuple commit hint bits on the real page. Also, I'm not at all sure that it's a good idea to release the buffer pin before we're genuinely done with the page. That sort of change would impact such things as the VACUUM interlock algorithm. Maybe it'd be OK but I'm disinclined to mess with it for a dubious speed gain. Even with my version of the proposal, it'd be risky to use the check- all-in-advance approach for non-MVCC snapshots such as SnapshotNow. I'm not sure that there are any places that would get broken by missing tuples that weren't yet committed when the page was first touched ... but I'm not sure there aren't, either. We could avoid this risk by having two paths through heapgettup, though. > Are you, or will you be implementing this? I plan to take a look at it soon. regards, tom lane
Qingqing Zhou <zhouqq@cs.toronto.edu> writes: > I can see your computer is really slow, so my theory is that since it is > easy to hold a running-slowly horse than a fast one, so my spinlock on a > 2.4G modern machine should takes relatively longer time to get effective. > Just kidding. Is that "modern machine" a Xeon by any chance? We know that Xeons have fairly awful concurrent performance, and the long latency for bus lock instructions may well be the reason why. FWIW, the numbers I showed last night were for an HPPA machine, which I used just because I chanced to have CVS tip already built for profiling on it. I've since reproduced the test on a spiffy new dual Xeon that Red Hat just bought me :-) ... and I get similar numbers to yours. It'd be interesting to see the results from an SMP Opteron, if anyone's got one handy. >> The only other objection I can think of is that if there are any broken >> tuples on a page, this approach would likely make it impossible to fetch >> any of the non-broken ones :-( > What do you mean by "broken tuple"? An data corrupted tuple? The specific case that's worrying me is a tuple with a corrupted xmin, such that tqual.c fails with a "can't access transaction status" message. > So you mean > if scan operator find a broken tuple on a page, then it will abort the > operation without returning any other good tuples? I think this is > acceptable. I think it would be totally unacceptable if it meant that there was no way at all to look at the other data on the same page with a corrupt tuple. Seqscans with "LIMIT n" have been the traditional tool for getting as much info as you could out of a corrupted page. However, it now occurs to me that you could still cherry-pick non-corrupt tuples with TID-scan queries, so this objection shouldn't be considered fatal. regards, tom lane
On Fri, 2005-11-25 at 09:54 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Thu, 2005-11-24 at 23:48 -0500, Tom Lane wrote: > >> Yeah, I was wondering the same. It'd be possible to rewrite the seqscan > >> stuff so that we do the visibility tests for all the tuples on a given > >> page at once, taking the buffer content lock just once, and saving aside > >> the valid tuple IDs to return later. This should definitely be faster > >> when all the tuples actually get fetched. > > > I was thinking of the brute force approach: take a complete copy of the > > block when we read the first tuple off it. That way any wierdness > > on-block is avoided until we logically attempt to read that tuple. It > > also allows us to palloc the right amount of space first time. > > That occurred to me too, but I rejected it on two grounds: > * All that data copying will be expensive. > * We couldn't update tuple commit hint bits on the real page. OK, the second one is the clincher. > Also, I'm not at all sure that it's a good idea to release the buffer > pin before we're genuinely done with the page. That sort of change > would impact such things as the VACUUM interlock algorithm. Maybe it'd > be OK but I'm disinclined to mess with it for a dubious speed gain. I think its safer not to try it all in one go. It's worth coming back to later though: VACUUM will never remove rows the scanner can see anyway. > Even with my version of the proposal, it'd be risky to use the check- > all-in-advance approach for non-MVCC snapshots such as SnapshotNow. > I'm not sure that there are any places that would get broken by missing > tuples that weren't yet committed when the page was first touched ... > but I'm not sure there aren't, either. We could avoid this risk by > having two paths through heapgettup, though. That seems prudent. The main performance optimization is required for MVCC access anyway; we seldom scan catalog tables and hopefully they never get too big. > > Are you, or will you be implementing this? > > I plan to take a look at it soon. Much appreciated and well done to Qingqing for spotting this. Best Regards, Simon Riggs
Le Vendredi 25 Novembre 2005 16:20, Tom Lane a écrit : > Qingqing Zhou <zhouqq@cs.toronto.edu> writes: > > I can see your computer is really slow, so my theory is that since it is > > easy to hold a running-slowly horse than a fast one, so my spinlock on a > > 2.4G modern machine should takes relatively longer time to get effective. > > Just kidding. > > Is that "modern machine" a Xeon by any chance? We know that Xeons have > fairly awful concurrent performance, and the long latency for bus lock > instructions may well be the reason why. FWIW, the numbers I showed > last night were for an HPPA machine, which I used just because I chanced > to have CVS tip already built for profiling on it. I've since > reproduced the test on a spiffy new dual Xeon that Red Hat just bought > me :-) ... and I get similar numbers to yours. It'd be interesting to > see the results from an SMP Opteron, if anyone's got one handy. Is that what you're looking for ? [thauvin@samsufi ~]$ egrep "(model name|MHz)" /proc/cpuinfo model name : AMD Opteron(tm) Processor 250 cpu MHz : 2391.040 model name : AMD Opteron(tm) Processor 250 cpu MHz : 2391.040 $ cat /etc/mandrake-release Mandrakelinux release 10.1 (Community) for x86_64 I can try to backport Postgresql 8.1.0 rpms from developement tree on mandriva 10.1, install and run some test if you're really interested.
On Fri, 25 Nov 2005, Tom Lane wrote: > > Is that "modern machine" a Xeon by any chance? > $#cat /proc/cpuinfo | grep "model name" model name : Intel(R) Pentium(R) 4 CPU 2.40GHz I can find a 4way Xeon (but it is shared by many users): /h/164/zhouqq#cat /proc/cpuinfo |grep "model name" model name : Intel(R) Xeon(TM) CPU 2.40GHz model name : Intel(R) Xeon(TM) CPU 2.40GHz model name : Intel(R) Xeon(TM) CPU 2.40GHz model name : Intel(R) Xeon(TM) CPU 2.40GHz $/pgsql/src/backend/storage/lmgr#./a.out Spinlock pair(2648542) duration: 161.785 ms $/pgsql/src/backend/storage/lmgr#./a.out Spinlock pair(2648542) duration: 160.661 ms $/pgsql/src/backend/storage/lmgr#./a.out Spinlock pair(2648542) duration: 155.505 ms > > it now occurs to me that you could still cherry-pick non-corrupt tuples > with TID-scan queries, so this objection shouldn't be considered fatal. > It is great that it is not that difficult to do it! What's more, the parallel scan will be easier to implement based on page level scan operators. Regards, Qingqing
"Tom Lane" <tgl@sss.pgh.pa.us> wrote > > I plan to take a look at it soon. > "Stand corrected"[Merlin]! In-memory "SELECT COUNT(*)" doubles the performance due to my test. - before - 1.56 s - now - 0.72 s Regards, Qingqing
On Sat, 2005-11-26 at 03:13 -0500, Qingqing Zhou wrote: > "Tom Lane" <tgl@sss.pgh.pa.us> wrote > > > > I plan to take a look at it soon. > > > > "Stand corrected"[Merlin]! In-memory "SELECT COUNT(*)" doubles the > performance due to my test. > > - before - > 1.56 s > > - now - > 0.72 s > ...and for emphasis: this optimization of SeqScans is not possible with any other database system, so its a big win for PostgreSQL. Best Regards, Simon Riggs
> ...and for emphasis: this optimization of SeqScans is not possible with > any other database system, so its a big win for PostgreSQL. With any other db system? That's a big call. Why? Not even other MVCC systems? Chris
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: >> ...and for emphasis: this optimization of SeqScans is not possible with >> any other database system, so its a big win for PostgreSQL. > With any other db system? That's a big call. Why? One could equally well spin it negatively, as "this optimization is not needed with any other database" ... regards, tom lane
On Sat, 2005-11-26 at 11:53 -0500, Tom Lane wrote: > Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > >> ...and for emphasis: this optimization of SeqScans is not possible with > >> any other database system, so its a big win for PostgreSQL. > > > With any other db system? That's a big call. Why? > > One could equally well spin it negatively, as "this optimization is not > needed with any other database" ... Why is it spin to call it a big win? You amaze me. After all the time we've spent investigating tuning the buffer manager, to reduce the locking required by a SeqScan to about 1% of what it was before is just way cool. When did you last spend a few hours tuning an internals feature (not the optimizer) and have something go 50% faster? Probably at least a month :-) Thinking more about other systems, ISTM that Oracle can do this, as can any MVCC based system. OTOH DB2 and SQLServer take block level read locks, so they can do this too, but at major loss of concurrency and threat of deadlock. Having said that, *any* system that chose not to do this would be severely sub-optimal with buffer manager contention. So AFAICS they all need this optimization. So, I think I'll take back the claim of uniqueness, but I'm happy with the accuracy of the statement that this is a big win for PostgreSQL. And a bigger win than most on the poor benighted Xeon, where reducing memory contention/traffic seems to be crucial. Best Regards, Simon Riggs
<font face="Verdana, Helvetica, Arial"><span style="font-size:14.0px">Nice job Qingqing and Tom!<br /><br /> The improvedexecutor / agg performance will likely help BI / data warehouse customers a lot. I’ll get some DBT-3 results tosubstantiate as soon as we can.<br /><br /> - Luke<br /><br /><br /> On 11/26/05 12:13 AM, "Qingqing Zhou" <zhouqq@cs.toronto.edu>wrote:<br /><br /></span></font><blockquote><font face="Verdana, Helvetica, Arial"><span style="font-size:14.0px"><br/><br /> "Tom Lane" <tgl@sss.pgh.pa.us> wrote<br /> ><br /> > I plan to take a lookat it soon.<br /> ><br /><br /> "Stand corrected"[Merlin]! In-memory "SELECT COUNT(*)" doubles the<br /> performancedue to my test.<br /><br /> - before -<br /> 1.56 s<br /><br /> - now -<br /> 0.72 s<br /><br /> Regards,<br />Qingqing<br /><br /><br /><br /> ---------------------------(end of broadcast)---------------------------<br /> TIP 1:if posting/reading through Usenet, please send an appropriate<br /> subscribe-nomail command to majordomo@postgresql.orgso that your<br /> message can get through to the mailing list cleanly<br /><br /><br /></span></font></blockquote><fontface="Verdana, Helvetica, Arial"><span style="font-size:14.0px"><br /></span></font>
Simon Riggs <simon@2ndquadrant.com> writes: >>> ...and for emphasis: this optimization of SeqScans is not possible with >>> any other database system, so its a big win for PostgreSQL. > Why is it spin to call it a big win? I didn't say it wasn't a big win; it was the first part of the sentence that bothered me. Without a lot more knowledge of the internals of the commercial DBMSes than I think is public, you can't say whether this is possible/useful/relevant for them. regards, tom lane
> Thinking more about other systems, ISTM that Oracle can do this, as can > any MVCC based system. OTOH DB2 and SQLServer take block level read > locks, so they can do this too, but at major loss of concurrency and > threat of deadlock. Having said that, *any* system that chose not to do > this would be severely sub-optimal with buffer manager contention. So > AFAICS they all need this optimization. MySQL/Innodb would presumably do it also I wonder... Chris
On Fri, Nov 25, 2005 at 10:20:11AM -0500, Tom Lane wrote: > Qingqing Zhou <zhouqq@cs.toronto.edu> writes: > > I can see your computer is really slow, so my theory is that since it is > > easy to hold a running-slowly horse than a fast one, so my spinlock on a > > 2.4G modern machine should takes relatively longer time to get effective. > > Just kidding. > > Is that "modern machine" a Xeon by any chance? We know that Xeons have > fairly awful concurrent performance, and the long latency for bus lock > instructions may well be the reason why. FWIW, the numbers I showed > last night were for an HPPA machine, which I used just because I chanced > to have CVS tip already built for profiling on it. I've since > reproduced the test on a spiffy new dual Xeon that Red Hat just bought > me :-) ... and I get similar numbers to yours. It'd be interesting to > see the results from an SMP Opteron, if anyone's got one handy. Is there still interest in this? I've got a dual Opteron running FBSD. (What would be the profiler to use on FBSD?) -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes: > (What would be the profiler to use on FBSD?) gprof should work fine. regards, tom lane
> > OTOH DB2 and SQLServer take block level > > read locks, so they can do this too, but at major loss of concurrency > > and threat of deadlock. Note, that in the usual committed read isolation, they do not need to read lock a row ! e.g. Informix only verifies, that it could lock the row (that there is no write lock). Only cursor stability leaves one read lock until the next fetch, serializable actually leaves all read locks, and select for update an intent update lock. Also they usually feed a buffer of rows to the client, so if the client does a fetch it gets a row from the client side buffer. Only when the buffer is empty, they get more from the server. I think the statement holds, that the optimization is pg specific, and cannot be directly compared to other db's. Andreas
>> > OTOH DB2 and SQLServer take block level >> > read locks, so they can do this too, but at major loss of >concurrency >> > and threat of deadlock. > >Note, that in the usual committed read isolation, they do not need to >read lock a row ! e.g. Informix only verifies, that it could lock the >row >(that there is no write lock). Only cursor stability leaves one read >lock >until the next fetch, serializable actually leaves all read locks, >and select for update an intent update lock. Not sure which product you're thinking about there. No such isolation level in DB2 or SQLServer, AFAIK. Were you talkingabout just Informix? DB2: Uncommitted Read (UR) mode "Dirty read" isn't the default, or the recommended lock level for most apps. I was consideringCursor Stability mode (or higher), which is the default unless you specifically set the system default otherwise.You can always skip the deadlock threat by using Uncommitted Read, by risking getting wrong results. There isn'tanything there I would ever want to emulate. SQLServer: READ COMMITTED does take share locks. There's a NO LOCK hint, true, but its not a default. READ_COMITTED_SNAPSHOT, new in2005, does row versioning like Oracle/PostgreSQL, and doesn't take locks. Out of interest: DB2 has learned from PostgreSQL that its OK to read a row and check whether it can see it before worrying about locks. Therecently introduced DB2_EVALUNCOMMITTED and DB2_SKIPINSERTED flags provide PostgreSQL like behaviour, new in the verylatest release. Best Regards, Simon Riggs
> DB2: > Uncommitted Read (UR) mode "Dirty read" isn't the default, or > the recommended lock level for most apps. I was considering > Cursor Stability mode (or higher), which is the default Sorry, they call it "read committed" but actually do cursor stability, which does keep one lock on the last fetched row. Keeping the lock would actually not be necessary to conform with ANSI "read committed". See table 4 on Page 8 of http://www.cs.ndsu.nodak.edu/~yawang/Snapshot.ppt > SQLServer: > READ COMMITTED does take share locks. But it does not hold them. According to docu it holds them "while reading" which is not a very detailed description. How long is that really, e.g. with odbc forward cursor fetch ? > There's a NO LOCK hint, true, but its not a default. That is for dirty/uncommitted reads. Andreas