Thread: gprof SELECT COUNT(*) results

gprof SELECT COUNT(*) results

From

Qingqing Zhou

Date:

24 November 2005, 14:25:25

I did some gprof on a simple "SELECT COUNT(*) FROM test" query on cvs tip.

Linux josh.db 2.4.29-1 #2 Tue Jan 25 17:03:33 EST 2005 i686 unknown
gcc: 2.96
gprof: 2.13.90.0.2
./configure --without-readline

There are 260k or so records in table test(i int), about 1500 pages. I
give a shared_buffers to 3000, which is enough to hold all data pages.
Other GUCs are by default. After some warmups (to make sure these pages
are in the file system buffers), I do "SELECT COUNT(*)" for 10 times of
each round, and I tested 3 rounds. The results are:

- Round 1 - %   cumulative   self              self     totaltime   seconds   seconds    calls   s/call   s/call
name16.67     0.27     0.27  2648542     0.00     0.00  LWLockAcquire13.58      0.49     0.22  2648543     0.00
0.00 LWLockRelease 8.02      0.62     0.13  5266128     0.00     0.00  LockBuffer 8.02      0.75     0.13  2621456
0.00    0.00  heapgettup
 

- Round 2 - %   cumulative   self              self     totaltime   seconds   seconds    calls   s/call   s/call
name19.14     0.31     0.31  2648542     0.00     0.00  LWLockAcquire13.58      0.53     0.22  2648543     0.00
0.00 LWLockRelease11.11      0.71     0.18  2621456     0.00     0.00  heapgettup 6.79      0.82     0.11  5266128
0.00    0.00  LockBuffer
 

- Round 3 - %   cumulative   self              self     totaltime   seconds   seconds    calls   s/call   s/call
name17.12     0.25     0.25  2648542     0.00     0.00  LWLockAcquire 8.22      0.37     0.12  2648543     0.00
0.00 LWLockRelease 7.53      0.48     0.11  2621456     0.00     0.00  heapgettup 6.85      0.58     0.10  2621440
0.00    0.00  ExecEvalConst
 

There are some variance in the results, so my question is:
(1) Are these results faithful?
(2) If so, does it indicate that LWLock needs some improvements?

Regards,
Qingqing

Re: gprof SELECT COUNT(*) results

From

Simon Riggs

Date:

24 November 2005, 16:59:30

On Thu, 2005-11-24 at 13:25 -0500, Qingqing Zhou wrote:
> I did some gprof on a simple "SELECT COUNT(*) FROM test" query on cvs tip.
> 
> Linux josh.db 2.4.29-1 #2 Tue Jan 25 17:03:33 EST 2005 i686 unknown
> gcc: 2.96
> gprof: 2.13.90.0.2
> ./configure --without-readline
> 
> There are 260k or so records in table test(i int), about 1500 pages. I
> give a shared_buffers to 3000, which is enough to hold all data pages.
> Other GUCs are by default. After some warmups (to make sure these pages
> are in the file system buffers), I do "SELECT COUNT(*)" for 10 times of
> each round, and I tested 3 rounds. The results are:
> 
> - Round 1 -
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls   s/call   s/call  name
>  16.67      0.27     0.27  2648542     0.00     0.00  LWLockAcquire
>  13.58      0.49     0.22  2648543     0.00     0.00  LWLockRelease
>   8.02      0.62     0.13  5266128     0.00     0.00  LockBuffer
>   8.02      0.75     0.13  2621456     0.00     0.00  heapgettup
> 
> - Round 2 -
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls   s/call   s/call  name
>  19.14      0.31     0.31  2648542     0.00     0.00  LWLockAcquire
>  13.58      0.53     0.22  2648543     0.00     0.00  LWLockRelease
>  11.11      0.71     0.18  2621456     0.00     0.00  heapgettup
>   6.79      0.82     0.11  5266128     0.00     0.00  LockBuffer
> 
> - Round 3 -
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls   s/call   s/call  name
>  17.12      0.25     0.25  2648542     0.00     0.00  LWLockAcquire
>   8.22      0.37     0.12  2648543     0.00     0.00  LWLockRelease
>   7.53      0.48     0.11  2621456     0.00     0.00  heapgettup
>   6.85      0.58     0.10  2621440     0.00     0.00  ExecEvalConst
> 
> There are some variance in the results, so my question is:
> (1) Are these results faithful?
> (2) If so, does it indicate that LWLock needs some improvements?

Maybe, maybe not. The whole system is designed around high levels of
concurrent access. If you know for certain you don't ever need that then
other systems are probably the right choice. Concurrency has a cost and
a benefit. If you measure the cost, but not the benefit, it will seem
expensive.

Your results show you have 2.6m rows, not 260k rows. Yes?

Best Regards, Simon Riggs

Re: gprof SELECT COUNT(*) results

From

Qingqing Zhou

Date:

24 November 2005, 17:27:09

On Thu, 24 Nov 2005, Simon Riggs wrote:

>
> Maybe, maybe not. The whole system is designed around high levels of
> concurrent access. If you know for certain you don't ever need that then
> other systems are probably the right choice. Concurrency has a cost and
> a benefit. If you measure the cost, but not the benefit, it will seem
> expensive.
>

Yeah, understood. What I can't understand that in this case why it costs
so much -- without concurrency, the LWLock code path just invloves
spinlock_lock/unlock and serveral simple instructions?

What's more, we can see that for each row, a LWLock pair is invoked. So on
a more aggressive thought, can we change it to page level? I know it is
terriblly difficult since our query processor infrastructure is based on a
single-tuple interface ...

> Your results show you have 2.6m rows, not 260k rows. Yes?

It is 260k since I test each round by 10 "SELECT COUNT(*)".

Regards,
Qingqing

Re: gprof SELECT COUNT(*) results

From

Greg Stark

Date:

24 November 2005, 23:23:32

Qingqing Zhou <zhouqq@cs.toronto.edu> writes:

> Yeah, understood. What I can't understand that in this case why it costs
> so much -- without concurrency, the LWLock code path just invloves
> spinlock_lock/unlock and serveral simple instructions?

You executed LWLock 2.6 million times in just under 300ms. If my math is right
that's about 115 nanoseconds per lock or about 300 cycles on a 2.6Ghz
processor.

That sounds like a lot but it's about the right order of magnitude. Was this
on a multiprocessor machine? In which case a big part of that time is probably
spent synchronizing between the processors.

-- 
greg

Re: gprof SELECT COUNT(*) results

From

Qingqing Zhou

Date:

25 November 2005, 00:02:42

On Thu, 24 Nov 2005, Greg Stark wrote:

>
>
> You executed LWLock 2.6 million times in just under 300ms. If my math is right
> that's about 115 nanoseconds per lock or about 300 cycles on a 2.6Ghz
> processor.
>
> That sounds like a lot but it's about the right order of magnitude. Was this
> on a multiprocessor machine? In which case a big part of that time is probably
> spent synchronizing between the processors.
>

Your math is right iff my math is right :-) It is a 2.4G desktop computer.
I may need to write some separate tests to see if this is what we should
pay for bus lock instruction.

Regards,
Qingqing

Re: gprof SELECT COUNT(*) results

From

Tom Lane

Date:

25 November 2005, 00:48:42

Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
> Yeah, understood. What I can't understand that in this case why it costs
> so much -- without concurrency, the LWLock code path just invloves
> spinlock_lock/unlock and serveral simple instructions?

I don't see those costing nearly as much as your results show
... perhaps there's something platform-specific at work?
What I see, down to the 1% level, is

Each sample counts as 0.01 seconds. %   cumulative   self              self     total           time   seconds
seconds   calls  ms/call  ms/call  name    37.98     13.91    13.91                             _mcount 6.53     16.30
  2.39  5242900     0.00     0.00  heapgettup 3.33     17.52     1.22 10542596     0.00     0.00  LockBuffer 3.30
18.73    1.21  5242880     0.00     0.00  advance_transition_function 2.68     19.71     0.98  5242880     0.00
0.00 IncrBufferRefCount 2.46     20.61     0.90  5385174     0.00     0.00  LWLockRelease 2.38     21.48     0.87
5271273    0.00     0.00  ReleaseAndReadBuffer 2.35     22.34     0.86  5385174     0.00     0.00  LWLockAcquire 2.18
 23.14     0.80  5242938     0.00     0.00  ReleaseBuffer 2.10     23.91     0.77  5242900     0.00     0.00
ExecStoreTuple1.97     24.63     0.72                             noshlibs 1.91     25.33     0.70  5242900     0.00
0.00  SeqNext 1.72     25.96     0.63  5271294     0.00     0.00  ResourceOwnerRememberBuffer 1.72     26.59     0.63
5242900    0.00     0.00  heap_getnext 1.72     27.22     0.63  5242880     0.00     0.00  advance_aggregates 1.69
27.84    0.62  5242940     0.00     0.00  ExecProcNode 1.64     28.44     0.60                             $$dyncall
1.61    29.03     0.59  5242900     0.00     0.00  MemoryContextReset 1.53     29.59     0.56  5242880     0.00
0.00 HeapTupleSatisfiesSnapshot 1.45     30.12     0.53  5242880     0.00     0.00  int8inc 1.37     30.62     0.50
5243140    0.00     0.00  ExecClearTuple 1.17     31.05     0.43  5242880     0.00     0.00  ExecEvalExprSwitchContext
1.15    31.47     0.42  5271294     0.00     0.00  ResourceOwnerForgetBuffer 1.12     31.88     0.41
        SeqNext 1.09     32.28     0.40  5271294     0.00     0.00  ResourceOwnerEnlargeBuffers 1.09     32.68     0.40
5242900     0.00     0.00  ExecScan 1.04     33.06     0.38  5242900     0.00     0.00  ExecSeqScan
 

(This is for 20, not 10, iterations of your example, but otherwise it's
the same test case.)

I've since gotten rid of the IncrBufferRefCount, ReleaseBuffer,
and ResourceOwnerXXX entries by eliminating some inefficiency in
ExecStoreTuple, so that puts the buffer lock stuff further up,
but it's still not all that critical by my numbers.

> What's more, we can see that for each row, a LWLock pair is invoked. So on
> a more aggressive thought, can we change it to page level?

Yeah, I was wondering the same.  It'd be possible to rewrite the seqscan
stuff so that we do the visibility tests for all the tuples on a given
page at once, taking the buffer content lock just once, and saving aside
the valid tuple IDs to return later.  This should definitely be faster
when all the tuples actually get fetched.  It might be a bit slower for
a LIMIT query, but I'm not sure if we care that much.  The only other
objection I can think of is that if there are any broken tuples on a
page, this approach would likely make it impossible to fetch any of the
non-broken ones :-(
        regards, tom lane

Re: gprof SELECT COUNT(*) results

From

Qingqing Zhou

Date:

25 November 2005, 00:59:00


On Thu, 24 Nov 2005, Qingqing Zhou wrote:
>
> I may need to write some separate tests to see if this is what we should
> pay for bus lock instruction.
>

Here I come up with a test program to see how spinlock costs:

$/pgsql/src/backend/storage/lmgr#./a.out
Spinlock pair(2648542) duration: 143.134 ms
$/pgsql/src/backend/storage/lmgr#./a.out
Spinlock pair(2648542) duration: 143.107 ms
$/pgsql/src/backend/storage/lmgr#./a.out
Spinlock pair(2648542) duration: 143.104 ms

So seems lock instruction really costs ...

Regards,
Qingqing

---

/** spintest.c -*    Test spinlock acquire/release without concurrency.**      To compile (the -pg is to match the
gprofmake I used):*        backend/storage/lmgr#gcc -O2 -pg -Wall -I ../../../include/ spintest.c*/
 

#include "postgres.h"
#include "storage/lwlock.h"
#include "storage/spin.h"
#include <sys/time.h>

#define TIMES     2648542

int NumLocks = 0;

void
s_lock(volatile slock_t *lock, const char *file, int line)
{fprintf(stderr, "should never be here\n");abort();
}

int
main(void)
{int    i;slock_t    lock = 0;struct timeval start_t, stop_t;long usecs;
gettimeofday(&start_t, NULL);for (i = 0; i < TIMES; i ++){    SpinLockAcquire_NoHoldoff(&lock);
    /* pretend to do something */    NumLocks ++;
    SpinLockRelease_NoHoldoff(&lock);}     gettimeofday(&stop_t, NULL);

if (stop_t.tv_usec < start_t.tv_usec)      {    stop_t.tv_sec--;    stop_t.tv_usec += 1000000;}
usecs = (long) (stop_t.tv_sec - start_t.tv_sec) * 1000000    + (long) (stop_t.tv_usec - start_t.tv_usec);
fprintf (stdout, "Spinlock pair(%u) duration: %ld.%03ld ms\n",    TIMES,    (long) ((stop_t.tv_sec - start_t.tv_sec) *
1000+        (stop_t.tv_usec - start_t.tv_usec) / 1000),    (long) (stop_t.tv_usec - start_t.tv_usec) % 1000);
 
return 0;
}

Re: gprof SELECT COUNT(*) results

From

Qingqing Zhou

Date:

25 November 2005, 01:58:26

On Thu, 24 Nov 2005, Tom Lane wrote:
>
> I don't see those costing nearly as much as your results show
> ... perhaps there's something platform-specific at work?
> What I see, down to the 1% level, is
>

I can see your computer is really slow, so my theory is that since it is
easy to hold a running-slowly horse than a fast one, so my spinlock on a
2.4G modern machine should takes relatively longer time to get effective.
Just kidding. I am not sure what's happened, but in previous email there
is a program I wrote to test the spinlock performance. In my machine, the
profiling results matches the single spinlock test.

>
> The only other objection I can think of is that if there are any broken
> tuples on a page, this approach would likely make it impossible to fetch
> any of the non-broken ones :-(
>

What do you mean by "broken tuple"? An data corrupted tuple? So you mean
if scan operator find a broken tuple on a page, then it will abort the
operation without returning any other good tuples? I think this is
acceptable.

Regards,
Qingqing

Re: gprof SELECT COUNT(*) results

From

Simon Riggs

Date:

25 November 2005, 06:23:58

On Thu, 2005-11-24 at 23:48 -0500, Tom Lane wrote:

> > What's more, we can see that for each row, a LWLock pair is invoked. So on
> > a more aggressive thought, can we change it to page level?
> 
> Yeah, I was wondering the same.  It'd be possible to rewrite the seqscan
> stuff so that we do the visibility tests for all the tuples on a given
> page at once, taking the buffer content lock just once, and saving aside
> the valid tuple IDs to return later.  This should definitely be faster
> when all the tuples actually get fetched.  

And me also. Looks good from here. It would be a very localised
optimization, so fairly easy to implement.

Even on your lower gprof numbers this would be 15% faster stand-alone.
Take into account the significant reduction in LWlock requests and it
might go much faster still on a busy system. The gain for other users
would be very good also.

> It might be a bit slower for
> a LIMIT query, but I'm not sure if we care that much.  

Agreed. It would be fairly easy to make this happen only for full
SeqScans, and the greater majority of those don't have a LIMIT of less
than one block.

> The only other
> objection I can think of is that if there are any broken tuples on a
> page, this approach would likely make it impossible to fetch any of the
> non-broken ones :-(

I was thinking of the brute force approach: take a complete copy of the
block when we read the first tuple off it. That way any wierdness
on-block is avoided until we logically attempt to read that tuple. It
also allows us to palloc the right amount of space first time.

This could be an interesting win: no other DBMS can take this approach.
I think it has further advantages also:

1. Copying the tuples locally will also mean that the data will probably
be in the L2 cache rather than main memory, so this could make things
faster still.

2. For large scans, the buffer for that block doesn't need to be pinned
after the first touch, so the buffer could be cleared and replaced with
another before we have logically finished reading the block. That will
improve fluidity in shared_buffers and improve efficiency.

Are you, or will you be implementing this? If not, I'll implement this -
it seems an important BI optimization.

Best Regards, Simon Riggs

Re: gprof SELECT COUNT(*) results

From

Tom Lane

Date:

25 November 2005, 10:55:00

Simon Riggs <simon@2ndquadrant.com> writes:
> On Thu, 2005-11-24 at 23:48 -0500, Tom Lane wrote:
>> Yeah, I was wondering the same.  It'd be possible to rewrite the seqscan
>> stuff so that we do the visibility tests for all the tuples on a given
>> page at once, taking the buffer content lock just once, and saving aside
>> the valid tuple IDs to return later.  This should definitely be faster
>> when all the tuples actually get fetched.  

> I was thinking of the brute force approach: take a complete copy of the
> block when we read the first tuple off it. That way any wierdness
> on-block is avoided until we logically attempt to read that tuple. It
> also allows us to palloc the right amount of space first time.

That occurred to me too, but I rejected it on two grounds:
* All that data copying will be expensive.
* We couldn't update tuple commit hint bits on the real page.

Also, I'm not at all sure that it's a good idea to release the buffer
pin before we're genuinely done with the page.  That sort of change
would impact such things as the VACUUM interlock algorithm.  Maybe it'd
be OK but I'm disinclined to mess with it for a dubious speed gain.

Even with my version of the proposal, it'd be risky to use the check-
all-in-advance approach for non-MVCC snapshots such as SnapshotNow.
I'm not sure that there are any places that would get broken by missing
tuples that weren't yet committed when the page was first touched ...
but I'm not sure there aren't, either.  We could avoid this risk by
having two paths through heapgettup, though.

> Are you, or will you be implementing this?

I plan to take a look at it soon.
        regards, tom lane

Re: gprof SELECT COUNT(*) results

From

Tom Lane

Date:

25 November 2005, 11:20:14

Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
> I can see your computer is really slow, so my theory is that since it is
> easy to hold a running-slowly horse than a fast one, so my spinlock on a
> 2.4G modern machine should takes relatively longer time to get effective.
> Just kidding.

Is that "modern machine" a Xeon by any chance?  We know that Xeons have
fairly awful concurrent performance, and the long latency for bus lock
instructions may well be the reason why.  FWIW, the numbers I showed
last night were for an HPPA machine, which I used just because I chanced
to have CVS tip already built for profiling on it.  I've since
reproduced the test on a spiffy new dual Xeon that Red Hat just bought
me :-) ... and I get similar numbers to yours.  It'd be interesting to
see the results from an SMP Opteron, if anyone's got one handy.

>> The only other objection I can think of is that if there are any broken
>> tuples on a page, this approach would likely make it impossible to fetch
>> any of the non-broken ones :-(

> What do you mean by "broken tuple"? An data corrupted tuple?

The specific case that's worrying me is a tuple with a corrupted xmin,
such that tqual.c fails with a "can't access transaction status"
message.

> So you mean
> if scan operator find a broken tuple on a page, then it will abort the
> operation without returning any other good tuples? I think this is
> acceptable.

I think it would be totally unacceptable if it meant that there was no
way at all to look at the other data on the same page with a corrupt
tuple.  Seqscans with "LIMIT n" have been the traditional tool for
getting as much info as you could out of a corrupted page.  However,
it now occurs to me that you could still cherry-pick non-corrupt tuples
with TID-scan queries, so this objection shouldn't be considered fatal.
        regards, tom lane

Re: gprof SELECT COUNT(*) results

From

Simon Riggs

Date:

25 November 2005, 11:38:52

On Fri, 2005-11-25 at 09:54 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Thu, 2005-11-24 at 23:48 -0500, Tom Lane wrote:
> >> Yeah, I was wondering the same.  It'd be possible to rewrite the seqscan
> >> stuff so that we do the visibility tests for all the tuples on a given
> >> page at once, taking the buffer content lock just once, and saving aside
> >> the valid tuple IDs to return later.  This should definitely be faster
> >> when all the tuples actually get fetched.  
> 
> > I was thinking of the brute force approach: take a complete copy of the
> > block when we read the first tuple off it. That way any wierdness
> > on-block is avoided until we logically attempt to read that tuple. It
> > also allows us to palloc the right amount of space first time.
> 
> That occurred to me too, but I rejected it on two grounds:
> * All that data copying will be expensive.
> * We couldn't update tuple commit hint bits on the real page.

OK, the second one is the clincher.

> Also, I'm not at all sure that it's a good idea to release the buffer
> pin before we're genuinely done with the page.  That sort of change
> would impact such things as the VACUUM interlock algorithm.  Maybe it'd
> be OK but I'm disinclined to mess with it for a dubious speed gain.

I think its safer not to try it all in one go. It's worth coming back to
later though: VACUUM will never remove rows the scanner can see anyway.

> Even with my version of the proposal, it'd be risky to use the check-
> all-in-advance approach for non-MVCC snapshots such as SnapshotNow.
> I'm not sure that there are any places that would get broken by missing
> tuples that weren't yet committed when the page was first touched ...
> but I'm not sure there aren't, either.  We could avoid this risk by
> having two paths through heapgettup, though.

That seems prudent. The main performance optimization is required for
MVCC access anyway; we seldom scan catalog tables and hopefully they
never get too big.

> > Are you, or will you be implementing this?
> 
> I plan to take a look at it soon.

Much appreciated and well done to Qingqing for spotting this.

Best Regards, Simon Riggs

Re: gprof SELECT COUNT(*) results

From

Olivier Thauvin

Date:

25 November 2005, 12:00:03

Le Vendredi 25 Novembre 2005 16:20, Tom Lane a écrit :
> Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
> > I can see your computer is really slow, so my theory is that since it is
> > easy to hold a running-slowly horse than a fast one, so my spinlock on a
> > 2.4G modern machine should takes relatively longer time to get effective.
> > Just kidding.
>
> Is that "modern machine" a Xeon by any chance?  We know that Xeons have
> fairly awful concurrent performance, and the long latency for bus lock
> instructions may well be the reason why.  FWIW, the numbers I showed
> last night were for an HPPA machine, which I used just because I chanced
> to have CVS tip already built for profiling on it.  I've since
> reproduced the test on a spiffy new dual Xeon that Red Hat just bought
> me :-) ... and I get similar numbers to yours.  It'd be interesting to
> see the results from an SMP Opteron, if anyone's got one handy.

Is that what you're looking for ?

[thauvin@samsufi ~]$ egrep "(model name|MHz)" /proc/cpuinfo
model name      : AMD Opteron(tm) Processor 250
cpu MHz         : 2391.040
model name      : AMD Opteron(tm) Processor 250
cpu MHz         : 2391.040

$ cat /etc/mandrake-release
Mandrakelinux release 10.1 (Community) for x86_64

I can try to backport Postgresql 8.1.0 rpms from developement tree on mandriva
10.1, install and run some test if you're really interested.

Re: gprof SELECT COUNT(*) results

From

Qingqing Zhou

Date:

25 November 2005, 13:57:25


On Fri, 25 Nov 2005, Tom Lane wrote:
>
> Is that "modern machine" a Xeon by any chance?
>

$#cat /proc/cpuinfo | grep "model name"
model name      : Intel(R) Pentium(R) 4 CPU 2.40GHz

I can find a 4way Xeon (but it is shared by many users):
/h/164/zhouqq#cat /proc/cpuinfo |grep "model name"
model name      : Intel(R) Xeon(TM) CPU 2.40GHz
model name      : Intel(R) Xeon(TM) CPU 2.40GHz
model name      : Intel(R) Xeon(TM) CPU 2.40GHz
model name      : Intel(R) Xeon(TM) CPU 2.40GHz
$/pgsql/src/backend/storage/lmgr#./a.out
Spinlock pair(2648542) duration: 161.785 ms
$/pgsql/src/backend/storage/lmgr#./a.out
Spinlock pair(2648542) duration: 160.661 ms
$/pgsql/src/backend/storage/lmgr#./a.out
Spinlock pair(2648542) duration: 155.505 ms

>
> it now occurs to me that you could still cherry-pick non-corrupt tuples
> with TID-scan queries, so this objection shouldn't be considered fatal.
>

It is great that it is not that difficult to do it! What's more, the
parallel scan will be easier to implement based on page level scan
operators.

Regards,
Qingqing

Re: gprof SELECT COUNT(*) results

From

"Qingqing Zhou"

Date:

26 November 2005, 04:13:58

"Tom Lane" <tgl@sss.pgh.pa.us> wrote
>
> I plan to take a look at it soon.
>

"Stand corrected"[Merlin]! In-memory "SELECT COUNT(*)" doubles the 
performance due to my test.

- before -
1.56 s

- now -
0.72 s

Regards,
Qingqing

Re: gprof SELECT COUNT(*) results

From

Simon Riggs

Date:

26 November 2005, 08:39:23

On Sat, 2005-11-26 at 03:13 -0500, Qingqing Zhou wrote:
> "Tom Lane" <tgl@sss.pgh.pa.us> wrote
> >
> > I plan to take a look at it soon.
> >
> 
> "Stand corrected"[Merlin]! In-memory "SELECT COUNT(*)" doubles the 
> performance due to my test.
> 
> - before -
> 1.56 s
> 
> - now -
> 0.72 s
> 

...and for emphasis: this optimization of SeqScans is not possible with
any other database system, so its a big win for PostgreSQL.

Best Regards, Simon Riggs

Re: gprof SELECT COUNT(*) results

From

Christopher Kings-Lynne

Date:

26 November 2005, 12:21:08

> ...and for emphasis: this optimization of SeqScans is not possible with
> any other database system, so its a big win for PostgreSQL.

With any other db system?  That's a big call.  Why?  Not even other MVCC 
systems?

Chris

Re: gprof SELECT COUNT(*) results

From

Tom Lane

Date:

26 November 2005, 12:53:36

Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
>> ...and for emphasis: this optimization of SeqScans is not possible with
>> any other database system, so its a big win for PostgreSQL.

> With any other db system?  That's a big call.  Why?

One could equally well spin it negatively, as "this optimization is not
needed with any other database" ...
        regards, tom lane

Re: gprof SELECT COUNT(*) results

From

Simon Riggs

Date:

26 November 2005, 13:57:11

On Sat, 2005-11-26 at 11:53 -0500, Tom Lane wrote:
> Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
> >> ...and for emphasis: this optimization of SeqScans is not possible with
> >> any other database system, so its a big win for PostgreSQL.
> 
> > With any other db system?  That's a big call.  Why?
> 
> One could equally well spin it negatively, as "this optimization is not
> needed with any other database" ...

Why is it spin to call it a big win? You amaze me. After all the time
we've spent investigating tuning the buffer manager, to reduce the
locking required by a SeqScan to about 1% of what it was before is just
way cool. When did you last spend a few hours tuning an internals
feature (not the optimizer) and have something go 50% faster? Probably
at least a month :-)

Thinking more about other systems, ISTM that Oracle can do this, as can
any MVCC based system. OTOH DB2 and SQLServer take block level read
locks, so they can do this too, but at major loss of concurrency and
threat of deadlock. Having said that, *any* system that chose not to do
this would be severely sub-optimal with buffer manager contention. So
AFAICS they all need this optimization.

So, I think I'll take back the claim of uniqueness, but I'm happy with
the accuracy of the statement that this is a big win for PostgreSQL. And
a bigger win than most on the poor benighted Xeon, where reducing memory
contention/traffic seems to be crucial.

Best Regards, Simon Riggs

Re: gprof SELECT COUNT(*) results

From

"Luke Lonergan"

Date:

26 November 2005, 16:20:04

<font face="Verdana, Helvetica, Arial"><span style="font-size:14.0px">Nice job Qingqing and Tom!<br /><br /> The
improvedexecutor / agg performance will likely help BI / data warehouse customers a lot.  I’ll get some DBT-3 results
tosubstantiate as soon as we can.<br /><br /> - Luke<br /><br /><br /> On 11/26/05 12:13 AM, "Qingqing Zhou"
<zhouqq@cs.toronto.edu>wrote:<br /><br /></span></font><blockquote><font face="Verdana, Helvetica, Arial"><span
style="font-size:14.0px"><br/><br /> "Tom Lane" <tgl@sss.pgh.pa.us> wrote<br /> ><br /> > I plan to take a
lookat it soon.<br /> ><br /><br /> "Stand corrected"[Merlin]! In-memory "SELECT COUNT(*)" doubles the<br />
performancedue to my test.<br /><br /> - before -<br /> 1.56 s<br /><br /> - now -<br /> 0.72 s<br /><br /> Regards,<br
/>Qingqing<br /><br /><br /><br /> ---------------------------(end of broadcast)---------------------------<br /> TIP
1:if posting/reading through Usenet, please send an appropriate<br />        subscribe-nomail command to
majordomo@postgresql.orgso that your<br />        message can get through to the mailing list cleanly<br /><br /><br
/></span></font></blockquote><fontface="Verdana, Helvetica, Arial"><span style="font-size:14.0px"><br /></span></font>

Re: gprof SELECT COUNT(*) results

From

Tom Lane

Date:

26 November 2005, 19:13:42

Simon Riggs <simon@2ndquadrant.com> writes:
>>> ...and for emphasis: this optimization of SeqScans is not possible with
>>> any other database system, so its a big win for PostgreSQL.

> Why is it spin to call it a big win?

I didn't say it wasn't a big win; it was the first part of the sentence
that bothered me.  Without a lot more knowledge of the internals of the
commercial DBMSes than I think is public, you can't say whether this is
possible/useful/relevant for them.
        regards, tom lane

Re: gprof SELECT COUNT(*) results

From

Christopher Kings-Lynne

Date:

27 November 2005, 08:47:42

> Thinking more about other systems, ISTM that Oracle can do this, as can
> any MVCC based system. OTOH DB2 and SQLServer take block level read
> locks, so they can do this too, but at major loss of concurrency and
> threat of deadlock. Having said that, *any* system that chose not to do
> this would be severely sub-optimal with buffer manager contention. So
> AFAICS they all need this optimization.

MySQL/Innodb would presumably do it also I wonder...

Chris

Re: gprof SELECT COUNT(*) results

From

"Jim C. Nasby"

Date:

28 November 2005, 20:35:26

On Fri, Nov 25, 2005 at 10:20:11AM -0500, Tom Lane wrote:
> Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
> > I can see your computer is really slow, so my theory is that since it is
> > easy to hold a running-slowly horse than a fast one, so my spinlock on a
> > 2.4G modern machine should takes relatively longer time to get effective.
> > Just kidding.
> 
> Is that "modern machine" a Xeon by any chance?  We know that Xeons have
> fairly awful concurrent performance, and the long latency for bus lock
> instructions may well be the reason why.  FWIW, the numbers I showed
> last night were for an HPPA machine, which I used just because I chanced
> to have CVS tip already built for profiling on it.  I've since
> reproduced the test on a spiffy new dual Xeon that Red Hat just bought
> me :-) ... and I get similar numbers to yours.  It'd be interesting to
> see the results from an SMP Opteron, if anyone's got one handy.

Is there still interest in this? I've got a dual Opteron running FBSD.
(What would be the profiler to use on FBSD?)
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: gprof SELECT COUNT(*) results

From

Tom Lane

Date:

28 November 2005, 20:36:29

"Jim C. Nasby" <jnasby@pervasive.com> writes:
> (What would be the profiler to use on FBSD?)

gprof should work fine.
        regards, tom lane

Re: gprof SELECT COUNT(*) results

From

"Zeugswetter Andreas DCP SD"

Date:

29 November 2005, 04:50:09

> > OTOH DB2 and SQLServer take block level
> > read locks, so they can do this too, but at major loss of
concurrency
> > and threat of deadlock.

Note, that in the usual committed read isolation, they do not need to
read lock a row ! e.g. Informix only verifies, that it could lock the
row
(that there is no write lock). Only cursor stability leaves one read
lock
until the next fetch, serializable actually leaves all read locks,
and select for update an intent update lock.

Also they usually feed a buffer of rows to the client, so if the client
does a fetch it gets a row from the client side buffer. Only when the
buffer
is empty, they get more from the server.

I think the statement holds, that the optimization is pg specific,
and cannot be directly compared to other db's.

Andreas

Re: gprof SELECT COUNT(*) results

From

simon@2ndquadrant.com

Date:

29 November 2005, 06:04:41

>> > OTOH DB2 and SQLServer take block level 
>> > read locks, so they can do this too, but at major loss of
>concurrency
>> > and threat of deadlock.
>
>Note, that in the usual committed read isolation, they do not need to 
>read lock a row ! e.g. Informix only verifies, that it could lock the
>row 
>(that there is no write lock). Only cursor stability leaves one read
>lock
>until the next fetch, serializable actually leaves all read locks, 
>and select for update an intent update lock.

Not sure which product you're thinking about there. No such isolation level in DB2 or SQLServer, AFAIK. Were you
talkingabout just Informix?
 

DB2:
Uncommitted Read (UR) mode "Dirty read" isn't the default, or the recommended lock level for most apps. I was
consideringCursor Stability mode (or higher), which is the default unless you specifically set the system default
otherwise.You can always skip the deadlock threat by using Uncommitted Read, by risking getting wrong results. There
isn'tanything there I would ever want to emulate.
 

SQLServer:
READ COMMITTED does take share locks. There's a NO LOCK hint, true, but its not a default. READ_COMITTED_SNAPSHOT, new
in2005, does row versioning like Oracle/PostgreSQL, and doesn't take locks. 
 

Out of interest:

DB2 has learned from PostgreSQL that its OK to read a row and check whether it can see it before worrying about locks.
Therecently introduced DB2_EVALUNCOMMITTED and DB2_SKIPINSERTED flags provide PostgreSQL like behaviour, new in the
verylatest release.
 

Best Regards, Simon Riggs

Re: gprof SELECT COUNT(*) results

From

"Zeugswetter Andreas DCP SD"

Date:

29 November 2005, 10:28:08

> DB2:
> Uncommitted Read (UR) mode "Dirty read" isn't the default, or
> the recommended lock level for most apps. I was considering
> Cursor Stability mode (or higher), which is the default

Sorry, they call it "read committed" but actually do cursor stability,
which does keep one lock on the last fetched row. Keeping the lock would

actually not be necessary to conform with ANSI "read committed".

See table 4 on Page 8 of
http://www.cs.ndsu.nodak.edu/~yawang/Snapshot.ppt

> SQLServer:
> READ COMMITTED does take share locks.

But it does not hold them. According to docu it holds them "while
reading" which
is not a very detailed description. How long is that really, e.g. with
odbc forward
cursor fetch ?

> There's a NO LOCK hint, true, but its not a default.

That is for dirty/uncommitted reads.

Andreas