Thread: Performance optimization of btree binary search

Performance optimization of btree binary search

From
Peter Geoghegan
Date:
Having nothing better to do over the holiday weekend, I decided to
pursue a number of ideas for improving performance that I thought
about a long time ago. These include:

* Pre-fetching list node pointers. This looks to be moderately
promising, but I'm certainly not going to be the one to land it, given
present obligations. Stephen Frost may wish to pick it up, given his
previous interest in the matter. This is slightly controversial,
because it uses a GCC intrinsic (__builtin_prefetch), but also because
the Linux kernel removed this optimization to their generic list
data-structure [1]. However, that list was what we'd call an embedded
list, so we should probably shouldn't be totally deterred. The amount
of effort that I put into this was, frankly, pretty low. A motivated
person, willing to do the appropriate analysis could probably bring it
further. For one thing, every single foreach() has a call to this
intrinsic, even where the list doesn't store pointers (which is not
undefined). At the very least that's going to bloat things up,
frequently for no conceivable gain, and yet with the patch applied
we're still able to see see quite tangible benefits, even if it isn't
exactly a stellar improvement. I have an idea that prefetching the
last element at the start of the loop could be much better than what I
did, because we know that those lists are mostly pretty small in
practice, and that's likely to help pipelining - prefetching too late
or even too early makes the optimization useless, because you may
still get a cache miss.

* Optimizing index scans - I noticed that binary searching accounted
for many cache misses during a pgbench select.sql benchmark,
instrumented with "perf record -e cache-misses". This warranted
further investigation.

I won't say anything further about the former optimization, except to
note that it's included for comparative purposes in the set of
benchmarks I've run (I haven't included a patch). The benchmark
results are here:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/results

I took two approaches to the latter. This was the more interesting
piece of work. Test sets include:

* Master baseline (green)

* List optimization (as mentioned above, not really relevant to the
main topic of this mail) (red)

* "fib btree", earlier patch, please disregard (blue)

* "Fixed fib patch", Fibonacci search, no specialization (purple)

* The interesting one, Finonacci search + specialization - "fib + no
jump"  (turquoise, see below for details)

Initially, I had a little bit of success with Fibonnacci search [2] in
place of binary search, in the hope that it would better take
advantage of CPU cache characteristics - Fibonnacci search is said to
have advantages where non-uniform memory access is an issue - it
minimizes the final interval. I wasn't all that optimistic that it
would work that well given the smallish size of BLCKSZ relative to
modern CPU L1 cache sizes [3], but it did make an appreciable dent on
its own. I suppose old habits die hard, because next I hacked up
_bt_compare and had it do an int4btcmp directly, in the event of
encountering a scankey that had as its comparator the relevant pg_proc
oid. This is very much in the style (and the spirit) of the grotty
early draft patches for the inlining-comparators-for-sorting patch.
Patch is attached. This is a draft, a POC, posted only to facilitate
discussion and to allow my results to be independently
duplicated/verified. Note that there is a bug (attributable to the new
search code) that causes the regression tests to fail in exactly one
place (i.e. one line of difference). I didn't think it was worth
deferring discussion to deal with that, though, since I don't think it
undermines anything.

I'm not sure how valuable the comparator trick is if we stick with
binary search - I didn't test that. I'm sure it's a question that must
be considered, though.

I have a fairly huge amount of data here, having run plenty of
benchmarks over several nights. The short version is that the 'select'
benchmark has just over 18% greater throughput on this machine at some
client counts (in particular, when there are 4 clients - there are 4
cores, but 8 logical cores) with the attached patch. There is a 3.5%
regression with one client, which is certainly not accounted for by
noise. Note, however, that latency appears consistently better with
the patch applied. This is a machine running on dedicated hardware: a
4-core i7-4770. The working set easily fits in its 32GiB of DDR3 RAM
at all pgbench scales tested (1, 10 and 100). The kernel used is
"3.8.0-31-generic #46~precise1-Ubuntu SMP". Postgres settings are
typical for this kind of thing (6GiB shared_buffers), but you can
refer to my pgbench-tools results for full details (drill down to an
individual pgbench run for that - they're all the same). I'm kind of
curious as to what this benchmark would like like on a server with
many more cores.

I guess I could write a proper patch to have code setting up a scankey
also set a flag that indicated that it was acceptable to assume that
the special built-in comparator would do fine. I don't think that
anything as involved as the sortsupport infrastructure is required
here, because that presumed (perhaps questionably) that it was widely
useful to provide more direct comparators, and that an extensible
framework was warranted, whereas what I've done is likely to have very
limited type-applicability. In general, people don't often put indexes
on floating point columns (where we need to handle NaN comparisons in
a special way in order to provide behavior consistent with various
invariants that btree operator classes need to respect). But I'm
reasonably confident that the majority of primary indexes in the wild
are int4 or composites of int4, maybe even the large majority.

I'd be happy with a scheme with only one built-in comparator, and
allowed a few types to be cataloged such that it was indicated that
just using the "built-in" comparator was acceptable, knowledge that
could be passed to _bt_compare via the scankey. I'm thinking of just
int4, and maybe date and a few other such int4 "covariant" types.

I don't think that this has to be morally equivalent to violating the
btree/AM abstraction by giving it direct knowledge of types - rather,
I'm only proposing that we give it the facility to know that for the
purposes of scankey comparison, a given type (well, some scankey that
has been appropriately hinted by the index Relation or something like
that) has Datums that compare like int4 Datums. Theoretically, it has
nothing in particular to do with the cataloged int4 type as such - it
has more to do with the fact that a comparison of 32-bit integers is a
single instruction on popular ISAs. More practically speaking, this
optimization seems likely to appreciably help many common cases.
Obviously, I'm not proposing that it be committed more or less as-is,
since it's entirely possible that the applicability of what I've done
isn't universal; I don't presume that it's never counter-productive.
How this could be most usefully applied is a discussion well worth
having, though. To be frank: hopefully we won't have a protracted
discussion this time. This stuff is not a high priority for me these
days.

Having mentioned CPU instructions, I should also say: as with the
sorting stuff, this optimization has little to nothing to do with
shaving some instructions from not having to go through the fmgr
trampoline. The true cost for that trampoline is almost certainly paid
in cache misses and, perhaps to a lesser extent, missed opportunities
for successful branch prediction.

Thoughts?

[1] http://lwn.net/Articles/444336/

[2] http://www.mpri.lsu.edu/textbook/Chapter5-a.htm#fibonacci

[3] http://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-pathological-case-for-caches/
--
Peter Geoghegan

Attachment

Re: Performance optimization of btree binary search

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> I guess I could write a proper patch to have code setting up a scankey
> also set a flag that indicated that it was acceptable to assume that
> the special built-in comparator would do fine. ...
> I'd be happy with a scheme with only one built-in comparator, and
> allowed a few types to be cataloged such that it was indicated that
> just using the "built-in" comparator was acceptable, knowledge that
> could be passed to _bt_compare via the scankey. I'm thinking of just
> int4, and maybe date and a few other such int4 "covariant" types.

If what you're proposing is that we have a fast path that compares Datums
as Datums, I should think that that would work fine for int2 as well,
*and* for int8 on machines where int8 is pass-by-value.  (Does anyone
still care much about PG's performance on 32-bit hardware?)  We might
have to fool a bit with the fooGetDatum macros in some cases, eg
I think Int16GetDatum isn't careful about sign extension.  Admittedly,
that might introduce an offsetting cost on some hardware, but I think
on most machines sign-extension isn't noticeably more expensive than
zero-extension.
        regards, tom lane



Re: Performance optimization of btree binary search

From
Robert Haas
Date:
On Wed, Dec 4, 2013 at 4:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <pg@heroku.com> writes:
>> I guess I could write a proper patch to have code setting up a scankey
>> also set a flag that indicated that it was acceptable to assume that
>> the special built-in comparator would do fine. ...
>> I'd be happy with a scheme with only one built-in comparator, and
>> allowed a few types to be cataloged such that it was indicated that
>> just using the "built-in" comparator was acceptable, knowledge that
>> could be passed to _bt_compare via the scankey. I'm thinking of just
>> int4, and maybe date and a few other such int4 "covariant" types.
>
> If what you're proposing is that we have a fast path that compares Datums
> as Datums, I should think that that would work fine for int2 as well,
> *and* for int8 on machines where int8 is pass-by-value.  (Does anyone
> still care much about PG's performance on 32-bit hardware?)  We might
> have to fool a bit with the fooGetDatum macros in some cases, eg
> I think Int16GetDatum isn't careful about sign extension.  Admittedly,
> that might introduce an offsetting cost on some hardware, but I think
> on most machines sign-extension isn't noticeably more expensive than
> zero-extension.

Yeah, I think if we can make something like this work, it would be
neat-o.  Getting this working for int4 would be a good win, as Peter
says, but getting it working for both int4 and int8 with the same code
would be a significantly better one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance optimization of btree binary search

From
Peter Geoghegan
Date:
On Wed, Dec 4, 2013 at 1:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Yeah, I think if we can make something like this work, it would be
> neat-o.  Getting this working for int4 would be a good win, as Peter
> says, but getting it working for both int4 and int8 with the same code
> would be a significantly better one.

No arguments here. I think I didn't initially suggest it myself out of
passing concern about the guarantees around how unused Datum bits are
initialized in all relevant contexts, but having looked at it for a
second I see that we are of course disciplined there.


-- 
Peter Geoghegan



Re: Performance optimization of btree binary search

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> On Wed, Dec 4, 2013 at 1:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Yeah, I think if we can make something like this work, it would be
>> neat-o.  Getting this working for int4 would be a good win, as Peter
>> says, but getting it working for both int4 and int8 with the same code
>> would be a significantly better one.

> No arguments here. I think I didn't initially suggest it myself out of
> passing concern about the guarantees around how unused Datum bits are
> initialized in all relevant contexts, but having looked at it for a
> second I see that we are of course disciplined there.

Hm ... actually, the comment at lines 335ff of postgres.h points out that
a Datum returned from a version 0 user-defined function might contain
garbage in the high order bits.  We could fix that, probably, with some
cleanup code added to fmgr_oldstyle.  It'd waste a few cycles ... but if
there's anybody out there who still cares about the performance of such
functions, it's high time they fixed them to be v1, anyway.
        regards, tom lane



Re: Performance optimization of btree binary search

From
Robert Haas
Date:
On Wed, Dec 4, 2013 at 6:33 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Dec 4, 2013 at 1:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Yeah, I think if we can make something like this work, it would be
>> neat-o.  Getting this working for int4 would be a good win, as Peter
>> says, but getting it working for both int4 and int8 with the same code
>> would be a significantly better one.
>
> No arguments here. I think I didn't initially suggest it myself out of
> passing concern about the guarantees around how unused Datum bits are
> initialized in all relevant contexts, but having looked at it for a
> second I see that we are of course disciplined there.

Hmm.  And yet, there's this:
* When a type narrower than Datum is stored in a Datum, we place it in the* low-order bits and are careful that the
DatumGetXXXmacro for it discards* the unused high-order bits (as opposed to, say, assuming they are zero).* This is
neededto support old-style user-defined functions, since depending* on architecture and compiler, the return value of a
functionreturning char* or short may contain garbage when called as if it returned Datum.
 

And record_image_eq does a rather elaborate dance around here, calling
the appropriate GET_x_BYTES macro depending on the type-width.  If we
can really count on the high-order bits to be zero, that's all
completely unnecessary tomfoolery.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance optimization of btree binary search

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Hmm.  And yet, there's this:

>  * When a type narrower than Datum is stored in a Datum, we place it in the
>  * low-order bits and are careful that the DatumGetXXX macro for it discards
>  * the unused high-order bits (as opposed to, say, assuming they are zero).
>  * This is needed to support old-style user-defined functions, since depending
>  * on architecture and compiler, the return value of a function returning char
>  * or short may contain garbage when called as if it returned Datum.

> And record_image_eq does a rather elaborate dance around here, calling
> the appropriate GET_x_BYTES macro depending on the type-width.  If we
> can really count on the high-order bits to be zero, that's all
> completely unnecessary tomfoolery.

Yeah, that's another thing we could simplify if we fixed this problem
at the source.  I think these decisions date from a time when we still
cared about the speed of fmgr_oldstyle.
        regards, tom lane



Re: Performance optimization of btree binary search

From
Tom Lane
Date:
I wrote:
> Yeah, that's another thing we could simplify if we fixed this problem
> at the source.  I think these decisions date from a time when we still
> cared about the speed of fmgr_oldstyle.

BTW, the text you're quoting is from 2007, but it's just documenting
behavior that's mostly a lot older.  It's worth reading commit 23a41573
in toto in this connection.  I'm not sure if we'd want to revert that
DatumGetBool change or not, if we were to clean up fmgr_oldstyle.
We'd be able to do whatever was cheapest, but I'm not sure what that is.
        regards, tom lane



Re: Performance optimization of btree binary search

From
Robert Haas
Date:
On Wed, Dec 4, 2013 at 6:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Hmm.  And yet, there's this:
>
>>  * When a type narrower than Datum is stored in a Datum, we place it in the
>>  * low-order bits and are careful that the DatumGetXXX macro for it discards
>>  * the unused high-order bits (as opposed to, say, assuming they are zero).
>>  * This is needed to support old-style user-defined functions, since depending
>>  * on architecture and compiler, the return value of a function returning char
>>  * or short may contain garbage when called as if it returned Datum.
>
>> And record_image_eq does a rather elaborate dance around here, calling
>> the appropriate GET_x_BYTES macro depending on the type-width.  If we
>> can really count on the high-order bits to be zero, that's all
>> completely unnecessary tomfoolery.
>
> Yeah, that's another thing we could simplify if we fixed this problem
> at the source.  I think these decisions date from a time when we still
> cared about the speed of fmgr_oldstyle.

Sure, let's whack that thing with a crowbar.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Performance optimization of btree binary search

From
Peter Eisentraut
Date:
On Wed, 2013-12-04 at 19:45 -0500, Robert Haas wrote:
> On Wed, Dec 4, 2013 at 6:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Yeah, that's another thing we could simplify if we fixed this problem
> > at the source.  I think these decisions date from a time when we still
> > cared about the speed of fmgr_oldstyle.
> 
> Sure, let's whack that thing with a crowbar.

Or just remove it.  Who still needs it?





Re: Performance optimization of btree binary search

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> On Wed, 2013-12-04 at 19:45 -0500, Robert Haas wrote:
>> On Wed, Dec 4, 2013 at 6:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Yeah, that's another thing we could simplify if we fixed this problem
>>> at the source.  I think these decisions date from a time when we still
>>> cared about the speed of fmgr_oldstyle.

>> Sure, let's whack that thing with a crowbar.

> Or just remove it.  Who still needs it?

Lazy people?  I'm not in a hurry to drop it; it's not costing us much to
just sit there, other than in this connection which we see how to fix.
        regards, tom lane



Re: Performance optimization of btree binary search

From
Peter Geoghegan
Date:
On Wed, Dec 4, 2013 at 12:58 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I'm kind of
> curious as to what this benchmark would like like on a server with
> many more cores.

I'm also curious about the impact on insertion into primary key
indexes. Presently, we hold an exclusive buffer lock for the duration
of a couple of operations when checkUnique != UNIQUE_CHECK_NO.
_bt_binsrch() is one such operation. The other one there,
_bt_check_unique(), is likely to be a lot cheaper than _bt_binsrch()
on average, I think, so I'm cautiously optimistic that it'll be
noticeable. I better go and check it out.


-- 
Peter Geoghegan



Re: Performance optimization of btree binary search

From
Peter Geoghegan
Date:
On Wed, Dec 4, 2013 at 5:28 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I'm also curious about the impact on insertion into primary key
> indexes. Presently, we hold an exclusive buffer lock for the duration
> of a couple of operations when checkUnique != UNIQUE_CHECK_NO.
> _bt_binsrch() is one such operation. The other one there,
> _bt_check_unique(), is likely to be a lot cheaper than _bt_binsrch()
> on average, I think, so I'm cautiously optimistic that it'll be
> noticeable. I better go and check it out.

Depending on how well this goes, I might also teach _bt_doinsert() to
hint to _bt_binsrch() (or as I'm calling it, _bt_page_search()) that
it should look to the end of the page when searching, using a similar
mechanism to the mechanism for hinting that the main Datum-compare
optimization is applicable (this strategy would be abandoned if it
didn't work immediately - as soon as the last item on the page turned
out to be greater than or equal to the scankey value). This is
something that I think would help with SERIAL columns, where it's
possible in principle to pass that kind of insight around -- if you
can live with making SERIAL more than mere syntactic sugar.

-- 
Peter Geoghegan



Re: Performance optimization of btree binary search

From
Peter Eisentraut
Date:
On Wed, 2013-12-04 at 20:27 -0500, Tom Lane wrote:
> Lazy people?  I'm not in a hurry to drop it; it's not costing us much to
> just sit there, other than in this connection which we see how to fix.

Actually, I think it probably costs a fair portion of extension authors
when their initial code crashes because they forgot to declare all their
functions V1.  I think it might actually be more of a bother to lazy
people than a benefit.





Re: Performance optimization of btree binary search

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> On Wed, 2013-12-04 at 20:27 -0500, Tom Lane wrote:
>> Lazy people?  I'm not in a hurry to drop it; it's not costing us much to
>> just sit there, other than in this connection which we see how to fix.

> Actually, I think it probably costs a fair portion of extension authors
> when their initial code crashes because they forgot to declare all their
> functions V1.  I think it might actually be more of a bother to lazy
> people than a benefit.

Hm.  We have heard one or two complaints like that, but not a huge number.

I'm worried about breaking code that's been working since god-knows-when;
but I will concede there's little evidence that there's very much of that
out there either.
        regards, tom lane



Re: Performance optimization of btree binary search

From
Peter Geoghegan
Date:
On Wed, Dec 4, 2013 at 5:28 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I'm also curious about the impact on insertion into primary key
> indexes. Presently, we hold an exclusive buffer lock for the duration
> of a couple of operations when checkUnique != UNIQUE_CHECK_NO.
> _bt_binsrch() is one such operation. The other one there,
> _bt_check_unique(), is likely to be a lot cheaper than _bt_binsrch()
> on average, I think, so I'm cautiously optimistic that it'll be
> noticeable. I better go and check it out.

I did a relatively short variant 'insert' pgbench-tools benchmark,
with a serial primary index created on the pgbench_history table.
Results:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/insert/

Note that in the interest of increasing the signal to noise ratio, I
used unlogged tables. These are considerably shorter two minute runs,
of inserts/writes, but even still it's interesting to compare it to my
original 'select' benchmark. Checkpoint settings were very high.

This looks to be about a wash. I'm not that surprised, because this is
all about memory bandwidth, not lock contention. Even with the lock
contention on the rightmost btree page, we top out at about the same
TPS level as the 'select' benchmark (at least compared to Postgres
master). However, we top out at that level much sooner here (growth is
much steeper), because the contention is mostly on the same one or two
btree leaf pages, which are well cached by CPU L1 cache. Writes are
actually quite a bit faster than uniformly-distributed reads, even
with the exclusive buffer locking (and lock contention) necessitated
by writing. You might even say that my patch makes the first benchmark
look more like this one (a less awkward comparison could probably be
made between what I've done for uniform distribution read workloads,
and a read workload with far more uneven data access, which will
naturally have fewer cache misses and less TLB contention). I strongly
suspect I wouldn't have seen a significant number of cache misses when
binary searching on btree pages if I was to instrument this workload.

It might still be worth pursuing the SERIAL hinting mechanism, but
it's not going to be nearly as big a win as what I've done for reads,
I think. It might also be worth looking at a workload with uniformly
distributed btree index tuple insertions, where I'd expect similar
advantages to those originally shown for reads.


-- 
Peter Geoghegan



Re: Performance optimization of btree binary search

From
Andres Freund
Date:
On 2013-12-04 18:48:44 -0500, Robert Haas wrote:
>  * When a type narrower than Datum is stored in a Datum, we place it in the
>  * low-order bits and are careful that the DatumGetXXX macro for it discards
>  * the unused high-order bits (as opposed to, say, assuming they are zero).
>  * This is needed to support old-style user-defined functions, since depending
>  * on architecture and compiler, the return value of a function returning char
>  * or short may contain garbage when called as if it returned Datum.
> 
> And record_image_eq does a rather elaborate dance around here, calling
> the appropriate GET_x_BYTES macro depending on the type-width.  If we
> can really count on the high-order bits to be zero, that's all
> completely unnecessary tomfoolery.

I don't think we can get rid of that dance in record_image_eq - it very
well could used on records originally generated when those bits haven't
been guaranteed to be zeroed.
Other usecases where the appropriate DatumGetFoo() macros are used don't
have a problem with that since it's cleared again there, but in
record_image_eq we can't rely on that.

Or am I missing something?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Performance optimization of btree binary search

From
Heikki Linnakangas
Date:
On 12/05/2013 07:30 AM, Tom Lane wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
>> On Wed, 2013-12-04 at 20:27 -0500, Tom Lane wrote:
>>> Lazy people?  I'm not in a hurry to drop it; it's not costing us much to
>>> just sit there, other than in this connection which we see how to fix.
>
>> Actually, I think it probably costs a fair portion of extension authors
>> when their initial code crashes because they forgot to declare all their
>> functions V1.  I think it might actually be more of a bother to lazy
>> people than a benefit.
>
> Hm.  We have heard one or two complaints like that, but not a huge number.

It happens to me about 75% of the time when I write a new C function. 
These days, I usually realize it pretty quickly.

I wonder how easy it would be to make the compiler produce a warning 
about it. Or issue a warning in PostgreSQL when you do CREATE FUNCTION 
and the C function appears to be a V0 function.

> I'm worried about breaking code that's been working since god-knows-when;
> but I will concede there's little evidence that there's very much of that
> out there either.

I wouldn't mind just dropping the V0 support. It's not difficult to 
modify your function for V1 convention, so if there's still anyone out 
there using V0, it wouldn't be unreasonable to require them to update.

- Heikki



Re: Performance optimization of btree binary search

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-12-04 18:48:44 -0500, Robert Haas wrote:
>> And record_image_eq does a rather elaborate dance around here, calling
>> the appropriate GET_x_BYTES macro depending on the type-width.  If we
>> can really count on the high-order bits to be zero, that's all
>> completely unnecessary tomfoolery.

> I don't think we can get rid of that dance in record_image_eq - it very
> well could used on records originally generated when those bits haven't
> been guaranteed to be zeroed.

No, you're failing to think about the context here.  A Datum is not an
on-disk concept, only an in-memory one.  In the case of record_image_eq,
simplifying the comparison of by-value Datums is unconditionally safe
as long as heap_deform_tuple is consistent about using the proper
GetDatum macros, which it is.  (So actually, that "elaborate dance"
is a 100% waste of effort today, regardless of any changes we're
discussing here.)

The risk we take by simplifying comparisons in a more general context
is that some function somewhere might've been sloppy about doing a
native-type-to-Datum conversion on its result.  In the case of V0
functions that risk is unavoidable except by adding some explicit cleanup
code, but I'm a bit worried that somebody, particularly third-party code,
might've sloppily written "return foo" in a V1 function when "return
Int32GetDatum(foo)" would be correct.  In that case, the resultant Datum
might have not-per-spec high-order bits, and if it reaches the fast
comparator without ever having been squeezed into a physical tuple,
we've got a problem.

I would certainly regard this as a bug in that function, but nonetheless
it's a hazard that we need to set against any performance improvement
that we can buy this way.
        regards, tom lane



Re: Performance optimization of btree binary search

From
Andres Freund
Date:
On 2013-12-05 08:58:55 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > I don't think we can get rid of that dance in record_image_eq - it very
> > well could used on records originally generated when those bits haven't
> > been guaranteed to be zeroed.
> 
> No, you're failing to think about the context here.

Ah yes. I was completely forgetting that
heap_form_tuple()/heap_fill_tuple() will properly take care to only use
meaningful parts of the (to-be-)stored data, not random padding.

Thanks.

> The risk we take by simplifying comparisons in a more general context
> is that some function somewhere might've been sloppy about doing a
> native-type-to-Datum conversion on its result.  In the case of V0
> functions that risk is unavoidable except by adding some explicit cleanup
> code, but I'm a bit worried that somebody, particularly third-party code,
> might've sloppily written "return foo" in a V1 function when "return
> Int32GetDatum(foo)" would be correct.  In that case, the resultant Datum
> might have not-per-spec high-order bits, and if it reaches the fast
> comparator without ever having been squeezed into a physical tuple,
> we've got a problem.

Too bad V1 hasn't insisted on using PG_RETURN_* macros. That would have
allowed asserts checking against such cases by setting
fcinfo->has_returned = true or such...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Performance optimization of btree binary search

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-12-05 08:58:55 -0500, Tom Lane wrote:
>> I'm a bit worried that somebody, particularly third-party code,
>> might've sloppily written "return foo" in a V1 function when "return
>> Int32GetDatum(foo)" would be correct.  In that case, the resultant Datum
>> might have not-per-spec high-order bits, and if it reaches the fast
>> comparator without ever having been squeezed into a physical tuple,
>> we've got a problem.

> Too bad V1 hasn't insisted on using PG_RETURN_* macros. That would have
> allowed asserts checking against such cases by setting
> fcinfo->has_returned = true or such...

[ shrug... ]  PG_RETURN_DATUM has no practical way to verify that the
given Datum was constructed safely, so I think we'd just be adding
overhead with not much real safety gain.

In practice, if we were to change Datum to be a signed type (intptr_t
not uintptr_t), the most common cases would probably do the right thing
anyway, ie an int or short return value would get promoted to Datum
with sign-extension.
        regards, tom lane



Re: Performance optimization of btree binary search

From
Andres Freund
Date:
On 2013-12-05 10:02:56 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2013-12-05 08:58:55 -0500, Tom Lane wrote:
> >> I'm a bit worried that somebody, particularly third-party code,
> >> might've sloppily written "return foo" in a V1 function when "return
> >> Int32GetDatum(foo)" would be correct.  In that case, the resultant Datum
> >> might have not-per-spec high-order bits, and if it reaches the fast
> >> comparator without ever having been squeezed into a physical tuple,
> >> we've got a problem.
> 
> > Too bad V1 hasn't insisted on using PG_RETURN_* macros. That would have
> > allowed asserts checking against such cases by setting
> > fcinfo->has_returned = true or such...
> 
> [ shrug... ]  PG_RETURN_DATUM has no practical way to verify that the
> given Datum was constructed safely, so I think we'd just be adding
> overhead with not much real safety gain.

I was thinking of doing it for assert only anyway.

> In practice, if we were to change Datum to be a signed type (intptr_t
> not uintptr_t), the most common cases would probably do the right thing
> anyway, ie an int or short return value would get promoted to Datum
> with sign-extension.

I was actually thinking about making Datum (and some other types we
have) structs or unions. Currently it's far, far to easy to mix them. We throw
away pretty much all of the little typesafety C has by typedef'ing them
to integral types with lots of autocasting behaviour.
Making Datum a union with all the base-types inside, would also get rid
of us violating the standard's aliasing rules...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Performance optimization of btree binary search

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> I was actually thinking about making Datum (and some other types we
> have) structs or unions. Currently it's far, far to easy to mix them. We throw
> away pretty much all of the little typesafety C has by typedef'ing them
> to integral types with lots of autocasting behaviour.

That's intentional; on many ABIs, making Datum a struct would be
catastrophic performance-wise because it would not be eligible for simple
register pass or return conventions.  In any case, the number of bugs
I can remember that such a thing would've prevented is negligible.
        regards, tom lane



Re: Performance optimization of btree binary search

From
Andres Freund
Date:
On 2013-12-05 10:34:16 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > I was actually thinking about making Datum (and some other types we
> > have) structs or unions. Currently it's far, far to easy to mix them. We throw
> > away pretty much all of the little typesafety C has by typedef'ing them
> > to integral types with lots of autocasting behaviour.
> 
> That's intentional; on many ABIs, making Datum a struct would be
> catastrophic performance-wise because it would not be eligible for simple
> register pass or return conventions.

Unions should behave saner in that regard tho? And it be fairly easy to
make it an optional thing.

> In any case, the number of bugs I can remember that such a thing
> would've prevented is negligible.

Cases talked about upthread, where a plain datatype is returned as a
Datum instead of using FooGetDatum() and the reverse, would be
impossible. I don't think those are that infrequent?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Performance optimization of btree binary search

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-12-05 10:34:16 -0500, Tom Lane wrote:
>> In any case, the number of bugs I can remember that such a thing
>> would've prevented is negligible.

> Cases talked about upthread, where a plain datatype is returned as a
> Datum instead of using FooGetDatum() and the reverse, would be
> impossible. I don't think those are that infrequent?

[ shrug... ]  The performance changes we're talking about here would have
the effect of making the compiler's implicit casts be the right thing
anyway.  In any case, I don't think you'd have accomplished much by
forcing people to use FooGetDatum, unless you can force them to use the
*right* FooGetDatum.  The errors I can remember seeing in this area were
more in the line of choosing the wrong macro.
        regards, tom lane



Re: Performance optimization of btree binary search

From
Peter Eisentraut
Date:
On Thu, 2013-12-05 at 15:29 +0200, Heikki Linnakangas wrote:
> It happens to me about 75% of the time when I write a new C function. 
> These days, I usually realize it pretty quickly.
> 
> I wonder how easy it would be to make the compiler produce a warning 
> about it. Or issue a warning in PostgreSQL when you do CREATE
> FUNCTION 
> and the C function appears to be a V0 function.

Here is a related idea I've had:  Integrate the function declaration
into the PG_FUNCTION_INFO_V1 macro, like this:

#define PG_FUNCTION_INFO_V1(funcname) \
Datum funcname(PG_FUNCTION_ARGS); \
...

Because if you don't declare the function, you will get a compiler
warning for -Wmissing-prototypes.  Field experience shows that many
extension authors neglect to explicitly declare their entry-point
functions, presumably because they don't understand the cause of the
warning, or they don't look at the compiler output, so many extensions
unfortunately don't compile without warnings at the moment.

By putting the function declaration into the PG_FUNCTION_INFO_V1 macro,
everyone wins:

- fewer compiler warnings due to laziness
- no code bloat because of redundant function declarations
- accidental V0 functions are warned about
- prototypes of V1 function implementations are implicitly checked for
correctness





Re: Performance optimization of btree binary search

From
Peter Geoghegan
Date:
On Thu, Dec 5, 2013 at 2:19 AM, Peter Geoghegan <pg@heroku.com> wrote:
> I did a relatively short variant 'insert' pgbench-tools benchmark,
> with a serial primary index created on the pgbench_history table.
> Results:
>
> http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/insert/

I saw something today that made me lose confidence in the results I've
presented. I was unsuccessful in re-creating similar 'select' numbers
at scale 15 on the same server [1] (originally I wanted to see how
eliding the fmgr trapoline worked out with binary searching only).
Then, when re-testing master as of commit
8e18d04d4daf34b8a557e2dc553a7754b255cd9a (my feature branches branched
off from that master commit), I noticed that even after the last
pgbench had run, a single postgres process was stuck at 100% CPU usage
for upwards of a minute (the run referred to was for 32 clients, and I
only saw that one process in 'top' output).

To me this points to either 1) some kind of contamination or 2) a bug
in Postgres. My first guess is that it's the issue described here [2]
and elsewhere (maybe whether or not that behavior constitutes a bug in
controversial, but I think it does).

Consider the contrast between these two runs (against master, 2
clients) of the same, new benchmark:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/new-select/50/index.html

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/new-select/44/index.html

I had considered that something like Intel Speedstep technology had a
role here, but I'm pretty sure it steps up very aggressively when
things are CPU bound - I tested that against a Core 2 Duo desktop a
couple of years back, where it was easy to immediately provoke it by
moving around desktop windows or something. I know that Greg Smith
controls for that by disabling it, but I have not done so here - I
assumed that he only did so because he was being particularly
cautious. There are similar runs on my original 'results' benchmark
(these particular should-be-comparable-but-are-not runs are with 1
client against my patch this time):

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/results/297/index.html

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/results/303/index.html

References:

[1] http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/new-select/

[2] http://www.postgresql.org/message-id/CAFj8pRDDa40eiP4UTrCm=+Bdt0xbWF7qC8T_3y0dFqYuZk2YAg@mail.gmail.com

-- 
Peter Geoghegan



Re: Performance optimization of btree binary search

From
Peter Geoghegan
Date:
On Fri, Dec 6, 2013 at 4:53 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I had considered that something like Intel Speedstep technology had a
> role here, but I'm pretty sure it steps up very aggressively when
> things are CPU bound - I tested that against a Core 2 Duo desktop a
> couple of years back, where it was easy to immediately provoke it by
> moving around desktop windows or something.

I decided to increase the default CPU governor from "ondemand" to
"performance" for each of the 8 logical cores on this system. I then
re-ran the benchmark. I saw markedly better, much more *consistent*
performance for master [1].

I Googled for clues, and found this:


https://communities.intel.com/community/datastack/blog/2013/08/05/how-to-maximise-cpu-performance-for-the-oracle-database-on-linux

(It happens to mention Oracle, but I think it would equally well apply
to any database). I strongly suspect this is down to Kernel version. I
should highlight this:

"""
Another further CPU setting is the Energy/Performance Bias and Red Hat
and Oracle users should note that the default setting has changed in
the Linux kernel used between the releases of Red Hat/Oracle Linux 5
and Red Hat/Oracle Linux 6. (Some system BIOS options may include a
setting to prevent the OS changing this value). In release 5 Linux did
not set a value for this setting and therefore the value remained at 0
for a bias towards performance. In Red Hat 6 this behaviour has
changed and the default sets a median range to move this bias more
towards conserving energy (remember the same Linux kernel is present
in both ultrabooks as well as  servers and on my ultrabook I use
powertop and the other Linux tools and configurations discussed here
to maximise battery life) and reports the following in the dmesg
output on boot.

...

You can also use the tool to set a lower value to change the bias
entirely towards performance (the default release 5 behaviour).
"""

If there is regression in Postgres performance on more recent Linux
kernels [2], perhaps this is it. I certainly don't recall hearing
advice on this from the usual places. I'm surprised that turbo boost
mode wasn't enabled very quickly on a workload like this. It makes a
*huge* difference - at 4 clients (one per physical core), setting the
CPU governor to "performance" increases TPS by a massive 40% compared
to some earlier, comparable runs of master. These days Redhat are even
pointing out that CPU governor policy can be set via cron jobs [3].

I cannot account for why the original benchmark performed was
consistent with the patch having helped to such a large degree, given
the large number of runs involved, and their relatively long duration
for a CPU/memory bound workload. As I said, this machine is on
dedicated hardware, and virtualization was not used. However, at this
point I have no choice but to withdraw it from consideration, and not
pursue this any further. Sorry for the noise.

[1] http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/turbo/

[2] http://www.postgresql.org/message-id/529F7D58.1060301@agliodbs.com

[3]
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Power_Management_Guide/cpufreq_governors.html
-- 
Peter Geoghegan