Thread: B-Tree support function number 3 (strxfrm() optimization)

B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
I have thought a little further about my proposal to have inner B-Tree
pages store strxfrm() blobs [1]; my earlier conclusion was that this
could not be trusted when equality was indicated [2] due to the
"Hungarian problem" [3]. As I noted earlier, that actually doesn't
really matter, because we may have missed what's really so useful
about strxfrm() blobs: We *can* trust a non-zero result as a proxy for
what we would have gotten with a proper bttextcmp(), even if a
strcmp() only looks at the first byte of a blob for each of two text
strings being compared. Here is what the strxfrm() blob looks like for
some common English words when put through glibc's strxfrm() with the
en_US.UTF-8 collation (I wrote a wrapper function to do this and
output bytea a couple of years ago):

[local]/postgres=# select strxfrm_test('Yellow');               strxfrm_test
--------------------------------------------\x241017171a220109090909090901100909090909
(1 row)

[local]/postgres=# select strxfrm_test('Red');      strxfrm_test
--------------------------\x1d100f0109090901100909
(1 row)

[local]/postgres=# select strxfrm_test('Orange');               strxfrm_test
--------------------------------------------\x1a1d0c1912100109090909090901100909090909
(1 row)

[local]/postgres=# select strxfrm_test('Green');            strxfrm_test
--------------------------------------\x121d101019010909090909011009090909
(1 row)

[local]/postgres=# select strxfrm_test('White');            strxfrm_test
--------------------------------------\x2213141f10010909090909011009090909
(1 row)

[local]/postgres=# select strxfrm_test('Black');            strxfrm_test
--------------------------------------\x0d170c0e16010909090909011009090909
(1 row)

Obviously, all of these blobs, while much larger than the original
string still differ in their first byte. It's almost as if they're
intended to be truncated.

The API I envisage is a new support function 3 that operator class
authors may optionally provide. Within the support function a text
varlena argument is passed, or whatever the type that relates to the
opclass happens to be. It returns a Datum to the core system. That
datum is always what is actually passed to their sort support routine
(B-Tree support function number 2) iff a support function 3 is
provided (you must provide number 2 if you provide number 3, but not
vice-versa). In respect of support-function-3-providing opclasses, the
core system is entitled to take the sort support return value as a
proxy for what a proper support function 1 call would indicate iff the
sort support routine does not return 0 in respect of two
support-function-3 blobs (typically strxfrm() blobs truncated at 8
bytes for convenient storage as pass-by-value Datums). Otherwise, a
proper call to support function 1, with fully-formed text arguments is
required.

I see at least two compelling things we can do with these blobs:

1. Store them as pseudo-columns of inner B-Tree IndexTuples (not leaf
pages). Naturally, inner pages are very heterogeneous, so only having
8 bytes is very probably an excellent trade-off there. Typically 1-3%
of B-Tree pages are inner pages, so the bloat risk seems acceptable.

2. When building a SortTuple array within TupleSort, we can store far
more of these truncated blobs in memory than we can proper strings. if
SortTuple.datum1 (the first column to sort on among tuples being
sorted, which is currently stored in memory as an optimization) was
just an 8 byte truncated strxfrm() blob, and not a pointer to a text
string in memory, that would be pretty great for performance for
several reasons. So just as with B-Tree inner pages, for SortTuples
there can be a pseudo leading key - we need only compare additional
sort keys/heap_getattr() when we need a tie-breaker, when those 8
bytes aren't enough to reach a firm conclusion.

It doesn't just stop with strxfrm() blobs, though. Why couldn't we
create blobs that can be used as reliable proxies for numerics, that
are just integers? Sure, you need to reserve a bit to indicate an
inability to represent the original value, and possibly work out other
details like that, but the only negative thing you can say about that,
or indeed applying these techniques to any operator class is that it
might not help in certain worst cases (mostly contrived cases). Still,
the overhead of doing that bit of extra work is surely quite low
anyway - at worst, a extra few instructions wasted per comparison -
making these techniques likely quite profitable for the vast majority
of real-world applications.

Does anyone else want to pick this idea up and run with it? I don't
think I'll have time for it.

[1] http://www.postgresql.org/message-id/CAM3SWZTcXrdDZSpA11qZXiyo4_jtxwjaNdZpnY54yjzq7d64=A@mail.gmail.com

[2] http://www.postgresql.org/message-id/CAM3SWZS7wewrBmRGCi9_yCX49Ug6UgqN2xNGJG3Zq5v8LbDU4g@mail.gmail.com

[3] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Wed, Mar 26, 2014 at 8:08 PM, Peter Geoghegan <pg@heroku.com> wrote:
> The API I envisage is a new support function 3 that operator class
> authors may optionally provide.

I've built a prototype patch, attached, that extends SortSupport and
tuplesort to support "poor man's normalized keys". All the regression
tests pass, so while it's just a proof of concept, it is reasonably
well put together for one. The primary shortcoming of the prototype
(the main reason why I'm calling it a prototype rather than just a
patch) is that it isn't sufficiently generalized (i.e. it only works
for the cases currently covered by SortSupport - not B-Tree index
builds, or B-Tree scanKeys). There is no B-Tree support function
number 3 in the patch. I didn't spend too long on this.

I'm pretty happy with the results for in-memory sorting of text (my
development system uses 'en_US.UTF8', so please assume that any costs
involved are for runs that use that collation). With the dellstore2
sample database [1] restored to my local development instance, the
following example demonstrates just how much the technique can help
performance.

With master:

pg@hamster:~/sort-tests$ cat sort.sql
select * from (select * from customers order by firstname offset 100000) d;
pg@hamster:~/sort-tests$ pgbench -f sort.sql -n -T 100
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 819
latency average: 122.100 ms
tps = 8.186197 (including connections establishing)
tps = 8.186522 (excluding connections establishing)

With patch applied (requires initdb for new text SortSupport pg_proc entry):

pg@hamster:~/sort-tests$ cat sort.sql
select * from (select * from customers order by firstname offset 100000) d;
pg@hamster:~/sort-tests$ pgbench -f sort.sql -n -T 100
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 2525
latency average: 39.604 ms
tps = 25.241723 (including connections establishing)
tps = 25.242447 (excluding connections establishing)

It looks like this technique is very valuable indeed, at least in the
average or best case. We're not just benefiting from following the
advice of the standard, and using strxfrm() for sorting to amortize
the cost of the strxfrm() transformation that strcoll() must do
anyway. It stands to reason that there is also a lot of benefit from
sorting tightly-packed keys. Quicksort is cache oblivious, and having
it sort tightly-packed binary data, as opposed to going through all of
that deferencing and deserialization indirection is probably also very
helpful. A tool like Cachegrind might offer some additional insights,
but I haven't gone to the trouble of trying that out.

(By the way, my earlier recollection about how memory-frugal
MinimalTuple/memtuple building is within tuplesort was incorrect, so
there are no savings in memory to be had here).

As I mentioned, something like a SortSupport for numeric, with poor
man's normalized keys might also be compelling. I suggest we focus on
how this technique can be further generalized, though. This prototype
patch is derivative of Robert's abandoned SortSupport for text patch.
If he wanted to take this off my hands, I'd have no objections - I
don't think I'm going to have time to take this as far as I'd like.

[1] http://pgfoundry.org/forum/forum.php?forum_id=603
--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Thom Brown
Date:
On 31 March 2014 06:51, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Mar 26, 2014 at 8:08 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> The API I envisage is a new support function 3 that operator class
>> authors may optionally provide.
>
> I've built a prototype patch, attached, that extends SortSupport and
> tuplesort to support "poor man's normalized keys". All the regression
> tests pass, so while it's just a proof of concept, it is reasonably
> well put together for one. The primary shortcoming of the prototype
> (the main reason why I'm calling it a prototype rather than just a
> patch) is that it isn't sufficiently generalized (i.e. it only works
> for the cases currently covered by SortSupport - not B-Tree index
> builds, or B-Tree scanKeys). There is no B-Tree support function
> number 3 in the patch. I didn't spend too long on this.
>
> I'm pretty happy with the results for in-memory sorting of text (my
> development system uses 'en_US.UTF8', so please assume that any costs
> involved are for runs that use that collation). With the dellstore2
> sample database [1] restored to my local development instance, the
> following example demonstrates just how much the technique can help
> performance.
>
> With master:
>
> pg@hamster:~/sort-tests$ cat sort.sql
> select * from (select * from customers order by firstname offset 100000) d;
> pg@hamster:~/sort-tests$ pgbench -f sort.sql -n -T 100
> transaction type: Custom query
> scaling factor: 1
> query mode: simple
> number of clients: 1
> number of threads: 1
> duration: 100 s
> number of transactions actually processed: 819
> latency average: 122.100 ms
> tps = 8.186197 (including connections establishing)
> tps = 8.186522 (excluding connections establishing)
>
> With patch applied (requires initdb for new text SortSupport pg_proc entry):
>
> pg@hamster:~/sort-tests$ cat sort.sql
> select * from (select * from customers order by firstname offset 100000) d;
> pg@hamster:~/sort-tests$ pgbench -f sort.sql -n -T 100
> transaction type: Custom query
> scaling factor: 1
> query mode: simple
> number of clients: 1
> number of threads: 1
> duration: 100 s
> number of transactions actually processed: 2525
> latency average: 39.604 ms
> tps = 25.241723 (including connections establishing)
> tps = 25.242447 (excluding connections establishing)

As another data point, I ran the same benchmark, but I don't appear to
yield the same positive result.  An initdb was done for each rebuild,
my system uses en_GB.UTF-8 (if that's relevant) and I used your same
sort.sql...

With master:

thom@swift ~/Development $ pgbench -f sort.sql -n -T 100 ds2
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 421479
latency average: 0.237 ms
tps = 4214.769601 (including connections establishing)
tps = 4214.906079 (excluding connections establishing)

With patch applied:

thom@swift ~/Development $ pgbench -f sort.sql -n -T 100 ds2
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 412405
latency average: 0.242 ms
tps = 4124.047237 (including connections establishing)
tps = 4124.177437 (excluding connections establishing)


And with 4 runs (TPS):

Master: 4214.906079 / 4564.532623 / 4152.784608 / 4152.578297 (avg: 4271)
Patched: 4124.177437 / 3777.561869 / 3777.561869 / 2484.220475 (avg: 3481)

I'm not sure what's causing the huge variation.  I ran 5 minute benchmarks too:

Master:

thom@swift ~/Development $ pgbench -f sort.sql -n -T 300 ds2
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 300 s
number of transactions actually processed: 1092221
latency average: 0.275 ms
tps = 3640.733002 (including connections establishing)
tps = 3640.784628 (excluding connections establishing)


Patched:

thom@swift ~/Development $ pgbench -f sort.sql -n -T 300 ds2
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 300 s
number of transactions actually processed: 1068239
latency average: 0.281 ms
tps = 3560.794946 (including connections establishing)
tps = 3560.835076 (excluding connections establishing)



And per-second results for the first 30 seconds:


Master:

progress: 1.0 s, 2128.8 tps, lat 0.464 ms stddev 0.084
progress: 2.0 s, 2138.9 tps, lat 0.464 ms stddev 0.015
progress: 3.0 s, 2655.6 tps, lat 0.374 ms stddev 0.151
progress: 4.0 s, 2214.0 tps, lat 0.448 ms stddev 0.080
progress: 5.0 s, 2171.1 tps, lat 0.457 ms stddev 0.071
progress: 6.0 s, 2131.6 tps, lat 0.466 ms stddev 0.035
progress: 7.0 s, 3811.2 tps, lat 0.260 ms stddev 0.177
progress: 8.0 s, 2139.6 tps, lat 0.464 ms stddev 0.017
progress: 9.0 s, 7989.7 tps, lat 0.124 ms stddev 0.091
progress: 10.0 s, 8509.7 tps, lat 0.117 ms stddev 0.062
progress: 11.0 s, 3131.3 tps, lat 0.317 ms stddev 0.177
progress: 12.0 s, 9362.1 tps, lat 0.106 ms stddev 0.006
progress: 13.0 s, 5831.0 tps, lat 0.170 ms stddev 0.137
progress: 14.0 s, 4949.3 tps, lat 0.201 ms stddev 0.156
progress: 15.0 s, 2136.9 tps, lat 0.464 ms stddev 0.028
progress: 16.0 s, 3918.3 tps, lat 0.253 ms stddev 0.177
progress: 17.0 s, 4102.7 tps, lat 0.242 ms stddev 0.122
progress: 18.0 s, 2997.6 tps, lat 0.331 ms stddev 0.151
progress: 19.0 s, 2139.1 tps, lat 0.464 ms stddev 0.034
progress: 20.0 s, 3189.5 tps, lat 0.311 ms stddev 0.173
progress: 21.0 s, 2120.7 tps, lat 0.468 ms stddev 0.030
progress: 22.0 s, 3197.7 tps, lat 0.311 ms stddev 0.182
progress: 23.0 s, 2115.3 tps, lat 0.469 ms stddev 0.034
progress: 24.0 s, 2129.0 tps, lat 0.466 ms stddev 0.031
progress: 25.0 s, 2190.7 tps, lat 0.453 ms stddev 0.106
progress: 26.0 s, 2118.6 tps, lat 0.468 ms stddev 0.031
progress: 27.0 s, 2136.8 tps, lat 0.464 ms stddev 0.018
progress: 28.0 s, 5160.7 tps, lat 0.193 ms stddev 0.156
progress: 29.0 s, 2312.5 tps, lat 0.429 ms stddev 0.107
progress: 30.0 s, 2145.9 tps, lat 0.463 ms stddev 0.038
progress: 31.0 s, 2107.6 tps, lat 0.471 ms stddev 0.071



Patched:

progress: 1.0 s, 2136.2 tps, lat 0.463 ms stddev 0.084
progress: 2.0 s, 2153.3 tps, lat 0.461 ms stddev 0.035
progress: 3.0 s, 2336.0 tps, lat 0.425 ms stddev 0.112
progress: 4.0 s, 2144.8 tps, lat 0.463 ms stddev 0.037
progress: 5.0 s, 2171.7 tps, lat 0.457 ms stddev 0.041
progress: 6.0 s, 2161.9 tps, lat 0.459 ms stddev 0.036
progress: 7.0 s, 2143.1 tps, lat 0.463 ms stddev 0.019
progress: 8.0 s, 2148.4 tps, lat 0.462 ms stddev 0.032
progress: 9.0 s, 2142.1 tps, lat 0.463 ms stddev 0.028
progress: 10.0 s, 2133.6 tps, lat 0.465 ms stddev 0.032
progress: 11.0 s, 2138.3 tps, lat 0.464 ms stddev 0.020
progress: 12.0 s, 2578.7 tps, lat 0.385 ms stddev 0.149
progress: 13.0 s, 2455.6 tps, lat 0.404 ms stddev 0.119
progress: 14.0 s, 2909.5 tps, lat 0.341 ms stddev 0.170
progress: 15.0 s, 2133.7 tps, lat 0.465 ms stddev 0.025
progress: 16.0 s, 2876.9 tps, lat 0.345 ms stddev 0.160
progress: 17.0 s, 2167.1 tps, lat 0.458 ms stddev 0.038
progress: 18.0 s, 3623.4 tps, lat 0.274 ms stddev 0.181
progress: 19.0 s, 2476.3 tps, lat 0.401 ms stddev 0.137
progress: 20.0 s, 2166.9 tps, lat 0.458 ms stddev 0.030
progress: 21.0 s, 3150.0 tps, lat 0.315 ms stddev 0.176
progress: 22.0 s, 2220.3 tps, lat 0.447 ms stddev 0.084
progress: 23.0 s, 2131.3 tps, lat 0.466 ms stddev 0.030
progress: 24.0 s, 2154.2 tps, lat 0.461 ms stddev 0.047
progress: 25.0 s, 2122.5 tps, lat 0.468 ms stddev 0.040
progress: 26.0 s, 2202.2 tps, lat 0.451 ms stddev 0.079
progress: 27.0 s, 2150.6 tps, lat 0.461 ms stddev 0.039
progress: 28.0 s, 2156.1 tps, lat 0.460 ms stddev 0.027
progress: 29.0 s, 2152.1 tps, lat 0.461 ms stddev 0.028
progress: 30.0 s, 2235.4 tps, lat 0.444 ms stddev 0.071


-- 
Thom



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Thom Brown
Date:
On 31 March 2014 11:32, Thom Brown <thom@linux.com> wrote:
> On 31 March 2014 06:51, Peter Geoghegan <pg@heroku.com> wrote:
>> On Wed, Mar 26, 2014 at 8:08 PM, Peter Geoghegan <pg@heroku.com> wrote:
>>> The API I envisage is a new support function 3 that operator class
>>> authors may optionally provide.
>>
>> I've built a prototype patch, attached, that extends SortSupport and
>> tuplesort to support "poor man's normalized keys". All the regression
>> tests pass, so while it's just a proof of concept, it is reasonably
>> well put together for one. The primary shortcoming of the prototype
>> (the main reason why I'm calling it a prototype rather than just a
>> patch) is that it isn't sufficiently generalized (i.e. it only works
>> for the cases currently covered by SortSupport - not B-Tree index
>> builds, or B-Tree scanKeys). There is no B-Tree support function
>> number 3 in the patch. I didn't spend too long on this.
>>
>> I'm pretty happy with the results for in-memory sorting of text (my
>> development system uses 'en_US.UTF8', so please assume that any costs
>> involved are for runs that use that collation). With the dellstore2
>> sample database [1] restored to my local development instance, the
>> following example demonstrates just how much the technique can help
>> performance.
>>
>> With master:
>>
>> pg@hamster:~/sort-tests$ cat sort.sql
>> select * from (select * from customers order by firstname offset 100000) d;
>> pg@hamster:~/sort-tests$ pgbench -f sort.sql -n -T 100
>> transaction type: Custom query
>> scaling factor: 1
>> query mode: simple
>> number of clients: 1
>> number of threads: 1
>> duration: 100 s
>> number of transactions actually processed: 819
>> latency average: 122.100 ms
>> tps = 8.186197 (including connections establishing)
>> tps = 8.186522 (excluding connections establishing)
>>
>> With patch applied (requires initdb for new text SortSupport pg_proc entry):
>>
>> pg@hamster:~/sort-tests$ cat sort.sql
>> select * from (select * from customers order by firstname offset 100000) d;
>> pg@hamster:~/sort-tests$ pgbench -f sort.sql -n -T 100
>> transaction type: Custom query
>> scaling factor: 1
>> query mode: simple
>> number of clients: 1
>> number of threads: 1
>> duration: 100 s
>> number of transactions actually processed: 2525
>> latency average: 39.604 ms
>> tps = 25.241723 (including connections establishing)
>> tps = 25.242447 (excluding connections establishing)
>
> As another data point, I ran the same benchmark, but I don't appear to
> yield the same positive result.  An initdb was done for each rebuild,
> my system uses en_GB.UTF-8 (if that's relevant) and I used your same
> sort.sql...
>
> With master:
>
> thom@swift ~/Development $ pgbench -f sort.sql -n -T 100 ds2
> transaction type: Custom query
> scaling factor: 1
> query mode: simple
> number of clients: 1
> number of threads: 1
> duration: 100 s
> number of transactions actually processed: 421479
> latency average: 0.237 ms
> tps = 4214.769601 (including connections establishing)
> tps = 4214.906079 (excluding connections establishing)
>
> With patch applied:
>
> thom@swift ~/Development $ pgbench -f sort.sql -n -T 100 ds2
> transaction type: Custom query
> scaling factor: 1
> query mode: simple
> number of clients: 1
> number of threads: 1
> duration: 100 s
> number of transactions actually processed: 412405
> latency average: 0.242 ms
> tps = 4124.047237 (including connections establishing)
> tps = 4124.177437 (excluding connections establishing)
>
>
> And with 4 runs (TPS):
>
> Master: 4214.906079 / 4564.532623 / 4152.784608 / 4152.578297 (avg: 4271)
> Patched: 4124.177437 / 3777.561869 / 3777.561869 / 2484.220475 (avg: 3481)
>
> I'm not sure what's causing the huge variation.  I ran 5 minute benchmarks too:
>
> Master:
>
> thom@swift ~/Development $ pgbench -f sort.sql -n -T 300 ds2
> transaction type: Custom query
> scaling factor: 1
> query mode: simple
> number of clients: 1
> number of threads: 1
> duration: 300 s
> number of transactions actually processed: 1092221
> latency average: 0.275 ms
> tps = 3640.733002 (including connections establishing)
> tps = 3640.784628 (excluding connections establishing)
>
>
> Patched:
>
> thom@swift ~/Development $ pgbench -f sort.sql -n -T 300 ds2
> transaction type: Custom query
> scaling factor: 1
> query mode: simple
> number of clients: 1
> number of threads: 1
> duration: 300 s
> number of transactions actually processed: 1068239
> latency average: 0.281 ms
> tps = 3560.794946 (including connections establishing)
> tps = 3560.835076 (excluding connections establishing)
>
>
>
> And per-second results for the first 30 seconds:
>
>
> Master:
>
> progress: 1.0 s, 2128.8 tps, lat 0.464 ms stddev 0.084
> progress: 2.0 s, 2138.9 tps, lat 0.464 ms stddev 0.015
> progress: 3.0 s, 2655.6 tps, lat 0.374 ms stddev 0.151
> progress: 4.0 s, 2214.0 tps, lat 0.448 ms stddev 0.080
> progress: 5.0 s, 2171.1 tps, lat 0.457 ms stddev 0.071
> progress: 6.0 s, 2131.6 tps, lat 0.466 ms stddev 0.035
> progress: 7.0 s, 3811.2 tps, lat 0.260 ms stddev 0.177
> progress: 8.0 s, 2139.6 tps, lat 0.464 ms stddev 0.017
> progress: 9.0 s, 7989.7 tps, lat 0.124 ms stddev 0.091
> progress: 10.0 s, 8509.7 tps, lat 0.117 ms stddev 0.062
> progress: 11.0 s, 3131.3 tps, lat 0.317 ms stddev 0.177
> progress: 12.0 s, 9362.1 tps, lat 0.106 ms stddev 0.006
> progress: 13.0 s, 5831.0 tps, lat 0.170 ms stddev 0.137
> progress: 14.0 s, 4949.3 tps, lat 0.201 ms stddev 0.156
> progress: 15.0 s, 2136.9 tps, lat 0.464 ms stddev 0.028
> progress: 16.0 s, 3918.3 tps, lat 0.253 ms stddev 0.177
> progress: 17.0 s, 4102.7 tps, lat 0.242 ms stddev 0.122
> progress: 18.0 s, 2997.6 tps, lat 0.331 ms stddev 0.151
> progress: 19.0 s, 2139.1 tps, lat 0.464 ms stddev 0.034
> progress: 20.0 s, 3189.5 tps, lat 0.311 ms stddev 0.173
> progress: 21.0 s, 2120.7 tps, lat 0.468 ms stddev 0.030
> progress: 22.0 s, 3197.7 tps, lat 0.311 ms stddev 0.182
> progress: 23.0 s, 2115.3 tps, lat 0.469 ms stddev 0.034
> progress: 24.0 s, 2129.0 tps, lat 0.466 ms stddev 0.031
> progress: 25.0 s, 2190.7 tps, lat 0.453 ms stddev 0.106
> progress: 26.0 s, 2118.6 tps, lat 0.468 ms stddev 0.031
> progress: 27.0 s, 2136.8 tps, lat 0.464 ms stddev 0.018
> progress: 28.0 s, 5160.7 tps, lat 0.193 ms stddev 0.156
> progress: 29.0 s, 2312.5 tps, lat 0.429 ms stddev 0.107
> progress: 30.0 s, 2145.9 tps, lat 0.463 ms stddev 0.038
> progress: 31.0 s, 2107.6 tps, lat 0.471 ms stddev 0.071
>
>
>
> Patched:
>
> progress: 1.0 s, 2136.2 tps, lat 0.463 ms stddev 0.084
> progress: 2.0 s, 2153.3 tps, lat 0.461 ms stddev 0.035
> progress: 3.0 s, 2336.0 tps, lat 0.425 ms stddev 0.112
> progress: 4.0 s, 2144.8 tps, lat 0.463 ms stddev 0.037
> progress: 5.0 s, 2171.7 tps, lat 0.457 ms stddev 0.041
> progress: 6.0 s, 2161.9 tps, lat 0.459 ms stddev 0.036
> progress: 7.0 s, 2143.1 tps, lat 0.463 ms stddev 0.019
> progress: 8.0 s, 2148.4 tps, lat 0.462 ms stddev 0.032
> progress: 9.0 s, 2142.1 tps, lat 0.463 ms stddev 0.028
> progress: 10.0 s, 2133.6 tps, lat 0.465 ms stddev 0.032
> progress: 11.0 s, 2138.3 tps, lat 0.464 ms stddev 0.020
> progress: 12.0 s, 2578.7 tps, lat 0.385 ms stddev 0.149
> progress: 13.0 s, 2455.6 tps, lat 0.404 ms stddev 0.119
> progress: 14.0 s, 2909.5 tps, lat 0.341 ms stddev 0.170
> progress: 15.0 s, 2133.7 tps, lat 0.465 ms stddev 0.025
> progress: 16.0 s, 2876.9 tps, lat 0.345 ms stddev 0.160
> progress: 17.0 s, 2167.1 tps, lat 0.458 ms stddev 0.038
> progress: 18.0 s, 3623.4 tps, lat 0.274 ms stddev 0.181
> progress: 19.0 s, 2476.3 tps, lat 0.401 ms stddev 0.137
> progress: 20.0 s, 2166.9 tps, lat 0.458 ms stddev 0.030
> progress: 21.0 s, 3150.0 tps, lat 0.315 ms stddev 0.176
> progress: 22.0 s, 2220.3 tps, lat 0.447 ms stddev 0.084
> progress: 23.0 s, 2131.3 tps, lat 0.466 ms stddev 0.030
> progress: 24.0 s, 2154.2 tps, lat 0.461 ms stddev 0.047
> progress: 25.0 s, 2122.5 tps, lat 0.468 ms stddev 0.040
> progress: 26.0 s, 2202.2 tps, lat 0.451 ms stddev 0.079
> progress: 27.0 s, 2150.6 tps, lat 0.461 ms stddev 0.039
> progress: 28.0 s, 2156.1 tps, lat 0.460 ms stddev 0.027
> progress: 29.0 s, 2152.1 tps, lat 0.461 ms stddev 0.028
> progress: 30.0 s, 2235.4 tps, lat 0.444 ms stddev 0.071

While this seems to indicate some kind or regression with the patch,
I've just realised that the data files weren't read in as the csv
files were in the wrong path, so that explains the very high tps.
*facepalm*  So here are the *right* benchmarks:

Master:

thom@swift ~/Development $ pgbench -f sort.sql -n -T 100 ds2
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 2093
latency average: 47.778 ms
tps = 20.920833 (including connections establishing)
tps = 20.921719 (excluding connections establishing)

(a repeat shows 20.737619)


With patch:

thom@swift ~/Development $ pgbench -f sort.sql -n -T 100 ds2
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 6148
latency average: 16.265 ms
tps = 61.477152 (including connections establishing)
tps = 61.479736 (excluding connections establishing)

(a repeat shows 62.252024)

So clearly a 3-fold improvement in this case.
-- 
Thom



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Heikki Linnakangas
Date:
On 03/31/2014 08:51 AM, Peter Geoghegan wrote:
> + #ifdef HAVE_LOCALE_T
> +     if (tss->locale)
> +         strxfrm_l(pres, tss->buf1, Min(sizeof(Datum), len), tss->locale);
> +     else
> + #endif
> +         strxfrm(pres, tss->buf1, Min(sizeof(Datum), len));
> +
> +     pres[Min(sizeof(Datum) - 1, len)] = '\0';

I'm afraid that trick isn't 100% reliable. The man page for strxrfm says:

> RETURN VALUE
>        The strxfrm() function returns the number of bytes  required  to  store
>        the  transformed  string  in  dest  excluding the terminating null byte
>        ('\0').  If the value returned is n or more, the contents of  dest  are
>        indeterminate.

Note the last sentence. To avoid the undefined behavior, you have to 
pass a buffer that's large enough to hold the whole result, and then 
truncate the result.

- Heikki



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Mon, Mar 31, 2014 at 8:08 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Note the last sentence. To avoid the undefined behavior, you have to pass a
> buffer that's large enough to hold the whole result, and then truncate the
> result.

I was working off the glibc documentation, which says:

"The return value is the length of the entire transformed string. This
value is not affected by the value of size, but if it is greater or
equal than size, it means that the transformed string did not entirely
fit in the array to. In this case, only as much of the string as
actually fits was stored. To get the whole transformed string, call
strxfrm again with a bigger output array."

It looks like this may be a glibc-ism. I'm not sure whether or not
it's worth attempting to benefit from glibc's apparent additional
guarantees here. Obviously that is something that would have to be
judged by weighing any actual additional benefit. FWIW, I didn't
actually imagine that this was a trick I was exploiting. I wouldn't be
surprised if this wasn't very important in practice.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> On Mon, Mar 31, 2014 at 8:08 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> Note the last sentence. To avoid the undefined behavior, you have to pass a
>> buffer that's large enough to hold the whole result, and then truncate the
>> result.

> I was working off the glibc documentation, which says:

> "The return value is the length of the entire transformed string. This
> value is not affected by the value of size, but if it is greater or
> equal than size, it means that the transformed string did not entirely
> fit in the array to. In this case, only as much of the string as
> actually fits was stored. To get the whole transformed string, call
> strxfrm again with a bigger output array."

> It looks like this may be a glibc-ism.

Possibly worth noting is that on my RHEL6 and Fedora machines,
"man strxfrm" says the same thing as the POSIX spec.  Likewise on OS X.
I don't think we can rely on what you suggest.
        regards, tom lane



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Mon, Mar 31, 2014 at 1:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Possibly worth noting is that on my RHEL6 and Fedora machines,
> "man strxfrm" says the same thing as the POSIX spec.  Likewise on OS X.
> I don't think we can rely on what you suggest.

Okay. Attached revision only trusts strxfrm() blobs (as far as that
goes) when the buffer passed to strxfrm() was sufficiently large that
the blob could fully fit. We're now managing two buffers as part of
the text sort support state (for leading poor man's keys, rather than
just for non-leading keys as before) - one for the original string (to
put in a NULL byte, since like strcoll(), strxfrm() expects that), and
the other to temporarily store the blob before memcpy()ing over to the
pass-by-value poor man's normalized key Datum. These are the same two
persistent sortsupport-state buffers used when sorting a non-leading
text key, where there is one for each string that needs to have a NULL
byte appended.

This appears to perform at least as well as the prior revision, and
possibly even appreciably better. I guess that glibc was internally
managing a buffer to do much the same thing, so perhaps we gain
something more from mostly avoiding that with a longer lived buffer.

Perhaps you could investigate the performance of this new revision too, Thom?

I'd really like to see this basic approach expanded to store pseudo
leading keys in internal B-Tree pages too, per my earlier proposals.
At the very least, B-Tree index builds should benefit from this. I
think I'm really only scratching the surface here, both in terms of
the number of places in the code where a poor man's normalized key
could usefully be exploited, as well as in terms of the number of
B-Tree operator classes that could be made to provide one.

--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Mon, Mar 31, 2014 at 7:35 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Okay. Attached revision only trusts strxfrm() blobs (as far as that
> goes) when the buffer passed to strxfrm() was sufficiently large that
> the blob could fully fit.

Attached revision has been further polished. I've added two additional
optimizations:

* Squeeze the last byte out of each Datum, so that on a 64-bit system,
the full 8 bytes are available to store strxfrm() blobs.

* Figure out when the strcoll() bttextfastcmp_locale() comparator is
called, if it was called because a poor man's comparison required it
(and *not* because it's the non-leading key in the traditional sense,
which implies there are no poorman's normalized keys in respect of
this attribute at all). This allows us to try and get away with a
straight memcmp if and when the lengths of the original text strings
match, on the assumption that when the initial poorman's comparison
didn't work out, and when the string lengths match, there is a very
good chance that both are equal, and on average it's a win to avoid
doing a strcoll() (along with the attendant copying around of buffers
for NULL-termination) entirely. Given that memcmp() is so much cheaper
than strcoll() anyway, this seems like a good trade-off.

--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Thom Brown
Date:
On 3 April 2014 17:52, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Mar 31, 2014 at 7:35 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Okay. Attached revision only trusts strxfrm() blobs (as far as that
>> goes) when the buffer passed to strxfrm() was sufficiently large that
>> the blob could fully fit.
>
> Attached revision has been further polished. I've added two additional
> optimizations:
>
> * Squeeze the last byte out of each Datum, so that on a 64-bit system,
> the full 8 bytes are available to store strxfrm() blobs.
>
> * Figure out when the strcoll() bttextfastcmp_locale() comparator is
> called, if it was called because a poor man's comparison required it
> (and *not* because it's the non-leading key in the traditional sense,
> which implies there are no poorman's normalized keys in respect of
> this attribute at all). This allows us to try and get away with a
> straight memcmp if and when the lengths of the original text strings
> match, on the assumption that when the initial poorman's comparison
> didn't work out, and when the string lengths match, there is a very
> good chance that both are equal, and on average it's a win to avoid
> doing a strcoll() (along with the attendant copying around of buffers
> for NULL-termination) entirely. Given that memcmp() is so much cheaper
> than strcoll() anyway, this seems like a good trade-off.

I'm getting an error when building this:

In file included from printtup.c:23:0:
../../../../src/include/utils/memdebug.h:21:31: fatal error:
valgrind/memcheck.h: No such file or directory
compilation terminated.
gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -fexcess-precision=standard -I../../../src/include
-D_GNU_SOURCE   -c -o analyze.o analyze.c -MMD -MP -MF
.deps/analyze.Po
make[4]: *** [printtup.o] Error 1
make[4]: Leaving directory
`/home/thom/Development/postgresql/src/backend/access/common'
make[3]: *** [common-recursive] Error 2
make[3]: Leaving directory
`/home/thom/Development/postgresql/src/backend/access'
make[2]: *** [access-recursive] Error 2
make[2]: *** Waiting for unfinished jobs....

-- 
Thom



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Apr 3, 2014 at 1:23 PM, Thom Brown <thom@linux.com> wrote:
> I'm getting an error when building this:

Sorry. I ran this through Valgrind, and forgot to reset where the
relevant macro is define'd before submission. Attached revision should
build without issue.


--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Thom Brown
Date:
On 3 April 2014 19:05, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Apr 3, 2014 at 1:23 PM, Thom Brown <thom@linux.com> wrote:
>> I'm getting an error when building this:
>
> Sorry. I ran this through Valgrind, and forgot to reset where the
> relevant macro is define'd before submission. Attached revision should
> build without issue.

Looking good:

-T 100 -n -f sort.sql

Master: 21.670467 / 21.718653 (avg: 21.69456)
Patch: 66.888756 / 66.888756 (avg: 66.888756)

308% increase in speed


-c 80 -j 80 -T 100 -n -f sort.sql

Master: 38.450082 / 37.440701 (avg: 37.9453915)
Patch: 153.321946 / 145.004726 (avg: 149.163336)

393% increase in speed

-- 
Thom



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Apr 3, 2014 at 3:19 PM, Thom Brown <thom@linux.com> wrote:
> Master: 38.450082 / 37.440701 (avg: 37.9453915)
> Patch: 153.321946 / 145.004726 (avg: 149.163336)

I think that those are objectively very large reductions in a cost
that figures prominently in most workloads. Based solely on those
facts, but also on the fairly low complexity of the patch, it may be
worth considering committing this before 9.4 goes into feature freeze,
purely as something that just adds a SortSupport function for the
default text opclass, with more or less the same cases accelerated as
with the existing SortSupport-supplying-opclasses. There has been only
very modest expansions to the SortSupport and tuplesort code. I
haven't generalized this to work with other areas where a normalized
key could be put to good use, but I see no reason to block on that.

Obviously I've missed the final commitfest deadline by a wide berth. I
don't suggest this lightly, and it in no small part has a lot to do
with the patch being simple and easy to reason about. The patch could
almost (but not quite) be written as part of a third party text
operator class's sort support routine. I think that if an individual
committer were willing to commit this at their own discretion before
feature freeze, outside of the formal commitfest process, that would
not be an unreasonable thing in these particular, somewhat unusual
circumstances.

I defer entirely to the judgement of others here - this is not an
issue that I feel justified in expressing a strong opinion on, and in
fact I don't have such an opinion anyway. However, by actually looking
at the risks and the benefits here, I think everyone will at least
understand why I'd feel justified in broaching the topic. This is a
very simple idea, and a rather old one at that.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> I think that those are objectively very large reductions in a cost
> that figures prominently in most workloads. Based solely on those
> facts, but also on the fairly low complexity of the patch, it may be
> worth considering committing this before 9.4 goes into feature freeze,

Personally, I have paid no attention to this thread and have no intention
of doing so before feature freeze.  There are three dozen patches at
https://commitfest.postgresql.org/action/commitfest_view?id=21
that have moral priority for consideration for 9.4.  Not all of them are
going to get in, certainly, and I'm already feeling a lot of guilt about
the small amount of time I've been able to devote to reviewing/committing
patches this cycle.  Spending time now on patches that didn't even exist
at the submission deadline feels quite unfair to me.

Perhaps I shouldn't lay my own guilt trip on other committers --- but
I think it would be a bad precedent to not deal with the existing patch
queue first.
        regards, tom lane



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Apr 3, 2014 at 11:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Personally, I have paid no attention to this thread and have no intention
> of doing so before feature freeze.

Anything that you missed was likely musings on how to further
generalize SortSupport. The actual additions to SortSupport and
tuplesort proposed are rather small. A simple abstraction to allow for
semi-reliable normalized keys, and a text sort support function to use
it.

> Perhaps I shouldn't lay my own guilt trip on other committers --- but
> I think it would be a bad precedent to not deal with the existing patch
> queue first.

I think that that's a matter of personal priorities for committers. I
am not in a position to tell anyone what their priorities ought to be,
and giving extra attention to this patch may be unfair. It doesn't
have to be a zero-sum game, though. Attention from a committer to,
say, this patch does not necessarily have to come at the expense of
another, if for example this patch piques somebody's interest and
causes them to put extra work into it on top of what they'd already
planned to look at. Again, under these somewhat unusual circumstances,
that seems like something that some committer might be inclined to do,
without it being altogether unreasonable. Then again, perhaps not.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Noah Misch
Date:
On Thu, Apr 03, 2014 at 11:44:46PM -0400, Tom Lane wrote:
> Peter Geoghegan <pg@heroku.com> writes:
> > I think that those are objectively very large reductions in a cost
> > that figures prominently in most workloads. Based solely on those
> > facts, but also on the fairly low complexity of the patch, it may be
> > worth considering committing this before 9.4 goes into feature freeze,
> 
> Personally, I have paid no attention to this thread and have no intention
> of doing so before feature freeze.  There are three dozen patches at
> https://commitfest.postgresql.org/action/commitfest_view?id=21
> that have moral priority for consideration for 9.4.  Not all of them are
> going to get in, certainly, and I'm already feeling a lot of guilt about
> the small amount of time I've been able to devote to reviewing/committing
> patches this cycle.  Spending time now on patches that didn't even exist
> at the submission deadline feels quite unfair to me.
> 
> Perhaps I shouldn't lay my own guilt trip on other committers --- but
> I think it would be a bad precedent to not deal with the existing patch
> queue first.

+1

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Fri, Apr 4, 2014 at 12:13 PM, Noah Misch <noah@leadboat.com> wrote:
> On Thu, Apr 03, 2014 at 11:44:46PM -0400, Tom Lane wrote:
>> Peter Geoghegan <pg@heroku.com> writes:
>> > I think that those are objectively very large reductions in a cost
>> > that figures prominently in most workloads. Based solely on those
>> > facts, but also on the fairly low complexity of the patch, it may be
>> > worth considering committing this before 9.4 goes into feature freeze,
>>
>> Personally, I have paid no attention to this thread and have no intention
>> of doing so before feature freeze.  There are three dozen patches at
>> https://commitfest.postgresql.org/action/commitfest_view?id=21
>> that have moral priority for consideration for 9.4.  Not all of them are
>> going to get in, certainly, and I'm already feeling a lot of guilt about
>> the small amount of time I've been able to devote to reviewing/committing
>> patches this cycle.  Spending time now on patches that didn't even exist
>> at the submission deadline feels quite unfair to me.
>>
>> Perhaps I shouldn't lay my own guilt trip on other committers --- but
>> I think it would be a bad precedent to not deal with the existing patch
>> queue first.
>
> +1

+1

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Greg Stark
Date:

On Fri, Apr 4, 2014 at 12:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Perhaps I shouldn't lay my own guilt trip on other committers --- but
>> I think it would be a bad precedent to not deal with the existing patch
>> queue first.
>
> +1

+1

I don't think we have to promise a strict priority queue and preempt all other development. But I agree loosely that processing patches that have been around should be a higher priority.

I've been meaning to do more review for a while and just took a skim through the queue. There are only a couple I feel I can contribute with so I'm going to work on those and then if it's still before the feature freeze I would like to go ahead with Peter's patch. I think it's generally a good patch.

Two questions I have:

1) Would it make more sense to use a floating point instead of an integer? I saw a need for a function like this when I was looking into doing GPU sorts. But GPUs expect floating point values. 

2) I would want to see a second data type, probably numeric, before committing to be sure we had a reasonably generic api. But it's pretty simply to do so.


--
greg

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Claudio Freire
Date:
On Fri, Apr 4, 2014 at 5:29 PM, Greg Stark <stark@mit.edu> wrote:
> Two questions I have:
>
> 1) Would it make more sense to use a floating point instead of an integer? I
> saw a need for a function like this when I was looking into doing GPU sorts.
> But GPUs expect floating point values.


In the context of this patch, I don't think you want to add
uncertainty to the != 0 or ==0 case (which is what FP would do).



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Fri, Apr 4, 2014 at 4:29 PM, Greg Stark <stark@mit.edu> wrote:
> 1) Would it make more sense to use a floating point instead of an integer? I
> saw a need for a function like this when I was looking into doing GPU sorts.
> But GPUs expect floating point values.

There is no reason to assume that the normalized keys within the
memtuples array need to be compared as integers, floats, or anything
else. That's a matter for the opclass author. Right now, it happens to
be the case that some of our earlier aspirations for SortSupport, such
as alternative user-defined sorting algorithms are not supported. This
is presumably only because no one came up with a compelling
alternative (i.e. no one followed through with trying to figure out if
radix sort added much value). It wouldn't be very hard to make
SortSupport and tuplesort care about that, but that's a separate
issue.

> 2) I would want to see a second data type, probably numeric, before
> committing to be sure we had a reasonably generic api. But it's pretty
> simply to do so.

I could pretty easily write a numeric proof of concept. I don't think
it would prove anything about the interface, though - I've only had
tuplesort and SortSupport assume that normalized keys are generated as
tuples are initially copied over (where applicable), as well as what
to do when a non-reliable comparison returns 0 (where applicable).
That is surely the kernel of the general idea, and nothing more, so I
don't see that numeric support proves the suitability of the interface
(or at least the essential idea of the interface, as opposed to
exactly how the mechanism works in the proposed patch, where the
SortSupport struct is slightly extended to express this idea).

A large majority of new code is the new SortSupport routine for text.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Fri, Apr 4, 2014 at 4:29 PM, Greg Stark <stark@mit.edu> wrote:
> On Fri, Apr 4, 2014 at 12:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> >> Perhaps I shouldn't lay my own guilt trip on other committers --- but
>> >> I think it would be a bad precedent to not deal with the existing patch
>> >> queue first.
>> >
>> > +1
>>
>> +1
>
> I don't think we have to promise a strict priority queue and preempt all
> other development. But I agree loosely that processing patches that have
> been around should be a higher priority.
>
> I've been meaning to do more review for a while and just took a skim through
> the queue. There are only a couple I feel I can contribute with so I'm going
> to work on those and then if it's still before the feature freeze I would
> like to go ahead with Peter's patch. I think it's generally a good patch.

To be honest, I think that's just flat-out inappropriate.  There were
over 100 patches in this CommitFest and there's not a single committed
patch that has your name on it even as a reviewer, let alone a
committer.  When a committer says, hey, I'm going to commit XYZ, that
basically forces anybody who might have an objection to it to drop
what they're doing and object fast, before it's too late.  In other
words, the people who just said that they are too busy reviewing
patches that were timely submitted and don't want to divert effort
from that to handle patches that weren't are going to have to do that
anyway, or lose their right to object.  I think that's unfair.  You're
essentially leveraging a commit bit that you haven't used in more than
three years to try to push a patch that was submitted months too late
to the head of the review queue - and, just to put icing on the cake,
it just so happens that you and the patch author work for the same
employer.  I have no objection to people committing patches written by
others who work at the same company, but only if those patches have
gone through a full, fair, and public review, with ample opportunity
for other people to complain if they don't like it.  That is obviously
not the case here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> On Fri, Apr 4, 2014 at 4:29 PM, Greg Stark <stark@mit.edu> wrote:
> > I've been meaning to do more review for a while and just took a skim through
> > the queue. There are only a couple I feel I can contribute with so I'm going
> > to work on those and then if it's still before the feature freeze I would
> > like to go ahead with Peter's patch. I think it's generally a good patch.
>
> To be honest, I think that's just flat-out inappropriate.  There were
> over 100 patches in this CommitFest and there's not a single committed
> patch that has your name on it even as a reviewer, let alone a
> committer.

I haven't got any either (except for my little one), which frustrates
me greatly.  Not because I'm looking for credit on the time that I've
spent in discussions, doing reviews, and I could have sworn there was
some patch that I did commit, but because I've not been able to find
the larger chunks of time required to get the more complex patches in.

> When a committer says, hey, I'm going to commit XYZ, that
> basically forces anybody who might have an objection to it to drop
> what they're doing and object fast, before it's too late.  In other
> words, the people who just said that they are too busy reviewing
> patches that were timely submitted and don't want to divert effort
> from that to handle patches that weren't are going to have to do that
> anyway, or lose their right to object.

I don't agree that this is the case.  We do revert patches from time to
time, when necessary, and issues with this particular patch seem likely
to be found during testing, well in advance of any release, and it's
self contained enough to be reverted pretty easily.  Perhaps it's not
fair to expect everyone to have realized that, but at least any
committer looking at it would, I believe, weigh those against it being a
late patch.

> I think that's unfair.  You're
> essentially leveraging a commit bit that you haven't used in more than
> three years to try to push a patch that was submitted months too late
> to the head of the review queue - and, just to put icing on the cake,
> it just so happens that you and the patch author work for the same
> employer.  I have no objection to people committing patches written by
> others who work at the same company, but only if those patches have
> gone through a full, fair, and public review, with ample opportunity
> for other people to complain if they don't like it.  That is obviously
> not the case here.

As for this- it's disingenuous at best and outright accusatory at worst.
The reason for this discussion is entirely because of PGConf.NYC, imv,
where Peter brought this patch up with at least Greg, Magnus and I.
Also during PGConf.NYC, Greg was specifically asking me about patches
which were in the commitfest that could be reviewed, ideally without
having to go back through threads hundreds of messages long and dealing
with complex parts of the code.  Sadly, as is often the case, the "easy"
ones get handled pretty quickly and the difficult ones get left behind.

One bad part of the commitfest, imv, is that when a "clearly good,
small" patch does manage to show up, we don't formally take any of that
into consideration when we are prioritizing patches.  They end up being
informally prioritized and get in quickly at the beginning, but that
doesn't address the reality that those smaller patches likely would get
in without any kind of commitfest process and not letting them in
because of the commitfest process doesn't generally mean that the larger
patches are any more likely to get in- iow, it's not a zero-sum game.
We have quite a few "part-time" committers (or at least, committers who
have disproportionate amounts of time, or perhaps the difference is in
the size of the blocks of time which can be dedicated to PG) and saying
"no, you can't help unless you tackle the hard problems" doesn't
particularly move us forward.

All that said, I don't have any particularly good idea of how to "fix"
any of this- it's not fair to tell the committers who have more time (or
larger blocks of time, etc) "you must work the hard problems only"
either.  I don't feel Greg's interest in this patch has anything to do
with his current employment and everything to do with the side-talks in
NYC, and to that point, I'm very tempted to go look at this patch myself
because it sounds like an exciting improvement with minimal effort.
Would I feel bad for doing so, with the CustomScan API and Updatable
Security Barrier Views patches still pending?  Sure, but it's the
difference between finding an hour and finding 8.  The hour will come
pretty easily (probably spent half that on this email..), while an
8-hour block, which would likely turn into more, is neigh-on impossible
til at least this weekend.  And, no, 8x one-hour blocks would not be
worthwhile; I've tried that before.
Thanks,
    Stephen

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Mon, Apr 7, 2014 at 1:01 PM, Stephen Frost <sfrost@snowman.net> wrote:
> All that said, I don't have any particularly good idea of how to "fix"
> any of this- it's not fair to tell the committers who have more time (or
> larger blocks of time, etc) "you must work the hard problems only"
> either.  I don't feel Greg's interest in this patch has anything to do
> with his current employment and everything to do with the side-talks in
> NYC, and to that point, I'm very tempted to go look at this patch myself
> because it sounds like an exciting improvement with minimal effort.
> Would I feel bad for doing so, with the CustomScan API and Updatable
> Security Barrier Views patches still pending?  Sure, but it's the
> difference between finding an hour and finding 8.  The hour will come
> pretty easily (probably spent half that on this email..), while an
> 8-hour block, which would likely turn into more, is neigh-on impossible
> til at least this weekend.  And, no, 8x one-hour blocks would not be
> worthwhile; I've tried that before.

If it's only going to take you an hour to address this patch (or 8 to
address those other ones) then you spend a heck of a lot less time on
review for a patch of a given complexity level than I do.  I agree
that it's desirable to slip things in, from time to time, when they're
uncontroversial and obviously meritorious, but I'm not completely
convinced that this is such a case.  As an utterly trivial point, I
find the naming to be less than ideal: "poorman" is not a term I want
to enshrine in our code.  That's not very descriptive of what the
patch is actually doing even if you know what the idiom means, and
people whose first language - many of whom do significant work on our
code - may not.

Now the point is not that that's a serious flaw in and of itself.  The
point is that these kinds of issues deserve to be discussed and agreed
on, and the process should be structured in a way that permits that.
And that discussion will require the time not only of the people who
find this patch more interesting than any other, but also of the
people who just said that they're busy with other things right now,
unless those people want to forfeit their right to an opinion.
Experience has shown that it's a whole lot easier for anyone here to
get a patch changed before it's committed than after it's committed,
so I don't buy your argument that the timing there doesn't matter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Mon, Apr 7, 2014 at 10:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> As an utterly trivial point, I
> find the naming to be less than ideal: "poorman" is not a term I want
> to enshrine in our code.  That's not very descriptive of what the
> patch is actually doing even if you know what the idiom means, and
> people whose first language - many of whom do significant work on our
> code - may not.

I didn't come up with the idea, or the name.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Andres Freund
Date:
On 2014-04-07 10:29:53 -0700, Peter Geoghegan wrote:
> On Mon, Apr 7, 2014 at 10:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > As an utterly trivial point, I
> > find the naming to be less than ideal: "poorman" is not a term I want
> > to enshrine in our code.  That's not very descriptive of what the
> > patch is actually doing even if you know what the idiom means, and
> > people whose first language - many of whom do significant work on our
> > code - may not.
> 
> I didn't come up with the idea, or the name.

Doesn't mean it needs to be enshrined everywhere. I don't think Robert's
against putting it in some comments.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Mon, Apr 7, 2014 at 10:42 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I didn't come up with the idea, or the name.
>
> Doesn't mean it needs to be enshrined everywhere. I don't think Robert's
> against putting it in some comments.

That seems reasonable. If someone wants to call what I have here
semi-reliable normalized keys, for example, I have no objection
whatsoever. In fact, I think that's probably a good idea.


-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Mon, Apr 7, 2014 at 1:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> As an utterly trivial point, I
> find the naming to be less than ideal: "poorman" is not a term I want
> to enshrine in our code.  That's not very descriptive of what the
> patch is actually doing even if you know what the idiom means, and
> people whose first language - many of whom do significant work on our
> code - may not.

To throw out one more point that I think is problematic, Peter's
original email on this thread gives a bunch of examples of strxfrm()
normalization that all different in the first few bytes - but so do
the underlying strings.  I *think* (but don't have time to check right
now) that on my MacOS X box, strxfrm() spits out 3 bytes of header
junk and then 8 bytes per character in the input string - so comparing
the first 8 bytes of the strxfrm()'d representation would amount to
comparing part of the first byte.  If for any reason the first byte is
the same (or similar enough) on many of the input strings, then this
will probably work out to be slower rather than faster.  Even if other
platforms are more space-efficient (and I think at least some of them
are), I think it's unlikely that this optimization will ever pay off
for strings that don't differ in the first 8 bytes.  And there are
many cases where that could be true a large percentage of the time
throughout the input, e.g. YYYY-MM-DD HH:MM:SS timestamps stored as
text.  It seems like that the patch pessimizes those cases, though of
course there's no way to know without testing.

Now it *may well be* that after doing some research and performance
testing we will conclude that either no commonly-used platforms show
any regressions or that the regressions that do occur are discountable
in view of the benefits to more common cases to the benefits.  I just
don't think mid-April is the right time to start those discussions
with the goal of a 9.4 commit; and I also don't think committing
without having those discussions is very prudent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Andres Freund
Date:
On 2014-04-07 13:01:52 -0400, Stephen Frost wrote:
> I haven't got any either (except for my little one), which frustrates
> me greatly.  Not because I'm looking for credit on the time that I've
> spent in discussions, doing reviews, and I could have sworn there was
> some patch that I did commit, but because I've not been able to find
> the larger chunks of time required to get the more complex patches in.

I am a bit confused. To my eyes there's been a huge number of actually
trivial patches in this commitfest? Even now, there's some:

* Bugfix for timeout in LDAP connection parameter resolution
* Problem with displaying "wide" tables in psql
* Enable CREATE FOREIGN TABLE (... LIKE ... )
* Add min, max, and stdev execute statement time in pg_stat_statement
* variant of regclass etc.
* vacuumdb: Add option --analyze-in-stages

Are all small patches that don't need major changes before getting committed.

That's after three months. And after a high number of smaller patches
committed by Tom on Friday.

> > When a committer says, hey, I'm going to commit XYZ, that
> > basically forces anybody who might have an objection to it to drop
> > what they're doing and object fast, before it's too late.  In other
> > words, the people who just said that they are too busy reviewing
> > patches that were timely submitted and don't want to divert effort
> > from that to handle patches that weren't are going to have to do that
> > anyway, or lose their right to object.
> 
> I don't agree that this is the case.  We do revert patches from time to
> time, when necessary, and issues with this particular patch seem likely
> to be found during testing, well in advance of any release, and it's
> self contained enough to be reverted pretty easily.

Given the trackrecord with testing the project seems to have with
testing, I don't have much faith in that claim. But even if, it'll only
get you testing on 2-3 platforms, without noticing portability issues.

I think it'd be a different discussion if this where CF-1 or so. But
we're nearly *2* months after the the *end* of the last CF.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> If it's only going to take you an hour to address this patch (or 8 to
> address those other ones) then you spend a heck of a lot less time on
> review for a patch of a given complexity level than I do.

Eh, I don't really time it and I'm probably completely off in reality,
it's more of a relative "feeling" thing wrt how long it'd take.  I was
also thinking about it from a 'initial review and provide feedback'
standpoint, actually getting to a point of committing something
certainly takes much longer, but I can also kick off tests and do
individual testing in those smaller time blocks.  Reading the code and
understanding it, writing up the feedback email, etc, is what requires
the larger single block.  Perhaps I can work on improving that for
myself and maybe find a way to do it in smaller chunks, but that hasn't
happened in the however-many-years, and there's the whole 'old dog, new
tricks' issue.

> I agree
> that it's desirable to slip things in, from time to time, when they're
> uncontroversial and obviously meritorious, but I'm not completely
> convinced that this is such a case.  As an utterly trivial point, I
> find the naming to be less than ideal: "poorman" is not a term I want
> to enshrine in our code.  That's not very descriptive of what the
> patch is actually doing even if you know what the idiom means, and
> people whose first language - many of whom do significant work on our
> code - may not.

Fair enough.

> Now the point is not that that's a serious flaw in and of itself.  The
> point is that these kinds of issues deserve to be discussed and agreed
> on, and the process should be structured in a way that permits that.

The issue on it being called "poorman"?  That doesn't exactly strike me
as needing a particularly long discussion, nor that it would be
difficult to change later.  I agree that there may be other issues, and
it'd be great to get buy-in from everyone before anything goes in, but
there's really zero hope of that being a reality.

> And that discussion will require the time not only of the people who
> find this patch more interesting than any other, but also of the
> people who just said that they're busy with other things right now,
> unless those people want to forfeit their right to an opinion.

I certainly hope that no committer feels that they forfeit their right
to an opinion about a piece of code because they didn't object to it
before it was committed.  My experience is that committed code gets
reviewed and concerns are raised, at which point it's usually on the
original committer to go back and fix it; which I'm certainly glad for.

> Experience has shown that it's a whole lot easier for anyone here to
> get a patch changed before it's committed than after it's committed,
> so I don't buy your argument that the timing there doesn't matter.

Once code has been released and there are external dependencies on it,
that's obviously an issue.  Ahead of that, I feel like the issue is
really more one of interest- everyone is very interested when code is
about to go in, but once it's in, for whatever reason, the interest
becomes much less to review and comment on it.  I feel like that's a
very long-standing issue as it relates to our 'beta' period because
doing code review just simply isn't fun and the motivation is reduced
once it's been committed.

That makes me wonder about having "beta" review-fests (in fact, I feel
like that may have been proposed before...), where commits are assigned
out to be reviewed by someone other than the committer/author/original
reviewer, as a way to motivate individuals to go review what has gone
in before we get to release.  That might help us ensure that more of the
committed code *gets* another review before release and reduce the
issues we have post-release.  Just a thought.
Thanks,
    Stephen

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Stephen Frost
Date:
* Andres Freund (andres@2ndquadrant.com) wrote:
> On 2014-04-07 13:01:52 -0400, Stephen Frost wrote:
> > I haven't got any either (except for my little one), which frustrates
> > me greatly.  Not because I'm looking for credit on the time that I've
> > spent in discussions, doing reviews, and I could have sworn there was
> > some patch that I did commit, but because I've not been able to find
> > the larger chunks of time required to get the more complex patches in.
>
> I am a bit confused. To my eyes there's been a huge number of actually
> trivial patches in this commitfest? Even now, there's some:
>
> * Bugfix for timeout in LDAP connection parameter resolution

I can take a look at that (if no one else wants to speak up about it).

> * Problem with displaying "wide" tables in psql

That's not without controvery, as I understand it, but I admit that I
haven't been following it terribly closely.

> * Enable CREATE FOREIGN TABLE (... LIKE ... )

This has definitely got issues which are not trival, see Tom's recent
email on the subject..

> * Add min, max, and stdev execute statement time in pg_stat_statement

This was also quite controversal.  If we've finally settled on this as
being acceptable then perhaps it can get in pretty easily.

> * variant of regclass etc.

This was recently being discussed also.

> * vacuumdb: Add option --analyze-in-stages

Haven't looked at this at all.

> Are all small patches that don't need major changes before getting committed.

That strikes me as optimistic.  I do plan to go do another pass through
the commitfest patches before looking at other things (as Greg also said
he would do); thanks for bringing up the ones you feel are more
managable- it'll help me focus on them.

> Given the trackrecord with testing the project seems to have with
> testing, I don't have much faith in that claim. But even if, it'll only
> get you testing on 2-3 platforms, without noticing portability issues.

This would be another case where it'd be nice if we could give people
access to the buildfarm w/o having to actually commit something.

> I think it'd be a different discussion if this where CF-1 or so. But
> we're nearly *2* months after the the *end* of the last CF.

There wouldn't be any discussion if it was CF-1 as I doubt anyone would
object to it going in (or at least not as strongly..), even if it was
submitted after CF-1 was supposed to be over with remaining patches.
It's the threat of getting punted to the next release that really makes
the difference here, imv.
Thanks,
    Stephen

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Mon, Apr 7, 2014 at 1:58 PM, Stephen Frost <sfrost@snowman.net> wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
>> If it's only going to take you an hour to address this patch (or 8 to
>> address those other ones) then you spend a heck of a lot less time on
>> review for a patch of a given complexity level than I do.
>
> Eh, I don't really time it and I'm probably completely off in reality,
> it's more of a relative "feeling" thing wrt how long it'd take.  I was
> also thinking about it from a 'initial review and provide feedback'
> standpoint, actually getting to a point of committing something
> certainly takes much longer, but I can also kick off tests and do
> individual testing in those smaller time blocks.  Reading the code and
> understanding it, writing up the feedback email, etc, is what requires
> the larger single block.  Perhaps I can work on improving that for
> myself and maybe find a way to do it in smaller chunks, but that hasn't
> happened in the however-many-years, and there's the whole 'old dog, new
> tricks' issue.

It takes me a big block of time, too, at least for the initial review.

> The issue on it being called "poorman"?  That doesn't exactly strike me
> as needing a particularly long discussion, nor that it would be
> difficult to change later.  I agree that there may be other issues, and
> it'd be great to get buy-in from everyone before anything goes in, but
> there's really zero hope of that being a reality.

Really?  I think there have been just about zero patches that have
gone in in the last (ugh) three months that have had significant
unaddressed objections from anyone at the time they were committed.
There has certainly been an enormous amount of work by a whole lot of
people to address objections that have been raised, and generally that
has gone well.  I will not pretend that every patch that has gone in
is completely well-liked by everyone and I am sure that is not the
case.  Nevertheless I think we've done pretty well.  Now the people
whose stuff has not got in - and may not get in - may well feel that
their stuff got short shrift, and I'm not going to deny that there's a
problem there.  But breaking from our usual procedure for this patch
is just adding to that unfairness, not making anything better.

>> And that discussion will require the time not only of the people who
>> find this patch more interesting than any other, but also of the
>> people who just said that they're busy with other things right now,
>> unless those people want to forfeit their right to an opinion.
>
> I certainly hope that no committer feels that they forfeit their right
> to an opinion about a piece of code because they didn't object to it
> before it was committed.  My experience is that committed code gets
> reviewed and concerns are raised, at which point it's usually on the
> original committer to go back and fix it; which I'm certainly glad for.

Sure, people can object to anything whenever they like.  But it
becomes an uphill battle once it goes in, unless the breakage is
rather flagrant.

Regardless, if this patch had had multiple, detailed reviews finding
lots of issues that were then addressed, I'd probably be keeping my
trap shut and hoping for the best.  But it hasn't.  Any patch of this
size is going to have nitpicky stuff that needs to be addressed, if
nothing else, and points requiring some modicum of public discussion,
and there hasn't been any of that.  That suggests to me that it just
hasn't been thoroughly reviewed yet, at least not in public, and that
should happen long before anyone talks about picking it up for commit.If this patch had been timely submitted and I'd
pickedit up right
 
away, I would have expected at least three weeks to elapse between the
time my first review was posted and the time all the details were
nailed down, even assuming no more serious problems were found, and
starting that ball rolling now means looking forward to a commit
around the end of April assuming things go great. That seems way too
late to me; IMHO, we should already be mostly in mop-up mode now
(hence my recent DSM patch cleanup patch, which no one has commented
on...).

And I think it's likely that, in fact, we'll find cases where this
regresses performance rather badly, for reasons sketched in an email I
posted a bit ago.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> To throw out one more point that I think is problematic, Peter's
> original email on this thread gives a bunch of examples of strxfrm()
> normalization that all different in the first few bytes - but so do
> the underlying strings.  I *think* (but don't have time to check right
> now) that on my MacOS X box, strxfrm() spits out 3 bytes of header
> junk and then 8 bytes per character in the input string - so comparing
> the first 8 bytes of the strxfrm()'d representation would amount to
> comparing part of the first byte.  If for any reason the first byte is
> the same (or similar enough) on many of the input strings, then this
> will probably work out to be slower rather than faster.  Even if other
> platforms are more space-efficient (and I think at least some of them
> are), I think it's unlikely that this optimization will ever pay off
> for strings that don't differ in the first 8 bytes.  And there are
> many cases where that could be true a large percentage of the time
> throughout the input, e.g. YYYY-MM-DD HH:MM:SS timestamps stored as
> text.  It seems like that the patch pessimizes those cases, though of
> course there's no way to know without testing.

Portability and performance concerns were exactly what worried me as
well.  It was my hope/understanding that this was a clear win which was
vetted by other large projects across multiple platforms.  If that's
actually in doubt and it isn't a clear win then I agree that we can't be
trying to squeeze it in at this late date.

> Now it *may well be* that after doing some research and performance
> testing we will conclude that either no commonly-used platforms show
> any regressions or that the regressions that do occur are discountable
> in view of the benefits to more common cases to the benefits.  I just
> don't think mid-April is the right time to start those discussions
> with the goal of a 9.4 commit; and I also don't think committing
> without having those discussions is very prudent.

I agree with this in concept- but I'd be willing to spend a bit of time
researching it, given that it's from a well known and respected author
who I trust has done much of this research already.
Thanks,
    Stephen

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Mon, Apr 7, 2014 at 10:47 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> To throw out one more point that I think is problematic, Peter's
> original email on this thread gives a bunch of examples of strxfrm()
> normalization that all different in the first few bytes - but so do
> the underlying strings.  I *think* (but don't have time to check right
> now) that on my MacOS X box, strxfrm() spits out 3 bytes of header
> junk and then 8 bytes per character in the input string - so comparing
> the first 8 bytes of the strxfrm()'d representation would amount to
> comparing part of the first byte.  If for any reason the first byte is
> the same (or similar enough) on many of the input strings, then this
> will probably work out to be slower rather than faster.  Even if other
> platforms are more space-efficient (and I think at least some of them
> are), I think it's unlikely that this optimization will ever pay off
> for strings that don't differ in the first 8 bytes.

Why would any platform have header bytes in the resulting binary
strings? That doesn't make any sense. Are you sure you aren't thinking
of the homogeneous trailing bytes that you can also see in my example?

The only case that this patch could possibly regress is where there
are strings that differ beyond about the first 8 bytes, but are not
identical (we chance a memcmp() == 0 before doing a full strcoll()
when tie-breaking on the semi-reliable initial comparison). We still
always avoid fmgr-overhead (and shim overhead, which I've measured),
as in your original patch - you put that at adding 7% at the time,
which is likely to make up for otherwise-regressed cases. There is
nothing at all contrived about my test-case.

You have to have an awfully large number of significantly similar but
not identical strings in order to possibly lose out. Even if you have
such a case, and the fmgr-trampoline-elision doesn't make up for it
(doesn't make up for having to do a separate heapattr lookup on the
minimal tuple, and optimization not too relevant for pass by reference
types), which is quite a stretch, it seems likely that you have other
cases that do benefit, which in aggregate makes up for it. The
benefits I've shown, on the first test case I picked are absolutely
enormous.

Now, let's assume that I'm wrong about all this, and that in fact
there is a plausible case where all of those tricks don't work out,
and someone has a complaint about a regression. What are we going to
do about that? Not accelerate text sorting by at least a factor of 3
for the benefit of some very narrow use-case? That's the only measure
I can see that you could take to not regress that case.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Andres Freund
Date:
On 2014-04-07 14:12:09 -0400, Stephen Frost wrote:
> I can take a look at that (if no one else wants to speak up about it).
> 
> > * Problem with displaying "wide" tables in psql
> 
> That's not without controvery, as I understand it, but I admit that I
> haven't been following it terribly closely.

There didn't seem to be any conflicts here? I am talking about
http://archives.postgresql.org/message-id/CAJTaR32A1_d0DqP25T4%3DLwE3RpmhNf3oY%3Dr0-ksejepfPv6O%3Dw%40mail.gmail.com

> > * Enable CREATE FOREIGN TABLE (... LIKE ... )
> 
> This has definitely got issues which are not trival, see Tom's recent
> email on the subject..

Yea. Besides others, he confirmed my comments. The issue there was
basically that I didn't like something, others disagreed. Still needed a
committers input.

> > * Add min, max, and stdev execute statement time in pg_stat_statement
> 
> This was also quite controversal.  If we've finally settled on this as
> being acceptable then perhaps it can get in pretty easily.

The minimal variant (just stddev) didn't seem to be all that
controversial.

> > I think it'd be a different discussion if this where CF-1 or so. But
> > we're nearly *2* months after the the *end* of the last CF.
> 
> There wouldn't be any discussion if it was CF-1 as I doubt anyone would
> object to it going in (or at least not as strongly..), even if it was
> submitted after CF-1 was supposed to be over with remaining patches.
> It's the threat of getting punted to the next release that really makes
> the difference here, imv.

I can understand that for feature patches which your company/client
needs, but for a localized performance improvement? Unconvinced.

And sorry, if the "threat of getting punted to the next release" plays
a significant role in here, the patch *needs* to be punted. The only
reason I can see at this point is "ah, trivial enough, let's just do
this now".

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Mon, Apr 7, 2014 at 2:12 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> I think it'd be a different discussion if this where CF-1 or so. But
>> we're nearly *2* months after the the *end* of the last CF.
>
> There wouldn't be any discussion if it was CF-1 as I doubt anyone would
> object to it going in (or at least not as strongly..), even if it was
> submitted after CF-1 was supposed to be over with remaining patches.
> It's the threat of getting punted to the next release that really makes
> the difference here, imv.

The point is that if this had been submitted for CF-1, CF-2, CF-3, or
CF-4, and I had concerns about it (which I do), then I would have
budgeted time to record those concerns so that they could be discussed
and, if necessary, addressed.  Since it wasn't, I assumed I didn't
need to worry about studying the patch, figuring out which of my
concerns were actually legitimate, searching for other areas of
potential concern, and putting together a nice write-up until June -
and maybe not even then, because at that point other people might also
be studying the patch and might cover those areas in sufficient detail
as to obviate my own concerns.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Alvaro Herrera
Date:
Stephen Frost wrote:
> * Andres Freund (andres@2ndquadrant.com) wrote:

> > I think it'd be a different discussion if this where CF-1 or so. But
> > we're nearly *2* months after the the *end* of the last CF.
> 
> There wouldn't be any discussion if it was CF-1 as I doubt anyone would
> object to it going in (or at least not as strongly..), even if it was
> submitted after CF-1 was supposed to be over with remaining patches.
> It's the threat of getting punted to the next release that really makes
> the difference here, imv.

That's why we have this rule that CF4 should only receive patches that
were already reviewed in previous commitfests.  I, too, find the
fast-tracking of this patch completely outside of the CF process to be
distasteful.  We summarily reject much smaller patches at the end of
each cycle process, even when the gain is as obvious as is claimed to
be for this patch.

TBH I don't see why we're even discussing this.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, Apr 7, 2014 at 1:58 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > The issue on it being called "poorman"?  That doesn't exactly strike me
> > as needing a particularly long discussion, nor that it would be
> > difficult to change later.  I agree that there may be other issues, and
> > it'd be great to get buy-in from everyone before anything goes in, but
> > there's really zero hope of that being a reality.
>
> Really?  I think there have been just about zero patches that have
> gone in in the last (ugh) three months that have had significant
> unaddressed objections from anyone at the time they were committed.

That implies that everyone has been reviewing everything, which is
certainly not entirely the case..  If that's happening then I'm not sure
I understand the point of the commitfest, and feel I must be falling
down on the job here.  Certainly there are patches which have been
picked up by committers and committed that I've not reviewed.  I also
didn't object, but if I find issue with them in the future, I'd
certainly bring them up, even post-commit.

> There has certainly been an enormous amount of work by a whole lot of
> people to address objections that have been raised, and generally that
> has gone well.  I will not pretend that every patch that has gone in
> is completely well-liked by everyone and I am sure that is not the
> case.  Nevertheless I think we've done pretty well.  Now the people
> whose stuff has not got in - and may not get in - may well feel that
> their stuff got short shrift, and I'm not going to deny that there's a
> problem there.  But breaking from our usual procedure for this patch
> is just adding to that unfairness, not making anything better.

This goes back to the point which I was trying to make earlier-
prioritization.  Small+clearly-good patches should be prioritized
higher, but identifying those is no small matter.  Of course, we would
need to address the starvation risk when it comes to larger patches,
which I've got no answer for.

> Regardless, if this patch had had multiple, detailed reviews finding
> lots of issues that were then addressed, I'd probably be keeping my
> trap shut and hoping for the best.  But it hasn't.  Any patch of this
> size is going to have nitpicky stuff that needs to be addressed, if
> nothing else, and points requiring some modicum of public discussion,
> and there hasn't been any of that.  That suggests to me that it just
> hasn't been thoroughly reviewed yet, at least not in public, and that
> should happen long before anyone talks about picking it up for commit.

Perhaps my memory is foggy, but I recall at least some amount of
discussion and review of this, along with fixes going in, over the past
couple of weeks.  More may be needed, of course, but if so, I'd expect
any committer looking at it would realize that and bounce it at this
point, given the feature freeze deadline which is like next week..

>  If this patch had been timely submitted and I'd picked it up right
> away, I would have expected at least three weeks to elapse between the
> time my first review was posted and the time all the details were
> nailed down, even assuming no more serious problems were found, and
> starting that ball rolling now means looking forward to a commit
> around the end of April assuming things go great. That seems way too
> late to me; IMHO, we should already be mostly in mop-up mode now
> (hence my recent DSM patch cleanup patch, which no one has commented
> on...).

I agree with this.

> And I think it's likely that, in fact, we'll find cases where this
> regresses performance rather badly, for reasons sketched in an email I
> posted a bit ago.

I've not researched it enough myself to say.
Thanks,
    Stephen

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Stephen Frost
Date:
* Alvaro Herrera (alvherre@2ndquadrant.com) wrote:
> That's why we have this rule that CF4 should only receive patches that
> were already reviewed in previous commitfests.

I, at least, always understood that rule to be 'large' patches, which
this didn't strike me as.

> I, too, find the
> fast-tracking of this patch completely outside of the CF process to be
> distasteful.  We summarily reject much smaller patches at the end of
> each cycle process, even when the gain is as obvious as is claimed to
> be for this patch.

In the past, we've also committed large patches which were submitted for
the first time to CF-4.

> TBH I don't see why we're even discussing this.

Think I'm about done, personally.  I can't comment more without actually
looking at it and doing some research on it myself and I don't know that
I'll be able to do that any time soon, as I told Peter when he asked me
about it in NYC.  That said, for my part, I don't like telling Greg that
he either has to review something else which was submitted but that he's
got no interest in (or which would take much longer), or not do
anything.
Thanks,
    Stephen

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Andres Freund
Date:
On 2014-04-07 14:35:23 -0400, Stephen Frost wrote:
> That said, for my part, I don't like telling Greg that
> he either has to review something else which was submitted but that he's
> got no interest in (or which would take much longer), or not do
> anything.

Reviewing and committing are two very different shoes imo. This
discussion wasn't about it getting reviewed before the next CF, but
about committing it into 9.4.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Mon, Apr 7, 2014 at 2:19 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Apr 7, 2014 at 10:47 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> To throw out one more point that I think is problematic, Peter's
>> original email on this thread gives a bunch of examples of strxfrm()
>> normalization that all different in the first few bytes - but so do
>> the underlying strings.  I *think* (but don't have time to check right
>> now) that on my MacOS X box, strxfrm() spits out 3 bytes of header
>> junk and then 8 bytes per character in the input string - so comparing
>> the first 8 bytes of the strxfrm()'d representation would amount to
>> comparing part of the first byte.  If for any reason the first byte is
>> the same (or similar enough) on many of the input strings, then this
>> will probably work out to be slower rather than faster.  Even if other
>> platforms are more space-efficient (and I think at least some of them
>> are), I think it's unlikely that this optimization will ever pay off
>> for strings that don't differ in the first 8 bytes.
>
> Why would any platform have header bytes in the resulting binary
> strings? That doesn't make any sense. Are you sure you aren't thinking
> of the homogeneous trailing bytes that you can also see in my example?

No, I'm not sure of that at all.  I haven't looked at this topic in a
while, but I'm happy to budget some to time to do so - for the June
CommitFest.

> The only case that this patch could possibly regress is where there
> are strings that differ beyond about the first 8 bytes, but are not
> identical (we chance a memcmp() == 0 before doing a full strcoll()
> when tie-breaking on the semi-reliable initial comparison). We still
> always avoid fmgr-overhead (and shim overhead, which I've measured),
> as in your original patch - you put that at adding 7% at the time,
> which is likely to make up for otherwise-regressed cases. There is
> nothing at all contrived about my test-case.

It's not a question of whether your test case is contrived.  Your test
case can be (and likely is) extremely realistic and still not account
for other cases when the patch regresses performance.  If I understand
correctly, and I may not because I wasn't planning to spend time on
this patch until the next CommitFest, the patch basically uses the
bytes available in datum1 to cache what you refer to as a normalized
or poor man's sort key which, it is hoped, will break most ties.
However, on any workload where it fails to break ties, you're going to
incur the additional overheads of (1) comparing the poor-man's sort
key, (2) memcmping the strings (based on what you just said), and then
(3) digging the correct datum back out of the tuple.  I note that
somebody thought #3 was important enough to be worth creating datum1
for in the first place, so I don't think it can or should be assumed
that undoing that optimization in certain cases will turn out to be
cheap enough not to matter.

In short, I don't see any evidence that you've made an attempt to
construct a worst-case scenario for this patch, and that's a basic
part of performance testing.  I had to endure having Andres beat the
snot out of me over cases where the MVCC snapshot patch regressed
performance, and as painful as that was, it led to a better patch.  If
I'd committed the first thing that did well on my own tests, which
*did* include attempts (much less successful than Andres's) to find
regressions, we'd be significantly worse off today than we are.

> You have to have an awfully large number of significantly similar but
> not identical strings in order to possibly lose out. Even if you have
> such a case, and the fmgr-trampoline-elision doesn't make up for it
> (doesn't make up for having to do a separate heapattr lookup on the
> minimal tuple, and optimization not too relevant for pass by reference
> types), which is quite a stretch, it seems likely that you have other
> cases that do benefit, which in aggregate makes up for it. The
> benefits I've shown, on the first test case I picked are absolutely
> enormous.

Testing the cases where your patch wins and hand-waving that the
losses won't be that bad in other cases - without actually testing -
is not the right methodology.

And I think it would be quite a large error to assume that tables
never contain large numbers of similar but not identical strings.  I
bet there are many people who have just that.  That's also quite an
inaccurate depiction of the cases where this will regress things; a
bunch of strings that share the first few characters might not be
similar in any normal human sense of that term, while still slipping
through the proposed sieve.  In your OP, a six-byte string blew up
until 20 bytes after being strxfrm()'d, which means that you're only
comparing the first 2-3 bytes.  On a platform where Datums are only 4
bytes, you're only comparing the first 1-2 bytes.  Arguing that nobody
anywhere has a table where not even the first character or two are
identical across most or all of the strings in the column is
ridiculous.

> Now, let's assume that I'm wrong about all this, and that in fact
> there is a plausible case where all of those tricks don't work out,
> and someone has a complaint about a regression. What are we going to
> do about that? Not accelerate text sorting by at least a factor of 3
> for the benefit of some very narrow use-case? That's the only measure
> I can see that you could take to not regress that case.

The appropriate time to have a discussion about whether the patch wins
by a large enough margin in enough use cases to justify possible
regressions in cases where it loses is after we have made a real
effort to characterize the winning and losing cases, mitigate the
losing cases as far as possible, and benchmarked the results.  I think
it's far from a given even that your patch will win in more cases than
it loses, or that the regressions when it loses will be small.  That,
of course, has yet to be proven, but so does the contrary.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Stephen Frost
Date:
* Andres Freund (andres@2ndquadrant.com) wrote:
> On 2014-04-07 14:35:23 -0400, Stephen Frost wrote:
> > That said, for my part, I don't like telling Greg that
> > he either has to review something else which was submitted but that he's
> > got no interest in (or which would take much longer), or not do
> > anything.
>
> Reviewing and committing are two very different shoes imo. This
> discussion wasn't about it getting reviewed before the next CF, but
> about committing it into 9.4.

I'm not entirely sure why.  No one is giving grief to the people
bringing up new things which are clearly for 9.5 at this point, yet we
have a ton of patches that still need to be reviewed.  I don't think
that we don't want to tell individuals who are volunteering how to
spend their time, but what we're saying is that they can do anything
except actually commit stuff, because we must have ordering to what
gets committed; an ordering which hasn't really got any prioritization
to it based on the patch itself but entirely depends on when it was
submitted.
Thanks,
    Stephen

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Mon, Apr 7, 2014 at 2:35 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-04-07 14:35:23 -0400, Stephen Frost wrote:
>> That said, for my part, I don't like telling Greg that
>> he either has to review something else which was submitted but that he's
>> got no interest in (or which would take much longer), or not do
>> anything.
>
> Reviewing and committing are two very different shoes imo. This
> discussion wasn't about it getting reviewed before the next CF, but
> about committing it into 9.4.

Yes.  I did not object to this patch being posted in the midst of
trying to nail down this release, and I certainly do not object to
Greg, or Stephen, or anyone else reviewing it.  My note was
specifically prompted not by someone say they intended to *review* the
patch, but that they intended to *commit* it when it hasn't even
really been reviewed yet.  There are patches that are trivial enough
that it's fine for someone to commit them without a public review
first, but this isn't remotely close to being in that category.  If
nothing else, the fact that it extends the definition of the btree
opclass is sufficient reason to merit a public review.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Greg Stark
Date:
On Mon, Apr 7, 2014 at 11:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> You're
> essentially leveraging a commit bit that you haven't used in more than
> three years to try to push a patch that was submitted months too late


I'm not leveraging anything any I'm not going to push something unless
people are on board. That's *why* I sent that message. And I started
the email by saying I was going to go work on patches from the
commitfest first.

-- 
greg



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> There are patches that are trivial enough
> that it's fine for someone to commit them without a public review
> first, but this isn't remotely close to being in that category.  If
> nothing else, the fact that it extends the definition of the btree
> opclass is sufficient reason to merit a public review.

hrmpf.  You have a good point about that- I admit that I didn't consider
that as much as I should have.
Thanks,
    Stephen

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Mon, Apr 7, 2014 at 11:59 AM, Stephen Frost <sfrost@snowman.net> wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
>> There are patches that are trivial enough
>> that it's fine for someone to commit them without a public review
>> first, but this isn't remotely close to being in that category.  If
>> nothing else, the fact that it extends the definition of the btree
>> opclass is sufficient reason to merit a public review.
>
> hrmpf.  You have a good point about that- I admit that I didn't consider
> that as much as I should have.

Actually, contrary to the original subject of this thread, that isn't
the case. I have not added a support function 3, which I ultimately
concluded was a bad idea. This is all sort support.


-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Mon, Apr 7, 2014 at 11:56 AM, Greg Stark <stark@mit.edu> wrote:
> On Mon, Apr 7, 2014 at 11:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> You're
>> essentially leveraging a commit bit that you haven't used in more than
>> three years to try to push a patch that was submitted months too late
>
>
> I'm not leveraging anything any I'm not going to push something unless
> people are on board. That's *why* I sent that message. And I started
> the email by saying I was going to go work on patches from the
> commitfest first.

Exactly.

I was of the opinion, as some familiar with the subject matter, that
this rose to the level of deserving special consideration. I'm glad
that there does seem to be a general recognition that such a category
exists. Given the reservations of Robert and others, this isn't going
to happen for 9.4. It was never going to happen under a cloud of
controversy. I only broached the idea.

Special consideration is not something I ask for lightly. I must admit
that it's hard to see things as I do if you aren't as familiar with
the problem. I happen to think that this is the wrong decision, but
I'll leave it at that.

I'm sure that whatever we come up with for 9.5 will be a lot better
than what I have here, because it will probably be generalized to
other important cases.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Mon, Apr 7, 2014 at 2:56 PM, Greg Stark <stark@mit.edu> wrote:
> On Mon, Apr 7, 2014 at 11:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> You're
>> essentially leveraging a commit bit that you haven't used in more than
>> three years to try to push a patch that was submitted months too late
>
> I'm not leveraging anything any I'm not going to push something unless
> people are on board. That's *why* I sent that message. And I started
> the email by saying I was going to go work on patches from the
> commitfest first.

I said that a lot more harshly than I should have, and I impugned you
unfairly.  Sorry.

I'm going to try again:

I don't doubt that your desire to move this patch forward is motivated
by anything under than the best of possible motivations.  However,
whether you intend it or not, trying to move this patch toward a 9.4
commit, or even trying to get people to express an opinion on whether
this is suitable for a 9.4 commit, is inevitably going to cause senior
reviewers who think they might have concerns about it to need to spend
time on it.  Inevitably, that time will come at the expense of patches
that were timely submitted, and that is unfair to the people who
submitted those patches.

Of course, if you want to review this patch now, I'm 100% OK with
that.  If you want to review other pending patches, for 9.4 or 9.5, I
think that's great, too.  But if there's talk of committing this
patch, I think that seems both quite a bit too late (relative to the
timing of CF4) and quite a bit too early (relative to the amount of
review and testing done thus far).

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Stephen Frost
Date:
* Peter Geoghegan (pg@heroku.com) wrote:
> Actually, contrary to the original subject of this thread, that isn't
> the case. I have not added a support function 3, which I ultimately
> concluded was a bad idea. This is all sort support.

Well, as apparently no one is objecting to Greg reviewing it, I'd
suggest he do that and actually articulate his feelings on the patch
post-review and exactly what it is changing and if he feels it needs
public comment, rather than all this speculation by folks who aren't
looking at the patch.

In other words, in hindsight, Greg was rather premature with his
suggestion that he might commit it and rather than suggesting such, he
should have just said he was going to review it and then come back with
a detailed email argueing the case for it to go in.

I don't particularly fault Greg for that, but perhaps some of this could
be avoided in the future.
Thanks,
    Stephen

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> I am a bit confused. To my eyes there's been a huge number of actually
> trivial patches in this commitfest? Even now, there's some:

> * Bugfix for timeout in LDAP connection parameter resolution
> * Problem with displaying "wide" tables in psql
> * Enable CREATE FOREIGN TABLE (... LIKE ... )
> * Add min, max, and stdev execute statement time in pg_stat_statement
> * variant of regclass etc.
> * vacuumdb: Add option --analyze-in-stages

> Are all small patches that don't need major changes before getting committed.

FWIW, I think the reason most of those aren't in is that there's not
consensus that it's a feature we want.  Triviality of the patch itself
doesn't make it easier to get past that.  (Indeed, when a feature patch
is actually trivial, that usually suggests to me that it's not been
thought through fully ...)

The LDAP one requires both LDAP and Windows expertise, which means the
pool of qualified committers for it is pretty durn small.  I think
Magnus promised to deal with it, though.

> I think it'd be a different discussion if this where CF-1 or so. But
> we're nearly *2* months after the the *end* of the last CF.

Yeah.  At this point the default decision has to be to reject (or
more accurately, punt to 9.5).  I think we can still get a few of
these in and meet the mid-April target date, but many of them will
not make it.
        regards, tom lane



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Mon, Apr 7, 2014 at 11:47 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> It's not a question of whether your test case is contrived.  Your test
> case can be (and likely is) extremely realistic and still not account
> for other cases when the patch regresses performance.  If I understand
> correctly, and I may not because I wasn't planning to spend time on
> this patch until the next CommitFest, the patch basically uses the
> bytes available in datum1 to cache what you refer to as a normalized
> or poor man's sort key which, it is hoped, will break most ties.
> However, on any workload where it fails to break ties, you're going to
> incur the additional overheads of (1) comparing the poor-man's sort
> key, (2) memcmping the strings (based on what you just said), and then
> (3) digging the correct datum back out of the tuple.  I note that
> somebody thought #3 was important enough to be worth creating datum1
> for in the first place, so I don't think it can or should be assumed
> that undoing that optimization in certain cases will turn out to be
> cheap enough not to matter.

The much earlier datum1 optimization is mostly compelling for
pass-by-value types, for reasons that prominently involve
cache/locality considerations. That's probably why this patch is so
compelling - it makes those advantages apply to pass-by-reference
types too.

> Testing the cases where your patch wins and hand-waving that the
> losses won't be that bad in other cases - without actually testing -
> is not the right methodology.

Okay. Here is a worst-case, with the pgbench script the same as my
original test-case, but with much almost maximally unsympathetic data
to sort:

[local]/postgres=# update customers set firstname =
'padding-padding-padding-padding' || firstname;
UPDATE 20000

Master:

pg@hamster:~/sort-tests$ pgbench -f sort.sql -n -T 100
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 323
latency average: 309.598 ms
tps = 3.227745 (including connections establishing)
tps = 3.227784 (excluding connections establishing)

Patch:

pg@hamster:~/sort-tests$ pgbench -f sort.sql -n -T 100
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 100 s
number of transactions actually processed: 307
latency average: 325.733 ms
tps = 3.066256 (including connections establishing)
tps = 3.066313 (excluding connections establishing)

That seems like a pretty modest regression for a case where 100% of
what I've done goes to waste. If something that I'd done worked out
10% of the time, we'd still be well ahead.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Mon, Apr 7, 2014 at 4:35 PM, Peter Geoghegan <pg@heroku.com> wrote:
> The much earlier datum1 optimization is mostly compelling for
> pass-by-value types, for reasons that prominently involve
> cache/locality considerations.

I agree.

> That's probably why this patch is so
> compelling - it makes those advantages apply to pass-by-reference
> types too.

Well, whether the patch is compelling is actually precisely what we
need to figure out.  I feel like you're asserting your hoped-for
conclusion prematurely.

>> Testing the cases where your patch wins and hand-waving that the
>> losses won't be that bad in other cases - without actually testing -
>> is not the right methodology.
>
> Okay. Here is a worst-case, with the pgbench script the same as my
> original test-case, but with much almost maximally unsympathetic data
> to sort:
>
> [local]/postgres=# update customers set firstname =
> 'padding-padding-padding-padding' || firstname;
> UPDATE 20000
>
> Master:
>
> pg@hamster:~/sort-tests$ pgbench -f sort.sql -n -T 100
> transaction type: Custom query
> scaling factor: 1
> query mode: simple
> number of clients: 1
> number of threads: 1
> duration: 100 s
> number of transactions actually processed: 323
> latency average: 309.598 ms
> tps = 3.227745 (including connections establishing)
> tps = 3.227784 (excluding connections establishing)
>
> Patch:
>
> pg@hamster:~/sort-tests$ pgbench -f sort.sql -n -T 100
> transaction type: Custom query
> scaling factor: 1
> query mode: simple
> number of clients: 1
> number of threads: 1
> duration: 100 s
> number of transactions actually processed: 307
> latency average: 325.733 ms
> tps = 3.066256 (including connections establishing)
> tps = 3.066313 (excluding connections establishing)
>
> That seems like a pretty modest regression for a case where 100% of
> what I've done goes to waste. If something that I'd done worked out
> 10% of the time, we'd still be well ahead.

Now that is definitely interesting, and it does seem to demonstrate
that the worst case for this patch might not be as bad as I had feared
- it's about a 5% regression: not great, but perhaps tolerable.  It's
not actually a worse case unless firstname is a fair ways into a tuple
with lots of variable-length columns before it, because part of what
the datum1 thing saves you is the cost of repeatedly walking through
the tuple's column list.

But I still think that a lot more could be done - and I'd challenge
you (or others) to do it - to look for cases where this might be a
pessimization.  I get that the patch has an upside, but nearly every
patch that anybody proposes does, and the author usually points out
those cases quite prominently, as you have, and right so.  But what
makes for a much more compelling submission is when the author *also*
tries really hard to break the patch, and hopefully fails.  I agree
that the test result shown above is good news for this patch's future
prospects, but I *don't* agree that it nails the door shut.  What
about other locales?  Other operating systems?  Other versions of
libc?  Longer strings?  Wider tuples?  Systems where datums are only
32-bits?  Sure, you can leave those things to the reviewer and/or
committer to worry about, but that's not the way to expedite the
patch's trip into the tree.

I have to admit, I didn't really view the original postings on this
topic to be anything more than, hey, we've got something promising
here, it's worth more study.  That's part of why I was so taken aback
by Greg's email.  There's certainly good potential here, but I think
there's quite a bit left of work left to do before you can declare
victory...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Mon, Apr 7, 2014 at 3:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Now that is definitely interesting, and it does seem to demonstrate
> that the worst case for this patch might not be as bad as I had feared
> - it's about a 5% regression: not great, but perhaps tolerable.  It's
> not actually a worse case unless firstname is a fair ways into a tuple
> with lots of variable-length columns before it, because part of what
> the datum1 thing saves you is the cost of repeatedly walking through
> the tuple's column list.

I only use strxfrm() when the leading key is text. Otherwise, it's
pretty much just what your original sort support for text patch from
2012 does, and we don't bother with anything other than fmgr-elision.
The only thing that stops the above test case being perfectly pessimal
is that we don't always chance a memcmp(), due to some comparisons
realizing that len1 != len2. Also, exactly one comparison actually
benefits from that same "chance a memcmp()" trick in practice (there
is one duplicate). But it really is vanishingly close to perfectly
pessimal on my dev system, or that is at least my sincere belief.

I think that we're going to find ourselves with more and more cases
where to risk wasting lots of compute bandwidth to save a little
memory bandwidth is, counter-intuitively, a useful optimization. My
opportunistic memcmp() is probably an example of this. I'm aware of
others. This is perhaps a contributing factor to how inexpensive it is
to waste all the effort that goes into strxfrm() and so on. Having
said that, this is an idea from the 1960s, which might explain
something about the name of the technique.

> But I still think that a lot more could be done - and I'd challenge
> you (or others) to do it - to look for cases where this might be a
> pessimization.  I get that the patch has an upside, but nearly every
> patch that anybody proposes does, and the author usually points out
> those cases quite prominently, as you have, and right so.  But what
> makes for a much more compelling submission is when the author *also*
> tries really hard to break the patch, and hopefully fails.  I agree
> that the test result shown above is good news for this patch's future
> prospects, but I *don't* agree that it nails the door shut.  What
> about other locales?  Other operating systems?  Other versions of
> libc?  Longer strings?  Wider tuples?  Systems where datums are only
> 32-bits?  Sure, you can leave those things to the reviewer and/or
> committer to worry about, but that's not the way to expedite the
> patch's trip into the tree.

I don't think that 32-bit systems are all that high a priority for
performance work. Cellphones will come out with 64-bit processors this
year. Even still, it's hard to reason about how much difference that
will make, except to say that the worst case cannot be that bad. As
for other locales, it only gets better for this patch, because I
believe en_US.UTF-8 is one of the cheapest to compare locales (which
is to say, requires fewest passes). If you come up with some test case
with complicated collation rules (I'm thinking of hu_HU, Hungarian),
it surely makes the patch look much better. Having said that, I still
don't do anything special with the C locale (just provide a
non-fmgr-accessed comparator), which should probably be fixed, on the
grounds that sorting using the C locale is probably now more expensive
than with the en_US.UTF-8 collation.

> I have to admit, I didn't really view the original postings on this
> topic to be anything more than, hey, we've got something promising
> here, it's worth more study.  That's part of why I was so taken aback
> by Greg's email.  There's certainly good potential here, but I think
> there's quite a bit left of work left to do before you can declare
> victory...

I think that Greg's choice of words was a little imprudent, but must
be viewed in the context of an offline discussion during the hall
track of pgConf NYC. Clearly Greg wasn't about to go off and
unilaterally commit this. FWIW, I think I put him off the idea a few
hours after he made those remarks, without intending for what I'd said
to have that specific effect (or the opposite effect).

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Heikki Linnakangas
Date:
On 04/07/2014 11:35 PM, Peter Geoghegan wrote:
> Okay. Here is a worst-case, with the pgbench script the same as my
> original test-case, but with much almost maximally unsympathetic data
> to sort:
>
> [local]/postgres=# update customers set firstname =
> 'padding-padding-padding-padding' || firstname;

Hmm. I would expect the worst case to be where the strxfrm is not 
helping because all the entries have the same prefix, but the actual key 
is as short and cheap-to-compare as possible. So the padding should be 
as short as possible. Also, we have a fast path for pre-sorted input, 
which reduces the number of comparisons performed; that will make the 
strxfrm overhead more significant.

I'm getting about 2x slowdown on this test case:

create table sorttest (t text);
insert into sorttest select 'foobarfo' || (g) || repeat('a', 75) from 
generate_series(10000, 30000) g;

explain analyze select * from sorttest order by t;

Now, you can argue that that's acceptable because it's such a special 
case, but if we're looking for the worst-case..

(BTW, IMHO it's way too late to do this for 9.4)

- Heikki



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Heikki Linnakangas
Date:
On 04/07/2014 09:19 PM, Peter Geoghegan wrote:
> The only case that this patch could possibly regress is where there
> are strings that differ beyond about the first 8 bytes, but are not
> identical (we chance a memcmp() == 0 before doing a full strcoll()
> when tie-breaking on the semi-reliable initial comparison). We still
> always avoid fmgr-overhead (and shim overhead, which I've measured),
> as in your original patch - you put that at adding 7% at the time,
> which is likely to make up for otherwise-regressed cases. There is
> nothing at all contrived about my test-case.

Did I understand correctly that this patch actually does two things:

1. Avoid fmgr and shim overhead
2. Use strxfrm to produce a pseudo-leading key that's cheaper to compare.

In that case, these changes need to be analyzed separately. You don't 
get to "make up" for the losses by the second part by the gains from the 
first part. We could commit just the first part (for 9.5!), and that has 
to be the baseline for the second part.

This is very promising stuff, but it's not a slam-dunk win. I'd suggest 
adding these to the next commitfest, as two separate patches.

- Heikki



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Greg Stark
Date:
On Mon, Apr 7, 2014 at 7:32 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I think that Greg's choice of words was a little imprudent, but must
> be viewed in the context of an offline discussion during the hall
> track of pgConf NYC. Clearly Greg wasn't about to go off and
> unilaterally commit this. FWIW, I think I put him off the idea a few
> hours after he made those remarks, without intending for what I'd said
> to have that specific effect (or the opposite effect).

It was somewhere between your two interpretations. I intend to review
everything I can from the commitfest and then this patch and if this
patch is ready for commit before feature freeze I was saying I would
go ahead and commit it. That would only happen if there was a pretty
solid consensus that the my review was good and the patch was good of
course.

The point of the commit fest is to ensure that all patches get
attention. That's why I would only look at this after I've reviewed
anything else from the commitfest that I feel up to reviewing. But if
I have indeed done so there's no point in not taking other patches as
well up to feature freeze. I don't have any intention of lowering our
review standards of course.

So let's table this discussion until the hypothetical case of me doing
lots of reviews *and* reviewing this patch *and* that review being
positive and decisive enough to commit after one review cycle.

-- 
greg



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
t
On Tue, Apr 8, 2014 at 3:12 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> 1. Avoid fmgr and shim overhead
> 2. Use strxfrm to produce a pseudo-leading key that's cheaper to compare.
>
> In that case, these changes need to be analyzed separately. You don't get to
> "make up" for the losses by the second part by the gains from the first
> part. We could commit just the first part (for 9.5!), and that has to be the
> baseline for the second part.

Yes, that's right. Robert already submitted a patch that only did 1)
almost 2 years ago. That should have been committed at the time, but
wasn't. At the time, the improvement was put at about 7% by Robert. It
would be odd to submit the same patch that Robert withdrew already.

Why shouldn't 2) be credited with the same benefits as 1) ? It's not
as if the fact that the strxfrm() trick uses SortSupport is a
contrivance. I cannot reasonably submit the two separately, unless the
second in a cumulative patch. By far the largest improvements come
from 2), while 1) doesn't regress anything.


-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Heikki Linnakangas
Date:
On 04/08/2014 08:02 PM, Peter Geoghegan wrote:
> On Tue, Apr 8, 2014 at 3:12 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> 1. Avoid fmgr and shim overhead
>> 2. Use strxfrm to produce a pseudo-leading key that's cheaper to compare.
>>
>> In that case, these changes need to be analyzed separately. You don't get to
>> "make up" for the losses by the second part by the gains from the first
>> part. We could commit just the first part (for 9.5!), and that has to be the
>> baseline for the second part.
>
> Yes, that's right. Robert already submitted a patch that only did 1)
> almost 2 years ago. That should have been committed at the time, but
> wasn't. At the time, the improvement was put at about 7% by Robert. It
> would be odd to submit the same patch that Robert withdrew already.
>
> Why shouldn't 2) be credited with the same benefits as 1) ? It's not
> as if the fact that the strxfrm() trick uses SortSupport is a
> contrivance. I cannot reasonably submit the two separately, unless the
> second in a cumulative patch.

Sure, submit the second as a cumulative patch.

> By far the largest improvements come
> from 2), while 1) doesn't regress anything.

Right. But 1) is the baseline we need to evaluate 2) against.

- Heikki



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Apr 8, 2014 at 10:10 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Right. But 1) is the baseline we need to evaluate 2) against.

I don't agree with that. Surely we're concerned with not regressing
cases that people actually care about, which in practical terms means
the changes of a single release. While I guess I'm fine with
structuring the patch like that, I don't think it's fair that the
strxfrm() stuff doesn't get credit for not regressing those cases so
badly just because they're only ameliorated by the fmgr-eliding stuff.
While I'm concerned about worst case performance myself, I don't want
to worry about Machiavelli rather than Murphy. What collation did you
use for your test-case?

The fmgr-eliding stuff is only really valuable in that it ameliorates
the not-so-bad regressions, and is integral to the strxfrm() stuff.
Let's not lose sight of the fact that (if we take TPC style benchmarks
as representative) the majority of text sorts can be made at least 3
times faster.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Tue, Apr 8, 2014 at 3:10 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Apr 8, 2014 at 10:10 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> Right. But 1) is the baseline we need to evaluate 2) against.
>
> I don't agree with that. Surely we're concerned with not regressing
> cases that people actually care about, which in practical terms means
> the changes of a single release.

No, we're concerned about ending up with the best possible
performance.  That could mean applying the fmgr-elision but not the
other part.  Whether the other part is beneficial is based on how it
compares to the performance post-fmgr-elision.

As an oversimplified example, suppose someone were to propose two
patches, one that makes PostgreSQL ten times as fast and the other of
which slows it down by a factor of five.  If someone merges those two
things into a single combined patch, we would surely be foolish to
apply the whole thing.  The right thing would be to separate them out
and apply only the first one.  Every change has to stand on its own
two feet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Apr 8, 2014 at 12:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> No, we're concerned about ending up with the best possible
> performance.  That could mean applying the fmgr-elision but not the
> other part.  Whether the other part is beneficial is based on how it
> compares to the performance post-fmgr-elision.

I agree with everything you say here, but I'm puzzled only because
it's overwhelmingly obvious that the strxfrm() stuff is where the
value is. You can dispute whether or not I should have made various
tweaks, and you probably should, but the basic value of that idea is
very much in evidence already. You yourself put the improvements of
fmgr-elision alone at ~7% back in 2012 [1]. At the time, Noah said
that he didn't think it was worth bothering with that patch for what
he considered to be a small gain, a view which I did not share at the
time.

What I have here looks like it speeds things up a little over 200% (so
a little over 300% of the original throughput) with a single client
for many representative cases. That's a massive difference, to the
point that I don't see a lot of sense in considering fmgr-elision
alone separately.

[1] http://www.postgresql.org/message-id/CA+Tgmoa8by24gd+YbuPX=5gSGmN0w5sGiPzWwq7_8iS26vL5CQ@mail.gmail.com
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Alvaro Herrera
Date:
Peter Geoghegan wrote:

> What I have here looks like it speeds things up a little over 200% (so
> a little over 300% of the original throughput) with a single client
> for many representative cases. That's a massive difference, to the
> point that I don't see a lot of sense in considering fmgr-elision
> alone separately.

I think the point here is what matters is that that gain from the
strxfrm part of the patch is large, regardless of what the baseline is
(right?).  If there's a small loss in an uncommon worst case, that's
probably acceptable, as long as the worst case is uncommon and the loss
is small.  But if the loss is large, or the case is not uncommon, then a
fix for the regression is going to be a necessity.

You seem to be assuming that a fix for whatever regression is found is
going to be impossible to find.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Apr 8, 2014 at 2:48 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> I think the point here is what matters is that that gain from the
> strxfrm part of the patch is large, regardless of what the baseline is
> (right?).  If there's a small loss in an uncommon worst case, that's
> probably acceptable, as long as the worst case is uncommon and the loss
> is small.  But if the loss is large, or the case is not uncommon, then a
> fix for the regression is going to be a necessity.

That all seems reasonable. I just don't understand why you'd want to
break out the fmgr-elision part, given that that was already discussed
at length two years ago.

> You seem to be assuming that a fix for whatever regression is found is
> going to be impossible to find.

I think that a fix that is actually worth it on balance will be
elusive. Heikki's worst case is extremely narrow. I think he'd
acknowledge that himself. I've already fixed some plausible
regressions. For example, the opportunistic early "len1 == l3n2 &&
memcmp() == 0?" test covers the common case where two leading keys are
equal. I think we're very much into chasing diminishing returns past
this point. I think that my figure of a 5% regression is much more
realistic, even though it is itself quite unlikely.

I think that the greater point is that we don't want to take worrying
about worst case performance to extremes. Calling Heikki's 50%
regression the worst case is almost unfair, since it involves very
carefully crafted input. You could probably also carefully craft input
that made our quicksort implementation itself go quadratic, a behavior
not necessarily exhibited by an inferior implementation for the same
input. Yes, let's consider a pathological worst case, but lets put it
in the context of being one end of a spectrum of behaviors, on the
extreme fringes. In reality, only a tiny number of individual sort
operations will experience any kind of regression at all. In simple
terms, I'd be very surprised if anyone complained about a regression
at all. If anyone does, it's almost certainly not going to be a 50%
regression. There is a reason why many other systems have
representative workloads that they target (i.e. a variety of tpc
benchmarks). I think that a tyranny of the majority is a bad thing
myself, but I'm concerned that we sometimes take that too far.

I wonder, why did Heikki not add more padding to the end of the
strings in his example, in order to give strxfrm() more wasted work?
Didn't he want to make his worst case even worse? Or was is to control
for TOASTing noise?

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Apr 3, 2014 at 12:19 PM, Thom Brown <thom@linux.com> wrote:
> Looking good:
>
> -T 100 -n -f sort.sql
>
> Master: 21.670467 / 21.718653 (avg: 21.69456)
> Patch: 66.888756 / 66.888756 (avg: 66.888756)

These were almost exactly the same figures as I saw on my machine.
However, when compiling with certain additional flags -- with
CFLAGS="-O3 -march=native" -- I was able to squeeze more out of this.
My machine has a recent Intel CPU, "Intel(R) Core(TM) i7-3520M". With
these build settings the benchmark then averages about 75.5 tps across
multiple runs, which I'd call a fair additional improvement. I tried
this because I was specifically interested in the results of a memcmp
implementation that uses SIMD. I believe that these flags make
gcc/glibc use a memcmp implementation that takes advantage of SSE
where supported (and various subsequent extensions). Although I didn't
go to the trouble of verifying all this by going through the
disassembly, or instrumenting the code in any way, that is my best
guess as to what actually helped. I don't know how any of that might
be applied to improve matters in the real world, which is why I
haven't dived into this further, but it's worth being aware of.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Apr 8, 2014 at 2:57 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Hmm. I would expect the worst case to be where the strxfrm is not helping
> because all the entries have the same prefix, but the actual key is as short
> and cheap-to-compare as possible. So the padding should be as short as
> possible. Also, we have a fast path for pre-sorted input, which reduces the
> number of comparisons performed; that will make the strxfrm overhead more
> significant.
>
> I'm getting about 2x slowdown on this test case:
>
> create table sorttest (t text);
> insert into sorttest select 'foobarfo' || (g) || repeat('a', 75) from
> generate_series(10000, 30000) g;

One thing that isn't all that obvious about this worst case is that
it's in general very qsort() friendly, and therefore the startup costs
(copying) totally dominates. Actually, you're not even sorting -
you're verifying that the tuples are already exactly in order (a
questionable optimization we apply at every level). Consider:

postgres=# select correlation from pg_stats where tablename =
'sorttest' and attname = 't';
 correlation
-------------
           1
(1 row)

So you only have to do n - 1 comparisons, never n log n. This is a
very skewed test. There is no reason to think that strcoll() is better
than strxfrm() + strcmp() if the first difference is earlier. That's
the main reason why what I called a worse case is much more
representative than this, even if we assume that it's very common for
people to sort strings that don't differ in the first 8 bytes and yet
are not identical.

I took a look at fixing your worst case all the same, despite becoming
discouraged during discussion at pgCon. I thought about it for a
while, and realized that the problems are probably solvable. Even
still, I would like to emphasize that this is a family of techniques
for B-Tree operator classes. Sorting heap tuples is not the only use
case, and it's probably not even the most interesting one, yet all the
criticisms are largely confined to sorting alone. This is because
wasted cycles on a poor man's comparison are not too problematic if
the only additional work is a few CPU instructions for comparison (and
not the entire transformation, which happens when that cost can be
amortized over a huge length of time). I'm thinking of inner pages in
B-Trees here. I've looked at heap tuple sorting first because the
sortsupport infrastructure happens to be available, making it the best
proving ground. That's the only reason.

Anyway, major steps against worst case performance in the attached
revision include:

* A configure AC_TRY_RUN tests the suitability of the system strxfrm()
implementation for the purposes of this optimization. There can be no
"header bytes" (people at pgCon reported redundant bytes on the Mac
OSX implementation at pgCon, to my great surprise), and with ASCII
code points you get the full benefit of being able to store the
leading 8 code points in the poorman's key (we also insist that
sizeof(Datum) == 8 to apply the poor man's optimization). It might be
that this restricts the optimization to 64-bit Linux entirely, where
glibc is almost universally used. If that's true, I consider it to be
a acceptable given that the majority of Postgres systems use this
family of platforms in practice. Even still, the fact that every
implementation doesn't meet my standard came as a big surprise to me,
and so I hope that the problem is limited to Mac OSX. I'm slightly
concerned that all BSD systems are affected by this issue, but even if
that was true not covering those cases would be acceptable in my view.
*Maybe* we could come up with a scheme for stripping those header
bytes if someone feels very strongly about it, and we judge it to be
worth it to get our hands dirty in that way (we could even dynamically
verify the header bytes always matched, and bail if our assumption was
somehow undermined at little cost on those platforms).

* The dominant additional cost in Heikki's worst case is wasted
strxfrm() transformations (or maybe just the memcpy() overhead
required to NULL terminate text for strxfrm() processing). When
copying over heap tuples into the memtuples array, we instrument the
costs and weigh them against the likely eventual benefits using
heuristic rules. We may choose to abort the transformation process
early if it doesn't look promising. This process includes maintaining
the mean query length so far (which is taken as a proxy for
transformation costs), as well as maintaining the approximate
cardinality of the set of already generated poor man's keys using the
HyperLogLog algorithm [1].

This almost completely fixes the worst case performance Heikki
complained of. With smaller strings you might get my implementation to
show worse regressions than what you'll see for your worst case by
placing careful traps for the heuristics that now appear within
bttext_abort(), but I suspect not by terribly much. These are still
fairly implausible cases.

The cost of adding all of these ameliorating measures appears to be
very low. We're so bottlenecked on memory bandwidth that the
fixed-size overhead of maintaining poor man cardinality, and the extra
cycles from hashing n keys for the benefit of HyperLogLog just don't
seem to matter next to the big savings for n log n comparisons. The
best case performance is seemingly just as good as before (about a
200% improvement in transaction throughput for one client, but closer
to a 250% improvement with many clients and/or expensive collations),
while the worst case is much much better. As the HyperLogLog paper [2]
says:

"In practice, the HYPERLOGLOG program is quite efficient: like the
standard LOGLOG of [10] or MINCOUNT of [16], its running time is
dominated by the computation of the hash function, so that it is only
three to four times slower than a plain scan of the data (e.g., by the
Unix command “wc -l”, which merely counts end-of-lines)."

(Since we're doing all that copying, and everything else anyway, and
only hashing the first 8 bytes, the ratio for us is more compelling
still, before we even begin the sort).

I'm not entirely sure that the fact that I now pass down outer plan
estimated plan rows in the case of ExecSort(), to hint to the
bttext_abort() heuristics so it has a "sense of proportion" about
costs is the right thing. It's a difficult problem to judge, so I've
left that in despite some misgivings. In any case, I think that this
revision isn't too far from the best of both worlds. I do have some
concerns that I don't have things well balanced within bttext_abort(),
but it's a start. It's hard to exactly characterize the break even
point here, particularly given that we only know about cardinality and
string length, and not distribution. Let's not lose sight of the fact
that in the real world the large majority of sorts will be on the
right side of that break even point, though. This is all about
ameliorating infrequent worst cases.

Because of the new strxfrm()-quality AC_TRY_RUN configure test, you'll
need to "autoreconf" and so on to get the patch working. I didn't
include changes to the "configure" file generated by these changes in
the patch itself.

I use "en_US.UTF8" for the purposes of all benchmark figures here. I
attach some interesting numbers for a variety of cases
(poorman-performance.txt). Worse case performance is considered.

I've been using a table of 3,17,102 cities of the world to look at
average and best case performance. Overview:

postgres=# select count(*), country from cities group by country order
by 1 desc limit 15;
 count |         country
-------+--------------------------
 66408 | India
 35845 | France
 29926 | United States of America
 13154 | Turkey
 11210 | Mexico
 11028 | Germany
  9352 | Tanzania
  8118 | Spain
  8109 | Italy
  7680 | Burkina Faso
  6066 | Czech Republic
  6043 | Iran
  5777 | Slovenia
  5584 | Brazil
  5115 | Philippines
(15 rows)

These city names are all Latin characters with accents and other
diacritics. For example:

postgres=# select * from cities where country = 'Sweden' limit 5;
 country | province |     city      |  one   |  two   | three  |  four
---------+----------+---------------+--------+--------+--------+--------
 Sweden  | Blekinge | Backaryd      | [null] | [null] | [null] | [null]
 Sweden  | Blekinge | Bräkne-Hoby   | [null] | [null] | [null] | [null]
 Sweden  | Blekinge | Brömsebro     | [null] | [null] | [null] | [null]
 Sweden  | Blekinge | Drottningskär | [null] | [null] | [null] | [null]
 Sweden  | Blekinge | Eringsboda    | [null] | [null] | [null] | [null]
(5 rows)

You can download a plain format dump of this database here:
http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/data/cities.dump

I also attach some DEBUG1 output that shows the accuracy of
HyperLogLog estimation (of poor man's normalized key cardinality) for
the cases tested (poorman-debug-hll-cardinality.txt). It's very
accurate given the low, fixed memory overhead chosen.

[1] https://en.wikipedia.org/wiki/HyperLogLog

[2] http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Jun 5, 2014 at 5:37 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Even still, the fact that every
> implementation doesn't meet my standard came as a big surprise to me,
> and so I hope that the problem is limited to Mac OSX. I'm slightly
> concerned that all BSD systems are affected by this issue

I tried out my test program with FreeBSD 9.2-RC2. I linked my program
to the system libc. I explicitly set the collation to "en_US.UTF-8". I
can't see any header bytes, and that implementation meets the standard
my configure test looks for generally, so I guess this was only ever
an issue peculiar to Mac OS X (or the collations it ships with?).

I probably should have mentioned before that Windows is still broken
(I don't plan on optimizing Windows at all due to complexity around
the UTF-8 hacks on that platform, but right now the sortsupport
routine just returns NULL, which is not acceptable).

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Jun 5, 2014 at 5:37 PM, Peter Geoghegan <pg@heroku.com> wrote:
> One thing that isn't all that obvious about this worst case is that
> it's in general very qsort() friendly, and therefore the startup costs
> (copying) totally dominates. Actually, you're not even sorting -
> you're verifying that the tuples are already exactly in order (a
> questionable optimization we apply at every level).

Kevin mentioned something about the Wisconsin courts having columns
that all began with "The State of Wisconsin Vs." in the dev meeting in
Ottawa. I thought that this was an interesting case, because it is
representative of reality, which is crucially important to consider
here. I decided to simulate it. In my original test database:

postgres=# create table wisconsin(casen text);
CREATE TABLE
postgres=# insert into wisconsin select 'The State of Wisconsin Vs. '
|| city from cities;
INSERT 0 317102

sort-wisconsin.sql: select * from (select * from wisconsin order by
casen offset 1000000) d;

Master:

pgbench -M prepared -f sort-wisconsin.sql -T 300 -n
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 300 s
number of transactions actually processed: 55
latency average: 5454.545 ms
tps = 0.181191 (including connections establishing)
tps = 0.181191 (excluding connections establishing)

Patch (most recent revision, with ameliorations, HyperLogLog, etc):

pgbench -M prepared -f sort-wisconsin.sql -T 300 -n

transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 300 s
number of transactions actually processed: 55
latency average: 5454.545 ms
tps = 0.182593 (including connections establishing)
tps = 0.182594 (excluding connections establishing)

Earlier patch (no ameliorations for Heikki's case):

pgbench -M prepared -f sort-wisconsin.sql -T 300 -n

transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 300 s
number of transactions actually processed: 54
latency average: 5555.556 ms
tps = 0.176914 (including connections establishing)
tps = 0.176915 (excluding connections establishing)

With my most recent revision, the ameliorating measures are effective
enough that with the sortsupport shim and fmgr trampoline avoided, we
still come out ahead even for this case. Great. But you may be
surprised that the regression is so small in the case of the patch
without any ameliorating measures (the original patch). That's because
the data isn't *perfectly* logically/physically correlated here, as in
Heikki's worst case. So, the 317,102 wasted strxfrm() calls are
relatively inexpensive. Consider how cost_sort() models the cost of a
sort when an in memory quicksort is anticipated:

/* We'll use plain quicksort on all the input tuples */
startup_cost += comparison_cost * tuples * LOG2(tuples);

In the case of this quicksort, the planner guesses there'll be "317102
* LOG2(317102)" comparisons -- about 5,794,908 comparisons, which
implies over 10 times as many strcoll() calls as wasted strxfrm()
calls. The cost of those strxfrm() calls begins to look insignificant
before n gets too big (at n = 100, it's 100 wasted strxfrm() calls to
about 664 strcoll() calls). Unless, of course, you have a "bubble sort
best case" where everything is already completely in order, in which
case there'll be a 1:1 ratio between wasted strxfrm() calls and
strcoll() calls. This optimization was something that we added to our
qsort(). It doesn't appear in the original NetBSD implementation, and
it doesn't appear in the Bentley/McIlroy paper, and it doesn't appear
anywhere else that I'm aware of. I'm not the only person to regard it
with suspicion - Tom has in the past expressed doubts about that too
[1]. Also, note that no sorting algorithm can do better than O(n log
n) in the average case - that's the information-theoretical lower
bound on the average-case speed of any comparison-based sorting
algorithm.

To be clear: I'm certainly not saying that we shouldn't fix Heikki's
worst case, and indeed I believe I have, but we should also put this
worst case in perspective.

By the way, I just realized that I failed to fully remove client
overhead (I should have put an extra 0 on the end of my offset for the
city.sql query), which added noise to the "city"/"country"/"province"
tests. Revised figures are as follows (these are better than before):

Master:
======

pgbench -M prepared -f sort-city.sql -T 300 -n

transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 300 s
number of transactions actually processed: 278
latency average: 1079.137 ms
tps = 0.924358 (including connections establishing)
tps = 0.924362 (excluding connections establishing)

Patch:
=====

pgbench -M prepared -f sort-city.sql -T 300 -n

transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 300 s
number of transactions actually processed: 1046
latency average: 286.807 ms
tps = 3.486089 (including connections establishing)
tps = 3.486104 (excluding connections establishing)

Master:
======

pgbench -M prepared -j 4 -c 4 -f sort-city.sql -T 300 -n

transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 4
number of threads: 4
duration: 300 s
number of transactions actually processed: 387
latency average: 3100.775 ms
tps = 1.278062 (including connections establishing)
tps = 1.278076 (excluding connections establishing)

Patch:
=====

pgbench -M prepared -j 4 -c 4 -f sort-city.sql -T 300 -n

transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 4
number of threads: 4
duration: 300 s
number of transactions actually processed: 1670
latency average: 718.563 ms
tps = 5.557700 (including connections establishing)
tps = 5.557754 (excluding connections establishing)


BTW, if you want to measure any of this independently, I suggest
making sure that power management settings don't ruin things. I advise
setting CPU governor to "performance" for each core on Linux, for
example.

[1] http://www.postgresql.org/message-id/18033.1361789032@sss.pgh.pa.us
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Thu, Jun 5, 2014 at 8:37 PM, Peter Geoghegan <pg@heroku.com> wrote:
> * A configure AC_TRY_RUN tests the suitability of the system strxfrm()
> implementation for the purposes of this optimization. There can be no
> "header bytes" (people at pgCon reported redundant bytes on the Mac
> OSX implementation at pgCon, to my great surprise), ...

I'm attaching a small test program demonstrating the Mac behavior.
Here's some sample output:

[rhaas ~]$ ./strxfrm en_US "" a aa ab abc abcd abcde peter Geoghegan _ %
"" -> ""
"a" -> "001S0000001S"
"aa" -> "001S001S0000001S001S"
"ab" -> "001S001T0000001S001T"
"abc" -> "001S001T001U0000001S001T001U"
"abcd" -> "001S001T001U001V0000001S001T001U001V"
"abcde" -> "001S001T001U001V001W0000001S001T001U001V001W"
"peter" -> "001b001W001f001W001d0000001b001W001f001W001d"
"Geoghegan" -> "0019001W001a001Y001Z001W001Y001S001`00000019001W001a001Y001Z001W001Y001S001`"
"_" -> "001Q0000001Q"
"%" -> "000W0000000W"

It appears that any string starting with the letter "a" will create
output that begins with 001S00 and the seventh character always
appears to be 0 or 1:

[rhaas ~]$ ./strxfrm en_US ab ac ad ae af a% a0 "a "
"ab" -> "001S001T0000001S001T"
"ac" -> "001S001U0000001S001U"
"ad" -> "001S001V0000001S001V"
"ae" -> "001S001W0000001S001W"
"af" -> "001S001X0000001S001X"
"a%" -> "001S000W0000001S000W"
"a0" -> "001S000b0000001S000b"
"a " -> "001S000R0000001S000R"

Also, the total number of bytes produced is apparently 8N+4, where N
is the length of the input string.  On a Linux system I tested, the
output included non-printable characters, so I adapted the test
program to print the results in hex.  Attaching that version, too.
Here, the length was 3N+2, except for 0-length strings which produce a
0-length result:

[rhaas@hydra ~]$ ./strxfrm-binary en_US "" a aa ab abc abcd abcde
peter Geoghegan _ %
"" ->  (0 bytes)
"a" -> 0c01020102 (5 bytes)
"aa" -> 0c0c010202010202 (8 bytes)
"ab" -> 0c0d010202010202 (8 bytes)
"abc" -> 0c0d0e0102020201020202 (11 bytes)
"abcd" -> 0c0d0e0f01020202020102020202 (14 bytes)
"abcde" -> 0c0d0e0f10010202020202010202020202 (17 bytes)
"peter" -> 1b101f101d010202020202010202020202 (17 bytes)
"Geoghegan" -> 12101a121310120c190102020202020202020201040202020202020202
(29 bytes)
"_" -> 0101010112 (5 bytes)
"%" -> 0101010139 (5 bytes)
[rhaas@hydra ~]$ ./strxfrm-binary en_US ab ac ad ae af a% a0 "a "
"ab" -> 0c0d010202010202 (8 bytes)
"ac" -> 0c0e010202010202 (8 bytes)
"ad" -> 0c0f010202010202 (8 bytes)
"ae" -> 0c10010202010202 (8 bytes)
"af" -> 0c11010202010202 (8 bytes)
"a%" -> 0c01020102010239 (8 bytes)
"a0" -> 0c02010202010202 (8 bytes)
"a " -> 0c01020102010211 (8 bytes)

Even though each input bytes is generating 3 output bytes, it's not
generating a group of output bytes for each input byte as appears to
be happening on MacOS X.  If it were doing that, then truncating the
blob to 8 bytes would only compare the first 2-3 bytes of the input
string.  In fact we do better.  If we change even the 8th letter in
the string to some other letter, the 8th output byte changes:

[rhaas@hydra ~]$ ./strxfrm-binary en_US aaaaaaaa aaaaaaab
"aaaaaaaa" -> 0c0c0c0c0c0c0c0c010202020202020202010202020202020202 (26 bytes)
"aaaaaaab" -> 0c0c0c0c0c0c0c0d010202020202020202010202020202020202 (26 bytes)

If we change the capitalization of the eighth byte, then the change
happens further out:

[rhaas@hydra ~]$ ./strxfrm-binary en_US aaaaaaaa aaaaaaaA
"aaaaaaaa" -> 0c0c0c0c0c0c0c0c010202020202020202010202020202020202 (26 bytes)
"aaaaaaaA" -> 0c0c0c0c0c0c0c0c010202020202020202010202020202020204 (26 bytes)

Still, it's fair to say that on this Linux system, the first 8 bytes
capture a significant portion of the entropy of the first 8 bytes of
the string, whereas on MacOS X you only get entropy from the first 2
bytes of the string.  It would be interesting to see results from
other platforms people might care about also.

> The cost of adding all of these ameliorating measures appears to be
> very low. We're so bottlenecked on memory bandwidth that the
> fixed-size overhead of maintaining poor man cardinality, and the extra
> cycles from hashing n keys for the benefit of HyperLogLog just don't
> seem to matter next to the big savings for n log n comparisons. The
> best case performance is seemingly just as good as before (about a
> 200% improvement in transaction throughput for one client, but closer
> to a 250% improvement with many clients and/or expensive collations),
> while the worst case is much much better.

I haven't looked at the patch yet, but this sounds promising.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Claudio Freire
Date:
On Thu, Jun 12, 2014 at 1:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> It appears that any string starting with the letter "a" will create
> output that begins with 001S00 and the seventh character always
> appears to be 0 or 1:
>
> [rhaas ~]$ ./strxfrm en_US ab ac ad ae af a% a0 "a "
> "ab" -> "001S001T0000001S001T"
> "ac" -> "001S001U0000001S001U"
> "ad" -> "001S001V0000001S001V"
> "ae" -> "001S001W0000001S001W"
> "af" -> "001S001X0000001S001X"
> "a%" -> "001S000W0000001S000W"
> "a0" -> "001S000b0000001S000b"
> "a " -> "001S000R0000001S000R"

...

> [rhaas@hydra ~]$ ./strxfrm-binary en_US aaaaaaaa aaaaaaaA
> "aaaaaaaa" -> 0c0c0c0c0c0c0c0c010202020202020202010202020202020202 (26 bytes)
> "aaaaaaaA" -> 0c0c0c0c0c0c0c0c010202020202020202010202020202020204 (26 bytes)
>
> Still, it's fair to say that on this Linux system, the first 8 bytes
> capture a significant portion of the entropy of the first 8 bytes of
> the string, whereas on MacOS X you only get entropy from the first 2
> bytes of the string.  It would be interesting to see results from
> other platforms people might care about also.

If you knew mac's output character set with some certainty, you could
compress it rather efficiently by applying a tabulated decode back
into non-printable bytes. Say, like base64 decoding (the set appears
to be a subset of base64, but it's hard to be sure).


Ie,
x = strxfrm(s)
xz = hex(tab[x[0]] + 64*tab[x[1]] + 64^2*tab[x[2]] ... etc)

This can be made rather efficient. But the first step is defining with
some certainty the output character set.



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
Thanks for looking into this.

On Thu, Jun 12, 2014 at 9:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Still, it's fair to say that on this Linux system, the first 8 bytes
> capture a significant portion of the entropy of the first 8 bytes of
> the string, whereas on MacOS X you only get entropy from the first 2
> bytes of the string.  It would be interesting to see results from
> other platforms people might care about also.

Right. It was a little bit incautious of me to say that we had the
full benefit of 8 bytes of storage with "en_US.UTF-8", since that is
only true of lower case characters (I think that FreeBSD can play
tricks with this. Sometimes, it will give you the benefit of 8 bytes
of entropy for an 8 byte string, with only non-differentiating
trailing bytes, so that the first 8 bytes of "Aaaaaaaa" are distinct
from the first eight bytes of "aaaaaaaa", while any trailing bytes are
non-distinct for both). In any case it's pretty clear that a goal of
the glibc implementation is to concentrate as much entropy as possible
into the first part of the string, and that's the important point.
This makes perfect sense, and is why I was so incredulous about the
Mac behavior. After all, the Open Group's strcoll() documentation
says:

"The strxfrm() and strcmp() functions should be used for sorting large lists."

Sorting text is hardly an infrequent requirement -- it's almost the
entire reason for having strxfrm() in the standard. You're always
going to want to have each strcmp() find differences as early as
possible.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> ... In any case it's pretty clear that a goal of
> the glibc implementation is to concentrate as much entropy as possible
> into the first part of the string, and that's the important point.
> This makes perfect sense, and is why I was so incredulous about the
> Mac behavior.

I think this may be another facet of something we were already aware of,
which is that the UTF8 locales on OS X pretty much suck.  It's fairly
clear that Apple has put no effort into achieving more than minimal
standards compliance for those.  Sorting doesn't work as expected in
those locales, for example.

Still, that's reality, and any proposal to rely on strxfrm is going to
have to deal with it :-(
        regards, tom lane



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Jun 12, 2014 at 2:09 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Right. It was a little bit incautious of me to say that we had the
> full benefit of 8 bytes of storage with "en_US.UTF-8", since that is
> only true of lower case characters (I think that FreeBSD can play
> tricks with this. Sometimes, it will give you the benefit of 8 bytes
> of entropy for an 8 byte string, with only non-differentiating
> trailing bytes, so that the first 8 bytes of "Aaaaaaaa" are distinct
> from the first eight bytes of "aaaaaaaa", while any trailing bytes are
> non-distinct for both). In any case it's pretty clear that a goal of
> the glibc implementation is to concentrate as much entropy as possible
> into the first part of the string, and that's the important point.

I thought about using an order-preserving compression technique like
Hu Tucker [1] in order to get additional benefit from those 8 bytes.
Even my apparently sympathetic cities example isn't all that
sympathetic, since the number of distinct normalized keys is only
about 75% of the total number of cities (while a strcmp()-only
reliable tie-breaker isn't expected to be useful for the remaining
25%). Here is the improvement I see when I setup things so that there
is a 1:1 correspondence between city rows and distinct normalized
keys:

postgres=# with cc as
(select count(*), array_agg(ctid) ct,
substring(strxfrm_test(city)::text from 0 for 19 ) blob from cities
group by 3 having count(*) > 1 order by 1),
ff as
( select unnest(ct[2:400]) u, blob from cc
)
delete from cities using ff where cities.ctid = ff.u;

postgres=# vacuum full cities ;
VACUUM

Afterwards:

postgres=# select count(*) from (select distinct
substring(strxfrm_test(city)::text from 0 for 19) from cities) i;count
--------243782
(1 row)

postgres=# select count(*) from cities;count
--------243782
(1 row)

$ cat sort-city.sql
select * from (select * from cities order by city offset 1000000) d;

Patch results
==========

pgbench -M prepared -f sort-city.sql -T 300 -n

transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 300 s
number of transactions actually processed: 1734
latency average: 173.010 ms
tps = 5.778545 (including connections establishing)
tps = 5.778572 (excluding connections establishing)

pgbench -M prepared -j 4 -c 4 -f sort-city.sql -T 300 -n

transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 4
number of threads: 4
duration: 300 s
number of transactions actually processed: 2916
latency average: 411.523 ms
tps = 9.706683 (including connections establishing)
tps = 9.706776 (excluding connections establishing)

Master results
==========

pgbench -M prepared -f sort-city.sql -T 300 -n

transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 300 s
number of transactions actually processed: 390
latency average: 769.231 ms
tps = 1.297545 (including connections establishing)
tps = 1.297551 (excluding connections establishing)

pgbench -M prepared -j 4 -c 4 -f sort-city.sql -T 300 -n

transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 4
number of threads: 4
duration: 300 s
number of transactions actually processed: 535
latency average: 2242.991 ms
tps = 1.777005 (including connections establishing)
tps = 1.777024 (excluding connections establishing)

So that seems like a considerable further improvement that would be
nice to see more frequently. I don't know that it's worth it to go to
the trouble of compressing given the existing ameliorating measures
(HyperLogLog, etc), at least if using compression is motivated by
worst case performance. There is some research on making this work for
this kind of thing specifically [2].

My concern is that it won't be worth it to do the extra work,
particularly given that I already have 8 bytes to work with. Supposing
I only had 4 bytes to work with (as researchers writing [2] may have
only had in 1994), that would leave me with a relatively small number
of distinct normalized keys in many representative cases. For example,
I'd have a mere 40,665 distinct normalized keys in the case of my
"cities" database, rather than 243,782 (out of a set of 317,102 rows)
for 8 bytes of storage. But if I double that to 16 bytes (which might
be taken as a proxy for what a good compression scheme could get me),
I only get a modest improvement - 273,795 distinct keys. To be fair,
that's in no small part because there are only 275,330 distinct city
names overall (and so most dups get away with a cheap memcmp() on
their tie-breaker), but this is a reasonably organic, representative
dataset.

Now, it's really hard to judge something like that, and I don't
imagine that this analysis is all that scientific. I am inclined to
think that we're better off just aborting if the optimization doesn't
work out while copying, and forgetting about order preserving
compression. Let us not lose sight of the fact that strxfrm() calls
are not that expensive relative to the cost of the sort in almost all
cases.

[1] http://www.cs.ust.hk/mjg_lib/bibs/DPSu/DPSu.Files/jstor_514.pdf

[2] http://www.hpl.hp.com/techreports/Compaq-DEC/CRL-94-3.pdf
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Claudio Freire
Date:
On Mon, Jul 14, 2014 at 2:53 PM, Peter Geoghegan <pg@heroku.com> wrote:
> My concern is that it won't be worth it to do the extra work,
> particularly given that I already have 8 bytes to work with. Supposing
> I only had 4 bytes to work with (as researchers writing [2] may have
> only had in 1994), that would leave me with a relatively small number
> of distinct normalized keys in many representative cases. For example,
> I'd have a mere 40,665 distinct normalized keys in the case of my
> "cities" database, rather than 243,782 (out of a set of 317,102 rows)
> for 8 bytes of storage. But if I double that to 16 bytes (which might
> be taken as a proxy for what a good compression scheme could get me),
> I only get a modest improvement - 273,795 distinct keys. To be fair,
> that's in no small part because there are only 275,330 distinct city
> names overall (and so most dups get away with a cheap memcmp() on
> their tie-breaker), but this is a reasonably organic, representative
> dataset.


Are those numbers measured on MAC's strxfrm?

That was the one with suboptimal entropy on the first 8 bytes.



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Mon, Jul 14, 2014 at 11:03 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> Are those numbers measured on MAC's strxfrm?
>
> That was the one with suboptimal entropy on the first 8 bytes.

No, they're from a Linux system which uses glibc 2.19. The
optimization will simply be not used on implementations that don't
meet a certain standard (see my AC_TRY_RUN test program). I'm
reasonably confident that that test program will pass on most systems.
Just not Mac OSX. The optimization is never used on Windows and 32-bit
systems for other reasons.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Jun 12, 2014 at 2:09 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Thanks for looking into this.

Is anyone going to look at this?

I attach a new revision. The only real change to the code is that I
fixed an open item concerning what to do on WIN32 with UTF-8, where
the UTF-16 hacks that we do there cannot be worked around to make the
optimization still work, and yet we're still obligated to set up a
sort support state since there is a cataloged sort support function
(unfortunately I wasn't able to test it on Windows since I no longer
own a Windows machine, but it should be fine). I set up a comparison
shim within varlena.c for that case only. This is a kludge, but I
decided that it was better than preparing the sort support
infrastructure to not get a sane sort support state from all opclasses
just because of some platform-specific issue. I think that's justified
by the fact that it's very unlikely to happen again. text is a
datatype that is unusually tied to the operating system. This does
necessitate having the sort support struct record which sort operator
was originally used to lookup the sort support function, but it seems
reasonable to store that in all instances.

What may be of more interest to reviewers is the revised AC_TRY_RUN
test program that "configure" consults. While the code is unchanged,
there is now a detailed rationale for why it's reasonable to expect
that a significant amount of entropy will be concentrated in the first
8 bytes of a strxfrm() blob, with reference to how those algorithms
are more or less required to behave by the Unicode consortium. The
basic reason is that the algorithm for building binary sort keys
(described by a Unicode standard [1] defining the behavior of Unicode
collation algorithms, and implemented by strxfrm()) specifically works
by storing primary level, secondary level and tertiary level weights
in turn in the returned blob. There may be additional levels, too. As
the standard says:

"""
The primary level (L1) is for the basic sorting of the text, and the
non-primary levels (L2..Ln) are for adjusting string weights for other
linguistic elements in the writing system that are important to users
in ordering, but less important than the order of the basic sorting.
"""

Robert pointed out a case where varying character case of an English
word did not alter the primary level bytes (and thus the poor man's
normalized key was the same). He acknowledged that most of the entropy
of the first 8 bytes of the string went into the first 8 bytes of the
blob/key. This can actually be very useful to the optimization in some
cases. In particular, with most Latin alphabets you'll see the same
pattern when diacritics are used. This means that even though the
original string has (say) an accented character that would take 2
bytes to store in UTF-8, the weight in the primary level is the same
as an equivalent unaccented character (and so only takes one byte to
store at that level, with differences only in subsequent levels).
Whole strxfrm() blobs are the same length regardless of how many
accents appear in otherwise equivalent original Latin string, and you
get exactly the same high concentrations of entropy in the first 8
bytes in pretty much all Latin alphabets (the *absence* of diacritics
is stored in later weight levels too, even with the "en_US.UTF-8"
collation). As the standard says:

"""
By default, the algorithm makes use of three fully-customizable
levels. For the Latin script, these levels correspond roughly to:

alphabetic ordering
diacritic ordering
case ordering.

A final level may be used for tie-breaking between strings not
otherwise distinguished.
"""

In practice a huge majority of comparisons are expected to be resolved
at the primary weight level (in the case of a straight-up sort of
complete, all-unique strxfrm() blobs), which at least in the case of
glibc appear to require (at most) as many bytes to store as an
original UTF-8 string did. Performance of a sort using a
strcoll()-based collation could be *faster* than the "C" location with
this patch if there were plenty of Latin characters with diacritics.
The diacritics would effectively be removed from the poor man's
normalized key representations, and would only be considered when a
tie-breaker is required.

[1] http://www.unicode.org/reports/tr10/
--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Greg Stark
Date:
On Sun, Jul 27, 2014 at 8:00 AM, Peter Geoghegan <pg@heroku.com> wrote:
> What may be of more interest to reviewers is the revised AC_TRY_RUN
> test program that "configure" consults.

I haven't looked yet. Can you describe what exactly the AC_TRY_RUN is
testing for?

If it's just whether the system supports strxfrm or UTF-8 at all that
might be ok. If it's detailed behaviour of the locales then that's a
problem since that could vary the platform the code is eventually run
on on compared to the one it's built on.


-- 
greg



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Sun, Jul 27, 2014 at 8:23 AM, Greg Stark <stark@mit.edu> wrote:
> I haven't looked yet. Can you describe what exactly the AC_TRY_RUN is
> testing for?

It's more or less testing for a primary weight level (i.e. the first
part of the blob) that is no larger than the original characters of
the string, and has no "header bytes" or other redundancies.  It also
matches secondary and subsequently weight levels to ensure that they
match, since the two stings tested have identical case, use of
diacritics, etc (they're both lowercase ASCII-safe strings). I don't
set a locale, but that shouldn't matter. I have good reason to believe
that many strxfrm() implementations behave this way, based on the
Unicode standard, and some investigation. Still, that is something
that can be more formally verified as long as we're not trusting of
strxfrm() generally rather than just discriminating against Mac OS X
specifically. I think that the Mac OS X implementation is an anomaly
(I haven't really looked into why), and the FreeBSD one just isn't
very good. But even the FreeBSD one appears to append primary weights
(only) to the blob it returns, and so is essentially the same for my
purposes [1].

[1] http://lists.freebsd.org/pipermail/freebsd-current/2003-April/001273.html
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Sun, Jul 27, 2014 at 12:34 PM, Peter Geoghegan <pg@heroku.com> wrote:
> It's more or less testing for a primary weight level (i.e. the first
> part of the blob) that is no larger than the original characters of
> the string, and has no "header bytes" or other redundancies.  It also
> matches secondary and subsequently weight levels to ensure that they
> match, since the two stings tested have identical case, use of
> diacritics, etc (they're both lowercase ASCII-safe strings). I don't
> set a locale, but that shouldn't matter.

Actually, come to think of it that might not quite be true. Consider
this output from Robert's strxfrm test program:

pg@hamster:~/code$ ./strxfrm hu_HU.utf8 potyty
"potyty" -> 2826303001090909090109090909 (14 bytes)
pg@hamster:~/code$ ./strxfrm hu_HU.utf8 potyta
"potyta" -> 2826302e0c010909090909010909090909 (17 bytes)

This is a very esoteric Hungarian collation rule [1], which at one
point we found we had to plaster over within varstr_cmp() to prevent
indexes giving wrong answers [2]. It turns out that with this
collation, strcoll("potyty", "potty") == 0. The point specifically is
that in principle, collations can alter the number of weights that
appear in the primary level of the blob. This might imply that the
number of primary level bytes for the ASCII-safe string "abcdefgh"
might not equal those of "ijklmnop" for some collation, because of the
application of some similar esoteric rule. In principle, collations
are at liberty to make that happen, even though this hardly ever
occurs in practice (we first heard about it in 2005, although the
Unicode algorithm standard warns of this), and even though any of the
cases where it does occur it probably happens to not affect my little
AC_TRY_RUN program. Even still, I'm not comfortable with the
deficiency of the program. I don't want my optimization to
accidentally not apply just because some hypothetical collation where
this is true was used when Postgres was built. It probably couldn't
happen, but I must admit guaranteeing that it can't is a mess.
I suppose I could fix this by no longer assuming that the number of
bytes that appear in the primary level are fixed at n for n original
ASCII code point strings. I think that in theory even that could
break, though, because we have no principled way of parsing out
different weight levels (the Unicode standard has some ideas about how
given strxfrm()'s "no NULL bytes in blob" restriction, but that's
clearly implementation defined).

Given that Mac OS X is the only platform that appears to have this
header byte problem at all, I think we'd be better off specifically
disabling it on Mac OS X. I was very surprised to learn of the problem
on Mac OS X. Clearly it's going against the grain by having the
problem.

[1] http://www.postgresql.org/message-id/43A16BB7.7030606@mage.hu
[2] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Mon, Jul 28, 2014 at 4:41 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Actually, come to think of it that might not quite be true.

Another issue is that we might just happen to use the "C" locale when
the AC_TRY_RUN program is invoked, which probably doesn't exhibit the
broken behavior of Mac OS X, since at least with glibc on Linux that
leaves you with a blob exactly matching the original string. Then
again, who knows? The Mac OS X behavior seems totally arbitrary to me.
If I had to guess I'd say it has something to do with their providing
an open standard shim to a UTF-16 based proprietary API.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Wim Lewis
Date:
On 28 Jul 2014, at 4:57 PM, Peter Geoghegan wrote:
> [....] Then
> again, who knows? The Mac OS X behavior seems totally arbitrary to me.
> If I had to guess I'd say it has something to do with their providing
> an open standard shim to a UTF-16 based proprietary API.

A quick glance at OSX's strxfrm() suggests they're using an implementation of strxfrm() from FreeBSD. You can find the
sourcehere: 
   http://www.opensource.apple.com/source/Libc/Libc-997.90.3/string/FreeBSD/strxfrm.c

(and a really quick glance at the contents of libc on OSX 10.9 reinforces this--- I don't see any calls into their
CoreFoundationunicode string APIs.) 






Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Mon, Jul 28, 2014 at 5:14 PM, Wim Lewis <wiml@omnigroup.com> wrote:
> A quick glance at OSX's strxfrm() suggests they're using an implementation of strxfrm() from FreeBSD. You can find
thesource here:
 
>
>     http://www.opensource.apple.com/source/Libc/Libc-997.90.3/string/FreeBSD/strxfrm.c
>
> (and a really quick glance at the contents of libc on OSX 10.9 reinforces this--- I don't see any calls into their
CoreFoundationunicode string APIs.)
 

Something isn't quite accounted for, then. The FreeBSD behavior is to
append the primary weights only. That makes their returned blobs
smaller than those you'll see on Linux, but also appears to imply that
their implementation is substandard (The PostgreSQL port uses ICU on
FreeBSD for a reason, I suppose). But FreeBSD did not add extra,
redundant "header bytes" right in the primary level when I tested it,
but I'm told Mac OS X does. I guess it could be that the collations
shipped differ, but I can't think why that would be. It does seem
peculiar that the Mac OS X blobs are always printable, whereas that
isn't the case with Glibc (the only restriction like that is that
there are no NULL bytes), and the Unicode algorithm standard
specifically says that that's okay.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Sun, Jul 27, 2014 at 12:00 AM, Peter Geoghegan <pg@heroku.com> wrote:
> I attach a new revision.

I think that I may have missed a trick here.

It turns out that it isn't expensive to also hash original text
values, to track their cardinality using HyperLogLog - it's hardly
measurable when done at an opportune point just after strxfrm()
expansion. I was already tracking the cardinality of poor man's
normalized keys using HyperLogLog. If I track the cardinality of both
sets (original values and normalized keys), I can compare the two when
evaluating if normalization should be aborted. This is by far the most
important consideration.

This causes the optimization to be applied when sorting millions of
tuples with only a tiny number of distinct values (like, 4 or 5),
without making bad cases that we fail to abort in a timely manner any
more likely. This is still significantly profitable - over 90% faster
in one test, because the optimistic memcmp() still allows us to avoid
any strcoll() calls. It looks about the same as using the "C"
collation. Not quite the huge boost we can see, but still well
worthwhile.

In general it seems well principled to have the "should I abort
normalization?" algorithm mostly weigh how effective a proxy for full
key cardinality normalized key cardinality is. If it is a good proxy
then nothing else matters. If it's not a very good proxy, that can
only be because there are many differences beyond the first 8 bytes.
Only then will we weigh the total number of distinct normalized keys,
and as the effectiveness of normalized key cardinality as a proxy for
overall cardinality falls, our requirements for the overall number of
distinct normalized keys shoots up rapidly.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Wim Lewis
Date:
On 28 Jul 2014, at 5:23 PM, Peter Geoghegan wrote:
> On Mon, Jul 28, 2014 at 5:14 PM, Wim Lewis <wiml@omnigroup.com> wrote:
>> A quick glance at OSX's strxfrm() suggests they're using an implementation of strxfrm() from FreeBSD. You can find
thesource here: 
>>
>>    http://www.opensource.apple.com/source/Libc/Libc-997.90.3/string/FreeBSD/strxfrm.c
>>
>> (and a really quick glance at the contents of libc on OSX 10.9 reinforces this--- I don't see any calls into their
CoreFoundationunicode string APIs.) 
>
> Something isn't quite accounted for, then. The FreeBSD behavior is to
> append the primary weights only. That makes their returned blobs
> smaller than those you'll see on Linux, but also appears to imply that
> their implementation is substandard (The PostgreSQL port uses ICU on
> FreeBSD for a reason, I suppose). But FreeBSD did not add extra,
> redundant "header bytes" right in the primary level when I tested it,
> but I'm told Mac OS X does.

I don't think OSX actually does. From a look at the source and a simple test program, OSX's strxfrm represents its
outputas a series of 24-bit weights, each one encoded into four bytes. Multiple "levels" of weights are separated by a
weightof 0 (which is encoded as the four bytes "0000"). 

Robert Haas' message in this thread on 7 April is the first mention of a header, but his examples from 12 June don't
reallydemonstrate a header--- they're completely consistent with the description above and the published source code. 

OSX's strxfrm output is very space-inefficient, especially for Latin text, where none of the weights seem to be greater
than2^12, meaning that half of the bytes in any output are always going to be 0x30. (Testing with, e.g., Hangul
charactersI find some weights that use more of the space.) But as far as I can tell it's not completely crazy. :) 

What this means is that on OSX, comparing the first 8 bytes of strxfrm output will end up comparing the primary weights
ofthe first two characters in the original string. Which is the same conclusion Robert Haas came to earlier, modulo the
"headerbytes" interpretation. 





Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Sun, Jul 27, 2014 at 3:00 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Jun 12, 2014 at 2:09 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Thanks for looking into this.
>
> Is anyone going to look at this?

I'm concerned by the licensing information in hyperloglog.c, which reads:

+  * Portions Copyright (c) 2014, PostgreSQL Global Development Group
+  *
+  * Based on Hideaki Ohno's C++ implementation.

There is no commentary whatever on the license applicable to that
code.  It appears to be the MIT license:

https://github.com/hideo55/cpp-HyperLogLog

I don't think we should commit anything that's not clearly under the
PostgreSQL license.

Heikki previously made quite firmly the point that you shouldn't blend
the addition of sortsupport logic in general with the poor man's key
optimization in particular; they should be separate patches.  He's
right.

Some other concerns:

1. As I think I mentioned before, I think the term "poor man's key" is
just terrible.  It conveys, at least to me, no useful information
about what is really being done.  If you called it, say,
"strxfrm-prefix comparison", it would be only a few more letters but a
whole lot more informative to future readers of the code.

2. The need to modify analyze.c and nodeMergeAppend.c to disable this
optimization seems odd, and the comments are uninformative, saying
only that it "isn't feasible", but not why.  If it were only analyze.c
I might expect that it required some kind of transaction state that
isn't present there, but that doesn't explain the merge-append case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Wed, Jul 30, 2014 at 10:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't think we should commit anything that's not clearly under the
> PostgreSQL license.

I thought that we were comfortable with using MIT licensed code. But,
come to think of it I don't know what our exact policy is. Wikipedia
says "The Simplified BSD license used by FreeBSD is essentially
identical[citation needed][discuss] to the MIT License, as it contains
neither an advertising clause, nor a prohibition on promotional use of
the copyright holder's name". There are plenty of suitably licensed
hyperLogLog implementations in any case, and many are similar. Some
direction here would be useful, since I'm not an expert on licensing.

> Heikki previously made quite firmly the point that you shouldn't blend
> the addition of sortsupport logic in general with the poor man's key
> optimization in particular; they should be separate patches.  He's
> right.

Ordinarily, I'd agree with this. But the fact is that there was an
extensive discussion of that patch. We both know quite a lot about it,
because you wrote it and I reviewed it [1]. What more is there to
learn about it? It's not very controversial. It provides a text sort
support routine that consistently wins by removing avoidable overhead.
For the strcoll() case, you put the improvement at 7%. It clearly
cannot regress anything. It's easy to reason about. It's a fine patch.
We had a silly disagreement about it, but I would almost have been
willing to commit it there and then (although Windows was still broken
with that patch, something I've only just fixed, AFAIK).

This patch may cause regressions. They may not be that bad, but they
need to be characterized and avoided. It can also make many
representative cases more than 3 times faster (an improvement in
excess of 200%). Even low cardinality cases (e.g sorting a list of
cities in the world with more than a thousand inhabitants by country)
can be up to 150% faster. That's a very strong improvement, and the
lack of this kind of facility surely accounts for why other systems
are said to sort text so much more efficiently than Postgres
(something we discussed privately years ago). But, regressions are
bad. Plus, I freely admit that this whole idea is ugly as sin. That's
the only thing to discuss, AFAICT. We know that the best case
performance is almost entirely attributable to the normalized key
stuff.

> Some other concerns:
>
> 1. As I think I mentioned before, I think the term "poor man's key" is
> just terrible.  It conveys, at least to me, no useful information
> about what is really being done.  If you called it, say,
> "strxfrm-prefix comparison", it would be only a few more letters but a
> whole lot more informative to future readers of the code.

I don't have a problem with changing the name. But the name that you
propose is all about text. This patch is intended to add an extensible
infrastructure (a new part of sort support), plus one client of that
more complete extensible infrastructure (sort support for text). I
think there's a good chance that it could work well for another type
too, like numeric, if we could come up with a good system for encoding
numeric normalized keys, and if we can similarly get over the fact
that there is going to be a minority of cases that won't be helped at
all. That's a whole other discussion, though, and text is clearly the
really compelling case.

Do you have another suggestion?

> 2. The need to modify analyze.c and nodeMergeAppend.c to disable this
> optimization seems odd, and the comments are uninformative, saying
> only that it "isn't feasible", but not why.  If it were only analyze.c
> I might expect that it required some kind of transaction state that
> isn't present there, but that doesn't explain the merge-append case.

The reason is that there is no convenient way to inject a conversion
routine. A binary heap is calling heap_compare_slots(), which is where
comparisons occur, so it isn't really sorting at all. It would be
ugly, and probably even counter-productive to do normalization just in
time, as tuples are pulled up. In general I assume that it's only
useful to do a full normalization pass before sorting (preferably at a
convenient juncture like copytup_heap()). Note that we still elide the
fmgr trampoline, so only the additional normalization optimization
fails to apply.

I should add that there is one other case where we arbitrarily don't
use the additional optimization that doesn't actually make any sense.
Within nodeAgg.c, "We use a plain Datum sorter when there's a single
input column; otherwise sort the full tuple". And thus, the
optimization may not be used there more or less by accident.
tuplesort_putdatum() needs to be taught about the new optimization
too, I think.

I feel it should be possible for both the Datum and Heap sort cases to
use the new optimization (since they can always use sort support)
wherever they can afford to make that normalization pass (they usually
can, since copytup* is usually passed through). Eventually, I think
we'll want to extend sort support to work with btree index builds.

Please don't review the abort logic until I finish producing a new
revision. It won't be long. I've come up with something appreciably
better [2].

[1] http://www.postgresql.org/message-id/CA+Tgmoa8by24gd+YbuPX=5gSGmN0w5sGiPzWwq7_8iS26vL5CQ@mail.gmail.com

[2] http://www.postgresql.org/message-id/CAM3SWZRYkTbOtVsun2R1j95XR5GnrvM+Zbvz+RxHLq0CLz41hA@mail.gmail.com
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Wed, Jul 30, 2014 at 3:04 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I don't have a problem with changing the name. But the name that you
> propose is all about text. This patch is intended to add an extensible
> infrastructure (a new part of sort support), plus one client of that
> more complete extensible infrastructure (sort support for text). I
> think there's a good chance that it could work well for another type
> too, like numeric, if we could come up with a good system for encoding
> numeric normalized keys, and if we can similarly get over the fact
> that there is going to be a minority of cases that won't be helped at
> all. That's a whole other discussion, though, and text is clearly the
> really compelling case.
>
> Do you have another suggestion?

How about "proxy sort keys"? It's suggestive of a format that can be
relied on to faithfully represent the original key. But, like any
proxy, it's a substitute, and by definition a substitute is never as
authoritative as the original value. As such, the original value may
need to be consulted when the proxy sort key doesn't have enough
information to give a conclusive answer, which hopefully doesn't
happen often. Like any good proxy, proxy sort keys know enough to know
when they cannot faithfully represent what the original value would
say. In practice proxy sort keys are almost always themselves
pass-by-value, while serving to proxy a pass by reference type, but
this isn't a formal requirement. Encoding strategies don't necessarily
have anything to do with strxfrm(). That's just how text's default
opclass happens to do it.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Wed, Jul 30, 2014 at 3:04 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I feel it should be possible for both the Datum and Heap sort cases to
> use the new optimization (since they can always use sort support)
> wherever they can afford to make that normalization pass (they usually
> can, since copytup* is usually passed through). Eventually, I think
> we'll want to extend sort support to work with btree index builds.

Actually, scratch that; I think it should be possible to have things
like aggregates with a single ORDER BY expression use the
normalization optimization where available (a slightly less specific
statement than the above). I do not in fact think we should make this
work for tuplesort_begin_datum(). The single column case ORDER BY
aggregate happens to use a datum representation within tuplesort,
whereas the multi input column case uses heap tuples (and
tuplesort_begin_heap(), etc).

What I mean is that I think the best way to accomplish what we really
want here is probably to make the single column case artificially do
the same thing as the multi-column case where we know there is a sort
support routine with normalization support. That has to be a superior
approach to jury-rigging tuplesort to differentiate between normalized
keys and an original representation where tuplesort_getdatum() is
consulted. It's a bit ugly that datum sorts will then support sort
support, but only when this new infrastructure is unused. I see no
better way.

Now, to figure out a reasonable way to have nodeAgg.c ask "does this
sort operator have a real sort support function that happens to use
the new normalization capability?" before calling either
tuplesort_begin_heap() or tuplesort_begin_datum()...

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Wed, Jul 30, 2014 at 7:17 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Jul 30, 2014 at 3:04 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> I don't have a problem with changing the name. But the name that you
>> propose is all about text. This patch is intended to add an extensible
>> infrastructure (a new part of sort support), plus one client of that
>> more complete extensible infrastructure (sort support for text). I
>> think there's a good chance that it could work well for another type
>> too, like numeric, if we could come up with a good system for encoding
>> numeric normalized keys, and if we can similarly get over the fact
>> that there is going to be a minority of cases that won't be helped at
>> all. That's a whole other discussion, though, and text is clearly the
>> really compelling case.
>>
>> Do you have another suggestion?
>
> How about "proxy sort keys"? It's suggestive of a format that can be
> relied on to faithfully represent the original key. But, like any
> proxy, it's a substitute, and by definition a substitute is never as
> authoritative as the original value. As such, the original value may
> need to be consulted when the proxy sort key doesn't have enough
> information to give a conclusive answer, which hopefully doesn't
> happen often. Like any good proxy, proxy sort keys know enough to know
> when they cannot faithfully represent what the original value would
> say. In practice proxy sort keys are almost always themselves
> pass-by-value, while serving to proxy a pass by reference type, but
> this isn't a formal requirement. Encoding strategies don't necessarily
> have anything to do with strxfrm(). That's just how text's default
> opclass happens to do it.

I certainly like that better than poor-man; but proxy, to me, fails to
convey inexactness.  Perhaps we can work that idea in somehow.  Or
maybe "pre"-something, to indicate that we do this before comparing
the regular key, in the hopes of not needing to.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Jul 31, 2014 at 11:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I certainly like that better than poor-man; but proxy, to me, fails to
> convey inexactness.  Perhaps we can work that idea in somehow.  Or
> maybe "pre"-something, to indicate that we do this before comparing
> the regular key, in the hopes of not needing to.

Well, these normalized keys are sometimes sufficient to give correct
answers, or are sufficient to determine that that isn't possible with
just their representation. That is pretty exact.

How about "delegate key"? That's a similar term to proxy key, but is
more strongly suggestive of the idea that these keys are in some sense
inferior to original keys.
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Marti Raudsepp
Date:
On Thu, Jul 31, 2014 at 9:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I certainly like that better than poor-man; but proxy, to me, fails to
> convey inexactness.

Maybe "abbreviated", "abridged", "minified"?

Regards,
Marti



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Thu, Jul 31, 2014 at 3:14 PM, Marti Raudsepp <marti@juffo.org> wrote:
> On Thu, Jul 31, 2014 at 9:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I certainly like that better than poor-man; but proxy, to me, fails to
>> convey inexactness.
>
> Maybe "abbreviated", "abridged", "minified"?

Yeah, something like that would work for me.  I like abbreviated; that
seems very descriptive.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Greg Stark
Date:
On Thu, Jul 31, 2014 at 8:14 PM, Marti Raudsepp <marti@juffo.org> wrote:
> On Thu, Jul 31, 2014 at 9:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I certainly like that better than poor-man; but proxy, to me, fails to
>> convey inexactness.
>
> Maybe "abbreviated", "abridged", "minified"?

Surrogate?

Let the bike shedding begin!

-- 
greg



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Jul 31, 2014 at 12:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Maybe "abbreviated", "abridged", "minified"?
>
> Yeah, something like that would work for me.  I like abbreviated; that
> seems very descriptive.

Abbreviated it is.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Jul 31, 2014 at 1:12 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Abbreviated it is.

Attached revision uses new terminology. I have abandoned configure
tests entirely, preferring to explicitly discriminate against Mac OS X
(Darwin) as a platform with a known odd locale implementation, just
like pg_get_encoding_from_locale() does. Since Robert did not give a
reason for discussing the basic fmgr-elision patch despite having
already discussed it a couple of years ago, I have not split the patch
into two (if I did, the first patch would be virtually identical to
Robert's 2012 patch). I hope that's okay. I am willing to revisit this
question if Robert feels that something is uncertain about the costs
and benefits of a basic fmgr-eliding sort support for text.

There are still some open items:

* I have left the licensing question unresolved. There is plenty we
can do about this, and if necessary we can ask the original author to
changed his license from the MIT license to the PostgreSQL license.
AFAICT, the BSD license is *more* restrictive than the MIT license,
and the PostgreSQL license is identical. The wording is almost
identical. I don't see the concern. If the PostgreSQL license had the
“non-endorsement clause” of the New BSD license, or alternatively if
the MIT license had a similar clause that the PostgreSQL licensed
lacked, that would be a substantive and appreciable difference. That
isn't the case.

* I have not made aggregates use the optimization where they currently
accidentally don't due to using datum tuplesort. I can artificially
force them to use heap tuplesort where that's likely to help [1].
Let's defer this question until we have an abort algorithm that seems
reasonable. There is a FIXME item.

Improvements in full:

* New terminology ("Abbreviated keys").

* Better explanations for why we don't use the additional optimization
of abbreviated keys where we're using sort support, in analyze.c and
nodeMergeAppend.c.

* Better handling of NULLs.

* Never use the optimization with bounded heap sorts.

* Better abort strategy, that centers on the idea of measuring full
key cardinality, and abbreviated key cardinality, and weighing how
good a proxy the former is for the latter. This is heavily weighed
when deciding if we should abort normalization as tuples are copied.
Exact details are explained within bttext_abort(). As already
mentioned, this can allow us to use the optimization when we're
sorting a million tuples with only five distinct values. This will
still win decisively, but it's obviously impossible to make work while
only considering abbreviated key cardinality. Determining cardinality
of both abbreviated keys and full keys appears to have very low
overhead, and is exactly what we ought to care about, so that's what
we do. While there is still some magic to the algorithm's inputs, my
sense is that I'm much closer to characterizing the break-even point
than before.


I also attach a second patch, which adds additional debug
instrumentation, and is intended to be applied on top of the real
patch to help with review. Run Postgres with DEBUG1 output when it's
applied. With the patch applied, the setting backslash_quote also
controls whether or not the optimization is used. So with the debug
instrumentation patch applied:

"backslash_quote = on" - use optimization, but never abort

"backslash_quote = off" - Never use optimization - set up shim (just
like the win32 UTF-8 case). Equivalent to master's behavior.

"backslash_quote = safe_encoding" - Use optimization, but actually
abort if it doesn't work out, the behavior that is always seen without
instrumentation. This is useful for testing the overhead of
considering the optimization in cases where it didn't work out (i.e.
it's useful to compare this with "backslash_quote = off").

I've found it useful to experiment with real-world data with the
optimization dynamically enabled and disabled.

Thoughts?

[1] http://www.postgresql.org/message-id/CAM3SWZSf0Ftxy8QHGAKJh=S80vD2SBx83zkEzuJyZ6R=pTy5xA@mail.gmail.com
--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Noah Misch
Date:
On Fri, Aug 01, 2014 at 04:00:11PM -0700, Peter Geoghegan wrote:
> Since Robert did not give a
> reason for discussing the basic fmgr-elision patch despite having
> already discussed it a couple of years ago, I have not split the patch
> into two (if I did, the first patch would be virtually identical to
> Robert's 2012 patch). I hope that's okay. I am willing to revisit this
> question if Robert feels that something is uncertain about the costs
> and benefits of a basic fmgr-eliding sort support for text.

Robert, Heikki and maybe Alvaro requested and/or explained this split back in
April.  The fact that the would-be first patch was discussed and rejected in
the past does not counter that request.  (I voted against Robert's 2012 patch.
My opposition was mild even then, and the patch now being a stepping stone to
a more-compelling benefit certainly tips the scale.)  Given that the effect of
that decision is a moderate procedural change only, even if the request were
wrong, why continue debating it?

> There are still some open items:
> 
> * I have left the licensing question unresolved. There is plenty we
> can do about this, and if necessary we can ask the original author to
> changed his license from the MIT license to the PostgreSQL license.
> AFAICT, the BSD license is *more* restrictive than the MIT license,
> and the PostgreSQL license is identical. The wording is almost
> identical. I don't see the concern. If the PostgreSQL license had the
> “non-endorsement clause” of the New BSD license, or alternatively if
> the MIT license had a similar clause that the PostgreSQL licensed
> lacked, that would be a substantive and appreciable difference. That
> isn't the case.

It's fine to incorporate MIT-licensed code; we have many cases of copying
more-liberally-licensed code into our tree.  However, it's important to have
long-term clarity on the upstream license terms.  This is insufficient:

> + /*-------------------------------------------------------------------------
> +  *
> +  * hyperloglog.c
> +  *      HyperLogLog cardinality estimator
> +  *
> +  * Portions Copyright (c) 2014, PostgreSQL Global Development Group
> +  *
> +  * Based on Hideaki Ohno's C++ implementation.  This is probably not ideally
> +  * suited to estimating the cardinality of very large sets;  in particular, we

Usually, the foreign file had a copyright/license notice, and we prepend the
PostgreSQL notice.  See src/port/fls.c for a trivial example.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Fri, Aug 1, 2014 at 8:57 PM, Noah Misch <noah@leadboat.com> wrote:
> Robert, Heikki and maybe Alvaro requested and/or explained this split back in
> April.  The fact that the would-be first patch was discussed and rejected in
> the past does not counter that request.  (I voted against Robert's 2012 patch.
> My opposition was mild even then, and the patch now being a stepping stone to
> a more-compelling benefit certainly tips the scale.)  Given that the effect of
> that decision is a moderate procedural change only, even if the request were
> wrong, why continue debating it?

Heikki asked for it, and Robert repeated the request in the past few
days. I am not debating it. I merely don't understand the nature of
the request. I think it's virtually a foregone conclusion that there
should be fmgr elision for text, with very few decisions to be made on
the details of how that would work. The rest of what I have here is
trickier, and is something that must be discussed. Even the
improvement of ~7% that fmgr-elision offers is valuable for such a
common operation.

I don't want to dredge up the details of the 2012 thread, but since
you mention it the fact that the patch was not committed centered on a
silly disagreement on a very fine point, and nothing more. It was
*not* rejected, and my sense at the time was that it was very close to
being committed. I was fairly confident that everyone understood
things around the 2012 patch that way, and I sought clarity on that
point. It's a totally non-surprising and easy to reason about patch.
Clearly at least one person had some reservations about the basic idea
at the time, and it was worth asking what the concern was before
splitting the patch. It is easy to split the patch, but easier still
to answer my question.

> It's fine to incorporate MIT-licensed code; we have many cases of copying
> more-liberally-licensed code into our tree.  However, it's important to have
> long-term clarity on the upstream license terms.  This is insufficient:
>
>> + /*-------------------------------------------------------------------------
>> +  *
>> +  * hyperloglog.c
>> +  *    HyperLogLog cardinality estimator
>> +  *
>> +  * Portions Copyright (c) 2014, PostgreSQL Global Development Group
>> +  *
>> +  * Based on Hideaki Ohno's C++ implementation.  This is probably not ideally
>> +  * suited to estimating the cardinality of very large sets;  in particular, we
>
> Usually, the foreign file had a copyright/license notice, and we prepend the
> PostgreSQL notice.  See src/port/fls.c for a trivial example.

I will be more careful about this in the future.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Heikki Linnakangas
Date:
On 08/02/2014 08:16 AM, Peter Geoghegan wrote:
> On Fri, Aug 1, 2014 at 8:57 PM, Noah Misch <noah@leadboat.com> wrote:
>> Robert, Heikki and maybe Alvaro requested and/or explained this split back in
>> April.  The fact that the would-be first patch was discussed and rejected in
>> the past does not counter that request.  (I voted against Robert's 2012 patch.
>> My opposition was mild even then, and the patch now being a stepping stone to
>> a more-compelling benefit certainly tips the scale.)  Given that the effect of
>> that decision is a moderate procedural change only, even if the request were
>> wrong, why continue debating it?
>
> Heikki asked for it, and Robert repeated the request in the past few
> days. I am not debating it.

Great, I'll wait for the patch.

> I merely don't understand the nature of the request.

Do you mean:

a) you don't understand what the patch should look like? or
b) you don't understand why it's been requested?

If a), I admit I don't remember the details of this patch or patches 
very well either, but looking back at the archives here: 
http://www.postgresql.org/message-id/CAM3SWZQVnuomFBWNHOyRQ8t+nVJp+3=e58jvvx_A9Y04QmHzrA@mail.gmail.com, 
I think you had a pretty solid idea of how the split should look like. 
So, please do that, i.e. post the patch that Robert did 2 years ago that 
gave the 7% speedup, rebased over master. I don't recall the details of 
that patch, so please explain briefly what it does, as if it was 
submitted for the first time.

If b), see Noah's reply above. Hate to be blunt, but the nature of the 
request is that you're not going to get anywhere with further debate, 
without splitting the patch.

> I don't want to dredge up the details of the 2012 thread, but since
> you mention it the fact that the patch was not committed centered on a
> silly disagreement on a very fine point, and nothing more. It was
> *not* rejected, and my sense at the time was that it was very close to
> being committed. I was fairly confident that everyone understood
> things around the 2012 patch that way, and I sought clarity on that
> point. It's a totally non-surprising and easy to reason about patch.
> Clearly at least one person had some reservations about the basic idea
> at the time, and it was worth asking what the concern was before
> splitting the patch. It is easy to split the patch, but easier still
> to answer my question.

I didn't understand what that question is. Please just post the split 
patches.
- Heikki




Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Sat, Aug 2, 2014 at 1:15 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Great, I'll wait for the patch.

I'll post something over the weekend.

> Do you mean:
>
> a) you don't understand what the patch should look like? or
> b) you don't understand why it's been requested?
>
> If a), I admit I don't remember the details of this patch or patches very
> well either, but looking back at the archives here:
> http://www.postgresql.org/message-id/CAM3SWZQVnuomFBWNHOyRQ8t+nVJp+3=e58jvvx_A9Y04QmHzrA@mail.gmail.com,
> I think you had a pretty solid idea of how the split should look like. So,
> please do that, i.e. post the patch that Robert did 2 years ago that gave
> the 7% speedup, rebased over master. I don't recall the details of that
> patch, so please explain briefly what it does, as if it was submitted for
> the first time.

Robert's 2012 patch just elides the fmgr overhead, like any sort
support routine, so you just go through a function pointer and not a
function pointer to a shim function with a fmgr call. It ends up
having the sort support routine do a dew things once that might
otherwise have to take place once per comparison. It's more or less a
basic sort support routine, and nothing more.

My question is: What is the reason for splitting the patch? Is it that
you or Robert don't agree with my assessment of the 2012 patch, and
you think there might be some subtlety to it that I'm not aware of? Do
you just want to do this incrementally because it's easier to digest
that way? This is a pertinent question. You said something about
understanding where the benefits come from here, which is fair, but my
point was that I actually do understand that the benefits clearly
mostly come from the new abbreviated key idea. As the author of the
2012 patch, I imagine that Robert must be pretty confident that's the
case too, but I don't want to presume that.

As I said, I don't have a problem with breaking out the patch. I am
not trying to artificially link the two. It just isn't obvious to me
that you're aware that most of the benefit, and indeed all of the
potential for regressions clearly comes from this abbreviated key
idea. So sure, we can break out the patch and commit the first part
fairly easily, almost entirely on the basis of 2012 discussion. What
you may have missed here is that the 2012 patch wasn't committed for
reasons entirely unrelated to the merit of the idea. If we were to
commit essentially the same 2012 patch, that is a very straightforward
matter, and is only a small fraction of our work here. The 2012 patch
really should have been committed in 2012.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Sat, Aug 2, 2014 at 2:45 AM, Peter Geoghegan <pg@heroku.com> wrote:
> I'll post something over the weekend.

Attached is a cumulative pair of patches generated using
git-format-patch. I refer anyone who wants to know how the two parts
fit together to the commit messages of each patch. In passing, I have
added a reference to the MIT License as outlined by Noah.

--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
Consider the "cities" table I've played around with throughout the
development of this patch:

postgres=# select tablename, attname, n_distinct, correlation from
pg_stats where attname in ('country', 'province', 'city');tablename | attname  | n_distinct | correlation
-----------+----------+------------+-------------cities    | city     |  -0.666628 |  -0.0169657cities    | province |
    1958 |   0.0776529cities    | country  |        206 |           1
 
(3 rows)

Consider this query:

select * from (select * from cities order by country, province, city
offset 1000000) i;

With master, this consistently takes about 6.4 seconds on my laptop.
With the patch I posted most recently, it takes about 3.5 seconds. But
if I change this code:

+ if (ssup->position == sortKeyTiebreak && len1 == len2)
+ {

To this:

+ if (len1 == len2)
+ {

Then my patch takes about 1.3 seconds to execute the query, since now
we're always optimistic about the prospect of getting away with a
cheap memcmp() when the lengths match, avoiding copying and strcoll()
overhead entirely. "province" has a relatively low number of distinct
values - about 1,958, in a table of 317,102 cities - so clearly that
optimism is justified in this instance.

This does seem to suggest that there is something to be said for being
optimistic about yielding equality not just when performing a leading
attribute abbreviated key tie-breaker. Maybe the executor could pass a
per-attribute n_distinct hint to tuplesort, which would pass that on
to our sortsupport routine. Of course, if there was a low cardinality
attribute with text strings all of equal length, and/or if there
wasn't a correlation between "country" and "province" a lot of the
time those opportunistic memcmp()s could go to waste. But that might
be just fine, especially if we only did this for moderately short
strings (say less than 64 bytes). I don't have a good sense of the
costs yet, but the hashing that HyperLogLog requires in my patch
appears to be almost free, since it occurs at a time when we're
already bottlenecked on memory bandwidth, and is performed on memory
that needed to be manipulated at that juncture anyway. It wouldn't
surprise me if this general optimism about a simple memcmp() working
out had a very acceptable worst case. It might even be practically
free to be totally wrong about equality being likely, and if we have
nothing to lose and everything to gain, clearly we should proceed. I'm
aware of cases where it makes probabilistic sense to waste compute
bandwidth to gain memory bandwidth. It's something I've seen crop up a
number of times in various papers.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Sat, Aug 2, 2014 at 6:58 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Sat, Aug 2, 2014 at 2:45 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> I'll post something over the weekend.
>
> Attached is a cumulative pair of patches generated using
> git-format-patch. I refer anyone who wants to know how the two parts
> fit together to the commit messages of each patch. In passing, I have
> added a reference to the MIT License as outlined by Noah.

OK, I have taken a look at patch 1.  You write:

+ * As a general principle, operator classes with a cataloged sort support
+ * function are expected to provide sane sort support state, including a
+ * function pointer comparator.  Rather than revising that principle, just
+ * setup a shim for the WIN32 UTF-8 and non-"C" collation special case here.

...but I'm wondering what underlies that decision.   I would
understand the decision to go that way if it simplified things
elsewhere, but in fact it seems that's what underlies the addition of
ssup_operator to SortSupportData, which in turn requires a number of
changes elsewhere.  The upshot of those changes is that it makes it
possible to write bttext_inject_shim, but AFAICS that's just
recapitulating what get_sort_function_for_ordering_op and
PrepareSortSupportFromOrderingOp are already doing.  Any material
change to either of those functions will have to be reflected in
bttext_inject_shim; and if some opclass other than text wants to
provide a sortsupport shim that supplies a comparator only sometimes,
it will need its own copy of the logic.

So I think it's better to just change the sortsupport contract so that
filling in the comparator is optional.  Patch for that attached.
Objections?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Aug 5, 2014 at 12:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> ...but I'm wondering what underlies that decision.   I would
> understand the decision to go that way if it simplified things
> elsewhere, but in fact it seems that's what underlies the addition of
> ssup_operator to SortSupportData, which in turn requires a number of
> changes elsewhere.

The changes aren't too invasive. There is exactly one place where it
isn't a trivial matter of storing the operator that was already
available:

+ /* Original operator must be provided */
+ clause->ssup.ssup_operator = get_opfamily_member(opfamily,
+        op_lefttype,
+        op_righttype,
+        opstrategy);

> So I think it's better to just change the sortsupport contract so that
> filling in the comparator is optional.  Patch for that attached.
> Objections?

I'd have preferred to maintain the obligation for some sane
sortsupport state to be provided. It's not as if I feel too strongly
about it, though.

You attached "git diff --stat" output, and not an actual patch. Please re-send.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Aug 5, 2014 at 12:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> and if some opclass other than text wants to
> provide a sortsupport shim that supplies a comparator only sometimes,
> it will need its own copy of the logic.

That's true, but my proposal to do things that way reflects the fact
that text is a type oddly tied to the platform. I don't think it will
come up again (note that in the 4 byte Datum case, we still use sort
support to some degree on other platforms with patch 2 applied). It
seemed logical to impose the obligation to deal with that on
varlena.c.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Tue, Aug 5, 2014 at 3:15 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Aug 5, 2014 at 12:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> and if some opclass other than text wants to
>> provide a sortsupport shim that supplies a comparator only sometimes,
>> it will need its own copy of the logic.
>
> That's true, but my proposal to do things that way reflects the fact
> that text is a type oddly tied to the platform. I don't think it will
> come up again (note that in the 4 byte Datum case, we still use sort
> support to some degree on other platforms with patch 2 applied). It
> seemed logical to impose the obligation to deal with that on
> varlena.c.

Per your other email, here's the patch again; hopefully I've got the
right stuff in the file this time.

On this point, I'm all for confining knowledge of things to a
particular module to that module.  However, in this particular case, I
don't believe that there's anything specific to varlena.c in
bttext_inject_shim(); it's a cut down version of the combination of
functions that appear in other modules, and there's nothing to
encourage someone who modifies one of those functions to also update
varlena.c.  Indeed, people developing on non-Windows platforms won't
even be compiling that function, so it would be easy for most of us to
miss the need for an update.  So I argue that my approach is keeping
this knowledge more localized.

I'm also not sure it won't come up again.  There are certainly other
text-like datatypes out there that might want to optimize sorts; e.g.
citext.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Aug 5, 2014 at 12:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Per your other email, here's the patch again; hopefully I've got the
> right stuff in the file this time.

Your patch looks fine to me. I recommend committing patch 1 with these
additions. I can't think of any reason to rehash the 2012 discussion.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Aug 5, 2014 at 12:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I'm also not sure it won't come up again.  There are certainly other
> text-like datatypes out there that might want to optimize sorts; e.g.
> citext.

Fair enough. Actually, come to think of it I find BpChar/character(n)
a far more likely candidate. I've personally never used the SQL
standard character(n) type, which is why I didn't think of it until
now. TPC-H does make use of character(n) though, and that might be a
good reason in and of itself to care about its sorting performance.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
I've looked at another (admittedly sympathetic) dataset that is
publicly available: the flickr "tags" dataset [1]. I end up with a
single table of tags; it's a large enough table, at 388 MB, but the
tuples are not very wide. There are 7,656,031 tags/tuples.

Predictably enough, this query is very fast when an internal sort is
used on a patched Postgres:

select * from (select * from tags order by tag offset 100000000) ii;

Git master takes about 25 seconds to execute the query. Patched takes
about 6.8 seconds. That seems very good, but this is not really new
information.

However, with work_mem set low enough to get an external sort, the
difference is more interesting. If I set work_mem to 10 MB, then the
query takes about 10.7 seconds to execute with a suitably patched
Postgres. Whereas on master, it consistently takes a full 69 seconds.
That's the largest improvement I've seen so far, for any case.

I must admit that this did surprise me, but then I don't grok tape
sort. What's particularly interesting here is that when work_mem is
cranked up to 512MB, which is a high setting, but still not high
enough to do an internal sort, the difference closes in a bit. Instead
of 41 runs, there are only 2. Patched now takes 16.3 seconds.
Meanwhile, master is somewhat improved, and consistently takes 65
seconds to complete the sort.

This probably has something to do with CPU cache effects. I believe
that all world class external sorting algorithms are cache sensitive.
I'm not sure what the outcome would have been had there not been a
huge amount of memory available for the OS cache to use, which there
was. I think there is probably something to learn about how to improve
tape sort here.

Does anyone recall hearing complaints around higher work_mem settings
regressing performance?

[1]
http://www.isi.edu/integration/people/lerman/load.html?src=http://www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html
, bottom of page
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Noah Misch
Date:
On Tue, Aug 05, 2014 at 07:32:35PM -0700, Peter Geoghegan wrote:
> select * from (select * from tags order by tag offset 100000000) ii;
> 
> Git master takes about 25 seconds to execute the query. Patched takes
> about 6.8 seconds. That seems very good, but this is not really new
> information.
> 
> However, with work_mem set low enough to get an external sort, the
> difference is more interesting. If I set work_mem to 10 MB, then the
> query takes about 10.7 seconds to execute with a suitably patched
> Postgres. Whereas on master, it consistently takes a full 69 seconds.
> That's the largest improvement I've seen so far, for any case.

Comparator cost affects external sorts more than it affects internal sorts.
When I profiled internal and external int4 sorting, btint4cmp() was 0.37% of
the internal sort profile and 10.26% of the external sort profile.

> I must admit that this did surprise me, but then I don't grok tape
> sort. What's particularly interesting here is that when work_mem is
> cranked up to 512MB, which is a high setting, but still not high
> enough to do an internal sort, the difference closes in a bit. Instead
> of 41 runs, there are only 2. Patched now takes 16.3 seconds.
> Meanwhile, master is somewhat improved, and consistently takes 65
> seconds to complete the sort.

> Does anyone recall hearing complaints around higher work_mem settings
> regressing performance?

Jeff Janes has mentioned it:
http://www.postgresql.org/message-id/CAMkU=1zVD82voXw1vBG1kWcz5c2G=SupGohPKM0ThwmpRK1Ddw@mail.gmail.com

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Aug 5, 2014 at 8:55 PM, Noah Misch <noah@leadboat.com> wrote:
> Comparator cost affects external sorts more than it affects internal sorts.
> When I profiled internal and external int4 sorting, btint4cmp() was 0.37% of
> the internal sort profile and 10.26% of the external sort profile.

Did you attempt to characterize where wall time was being spent? With
a difference like that, it seems likely that it's largely down to the
fact that quicksort is cache oblivious, rather than, say, that there
were more comparisons required.

>> I must admit that this did surprise me, but then I don't grok tape
>> sort. What's particularly interesting here is that when work_mem is
>> cranked up to 512MB, which is a high setting, but still not high
>> enough to do an internal sort, the difference closes in a bit. Instead
>> of 41 runs, there are only 2. Patched now takes 16.3 seconds.
>> Meanwhile, master is somewhat improved, and consistently takes 65
>> seconds to complete the sort.
>
>> Does anyone recall hearing complaints around higher work_mem settings
>> regressing performance?
>
> Jeff Janes has mentioned it:
> http://www.postgresql.org/message-id/CAMkU=1zVD82voXw1vBG1kWcz5c2G=SupGohPKM0ThwmpRK1Ddw@mail.gmail.com

I knew that I'd heard that at least once. Apparently some other
database systems have external sorts that tend to be faster than
equivalent internal sorts. I'd guess that that is an artifact of their
having a substandard internal sort, though.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Noah Misch
Date:
On Tue, Aug 05, 2014 at 09:32:59PM -0700, Peter Geoghegan wrote:
> On Tue, Aug 5, 2014 at 8:55 PM, Noah Misch <noah@leadboat.com> wrote:
> > Comparator cost affects external sorts more than it affects internal sorts.
> > When I profiled internal and external int4 sorting, btint4cmp() was 0.37% of
> > the internal sort profile and 10.26% of the external sort profile.
> 
> Did you attempt to characterize where wall time was being spent?

No.  I have the perf data files, if you'd like to see any particular report.

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Tue, Aug 5, 2014 at 3:54 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Aug 5, 2014 at 12:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Per your other email, here's the patch again; hopefully I've got the
>> right stuff in the file this time.
>
> Your patch looks fine to me. I recommend committing patch 1 with these
> additions. I can't think of any reason to rehash the 2012 discussion.

I've committed the patch I posted yesterday.  I did not see a good
reason to meld that together in a single commit with the first of the
patches you posted; I'll leave you to revise that patch to conform
with this approach.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Wed, Aug 6, 2014 at 1:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I've committed the patch I posted yesterday.  I did not see a good
> reason to meld that together in a single commit with the first of the
> patches you posted; I'll leave you to revise that patch to conform
> with this approach.

Okay. Attached is the same patch set, rebased on top on your commit
with appropriate amendments.

BTW, I haven't added any of the same "portability" measures that the
existing strxfrm() selfuncs.c caller has. Since Windows can only use
the optimization with the C locale, I don't believe we're affected,
although the link
"http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=99694"
is now dead (even if you have a Microsoft account), so I won't swear
that Visual Studio 2005 is unaffected in the "C" locale case. However,
only VS 2008 and later versions support 64-bit builds, and only 64-bit
builds support abbreviated keys within sort support for text, so it
doesn't matter. As for the Solaris bugs that convert_string_datum()
also refers to, well, I just don't have much sympathy for that case.
If a version of Solaris that was old in 2003 still has a buggy
strxfrm() implementation and wants to use Postgres 9.5, too bad. My
usage of strxfrm() is the pattern that the standard expects. The idea
that we should always do a dry-run, where strxfrm() is called with a
NULL pointer to check how much memory is required, just because
somebody's strxfrm() is so badly written as to feel entitled to write
past the end of my buffer is just absurd. That should probably be
removed from the existing convert_string_datum() strxfrm() call site
too. I suspect that no Oracle-supported version of Solaris will be
affected, but if I'm wrong, that's what we have a buildfarm for.

The nodeAgg.c FIXME item remains. I've added a new TODO comment which
suggests that we investigate more opportunistic attempts at getting
away with a cheap memcmp() that indicates equality. That continues to
only actually occur when the sort support comparator is called as a
tie-breaker.

--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Aug 5, 2014 at 8:55 PM, Noah Misch <noah@leadboat.com> wrote:
>> However, with work_mem set low enough to get an external sort, the
>> difference is more interesting. If I set work_mem to 10 MB, then the
>> query takes about 10.7 seconds to execute with a suitably patched
>> Postgres. Whereas on master, it consistently takes a full 69 seconds.
>> That's the largest improvement I've seen so far, for any case.
>
> Comparator cost affects external sorts more than it affects internal sorts.
> When I profiled internal and external int4 sorting, btint4cmp() was 0.37% of
> the internal sort profile and 10.26% of the external sort profile.

I took another look at this.

If I run "dd if=/dev/zero of=/home/pg/test", I can see with iotop that
that has about 45 M/s "total disk write" fairly sustainably, with
occasional mild blips during write-back. This is a Crucial mobile SSD,
with an ext4/lvm file system, and happens to be what is close at hand.

If I run the same external sorting query with a patched Postgres, I
see 24 M/s total disk write throughout. With master, it's about 6 M/s,
and falls to 0 during the final 36-way merge. I'm not sure if the same
thing occurs with patched during the final merge, because the
available resolution isn't good enough to be able to tell. Anyway,
it's pretty clear that when patched, the external sort on text is, if
not totally I/O bound, much closer to being I/O bound. A good external
sort algorithm should be at least close to totally I/O bound. This
makes I/O parallelism a viable strategy for speeding up sorts, where
it might not otherwise be. I've heard of people using a dedicated temp
tablespace disk with Postgres to speed up sorting, but that always
seemed to be about reducing the impact on the heap filesystem, or vice
versa. I've never heard of anyone using multiple disks to speed up
sorting with Postgres (which I don't presume means it hasn't been done
at least somewhat effectively). However, with external sort benchmarks
(like http://sortbenchmark.org), using I/O parallelism strategically
seems to be table stakes for external sort entrants.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Aug 5, 2014 at 9:32 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I knew that I'd heard that at least once. Apparently some other
> database systems have external sorts that tend to be faster than
> equivalent internal sorts. I'd guess that that is an artifact of their
> having a substandard internal sort, though.

This *almost* applies to patched Postgres if you pick a benchmark that
is very sympathetic to my patch. To my surprise, work_mem = '10MB'
(which results in an external tape sort) is sometimes snapping at the
heels of a work_mem = '5GB' setting (which results in an in-memory
quicksort).

I have a 338 MB table, that consists or a single text column of 8 byte
strings strings, with high cardinality. I ran VACUUM FREEZE, and took
all the usual precautions of that kind. On the test table n_distinct =
-1, and there is no discernible physical/logical correlation.

The external sort case stabilized as follows:

LOG:  duration: 9731.776 ms  statement: select * from (select * from
besttest order by tt offset 10000000) i;
LOG:  duration: 9742.948 ms  statement: select * from (select * from
besttest order by tt offset 10000000) i;
LOG:  duration: 9733.918 ms  statement: select * from (select * from
besttest order by tt offset 10000000) i;

The in-memory case stabilized as follows:

LOG:  duration: 0.059 ms  statement: set work_mem = '5GB';
LOG:  duration: 9665.731 ms  statement: select * from (select * from
besttest order by tt offset 10000000) i;
LOG:  duration: 9602.841 ms  statement: select * from (select * from
besttest order by tt offset 10000000) i;
LOG:  duration: 9609.107 ms  statement: select * from (select * from
besttest order by tt offset 10000000) i;

FWIW, master performed as follows with work_mem = '10MB':

LOG:  duration: 60456.943 ms  statement: select * from (select * from
besttest order by tt offset 10000000) i;
LOG:  duration: 60368.987 ms  statement: select * from (select * from
besttest order by tt offset 10000000) i;
LOG:  duration: 61223.942 ms  statement: select * from (select * from
besttest order by tt offset 10000000) i;

And master did quite a lot better with work_mem = '5GB', in a way that
fits with my prejudices about how quicksort is supposed to perform
relative to tape sort:

LOG:  duration: 0.060 ms  statement: set work_mem = '5GB';
LOG:  duration: 41697.659 ms  statement: select * from (select * from
besttest order by tt offset 10000000) i;
LOG:  duration: 41755.496 ms  statement: select * from (select * from
besttest order by tt offset 10000000) i;
LOG:  duration: 41883.888 ms  statement: select * from (select * from
besttest order by tt offset 10000000) i;

work_mem = '10MB' continues to be the external sort sweet spot for
this hardware, for whatever reason - I can add a few seconds to
execution time by increasing or decreasing the setting a bit. I'm
using an Intel Core i7-3520M CPU @ 2.90GHz, with a /proc/cpuinfo
reported L3 cache size of 4096 KB. I have been very careful to take
into account power saving features throughout all
experimentation/benchmarking of this patch and previous abbreviated
key patches - failing to do so is a good way to end up with complete
garbage when investigating this kind of thing.

Anyway, I'm not sure what this tells us about quicksort and tape sort,
but I think there might be an interesting and more general insight to
be gained here. I'd have thought that tape sort wastes memory
bandwidth by copying to operating system buffers to the extent that
things are slowed down considerably (this is after all a test
performed with lots of memory available, even when work_mem = '10
MB'). And even if that wasn't a significant factor, I'd expect
quicksort to win decisively anyway. Why does this happen?

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Wed, Aug 6, 2014 at 10:36 PM, Peter Geoghegan <pg@heroku.com> wrote:
> This *almost* applies to patched Postgres if you pick a benchmark that
> is very sympathetic to my patch. To my surprise, work_mem = '10MB'
> (which results in an external tape sort) is sometimes snapping at the
> heels of a work_mem = '5GB' setting (which results in an in-memory
> quicksort).

Note that this was with a default temp_tablespaces setting that wrote
temp files on my home partition/SSD. With a /dev/shm/ temp tablespace,
tape sort edges ahead, and has a couple of hundred milliseconds on
quicksort for this test case. It's actually faster.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Wed, Aug 6, 2014 at 7:18 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Aug 6, 2014 at 1:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I've committed the patch I posted yesterday.  I did not see a good
>> reason to meld that together in a single commit with the first of the
>> patches you posted; I'll leave you to revise that patch to conform
>> with this approach.
>
> Okay. Attached is the same patch set, rebased on top on your commit
> with appropriate amendments.

Two things:

+        * result.  Also, since strxfrm()/strcoll() require
NULL-terminated inputs,

In my original patch, I wrote NUL, as in the NUL character.  You've
changed it to NULL, but the original was correct.  NULL is a pointer
value that is not relevant here; the character with value 0 is NUL.


-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Thu, Aug 7, 2014 at 3:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Aug 6, 2014 at 7:18 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Wed, Aug 6, 2014 at 1:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I've committed the patch I posted yesterday.  I did not see a good
>>> reason to meld that together in a single commit with the first of the
>>> patches you posted; I'll leave you to revise that patch to conform
>>> with this approach.
>>
>> Okay. Attached is the same patch set, rebased on top on your commit
>> with appropriate amendments.
>
> Two things:
>
> +        * result.  Also, since strxfrm()/strcoll() require
> NULL-terminated inputs,
>
> In my original patch, I wrote NUL, as in the NUL character.  You've
> changed it to NULL, but the original was correct.  NULL is a pointer
> value that is not relevant here; the character with value 0 is NUL.

Gah.  Hit send to soon.  Also, as much as I'd prefer to avoid
relitigating the absolutely stupid debate about how to expand the
buffers, this is no good:

+               tss->buflen1 = Max(len1 + 1, tss->buflen1 * 2);

If the first expansion is for a string >512MB and the second string is
longer than the first, this will exceed MaxAllocSize and error out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Aug 7, 2014 at 12:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> In my original patch, I wrote NUL, as in the NUL character.  You've
> changed it to NULL, but the original was correct.  NULL is a pointer
> value that is not relevant here; the character with value 0 is NUL.

"NULL-terminated string" seems like acceptable usage (e.g. [1]), but
I'll try to use the term NUL in reference to '\0' in the future to
avoid confusion.

[1] https://en.wikipedia.org/wiki/Null-terminated_string
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Aug 7, 2014 at 12:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Gah.  Hit send to soon.  Also, as much as I'd prefer to avoid
> relitigating the absolutely stupid debate about how to expand the
> buffers, this is no good:
>
> +               tss->buflen1 = Max(len1 + 1, tss->buflen1 * 2);
>
> If the first expansion is for a string >512MB and the second string is
> longer than the first, this will exceed MaxAllocSize and error out.

Fair point. I think this problem is already present in a few existing
places, but it shouldn't be. I suggest this remediation:

> +               tss->buflen1 = Max(len1 + 1, Min(tss->buflen1 * 2, (int) MaxAllocSize));

I too would very much prefer to not repeat that debate.  :-)
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Sun, Jul 27, 2014 at 12:00 AM, Peter Geoghegan <pg@heroku.com> wrote:
> Robert pointed out a case where varying character case of an English
> word did not alter the primary level bytes (and thus the poor man's
> normalized key was the same). He acknowledged that most of the entropy
> of the first 8 bytes of the string went into the first 8 bytes of the
> blob/key. This can actually be very useful to the optimization in some
> cases. In particular, with most Latin alphabets you'll see the same
> pattern when diacritics are used. This means that even though the
> original string has (say) an accented character that would take 2
> bytes to store in UTF-8, the weight in the primary level is the same
> as an equivalent unaccented character (and so only takes one byte to
> store at that level, with differences only in subsequent levels).
> Whole strxfrm() blobs are the same length regardless of how many
> accents appear in otherwise equivalent original Latin string, and you
> get exactly the same high concentrations of entropy in the first 8
> bytes in pretty much all Latin alphabets (the *absence* of diacritics
> is stored in later weight levels too, even with the "en_US.UTF-8"
> collation).

There are many other interesting cases where en_US.UTF-8, and
presumably all other collations concentrate much more entropy into
leading bytes than might be expected. Consider:

pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "abc"
"abc" -> 0c0d0e0109090901090909 (11 bytes)
pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "# abc"
"# abc" -> 0c0d0e01090909010909090101760135 (16 bytes)
pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "***** abc"
"***** abc" -> 0c0d0e010909090109090901017301730173017301730135 (24 bytes)

and, to show you what this looks like when the primary
weights/original codepoints appear backwards:

pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "cba"
"cba" -> 0e0d0c0109090901090909 (11 bytes)
pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "# cba"
"# cba" -> 0e0d0c01090909010909090101760135 (16 bytes)
pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "***** cba"
"***** cba" -> 0e0d0c010909090109090901017301730173017301730135 (24 bytes)

Evidently, the implementation always places primary weights first,
corresponding to "abc" (and later "cba") - the bytes "\0c\0d\0e" (and
later "\0e\0d\0c") - no matter how many "extraneous" characters are
placed in front. They're handled later. Space don't appear in the
primary weight level at all:

pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "a b c"
"a b c" -> 0c0d0e01090909010909090102350235 (16 bytes)

Lots of punctuation-type characters will not affect the primary weight level:

pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "%@!()\/#-+,:^~? a b c"
"%@!()\/#-+,:^~? a b c" ->
0c0d0e0109090901090909010177015d013e01500152017401420176013a0178013b013d011201170140013502350235
(48 bytes)

Some non-alphabetic ASCII characters will affect the primary level,
though. For example:

pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "1.) a b c"
"1.) a b c" -> 030c0d0e010909090901090909090102440152013502350235 (25 bytes)

There is one extra byte here, in front of the "abc" bytes "\0c\0d\0e",
a primary weight "\03" corresponding to '1'.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Aug 7, 2014 at 1:09 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Aug 6, 2014 at 10:36 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> This *almost* applies to patched Postgres if you pick a benchmark that
>> is very sympathetic to my patch. To my surprise, work_mem = '10MB'
>> (which results in an external tape sort) is sometimes snapping at the
>> heels of a work_mem = '5GB' setting (which results in an in-memory
>> quicksort).
>
> Note that this was with a default temp_tablespaces setting that wrote
> temp files on my home partition/SSD. With a /dev/shm/ temp tablespace,
> tape sort edges ahead, and has a couple of hundred milliseconds on
> quicksort for this test case. It's actually faster.

I decided to do a benchmark of a very large, sympathetic sort. Again,
this was with 8 byte random text strings, but this time with a total
of 500 million tuples. This single-column table is about 21 GB. This
ran on a dedicated server with 32 GB of ram. I took all the usual
precautions (the cache was warmed, tuples were frozen, etc).

Master took 02:02:20.5561 to sort the data with a 10 MB work_mem
setting, without a ramdisk in temp_tablespaces. With a
temp_tablespaces /dev/shm ramdisk, there was only a very small
improvement that left total execution time at 02:00:58.51878 (while a
later repeat attempt took 02:00:41.385937) -- an improvement hardly
worth bothering with.

Patched Postgres took 00:16:13.228053, again with a work_mem of 10 MB,
but no ramdisk. When I changed temp_tablespaces to use the same
ramdisk, this went down to  00:11:58.77301, a significant improvement.
This is clearly because the data directory was on a filesystem on
spinning disks, and more I/O bandwidth (real or simulated) helps
external sorts. Since this machine only has 32GB of ram, and a
significant proportion of that must be used for shared_buffers (8GB)
and the OS cache, I think there is a fair chance that a more capable
I/O subsystem could be used to get appreciably better results using
otherwise identical hardware.

Comparing like with like, the ramdisk patched run was over 10 times
faster than master with the same ramdisk. While that disparity is
obviously in itself very significant, I think the disparity in how
much faster things were with a ramdisk for patched, but not for master
is also significant.

I'm done with sympathetic cases now. I welcome unsympathetic ones, or
more broadly representative large tests. It's hard to come up with a
benchmark that isn't either sympathetic or very sympathetic, or a
pathological bad case. There is actually a simple enough C program for
generating test input, "gensort", which is used by sortbenchmark.org:

http://www.ordinal.com/gensort.html

(I suggest building without support for "the SUMP Pump library", by
modifying the Makefile before building)

What's interesting about gensort is that there is a "skew" option.
Without it, I can generate totally random ASCII keys. But with it,
there is a tendency for there to be a certain amount of redundancy
between keys in their first few bytes. This is intended to limit the
effectiveness of abbreviation-type optimizations for sortbenchmark.org
"Daytona Sort" entrants:

http://sortbenchmark.org/FAQ-2014.html ("Indy vs. Daytona")

Daytona sort is basically a benchmark that focuses on somewhat
commercially representative data (often ASCII data), with sneaky
data-aware tricks forbidden, as opposed to totally artificial
uniformly distributed random binary data, which is acceptable for
their "Indy sort" benchmark that gets to use every trick in the book.
They note here that Daytona entrants should "not be overly dependent
on the uniform and random distribution of key values in the sort
input". They are allowed to be somewhat dependent on it, though - for
one thing, keys are always exactly 10 bytes. They must merely "be able
to sort the alternate, skewed-keys input data set in an elapsed time
of no more than twice the elapsed time of the benchmark entry".

It might be useful for Robert or other reviewers to hold the
abbreviated keys patch to a similar standard, possibly by using
gensort, or their own modified version. I've shown a sympathetic case
that is over 10 times faster, with some other less sympathetic cases
that were still pretty good, since tie-breakers were generally able to
get away with a cheap memcmp(). There has also been some tests showing
pathological bad cases for the optimization. The middle ground has now
become interesting, and gensort might offer a half-reasonable way to
generate tests that are balanced. I've looked at the gensort code, and
it seems easy enough to understand and modify for our purposes. You
might want to look at multiple cases with this constant modified, for
example:

#define SKEW_BYTES     6

Since this controls the number of leading bytes that come from a table
of skew bytes.
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Thu, Aug 7, 2014 at 4:19 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Aug 7, 2014 at 12:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Gah.  Hit send to soon.  Also, as much as I'd prefer to avoid
>> relitigating the absolutely stupid debate about how to expand the
>> buffers, this is no good:
>>
>> +               tss->buflen1 = Max(len1 + 1, tss->buflen1 * 2);
>>
>> If the first expansion is for a string >512MB and the second string is
>> longer than the first, this will exceed MaxAllocSize and error out.
>
> Fair point. I think this problem is already present in a few existing
> places, but it shouldn't be. I suggest this remediation:
>
>> +               tss->buflen1 = Max(len1 + 1, Min(tss->buflen1 * 2, (int) MaxAllocSize));
>
> I too would very much prefer to not repeat that debate.  :-)

Committed that way.  As the patch is by and large the same as what I
submitted for this originally, I credited myself as first author and
you as second author.  I hope that seems fair.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Aug 14, 2014 at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Committed that way.  As the patch is by and large the same as what I
> submitted for this originally, I credited myself as first author and
> you as second author.  I hope that seems fair.


I think that's more than fair. Thanks!

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Thu, Aug 14, 2014 at 1:24 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Aug 14, 2014 at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Committed that way.  As the patch is by and large the same as what I
>> submitted for this originally, I credited myself as first author and
>> you as second author.  I hope that seems fair.
>
> I think that's more than fair. Thanks!

Great.  BTW, I notice to my chagrin that 'reindex table
some_table_with_an_indexed_text_column' doesn't benefit from this,
apparently because tuplesort_begin_index_btree is used, and it knows
nothing about sortsupport.  I have a feeling there's a good reason for
that, but I don't remember what it is; do you?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Aug 14, 2014 at 11:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Great.  BTW, I notice to my chagrin that 'reindex table
> some_table_with_an_indexed_text_column' doesn't benefit from this,
> apparently because tuplesort_begin_index_btree is used, and it knows
> nothing about sortsupport.  I have a feeling there's a good reason for
> that, but I don't remember what it is; do you?

No, I don't, but I'm pretty sure that's because there is no good
reason. I guess the really compelling original sort support functions
were most compelling for the onlyKey case. We can't do that with
B-Tree (at least not without another qsort() specialization, like
qsort_tuple_btree()), because there is additional tie-breaker logic to
sort on item pointer within comparetup_index_btree(). I remember
arguing that that wasn't necessary, because of course I wanted to make
sortsupport as applicable as possible, but I realize in hindsight that
I was probably wrong about that.

Clearly there are still benefits to be had for cluster and B-Tree
tuplesorts. It looks like more or less a simple matter of programming
to me. _bt_mkscankey_nodata() tuplesort call sites like
tuplesort_begin_index_btree() can be taught to produce an equivalent
sortsupport state. I expect that we'll get around to fixing the
problem at some point before too long.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Aug 14, 2014 at 11:55 AM, Peter Geoghegan <pg@heroku.com> wrote:
> Clearly there are still benefits to be had for cluster and B-Tree
> tuplesorts.

In a world where this general support exists, abbreviated keys could
be made to work with both of those, but not datum tuplesorts, because
that case needs to support tuplesort_getdatum(). Various datum
tuplesort clients expect to be able to retrieve the original
representation stored in SortTuple.datum1, and there isn't much we can
do about that.

This is a bit messy, because now you have heap and datum cases able to
use the onlyKey qsort specialization (iff the opclass doesn't provide
abbreviated key support in the heap case), while all cases except the
datum case support abbreviated keys. It's not that bad though; at
least the onlyKey qsort specialization doesn't have to care about
abbreviated keys, which makes sense because it's generally only
compelling for pass-by-value types.
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Thu, Aug 14, 2014 at 1:24 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Aug 14, 2014 at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Committed that way.  As the patch is by and large the same as what I
>> submitted for this originally, I credited myself as first author and
>> you as second author.  I hope that seems fair.
>
> I think that's more than fair. Thanks!

Patch 0002 no longer applies; please rebase.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Fri, Aug 22, 2014 at 7:19 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Patch 0002 no longer applies; please rebase.

I attach rebased patch.

Note that there is currently a bug in the master branch:

+   if (len2 >= tss->buflen2)
+   {
+       pfree(tss->buf2);
+       tss->buflen1 = Max(len2 + 1, Min(tss->buflen2 * 2, MaxAllocSize));
+       tss->buf2 = MemoryContextAlloc(ssup->ssup_cxt, tss->buflen2);
+   }

Thanks
--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Fri, Aug 22, 2014 at 2:46 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Fri, Aug 22, 2014 at 7:19 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Patch 0002 no longer applies; please rebase.
>
> I attach rebased patch.
>
> Note that there is currently a bug in the master branch:
>
> +   if (len2 >= tss->buflen2)
> +   {
> +       pfree(tss->buf2);
> +       tss->buflen1 = Max(len2 + 1, Min(tss->buflen2 * 2, MaxAllocSize));
> +       tss->buf2 = MemoryContextAlloc(ssup->ssup_cxt, tss->buflen2);
> +   }

You didn't say what the bug is; after inspection, I believe it's that
line 4 begins with tss->buflen1 rather than tss->buflen2.

I have committed a fix for that problem.  Let me know if I missed
something else.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Aug 26, 2014 at 12:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I have committed a fix for that problem.  Let me know if I missed
> something else.

Yes, that's all I meant.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Tue, Aug 26, 2014 at 4:09 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Aug 26, 2014 at 12:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I have committed a fix for that problem.  Let me know if I missed
>> something else.
>
> Yes, that's all I meant.

OK.  The patch needs another rebase as a result of that fix.  In
general, you would probably do well to separate unrelated changes out
of your patches and submit a series of patches, one with the unrelated
fixes and a second with the new changes that applies on top of the
first, instead of munging two sets of changes together.

I found the capitalization of sortKeyOther, sortKeyAbbreviated, and
sortKeyTiebreak somewhat surprising.  The great majority of precedents
in src/include use all-caps separated by underscores for this, i.e.
SORT_KEY_OTHER.  I think it'd be better to use that style here, too.
I also don't particularly like the content of the naming:
SORT_KEY_OTHER does not obviously mean "don't use the abbreviated-key
optimization".  We could use ABBREVIATE_KEYS_YES and
ABBREVIATE_KEYS_NO, but what to do about the "tiebreak" value?  Hmm.

Maybe we should get rid of the tiebreak case altogether: the second
SortSupport object is just containing all the same values as the first
one, with only the comparator being different.  Can't we just have
both the abbreviated-comparator and authoritative-comparator as
members of the SortSupport, and call whichever one is appropriate,
instead of copying the whole SortSupport object?  That would have the
nice property of avoiding the need for special handling in
reversedirection_heap().

In bttextcmp_abbreviated, I wondered why you used memcmp() rather than
just testing e.g. a < b ? -1 : a > b ? 1 : 0.  Then I realized the
latter is probably wrong on little-endian machines.   Not sure if a
comment is warranted.

This comment in sortsupport.h seems to be no longer true, as of commit
1d41739e5a04b0e93304d24d864b6bfa3efc45f2:
* (However, all opclasses that provide BTSORTSUPPORT are required to provide* the comparator function.)

Independent of this patch, it seems to me that that comment should get
deleted, since we now allow that.

Most places that use a SortSupportData initialize ssup.position
explicitly, but tuplesort_begin_datum() doesn't.  That's an
inconsistency that should be fixed, but I'm not sure which direction
is best.  Since enum SortKeyPosition is (quite rightly) defined such
that sortKeyOther (i.e. no optimization) is 0, all of the
initializations of pre-zeroed structures are redundant.  It might be
just as well to leave those out entirely, and only initialize it in
those places that want to opt *in* to the optimization.  But if not,
tuplesort_begin_datum() shouldn't be the only one missing the party.

I think that the naming of the new SortSupport members is not the
greatest.  There's nothing (outside of the detailed comments, of
course) to indicate that "position" or "converter" or
"abort_conversion" or "tiebreak" are things that pertain specifically
to abbreviated keys. And I think that's a must, because the
SortSupport comments clearly foresee that it might oversee multiple
types of optimizations.  There are several ways to accomplish this.
One is to give them all a common prefix (like "ak_").  Another is to
just rename them a bit.  For example, I think "converter" could be
called something like "abbreviate_key" and "abort_conversion" could be
called something like "abort_key_abbreviation".  I think I like that
better than a common prefix.  I'm not exactly sure what to recommend
for "position" (but I note that it is not, in any relevant sense, the
position of anything) and "tiebreak", and the right answers might
depend on how we resolve some of the other comments noted above.

There are similar problems with the naming of the fields in
Tuplesortstate.  You can't just have a flag in there called "aborted"
that relates only to aborting one very specific thing and not the
whole tuplesort.  I grant you there's a comment explaining that the
field relates specifically to the abbreviated key optimization, but it
should be also be named in a way that makes that self-evident.  There
are other instances of this problem elsewhere; e.g. bttext_abort is
not an abort function for bttext, but something much more specific.

n_distinct is a cardinality estimate, but AFAIK not using hyperloglog.
What is different about that problem vs. this one that makes each
method correct for its respective case?  Or should analyze be using
hyperloglog too?

Is it the right decision to suppress the abbreviated-key optimization
unconditionally on 32-bit systems and on Darwin?  There's certainly
more danger, on those platforms, that the optimization could fail to
pay off.  But it could also win big, if in fact the first character or
two of the string is enough to distinguish most rows, or if Darwin
improves their implementation in the future.  If the other defenses
against pathological cases in the patch are adequate, I would think
it'd be OK to remove the hard-coded checks here and let those cases
use the optimization or not according to its merits in particular
cases.  We'd want to look at what the impact of that is, of course,
but if it's bad, maybe those other defenses aren't adequate anyway.

Does the lengthy comment in btsortsupport_worker() survive pgindent?

+ * Based on Hideaki Ohno's C++ implementation.  The copyright term's of Ohno's

terms, not term's.

+       /* memset() so untouched bytes are NULL */
+       /* By convention, we use buffer 1 to store and NULL terminate text */
+       /* Just like strcoll(), strxfrm() expects a NULL-terminated string */

Please use NUL or \0.

+        * There is no special handling of the C locale here.  strxfrm() is used
+        * indifferently.

Comments should explain the C code, not just recapitulate it.

+        * First, Hash key proper, or a significant fraction of it.
Mix in length

There's no reason why "Hash" should be capitalized here.

+        * Every Datum byte is compared.  This is safe because the
strxfrm() blob
+        * is itself NULL-terminated, leaving no danger of
misinterpreting any NULL
+        * bytes not intended to be interpreted as logically representing
+        * termination.

Reading from an uninitialized byte could provide a valgrind warning
even if it's harmless, I think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Sep 2, 2014 at 12:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> < Various points on style>

Okay, fair enough.

> n_distinct is a cardinality estimate, but AFAIK not using hyperloglog.
> What is different about that problem vs. this one that makes each
> method correct for its respective case?  Or should analyze be using
> hyperloglog too?

HyperLogLog isn't sample-based - it's useful for streaming a set and
accurately tracking its cardinality with fixed overhead. I think that
the first place that it found use was for things like network switches
(aside: it's a pretty recent, trendy algorithm, but FWIW I think that
actually makes sense. It's really quite impressive.). Whereas,
../commands/analyze.c uses an "Estimate [of] the number of distinct
values using the estimator proposed by Haas and Stokes in IBM Research
Report RJ 10025", which is based on sampling (I think that there are
reasons to be skeptical of sample-based cardinality estimators in
general). Right now, given the way ANALYZE does random sampling, we
probably save little to no actual I/O by doing random sampling as
opposed to reading everything. I certainly hope that that doesn't stay
true forever. Simon has expressed interest in working on block-based
sampling. n_distinct is only one ouput that we need to produce as part
of ANALYZE, of course.

So, the way we estimate n_distinct is currently not very good, but
maybe that's inevitable with random sampling. I was surprised by the
fact that you didn't seem to consider it that much of a problem in
your pgCon talk on the planner, since I saw it a couple of times back
when I was a consultant. I haven't seen it that many times, though.
The main practical benefit is that HLL isn't going to give you an
answer that's wildly wrong, and the main disadvantage is that it
expects to observe every element in the set, which in this instance is
no disadvantage at all. There are guarantees around the accuracy of
estimates, typically very good guarantees.

> Is it the right decision to suppress the abbreviated-key optimization
> unconditionally on 32-bit systems and on Darwin?  There's certainly
> more danger, on those platforms, that the optimization could fail to
> pay off.  But it could also win big, if in fact the first character or
> two of the string is enough to distinguish most rows, or if Darwin
> improves their implementation in the future.  If the other defenses
> against pathological cases in the patch are adequate, I would think
> it'd be OK to remove the hard-coded checks here and let those cases
> use the optimization or not according to its merits in particular
> cases.  We'd want to look at what the impact of that is, of course,
> but if it's bad, maybe those other defenses aren't adequate anyway.

I'm not sure. Perhaps the Darwin thing is a bad idea because no one is
using Macs to run real database servers. Apple haven't had a server
product in years, and typically people only use Postgres on their Macs
for development. We might as well have coverage of the new code for
the benefit of Postgres hackers that favor Apple machines. Or, to look
at it another way, the optimization is so beneficially that it's
probably worth the risk, even for more marginal cases.

8 primary weights (the leading 8 bytes, frequently isomorphic to the
first 8 Latin characters, regardless of whether or not they have
accents/diacritics, or punctuation/whitespace) is twice as many as 4.
But every time you add a byte of space to the abbreviated
representation that can resolve a comparison, the number of
unresolvable-without-tiebreak comparisons (in general) is, I imagine,
reduced considerably. Basically, 8 bytes is way better than twice as
good as 4 bytes in terms of its effect on the proportion of
comparisons that are resolved only with abbreviated keys. Even still,
I suspect it's still worth it to apply the optimization with only 4.

You've seen plenty of suggestions on assessing the applicability of
the optimization from me. Perhaps you have a few of your own.

> Does the lengthy comment in btsortsupport_worker() survive pgindent?

I'll look into it.

> Reading from an uninitialized byte could provide a valgrind warning
> even if it's harmless, I think.

That wouldn't be harmless - it would probably result in incorrect
answers in practice, and would certainly be unspecified. However, I'm
not reading uninitialized bytes. I call memset() so that in the event
of the final strxfrm() blob being less than 8 bytes (which can happen
even on glibc with en_US.UTF-8). It cannot be harmful to memcmp()
every Datum byte if the remaining bytes are always initialized to NUL.


-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Sep 2, 2014 at 12:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Maybe we should get rid of the tiebreak case altogether: the second
> SortSupport object is just containing all the same values as the first
> one, with only the comparator being different.  Can't we just have
> both the abbreviated-comparator and authoritative-comparator as
> members of the SortSupport, and call whichever one is appropriate,
> instead of copying the whole SortSupport object?  That would have the
> nice property of avoiding the need for special handling in
> reversedirection_heap().

I thought about that. I think that there are other disadvantages to
explicitly having a second comparator, associated with a the same sort
support state as the authoritative comparator: ApplySortComparator()
expects to compare using ssup->comparator(). You'd have to duplicate
that for your alternative/abbreviated comparator. It might be to our
advantage to use the same ApplySortComparator() inline comparator
muliple times in routines like comparetup_heap(), if not for clarity
then for performance (granted, that isn't something I have any
evidence for, but I wouldn't be surprised if it was noticeable). It
might also be to our advantage to have a separate work space.

By having a second comparator, you're making the leading
key/abbreviated comparison special, rather than the (hopefully) less
common tie-breaker case. I find it more logical to structure the code
such that the leading/abbreviated comparison is the "regular state",
with a tie-breaker on the "irregular"/authoritative state accessed
through indirection from the leading key state, reflecting our
preference for having most comparisons resolved using cheap
abbreviated comparisons. It's not as if I feel that strongly about it,
though.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Sep 2, 2014 at 12:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> That would have the
> nice property of avoiding the need for special handling in
> reversedirection_heap().


Oh, BTW, we probably don't need that anyway, since I'm already
disabling abbreviated key optimization in the event of a bounded sort
on the grounds that in general it doesn't pay for itself. So maybe
reversedirection_heap() (and perhaps other tuple-type variants of
same) should merely note: "If ever abbreviated keys with top-N heap
sort start to make sense, the logic to invert ordering would have to
be duplicated here". Technically, right now what I've added to
reversedirection_heap() is dead code. What do you think of that?

It's still not clear that the explicit "tie-breaker" comparator
introspection is paying for itself: as I've already pointed out, we
may well be better off always trying a memcmp() tie-breaker first, and
not bothering with the distinction between whether or not a given
comparison has been called as a tie-breaker, or whether there never
was an abbreviated comparison to have to tie-break in the first place.
Even if we're better off always being optimistic about a "try memcmp()
== 0" paying off for text, there are other datatypes, and it might be
a more useful distinction for those other datatypes. Which is not to
suggest that it's clear that it isn't a useful distinction to make in
the case of text.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Sep 2, 2014 at 12:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Most places that use a SortSupportData initialize ssup.position
> explicitly, but tuplesort_begin_datum() doesn't.  That's an
> inconsistency that should be fixed, but I'm not sure which direction
> is best.

I'm not sure what you mean. tuplesort_begin_datum() only
uses/initializes the Tuplesortstate.onlyKey field, and in fact that
did have its ssup.position initialized to zero in the last revision.
This indicates that we should not apply the optimization, because:

/** Conversion to abbreviated representation infeasible in the Datum case.* It must be possible to subsequently fetch
originaldatum values within* tuplesort_getdatum(), which would require special-case preservation of* original values
thatwe prefer to avoid.*/
 
state->onlyKey->position = sortKeyOther;

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
Attached revision:

* Still doesn't address the open question of whether or not we should
optimistically always try "memcmp() == 0" on tiebreak. I still lean
towards "yes".

* Leaves open the question of what to do when we can't use the
abbreviated keys optimization just because a datum tuplesort is
preferred when sorting single-attribute tuples (recall that datum case
tuplesorts cannot use abbreviated keys). We want to avail of tuplesort
datum sorting where we can, except when abbreviated keys are
available, which presumably tips the balance in favor of heap tuple
sorting even when sorting on only one attribute, simply because then
we can then use abbreviated keys. I'm thinking in particular of
nodeAgg.c, which is an important case.

There are still FIXME/TODO comments for each of these two points.
Further, this revised/rebased patch set:

* Incorporates your feedback on stylistic issues, with changes
confined to their own commit (on top of earlier commits that are
almost, but not quite, the same as the prior revision that your
remarks apply to).

* No longer does anything special within reversedirection_heap(),
since that is unnecessary, as it's only used by bounded sorts, which
aren't a useful target for abbreviated keys. This is noted. There is
no convenient point to add a defensive assertion against this, so I
haven't.

* Updates comments in master in a broken-out way, reflecting opclass
contract with sortsupport as established by
1d41739e5a04b0e93304d24d864b6bfa3efc45f2, that is convenient to apply
to and commit in the master branch immediately.

--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Tue, Sep 2, 2014 at 4:41 PM, Peter Geoghegan <pg@heroku.com> wrote:
> HyperLogLog isn't sample-based - it's useful for streaming a set and
> accurately tracking its cardinality with fixed overhead.

OK.

>> Is it the right decision to suppress the abbreviated-key optimization
>> unconditionally on 32-bit systems and on Darwin?  There's certainly
>> more danger, on those platforms, that the optimization could fail to
>> pay off.  But it could also win big, if in fact the first character or
>> two of the string is enough to distinguish most rows, or if Darwin
>> improves their implementation in the future.  If the other defenses
>> against pathological cases in the patch are adequate, I would think
>> it'd be OK to remove the hard-coded checks here and let those cases
>> use the optimization or not according to its merits in particular
>> cases.  We'd want to look at what the impact of that is, of course,
>> but if it's bad, maybe those other defenses aren't adequate anyway.
>
> I'm not sure. Perhaps the Darwin thing is a bad idea because no one is
> using Macs to run real database servers. Apple haven't had a server
> product in years, and typically people only use Postgres on their Macs
> for development. We might as well have coverage of the new code for
> the benefit of Postgres hackers that favor Apple machines. Or, to look
> at it another way, the optimization is so beneficially that it's
> probably worth the risk, even for more marginal cases.
>
> 8 primary weights (the leading 8 bytes, frequently isomorphic to the
> first 8 Latin characters, regardless of whether or not they have
> accents/diacritics, or punctuation/whitespace) is twice as many as 4.
> But every time you add a byte of space to the abbreviated
> representation that can resolve a comparison, the number of
> unresolvable-without-tiebreak comparisons (in general) is, I imagine,
> reduced considerably. Basically, 8 bytes is way better than twice as
> good as 4 bytes in terms of its effect on the proportion of
> comparisons that are resolved only with abbreviated keys. Even still,
> I suspect it's still worth it to apply the optimization with only 4.
>
> You've seen plenty of suggestions on assessing the applicability of
> the optimization from me. Perhaps you have a few of your own.

My suggestion is to remove the special cases for Darwin and 32-bit
systems and see how it goes.

> That wouldn't be harmless - it would probably result in incorrect
> answers in practice, and would certainly be unspecified. However, I'm
> not reading uninitialized bytes. I call memset() so that in the event
> of the final strxfrm() blob being less than 8 bytes (which can happen
> even on glibc with en_US.UTF-8). It cannot be harmful to memcmp()
> every Datum byte if the remaining bytes are always initialized to NUL.

OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Tue, Sep 2, 2014 at 7:51 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Sep 2, 2014 at 12:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Maybe we should get rid of the tiebreak case altogether: the second
>> SortSupport object is just containing all the same values as the first
>> one, with only the comparator being different.  Can't we just have
>> both the abbreviated-comparator and authoritative-comparator as
>> members of the SortSupport, and call whichever one is appropriate,
>> instead of copying the whole SortSupport object?  That would have the
>> nice property of avoiding the need for special handling in
>> reversedirection_heap().
>
> I thought about that. I think that there are other disadvantages to
> explicitly having a second comparator, associated with a the same sort
> support state as the authoritative comparator: ApplySortComparator()
> expects to compare using ssup->comparator(). You'd have to duplicate
> that for your alternative/abbreviated comparator. It might be to our
> advantage to use the same ApplySortComparator() inline comparator
> muliple times in routines like comparetup_heap(), if not for clarity
> then for performance (granted, that isn't something I have any
> evidence for, but I wouldn't be surprised if it was noticeable). It
> might also be to our advantage to have a separate work space.

Well, the additional code needed in ApplySortComparator would be about
two lines long.  Maybe that's going to turn out to be too expensive to
do in all cases, so that we'll end up with ApplySortComparator and
ApplyAbbreviatedSortComparator, but even if we do that seems less
heavyweight than spawning a whole separate object for the tiebreak
case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Wed, Sep 3, 2014 at 2:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> My suggestion is to remove the special cases for Darwin and 32-bit
> systems and see how it goes.

I guess it should still be a configure option, then. Or maybe there
should just be a USE_ABBREV_KEYS macro within pg_config_manual.h.

Are you suggesting that the patch be committed with the optimization
enabled on all platforms by default, with the option to revisit
disabling it if and when there is user push-back? I don't think that's
unreasonable, given the precautions now taken, but I'm just not sure
that's what you mean.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Tue, Sep 2, 2014 at 10:27 PM, Peter Geoghegan <pg@heroku.com> wrote:
> * Still doesn't address the open question of whether or not we should
> optimistically always try "memcmp() == 0" on tiebreak. I still lean
> towards "yes".

Let m be the cost of a memcmp() that fails near the end of the
strings; and let s be the cost of a strcoll that does likewise.
Clearly s > m.  But approximately what is s/m on platforms where you
can test?  Say, with 100 byte string, in a few different locales.

If for example s/m > 100 then it's a no-brainer, because in the worst
case we're adding 1% overhead, and in the best case we're saving 99%.
OTOH, if s/m < 2 then I almost certainly wouldn't do it, because in
the worst case we're adding >50% overhead, and in the best case we're
saving <50%.  That seems like it's doubling down on the abbreviated
key stuff to work mostly all the time, and I'm not prepared to make
that bet.  There is of course a lot of daylight between a 2-to-1 ratio
and a 100-to-1 ratio and I expect the real value is somewhere in the
middle (probably closer to 2); I haven't at this time made up my mind
what value would make this worthwhile, but I'd like to know what the
real numbers are.

> * Leaves open the question of what to do when we can't use the
> abbreviated keys optimization just because a datum tuplesort is
> preferred when sorting single-attribute tuples (recall that datum case
> tuplesorts cannot use abbreviated keys). We want to avail of tuplesort
> datum sorting where we can, except when abbreviated keys are
> available, which presumably tips the balance in favor of heap tuple
> sorting even when sorting on only one attribute, simply because then
> we can then use abbreviated keys. I'm thinking in particular of
> nodeAgg.c, which is an important case.

I favor leaving this issue to a future patch.  The last thing this
patch needs is more changes that someone might potentially dislike.
Let's focus on getting the core thing working, and then you can
enhance it once we all agree that it is.

On the substance of this issue, I suspect that for pass-by-value data
types it can hardly be wrong to use the datum tuplesort approach; but
it's possible we will want to disable it for pass-by-reference data
types when the abbreviated-key infrastructure is available.  That will
lose if it turns out that the abbreviated keys aren't capturing enough
of the entropy, but maybe we'll decide that's OK.  Or maybe not.  But
I don't think it's imperative that this patch make a change in that
area, and indeed, in the interest of keeping separate changes
isolated, I think it's better if it doesn't.

> There are still FIXME/TODO comments for each of these two points.
> Further, this revised/rebased patch set:
>
> * Incorporates your feedback on stylistic issues, with changes
> confined to their own commit (on top of earlier commits that are
> almost, but not quite, the same as the prior revision that your
> remarks apply to).
>
> * No longer does anything special within reversedirection_heap(),
> since that is unnecessary, as it's only used by bounded sorts, which
> aren't a useful target for abbreviated keys. This is noted. There is
> no convenient point to add a defensive assertion against this, so I
> haven't.
>
> * Updates comments in master in a broken-out way, reflecting opclass
> contract with sortsupport as established by
> 1d41739e5a04b0e93304d24d864b6bfa3efc45f2, that is convenient to apply
> to and commit in the master branch immediately.

Thanks, committed that one.  The remaining patches can be squashed
into a single one, as none of them can be applied without the others.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Wed, Sep 3, 2014 at 5:44 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Sep 3, 2014 at 2:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> My suggestion is to remove the special cases for Darwin and 32-bit
>> systems and see how it goes.
>
> I guess it should still be a configure option, then. Or maybe there
> should just be a USE_ABBREV_KEYS macro within pg_config_manual.h.
>
> Are you suggesting that the patch be committed with the optimization
> enabled on all platforms by default, with the option to revisit
> disabling it if and when there is user push-back? I don't think that's
> unreasonable, given the precautions now taken, but I'm just not sure
> that's what you mean.

That's what I mean.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Sep 4, 2014 at 9:19 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Sep 2, 2014 at 10:27 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> * Still doesn't address the open question of whether or not we should
>> optimistically always try "memcmp() == 0" on tiebreak. I still lean
>> towards "yes".
>
> Let m be the cost of a memcmp() that fails near the end of the
> strings; and let s be the cost of a strcoll that does likewise.
> Clearly s > m.  But approximately what is s/m on platforms where you
> can test?  Say, with 100 byte string, in a few different locales.

Just to be clear: I imagine you're more or less sold on the idea of
testing equality in the event of a tie-break, where the leading 8
primary weight bytes are already known to be equal (and the full text
string lengths also match); the theory of operation behind testing how
good a proxy for full key cardinality abbreviated key cardinality is
is very much predicated on that. We can still win big with very low
cardinality sets this way, which are an important case. What I
consider an open question is whether or not we should do that on the
first call when there is no abbreviated comparison, such as on the
second or subsequent attribute in a multi-column sort, in the hope
that equality will just happen to be indicated.

> If for example s/m > 100 then it's a no-brainer, because in the worst
> case we're adding 1% overhead, and in the best case we're saving 99%.
> OTOH, if s/m < 2 then I almost certainly wouldn't do it, because in
> the worst case we're adding >50% overhead, and in the best case we're
> saving <50%.  That seems like it's doubling down on the abbreviated
> key stuff to work mostly all the time, and I'm not prepared to make
> that bet.  There is of course a lot of daylight between a 2-to-1 ratio
> and a 100-to-1 ratio and I expect the real value is somewhere in the
> middle (probably closer to 2); I haven't at this time made up my mind
> what value would make this worthwhile, but I'd like to know what the
> real numbers are.

Well, we can only lose when the strings happen to be the same size. So
that's something. But I'm willing to consider the possibility that the
memcmp() is virtually free. I would only proceed with this extra
optimization if that is actually the case. Modern CPUs are odd things.
Branch prediction/instruction pipelining, and the fact that we're
frequently stalled on cache misses might combine to make it
effectively the case that the opportunistic memcmp() is free. I could
be wrong about that, and I'm certainly wrong if you test large enough
strings with differences only towards the very end, but it seems
reasonable to speculate that it would work well with appropriate
precautions (in particular, don't do it when the strings are huge).
Let me try and come up with some numbers for a really unsympathetic
case, since you've already seen sympathetic numbers. I think the
sympathetic country/province/city sort test case [1] is actually
fairly representative; sort keys *are* frequently correlated like
that, implying that there are lots of savings to be had by being
"memcmp() == 0 optimistic" when resolving comparisons using the second
or subsequent attribute.

>> * Leaves open the question of what to do when we can't use the
>> abbreviated keys optimization just because a datum tuplesort is
>> preferred when sorting single-attribute tuples (recall that datum case
>> tuplesorts cannot use abbreviated keys). We want to avail of tuplesort
>> datum sorting where we can, except when abbreviated keys are
>> available, which presumably tips the balance in favor of heap tuple
>> sorting even when sorting on only one attribute, simply because then
>> we can then use abbreviated keys. I'm thinking in particular of
>> nodeAgg.c, which is an important case.
>
> I favor leaving this issue to a future patch.  The last thing this
> patch needs is more changes that someone might potentially dislike.
> Let's focus on getting the core thing working, and then you can
> enhance it once we all agree that it is.

Makes sense. I think we should make a separate pass to enable sort
support for B-Tree sorting - that's probably the most compelling case,
after all. That's certainly the thing that I've heard complaints
about. There could be as many as 2-3 follow-up commits.

> On the substance of this issue, I suspect that for pass-by-value data
> types it can hardly be wrong to use the datum tuplesort approach; but
> it's possible we will want to disable it for pass-by-reference data
> types when the abbreviated-key infrastructure is available.  That will
> lose if it turns out that the abbreviated keys aren't capturing enough
> of the entropy, but maybe we'll decide that's OK.  Or maybe not.  But
> I don't think it's imperative that this patch make a change in that
> area, and indeed, in the interest of keeping separate changes
> isolated, I think it's better if it doesn't.

Right. I had presumed that we'd want to figure that out each time. I
wasn't sure how best to go about doing that, which is why it's an open
item.

[1] http://www.postgresql.org/message-id/CAM3SWZQTYv3KP+CakZJZV3RwB1OJjaHwPCZ9cOYJXPkhbtcBVg@mail.gmail.com
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Thu, Sep 4, 2014 at 2:12 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Sep 4, 2014 at 9:19 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Sep 2, 2014 at 10:27 PM, Peter Geoghegan <pg@heroku.com> wrote:
>>> * Still doesn't address the open question of whether or not we should
>>> optimistically always try "memcmp() == 0" on tiebreak. I still lean
>>> towards "yes".
>>
>> Let m be the cost of a memcmp() that fails near the end of the
>> strings; and let s be the cost of a strcoll that does likewise.
>> Clearly s > m.  But approximately what is s/m on platforms where you
>> can test?  Say, with 100 byte string, in a few different locales.
>
> Just to be clear: I imagine you're more or less sold on the idea of
> testing equality in the event of a tie-break, where the leading 8
> primary weight bytes are already known to be equal (and the full text
> string lengths also match); the theory of operation behind testing how
> good a proxy for full key cardinality abbreviated key cardinality is
> is very much predicated on that. We can still win big with very low
> cardinality sets this way, which are an important case. What I
> consider an open question is whether or not we should do that on the
> first call when there is no abbreviated comparison, such as on the
> second or subsequent attribute in a multi-column sort, in the hope
> that equality will just happen to be indicated.

Eh, maybe?  I'm not sure why the case where we're using abbreviated
keys should be different than the case we're not.  In either case this
is a straightforward trade-off: if we do a memcmp() before strcoll(),
we win if it returns 0 and lose if returns non-zero and strcoll also
returns non-zero.  (If memcmp() returns non-zero but strcoll() returns
0, it's a tie.)  I'm not immediately sure why it should affect the
calculus one way or the other whether abbreviated keys are in use; the
question of how much faster memcmp() is than strcoll() seems like the
relevant consideration either way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Sep 4, 2014 at 2:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Eh, maybe?  I'm not sure why the case where we're using abbreviated
> keys should be different than the case we're not.  In either case this
> is a straightforward trade-off: if we do a memcmp() before strcoll(),
> we win if it returns 0 and lose if returns non-zero and strcoll also
> returns non-zero.  (If memcmp() returns non-zero but strcoll() returns
> 0, it's a tie.)  I'm not immediately sure why it should affect the
> calculus one way or the other whether abbreviated keys are in use; the
> question of how much faster memcmp() is than strcoll() seems like the
> relevant consideration either way.


Not quite. Consider my earlier example of sorting ~300,000 cities by
country only. That's a pretty low cardinality attribute. We win big,
and we are almost certain that the abbreviated key cardinality is a
perfect proxy for the full key cardinality so we stick with
abbreviated keys while copying over tuples. Sure, most comparisons
will actually be resolved with a "memcmp() == 0" rather than an
abbreviated comparison, but under my ad-hoc cost model there is no
distinction, since they're both very much cheaper than a strcoll()
(particularly when we factor in the NUL termination copying that a
"memcmp() == 0" also avoids). To a lesser extent we're also justified
in that optimism because we've already established that roughly the
first 8 bytes of the string are bitwise equal.

So the difference is that in the abbreviated key case, we are at least
somewhat justified in our optimism. Whereas, where we're just eliding
fmgr overhead, say on the 2nd or subsequent attribute of a multi-key
sort, it's totally opportunistic to chance a "memcmp() == 0". The
latter optimization can only be justified by the fact that the
memcmp() is somewhere between dirt cheap and free. That seems like
soemthing that should significantly impact the calculus.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Sep 4, 2014 at 11:12 AM, Peter Geoghegan <pg@heroku.com> wrote:
> What I
> consider an open question is whether or not we should do that on the
> first call when there is no abbreviated comparison, such as on the
> second or subsequent attribute in a multi-column sort, in the hope
> that equality will just happen to be indicated.

> Let me try and come up with some numbers for a really unsympathetic
> case, since you've already seen sympathetic numbers.

So I came up with what I imagined to be an unsympathetic case:

postgres=# create table opt_eq_test as select 1 as dummy, country ||
', ' || city as country  from cities order by city;
SELECT 317102
[local]/postgres=# select * from opt_eq_test limit 5;dummy |         country
-------+--------------------------    1 | India, 108 Kalthur    1 | Mexico, 10 de Abril    1 | India, 113 Thalluru    1
|Argentina, 11 de Octubre    1 | India, 11 Dlp
 
(5 rows)

I added the dummy column to prevent abbreviated keys from being used -
this is all about the question of trying to get away with a "memcmp()
== 0" in all circumstances, without abbreviated keys/statistics on
attribute cardinality. This is a question that has nothing to do with
abbreviated keys in particular.

With the most recent revision of the patch, performance of a
representative query against this data stabilizes as follows on my
laptop:

LOG:  duration: 2252.500 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;
LOG:  duration: 2261.505 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;
LOG:  duration: 2315.903 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;
LOG:  duration: 2260.132 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;
LOG:  duration: 2247.340 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;
LOG:  duration: 2246.723 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;
LOG:  duration: 2276.297 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;
LOG:  duration: 2241.911 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;
LOG:  duration: 2259.540 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;
LOG:  duration: 2248.740 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;
LOG:  duration: 2245.913 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;
LOG:  duration: 2230.583 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;

If I apply the additional optimization that we're on the fence about:

--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1967,7 +1967,7 @@ bttextfastcmp_locale(Datum x, Datum y, SortSupport ssup)       len1 = VARSIZE_ANY_EXHDR(arg1);
  len2 = VARSIZE_ANY_EXHDR(arg2);
 

-       if (ssup->abbrev_state == ABBREVIATED_KEYS_TIE && len1 == len2)
+       if (len1 == len2)       {               /*                * Being called as authoritative tie-breaker for an
abbreviated key

Then the equivalent numbers look like this:

LOG:  duration: 2178.220 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;
LOG:  duration: 2175.005 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;
LOG:  duration: 2219.648 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;
LOG:  duration: 2174.865 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;
LOG:  duration: 2246.387 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;
LOG:  duration: 2234.023 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;
LOG:  duration: 2186.957 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;
LOG:  duration: 2177.778 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;
LOG:  duration: 2186.709 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;
LOG:  duration: 2171.557 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;
LOG:  duration: 2211.822 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;
LOG:  duration: 2224.198 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;
LOG:  duration: 2192.506 ms  statement: select * from (select * from
opt_eq_test order by dummy, country offset 1000000) d;

That looks like a small improvement. It turns out that there are some
cities in certain countries that are not the only city with that name
in the country. For example, there are about 5 Dublins in the United
States, although each is fairly far apart. There are 285,260 distinct
country + city combinations out of a total of 317,102 cities in this
sample dataset. I thought about doing something equivalent with
(dummy, province || city), but that's not interesting because there
are lots of distinct "provinces" (e.g American states, Canadian
provinces, British counties) in the world, so a wasted memcmp() is
minimally wasteful. I also thought about looking at (dummy, country ||
province || city), but the expense of sorting larger strings tends to
dominate there.

I'm more confident that being optimistic about getting away with a
"memcmp() == 0" in all circumstances is the right decision now. I'm
still not quite sure if 1) We should worry about very long strings,
where our optimism is unjustified, or 2) It's worth surfacing
ABBREVIATED_KEYS_TIE within the abbreviated key infrastructure at all.

Perhaps there is something to be said for "getting out of the way of
branch prediction". In any case, even if I actually saw a small
regression here, I'd probably still say it was worth it to get the big
improvements to sympathetic though reasonably representative cases
that we've seen with this technique [1]. As things stand, it looks
well worth it to me.

[1] http://www.postgresql.org/message-id/CAM3SWZQTYv3KP+CakZJZV3RwB1OJjaHwPCZ9cOYJXPkhbtcBVg@mail.gmail.com
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Sep 4, 2014 at 5:07 PM, Peter Geoghegan <pg@heroku.com> wrote:
> So I came up with what I imagined to be an unsympathetic case:

BTW, this "cities" data is still available from:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/data/cities.dump

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Wed, Sep 3, 2014 at 2:44 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I guess it should still be a configure option, then. Or maybe there
> should just be a USE_ABBREV_KEYS macro within pg_config_manual.h.

Attached additional patches are intended to be applied on top off most
of the patches posted on September 2nd [1]. Note that you should not
apply patch 0001-* from that set to master, since it has already been
committed to master [2]. However, while rebasing I revised
patch/commit 0005-* to abbreviation used on all platforms, including
32-bit platforms (the prior 0005-* patch just re-enabled the
optimization on Darwin/Apple), so you should discard the earlier
0005-* patch. In a later commit I also properly formalize the idea
that we always do opportunistic "memcmp() == 0" checks, no matter what
context a sortsupport-accelerated text comparison occurs in. That
seems like a good idea, but it's broken out in a separate commit in
case you are not in agreement.

While I gave serious consideration to your idea of having a dedicated
abbreviation comparator, and not duplicating sortsupport state when
abbreviated keys are used (going so far as to almost fully implement
the idea), I ultimately decided that my vote says we don't do that. It
seemed to me that there were negligible benefits for increased
complexity. In particular, I didn't want to burden tuplesort with
having to worry about whether or not abbreviation was aborted during
tuple copying, or was not used by the opclass in the first place -
implementing your scheme makes that distinction relevant. It's very
convenient to have comparetup_heap() "compare the leading sort key"
(that specifically looks at SortTuple.datum1 pairs) indifferently,
using the same comparator for "abbreviated" and "not abbreviated"
cases indifferently. comparetup_heap() does not seem like a great
place to burden with caring about each combination any more than
strictly necessary.

I like that I don't have to care about every combination, and can
treat abbreviation abortion as the special case with the extra step,
in line with how I think of the optimization conceptually. Does that
make sense? Otherwise, there'd have to be a ApplySortComparator()
*and* "ApplySortComparatorAbbreviated()" call with SortTuple.datum1
pairs passed, as appropriate for each opclass (and abortion state), as
well as a heap_getattr() tie-breaker call for the latter case alone
(when we got an inconclusive answer, OR when abbreviation was
aborted). Finally, just as things are now, there'd have to be a loop
where the second or subsequent attributes are dealt with by
ApplySortComparator()'ing. So AFAICT under your scheme there are 4
ApplySortComparator* call sites required, rather than 3 as under mine.

Along similar lines, I thought about starting from nkey = 0 within
comparetup_heap() when abortion occurs (so that there'd only be 2
ApplySortComparator() call sites - no increase from master) , but that
turns out to be messy, plus I like those special tie-breaker
assertions.

I will be away for much of next week, and will have limited access to
e-mail. I will be around tomorrow, though. I hope that what I've
posted is suitable to commit without further input from me.

[1] http://www.postgresql.org/message-id/CAM3SWZTEtQcKc24LhWKDLasJf-b-cCNn4q0OYjhGBX+NcpNRpg@mail.gmail.com
[2] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d8d4965dc29263462932be03d4206aa694e2cd7e
--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Fri, Sep 5, 2014 at 7:45 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Attached additional patches are intended to be applied on top off most
> of the patches posted on September 2nd [1].


I attach another amendment/delta patch, intended to be applied on top
of what was posted yesterday. I neglected to remove some abort logic
that was previously only justified by the lack of opportunistic
"memcmp() == 0" comparisons in all instances, rather than just with
abbreviated keys.

--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Sat, Sep 6, 2014 at 3:01 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I attach another amendment/delta patch

Attached is another amendment to the patch set. With the recent
addition of abbreviation support on 32-bit platforms, we should just
hash the Datum representation as a uint32 on SIZEOF_DATUM != 8
platforms.

--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Thu, Sep 4, 2014 at 5:46 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Sep 4, 2014 at 2:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Eh, maybe?  I'm not sure why the case where we're using abbreviated
>> keys should be different than the case we're not.  In either case this
>> is a straightforward trade-off: if we do a memcmp() before strcoll(),
>> we win if it returns 0 and lose if returns non-zero and strcoll also
>> returns non-zero.  (If memcmp() returns non-zero but strcoll() returns
>> 0, it's a tie.)  I'm not immediately sure why it should affect the
>> calculus one way or the other whether abbreviated keys are in use; the
>> question of how much faster memcmp() is than strcoll() seems like the
>> relevant consideration either way.
>
> Not quite. Consider my earlier example of sorting ~300,000 cities by
> country only. That's a pretty low cardinality attribute. We win big,
> and we are almost certain that the abbreviated key cardinality is a
> perfect proxy for the full key cardinality so we stick with
> abbreviated keys while copying over tuples. Sure, most comparisons
> will actually be resolved with a "memcmp() == 0" rather than an
> abbreviated comparison, but under my ad-hoc cost model there is no
> distinction, since they're both very much cheaper than a strcoll()
> (particularly when we factor in the NUL termination copying that a
> "memcmp() == 0" also avoids). To a lesser extent we're also justified
> in that optimism because we've already established that roughly the
> first 8 bytes of the string are bitwise equal.
>
> So the difference is that in the abbreviated key case, we are at least
> somewhat justified in our optimism. Whereas, where we're just eliding
> fmgr overhead, say on the 2nd or subsequent attribute of a multi-key
> sort, it's totally opportunistic to chance a "memcmp() == 0". The
> latter optimization can only be justified by the fact that the
> memcmp() is somewhere between dirt cheap and free. That seems like
> soemthing that should significantly impact the calculus.

Boiled down, what you're saying is that you might have a set that
contains lots of duplicates in general, but not very many where the
abbreviated-keys also match.  Sure, that's true.  But you might also
not have that case, so I don't see where that gets us; the same
worst-case test case Heikki developed the last time we relitigated
this point is still relevant here.  In order to know how much we're
giving up in that case, we need the exact number I asked you to
provide in my previous email: the ratio of the cost of strcoll() to
the cost of memcmp().

I see that you haven't chosen to provide that information in any of
your four responses.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Fri, Sep 5, 2014 at 10:45 PM, Peter Geoghegan <pg@heroku.com> wrote:
> While I gave serious consideration to your idea of having a dedicated
> abbreviation comparator, and not duplicating sortsupport state when
> abbreviated keys are used (going so far as to almost fully implement
> the idea), I ultimately decided that my vote says we don't do that. It
> seemed to me that there were negligible benefits for increased
> complexity. In particular, I didn't want to burden tuplesort with
> having to worry about whether or not abbreviation was aborted during
> tuple copying, or was not used by the opclass in the first place -
> implementing your scheme makes that distinction relevant. It's very
> convenient to have comparetup_heap() "compare the leading sort key"
> (that specifically looks at SortTuple.datum1 pairs) indifferently,
> using the same comparator for "abbreviated" and "not abbreviated"
> cases indifferently. comparetup_heap() does not seem like a great
> place to burden with caring about each combination any more than
> strictly necessary.
>
> I like that I don't have to care about every combination, and can
> treat abbreviation abortion as the special case with the extra step,
> in line with how I think of the optimization conceptually. Does that
> make sense?

No.  comparetup_heap() is hip-deep in this optimization as it stands,
and what I proposed - if done correctly - isn't going to make that
significantly worse.  In fact, it really ought to make things better:
you should be able to set things up so that ssup->comparator is always
the test that should be applied first, regardless of whether we're
aborted or not-aborted or not doing this in the first place; and then
ssup->tiebreak_comparator, if not NULL, can be applied after that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Sep 9, 2014 at 2:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Boiled down, what you're saying is that you might have a set that
> contains lots of duplicates in general, but not very many where the
> abbreviated-keys also match.  Sure, that's true.

Abbreviated keys are not used in the case where we do a (fully)
opportunistic memcmp(), without really having any idea of whether or
not it'll work out. Abbreviated keys aren't really relevant to that
case, except perhaps in that we know we'll have statistics available
for leading attributes, which will make the case less frequent in
practice.

> But you might also
> not have that case, so I don't see where that gets us; the same
> worst-case test case Heikki developed the last time we relitigated
> this point is still relevant here.

Well, I think that there should definitely be a distinction made
between abbreviated and non-abbreviated cases; you could frequently
have almost 100% certainty that each of those optimistic memcmp()s
will work out with abbreviated keys. Low cardinality sets are very
common. I'm not sure what your position on that is. My proposal to
treat both of those cases (abbreviated with a cost model/cardinality
statistics; non-abbreviated without) the same is based on different
arguments for each case.

> In order to know how much we're
> giving up in that case, we need the exact number I asked you to
> provide in my previous email: the ratio of the cost of strcoll() to
> the cost of memcmp().
>
> I see that you haven't chosen to provide that information in any of
> your four responses.

Well, it's kind of difficult to give that number in a vacuum. I showed
a case that had a large majority of opportunistic memcmp()s go to
waste, while a small number were useful, which still put us ahead. I
can look at Heikki's case with this again if you think that'll help.
Heikki said that his case was all about wasted strxfrm()s, which are
surely much more expensive than wasted memcmp()s, particularly when
you consider temporal locality (we needed that memory to be stored in
a cacheline for the immediately subsequent operation anyway, should
the memcmp() thing not work out - the simple ratio that you're
interested in may be elusive).

In case I haven't been clear enough on this point, I re-emphasize that
I do accept that for something like the non-abbreviated case, the
opportunistic memcmp() thing must virtually be free if we're to
proceed, since it is purely opportunistic. If it can be demonstrated
that that isn't the case (and if that cannot be fixed by limiting it
to < CACHE_LINE_SIZE), clearly we should not proceed with
opportunistic (non-abbreviated) memcmp()s. In fact, I think I'm
holding it to a higher standard than you are - I believe that it had
better be virtually free.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Wed, Sep 10, 2014 at 1:36 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> In order to know how much we're
>> giving up in that case, we need the exact number I asked you to
>> provide in my previous email: the ratio of the cost of strcoll() to
>> the cost of memcmp().
>>
>> I see that you haven't chosen to provide that information in any of
>> your four responses.
>
> Well, it's kind of difficult to give that number in a vacuum.

No, not really.  All you have to do is right a little test program to
gather the information.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Wed, Sep 10, 2014 at 11:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> No, not really.  All you have to do is right a little test program to
> gather the information.

I don't think a little test program is useful - IMV it's too much of a
simplification to suppose that a strcoll() has a fixed cost, and a
memcmp() has a fixed cost, and that we can determine algebraically
that we should (say) proceed or not proceed with the additional
opportunistic "memcmp() == 0" optimization based solely on that. I'm
not sure if that's what you meant, but it might have been. Temporal
locality is surely a huge factor here, for example. Are we talking
about a memcmp() that always immediately precedes a similar strcoll()
call on the same memory? Are we factoring in the cost of
NUL-termination in order to make each strcoll() possible? And that's
just for starters.

However, I think it's perfectly fair to consider a case where the
opportunistic "memcmp() == 0" thing never works out (as opposed to
mostly not helping, which is what I considered earlier [1]), as long
as we're sorting real tuples. You mentioned Heikki's test case; it
seems fair to consider that, but for the non-abbreviated case where
the additional, *totally* opportunistic "memcmp == 0" optimization
only applies (so no abbreviated keys), while still having the
additional optimization be 100% useless. Clearly that test case is
also just about perfectly pessimal for this case too. (Recall that
Heikki's test case shows performance on per-sorted input, so there are
far fewer comparisons than would typically be required anyway - n
comparisons, or a "bubble sort best case". If I wanted to cheat, I
could reduce work_mem so that an external tape sort is used, since as
it happens tapesort doesn't opportunistically check for pre-sorted
input, but I won't do that. Heikki's case both emphasizes the
amortized cost of a strxfrm() where we abbreviate, and in this
instance de-emphasizes the importance of memory latency by having
access be sequential/predictable.)

The only variation I'm adding here to Heikki's original test case is
to have a leading int4 attribute that always has a value of 1 -- that
conveniently removes abbreviation (including strxfrm() overhead) as a
factor that can influence the outcome, since right now that isn't
under consideration. So:

create table sorttest (dummy int4, t text);
insert into sorttest select 1, 'foobarfo' || (g) || repeat('a', 75)
from generate_series(10000, 30000) g;

Benchmark:

pg@hamster:~/tests$ cat heikki-sort.sql
select * from (select * from sorttest order by dummy, t offset 1000000) f;

pgbench -f heikki-sort.sql -T 100 -n

With optimization enabled
====================
tps = 77.861353 (including connections establishing)
tps = 77.862260 (excluding connections establishing)

tps = 78.211053 (including connections establishing)
tps = 78.212016 (excluding connections establishing)

tps = 77.996117 (including connections establishing)
tps = 77.997069 (excluding connections establishing)

With optimization disabled (len1 == len2 thing is commented out)
=================================================

tps = 78.719389 (including connections establishing)
tps = 78.720366 (excluding connections establishing)

tps = 78.764819 (including connections establishing)
tps = 78.765712 (excluding connections establishing)

tps = 78.472902 (including connections establishing)
tps = 78.473844 (excluding connections establishing)

So, yes, it looks like I might have just about regressed this case -
it's hard to be completely sure. However, this is still a very
unrealistic case, since invariably "len1 == len2" without the
optimization ever working out, whereas the case that benefits [2] is
quite representative. As I'm sure you were expecting, I still favor
pursuing this additional optimization.

If you think I've been unfair or not thorough, I'm happy to look at
other cases. Also, I'm not sure that you accept that I'm justified in
considering this a separate question to the more important question of
what to do in the tie-breaker abbreviation case (where we may be
almost certain that equality will be indicated by a memcmp()). If you
don't accept that I'm right about that more important case, I guess
that means that you don't have confidence in my ad-hoc cost model (the
HyperLogLog/cardinality stuff).

[1] http://www.postgresql.org/message-id/CAM3SWZR9dtGO+zX4VEn7GTW2=+umSNq=c57SJGxG8OqHjarL7g@mail.gmail.com
[2] http://www.postgresql.org/message-id/CAM3SWZQTYv3KP+CakZJZV3RwB1OJjaHwPCZ9cOYJXPkhbtcBVg@mail.gmail.com
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Thu, Sep 11, 2014 at 4:13 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Sep 10, 2014 at 11:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> No, not really.  All you have to do is right a little test program to
>> gather the information.
>
> I don't think a little test program is useful - IMV it's too much of a
> simplification to suppose that a strcoll() has a fixed cost, and a
> memcmp() has a fixed cost, and that we can determine algebraically
> that we should (say) proceed or not proceed with the additional
> opportunistic "memcmp() == 0" optimization based solely on that. I'm
> not sure if that's what you meant, but it might have been.

I think I said pretty clearly that it was.

> However, I think it's perfectly fair to consider a case where the
> opportunistic "memcmp() == 0" thing never works out (as opposed to
> mostly not helping, which is what I considered earlier [1]), as long
> as we're sorting real tuples. You mentioned Heikki's test case; it
> seems fair to consider that, but for the non-abbreviated case where
> the additional, *totally* opportunistic "memcmp == 0" optimization
> only applies (so no abbreviated keys), while still having the
> additional optimization be 100% useless. Clearly that test case is
> also just about perfectly pessimal for this case too. (Recall that
> Heikki's test case shows performance on per-sorted input, so there are
> far fewer comparisons than would typically be required anyway - n
> comparisons, or a "bubble sort best case". If I wanted to cheat, I
> could reduce work_mem so that an external tape sort is used, since as
> it happens tapesort doesn't opportunistically check for pre-sorted
> input, but I won't do that. Heikki's case both emphasizes the
> amortized cost of a strxfrm() where we abbreviate, and in this
> instance de-emphasizes the importance of memory latency by having
> access be sequential/predictable.)
>
> The only variation I'm adding here to Heikki's original test case is
> to have a leading int4 attribute that always has a value of 1 -- that
> conveniently removes abbreviation (including strxfrm() overhead) as a
> factor that can influence the outcome, since right now that isn't
> under consideration. So:
>
> create table sorttest (dummy int4, t text);
> insert into sorttest select 1, 'foobarfo' || (g) || repeat('a', 75)
> from generate_series(10000, 30000) g;
>
> Benchmark:
>
> pg@hamster:~/tests$ cat heikki-sort.sql
> select * from (select * from sorttest order by dummy, t offset 1000000) f;
>
> pgbench -f heikki-sort.sql -T 100 -n
>
> With optimization enabled
> ====================
> tps = 77.861353 (including connections establishing)
> tps = 77.862260 (excluding connections establishing)
>
> tps = 78.211053 (including connections establishing)
> tps = 78.212016 (excluding connections establishing)
>
> tps = 77.996117 (including connections establishing)
> tps = 77.997069 (excluding connections establishing)
>
> With optimization disabled (len1 == len2 thing is commented out)
> =================================================
>
> tps = 78.719389 (including connections establishing)
> tps = 78.720366 (excluding connections establishing)
>
> tps = 78.764819 (including connections establishing)
> tps = 78.765712 (excluding connections establishing)
>
> tps = 78.472902 (including connections establishing)
> tps = 78.473844 (excluding connections establishing)
>
> So, yes, it looks like I might have just about regressed this case -
> it's hard to be completely sure. However, this is still a very
> unrealistic case, since invariably "len1 == len2" without the
> optimization ever working out, whereas the case that benefits [2] is
> quite representative. As I'm sure you were expecting, I still favor
> pursuing this additional optimization.

Well, I have to agree that doesn't look too bad, but your reluctance
to actually do the microbenchmark worries me.  Granted,
macrobenchmarks are what actually matters, but they can be noisy and
there can be other confounding factors.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Sep 11, 2014 at 1:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think I said pretty clearly that it was.

I agree that you did, but it wasn't clear exactly what factors you
were asking me to simulate. It still isn't. Do you want me to compare
the same string a million times in a loop, both with a strcoll() and
with a memcmp()? Should I copy it into a buffer to add a NUL byte? Or
should it be a new string each time, with a cache miss expected some
proportion of the time? These considerations might significantly
influence the outcome here, and one variation might be significantly
less fair than another. Tell me what to do in a little more detail,
and I'll do it (plus let you know what I think of it). I honestly
don't know what you expect.

>> So, yes, it looks like I might have just about regressed this case -
>> it's hard to be completely sure. However, this is still a very
>> unrealistic case, since invariably "len1 == len2" without the
>> optimization ever working out, whereas the case that benefits [2] is
>> quite representative. As I'm sure you were expecting, I still favor
>> pursuing this additional optimization.
>
> Well, I have to agree that doesn't look too bad, but your reluctance
> to actually do the microbenchmark worries me.  Granted,
> macrobenchmarks are what actually matters, but they can be noisy and
> there can be other confounding factors.

Well, I've been quite open about the fact that I think we can and
should hide things in memory latency. I don't think my benchmark was
in any way noisy, since you saw 3 100 second runs per test set/case,
with a very stable outcome throughout - plus the test case is
extremely unsympathetic/unrealistic to begin with. Hiding behind
memory latency is an increasingly influential trick that I've seen
crop up a few times in various papers. I think it's perfectly
legitimate to rely on that. But, honestly, I have little idea how much
I actually am relying on it. I think it's only fair to look at
representative cases (i.e. actually SQL queries). Anything else can
only be used as a guide. But, in this instance, a guide to what,
exactly? This is not a rhetorical question, and I'm not trying to be
difficult. If I thought there was something bad hiding here, I'd tell
you about it.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Sep 9, 2014 at 2:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I like that I don't have to care about every combination, and can
>> treat abbreviation abortion as the special case with the extra step,
>> in line with how I think of the optimization conceptually. Does that
>> make sense?
>
> No.  comparetup_heap() is hip-deep in this optimization as it stands,
> and what I proposed - if done correctly - isn't going to make that
> significantly worse.  In fact, it really ought to make things better:
> you should be able to set things up so that ssup->comparator is always
> the test that should be applied first, regardless of whether we're
> aborted or not-aborted or not doing this in the first place; and then
> ssup->tiebreak_comparator, if not NULL, can be applied after that.

I'm not following here. Isn't that at least equally true of what I've
proposed? Sure, I'm checking "if (!state->abbrevAborted)" first, but
that's irrelevant to the non-abbreviated case. It has nothing to
abort. Also, AFAICT we cannot abort and still call ssup->comparator()
indifferently, since sorttuple.datum1 fields are perhaps abbreviated
keys in half of all cases (i.e. pre-abort tuples), and uninitialized
garbage the other half of the time (i.e. post-abort tuples).

Where is the heap_getattr() stuff supposed to happen for the first
attribute to get an authoritative comparison in the event of aborting
(if we're calling ssup->comparator() on datum1 indifferently)? We
decide that we're going to use abbreviated keys within datum1 fields
up-front. When we abort, we cannot use datum1 fields at all (which
doesn't appear to matter for text -- the datum1 optimization has
historically only benefited pass-by-value types).

I'd mentioned that I'd hacked together a patch that doesn't
necessitate a separate state (if only to save a very small amount of
memory), but it is definitely messier within comparetup_heap(). I'm
still tweaking it. FYI, it does this within comparetup_heap():

+       if (!sortKey->abbrev_comparator)
+       {
+               /*
+                * There are no abbreviated keys to begin with (i.e.
no opclass support
+                * exists).  Compare the leading sort key, assuming an
authoritative
+                * representation.
+                */
+               compare = ApplySortComparator(a->datum1, a->isnull1,
+  b->datum1, b->isnull1,
+  sortKey);
+               if (compare != 0)
+                       return compare;
+
+               sortKey++;
+               nkey = 1;
+       }
+       else if (!state->abbrevAborted && sortKey->abbrev_comparator)
+       {
+               /*
+                * Leading attribute has abbreviated key representation, and
+                * abbreviation was not aborted when copying.  Compare
the leading sort
+                * key using abbreviated representation.
+                */
+               compare = ApplySortAbbrevComparator(a->datum1, a->isnull1,
+                b->datum1, b->isnull1,
+                sortKey);
+               if (compare != 0)
+                       return compare;
+
+               /*
+                * Since abbreviated comparison returned 0, call
tie-breaker comparator
+                * using original, authoritative representation, which
may break tie
+                * when differences were not captured within
abbreviated representation
+                */
+               nkey = 0;
+       }
+       else
+       {
+               /*
+                * Started with abbreviated keys, but aborted during
conversion/tuple
+                * copying -- check each attribute from scratch.  It's
not safe to make
+                * any assumption about the state of individual datum1 fields.
+                */
+               nkey = 0;
+       }

Is doing all this worth the small saving in memory? Personally, I
don't think that it is, but I defer to you.
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Heikki Linnakangas
Date:
On 09/12/2014 12:46 AM, Peter Geoghegan wrote:
> On Thu, Sep 11, 2014 at 1:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think I said pretty clearly that it was.
>
> I agree that you did, but it wasn't clear exactly what factors you
> were asking me to simulate.

All factors.

> Do you want me to compare the same string a million times in a loop,
> both with a strcoll() and with a memcmp()?

Yes.

> Should I copy it into a buffer to add a NUL byte?

Yes.

> Or should it be a new string each time, with a cache miss expected
> some proportion of the time?

Yes.

I'm being facetious - it's easy to ask for tests when you're not the one 
running them. But seriously, please do run the all the tests that you 
think make sense.

I'm particularly interested in the worst case. What is the worst case 
for the proposed memcmp() check? Test that. If the worst case regresses 
significantly, then we need to have a discussion of how likely that 
worst case is to happen in real life, what the performance is like in 
more realistic almost-worst-case scenarios, does it need to be tunable, 
is the trade-off worth it, etc. But if the worst case regresses less 
than, say, 1%, and there are some other cases where you get a 300% speed 
up, then I think it's safe to say that the optimization is worth it, 
without any more testing or discussion.
- Heikki




Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Fri, Sep 12, 2014 at 5:28 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 09/12/2014 12:46 AM, Peter Geoghegan wrote:
>>
>> On Thu, Sep 11, 2014 at 1:50 PM, Robert Haas <robertmhaas@gmail.com>
>> wrote:
>>>
>>> I think I said pretty clearly that it was.
>>
>>
>> I agree that you did, but it wasn't clear exactly what factors you
>> were asking me to simulate.
>
>
> All factors.
>
>> Do you want me to compare the same string a million times in a loop,
>> both with a strcoll() and with a memcmp()?
>
>
> Yes.
>
>> Should I copy it into a buffer to add a NUL byte?
>
>
> Yes.
>
>> Or should it be a new string each time, with a cache miss expected
>> some proportion of the time?
>
>
> Yes.
>
> I'm being facetious - it's easy to ask for tests when you're not the one
> running them. But seriously, please do run the all the tests that you think
> make sense.
>
> I'm particularly interested in the worst case. What is the worst case for
> the proposed memcmp() check? Test that. If the worst case regresses
> significantly, then we need to have a discussion of how likely that worst
> case is to happen in real life, what the performance is like in more
> realistic almost-worst-case scenarios, does it need to be tunable, is the
> trade-off worth it, etc. But if the worst case regresses less than, say, 1%,
> and there are some other cases where you get a 300% speed up, then I think
> it's safe to say that the optimization is worth it, without any more testing
> or discussion.

+1 to all that, including the facetious parts.

Based on discussion thus far it seems that there's a possibility that
the trade-off may be different for short strings vs. long strings.  If
the string is small enough to fit in the L1 CPU cache, then it may be
that memcmp() followed by strcoll() is not much more expensive than
strcoll().  That should be easy to figure out: write a standalone C
program that creates a bunch of arbitrary, fairly-short strings, say
32 bytes, in a big array.  Make the strings different near the end,
but not at the beginning.  Then have the program either do strcoll()
on every pair (O(n^2)) or, with a #define, memcmp() followed by
strcoll() on every pair.  It should be easy to see whether the
memcmp() adds 1% or 25% or 100%.

Then, repeat the same thing with strings that are big enough to blow
out the L1 cache, say 1MB in length.  Some intermediate sizes (64kB?)
might be worth testing, too.  Again, it should be easy to see what the
overhead is.  Once we know that, we can make intelligent decisions
about whether this is a good idea or not, and when.  If you attach the
test programs, other people (e.g. me) can also try them on other
systems (e.g. MacOS X) to see whether the characteristics there are
different than what you saw on your system.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Fri, Sep 12, 2014 at 11:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Based on discussion thus far it seems that there's a possibility that
> the trade-off may be different for short strings vs. long strings.  If
> the string is small enough to fit in the L1 CPU cache, then it may be
> that memcmp() followed by strcoll() is not much more expensive than
> strcoll().  That should be easy to figure out: write a standalone C
> program that creates a bunch of arbitrary, fairly-short strings, say
> 32 bytes, in a big array.

While I think that's fair, the reason I didn't bother playing tricks
with only doing a (purely) opportunistic memcmp() when the string size
is under (say) CACHE_LINE_SIZE bytes is that in order for it to matter
you'd have to have a use case where the first CACHE_LINE_SIZE of bytes
matched, and the string just happened to be identical in length, but
also ultimately differed at least a good fraction of the time. That
seems like the kind of thing that it's okay to care less about. That
might have been regressed worse than what you've seen already. It's
narrow in a whole new dimension, though. The intersection of that
issue, and the issues exercised by Heikki's existing test case must be
exceedingly rare.

I'm still confused about whether or not we're talking at cross
purposes here, Robert. Are you happy to consider this as a separate
and additional question to the question of what to do in an
abbreviated comparison tie-break? The correlated multiple sort key
attributes case strikes me as very common - it's a nice to have, and
will sometimes offer a nice additional boost. On the other hand, doing
this for abbreviated comparison tie-breakers is more or less
fundamental to the patch. In my cost model, a memcmp() abbreviated key
tie-breaker than works out is equivalent to an abbreviated comparison.
This is a bit of a fudge, but close enough.

BTW, I do appreciate your work on this. I realize that if you didn't
give this patch a fair go, there is a chance that no one else would.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Fri, Sep 12, 2014 at 2:58 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Fri, Sep 12, 2014 at 11:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Based on discussion thus far it seems that there's a possibility that
>> the trade-off may be different for short strings vs. long strings.  If
>> the string is small enough to fit in the L1 CPU cache, then it may be
>> that memcmp() followed by strcoll() is not much more expensive than
>> strcoll().  That should be easy to figure out: write a standalone C
>> program that creates a bunch of arbitrary, fairly-short strings, say
>> 32 bytes, in a big array.
>
> While I think that's fair, the reason I didn't bother playing tricks
> with only doing a (purely) opportunistic memcmp() when the string size
> is under (say) CACHE_LINE_SIZE bytes is that in order for it to matter
> you'd have to have a use case where the first CACHE_LINE_SIZE of bytes
> matched, and the string just happened to be identical in length, but
> also ultimately differed at least a good fraction of the time. That
> seems like the kind of thing that it's okay to care less about. That
> might have been regressed worse than what you've seen already. It's
> narrow in a whole new dimension, though. The intersection of that
> issue, and the issues exercised by Heikki's existing test case must be
> exceedingly rare.
>
> I'm still confused about whether or not we're talking at cross
> purposes here, Robert. Are you happy to consider this as a separate
> and additional question to the question of what to do in an
> abbreviated comparison tie-break?

I think I've said a few times now that I really want to get this
additional data before forming an opinion.  As a certain Mr. Doyle
writes, "It is a capital mistake to theorize before one has data.
Insensibly one begins to twist facts to suit theories, instead of
theories to suit facts."  I can't say it any better than that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Fri, Sep 12, 2014 at 12:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think I've said a few times now that I really want to get this
> additional data before forming an opinion.  As a certain Mr. Doyle
> writes, "It is a capital mistake to theorize before one has data.
> Insensibly one begins to twist facts to suit theories, instead of
> theories to suit facts."  I can't say it any better than that.

Well, in the abbreviated key case we might know that with probability
0.99999 that the "memcmp() == 0" thing will work out. In the
non-abbreviated tie-breaker case, we'll definitely know nothing. That
seems like a pretty fundamental distinction, so I don't think it's
premature to ask you to consider those two questions individually.
Still, maybe it's easier to justify both cases in the same way, if we
can.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Fri, Sep 12, 2014 at 11:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Based on discussion thus far it seems that there's a possibility that
> the trade-off may be different for short strings vs. long strings.  If
> the string is small enough to fit in the L1 CPU cache, then it may be
> that memcmp() followed by strcoll() is not much more expensive than
> strcoll().  That should be easy to figure out: write a standalone C
> program that creates a bunch of arbitrary, fairly-short strings, say
> 32 bytes, in a big array.  Make the strings different near the end,
> but not at the beginning.  Then have the program either do strcoll()
> on every pair (O(n^2)) or, with a #define, memcmp() followed by
> strcoll() on every pair.  It should be easy to see whether the
> memcmp() adds 1% or 25% or 100%.

I get where you're coming from now (I think). It seems like you're
interested in getting a sense of what it would cost to do an
opportunistic memcmp() in a universe where memory latency isn't by far
the dominant cost (which it clearly is, as evidenced by my most recent
benchmark where a variant of Heikki's highly unsympathetic SQL test
case showed a ~1% regression). You've described a case with totally
predictable access patterns, perfectly suited to prefetching, and with
no other memory access bottlenecks. As I've said, this seems
reasonable (at least with those caveats in mind).

The answer to your high level question appears to be: the
implementation (the totally opportunistic "memcmp() == 0" thing)
benefits from the fact that we're always bottlenecked on memory, and
to a fairly appreciable degree. I highly doubt that this is something
that can fail to work out with real SQL queries, but anything that
furthers our understanding of the optimization is a good thing. Of
course, within tuplesort we're not really getting the totally
opportunistic memcmp()s "for free" - rather, we're using a resource
that we would not otherwise be able to use at all.

This graph illustrates the historic trends of CPU and memory performance:

http://www.cs.virginia.edu/stream/stream_logo.gif

I find this imbalance quite remarkable - no wonder researchers are
finding ways to make the best of the situation. To make matters worse,
the per-core trends for memory bandwidth are now actually *negative
growth*. We may actually be going backwards, if we assume that that's
the bottleneck, and that we cannot parallelize things.

Anyway, attached rough test program implements what you outline. This
is for 30,000 32 byte strings (where just the final two bytes differ).
On my laptop, output looks like this (edited to only show median
duration in each case):

"""
Strings generated - beginning tests
(baseline) duration of comparisons without useless memcmp()s: 13.445506 seconds

duration of comparisons with useless memcmp()s: 17.720501 seconds
"""

It looks like about a 30% increase in run time when we always have
useless memcmps().

You can change the constants around easily - let's consider 64 KiB
strings now (by changing STRING_SIZE to 65536). In order to make the
program not take too long, I also reduce the number of strings
(N_STRINGS) from 30,000 to 1,000. If I do so, this is what I see as
output:

"""
Strings generated - beginning tests
(baseline) duration of comparisons without useless memcmp()s: 11.205683 seconds

duration of comparisons with useless memcmp()s: 14.542997 seconds
"""

It looks like the overhead here is surprisingly consistent with the
first run - again, about a 30% increase in run time.

As for 1MiB strings (this time, with an N_STRINGS of 300):

"""
Strings generated - beginning tests
(baseline) duration of comparisons without useless memcmp()s: 23.624183 seconds

duration of comparisons with useless memcmp()s: 35.332728 seconds
"""

So, at this scale, the overhead gets quite a bit worse, but the case
also becomes quite a bit less representative (if that's even
possible). I suspect that the call stack's stupidly large size may be
a problem here, but I didn't try and fix that.

Does this answer your question? Are you intent on extrapolating across
different platforms based on this test program, rather than looking at
real SQL queries? While a certain amount of that makes sense, I think
we should focus on queries that have some chance of being seen in real
production PostgreSQL instances. Failing that, actual SQL queries.

--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Heikki Linnakangas
Date:
On 09/13/2014 11:28 PM, Peter Geoghegan wrote:
> Anyway, attached rough test program implements what you outline. This
> is for 30,000 32 byte strings (where just the final two bytes differ).
> On my laptop, output looks like this (edited to only show median
> duration in each case):

Got to be careful to not let the compiler optimize away microbenchmarks
like this. At least with my version of gcc, the strcoll calls get
optimized away, as do the memcmp calls, if you don't use the result for
anything. Clang was even more aggressive; it ran both comparisons in 0.0
seconds. Apparently it optimizes away the loops altogether.

Also, there should be a setlocale(LC_ALL, "") call somewhere. Otherwise
it runs in C locale, and we don't use strcoll() at all for C locale.

After fixing those, it runs much slower, so I had to reduce the number
of strings. Here's a fixed program. I'm now getting numbers like this:

(baseline) duration of comparisons without useless memcmp()s: 6.007368
seconds

duration of comparisons with useless memcmp()s: 6.079826 seconds

Both values vary in range 5.9 - 6.1 s, so it's fair to say that the
useless memcmp() is free with these parameters.

Is this the worst case scenario?

- Heikki


Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Sun, Sep 14, 2014 at 7:37 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Got to be careful to not let the compiler optimize away microbenchmarks like
> this. At least with my version of gcc, the strcoll calls get optimized away,
> as do the memcmp calls, if you don't use the result for anything. Clang was
> even more aggressive; it ran both comparisons in 0.0 seconds. Apparently it
> optimizes away the loops altogether.

I suppose the fact that I saw results that fit my pre-conceived notion
of what was happening made me lose my initial concern about that.

> Also, there should be a setlocale(LC_ALL, "") call somewhere. Otherwise it
> runs in C locale, and we don't use strcoll() at all for C locale.

Oops. This might be a useful mistake, though -- if the strcoll() using
the C locale is enough to make the memcmp() not free, then that
suggests that strcoll() is the "hiding place" for the useless
memcmp(), where instructions relating to the memcmp() can execute in
parallel to instructions relating to strcoll() that add latency from
memory accesses (for non-C locales). With the C locale, strcoll() is
equivalent to strcmp()/memcmp().

Commenting out the setlocale(LC_ALL, "") in your revised versions
shows something like my original numbers (so I guess my compiler
wasn't smart enough to optimize away the strcoll() + memcmp() cases).
Whereas, there is no noticeable regression/difference between each
case when I run the revised program unmodified. That seems to prove
that strcoll() is a good enough "hiding place".

> Both values vary in range 5.9 - 6.1 s, so it's fair to say that the useless
> memcmp() is free with these parameters.
>
> Is this the worst case scenario?

Other than pushing the differences much much later in the strings
(which you surely thought of already), yes. I think it's worse than
the worst, because we've boiled this down to just the comparison part,
leaving only the strcoll() as a "hiding place", which is evidently
good enough. I thought that it was important that there be an
unpredictable access pattern (characteristic of quicksort), so that
memory latency is added here and there. I'm happy to learn that I was
wrong about that, and that a strcoll() alone hides the would-be
memcmp() latency.

Large strings matter much less anyway, I think. If you have a pair of
strings both longer than CACHE_LINE_SIZE bytes, and the first
CACHE_LINE_SIZE bytes are identical, and the lengths are known to
match, it seems like a very sensible bet to anticipate that they're
fully equal. So in a world where that affects the outcome of this test
program, I think it still changes nothing (if, indeed, it matters at
all, which it appears not to anyway, at least with 256 byte strings).

We should probably do the a fully opportunistic "memcmp() == 0" within
varstr_cmp() itself, so that Windows has the benefit of this too, as
well as callers like compareJsonbScalarValue().

Actually, looking at it closely, I think that there might still be a
microscopic regression, as there might have also been with my variant
of your SQL test case [1] - certainly in the noise, but perhaps
measurable with enough runs. If there is, that seems like an
acceptable price to pay.

When I test this stuff, I'm now very careful about power management
settings on my laptop...there are many ways to be left with egg on
your face with this kind of benchmark.

[1] http://www.postgresql.org/message-id/CAM3SWZQY95Sow00b+zJycrGMR-uF1mz8rYv4_Ou2ENcvsTnxYA@mail.gmail.com
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Heikki Linnakangas
Date:
On 09/14/2014 11:34 PM, Peter Geoghegan wrote:
> On Sun, Sep 14, 2014 at 7:37 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> Both values vary in range 5.9 - 6.1 s, so it's fair to say that the useless
>> memcmp() is free with these parameters.
>>
>> Is this the worst case scenario?
>
> Other than pushing the differences much much later in the strings
> (which you surely thought of already), yes.

Please test the absolute worst case scenario you can think of. As I said 
earlier, if you can demonstrate that the slowdown of that is acceptable, 
we don't even need to discuss how likely or realistic the case is.

- Heikki




Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Sun, Sep 14, 2014 at 10:37 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 09/13/2014 11:28 PM, Peter Geoghegan wrote:
>> Anyway, attached rough test program implements what you outline. This
>> is for 30,000 32 byte strings (where just the final two bytes differ).
>> On my laptop, output looks like this (edited to only show median
>> duration in each case):
>
> Got to be careful to not let the compiler optimize away microbenchmarks like
> this. At least with my version of gcc, the strcoll calls get optimized away,
> as do the memcmp calls, if you don't use the result for anything. Clang was
> even more aggressive; it ran both comparisons in 0.0 seconds. Apparently it
> optimizes away the loops altogether.
>
> Also, there should be a setlocale(LC_ALL, "") call somewhere. Otherwise it
> runs in C locale, and we don't use strcoll() at all for C locale.
>
> After fixing those, it runs much slower, so I had to reduce the number of
> strings. Here's a fixed program. I'm now getting numbers like this:
>
> (baseline) duration of comparisons without useless memcmp()s: 6.007368
> seconds
>
> duration of comparisons with useless memcmp()s: 6.079826 seconds
>
> Both values vary in range 5.9 - 6.1 s, so it's fair to say that the useless
> memcmp() is free with these parameters.
>
> Is this the worst case scenario?

I can't see a worse one, and I replicated your results here on my
MacBook Pro.  I also tried with 1MB strings and, surprisingly, it was
basically free there, too.

It strikes me that perhaps we should make this change (rearranging
things so that the memcmp tiebreak is run before strcoll) first,
before dealing with the rest of the abbreviated keys infrastructure.
It appears to be a separate improvement which is worthwhile
independently of what we do about that patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Mon, Sep 15, 2014 at 10:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> It strikes me that perhaps we should make this change (rearranging
> things so that the memcmp tiebreak is run before strcoll) first,
> before dealing with the rest of the abbreviated keys infrastructure.
> It appears to be a separate improvement which is worthwhile
> independently of what we do about that patch.

I guess we could do that, but AFAICT the only open item blocking the
commit of a basic version of abbreviated keys (the informally agreed
to basic version lacking support for single-attribute aggregates) is
what to do about the current need to create a separate sortsupport
state. I've talked about my thoughts on that question in detail now
[1].

BTW, you probably realize this, but we still need a second memcmp()
after strcoll() too. hu_HU will care about that [2].

[1] http://www.postgresql.org/message-id/CAM3SWZQCDCnfWd3qzoO4QmY4G8oKHUqyrd26bBLa7FL2x-nTjg@mail.gmail.com
[2] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Mon, Sep 15, 2014 at 1:34 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Sep 15, 2014 at 10:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> It strikes me that perhaps we should make this change (rearranging
>> things so that the memcmp tiebreak is run before strcoll) first,
>> before dealing with the rest of the abbreviated keys infrastructure.
>> It appears to be a separate improvement which is worthwhile
>> independently of what we do about that patch.
>
> I guess we could do that, but AFAICT the only open item blocking the
> commit of a basic version of abbreviated keys (the informally agreed
> to basic version lacking support for single-attribute aggregates) is
> what to do about the current need to create a separate sortsupport
> state. I've talked about my thoughts on that question in detail now
> [1].

I think there's probably more than that to work out, but in any case
there's no harm in getting a simple optimization done first before
moving on to a complicated one.

> BTW, you probably realize this, but we still need a second memcmp()
> after strcoll() too. hu_HU will care about that [2].
>
> [1] http://www.postgresql.org/message-id/CAM3SWZQCDCnfWd3qzoO4QmY4G8oKHUqyrd26bBLa7FL2x-nTjg@mail.gmail.com
> [2] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad

I rather assume we could reuse the results of the first memcmp()
instead of doing it again.

x = memcmp();
if (x == 0)  return x;
y = strcoll();
if (y == 0)  return x;
return y;

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Mon, Sep 15, 2014 at 10:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think there's probably more than that to work out, but in any case
> there's no harm in getting a simple optimization done first before
> moving on to a complicated one.

I guess we never talked about the abort logic in all that much detail.
I suppose there's that, too.

> I rather assume we could reuse the results of the first memcmp()
> instead of doing it again.
>
> x = memcmp();
> if (x == 0)
>    return x;
> y = strcoll();
> if (y == 0)
>    return x;
> return y;

Of course, but you know what I mean. (I'm sure the compiler will
realize this if the programmer doesn't)

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Mon, Sep 15, 2014 at 1:55 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Sep 15, 2014 at 10:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think there's probably more than that to work out, but in any case
>> there's no harm in getting a simple optimization done first before
>> moving on to a complicated one.
>
> I guess we never talked about the abort logic in all that much detail.
> I suppose there's that, too.

Well, the real point is that from where I'm sitting, this...

>> x = memcmp();
>> if (x == 0)
>>    return x;
>> y = strcoll();
>> if (y == 0)
>>    return x;
>> return y;

...looks like about a 10-line patch.  We have the data to show that
the loss is trivial even in the worst case, and we have or should be
able to get data showing that the best-case win is significant even
without the abbreviated key stuff.  If you'd care to draft a patch for
just that, I assume we could get it committed in a day or two, whereas
I'm quite sure that considerably more work than that remains for the
main patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Mon, Sep 15, 2014 at 11:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> ...looks like about a 10-line patch.  We have the data to show that
> the loss is trivial even in the worst case, and we have or should be
> able to get data showing that the best-case win is significant even
> without the abbreviated key stuff.  If you'd care to draft a patch for
> just that, I assume we could get it committed in a day or two, whereas
> I'm quite sure that considerably more work than that remains for the
> main patch.

OK, I'll draft a patch for that today, including similar alterations
to varstr_cmp() for the benefit of Windows and so on.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Mon, Sep 15, 2014 at 11:25 AM, Peter Geoghegan <pg@heroku.com> wrote:
> OK, I'll draft a patch for that today, including similar alterations
> to varstr_cmp() for the benefit of Windows and so on.

I attach a much simpler patch, that only adds an opportunistic
"memcmp() == 0" before a possible strcoll().  Both
bttextfastcmp_locale() and varstr_cmp() have the optimization added,
since there is no point in leaving anyone out for this part.

When this is committed, and I hear back from you on the question of
what to do about having an independent sortsupport state for
abbreviated tie-breakers (and possibly other issues of concern), I'll
produce a rebased patch with a single commit.

--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Mon, Sep 15, 2014 at 4:21 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I attach a much simpler patch, that only adds an opportunistic
> "memcmp() == 0" before a possible strcoll().  Both
> bttextfastcmp_locale() and varstr_cmp() have the optimization added,
> since there is no point in leaving anyone out for this part.

FWIW, it occurs to me that this could be a big win for cases like
ExecMergeJoin(). Obviously, abbreviated keys will usually make your
merge join on text attributes a lot faster in the common case where a
sort is involved (if we consider that the sort is integral to the cost
of the join). However, when making sure that inner and outer tuples
match, the MJCompare() call will *very* frequently get away with a
cheap memcmp().  That could make a very significant additional
difference, I think.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Mon, Sep 15, 2014 at 7:21 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Sep 15, 2014 at 11:25 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> OK, I'll draft a patch for that today, including similar alterations
>> to varstr_cmp() for the benefit of Windows and so on.
>
> I attach a much simpler patch, that only adds an opportunistic
> "memcmp() == 0" before a possible strcoll().  Both
> bttextfastcmp_locale() and varstr_cmp() have the optimization added,
> since there is no point in leaving anyone out for this part.

Even though our testing seems to indicate that the memcmp() is
basically free, I think it would be good to make the effort to avoid
doing memcmp() and then strcoll() and then strncmp().  Seems like it
shouldn't be too hard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Sep 16, 2014 at 1:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Even though our testing seems to indicate that the memcmp() is
> basically free, I think it would be good to make the effort to avoid
> doing memcmp() and then strcoll() and then strncmp().  Seems like it
> shouldn't be too hard.

Really? The tie-breaker for the benefit of locales like hu_HU uses
strcmp(), not memcmp(). It operates on the now-terminated copies of
strings. There is no reason to think that the strings must be the same
size for that strcmp(). I'd rather only do the new opportunistic
"memcmp() == 0" thing when len1 == len2. And I wouldn't like to have
to also figure out that it's safe to use the earlier result, because
as it happens len1 == len2, or any other such trickery.

The bug fix that added the strcmp() tie-breaker was committed in 2005.
PostgreSQL had locale support for something like 8 years prior, and it
took that long for us to notice the problem. I would suggest that
makes the case for doing anything else pretty marginal. In the bug
report at the time, len1 != len2 anyway.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Tue, Sep 16, 2014 at 4:55 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Sep 16, 2014 at 1:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Even though our testing seems to indicate that the memcmp() is
>> basically free, I think it would be good to make the effort to avoid
>> doing memcmp() and then strcoll() and then strncmp().  Seems like it
>> shouldn't be too hard.
>
> Really? The tie-breaker for the benefit of locales like hu_HU uses
> strcmp(), not memcmp(). It operates on the now-terminated copies of
> strings. There is no reason to think that the strings must be the same
> size for that strcmp(). I'd rather only do the new opportunistic
> "memcmp() == 0" thing when len1 == len2. And I wouldn't like to have
> to also figure out that it's safe to use the earlier result, because
> as it happens len1 == len2, or any other such trickery.

OK, good point.  So committed as-is, then, except that I rewrote the
comments, which I felt were excessively long for the amount of code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Fri, Sep 19, 2014 at 9:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> OK, good point.  So committed as-is, then, except that I rewrote the
> comments, which I felt were excessively long for the amount of code.

Thanks!

I look forward to hearing your thoughts on the open issues with the
patch as a whole.
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Thu, Sep 11, 2014 at 8:34 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Sep 9, 2014 at 2:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I like that I don't have to care about every combination, and can
>>> treat abbreviation abortion as the special case with the extra step,
>>> in line with how I think of the optimization conceptually. Does that
>>> make sense?
>>
>> No.  comparetup_heap() is hip-deep in this optimization as it stands,
>> and what I proposed - if done correctly - isn't going to make that
>> significantly worse.  In fact, it really ought to make things better:
>> you should be able to set things up so that ssup->comparator is always
>> the test that should be applied first, regardless of whether we're
>> aborted or not-aborted or not doing this in the first place; and then
>> ssup->tiebreak_comparator, if not NULL, can be applied after that.
>
> I'm not following here. Isn't that at least equally true of what I've
> proposed? Sure, I'm checking "if (!state->abbrevAborted)" first, but
> that's irrelevant to the non-abbreviated case. It has nothing to
> abort. Also, AFAICT we cannot abort and still call ssup->comparator()
> indifferently, since sorttuple.datum1 fields are perhaps abbreviated
> keys in half of all cases (i.e. pre-abort tuples), and uninitialized
> garbage the other half of the time (i.e. post-abort tuples).

You can if you engineer ssup->comparator() to contain the right
pointer at the right time.  Also, shouldn't you go back and fix up
those abbreviated keys to point to datum1 again if you abort?

> Where is the heap_getattr() stuff supposed to happen for the first
> attribute to get an authoritative comparison in the event of aborting
> (if we're calling ssup->comparator() on datum1 indifferently)? We
> decide that we're going to use abbreviated keys within datum1 fields
> up-front. When we abort, we cannot use datum1 fields at all (which
> doesn't appear to matter for text -- the datum1 optimization has
> historically only benefited pass-by-value types).

You always pass datum1 to a function.  The code doesn't need to care
about whether that function is expecting abbreviated or
non-abbreviated datums unless that function returns equality.  Then it
needs to think about calling a backup comparator if there is one.

> Is doing all this worth the small saving in memory? Personally, I
> don't think that it is, but I defer to you.

I don't care about the memory; I care about the code complexity and
the ease of understanding that code.  I am confident that it can be
done better than the patch does it today.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Fri, Sep 19, 2014 at 2:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Also, shouldn't you go back and fix up
> those abbreviated keys to point to datum1 again if you abort?

Probably not - it appears to make very little difference to
unoptimized pass-by-reference types whether or not datum1 can be used
(see my simulation of Kevin's worst case, for example [1]). Streaming
through a not inconsiderable proportion of memtuples again is probably
a lot worse. The datum1 optimization (which is not all that old) made
a lot of sense when initially introduced, because it avoided chasing
through a pointer for pass-by-value types. I think that's its sole
justification, though.

BTW, I think that if we ever get around to doing this for numeric, it
won't ever abort. The abbreviation strategy can be adaptive, to
maximize the number of comparisons successfully resolved with
abbreviated keys. This would probably use a streaming algorithm like
HyperLogLog too.

[1] http://www.postgresql.org/message-id/CAM3SWZQHjxiyrsqBs5w3u-vTJ_jT2hp8o02big5wYb4m9Lp1jg@mail.gmail.com
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Fri, Sep 19, 2014 at 2:54 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Probably not - it appears to make very little difference to
> unoptimized pass-by-reference types whether or not datum1 can be used
> (see my simulation of Kevin's worst case, for example [1]). Streaming
> through a not inconsiderable proportion of memtuples again is probably
> a lot worse. The datum1 optimization (which is not all that old) made
> a lot of sense when initially introduced, because it avoided chasing
> through a pointer for pass-by-value types. I think that's its sole
> justification, though.


Just to be clear -- I am blocked on this. Do you really prefer to
restart copying heap tuples from scratch in the event of aborting,
just to make sure that the datum1 representation is consistently
either a pointer to text, or an abbreviated key? I don't think that
the costs involved make that worth it, as I've said, but I'm not sure
how to resolve that controversy.

I suggest that we focus on making sure the abort logic itself is
sound. There probably hasn't been enough discussion of that. Once that
is resolved, we can revisit the question of whether or not copying
should restart to keep the datum1 representation consistent. I suspect
that leaving that until later will be easier all around.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Wed, Sep 24, 2014 at 7:04 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Fri, Sep 19, 2014 at 2:54 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Probably not - it appears to make very little difference to
>> unoptimized pass-by-reference types whether or not datum1 can be used
>> (see my simulation of Kevin's worst case, for example [1]). Streaming
>> through a not inconsiderable proportion of memtuples again is probably
>> a lot worse. The datum1 optimization (which is not all that old) made
>> a lot of sense when initially introduced, because it avoided chasing
>> through a pointer for pass-by-value types. I think that's its sole
>> justification, though.
>
> Just to be clear -- I am blocked on this. Do you really prefer to
> restart copying heap tuples from scratch in the event of aborting,
> just to make sure that the datum1 representation is consistently
> either a pointer to text, or an abbreviated key? I don't think that
> the costs involved make that worth it, as I've said, but I'm not sure
> how to resolve that controversy.
>
> I suggest that we focus on making sure the abort logic itself is
> sound. There probably hasn't been enough discussion of that. Once that
> is resolved, we can revisit the question of whether or not copying
> should restart to keep the datum1 representation consistent. I suspect
> that leaving that until later will be easier all around.

The top issue on my agenda is figuring out a way to get rid of the
extra SortSupport object.  I'm not going to commit any version of this
patch that uses a second SortSupport for the tiebreak.  I doubt anyone
else will like that either, but you can try.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Sep 25, 2014 at 9:21 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> The top issue on my agenda is figuring out a way to get rid of the
> extra SortSupport object.

Really? I'm surprised. Clearly the need to restart heap tuple copying
from scratch, in order to make the datum1 representation consistent,
rather than abandoning datum1 for storing abbreviated keys or pointers
entirely is a very important aspect of whether or not we should change
that. In turn, that's something that's going to (probably
significantly) affect the worst case.

Do you have an opinion on that? If you want me to start from scratch,
and then have a consistent datum1 representation, and then be able to
change the structure of comparetup_heap() as you outline (so as to get
rid of the extra SortSupport object), I can do that. My concern is the
regression. The datum1 pointer optimization appears to matter very
little for pass by value types (traditionally, before abbreviated
keys), and so I have a hard time imagining this working out.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Thu, Sep 25, 2014 at 2:05 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Sep 25, 2014 at 9:21 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> The top issue on my agenda is figuring out a way to get rid of the
>> extra SortSupport object.
>
> Really? I'm surprised. Clearly the need to restart heap tuple copying
> from scratch, in order to make the datum1 representation consistent,
> rather than abandoning datum1 for storing abbreviated keys or pointers
> entirely is a very important aspect of whether or not we should change
> that. In turn, that's something that's going to (probably
> significantly) affect the worst case.
>
> Do you have an opinion on that?

I haven't looked at that part of the patch in detail yet, so... not
really.  But I don't see why you'd ever need to restart heap tuple
copying.  At most you'd need to re-extract datum1 from the tuples you
have already copied.  To find out how much that optimization buys, you
should use tuples with many variable-length columns (say, 50)
preceding the text column you're sorting on. I won't be surprised if
that turns out to be expensive enough to be worth worrying about, but
I have not benchmarked it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Sep 25, 2014 at 11:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I haven't looked at that part of the patch in detail yet, so... not
> really.  But I don't see why you'd ever need to restart heap tuple
> copying.  At most you'd need to re-extract datum1 from the tuples you
> have already copied.

Well, okay, technically it's not restarting heap tuple copying, but
it's about the same thing. The point is that you have to stream a
significant chunk of the memtuples array through memory again. Not to
mention, having infrastructure to do that, and pick up where we left
off, which is significantly more code, all to make comparetup_heap() a
bit clearer (i.e. making it so that it won't have to think about an
extra sortsupport state).

> To find out how much that optimization buys, you
> should use tuples with many variable-length columns (say, 50)
> preceding the text column you're sorting on. I won't be surprised if
> that turns out to be expensive enough to be worth worrying about, but
> I have not benchmarked it.

Sorry, but I don't follow. I don't think the pertinent question is if
it's a noticeable cost. I think the pertinent question is if it's
worth it. Doing something about it necessitates a lot of extra memory
access. Not doing something about it hardly affects the amount of
memory access required, perhaps not at all.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Thu, Sep 25, 2014 at 3:17 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> To find out how much that optimization buys, you
>> should use tuples with many variable-length columns (say, 50)
>> preceding the text column you're sorting on. I won't be surprised if
>> that turns out to be expensive enough to be worth worrying about, but
>> I have not benchmarked it.
>
> Sorry, but I don't follow. I don't think the pertinent question is if
> it's a noticeable cost. I think the pertinent question is if it's
> worth it. Doing something about it necessitates a lot of extra memory
> access. Not doing something about it hardly affects the amount of
> memory access required, perhaps not at all.

I think you're mincing words.  If you go back and fix datum1, you'll
spend a bunch of effort doing that.    If you don't, you'll pay a cost
on every comparison to re-find the relevant column inside each tuple.
You can compare those costs in a variety of cases, including the one I
mentioned, where the latter cost will be relatively high.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Thu, Sep 25, 2014 at 1:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> (concerns about a second sortsupport state)

I think I may have underestimated the cost of not have
sorttuple.datum1 with a pointer-to-text representation available in
cases such as the one you describe.

Attached revision introduces an alternative approach, which does not
have a separate sortsupport state struct. In the event of aborting
abbreviation, we go back and fix-up datum1 to have a consistently have
a pointer to text representation, making a comparator swap safe - at
that point, it's as if abbreviation was never even considered (apart
from the cost, of course). We still need a special tie-breaker
comparator, though.

I hope this addresses your concern.

--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Mon, Sep 29, 2014 at 10:34 PM, Peter Geoghegan <pg@heroku.com> wrote:
> <single sortsupport state patch>.

You probably noticed that I posted an independently useful patch to
make all tuplesort cases use sortsupport [1] - currently, both the
B-Tree and CLUSTER cases do not use the sortsupport infrastructure
more or less for no good reason. That patch was primarily written to
make abbreviated keys work for all cases, though. I think that making
heap tuple sorts based on a text attribute much faster is a very nice
thing, but in practice most OLTP or web applications are not all that
sensitive to the amount of time taken to sort heap tuples. However,
the length of time it takes to build indexes is something that most
busy production applications are very sensitive to, since of course
apart from the large consumption of system resources often required,
however long the sort takes is virtually the same amount of time as
however long we hold a very heavy, disruptive relation-level
ShareLock. Obviously the same applies to CLUSTER, but more so, since
it must acquire an AccessExclusiveLock on the relation to be
reorganized. I think almost everyone will agree that making B-Tree
builds much faster is the really compelling case here, because that's
where slow sorts cause by far the most pain for users in the real
world.

Attached patch, when applied, accelerates all tuplesort cases using
abbreviated keys, building on previous work here, as well as the patch
posted to that other thread. Exact instructions are in the commit
message of 0004-* (i.e. where to find the pieces I haven't posted
here). I also attach a minor bitrot fix commit/patch.

Performance is improved for B-Tree index builds by a great deal, too.
The improvements are only slightly less than those seen for comparable
heap tuple sorts (that is, my earlier test cases that had client
overhead removed). With larger sorts, that difference tends to get
lost in the noise easily.

I'm very encouraged by this. I think that being able to build B-Tree
indexes on text attributes very significantly faster than previous
versions of PostgreSQL is likely to be a significant feature for
PostgreSQL 9.5. After all, external sorts are where improvements are
most noticeable [2] - they're so much faster with this patch that
they're actually sometimes faster than internal sorts *with*
abbreviated keys. This would something that I found quite surprising.

[1] http://www.postgresql.org/message-id/CAM3SWZTfKZHTUiWDdHg+6tcYuMsdHoC=bMuAiVgMP9AThj42Gw@mail.gmail.com
[2] http://www.postgresql.org/message-id/CAM3SWZQVjCgmE6uBe-YDipu0n5BO7RMz31zRHMSkdDuynejmJA@mail.gmail.com
--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Sat, Oct 11, 2014 at 6:34 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Attached patch, when applied, accelerates all tuplesort cases using
> abbreviated keys, building on previous work here, as well as the patch
> posted to that other thread.

I attach an updated patch set, rebased on top of the master branch's
tip. All relevant tuplesort cases (B-Tree, MinimalTuple, CLUSTER) are
now directly covered by the patch set, since there is now general
sortsupport support for those cases in the master branch -- no need to
apply some other patch from some other thread.

For the convenience of reviewers, this new revision includes a new
approach to making my improvements cumulative: A second commit adds
tuple count estimation. This hint, passed along to the text opclass's
convert routine, is taken from the optimizer's own estimate, or the
relcache's reltuples, depending on the tuplesort case being
accelerated. As in previous revisions, the idea is to give the opclass
a sense of proportion about how far along it is, to be weighed in
deciding whether or not to abort abbreviation. One potentially
controversial aspect of that is how the text opclass abbreviation cost
model/abort early stuff weighs simply having many tuples - past a
certain point, it *always* proceeds with abbreviation, not matter what
the cardinality of abbreviated keys so far is. For that reason it
particular, it seemed to make sense to split these parts out into a
second commit.

I hope that we can finish up all 9.5 work on accelerated sorting soon.
--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Sun, Nov 9, 2014 at 10:02 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Sat, Oct 11, 2014 at 6:34 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Attached patch, when applied, accelerates all tuplesort cases using
>> abbreviated keys, building on previous work here, as well as the patch
>> posted to that other thread.
>
> I attach an updated patch set, rebased on top of the master branch's
> tip. All relevant tuplesort cases (B-Tree, MinimalTuple, CLUSTER) are
> now directly covered by the patch set, since there is now general
> sortsupport support for those cases in the master branch -- no need to
> apply some other patch from some other thread.
>
> For the convenience of reviewers, this new revision includes a new
> approach to making my improvements cumulative: A second commit adds
> tuple count estimation. This hint, passed along to the text opclass's
> convert routine, is taken from the optimizer's own estimate, or the
> relcache's reltuples, depending on the tuplesort case being
> accelerated. As in previous revisions, the idea is to give the opclass
> a sense of proportion about how far along it is, to be weighed in
> deciding whether or not to abort abbreviation. One potentially
> controversial aspect of that is how the text opclass abbreviation cost
> model/abort early stuff weighs simply having many tuples - past a
> certain point, it *always* proceeds with abbreviation, not matter what
> the cardinality of abbreviated keys so far is. For that reason it
> particular, it seemed to make sense to split these parts out into a
> second commit.
>
> I hope that we can finish up all 9.5 work on accelerated sorting soon.

There's a lot of stuff in this patch I'm still trying to digest, but
here are a few thoughts on patch 0001:

- This appears to needlessly reindent the comments for PG_CACHE_LINE_SIZE.

- I really don't think we need a #define in pg_config_manual.h for
this.  Please omit that.

- I'm much happier with the way the changes to sortsupport.h look in
this version.  However, I think that auth_comparator is a confusing
name, because "auth" is often used as an abbreviation for
"authentication".   We can spell it out (authoritative_comparator) or
come up with a different name (backup_comparator?
abbrev_full_comparator?).  Whatever we do, ApplySortComparatorAuth()
should be renamed to match.

- Also, I don't think making abbrev_state an enumerated value with two
values is really doing anything for us; we could just use a Boolean.
I'm wondering if we should actually go a bit further and remove this
from the SortSupport object and instead add an additional Boolean flag
to PrepareSortSupportFrom(OrderingOp|IndexRel) that gets passed all
the way down to the opclass's sortsupport function.  It seems like
that might be more clear.  Once the opclass function has done its
thing, the other three new nembers are enough to know whether we're
using the optimization or not (and can be fiddled if we want to make a
later decision to call the whole thing off).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Nov 25, 2014 at 4:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> - This appears to needlessly reindent the comments for PG_CACHE_LINE_SIZE.

Actually, the word "only" is removed (because PG_CACHE_LINE_SIZE has a
new client now). So it isn't quite the same paragraph as before.

> - I really don't think we need a #define in pg_config_manual.h for
> this.  Please omit that.

You'd prefer to not offer a way to disable abbreviation? Okay. I guess
that makes sense - it should work well as a general optimization.

> - I'm much happier with the way the changes to sortsupport.h look in
> this version.  However, I think that auth_comparator is a confusing
> name, because "auth" is often used as an abbreviation for
> "authentication".   We can spell it out (authoritative_comparator) or
> come up with a different name (backup_comparator?
> abbrev_full_comparator?).  Whatever we do, ApplySortComparatorAuth()
> should be renamed to match.

Okay.

> - Also, I don't think making abbrev_state an enumerated value with two
> values is really doing anything for us; we could just use a Boolean.
> I'm wondering if we should actually go a bit further and remove this
> from the SortSupport object and instead add an additional Boolean flag
> to PrepareSortSupportFrom(OrderingOp|IndexRel) that gets passed all
> the way down to the opclass's sortsupport function.  It seems like
> that might be more clear.  Once the opclass function has done its
> thing, the other three new nembers are enough to know whether we're
> using the optimization or not (and can be fiddled if we want to make a
> later decision to call the whole thing off).

I'm not sure about that. I'd prefer to have tuplesort (and one or two
other sites) set the "abbreviation is possible in principle" flag.
Otherwise, sortsupport needs to assume that the leading attribute is
going to be the abbreviation-applicable one, which might not always be
true. Still, it's not as if I feel strongly about it.


-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Nov 25, 2014 at 10:38 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> - Also, I don't think making abbrev_state an enumerated value with two
>> values is really doing anything for us; we could just use a Boolean.
>> I'm wondering if we should actually go a bit further and remove this
>> from the SortSupport object and instead add an additional Boolean flag
>> to PrepareSortSupportFrom(OrderingOp|IndexRel) that gets passed all
>> the way down to the opclass's sortsupport function.  It seems like
>> that might be more clear.  Once the opclass function has done its
>> thing, the other three new nembers are enough to know whether we're
>> using the optimization or not (and can be fiddled if we want to make a
>> later decision to call the whole thing off).
>
> I'm not sure about that. I'd prefer to have tuplesort (and one or two
> other sites) set the "abbreviation is possible in principle" flag.

As for the related question of whether or not there should just be a
bool in place of an abbreviation state enum: I thought that we might
add some more flags to that enum (you'll recall that there actually
was another flag in earlier revisions, relating to optimistic
tie-breaks with memcmp() that the master branch now always does
anyway). But come to think of it, I think it's very unlikely that that
flag will ever be extended to represent some now unforeseen state
regarding abbreviation. It's either going to be "abbreviation
applicable for this sort and this attribute" or "not applicable". So,
yes, let's make it a boolean instead.

As I think I mentioned at one point, I imagine that if and when we do
abbreviation with the numeric opclass, it won't ever abort - its
encoding strategy will adapt. That ought to be possible for certain
other datatypes, of which numeric is the best example. Although, I
think it's only going to be a handful of the most important datatypes
that get abbreviation support.
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Nov 25, 2014 at 4:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> There's a lot of stuff in this patch I'm still trying to digest

I spotted a bug in the most recent revision. Mea culpa.

I think that the new field Tuplesortstate.abbrevNext should be an
int64, not an int. The fact that Tuplesortstate.memtupcount is an int
is not reason enough to make abbrevNext an int -- after all, with the
patch applied tuplesort uses a doubling growth strategy in respect of
abbrevNext, whereas grow_memtuples() is very careful about integer
overflow when growing memtupcount. I suggest we follow the good
example of tuplesort_skiptuples() in making our "ntuples" variable
(Tuplesortstate.abbrevNext) an int64 instead. The alternative is to
add grow_memtuples()-style checks.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Tue, Nov 25, 2014 at 1:38 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Nov 25, 2014 at 4:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> - This appears to needlessly reindent the comments for PG_CACHE_LINE_SIZE.
>
> Actually, the word "only" is removed (because PG_CACHE_LINE_SIZE has a
> new client now). So it isn't quite the same paragraph as before.

Oy, I missed that.

>> - I really don't think we need a #define in pg_config_manual.h for
>> this.  Please omit that.
>
> You'd prefer to not offer a way to disable abbreviation? Okay. I guess
> that makes sense - it should work well as a general optimization.

I'd prefer not to have a #define in pg_config_manual.h.  Only stuff
that we expect a reasonably decent number of users to need to change
should be in that file, and this is too marginal for that.  If anybody
other than the developers of the feature is disabling this, the whole
thing is going to be ripped back out.

> I'm not sure about that. I'd prefer to have tuplesort (and one or two
> other sites) set the "abbreviation is possible in principle" flag.
> Otherwise, sortsupport needs to assume that the leading attribute is
> going to be the abbreviation-applicable one, which might not always be
> true. Still, it's not as if I feel strongly about it.

When wouldn't that be true?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Dec 2, 2014 at 1:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I'd prefer not to have a #define in pg_config_manual.h.  Only stuff
> that we expect a reasonably decent number of users to need to change
> should be in that file, and this is too marginal for that.  If anybody
> other than the developers of the feature is disabling this, the whole
> thing is going to be ripped back out.

I agree. The patch either works well in general and it goes in, or it
doesn't and should be rejected. That doesn't mean that the standard
applied is that regressions are absolutely unacceptable, but the
standard shouldn't be too far off that.  I feel pretty confident that
we'll be able to meet that standard, though, because database luminary
Jim Gray recommended this technique in about 1995. Even if the details
of what I have here could stand to be tweaked, there is no getting
around the fact that a good sort routine needs to strongly consider
locality. That was apparent even in 1995, but now it's a very major
trend.

Incidentally, I think that an under-appreciated possible source of
regressions here is that attributes abbreviated have a strong
physical/logical correlation. I could see a small regression for one
such case even though my cost model indicated that it should be very
profitable. On the other hand, on other occasions my cost model (i.e.
considering how good a proxy for full key cardinality abbreviated key
cardinality is) was quite pessimistic. Although, at least it was only
a small regression, even though the correlation was something like
0.95. And at least the sort will be very fast in any case.

You'll recall that Heikki's test case involved correlation like that,
even though it was mostly intended to make a point about the entropy
in abbreviated keys. Correlation was actually the most important
factor there. I think it might be generally true that it's the most
important factor, in practice more important even than capturing
sufficient entropy in the abbreviated key representation.

>> I'm not sure about that. I'd prefer to have tuplesort (and one or two
>> other sites) set the "abbreviation is possible in principle" flag.
>> Otherwise, sortsupport needs to assume that the leading attribute is
>> going to be the abbreviation-applicable one, which might not always be
>> true. Still, it's not as if I feel strongly about it.
>
> When wouldn't that be true?

It just feels a bit wrong to me. There might be a future in which we
want to use the datum1 field for a non-leading attribute. For example,
when it is known ahead of time that there are low cardinality integers
in the leading key/attribute. Granted, that's pretty speculative, but
then it's not as if I'm insisting that it must be done that way. I
defer to you.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Tue, Dec 2, 2014 at 4:21 PM, Peter Geoghegan <pg@heroku.com> wrote:
>>> I'm not sure about that. I'd prefer to have tuplesort (and one or two
>>> other sites) set the "abbreviation is possible in principle" flag.
>>> Otherwise, sortsupport needs to assume that the leading attribute is
>>> going to be the abbreviation-applicable one, which might not always be
>>> true. Still, it's not as if I feel strongly about it.
>>
>> When wouldn't that be true?
>
> It just feels a bit wrong to me. There might be a future in which we
> want to use the datum1 field for a non-leading attribute. For example,
> when it is known ahead of time that there are low cardinality integers
> in the leading key/attribute. Granted, that's pretty speculative, but
> then it's not as if I'm insisting that it must be done that way. I
> defer to you.

Well, maybe you should make the updates we've agreed on and I can take
another look at it.  But I didn't think that I was proposing to change
anything about the level at which the decision about whether to
abbreviate or not was made; rather, I thought I was suggesting that we
pass that flag down to the code that initializes the sortsupport
object as an argument rather than through the sortsupport structure
itself.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Dec 2, 2014 at 2:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Well, maybe you should make the updates we've agreed on and I can take
> another look at it.

Agreed.

> But I didn't think that I was proposing to change
> anything about the level at which the decision about whether to
> abbreviate or not was made; rather, I thought I was suggesting that we
> pass that flag down to the code that initializes the sortsupport
> object as an argument rather than through the sortsupport structure
> itself.

The flag I'm talking about concerns the *applicability* of
abbreviation, and not whether or not it will actually be used (maybe
the opclass lacks support, or decides not to for some platform
specific reason). Tuplesort has a contract with abbreviation +
sortsupport that considers whether or not the function pointer used to
abbreviate is set, which relates to whether or not abbreviation will
*actually* be used. Note that for non-abbreviation-applicable
attributes, btsortsupport_worker() never sets the function pointer
(nor, incidentally, does it set the other abbreviation related
function pointers in the struct).

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Tue, Dec 2, 2014 at 5:16 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Dec 2, 2014 at 2:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Well, maybe you should make the updates we've agreed on and I can take
>> another look at it.
>
> Agreed.
>
>> But I didn't think that I was proposing to change
>> anything about the level at which the decision about whether to
>> abbreviate or not was made; rather, I thought I was suggesting that we
>> pass that flag down to the code that initializes the sortsupport
>> object as an argument rather than through the sortsupport structure
>> itself.
>
> The flag I'm talking about concerns the *applicability* of
> abbreviation, and not whether or not it will actually be used (maybe
> the opclass lacks support, or decides not to for some platform
> specific reason). Tuplesort has a contract with abbreviation +
> sortsupport that considers whether or not the function pointer used to
> abbreviate is set, which relates to whether or not abbreviation will
> *actually* be used. Note that for non-abbreviation-applicable
> attributes, btsortsupport_worker() never sets the function pointer
> (nor, incidentally, does it set the other abbreviation related
> function pointers in the struct).

Right, and what I'm saying is that maybe the "applicability" flag
shouldn't be stored in the SortSupport object, but passed down as an
argument.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Dec 2, 2014 at 2:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Right, and what I'm saying is that maybe the "applicability" flag
> shouldn't be stored in the SortSupport object, but passed down as an
> argument.

But then how does that information get to any given sortsupport
routine? That's the place that really needs to know if abbreviation is
useful. In general, they're only passed a SortSupport object. Are you
suggesting revising the signature required of SortSupport routines to
add that extra flag as an additional argument?

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> On Tue, Dec 2, 2014 at 2:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Right, and what I'm saying is that maybe the "applicability" flag
>> shouldn't be stored in the SortSupport object, but passed down as an
>> argument.

> But then how does that information get to any given sortsupport
> routine? That's the place that really needs to know if abbreviation is
> useful. In general, they're only passed a SortSupport object. Are you
> suggesting revising the signature required of SortSupport routines to
> add that extra flag as an additional argument?

I think that is what he's suggesting, and I too am wondering why it's
a good idea.
        regards, tom lane



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Dec 2, 2014 at 2:16 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Dec 2, 2014 at 2:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Well, maybe you should make the updates we've agreed on and I can take
>> another look at it.
>
> Agreed.

Attached, revised patchset makes these updates. I continue to use the
sortsupport struct to convey that a given attribute of a given sort is
abbreviation-applicable (although the field is now a bool, not an
enum).

--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Dec 2, 2014 at 5:28 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Attached, revised patchset makes these updates.

Whoops. Missed some obsolete comments. Here is a third commit that
makes a further small modification to one comment.

--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Tue, Dec 2, 2014 at 5:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <pg@heroku.com> writes:
>> On Tue, Dec 2, 2014 at 2:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Right, and what I'm saying is that maybe the "applicability" flag
>>> shouldn't be stored in the SortSupport object, but passed down as an
>>> argument.
>
>> But then how does that information get to any given sortsupport
>> routine? That's the place that really needs to know if abbreviation is
>> useful. In general, they're only passed a SortSupport object. Are you
>> suggesting revising the signature required of SortSupport routines to
>> add that extra flag as an additional argument?
>
> I think that is what he's suggesting, and I too am wondering why it's
> a good idea.

I find it somewhat confusing that we've got one flag which is only
used from the time the SortSupport object is created until the time
that it's fully initialized, and then a different way of indicating
whether we paid attention to that flag.  I'm not totally sure what the
right solution to that problem is, but the current situation feels
like something of a wart.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Dec 2, 2014 at 1:21 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Incidentally, I think that an under-appreciated possible source of
> regressions here is that attributes abbreviated have a strong
> physical/logical correlation. I could see a small regression for one
> such case even though my cost model indicated that it should be very
> profitable.

This was the column in question:

postgres=# select * from pg_stats where tablename = 'ohio_voters' and
attname = 'mailing_address1' ;
-[ RECORD 1 ]-
schemaname             | public
tablename              | ohio_voters
attname                | mailing_address1
inherited              | f
null_frac              | 0
avg_width              | 5
n_distinct             | 789
most_common_vals       | {"    "}
most_common_freqs      | {0.969267}
histogram_bounds       |  ****SNIP ***
correlation            | 0.944785****SNIP ***

This n_distinct is wrong, though. In fact, the number of distinct
columns is 25,946, while the number of distinct abbreviated keys is
13,691. So correlation was not the dominant factor here (although it
was probably still a factor) - rather, the dominant factor was that
the vast majority of comparisons would get away with an opportunistic
"memcmp() == 0" anyway (although not with Postgres 9.4), and so my
baseline is very fast for this case.

This would not have come up had the value "   " been represented as
NULL (as it clearly should have been), since that would not undergo
strxfrm() transformation/abbreviation in the first place. Even still,
highly skewed attributes exist in the wild, and deserve our
consideration - we do not model the distribution of values within the
set.

I believe that these cases are rare enough, and (thanks to the already
committed parts of this work) fast enough to probably not be worried
about; maybe a more complex cost model could do better, but I'm
inclined to think that it's not worth it. We'd end up slightly
improving this case at bigger cost to other, much more common cases.
Besides, equality-resolved comparisons are not necessarily much
cheaper for datatypes other than Postgres 9.5 text (in a world where
there is a variety of datatypes accelerated by abbreviation), which
discourages a general solution.

A custom format dump of this data (publicly available Ohio State voter
records) is available from:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/ohio_voters.custom.dump

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
There is an interesting thread about strcoll() overhead over on -general:

http://www.postgresql.org/message-id/CAB25XEXNONdRmC1_cy3jvmB0TMyDm38eF9q2D7xLa0rbnCJ5pQ@mail.gmail.com

My guess was that this person experienced a rather unexpected downside
of spilling to disk when sorting on a text attribute: System
throughput becomes very CPU bound, because tapesort tends to result in
more comparisons [1].  With abbreviated keys, tapesort can actually
compete with quicksort in certain cases [2]. Tapesorts of text
attributes are especially bad on released versions of Postgres, and
will perform very little actual I/O.

In all seriousness, I wonder if we should add a release note item
stating that when using Postgres 9.5, due to the abbreviated key
optimization, external sorts can be much more I/O bound than in
previous releases...

[1] http://www.postgresql.org/message-id/20140806035512.GA91137@tornado.leadboat.com
[2] http://www.postgresql.org/message-id/CAM3SWZQiGvGhMB4TMbEWoNjO17=ySB5b5Y5MGqJsaNq4uWTryA@mail.gmail.com
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Michael Paquier
Date:
On Wed, Dec 3, 2014 at 10:43 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Dec 2, 2014 at 5:28 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Attached, revised patchset makes these updates.
>
> Whoops. Missed some obsolete comments. Here is a third commit that
> makes a further small modification to one comment.
Moving this patch to CF 2014-12 as more review seems to be needed.
-- 
Michael



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Tue, Dec 2, 2014 at 8:28 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Dec 2, 2014 at 2:16 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Tue, Dec 2, 2014 at 2:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Well, maybe you should make the updates we've agreed on and I can take
>>> another look at it.
>>
>> Agreed.
>
> Attached, revised patchset makes these updates. I continue to use the
> sortsupport struct to convey that a given attribute of a given sort is
> abbreviation-applicable (although the field is now a bool, not an
> enum).

All right, it seems Tom is with you on that point, so after some
study, I've committed this with very minor modifications.  Sorry for
the long delay.  I have not committed the 0002 patch, though, because
I haven't studied that enough yet to know whether I think it's a good
idea.  Perhaps that could get its own CommitFest entry and thread,
though, to separate it from this exceedingly long discussion and make
it clear exactly what we're hoping to gain by that patch specifically.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Mon, Jan 19, 2015 at 3:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> All right, it seems Tom is with you on that point, so after some
> study, I've committed this with very minor modifications.  Sorry for
> the long delay.  I have not committed the 0002 patch, though, because
> I haven't studied that enough yet to know whether I think it's a good
> idea.  Perhaps that could get its own CommitFest entry and thread,
> though, to separate it from this exceedingly long discussion and make
> it clear exactly what we're hoping to gain by that patch specifically.

By the way, for those following along at home, here's an example of
how this patch can help:

rhaas=# create table stuff as select random()::text as a, 'filler
filler filler'::text as b, g as c from generate_series(1, 1000000) g;
SELECT 1000000
rhaas=# create index on stuff (a);
CREATE INDEX

On the PPC64 machine I normally use for performance testing, it takes
about 6.3 seconds to build the index with the commit just before this
one.  With this commit, it drops to 1.9 seconds.  That's more than a
3x speedup!

Now, if I change the query that creates the table to this.

rhaas=# create table stuff as select 'aaaaaaaa' || random()::text as
a, 'filler filler filler'::text as b, g as c from generate_series(1,
1000000) g;

...then it takes 10.8 seconds with or without this patch.  In general,
any case where the first few characters of every string are exactly
identical (or only quite rarely different) will not benefit, but many
practical cases will benefit significantly.  Also, Peter's gone to a
fair amount of work to make sure that even when the patch does not
help, it doesn't hurt, either.

So that's pretty cool.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> On the PPC64 machine I normally use for performance testing, it takes
> about 6.3 seconds to build the index with the commit just before this
> one.  With this commit, it drops to 1.9 seconds.  That's more than a
> 3x speedup!
>
> Now, if I change the query that creates the table to this.
>
> rhaas=# create table stuff as select 'aaaaaaaa' || random()::text as
> a, 'filler filler filler'::text as b, g as c from generate_series(1,
> 1000000) g;
>
> ...then it takes 10.8 seconds with or without this patch.  In general,
> any case where the first few characters of every string are exactly
> identical (or only quite rarely different) will not benefit, but many
> practical cases will benefit significantly.  Also, Peter's gone to a
> fair amount of work to make sure that even when the patch does not
> help, it doesn't hurt, either.
>
> So that's pretty cool.

Wow, nice!

Good work Peter!
Thanks,
    Stephen

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Mon, Jan 19, 2015 at 12:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> All right, it seems Tom is with you on that point, so after some
> study, I've committed this with very minor modifications.  Sorry for
> the long delay.

Thank you very much for your help with this! I appreciate it.

> I have not committed the 0002 patch, though, because
> I haven't studied that enough yet to know whether I think it's a good
> idea.  Perhaps that could get its own CommitFest entry and thread,
> though, to separate it from this exceedingly long discussion and make
> it clear exactly what we're hoping to gain by that patch specifically.

I'll think about that some more. It might be that we're chasing
diminishing returns there.

It appears that the buildfarm animal brolga isn't happy about this
patch. I'm not sure why, since I thought we already figured out bugs
or other inconsistencies in various strxfrm() implementations.
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Mon, Jan 19, 2015 at 5:43 PM, Peter Geoghegan <pg@heroku.com> wrote:
> It appears that the buildfarm animal brolga isn't happy about this
> patch. I'm not sure why, since I thought we already figured out bugs
> or other inconsistencies in various strxfrm() implementations.

Well, the first thing that comes to mind is that strxfrm() is
returning strings that, when sorted, do not give the same order we
would have obtained via strcoll().  It's true that there are existing
callers of strxfrm(), but it looks like that is mostly used for
statistics-gathering, so it's possible that differences vs. strcoll()
would not have shown up before now.  Is there any legitimate way that
strxfrm() and strcoll() can return inconsistent answers - e.g. they
are somehow allowed to derive their notion of the relevant locale
differently - or is this just a case of Cygwin being busted?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Alvaro Herrera
Date:
Peter Geoghegan wrote:

> It appears that the buildfarm animal brolga isn't happy about this
> patch. I'm not sure why, since I thought we already figured out bugs
> or other inconsistencies in various strxfrm() implementations.

You did notice that bowerbird isn't building, right?
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2015-01-19%2023%3A54%3A46

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Mon, Jan 19, 2015 at 5:33 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> You did notice that bowerbird isn't building, right?
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2015-01-19%2023%3A54%3A46

Yeah. Looks like strxfrm_l() isn't available on the animal, for whatever reason.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Mon, Jan 19, 2015 at 5:59 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Jan 19, 2015 at 5:33 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> You did notice that bowerbird isn't building, right?
>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2015-01-19%2023%3A54%3A46
>
> Yeah. Looks like strxfrm_l() isn't available on the animal, for whatever reason.

I think that the attached patch should at least fix that much. Maybe
the problem on the other animal is also explained by the lack of this,
since there could also be a MinGW-ish strxfrm_l(), I suppose.

--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Michael Paquier
Date:
On Tue, Jan 20, 2015 at 11:29 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Mon, Jan 19, 2015 at 5:59 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Mon, Jan 19, 2015 at 5:33 PM, Alvaro Herrera
>> <alvherre@2ndquadrant.com> wrote:
>>> You did notice that bowerbird isn't building, right?
>>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2015-01-19%2023%3A54%3A46
>>
>> Yeah. Looks like strxfrm_l() isn't available on the animal, for whatever reason.
>
> I think that the attached patch should at least fix that much. Maybe
> the problem on the other animal is also explained by the lack of this,
> since there could also be a MinGW-ish strxfrm_l(), I suppose.
On MinGW-32, not that I know of:
$ find . -name *.h | xgrep strxfrm_l
./lib/gcc/mingw32/4.8.1/include/c++/mingw32/bits/c++config.h:/* Define if strxfr
m_l is available in <string.h>. */
./mingw32/lib/gcc/mingw32/4.8.1/include/c++/mingw32/bits/c++config.h:/* Define i
f strxfrm_l is available in <string.h>. */
strxfrm is defined in string.h though.

With your patch applied, the failure with MSVC disappeared, but there
is still a warning showing up:
(ClCompile target) -> src\backend\lib\hyperloglog.c(73): warning C4334: '<<' : result of
32-bit shift implicitly converted to 64 bits (was 64-bit shift
intended?
-- 
Michael



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Mon, Jan 19, 2015 at 7:47 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On MinGW-32, not that I know of:
> $ find . -name *.h | xgrep strxfrm_l
> ./lib/gcc/mingw32/4.8.1/include/c++/mingw32/bits/c++config.h:/* Define if strxfr
> m_l is available in <string.h>. */
> ./mingw32/lib/gcc/mingw32/4.8.1/include/c++/mingw32/bits/c++config.h:/* Define i
> f strxfrm_l is available in <string.h>. */
> strxfrm is defined in string.h though.

I'm not quite following. Doesn't that imply that strxfrm_l() at least
*could* be available? I guess it doesn't matter, though, because the
animal with the successful build that fails the locale regression test
(brolga) does not have locale_t support. Therefore, there is no new
strxfrm_l() caller.

My next guess is that the real problem is an assumption I've made.
That is, my assumption that strxfrm() always behaves equivalently to
strcpy() when the C locale happens to be in use may not be portable
(due to external factors). I guess we're inconsistent about making
sure that LC_COLLATE is set correctly in WIN32 and/or EXEC_BACKEND
builds, or something along those lines. The implementation in the past
got away with strcoll()/strxfrm() not having the C locale set, since
strcoll() was never called when the C locale was in use -- we just
called strcmp() instead.

Assuming that's correct, it might be easier just to entirely disable
the optimization on Windows, even with the C locale. It may not be
worth it to even bother just for C locale support of abbreviated keys.
I'm curious about what will happen there when the "_strxfrm_l()" fix
patch is applied.

> With your patch applied, the failure with MSVC disappeared, but there
> is still a warning showing up:
> (ClCompile target) ->
>   src\backend\lib\hyperloglog.c(73): warning C4334: '<<' : result of
> 32-bit shift implicitly converted to 64 bits (was 64-bit shift
> intended?

That seems harmless. I suggest an explicit cast to "Size" here.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Andrew Gierth
Date:
>>>>> "Robert" == Robert Haas <robertmhaas@gmail.com> writes:
Robert> All right, it seems Tom is with you on that point, so afterRobert> some study, I've committed this with very
minormodifications.
 

This caught my eye (thanks to conflict with GS patch):
* In the future, we should consider forcing the* tuplesort_begin_heap() case when the abbreviated key* optimization can
therebybe used, even when numInputs is 1.
 

The comment in tuplesort_begin_datum that abbreviation can't be used
seems wrong to me; why is the copy of the original value pointed to by
stup->tuple (in the case of by-reference types, and abbreviation is
obviously not needed for by-value types) not sufficient?

Or what am I missing?

-- 
Andrew (irc:RhodiumToad)



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Mon, Jan 19, 2015 at 9:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I think that the attached patch should at least fix that much. Maybe
> the problem on the other animal is also explained by the lack of this,
> since there could also be a MinGW-ish strxfrm_l(), I suppose.

Committed that, rather blindly, since it looks like a reasonable fix.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Jan 20, 2015 at 3:46 AM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
> The comment in tuplesort_begin_datum that abbreviation can't be used
> seems wrong to me; why is the copy of the original value pointed to by
> stup->tuple (in the case of by-reference types, and abbreviation is
> obviously not needed for by-value types) not sufficient?

We haven't formalized the idea that pass-by-value types are not
targets for abbreviation (it's just that the practical application of
abbreviated keys is likely to be limited to pass-by-reference types,
generating a compact pass-by-value abbreviated representation). That
could be a useful restriction to formalize, and certainly seems likely
to be a harmless one, but for now that's the way it is.

It might be sufficient for some tuplesort_begin_datum() callers. Datum
tuple sorts require the original values. Aside from the formalization
of abbreviation only applying to pass-by-value types, you'd have to
teach tuplesort_getdatum() to reconstruct the non-abbreviated
representation transparently from each SortTuple's "tuple proper".
However, the actual tuplesort_getdatum() calls could be the dominant
cost, not the sort  (I'm not sure of that offhand - that's total
speculation).

Basically, the intersection of the datum sort case with abbreviated
keys seems complicated. I tended to think that the solution was to
force a heaptuple sort instead (where abbreviation naturally can be
used), since clearly that could work in some important cases like
nodeAgg.c, iff the gumption to do it that way was readily available.
Rightly or wrongly, I preferred that idea to the idea of teaching the
Datum case to handle abbreviation across the board. Maybe that's the
wrong way of fixing that, but for now I don't think it's acceptable
that abbreviation isn't always used in certain cases where it could
make sense (e.g. not for simple GroupAggregates with a single
attribute -- only multiple attribute GroupAggregates). After all, most
sort cases (e.g. B-Tree builds) didn't use SortSupport for several
years, simply because no one got around to it until I finally did a
few months back.

Note that most tuplesort non-users of abbreviation don't use
abbreviation for sensible reasons. For example, abbreviation simply
doesn't make sense for Top-N heap sorts, or MJExamineQuals(). The
non-single-attribute GroupAggregate/nodeAgg.c case seems bad, but I
don't have a good sense of how bad things are with orderedsetaggs.c
non-use is...it might matter less than the other case.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Jan 20, 2015 at 2:00 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Maybe that's the
> wrong way of fixing that, but for now I don't think it's acceptable
> that abbreviation isn't always used in certain cases where it could
> make sense (e.g. not for simple GroupAggregates with a single
> attribute -- only multiple attribute GroupAggregates). After all, most
> sort cases (e.g. B-Tree builds) didn't use SortSupport for several
> years, simply because no one got around to it until I finally did a
> few months back.


Exuse me. I mean that this *is* an acceptable restriction for the time being.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Andrew Gierth
Date:
>>>>> "Robert" == Robert Haas <robertmhaas@gmail.com> writes:
Robert> All right, it seems Tom is with you on that point, so afterRobert> some study, I've committed this with very
minormodifications.
 

While hacking up a patch to demonstrate the simplicity of extending this
to the Datum sorter, I seem to have run into a fairly major issue with
this: there seems to be no attempt whatsoever to handle spilling to disk
correctly. The data spilled to disk only has the un-abbreviated values,
but nothing tries to re-abbreviate it (or disable abbreviations) when it
is read back in, and chaos ensues:

set work_mem = 64;
select v, v > lag(v) over (order by v) from (select 'B'||i as v from generate_series(1,10000) i       union all select
'a'||ifrom generate_series(1,10000) i offset 0) s order by v limit 20;
 
  v    | ?column? 
--------+----------a10000 | B10000 | fa1000  | ta1001  | ta1002  | ta1003  | tB1000  | fB1001  | tB1002  | tB1003  |
tB1004 | tB1005  | ta1004  | ta1005  | ta1006  | ta1007  | ta1008  | tB1     | fB10    | tB100   | t
 
(20 rows)

-- 
Andrew (irc:RhodiumToad)



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Tue, Jan 20, 2015 at 10:54 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jan 19, 2015 at 9:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> I think that the attached patch should at least fix that much. Maybe
>> the problem on the other animal is also explained by the lack of this,
>> since there could also be a MinGW-ish strxfrm_l(), I suppose.
>
> Committed that, rather blindly, since it looks like a reasonable fix.

Peter, this made bowerbird (Windows 8/Visual Studio) build, but it's
failing make check.  Ditto hamerkop (Windows 2k8/VC++) and currawong
(Windows XP Pro/MSVC++).  jacana (Windows 8/gcc) and brolga (Windows
XP Pro/cygwin) are unhappy too, although the failures are showing up
in different build stages rather than in 'make check'.  narwhal
(Windows 2k3/mingw) and frogmouth (Windows XP Pro/gcc) are happy,
though, so it's not affecting ALL of the Windows critters.  Still, I'm
leaning toward the view that we should disable this optimization
across-the-board on Windows until somebody has time to do the legwork
to figure out what it takes to make it work, and what makes it work on
some of these critters and fail on others.  We can't leave the
buildfarm red for long periods of time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Tue, Jan 20, 2015 at 6:27 PM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
>>>>>> "Robert" == Robert Haas <robertmhaas@gmail.com> writes:
>  Robert> All right, it seems Tom is with you on that point, so after
>  Robert> some study, I've committed this with very minor modifications.
>
> While hacking up a patch to demonstrate the simplicity of extending this
> to the Datum sorter, I seem to have run into a fairly major issue with
> this: there seems to be no attempt whatsoever to handle spilling to disk
> correctly. The data spilled to disk only has the un-abbreviated values,
> but nothing tries to re-abbreviate it (or disable abbreviations) when it
> is read back in, and chaos ensues:

Dear me.  Peter, can you fix this RSN?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Jan 20, 2015 at 3:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Peter, this made bowerbird (Windows 8/Visual Studio) build, but it's
> failing make check.  Ditto hamerkop (Windows 2k8/VC++) and currawong
> (Windows XP Pro/MSVC++).  jacana (Windows 8/gcc) and brolga (Windows
> XP Pro/cygwin) are unhappy too, although the failures are showing up
> in different build stages rather than in 'make check'.  narwhal
> (Windows 2k3/mingw) and frogmouth (Windows XP Pro/gcc) are happy,
> though, so it's not affecting ALL of the Windows critters.  Still, I'm
> leaning toward the view that we should disable this optimization
> across-the-board on Windows until somebody has time to do the legwork
> to figure out what it takes to make it work, and what makes it work on
> some of these critters and fail on others.  We can't leave the
> buildfarm red for long periods of time.

Fair enough.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Jan 20, 2015 at 3:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Dear me.  Peter, can you fix this RSN?

Investigating.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Jan 20, 2015 at 3:34 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Jan 20, 2015 at 3:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Dear me.  Peter, can you fix this RSN?
>
> Investigating.

It's certainly possible to fix Andrew's test case with the attached.
I'm not sure that that's the appropriate fix, though: there is
probably a case to be made for not bothering with abbreviation once
we've read tuples in for the final merge run. More likely, the
strongest case is for storing the abbreviated keys on disk too, and
reading those back.

--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Jan 20, 2015 at 3:57 PM, Peter Geoghegan <pg@heroku.com> wrote:
> It's certainly possible to fix Andrew's test case with the attached.
> I'm not sure that that's the appropriate fix, though: there is
> probably a case to be made for not bothering with abbreviation once
> we've read tuples in for the final merge run. More likely, the
> strongest case is for storing the abbreviated keys on disk too, and
> reading those back.

Maybe not, though: An extra 8 bytes per tuple on disk is not free.
OTOH, if we're I/O bound on the final merge, as we ought to be, then
recomputing the abbreviated keys could make sense, since there may
well be an idle CPU core anyway.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Tue, Jan 20, 2015 at 7:07 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Jan 20, 2015 at 3:57 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> It's certainly possible to fix Andrew's test case with the attached.
>> I'm not sure that that's the appropriate fix, though: there is
>> probably a case to be made for not bothering with abbreviation once
>> we've read tuples in for the final merge run. More likely, the
>> strongest case is for storing the abbreviated keys on disk too, and
>> reading those back.
>
> Maybe not, though: An extra 8 bytes per tuple on disk is not free.
> OTOH, if we're I/O bound on the final merge, as we ought to be, then
> recomputing the abbreviated keys could make sense, since there may
> well be an idle CPU core anyway.

I was assuming we were going to fix this by undoing the abbreviation
(as in the abort case) when we spill to disk, and not bothering with
it thereafter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Jan 20, 2015 at 5:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I was assuming we were going to fix this by undoing the abbreviation
> (as in the abort case) when we spill to disk, and not bothering with
> it thereafter.

The spill-to-disk case is at least as compelling at the internal sort
case. The overhead of comparisons is much higher for tapesort.

Attached patch serializes keys. On reflection, I'm inclined to go with
this approach. Even if the CPU overhead of reconstructing strxfrm()
blobs is acceptable for text, it might be much more expensive for
other types. I'm loathe to throw away those abbreviated keys
unnecessarily.

We don't have to worry about having aborted abbreviation, since once
we spill to disk we've effectively committed to abbreviation. This
patch formalizes the idea that there is strictly a pass-by-value
representation required for such cases (but not that the original
Datums must be of a pass-by-reference, which is another thing
entirely). I've tested it some, obviously with Andrew's testcase and
the regression tests, but also with my B-Tree verification tool.
Please review it.

Sorry about this.
--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Tue, Jan 20, 2015 at 8:39 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Jan 20, 2015 at 5:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I was assuming we were going to fix this by undoing the abbreviation
>> (as in the abort case) when we spill to disk, and not bothering with
>> it thereafter.
>
> The spill-to-disk case is at least as compelling at the internal sort
> case. The overhead of comparisons is much higher for tapesort.

First, we need to unbreak this.  Then, we can look at optimizing it.
The latter task will require performance testing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Jan 20, 2015 at 5:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jan 20, 2015 at 8:39 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Tue, Jan 20, 2015 at 5:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I was assuming we were going to fix this by undoing the abbreviation
>>> (as in the abort case) when we spill to disk, and not bothering with
>>> it thereafter.
>>
>> The spill-to-disk case is at least as compelling at the internal sort
>> case. The overhead of comparisons is much higher for tapesort.
>
> First, we need to unbreak this.  Then, we can look at optimizing it.
> The latter task will require performance testing.

I don't see that any alternative isn't a performance trade-off. My
patch accomplishes unbreaking this. I agree that it needs still needs
review from that perspective, but it doesn't seem any worse than any
other alternative. Would you prefer it if the spill-to-disk case
aborted in the style of low entropy keys? That doesn't seem
significantly safer than this, and it certainly not acceptable from a
performance perspective.
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Jan 20, 2015 at 5:46 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Would you prefer it if the spill-to-disk case
> aborted in the style of low entropy keys? That doesn't seem
> significantly safer than this, and it certainly not acceptable from a
> performance perspective.

BTW, I can write that patch if that's your preference. Should I?

I just don't favor that even as a short term correctness fix, because
it seems unacceptable to throw away all the strxfrm() work where
that's a very predictable and even likely outcome. I suggest reviewing
and committing my fix as a short term fix, that may well turn out to
be generally acceptable upon further consideration. Yes, we'll need to
make a point of reviewing an already committed patch, but there is a
precedent for that.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Tue, Jan 20, 2015 at 8:39 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Jan 20, 2015 at 5:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I was assuming we were going to fix this by undoing the abbreviation
>> (as in the abort case) when we spill to disk, and not bothering with
>> it thereafter.
>
> The spill-to-disk case is at least as compelling at the internal sort
> case. The overhead of comparisons is much higher for tapesort.
>
> Attached patch serializes keys. On reflection, I'm inclined to go with
> this approach. Even if the CPU overhead of reconstructing strxfrm()
> blobs is acceptable for text, it might be much more expensive for
> other types. I'm loathe to throw away those abbreviated keys
> unnecessarily.
>
> We don't have to worry about having aborted abbreviation, since once
> we spill to disk we've effectively committed to abbreviation. This
> patch formalizes the idea that there is strictly a pass-by-value
> representation required for such cases (but not that the original
> Datums must be of a pass-by-reference, which is another thing
> entirely). I've tested it some, obviously with Andrew's testcase and
> the regression tests, but also with my B-Tree verification tool.
> Please review it.
>
> Sorry about this.

I don't want to change the on-disk format for tapes without a lot more
discussion.  Can you come up with a fix that avoids that for now?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Jan 20, 2015 at 6:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't want to change the on-disk format for tapes without a lot more
> discussion.  Can you come up with a fix that avoids that for now?

A more conservative approach would be to perform conversion on-the-fly
once more. That wouldn't be patently unacceptable from a performance
perspective, and would also not change the on-disk format for tapes.

Thoughts?

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Tue, Jan 20, 2015 at 9:33 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Jan 20, 2015 at 6:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I don't want to change the on-disk format for tapes without a lot more
>> discussion.  Can you come up with a fix that avoids that for now?
>
> A more conservative approach would be to perform conversion on-the-fly
> once more. That wouldn't be patently unacceptable from a performance
> perspective, and would also not change the on-disk format for tapes.
>
> Thoughts?

That might be OK.  Probably needs a bit of performance testing to see
how it looks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Jan 20, 2015 at 6:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> That might be OK.  Probably needs a bit of performance testing to see
> how it looks.

Well, we're still only doing it when we do our final merge. So that's
"only" doubling the number of conversions required, which if we're
blocked on I/O might not matter at all.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Tue, Jan 20, 2015 at 6:39 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Jan 20, 2015 at 6:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> That might be OK.  Probably needs a bit of performance testing to see
>> how it looks.
>
> Well, we're still only doing it when we do our final merge. So that's
> "only" doubling the number of conversions required, which if we're
> blocked on I/O might not matter at all.

You'll probably prefer the attached. This patch works by disabling
abbreviation, but only after writing out runs, with the final merge
left to go. That way, it doesn't matter when abbreviated keys are not
read back from disk (or regenerated).

I believe this bug was missed because it only occurs when there are
multiple runs, and not in the common case where there is one big
initial run that is found already sorted when we reach mergeruns().
--
Peter Geoghegan

Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Andrew Gierth
Date:
>>>>> "Peter" == Peter Geoghegan <pg@heroku.com> writes:
Peter> You'll probably prefer the attached. This patch works byPeter> disabling abbreviation, but only after writing
outruns, withPeter> the final merge left to go. That way, it doesn't matter whenPeter> abbreviated keys are not read
backfrom disk (or regenerated).
 

This seems tolerable to me for a quick fix. The merits of storing the
abbreviation vs. re-abbreviating on input can be studied later.
Peter> I believe this bug was missed because it only occurs when therePeter> are multiple runs, and not in the common
casewhere there is onePeter> big initial run that is found already sorted when we reachPeter> mergeruns().
 

Ah, yes, there is an optimization for the one-run case which bypasses
all further comparisons, hiding the problem.

-- 
Andrew (irc:RhodiumToad)



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Andrew Gierth
Date:
>>>>> "Peter" == Peter Geoghegan <pg@heroku.com> writes:

 Peter> Basically, the intersection of the datum sort case with
 Peter> abbreviated keys seems complicated.

Not to me. To me it seems completely trivial.

Now, I follow this general principle that someone who is not doing the
work should never say "X is easy" to someone who _is_ doing it, unless
they're prepared to at least outline the solution on request or
otherwise contribute.  So see the attached patch (which I will concede
could probably do with more comments, it's a quick hack intended for
illustration) and tell me what you think is missing that would make it a
complicated problem.

 Peter> I tended to think that the solution was to force a heaptuple
 Peter> sort instead (where abbreviation naturally can be used),

This seems completely wrong - why should the caller have to worry about
this implementation detail? The caller shouldn't have to know about what
types or what circumstances might or might not benefit from
abbreviation.

--
Andrew (irc:RhodiumToad)



Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Wed, Jan 21, 2015 at 4:44 AM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
> Now, I follow this general principle that someone who is not doing the
> work should never say "X is easy" to someone who _is_ doing it, unless
> they're prepared to at least outline the solution on request or
> otherwise contribute.  So see the attached patch (which I will concede
> could probably do with more comments, it's a quick hack intended for
> illustration) and tell me what you think is missing that would make it a
> complicated problem.

Okay, then. I concede the point: We should support the datum case as
you outline, since it is simpler than any alternative. It probably
won't even be necessary to formalize the idea that finished
abbreviated keys must be pass-by-value (at least not for the benefit
of this functionality); if someone writes an opclass that generates
pass-by-reference abbreviated keys (I think that might actually make
sense, although I'm being imaginative), it simply won't work for the
datum sort case, which is probably fine.

Are you going to submit this to the final commitfest? I'll review it if you do.
-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Peter Geoghegan
Date:
On Wed, Jan 21, 2015 at 2:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Okay, then. I concede the point: We should support the datum case as
> you outline, since it is simpler than any alternative. It probably
> won't even be necessary to formalize the idea that finished
> abbreviated keys must be pass-by-value (at least not for the benefit
> of this functionality); if someone writes an opclass that generates
> pass-by-reference abbreviated keys (I think that might actually make
> sense, although I'm being imaginative), it simply won't work for the
> datum sort case, which is probably fine.

I mean that a restriction formally preventing use of abbreviation with
pass-by-value types isn't necessary. That was something that I thought
we'd have to document as a restriction (for the benefit of your datum
sort patch), without considering that it could simply be skipped by
only considering state->datumTypeByVal (which is what you've proposed
here).

This requirement is much less likely than wanting to create
pass-by-value abbreviated keys for a pass-by-value datatype (which, as
I go into above, seems at least possible). This seems like a very
insignificant restriction, not worth formalizing or even mentioning in
code comments.

-- 
Peter Geoghegan



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Andrew Gierth
Date:
>>>>> "Peter" == Peter Geoghegan <pg@heroku.com> writes:
Peter> Okay, then. I concede the point: We should support the datumPeter> case as you outline, since it is simpler than
anyPeter>alternative. It probably won't even be necessary to formalizePeter> the idea that finished abbreviated keys
mustbe pass-by-valuePeter> (at least not for the benefit of this functionality); if someonePeter> writes an opclass
thatgenerates pass-by-reference abbreviatedPeter> keys (I think that might actually make sense, although I'm
beingPeter>imaginative), it simply won't work for the datum sort case,Peter> which is probably fine.
 

I don't see why a by-reference abbreviated key would be any more of an
issue for the datum sorter than for anything else. In either case you'd
just get excess memory usage (any memory allocated by the abbreviation
function for the result won't be charged against work_mem and won't be
freed until the sort ends).

What matters for the datum sorter (and not for the others) is that we
not try and abbreviate a by-value type (since we only have an allocated
copy of the value if it was by-reference); this is handled in the code
by just not asking for abbreviation in such cases.
Peter> Are you going to submit this to the final commitfest? I'llPeter> review it if you do.

I'll post a cleaned-up version after the existing issues are fixed.

-- 
Andrew (irc:RhodiumToad)



Re: B-Tree support function number 3 (strxfrm() optimization)

From
David Rowley
Date:
On 20 January 2015 at 17:10, Peter Geoghegan <pg@heroku.com> wrote:
On Mon, Jan 19, 2015 at 7:47 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

> With your patch applied, the failure with MSVC disappeared, but there
> is still a warning showing up:
> (ClCompile target) ->
>   src\backend\lib\hyperloglog.c(73): warning C4334: '<<' : result of
> 32-bit shift implicitly converted to 64 bits (was 64-bit shift
> intended?

That seems harmless. I suggest an explicit cast to "Size" here.

This caught my eye too.

I agree about casting to Size.

Patch attached.

Regards

David Rowley 
Attachment

Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Fri, Jan 23, 2015 at 2:18 AM, David Rowley <dgrowleyml@gmail.com> wrote:
> On 20 January 2015 at 17:10, Peter Geoghegan <pg@heroku.com> wrote:
>>
>> On Mon, Jan 19, 2015 at 7:47 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>
>> > With your patch applied, the failure with MSVC disappeared, but there
>> > is still a warning showing up:
>> > (ClCompile target) ->
>> >   src\backend\lib\hyperloglog.c(73): warning C4334: '<<' : result of
>> > 32-bit shift implicitly converted to 64 bits (was 64-bit shift
>> > intended?
>>
>> That seems harmless. I suggest an explicit cast to "Size" here.
>
> This caught my eye too.
>
> I agree about casting to Size.
>
> Patch attached.

Thanks, committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: B-Tree support function number 3 (strxfrm() optimization)

From
Robert Haas
Date:
On Wed, Jan 21, 2015 at 2:22 AM, Peter Geoghegan <pg@heroku.com> wrote:
> You'll probably prefer the attached. This patch works by disabling
> abbreviation, but only after writing out runs, with the final merge
> left to go. That way, it doesn't matter when abbreviated keys are not
> read back from disk (or regenerated).

Yes, this seems like the way to go for now.  Thanks, committed.  And
thanks to Andrew for spotting this so quickly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company