Re: B-Tree support function number 3 (strxfrm() optimization) - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: B-Tree support function number 3 (strxfrm() optimization)
Date
Msg-id CAM3SWZQHjxiyrsqBs5w3u-vTJ_jT2hp8o02big5wYb4m9Lp1jg@mail.gmail.com
Whole thread Raw
In response to Re: B-Tree support function number 3 (strxfrm() optimization)  (Peter Geoghegan <pg@heroku.com>)
List pgsql-hackers
On Thu, Jun 5, 2014 at 5:37 PM, Peter Geoghegan <pg@heroku.com> wrote:
> One thing that isn't all that obvious about this worst case is that
> it's in general very qsort() friendly, and therefore the startup costs
> (copying) totally dominates. Actually, you're not even sorting -
> you're verifying that the tuples are already exactly in order (a
> questionable optimization we apply at every level).

Kevin mentioned something about the Wisconsin courts having columns
that all began with "The State of Wisconsin Vs." in the dev meeting in
Ottawa. I thought that this was an interesting case, because it is
representative of reality, which is crucially important to consider
here. I decided to simulate it. In my original test database:

postgres=# create table wisconsin(casen text);
CREATE TABLE
postgres=# insert into wisconsin select 'The State of Wisconsin Vs. '
|| city from cities;
INSERT 0 317102

sort-wisconsin.sql: select * from (select * from wisconsin order by
casen offset 1000000) d;

Master:

pgbench -M prepared -f sort-wisconsin.sql -T 300 -n
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 300 s
number of transactions actually processed: 55
latency average: 5454.545 ms
tps = 0.181191 (including connections establishing)
tps = 0.181191 (excluding connections establishing)

Patch (most recent revision, with ameliorations, HyperLogLog, etc):

pgbench -M prepared -f sort-wisconsin.sql -T 300 -n

transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 300 s
number of transactions actually processed: 55
latency average: 5454.545 ms
tps = 0.182593 (including connections establishing)
tps = 0.182594 (excluding connections establishing)

Earlier patch (no ameliorations for Heikki's case):

pgbench -M prepared -f sort-wisconsin.sql -T 300 -n

transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 300 s
number of transactions actually processed: 54
latency average: 5555.556 ms
tps = 0.176914 (including connections establishing)
tps = 0.176915 (excluding connections establishing)

With my most recent revision, the ameliorating measures are effective
enough that with the sortsupport shim and fmgr trampoline avoided, we
still come out ahead even for this case. Great. But you may be
surprised that the regression is so small in the case of the patch
without any ameliorating measures (the original patch). That's because
the data isn't *perfectly* logically/physically correlated here, as in
Heikki's worst case. So, the 317,102 wasted strxfrm() calls are
relatively inexpensive. Consider how cost_sort() models the cost of a
sort when an in memory quicksort is anticipated:

/* We'll use plain quicksort on all the input tuples */
startup_cost += comparison_cost * tuples * LOG2(tuples);

In the case of this quicksort, the planner guesses there'll be "317102
* LOG2(317102)" comparisons -- about 5,794,908 comparisons, which
implies over 10 times as many strcoll() calls as wasted strxfrm()
calls. The cost of those strxfrm() calls begins to look insignificant
before n gets too big (at n = 100, it's 100 wasted strxfrm() calls to
about 664 strcoll() calls). Unless, of course, you have a "bubble sort
best case" where everything is already completely in order, in which
case there'll be a 1:1 ratio between wasted strxfrm() calls and
strcoll() calls. This optimization was something that we added to our
qsort(). It doesn't appear in the original NetBSD implementation, and
it doesn't appear in the Bentley/McIlroy paper, and it doesn't appear
anywhere else that I'm aware of. I'm not the only person to regard it
with suspicion - Tom has in the past expressed doubts about that too
[1]. Also, note that no sorting algorithm can do better than O(n log
n) in the average case - that's the information-theoretical lower
bound on the average-case speed of any comparison-based sorting
algorithm.

To be clear: I'm certainly not saying that we shouldn't fix Heikki's
worst case, and indeed I believe I have, but we should also put this
worst case in perspective.

By the way, I just realized that I failed to fully remove client
overhead (I should have put an extra 0 on the end of my offset for the
city.sql query), which added noise to the "city"/"country"/"province"
tests. Revised figures are as follows (these are better than before):

Master:
======

pgbench -M prepared -f sort-city.sql -T 300 -n

transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 300 s
number of transactions actually processed: 278
latency average: 1079.137 ms
tps = 0.924358 (including connections establishing)
tps = 0.924362 (excluding connections establishing)

Patch:
=====

pgbench -M prepared -f sort-city.sql -T 300 -n

transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 300 s
number of transactions actually processed: 1046
latency average: 286.807 ms
tps = 3.486089 (including connections establishing)
tps = 3.486104 (excluding connections establishing)

Master:
======

pgbench -M prepared -j 4 -c 4 -f sort-city.sql -T 300 -n

transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 4
number of threads: 4
duration: 300 s
number of transactions actually processed: 387
latency average: 3100.775 ms
tps = 1.278062 (including connections establishing)
tps = 1.278076 (excluding connections establishing)

Patch:
=====

pgbench -M prepared -j 4 -c 4 -f sort-city.sql -T 300 -n

transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 4
number of threads: 4
duration: 300 s
number of transactions actually processed: 1670
latency average: 718.563 ms
tps = 5.557700 (including connections establishing)
tps = 5.557754 (excluding connections establishing)


BTW, if you want to measure any of this independently, I suggest
making sure that power management settings don't ruin things. I advise
setting CPU governor to "performance" for each core on Linux, for
example.

[1] http://www.postgresql.org/message-id/18033.1361789032@sss.pgh.pa.us
-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: Why is it "JSQuery"?
Next
From: Cédric Villemain
Date:
Subject: Re: Proposing pg_hibernate