Thread: Strange Create Index behaviour

Strange Create Index behaviour

From
Gary Doades
Date:
Platform: FreeBSD 6.0, Postgresql 8.1.2 compiled from the ports collection.

Not sure if this belongs in performance or bugs..

A pg_restore of my 2.5GB database was taking up to 2 hours to complete
instead of the expected 10-15 minutes. Checking the server it was mostly
CPU bound. Testing further I discovered that is was spending huge
amounts of CPU time creating some indexes.

It took a while to find out, but basically it boils down to this:

If the column that is having the index created has a certain
distribution of values then create index takes a very long time. If the
data values (integer in this case) a fairly evenly distributed then
create index is very quick, if the data values are all the same it is
very quick. I discovered that in the slow cases the column had
approximately half the values as zero and the rest fairly spread out.
One column started off with around 400,000 zeros and the rest of the
following rows spread between values of 1 to 500,000.

I have put together a test case that demonstrates the problem (see
below). I create a simple table, as close in structure to one of my
problem tables and populate an integer column with 100,000 zeros follow
by 100,000 random integers between 0 and 100,000. Then create an index
on this column. I then drop the table and repeat. The create index
should take around 1-2 seconds. A fair proportion of the time it takes
50 seconds!!!

If I fill the same row with all random data the create index always
takes a second or two. If I fill the column with all zeros everything is
still OK.

When my tables that I am trying to restore are over 2 million rows the
creating one index can take an hour!! (almost all CPU time).

All other areas of performance, once the dump is restored and analysed
seem to be OK, even large hash/merge joins and sorts

This is entirely repeatable in FreeBSD in that around half the time
create index will be incredibly slow.

All postgresql.conf settings are at the defaults for the test initially
(fresh install)

The final interesting thing is that as I increase shared buffers to 2000
or 3000 the problem gets *worse*

The following text is output from the test script..

select version();
                                             version

------------------------------------------------------------------------------------------------
  PostgreSQL 8.1.2 on i386-portbld-freebsd6.0, compiled by GCC cc (GCC)
3.4.4 [FreeBSD] 20050518
(1 row)

\timing
Timing is on.

-----  Many slow cases, note the 50+ seconds cases

create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 81.859 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),0,now(),now();
INSERT 0 100000
Time: 1482.141 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1543.508 ms
create index idx on atest(r);
CREATE INDEX
Time: 56685.230 ms

drop table atest;
DROP TABLE
Time: 4.616 ms
create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 6.889 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),0,now(),now();
INSERT 0 100000
Time: 2009.787 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1828.663 ms
create index idx on atest(r);
CREATE INDEX
Time: 3991.257 ms

drop table atest;
DROP TABLE
Time: 3.796 ms
create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 19.965 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),0,now(),now();
INSERT 0 100000
Time: 1625.059 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 2622.827 ms
create index idx on atest(r);
CREATE INDEX
Time: 1082.799 ms

drop table atest;
DROP TABLE
Time: 4.627 ms
create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 2.953 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),0,now(),now();
INSERT 0 100000
Time: 2068.744 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 2671.420 ms
create index idx on atest(r);
CREATE INDEX
Time: 8047.660 ms

drop table atest;
DROP TABLE
Time: 3.675 ms
create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 2.582 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),0,now(),now();
INSERT 0 100000
Time: 1723.987 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 2263.131 ms
create index idx on atest(r);
CREATE INDEX
Time: 50050.308 ms

drop table atest;
DROP TABLE
Time: 52.744 ms
create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 25.370 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),0,now(),now();
INSERT 0 100000
Time: 2052.733 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 2631.317 ms
create index idx on atest(r);
CREATE INDEX
Time: 61440.897 ms

drop table atest;
DROP TABLE
Time: 26.137 ms
create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 24.794 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),0,now(),now();
INSERT 0 100000
Time: 2851.977 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1553.046 ms
create index idx on atest(r);
CREATE INDEX
Time: 1774.920 ms


----  Fast (Normal?) cases

drop table atest;
DROP TABLE
Time: 4.422 ms
create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 2.543 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1516.246 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1407.400 ms
create index idx on atest(r);
CREATE INDEX
Time: 903.503 ms

drop table atest;
DROP TABLE
Time: 3.820 ms
create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 22.861 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1455.556 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 2037.996 ms
create index idx on atest(r);
CREATE INDEX
Time: 718.286 ms

drop table atest;
DROP TABLE
Time: 4.503 ms
create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 3.448 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1523.540 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1261.473 ms
create index idx on atest(r);
CREATE INDEX
Time: 727.707 ms

drop table atest;
DROP TABLE
Time: 3.564 ms
create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 2.897 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1447.504 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1403.525 ms
create index idx on atest(r);
CREATE INDEX
Time: 754.577 ms

drop table atest;
DROP TABLE
Time: 4.633 ms
create table atest(i int4, r int4,d1 timestamp, d2 timestamp);
CREATE TABLE
Time: 3.196 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1618.544 ms
insert into atest (i,r,d1,d2) select
generate_series(1,100000),random()*100000,now(),now();
INSERT 0 100000
Time: 1530.450 ms
create index idx on atest(r);
CREATE INDEX
Time: 802.980 ms
drop table atest;
DROP TABLE
Time: 4.707 ms
mserver#

Regards,
Gary.

Re: Strange Create Index behaviour

From
Tom Lane
Date:
Gary Doades <gpd@gpdnet.co.uk> writes:
> Platform: FreeBSD 6.0, Postgresql 8.1.2 compiled from the ports collection.

> If the column that is having the index created has a certain
> distribution of values then create index takes a very long time. If the
> data values (integer in this case) a fairly evenly distributed then
> create index is very quick, if the data values are all the same it is
> very quick. I discovered that in the slow cases the column had
> approximately half the values as zero and the rest fairly spread out.

Interesting.  I tried your test script and got fairly close times
for all the cases on two different machines:
    old HPUX machine: shortest 5800 msec, longest 7960 msec
    new Fedora 4 machine: shortest 461 msec, longest 608 msec
(the HPUX machine was doing other stuff at the same time, so some
of its variation is probably only noise).

So what this looks like to me is a corner case that FreeBSD's qsort
fails to handle well.

You might try forcing Postgres to use our private copy of qsort, as we
do on Solaris for similar reasons.  (The easy way to do this by hand
is to configure as normal, then alter the LIBOBJS setting in
src/Makefile.global to add "qsort.o", then proceed with normal build.)
However, I think that our private copy is descended from *BSD sources,
so it might have the same failure mode.  It'd be worth finding out.

> The final interesting thing is that as I increase shared buffers to 2000
> or 3000 the problem gets *worse*

shared_buffers is unlikely to impact index build time noticeably in
recent PG releases.  maintenance_work_mem would affect it a lot, though.
What setting were you using for that?

Can anyone else try these test cases on other platforms?

            regards, tom lane

Re: Strange Create Index behaviour

From
Gary Doades
Date:
Tom Lane wrote:
> Interesting.  I tried your test script and got fairly close times
> for all the cases on two different machines:
>     old HPUX machine: shortest 5800 msec, longest 7960 msec
>     new Fedora 4 machine: shortest 461 msec, longest 608 msec
> (the HPUX machine was doing other stuff at the same time, so some
> of its variation is probably only noise).
>
> So what this looks like to me is a corner case that FreeBSD's qsort
> fails to handle well.
>
> You might try forcing Postgres to use our private copy of qsort, as we
> do on Solaris for similar reasons.  (The easy way to do this by hand
> is to configure as normal, then alter the LIBOBJS setting in
> src/Makefile.global to add "qsort.o", then proceed with normal build.)
> However, I think that our private copy is descended from *BSD sources,
> so it might have the same failure mode.  It'd be worth finding out.
>
>> The final interesting thing is that as I increase shared buffers to 2000
>> or 3000 the problem gets *worse*
>
> shared_buffers is unlikely to impact index build time noticeably in
> recent PG releases.  maintenance_work_mem would affect it a lot, though.
> What setting were you using for that?
>
> Can anyone else try these test cases on other platforms?
>

Thanks for that.

I've since tried it on Windows (pg 8.1.2) and the times were all
similar, around 1200ms so it might just be BSD.

I'll have to wait until tomorrow to get back to my BSD box. FreeBSD
ports makes it easy to install, so I'll have to figure out how to get in
and change things manually. I guess the appropriate files are still left
around after the ports make command finishes, so I just edit the file
and make again?

If it can't be fixed though I guess we may have a problem using BSD. I'm
surprised this hasn't been brought up before, the case doesn't seem
*that* rare. Maybe not that many using FreeBSD?

I'd certainly be interested if anyone else can repro it on FreeBSD though.

Regards,
Gary.


Re: Strange Create Index behaviour

From
Gary Doades
Date:
Tom Lane wrote:
> shared_buffers is unlikely to impact index build time noticeably in
> recent PG releases.  maintenance_work_mem would affect it a lot, though.
> What setting were you using for that?
>

Also, i tried upping maintenance_work_mem to 65536 and it didn't make
much difference (maybe 10% faster for the "normal" cases). Upping the
shared_buffers *definitely* makes the bad cases worse though, but I
agree I don't see why...

Regards,
Gary.

Re: Strange Create Index behaviour

From
Simon Riggs
Date:
On Wed, 2006-02-15 at 20:00 +0000, Gary Doades wrote:

> I have put together a test case

Please enable trace_sort=on and then repeat tests and post the
accompanying log file.

I think this is simply the sort taking longer depending upon the data
distribution, but I'd like to know for sure.

Thanks,

Best Regards, Simon Riggs


Re: Strange Create Index behaviour

From
Tom Lane
Date:
I wrote:
> Interesting.  I tried your test script and got fairly close times
> for all the cases on two different machines:
>     old HPUX machine: shortest 5800 msec, longest 7960 msec
>     new Fedora 4 machine: shortest 461 msec, longest 608 msec

> So what this looks like to me is a corner case that FreeBSD's qsort
> fails to handle well.

I tried forcing PG to use src/port/qsort.c on the Fedora machine,
and lo and behold:
    new Fedora 4 machine: shortest 434 msec, longest 8530 msec

So it sure looks like this script does expose a problem on BSD-derived
qsorts.  Curiously, the case that's much the worst for me is the third
in the script, while the shortest time is the first case, which was slow
for Gary.  So I'd venture that the *BSD code has been tweaked somewhere
along the way, in a manner that moves the problem around without really
fixing it.  (Anyone want to compare the actual FreeBSD source to what
we have?)

This is pretty relevant stuff, because there was a thread recently
advocating that we stop using the platform qsort on all platforms:
http://archives.postgresql.org/pgsql-hackers/2005-12/msg00610.php

It's really interesting to see a case where port/qsort is radically
worse than other qsorts ... unless we figure that out and fix it,
I think the idea of using port/qsort everywhere has just taken a
major hit.

            regards, tom lane

Re: Strange Create Index behaviour

From
Gary Doades
Date:
Tom Lane wrote:
  > I tried forcing PG to use src/port/qsort.c on the Fedora machine,
> and lo and behold:
>     new Fedora 4 machine: shortest 434 msec, longest 8530 msec
>
> So it sure looks like this script does expose a problem on BSD-derived
> qsorts.  Curiously, the case that's much the worst for me is the third
> in the script, while the shortest time is the first case, which was slow
> for Gary.  So I'd venture that the *BSD code has been tweaked somewhere
> along the way, in a manner that moves the problem around without really
> fixing it.  (Anyone want to compare the actual FreeBSD source to what
> we have?)
>

If I run the script again, it is not always the first case that is slow,
it varies from run to run, which is why I repeated it quite a few times
for the test.

Interestingly, if I don't delete the table after a run, but just drop
and re-create the index repeatedly it stays a pretty consistent time,
either repeatedly good or repeatedly bad!

Regards,
Gary.

Re: Strange Create Index behaviour

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Please enable trace_sort=on and then repeat tests and post the
> accompanying log file.

I did this on my Fedora machine with port/qsort.c, and got the results
attached.  Curiously, this run has the spikes in completely different
places than the prior one did.  So the random component of the test data
is affecting the results quite a lot.  There seems absolutely no doubt
that we are looking at data-dependent qsort misbehavior, though.  The
CPU time eaten by performsort accounts for all but about 100 msec of the
elapsed time reported on the psql side.

            regards, tom lane


LOG:  begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG:  performsort starting: CPU 0.00s/0.15u sec elapsed 0.15 sec
LOG:  performsort done: CPU 0.00s/12.43u sec elapsed 12.44 sec
LOG:  internal sort ended, 9861 KB used: CPU 0.01s/12.51u sec elapsed 12.52 sec
LOG:  begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG:  performsort starting: CPU 0.00s/0.14u sec elapsed 0.15 sec
LOG:  performsort done: CPU 0.00s/0.78u sec elapsed 0.78 sec
LOG:  internal sort ended, 9861 KB used: CPU 0.02s/0.85u sec elapsed 0.87 sec
LOG:  begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG:  performsort starting: CPU 0.01s/0.14u sec elapsed 0.15 sec
LOG:  performsort done: CPU 0.01s/0.96u sec elapsed 0.97 sec
LOG:  internal sort ended, 9861 KB used: CPU 0.02s/1.03u sec elapsed 1.06 sec
LOG:  begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG:  performsort starting: CPU 0.00s/0.14u sec elapsed 0.15 sec
LOG:  performsort done: CPU 0.00s/0.31u sec elapsed 0.32 sec
LOG:  internal sort ended, 9861 KB used: CPU 0.02s/0.38u sec elapsed 0.40 sec
LOG:  begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG:  performsort starting: CPU 0.00s/0.14u sec elapsed 0.15 sec
LOG:  performsort done: CPU 0.00s/7.91u sec elapsed 7.92 sec
LOG:  internal sort ended, 9861 KB used: CPU 0.02s/7.99u sec elapsed 8.01 sec
LOG:  begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG:  performsort starting: CPU 0.01s/0.13u sec elapsed 0.15 sec
LOG:  performsort done: CPU 0.01s/0.61u sec elapsed 0.63 sec
LOG:  internal sort ended, 9861 KB used: CPU 0.04s/0.67u sec elapsed 0.71 sec
LOG:  begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG:  performsort starting: CPU 0.01s/0.13u sec elapsed 0.15 sec
LOG:  performsort done: CPU 0.01s/11.52u sec elapsed 11.54 sec
LOG:  internal sort ended, 9861 KB used: CPU 0.03s/11.59u sec elapsed 11.62 sec
LOG:  begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG:  performsort starting: CPU 0.00s/0.14u sec elapsed 0.15 sec
LOG:  performsort done: CPU 0.00s/0.45u sec elapsed 0.46 sec
LOG:  internal sort ended, 9861 KB used: CPU 0.02s/0.55u sec elapsed 0.57 sec
LOG:  begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG:  performsort starting: CPU 0.00s/0.14u sec elapsed 0.15 sec
LOG:  performsort done: CPU 0.00s/0.45u sec elapsed 0.46 sec
LOG:  internal sort ended, 9861 KB used: CPU 0.04s/0.54u sec elapsed 0.57 sec
LOG:  begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG:  performsort starting: CPU 0.02s/0.12u sec elapsed 0.15 sec
LOG:  performsort done: CPU 0.02s/0.44u sec elapsed 0.46 sec
LOG:  internal sort ended, 9861 KB used: CPU 0.03s/0.55u sec elapsed 0.58 sec
LOG:  begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG:  performsort starting: CPU 0.02s/0.13u sec elapsed 0.15 sec
LOG:  performsort done: CPU 0.02s/0.44u sec elapsed 0.46 sec
LOG:  internal sort ended, 9861 KB used: CPU 0.03s/0.54u sec elapsed 0.58 sec
LOG:  begin index sort: unique = f, workMem = 16384, randomAccess = f
LOG:  performsort starting: CPU 0.02s/0.13u sec elapsed 0.15 sec
LOG:  performsort done: CPU 0.02s/0.44u sec elapsed 0.46 sec
LOG:  internal sort ended, 9861 KB used: CPU 0.04s/0.54u sec elapsed 0.59 sec

Re: Strange Create Index behaviour

From
Tom Lane
Date:
Gary Doades <gpd@gpdnet.co.uk> writes:
> Interestingly, if I don't delete the table after a run, but just drop
> and re-create the index repeatedly it stays a pretty consistent time,
> either repeatedly good or repeatedly bad!

This is consistent with the theory of a data-dependent performance
problem in qsort.  If you don't generate a fresh set of random test
data, then you get repeatable runtimes.  With a new set of test data,
you might or might not hit the not-so-sweet-spot that we seem to have
detected.

            regards, tom lane

Re: Strange Create Index behaviour

From
Gary Doades
Date:
Tom Lane wrote:
>
> So it sure looks like this script does expose a problem on BSD-derived
> qsorts.  Curiously, the case that's much the worst for me is the third
> in the script, while the shortest time is the first case, which was slow
> for Gary.  So I'd venture that the *BSD code has been tweaked somewhere
> along the way, in a manner that moves the problem around without really
> fixing it.  (Anyone want to compare the actual FreeBSD source to what
> we have?)
>
> It's really interesting to see a case where port/qsort is radically
> worse than other qsorts ... unless we figure that out and fix it,
> I think the idea of using port/qsort everywhere has just taken a
> major hit.
>

More specifically to BSD, is there any way I can use a non-BSD qsort for
building Postresql server?

Regards,
Gary.

qsort again (was Re: Strange Create Index behaviour)

From
Tom Lane
Date:
Gary Doades <gpd@gpdnet.co.uk> writes:
> If I run the script again, it is not always the first case that is slow,
> it varies from run to run, which is why I repeated it quite a few times
> for the test.

For some reason I hadn't immediately twigged to the fact that your test
script is just N repetitions of the exact same structure with random data.
So it's not so surprising that you get random variations in behavior
with different test data sets.

I did some experimentation comparing the qsort from Fedora Core 4
(glibc-2.3.5-10.3) with our src/port/qsort.c.  For those who weren't
following the pgsql-performance thread, the test case is just this
repeated a lot of times:

create table atest(i int4, r int4);
insert into atest (i,r) select generate_series(1,100000), 0;
insert into atest (i,r) select generate_series(1,100000), random()*100000;
\timing
create index idx on atest(r);
\timing
drop table atest;

I did this 100 times and sorted the reported runtimes.  (Investigation
with trace_sort = on confirms that the runtime is almost entirely spent
in qsort() called from our performsort --- the Postgres overhead is
about 100msec on this machine.)  Results are below.

It seems clear that our qsort.c is doing a pretty awful job of picking
qsort pivots, while glibc is mostly managing not to make that mistake.
I haven't looked at the glibc code yet to see what they are doing
differently.

I'd say this puts a considerable damper on my enthusiasm for using our
qsort all the time, as was recently debated in this thread:
http://archives.postgresql.org/pgsql-hackers/2005-12/msg00610.php
We need to fix our qsort.c before pushing ahead with that idea.

            regards, tom lane


100 runtimes for glibc qsort, sorted ascending:

Time: 459.860 ms
Time: 460.209 ms
Time: 460.704 ms
Time: 461.317 ms
Time: 461.538 ms
Time: 461.652 ms
Time: 461.988 ms
Time: 462.573 ms
Time: 462.638 ms
Time: 462.716 ms
Time: 462.917 ms
Time: 463.219 ms
Time: 463.455 ms
Time: 463.650 ms
Time: 463.723 ms
Time: 463.737 ms
Time: 463.750 ms
Time: 463.852 ms
Time: 463.964 ms
Time: 463.988 ms
Time: 464.003 ms
Time: 464.135 ms
Time: 464.372 ms
Time: 464.458 ms
Time: 464.496 ms
Time: 464.551 ms
Time: 464.599 ms
Time: 464.655 ms
Time: 464.656 ms
Time: 464.722 ms
Time: 464.814 ms
Time: 464.827 ms
Time: 464.878 ms
Time: 464.899 ms
Time: 464.905 ms
Time: 464.987 ms
Time: 465.055 ms
Time: 465.138 ms
Time: 465.159 ms
Time: 465.194 ms
Time: 465.310 ms
Time: 465.316 ms
Time: 465.375 ms
Time: 465.450 ms
Time: 465.535 ms
Time: 465.595 ms
Time: 465.680 ms
Time: 465.769 ms
Time: 465.865 ms
Time: 465.892 ms
Time: 465.903 ms
Time: 466.003 ms
Time: 466.154 ms
Time: 466.164 ms
Time: 466.203 ms
Time: 466.305 ms
Time: 466.344 ms
Time: 466.364 ms
Time: 466.388 ms
Time: 466.502 ms
Time: 466.593 ms
Time: 466.725 ms
Time: 466.794 ms
Time: 466.798 ms
Time: 466.904 ms
Time: 466.971 ms
Time: 466.997 ms
Time: 467.122 ms
Time: 467.146 ms
Time: 467.221 ms
Time: 467.224 ms
Time: 467.244 ms
Time: 467.277 ms
Time: 467.587 ms
Time: 468.142 ms
Time: 468.207 ms
Time: 468.237 ms
Time: 468.471 ms
Time: 468.663 ms
Time: 468.700 ms
Time: 469.235 ms
Time: 469.840 ms
Time: 470.472 ms
Time: 471.140 ms
Time: 472.811 ms
Time: 472.959 ms
Time: 474.858 ms
Time: 477.210 ms
Time: 479.571 ms
Time: 479.671 ms
Time: 482.797 ms
Time: 488.852 ms
Time: 514.639 ms
Time: 529.287 ms
Time: 612.185 ms
Time: 660.748 ms
Time: 742.227 ms
Time: 866.814 ms
Time: 1234.848 ms
Time: 1267.398 ms


100 runtimes for port/qsort.c, sorted ascending:

Time: 418.905 ms
Time: 420.611 ms
Time: 420.764 ms
Time: 420.904 ms
Time: 421.706 ms
Time: 422.466 ms
Time: 422.627 ms
Time: 423.189 ms
Time: 423.302 ms
Time: 425.096 ms
Time: 425.731 ms
Time: 425.851 ms
Time: 427.253 ms
Time: 430.113 ms
Time: 432.756 ms
Time: 432.963 ms
Time: 440.502 ms
Time: 440.640 ms
Time: 450.452 ms
Time: 458.143 ms
Time: 459.212 ms
Time: 467.706 ms
Time: 468.006 ms
Time: 468.574 ms
Time: 470.003 ms
Time: 472.313 ms
Time: 483.622 ms
Time: 492.395 ms
Time: 509.564 ms
Time: 531.037 ms
Time: 533.366 ms
Time: 535.610 ms
Time: 575.523 ms
Time: 582.688 ms
Time: 593.545 ms
Time: 647.364 ms
Time: 660.612 ms
Time: 677.312 ms
Time: 680.288 ms
Time: 697.626 ms
Time: 833.066 ms
Time: 834.511 ms
Time: 851.819 ms
Time: 920.443 ms
Time: 926.731 ms
Time: 954.289 ms
Time: 1045.214 ms
Time: 1059.200 ms
Time: 1062.328 ms
Time: 1136.018 ms
Time: 1260.091 ms
Time: 1276.883 ms
Time: 1319.351 ms
Time: 1438.854 ms
Time: 1475.457 ms
Time: 1538.211 ms
Time: 1549.004 ms
Time: 1744.642 ms
Time: 1771.258 ms
Time: 1959.530 ms
Time: 2300.140 ms
Time: 2589.641 ms
Time: 2612.780 ms
Time: 3100.024 ms
Time: 3284.125 ms
Time: 3379.792 ms
Time: 3750.278 ms
Time: 4302.278 ms
Time: 4780.624 ms
Time: 5000.056 ms
Time: 5092.604 ms
Time: 5168.722 ms
Time: 5292.941 ms
Time: 5895.964 ms
Time: 7003.164 ms
Time: 7099.449 ms
Time: 7115.083 ms
Time: 7384.940 ms
Time: 8214.010 ms
Time: 8700.771 ms
Time: 9331.225 ms
Time: 10503.360 ms
Time: 12496.026 ms
Time: 12982.474 ms
Time: 15192.390 ms
Time: 15392.161 ms
Time: 15958.295 ms
Time: 18375.693 ms
Time: 18617.706 ms
Time: 18927.515 ms
Time: 19898.018 ms
Time: 20865.979 ms
Time: 21000.907 ms
Time: 21297.585 ms
Time: 21714.518 ms
Time: 25423.235 ms
Time: 27543.052 ms
Time: 28314.182 ms
Time: 29400.278 ms
Time: 34142.534 ms

Re: Strange Create Index behaviour

From
Simon Riggs
Date:
On Wed, 2006-02-15 at 16:51 -0500, Tom Lane wrote:
> Gary Doades <gpd@gpdnet.co.uk> writes:
> > Interestingly, if I don't delete the table after a run, but just drop
> > and re-create the index repeatedly it stays a pretty consistent time,
> > either repeatedly good or repeatedly bad!
>
> This is consistent with the theory of a data-dependent performance
> problem in qsort.  If you don't generate a fresh set of random test
> data, then you get repeatable runtimes.  With a new set of test data,
> you might or might not hit the not-so-sweet-spot that we seem to have
> detected.

Agreed. Good analysis...

Best Regards, Simon Riggs


Re: qsort again (was Re: Strange Create Index behaviour)

From
Gary Doades
Date:
Tom Lane wrote:
> For some reason I hadn't immediately twigged to the fact that your test
> script is just N repetitions of the exact same structure with random data.
> So it's not so surprising that you get random variations in behavior
> with different test data sets.
>
  > It seems clear that our qsort.c is doing a pretty awful job of picking
> qsort pivots, while glibc is mostly managing not to make that mistake.
> I haven't looked at the glibc code yet to see what they are doing
> differently.
>
> I'd say this puts a considerable damper on my enthusiasm for using our
> qsort all the time, as was recently debated in this thread:
> http://archives.postgresql.org/pgsql-hackers/2005-12/msg00610.php
> We need to fix our qsort.c before pushing ahead with that idea.

[snip]

> Time: 28314.182 ms
> Time: 29400.278 ms
> Time: 34142.534 ms

Ouch! That confirms my problem. I generated the random test case because
it was easier than including the dump of my tables, but you can
appreciate that tables 20 times the size are basically crippled when it
comes to creating an index on them.

Examining the dump and the associated times during restore it looks like
I have 7 tables with this approximate distribution, thus the
ridiculously long restore time. Better not re-index soon!

Is this likely to hit me in a random fashion during normal operation,
joins, sorts, order by for example?

So the options are:
1) Fix the included qsort.c code and use that
2) Get FreeBSD to fix their qsort code
3) Both

I guess that 1 is the real solution in case anyone else's qsort is
broken in the same way. Then at least you *could* use it all the time :)

Regards,
Gary.




Re: qsort again (was Re: Strange Create Index behaviour)

From
Tom Lane
Date:
Gary Doades <gpd@gpdnet.co.uk> writes:
> Is this likely to hit me in a random fashion during normal operation,
> joins, sorts, order by for example?

Yup, anytime you're passing data with that kind of distribution
through a sort.

> So the options are:
> 1) Fix the included qsort.c code and use that
> 2) Get FreeBSD to fix their qsort code
> 3) Both

> I guess that 1 is the real solution in case anyone else's qsort is
> broken in the same way. Then at least you *could* use it all the time :)

It's reasonable to assume that most of the *BSDen have basically the
same qsort code.  Ours claims to have come from NetBSD sources, but
I don't doubt that they all trace back to a common ancestor.

            regards, tom lane

Re: [HACKERS] qsort again (was Re: Strange Create Index behaviour)

From
Tom Lane
Date:
Gary Doades <gpd@gpdnet.co.uk> writes:
> Ouch! That confirms my problem. I generated the random test case because
> it was easier than including the dump of my tables, but you can
> appreciate that tables 20 times the size are basically crippled when it
> comes to creating an index on them.

Actually... we only use qsort when we have a sorting problem that fits
within the allowed sort memory.  The external-sort logic doesn't go
through that code at all.  So all the analysis we just did on your test
case doesn't necessarily apply to sort problems that are too large for
the sort_mem setting.

The test case would be sorting 200000 index entries, which'd probably
occupy at least 24 bytes apiece of sort memory, so probably about 5 meg.
A problem 20 times that size would definitely not fit in the default
16MB maintenance_work_mem.  Were you using a large value of
maintenance_work_mem for your restore?

            regards, tom lane

Re: qsort again (was Re: Strange Create Index

From
Ron
Date:
This behavior is consistent with the pivot choosing algorithm
assuming certain distribution(s) for the data.  For instance,
median-of-three partitioning is known to be pessimal when the data is
geometrically or hyper-geometrically distributed.  Also, care must be
taken that sometimes is not when there are many equal values in the
data.  Even pseudo random number generator based pivot choosing
algorithms are not immune if the PRNG is flawed in some way.

How are we choosing our pivots?


At 06:28 PM 2/15/2006, Tom Lane wrote:

>I did some experimentation comparing the qsort from Fedora Core 4
>(glibc-2.3.5-10.3) with our src/port/qsort.c.  For those who weren't
>following the pgsql-performance thread, the test case is just this
>repeated a lot of times:
>
>create table atest(i int4, r int4);
>insert into atest (i,r) select generate_series(1,100000), 0;
>insert into atest (i,r) select generate_series(1,100000), random()*100000;
>\timing
>create index idx on atest(r);
>\timing
>drop table atest;
>
>I did this 100 times and sorted the reported runtimes.  (Investigation
>with trace_sort = on confirms that the runtime is almost entirely spent
>in qsort() called from our performsort --- the Postgres overhead is
>about 100msec on this machine.)  Results are below.
>
>It seems clear that our qsort.c is doing a pretty awful job of picking
>qsort pivots, while glibc is mostly managing not to make that mistake.
>I haven't looked at the glibc code yet to see what they are doing
>differently.
>
>I'd say this puts a considerable damper on my enthusiasm for using our
>qsort all the time, as was recently debated in this thread:
>http://archives.postgresql.org/pgsql-hackers/2005-12/msg00610.php
>We need to fix our qsort.c before pushing ahead with that idea.
>
>                         regards, tom lane
>
>
>100 runtimes for glibc qsort, sorted ascending:
>
>Time: 459.860 ms
><snip>
>Time: 488.852 ms
>Time: 514.639 ms
>Time: 529.287 ms
>Time: 612.185 ms
>Time: 660.748 ms
>Time: 742.227 ms
>Time: 866.814 ms
>Time: 1234.848 ms
>Time: 1267.398 ms
>
>
>100 runtimes for port/qsort.c, sorted ascending:
>
>Time: 418.905 ms
><snip>
>Time: 20865.979 ms
>Time: 21000.907 ms
>Time: 21297.585 ms
>Time: 21714.518 ms
>Time: 25423.235 ms
>Time: 27543.052 ms
>Time: 28314.182 ms
>Time: 29400.278 ms
>Time: 34142.534 ms



Re: [HACKERS] qsort again (was Re: Strange Create Index behaviour)

From
Tom Lane
Date:
I wrote:
> Gary Doades <gpd@gpdnet.co.uk> writes:
>> Ouch! That confirms my problem. I generated the random test case because
>> it was easier than including the dump of my tables, but you can
>> appreciate that tables 20 times the size are basically crippled when it
>> comes to creating an index on them.

> Actually... we only use qsort when we have a sorting problem that fits
> within the allowed sort memory.  The external-sort logic doesn't go
> through that code at all.  So all the analysis we just did on your test
> case doesn't necessarily apply to sort problems that are too large for
> the sort_mem setting.

I increased the size of the test case by 10x (basically s/100000/1000000/)
which is enough to push it into the external-sort regime.  I get
amazingly stable runtimes now --- I didn't have the patience to run 100
trials, but in 30 trials I have slowest 11538 msec and fastest 11144 msec.
So this code path is definitely not very sensitive to this data
distribution.

While these numbers aren't glittering in comparison to the best-case
qsort times (~450 msec to sort 10% as much data), they are sure a lot
better than the worst-case times.  So maybe a workaround for you is
to decrease maintenance_work_mem, counterintuitive though that be.
(Now, if you *weren't* using maintenance_work_mem of 100MB or more
for your problem restore, then I'm not sure I know what's going on...)

We still ought to try to fix qsort of course.

            regards, tom lane

Re: qsort again (was Re: Strange Create Index behaviour)

From
Tom Lane
Date:
Ron <rjpeace@earthlink.net> writes:
> How are we choosing our pivots?

See qsort.c: it looks like median of nine equally spaced inputs (ie,
the 1/8th points of the initial input array, plus the end points),
implemented as two rounds of median-of-three choices.  With half of the
data inputs zero, it's not too improbable for two out of the three
samples to be zeroes in which case I think the med3 result will be zero
--- so choosing a pivot of zero is much more probable than one would
like, and doing so in many levels of recursion causes the problem.

I think.  I'm not too sure if the code isn't just being sloppy about the
case where many data values are equal to the pivot --- there's a special
case there to switch to insertion sort, and maybe that's getting invoked
too soon.  It'd be useful to get a line-level profile of the behavior of
this code in the slow cases...

            regards, tom lane

Re: qsort again (was Re: Strange Create Index behaviour)

From
Christopher Kings-Lynne
Date:
> Ouch! That confirms my problem. I generated the random test case because
> it was easier than including the dump of my tables, but you can
> appreciate that tables 20 times the size are basically crippled when it
> comes to creating an index on them.


I have to say that I restored a few gigabyte dump on freebsd the other
day, and most of the restore time was in index creation - I didn't think
too much of it though at the time.  FreeBSD 4.x.

Chris


Re: Strange Create Index behaviour

From
Simon Riggs
Date:
On Wed, 2006-02-15 at 20:00 +0000, Gary Doades wrote:

> I have put together a test case that demonstrates the problem (see
> below). I create a simple table, as close in structure to one of my
> problem tables and populate an integer column with 100,000 zeros follow
> by 100,000 random integers between 0 and 100,000. Then create an index
> on this column. I then drop the table and repeat. The create index
> should take around 1-2 seconds. A fair proportion of the time it takes
> 50 seconds!!!
>
> If I fill the same row with all random data the create index always
> takes a second or two. If I fill the column with all zeros everything is
> still OK.

Aside from the importance of investigating sort behaviour, have you
tried to build a partial index WHERE col > 0 ? That way you wouldn't
even be indexing the zeros.

Best Regards, Simon Riggs




Re: [HACKERS] qsort again (was Re: Strange Create Index

From
Simon Riggs
Date:
On Wed, 2006-02-15 at 19:59 -0500, Tom Lane wrote:

>  I get
> amazingly stable runtimes now --- I didn't have the patience to run 100
> trials, but in 30 trials I have slowest 11538 msec and fastest 11144 msec.
> So this code path is definitely not very sensitive to this data
> distribution.

"The worst-case behavior of replacement-selection is very close to its
average behavior, while the worst-case behavior of QuickSort is terrible
(N2) – a strong argument in favor of replacement-selection. Despite this
risk, QuickSort is widely used because, in practice, it has superior
performance." p.8, "AlphaSort: A Cache-Sensitive Parallel External
Sort", Nyberg et al, VLDB Journal 4(4): 603-627 (1995)

I think your other comment about flipping to insertion sort too early
(and not returning...) is a plausible cause for the poor pg qsort
behaviour, but the overall spread of values seems as expected.

Some test results I've seen seem consistent with the view that
increasing memory also increases run-time for larger settings of
work_mem/maintenance_work_mem. Certainly, as I observed a while back,
having a large memory settings doesn't help you at all when you are
doing final run merging on the external sort. Whatever we do, we should
look at the value high memory settings bring to each phase of a sort
separately from the other phases.

There is work underway on improving external sorts, so I hear (not me).
Plus my WIP on randomAccess requirements.

Best Regards, Simon Riggs




Re: [HACKERS] qsort again (was Re: Strange Create Index

From
Neil Conway
Date:
On Wed, 2006-02-15 at 18:28 -0500, Tom Lane wrote:
> It seems clear that our qsort.c is doing a pretty awful job of picking
> qsort pivots, while glibc is mostly managing not to make that mistake.
> I haven't looked at the glibc code yet to see what they are doing
> differently.

glibc qsort is actually merge sort, so I'm not surprised it avoids this
problem.

-Neil



Re: qsort again (was Re: Strange Create Index

From
Ron
Date:
At 08:21 PM 2/15/2006, Tom Lane wrote:
>Ron <rjpeace@earthlink.net> writes:
> > How are we choosing our pivots?
>
>See qsort.c: it looks like median of nine equally spaced inputs (ie,
>the 1/8th points of the initial input array, plus the end points),
>implemented as two rounds of median-of-three choices.

OK, this is a bad way to do median-of-n partitioning for a few
reasons.  See Sedgewick's PhD thesis for details.

Basically, if one is using median-of-n partitioning to choose a
pivot, one should do it in =one= pass, and n for that pass should be
<= the numbers of registers in the CPU.  Since the x86 ISA has 8
GPR's, n should be <= 8.  7 for instance.

Special purposing the median-of-n code so that the minimal number of
comparisons and moves is used to sort the sample and then
"partitioning in place" is the best way to do it.  In addition, care
must be taken to deal with the possibility that many of the keys may be equal.

The (pseudo) code looks something like this:

qs(a[],L,R){
if((R-L) > SAMPLE_SIZE){ // Not worth using qs for too few elements
    SortSample(SAMPLE_SIZE,a[],L,R);
    // Sorts SAMPLE_SIZE= n elements and does median-of-n
partitioning for small n
    // using the minimal number of comparisons and moves.
    // In the process it ends up partitioning the first n/2 and last
n/2 elements
    // SAMPLE_SIZE is a constant chosen to work best for a given CPU.
    //  #GPRs - 1 is a good initial guess.
    // For the x86 ISA, #GPRs - 1 = 7. For native x86-64, it's 15.
    // For most RISC CPUs it's 31 or 63.  For Itanium, it's 127 (!)
    pivot= a[(L+R)>>1]; i= L+(SAMPLE_SIZE>>1); j= R-(SAMPLE_SIZE>>1);
    for(;;){
       while(a[++i] < pivot);
       while(a[--j] > pivot);
       if(i >= j) break;
       if(a[i] > a[j]) swap(a[i],a[j]);
       }
    if((i-R) >= (j-L)){qs(a,L,i-1);}
    else{qs(a,i,R);}
else{OofN^2_Sort(a,L,R);}
// SelectSort may be better than InsertSort if KeySize in bits <<
RecordSize in bits
} // End of qs

Given that the most common CPU ISA in existence has 8 GPRs,
SAMPLE_SIZE= 7 is probably optimal:
t= (L+R);
the set would be {L; t/8; t/4; t/2; 3*t/4; 7*t/8; R;}
==> {L; t>>3; t>>2; t>>1; (3*t)>>2; (7*t)>>3; R} as the locations.
Even better (and more easily scaled as the number of GPR's in the CPU
changes) is to use
the set {L; L+1; L+2; t>>1; R-2; R-1; R}
This means that instead of 7 random memory accesses, we have 3; two
of which result in a
burst access for three elements each.
That's much faster; _and_ using a sample of 9, 15, 31, 63, etc (to
max of ~GPRs -1) elements is more easily done.

It also means that the work we do sorting the sample can be taken
advantage of when starting
inner loop of quicksort: items L..L+2, t, and R-2..R are already
partitioned by SortSample().

Insuring that the minimum number of comparisons and moves is done in
SortSample can be down by using a code generator to create a
comparison tree that identifies which permutation(s) of n we are
dealing with and then moving them into place with the minimal number of moves.

SIDE NOTE: IIRC glibc's qsort is actually merge sort.  Merge sort
performance is insensitive to all inputs, and there are way to
optimize it as well.

I'll leave the actual coding to someone who knows the pg source
better than I do.
Ron



Re: [HACKERS] qsort again (was Re: Strange Create Index

From
"Gary Doades"
Date:
Tom Lane wrote:
> I increased the size of the test case by 10x (basically s/100000/1000000/)
> which is enough to push it into the external-sort regime.  I get
> amazingly stable runtimes now --- I didn't have the patience to run 100
> trials, but in 30 trials I have slowest 11538 msec and fastest 11144 msec.
> So this code path is definitely not very sensitive to this data
> distribution.
>
> While these numbers aren't glittering in comparison to the best-case
> qsort times (~450 msec to sort 10% as much data), they are sure a lot
> better than the worst-case times.  So maybe a workaround for you is
> to decrease maintenance_work_mem, counterintuitive though that be.
> (Now, if you *weren't* using maintenance_work_mem of 100MB or more
> for your problem restore, then I'm not sure I know what's going on...)
>

Good call. I basically reversed your test by keeping the number of rows
the same (200000), but reducing maintenance_work_mem. Reducing to 8192
made no real difference. Reducing to 4096 flattened out all the times
nicely. Slower overall, but at least predictable. Hopefully only a
temporary solution until qsort is fixed.

My restore now takes 22 minutes :)

I think the reason I wasn't seeing performance issues with normal sort
operations is because they use work_mem not maintenance_work_mem which was
only set to 2048 anyway. Does that sound right?

Regards,
Gary.



Re: qsort again (was Re: Strange Create Index

From
"Steinar H. Gunderson"
Date:
On Wed, Feb 15, 2006 at 11:30:54PM -0500, Ron wrote:
> Even better (and more easily scaled as the number of GPR's in the CPU
> changes) is to use
> the set {L; L+1; L+2; t>>1; R-2; R-1; R}
> This means that instead of 7 random memory accesses, we have 3; two
> of which result in a
> burst access for three elements each.

Isn't that improvement going to disappear competely if you choose a bad
pivot?

> SIDE NOTE: IIRC glibc's qsort is actually merge sort.  Merge sort
> performance is insensitive to all inputs, and there are way to
> optimize it as well.

glibc-2.3.5/stdlib/qsort.c:

  /* Order size using quicksort.  This implementation incorporates
     four optimizations discussed in Sedgewick:

I can't see any references to merge sort in there at all.

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: [HACKERS] qsort again

From
Florian Weimer
Date:
* Neil Conway:

> On Wed, 2006-02-15 at 18:28 -0500, Tom Lane wrote:
>> It seems clear that our qsort.c is doing a pretty awful job of picking
>> qsort pivots, while glibc is mostly managing not to make that mistake.
>> I haven't looked at the glibc code yet to see what they are doing
>> differently.
>
> glibc qsort is actually merge sort, so I'm not surprised it avoids this
> problem.

qsort also performs twice as many key comparisons as the theoretical
minimum.  If key comparison is not very cheap, other schemes (like
heapsort, for example) are more attractive.

Re: [HACKERS] qsort again

From
Martijn van Oosterhout
Date:
On Thu, Feb 16, 2006 at 01:10:48PM +0100, Florian Weimer wrote:
> * Neil Conway:
>
> > On Wed, 2006-02-15 at 18:28 -0500, Tom Lane wrote:
> >> It seems clear that our qsort.c is doing a pretty awful job of picking
> >> qsort pivots, while glibc is mostly managing not to make that mistake.
> >> I haven't looked at the glibc code yet to see what they are doing
> >> differently.
> >
> > glibc qsort is actually merge sort, so I'm not surprised it avoids this
> > problem.
>
> qsort also performs twice as many key comparisons as the theoretical
> minimum.  If key comparison is not very cheap, other schemes (like
> heapsort, for example) are more attractive.

Last time around there were a number of different algorithms tested.
Did anyone run those tests while getting it to count the number of
actual comparisons (which could easily swamp the time taken to do the
actual sort in some cases)?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: [HACKERS] qsort again

From
Sven Geisler
Date:
Martijn van Oosterhout schrieb:
>
> Last time around there were a number of different algorithms tested.
> Did anyone run those tests while getting it to count the number of
> actual comparisons (which could easily swamp the time taken to do the
> actual sort in some cases)?
>

The last time I did such tests is almost 10 years ago. I had used
MetroWerks CodeWarrior C/C++, which had Quicksort as algorithm in the Lib C.
Anyhow, I tested a few algorithms including merge sort and heapsort. I
end up with heapsort because it was the fastest algorithm for our issue.
We joined two arrays where each array was sorted and run qsort to sort
the new array.

Sven.

Re: qsort again (was Re: Strange Create Index

From
Ron
Date:
At 06:35 AM 2/16/2006, Steinar H. Gunderson wrote:
>On Wed, Feb 15, 2006 at 11:30:54PM -0500, Ron wrote:
> > Even better (and more easily scaled as the number of GPR's in the CPU
> > changes) is to use
> > the set {L; L+1; L+2; t>>1; R-2; R-1; R}
> > This means that instead of 7 random memory accesses, we have 3; two
> > of which result in a burst access for three elements each.
>
>Isn't that improvement going to disappear competely if you choose a bad
>pivot?
Only if you _consistently_ (read: "the vast majority of the time":
quicksort is actually darn robust) choose a _pessimal_, not just
"bad", pivot quicksort will degenerate to the O(N^2) behavior
everyone worries about.  See Corman & Rivest for a proof on this.

Even then, doing things as above has benefits:
1= The worst case is less bad since the guaranteed O(lgs!) pivot
choosing algorithm puts s elements into final position.
Worst case becomes better than O(N^2/(s-1)).

2=  The overhead of pivot choosing can overshadow the benefits using
more traditional methods for even moderate values of s.  See
discussions on the quicksort variant known as "samplesort" and
Sedgewick's PhD thesis for details.  Using a pivot choosing algorithm
that actually does some of the partitioning (and does it more
efficiently than the "usual" partitioning algorithm does) plus using
partition-in-place (rather then Lomuto's method) reduces overhead
very effectively (at the "cost" of more complicated / delicate to get
right partitioning code).  The above reduces the number of moves used
in a quicksort pass considerably regardless of the number of compares used.

3= Especially in modern systems where the gap between internal CPU
bandwidth and memory bandwidth is so great, the overhead of memory
accesses for comparisons and moves is the majority of the overhead
for both the pivot choosing and the partitioning algorithms within
quicksort.  Particularly random memory accesses.  The reason (#GPRs -
1) is a magic constant is that it's the most you can compare and move
using only register-to-register operations.

In addition, replacing as many of the memory accesses you must do
with sequential rather than random memory accesses is a big deal:
sequential memory access is measured in 10's of CPU cycles while
random memory access is measured in hundreds of CPU cycles.  It's no
accident that the advances in Grey's sorting contest have involved
algorithms that are both register and cache friendly, minimizing
overall memory access and using sequential memory access as much as
possible when said access can not be avoided.  As caches grow larger
and memory accesses more expensive, it's often worth it to use a
BucketSort+QuickSort hybrid rather than just QuickSort.

...and of course if you know enough about the data to be sorted so as
to constrain it appropriately, one should use a non comparison based
O(N) sorting algorithm rather than any of the general comparison
based O(NlgN) methods.


> > SIDE NOTE: IIRC glibc's qsort is actually merge sort.  Merge sort
> > performance is insensitive to all inputs, and there are way to
> > optimize it as well.
>
>glibc-2.3.5/stdlib/qsort.c:
>
>   /* Order size using quicksort.  This implementation incorporates
>      four optimizations discussed in Sedgewick:
>
>I can't see any references to merge sort in there at all.
Well, then I'm not the only person on the lists whose memory is faulty ;-)

The up side of MergeSort is that its performance is always O(NlgN).
The down sides are that it is far more memory hungry than QuickSort and slower.


Ron



Re: [HACKERS] qsort again

From
Ron
Date:
At 07:10 AM 2/16/2006, Florian Weimer wrote:
>* Neil Conway:
>
> > On Wed, 2006-02-15 at 18:28 -0500, Tom Lane wrote:
> >> It seems clear that our qsort.c is doing a pretty awful job of picking
> >> qsort pivots, while glibc is mostly managing not to make that mistake.
> >> I haven't looked at the glibc code yet to see what they are doing
> >> differently.
> >
> > glibc qsort is actually merge sort, so I'm not surprised it avoids this
> > problem.
>
>qsort also performs twice as many key comparisons as the theoretical
>minimum.

The theoretical minimum number of comparisons for a general purpose
comparison based sort is O(lgN!).
QuickSort uses 2NlnN ~= 1.38NlgN or ~1.38x the optimum without tuning
(see Knuth, Sedgewick, Corman, ... etc)
OTOH, QuickSort uses ~2x as many =moves= as the theoretical minimum
unless tuned, and moves are more expensive than compares in modern systems.

See my other posts for QuickSort tuning methods that attempt to
directly address both issues.


Ron



Re: qsort again (was Re: Strange Create Index

From
Markus Schaber
Date:
Hi, Ron,

Ron wrote:

> ...and of course if you know enough about the data to be sorted so as to
> constrain it appropriately, one should use a non comparison based O(N)
> sorting algorithm rather than any of the general comparison based
> O(NlgN) methods.

Sounds interesting, could you give us some pointers (names, URLs,
papers) to such algorithms?

Thanks a lot,
Markus



--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

Re: [HACKERS] qsort again (was Re: Strange Create Index behaviour)

From
Tom Lane
Date:
"Gary Doades" <gpd@gpdnet.co.uk> writes:
> I think the reason I wasn't seeing performance issues with normal sort
> operations is because they use work_mem not maintenance_work_mem which was
> only set to 2048 anyway. Does that sound right?

Very probable.  Do you want to test the theory by jacking that up?  ;-)

            regards, tom lane

Re: [HACKERS] qsort again (was Re: Strange Create Index

From
Martijn van Oosterhout
Date:
On Thu, Feb 16, 2006 at 08:22:55AM -0500, Ron wrote:
> 3= Especially in modern systems where the gap between internal CPU
> bandwidth and memory bandwidth is so great, the overhead of memory
> accesses for comparisons and moves is the majority of the overhead
> for both the pivot choosing and the partitioning algorithms within
> quicksort.  Particularly random memory accesses.  The reason (#GPRs -
> 1) is a magic constant is that it's the most you can compare and move
> using only register-to-register operations.

But how much of this applies to us? We're not sorting arrays of
integers, we're sorting pointers to tuples. So while moves cost very
little, a comparison costs hundreds, maybe thousands of cycles. A tuple
can easily be two or three cachelines and you're probably going to
access all of it, not to mention the Fmgr structures and the Datums
themselves.

None of this is cache friendly. The actual tuples themselves could be
spread all over memory (I don't think any particular effort is expended
trying to minimize fragmentation).

Do these algorithms discuss the case where a comparison is more than
1000 times the cost of a move?

Where this does become interesting is where we can convert a datum to
an integer such that if f(A) > f(B) then A > B. Then we can sort on
f(X) first with just integer comparisons and then do a full tuple
comparison only if f(A) = f(B). This would be much more cache-coherent
and make these algorithms much more applicable in my mind.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: [HACKERS] qsort again (was Re: Strange Create Index

From
"Gary Doades"
Date:
> "Gary Doades" <gpd@gpdnet.co.uk> writes:
>> I think the reason I wasn't seeing performance issues with normal sort
>> operations is because they use work_mem not maintenance_work_mem which
>> was
>> only set to 2048 anyway. Does that sound right?
>
> Very probable.  Do you want to test the theory by jacking that up?  ;-)

Hmm, played around a bit. I have managed to get it to do a sort on one of
the "bad" columns using a select of two whole tables that results in a
sequntial scan, sort and merge join. I also tried a simple select column
order by column for a bad column.

I tried varying maintenance_work_mem and work_mem up and down between 2048
and 65536 but I always get similar results. The sort phase always takes 4
to 5 seconds which seems about right for 900,000 rows.

This was on a colunm that took 12 minutes to create an index on.

I've no idea why it should behave this way, but probably explains why I
(and others) may not have noticed it before.

Regards,
Gary.



Re: [HACKERS] qsort again (was Re: Strange Create

From
Ron
Date:
At 09:48 AM 2/16/2006, Martijn van Oosterhout wrote:
>On Thu, Feb 16, 2006 at 08:22:55AM -0500, Ron wrote:
> > 3= Especially in modern systems where the gap between internal CPU
> > bandwidth and memory bandwidth is so great, the overhead of memory
> > accesses for comparisons and moves is the majority of the overhead
> > for both the pivot choosing and the partitioning algorithms within
> > quicksort.  Particularly random memory accesses.  The reason (#GPRs -
> > 1) is a magic constant is that it's the most you can compare and move
> > using only register-to-register operations.
>
>But how much of this applies to us? We're not sorting arrays of
>integers, we're sorting pointers to tuples. So while moves cost very
>little, a comparison costs hundreds, maybe thousands of cycles. A tuple
>can easily be two or three cachelines and you're probably going to
>access all of it, not to mention the Fmgr structures and the Datums
>themselves.
Pointers are simply fixed size 32b or 64b quantities.  They are
essentially integers.  Comparing and moving pointers or fixed size
keys to those pointers is exactly the same problem as comparing and
moving integers.

Comparing =or= moving the actual data structures is a much more
expensive and variable cost proposition.  I'm sure that pg's sort
functionality is written intelligently enough that the only real data
moves are done in a final pass after the exact desired order has been
found using pointer compares and (re)assignments during the sorting
process.  That's a standard technique for sorting data whose "key" or
pointer is much smaller than a datum.

Your cost comment basically agrees with mine regarding the cost of
random memory accesses.  The good news is that the number of datums
to be examined during the pivot choosing process is small enough that
the datums can fit into CPU cache while the pointers to them can be
assigned to registers: making pivot choosing +very+ fast when done correctly.

As you've noted, actual partitioning is going to be more expensive
since it involves accessing enough actual datums that they can't all
fit into CPU cache.  The good news is that QuickSort has a very
sequential access pattern within its inner loop.  So while we must go
to memory for compares, we are at least keeping the cost for it down
it a minimum.  In addition, said access is nice enough to be very
prefetch and CPU cache hierarchy friendly.


>None of this is cache friendly. The actual tuples themselves could be
>spread all over memory (I don't think any particular effort is expended
>trying to minimize fragmentation).
It probably would be worth it to spend some effort on memory layout
just as we do for HD layout.


>Do these algorithms discuss the case where a comparison is more than
>1000 times the cost of a move?
A move is always more expensive than a compare when the datum is
larger than its pointer or key.  A move is always more expensive than
a compare when it involves memory to memory movement rather than CPU
location to CPU location movement.  A move is especially more
expensive than a compare when it involves both factors.  Most moves
do involve both.

What I suspect you meant is that a key comparison that involves
accessing the data in memory is more expensive than reassigning the
pointers associated with those keys.   That is certainly true.

Yes.  The problem has been extensively studied. ;-)


>Where this does become interesting is where we can convert a datum to
>an integer such that if f(A) > f(B) then A > B. Then we can sort on
>f(X) first with just integer comparisons and then do a full tuple
>comparison only if f(A) = f(B). This would be much more cache-coherent
>and make these algorithms much more applicable in my mind.
In fact we can do better.
Using hash codes or what-not to map datums to keys and then sorting
just the keys and the pointers to those datums followed by an
optional final pass where we do the actual data movement is also a
standard technique for handling large data structures.


Regardless of what tweaks beyond the basic algorithms we use, the
algorithms themselves have been well studied and their performance
well established.  QuickSort is the best performing of the O(nlgn)
comparison based sorts and it uses less resources than HeapSort or MergeSort.

Ron



Re: [HACKERS] qsort again (was Re: Strange Create

From
Tom Lane
Date:
Ron <rjpeace@earthlink.net> writes:
> Your cost comment basically agrees with mine regarding the cost of
> random memory accesses.  The good news is that the number of datums
> to be examined during the pivot choosing process is small enough that
> the datums can fit into CPU cache while the pointers to them can be
> assigned to registers: making pivot choosing +very+ fast when done correctly.

This is more or less irrelevant given that comparing the pointers is not
the operation we need to do.

            regards, tom lane

Re: qsort again (was Re: Strange Create Index

From
"Craig A. James"
Date:
Markus Schaber wrote:
> Ron wrote:
>>...and of course if you know enough about the data to be sorted so as to
>>constrain it appropriately, one should use a non comparison based O(N)
>>sorting algorithm rather than any of the general comparison based
>>O(NlgN) methods.
>
> Sounds interesting, could you give us some pointers (names, URLs,
> papers) to such algorithms?

Most of these techniques boil down to good ol' "bucket sort".  A simple example: suppose you have a column of integer
percentages,range zero to 100.  You know there are only 101 distinct values.  So create 101 "buckets" (e.g. linked
lists),make a single pass through your data and drop each row's ID into the right bucket, then make a second pass
throughthe buckets, and write the row ID's out in bucket order.  This is an O(N) sort technique. 

Any time you have a restricted data range, you can do this.  Say you have 100 million rows of scientific results known
tobe good to only three digits -- it can have at most 1,000 distinct values (regardless of the magnitude of the
values),so you can do this with 1,000 buckets and just two passes through the data. 

You can also use this trick when the optimizer is asked for "fastest first result."  Say you have a cursor on a column
ofnumbers with good distribution.  If you do a bucket sort on the first two or three digits only, you know the first
"page"of results will be in the first bucket.  So you only need to apply qsort to that first bucket (which is very
fast,since it's small), and you can deliver the first page of data to the application.  This can be particularly
effectivein interactive situations, where the user typically looks at a few pages of data and then abandons the search.
 

I doubt this is very relevant to Postgres.  A relational database has to be general purpose, and it's hard to give it
"hints"that would tell it when to use this particular optimization. 

Craig

Re: [HACKERS] qsort again (was Re: Strange Create

From
Ron
Date:
At 10:52 AM 2/16/2006, Ron wrote:
>At 09:48 AM 2/16/2006, Martijn van Oosterhout wrote:
>
>>Where this does become interesting is where we can convert a datum to
>>an integer such that if f(A) > f(B) then A > B. Then we can sort on
>>f(X) first with just integer comparisons and then do a full tuple
>>comparison only if f(A) = f(B). This would be much more cache-coherent
>>and make these algorithms much more applicable in my mind.
>In fact we can do better.
>Using hash codes or what-not to map datums to keys and then sorting
>just the keys and the pointers to those datums followed by an
>optional final pass where we do the actual data movement is also a
>standard technique for handling large data structures.
I thought some follow up might be in order here.

Let's pretend that we have the typical DB table where rows are ~2-4KB
apiece.  1TB of storage will let us have 256M-512M rows in such a table.

A 32b hash code can be assigned to each row value such that only
exactly equal rows will have the same hash code.
A 32b pointer can locate any of the 256M-512M rows.

Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b+32b=
64b*(2^28 to 2^29)=  2-4GB of pointers+keys followed by an optional
pass to rearrange the actual rows if we so wish.

We get the same result while only examining and manipulating 1/50 to
1/25 as much data during the sort.

If we want to spend more CPU time in order to save more space, we can
compress the key+pointer representation.  That usually reduces the
amount of data to be manipulated to ~1/4 the original key+pointer
representation, reducing things to ~512M-1GB worth of compressed
pointers+keys.  Or ~1/200 - ~1/100 the original amount of data we
were discussing.

Either representation is small enough to fit within RAM rather than
requiring HD IO, so we solve the HD IO bottleneck in the best
possible way: we avoid ever doing it.

Ron



Re: [HACKERS] qsort again (was Re: Strange Create

From
Martijn van Oosterhout
Date:
On Thu, Feb 16, 2006 at 11:32:55AM -0500, Ron wrote:
> At 10:52 AM 2/16/2006, Ron wrote:
> >In fact we can do better.
> >Using hash codes or what-not to map datums to keys and then sorting
> >just the keys and the pointers to those datums followed by an
> >optional final pass where we do the actual data movement is also a
> >standard technique for handling large data structures.

Or in fact required if the Datums are not all the same size, which is
the case in PostgreSQL.

> I thought some follow up might be in order here.
>
> Let's pretend that we have the typical DB table where rows are ~2-4KB
> apiece.  1TB of storage will let us have 256M-512M rows in such a table.
>
> A 32b hash code can be assigned to each row value such that only
> exactly equal rows will have the same hash code.
> A 32b pointer can locate any of the 256M-512M rows.

That hash code is impossible the way you state it, since the set of
strings is not mappable to a 32bit integer. You probably meant that a
hash code can be assigned such that equal rows have equal hashes (drop
the only).

> Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b+32b=
> 64b*(2^28 to 2^29)=  2-4GB of pointers+keys followed by an optional
> pass to rearrange the actual rows if we so wish.
>
> We get the same result while only examining and manipulating 1/50 to
> 1/25 as much data during the sort.

But this is what we do now. The tuples are loaded, we sort an array of
pointers, then we write the output. Except we don't have the hash, so
we require access to the 1TB of data to do the actual comparisons. Even
if we did have the hash, we'd *still* need access to the data to handle
tie-breaks.

That's why your comment about moves always being more expensive than
compares makes no sense. A move can be acheived simply by swapping two
pointers in the array. A compare actually needs to call all sorts of
functions. If and only if we have functions for every data type to
produce an ordered hash, we can optimise sorts based on single
integers.

For reference, look at comparetup_heap(). It's just 20 lines, but each
function call there expands to maybe a dozen lines of code. And it has
a loop. I don't think we're anywhere near the stage where locality of
reference makes much difference.

We very rarely needs the tuples actualised in memory in the required
order, just the pointers are enough.

Have a ncie day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: [HACKERS] qsort again (was Re: Strange Create Index

From
Tom Lane
Date:
"Craig A. James" <cjames@modgraph-usa.com> writes:
> You can also use this trick when the optimizer is asked for "fastest first result."  Say you have a cursor on a
columnof numbers with good distribution.  If you do a bucket sort on the first two or three digits only, you know the
first"page" of results will be in the first bucket.  So you only need to apply qsort to that first bucket (which is
veryfast, since it's small), and you can deliver the first page of data to the application.  This can be particularly
effectivein interactive situations, where the user typically looks at a few pages of data and then abandons the search.
 

> I doubt this is very relevant to Postgres.  A relational database has to be general purpose, and it's hard to give it
"hints"that would tell it when to use this particular optimization. 

Actually, LIMIT does nicely for that hint; the PG planner has definitely
got a concept of preferring fast-start plans for limited queries.  The
real problem in applying bucket-sort ideas is the lack of any
datatype-independent way of setting up the buckets.

Once or twice we've kicked around the idea of having some
datatype-specific sorting code paths alongside the general-purpose one,
but I can't honestly see this as being workable from a code maintenance
standpoint.

            regards, tom lane

Re: [HACKERS] qsort again (was Re: Strange Create

From
Scott Lamb
Date:
On Feb 16, 2006, at 8:32 AM, Ron wrote:
> Let's pretend that we have the typical DB table where rows are
> ~2-4KB apiece.  1TB of storage will let us have 256M-512M rows in
> such a table.
>
> A 32b hash code can be assigned to each row value such that only
> exactly equal rows will have the same hash code.
> A 32b pointer can locate any of the 256M-512M rows.
>
> Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b
> +32b= 64b*(2^28 to 2^29)=  2-4GB of pointers+keys followed by an
> optional pass to rearrange the actual rows if we so wish.

I don't understand this.

This is a true statement: (H(x) != H(y)) => (x != y)
This is not: (H(x) < H(y)) => (x < y)

Hash keys can tell you there's an inequality, but they can't tell you
how the values compare. If you want 32-bit keys that compare in the
same order as the original values, here's how you have to get them:

(1) sort the values into an array
(2) use each value's array index as its key

It reduces to the problem you're trying to use it to solve.


--
Scott Lamb <http://www.slamb.org/>



Re: [HACKERS] qsort again (was Re: Strange Create

From
Ron
Date:
At 12:19 PM 2/16/2006, Scott Lamb wrote:
>On Feb 16, 2006, at 8:32 AM, Ron wrote:
>>Let's pretend that we have the typical DB table where rows are
>>~2-4KB apiece.  1TB of storage will let us have 256M-512M rows in
>>such a table.
>>
>>A 32b hash code can be assigned to each row value such that only
>>exactly equal rows will have the same hash code.
>>A 32b pointer can locate any of the 256M-512M rows.
>>
>>Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b
>>+32b= 64b*(2^28 to 2^29)=  2-4GB of pointers+keys followed by an
>>optional pass to rearrange the actual rows if we so wish.
>
>I don't understand this.
>
>This is a true statement: (H(x) != H(y)) => (x != y)
>This is not: (H(x) < H(y)) => (x < y)
>
>Hash keys can tell you there's an inequality, but they can't tell you
>how the values compare. If you want 32-bit keys that compare in the
>same order as the original values, here's how you have to get them:
For most hash codes, you are correct.  There is a class of hash or
hash-like codes that maintains the mapping to support that second statement.

More later when I can get more time.
Ron



Re: qsort again (was Re: Strange Create Index

From
Neil Conway
Date:
On Thu, 2006-02-16 at 12:35 +0100, Steinar H. Gunderson wrote:
> glibc-2.3.5/stdlib/qsort.c:
>
>   /* Order size using quicksort.  This implementation incorporates
>      four optimizations discussed in Sedgewick:
>
> I can't see any references to merge sort in there at all.

stdlib/qsort.c defines _quicksort(), not qsort(), which is defined by
msort.c. On looking closer, it seems glibc actually tries to determine
the physical memory in the machine -- if it is sorting a single array
that exceeds 1/4 of the machine's physical memory, it uses quick sort,
otherwise it uses merge sort.

-Neil



Re: [HACKERS] qsort again (was Re: Strange Create Index

From
Mark Lewis
Date:
On Thu, 2006-02-16 at 12:15 -0500, Tom Lane wrote:
> Once or twice we've kicked around the idea of having some
> datatype-specific sorting code paths alongside the general-purpose one,
> but I can't honestly see this as being workable from a code maintenance
> standpoint.
>
>             regards, tom lane


It seems that instead of maintaining a different sorting code path for
each data type, you could get away with one generic path and one
(hopefully faster) path if you allowed data types to optionally support
a 'sortKey' interface by providing a function f which maps inputs to 32-
bit int outputs, such that the following two properties hold:

f(a)>=f(b) iff a>=b
if a==b then f(a)==f(b)

So if a data type supports the sortKey interface you could perform the
sort on f(value) and only refer back to the actual element comparison
functions when two sortKeys have the same value.

Data types which could probably provide a useful function for f would be
int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII).

Depending on the overhead, you might not even need to maintain 2
independent search code paths, since you could always use f(x)=0 as the
default sortKey function which would degenerate to the exact same sort
behavior in use today.

-- Mark Lewis

Re: [HACKERS] qsort again (was Re: Strange Create Index

From
Markus Schaber
Date:
Hi, Mark,

Mark Lewis schrieb:

> It seems that instead of maintaining a different sorting code path for
> each data type, you could get away with one generic path and one
> (hopefully faster) path if you allowed data types to optionally support
> a 'sortKey' interface by providing a function f which maps inputs to 32-
> bit int outputs, such that the following two properties hold:
>
> f(a)>=f(b) iff a>=b
> if a==b then f(a)==f(b)

Hmm, to remove redundancy, I'd change the <= to a < and define:

if a==b then f(a)==f(b)
if a<b  then f(a)<=f(b)

> Data types which could probably provide a useful function for f would be
> int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII).

With int2 or some restricted ranges of oid and int4, we could even
implement a bucket sort.

Markus

Re: [HACKERS] qsort again (was Re: Strange Create Index

From
Martijn van Oosterhout
Date:
On Thu, Feb 16, 2006 at 02:17:36PM -0800, Mark Lewis wrote:
> It seems that instead of maintaining a different sorting code path for
> each data type, you could get away with one generic path and one
> (hopefully faster) path if you allowed data types to optionally support
> a 'sortKey' interface by providing a function f which maps inputs to 32-
> bit int outputs, such that the following two properties hold:
>
> f(a)>=f(b) iff a>=b
> if a==b then f(a)==f(b)

Note this is a property of the collation, not the type. For example
strings can be sorted in many ways and the sortKey must reflect that.
So in postgres terms it's a property of the btree operator class.

It's something I'd like to do if I get A Round Tuit. :)

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: [HACKERS] qsort again (was Re: Strange Create Index

From
Greg Stark
Date:
Markus Schaber <schabi@logix-tt.com> writes:

> Hmm, to remove redundancy, I'd change the <= to a < and define:
>
> if a==b then f(a)==f(b)
> if a<b  then f(a)<=f(b)
>
> > Data types which could probably provide a useful function for f would be
> > int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII).

How exactly do you imagine doing this for text?

I could see doing it for char(n)/varchar(n) where n<=4 in SQL_ASCII though.

--
greg

Re: [HACKERS] qsort again (was Re: Strange Create Index

From
PFC
Date:

> It seems that instead of maintaining a different sorting code path for
> each data type, you could get away with one generic path and one
> (hopefully faster) path if you allowed data types to optionally support
> a 'sortKey' interface by providing a function f which maps inputs to 32-
> bit int outputs, such that the following two properties hold:

    Looks like the decorate-sort-undecorate pattern, which works quite well.
Good idea.
    I would have said a 64 bit int, but it's the same idea. However it won't
work for floats, which is a pity, because floats fit in 64 bits. Unless
more types creep in the code path (which would not necessarily make it
that slower).
    As for text, the worst case is when all strings start with the same 8
letters, but a good case pops up when a few-letter code is used as a key
in a table. Think about a zipcode, for instance. If a merge join needs to
sort on zipcodes, it might as well sort on 64-bits integers...

    By the way, I'd like to declare my zipcode columns as SQL_ASCII while the
rest of my database is in UNICODE, so they are faster to index and sort.
Come on, MySQL does it...

    Keep up !

Re: [HACKERS] qsort again (was Re: Strange Create Index

From
Mark Lewis
Date:
On Thu, 2006-02-16 at 17:51 -0500, Greg Stark wrote:
> > > Data types which could probably provide a useful function for f would be
> > > int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII).
>
> How exactly do you imagine doing this for text?
>
> I could see doing it for char(n)/varchar(n) where n<=4 in SQL_ASCII though.


In SQL_ASCII, just take the first 4 characters (or 8, if using a 64-bit
sortKey as elsewhere suggested).  The sorting key doesn't need to be a
one-to-one mapping.

-- Mark Lewis

Re: [HACKERS] qsort again (was Re: Strange Create Index

From
Markus Schaber
Date:
Hi, PFC,

PFC schrieb:

>     By the way, I'd like to declare my zipcode columns as SQL_ASCII
> while the  rest of my database is in UNICODE, so they are faster to
> index and sort.  Come on, MySQL does it...

Another use case for parametric column definitions - charset definitions
- and the first one that cannot be emulated via constraints.

Other use cases I remember were range definitions for numbers or PostGIS
dimension, subtype and SRID, but those cann all be emulated via checks /
constraints.

Markus

Re: [HACKERS] qsort again (was Re: Strange Create Index

From
"Steinar H. Gunderson"
Date:
On Fri, Feb 17, 2006 at 12:05:23AM +0100, PFC wrote:
>     I would have said a 64 bit int, but it's the same idea. However it
>     won't  work for floats, which is a pity, because floats fit in 64 bits.

Actually, you can compare IEEE floats directly as ints, as long as they're
positive. (If they can be both positive and negative, you need to take
special care of the sign bit first, but it's still doable.)

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: [HACKERS] qsort again (was Re: Strange Create Index

From
David Lang
Date:
On Thu, 16 Feb 2006, Mark Lewis wrote:

> On Thu, 2006-02-16 at 17:51 -0500, Greg Stark wrote:
>>>> Data types which could probably provide a useful function for f would be
>>>> int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII).
>>
>> How exactly do you imagine doing this for text?
>>
>> I could see doing it for char(n)/varchar(n) where n<=4 in SQL_ASCII though.
>
>
> In SQL_ASCII, just take the first 4 characters (or 8, if using a 64-bit
> sortKey as elsewhere suggested).  The sorting key doesn't need to be a
> one-to-one mapping.

that would violate your second contraint ( f(a)==f(b) iff (a==b) )

if you could drop that constraint (the cost of which would be extra 'real'
compares within a bucket) then a helper function per datatype could work
as you are talking.

David Lang

Re: [HACKERS] qsort again (was Re: Strange Create

From
Ron
Date:
At 01:47 PM 2/16/2006, Ron wrote:
>At 12:19 PM 2/16/2006, Scott Lamb wrote:
>>On Feb 16, 2006, at 8:32 AM, Ron wrote:
>>>Let's pretend that we have the typical DB table where rows are
>>>~2-4KB apiece.  1TB of storage will let us have 256M-512M rows in
>>>such a table.
>>>
>>>A 32b hash code can be assigned to each row value such that only
>>>exactly equal rows will have the same hash code.
>>>A 32b pointer can locate any of the 256M-512M rows.
>>>
>>>Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b
>>>+32b= 64b*(2^28 to 2^29)=  2-4GB of pointers+keys followed by an
>>>optional pass to rearrange the actual rows if we so wish.
>>
>>I don't understand this.
>>
>>This is a true statement: (H(x) != H(y)) => (x != y)
>>This is not: (H(x) < H(y)) => (x < y)
>>
>>Hash keys can tell you there's an inequality, but they can't tell you
>>how the values compare. If you want 32-bit keys that compare in the
>>same order as the original values, here's how you have to get them:
>For most hash codes, you are correct.  There is a class of hash or
>hash-like codes that maintains the mapping to support that second statement.
>
>More later when I can get more time.
>Ron

OK, so here's _a_ way (there are others) to obtain a mapping such that
  if a < b then f(a) < f (b) and
  if a == b then f(a) == f(b)

Pretend each row is a integer of row size (so a 2KB row becomes a
16Kb integer; a 4KB row becomes a 32Kb integer; etc)
Since even a 1TB table made of such rows can only have 256M - 512M
possible values even if each row is unique, a 28b or 29b key is large
enough to represent each row's value and relative rank compared to
all of the others even if all row values are unique.

By scanning the table once, we can map say 0000001h (Hex used to ease
typing) to the row with the minimum value and 1111111h to the row
with the maximum value as well as mapping everything in between to
their appropriate keys.  That same scan can be used to assign a
pointer to each record's location.

We can now sort the key+pointer pairs instead of the actual data and
use an optional final pass to rearrange the actual rows if we wish.

That initial scan to set up the keys is expensive, but if we wish
that cost can be amortized over the life of the table so we don't
have to pay it all at once.  In addition, once we have created those
keys, then can be saved for later searches and sorts.

Further space savings can be obtained whenever there are duplicate
keys and/or when compression methods are used on the Key+pointer pairs.

Ron







Re: [HACKERS] qsort again (was Re: Strange Create

From
Ragnar
Date:
On fös, 2006-02-17 at 01:20 -0500, Ron wrote:
> At 01:47 PM 2/16/2006, Ron wrote:
> >At 12:19 PM 2/16/2006, Scott Lamb wrote:
> >>On Feb 16, 2006, at 8:32 AM, Ron wrote:
> >>>Let's pretend that we have the typical DB table where rows are
> >>>~2-4KB apiece.  1TB of storage will let us have 256M-512M rows in
> >>>such a table.
> >>>
> >>>A 32b hash code can be assigned to each row value such that only
> >>>exactly equal rows will have the same hash code.
> >>>A 32b pointer can locate any of the 256M-512M rows.
> >>>
> >>>Now instead of sorting 1TB of data we can sort 2^28 to 2^29 32b
> >>>+32b= 64b*(2^28 to 2^29)=  2-4GB of pointers+keys followed by an
> >>>optional pass to rearrange the actual rows if we so wish.
> >>
> >>I don't understand this.
> >>
> >>This is a true statement: (H(x) != H(y)) => (x != y)
> >>This is not: (H(x) < H(y)) => (x < y)
> >>
> >>Hash keys can tell you there's an inequality, but they can't tell you
> >>how the values compare. If you want 32-bit keys that compare in the
> >>same order as the original values, here's how you have to get them:
> >For most hash codes, you are correct.  There is a class of hash or
> >hash-like codes that maintains the mapping to support that second statement.
> >
> >More later when I can get more time.
> >Ron
>
> OK, so here's _a_ way (there are others) to obtain a mapping such that
>   if a < b then f(a) < f (b) and
>   if a == b then f(a) == f(b)

> By scanning the table once, we can map say 0000001h (Hex used to ease
> typing) to the row with the minimum value and 1111111h to the row
> with the maximum value as well as mapping everything in between to
> their appropriate keys.  That same scan can be used to assign a
> pointer to each record's location.

This step is just as expensive as the original sort you
want to replace/improve. If you want to keep this mapping
saved as a sort of an index, or as part ot each row data, this will make
the cost of inserts and updates enormous.

>
> We can now sort the key+pointer pairs instead of the actual data and
> use an optional final pass to rearrange the actual rows if we wish.

How are you suggesting this mapping be accessed? If the
mapping is kept separate from the tuple data, as in an index, then how
will you look up the key?

> That initial scan to set up the keys is expensive, but if we wish
> that cost can be amortized over the life of the table so we don't
> have to pay it all at once.  In addition, once we have created those
> keys, then can be saved for later searches and sorts.

What is the use case where this would work better than a
regular btree index ?

gnari



Re: [HACKERS] qsort again (was Re: Strange Create Index

From
Markus Schaber
Date:
Hi, David,

David Lang schrieb:

>> In SQL_ASCII, just take the first 4 characters (or 8, if using a 64-bit
>> sortKey as elsewhere suggested).  The sorting key doesn't need to be a
>> one-to-one mapping.

> that would violate your second contraint ( f(a)==f(b) iff (a==b) )

no, it doesn't.

When both strings are equal, then the first characters are equal, too.

If they are not equal, the constraint condition does not match.

The first characters of the strings may be equal as f(a) may be equal to
f(b) as to the other constraint.

Markus

Re: [HACKERS] qsort again (was Re: Strange Create

From
Markus Schaber
Date:
Hi, Ron,

Ron schrieb:

> OK, so here's _a_ way (there are others) to obtain a mapping such that
>  if a < b then f(a) < f (b) and
>  if a == b then f(a) == f(b)
>
> Pretend each row is a integer of row size (so a 2KB row becomes a 16Kb
> integer; a 4KB row becomes a 32Kb integer; etc)
> Since even a 1TB table made of such rows can only have 256M - 512M
> possible values even if each row is unique, a 28b or 29b key is large
> enough to represent each row's value and relative rank compared to all
> of the others even if all row values are unique.
>
> By scanning the table once, we can map say 0000001h (Hex used to ease
> typing) to the row with the minimum value and 1111111h to the row with
> the maximum value as well as mapping everything in between to their
> appropriate keys.  That same scan can be used to assign a pointer to
> each record's location.

But with a single linear scan, this cannot be accomplished, as the table
contents are neither sorted nor distributed linearly between the minimum
and the maximum.

For this mapping, you need a full table sort.

> That initial scan to set up the keys is expensive, but if we wish that
> cost can be amortized over the life of the table so we don't have to pay
> it all at once.  In addition, once we have created those keys, then can
> be saved for later searches and sorts.

But for every update or insert, you have to resort the keys, which is
_very_ expensive as it basically needs to update a huge part of the table.

Markus

Re: [HACKERS] qsort again (was Re: Strange Create

From
PFC
Date:
    Has anybody got some profiler data on the amount of time spent in
comparisons during a sort ? Say, the proposals here would give the most
gains on simple types like INTEGER ; so it would be interesting to know
how much time is now spent in comparisons for sorting a column of ints. If
it's like, 10% of the total time, well...

    More hand-waving :

    What are the usage case for sorts ?

    - potentially huge data sets : create index, big joins, reporting queries
etc.
    - small data sets : typically, a query with an ORDER BY which will return
a small amount of rows (website stuff), or joins not small enough to use a
HashAggregate, but not big enough to create an index just for them.

    The cost of a comparison vs. moving stuff around and fetching stuff is
probably very different in these two cases. If it all neatly fits in
sort_mem, you can do fancy stuff (like sorting on SortKeys) which will
need to access the data in almost random order when time comes to hand the
sorted data back. So, I guess the SortKey idea would rather apply to the
latter case only, which is CPU limited.

    Anyway, I was wondering about queries with multipart keys, like ORDER BY
zipcode, client_name, date and the like. Using just an int64 as the key
isn't going to help a lot here. Why not use a binary string of limited
length ? I'd tend to think it would not be that slower than comparing
ints, and it would be faster than calling each comparison function for
each column. Each key part would get assigned to a byte range in the
string.
    It would waste some memory, but for instance, using 2x the memory for
half the time would be a good tradeoff if the amount of memory involved is
in the order of megabytes.
    Also, it would solve the problem of locales. Comparisons involving
locales are slow, but strings can be converted to a canonical form
(removing accents and stuff), and then binary sorted.

    Also I'll insert a plug for the idea that the Sort needs to know if there
will be a LIMIT afterwards ; this way it could reduce its working set by
simply discarding the rows which would have been discarded anyway by the
LIMIT. Say you want the 100 first rows out of a million ordered rows. If
the sort knows this, it can be performed in the amount of memory for a 100
rows sort.


Re: [HACKERS] qsort again (was Re: Strange Create

From
Ron
Date:
At 04:24 AM 2/17/2006, Ragnar wrote:
>On fös, 2006-02-17 at 01:20 -0500, Ron wrote:
> >
> > OK, so here's _a_ way (there are others) to obtain a mapping such that
> >   if a < b then f(a) < f (b) and
> >   if a == b then f(a) == f(b)
>
> > By scanning the table once, we can map say 0000001h (Hex used to ease
> > typing) to the row with the minimum value and 1111111h to the row
> > with the maximum value as well as mapping everything in between to
> > their appropriate keys.  That same scan can be used to assign a
> > pointer to each record's location.
>
>This step is just as expensive as the original
>sort you want to replace/improve.

Why do you think that?  External sorts involve
the equivalent of multiple scans of the table to
be sorted, sometimes more than lgN (where N is
the number of items in the table to be
sorted).  Since this is physical IO we are
talking about, each scan is very expensive, and
therefore 1 scan is going to take considerably
less time than >= lgN scans will be.


>If you want to keep this mapping saved as a sort
>of an index, or as part ot each row data, this
>will make the cost of inserts and updates enormous.

Not sure you've got this right either.  Looks to
me like we are adding a <= 32b quantity to each
row.  Once we know the mapping, incrementally
updating it upon insert or update would seem to
be simple matter of a fast search for the correct
ranking [Interpolation search, which we have all
the needed data for, is O(lglgN).  Hash based
search is O(1)]; plus an increment/decrement of
the key values greater/less than the key value of
the row being inserted / updated.  Given than we
are updating all the keys in a specific range
within a tree structure, that update can be done
in O(lgm) (where m is the number of records affected).

> >  We can now sort the key+pointer pairs instead of the actual data and
> > use an optional final pass to rearrange the actual rows if we wish.
>
>How are you suggesting this mapping be accessed?
>If the mapping is kept separate from the tuple
>data, as in an index, then how will you look up the key?
???  We've effectively created a data set where
each record is a pointer to a DB row plus its
key.  We can now sort the data set by key and
then do an optional final pass to rearrange the
actual DB rows if we so wish.  Since that final
pass is very expensive, it is good that not all
use scenarios will need that final pass.

The amount of storage required to sort this
representation of the table rather than the
actual table is so much less that it turns an
external sorting problem into a internal sorting
problem with an optional final pass that is =1=
scan (albeit one scan with a lot of seeks and
data movement).  This is a big win.  It is a
variation of a well known technique.  See Sedgewick, Knuth, etc.


> > That initial scan to set up the keys is expensive, but if we wish
> > that cost can be amortized over the life of the table so we don't
> > have to pay it all at once.  In addition, once we have created those
> > keys, then can be saved for later searches and sorts.
>
>What is the use case where this would work better than a
>regular btree index ?
Again, ???  btree indexes address different
issues.  They do not in any way help create a
compact data representation of the original data
that saves enough space so as to turn an external
ranking or sorting problem into an internal one.


Ron



Re: [HACKERS] qsort again (was Re: Strange Create

From
Ron
Date:
At 05:19 AM 2/17/2006, Markus Schaber wrote:
>Hi, Ron,
>
>Ron schrieb:
>
> > OK, so here's _a_ way (there are others) to obtain a mapping such that
> >  if a < b then f(a) < f (b) and
> >  if a == b then f(a) == f(b)
> >
> > Pretend each row is a integer of row size (so a 2KB row becomes a 16Kb
> > integer; a 4KB row becomes a 32Kb integer; etc)
> > Since even a 1TB table made of such rows can only have 256M - 512M
> > possible values even if each row is unique, a 28b or 29b key is large
> > enough to represent each row's value and relative rank compared to all
> > of the others even if all row values are unique.
> >
> > By scanning the table once, we can map say 0000001h (Hex used to ease
> > typing) to the row with the minimum value and 1111111h to the row with
> > the maximum value as well as mapping everything in between to their
> > appropriate keys.  That same scan can be used to assign a pointer to
> > each record's location.
>
>But with a single linear scan, this cannot be accomplished, as the table
>contents are neither sorted nor distributed linearly between the minimum
>and the maximum.
So what?  We are talking about key assignment here, not anything that
requires physically manipulating the actual DB rows.
One physical IO pass should be all that's needed.


>For this mapping, you need a full table sort.
One physical IO pass should be all that's needed.  However, let's
pretend you are correct and that we do need to sort the table to get
the key mapping.  Even so, we would only need to do it =once= and
then we would be able to use and incrementally update the results
forever afterward.  Even under this assumption, one external sort to
save all subsequent such sorts seems well worth it.

IOW, even if I'm wrong about the initial cost to do this; it is still
worth doing ;-)


> > That initial scan to set up the keys is expensive, but if we wish that
> > cost can be amortized over the life of the table so we don't have to pay
> > it all at once.  In addition, once we have created those keys, then can
> > be saved for later searches and sorts.
>
>But for every update or insert, you have to resort the keys, which is
>_very_ expensive as it basically needs to update a huge part of the table.

??? You do not need to resort already ordered data to insert a new
element into it such that the data stays ordered!  Once we have done
the key ordering operation once, we should not ever need to do it
again on the original data.  Else most sorting algorithms wouldn't work ;-)


Ron



Re: [HACKERS] qsort again (was Re: Strange Create

From
Martijn van Oosterhout
Date:
On Fri, Feb 17, 2006 at 08:23:40AM -0500, Ron wrote:
> >For this mapping, you need a full table sort.
> One physical IO pass should be all that's needed.  However, let's
> pretend you are correct and that we do need to sort the table to get
> the key mapping.  Even so, we would only need to do it =once= and
> then we would be able to use and incrementally update the results
> forever afterward.  Even under this assumption, one external sort to
> save all subsequent such sorts seems well worth it.
>
> IOW, even if I'm wrong about the initial cost to do this; it is still
> worth doing ;-)

I think you're talking about something different here. You're thinking
of having the whole table sorted and when you add a new value you add
it in such a way to keep it sorted. The problem is, what do you sort it
by? If you've sorted the table by col1, then when the user does ORDER
BY col2 it's useless.

Indeed, this is what btrees do, you store the order of the table
seperate from the data. And you can store multiple orders. But even
then, when someone does ORDER BY lower(col1), it's still useless.

And you're right, we still need to do the single mass sort in the
beginning, which is precisely what we're trying to optimise here.

> ??? You do not need to resort already ordered data to insert a new
> element into it such that the data stays ordered!  Once we have done
> the key ordering operation once, we should not ever need to do it
> again on the original data.  Else most sorting algorithms wouldn't work ;-)

We already do this with btree indexes. I'm not sure what you are
proposing that improves on that.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: [HACKERS] qsort again (was Re: Strange Create Index

From
Scott Lamb
Date:
On Feb 16, 2006, at 2:17 PM, Mark Lewis wrote:

Data types which could probably provide a useful function for f would be

int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII).


...and with some work, floats (I think just the exponent would work, if nothing else). bytea. Probably just about anything.

Interesting. If you abandon the idea that collisions should be impossible (they're not indexes) or extremely rare (they're not hashes), it's pretty easy to come up with a decent hint to avoid a lot of dereferences.

--


Re: [HACKERS] qsort again (was Re: Strange Create Index

From
Martijn van Oosterhout
Date:
On Fri, Feb 17, 2006 at 08:18:41AM -0800, Scott Lamb wrote:
> On Feb 16, 2006, at 2:17 PM, Mark Lewis wrote:
> >Data types which could probably provide a useful function for f
> >would be
> >int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII).
>
> ...and with some work, floats (I think just the exponent would work,
> if nothing else). bytea. Probably just about anything.
>
> Interesting. If you abandon the idea that collisions should be
> impossible (they're not indexes) or extremely rare (they're not
> hashes), it's pretty easy to come up with a decent hint to avoid a
> lot of dereferences.

Yep, pretty much for any datatype you create a mapping function to map
it to a signed int32. All you have to guarentee is that f(a) > f(b)
implies that a > b. Only if f(a) == f(b) do you need to compare a and
b.

You then change the sorting code to have an array of (Datum,int32)
(ouch, double the storage) where the int32 is the f(Datum). And in the
comparison routines you first check the int32. If they give an order
you're done. On match you do the full comparison.

For integer types (int2,int4,int8,oid) the conversion is straight
forward. For float you'd use the exponent and the first few bits of the
mantissa. For strings you'd have to bail, or use a strxfrm equivalent.
NULL would be INT_MAX pr INT_MIN depending on where you want it. Thing
is, even if you don't have such a function and always return zero, the
results will still be right.

Not a new idea, but it would be very nice to implement. If would
produce nice speedups for type where comparisons are expensive. And
more importantly, the bulk of the comparisons can be moved inline and
make the whole cache-friendlyness discussed here much more meaningful.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: [HACKERS] qsort again (was Re: Strange Create

From
Ron
Date:
At 10:53 AM 2/17/2006, Martijn van Oosterhout wrote:
>On Fri, Feb 17, 2006 at 08:23:40AM -0500, Ron wrote:
> > >For this mapping, you need a full table sort.
> > One physical IO pass should be all that's needed.  However, let's
> > pretend you are correct and that we do need to sort the table to get
> > the key mapping.  Even so, we would only need to do it =once= and
> > then we would be able to use and incrementally update the results
> > forever afterward.  Even under this assumption, one external sort to
> > save all subsequent such sorts seems well worth it.
> >
> > IOW, even if I'm wrong about the initial cost to do this; it is still
> > worth doing ;-)
>
>I think you're talking about something different here. You're thinking
>of having the whole table sorted and when you add a new value you add
>it in such a way to keep it sorted. The problem is, what do you sort it
>by? If you've sorted the table by col1, then when the user does ORDER
>BY col2 it's useless.
No, I'm thinking about how to represent DB row data in such a way that
a= we use a compact enough representation that we can sort internally
rather than externally.
b= we do the sort once and avoid most of the overhead upon subsequent
similar requests.

I used the example of sorting on the entire row to show that the
approach works even when the original record being sorted by is very large.
All my previous comments on this topic hold for the case where we are
sorting on only part of a row as well.

If all you are doing is sorting on a column or a few columns, what
I'm discussing is even easier since treating the columns actually
being used a sort criteria as integers rather than the whole row as
an atomic unit eats less resources during the key creation and
mapping process.  If the row is 2-4KB in size, but we only care about
some combination of columns that only takes on <= 2^8 or <= 2^16
different values, then what I've described will be even better than
the original example I gave.

Basically, the range of a key is going to be restricted by how
a= big the field is that represents the key (columns and such are
usually kept narrow for performance reasons) or
b= big each row is (the more space each row takes, the fewer rows fit
into any given amount of storage)
c= many rows there are in the table
Between the conditions, the range of a key tends to be severely
restricted and therefore use much less space than sorting the actual
DB records would.  ...and that gives us something we can take advantage of.


>Indeed, this is what btrees do, you store the order of the table
>seperate from the data. And you can store multiple orders. But even
>then, when someone does ORDER BY lower(col1), it's still useless.
>
>And you're right, we still need to do the single mass sort in the
>beginning, which is precisely what we're trying to optimise here.
Sigh.  My points were:
1= we have information available to us that allows us to map the rows
in such a way as to turn most external sorts into internal sorts,
thereby avoiding the entire external sorting problem in those
cases.  This is a huge performance improvement.

2= that an external sort is =NOT= required for initial key
assignment, but even if it was it would be worth it.

3= that this is a variation of a well known technique so I'm not
suggesting heresy here.


Ron



Re: [HACKERS] qsort again (was Re: Strange Create Index

From
Mark Lewis
Date:
On Thu, 2006-02-16 at 21:33 -0800, David Lang wrote:
> > In SQL_ASCII, just take the first 4 characters (or 8, if using a 64-bit
> > sortKey as elsewhere suggested).  The sorting key doesn't need to be a
> > one-to-one mapping.
>
> that would violate your second contraint ( f(a)==f(b) iff (a==b) )
>
> if you could drop that constraint (the cost of which would be extra 'real'
> compares within a bucket) then a helper function per datatype could work
> as you are talking.

I think we're actually on the same page here; you're right that the
constraint above ( f(a)==f(b) iff a==b ) can't be extended to data types
with more than 32 bits of value space.  But the constraint I listed was
actually:

if a==b then f(a)==f(b)

Which doesn't imply 'if and only if'.  It's a similar constraint to
hashcodes; the same value will always have the same hash, but you're not
guaranteed that the hashcodes for two distinct values will be unique.

-- Mark

Re: [HACKERS] qsort again (was Re: Strange Create

From
Ragnar
Date:
On fös, 2006-02-17 at 08:01 -0500, Ron wrote:
> At 04:24 AM 2/17/2006, Ragnar wrote:
> >On fös, 2006-02-17 at 01:20 -0500, Ron wrote:
> > >
> > > OK, so here's _a_ way (there are others) to obtain a mapping such that
> > >   if a < b then f(a) < f (b) and
> > >   if a == b then f(a) == f(b)
> >
> > > By scanning the table once, we can map say 0000001h (Hex used to ease
> > > typing) to the row with the minimum value and 1111111h to the row
> > > with the maximum value as well as mapping everything in between to
> > > their appropriate keys.  That same scan can be used to assign a
> > > pointer to each record's location.
> >
> >This step is just as expensive as the original
> >sort you want to replace/improve.
>
> Why do you think that?  External sorts involve
> the equivalent of multiple scans of the table to
> be sorted, sometimes more than lgN (where N is
> the number of items in the table to be
> sorted).  Since this is physical IO we are
> talking about, each scan is very expensive, and
> therefore 1 scan is going to take considerably
> less time than >= lgN scans will be.

Call me dim, but please explain exactly how you are going
to build this mapping in one scan. Are you assuming
the map will fit in memory?

>
>
> >If you want to keep this mapping saved as a sort
> >of an index, or as part ot each row data, this
> >will make the cost of inserts and updates enormous.
>
> Not sure you've got this right either.  Looks to
> me like we are adding a <= 32b quantity to each
> row.  Once we know the mapping, incrementally
> updating it upon insert or update would seem to
> be simple matter of a fast search for the correct
> ranking [Interpolation search, which we have all
> the needed data for, is O(lglgN).  Hash based
> search is O(1)]; plus an increment/decrement of
> the key values greater/less than the key value of
> the row being inserted / updated.  Given than we
> are updating all the keys in a specific range
> within a tree structure, that update can be done
> in O(lgm) (where m is the number of records affected).

Say again ?
Let us say you have 1 billion rows, where the
column in question contains strings like
baaaaaaaaaaaaaaa....aaa
baaaaaaaaaaaaaaa....aab
baaaaaaaaaaaaaaa....aac
...
not necessarily in this order on disc of course

The minimum value would be keyed as 00000001h,
the next one as 00000002h and so on.

Now insert new value 'aaaaa'

Not only will you have to update 1 billion records,
but also all the values in your map.

please explain

gnari



Re: [HACKERS] qsort again (was Re: Strange Create Index

From
Tom Lane
Date:
Mark Lewis <mark.lewis@mir3.com> writes:
> I think we're actually on the same page here; you're right that the
> constraint above ( f(a)==f(b) iff a==b ) can't be extended to data types
> with more than 32 bits of value space.  But the constraint I listed was
> actually:

> if a==b then f(a)==f(b)

I believe Martijn had it right: the important constraint is

    f(a) > f(b) implies a > b

which implies by commutativity

    f(a) < f(b) implies a < b

and these two together imply

    a == b implies f(a) == f(b)

Now you can't do any sorting if you only have the equality rule, you
need the inequality rule.

            regards, tom lane

Re: [HACKERS] qsort again (was Re: Strange Create Index

From
"Jonah H. Harris"
Date:
Last night I implemented a non-recursive introsort in C... let me test it a bit more and then I'll post it here for everyone else to try out.

On 2/16/06, Markus Schaber <schabi@logix-tt.com> wrote:
Hi, Ron,

Ron wrote:

> ...and of course if you know enough about the data to be sorted so as to
> constrain it appropriately, one should use a non comparison based O(N)
> sorting algorithm rather than any of the general comparison based
> O(NlgN) methods.

Sounds interesting, could you give us some pointers (names, URLs,
papers) to such algorithms?

Thanks a lot,
Markus



--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org



--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324

Re: [HACKERS] qsort again (was Re: Strange Create

From
"Gregory Maxwell"
Date:
On 2/17/06, Ragnar <gnari@hive.is> wrote:
> Say again ?
> Let us say you have 1 billion rows, where the
> column in question contains strings like
> baaaaaaaaaaaaaaa....aaa
> baaaaaaaaaaaaaaa....aab
> baaaaaaaaaaaaaaa....aac
> ...
> not necessarily in this order on disc of course
>
> The minimum value would be keyed as 00000001h,
> the next one as 00000002h and so on.
>
> Now insert new value 'aaaaa'
>
> Not only will you have to update 1 billion records,
> but also all the values in your map.
>
> please explain

No comment on the usefulness of the idea overall.. but the solution
would be to insert with the colliding value of the existing one lesser
than it..

It will falsly claim equal, which you then must fix with a second
local sort which should be fast because you only need to sort the
duplicates/false dupes.  If you insert too much then this obviously
becomes completely useless.

Re: qsort again (was Re: Strange Create Index behaviour)

From
Tom Lane
Date:
Last month I wrote:
> It seems clear that our qsort.c is doing a pretty awful job of picking
> qsort pivots, while glibc is mostly managing not to make that mistake.

I re-ran Gary's test script using the just-committed improvements to
qsort.c, and got pretty nice numbers (attached --- compare to
http://archives.postgresql.org/pgsql-performance/2006-02/msg00227.php).
So it was wrong to blame his problems on the pivot selection --- the
culprit was that ill-considered switch to insertion sort.

            regards, tom lane

100 runtimes for latest port/qsort.c, sorted ascending:

Time: 335.481 ms
Time: 335.606 ms
Time: 335.932 ms
Time: 336.039 ms
Time: 336.182 ms
Time: 336.231 ms
Time: 336.711 ms
Time: 336.721 ms
Time: 336.971 ms
Time: 336.982 ms
Time: 337.036 ms
Time: 337.190 ms
Time: 337.223 ms
Time: 337.312 ms
Time: 337.350 ms
Time: 337.423 ms
Time: 337.523 ms
Time: 337.528 ms
Time: 337.565 ms
Time: 337.566 ms
Time: 337.732 ms
Time: 337.741 ms
Time: 337.744 ms
Time: 337.786 ms
Time: 337.790 ms
Time: 337.898 ms
Time: 337.905 ms
Time: 337.952 ms
Time: 337.976 ms
Time: 338.017 ms
Time: 338.123 ms
Time: 338.206 ms
Time: 338.306 ms
Time: 338.514 ms
Time: 338.594 ms
Time: 338.597 ms
Time: 338.683 ms
Time: 338.705 ms
Time: 338.729 ms
Time: 338.748 ms
Time: 338.816 ms
Time: 338.958 ms
Time: 338.963 ms
Time: 338.997 ms
Time: 339.074 ms
Time: 339.106 ms
Time: 339.134 ms
Time: 339.159 ms
Time: 339.226 ms
Time: 339.260 ms
Time: 339.289 ms
Time: 339.341 ms
Time: 339.500 ms
Time: 339.585 ms
Time: 339.595 ms
Time: 339.774 ms
Time: 339.897 ms
Time: 339.927 ms
Time: 340.064 ms
Time: 340.133 ms
Time: 340.172 ms
Time: 340.219 ms
Time: 340.261 ms
Time: 340.323 ms
Time: 340.708 ms
Time: 340.761 ms
Time: 340.785 ms
Time: 340.900 ms
Time: 340.986 ms
Time: 341.339 ms
Time: 341.564 ms
Time: 341.707 ms
Time: 342.155 ms
Time: 342.213 ms
Time: 342.452 ms
Time: 342.515 ms
Time: 342.540 ms
Time: 342.928 ms
Time: 343.548 ms
Time: 343.663 ms
Time: 344.192 ms
Time: 344.952 ms
Time: 345.152 ms
Time: 345.174 ms
Time: 345.444 ms
Time: 346.848 ms
Time: 348.144 ms
Time: 348.842 ms
Time: 354.550 ms
Time: 356.877 ms
Time: 357.475 ms
Time: 358.487 ms
Time: 364.178 ms
Time: 370.730 ms
Time: 493.098 ms
Time: 648.009 ms
Time: 849.345 ms
Time: 860.616 ms
Time: 936.800 ms
Time: 1727.085 ms

Poor performance o

From
"Craig A. James"
Date:
I'm reposting this -- I sent this out a month ago but never got a response, and hope someone can shed some light on
this.

Thanks,
Craig

--------------------------

This is a straightforward query that should be fairly quick, but takes about 30 minutes.  It's a query across three
tables,call them A, B, and C.  The tables are joined on indexed columns. 

Here's a quick summary:

Table A -----> Table B -----> Table C
 A_ID           B_ID           C_ID
                A_ID           NAME
                C_ID

Tables A and B have 6 million rows each.  Table C is small: 67 names, no repeats.  All columns involved in the join are
indexed. The database has been full-vacuumed and analyzed. 

Summary:

1. Query B only:    2.7 seconds, 302175 rows returned
2. Join B and C:    4.3 seconds, exact same answer
3. Join A and B:    7.2 minutes, exact same answer
4. Join A, B, C:    32.7 minutes, exact same answer

Looking at these:

Query #1 is doing the real work: finding the rows of interest.

Queries #1 and #2 ought to be virtually identical, since Table C has
just one row with C_ID = 9, but the time almost doubles.

Query #3 should take a bit longer than Query #1 because it has to join
300K rows, but the indexes should make this take just a few seconds,
certainly well under a minute.

Query #4 should be identical to Query #3, again because there's only
one row in Table C.  32 minutes is pretty horrible for such a
straightforward query.

It looks to me like the problem is the use of nested loops when a hash join should be used, but I'm no expert at query
planning.

This is psql 8.0.3.  Table definitions are at the end.  Hardware is a Dell, 2-CPU Xeon, 4 GB memory, database is on a
singleSATA 7200RPM disk. 

These table and column names are altered to protect the guilty, otherwise these are straight from Postgres.


QUERY #1:
---------

explain analyze select B.A_ID from B where B.B_ID = 9;

Index Scan using i_B_B_ID on B  (cost=0.00..154401.36 rows=131236 width=4) (actual time=0.158..1387.251 rows=302175
loops=1)
Index Cond: (B_ID = 9)
Total runtime: 2344.053 ms


QUERY #2:
---------

explain analyze select B.A_ID from B join C on (B.C_ID = C.C_ID) where C.name = 'Joe';

Nested Loop  (cost=0.00..258501.92 rows=177741 width=4) (actual time=0.349..3392.532 rows=302175 loops=1)
->  Seq Scan on C  (cost=0.00..12.90 rows=1 width=4) (actual time=0.232..0.336 rows=1 loops=1)
     Filter: ((name)::text = 'Joe'::text)
->  Index Scan using i_B_C_ID on B  (cost=0.00..254387.31 rows=328137 width=8) (actual time=0.102..1290.002 rows=302175
loops=1)
     Index Cond: (B.C_ID = "outer".C_ID)
Total runtime: 4373.916 ms


QUERY #3:
---------

explain analyze
select A.A_ID from A
 join B on (A.A_ID = B.A_ID)    where B.B_ID = 9;

Nested Loop  (cost=0.00..711336.41 rows=131236 width=4) (actual time=37.118..429419.347 rows=302175 loops=1)
->  Index Scan using i_B_B_ID on B  (cost=0.00..154401.36 rows=131236 width=4) (actual time=27.344..8858.489
rows=302175loops=1) 
     Index Cond: (B_ID = 9)
->  Index Scan using pk_A_test on A  (cost=0.00..4.23 rows=1 width=4) (actual time=1.372..1.376 rows=1 loops=302175)
     Index Cond: (A.A_ID = "outer".A_ID)
Total runtime: 430467.686 ms


QUERY #4:
---------
explain analyze
select A.A_ID from A
 join B on (A.A_ID = B.A_ID)
 join C on (B.B_ID = C.B_ID)
 where C.name = 'Joe';

Nested Loop  (cost=0.00..1012793.38 rows=177741 width=4) (actual time=70.184..1960112.247 rows=302175 loops=1)
->  Nested Loop  (cost=0.00..258501.92 rows=177741 width=4) (actual time=52.114..17753.638 rows=302175 loops=1)
     ->  Seq Scan on C  (cost=0.00..12.90 rows=1 width=4) (actual time=0.109..0.176 rows=1 loops=1)
           Filter: ((name)::text = 'Joe'::text)
     ->  Index Scan using i_B_B_ID on B  (cost=0.00..254387.31 rows=328137 width=8) (actual time=51.985..15566.896
rows=302175loops=1) 
           Index Cond: (B.B_ID = "outer".B_ID)
->  Index Scan using pk_A_test on A  (cost=0.00..4.23 rows=1 width=4) (actual time=6.407..6.412 rows=1 loops=302175)
     Index Cond: (A.A_ID = "outer".A_ID)
Total runtime: 1961200.079 ms


TABLE DEFINITIONS:
------------------

xxx => \d a
            Table "xxx.a"
      Column       |          Type          | Modifiers
 ------------------+------------------------+-----------
  a_id             | integer                | not null
  ... more columns

 Indexes:
  "pk_a_id" PRIMARY KEY, btree (a_id)
    ... more indexes on other columns

xxx => \d b
                Table "xxx.b"
          Column          |          Type          | Modifiers
 -------------------------+------------------------+-----------
   b_id                   | integer                | not null
   a_id                   | integer                | not null
   c_id                   | integer                | not null
   ... more columns

 Indexes:
  "b_pkey" PRIMARY KEY, btree (b_id)
  "i_b_a_id" btree (a_id)
  "i_b_c_id" btree (c_id)


xxx=> \d c
        Table "xxx.c"
    Column     |          Type          | Modifiers
 --------------+------------------------+-----------
   c_id        | integer                | not null
   name        | character varying(200) |
   ... more columns

 Indexes:
  "c_pkey" PRIMARY KEY, btree (c_id)






Re: Poor performance o

From
Tom Lane
Date:
"Craig A. James" <cjames@modgraph-usa.com> writes:
> It looks to me like the problem is the use of nested loops when a hash join should be used, but I'm no expert at
queryplanning. 

Given the sizes of the tables involved, you'd likely have to boost up
work_mem before the planner would consider a hash join.  What nondefault
configuration settings do you have, anyway?

            regards, tom lane

Re: Poor performance o

From
"Craig A. James"
Date:
Tom Lane wrote:
> "Craig A. James" <cjames@modgraph-usa.com> writes:
>> It looks to me like the problem is the use of nested loops when a hash
>> join should be used, but I'm no expert at query planning.
>
> Given the sizes of the tables involved, you'd likely have to boost up
> work_mem before the planner would consider a hash join.  What nondefault
> configuration settings do you have, anyway?

shared_buffers = 20000
work_mem = 32768
effective_cache_size = 300000

This is on a 4GB machine.  Is there a guideline for work_mem that's related to table size?  Something like, "allow 2 MB
permillion rows"? 

I'm also curious why the big difference between my "Query #1" and "Query #2".  Even though it does a nested loop, #2's
outerloop only returns one result from a very tiny table, so shouldn't it be virtually indistinguishable from #1? 

Thanks,
Craig

Re: Poor performance o

From
Tom Lane
Date:
"Craig A. James" <cjames@modgraph-usa.com> writes:
> Tom Lane wrote:
>> Given the sizes of the tables involved, you'd likely have to boost up
>> work_mem before the planner would consider a hash join.  What nondefault
>> configuration settings do you have, anyway?

> shared_buffers = 20000
> work_mem = 32768
> effective_cache_size = 300000

So for a 6M-row table, 32M work_mem would allow ... um ... 5 bytes per
row.  It's not happening :-(

Try boosting work_mem by a factor of 100 and seeing whether a hash-based
join actually wins or not.  If so, we can discuss where the sane setting
really falls, if not there's no point.

            regards, tom lane

Re: Poor performance o

From
"Jim C. Nasby"
Date:
On Tue, Mar 21, 2006 at 05:04:16PM -0800, Craig A. James wrote:
> Tom Lane wrote:
> >"Craig A. James" <cjames@modgraph-usa.com> writes:
> >>It looks to me like the problem is the use of nested loops when a hash
> >>join should be used, but I'm no expert at query planning.
> >
> >Given the sizes of the tables involved, you'd likely have to boost up
> >work_mem before the planner would consider a hash join.  What nondefault
> >configuration settings do you have, anyway?
>
> shared_buffers = 20000
> work_mem = 32768
> effective_cache_size = 300000
>
> This is on a 4GB machine.  Is there a guideline for work_mem that's related
> to table size?  Something like, "allow 2 MB per million rows"?

No. The general guide is "set it as large as possible without making the
machine start swapping." In some cases, you'll want to bump it up much
higher for certain queries, especially if you know those queries will
only run one at a time.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461