Thread: Bad pg_stats.correlation with repeated keys.

Bad pg_stats.correlation with repeated keys.

From
Ralph Loader
Date:
Hi,

I'm experiencing a performance problem with sql queries on a table with
large numbers of repeated keys.

I've tracked it down to the pg_stats.correlation being overly
optimistic.  The structure is:

* Tables of ~ 100 million 100 byte rows, with a per-second timestamp.
* A single btree index on the timestamp.
* We're inserting ~ 10000 rows / second, almost (but not entirely) in
timestamp order.
* The tables are insert-only : deletes are done exclusively by dropping
entire tables.

I found:

* Disk I/O bound queries on the tables are much faster with a
bitmap-scan than an index-scan [it's an order of magnitude difference].

* No matter how high I set random_page_cost (and with
cursor_tuple_fraction=3D1), I could not get the query planner to use
bitmap-scans.  It would insist on using index-scans until I set
random_page_cost to around 10000, and then flip to a full table seq-scan.

* Running strace on the postgres process with an index-scan, the data
reads are highly non-sequential, lseek()ing on nearly every read.

* The pg_stats.correlation value on the index is 0.999993 (varies
slightly, nearly always 5 nines).  The strace output indicates that this
is completely wrong.

* Tweaking the postgres code to limit the correlation used to the range
-0.99 to 0.99 immediately got the query planner using bitmap-scan.

I suspect that the cause is related to this comment in nbtinsert.c:

   "If the new key is equal to one or more existing keys, we can
    legitimately place it anywhere in the series of equal keys"

This is true, but arbitrarily ordering repeat keys will break the
correlation between the index order and the table data order.

I think that either pg_stats.correlation should reflect that fact, or
even better, the btree insert should attempt to actually match the on
disk data order.  [For my use case, repeated keys insert after existing
would be right, I am not sure if that is always true.]

Cheers,
Ralph Loader.

PS.  A tangentally related comment : bitmap scans achieve good ordering
for the data reads, but not necessarily for the index look-up.  A
breadth-first btree lookup that takes disk-order into account would give
a significant performance gain for large un-cached indexes.

________________________________
Ralph Loader

Senior Software Engineer
Level 2, Building A, The Millennium Building Phase 2, 600 Great South Road,=
 Ellerslie
Auckland City,
Auckland,
1051

+64 9 926 2902 Office
www.emulex.com
 [http://www.emulex.com/files/signature/emulexendacelogo.jpg]  <http://www.=
emulex.com/emulex-connects/>
This message contains Emulex confidential information intended only for spe=
cific recipients and is not to be forwarded to anyone else. If you have rec=
eived this message in error, please delete it immediately. Thank you.

Re: Bad pg_stats.correlation with repeated keys.

From
Tom Lane
Date:
[ catching up on old email ]

Ralph Loader <ralph.loader@emulex.com> writes:
> I'm experiencing a performance problem with sql queries on a table with
> large numbers of repeated keys.
> I've tracked it down to the pg_stats.correlation being overly
> optimistic ...
> [ for an index on a low-resolution time-of-insertion column ]

> * The pg_stats.correlation value on the index is 0.999993 (varies
> slightly, nearly always 5 nines).  The strace output indicates that this
> is completely wrong.

> * Tweaking the postgres code to limit the correlation used to the range
> -0.99 to 0.99 immediately got the query planner using bitmap-scan.

> I suspect that the cause is related to this comment in nbtinsert.c:

>    "If the new key is equal to one or more existing keys, we can
>     legitimately place it anywhere in the series of equal keys"

> This is true, but arbitrarily ordering repeat keys will break the
> correlation between the index order and the table data order.

> I think that either pg_stats.correlation should reflect that fact, or
> even better, the btree insert should attempt to actually match the on
> disk data order.  [For my use case, repeated keys insert after existing
> would be right, I am not sure if that is always true.]

I don't like the idea of monkeying with the correlation estimate here.

It seems like the ideal fix is to get rid of the randomized choice in
btree insertion in favor of breaking ties using the TID to be inserted;
that is, for insertion attempts, we'd view the tuple TID as an additional
low-order index column, and thus there would never be any true duplicates
and every incoming index entry would have a unique place where it should
go.  This would guarantee that reading a long sequence of duplicate keys
from the index would visit the heap in TID order.

However, I'm unsure what such a rule would do to insertion costs.  The
point of the existing code is to avoid unnecessary index page splits.
Before we had that logic, a bunch of equal-keyed insertions would all
land at the same spot in the index, and thus result in a series of
page splits, even though there might be plenty of free space in nearby
pages.  I'm worried that if we're inserting tuples in TID order (which'd
usually be the case for an insert-mostly table like yours), we'd end up
with the same problem only at the right end instead of the left end of
the run of duplicates.

Another problem is backward compatibility with existing btree indexes in
which equal-keyed entries fail to appear in TID order.  The tree descent
algorithms would see that as an index with corrupt ordering, and possibly
misbehave.  It might be that no serious harm would ensue, and you'd still
get the new index entry inserted at some legal place within the run of
matching key values.  Or maybe not; it'd take some investigation to be
sure.

Anyway, I'm not terribly excited about this scenario myself, and am not
volunteering to do any of the above.  Just throwing out some thoughts
in case anybody else wants to work on it.

            regards, tom lane