Eliminating CREATE INDEX comparator TID tie-breaker overhead - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Eliminating CREATE INDEX comparator TID tie-breaker overhead
Date
Msg-id CAM3SWZQpZygWdVUC6H_rctP5_FS7g7_EKz7fqeOa6hYiULU6pA@mail.gmail.com
Whole thread Raw
Responses Re: Eliminating CREATE INDEX comparator TID tie-breaker overhead
List pgsql-hackers
I have one more idea for accelerating sorting-related operations that
is both fairly effective and relatively easy to implement. This just
about clears my backlog of those, though.

There is an open item on the TODO list entitled "Consider whether
duplicate keys should be sorted by block/offset" [1].

Currently, comparetup_index_btree() and comparetup_index_hash() do a
tie-breaker on ItemPointer (heap TID). This started as insurance
against bad qsort() implementations that do badly with many equal
keys, but it was also thought that there is some value in having equal
keys be in physical (heap) order. Clearly the first concern has been
obsolete since 2006, when a high quality qsort() was added to
Postgres, but the second concern is probably still valid.

Overhead
-------------

Tom said in 2008 that the CREATE INDEX overhead of doing this may be
about 7% [2]. It seems even worse now, presumably due to SortSupport
reducing costs elsewhere. Several years ago, I suggested ripping out
the tie-breaker too, but didn't get very far with that idea due to the
potential downsides for index scans. I'm not about to revisit the
discussion from 2011 about whether or not it should be torn out. I'd
rather just (mostly) fix the real problem without changing the
behavior of comparetup_index_btree() (or comparetup_index_hash()).

The real problem is that we do pointer chasing to get to the
IndexTuple ("tuple proper"), where we might get away with only
examining the SortTuple, as would happen in comparetup_heap() in the
event of an equivalent heap tuplesort, for example. The added memory
latency hurts lower cardinality attributes (when only one attribute
appears in the Index definition and the IndexTuple isn't cache
resident), and wastes memory bandwidth on the system, which is
something that there is virtually never enough of.

Patch
=====

Attached patch adds a new tie-breaker which is used only when the
tuples being sorted still fit in memory. Currently, the field
SortTuple.tupindex has a different purpose depending on whether we're
building initial runs or merging runs. However, SortTuple.tupindex
currently has no purpose when a sort can complete entirely in memory,
so that case is available to support a third meaning:
SortTuple.tupindex can be used to hold a simple ordinal number. When
our status is TSS_INITIAL (when we qsort()), this is used as a proxy
for what the TID tie-breaker would ordinarily return.

This reliably preserves the existing behavior because index tuplesort
clients invariably scan heap tuples in physical order, and so nothing
is lost.

Performance
-----------------

The patch can make CREATE INDEX with an unlogged B-Tree index with a
single int4 attribute as much as 15% faster (although I think under
10% is much more likely). That seems like an improvement that makes
the patch worthwhile.

Design considerations and consequences
--------------------------------------------------------

I don't think that there is much point in getting rid of
SortTuple.tupindex for in-memory sorts, which this patch closes off
the possibility of. I tested the performance of a SortTuple struct
with only a datum1 and "tuple proper" for in-memory sorting at one
time, and it was disappointing, and in any case unworkably invasive.

I also don't think it's worth worrying about the fact that tupindex is
assigned an ordinal number even in cases where it won't be used inside
the comparator. The overhead has to be indistinguishable from zero
anyway, and the right place to put that assignment happens to be where
there is generic TSS_INITIAL copying within puttuple_common().

I'm not concerned about synchronized scans breaking my assumption of a
physical ordering of heaptuples being fed to tuplesort.c. I think that
it is unlikely to ever be worth seriously considering this case.

I have a hard time imagining anything (beyond synchronous scans)
breaking my assumption that index tuplesorts receive tuples in heap
physical order. If anything was to break that in the future (e.g.
parallelizing the heap scan for index builds), then IMV the onus
should be on that new case to take appropriate precautions against
breaking my assumption.

[1] https://wiki.postgresql.org/wiki/Todo#Sorting
[2] http://www.postgresql.org/message-id/23321.1205726381@sss.pgh.pa.us
--
Peter Geoghegan

Attachment

pgsql-hackers by date:

Previous
From: Kevin Grittner
Date:
Subject: brin index vacuum versus transaction snapshots
Next
From: Robert Haas
Date:
Subject: Re: [PROPOSAL] VACUUM Progress Checker.