Home > mailing lists
Re: Batching in executor - Mailing list pgsql-hackers

From	Amit Langote
Subject	Re: Batching in executor
Date	March 24 03:59:33
Msg-id	CA+HiwqGkway9dcpTVmRs0Q=ZsQQ8Nx0MRjy9jfDmR8QFLnd-Mg@mail.gmail.com Whole thread
In response to	Re: Batching in executor (Junwang Zhao <zhjwpku@gmail.com>)
Responses	Re: Batching in executor
List	pgsql-hackers
Tree view
Hi,

Here is a significantly revised version of the patch series. A lot has
changed since the January submission, so I want to summarize the
design changes before getting into the patches.  I think it does
address the points in the two reviews that landed since v5 but maybe a
bunch of points became moot after my rewrite of the relevant portions
(thanks Junwang and ChangAo for the review in any case).

At this point it might be better to think of this as targeting v20,
except that if there is review bandwidth in the remaining two weeks
before the v19 feature freeze, the rs_vistuples[] change described
below as a standalone improvement to the existing pagemode scan path
could be considered for v19, though that too is an optimistic
scenario.

It is also worth noting that Andres identified a number of
inefficiencies in the existing scan path in:

Re: unnecessary executor overheads around seqscans
https://postgr.es/m/xzflwwjtwxin3dxziyblrnygy3gfygo5dsuw6ltcoha73ecmnf%40nh6nonzta7kw

that are worth fixing independently of batching. Some of those fixes
may be better pursued first, both because they benefit all scan paths
and because they would make batching's gains more honest.

Separately, after looking at the previous version, Andres pointed out
offlist two fundamental issues with the patch's design:

* The heapam implementation (in a version of the patch I didn't post
to the thread) duplicated heap_prepare_pagescan() logic in a separate
batch-specific code path, which is not acceptable as changes should
benefit the existing slot interface too.  Code duplication is not good
either from a future maintainability aspect. The v5 version of that
code is not great in that respect either; it instead duplicated
heapggettup_pagemode() to slap batching on it.

* Allocating executor_batch_rows slots on the executor side to receive
rows from the AM adds significant overhead for slot initialization and
management, and for non-row-organized AMs that do not produce
individual rows at all, those slots would never be meaningfully
populated.

In any case, he just wasn't a fan of the slot-array approach the
moment I mentioned it. The previous version had two slot arrays,
inslots and outslots, of TTSOpsHeapTuple type (not
TTSOpsBufferHeapTuple because buffer pins were managed by the batch
code, which has its own modularity/correctness issues), populated via
a materialize_all callback. A batch qual evaluator would copy
qualifying tuples into outslots, with an activeslots pointer switching
between the two depending on whether batch qual evaluation was used.

The new design addresses both issues and differs from the previous
version in several other ways:

 * Single slot instead of slot arrays: there is a single
TupleTableSlot, reusing the scan node's ss_ScanTupleSlot whose type
was already determined by the AM via table_slot_callbacks().  The slot
is re-pointed to each HeapTuple in the current buffer page via a new
repoint_slot AM callback, with no materialization or copying.  Tuples
are returned one by one from the executor's perspective, but the AM
serves them in page-sized batches from pre-built HeapTupleData
descriptors in rs_vistuples[], avoiding repeated descent into heapam
per tuple.  This is heapam's implementation of the batch interface;
there is no intention to force other AMs into the same row-oriented
model.

 * Batch qual evaluator not included: with the single-slot model,
quals are evaluated per tuple via the existing ExecQual path after
each repoint_slot call.  A natural next step would be a new opcode
(EEOP) that calls repoint_slot() internally within expression
evaluation, allowing ExecQual to advance through multiple tuples from
the same batch without returning to the scan node each time, with qual
results accumulated in a bitmask in ExprState.  The details of that
will be worked out in a follow-on series.

* heapgettup_pagemode_batch() gone: patch 0001 (described below) makes
HeapScanDesc store full HeapTupleData entries in rs_vistuples[], which
allows heap_getnextbatch() to simply advance a slice pointer into that
array without any additional copying or re-entering heap code, making
a separate batch-specific scan function unnecessary.

 * TupleBatch renamed to RowBatch: "row batch" is more natural
terminology for this concept and also consistent with how similar
abstractions are named in columnar and OLAP systems.

 * AM callbacks now take RowBatch directly: previously
heap_getnextbatch() returned a void pointer that the executor would
store into RowBatch.am_payload, because only the executor knew the
internals of RowBatch.  Now the AM receives RowBatch directly as a
parameter and can populate it without the executor acting as an
intermediary.  This is also why RowBatch is introduced in its own
patch ahead of the AM API addition, so the struct definition is
available to both sides.

Patch 0001 changes rs_vistuples[] to store full HeapTupleData entries
instead of OffsetNumbers, as a standalone improvement to the existing
pagemode scan path. Measured on a pg_prewarm'd  (also vaccum freeze'd
in the all-visible case) table with 1M/5M/10M rows:

  query                           all-visible      not-all-visible
  count(*)                        -0.2% to +0.9%   -0.4% to +0.5%
  count(*) WHERE id % 10 = 0     -1.1% to +3.4%   +0.2% to +1.5%
  SELECT * LIMIT 1 OFFSET N      -2.2% to -0.6%   -0.9% to +6.6%
  SELECT * WHERE id%10=0 LIMIT   -0.8% to +3.9%   +0.9% to +9.6%

No significant regression on either page type. The structural
improvement is most visible on not-all-visible pages where
HeapTupleSatisfiesMVCCBatch() already reads every tuple header during
visibility checks, so persisting the result into rs_vistuples[]
eliminates the downstream re-read (in heapgettupe_pagemode()) with no
measurable overhead.  That said, these numbers are somewhat noisy on
my machine.  Results on other machines would be welcome.

Patches 0002-0005 add the RowBatch infrastructure, the batch AM API
and heapam implementation including seqscan variants that use the new
scan_getnextbatch() API, and EXPLAIN (ANALYZE, BATCHES) support,
respectively. With batching enabled (executor_batch_rows=300,
~MaxHeapTuplesPerPage):

  query                           all-visible    not-all-visible
  count(*)                        +11 to +15%    +9 to +13%
  count(*) WHERE id % 10 = 0     +6 to +11%     +10 to +14%
  SELECT * LIMIT 1 OFFSET N      +16 to +19%    +16 to +22%
  SELECT * WHERE id%10=0 LIMIT   +8 to +10%     +8 to +13%

With executor_batch_rows=0, results are within noise of master across
all query types and sizes, confirming no regression from the
infrastructure changes themselves.  The not-all-visible results tend
to show slightly higher gains than the all-visible case. This is
likely because the existing heapam code is more optimized for the
all-visible path, so the not-all-visible path, which goes through
HeapTupleSatisfiesMVCCBatch() for per-tuple visibility checks, has
more headroom that batching can exploit.

Setting aside the current series for a moment, there are some broader
design questions worth raising while we have attention on this area.
Some of these echo points Tomas raised in his first reply on this
thread, and I am reiterating them deliberately since I have not
managed to fully address them on my own or I simply didn't need to for
the TAM-to-scan-node batching and think they would benefit from wider
input rather than just my own iteration.

We should also start thinking about other ways the executor can
consume batch rows, not always assuming they are presented as
HeapTupleData. For instance, an AM could expose decoded column arrays
directly to operators that can consume them, bypassing slot-based
deform entirely, or a columnar AM could implement scan_getnextbatch by
decoding column strips directly into the batch without going through
per-tuple HeapTupleData at all. Feedback on whether the current
RowBatch design and the choices made in the scan_getnextbatch and
RowBatchOps API make that sort of thing harder than it needs to be
would be appreciated. For example, heapam's implementation of
scan_getnextbatch uses a single TTSOpsBufferHeapTuple slot re-pointed
to HeapTupleData entries one at a time via repoint_slot in
RowBatchHeapOps. That works for heapam but a columnar AM could
implement scan_getnextbatch to decode column strips directly into
arrays in the batch, with no per-row repoint step needed at all. Any
adjustments that would make RowBatch more AM-agnostic are worth
discussing now before the design hardens.

There are also broader open questions about how far the batch model
can extend beyond the scan node. Qual pushdown into the AM has been
discussed in nearby threads and would be one way to allow expression
evaluation to happen before data reaches the executor proper, though
that is a separate effort. For the purposes of this series, expression
evaluation still happens in the executor after scan_getnextbatch
returns. If the scan node does not project, the buffer heap slot is
passed directly to the parent node, which calls slot callbacks to
deform as needed. But once a node above projects, aggregates, or
joins, the notion of a page-sized batch from a single AM loses its
meaning and virtual slots take over. Whether RowBatch is usable or
meaningful beyond the scan/TAM boundary in any form, and whether the
core executor will ever have non-HeapTupleData batch consumption paths
or leave that entirely to extensions, are open questions worth
discussing.

For RowBatch to eventually play the role that TupleTableSlot plays for
row-at-a-time execution, something inside it would need to serve as
the common currency for batch data, analogous to TupleTableSlot's
datum/isnull arrays. Column arrays are the obvious direction, but even
that leaves open the question of representation. PostgreSQL's Datum is
a pointer-sized abstraction that boxes everything, whereas vectorized
systems use typed packed arrays of native types with validity
bitmasks, which is a significant part of why tight vectorized loops
are fast there. Whether column arrays of Datum would be good enough,
or whether going further toward typed packed arrays would be necessary
to get meaningful vectorization, is a deeper design question that this
series deliberately does not try to answer.

Even though the focus is on getting batching working at the scan/TAM
boundary first, thoughts on any of these points would be welcome.

--
Thanks, Amit Langote
Attachment

pgsql-hackers by date:
From: Bharath Rupireddy
Date: 24 March, 03:53:59
Subject: Re: Initial COPY of Logical Replication is too slow
From: Henson Choi
Date: 24 March, 04:05:28
Subject: Re: SQL Property Graph Queries (SQL/PGQ)
Re: Batching in executor - Mailing list pgsql-hackers

Attachment

Previous

Next