Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits - Mailing list pgsql-hackers

From Andres Freund
Subject Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits
Date
Msg-id 20190408010427.4l63qr7h2fjcyp77@alap3.anarazel.de
Whole thread Raw
In response to Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits  (Andres Freund <andres@anarazel.de>)
Responses Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits
List pgsql-hackers
Hi,

On 2019-04-04 21:04:49 -0700, Andres Freund wrote:
> On 2019-04-04 23:57:58 -0400, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > I think the right approach would be to do all of this in heap_insert and
> > > heap_multi_insert. Whenever starting to work on a page, if INSERT_FROZEN
> > > is specified, remember whether it is either currently empty, or is
> > > already marked as all-visible. If previously empty, mark it as all
> > > visible at the end. If already all visible, there's no need to change
> > > that. If not yet all-visible, no need to do anything, since it can't
> > > have been inserted with COPY FREEZE.  Do you see any problem doing it
> > > that way?
> > 
> > Do we want to add overhead to these hot-spot routines for this purpose?
> 
> For heap_multi_insert I can't see it being a problem - it's only used
> from copy.c, and the cost should be "smeared" over many tuples.  I'd
> assume that compared with locking a page, WAL logging, etc, it'd not
> even meaningfully show up for heap_insert. Especially because we already
> have codepaths for options & HEAP_INSERT_FROZEN in
> heap_prepare_insert(), and I'd assume those could be combined.
> 
> I think we should measure it, but I don't think that one or two
> additional, very well predictd, branches are going to be measurable in
> in those routines.
> 
> The patch, as implemented, has modifications in
> RelationGetBufferForTuple(), that seems like they'd be more expensive.

Here's a *prototype* patch for this.  It only implements what I
described for heap_multi_insert, not for plain inserts. I wanted to see
what others think before investing additional time.

I don't think it's possible to see the overhead of the changed code in
heap_multi_insert(), and probably - with less confidence - that it's
also going to be ok for heap_insert(). But we gotta measure that.


This avoids an extra WAL record for setting empty pages to all visible,
by adding XLH_INSERT_ALL_VISIBLE_SET & XLH_INSERT_ALL_FROZEN_SET, and
setting those when appropriate in heap_multi_insert.  Unfortunately
currently visibilitymap_set() doesn't really properly allow to do this,
as it has embedded WAL logging for heap.

I think we should remove the WAL logging from visibilitymap_set(), and
move it to a separate, heap specific, function. Right now different
tableams e.g. would have to reimplement visibilitymap_set(), so that's a
second need to separate that functionality.  Let me try to come up with
a proposal.


The patch currently does a vmbuffer_pin() while holding an exclusive
lwlock on the page. That's something we normally try to avoid - but I
think it's probably OK here, because INSERT_FROZEN can only be used to
insert into a new relfilenode (which thus no other session can see). I
think it's preferrable to have this logic in specific to the
INSERT_FROZEN path, rather than adding nontrivial complications to
RelationGetBufferForTuple().

I noticed that, before this patch, we do a
    if (vmbuffer != InvalidBuffer)
        ReleaseBuffer(vmbuffer);
after every filled page - that doesn't strike me as particularly smart -
it's pretty likely that the next heap page to be filled is going to be
on the same vm page as the previous iteration.


I noticed one small oddity that I think is common to all the approaches
presented in this thread so far. After

BEGIN;
TRUNCATE foo;
COPY foo(i) FROM '/tmp/foo' WITH FREEZE;
COPY foo(i) FROM '/tmp/foo' WITH FREEZE;
COPY foo(i) FROM '/tmp/foo' WITH FREEZE;
COMMIT;

we currently end up with pages like:
┌───────┬───────────┬──────────┬───────┬───────┬───────┬─────────┬──────────┬─────────┬───────────┐
│ blkno │    lsn    │ checksum │ flags │ lower │ upper │ special │ pagesize │ version │ prune_xid │
├───────┼───────────┼──────────┼───────┼───────┼───────┼─────────┼──────────┼─────────┼───────────┤
│     0 │ 0/50B5488 │        0 │     4 │   928 │   960 │    8192 │     8192 │       4 │         0 │
│     1 │ 0/50B6360 │        0 │     4 │   928 │   960 │    8192 │     8192 │       4 │         0 │
│     2 │ 0/50B71B8 │        0 │     4 │   928 │   960 │    8192 │     8192 │       4 │         0 │
│     3 │ 0/50B8028 │        0 │     4 │   928 │   960 │    8192 │     8192 │       4 │         0 │
│     4 │ 0/50B8660 │        0 │     4 │   408 │  5120 │    8192 │     8192 │       4 │         0 │
│     5 │ 0/50B94B8 │        0 │     4 │   928 │   960 │    8192 │     8192 │       4 │         0 │
│     6 │ 0/50BA328 │        0 │     4 │   928 │   960 │    8192 │     8192 │       4 │         0 │
│     7 │ 0/50BB180 │        0 │     4 │   928 │   960 │    8192 │     8192 │       4 │         0 │
│     8 │ 0/50BBFD8 │        0 │     4 │   928 │   960 │    8192 │     8192 │       4 │         0 │
│     9 │ 0/50BCF88 │        0 │     4 │   928 │   960 │    8192 │     8192 │       4 │         0 │
│    10 │ 0/50BDDE0 │        0 │     4 │   928 │   960 │    8192 │     8192 │       4 │         0 │
│    11 │ 0/50BEC50 │        0 │     4 │   928 │   960 │    8192 │     8192 │       4 │         0 │
│    12 │ 0/50BFAA8 │        0 │     4 │   928 │   960 │    8192 │     8192 │       4 │         0 │
│    13 │ 0/50C06F8 │        0 │     4 │   792 │  2048 │    8192 │     8192 │       4 │         0 │
└───────┴───────────┴──────────┴───────┴───────┴───────┴─────────┴──────────┴─────────┴───────────┘
(14 rows)

Note how block 4 has more space available. That's because the
visibilitymap_pin() called in the first COPY has to vm_extend(), which
in turn does:

    /*
     * Send a shared-inval message to force other backends to close any smgr
     * references they may have for this rel, which we are about to change.
     * This is a useful optimization because it means that backends don't have
     * to keep checking for creation or extension of the file, which happens
     * infrequently.
     */
    CacheInvalidateSmgr(rel->rd_smgr->smgr_rnode);

which invalidates ->rd_smgr->smgr_targblock *after* the first COPY,
because that's when the pending smgr invalidations are sent out.  That's
far from great, but it doesn't seem to be this patch's fault.

It seems to me we need a separate invalidation that doesn't close the
whole smgr relation, but just invalidates the VM specific fields.

Greetings,

Andres Freund

Attachment

pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Re: reloption to prevent VACUUM from truncating empty pages atthe end of relation
Next
From: Andres Freund
Date:
Subject: Re: reloption to prevent VACUUM from truncating empty pages at theend of relation