Thread: POC: Cleaning up orphaned files using undo logs

POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

01 November 2018, 03:22:51

Hello hackers,

The following sequence creates an orphaned file:

BEGIN;
CREATE TABLE t ();
<kill -9 this backend>

Occasionally there are reports of systems that have managed to produce
a lot of them, perhaps through ENOSPC-induced panics, OOM signals or
buggy/crashing extensions etc.  The most recent example I found in the
archives involved 1.7TB of unexpected files and some careful cleanup
work.

Relation files are created eagerly, and rollback is handled by pushing
PendingRelDelete objects onto the pendingDeletes list, to be discarded
on commit or processed on abort.  That's effectively a kind of
specialised undo log, but it's in memory only, so it's less persistent
than the effects it is supposed to undo.

Here's a proof-of-concept patch that plugs the gap using the undo log
technology we're developing as part of the zHeap project.  Note that
zHeap is not involved here: the SMGR module is acting as a direct
client of the undo machinery.  Example:

postgres=# begin;
BEGIN
postgres=# create table t1 ();
CREATE TABLE
postgres=# create table t2 ();
CREATE TABLE

... now we can see that this transaction has some undo data (discard < insert):

postgres=# select logno, discard, insert, xid, pid from pg_stat_undo_logs;
 logno |     discard      |      insert      | xid |  pid
-------+------------------+------------------+-----+-------
     0 | 00000000000021EF | 0000000000002241 | 581 | 18454
(1 row)

... and, if the test_undorecord module is installed, we can inspect
the records it holds:

postgres=# call dump_undo_records(0);
NOTICE:  0000000000002224: Storage: CREATE dbid=12655, tsid=1663, relfile=24594
NOTICE:  00000000000021EF: Storage: CREATE dbid=12655, tsid=1663, relfile=24591
CALL

If we COMMIT, the undo data is discarded by advancing the discard
pointer (tail) to match the insert pointer (head).  If we ROLLBACK,
either explicitly or automatically by crashing and recovering, then
the files will be unlinked and the insert pointer will be rewound;
either way the undo log eventually finishes up "empty" again (discard
== insert).  This is done with a system of per-rmgr-ID record types
and callbacks, similar to redo.  The rollback action are either
executed immediately or offloaded to an undo worker process, depending
on simple heuristics.

Of course this isn't free, and the current patch makes table creation
slower.  The goal is to make sure that there is no scenario (kill -9,
power cut etc) in which there can be a new relation file on disk, but
not a corresponding undo record that would unlink that file if the
transaction later has to roll back.  Currently, that means that we
need to flush the WAL record that will create the undo record that
will unlink the file *before* we create the relation file.  I suspect
that could be mitigated quite easily, by deferring file creation in a
backend-local queue until forced by access or commit.  I didn't try to
do that in this basic version.

There are probably other ways to solve the specific problem of
orphaned files, but this approach is built on a general reusable
facility and I think it is a nice way to show the undo concepts, and
how they are separate from zheap.  Specifically, it demonstrates the
more traditional of the two uses for undo logs: a reliable way to
track actions that must be performed on rollback.  (The other use is:
seeing past versions of data, for vacuumless MVCC; that's a topic for
later).

Patches 0001-0006 are development snapshots of material posted on
other threads already[1][2], hacked around by me to make this possible
(see those threads for further developments in those patches including
some major strengthening work, coming soon).  The subject of this
thread is 0007, the core of which is just a couple of hundred lines
written by me, based on an idea from Robert Haas.

Personally I think it'd be a good feature to get into PostgreSQL 12,
and I will add it to the CF that is about to start to seek feedback.
It passes make check on Unix and Windows, though currently it's
failing some of the TAP tests for reasons I'm looking into (possibly
due to bugs in the lower level patches, not sure).

Thanks for reading,

[1] https://www.postgresql.org/message-id/flat/CAEepm=2EqROYJ_xYz4v5kfr4b0qw_Lq_6Pe8RTEC8rx3upWsSQ@mail.gmail.com
[2] https://www.postgresql.org/message-id/flat/CAFiTN-sYQ8r8ANjWFYkXVfNxgXyLRfvbX9Ee4SxO9ns-OBBgVA@mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

undo-smgr-v1.tgz

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

01 November 2018, 14:43:58

While I've been involved in the design discussions for this patch set,
I haven't looked at any of the code personally in a very long time.  I
certainly don't claim to be an independent reviewer, and I encourage
others to review this work also.  That said, here are some review
comments.

I decided to start with 0005, as that has the user-facing
documentation for this feature. There is a spurious whitespace-only
hunk in monitoring.sgml.

+     <entry>Process ID of the backend currently attached to this undo log
+      for writing.</entry>

or NULL/0/something if none?

+   each undo log that exists.  Undo logs are extents within a contiguous
+   addressing space that have their own head and tail pointers.

This sentence seems to me to have so little detail that it's not going
to help anyone, and it also seems somewhat out-of-place here.  I think
it would be better to link to the longer explanation in the new
storage section instead.

+   Each backend that has written undo data is associated with one or more undo

extra space

+<para>
+Undo logs hold data that is used for rolling back and for implementing
+MVCC in access managers that are undo-aware (currently "zheap").  The storage
+format of undo logs is optimized for reusing existing files.
+</para>

I think the mention of zheap should be removed here since the hope is
that the undo stuff can be committed independently of and prior to
zheap.

I think you mean access methods, not access managers.  I suggest
making that an xref.

Maybe add a little more detail, e.g.

Undo logs provide a place for access methods to store data that can be
used to perform necessary cleanup operations after a transaction
abort.  The data will be retained after a transaction abort until the
access method successfully performs the required cleanup operations.
After a transaction commit, undo data will be retained until the
transaction is all-visible.  This makes it possible for access
managers to use undo data to implement MVCC.  Since it most cases undo
data is discarded very quickly, the undo system has been optimized to
minimize writes to disk and to reuse existing files efficiently.

+<para>
+Undo data exists in a 64 bit address space broken up into numbered undo logs
+that represent 1TB extents, for efficient management.  The space is further
+broken up into 1MB segment files, for physical storage.  The name of each file
+is the address of of the first byte in the file, with a period inserted after
+the part that indicates the undo log number.
+</para>

I cannot read this section and know what an undo filename is going to
look like.  Also, the remarks about efficient management seems like it
might be unclear to someone not already familiar with how this works.
Maybe something like:

Undo data exists in a 64-bit address space divided into 2^34 undo
logs, each with a theoretical capacity of 1TB.  The first time a
backend writes undo, it attaches to an existing undo log whose
capacity is not yet exhausted and which is not currently being used by
any other backend; or if no suitable undo log already exists, it
creates a new one.  To avoid wasting space, each undo log is further
divided into 1MB segment files, so that segments which are no longer
needed can be removed (possibly recycling the underlying file by
renaming it) and segments which are not yet needed do not need to be
physically created on disk.  An undo segment file has a name like
<example>, where <thing> is the undo log number and <thang> is the
segment number.

I think it's good to spell out the part about attaching to undo logs
here, because when people look at pg_undo, the number of files will be
roughly proportional to the number of backends, and we should try to
help them understand - at least in general terms - why that happens.

+<para>
+Just as relations can have one of the three persistence levels permanent,
+unlogged or temporary, the undo data that is generated by modifying them must
+be stored in an undo log of the same persistence level.  This enables the
+undo data to be discarded at appropriate times along with the relations that
+reference it.
+</para>

This is not quite general, because we're not necessarily talking about
modifications to the files.  In fact, in this POC, we're explicitly
talking about the cleanup of the files themselves.  Also, it's not
technically correct to say that the persistence level has to match.
You could put everything in permanent undo logs.  It would just suck.

Moving on to 0003, the developer documentation:

+The undo log subsystem provides a way to store data that is needed for
+a limited time.  Undo data is generated whenever zheap relations are
+modified, but it is only useful until (1) the generating transaction
+is committed or rolled back and (2) there is no snapshot that might
+need it for MVCC purposes.  See src/backend/access/zheap/README for
+more information on zheap.  The undo log subsystem is concerned with

Again, I think this should be rewritten to make it independent of
zheap.  We hope that this facility is not only usable by but will
actually be used by other AMs.

+their location within a 64 bit address space.  Unlike redo data, the
+addressing space is internally divided up unto multiple numbered logs.

Except it's not totally unlike; cf. the log and seg arguments to
XLogFileNameById.  The xlog division is largely a historical accident
of having to support systems with 32-bit arithmetic and has minimal
consequences in practice, and it's a lot less noticeable now than it
used to be, but it does still kinda exist.  I would try to sharpen
this wording a bit to de-emphasize the contrast over whether a log/seg
distinction exists and instead just contrast multiple insertion points
vs. a single one.

+level code (zheap) is largely oblivious to this internal structure and

Another zheap reference.

+eviction provoked by memory pressure, then no disk IO is generated.

I/O?

+Keeping the undo data physically separate from redo data and accessing
+it though the existing shared buffers mechanism allows it to be
+accessed efficiently for MVCC purposes.

And also non-MVCC purposes.  I mean, it's not very feasible to do
post-abort cleanup driven solely off the WAL, because the WAL segments
might've been archived or recycled and there's no easy way to access
the bits we want.  Saying this is for MVCC purposes specifically seems
misleading.

+shared memory and can be inspected in the pg_stat_undo_logs view.  For

Replace "in" with "via" or "through" or something?

+shared memory and can be inspected in the pg_stat_undo_logs view.  For
+each undo log, a set of properties called the undo log's meta-data are
+tracked:

"called the undo log's meta-data" seems a bit awkward.

+* the "discard" pointer; data before this point has been discarded
+* the "insert" pointer: new data will be written here
+* the "end" pointer: a new undo segment file will be needed at this point

why ; for the first and : for the others?

+The three pointers discard, insert and end move strictly forwards
+until the whole undo log has been exhausted.  At all times discard <=
+insert <= end.  When discard == insert, the undo log is empty

I think you should either remove "discard, insert and end" from this
sentence, relying on people to remember the list they just read, or
else punctuate it like this: The three pointers -- discard, insert,
and end -- move...

+logs are held in a fixed-sized pool in shared memory.  The size of
+the array is a multiple of max_connections, and limits the total size of
+transactions.

I think you should elaborate on "limits the total size of transactions."

+The meta-data for all undo logs is written to disk at every
+checkpoint.  It is stored in files under PGDATA/pg_undo/, using the

Even unlogged and temporary undo logs?

+level of the relation being modified and the current value of the GUC

Suggest: the corresponding relation

+suitable undo log must be either found or created.  The system should
+stabilize on one undo log per active writing backend (or more if
+different tablespaces are persistence levels are used).

Won't edge effects drive the number up considerably?

+and they cannot be accessed by other backend including undo workers.

Grammar.  Also, begs the question "so how does this work if the undo
workers are frozen out?"

+Responsibility for WAL-logging the contents of the undo log lies with
+client code (ie zheap).  While undolog.c WAL-logs all meta-data

Another zheap reference.

+hard coded to use md.c unconditionally, PostgreSQL 12 routes IO for the undo

Suggest I/O rather than IO.

I'll see if I can find time to actually review some of this code at
some point.  Regarding 0006, I can't help but notice that it is
completely devoid of documentation and README updates, which will not
do.  Regarding 0007, that's an impressively small patch.

...Robert

Re: POC: Cleaning up orphaned files using undo logs

From

Kuntal Ghosh

Date:

05 November 2018, 11:42:40

Hello Thomas,

On Thu, Nov 1, 2018 at 8:53 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
>
> It passes make check on Unix and Windows, though currently it's
> failing some of the TAP tests for reasons I'm looking into (possibly
> due to bugs in the lower level patches, not sure).
>
I looked into the regression failures when the tap-tests are enabled.
It seems that we're not estimating and allocating the shared memory
for rollback-hash tables correctly. I've added a patch to fix the
same.


--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachment

0001-Fix-shared-memory-size-for-rollback-hash-table.patch

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

05 November 2018, 12:34:56

On Mon, Nov 5, 2018 at 5:13 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>
> Hello Thomas,
>
> On Thu, Nov 1, 2018 at 8:53 AM Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
> >
> > It passes make check on Unix and Windows, though currently it's
> > failing some of the TAP tests for reasons I'm looking into (possibly
> > due to bugs in the lower level patches, not sure).
> >
> I looked into the regression failures when the tap-tests are enabled.
> It seems that we're not estimating and allocating the shared memory
> for rollback-hash tables correctly. I've added a patch to fix the
> same.
>

I have included your fix in the latest version of the undo-worker patch[1]

[1] https://www.postgresql.org/message-id/flat/CAFiTN-sYQ8r8ANjWFYkXVfNxgXyLRfvbX9Ee4SxO9ns-OBBgVA%40mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

08 November 2018, 03:02:39

On Tue, Nov 6, 2018 at 12:42 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> On Thu, Nov 1, 2018 at 8:53 AM Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
> > It passes make check on Unix and Windows, though currently it's
> > failing some of the TAP tests for reasons I'm looking into (possibly
> > due to bugs in the lower level patches, not sure).
> >
> I looked into the regression failures when the tap-tests are enabled.
> It seems that we're not estimating and allocating the shared memory
> for rollback-hash tables correctly. I've added a patch to fix the
> same.

Thanks Kuntal.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dmitry Dolgov

Date:

30 November 2018, 16:13:44

> On Thu, Nov 8, 2018 at 4:03 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> On Tue, Nov 6, 2018 at 12:42 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> > On Thu, Nov 1, 2018 at 8:53 AM Thomas Munro
> > <thomas.munro@enterprisedb.com> wrote:
> > > It passes make check on Unix and Windows, though currently it's
> > > failing some of the TAP tests for reasons I'm looking into (possibly
> > > due to bugs in the lower level patches, not sure).
> > >
> > I looked into the regression failures when the tap-tests are enabled.
> > It seems that we're not estimating and allocating the shared memory
> > for rollback-hash tables correctly. I've added a patch to fix the
> > same.
>
> Thanks Kuntal.

Thanks for the patch,

Unfortunately, cfbot complains about these patches and can't apply them for
some reason, so I did this manually to check it out. All of them (including the
fix from Kuntal) were applied without conflicts, but compilation stopped here

undoinsert.c: In function ‘UndoRecordAllocateMulti’:
undoinsert.c:547:18: error: ‘urec’ may be used uninitialized in this
function [-Werror=maybe-uninitialized]
   urec->uur_info = 0;  /* force recomputation of info bits */
   ~~~~~~~~~~~~~~~^~~

Could you please post a fixed version of the patch?

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

03 December 2018, 05:43:04

On Sat, Dec 1, 2018 at 5:12 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > On Thu, Nov 8, 2018 at 4:03 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> > On Tue, Nov 6, 2018 at 12:42 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> > > On Thu, Nov 1, 2018 at 8:53 AM Thomas Munro
> > > <thomas.munro@enterprisedb.com> wrote:
> > > > It passes make check on Unix and Windows, though currently it's
> > > > failing some of the TAP tests for reasons I'm looking into (possibly
> > > > due to bugs in the lower level patches, not sure).
> > > >
> > > I looked into the regression failures when the tap-tests are enabled.
> > > It seems that we're not estimating and allocating the shared memory
> > > for rollback-hash tables correctly. I've added a patch to fix the
> > > same.
> >
> > Thanks Kuntal.
>
> Thanks for the patch,
>
> Unfortunately, cfbot complains about these patches and can't apply them for
> some reason, so I did this manually to check it out. All of them (including the
> fix from Kuntal) were applied without conflicts, but compilation stopped here
>
> undoinsert.c: In function ‘UndoRecordAllocateMulti’:
> undoinsert.c:547:18: error: ‘urec’ may be used uninitialized in this
> function [-Werror=maybe-uninitialized]
>    urec->uur_info = 0;  /* force recomputation of info bits */
>    ~~~~~~~~~~~~~~~^~~
>
> Could you please post a fixed version of the patch?

Sorry for my silence... I got stuck on a design problem with the lower
level undo log management code that I'm now close to having figured
out.  I'll have a new patch soon.

--
Thomas Munro
http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

03 February 2019, 10:09:44

Hi,

On 2018-12-03 18:43:04 +1300, Thomas Munro wrote:
> On Sat, Dec 1, 2018 at 5:12 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > > On Thu, Nov 8, 2018 at 4:03 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> > > On Tue, Nov 6, 2018 at 12:42 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> > > > On Thu, Nov 1, 2018 at 8:53 AM Thomas Munro
> > > > <thomas.munro@enterprisedb.com> wrote:
> > > > > It passes make check on Unix and Windows, though currently it's
> > > > > failing some of the TAP tests for reasons I'm looking into (possibly
> > > > > due to bugs in the lower level patches, not sure).
> > > > >
> > > > I looked into the regression failures when the tap-tests are enabled.
> > > > It seems that we're not estimating and allocating the shared memory
> > > > for rollback-hash tables correctly. I've added a patch to fix the
> > > > same.
> > >
> > > Thanks Kuntal.
> >
> > Thanks for the patch,
> >
> > Unfortunately, cfbot complains about these patches and can't apply them for
> > some reason, so I did this manually to check it out. All of them (including the
> > fix from Kuntal) were applied without conflicts, but compilation stopped here
> >
> > undoinsert.c: In function ‘UndoRecordAllocateMulti’:
> > undoinsert.c:547:18: error: ‘urec’ may be used uninitialized in this
> > function [-Werror=maybe-uninitialized]
> >    urec->uur_info = 0;  /* force recomputation of info bits */
> >    ~~~~~~~~~~~~~~~^~~
> >
> > Could you please post a fixed version of the patch?
> 
> Sorry for my silence... I got stuck on a design problem with the lower
> level undo log management code that I'm now close to having figured
> out.  I'll have a new patch soon.

Given this patch has been in waiting for author for ~two months, I'm
unfortunately going to have to mark it as returned with feedback. Please
resubmit once refreshed.

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

12 March 2019, 13:20:29

On Sun, Feb 3, 2019 at 11:09 PM Andres Freund <andres@anarazel.de> wrote:
> On 2018-12-03 18:43:04 +1300, Thomas Munro wrote:
> > Sorry for my silence... I got stuck on a design problem with the lower
> > level undo log management code that I'm now close to having figured
> > out.  I'll have a new patch soon.

Hello all,

Here's a new WIP version of this patch set.  It builds on a fairly
deep stack of patches being developed by several people.  As mentioned
before, it's a useful crash-test dummy for a whole stack of technology
we're working on, but it's also aiming to solve a real problem.

It currently fails in one regression test for a well understood
reason, fix on the way (see end), and there are some other stability
problems being worked on.

Here's a quick tour of the observable behaviour, having installed the
pg_buffercache and test_undorecord extensions:

==================

postgres=# begin;
BEGIN
postgres=# create table foo ();
CREATE TABLE

Check if our transaction has generated undo data:

postgres=# select logno, discard, insert, xid, pid from pg_stat_undo_logs ;
 logno |     discard      |      insert      | xid |  pid
-------+------------------+------------------+-----+-------
     0 | 0000000000002CD9 | 0000000000002D1A | 476 | 39169
(1 row)

Here, we see that undo log number 0 has some undo data because discard
< insert.  We can find out what it says:

postgres=# call dump_undo_records(0);
NOTICE:  0000000000002CD9: Storage: CREATE dbid=12916, tsid=1663,
relfile=16386; xid=476, next xact=0
CALL

The undo record shown there lives in shared buffers, and we can see
that it's in there with pg_buffercache (the new column smgrid 1 means
undo data; 0 is regular relation data):

postgres=# select bufferid, smgrid, relfilenode, relblocknumber,
isdirty, usagecount from pg_buffercache where smgrid = 1;
 bufferid | smgrid | relfilenode | relblocknumber | isdirty | usagecount
----------+--------+-------------+----------------+---------+------------
        3 |      1 |           0 |              1 | t       |          5
(1 row)

Even though that's just a dirty page in shared buffers, if we crash
now and recover, it'll be recreated by a new WAL record that was
flushed *before* creating the relation file.  We can see that with
pg_waldump:

rmgr: Storage ... PRECREATE base/12916/16384, blkref #0: smgr 1 rel
1663/0/0 blk 1 FPW
rmgr: Storage ... CREATE base/12916/16384

The PRECREATE record dirtied block 1 of undo log 0.  In this case it
happened to include a FPW of the undo log page too, following the
usual rules.  FPWs are rare for undo pages because of the
REGBUF_WILL_INIT optimisation that applies to the zeroed out pages
(which is most undo pages, due to the append-mostly access pattern).

Finally, we if commit we see the undo data is discarded by a
background worker, and if we roll back explicitly or crash and run
recovery, the file is unlinked.  Here's an example of the crash case:

postgres=# begin;
BEGIN
postgres=# create table foo ();
CREATE TABLE
postgres=# select relfilenode from pg_class where relname = 'foo';
 relfilenode
-------------
       16395
(1 row)

postgres=# select pg_backend_pid();
 pg_backend_pid
----------------
          39169
(1 row)

$ kill -9 39169

... server restarts, recovers ...

$ ls pgdata/base/12916/16395
pgdata/base/12916/16395

It's still there, though it's been truncated by an undo worker (see
end of email).  And finally, after the next checkpoint:

$ ls pgdata/base/12916/16395
ls: pgdata/base/12916/16395: No such file or directory

That's the end of the quick tour.

Most of these patches should probably be discussed in other threads,
but I'm posting a snapshot of the full stack here anyway.  Here's a
patch-by-patch summary:

=== 0001 "Refactor the fsync mechanism to support future SMGR
implementations." ===

The 0001 patch has its own CF thread
https://commitfest.postgresql.org/22/1829/ and is from Shawn Debnath
(based on earlier work by me), but I'm including a copy here for
convenience/cfbot.

=== 0002 "Add SmgrId to smgropen() and BufferTag." ===

This is new, and is based on the discussion from another recent
thread[1] about how we should identify buffers belonging to different
storage managers.  In earlier versions of the patch-set I had used a
special reserved DB OID for undo data.  Tom Lane didn't like that idea
much, and Anton Shyrabokau (via Shawn Debnath) suggested making
ForkNumber narrower so we can add a new field to BufferTag, and Andres
Freund +1'd my proposal to add the extra value as a parameter to
smgropen().  So, here is a patch that tries those ideas.

Another way to do this would be to widen RelFileNode instead, to avoid
having to pass around the SMGR ID separately in various places.
Looking at the number of places that have to chance, you can probably
see why we wanted to use a magic DB OID instead, and I'm not entirely
convinced that it wasn't better that way, or that I've found all the
places that need to carry an smgrid alongside a RelFileNode.

Archeological note: smgropen() was like that ~15 years ago before
commit 87bd9563, but buffer tags didn't include the SMGR ID.

I decided to call md.c's ID "SMGR_RELATION", describing what it really
holds -- regular relations -- rather than perpetuating the doubly
anachronistic "magnetic disk" name.

While here, I resurrected the ancient notion of a per-SMGR 'open'
routine, so that a small amount of md.c-specific stuff could be kicked
out of smgr.c and future implementations can do their own thing here
too.

While doing that work I realised that at least pg_rewind needs to
learn about how different storage managers map blocks to files, so
that's a new TODO item requiring more thought.  I wonder what other
places know how to map { RelFileNode, ForkNumber, BlockNumber } to a
path + offset, and I wonder what to think about the fact that some of
them may be non-backend code...

=== 0003 "Add undo log manager." ===

This and the next couple of patches live in CF thread
https://commitfest.postgresql.org/22/1649/ but here's a much newer
snapshot that hasn't been posted there yet.

Manages a set of undo logs in shared memory, manages undo segment
files, tracks discard, insert, end pointers visible in
pg_stat_undo_logs.  With this patch you can allocate and discard space
in undo logs using the UndoRecPtr type to refer to addresses, but
there is no access to the data yet.  Improvements since the last
version are not requiring DSM segments, proper FPW support and reduced
WAL traffic.  Previously there were extra per-xact and per-checkpoint
records requiring retry-loops in code that inserted undo data.

=== 0004 "Provide access to undo log data via the buffer manager." ===

Provide SMGR_UNDO.  While the 0003 patch deals with allocating and
discarding undo address space and makes sure that backing files exist,
this patch lets you read and write buffered data in them.

=== 0005 "Allow WAL record data on first modification after a checkpoint." ===

Provide a way for data to be attached to a WAL-registered block that
is only included if this turns out to be the first WAL record that
touches the block after a checkpoint.  This is a bit like FPW images,
except that it's arbitrary extra data and happens even if FPW is off.
This is used to capture a copy of the (tiny) undo log meta-data
(primary the insertion pointer) to fix a consistency problem when
recovering from an online checkpoint.

=== 0006 + 0007 "Provide interfaces to store and fetch undo records." ===

This is a snapshot of work by my colleagues Dilip, Rafia and others
based on earlier prototyping by Robert.  While the earlier patches
give you buffered binary undo data, this patch introduces the concept
of high level undo records that can be inserted, and read back given
an UndoRecPtr.  This is a version presented on another thread already;
here it's lightly changed due to rebasing by me.

Undo-aware modules should design a set of undo record types, and
insert exactly the same ones at do and undo time.

The 0007 patch is fixups from me to bring that code into line with
changes to the lower level patches.  Future versions will be squashed
and tidied up; still working on that.

=== 0008 + 0009 "Undo worker and transaction rollback" ===

This has a CF thread at https://commitfest.postgresql.org/22/1828/ and
again this is a snapshot of work from Dilip, Rafia and others, with a
fixup from me.  Still working on coordinating that for the next
version.

This provides a way for RMGR modules to register a callback function
that will receive all the undo records they inserted during a given
[sub]transaction if it rolls back.  It also provides a system of
background workers that can execute those undo records in case the
rollback happens after crash recovery, or in case the work can be
usefully pushed into the background during a regular online rollback.
This is a complex topic and I'm not attempting to explain it here.

There are a few known problems with this and Dilip is working on a
more sophisticated worker management system, but I'll let him write
about that, over in that other thread.

I think it'd probably be a good idea to split this patch into two or
three; the RMGR undo support, the xact.c integration and the worker
machinery.  But maybe that's just me.

Archeological note: XXXX_undo() callback functions registered via
rmgrlist.h a bit like this originally appeared in the work by Vadim
Mikheev (author of WAL) in commit b58c0411bad4, but that was
apparently never completed once people figured out that you can make a
force, steal, redo, no-undo database work (curiously I saw a slide
from a university lecture somewhere saying that would be impossible).
The stub functions were removed from the tree in 4c8495a1.  Our new
work differs from Vadim's original vision by putting undo data in a
separate place from the WAL, and accessing it via shared buffers.  I
guess that might be because Vadim planned to use undo for rollback
only, not for MVCC (but I might be wrong about that).  That difference
might explains why eg Vadim's function heap_undo() took an XLogRecord,
whereas our proposal takes a different type.  Our proposal also passes
more than one records at a time to the undo handler; in future this
will allow us to collect up all undo records relating to a page of
(eg) zheap, and process them together for mechanical sympathy.

=== 0010 "Add developer documentation for the undo log storage subsystem." ===

Updated based on Robert's review up-thread.  No coverage of background
workers yet -- that is under development.

=== 0011 "Add user-facing documentation for undo logs." ===

Updated based on Robert's review up-thread.

=== 0012 "Add test_undorecord test module." ===

Provides quick and dirty dump_undo_records() procedure for testing.

=== 0013 "Use undo-based rollback to clean up files on abort." ===

Finally, this is the actual feature that this CF item is about.  The
main improvement here is that the previous version unlinked files
immediately when executing undo actions, which broke the protocol
established by commit 6cc4451b, namely that you can't reuse a
relfilenode until after the next checkpoint, and the existence of an
(empty) first relation segment in the filesystem is the only thing
preventing that.  That is fixed in this version (but see problem 2
below).

Known problems:

1.  A couple of tests fail with "ERROR:  buffer is pinned in
InvalidateBuffer".  That's because ROLLBACK TO SAVEPOINT is executing
the undo actions that drop the buffers for a newly created table
before the subtransaction has been cleaned up.  Amit is working on a
solution to that.  More soon.

2.  The are two levels of deferment of file unlinking in current
PostgreSQL.  First, when you create a new relation, it is pushed on
pendingDeletes; this patch-set replaces that in-memory list with
persistent undo records as discussed.  There is a second level of
deferment: we unlink all the segments of the file except the first
one, which we truncate, and then finally the zero-length file is
unlinked after the next checkpoint; this is an important part of
PostgreSQL's protocol for not reusing relfilenodes too soon.  That
means that there is still a very narrow window after the checkpoint is
logged but before we've unlinked that file where you could still crash
and leak a zero-length file.  I've thought about a couple of solutions
to close that window, including a file renaming scheme where .zombie
files get cleaned up on crash, but that seemed like something that
could be improved later.

There is something else that goes wrong under parallel make check,
which I must have introduced recently but haven't tracked down yet.  I
wanted to post a snapshot version for discussion anyway.  More soon.

This code is available at https://github.com/EnterpriseDB/zheap/tree/undo.

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BDE0mmiBZMtZyvwWtgv1sZCniSVhXYsXkvJ_Wo%2B83vvw%40mail.gmail.com

-- 
Thomas Munro
https://enterprisedb.com

Attachment

undo-smgr-v2.tgz

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

19 April 2019, 10:15:43

On Tue, Mar 12, 2019 at 6:51 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Sun, Feb 3, 2019 at 11:09 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2018-12-03 18:43:04 +1300, Thomas Munro wrote:
> > > Sorry for my silence... I got stuck on a design problem with the lower
> > > level undo log management code that I'm now close to having figured
> > > out.  I'll have a new patch soon.
>
> Hello all,
>
> Here's a new WIP version of this patch set.  It builds on a fairly
> deep stack of patches being developed by several people.  As mentioned
> before, it's a useful crash-test dummy for a whole stack of technology
> we're working on, but it's also aiming to solve a real problem.
>
> It currently fails in one regression test for a well understood
> reason, fix on the way (see end), and there are some other stability
> problems being worked on.
>
> Here's a quick tour of the observable behaviour, having installed the
> pg_buffercache and test_undorecord extensions:
>
> ==================
>
> postgres=# begin;
> BEGIN
> postgres=# create table foo ();
> CREATE TABLE
>
> Check if our transaction has generated undo data:
>
> postgres=# select logno, discard, insert, xid, pid from pg_stat_undo_logs ;
>  logno |     discard      |      insert      | xid |  pid
> -------+------------------+------------------+-----+-------
>      0 | 0000000000002CD9 | 0000000000002D1A | 476 | 39169
> (1 row)
>
> Here, we see that undo log number 0 has some undo data because discard
> < insert.  We can find out what it says:
>
> postgres=# call dump_undo_records(0);
> NOTICE:  0000000000002CD9: Storage: CREATE dbid=12916, tsid=1663,
> relfile=16386; xid=476, next xact=0
> CALL
>
> The undo record shown there lives in shared buffers, and we can see
> that it's in there with pg_buffercache (the new column smgrid 1 means
> undo data; 0 is regular relation data):
>
> postgres=# select bufferid, smgrid, relfilenode, relblocknumber,
> isdirty, usagecount from pg_buffercache where smgrid = 1;
>  bufferid | smgrid | relfilenode | relblocknumber | isdirty | usagecount
> ----------+--------+-------------+----------------+---------+------------
>         3 |      1 |           0 |              1 | t       |          5
> (1 row)
>
> Even though that's just a dirty page in shared buffers, if we crash
> now and recover, it'll be recreated by a new WAL record that was
> flushed *before* creating the relation file.  We can see that with
> pg_waldump:
>
> rmgr: Storage ... PRECREATE base/12916/16384, blkref #0: smgr 1 rel
> 1663/0/0 blk 1 FPW
> rmgr: Storage ... CREATE base/12916/16384
>
> The PRECREATE record dirtied block 1 of undo log 0.  In this case it
> happened to include a FPW of the undo log page too, following the
> usual rules.  FPWs are rare for undo pages because of the
> REGBUF_WILL_INIT optimisation that applies to the zeroed out pages
> (which is most undo pages, due to the append-mostly access pattern).
>
> Finally, we if commit we see the undo data is discarded by a
> background worker, and if we roll back explicitly or crash and run
> recovery, the file is unlinked.  Here's an example of the crash case:
>
> postgres=# begin;
> BEGIN
> postgres=# create table foo ();
> CREATE TABLE
> postgres=# select relfilenode from pg_class where relname = 'foo';
>  relfilenode
> -------------
>        16395
> (1 row)
>
> postgres=# select pg_backend_pid();
>  pg_backend_pid
> ----------------
>           39169
> (1 row)
>
> $ kill -9 39169
>
> ... server restarts, recovers ...
>
> $ ls pgdata/base/12916/16395
> pgdata/base/12916/16395
>
> It's still there, though it's been truncated by an undo worker (see
> end of email).  And finally, after the next checkpoint:
>
> $ ls pgdata/base/12916/16395
> ls: pgdata/base/12916/16395: No such file or directory
>
> That's the end of the quick tour.
>
> Most of these patches should probably be discussed in other threads,
> but I'm posting a snapshot of the full stack here anyway.  Here's a
> patch-by-patch summary:
>
> === 0001 "Refactor the fsync mechanism to support future SMGR
> implementations." ===
>
> The 0001 patch has its own CF thread
> https://commitfest.postgresql.org/22/1829/ and is from Shawn Debnath
> (based on earlier work by me), but I'm including a copy here for
> convenience/cfbot.
>
> === 0002 "Add SmgrId to smgropen() and BufferTag." ===
>
> This is new, and is based on the discussion from another recent
> thread[1] about how we should identify buffers belonging to different
> storage managers.  In earlier versions of the patch-set I had used a
> special reserved DB OID for undo data.  Tom Lane didn't like that idea
> much, and Anton Shyrabokau (via Shawn Debnath) suggested making
> ForkNumber narrower so we can add a new field to BufferTag, and Andres
> Freund +1'd my proposal to add the extra value as a parameter to
> smgropen().  So, here is a patch that tries those ideas.
>
> Another way to do this would be to widen RelFileNode instead, to avoid
> having to pass around the SMGR ID separately in various places.
> Looking at the number of places that have to chance, you can probably
> see why we wanted to use a magic DB OID instead, and I'm not entirely
> convinced that it wasn't better that way, or that I've found all the
> places that need to carry an smgrid alongside a RelFileNode.
>
> Archeological note: smgropen() was like that ~15 years ago before
> commit 87bd9563, but buffer tags didn't include the SMGR ID.
>
> I decided to call md.c's ID "SMGR_RELATION", describing what it really
> holds -- regular relations -- rather than perpetuating the doubly
> anachronistic "magnetic disk" name.
>
> While here, I resurrected the ancient notion of a per-SMGR 'open'
> routine, so that a small amount of md.c-specific stuff could be kicked
> out of smgr.c and future implementations can do their own thing here
> too.
>
> While doing that work I realised that at least pg_rewind needs to
> learn about how different storage managers map blocks to files, so
> that's a new TODO item requiring more thought.  I wonder what other
> places know how to map { RelFileNode, ForkNumber, BlockNumber } to a
> path + offset, and I wonder what to think about the fact that some of
> them may be non-backend code...
>
> === 0003 "Add undo log manager." ===
>
> This and the next couple of patches live in CF thread
> https://commitfest.postgresql.org/22/1649/ but here's a much newer
> snapshot that hasn't been posted there yet.
>
> Manages a set of undo logs in shared memory, manages undo segment
> files, tracks discard, insert, end pointers visible in
> pg_stat_undo_logs.  With this patch you can allocate and discard space
> in undo logs using the UndoRecPtr type to refer to addresses, but
> there is no access to the data yet.  Improvements since the last
> version are not requiring DSM segments, proper FPW support and reduced
> WAL traffic.  Previously there were extra per-xact and per-checkpoint
> records requiring retry-loops in code that inserted undo data.
>
> === 0004 "Provide access to undo log data via the buffer manager." ===
>
> Provide SMGR_UNDO.  While the 0003 patch deals with allocating and
> discarding undo address space and makes sure that backing files exist,
> this patch lets you read and write buffered data in them.
>
> === 0005 "Allow WAL record data on first modification after a checkpoint." ===
>
> Provide a way for data to be attached to a WAL-registered block that
> is only included if this turns out to be the first WAL record that
> touches the block after a checkpoint.  This is a bit like FPW images,
> except that it's arbitrary extra data and happens even if FPW is off.
> This is used to capture a copy of the (tiny) undo log meta-data
> (primary the insertion pointer) to fix a consistency problem when
> recovering from an online checkpoint.
>
> === 0006 + 0007 "Provide interfaces to store and fetch undo records." ===
>
> This is a snapshot of work by my colleagues Dilip, Rafia and others
> based on earlier prototyping by Robert.  While the earlier patches
> give you buffered binary undo data, this patch introduces the concept
> of high level undo records that can be inserted, and read back given
> an UndoRecPtr.  This is a version presented on another thread already;
> here it's lightly changed due to rebasing by me.
>
> Undo-aware modules should design a set of undo record types, and
> insert exactly the same ones at do and undo time.
>
> The 0007 patch is fixups from me to bring that code into line with
> changes to the lower level patches.  Future versions will be squashed
> and tidied up; still working on that.

Currently, undo branch[1] contain an older version of the (undo
interface + some fixup).  Now, I have merged the latest changes from
the zheap branch[2] to the undo branch[1]
which can be applied on top of the undo storage commit[3].  For
merging those changes, I need to add some changes to the undo log
storage patch as well for handling the multi log transaction.  So I
have attached two patches, 1) improvement to undo log storage 2)
complete undo interface patch which include 0006+0007 from undo
branch[1] + new changes on the zheap branch.

[1] https://github.com/EnterpriseDB/zheap/tree/undo
[2] https://github.com/EnterpriseDB/zheap
[3] https://github.com/EnterpriseDB/zheap/tree/undo
(b397d96176879ed5b09cf7322b8d6f2edd8043a5)

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Fri, Apr 19, 2019 at 10:13 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Apr 19, 2019 at 6:16 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > Currently, undo branch[1] contain an older version of the (undo
> > interface + some fixup).  Now, I have merged the latest changes from
> > the zheap branch[2] to the undo branch[1]
> > which can be applied on top of the undo storage commit[3].  For
> > merging those changes, I need to add some changes to the undo log
> > storage patch as well for handling the multi log transaction.  So I
> > have attached two patches, 1) improvement to undo log storage 2)
> > complete undo interface patch which include 0006+0007 from undo
> > branch[1] + new changes on the zheap branch.

Thanks for the review Robert.  Please find my reply inline.
>
> Some review comments:
>
> +#define AtAbort_ResetUndoBuffers() ResetUndoBuffers()
>
> I don't think this really belongs in xact.c.  Seems like we ought to
> declare it in the appropriate header file.  Perhaps we also ought to
> consider using a static inline function rather than a macro, although
> I guess it doesn't really matter.

Moved to undoinsert.h

>
> +void
> +SetCurrentUndoLocation(UndoRecPtr urec_ptr)
> +{
> + UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr), false);
> + UndoPersistence upersistence = log->meta.persistence;
>
Right.  This should not be part of this patch so removed.

>
> + * When the undorecord for a transaction gets inserted in the next log then we
>
> undo record
Changed
>
> + * insert a transaction header for the first record in the new log and update
> + * the transaction header with this new logs location.  We will also keep
>
> This appears to be nonsensical.  You're saying that you add a
> transaction header to the new log and then update it with its own
> location.  That can't be what you mean.
Actually, what I meant is update the transaction's header which is in
the old log.  I have changed the comments

>
> + * Incase, this is not the first record in new log (aka new log already
>
> "Incase," -> "If"
> "aka" -> "i.e."
>
Done

> Overall this whole paragraph is a bit hard to understand.
I tired to improve it in newer version.

> + * same transaction spans across multiple logs depending on which log is
>
> delete "across"
Fixed
>
> + * processed first by the discard worker.  If it processes the first log which
> + * contains the transactions first record, then it can get the last record
> + * of that transaction even if it is in different log and then processes all
> + * the undo records from last to first.  OTOH, if the next log get processed
>
> Not sure how that would work if the number of logs is >2.
> This whole paragraph is also hard to understand.
Actually, what I meant is that if it spread in multiple logs for
example 3 logs(1,2,3) and the discard worker check the log 1 first
then for aborted transaction it will follow the chain of undo headers
and register complete request for rollback and it will apply all undo
action in log1,2and 3 together.  Whereas if it encounters log2 first
it will register request for undo actions in log2 and 3 and similarly
if it encounter log 3 first then it will only process that log.  We
have done that so that we can maintain the order of undo apply.
However, there is possibility that we always collect all undos and
apply together but for that we need to add one more pointer in the
transaction header (transaction's undo header in previous log).  May
be the next log pointer we can keep in separate header instead if
keeping in transaction  header so that it will only occupy space on
log switch.

I think this comment don't belong here it's more related to undo
discard processing so I have removed.

>
> +static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
> +static int buffer_idx;
>
> This is a file-global variable with a very generic name that is very
> similar to a local variable name used by multiple functions within the
> file (bufidx) and no comments.  Ugh.
>
Comment added.

> +UndoRecordIsValid(UndoLogControl * log, UndoRecPtr urp)
>
> The locking regime for this function is really confusing.  It requires
> that the caller hold discard_lock on entry, and on exit the lock will
> still be held if the return value is true but will no longer be held
> if the return value is false.  Yikes!  Anybody who reads code that
> uses this function is not going to guess that it has such strange
> behavior.  I'm not exactly sure how to redesign this, but I think it's
> not OK the way you have it. One option might be to inline the logic
> into each call site.

I think the simple solution will be that inside UndoRecordIsValid
function we can directly check UndoLogIsDiscarded if oldest_data is
not yet initialized, I think we don't need to release the discard lock
for that.  So I have changed it like that.

>
> +/*
> + * Overwrite the first undo record of the previous transaction to update its
> + * next pointer.  This will just insert the already prepared record by
> + * UndoRecordPrepareTransInfo.  This must be called under the critical section.
> + * This will just overwrite the undo header not the data.
> + */
> +static void
> +UndoRecordUpdateTransInfo(int idx)
>
> It bugs me that this function goes back in to reacquire the discard
> lock for the purpose of preventing a concurrent undo discard.
> Apparently, if the other transaction's undo has been discarded between
> the prepare phase and where we are now, we're OK with that and just
> exit without doing anything; otherwise, we update the previous
> transaction header.  But that seems wrong.  When we enter a critical
> section, I think we should aim to know exactly what modifications we
> are going to make within that critical section.
>
> I also wonder how the concurrent discard could really happen.  We must
> surely be holding exclusive locks on the relevant buffers -- can undo
> discard really discard undo when the relevant buffers are x-locked?
>
> It seems to me that remaining_bytes is a crock that should be ripped
> out entirely, both here and in InsertUndoRecord.  It seems to me that
> UndoRecordUpdateTransInfo can just contrive to set remaining_bytes
> correctly.  e.g.
>
> do
> {
>     // stuff
>     if (!BufferIsValid(buffer))
>     {
>         Assert(InRecovery);
>         already_written += (BLCKSZ - starting_byte);
>         done = (already_written >= undo_len);
>     }
>     else
>     {
>         page = BufferGetPage(buffer);
>         done = InsertUndoRecord(...);
>         MarkBufferDirty(buffer);
>     }
> } while (!done);
>
> InsertPreparedUndo needs similar treatment.
>
> To make this work, I guess the long string of assignments in
> InsertUndoRecord will need to be made unconditional, but that's
> probably pretty cheap.  As a fringe benefit, all of those work_blah
> variables that are currently referencing file-level globals can be
> replaced with something local to this function, which will surely
> please the coding style police.

Got fixed as part of last comment fix where we introduced
SkipInsertingUndoData and globals moved to the context.

>
> + * In recovery, 'xid' refers to the transaction id stored in WAL, otherwise,
> + * it refers to the top transaction id because undo log only stores mapping
> + * for the top most transactions.
> + */
> +UndoRecPtr
> +PrepareUndoInsert(UnpackedUndoRecord *urec, FullTransactionId fxid,
>
> xid vs fxid
>
> + urec->uur_xidepoch = EpochFromFullTransactionId(fxid);
>
> We need to make some decisions about how we're going to handle 64-bit
> XIDs vs. 32-bit XIDs in undo. This doesn't look like a well-considered
> scheme.  In general, PrepareUndoInsert expects the caller to have
> populated the UnpackedUndoRecord, but here, uur_xidepoch is getting
> overwritten with the high bits of the caller-specified XID.  The
> low-order bits aren't stored anywhere by this function, but the caller
> is presumably expected to have placed them inside urec->uur_xid.  And
> it looks like the low-order bits (urec->uur_xid) get stored for every
> undo record, but the high-order bits (urec->xidepoch) only get stored
> when we emit a transaction header.  This all seems very confusing.

Yeah it seems bit confusing.  Actually, discard worker process the
transaction chain from one transaction header to the next transaction
header so we need epoch only when it's first record of the transaction
and currently we have set all header information inside
PrepareUndoInsert.  Xid is stored by caller as caller needs it for the
MVCC purpose.  I think caller can always set it and if transaction
header get added then it will be stored otherwise not. So I think we
can remove setting it here.
>
> I would really like to see us replace uur_xid and uur_xidepoch with a
> FullTransactionId; now that we have that concept, it seems like bad
> practice to break what is really a FullTransactionId into two halves
> and store them separately.  However, it would be a bit unfortunate to
> store an additional 4 bytes of mostly-static data in every undo
> record.  What if we went the other way?   That is, remove urec_xid
> from UndoRecordHeader and from UnpackedUndoRecord.  Make it the
> responsibility of whoever is looking up an undo record to know which
> transaction's undo they are searching.  zheap, at least, generally
> does know this: if it's starting from a page, then it has the XID +
> epoch available from the transaction slot, and if it's starting from
> an undo record, you need to know the XID for which you are searching,
> I guess from uur_prevxid.

Right, from uur_prevxid we would know that for which xid's undo we are
looking for but without having uur_xid in undo record it self how we
would know which undo record is inserted by the xid we are looking
for?  Because in zheap while following the undo chain and if slot got
switch, then there is possibility (because of slot reuse) that we
might get some other transaction's undo record for the same zheap
tuple, but we want to traverse back as we want to find the record
inserted by uur_prevxid.  So we need uur_xid as well to tell who is
inserter of this undo record?

>
> I also think that we need to remove uur_prevxid.  That field does not
> seem to be properly a general-purpose part of the undo machinery, but
> a zheap-specific consideration.  I think it's job is to tell you which
> transaction last modified the current tuple, but zheap can put that
> data in the payload if it likes.  It is a waste to store it in every
> undo record, because it's not needed if the older undo has already
> been discarded or if the operation is an insert.
Done
>
> + * Insert a previously-prepared undo record.  This will write the actual undo
>
> Looks like this now inserts all previously-prepared undo records
> (rather than just a single one).
Fixed
>
> + * in page.  We start writting immediately after the block header.
>
> Spelling.
Done
>
> + * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
> + * by urp and unpack the record into urec.  This function will not release the
> + * pin on the buffer if complete record is fetched from one buffer, so caller
> + * can reuse the same urec to fetch the another undo record which is on the
> + * same block.  Caller will be responsible to release the buffer inside urec
> + * and set it to invalid if it wishes to fetch the record from another block.
> + */
> +UnpackedUndoRecord *
> +UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
> + UndoPersistence persistence)
>
> I don't really understand why uur_buffer is part of an
> UnpackedUndoRecord.  It doesn't look like the other fields, which tell
> you about the contents of an undo record that you have created or that
> you have parsed.  Instead, it's some kind of scratch space for keeping
> track of a buffer that we're using in the process of reading an undo
> record.  It looks like it should be an argument to UndoGetOneRecord()
> and ResetUndoRecord().
>
> I also wonder whether it's really a good design to make the caller
> responsible for invalidating the buffer before accessing another
> block.  Maybe it would be simpler to have this function just check
> whether the buffer it's been given is the right one; if not, unpin it
> and pin the new one instead.  But I'm not really sure...

I am not sure what will be better here, But I thought caller anyway
has to release the last buffer so why not to make it responsible to
keeping the track of the first buffer of the undo record and caller
understands it better that it needs to hold the first buffer of the
undo record because it hope that the previous undo record in the chain
might fall on the same buffer?
May be we can make caller completely responsible for reading the
buffer for the first block of the undo record and it will always pass
the valid buffer and UndoGetOneRecord only need to read buffer if the
undo record is spilt and it can release them right there.  So the
caller will always keep track of the first buffer where undo record
start and whenever the undo record pointer change it will be
responsible for changing the buffer.

> + /* If we already have a buffer pin then no need to allocate a new one. */
> + if (!BufferIsValid(buffer))
> + {
> + buffer = ReadBufferWithoutRelcache(SMGR_UNDO,
> +    rnode, UndoLogForkNum, cur_blk,
> +    RBM_NORMAL, NULL,
> +    RelPersistenceForUndoPersistence(persistence));
> +
> + urec->uur_buffer = buffer;
> + }
>
> I think you should move this code inside the loop that follows.  Then
> at the bottom of that loop, instead of making a similar call, just set
> buffer = InvalidBuffer.  Then when you loop around it'll do the right
> thing and you'll need less code.
Done

>
> Notice that having both the local variable buffer and the structure
> member urec->uur_buffer is actually making this code more complex. You
> are already setting urec->uur_buffer = InvalidBuffer when you do
> UnlockReleaseBuffer().  If you didn't have a separate 'buffer'
> variable you wouldn't need to clear them both here.  In fact I think
> what you should have is an argument Buffer *curbuf, or something like
> that, and no uur_buffer at all.
Done
>
> + /*
> + * If we have copied the data then release the buffer, otherwise, just
> + * unlock it.
> + */
> + if (is_undo_rec_split)
> + UnlockReleaseBuffer(buffer);
> + else
> + LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
>
> Ugh.  I think what's going on here is: UnpackUndoRecord copies the
> data if it switches buffers, but not otherwise.  So if the record is
> split, we can release the lock and pin, but otherwise we have to keep
> the pin to avoid having the data get clobbered.  But having this code
> know about what UnpackUndoRecord does internally seems like an
> abstraction violation.  It's also probably not right if we ever want
> to fetch undo records in bulk, as I see that the latest code in zheap
> master does.  I think we should switch UnpackUndoRecord over to always
> copying the data and just avoid all this.

Done
>
> (To make that cheaper, we may want to teach UnpackUndoRecord to store
> data into scratch space provided by the caller rather than using
> palloc to get its own space, but I'm not actually sure that's (a)
> worth it or (b) actually better.)
>
> [ Skipping over some things ]
>
> +bool
> +UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte,
> + int *already_decoded, bool header_only)
>
> I think we should split this function into three functions that use a
> context object, call it say UnpackUndoContext.  The caller will do:
>
> BeginUnpackUndo(&ucontext);  // just once
> UnpackUndoData(&ucontext, page, starting_byte);  // any number of times
> FinishUnpackUndo(&uur, &ucontext);  // just once
>
> The undo context will store an enum value that tells us the "stage" of decoding:
>
> - UNDO_DECODE_STAGE_HEADER: We have not yet decoded even the record
> header; we need to do that next.
> - UNDO_DECODE_STAGE_RELATION_DETAILS: The next thing to be decoded is
> the relation details, if present.
> - UNDO_DECODE_STAGE_BLOCK: The next thing to be decoded is the block
> details, if present.
> - UNDO_DECODE_STAGE_TRANSACTION: The next thing to be decoded is the
> transaction details, if present.
> - UNDO_DECODE_STAGE_PAYLOAD: The next thing to be decoded is the
> payload details, if present.
> - UNDO_DECODE_STAGE_DONE: Decoding is complete.
>
> It will also store the number of bytes that have been already been
> copied as part of whichever stage is current.  A caller who wants only
> part of the record can stop when ucontext.stage > desired_stage; e.g.
> the current header_only flag corresponds to stopping when
> ucontext.stage > UNDO_DECODE_STAGE_HEADER, and the potential
> optimization mentioned in UndoGetOneRecord could be done by stopping
> when ucontext.stage > UNDO_DECODE_STAGE_BLOCK (although I don't know
> if that's worth doing).
>
> In this scheme, BeginUnpackUndo just needs to set the stage to
> UNDO_DECODE_STAGE_HEADER and the number of bytes copied to 0.  The
> context object contains all the work_blah things (no more global
> variables!), but BeginUnpackUndo does not need to initialize them,
> since they will be overwritten before they are examined.  And
> FinishUnpackUndo just needs to copy all of the fields from the
> work_blah things into the UnpackedUndoRecord.  The tricky part is
> UnpackUndoData itself, which I propose should look like a big switch
> where all branches fall through.  Roughly:
>
> switch (ucontext->stage)
> {
> case UNDO_DECODE_STAGE_HEADER:
> if (!ReadUndoBytes(...))
>   return;
> stage = UNDO_DECODE_STAGE_RELATION_DETAILS;
> /* FALLTHROUGH */
> case UNDO_DECODE_STAGE_RELATION_DETAILS:
> if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
> {
>   if (!ReadUndoBytes(...))
>     return;
> }
> stage = UNDO_DECODE_STAGE_BLOCK;
> /* FALLTHROUGH */
> etc.
>
> ReadUndoBytes would need some adjustments in this scheme; it wouldn't
> need my_bytes_read any more since it would only get called for
> structures that are not yet completely read.
Yeah so we can directly jump to the header which is not yet completely
read but if any of the header is partially read then we need to
maintain some kind of partial read variable otherwise from 'already
read' we wouldn't be able to know how many bytes of the header got
read in last call unless we calculate that from uur_info or maintain
the partial_read in context like I have done in the new version.

(Regardless of whether
> we adopt this idea, the nocopy flag to ReadUndoBytes appears to be
> unused and can be removed.)

Yup.

>
> We could use a similar context object for InsertUndoRecord.
> BeginInsertUndoRecord(&ucontext, &uur) would initialize all of the
> work_blah structures within the context object.  InsertUndoData will
> be a big switch.  Maybe no need for a "finish" function here.  There
> can also be a SkipInsertingUndoData function that can be called
> instead of InsertUndoData if the page is discarded.  I think this
> would be more elegant than what we've got now.

Done.

Note:
- I think the ucontext->stage are same for the insert and DECODE can
we just declare only
  one enum and give some generic name e.g. UNDO_PROCESS_STAGE_HEADER ?
- In SkipInsertingUndoData also I have to go through all the stages so
that if we find some valid
block then stage is right for inserting the partial record?  Do you
think I could have avoided that?

Apart from these changes I have also included UndoRecordBulkFetch in
the undoinsert.c.

I have tested this patch with my local test modules which basically
insert, fetch and bulk fetch multiple records and compare the
contents.  My test patch is still not in good shape so I will post the
test module later.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Thomas told me offlist that this email of mine didn't hit
pgsql-hackers, so trying it again by resending.

On Mon, Apr 29, 2019 at 3:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Apr 19, 2019 at 3:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Mar 12, 2019 at 6:51 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > >
> >
> > Currently, undo branch[1] contain an older version of the (undo
> > interface + some fixup).  Now, I have merged the latest changes from
> > the zheap branch[2] to the undo branch[1]
> > which can be applied on top of the undo storage commit[3].  For
> > merging those changes, I need to add some changes to the undo log
> > storage patch as well for handling the multi log transaction.  So I
> > have attached two patches, 1) improvement to undo log storage 2)
> > complete undo interface patch which include 0006+0007 from undo
> > branch[1] + new changes on the zheap branch.
> >
> > [1] https://github.com/EnterpriseDB/zheap/tree/undo
> > [2] https://github.com/EnterpriseDB/zheap
> > [3] https://github.com/EnterpriseDB/zheap/tree/undo
> > (b397d96176879ed5b09cf7322b8d6f2edd8043a5)
> >
> > []
>
> Dilip has posted the patch for "undo record interface", next in series
> is a patch that handles transaction rollbacks (the machinery to
> perform undo actions) and background workers to manage undo.
>
> Transaction Rollbacks
> ----------------------------------
> We always perform rollback actions after cleaning up the current
> (sub)transaction.  This will ensure that we perform the actions
> immediately after an error (and release the locks) rather than when
> the user issues Rollback command at some later point of time.  We are
> releasing the locks after the undo actions are applied.  The reason to
> delay lock release is that if we release locks before applying undo
> actions, then the parallel session can acquire the lock before us
> which can lead to deadlock.
>
> We promote the error to FATAL error if it occurred while applying undo
> for a subtransaction.  The reason we can't proceed without applying
> subtransaction's undo is that the modifications made in that case must
> not be visible even if the main transaction commits.  Normally, the
> backends that receive the request to perform Rollback (To Savepoint)
> applies the undo actions, but there are cases where it is preferable
> to push the requests to background workers.  The main reasons to push
> the requests to background workers are (a) The request for a very
> large rollback, this will allow us to return control to users quickly.
> There is a guc rollback_overflow_size which indicates that rollbacks
> greater than the configured size are performed lazily by background
> workers. (b) While applying the undo actions, if there is an error, we
> push such a request to background workers.
>
> Undo Requests and Undo workers
> --------------------------------------------------
> To improve the efficiency of the rollbacks, we create three queues and
> a hash table for the rollback requests.  A Xid based priority queue
> which will allow us to process the requests of older transactions and
> help us to move oldesdXidHavingUndo (this is a xid-horizon below which
> all the transactions are visible) forward.  A size-based queue which
> will help us to perform the rollbacks of larger aborts in a timely
> fashion so that we don't get stuck while processing them during
> discard of the logs.  An error queue to hold the requests for
> transactions that failed to apply its undo.  The rollback hash table
> is used to avoid duplicate undo requests by backends and discard
> worker.
>
> Undo launcher is responsible for launching the workers iff there is
> some work available in one of the work queues and there are more
> workers available.  The worker is launched to handle requests for a
> particular database.  Each undo worker then start reading from one of
> the queues the requests for that particular database.  A worker would
> peek into each queue for the requests from a particular database if it
> needs to switch a database in less than undo_worker_quantum ms (10s as
> default) after starting. Also, if there is no work, it lingers for
> UNDO_WORKER_LINGER_MS (10s as default).  This avoids restarting the
> workers too frequently.
>
> The discard worker is responsible for discarding the undo log of
> transactions that are committed and all-visible or are rolled-back.
> It also registers the request for aborted transactions in the work
> queues.  It iterates through all the active logs one-by-one and tries
> to discard the transactions that are old enough to matter.
>
> The details of how all of this works are described in
> src/backend/access/undo/README.UndoProcessing.  The main idea to keep
> a readme is to allow reviewers to understand this patch, later we can
> decide parts of it to move to comments in code and others to main
> README of undo.
>
> Question:  Currently, TwoPhaseFileHeader stores just TransactionId, so
> for the systems (like zheap) that support FullTransactionId, the
> two-phase transaction will be tricky to support as we need
> FullTransactionId during rollbacks.  Is it a good idea to store
> FullTransactionId in TwoPhaseFileHeader?
>
> Credits:
> --------------
> Designed by: Andres Freund, Amit Kapila, Robert Haas, and Thomas Munro
> Author: Amit Kapila, Dilip Kumar, Kuntal Ghosh, and Thomas Munro
>
> This patch is based on the latest Dilip's patch for undo record
> interface.  The branch can be accessed at
> https://github.com/EnterpriseDB/zheap/tree/undoprocessing
>
> Inputs on design/code are welcome.
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

undoprocessing_1.patch

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

02 May 2019, 04:00:23

On Tue, Apr 30, 2019 at 11:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

The attached patch will provide mechanism for masking the necessary
bits in undo pages for supporting consistency checking for the undo
pages.  Ideally we can merge this patch with the main interface patch
but currently I have kept it separate for mainly because a) this is
still a WIP patch and b) review of the changes will be easy.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

0005-undo-page-consistency-checker_WIP_v3.patch

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

02 May 2019, 09:32:34

On Tue, Apr 30, 2019 at 8:44 PM Robert Haas <robertmhaas@gmail.com> wrote:

> UndoRecInfo looks a bit silly, I think.  Isn't index just the index of
> this entry in the array?  You can always figure that out by ptr -
> array_base. Instead of having UndoRecPtr urp in this structure, how
> about adding that to UnpackedUndoRecord?  When inserting, caller
> leaves it unset and InsertPreparedUndo sets it; when retrieving,
> UndoFetchRecord or UndoRecordBulkFetch sets it.  With those two
> changes, I think you can get rid of UndoRecInfo entirely and just
> return an array of UnpackedUndoRecords.  This might also eliminate the
> need for an 'urp' member in PreparedUndoSpace.
>
Yeah, at least in this patch it looks silly.  Actually, I added that
index to make the qsort stable when execute_undo_action sorts them in
block order.  But, as part of this patch we don't have that processing
so either we can remove this structure completely as you suggested but
undo processing patch has to add that structure or we can just add
comment that why we added this index field.

I am ok with other comments and will work on them.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

02 May 2019, 13:30:34

On Thu, May 2, 2019 at 5:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Yeah, at least in this patch it looks silly.  Actually, I added that
> index to make the qsort stable when execute_undo_action sorts them in
> block order.  But, as part of this patch we don't have that processing
> so either we can remove this structure completely as you suggested but
> undo processing patch has to add that structure or we can just add
> comment that why we added this index field.

Well, the qsort comparator could compute the index as ptr - array_base
just like any other code, couldn't it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

03 May 2019, 04:45:59

On Thu, May 2, 2019 at 7:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, May 2, 2019 at 5:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > Yeah, at least in this patch it looks silly.  Actually, I added that
> > index to make the qsort stable when execute_undo_action sorts them in
> > block order.  But, as part of this patch we don't have that processing
> > so either we can remove this structure completely as you suggested but
> > undo processing patch has to add that structure or we can just add
> > comment that why we added this index field.
>
> Well, the qsort comparator could compute the index as ptr - array_base
> just like any other code, couldn't it?
>
I might be completely missing but (ptr - array_base) is only valid
when first time you get the array, but qsort will swap the element
around and after that you will never be able to make out which element
was at lower index and which one was at higher index.   Basically, our
goal is to preserve the order of the undo record for the same block
but their order might get changed due to swap when they are getting
compared with the undo record pointer of the another block and once
the order is swap we will never know what was their initial positions?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

03 May 2019, 15:06:58

On Fri, May 3, 2019 at 12:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I might be completely missing but (ptr - array_base) is only valid
> when first time you get the array, but qsort will swap the element
> around and after that you will never be able to make out which element
> was at lower index and which one was at higher index.   Basically, our
> goal is to preserve the order of the undo record for the same block
> but their order might get changed due to swap when they are getting
> compared with the undo record pointer of the another block and once
> the order is swap we will never know what was their initial positions?

*facepalm*

Yeah, you're right.

Still, I think we should see if there's some way of getting rid of
that structure, or at least making it an internal detail that is used
by the code that's doing the sorting rather than something that is
exposed as an external interface.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

05 May 2019, 04:58:21

On Wed, May 1, 2019 at 10:08 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Thomas told me offlist that this email of mine didn't hit
> pgsql-hackers, so trying it again by resending.
>

Attached is next version of the patch with minor improvements:
a. use FullTransactionId
b. improve comments
c. removed some functions

The branch can be accessed at
https://github.com/EnterpriseDB/zheap/tree/undoprocessing.  It is on
top of Thomas and Dilip's patches related to undo logs and undo
records, though still not everything is synced up from both the
branches as they are also actively working on their set of patches.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

0001-Allow-undo-actions-to-be-applied-on-rollbacks-and-di.v2.patch

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

06 May 2019, 12:13:22

On Tue, Apr 30, 2019 at 8:44 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Apr 30, 2019 at 2:16 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > Like previous version these patch set also applies on:
> > https://github.com/EnterpriseDB/zheap/tree/undo
> > (b397d96176879ed5b09cf7322b8d6f2edd8043a5)
>
> Some more review of 0003:
>
> There is a whitespace-only hunk in xact.c.
Fixed
>
> It would be nice if AtAbort_ResetUndoBuffers didn't need to exist at
> all.  Then, this patch would make no changes whatsoever to xact.c.
> We'd still need such changes in other patches, because the whole idea
> of undo is tightly bound up with the concept of transactions.  Yet,
> this particular patch wouldn't touch that file, and that would be
> nice.  In general, the reason why we need AtCommit/AtAbort/AtEOXact
> callbacks is to adjust the values of global variables (or the data
> structures to which they point) at commit or abort time.  And that is
> also the case here.  The thing is, I'm not sure why we need these
> particular global variables.  Is there some way that we can get by
> without them? The obvious thing to do seems to be to invent yet
> another context object, allocated via a new function, which can then
> be passed to PrepareUndoInsert, InsertPreparedUndo,
> UndoLogBuffersSetLSN, UnlockReleaseUndoBuffers, etc.  This would
> obsolete UndoSetPrepareSize, since you'd instead pass the size to the
> context allocator function.

I have moved all the global variables to a context.  Now, I think we
don't need AtAbort_ResetUndoBuffers as memory will be freed with the
transaction context.

>
> UndoRecordUpdateTransInfo should declare a local variable
> XactUndoRecordInfo *something = &xact_urec_info[idx] and use that
> variable wherever possible.
>
Done.
> It should also probably use while (1) { ... } rather than do { ... }
> while (true).
Ok

>
> In UndoBufferGetSlot you could replace 'break;' with 'return i;' and
> then more than half the function would need one less level of
> indentation.  This function should also declare PreparedUndoBuffer
> *something and set that variable equal to &prepared_undo_buffers[i] at
> the top of the loop and again after the loop, and that variable should
> then be used whenever possible.

Done
>
> UndoRecordRelationDetails seems to need renaming now that it's down to
> a single member.
I have directly moved that context to UndoPackContext
>
> The comment for UndoRecordBlock needs updating, as it still refers to blkprev.
Done
>
> The comment for UndoRecordBlockPrev refers to "Identifying
> information" but it's not identifying anything.  I think you could
> just delete "Identifying information for" from this sentence and not
> lose anything.  And I think several of the nearby comments that refer
> to "Identifying information" could more simply and more correctly just
> refer to "Information".
Done
>
> I don't think that SizeOfUrecNext needs to exist.
Removed
>
> xact.h should not add an include for undolog.h.  There are no other
> changes to this header, so presumably the header does not depend on
> anything in undolog.h.  If .c files that include xact.h need
> undolog.h, then the header should be included in those files, not in
> the header itself.  That way, we avoid making partial recompiles more
> painful than necessary.

Right, fixed.
>
> UndoGetPrevUndoRecptr looks like a strange interface.  Why not just
> arrange not to call the function in the first place if prevurp is
> valid?
Done
>
> Every use of palloc0 in this code should be audited to check whether
> it is really necessary to zero the memory before use.  If it will be
> initialized before it's used for anything anyway, it doesn't need to
> be pre-zeroed.

Yeah, I found at few places it was not required so fixed.

>
> FinishUnpackUndo looks nice!  But it is missing a blank line in one
> place, and 'if it presents' should be 'if it is present' in a whole
> bunch of places.
>
> BeginInsertUndo also looks to be missing a few blank lines.
Fixed
>
> Still need to do some cleanup of the UnpackedUndoRecord, e.g. unifying
> uur_xid and uur_xidepoch into uur_fxid.
>
I will work on this.

> InsertUndoData ends in an unnecessary return, as does SkipInsertingUndoData.
Done

>
> I think the argument to SkipInsertingUndoData should be the number of
> bytes to skip, rather than the starting byte within the block.
Done
>
> Function header comment formatting is not consistent.  Compare
> FinishUnpackUndo (function name recapitulated on first line of
> comment) to ReadUndoBytes (not recapitulated) to UnpackUndoData
> (entire header comment jammed into one paragraph with function name at
> start).  I prefer the style where the function name is not restated,
> but YMMV.  Anyway, it has to be consistent.
>
Fixed
> UndoGetPrevRecordLen should not declare char *page instead of Page
> page, I think.
>
> UndoRecInfo looks a bit silly, I think.  Isn't index just the index of
> this entry in the array?  You can always figure that out by ptr -
> array_base. Instead of having UndoRecPtr urp in this structure, how
> about adding that to UnpackedUndoRecord?  When inserting, caller
> leaves it unset and InsertPreparedUndo sets it; when retrieving,
> UndoFetchRecord or UndoRecordBulkFetch sets it.  With those two
> changes, I think you can get rid of UndoRecInfo entirely and just
> return an array of UnpackedUndoRecords.  This might also eliminate the
> need for an 'urp' member in PreparedUndoSpace.

As discussed upthread,  I will work on fixing this.

>
> I'd probably use UREC_INFO_BLKPREV rather than UREC_INFO_BLOCKPREV for
> consistency.
>
> Similarly UndoFetchRecord and UndoRecordBulkFetch seems a bit
> inconsistent.  Perhaps UndoBulkFetchRecord.
Done
>
> In general, I find the code for updating transaction headers to be
> really hard to understand.  I'm not sure exactly what can be done
> about that.  Like, why is UndoRecordPrepareTransInfo unpacking undo?

It's only unpacking header.  But, yeah we can do better, instead of
unpacking we can just read the main header and from uur_info we can
calculate exact offset of the uur_next and in
UndoRecordUpdateTransInfo we can directly update only uur_next by
writing at that offset, instead of overwriting the complete header?

> Why does it take two undo record pointers as arguments and how are
> they different?
One is previous transaction's start header which we wants to update
and other is current transaction's urec pointer what we want to set as
uur_next in the previous transaction's start header.

Just for tracking, open comments which still needs to be worked on.

1.  Avoid special case in UndoRecordIsValid.
> Can we instead eliminate the special case?  It seems like the if
> (log->oldest_data == InvalidUndoRecPtr) case will be taken very
> rarely, so if it's buggy, we might not notice.

2. While updating the previous transaction header instead of unpacking
complete header and writing it back, we can just unpack main header
and calculate the offset of uur_next and then update it directly.

3. unifying uur_xid and uur_xidepoch into uur_fxid.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Mon, May 6, 2019 at 5:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Just for tracking, open comments which still needs to be worked on.
>
> 1.  Avoid special case in UndoRecordIsValid.
> > Can we instead eliminate the special case?  It seems like the if
> > (log->oldest_data == InvalidUndoRecPtr) case will be taken very
> > rarely, so if it's buggy, we might not notice.

I have worked on this comments and added changes in the latest patch.
>
> 2. While updating the previous transaction header instead of unpacking
> complete header and writing it back, we can just unpack main header
> and calculate the offset of uur_next and then update it directly.

For this as you suggested I am not changing, updated the comments.
>
> 3. unifying uur_xid and uur_xidepoch into uur_fxid.
Still open.

I have also added the README.

Patches can be applied on top of undo branch [1] commit:
(cb777466d008e656f03771cf16ec7ef9d6f2778b)

[1] https://github.com/EnterpriseDB/zheap/tree/undo

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

10 May 2019, 06:18:16

On Thu, May 9, 2019 at 6:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Patches can be applied on top of undo branch [1] commit:
> (cb777466d008e656f03771cf16ec7ef9d6f2778b)

Hello all,

Here is a new patch set which includes all of the patches discussed in
this thread in one go, rebased on today's master.  To summarise the
main layers, from the top down we have:

 0013:       undo-based orphaned file clean-up ($SUBJECT, a demo of
undo technology)
 0009-0010:  undo processing (execution of undo actions when rolling back)
 0008:       undo records
 0001-0007:  undo storage

The main changes to the storage layer since the last time I posted the
full patch stack:

* pg_upgrade support: you can't have any live undo logs (much like 2PC
transactions, we want to be free to change the format), but some work
was required to make sure that all "discarded" undo record pointers
from the old cluster still appear as discarded in the new cluster, as
well as any from the new cluster

* tweaks to various other src/bin tools that are aware of files under
pgdata and were confused by undo segment files

* the fsync of undo log segment files when they're created or recycled
is now handed off to the checkpointer (this was identified as a
performance problem for zheap)

* code tidy-up, removing dead code (undo log rewind, prevlen, prevlog
were no longer needed by patches higher up in the stack), removing
global variables, noisy LOG messages about undo segment files now
reduced to DEBUG1

* new extension contrib/undoinspect, for developer use, showing what
will be undone if you abort:

postgres=# begin;
BEGIN
postgres=# create table t();
CREATE TABLE
postgres=# select * from undoinspect();
     urecptr      |  rmgr   | flags | xid |
description
------------------+---------+-------+-----+---------------------------------------------
 00000000000032FA | Storage | P,T   | 487 | CREATE dbid=12934,
tsid=1663, relfile=16393
(1 row)

One silly detail: I had to change the default max_worker_processes
from 8 to 12, because otherwise a couple of tests run with fewer
parallel workers than they expect, due to undo worker processes using
up slots.  There is probably a better solution to that problem.

I put the patches in a tarball here, but they are also available from
https://github.com/EnterpriseDB/zheap/tree/undo.

-- 
Thomas Munro
https://enterprisedb.com

Attachment

undo-2019-05-10.tgz

Re: POC: Cleaning up orphaned files using undo logs

From

Kuntal Ghosh

Date:

10 May 2019, 10:46:09

Hello Thomas,

In pg_buffercache contrib module, the file pg_buffercache--1.3--1.4.sql is missing. AFAICS, this file should be added as part of the following commit:
Add SmgrId to smgropen() and BufferTag

Otherwise, I'm not able to compile the contrib modules. I've also attached the patch to fix the same.

On Fri, May 10, 2019 at 11:48 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, May 9, 2019 at 6:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Patches can be applied on top of undo branch [1] commit:
> (cb777466d008e656f03771cf16ec7ef9d6f2778b)

Hello all,

Here is a new patch set which includes all of the patches discussed in
this thread in one go, rebased on today's master. To summarise the
main layers, from the top down we have:

0013: undo-based orphaned file clean-up ($SUBJECT, a demo of
undo technology)
0009-0010: undo processing (execution of undo actions when rolling back)
0008: undo records
0001-0007: undo storage

The main changes to the storage layer since the last time I posted the
full patch stack:

* pg_upgrade support: you can't have any live undo logs (much like 2PC
transactions, we want to be free to change the format), but some work
was required to make sure that all "discarded" undo record pointers
from the old cluster still appear as discarded in the new cluster, as
well as any from the new cluster

* tweaks to various other src/bin tools that are aware of files under
pgdata and were confused by undo segment files

* the fsync of undo log segment files when they're created or recycled
is now handed off to the checkpointer (this was identified as a
performance problem for zheap)

* code tidy-up, removing dead code (undo log rewind, prevlen, prevlog
were no longer needed by patches higher up in the stack), removing
global variables, noisy LOG messages about undo segment files now
reduced to DEBUG1

* new extension contrib/undoinspect, for developer use, showing what
will be undone if you abort:

postgres=# begin;
BEGIN
postgres=# create table t();
CREATE TABLE
postgres=# select * from undoinspect();
urecptr | rmgr | flags | xid |
description
------------------+---------+-------+-----+---------------------------------------------
00000000000032FA | Storage | P,T | 487 | CREATE dbid=12934,
tsid=1663, relfile=16393
(1 row)

One silly detail: I had to change the default max_worker_processes
from 8 to 12, because otherwise a couple of tests run with fewer
parallel workers than they expect, due to undo worker processes using
up slots. There is probably a better solution to that problem.

I put the patches in a tarball here, but they are also available from
https://github.com/EnterpriseDB/zheap/tree/undo.

--
Thomas Munro
https://enterprisedb.com

Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachment

0001-Add-missing-file-in-pg_buffercache.patch

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

11 May 2019, 09:49:49

On Fri, May 10, 2019 at 10:46 PM Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
> In pg_buffercache contrib module, the file pg_buffercache--1.3--1.4.sql is missing. AFAICS, this file should be added
aspart of the following commit:
 
> Add SmgrId to smgropen() and BufferTag
>
> Otherwise, I'm not able to compile the contrib modules. I've also attached the patch to fix the same.

Oops, thanks Kuntal.  Fixed, along with some compiler warnings from
MSVC and GCC.  I added a quick tour of this to a README.md visible
here:

https://github.com/EnterpriseDB/zheap/tree/undo

-- 
Thomas Munro
https://enterprisedb.com

Attachment

undo-2019-05-11.tgz

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

12 May 2019, 06:15:03

On Thu, May 9, 2019 at 12:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, May 6, 2019 at 5:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Just for tracking, open comments which still needs to be worked on.
> >
> > 1.  Avoid special case in UndoRecordIsValid.
> > > Can we instead eliminate the special case?  It seems like the if
> > > (log->oldest_data == InvalidUndoRecPtr) case will be taken very
> > > rarely, so if it's buggy, we might not notice.
>
> I have worked on this comments and added changes in the latest patch.
> >
> > 2. While updating the previous transaction header instead of unpacking
> > complete header and writing it back, we can just unpack main header
> > and calculate the offset of uur_next and then update it directly.
>
> For this as you suggested I am not changing, updated the comments.
> >
> > 3. unifying uur_xid and uur_xidepoch into uur_fxid.
> Still open.
>
> I have also added the README.
>
> Patches can be applied on top of undo branch [1] commit:
> (cb777466d008e656f03771cf16ec7ef9d6f2778b)
>
> [1] https://github.com/EnterpriseDB/zheap/tree/undo
>

I have removed some of the globals and also improved some comments.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

0002-Provide-interfaces-to-store-and-fetch-undo-records_v6.patch

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

13 May 2019, 18:05:59

On Sun, May 12, 2019 at 2:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have removed some of the globals and also improved some comments.

I don't like the discard_lock very much.  Perhaps it's OK, but I hope
that there are better alternatives.  One problem with Thomas Munro
pointed out to me in off-list discussion is that the discard_lock has
to be held by anyone reading undo even if the undo they are reading
and the undo that the discard worker wants to discard are in
completely different parts of the undo log.  Somebody could be trying
to read an undo page written 1 second ago while the discard worker is
trying to discard an undo page written to the same undo log 1 hour
ago.  Those things need not block each other, but with this design
they will.  Another problem is that we end up holding it across an
I/O; there's precedent for that, but it's not particularly good
precedent.  Let's see if we can do better.

My first idea was that we should just make this the caller's problem
instead of handling it in this layer.  Undo is retained for committed
transactions until they are all-visible, and the reason for that is
that we presume that nobody can be interested in the data for MVCC
purposes unless there's a snapshot that can't see the results of the
transaction in question.  Once the committed transaction is
all-visible, that's nobody, so it should be fine to just discard the
undo any time we like.  That won't work with the existing zheap code,
which currently sometimes follows undo chains for transactions that
are all-visible, but I think that's a problem we should fix rather
than something we should force the undo layer to support.  We'd still
need something kinda like the discard_lock for aborted transactions,
though, because as soon as you release the buffer lock on a table
page, the undo workers could apply all the undo to that page and then
discard it, and then you could afterwards try to look up the undo
pointer which you had retrieved from that page and stored in
backend-local memory.  One thing we could probably do is make that a
heavyweight lock on the XID itself, so if you observe that an XID is
aborted, you have to go get this lock in ShareLock mode, then recheck
the page, and only then consult the undo; discarding the undo for an
aborted transaction would require AccessExclusiveLock on the XID.
This solution gets rid of the LWLock for committed undo; for aborted
undo, it avoids the false sharing and non-interruptibility that an
LWLock imposes.

But then I had what I think may be a better idea.  Let's add a new
ReadBufferMode that suppresses the actual I/O; if the buffer is not
already present in shared_buffers, it allocates a buffer but returns
it without doing any I/O, so the caller must be prepared for BM_VALID
to be unset.  I don't know what to call this, so I'll call it
RBM_ALLOCATE (leaving room for possible future variants like
RBM_ALLOCATE_AND_LOCK).  Then, the protocol for reading an undo buffer
would go like this:

1. Read the buffer with RBM_ALLOCATE, thus acquiring a pin on the
relevant buffer.
2. Check whether the buffer precedes the discard horizon for that undo
log stored in shared memory.
3. If so, use the ForgetBuffer() code we have in the zheap branch to
deallocate the buffer and stop here.  The undo is not available to be
read, whether it's still physically present or not.
4. Otherwise, if the buffer is not valid, call ReadBufferExtended
again, or some new function, to make it so.  Remember to release all
of our pins.

The protocol for discarding an undo buffer would go like this:

1. Advance the discard horizon in shared memory.
2. Take a cleanup lock on each buffer that ought to be discarded.
Remember the dirty ones and forget the others.
3. WAL-log the discard operation.
4. Revisit the dirty buffers we remembered in step 2 and forget them.

The idea is that, once we've advanced the discard horizon in shared
memory, any readers that come along later are responsible for making
sure that they never do I/O on any older undo.  They may create some
invalid buffers in shared memory, but they'll hopefully also get rid
of them if they do, and if they error out for some reason before doing
so, that buffer should age out naturally.  So, the discard worker just
needs to worry about buffers that already exist.  Once it's taken a
cleanup lock on each buffer, it knows that there are no I/O operations
and in fact no buffer usage of any kind still in progress from before
it moved the in-memory discard horizon.  Anyone new that comes along
will clean up after themselves.  We postpone forgetting dirty buffers
until after we've successfully WAL-logged the discard, in case we fail
to do so.

With this design, we don't add any new cases where a lock of any kind
must be held across an I/O, and there's also no false sharing.
Furthermore, unlike the previous proposal, this will work nicely with
something like old_snapshot_threshold.  The previous design relies on
undo not getting discarded while anyone still cares about it, but
old_snapshot_threshold, if applied to zheap, would have the express
goal of discarding undo while somebody still cares about it.  With
this design, we could support old_snapshot_threshold by having undo
readers error out in step #2 if the transaction is committed and not
visible to our snapshot but yet the undo is discarded.  Heck, we can
do that anyway as a safety check, basically for free, and just tailor
the error message depending on whether old_snapshot_threshold is such
that the condition is expected to be possible.

While I'm kvetching, I can't help noticing that undoinsert.c contains
functions both for inserting undo and also for reading it, which seems
like a loose end that needs to be tied up somehow.  I'm mildly
inclined to think that we should rename the file to something more
generic (e.g. undoaccess.h) rather than splitting it into two files
(e.g. undoinsert.c and undoread.c).  Also, it looks to me like you
need to go through what is currently undoinsert.h and look for stuff
that can be made private to the .c file.  I don't see why thing like
MAX_PREPARED_UNDO need to be exposed at all, and for things like
PreparedUndoSpace it seems like it would suffice to just do 'struct
PreparedUndoSpace; typedef struct PreparedUndoSpace
PreparedUndoSpace;' in the header and put the actual 'struct
PreparedUndoSpace { ... };' definition in the .c file.  And
UnlockReleaseUndoBuffers has a declaration but no longer has a
definition, so I think that can go away too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

15 May 2019, 10:28:30

On Mon, May 13, 2019 at 11:36 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sun, May 12, 2019 at 2:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I have removed some of the globals and also improved some comments.
>
> I don't like the discard_lock very much.  Perhaps it's OK, but I hope
> that there are better alternatives.  One problem with Thomas Munro
> pointed out to me in off-list discussion is that the discard_lock has
> to be held by anyone reading undo even if the undo they are reading
> and the undo that the discard worker wants to discard are in
> completely different parts of the undo log.  Somebody could be trying
> to read an undo page written 1 second ago while the discard worker is
> trying to discard an undo page written to the same undo log 1 hour
> ago.  Those things need not block each other, but with this design
> they will.
>

Yeah, this doesn't appear to be a good way to deal with the problem.

>  Another problem is that we end up holding it across an
> I/O; there's precedent for that, but it's not particularly good
> precedent.  Let's see if we can do better.
>
> But then I had what I think may be a better idea.
>

+1.  I also think the below idea is better than the previous one.

>  Let's add a new
> ReadBufferMode that suppresses the actual I/O; if the buffer is not
> already present in shared_buffers, it allocates a buffer but returns
> it without doing any I/O, so the caller must be prepared for BM_VALID
> to be unset.  I don't know what to call this, so I'll call it
> RBM_ALLOCATE (leaving room for possible future variants like
> RBM_ALLOCATE_AND_LOCK).  Then, the protocol for reading an undo buffer
> would go like this:
>
> 1. Read the buffer with RBM_ALLOCATE, thus acquiring a pin on the
> relevant buffer.
> 2. Check whether the buffer precedes the discard horizon for that undo
> log stored in shared memory.
> 3. If so, use the ForgetBuffer() code we have in the zheap branch to
> deallocate the buffer and stop here.  The undo is not available to be
> read, whether it's still physically present or not.
> 4. Otherwise, if the buffer is not valid, call ReadBufferExtended
> again, or some new function, to make it so.  Remember to release all
> of our pins.
>
> The protocol for discarding an undo buffer would go like this:
>
> 1. Advance the discard horizon in shared memory.
> 2. Take a cleanup lock on each buffer that ought to be discarded.
> Remember the dirty ones and forget the others.
> 3. WAL-log the discard operation.
> 4. Revisit the dirty buffers we remembered in step 2 and forget them.
>
> The idea is that, once we've advanced the discard horizon in shared
> memory, any readers that come along later are responsible for making
> sure that they never do I/O on any older undo.  They may create some
> invalid buffers in shared memory, but they'll hopefully also get rid
> of them if they do, and if they error out for some reason before doing
> so, that buffer should age out naturally.  So, the discard worker just
> needs to worry about buffers that already exist.  Once it's taken a
> cleanup lock on each buffer, it knows that there are no I/O operations
> and in fact no buffer usage of any kind still in progress from before
> it moved the in-memory discard horizon.  Anyone new that comes along
> will clean up after themselves.  We postpone forgetting dirty buffers
> until after we've successfully WAL-logged the discard, in case we fail
> to do so.
>

I have spent some time thinking over this and couldn't see any problem
with this.  So, +1 for trying this out on the lines of what you have
described above.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

15 May 2019, 14:44:18

On Mon, May 13, 2019 at 11:36 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> While I'm kvetching, I can't help noticing that undoinsert.c contains
> functions both for inserting undo and also for reading it, which seems
> like a loose end that needs to be tied up somehow.  I'm mildly
> inclined to think that we should rename the file to something more
> generic (e.g. undoaccess.h) rather than splitting it into two files
> (e.g. undoinsert.c and undoread.c).
Changed to undoaccess
 Also, it looks to me like you
> need to go through what is currently undoinsert.h and look for stuff
> that can be made private to the .c file.  I don't see why thing like
> MAX_PREPARED_UNDO need to be exposed at all,
Ideally, my previous patch should have got rid of MAX_PREPARED_UNDO as
we are now always allocating memory for prepared space but by mistake
I left it in this file.  Now, I have removed it.

and for things like
> PreparedUndoSpace it seems like it would suffice to just do 'struct
> PreparedUndoSpace; typedef struct PreparedUndoSpace
> PreparedUndoSpace;' in the header and put the actual 'struct
> PreparedUndoSpace { ... };' definition in the .c file.
Changed, I think
typedef struct PreparedUndoSpace PreparedUndoSpace; in header and
PreparedUndoSpace { ... }; is fine.
And
> UnlockReleaseUndoBuffers has a declaration but no longer has a
> definition, so I think that can go away too.
Removed, and also cleaned some other such declarations.

Pending items to be worked upon:
a) Get rid of UndoRecInfo
b) Get rid of xid in generic undo code and unify epoch and xid to fxid
c) Get rid of discard lock
d) Move log switch related information from transaction header to new
log switch header

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

0001-Provide-interfaces-to-store-and-fetch-undo-records_v7.patch

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

21 May 2019, 17:17:53

Hi,

On 2019-05-05 10:28:21 +0530, Amit Kapila wrote:
> From 5d9e179bd481b5ed574b6e7117bf3eb62b5dc003 Mon Sep 17 00:00:00 2001
> From: Amit Kapila <amit.kapila@enterprisedb.com>
> Date: Sat, 4 May 2019 16:52:01 +0530
> Subject: [PATCH] Allow undo actions to be applied on rollbacks and discard
>  unwanted undo.

I think this needs to be split into some constituent parts, to be
reviewable. Discussing 270kb of patch at once is just too much. My first
guess for a viable split would be:

1) undoaction related infrastructure
2) xact.c integration et al
3) binaryheap changes etc
4) undo worker infrastructure

It probably should be split even further, by moving things like:
- oldestXidHavingUndo infrastructure
- discard infrastructure

Some small remarks:

>  
> +    {
> +        {"disable_undo_launcher", PGC_POSTMASTER, DEVELOPER_OPTIONS,
> +            gettext_noop("Decides whether to launch an undo worker."),
> +            NULL,
> +            GUC_NOT_IN_SAMPLE
> +        },
> +        &disable_undo_launcher,
> +        false,
> +        NULL, NULL, NULL
> +    },
> +

We don't normally formulate GUCs in the negative like that. C.F.
autovacuum etc.


> +/* Extract xid from a value comprised of epoch and xid  */
> +#define GetXidFromEpochXid(epochxid)            \
> +    ((uint32) (epochxid) & 0XFFFFFFFF)
> +
> +/* Extract epoch from a value comprised of epoch and xid  */
> +#define GetEpochFromEpochXid(epochxid)            \
> +    ((uint32) ((epochxid) >> 32))
> +

Why do these exist? This should all go through FullTransactionId.


>      /* End-of-list marker */
>      {
>          {NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
> @@ -2923,6 +2935,16 @@ static struct config_int ConfigureNamesInt[] =
>          5000, 1, INT_MAX,
>          NULL, NULL, NULL
>      },
> +    {
> +        {"rollback_overflow_size", PGC_USERSET, RESOURCES_MEM,
> +            gettext_noop("Rollbacks greater than this size are done lazily"),
> +            NULL,
> +            GUC_UNIT_MB
> +        },
> +        &rollback_overflow_size,
> +        64, 0, MAX_KILOBYTES,
> +        NULL, NULL, NULL
> +    },

rollback_foreground_size? rollback_background_size? I don't think
overflow is particularly clear.


> @@ -1612,6 +1635,85 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
>  
>      MyLockedGxact = NULL;
>  
> +    /*
> +     * Perform undo actions, if there are undologs for this transaction. We
> +     * need to perform undo actions while we are still in transaction. Never
> +     * push rollbacks of temp tables to undo worker.
> +     */
> +    for (i = 0; i < UndoPersistenceLevels; i++)
> +    {

This should be in a separate function. And it'd be good if more code
between this and ApplyUndoActions() would be shared.


> +    /*
> +     * Here, we just detect whether there are any pending undo actions so that
> +     * we can skip releasing the locks during abort transaction.  We don't
> +     * release the locks till we execute undo actions otherwise, there is a
> +     * risk of deadlock.
> +     */
> +    SetUndoActionsInfo();

This function name is so generic that it gives the reader very little
information about why it's called here (and in other similar places).


Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

21 May 2019, 20:13:12

On Tue, May 21, 2019 at 1:18 PM Andres Freund <andres@anarazel.de> wrote:
> I think this needs to be split into some constituent parts, to be
> reviewable. Discussing 270kb of patch at once is just too much.

+1.

> > +     {
> > +             {"rollback_overflow_size", PGC_USERSET, RESOURCES_MEM,
> > +                     gettext_noop("Rollbacks greater than this size are done lazily"),
> > +                     NULL,
> > +                     GUC_UNIT_MB
> > +             },
> > +             &rollback_overflow_size,
> > +             64, 0, MAX_KILOBYTES,
> > +             NULL, NULL, NULL
> > +     },
>
> rollback_foreground_size? rollback_background_size? I don't think
> overflow is particularly clear.

The problem with calling it 'rollback' is that a rollback is a general
PostgreSQL term that gives no hint the proposed undo facility is
involved.  I'm not exactly sure what to propose but I think it's got
to have the word 'undo' in there someplace (or some new term we invent
that is only used in connection with undo).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

22 May 2019, 11:17:06

On Tue, May 21, 2019 at 10:47 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-05-05 10:28:21 +0530, Amit Kapila wrote:
> > From 5d9e179bd481b5ed574b6e7117bf3eb62b5dc003 Mon Sep 17 00:00:00 2001
> > From: Amit Kapila <amit.kapila@enterprisedb.com>
> > Date: Sat, 4 May 2019 16:52:01 +0530
> > Subject: [PATCH] Allow undo actions to be applied on rollbacks and discard
> >  unwanted undo.
>
> I think this needs to be split into some constituent parts, to be
> reviewable.

Okay.

> Discussing 270kb of patch at once is just too much. My first
> guess for a viable split would be:
>
> 1) undoaction related infrastructure
> 2) xact.c integration et al
> 3) binaryheap changes etc
> 4) undo worker infrastructure
>
> It probably should be split even further, by moving things like:
> - oldestXidHavingUndo infrastructure
> - discard infrastructure
>

Okay, I will think about this and split the patch.

> Some small remarks:
>
> >
> > +     {
> > +             {"disable_undo_launcher", PGC_POSTMASTER, DEVELOPER_OPTIONS,
> > +                     gettext_noop("Decides whether to launch an undo worker."),
> > +                     NULL,
> > +                     GUC_NOT_IN_SAMPLE
> > +             },
> > +             &disable_undo_launcher,
> > +             false,
> > +             NULL, NULL, NULL
> > +     },
> > +
>
> We don't normally formulate GUCs in the negative like that. C.F.
> autovacuum etc.
>

Okay, will change.  Actually, this is just for development purpose.
It can help us in testing cases where we have pushed the undo, but it
won't apply, so whenever the foreground process encounter such a
transaction, it will perform the page-wise undo.  I am not 100% sure
if we need this for the final version.  Similarly, for testing
purpose, we might need enable_discard_worker to test the cases where
discard doesn't happen for a long time.

>
> > +/* Extract xid from a value comprised of epoch and xid  */
> > +#define GetXidFromEpochXid(epochxid)                 \
> > +     ((uint32) (epochxid) & 0XFFFFFFFF)
> > +
> > +/* Extract epoch from a value comprised of epoch and xid  */
> > +#define GetEpochFromEpochXid(epochxid)                       \
> > +     ((uint32) ((epochxid) >> 32))
> > +
>
> Why do these exist?
>

We don't need the second one (GetEpochFromEpochXid), but the first one
is required.  Basically, the oldestXidHavingUndo computation does
consider oldestXmin (which is still a TransactionId) as we can't
retain undo which is 2^31 transactions old due to other limitations
like clog/snapshots still has a limit of 4-byte transaction ids.
Slightly unrelated, but we do want to improve the undo retention in a
subsequent version such that we won't allow pending undo for
transaction whose age is more than 2^31.

>
> >       /* End-of-list marker */
> >       {
> >               {NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
> > @@ -2923,6 +2935,16 @@ static struct config_int ConfigureNamesInt[] =
> >               5000, 1, INT_MAX,
> >               NULL, NULL, NULL
> >       },
> > +     {
> > +             {"rollback_overflow_size", PGC_USERSET, RESOURCES_MEM,
> > +                     gettext_noop("Rollbacks greater than this size are done lazily"),
> > +                     NULL,
> > +                     GUC_UNIT_MB
> > +             },
> > +             &rollback_overflow_size,
> > +             64, 0, MAX_KILOBYTES,
> > +             NULL, NULL, NULL
> > +     },
>
> rollback_foreground_size? rollback_background_size? I don't think
> overflow is particularly clear.
>

How about rollback_undo_size or abort_undo_size or
undo_foreground_size or pending_undo_size?

>
> > @@ -1612,6 +1635,85 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
> >
> >       MyLockedGxact = NULL;
> >
> > +     /*
> > +      * Perform undo actions, if there are undologs for this transaction. We
> > +      * need to perform undo actions while we are still in transaction. Never
> > +      * push rollbacks of temp tables to undo worker.
> > +      */
> > +     for (i = 0; i < UndoPersistenceLevels; i++)
> > +     {
>
> This should be in a separate function. And it'd be good if more code
> between this and ApplyUndoActions() would be shared.
>

makes sense, will try.

>
> > +     /*
> > +      * Here, we just detect whether there are any pending undo actions so that
> > +      * we can skip releasing the locks during abort transaction.  We don't
> > +      * release the locks till we execute undo actions otherwise, there is a
> > +      * risk of deadlock.
> > +      */
> > +     SetUndoActionsInfo();
>
> This function name is so generic that it gives the reader very little
> information about why it's called here (and in other similar places).
>

NeedToPerformUndoActions()? UndoActionsRequired()?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

22 May 2019, 12:17:40

On Wed, May 22, 2019 at 7:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > +/* Extract xid from a value comprised of epoch and xid  */
> > > +#define GetXidFromEpochXid(epochxid)                 \
> > > +     ((uint32) (epochxid) & 0XFFFFFFFF)
> > > +
> > > +/* Extract epoch from a value comprised of epoch and xid  */
> > > +#define GetEpochFromEpochXid(epochxid)                       \
> > > +     ((uint32) ((epochxid) >> 32))
> > > +
> >
> > Why do these exist?
> >
>
> We don't need the second one (GetEpochFromEpochXid), but the first one
> is required.  Basically, the oldestXidHavingUndo computation does
> consider oldestXmin (which is still a TransactionId) as we can't
> retain undo which is 2^31 transactions old due to other limitations
> like clog/snapshots still has a limit of 4-byte transaction ids.
> Slightly unrelated, but we do want to improve the undo retention in a
> subsequent version such that we won't allow pending undo for
> transaction whose age is more than 2^31.

The point is that we now have EpochFromFullTransactionId and
XidFromFullTransactionId.  You shouldn't be inventing your own version
of that infrastructure.  Use FullTransactionId, not a uint64, and then
use the functions for dealing with full transaction IDs from
transam.h.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

22 May 2019, 12:44:23

On Wed, May 22, 2019 at 5:47 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, May 22, 2019 at 7:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > +/* Extract xid from a value comprised of epoch and xid  */
> > > > +#define GetXidFromEpochXid(epochxid)                 \
> > > > +     ((uint32) (epochxid) & 0XFFFFFFFF)
> > > > +
> > > > +/* Extract epoch from a value comprised of epoch and xid  */
> > > > +#define GetEpochFromEpochXid(epochxid)                       \
> > > > +     ((uint32) ((epochxid) >> 32))
> > > > +
> > >
> > > Why do these exist?
> > >
> >
> > We don't need the second one (GetEpochFromEpochXid), but the first one
> > is required.  Basically, the oldestXidHavingUndo computation does
> > consider oldestXmin (which is still a TransactionId) as we can't
> > retain undo which is 2^31 transactions old due to other limitations
> > like clog/snapshots still has a limit of 4-byte transaction ids.
> > Slightly unrelated, but we do want to improve the undo retention in a
> > subsequent version such that we won't allow pending undo for
> > transaction whose age is more than 2^31.
>
> The point is that we now have EpochFromFullTransactionId and
> XidFromFullTransactionId.  You shouldn't be inventing your own version
> of that infrastructure.  Use FullTransactionId, not a uint64, and then
> use the functions for dealing with full transaction IDs from
> transam.h.
>

Okay, I misunderstood the comment.  I'll change accordingly.  Thanks
for pointing out.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

27 May 2019, 09:58:58

On Wed, May 22, 2019 at 4:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, May 21, 2019 at 10:47 PM Andres Freund <andres@anarazel.de> wrote:
>
> > Some small remarks:
> >
> > >
> > > +     {
> > > +             {"disable_undo_launcher", PGC_POSTMASTER, DEVELOPER_OPTIONS,
> > > +                     gettext_noop("Decides whether to launch an undo worker."),
> > > +                     NULL,
> > > +                     GUC_NOT_IN_SAMPLE
> > > +             },
> > > +             &disable_undo_launcher,
> > > +             false,
> > > +             NULL, NULL, NULL
> > > +     },
> > > +
> >
> > We don't normally formulate GUCs in the negative like that. C.F.
> > autovacuum etc.
> >
>
> Okay, will change.  Actually, this is just for development purpose.
> It can help us in testing cases where we have pushed the undo, but it
> won't apply, so whenever the foreground process encounter such a
> transaction, it will perform the page-wise undo.  I am not 100% sure
> if we need this for the final version.  Similarly, for testing
> purpose, we might need enable_discard_worker to test the cases where
> discard doesn't happen for a long time.
>

Changed.

> >
> > > +/* Extract xid from a value comprised of epoch and xid  */
> > > +#define GetXidFromEpochXid(epochxid)                 \
> > > +     ((uint32) (epochxid) & 0XFFFFFFFF)
> > > +
> > > +/* Extract epoch from a value comprised of epoch and xid  */
> > > +#define GetEpochFromEpochXid(epochxid)                       \
> > > +     ((uint32) ((epochxid) >> 32))
> > > +
> >
> > Why do these exist?
> >
>
> We don't need the second one (GetEpochFromEpochXid), but the first one
> is required.  Basically, the oldestXidHavingUndo computation does
> consider oldestXmin (which is still a TransactionId) as we can't
> retain undo which is 2^31 transactions old due to other limitations
> like clog/snapshots still has a limit of 4-byte transaction ids.
> Slightly unrelated, but we do want to improve the undo retention in a
> subsequent version such that we won't allow pending undo for
> transaction whose age is more than 2^31.
>

Removed both the above defines.

> >
> > >       /* End-of-list marker */
> > >       {
> > >               {NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
> > > @@ -2923,6 +2935,16 @@ static struct config_int ConfigureNamesInt[] =
> > >               5000, 1, INT_MAX,
> > >               NULL, NULL, NULL
> > >       },
> > > +     {
> > > +             {"rollback_overflow_size", PGC_USERSET, RESOURCES_MEM,
> > > +                     gettext_noop("Rollbacks greater than this size are done lazily"),
> > > +                     NULL,
> > > +                     GUC_UNIT_MB
> > > +             },
> > > +             &rollback_overflow_size,
> > > +             64, 0, MAX_KILOBYTES,
> > > +             NULL, NULL, NULL
> > > +     },
> >
> > rollback_foreground_size? rollback_background_size? I don't think
> > overflow is particularly clear.
> >
>
> How about rollback_undo_size or abort_undo_size or
> undo_foreground_size or pending_undo_size?
>

I think we need some more discussion on this before we change as
Robert seems to feel that we should have 'undo' someplace in the name.
Please let me know your
preference.

> >
> > > @@ -1612,6 +1635,85 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
> > >
> > >       MyLockedGxact = NULL;
> > >
> > > +     /*
> > > +      * Perform undo actions, if there are undologs for this transaction. We
> > > +      * need to perform undo actions while we are still in transaction. Never
> > > +      * push rollbacks of temp tables to undo worker.
> > > +      */
> > > +     for (i = 0; i < UndoPersistenceLevels; i++)
> > > +     {
> >
> > This should be in a separate function. And it'd be good if more code
> > between this and ApplyUndoActions() would be shared.
> >
>
> makes sense, will try.
>

Done.  Now, there is a common function that is used in twophase.c and
ApplyUndoActions.

> >
> > > +     /*
> > > +      * Here, we just detect whether there are any pending undo actions so that
> > > +      * we can skip releasing the locks during abort transaction.  We don't
> > > +      * release the locks till we execute undo actions otherwise, there is a
> > > +      * risk of deadlock.
> > > +      */
> > > +     SetUndoActionsInfo();
> >
> > This function name is so generic that it gives the reader very little
> > information about why it's called here (and in other similar places).
> >
>
> NeedToPerformUndoActions()? UndoActionsRequired()?
>

Changed to UndoActionsRequired and added comments atop of the function
to make it clear why and when this function needs to use.

Apart from fixing the above comments, the patch is rebased on latest
undo patchset.  As of now, I have split the binaryheap.c changes into
a separate patch.  We are stilll enhancing the patch to compute
oldestXidHavingUnappliedUndo which touches various parts of patch, so
splitting further without completing that can make it a bit difficult
to work on that.

Pending work
-------------------
1. Enhance uur_progress so that it updates undo action apply progress
at regular intervals.
2. Enhance to support oldestXidHavingUnappliedUndo, more on that later.
3. Split the patch.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: POC: Cleaning up orphaned files using undo logs

From

Asim R P

Date:

10 June 2019, 12:35:25

My understanding is smgr pendingDeletes infrastructure will be replaced by these patches. I still see CommitTransaction() calling smgrDoPendingDeletes() in the latest patch set. Am I missing something?

Asim

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

10 June 2019, 22:00:47

On Mon, Jun 10, 2019 at 5:35 AM Asim R P <apraveen@pivotal.io> wrote:
> My understanding is smgr pendingDeletes infrastructure will be replaced by these patches. I still see
CommitTransaction()calling smgrDoPendingDeletes() in the latest patch set. Am I missing something?

Hi Asim,

Thanks for looking at the patch.

The pendingDeletes list is used both for files that should be deleted
if we commit and files that should be deleted if we abort. This patch
deals only with the abort case, using the undo log instead of
pendingDeletes. That is the file leak scenario that has an
arbitrarily wide window controlled by the user and is probably the
source of almost all cases that you hear of of disks filling up with
orphaned junk AFAICS.

There could in theory be a persistent stuff-to-do-if-we-commit system
exactly unlike undo logs (records to be discarded on abort, executed
on commit). I haven't thought much about how it'd work, but Andres
did suggest something like that for another purpose just the other
day, and although it's hard to think of a name for it, it doesn't seem
crazy as long as it doesn't add overheads when you're not using it.
Without such a mechanism, you can probably leak files belonging to
tables that you have dropped in a committed transaction, if you die in
CommitTransaction() after it has called RecordTransactionCommit() but
before it reaches smgrDoPendingDeletes(), and even then probably only
if there is super well-timed checkpoint so that you recover without
replaying the drop. I'm not try to tackle that today.

BTW, there is yet another kind of deferred unlinking going on. In
SyncPostCheckpoint() (formerly known as mdpostckpt()) we defer the
last bit of the job until after the next checkpoint. At that point we
only expect the first segment to exist and we expect it to be empty.
That's a mechanism introduced by commit 6cc4451b5c47 to make sure that
we don't reuse relfilenode numbers too soon in some crash scenarios.
That means there is another very narrow window there to leak a file
(though these ones are empty): you die after the checkpoint is logged
but before SyncPostCheckpoint() is run, or even after that but before
the operating system has flushed the directory.

--
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

12 June 2019, 21:42:58

On Mon, May 27, 2019 at 5:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Apart from fixing the above comments, the patch is rebased on latest
> undo patchset.  As of now, I have split the binaryheap.c changes into
> a separate patch.  We are stilll enhancing the patch to compute
> oldestXidHavingUnappliedUndo which touches various parts of patch, so
> splitting further without completing that can make it a bit difficult
> to work on that.

Some review comments around execute_undo_actions:

The 'nopartial' argument to execute_undo_actions is confusing.  First,
it would probably be worth spelling it out instead of abbreviation:
not_partial_transaction rather than nopartial.  Second, it is usually
better to phrase parameter names in terms of what they are rather than
in terms of what they are not: complete_transaction rather than
not_partial_transaction.  Third, it's unclear from these comments why
we'd be undoing something other than a complete transaction.  It looks
as though the answer is that this flag will be false when we're
undoing a subxact -- in which case, why not invert the sense of the
flag and call it 'bool subxact'?  I might be wrong, but it seems like
that would be a whole lot clearer. Fourth, the block at the top of
this function, guarded by nopartial, seems like it must be vulnerable
to race conditions.  If we're undoing the complete transaction, then
it checks whether UndoFetchRecord() still returns anything.  But that
could change not just at the beginning of the function, but also any
time in the middle, or so it seems to me.  I doubt that this is the
right level at which to deal with this sort of interlocking. I think
there should be some higher-level mechanism that prevents two
processes from trying to undo the same transaction at the same time,
like a heavyweight lock or some kind of flag in the shared memory data
structure that keeps track of pending undo, so that we never even
reach this code unless we know that this XID needs undo work and no
other process is already doing it.  If you're the only one undoing XID
123456, then there shouldn't be any chance of the undo disappearing
from underneath you.  And we definitely want to guarantee that only
one process is undoing any given XID at a time.

The 'blk_chain_complete' variable which is set in this function and
passed down to execute_undo_actions_page() and then to the rmgr's
rm_undo callback also looks problematic.  First, not every AM that
uses undo may even have the notion of a 'block chain'; zedstore for
example uses TIDs as a 48-bit integer, not a block + offset number, so
it's really not going to have a 'block chain.'  Second, even in
zheap's usage, it seems to me that the block chain could be complete
even when this gets set to false. It gets set to true when we're
undoing a toplevel transaction (not a subxact) and we were able to
fetch all of the undo for that toplevel transaction. But even if
that's not true, the chain for an individual block could still be
complete, because all the remaining undo for the block at issue
might've been in the chunk of undo we already read; the remaining undo
could be for other blocks.  For that reason, I can't see how the zheap
code that relies on this value can be correct; it uses this value to
decide whether to stick zeroes in the transaction slot, but if the
scenario described above happened, then I suppose the XID would not
get cleared from the slot during undo.  Maybe zheap is just relying on
that being harmless, since if all of the undo actions have been
correctly executed for the page, the fact that the transaction slot is
still bogusly used by an aborted xact won't matter; nothing will refer
to it. However, it seems to me that it would be better for zheap to
set things up so that the first undo record for a particular txn/page
combination is flagged in some way (in the payload!) so that undo can
zero the slot if the action being undone is the one that claimed the
slot.  That seems cleaner on principle, and it also avoids having
supposedly AM-independent code pass down details that are driven by
zheap's particular needs.

While it's probably moot since I think this code should go away
anyway, I find it poor style to write something like:

+ if (nopartial && !UndoRecPtrIsValid(urec_ptr))
+ blk_chain_complete = true;
+ else
+ blk_chain_complete = false;

"if (x) y = true; else y = false;" can be more compactly written as "y
= x;", like this:

blk_chain_complete = nopartial && !UndoRecPtrIsValid(urec_ptr);

I think that the signature for rm_undo can be simplified considerably.
I think blk_chain_complete should go away for the reasons discussed
above.  Also, based on our conversations with Heikki at PGCon, we
decided that we should not presume that the AM wants the records
grouped by block, so the blkno argument should go away.  In addition,
I don't see much reason to have a first_idx argument. Instead of
passing a pointer to the caller's entire array and telling the
callback where to start looking, couldn't we just pass a pointer to
the first record the callback should examine, i.e. instead of passing
urp_array, pass urp_array + first_idx.  Then instead of having a
last_idx argument, have an argument for the number of entries in the
array, computed as last_idx - first_idx + 1.  With those changes,
rm_undo would look like this:

bool (*rm_undo) (UndoRecInfo *urp_array, int count, Oid reloid,
FullTransactionId full_xid);

Now for the $10m question: why even pass reloid and full_xid?  Aren't
those values going to be present inside every UnpackedUndoRecord?  Why
not just let the callback get them from the first record (or however
it wants to do things)?  Perhaps there is some documentation value
here in that it implies that the value will be the same for every
record, but we could also handle that by just documenting in the
appropriate places that undo is done by transaction and relation and
therefore the callback is entitled to assume that the same value will
be present in every record.  Then again, I am not sure we really want
the callback to assume that reloid doesn't change.  I don't see a
reason offhand not to just pass as many records as we have for a given
transaction and let the callback do what it likes.  So maybe that's
another reason to get rid of the reloid argument, at least.  And then
we could document that all the record will have the same full_xid
(unless we decide that we don't want to guarantee that either).

Additionally, it strikes me that urp_array is not the greatest name.
Generally, putting _array into the name of the variable to indicate
that it's an array doesn't seem all that great from a coding-style
perspective.  I mean, sometimes it's the best you can do, but it's not
amazing.  And urp seems like it's using an abbreviation without any
real reason.  For contrast, consider this existing precedent:

extern SysScanDesc systable_beginscan_ordered(Relation heapRelation,
                                                   Relation indexRelation,
                                                   Snapshot snapshot,
                                                   int nkeys, ScanKey key);

Or this one:

extern TupleDesc CreateTupleDesc(int natts, Form_pg_attribute *attrs);

Notice that in each case the array parameter (which is the last one)
is named based on what data it contains rather than on the fact that
it is an array.

Finally, I observe that rm_undo returns a Boolean, but it's not used
for anything.  The only call to rm_undo in the current patch set is in
execute_undo_actions_page, which returns that value to the caller, but
the callers just discard it.  I suppose maybe this was intended to
report success or failure, but I think the way that rm_undo will
report failure is to ERROR.  Or, if we want to allow a fail-soft
behavior for some reason, then the callers all need to check the
value.  I'm not sure whether there's a use case for that or not.

Putting all that together, I suggest a signature like this:

void (*rm_undo) (int nrecords, UndoRecInfo *records);

Or if we decide we need to have a fail-soft behavior, then like this:

bool (*rm_undo) (int nrecords, UndoRecInfo *records);

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

14 June 2019, 02:55:52

On Mon, Jun 10, 2019 at 3:00 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Mon, Jun 10, 2019 at 5:35 AM Asim R P <apraveen@pivotal.io> wrote:
> > My understanding is smgr pendingDeletes infrastructure will be replaced by these patches.  I still see
CommitTransaction()calling smgrDoPendingDeletes() in the latest patch set.  Am I missing something?

> Thanks for looking at the patch.

Hello,

Here is a new rebased version of the full patch set for orphaned file
cleanup.  The orphaned file cleanup code itself hasn't changed but
there are some changes in lower down patches:

* getting rid of more global variables, instead using eg
CurrentSession->attached_undo_logs (the session.h infrastructure that
is intended to avoid creating more multithreading-hostile code)

* using undo log "slots" in various APIs to make it clearer that slots
can be recycled, which has locking implications, plus several locking
bug fixes that motivated that change

* current versions of the record and worker code discussed upthread by
Amit and others

The code is also at https://github.com/EnterpriseDB/zheap/tree/undo
and includes patches from
https://github.com/EnterpriseDB/zheap/tree/undoprocessing and
https://github.com/EnterpriseDB/zheap/tree/undo_interface_v1 where
some parts of this stack (workers etc) are being developed.

-- 
Thomas Munro
https://enterprisedb.com

Attachment

undo-20190614.tgz

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

14 June 2019, 10:46:46

On Fri, Jun 14, 2019 at 8:26 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> * current versions of the record and worker code discussed upthread by
> Amit and others
>

Thanks for posting the complete patchset.

Last time, I mentioned the remaining work in undo-processing patchset,
the status of which is as follows:
1. Enhance uur_progress so that it updates undo action apply progress
at regular intervals.  This has been done.  The idea is that we update
the transaction's undo apply progress at regular intervals so that
after a crash we can skip already applied undo.   The undo apply
progress is updated in terms of the number of blocks processed.
I think it is better to change the name of uur_progress to something
like uur_apply_progress.  Any suggestions?

2. Enhance to support oldestXidHavingUnappliedUndo, more on that
later.  This has been done.  The idea here is that we register all the
undo apply (transaction abort) requests in the hash table (referred to
as Rollback Hash Table in the patch) and we have a hard limit (after
that we won't allow new transactions to write undo) on how many such
requests can be pending.  So scanning this table gives us the value of
oldestXidHavingUnappliedUndo (actually the value for this will be
smallest of 'xid having pending undo' and 'oldestXmin').  As this
rollback hash table is not persistent, after start, we need to take a
pass over undo logs to register all the pending abort requests in the
rollback hash table.  There are two main purposes which this value
serves (a) Any Xid below this is all-visible, so it can help in
visibility checks, (b) it can help us implementing the rule that "No
aborted XID with an age >2^31 can have unapplied undo.".  This part
helps us to decide to truncate the clog because we can't truncate the
clog for transactions having undo.

3. Split the patch.  The patch is split into five patches.  I will
give a brief description of each patch which to a good extent is
mentioned in the commit message for each patch as well:
0010-Extend-binary-heap-functionality -  This patch adds the routines
to allocate binary heap in shared memory and to remove nth element
from binary heap.  These routines will be used by a later patch that
will allow an efficient way to process the pending rollback requests.

0011-Infrastructure-to-register-and-fetch-undo-action-req - This patch
provides an infrastructure to register and fetch undo action requests.
This infrastructure provides a way to allow execution of undo actions.
One might think that we can always execute undo actions on error or
explicit rollback by the user, however, there are cases when that is
not possible.  For example, (a) if the system crash while doing the
operation, then after startup, we need a way to perform undo actions;
(b) If we get an error while
performing undo actions.

Apart from this, when there are large rollback requests, then it is
quite inefficient to perform all the undo actions and then return
control to the user.

0012-Infrastructure-to-execute-pending-undo-actions - This provides an
infrastructure to execute pending undo actions.  To apply the undo
actions, we collect the undo records in bulk and try to process them
together.  We ensure to update the transaction's progress at regular
intervals so that after a crash we can skip already applied undo.
This needs some more work to generalize the processing of undo records
so that this infrastructure can be used by other AM's as well.

0013-Allow-foreground-transactions-to-perform-undo-action - This patch
allows foreground transactions to perform undo actions on abort.  We
always perform rollback actions after cleaning up the current
(sub)transaction.  This will ensure that we perform the actions
immediately after an error (and release the locks) rather than when
the user issues Rollback command at some later point of time.  We are
releasing the locks after the undo actions are applied.  The reason to
delay lock release is that if we release locks before applying undo
actions, then the parallel session can acquire the lock before us
which can lead to deadlock.

0014-Allow-execution-and-discard-of-undo-by-background-wo-  - This
patch allows execution and discard of undo by background workers.
Undo launcher is responsible for launching the workers iff there is
some work available in one of the work queues and there are more
workers available.  The worker is launched to handle requests for a
particular database.

The discard worker is responsible for discarding the undo log of
transactions that are committed and all-visible or are rolled-back.
It also registers the request for aborted transactions in the work
queues.  It iterates through all the active logs one-by-one and tries
to discard the transactions that are old enough to matter.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

17 June 2019, 10:03:23

On Thu, Jun 13, 2019 at 3:13 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, May 27, 2019 at 5:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Apart from fixing the above comments, the patch is rebased on latest
> > undo patchset.  As of now, I have split the binaryheap.c changes into
> > a separate patch.  We are stilll enhancing the patch to compute
> > oldestXidHavingUnappliedUndo which touches various parts of patch, so
> > splitting further without completing that can make it a bit difficult
> > to work on that.
>
> Some review comments around execute_undo_actions:
>
> The 'nopartial' argument to execute_undo_actions is confusing.  First,
> it would probably be worth spelling it out instead of abbreviation:
> not_partial_transaction rather than nopartial.  Second, it is usually
> better to phrase parameter names in terms of what they are rather than
> in terms of what they are not: complete_transaction rather than
> not_partial_transaction.  Third, it's unclear from these comments why
> we'd be undoing something other than a complete transaction.  It looks
> as though the answer is that this flag will be false when we're
> undoing a subxact -- in which case, why not invert the sense of the
> flag and call it 'bool subxact'?  I might be wrong, but it seems like
> that would be a whole lot clearer.
>

The idea was that it could be use for multiple purposes like 'rolling
back complete xact', 'rolling back subxact', 'rollback at page-level'
or any similar future need even though not all code paths use that
function.  I am not wedded to any particular name here, but among your
suggestions complete_transaction sounds better to me.  Are you okay
going with that?

> Fourth, the block at the top of
> this function, guarded by nopartial, seems like it must be vulnerable
> to race conditions.  If we're undoing the complete transaction, then
> it checks whether UndoFetchRecord() still returns anything.  But that
> could change not just at the beginning of the function, but also any
> time in the middle, or so it seems to me.
>

It won't change in between because we have ensured at top-level that
no two processes can start executing pending undo at the same time.
Basically, anyone wants to execute the undo actions will have an entry
in rollback hash table and that will be marked as in-progress.  As
mentioned in comments, the race is only "after discard worker
fetches the record and found that this transaction need to be rolled
back, backend might concurrently execute the actions and remove the
request from rollback hash table."

>  I doubt that this is the
> right level at which to deal with this sort of interlocking. I think
> there should be some higher-level mechanism that prevents two
> processes from trying to undo the same transaction at the same time,
> like a heavyweight lock or some kind of flag in the shared memory data
> structure that keeps track of pending undo, so that we never even
> reach this code unless we know that this XID needs undo work
>

Introducing heavyweight lock can create different sort of problems
because we need to hold it till all the actions are applied to avoid
what I have mentioned above.  The problem will be that discard worker
will be blocked till backend/undo worker applies the complete actions
unless we just take this lock conditionally in discard worker.

Another way could be that we re-fetch the undo record when we are
registering the undo request under RollbackRequestLock and check it's
status again becuase in that case backend or other undoworker won't be
able to remove the request from hash table concurrently.  However, the
advantage of checking it in execute_undo_actions is that we can
optimize it in the future to avoid re-fetching this record when
actually fetching the records to apply undo actions.

> and no
> other process is already doing it.
>

This part is already ensured in the current code.

>
> The 'blk_chain_complete' variable which is set in this function and
> passed down to execute_undo_actions_page() and then to the rmgr's
> rm_undo callback also looks problematic.
>

I agree this parameter should go away from the generic interface
considering the requirements from zedstore.

>  First, not every AM that
> uses undo may even have the notion of a 'block chain'; zedstore for
> example uses TIDs as a 48-bit integer, not a block + offset number, so
> it's really not going to have a 'block chain.'  Second, even in
> zheap's usage, it seems to me that the block chain could be complete
> even when this gets set to false. It gets set to true when we're
> undoing a toplevel transaction (not a subxact) and we were able to
> fetch all of the undo for that toplevel transaction. But even if
> that's not true, the chain for an individual block could still be
> complete, because all the remaining undo for the block at issue
> might've been in the chunk of undo we already read; the remaining undo
> could be for other blocks.  For that reason, I can't see how the zheap
> code that relies on this value can be correct; it uses this value to
> decide whether to stick zeroes in the transaction slot, but if the
> scenario described above happened, then I suppose the XID would not
> get cleared from the slot during undo.  Maybe zheap is just relying on
> that being harmless, since if all of the undo actions have been
> correctly executed for the page, the fact that the transaction slot is
> still bogusly used by an aborted xact won't matter; nothing will refer
> to it. However, it seems to me that it would be better for zheap to
> set things up so that the first undo record for a particular txn/page
> combination is flagged in some way (in the payload!) so that undo can
> zero the slot if the action being undone is the one that claimed the
> slot.  That seems cleaner on principle, and it also avoids having
> supposedly AM-independent code pass down details that are driven by
> zheap's particular needs.
>

Yeah, we can do what you are suggesting for zheap or in many cases, we
should be able to detect it via uur_blkprev of the last record of
page.  The invalid value will indicate that the chain for the page is
complete.

>
> I think that the signature for rm_undo can be simplified considerably.
> I think blk_chain_complete should go away for the reasons discussed
> above.  Also, based on our conversations with Heikki at PGCon, we
> decided that we should not presume that the AM wants the records
> grouped by block, so the blkno argument should go away.  In addition,
> I don't see much reason to have a first_idx argument. Instead of
> passing a pointer to the caller's entire array and telling the
> callback where to start looking, couldn't we just pass a pointer to
> the first record the callback should examine, i.e. instead of passing
> urp_array, pass urp_array + first_idx.  Then instead of having a
> last_idx argument, have an argument for the number of entries in the
> array, computed as last_idx - first_idx + 1.  With those changes,
> rm_undo would look like this:
>
> bool (*rm_undo) (UndoRecInfo *urp_array, int count, Oid reloid,
> FullTransactionId full_xid);
>

I agree.

> Now for the $10m question: why even pass reloid and full_xid?  Aren't
> those values going to be present inside every UnpackedUndoRecord?  Why
> not just let the callback get them from the first record (or however
> it wants to do things)?  Perhaps there is some documentation value
> here in that it implies that the value will be the same for every
> record, but we could also handle that by just documenting in the
> appropriate places that undo is done by transaction and relation and
> therefore the callback is entitled to assume that the same value will
> be present in every record.  Then again, I am not sure we really want
> the callback to assume that reloid doesn't change.  I don't see a
> reason offhand not to just pass as many records as we have for a given
> transaction and let the callback do what it likes.  So maybe that's
> another reason to get rid of the reloid argument, at least.  And then
> we could document that all the record will have the same full_xid
> (unless we decide that we don't want to guarantee that either).
>
> Additionally, it strikes me that urp_array is not the greatest name.
> Generally, putting _array into the name of the variable to indicate
> that it's an array doesn't seem all that great from a coding-style
> perspective.  I mean, sometimes it's the best you can do, but it's not
> amazing.  And urp seems like it's using an abbreviation without any
> real reason.  For contrast, consider this existing precedent:
>
> extern SysScanDesc systable_beginscan_ordered(Relation heapRelation,
>                                                    Relation indexRelation,
>                                                    Snapshot snapshot,
>                                                    int nkeys, ScanKey key);
>
> Or this one:
>
> extern TupleDesc CreateTupleDesc(int natts, Form_pg_attribute *attrs);
>
> Notice that in each case the array parameter (which is the last one)
> is named based on what data it contains rather than on the fact that
> it is an array.
>

Agreed, will change accordingly.

> Finally, I observe that rm_undo returns a Boolean, but it's not used
> for anything.  The only call to rm_undo in the current patch set is in
> execute_undo_actions_page, which returns that value to the caller, but
> the callers just discard it.  I suppose maybe this was intended to
> report success or failure, but I think the way that rm_undo will
> report failure is to ERROR.
>

For Error case, it is fine to report failure, but there can be cases
where we don't need to apply undo actions like when the relation is
dropped/truncated, undo actions are already applied.  The original
idea was to cover such cases by the return value.  I agree that
currently, caller ignores this value, but there is some value in
keeping it.  So, I am in favor of a signature with bool as the return
value.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

17 June 2019, 13:59:50

On Mon, Jun 17, 2019 at 6:03 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> The idea was that it could be use for multiple purposes like 'rolling
> back complete xact', 'rolling back subxact', 'rollback at page-level'
> or any similar future need even though not all code paths use that
> function.  I am not wedded to any particular name here, but among your
> suggestions complete_transaction sounds better to me.  Are you okay
> going with that?

Sure, let's try that for now and see how it looks.  We can always
change it again if it seems to be a good idea later.

> It won't change in between because we have ensured at top-level that
> no two processes can start executing pending undo at the same time.
> Basically, anyone wants to execute the undo actions will have an entry
> in rollback hash table and that will be marked as in-progress.  As
> mentioned in comments, the race is only "after discard worker
> fetches the record and found that this transaction need to be rolled
> back, backend might concurrently execute the actions and remove the
> request from rollback hash table."
>
> [ discussion of alternatives ]

I'm not precisely sure what the best thing to do here is, but I'm
skeptical that the code in question belongs in this function.  There
are two separate things going on here: one is this revalidation that
the undo hasn't been discarded, and the other is executing the undo
actions. Those are clearly separate tasks, and they are not tasks that
always get done together: sometimes we do only one, and sometimes we
do both.  Any function that looks like this is inherently suspicious:

whatever(....., bool flag)
{
    if (flag)
    {
         // lengthy block of code
    }

    // another lengthy block of code
}

There has to be a reason not to just split this into two functions and
let the caller decide whether to call one or both.

> For Error case, it is fine to report failure, but there can be cases
> where we don't need to apply undo actions like when the relation is
> dropped/truncated, undo actions are already applied.  The original
> idea was to cover such cases by the return value.  I agree that
> currently, caller ignores this value, but there is some value in
> keeping it.  So, I am in favor of a signature with bool as the return
> value.

OK.  So then the callers can't keep ignoring it... and there should be
some test framework that verifies the behavior when the return value
is false.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

18 June 2019, 11:30:48

On Mon, Jun 17, 2019 at 7:30 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jun 17, 2019 at 6:03 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I'm not precisely sure what the best thing to do here is, but I'm
> skeptical that the code in question belongs in this function.  There
> are two separate things going on here: one is this revalidation that
> the undo hasn't been discarded, and the other is executing the undo
> actions. Those are clearly separate tasks, and they are not tasks that
> always get done together: sometimes we do only one, and sometimes we
> do both.  Any function that looks like this is inherently suspicious:
>
> whatever(....., bool flag)
> {
>     if (flag)
>     {
>          // lengthy block of code
>     }
>
>     // another lengthy block of code
> }
>
> There has to be a reason not to just split this into two functions and
> let the caller decide whether to call one or both.
>

Yeah, because some of the information required to perform the
necessary steps (in the code under the flag) is quite central to this
function (see undo apply progress update part) and it is used at more
than one place in this function.  I have refactored the code in this
function, see if it makes sense now.  You need to check patch
0012-Infrastructure-to-execute-pending-undo-actions.patch for these
changes.

> > For Error case, it is fine to report failure, but there can be cases
> > where we don't need to apply undo actions like when the relation is
> > dropped/truncated, undo actions are already applied.  The original
> > idea was to cover such cases by the return value.  I agree that
> > currently, caller ignores this value, but there is some value in
> > keeping it.  So, I am in favor of a signature with bool as the return
> > value.
>
> OK.  So then the callers can't keep ignoring it...
>

I again thought about this but couldn't come up with anything
meaningful.  The idea is to ignore some undo records if they belong to
the same relation which is already gone. I think we can do something
about it in zheap specific code and make the generic code return void.

I have fixed the other comments raised by you.  See
0012-Infrastructure-to-execute-pending-undo-actions.patch

Apart from the changes related to the undo apply, this patch series
contains changes for making the transaction header at a location
immediately after
UndoRecordHeader which makes it easy to update the same.  The changes
are in patches 0007-Provide-interfaces-to-store-and-fetch-undo-records.patch
and 0012-Infrastructure-to-execute-pending-undo-actions.patch.

There are no changes in undo log module patches.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

On Thu, Jun 20, 2019 at 9:56 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jun 20, 2019 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > IIRC, then you only seem to have suggested that we need a kind of
> > back-off algorithm that gradually increases the retry time up to some
> > maximum [1].  I think that is a good way to de-prioritize requests
> > that are repeatedly failing.  Say, there is a request that has already
> > failed for 5 times and the worker queues it to get executed after 10s.
> > Immediately after that, another new request has failed for the first
> > time for the same database and it also got queued to get executed
> > after 10s.  In this scheme the request that has already failed for 5
> > times will get a chance before the request that has failed for the
> > first time.
>
> Sure, that's an advantage of increasing back-off times -- you can keep
> the stuff that looks hopeless from interfering too much with the stuff
> that is more likely to work out. However, I don't think we've actually
> done enough testing to know for sure what algorithm will work out
> best.  Do we want linear back-off (10s, 20s, 30s, ...)?  Exponential
> back-off (1s, 2s, 4s, 8s, ...)?  No back-off (10s, 10s, 10s, 10s)?
> Some algorithm that depends on the size of the failed transaction, so
> that big things get retried less often? I think it's important to
> design the code in such a way that the algorithm can be changed easily
> later, because I don't think we can be confident that whatever we pick
> for the first attempt will prove to be best.  I'm pretty sure that
> storing the failure count INSTEAD OF the next retry time is going to
> make it harder to experiment with different algorithms later.
>

Fair enough.  I have implemented it based on next_retry_at and use
constant time 10s for the next retry.  I have used define instead of a
GUC as all the other constants for similar things are defined as of
now.  One thing to note is that we want the linger time (defined as
UNDO_WORKER_LINGER_MS) for a undo worker to be more than failure retry
time (defined as UNDO_FAILURE_RETRY_DELAY_MS) as, otherwise, the undo
worker can exit before retrying the failed requests.  The changes for
this are in patches
0011-Infrastructure-to-register-and-fetch-undo-action-req.patch and
0014-Allow-execution-and-discard-of-undo-by-background-wo.patch.

Apart from these, there are few other changes in the patch series:

0014-Allow-execution-and-discard-of-undo-by-background-wo.patch:
1. Allow the undo workers to respond to cancel command by the user.
CHECK_FOR_INTERRUPTS was missing while the worker was checking for the
next undo request in a loop.
2. Change the value of UNDO_WORKER_LINGER_MS to 20s, so that it is
more than UNDO_FAILURE_RETRY_DELAY_MS.
3. Handled Sigterm signal for undo launcher and workers
4. Fixed the code bug to avoid having CommitTransaction when one of
the workers fails to register.  There is no StartTransaction to match
the same. This was leftover from the previous approach.

0012-Infrastructure-to-execute-pending-undo-actions.patch
1 Fix compiler warning

0007-Provide-interfaces-to-store-and-fetch-undo-records.patch
1. Fixed a bug to unlock the buffer while resetting the undo unpacked record
2. Fixed the spurious release of the lock in UndoFetchRecord.
3. Remove the pointer to previous undo in a different log from
UndoRecordTransaction structure.  Now, a separate low_switch header
contains the same.

0007-Provide-interfaces-to-store-and-fetch-undo-records.patch is
Dilip's patch and he has modified it, but changes were small so there
was not much sense in posting it separately.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

On Fri, Jun 28, 2019 at 6:09 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I happened to open up 0001 from this series, which is from Thomas, and
> I do not think that the pg_buffercache changes are correct.  The idea
> here is that the customer might install version 1.3 or any prior
> version on an old release, then upgrade to PostgreSQL 13. When they
> do, they will be running with the old SQL definitions and the new
> binaries.  At that point, it sure looks to me like the code in
> pg_buffercache_pages.c is going to do the Wrong Thing.  [...]

Yep, that was completely wrong.  Here's a new version.  I tested that
I can install 1.3 in an older release, then pg_upgrade to master, then
look at the view without the new column, then UPGRADE the extension to
1.4, and then the new column appears.

Other new stuff in this tarball (and also at
https://github.com/EnterpriseDB/zheap/tree/undo):

Based on hallway track discussions at PGCon, I have made a few
modifications to the undo log storage and record layer to support
"shared" record sets.  They are groups of records can be used for
temporary storage space for anything that needs to outlive a whole set
of transactions.  The intended usage is extra transaction slots for
updaters and lockers when there isn't enough space on a zheap (or
other AM) page.  The idea is to avoid the need to have in-heap
overflow pages for transient transaction management data, and instead
put that stuff on the conveyor belt of perfectly timed doom[1] along
with old tuple versions.

"Shared" undo records are never executed (that is, they don't really
represent rollback actions), they are just used for storage space that
is eventually discarded.  (I experimented with a way to use these also
to perform rollback actions to clean up stuff like the junk left
behind by aborted CREATE INDEX CONCURRENTLY commands, which seemed
promising, but it turned out to be quite tricky so I abandoned that
for now).

Details:

1.  Renamed UndoPersistence to UndoLogCategory everywhere, and add a
fourth category UNDO_SHARED where transactions can write 'out of band'
data that relates to more than one transaction.

2.  Introduced a new RMGR callback rm_undo_status.  It is used to
decide when record sets in the UNDO_SHARED category should be
discarded (instead of the usual single xid-based rules).  The possible
answers are "discard me now!", "ask me again when a given XID is all
visible", and "ask me again when a given XID is no longer running".

3.  Recognise UNDO_SHARED record set boundaries differently.  Whereas
undolog.c recognises transaction boundaries automatically for the
other categories (UNDO_PERMANENT, UNDO_UNLOGGED, UNDO_TEMP), for
UNDO_SHARED the

4.  Add some quick-and-dirty throw-away test stuff to demonstrate
that.  SELECT test_multixact([1234, 2345]) will create a new record
set that will survive until the given array of transactions is no
longer running, and then it'll be discarded.  You can see that with
SELECT * FROM undoinspect('shared').  Or look at SELECT
pg_stat_undo_logs.  This test simply writes all the xids into its
payload, and then has an rm_undo_status function that returns the
first xid it finds in the list that is still running, or if none are
running returns UNDO_STATUS_DISCARD.

Currently you can only return UNDO_STATUS_WAIT_XMIN so wait for an xid
to be older than the oldest xmin; presumably it'd be useful to be able
to discard as soon as an xid is no longer active, which could be a bit
sooner.

Another small change: several people commented that
UndoLogIsDiscarded(ptr) ought to have some kind of fast path that
doesn't acquire locks since it'll surely be hammered.  Here's an
attempt at that that provides an inlined function that uses a
per-backend recent_discard to avoid doing more work in the (hopefully)
common case that you mostly encounter discarded undo pointers.  I hope
this change will show up in profilers in some zheap workloads but this
hasn't been tested yet.

Another small change/review: the function UndoLogGetNextInsertPtr()
previously took a transaction ID, but I'm not sure if that made sense,
I need to think about it some more.

I pulled the latest patches pulled in from the "undoprocessing" branch
as of late last week, and most of the above is implemented as fixup
commits on top of that.

Next I'm working on DBA facilities for forcing undo records to be
discarded (which consists mostly of sorting out the interlocking to
make that work safely).  And also testing facilities for simulating
undo log switching (when you fill up each log and move to another one,
which are rare code paths run, so we need a good way to make them not
rare).

[1] https://speakerdeck.com/macdice/transactions-in-postgresql-and-other-animals?slide=23

-- 
Thomas Munro
https://enterprisedb.com

Attachment

undo-20190701.tgz

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

01 July 2019, 08:46:09

On Mon, Jul 1, 2019 at 7:53 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> 3.  Recognise UNDO_SHARED record set boundaries differently.  Whereas
> undolog.c recognises transaction boundaries automatically for the
> other categories (UNDO_PERMANENT, UNDO_UNLOGGED, UNDO_TEMP), for
> UNDO_SHARED the

... set of records inserted in between calls to
BeginUndoRecordInsert() and FinishUndoRecordInsert() calls is
eventually discarded as a unit, and the rm_undo_status() callback for
the calling AM decides when that is allowed.  In contrast, for the
other categories there may be records from any number undo-aware AMs
that are entirely unaware of each other and they must all be discarded
together if the transaction commits and becomes all visible, so
undolog.c automatically manages the boundaries to make that work when
inserting.

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

04 July 2019, 11:53:52

On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> Another small change/review: the function UndoLogGetNextInsertPtr()
> previously took a transaction ID, but I'm not sure if that made sense,
> I need to think about it some more.
>

The changes you have made related to UndoLogGetNextInsertPtr() doesn't
seem correct to me.

@@ -854,7 +854,9 @@ FindUndoEndLocationAndSize(UndoRecPtr start_urecptr,
  * has already started in this log then lets re-fetch the undo
  * record.
  */
- next_insert = UndoLogGetNextInsertPtr(slot->logno, uur->uur_xid);
+ next_insert = UndoLogGetNextInsertPtr(slot->logno);
+
+ /* TODO this can't happen */
  if (!UndoRecPtrIsValid(next_insert))

I think this is a possible case.  Say while the discard worker tries
to register the rollback request from some log and after it fetches
the undo record corresponding to start location in this function,
another backend adds the new transaction undo.  The same is mentioned
in comments as well.   Can you explain what makes you think that this
can't happen?  If we don't want to pass the xid to
UndoLogGetNextInsertPtr, then I think we need to get the insert
location before fetching the record.  I will think more on it to see
if there is any other problem with the same.

2.
@@ -167,25 +205,14 @@ UndoDiscardOneLog(UndoLogSlot *slot,
TransactionId xmin, bool *hibernate)
+ if (!TransactionIdIsValid(wait_xid) && !pending_abort)
  {
  UndoRecPtr next_insert = InvalidUndoRecPtr;
- /*
- * If more undo has been inserted since we checked last, then
- * we can process that as well.
- */
- next_insert = UndoLogGetNextInsertPtr(logno, undoxid);
- if (!UndoRecPtrIsValid(next_insert))
- continue;
+ next_insert = UndoLogGetNextInsertPtr(logno);

This change is also not safe.  It can lead to discarding the undo of
some random transaction because a new undo records from some other
transaction would have been added since we last fetched the undo
record.  This can be fixed by just removing the call to
UndoLogGetNextInsertPtr.  I have done so in undoprocessing branch and
added the comment as well.

I think the common problem with the above changes is that it assumes
that new undo can't be added to undo logs while discard worker is
processing them.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

05 July 2019, 11:50:50

On Thu, Jul 4, 2019 at 5:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> >
> > Another small change/review: the function UndoLogGetNextInsertPtr()
> > previously took a transaction ID, but I'm not sure if that made sense,
> > I need to think about it some more.
> >
>
> The changes you have made related to UndoLogGetNextInsertPtr() doesn't
> seem correct to me.
>
> @@ -854,7 +854,9 @@ FindUndoEndLocationAndSize(UndoRecPtr start_urecptr,
>   * has already started in this log then lets re-fetch the undo
>   * record.
>   */
> - next_insert = UndoLogGetNextInsertPtr(slot->logno, uur->uur_xid);
> + next_insert = UndoLogGetNextInsertPtr(slot->logno);
> +
> + /* TODO this can't happen */
>   if (!UndoRecPtrIsValid(next_insert))
>
> I think this is a possible case.  Say while the discard worker tries
> to register the rollback request from some log and after it fetches
> the undo record corresponding to start location in this function,
> another backend adds the new transaction undo.  The same is mentioned
> in comments as well.   Can you explain what makes you think that this
> can't happen?  If we don't want to pass the xid to
> UndoLogGetNextInsertPtr, then I think we need to get the insert
> location before fetching the record.  I will think more on it to see
> if there is any other problem with the same.
>

Pushed the fixed on above lines in the undoprocessing branch.  It will
be available in the next set of patches we post.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

05 July 2019, 11:53:13

On Fri, Jul 5, 2019 at 5:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 4, 2019 at 5:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > >
> > > Another small change/review: the function UndoLogGetNextInsertPtr()
> > > previously took a transaction ID, but I'm not sure if that made sense,
> > > I need to think about it some more.
> > >
> >
> > The changes you have made related to UndoLogGetNextInsertPtr() doesn't
> > seem correct to me.
> >
> > @@ -854,7 +854,9 @@ FindUndoEndLocationAndSize(UndoRecPtr start_urecptr,
> >   * has already started in this log then lets re-fetch the undo
> >   * record.
> >   */
> > - next_insert = UndoLogGetNextInsertPtr(slot->logno, uur->uur_xid);
> > + next_insert = UndoLogGetNextInsertPtr(slot->logno);
> > +
> > + /* TODO this can't happen */
> >   if (!UndoRecPtrIsValid(next_insert))
> >
> > I think this is a possible case.  Say while the discard worker tries
> > to register the rollback request from some log and after it fetches
> > the undo record corresponding to start location in this function,
> > another backend adds the new transaction undo.  The same is mentioned
> > in comments as well.   Can you explain what makes you think that this
> > can't happen?  If we don't want to pass the xid to
> > UndoLogGetNextInsertPtr, then I think we need to get the insert
> > location before fetching the record.  I will think more on it to see
> > if there is any other problem with the same.
> >
>
> Pushed the fixed on above lines in the undoprocessing branch.
>

Just in case anyone wants to look at the undoprocessing branch, it is
available at https://github.com/EnterpriseDB/zheap/tree/undoprocessing

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

05 July 2019, 14:09:42

On Tue, Jun 25, 2019 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Fair enough.  I have implemented it based on next_retry_at and use
> constant time 10s for the next retry.  I have used define instead of a
> GUC as all the other constants for similar things are defined as of
> now.  One thing to note is that we want the linger time (defined as
> UNDO_WORKER_LINGER_MS) for a undo worker to be more than failure retry
> time (defined as UNDO_FAILURE_RETRY_DELAY_MS) as, otherwise, the undo
> worker can exit before retrying the failed requests.

Uh, I think we want exactly the opposite.  We want the workers to exit
before retrying, so that there's a chance for other databases to get
processed, I think.  Am I confused?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

05 July 2019, 20:16:51

On Mon, Jul 1, 2019 at 3:54 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> [ new patches ]

I took a look at 0012 today, Amit's patch for extending the binary
heap machinery, and 0013, Amit's patch for "Infrastructure to register
and fetch undo action requests."

I don't think that binaryheap_allocate_shm() is a good design.  It
presupposes that we want to store the binary heap as its own chunk of
shared memory allocated via ShmemInitStruct(), but we might want to do
something else, like embed in another structure, store it in a DSM or
DSA, etc., and this function can't do any of that.  I think we should
have something more like:

extern Size binaryheap_size(int capacity);
extern void binaryheap_initialize(binaryheap *, int capacity,
binaryheap_comparator compare, void *arg);

Then the caller can do something like:

sz = binaryheap_size(capacity);
bh = ShmemInitStruct(name, sz, &found);
if (!found)
    binaryheap_initialize(bh, capacity, comparator, whatever);

If it wants to get the memory in some other way, it just needs to
initialize bh differently; the rest is the same.  Note that there is
no need, in this design, for binaryheap_size/initialize to make use of
"shared" memory.  They could equally well be used on backend-local
memory.  They do not need to care. You just provide the memory, and
they do their thing.

I wasn't very happy about binaryheap_nth(), binaryheap_remove_nth(),
and binaryheap_remove_nth_unordered() and started looking at how they
are used to try to see if there might be a better way.  That led me to
look at 0013.  Unfortunately, I find it really hard to understand what
this code is actually doing.  There's a lot of redundant and
badly-written stuff in here.  As a general principle, if you have two
or three data structures of some particular type, you don't write a
separate family of functions for manipulating each one.  You write one
function for each operation, and you pass the particular copy of the
data structure with which you are working as an argument.

In the lengthy section of macros definitions at the top of
undorequest.c, we have macros InitXidQueue, XidQueueIsEmpty,
GetXidQueueSize, GetXidQueueElem, GetXidQueueTopElem,
GetXidQueueNthElem, and SetXidQueueElem.  Several of these are used in
only one place or are not used anywhere at all; those should be
removed altogether and inlined into the single call site if there is
one.  Now, then after this, there is a matching set of macros,
InitSizeQueue, SizeQueueIsEmpty, GetSizeQueueSize, GetSizeQueueElem,
GetSizeQueueTopElem, GetSizeQueueNthElem, and  SetSizeQueueElem.  Many
of these macros are exactly the same as the previous set of macros
except that they operate on a different queue, which as I mentioned in
the previous paragraph, is not a good design.  It leads to extensive
code duplication.

Look, for example, at RemoveOldElemsFromSizeQueue and
RemoveOldElemsFromXidQueue.  They are basically identical except for
s/Size/Xid/g and s/SIZE/XID/g, but you can't unify them easily because
they are calling different functions.  However, if you didn't have one
function called GetSizeQueueSize and another called GetXidQueueSize,
but just had a pointer to the relevant binary heap, then both
functions could just call binaryheap_empty() on it, which would be
better style, use fewer macros, generate less machine code, and be
easier to read.  Ideally, you'd get to the point where you could just
have one function rather than two, and pass the queue upon which it
should operate as an argument.  There seems to be a good deal of this
kind of duplication in this file and it really needs to be cleaned up.

Now, one objection to the above line of attack is the different queues
actually contain different types of elements.  Apparently, the XID
queue contains elements of type UndoXidQueue and the size queue
contains elements of type UndoSizeQueue.  It is worth noting here that
these are bad type names, because they sound like they are describing
a type of queue, but it seems that they are actually describing an
element in the queue. However, there are two larger problems:

1. I don't think we should have three different kinds of objects for
each of the three different queues.  It seems like it would be much
simpler and easier to just have one kind of object that stores all the
information we need (full_xid, start_urec_ptr, dbid, request_size,
next_retry_at, error_ocurred_at) and use that everywhere. You could
object that this would increase the storage space requirement, but it
wouldn't be enough to make any real difference and it probably would
be well worth it for the avoidance of complexity.

2. However, I don't think we should have a separate request object for
each queue anyway.  We should insert pointers to the same objects in
all the relevant queue (either size + XID, or else error).  So instead
of having three sets of objects, one for each queue, we'd just have
one set of objects and point to them with as many as two pointers.
We'd therefore need LESS memory than we're using today, because we
wouldn't have separate arrays for XID, size, and error queue elements.

In fact, it seems to me that we shouldn't have any such thing as
"queue entries" at all.  The queues should just be pointing to
RollbackHashEntry *, and we should add all the fields there that are
present in any of the "queue entry" structures.  This would use less
memory still.

I also think we should be using simplehash rather than dynahash.  I'm
not sure that I would really say that simplehash is "simple," but it
does have a nicer API and simpler memory management. There's just a
big old block of memory, and there's no incremental allocation.  That
keeps things simple for the code that wants to go through the queues
and removing dangling pointers.  I think that the way this should work
is that each RollbackHashEntry * should contain a field "bool active."
 Then:

1. When we pull an item out of one of the binary heaps, we check the
active flag.  If it's clear, we ignore the entry and pull the next
item.  If it's set, we  clear the flag and process the item, so that
if it's subsequently pulled from the other queue it will be ignored.

2. If a binary heap is full when we need to insert into it, we can
iterate over all of the elements and throw away any that are !active.
They've already been dequeued and processed from some other queue, so
they're not "really" in this queue any more, even though we haven't
gone to the trouble of actually kicking them out yet.

On another note, UNDO_PEEK_DEPTH is bogus.  It's used in UndoGetWork()
and it passes the depth argument down to GetRollbackHashKeyFromQueue,
which then does binaryheap_nth() on the relevant queue.  Note that
this function is another places that is ends up duplicating code
because of the questionable decision to have separate types of queue
entries for each different queue; otherwise, it could probably just
take the binary heap into which it's peeking as an argument instead of
having three different cases. But that's not the main point here.  The
main point is that it calls a function for whichever type of queue
we've got and gets some kind of queue entry using binaryheap_nth().
But binaryheap_nth(whatever, 2) does not give you the third-smallest
element in the binary heap.  It gives you the third entry in the
array, which may or may not have the heap property, but even if it
does, the third element could be huge.  Consider this binary heap:

0 1 100000 2 3 100001 100002 4 5 6 7 100003 100004 100005 100006

This satisfies the binary heap property, because the element at
position n is always smaller than the elements at positions 2n+1 and
2n+2 (assuming 0-based indexing). But if you want to look at the
smallest three elements in the heap, you can't just look at indexes
0..2.  The second-smallest element must be at index 1 or 2, but it
could be either place.  The third-smallest element could be the other
of 1 and 2, or it could be either child of the smaller one, so there
are three places it might be.  In general, a binary heap is not a good
data structure for finding the smallest N elements of a collection
unless N is 1, and what's going to happen with what you've got here is
that we'll sometimes prioritize an item that would not have been
pulled from the queue for a long time over one that would have
otherwise been processed much sooner.  I'm not sure that's a
show-stopper, but it doesn't seem good, and the current patch doesn't
seem to have any comments justifying it, or at least not in the places
nearby to where this is actually happening.

I think there are more problems here, too.  Let's suppose that we
fixed the problem described in the previous paragraph somehow, or
decided that it won't actually make a big difference and just ignored
it.  Suppose further that we have N active databases which are
generating undo requests.  Luckily, we happen to also have N undo
workers available, and let's suppose that as of a certain moment in
time there is exactly one worker in each database.  Think about what
will happen when one of those workers goes to look for the next undo
request.  It's likely that the first request in the queue will be for
some other database, so it's probably going to have to peak ahead to
find a request for the database to which it's connected -- let's just
assume that there is one.  How far will it have to peak ahead?  Well,
if the requests are uniformly distributed across databases, each
request has a 1-in-N chance of being the right one.  I wrote a little
Perl program to estimate the probability that we won't find the next
request for our databases within 10 requests as a function of the
number of databases:

1 databases => failure chance with 10 lookahead is 0.00%
2 databases => failure chance with 10 lookahead is 0.10%
3 databases => failure chance with 10 lookahead is 1.74%
4 databases => failure chance with 10 lookahead is 5.66%
5 databases => failure chance with 10 lookahead is 10.74%
6 databases => failure chance with 10 lookahead is 16.18%
7 databases => failure chance with 10 lookahead is 21.45%
8 databases => failure chance with 10 lookahead is 26.31%
9 databases => failure chance with 10 lookahead is 30.79%
10 databases => failure chance with 10 lookahead is 34.91%
11 databases => failure chance with 10 lookahead is 38.58%
12 databases => failure chance with 10 lookahead is 41.85%
13 databases => failure chance with 10 lookahead is 44.91%
14 databases => failure chance with 10 lookahead is 47.69%
15 databases => failure chance with 10 lookahead is 50.12%
16 databases => failure chance with 10 lookahead is 52.34%
17 databases => failure chance with 10 lookahead is 54.53%
18 databases => failure chance with 10 lookahead is 56.39%
19 databases => failure chance with 10 lookahead is 58.18%
20 databases => failure chance with 10 lookahead is 59.86%

Assuming my script (attached) doesn't have a bug, with only 8
databases, there's better than a 1-in-4 chance that we'll fail to find
the next entry for the current database within the lookahead window.
That's bad, because then the worker will be sitting around waiting
when it should be doing stuff.  Maybe it will even exit, even though
there's work to be done, and even though all the other databases have
their own workers already.  You can construct way worse examples than
this one, too: imagine that there are two databases, each with a
worker, and one has 99% of the requests and the other one has 1% of
the requests.  It's really unlikely that there's going to be an entry
for the second database within the lookahead window.  And note that
increasing the window doesn't really help either: you just need more
databases than the size of the lookahead window, or even almost as
many as the lookahead window, and things are going to stop working
properly.

On the other hand, suppose that you have 10 databases and one undo
worker.  One database is pretty active and generates a continuous
stream of undo requests at exactly the same speed we can process them.
The others all have 1 pending undo request.  Now, what's going to
happen is that you'll always find the undo request for the current
database within the lookahead window.  So, you'll never exit.  But
that means the undo requests in the other 9 databases will just sit
there for all eternity, because there's no other worker to process
them.  On the other hand, if you had 11 databases, there's a good
chance it would work fine, because the new request for the active
database would likely be outside the lookahead window, and so you'd
find no work to do and exit, allowing a worker to be started up in
some other database.  It would in turn exit and so on and you'd clear
the backlog for the other databases at least for a while, until you
picked the active database again.  Actually, I haven't looked at the
whole patch set, so perhaps there is some solution to this problem
contemplated somewhere, but I consider this argument to be pretty good
evidence that a fixed lookahead distance is probably the wrong thing.

The right things to do about these problems probably need some
discussion, but here's the best idea I have off-hand: instead of just
have 3 binary heaps (size, XID, error), have N+1 "undo work trackers",
each of which contains 3 binary heaps (size, XID, error).  Undo work
tracker #0 contains all requests that are not assigned to any other
undo work tracker.  Each of undo work trackers #1..N contain all the
requests for one particular database, but they all start out unused.
Before launching an undo worker for a particular database, the
launcher must check whether it has an undo work tracker allocated to
that database.  If not, it allocates one and moves all the work for
that database out of tracker #0 and into the newly-allocated tracker.
If there are none free, it must first deallocate an undo work tracker,
moving any remaining work for that tracker back into tracker #0.  With
this approach, there's no need for lookahead, because every worker is
always pulling from a queue that is database-specific, so the next
entry is always guaranteed to be relevant.  And you choose N to be
equal to the number of workers, so that even if every worker is in a
separate database there will be enough trackers for all workers to
have one, plus tracker #0 for whatever's left.

There still remains the problem of figuring out when a worker should
terminate to allow for new workers to be launched, which is a fairly
complex problem that deserves its own discussion, but I think this
design helps. At the very least, you can see whether tracker #0 is
empty.  If it is, you might still want to rebalance workers between
databases, but you don't really need to worry about databases getting
starved altogether, because you know that you can run a worker for
every database that has any pending undo.  If tracker #0 is non-empty
but you have unused workers, you can just allocate trackers for the
databases in tracker #0 and move stuff over there to be processed.  If
tracker #0 is non-empty and all workers are allocated, you are going
to need to ask one of them to exit at some point, to avoid starvation.
I don't know exactly what the algorithm for that should be; I do have
some ideas.  I'm not going to include them in this email though,
because this email is already long and I don't have time to make it
longer right now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

lookahead-failure-chance.pl

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

06 July 2019, 02:05:27

On Fri, Jul 5, 2019 at 7:39 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jun 25, 2019 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Fair enough.  I have implemented it based on next_retry_at and use
> > constant time 10s for the next retry.  I have used define instead of a
> > GUC as all the other constants for similar things are defined as of
> > now.  One thing to note is that we want the linger time (defined as
> > UNDO_WORKER_LINGER_MS) for a undo worker to be more than failure retry
> > time (defined as UNDO_FAILURE_RETRY_DELAY_MS) as, otherwise, the undo
> > worker can exit before retrying the failed requests.
>
> Uh, I think we want exactly the opposite.  We want the workers to exit
> before retrying, so that there's a chance for other databases to get
> processed, I think.
>

The workers will exit if there is any chance for other databases to
get processed.  Basically, we linger only when we find there is no
work in other databases.  Not only that even if some new work is added
to the queues for some other database then also we stop the lingering
worker if there is no worker available for the new request that has
arrived.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

06 July 2019, 14:56:29

On Thu, Jul 4, 2019 at 5:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
PFA, the latest version of the undo interface and undo processing patches.

Summary of the changes in the patch set

1. Undo Interface
- Rebased over latest undo storage code
- Implemented undo page compression (don't store the common fields in
all the records instead we get from the first complete record of the
page).
- As per Robert's comment, UnpackedUndoRecord is divided in two parts,
  a) All fields which are set by the caller.
  b) Pointer to structures which are set internally.
- Epoch and the Transaction id are  unified as full transaction id
- Fixed handling of dbid during recovery (TODO in PrepareUndoInsert)

Pending:
- Move loop in UndoFetchRecord to outside and test performance with
keeping pin vs pin+lock across undo records.  This will be done after
testing performance over the zheap code.
- I need to investigate whether Discard checking can be unified in
master and HotStandby in UndoFetchRecord function.

2. Undo Processing
- Defect fix in multi-log rollback for subtransaction.
- Assorted defect fixes.

Others
   - Fixup for undo log code to handle full transaction id in
UndoLogSlot for discard and other bug fixes in undo log.
   - Fixup for Orphan file cleanup to pass dbid in PrepareUndoInsert








--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

undo_20190706.tar.gz

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

08 July 2019, 10:57:44

On Sat, Jul 6, 2019 at 1:47 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jul 1, 2019 at 3:54 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > [ new patches ]
>
> I took a look at 0012 today, Amit's patch for extending the binary
> heap machinery, and 0013, Amit's patch for "Infrastructure to register
> and fetch undo action requests."
>

Thanks for looking into the patches.

> I don't think that binaryheap_allocate_shm() is a good design.  It
> presupposes that we want to store the binary heap as its own chunk of
> shared memory allocated via ShmemInitStruct(), but we might want to do
> something else, like embed in another structure, store it in a DSM or
> DSA, etc., and this function can't do any of that.  I think we should
> have something more like:
>
> extern Size binaryheap_size(int capacity);
> extern void binaryheap_initialize(binaryheap *, int capacity,
> binaryheap_comparator compare, void *arg);
>
> Then the caller can do something like:
>
> sz = binaryheap_size(capacity);
> bh = ShmemInitStruct(name, sz, &found);
> if (!found)
>     binaryheap_initialize(bh, capacity, comparator, whatever);
>
> If it wants to get the memory in some other way, it just needs to
> initialize bh differently; the rest is the same.  Note that there is
> no need, in this design, for binaryheap_size/initialize to make use of
> "shared" memory.  They could equally well be used on backend-local
> memory.  They do not need to care. You just provide the memory, and
> they do their thing.
>

I didn't have other use cases in mind and I think to some extent this
argument holds true for existing binaryheap_allocate allocate.  If we
want to make it more generic, then shouldn't we try to even change the
existing binaryheap_allocate to use this new model, that way the
binary heap allocation API will be more generic?

..
..
>
> Now, one objection to the above line of attack is the different queues
> actually contain different types of elements.  Apparently, the XID
> queue contains elements of type UndoXidQueue and the size queue
> contains elements of type UndoSizeQueue.  It is worth noting here that
> these are bad type names, because they sound like they are describing
> a type of queue, but it seems that they are actually describing an
> element in the queue. However, there are two larger problems:
>
> 1. I don't think we should have three different kinds of objects for
> each of the three different queues.  It seems like it would be much
> simpler and easier to just have one kind of object that stores all the
> information we need (full_xid, start_urec_ptr, dbid, request_size,
> next_retry_at, error_ocurred_at) and use that everywhere. You could
> object that this would increase the storage space requirement,

Yes, this was the reason to keep them separate, but I see your point.

> but it
> wouldn't be enough to make any real difference and it probably would
> be well worth it for the avoidance of complexity.
>

Okay, will give it a try and see if it can avoid some code complexity.
Along with this, I will investigate your other suggestions related to
code improvements as well.


>
> On another note, UNDO_PEEK_DEPTH is bogus.  It's used in UndoGetWork()
> and it passes the depth argument down to GetRollbackHashKeyFromQueue,
> which then does binaryheap_nth() on the relevant queue.  Note that
> this function is another places that is ends up duplicating code
> because of the questionable decision to have separate types of queue
> entries for each different queue; otherwise, it could probably just
> take the binary heap into which it's peeking as an argument instead of
> having three different cases. But that's not the main point here.  The
> main point is that it calls a function for whichever type of queue
> we've got and gets some kind of queue entry using binaryheap_nth().
> But binaryheap_nth(whatever, 2) does not give you the third-smallest
> element in the binary heap.  It gives you the third entry in the
> array, which may or may not have the heap property, but even if it
> does, the third element could be huge.  Consider this binary heap:
>
> 0 1 100000 2 3 100001 100002 4 5 6 7 100003 100004 100005 100006
>
> This satisfies the binary heap property, because the element at
> position n is always smaller than the elements at positions 2n+1 and
> 2n+2 (assuming 0-based indexing). But if you want to look at the
> smallest three elements in the heap, you can't just look at indexes
> 0..2.  The second-smallest element must be at index 1 or 2, but it
> could be either place.  The third-smallest element could be the other
> of 1 and 2, or it could be either child of the smaller one, so there
> are three places it might be.  In general, a binary heap is not a good
> data structure for finding the smallest N elements of a collection
> unless N is 1, and what's going to happen with what you've got here is
> that we'll sometimes prioritize an item that would not have been
> pulled from the queue for a long time over one that would have
> otherwise been processed much sooner.
>

You are right that it won't be nth smallest element from the queue and
we don't even care for that here.  The peeking logic is not to find
the next prioritized element but to check if we can find some element
for the same database in the next few entries to avoid frequent undo
worker restart.

>  I'm not sure that's a
> show-stopper, but it doesn't seem good, and the current patch doesn't
> seem to have any comments justifying it, or at least not in the places
> nearby to where this is actually happening.
>

I agree that we should add more comments explaining this.

> I think there are more problems here, too.  Let's suppose that we
> fixed the problem described in the previous paragraph somehow, or
> decided that it won't actually make a big difference and just ignored
> it.  Suppose further that we have N active databases which are
> generating undo requests.  Luckily, we happen to also have N undo
> workers available, and let's suppose that as of a certain moment in
> time there is exactly one worker in each database.  Think about what
> will happen when one of those workers goes to look for the next undo
> request.  It's likely that the first request in the queue will be for
> some other database, so it's probably going to have to peak ahead to
> find a request for the database to which it's connected -- let's just
> assume that there is one.  How far will it have to peak ahead?  Well,
> if the requests are uniformly distributed across databases, each
> request has a 1-in-N chance of being the right one.  I wrote a little
> Perl program to estimate the probability that we won't find the next
> request for our databases within 10 requests as a function of the
> number of databases:
>
> 1 databases => failure chance with 10 lookahead is 0.00%
> 2 databases => failure chance with 10 lookahead is 0.10%
> 3 databases => failure chance with 10 lookahead is 1.74%
> 4 databases => failure chance with 10 lookahead is 5.66%
> 5 databases => failure chance with 10 lookahead is 10.74%
> 6 databases => failure chance with 10 lookahead is 16.18%
> 7 databases => failure chance with 10 lookahead is 21.45%
> 8 databases => failure chance with 10 lookahead is 26.31%
> 9 databases => failure chance with 10 lookahead is 30.79%
> 10 databases => failure chance with 10 lookahead is 34.91%
> 11 databases => failure chance with 10 lookahead is 38.58%
> 12 databases => failure chance with 10 lookahead is 41.85%
> 13 databases => failure chance with 10 lookahead is 44.91%
> 14 databases => failure chance with 10 lookahead is 47.69%
> 15 databases => failure chance with 10 lookahead is 50.12%
> 16 databases => failure chance with 10 lookahead is 52.34%
> 17 databases => failure chance with 10 lookahead is 54.53%
> 18 databases => failure chance with 10 lookahead is 56.39%
> 19 databases => failure chance with 10 lookahead is 58.18%
> 20 databases => failure chance with 10 lookahead is 59.86%
>
> Assuming my script (attached) doesn't have a bug, with only 8
> databases, there's better than a 1-in-4 chance that we'll fail to find
> the next entry for the current database within the lookahead window.
>

This is a good test scenario, but I think it has not been considered
that there are multiple queues and we peek into each one.

> That's bad, because then the worker will be sitting around waiting
> when it should be doing stuff.  Maybe it will even exit, even though
> there's work to be done, and even though all the other databases have
> their own workers already.
>

I think we should once try with the actual program which can test such
a scenario on the undo patches before reaching any conclusion.  I or
one of my colleague will work on this and report back the results.

>  You can construct way worse examples than
> this one, too: imagine that there are two databases, each with a
> worker, and one has 99% of the requests and the other one has 1% of
> the requests.  It's really unlikely that there's going to be an entry
> for the second database within the lookahead window.
>

I am not sure if that is the case because as soon as the request from
other database get prioritized (say because its XID becomes older) and
came as the first request in one of the queues, the undo worker will
exit (provided it has worked for some threshold time (10s) in that
database) and allow the request from another database to be processed.

>  And note that
> increasing the window doesn't really help either: you just need more
> databases than the size of the lookahead window, or even almost as
> many as the lookahead window, and things are going to stop working
> properly.
>
> On the other hand, suppose that you have 10 databases and one undo
> worker.  One database is pretty active and generates a continuous
> stream of undo requests at exactly the same speed we can process them.
> The others all have 1 pending undo request.  Now, what's going to
> happen is that you'll always find the undo request for the current
> database within the lookahead window.  So, you'll never exit.
>

Following the logic given above, I think here also worker will exit as
soon as the request from other database get prioritised.

>  But
> that means the undo requests in the other 9 databases will just sit
> there for all eternity, because there's no other worker to process
> them.  On the other hand, if you had 11 databases, there's a good
> chance it would work fine, because the new request for the active
> database would likely be outside the lookahead window, and so you'd
> find no work to do and exit, allowing a worker to be started up in
> some other database.
>

As explained above, I think it will work the same way both for 10 or
11 databases.  Note, that we don't always try to look ahead. We look
ahead when we have not worked on the current database for some
threshold amount of time.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

08 July 2019, 14:46:54

On Mon, Jul 8, 2019 at 6:57 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> I didn't have other use cases in mind and I think to some extent this
> argument holds true for existing binaryheap_allocate allocate.  If we
> want to make it more generic, then shouldn't we try to even change the
> existing binaryheap_allocate to use this new model, that way the
> binary heap allocation API will be more generic?

No.  binaryheap_allocate is fine for simple cases and there's no
reason that I can see to change it.

> You are right that it won't be nth smallest element from the queue and
> we don't even care for that here.  The peeking logic is not to find
> the next prioritized element but to check if we can find some element
> for the same database in the next few entries to avoid frequent undo
> worker restart.

You *should* care for that here. The whole purpose of a binary heap is
to help us choose which task we should do first and which ones should
be done later.  I don't see why it's OK to decide that we only care
about doing the tasks in priority order sometimes, and other times
it's OK to just pick semi-randomly.

> This is a good test scenario, but I think it has not been considered
> that there are multiple queues and we peek into each one.

I think that makes very little difference, so I don't see why it
should be considered. It's true that this will sometimes mask the
problem, but so what? An algorithm that works 90% of the time is not
much better than one that works 80% of the time, and neither is the
caliber of work we are expecting to see in PostgreSQL.

> I think we should once try with the actual program which can test such
> a scenario on the undo patches before reaching any conclusion.  I or
> one of my colleague will work on this and report back the results.

There is certainly a place for empirical testing of a patch like this
(perhaps even before posting it). It does not substitute for a good
theoretical explanation of why the algorithm is correct, and I don't
think it is.

> >  You can construct way worse examples than
> > this one, too: imagine that there are two databases, each with a
> > worker, and one has 99% of the requests and the other one has 1% of
> > the requests.  It's really unlikely that there's going to be an entry
> > for the second database within the lookahead window.
>
> I am not sure if that is the case because as soon as the request from
> other database get prioritized (say because its XID becomes older) and
> came as the first request in one of the queues, the undo worker will
> exit (provided it has worked for some threshold time (10s) in that
> database) and allow the request from another database to be processed.

I don't see how this responds to what I wrote.  Neither worker needs
to exit in this scenario, but the worker from the less-popular
database is likely to exit anyway, which seems like it's probably not
the right thing.

> >  And note that
> > increasing the window doesn't really help either: you just need more
> > databases than the size of the lookahead window, or even almost as
> > many as the lookahead window, and things are going to stop working
> > properly.
> >
> > On the other hand, suppose that you have 10 databases and one undo
> > worker.  One database is pretty active and generates a continuous
> > stream of undo requests at exactly the same speed we can process them.
> > The others all have 1 pending undo request.  Now, what's going to
> > happen is that you'll always find the undo request for the current
> > database within the lookahead window.  So, you'll never exit.
>
> Following the logic given above, I think here also worker will exit as
> soon as the request from other database get prioritised.

OK.

> >  But
> > that means the undo requests in the other 9 databases will just sit
> > there for all eternity, because there's no other worker to process
> > them.  On the other hand, if you had 11 databases, there's a good
> > chance it would work fine, because the new request for the active
> > database would likely be outside the lookahead window, and so you'd
> > find no work to do and exit, allowing a worker to be started up in
> > some other database.
>
> As explained above, I think it will work the same way both for 10 or
> 11 databases.  Note, that we don't always try to look ahead. We look
> ahead when we have not worked on the current database for some
> threshold amount of time.

That's interesting, and it means that some of the scenarios that I
mentioned are not problems. However, I don't believe it means that
your code is actually correct. It's just means that it's wrong in
different ways.  The point is that, with the way you've implemented
this, whenever you do lookahead, you will, basically randomly,
sometimes find the next entry for the current database within the
lookahead window, and sometimes you won't.  And sometimes it will be
the next-highest-priority request, and sometimes it won't. That just
cannot possibly be the right thing to do.

Would you propose to commit a patch that implemented the following pseudocode?

find-next-thing-to-do:
   see if the highest-priority task in any database is for our database.
   if it is, do it and stop here.
   if it is not, and if we haven't worked on the current database for
at least 10 seconds, look for an item in the current database.
   ...but don't look very hard, so that we'll sometimes,
semi-randomly, find nothing even when there is something we could do.
   ...and also, sometimes find a lower-priority item that we can do,
possibly much lower-priority, instead of the highest priority thing we
can do.

Because that's what your patch is doing.

In contrast, the algorithm that I proposed would work like this:

find-next-thing-to-do:
   find the highest-priority item for the current database.
   do it.

I venture to propose that the second one is the superior algorithm
here.  One problem with the second algorithm, which I pointed out in
my previous email, is that sometimes we might want the worker to exit
even though there is work to do in the current database. My algorithm
makes no provision for that, and yours does.  However, yours does that
in a way that's totally unprincipled: it just sometimes fails to find
any work that it could do even though there is work that it could do.
No amount of testing or argumentation is going to convince me that
this is a good approach. The decision about when a worker should exit
to allow a new one to be launched needs to be based on clear,
understandable rules, not be something that happens semi-randomly when
a haphazard search for the next entry fails, as if by chance, to find
it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

09 July 2019, 10:28:22

On Sat, Jul 6, 2019 at 8:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jul 4, 2019 at 5:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> PFA, the latest version of the undo interface and undo processing patches.
>
> Summary of the changes in the patch set
>
> 1. Undo Interface
> - Rebased over latest undo storage code
> - Implemented undo page compression (don't store the common fields in
> all the records instead we get from the first complete record of the
> page).
> - As per Robert's comment, UnpackedUndoRecord is divided in two parts,
>   a) All fields which are set by the caller.
>   b) Pointer to structures which are set internally.
> - Epoch and the Transaction id are  unified as full transaction id
> - Fixed handling of dbid during recovery (TODO in PrepareUndoInsert)
>
> Pending:
> - Move loop in UndoFetchRecord to outside and test performance with
> keeping pin vs pin+lock across undo records.  This will be done after
> testing performance over the zheap code.
> - I need to investigate whether Discard checking can be unified in
> master and HotStandby in UndoFetchRecord function.
>
> 2. Undo Processing
> - Defect fix in multi-log rollback for subtransaction.
> - Assorted defect fixes.
>
> Others
>    - Fixup for undo log code to handle full transaction id in
> UndoLogSlot for discard and other bug fixes in undo log.
>    - Fixup for Orphan file cleanup to pass dbid in PrepareUndoInsert
>
PFA, updated patch version which includes
- One defect fix in undo interface related to undo page compression
for handling persistence level
- Implemented pending TODO optimization in undo page compression.
- One defect fix in undo processing related to the prepared transaction

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

undo_20190709.tar.gz

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

10 July 2019, 06:32:18

On Sat, Jul 6, 2019 at 1:47 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> In fact, it seems to me that we shouldn't have any such thing as
> "queue entries" at all.  The queues should just be pointing to
> RollbackHashEntry *, and we should add all the fields there that are
> present in any of the "queue entry" structures.  This would use less
> memory still.
>

As of now, after we finish executing the rollback actions, the entry
from the hash table is removed.  Now, at a later time (when queues are
full and we want to insert a new entry) when we access the queue entry
(to check whether we can remove it)  corresponding to the removed hash
table entry, will it be safe to access it?  The hash table entry might
have been freed or would have been reused as some other entry by the
time we try to access it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

10 July 2019, 16:36:14

On Wed, Jul 10, 2019 at 2:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> As of now, after we finish executing the rollback actions, the entry
> from the hash table is removed.  Now, at a later time (when queues are
> full and we want to insert a new entry) when we access the queue entry
> (to check whether we can remove it)  corresponding to the removed hash
> table entry, will it be safe to access it?  The hash table entry might
> have been freed or would have been reused as some other entry by the
> time we try to access it.

Hmm, yeah, that's a problem.  I think we could possibly fix this by
having the binary heaps just store a FullTransactionId rather than a
pointer to the RollBackHashEntry.  Then, if you get a
FullTransactionId from the binary heap, you just do a hash table
lookup to find the RollBackHashEntry instead of accessing it directly.
  If it doesn't exist, then you can just discard the entry: it's for
some old transaction that's no longer relevant.

However, there are a few problems with that idea. One is that I see
that you've made the hash table keyed by full_xid + start_urec_ptr
rather than just full_xid, so if the queues just point to an XID, it's
not enough to find the hash table entry.  The comment claims that this
is necessary because "in the same transaction, there could be rollback
requests for both logged and unlogged relations," but I don't
understand why that means we need start_urec_ptr in the hash table
key. It would seem more natural to me to have a single entry that
covers both the logged and the unlogged undo for that transaction.

(Incidentally, I don't think it's correct that RollbackHashEntry
starts with FullTransactionId full_xid + UndoRecPtr start_uprec_ptr
declared separately; I think it should start with RollbackHashKey -
although if we change the key back to just a FullTransactionId then we
don't need to worry separately about fixing this issue.)

Another problem is that on a 64-bit system, we can pass a
FullTransactionId by value, but on a 32-bit system we can't. That's
awkward, because if we can't pass the XID by value, then we're back to
needing a separately-allocated structure for the queue entries, which
I was really hoping to avoid.

A second possible approach to this problem is to just reset all the
binary heaps (using binaryheap_reset) whenever we insert a new entry
into the hash table, and rebuild them the next time they're needed by
reinserting all of the current entries in the hash table. That might
be too inefficient.  You can insert a bunch of things in a row without
re-heaping, and you can dequeue a bunch of things in a row without
re-heaping, but if they alternate you'll re-heap a lot. I don't know
whether that costs enough to worry about; it might be fine.

A third possible approach is to allocate a separate array whose
entries are reused, and to maintain a freelist of entries from that
array.  All the real data is stored in this array, and the binary
heaps and hash table entries just point to it.  When the freelist is
empty, the next allocate scans all the binary heaps and removes any
pointers to inactive entries; it then puts all inactive entries back
onto the freelist. This is more complex than the previous approach,
and it doesn't totally avoid re-heaping, because removing pointers to
inactive entries from the binary heaps will necessitate a re-heap on
next access. However, if the total capacity of the data structures is
large compared to the number of entries actually in use, which will
usually be true, we'll have to re-heap much less often, because we
only have to do it when the number of allocations exhausts
*everything* on the free-list, rather than after every allocation.

A fourth possible approach is to enhance the simplehash mechanism to
allow us to do cleanup when an item to which there might still be
residual pointers is reused. We could allow some code supplied by the
definer of an individual simplehash implementation to be executed
inside SH_INSERT, just at the point where we're going to make an entry
status SH_STATUS_IN_USE.  What we'd do is add a flag to the structure
indicating whether there might be deferred cleanup work for that
entry.  Maybe it would be called something like 'bool processed' and
set when we process the undo work for that entry.  If, when we're
about to reuse an entry, that flag is set, then we go scan all the
binary heaps and remove all entries for which that flag is set.  And
then we unset the flag for all of those entries. Like the previous
approach, this is basically a refinement of the second approach in
that it tries to avoid re-heaping too often. Here, instead of
re-heaping once we've been through the entire free-list, we'll re-heap
when we happen (more or less randomly) happen to reuse a hash table
entry that's been reused, but we avoid it when we happen to snag a
hash table entry that hasn't been reused recently.  This is probably
less efficient at avoiding re-heaping than the previous approach, but
it avoids a separately-allocated data structure, which is nice.

Broadly, you are correct to point out that you need to avoid chasing
stale pointers, and there are a bunch of ways to accomplish that:
approach #1 avoids using real pointers, and the rest just make sure
that any stale pointers don't stick around long enough to cause any
harm. There are probably also several other totally realistic
alternatives, and I don't know for sure what is best, or how much it
matters.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

10 July 2019, 19:08:33

On Tue, Jul 9, 2019 at 6:28 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> PFA, updated patch version which includes
> - One defect fix in undo interface related to undo page compression
> for handling persistence level
> - Implemented pending TODO optimization in undo page compression.
> - One defect fix in undo processing related to the prepared transaction

Looking at 0002 a bit, it seems to me that you really need to spend
some energy getting things into a consistent order all across the
patch.  For example, UndoPackStage uses the ordering: HEADER,
TRANSACTION, RMID, RELOID, XID, CID...  But the declarations of the
UREC_INFO constants go in a different order: TRANSACTION, FORK, BLOCK,
BLKPREV... The comments defining those go in a different order and
some of them are missing. The definition of the UndoRecordBlah
structures go in a different order still: Transaction, Block,
LogSwitch, Payload.  UndoRecordHeaderSize goes with FORK, BLOCK,
BLPREV, TRANSACTION, LOGSWITCH, ....  That really needs to be
straightened out and made consistent.

You (still) need to rename blkprev to something more generic, as
mentioned in previous rounds of review.

I think it would be a good idea to avoid complex macros in favor of
functions where possible, e.g. UNDO_PAGE_PARTIAL_REC_SIZE.  If
performance is a concern, it could be declared static inline, which
should be as good as a macro.

I don't like the fact that undoaccess.c has a new global,
undo_compression_info.  I haven't read the code thoroughly, but do we
really need that?  I think it's never modified (so it could just be
declared const), and I also think it's just all zeroes (so
initializing it isn't really necessary), and I also think that it's
just used for initializing other UndoCompressionInfos (so we could
just initialize them directly, either by setting the members
individually or jus zeroing them).

It seems like UndoRecordPrepareTransInfo ought to have an Assert(index
< some_limit) in the loop.

A comment in PrepareUndoInsert refers to "low switch" where it means
"log switch."

This is by no means a complete review, for which I unfortunately lack
the time at present.  Just some initial observations.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

10 July 2019, 19:59:00

On Wed, Jul 10, 2019 at 12:36 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Broadly, you are correct to point out that you need to avoid chasing
> stale pointers, and there are a bunch of ways to accomplish that:
> approach #1 avoids using real pointers, and the rest just make sure
> that any stale pointers don't stick around long enough to cause any
> harm. There are probably also several other totally realistic
> alternatives, and I don't know for sure what is best, or how much it
> matters.

After some off-list discussion with Andres ...

Another possible approach here, which I think I like better, is to
switch from using a binary heap to using an rbtree.  That wouldn't
work well in DSM because of the way it uses pointers, but here we're
putting data in the shared memory segment so it seems like it should
work.  The idea would be to allocate an array of entries with a
freelist, and then have allocfunc and freefunc defined to push and pop
the freelist.  Unlike a binary heap, an rbtree lets us (a) do
peek-ahead in sorted order and (b) delete elements from an arbitrary
position without rebuilding anything.

If we adopt this approach, then I think a bunch of the problems we've
been talking about actually get a lot easier.  If we pull an item from
the ordered-by-XID rbtree or the ordered-by-undo-size rbtree, we can
remove it from the other one cheaply, because we can store a pointer
to the RBTNode in the main object.  So then we never have any stale
pointers in any data structure, which means we don't have to have a
strategy to avoid accidentally following them.

The fact that we can peak-ahead correctly without any new code is also
very nice.  I'm still concerned that peeking ahead isn't the right
approach in general, but if we're going to do it, peeking ahead to the
actually-next-highest-priority item is a lot better than peeking ahead
to some-item-that-may-be-fairly-high-priority.

One problem which Andres spotted is that rbt_delete() can actually
move content around, so if you just cache the RBTNode returned by
rbt_insert(), it might not be the right one by the time you
rbt_delete(), if other stuff has been deleted first.  There are
several possible approaches to that problem, but one that I'm
wondering about is modifying rbt_delete_node() so that it doesn't rely
on rbt_copy_data.  The idea is that if y != z, instead of copying the
data from y to z, copy the left/right/parent pointers from z into y,
and make z's left, right, and parent nodes point to y instead.  Then
we always end up removing the correct node, which would make things
much easier for us and might well be helpful to other code that uses
rbtree as well.

Another small problem, also spotted by Andres, is that rbt_create()
uses palloc.  That seems easy to work around: just provide an
rbt_intialize() function that a caller can use instead of it wants to
initialize an already-allocated block of memory.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

11 July 2019, 03:47:41

On Thu, Jul 11, 2019 at 12:38 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jul 9, 2019 at 6:28 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > PFA, updated patch version which includes
> > - One defect fix in undo interface related to undo page compression
> > for handling persistence level
> > - Implemented pending TODO optimization in undo page compression.
> > - One defect fix in undo processing related to the prepared transaction
>
> Looking at 0002 a bit, it seems to me that you really need to spend
> some energy getting things into a consistent order all across the
> patch.  For example, UndoPackStage uses the ordering: HEADER,
> TRANSACTION, RMID, RELOID, XID, CID...  But the declarations of the
> UREC_INFO constants go in a different order: TRANSACTION, FORK, BLOCK,
> BLKPREV... The comments defining those go in a different order and
> some of them are missing. The definition of the UndoRecordBlah
> structures go in a different order still: Transaction, Block,
> LogSwitch, Payload.  UndoRecordHeaderSize goes with FORK, BLOCK,
> BLPREV, TRANSACTION, LOGSWITCH, ....  That really needs to be
> straightened out and made consistent.
>
Thanks for the review, I will work on this.

> You (still) need to rename blkprev to something more generic, as
> mentioned in previous rounds of review.

I will change this.
>
> I think it would be a good idea to avoid complex macros in favor of
> functions where possible, e.g. UNDO_PAGE_PARTIAL_REC_SIZE.  If
> performance is a concern, it could be declared static inline, which
> should be as good as a macro.
ok
>
> I don't like the fact that undoaccess.c has a new global,
> undo_compression_info.  I haven't read the code thoroughly, but do we
> really need that?  I think it's never modified (so it could just be
> declared const),

Actually, this will get modified otherwise across undo record
insertion how we will know what was the values of the common fields in
the first record of the page.  Another option could be that every time
we insert the record, read the value from the first complete undo
record on the page but that will be costly because for every new
insertion we need to read the first undo record of the page.

Currently, we are doing like this

a) BeginUndoRecordInsert
-  Copy the global "undo_compression_info" to our local context for
handling multi-prepare because for multi-prepare we don't want to
update the global value until we have successfully inserted the undo
record.

b) PrepareUndoInsert
-Operate on the context and update the context->undo_compression_info
if required (page changed)

c)InsertPrepareUndo
- After we have inserted successfully overwrite
context->undo_compression_info to the global "undo_compression_info".
So that next undo insertion can get the right information.

and I also think it's just all zeroes (so
> initializing it isn't really necessary), and I also think that it's
> just used for initializing other UndoCompressionInfos (so we could
> just initialize them directly, either by setting the members
> individually or jus zeroing them).

Initially, I was doing that but later I thought that InvalidUndoRecPtr
is macro (although the value is 0) shouldn't we initialize all
UndoRecPtr variables with value InvalidUndoRecPtr instead of directly
using 0 so I changed like this.

>
> It seems like UndoRecordPrepareTransInfo ought to have an Assert(index
> < some_limit) in the loop.
>
> A comment in PrepareUndoInsert refers to "low switch" where it means
> "log switch."

I will fix.
>
> This is by no means a complete review, for which I unfortunately lack
> the time at present.  Just some initial observations.
>
ok

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

11 July 2019, 04:31:35

On Wed, Jul 10, 2019 at 10:06 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jul 10, 2019 at 2:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > As of now, after we finish executing the rollback actions, the entry
> > from the hash table is removed.  Now, at a later time (when queues are
> > full and we want to insert a new entry) when we access the queue entry
> > (to check whether we can remove it)  corresponding to the removed hash
> > table entry, will it be safe to access it?  The hash table entry might
> > have been freed or would have been reused as some other entry by the
> > time we try to access it.
>
> Hmm, yeah, that's a problem.  I think we could possibly fix this by
> having the binary heaps just store a FullTransactionId rather than a
> pointer to the RollBackHashEntry.  Then, if you get a
> FullTransactionId from the binary heap, you just do a hash table
> lookup to find the RollBackHashEntry instead of accessing it directly.
>   If it doesn't exist, then you can just discard the entry: it's for
> some old transaction that's no longer relevant.
>
> However, there are a few problems with that idea. One is that I see
> that you've made the hash table keyed by full_xid + start_urec_ptr
> rather than just full_xid, so if the queues just point to an XID, it's
> not enough to find the hash table entry.  The comment claims that this
> is necessary because "in the same transaction, there could be rollback
> requests for both logged and unlogged relations," but I don't
> understand why that means we need start_urec_ptr in the hash table
> key. It would seem more natural to me to have a single entry that
> covers both the logged and the unlogged undo for that transaction.
>

The data for logged and unlogged undo are in separate logs.  So, the
discard worker can encounter them at different times.  It is quite
possible that by the time it encounters the second request, some undo
worker is already half-way processing the first request.  It might be
feasible to combine them during foreground work, but after startup or
some other times when discard worker has to register the request, it
won't be feasible to have one entry or at least we need more smarts to
ensure that we can always edit the hash table entry at later time to
append the request.  I have thought about keep full_xid +
persistence_level/undo_category as a key, but as we anyway need
start_ptr for the request, so it seems appealing to use the same.
Also, even if we try to support one entry for logged and unlogged
undo, it won't be always possible to have one request for it as is the
case explained for discard worker.

> (Incidentally, I don't think it's correct that RollbackHashEntry
> starts with FullTransactionId full_xid + UndoRecPtr start_uprec_ptr
> declared separately; I think it should start with RollbackHashKey -
> although if we change the key back to just a FullTransactionId then we
> don't need to worry separately about fixing this issue.)
>

Agreed.

It seems before we analyze or discuss in detail the other solutions
related to dangling entries, it is better to investigate the rbtree
idea you and Andres came up with as on a quick look it seems that
might avoid creating the dangling entries at the first place.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

12 July 2019, 09:39:51

On Thu, Jul 11, 2019 at 1:29 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> After some off-list discussion with Andres ...
>
> Another possible approach here, which I think I like better, is to
> switch from using a binary heap to using an rbtree.  That wouldn't
> work well in DSM because of the way it uses pointers, but here we're
> putting data in the shared memory segment so it seems like it should
> work.  The idea would be to allocate an array of entries with a
> freelist, and then have allocfunc and freefunc defined to push and pop
> the freelist.  Unlike a binary heap, an rbtree lets us (a) do
> peek-ahead in sorted order and (b) delete elements from an arbitrary
> position without rebuilding anything.
>
> If we adopt this approach, then I think a bunch of the problems we've
> been talking about actually get a lot easier.  If we pull an item from
> the ordered-by-XID rbtree or the ordered-by-undo-size rbtree, we can
> remove it from the other one cheaply, because we can store a pointer
> to the RBTNode in the main object.  So then we never have any stale
> pointers in any data structure, which means we don't have to have a
> strategy to avoid accidentally following them.
>
> The fact that we can peak-ahead correctly without any new code is also
> very nice.  I'm still concerned that peeking ahead isn't the right
> approach in general, but if we're going to do it, peeking ahead to the
> actually-next-highest-priority item is a lot better than peeking ahead
> to some-item-that-may-be-fairly-high-priority.
>
> One problem which Andres spotted is that rbt_delete() can actually
> move content around, so if you just cache the RBTNode returned by
> rbt_insert(), it might not be the right one by the time you
> rbt_delete(), if other stuff has been deleted first.  There are
> several possible approaches to that problem, but one that I'm
> wondering about is modifying rbt_delete_node() so that it doesn't rely
> on rbt_copy_data.  The idea is that if y != z, instead of copying the
> data from y to z, copy the left/right/parent pointers from z into y,
> and make z's left, right, and parent nodes point to y instead.
>

I am not sure but don't we need to retain the color of z as well?

Apart from this, the duplicate key (ex. for size queues the size of
two requests can be same) handling might need some work.  Basically,
either special combine function needs to be written (not sure yet what
we should do there) or we always need to ensure that the key is unique
like (size + start_urec_ptr).  If the size is the same, then we can
decide based on start_urec_ptr.

I think we can go by changing the implementation to rbtree by doing
some enhancements instead of the binary heap or alternatively, we can
use one of the two ideas suggested by you in the email above [1] to
simplify the code and keep using the binary heap for now.  Especially,
I like the below one.
"2. However, I don't think we should have a separate request object
for each queue anyway. We should insert pointers to the same objects
in all the relevant queue (either size + XID, or else error). So
instead of having three sets of objects, one for each queue, we'd just
have one set of objects and point to them with as many as two
pointers.
We'd therefore need LESS memory than we're using today, because we
wouldn't have separate arrays for XID, size, and error queue
elements."

I think even if we currently go with a binary heap, it will be
possible to change it to rbtree later, but I am fine either way.

[1] - https://www.postgresql.org/message-id/CA%2BTgmoZ5g7UzMvM_42YMG8nbhOYpH%2Bu5OMMnePJkYtT5HWotUw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

12 July 2019, 13:37:46

On Fri, Jul 12, 2019 at 5:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> I am not sure but don't we need to retain the color of z as well?

I believe that would be very wrong. If you recolor an internal node,
you'll break the constant-black-height invariant.

> Apart from this, the duplicate key (ex. for size queues the size of
> two requests can be same) handling might need some work.  Basically,
> either special combine function needs to be written (not sure yet what
> we should do there) or we always need to ensure that the key is unique
> like (size + start_urec_ptr).  If the size is the same, then we can
> decide based on start_urec_ptr.

I think that this problem is somewhat independent of whether we use an
rbtree or a binaryheap or some other data structure.  I would be
inclined to use XID as a tiebreak for the size queue, so that it's
more likely to match the ordering of the XID queue, but if that's
inconvenient, then some other arbitrary value like start_urec_ptr
should be fine.

> I think we can go by changing the implementation to rbtree by doing
> some enhancements instead of the binary heap or alternatively, we can
> use one of the two ideas suggested by you in the email above [1] to
> simplify the code and keep using the binary heap for now.  Especially,
> I like the below one.
> "2. However, I don't think we should have a separate request object
> for each queue anyway. We should insert pointers to the same objects
> in all the relevant queue (either size + XID, or else error). So
> instead of having three sets of objects, one for each queue, we'd just
> have one set of objects and point to them with as many as two
> pointers.
> We'd therefore need LESS memory than we're using today, because we
> wouldn't have separate arrays for XID, size, and error queue
> elements."
>
> I think even if we currently go with a binary heap, it will be
> possible to change it to rbtree later, but I am fine either way.

Well, I don't see much point in revising all of this logic twice. We
should pick the way we want it to work and make it work that way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

13 July 2019, 10:25:51

On Fri, Jul 12, 2019 at 7:08 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Jul 12, 2019 at 5:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > Apart from this, the duplicate key (ex. for size queues the size of
> > two requests can be same) handling might need some work.  Basically,
> > either special combine function needs to be written (not sure yet what
> > we should do there) or we always need to ensure that the key is unique
> > like (size + start_urec_ptr).  If the size is the same, then we can
> > decide based on start_urec_ptr.
>
> I think that this problem is somewhat independent of whether we use an
> rbtree or a binaryheap or some other data structure.
>

I think then I am missing something because what I am talking about is
below code in rbt_insert:
rbt_insert()
{
..
cmp = rbt->comparator(data, current, rbt->arg);
if (cmp == 0)
{
/*
* Found node with given key.  Apply combiner.
*/
rbt->combiner(current, data, rbt->arg);
*isNew = false;
return current;
}
..
}

If you see, here it doesn't add the duplicate key in the tree and that
is not the case with binary_heap as far as I can understand.

>  I would be
> inclined to use XID as a tiebreak for the size queue, so that it's
> more likely to match the ordering of the XID queue, but if that's
> inconvenient, then some other arbitrary value like start_urec_ptr
> should be fine.
>

I think it would be better to use start_urec_ptr because XID can be
non-unique in our case.  As I explained in one of the emails above [1]
that we register the requests for logged and unlogged relations
separately, so XID can be non-unique.

> >
> > I think even if we currently go with a binary heap, it will be
> > possible to change it to rbtree later, but I am fine either way.
>
> Well, I don't see much point in revising all of this logic twice. We
> should pick the way we want it to work and make it work that way.
>

Yeah, I agree.  So, I am assuming here that as you have discussed this
idea with Andres offlist, he is on board with changing it as he has
originally suggested using binary_heap.  Andres, do let us know if you
think differently here.  It would be good if anyone else following the
thread can also weigh in.

[1] - https://www.postgresql.org/message-id/CAA4eK1LEKyPZD5Dy4j1u2smUUyMzxgC2YLj8E%2BaJpsvG7sVJYA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

15 July 2019, 16:26:21

On Sat, Jul 13, 2019 at 6:26 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think then I am missing something because what I am talking about is
> below code in rbt_insert:

What you're saying here is that, with an rbtree, an exact match will
result in a merging of requests which we don't want, so we have to
make them always unique.  That's fine, but even if you used a binary
heap where it wouldn't be absolutely required that you break the ties,
you'd still want to think at least a little bit about what behavior is
best in case of a tie, just from the point of view of making the
system efficient.

> I think it would be better to use start_urec_ptr because XID can be
> non-unique in our case.  As I explained in one of the emails above [1]
> that we register the requests for logged and unlogged relations
> separately, so XID can be non-unique.

Yeah. I didn't understand that explanation.  It seems to me that one
of the fundamental design questions for this system is whether we
should allow there to be an unbounded number of transactions that are
pending undo application, or whether it's OK to enforce a hard limit.
Either way, there should certainly be pressure applied to try to keep
the number low, like forcing undo application into the foreground when
a backlog is accumulating, but the question is what to do when that's
insufficient.  My original idea was that we should not have a hard
limit, in which case the shared memory data on what is pending might
be incomplete, in which case we would need the discard workers to
discover transactions needing undo and add them to the shared memory
data structures, and if those structures are full, then we'd just skip
adding those details and rediscover those transactions again at some
future point.

But, my understanding of the current design being implemented is that
there is a hard limit on the number of transactions that can be
pending undo and the in-memory data structures are sized accordingly.
In such a system, we cannot rely on the discard worker(s) to
(re)discover transactions that need undo, because if there can be
transactions that need undo that we don't know about, then we can't
enforce a hard limit correctly.  The exception, I suppose, is that
after a crash, we'll need to scan all the undo logs and figure out
which transactions are pending, but that doesn't preclude using a
single queue entry covering both the logged and the unlogged portion
of a transaction that has written undo of both kinds.  We've got to
scan all of the undo logs before we allow any new undo-using
transactions to start, and so we can create one fully-up-to-date entry
that reflects the data for both persistence levels before any
concurrent activity happens.

I am wondering (and would love to hear other opinions on) the question
of which kind of design we ought to be pursuing, but it's got to be
one or the other, not something in the middle.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

15 July 2019, 20:39:19

On Mon, Jul 1, 2019 at 3:54 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> 2.  Introduced a new RMGR callback rm_undo_status.  It is used to
> decide when record sets in the UNDO_SHARED category should be
> discarded (instead of the usual single xid-based rules).  The possible
> answers are "discard me now!", "ask me again when a given XID is all
> visible", and "ask me again when a given XID is no longer running".

From the minor nitpicking department, the patches from this stack that
are updating rmgrlist.h are consistently failing to update the comment
line preceding the list of PG_RMGR() lines.  This looks to be patches
0014 and 0015 in this stack; 0015 seems to need to be squashed into
0014.

Reviewing Amit's 0016:

performUndoActions appears to be badly-designed.  For starters, it's
sometimes wrong: the only place it gets set to true is in
UndoActionsRequired (which is badly named, because from the name you
expect it to return a Boolean and to not have side effects, but
instead it doesn't return anything and does have side effects).
UndoActionsRequired() only gets called from selected places, like
AbortCurrentTransaction(), so the rest of the time it just returns a
wrong answer.  Now maybe it's never called at those times, but there's
no guard to prevent a function like CanPerformUndoActions() (which is
also badly named, because performUndoActions tells you whether you
need to perform undo actions, not whether it's possible to perform
undo actions) from being called before the flag is set.  I think that
this flag should be either (1) maintained eagerly - so that wherever
we set start_urec_ptr we also set the flag right away or (2) removed -
so when we need to know, we just loop over all of the undo categories
on the spot, which is not that expensive because there aren't that
many of them.

It seems pointless to make PrepareTransaction() take undo pointers as
arguments, because those pointers are just extracted from the
transaction state, to which PrepareTransaction() has a pointer.

Thomas has already objected to another proposal to add functions that
turn 32-bit XIDs into 64-bit XIDs.  Therefore, I feel confident in
predicting that he will likewise object to GetEpochForXid. I think
this needs to be changed somehow, maybe by doing what the XXX comment
you added suggests.

This patch has some problems with naming consistency.  There's a
function called PushUndoRequest() which calls a function called
RegisterRollbackReq() to do the heart of the work.  So, is it undo or
rollback?  Are we pushing or registering?  Is it a request or a req?
For bonus points, the flag that the function sets is called
undo_req_pushed, which is halfway in between the two competing
terminologies.  Other gripes about PushUndoRequest: push is vague and
doesn't really explain what's happening, "apllying" is a typo,
per_level is a poor variable name and shouldn't be declared volatile.
This function has problems with naming in other places, too; please go
through all of the names carefully and make them consistent and
adequately descriptive.

I am not a fan of applying_subxact_undo.  I think we should look for a
better design there.  A couple of things occur to me.  One is that we
don't necessarily need to go to FATAL; we could just force the current
transaction and all of its subtransactions fail all the way out to the
top level, but then perhaps allow new transactions to be started
afterwards.  I'm not sure that's worth it, but it would work, and I
think it has precedent in SxactIsDoomed. Assuming we're going to stick
with the current FATAL plan, I think we should do something like
invent a new kind of critical section that forces ERROR to be promoted
to FATAL and then use it here.  We could call it a semi-critical or
locally-critical section, and the undo machinery could use it, but
then also so could other things.  I've wanted that sort of concept
before, so I think it's a good idea to try to have something general
and independent of undo.  The same concept could be used in
PerformUndoActions() instead of having to invent
pg_rethrow_as_fatal(), so we'd have two uses for this mechanism right
away.

FinishPreparedTransactions() tries to apply undo actions while
interrupts are still held.  Is that necessary?  Can we avoid it?

It seems highly likely that the logic added to the TBLOCK_SUBCOMMIT
case inside CommitTransactionCommand and also into
ReleaseCurrentSubTransaction should have been added to
CommitSubTransaction instead.  If that's not true, then we have to
believe that the TBLOCK_SUBRELEASE call to CommitSubTransaction needs
different treatment from the other two cases, which sounds unlikely;
we also have to explain why undo is somehow different from all of
these other releases that are already handled in that function, not in
its callers.  I also strongly suspect it is altogether wrong to do
this before CommitSubTransaction sets s->state to TRANS_COMMIT; what
if a subxact callback throws an error?

For related reasons, I don't think that the change ReleaseSavepoint()
are right either.  Notice the header comment: "As above, we don't
actually do anything here except change blockState."  The "as above"
part of the comment probably didn't originally refer to
DefineSavepoint(), which definitely does do other stuff, but to
something like EndImplicitTransactionBlock() or EndTransactionBlock(),
and DefineSavepoint() got stuck in the middle later.  Anyway, your
patch makes the comment false by doing actual state changes in this
function, rather than just marking the subtransactions for commit.
But why should that be right?  If none of the many other bits of state
are manipulated here rather than in CommitSubTransaction(), why is
undo the one thing that is different?  I guess this is basically just
compensation for the lack of any of this code in the TBLOCK_SUBRELEASE
path which I noted in the previous paragraph, but I still think the
right answer is to put it all in CommitSubTransaction() *after* we set
TRANS_COMMIT.

There are a number of things I either don't like or don't understand
about PerformUndoActions.  One is that undo_req_pushed gets passed to
this function.  That just looks really odd from an abstraction point
of view.  Basically, we have a function whose job is to "perform undo
actions," and it gets a flag as an argument that tells it to not
actually perform some of the undo actions: that's odd. I think the
reason it's like that is because of the issue we've been discussing
elsewhere that there's a separate undo request for each category.  If
you didn't have that, you wouldn't need to do this here.  I'm not
saying that proves that the one-request-per-persistence-level design
is definitely wrong, but this is certainly not a point in its favor,
at least IMHO.

PerformUndoActions() also thinks that there is a possibility of
failing to insert a failed request into the error queue, and makes
reference to such requests being rediscovered by the discard worker,
but I thought (as I said in my previous email) that we had abandoned
that approach in favor of always having enough space in shared memory
to record everything. Among other problems, if you want
oldestXidHavingUndo to be calculated based on the information in
shared memory, then you have to have all the records in shared memory,
not lose some of them temporarily and have them get re-inserted into
the error queue.  It also feels to me like there may be a conflict
between the everything-must-fit approach and the
one-request-per-persistence level thing you've got here.  I believe
Andres's idea was one-request-per-transaction, so the idea is
something like:

- When your transaction first tries to attach to an undo log, you make
a hash table entry.
- If that fails, you error out, but you have no undo, so it's OK.
- If it works, then you know that there's no chance of aborting
without making a hash table entry, because you already did it.
- If you commit, you remove the entry, because your transaction does
not need to be undone.
- If you abort, you process the entry in the foreground if it's small
or if the number of hash table slots remaining is < max_connections.
Otherwise you leave it for the background worker to handle.

If you have one request per persistence level, you could make an entry
for the first persistence level, and then find that you are out of
room when trying to make an entry for the second persistence level.  I
guess that doesn't break anything: the changes from the first
persistence level would get undone, and the second persistence level
wouldn't get any undo.  Maybe that's OK, but again it doesn't seem all
that nice, so maybe we need to think about it some more.

I think there's more, but I am out of time for the moment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

16 July 2019, 04:32:45

On Tue, Jul 16, 2019 at 2:09 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> PerformUndoActions() also thinks that there is a possibility of
> failing to insert a failed request into the error queue, and makes
> reference to such requests being rediscovered by the discard worker,
> but I thought (as I said in my previous email) that we had abandoned
> that approach in favor of always having enough space in shared memory
> to record everything. Among other problems, if you want
> oldestXidHavingUndo to be calculated based on the information in
> shared memory, then you have to have all the records in shared memory,
> not lose some of them temporarily and have them get re-inserted into
> the error queue.
>

The idea is that the queues can get full, but not rollback hash table.
In the case where the error queue gets full, we mark the entry as
Invalid in the hash table and later when discard worker again
encounters this request, it adds it to the queue if there is a space
available and marks the entry in the hash table as valid.  This allows
us to keep the information of all xacts having pending undo in shared
memory.

>  It also feels to me like there may be a conflict
> between the everything-must-fit approach and the
> one-request-per-persistence level thing you've got here.  I believe
> Andres's idea was one-request-per-transaction, so the idea is
> something like:
>
> - When your transaction first tries to attach to an undo log, you make
> a hash table entry.
..
..
> - If you commit, you remove the entry, because your transaction does
> not need to be undone.

I think this can regress the performance when there are many
concurrent sessions unless there is a way to add/remove request
without a lock.  As of now, we don't enter any request or block any
space in shared memory related to pending undo till there is an error
or user explicitly Rollback the transaction.  We can surely do some
other way as well, but this way we won't have any overhead in the
commit or successful transaction's path.

>
> If you have one request per persistence level, you could make an entry
> for the first persistence level, and then find that you are out of
> room when trying to make an entry for the second persistence level.  I
> guess that doesn't break anything: the changes from the first
> persistence level would get undone, and the second persistence level
> wouldn't get any undo.  Maybe that's OK, but again it doesn't seem all
> that nice, so maybe we need to think about it some more.
>

Again coming to question of whether we need single or multiple entries
for one-request-per-persistence level, the reason for the same we have
discussed so far is that discard worker can register the requests for
them while scanning undo logs at different times.  However, there are
few more things like what if while applying the actions, the actions
for logged are successful and unlogged fails, keeping them separate
allows better processing.  If one fails, register its request in error
queue and try to process the request for another persistence level.  I
think the requests for the different persistence levels are kept in a
separate log which makes their processing separately easier.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

16 July 2019, 05:01:39

On Mon, Jul 15, 2019 at 9:56 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Jul 13, 2019 at 6:26 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I think then I am missing something because what I am talking about is
> > below code in rbt_insert:
>
> What you're saying here is that, with an rbtree, an exact match will
> result in a merging of requests which we don't want, so we have to
> make them always unique.  That's fine, but even if you used a binary
> heap where it wouldn't be absolutely required that you break the ties,
> you'd still want to think at least a little bit about what behavior is
> best in case of a tie, just from the point of view of making the
> system efficient.
>

Okay.

> > I think it would be better to use start_urec_ptr because XID can be
> > non-unique in our case.  As I explained in one of the emails above [1]
> > that we register the requests for logged and unlogged relations
> > separately, so XID can be non-unique.
>
> Yeah. I didn't understand that explanation.  It seems to me that one
> of the fundamental design questions for this system is whether we
> should allow there to be an unbounded number of transactions that are
> pending undo application, or whether it's OK to enforce a hard limit.
> Either way, there should certainly be pressure applied to try to keep
> the number low, like forcing undo application into the foreground when
> a backlog is accumulating, but the question is what to do when that's
> insufficient.  My original idea was that we should not have a hard
> limit, in which case the shared memory data on what is pending might
> be incomplete, in which case we would need the discard workers to
> discover transactions needing undo and add them to the shared memory
> data structures, and if those structures are full, then we'd just skip
> adding those details and rediscover those transactions again at some
> future point.
>
> But, my understanding of the current design being implemented is that
> there is a hard limit on the number of transactions that can be
> pending undo and the in-memory data structures are sized accordingly.
>

Yes, that is correct.

> In such a system, we cannot rely on the discard worker(s) to
> (re)discover transactions that need undo, because if there can be
> transactions that need undo that we don't know about, then we can't
> enforce a hard limit correctly.
>

I have responded in the email above about this point.

>  The exception, I suppose, is that
> after a crash, we'll need to scan all the undo logs and figure out
> which transactions are pending, but that doesn't preclude using a
> single queue entry covering both the logged and the unlogged portion
> of a transaction that has written undo of both kinds.  We've got to
> scan all of the undo logs before we allow any new undo-using
> transactions to start, and so we can create one fully-up-to-date entry
> that reflects the data for both persistence levels before any
> concurrent activity happens.
>

It is correct that no new undo using transaction can start, but
nothing prevents undo launcher to start the undo workers to process
the already registered requests which can lead to some concurrent
activity.

> I am wondering (and would love to hear other opinions on) the question
> of which kind of design we ought to be pursuing, but it's got to be
> one or the other, not something in the middle.
>

I agree that it should not be in the middle.  It is possible that I am
missing or misunderstanding something here, but AFAIU, the current
design, and implementation allows us to maintain the pending undo
state in-memory.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

16 July 2019, 08:50:11

On Thu, Jul 11, 2019 at 9:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jul 11, 2019 at 12:38 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Tue, Jul 9, 2019 at 6:28 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > PFA, updated patch version which includes
> > > - One defect fix in undo interface related to undo page compression
> > > for handling persistence level
> > > - Implemented pending TODO optimization in undo page compression.
> > > - One defect fix in undo processing related to the prepared transaction
> >
> > Looking at 0002 a bit, it seems to me that you really need to spend
> > some energy getting things into a consistent order all across the
> > patch.  For example, UndoPackStage uses the ordering: HEADER,
> > TRANSACTION, RMID, RELOID, XID, CID...  But the declarations of the
> > UREC_INFO constants go in a different order: TRANSACTION, FORK, BLOCK,
> > BLKPREV... The comments defining those go in a different order and
> > some of them are missing. The definition of the UndoRecordBlah
> > structures go in a different order still: Transaction, Block,
> > LogSwitch, Payload.  UndoRecordHeaderSize goes with FORK, BLOCK,
> > BLPREV, TRANSACTION, LOGSWITCH, ....  That really needs to be
> > straightened out and made consistent.

I have worked on this part, please check in the latest patch.  For
some of the header i.e RMID, RELOID, XID, CID, FORK, PREVUNDO, and
BLOCK have an only one member so there are no structures for them
except this I think others are in order now.

>
> > You (still) need to rename blkprev to something more generic, as
> > mentioned in previous rounds of review.
>
> I will change this.
Changed to prevundo

> >
> > I think it would be a good idea to avoid complex macros in favor of
> > functions where possible, e.g. UNDO_PAGE_PARTIAL_REC_SIZE.  If
> > performance is a concern, it could be declared static inline, which
> > should be as good as a macro.
> ok
Done
> >
> > I don't like the fact that undoaccess.c has a new global,
> > undo_compression_info.  I haven't read the code thoroughly, but do we
> > really need that?  I think it's never modified (so it could just be
> > declared const),
>
> Actually, this will get modified otherwise across undo record
> insertion how we will know what was the values of the common fields in
> the first record of the page.  Another option could be that every time
> we insert the record, read the value from the first complete undo
> record on the page but that will be costly because for every new
> insertion we need to read the first undo record of the page.
>
> Currently, we are doing like this
>
> a) BeginUndoRecordInsert
> -  Copy the global "undo_compression_info" to our local context for
> handling multi-prepare because for multi-prepare we don't want to
> update the global value until we have successfully inserted the undo
> record.
>
> b) PrepareUndoInsert
> -Operate on the context and update the context->undo_compression_info
> if required (page changed)
>
> c)InsertPrepareUndo
> - After we have inserted successfully overwrite
> context->undo_compression_info to the global "undo_compression_info".
> So that next undo insertion can get the right information.
>
> and I also think it's just all zeroes (so
> > initializing it isn't really necessary), and I also think that it's
> > just used for initializing other UndoCompressionInfos (so we could
> > just initialize them directly, either by setting the members
> > individually or jus zeroing them).
>
> Initially, I was doing that but later I thought that InvalidUndoRecPtr
> is macro (although the value is 0) shouldn't we initialize all
> UndoRecPtr variables with value InvalidUndoRecPtr instead of directly
> using 0 so I changed like this.
>
> >
> > It seems like UndoRecordPrepareTransInfo ought to have an Assert(index
> > < some_limit) in the loop.
Done
> >
> > A comment in PrepareUndoInsert refers to "low switch" where it means
> > "log switch."
>
> I will fix.
Fixed
> >
> > This is by no means a complete review, for which I unfortunately lack
> > the time at present.  Just some initial observations.
> >
> ok

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

undolog_20190716.tar.gz

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

16 July 2019, 11:13:05

On Tue, Jul 16, 2019 at 2:09 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jul 1, 2019 at 3:54 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> Reviewing Amit's 0016:
>
> performUndoActions appears to be badly-designed.  For starters, it's
> sometimes wrong: the only place it gets set to true is in
> UndoActionsRequired (which is badly named, because from the name you
> expect it to return a Boolean and to not have side effects, but
> instead it doesn't return anything and does have side effects).
> UndoActionsRequired() only gets called from selected places, like
> AbortCurrentTransaction(), so the rest of the time it just returns a
> wrong answer.  Now maybe it's never called at those times, but there's
> no guard to prevent a function like CanPerformUndoActions() (which is
> also badly named, because performUndoActions tells you whether you
> need to perform undo actions, not whether it's possible to perform
> undo actions) from being called before the flag is set.  I think that
> this flag should be either (1) maintained eagerly - so that wherever
> we set start_urec_ptr we also set the flag right away or (2) removed -
> so when we need to know, we just loop over all of the undo categories
> on the spot, which is not that expensive because there aren't that
> many of them.
>

I would prefer to go with (2).  So, I will change the function
CanPerformUndoActions() to loop over categories and return whether
there is a need to perform undo actions. Also, rename
CanPerformUndoActions as NeedToPerformUndoActions or
UndoActionsRequired, any other better suggestion?

> It seems pointless to make PrepareTransaction() take undo pointers as
> arguments, because those pointers are just extracted from the
> transaction state, to which PrepareTransaction() has a pointer.
>

Agreed, will remove.

> Thomas has already objected to another proposal to add functions that
> turn 32-bit XIDs into 64-bit XIDs.  Therefore, I feel confident in
> predicting that he will likewise object to GetEpochForXid. I think
> this needs to be changed somehow, maybe by doing what the XXX comment
> you added suggests.
>

We can do what the comment says, but there is one more similar usage
in undodiscard.c as well, so not sure if that is the right thing.  I
think Thomas is suggesting to open code its usage where it is safe to
do so and required.  I have responded to his email, let us see what he
has to say, based on that we can modify this patch.

> This patch has some problems with naming consistency.  There's a
> function called PushUndoRequest() which calls a function called
> RegisterRollbackReq() to do the heart of the work.  So, is it undo or
> rollback?  Are we pushing or registering?  Is it a request or a req?
>

I think we can rename PushUndoRequest as RegisterUndoRequest and
RegisterRollbackReq as RegisterUndoRequestGuts.

> For bonus points, the flag that the function sets is called
> undo_req_pushed, which is halfway in between the two competing
> terminologies.  Other gripes about PushUndoRequest: push is vague and
> doesn't really explain what's happening, "apllying" is a typo,
> per_level is a poor variable name and shouldn't be declared volatile.
> This function has problems with naming in other places, too; please go
> through all of the names carefully and make them consistent and
> adequately descriptive.
>

Okay, will change as per suggestion.

> I am not a fan of applying_subxact_undo.  I think we should look for a
> better design there.  A couple of things occur to me.  One is that we
> don't necessarily need to go to FATAL; we could just force the current
> transaction and all of its subtransactions fail all the way out to the
> top level, but then perhaps allow new transactions to be started
> afterwards.  I'm not sure that's worth it, but it would work, and I
> think it has precedent in SxactIsDoomed. Assuming we're going to stick
> with the current FATAL plan, I think we should do something like
> invent a new kind of critical section that forces ERROR to be promoted
> to FATAL and then use it here.  We could call it a semi-critical or
> locally-critical section, and the undo machinery could use it, but
> then also so could other things.  I've wanted that sort of concept
> before, so I think it's a good idea to try to have something general
> and independent of undo.  The same concept could be used in
> PerformUndoActions() instead of having to invent
> pg_rethrow_as_fatal(), so we'd have two uses for this mechanism right
> away.
>

Okay, I will investigate on the lines of the semi-critical section.

> FinishPreparedTransactions() tries to apply undo actions while
> interrupts are still held.  Is that necessary?
>

I don't think so.  I'll think some more and update back if I see any
problem, otherwise, will do RESUME_INTERRUPTS before performing
actions.

>  Can we avoid it?
>
> It seems highly likely that the logic added to the TBLOCK_SUBCOMMIT
> case inside CommitTransactionCommand and also into
> ReleaseCurrentSubTransaction should have been added to
> CommitSubTransaction instead.  If that's not true, then we have to
> believe that the TBLOCK_SUBRELEASE call to CommitSubTransaction needs
> different treatment from the other two cases, which sounds unlikely;
> we also have to explain why undo is somehow different from all of
> these other releases that are already handled in that function, not in
> its callers.
>

Yeah, it is better to move that code from ReleaseSavepoint to here or
rather move it to CommitSubTransaction as suggested by you.

>  I also strongly suspect it is altogether wrong to do
> this before CommitSubTransaction sets s->state to TRANS_COMMIT; what
> if a subxact callback throws an error?
>

Are you worried that it might lead to the execution of actions twice?
If so, I think we prevent that during replay of actions and also that
can happen in other ways too.  I am not telling that we should not
move that code block to the location you are suggesting, but I think
the current code is also not wrong.

> For related reasons, I don't think that the change ReleaseSavepoint()
> are right either.  Notice the header comment: "As above, we don't
> actually do anything here except change blockState."  The "as above"
> part of the comment probably didn't originally refer to
> DefineSavepoint(), which definitely does do other stuff, but to
> something like EndImplicitTransactionBlock() or EndTransactionBlock(),
> and DefineSavepoint() got stuck in the middle later.  Anyway, your
> patch makes the comment false by doing actual state changes in this
> function, rather than just marking the subtransactions for commit.
> But why should that be right?  If none of the many other bits of state
> are manipulated here rather than in CommitSubTransaction(), why is
> undo the one thing that is different?  I guess this is basically just
> compensation for the lack of any of this code in the TBLOCK_SUBRELEASE
> path which I noted in the previous paragraph, but I still think the
> right answer is to put it all in CommitSubTransaction() *after* we set
> TRANS_COMMIT.
>

Agreed, will change accordingly.

> There are a number of things I either don't like or don't understand
> about PerformUndoActions.  One is that undo_req_pushed gets passed to
> this function.  That just looks really odd from an abstraction point
> of view.  Basically, we have a function whose job is to "perform undo
> actions," and it gets a flag as an argument that tells it to not
> actually perform some of the undo actions: that's odd. I think the
> reason it's like that is because of the issue we've been discussing
> elsewhere that there's a separate undo request for each category.
>

The reason was that if we don't have that check here, then we need to
do the same in both the callers.  As there are just two places, so
moving it to the caller should be okay.  I think if we do that then
probably looping for each persistence level can also be moved into the
caller.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

16 July 2019, 11:33:01

On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
>
> 1.  Renamed UndoPersistence to UndoLogCategory everywhere, and add a
> fourth category UNDO_SHARED where transactions can write 'out of band'
> data that relates to more than one transaction.
>
> 2.  Introduced a new RMGR callback rm_undo_status.  It is used to
> decide when record sets in the UNDO_SHARED category should be
> discarded (instead of the usual single xid-based rules).  The possible
> answers are "discard me now!", "ask me again when a given XID is all
> visible", and "ask me again when a given XID is no longer running".
>
> 3.  Recognise UNDO_SHARED record set boundaries differently.  Whereas
> undolog.c recognises transaction boundaries automatically for the
> other categories (UNDO_PERMANENT, UNDO_UNLOGGED, UNDO_TEMP), for
> UNDO_SHARED the
>
> 4.  Add some quick-and-dirty throw-away test stuff to demonstrate
> that.  SELECT test_multixact([1234, 2345]) will create a new record
> set that will survive until the given array of transactions is no
> longer running, and then it'll be discarded.  You can see that with
> SELECT * FROM undoinspect('shared').  Or look at SELECT
> pg_stat_undo_logs.  This test simply writes all the xids into its
> payload, and then has an rm_undo_status function that returns the
> first xid it finds in the list that is still running, or if none are
> running returns UNDO_STATUS_DISCARD.
>
> Currently you can only return UNDO_STATUS_WAIT_XMIN so wait for an xid
> to be older than the oldest xmin; presumably it'd be useful to be able
> to discard as soon as an xid is no longer active, which could be a bit
> sooner.
>
> Another small change: several people commented that
> UndoLogIsDiscarded(ptr) ought to have some kind of fast path that
> doesn't acquire locks since it'll surely be hammered.  Here's an
> attempt at that that provides an inlined function that uses a
> per-backend recent_discard to avoid doing more work in the (hopefully)
> common case that you mostly encounter discarded undo pointers.  I hope
> this change will show up in profilers in some zheap workloads but this
> hasn't been tested yet.
>
> Another small change/review: the function UndoLogGetNextInsertPtr()
> previously took a transaction ID, but I'm not sure if that made sense,
> I need to think about it some more.
>
> I pulled the latest patches pulled in from the "undoprocessing" branch
> as of late last week, and most of the above is implemented as fixup
> commits on top of that.
>
> Next I'm working on DBA facilities for forcing undo records to be
> discarded (which consists mostly of sorting out the interlocking to
> make that work safely).  And also testing facilities for simulating
> undo log switching (when you fill up each log and move to another one,
> which are rare code paths run, so we need a good way to make them not
> rare).
>

In 0003-Add-undo-log-manager

/* If we discarded everything, the slot can be given up. */
+ if (entirely_discarded)
+ free_undo_log_slot(slot);

I have noticed that when the undo log was detached and it was full
then if we discard complete log we release its slot.  But, what is
bothering me is should we add that log to the free list?  Or I am
missing something?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

16 July 2019, 14:02:08

On Tue, Jul 16, 2019 at 4:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jul 16, 2019 at 2:09 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > This patch has some problems with naming consistency.  There's a
> > function called PushUndoRequest() which calls a function called
> > RegisterRollbackReq() to do the heart of the work.  So, is it undo or
> > rollback?  Are we pushing or registering?  Is it a request or a req?
> >
>
> I think we can rename PushUndoRequest as RegisterUndoRequest and
> RegisterRollbackReq as RegisterUndoRequestGuts.
>

One thing I am not sure about the above suggestion is whether it is a
good idea to expose a function which ends with 'Guts'.  I have checked
and found that there are a few similar precedents like
ExecuteTruncateGuts.  Another idea could be to rename
RegisterRollbackReq as RegisterUndoRequestInternal.  We have few
precedents for that as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

16 July 2019, 16:04:17

On Tue, Jul 16, 2019 at 10:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Jul 16, 2019 at 4:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Jul 16, 2019 at 2:09 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > This patch has some problems with naming consistency.  There's a
> > > function called PushUndoRequest() which calls a function called
> > > RegisterRollbackReq() to do the heart of the work.  So, is it undo or
> > > rollback?  Are we pushing or registering?  Is it a request or a req?
> >
> > I think we can rename PushUndoRequest as RegisterUndoRequest and
> > RegisterRollbackReq as RegisterUndoRequestGuts.
>
> One thing I am not sure about the above suggestion is whether it is a
> good idea to expose a function which ends with 'Guts'.  I have checked
> and found that there are a few similar precedents like
> ExecuteTruncateGuts.  Another idea could be to rename
> RegisterRollbackReq as RegisterUndoRequestInternal.  We have few
> precedents for that as well.

I don't personally like Guts, not only because bringing human (or
animal) body parts into this seems unnecessary, but more importantly
because it's not at all descriptive. Internal is no better. The point
is that you need to give the functions names that make it clear how
what one function does is different from what another function does,
and neither Guts nor Internal is going to help with that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

16 July 2019, 16:14:30

On Tue, Jul 16, 2019 at 12:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> The idea is that the queues can get full, but not rollback hash table.
> In the case where the error queue gets full, we mark the entry as
> Invalid in the hash table and later when discard worker again
> encounters this request, it adds it to the queue if there is a space
> available and marks the entry in the hash table as valid.  This allows
> us to keep the information of all xacts having pending undo in shared
> memory.

I don't understand.  How is it OK to have entries in the hash table
but not the queues?  And why would that ever happen, anyway?  If you
make the queues as big as the hash table is, then they should never
fill up (or, if using binary heaps with lazy removal rather than
rbtrees, they might fill up, but if they do, you can always make space
by cleaning out the stale entries).

> I think this can regress the performance when there are many
> concurrent sessions unless there is a way to add/remove request
> without a lock.  As of now, we don't enter any request or block any
> space in shared memory related to pending undo till there is an error
> or user explicitly Rollback the transaction.  We can surely do some
> other way as well, but this way we won't have any overhead in the
> commit or successful transaction's path.

Well, we're already incurring some overhead to attach to an undo log,
and that probably involves some locking.  I don't see why this would
be any worse, and maybe it could piggyback on the existing work.
Anyway, if you don't like this solution, propose something else. It's
impossible to correctly implement a hard limit unless the number of
aborted-but-not-yet-undone transaction is bounded to (HARD_LIMIT -
ENTRIES_THAT_WOULD_BE_ADDED_AFTER_RECOVERY_IF_THE_SYSTEM_CRASHED_NOW).
If there are 100 transactions each bound to 2 undo logs, and you
crash, you will need to (as you have it designed now) add another 200
transactions to the hash table upon recovery, and that will make you
exceed the hard limit unless you were at least 200 transactions below
the limit before the crash.  Have you handled that somehow?  If so,
how?  It seems to me that you MUST - at a minimum - keep a count of
undo logs attached to in-progress transactions, if not the actual hash
table entries.

> Again coming to question of whether we need single or multiple entries
> for one-request-per-persistence level, the reason for the same we have
> discussed so far is that discard worker can register the requests for
> them while scanning undo logs at different times.

Yeah, but why do we need that in the first place?  I wrote something
about that in a previous email, but you haven't responded to it here.

> However, there are
> few more things like what if while applying the actions, the actions
> for logged are successful and unlogged fails, keeping them separate
> allows better processing.  If one fails, register its request in error
> queue and try to process the request for another persistence level.  I
> think the requests for the different persistence levels are kept in a
> separate log which makes their processing separately easier.

I don't find this convincing. It's not really an argument, just a
vague list of issues. If you want to convince me, you'll need to be
much more precise.

It seems to me that it is generally undesirable to undo the unlogged
part of a transaction separately from the logged part of the
transaction.  But even if we want to support that, having one entry
per XID rather than one entry per <XID, persistence level> doesn't
preclude that.  Even if you discover the entries at different times,
you can still handle that by updating the existing entry rather than
making a new one.

There might be a good reason to do it the way you are describing, but
I don't see that you've made the argument for it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

16 July 2019, 16:22:11

On Tue, Jul 16, 2019 at 7:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >  I also strongly suspect it is altogether wrong to do
> > this before CommitSubTransaction sets s->state to TRANS_COMMIT; what
> > if a subxact callback throws an error?
>
> Are you worried that it might lead to the execution of actions twice?

No, I'm worried that you are running code that is part of the commit
path before the transaction has actually committed.
CommitSubTransaction() is full of stuff which basically propagates
whatever the subtransaction did out to the parent transaction, and all
of that code runs after we've ruled out the possibility of an abort,
but this very-similar-looking code runs while it's still possible for
an abort to happen. That seems unlikely to be correct, and even if it
is, it seems needlessly inconsistent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

16 July 2019, 16:28:26

On Mon, Jul 1, 2019 at 3:54 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Here's a new version.

Here's a relatively complete review of 0019 and 0020 and a remark or
two on the beginning of 0003.

Regarding 0020:

The documentation claims that undo data exists in a 64-bit address
space divided into 2^34 undo logs, each with a theoretical capacity of
1TB, but that would require 74 bits.

I am mildly suspicious that, on a busy system, the use of 1MB segment
files could result in slowdowns due to frequent filesystem operations.
We just recently made it more convenient to change the WAL segment
size, mostly so that people on very busy systems could crank it up
from 16MB to, say, 64MB or 256MB. It's true that the considerations
are a bit different here, because undo logs don't have to be archived,
and because we might be using many undo logs simultaneously rather
than only 1 for the whole system, but it's still true that if you've
got a bunch of backends blasting out undo at top speed, you're going
to have to recycle files *extremely* quickly.  How much performance
testing have you done to assess the effect of segment size? Do you
think there's an argument for making this 1MB size configurable at
initdb-time? Or even variable at runtime, so that we use larger files
if we're filling them up in < 100ms or whatever?

I don't think the last paragraph is entirely accurate.  The access
method gets to control what records are written, but the general
format of the records is fixed by the undo system. Perhaps the undo
log code isn't what cares about that, but whether it's the undo log
code or the undo access code or the undo processing code isn't likely
to seem relevant to developers.

Regarding 0019:

I think there's a substantial amount of duplication between 0019 and
0020, and I'm not sure that we ought to have both.  They both talk
about the purpose of undo, the way the adddress space is divided, etc.
I understand that it would be a little weird to include all of the
information from 0019 in the user-facing documentation, and I also
understand that it won't work to have no user-facing documentation at
all, but it still seems a little odd to me.  Possibly 0019 could refer
to the SGML documentation for preliminaries and then add only those
details that are not covered there.

How could we avoid the limit on the total size of an active
transaction mentioned here? And what would be the cost of such a
scheme? If we've filled an undo log and moved on to another one, why
can't we evict the one that's full and reuse the shared memory slot,
bringing it back in later when required?  I suspect the answer is that
there is a locking rule involved. I think this README would be a good
place to document things like locking rules, or a least to refer to
where they are documented. I also think we should mull over whether we
could relax the rule without too much pain.  I expect that at least
part of the problem is that somebody might have a pointer to an
UndoLogSlot which could become stale if we recycle a slot, but that
can already happen at least when the log is fully discarded, so maybe
allowing it to happen in other cases wouldn't be too bad.

I know you're laughing at me on the inside, worrying about a
transaction that touches so many TB of data that it manages to exhaust
all the undo log slots, but I don't think that's a completely crazy
scenario. There are PB-scale databases out there, and it would be nice
to think that PostgreSQL could capture more of those workloads.  They
will probably become more common over time.

Reading the section on persistence levels and tablespaces makes me
wonder what happens to address space that gets allocated to temporary
and unlogged undo logs. It seems pretty important to make sure that we
at least don't leak anything significant, and maybe that we actually
recycle the address space or share it across backends. That is, if
several backends are all writing temporary undo, there's no intrinsic
reason why they can't all be using the same temporary undo logs, as
long as the file naming works OK for that (e.g. if it follows the same
pattern we use for relation names). Any undo logs that get allocated
to unlogged undo can be recycled - either for unlogged undo or
otherwise - after a crash, and any that are partially filled can be
rewound. I don't know how much effort we're expending on any of that
right now, but it seems like it would be worth discussing in this
README, and possibly improving.

When the undo log contents section mentions that "client code is
responsible for stepping over the page headers and advancing to the
next page," that's again a somewhat middle-of-the-patch stack
perspective. I am not sure exactly how this should be phrased, but the
point is that the client code we're talking about is not the AM but
the next patch in the stack. I think developers will view the AM as
the client and our wording probably ought to reflect that.

"keepign" is not spelled correctly. A little later on, "checkpoin" is
missing a letter.

I think it would be worth mentioning how you solved the problem of
inferring during recovery the position within the page where the
record needs to be placed.

The bit about checkpoint files written to pg_undo being potentially
inconsistent is confusing.  If the files are written before the
checkpoint is completed, fsync'd, and not modified afterwards, how can
they be inconsistent?

Regarding 0003:

UndoLogSharedData could use a more extensive comment.  It's not very
clear what low_logno and next_logno are, and it also seems like it
would be worth mentioning how the free lists are linked.  On a similar
note, I think the file header comment ought to reference the undo
README added by 0019 and perhaps also the documentation added by 0020,
and I think 0019 and 0020 ought to be flattened into 0003.

I meant to write more about 0003 before sending this, but I am out of
time and it seems more useful to send what I have now than to wait
until I have more...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

16 July 2019, 22:07:22

Hi,

On 2019-07-13 15:55:51 +0530, Amit Kapila wrote:
> On Fri, Jul 12, 2019 at 7:08 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > > I think even if we currently go with a binary heap, it will be
> > > possible to change it to rbtree later, but I am fine either way.
> >
> > Well, I don't see much point in revising all of this logic twice. We
> > should pick the way we want it to work and make it work that way.
> >
>
> Yeah, I agree.  So, I am assuming here that as you have discussed this
> idea with Andres offlist, he is on board with changing it as he has
> originally suggested using binary_heap.  Andres, do let us know if you
> think differently here.  It would be good if anyone else following the
> thread can also weigh in.

Yes, I think using an rbtree makes sense.

I'm not yet sure whether we'd want the rbtree nodes being pointed to
directly by the hashtable, or whether we'd want one indirection.

e.g. either something like:


typedef struct UndoWorkerQueue
{
    /* priority ordered tree */
    RBTree *tree;
    ....
}

typedef struct UndoWorkerQueueEntry
{
     RBTNode tree_node;

     /*
      * Reference hashtable via key, not pointers, entries might be
      * moved.
      */
     RollbackHashKey rollback_key
     ...
} UndoWorkerQueueEntry;

typedef struct RollbackHashEntry
{
     ...
     UndoWorkerQueueEntry *queue_memb_size;
     UndoWorkerQueueEntry *queue_memb_age;
     UndoWorkerQueueEntry *queue_memb_error;
}

and call rbt_delete() for any non-NULL queue_memb_* whenever an entry is
dequeued via one of the queues (after setting the one already dequeued
from to NULL, of course).  Which requires - as Robert mentioned - that
rbtree pointers remain stable after insertions.


Alternatively we can have a more complicated arrangement without the
"stable pointer" requirement (which'd also similarly work for a binary
heap):

typedef struct UndoWorkerQueue
{
    /* information about work needed, not meaningfully ordered */
    UndoWorkerQueueEntry *entries;

    /*
     * Priority ordered references into 0<entries, using
     * UndoWorkerQueueTreeEntry as members.
     */
    RBTree tree;

    /* unused elements in ->entries, UndoWorkerQueueEntry members */
    slist_head freelist;

    /*
     * Number of entries in ->entries and tree that can be pruned by
     * doing a scan of both.
     */
    int num_prunable_entries;
}

typedef struct UndoWorkerQueueEntry
{
     /*
      * Reference hashtable via key, not pointers, entries might be
      * moved.
      */
     RollbackHashKey rollback_key


     /*
      * As members of UndoWorkerQueue->tree can be moved in memory,
      * RollbackHashEntry cannot directly point to them. Instead
      */
     bool already_processed;
     ...

     slist_node freelist_node;
} UndoWorkerQueueEntry;


typedef struct UndoWorkerQueueTreeEntry
{
    RBTree tree;

    /* offset into UndoWorkerQueue->entries */
    int off;
} UndoWorkerQueueEntry;

and again

typedef struct RollbackHashEntry
{
     RBTNode tree_node;
     ...
     UndoWorkerQueueEntry *queue_memb_size;
     UndoWorkerQueueEntry *queue_memb_age;
     UndoWorkerQueueEntry *queue_memb_error;
}


Because the tree entries are not members of the tree itself, pointers to
them would be stable, regardless of rbtree (or binary heap) moving them
around.   The cost of that would be more complicated datastructures, and
insertion/deletion/dequeing operations:

insertion:
    if (slist_is_empty(&queue->freelist))
       prune();
    if (slist_is_empty(&queue->freelist))
       elog(ERROR, "full")

    UndoWorkerQueueEntry *entry = slist_pop_head_node(&queue->freelist)
    UndoWorkerQueueTreeEntry tree_entry;

    entry->already_processed = false;
    entry->... = ...;

    tree_entry.off = entry - queue->entries; // calculate offset
    rbt_insert(queue->tree, &tree_entry, NULL);


prune:
    if (queue->num_prunable_entries > 0)
        RBTreeIterator iter;
        slist_node *pending_freelist;
        rbt_begin_iterate(queue->tree, &iter, LeftRightWalk);
        while ((tnode = rbt_iterate(&iter)) != 0)
            node = (UndoWorkerQueueTreeEntry *) tnode;
            if (queue->entries[node->off]->already_processed)
                rbt_delete(tnode);
                /* XXX: Have to stop here, the iterator is invalid -
                 * probably should add a rbt_delete_current(iterator);
                 */
                break;

dequeue:
    while (node = rbt_leftmost(queue->tree))
        node = (UndoWorkerQueueTreeEntry *) tnode;
        entry = &queue->entries[node->off];

        rbt_delete(tnode);

        /* check if the entry has already been processed via another queue */
        if (entry->already_processed)
            slist_push(&queue->freelist, &entry->freelist_node);
        else
            /* found it */
            return entry;
    return NULL;

delete (i.e. processed in another queue):

    /*
     * Queue entry will only be reusable when the corresponding tree
     * entry has been removed. That'll happen either when new entries
     * are needed (cf prune), or when the entry is dequeued (cf dequeue).
     */
    entry->already_processed = true;


I think the first approach is clearly preferrable from a simplicity POV,
but the second approach would be a bit more generic (applicable to heap
as well) and wouldn't require adjusting the rbtree code.

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

16 July 2019, 22:18:20

On Tue, Jul 16, 2019 at 11:33 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> /* If we discarded everything, the slot can be given up. */
> + if (entirely_discarded)
> + free_undo_log_slot(slot);
>
> I have noticed that when the undo log was detached and it was full
> then if we discard complete log we release its slot.  But, what is
> bothering me is should we add that log to the free list?  Or I am
> missing something?

Stepping back a bit:  The free lists are for undo logs that someone
might want to attach to and insert into.  If it's full, we probably
can't insert anything into it again (well, technically someone else
who wants to insert something a bit smaller might be able to, but
that's not an interesting case to worry about).  So it doesn't need to
go back on a free list, but it still needs to exist (= occupy a slot)
as long as there is undiscarded data in it, because that data is
needed and we need to be able to test URPs against its discard
pointer.  But once its data is entirely discarded, it ceases to exist
-- there is no reason to waste a slot on it, and any URP in this undo
log will be considered to be discarded (because we can't find a slot,
and we also cache that fact in recent_discard so lookups are fast and
lock-free), and therefore it'll not be checkpointed or reloaded at
next startup; then we couldn't put it on a free list even if we wanted
to, because there is nothing left of it ("logs" don't really exist in
memory, only "slots", currently holding the meta-data for a log, which
is why I renamed UndoLog to UndoLogSlot to reduce confusion on that
point).  One of the goals here is to make a system that doesn't
require an increasing amount of memory as time goes on -- hence desire
to completely remove state relating to entirely discarded undo logs
(you might point out that the recent_discard cache would get
arbitrarily large after we chew through millions of undo logs, but
there is another defence against that in the form of low_logno which
isn't used in that test yet but could be used to miminise that
effect).  Does this make sense, and do you see a problem?

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

16 July 2019, 22:23:38

Hi,

On 2019-07-15 12:26:21 -0400, Robert Haas wrote:
> Yeah. I didn't understand that explanation.  It seems to me that one
> of the fundamental design questions for this system is whether we
> should allow there to be an unbounded number of transactions that are
> pending undo application, or whether it's OK to enforce a hard limit.
> Either way, there should certainly be pressure applied to try to keep
> the number low, like forcing undo application into the foreground when
> a backlog is accumulating, but the question is what to do when that's
> insufficient.  My original idea was that we should not have a hard
> limit, in which case the shared memory data on what is pending might
> be incomplete, in which case we would need the discard workers to
> discover transactions needing undo and add them to the shared memory
> data structures, and if those structures are full, then we'd just skip
> adding those details and rediscover those transactions again at some
> future point.
>
> But, my understanding of the current design being implemented is that
> there is a hard limit on the number of transactions that can be
> pending undo and the in-memory data structures are sized accordingly.

My understanding is that that's really just an outcome of needing to
maintain oldestXidHavingUndo accurately, right? I know I asked this
before, but I didn't feel like the answer was that clear (probably due
to my own haziness). To me it seems very important to understand whether
/ how much we can separate the queuing/worker logic from the question of
how to maintain oldestXidHavingUndo.


> In such a system, we cannot rely on the discard worker(s) to
> (re)discover transactions that need undo, because if there can be
> transactions that need undo that we don't know about, then we can't
> enforce a hard limit correctly.  The exception, I suppose, is that
> after a crash, we'll need to scan all the undo logs and figure out
> which transactions are pending, but that doesn't preclude using a
> single queue entry covering both the logged and the unlogged portion
> of a transaction that has written undo of both kinds.  We've got to
> scan all of the undo logs before we allow any new undo-using
> transactions to start, and so we can create one fully-up-to-date entry
> that reflects the data for both persistence levels before any
> concurrent activity happens.

Yea, that seems like a question independent of the "completeness"
requirement. If desirable, it seems trivial to either have
RollbackHashEntry have per-persistence level status (for one entry per
xid), or not (for per-persistence entries).

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

17 July 2019, 02:57:53

On Wed, Jul 17, 2019 at 3:53 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-07-15 12:26:21 -0400, Robert Haas wrote:
> > Yeah. I didn't understand that explanation.  It seems to me that one
> > of the fundamental design questions for this system is whether we
> > should allow there to be an unbounded number of transactions that are
> > pending undo application, or whether it's OK to enforce a hard limit.
> > Either way, there should certainly be pressure applied to try to keep
> > the number low, like forcing undo application into the foreground when
> > a backlog is accumulating, but the question is what to do when that's
> > insufficient.  My original idea was that we should not have a hard
> > limit, in which case the shared memory data on what is pending might
> > be incomplete, in which case we would need the discard workers to
> > discover transactions needing undo and add them to the shared memory
> > data structures, and if those structures are full, then we'd just skip
> > adding those details and rediscover those transactions again at some
> > future point.
> >
> > But, my understanding of the current design being implemented is that
> > there is a hard limit on the number of transactions that can be
> > pending undo and the in-memory data structures are sized accordingly.
>
> My understanding is that that's really just an outcome of needing to
> maintain oldestXidHavingUndo accurately, right?
>

Yes.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

17 July 2019, 03:44:37

On Wed, Jul 17, 2019 at 3:48 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Tue, Jul 16, 2019 at 11:33 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > /* If we discarded everything, the slot can be given up. */
> > + if (entirely_discarded)
> > + free_undo_log_slot(slot);
> >
> > I have noticed that when the undo log was detached and it was full
> > then if we discard complete log we release its slot.  But, what is
> > bothering me is should we add that log to the free list?  Or I am
> > missing something?
>
> Stepping back a bit:  The free lists are for undo logs that someone
> might want to attach to and insert into.  If it's full, we probably
> can't insert anything into it again (well, technically someone else
> who wants to insert something a bit smaller might be able to, but
> that's not an interesting case to worry about).  So it doesn't need to
> go back on a free list, but it still needs to exist (= occupy a slot)
> as long as there is undiscarded data in it, because that data is
> needed and we need to be able to test URPs against its discard
> pointer.  But once its data is entirely discarded, it ceases to exist
> -- there is no reason to waste a slot on it,

Right, actually I got that point.  But, I was thinking that we are
wasting one logno from undo log addressing space no?.  Instead, if we
can keep it attached to the slot and somehow manage to add to the free
list then the same logno can be used by someone else?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

17 July 2019, 03:56:29

On Wed, Jul 17, 2019 at 3:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Right, actually I got that point.  But, I was thinking that we are
> wasting one logno from undo log addressing space no?.  Instead, if we
> can keep it attached to the slot and somehow manage to add to the free
> list then the same logno can be used by someone else?

We can never reuse log numbers.  UndoRecPtr values containing that log
number could exist in permanent storage anywhere (zheap, zedstore etc)
and must appear to be discarded forever if anyone asks.  Now, it so
happens that the current coding in zheap has fxid + urp for each
transaction slot and always checks the fxid first so it probably
wouldn't ask about discarded urps too much, but I don't think that's
policy is a requirement and the undo layer can't count on it.  I think
I heard that zedstore is planning to check urp only.

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

17 July 2019, 05:26:30

On Wed, Jul 17, 2019 at 9:27 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Wed, Jul 17, 2019 at 3:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > Right, actually I got that point.  But, I was thinking that we are
> > wasting one logno from undo log addressing space no?.  Instead, if we
> > can keep it attached to the slot and somehow manage to add to the free
> > list then the same logno can be used by someone else?
>
> We can never reuse log numbers.  UndoRecPtr values containing that log
> number could exist in permanent storage anywhere (zheap, zedstore etc)
> and must appear to be discarded forever if anyone asks.

Yeah right.  I knew that we can not reuse UndoRecPtr but forget to
think that if we reuse logno then it is same as reusing UndoRecPtr.
Sorry for the noise.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

17 July 2019, 06:12:52

On Tue, Jul 16, 2019 at 9:44 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jul 16, 2019 at 12:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > The idea is that the queues can get full, but not rollback hash table.
> > In the case where the error queue gets full, we mark the entry as
> > Invalid in the hash table and later when discard worker again
> > encounters this request, it adds it to the queue if there is a space
> > available and marks the entry in the hash table as valid.  This allows
> > us to keep the information of all xacts having pending undo in shared
> > memory.
>
> I don't understand.  How is it OK to have entries in the hash table
> but not the queues?  And why would that ever happen, anyway?
>

We add entries in queues only when we want them to be processed by
background workers whereas hash table will contain the entries for all
the pending undo requests irrespective of whether they are executed by
foreground-transaction or by background workers.  Once the request is
processed, we remove it from the hash table.  The reasons for keeping
all the pending abort requests in hash table is that it allows us to
compute oldestXidHavingUnappliedUndo and second is it avoids us to
have duplicate undo requests by backends and discard worker.  In
short, there is no reason to keep all the entries in queues, but there
are reasons to keep all the aborted xact entries in hash table.

There is some more explanation about queues and hash table in
README.UndoProcessing which again might not be sufficient to get all
the details, but it can still help.

>  If you
> make the queues as big as the hash table is, then they should never
> fill up (or, if using binary heaps with lazy removal rather than
> rbtrees, they might fill up, but if they do, you can always make space
> by cleaning out the stale entries).
>
> > I think this can regress the performance when there are many
> > concurrent sessions unless there is a way to add/remove request
> > without a lock.  As of now, we don't enter any request or block any
> > space in shared memory related to pending undo till there is an error
> > or user explicitly Rollback the transaction.  We can surely do some
> > other way as well, but this way we won't have any overhead in the
> > commit or successful transaction's path.
>
> Well, we're already incurring some overhead to attach to an undo log,
> and that probably involves some locking.  I don't see why this would
> be any worse, and maybe it could piggyback on the existing work.
>

We attach to the undo log only once per backend (unless user changes
tablespace of undo in-between or probably when the space in current
log is finished) and then use it for all transactions via that
backend.  For each transaction, we don't take any global lock for
undo, so here we need something different.  Also, we need it at commit
time as well.

> Anyway, if you don't like this solution, propose something else. It's
> impossible to correctly implement a hard limit unless the number of
> aborted-but-not-yet-undone transaction is bounded to (HARD_LIMIT -
> ENTRIES_THAT_WOULD_BE_ADDED_AFTER_RECOVERY_IF_THE_SYSTEM_CRASHED_NOW).
> If there are 100 transactions each bound to 2 undo logs, and you
> crash, you will need to (as you have it designed now) add another 200
> transactions to the hash table upon recovery, and that will make you
> exceed the hard limit unless you were at least 200 transactions below
> the limit before the crash.  Have you handled that somehow?  If so,
> how?
>

Yeah, we have handled it by reserving the space of MaxBackends.  It is
UndoRollbackHashTableSize() - MaxBackends.  There is a bug in the
current patch which is that it should reserve space for 2 *
MaxBackends so that after recovery, we are safe, but that can be
fixed.

>  It seems to me that you MUST - at a minimum - keep a count of
> undo logs attached to in-progress transactions, if not the actual hash
> table entries.
>
> > Again coming to question of whether we need single or multiple entries
> > for one-request-per-persistence level, the reason for the same we have
> > discussed so far is that discard worker can register the requests for
> > them while scanning undo logs at different times.
>
> Yeah, but why do we need that in the first place?  I wrote something
> about that in a previous email, but you haven't responded to it here.
>

I have responded to it as a separate email, but let's discuss it here.
So, you are right that only time we need to scan the undo logs to find
all pending aborted xacts is immediately after startup.  But, we can't
create a fully update-to-date entry from both the logs unless we make
undo launcher to also wait to process anything till we are done.  We
are not doing this in the current patch but we can do it if we want.
This will be an additional restriction we have to put which is not
required for the current approach.

Another related thing is that to update the existing entry for queues,
we need to delete and re-insert the entry after we find the request in
a different log category.   Again it depends if we point queue entries
to hash table, then we might not have this additional work but that
has its own set of complexities.

> > However, there are
> > few more things like what if while applying the actions, the actions
> > for logged are successful and unlogged fails, keeping them separate
> > allows better processing.  If one fails, register its request in error
> > queue and try to process the request for another persistence level.  I
> > think the requests for the different persistence levels are kept in a
> > separate log which makes their processing separately easier.
>
> I don't find this convincing. It's not really an argument, just a
> vague list of issues. If you want to convince me, you'll need to be
> much more precise.
>

I think it is implementation wise simpler to have one entry per
persistence level.   It is not that we can't deal with all the
problems being discussed.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

17 July 2019, 06:27:43

On Wed, Jul 17, 2019 at 3:53 AM Andres Freund <andres@anarazel.de> wrote:
> On 2019-07-15 12:26:21 -0400, Robert Haas wrote:

Responding again with some more details.

> >
> > But, my understanding of the current design being implemented is that
> > there is a hard limit on the number of transactions that can be
> > pending undo and the in-memory data structures are sized accordingly.
>
> My understanding is that that's really just an outcome of needing to
> maintain oldestXidHavingUndo accurately, right?
>

Yes.

> I know I asked this
> before, but I didn't feel like the answer was that clear (probably due
> to my own haziness). To me it seems very important to understand whether
> / how much we can separate the queuing/worker logic from the question of
> how to maintain oldestXidHavingUndo.
>

I am not sure if there is any tight coupling between queuing/worker
logic and computing oldestXid* value.  The main thing to compute
oldestXid* value is that we need to know the xids of all the pending
abort transactions.  We have already decided from the very beginning
that the hash table will have all the abort requests irrespective of
whether it is being processed by the foreground process or background
process.  This will help us to avoid duplicate entries by backend and
background workers.  Later, we decided that if we can have a hard
limit on how many pending undo requests can be present in a system,
then we can find the value of oldestXid* from the hash table.

I don't know how much it helps and you might already know all of this,
but I thought it is better to summarize to avoid any confusion.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

17 July 2019, 06:35:26

On Tue, Jul 16, 2019 at 9:52 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jul 16, 2019 at 7:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >  I also strongly suspect it is altogether wrong to do
> > > this before CommitSubTransaction sets s->state to TRANS_COMMIT; what
> > > if a subxact callback throws an error?
> >
> > Are you worried that it might lead to the execution of actions twice?
>
> No, I'm worried that you are running code that is part of the commit
> path before the transaction has actually committed.
> CommitSubTransaction() is full of stuff which basically propagates
> whatever the subtransaction did out to the parent transaction, and all
> of that code runs after we've ruled out the possibility of an abort,
> but this very-similar-looking code runs while it's still possible for
> an abort to happen. That seems unlikely to be correct, and even if it
> is, it seems needlessly inconsistent.
>

Fair point, will change as per your suggestion.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

18 July 2019, 05:45:05

On Wed, Jul 17, 2019 at 3:37 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-07-13 15:55:51 +0530, Amit Kapila wrote:
> > On Fri, Jul 12, 2019 at 7:08 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > > > I think even if we currently go with a binary heap, it will be
> > > > possible to change it to rbtree later, but I am fine either way.
> > >
> > > Well, I don't see much point in revising all of this logic twice. We
> > > should pick the way we want it to work and make it work that way.
> > >
> >
> > Yeah, I agree.  So, I am assuming here that as you have discussed this
> > idea with Andres offlist, he is on board with changing it as he has
> > originally suggested using binary_heap.  Andres, do let us know if you
> > think differently here.  It would be good if anyone else following the
> > thread can also weigh in.
>
> Yes, I think using an rbtree makes sense.
>

Okay.

> I'm not yet sure whether we'd want the rbtree nodes being pointed to
> directly by the hashtable, or whether we'd want one indirection.
>
> e.g. either something like:
>
>
> typedef struct UndoWorkerQueue
> {
>     /* priority ordered tree */
>     RBTree *tree;
>     ....
> }
>

I think we also need the size of rbtree (aka how many nodes/undo
requests it has) to know whether we can add more.  This information is
available in binary heap, but here I think we need to track it in
UndoWorkerQueue.  Basically, at each enqueue/dequeue, we need to
increment/decrement the same.

> typedef struct UndoWorkerQueueEntry
> {
>      RBTNode tree_node;
>
>      /*
>       * Reference hashtable via key, not pointers, entries might be
>       * moved.
>       */
>      RollbackHashKey rollback_key
>      ...
> } UndoWorkerQueueEntry;
>

In UndoWorkerQueueEntry, we might also want to include some other info
like dbid, request_size, next_retry_at, err_occurred_at so that while
accessing queue entry in comparator functions or other times, we don't
always need to perform hash table search.  OTOH, we can do hash_search
as well, but may be code-wise it will be better to keep additional
information.

Another thing is we need some freelist/array for
UndoWorkerQueueEntries equivalent to size of three queues?

> typedef struct RollbackHashEntry
> {
>      ...
>      UndoWorkerQueueEntry *queue_memb_size;
>      UndoWorkerQueueEntry *queue_memb_age;
>      UndoWorkerQueueEntry *queue_memb_error;
> }
>
> and call rbt_delete() for any non-NULL queue_memb_* whenever an entry is
> dequeued via one of the queues (after setting the one already dequeued
> from to NULL, of course).  Which requires - as Robert mentioned - that
> rbtree pointers remain stable after insertions.
>

Right.

BTW, do you have any preference for using dynahash or simplehash for
RollbackHashTable?

>
> Alternatively we can have a more complicated arrangement without the
> "stable pointer" requirement (which'd also similarly work for a binary
> heap):
>
>
> I think the first approach is clearly preferrable from a simplicity POV,
> but the second approach would be a bit more generic (applicable to heap
> as well) and wouldn't require adjusting the rbtree code.
>

+1 for the first approach, the second one appears to be quite
complicated as compared to first.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

18 July 2019, 05:56:25

Hi,

On 2019-07-18 11:15:05 +0530, Amit Kapila wrote:
> On Wed, Jul 17, 2019 at 3:37 AM Andres Freund <andres@anarazel.de> wrote:
> > I'm not yet sure whether we'd want the rbtree nodes being pointed to
> > directly by the hashtable, or whether we'd want one indirection.
> >
> > e.g. either something like:
> >
> >
> > typedef struct UndoWorkerQueue
> > {
> >     /* priority ordered tree */
> >     RBTree *tree;
> >     ....
> > }
> >
> 
> I think we also need the size of rbtree (aka how many nodes/undo
> requests it has) to know whether we can add more.  This information is
> available in binary heap, but here I think we need to track it in
> UndoWorkerQueue.  Basically, at each enqueue/dequeue, we need to
> increment/decrement the same.
> 
> > typedef struct UndoWorkerQueueEntry
> > {
> >      RBTNode tree_node;
> >
> >      /*
> >       * Reference hashtable via key, not pointers, entries might be
> >       * moved.
> >       */
> >      RollbackHashKey rollback_key
> >      ...
> > } UndoWorkerQueueEntry;
> >
> 
> In UndoWorkerQueueEntry, we might also want to include some other info
> like dbid, request_size, next_retry_at, err_occurred_at so that while
> accessing queue entry in comparator functions or other times, we don't
> always need to perform hash table search.  OTOH, we can do hash_search
> as well, but may be code-wise it will be better to keep additional
> information.

The dots signal that additional fields are needed in those places.


> Another thing is we need some freelist/array for
> UndoWorkerQueueEntries equivalent to size of three queues?

I think using the slist as I proposed for the second alternative is
better?


> BTW, do you have any preference for using dynahash or simplehash for
> RollbackHashTable?

I find simplehash nicer to use in code, personally, and it's faster in
most cases...

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

18 July 2019, 11:11:08

On Tue, Jul 16, 2019 at 8:39 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Thomas has already objected to another proposal to add functions that
> turn 32-bit XIDs into 64-bit XIDs.  Therefore, I feel confident in
> predicting that he will likewise object to GetEpochForXid. I think
> this needs to be changed somehow, maybe by doing what the XXX comment
> you added suggests.

Perhaps we should figure out how to write GetOldestFullXmin() and friends.

For FinishPreparedTransaction(), the XXX comment sounds about right
(TwoPhaseFileHeader should hold an fxid).

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

18 July 2019, 11:28:19

On Tue, Jul 16, 2019 at 2:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>

Few comments on the new patch:

1.
Additionally,
+there is a mechanism for multi-insert, wherein multiple records are prepared
+and inserted at a time.

Which mechanism are you talking about here?  By any chance is this
related to some old code?

2.
+Fetching and undo record
+------------------------
+To fetch an undo record, a caller must provide a valid undo record pointer.
+Optionally, the caller can provide a callback function with the information of
+the block and offset, which will help in faster retrieval of undo record,
+otherwise, it has to traverse the undo-chain.

I think this is out-dated information.  You seem to forget updating
README after latest changes in API.

3.
+ * The cid/xid/reloid/rmid information will be added in the undo record header
+ * in the following cases:
+ * a) The first undo record of the transaction.
+ * b) First undo record of the page.
+ * c) All subsequent record for the transaction which is not the first
+ *   transaction on the page.
+ * Except above cases,  If the rmid/reloid/xid/cid is same in the subsequent
+ * records this information will not be stored in the record, these information
+ * will be retrieved from the first undo record of that page.
+ * If any of the member rmid/reloid/xid/cid has changed, the changed
information
+ * will be stored in the undo record and the remaining information will be
+ * retrieved from the first complete undo record of the page
+ */
+UndoCompressionInfo undo_compression_info[UndoLogCategories];

a. Do we want to compress fork_number also?  It is an optional field
and is only include when undo record is for not MAIN_FORKNUM.  For
zheap, this means it will never be included, but in future, it could
be included for some other AM or some other use case.  So, not sure if
there is any benefit in compressing the same.

b. cid/xid/reloid/rmid - I think it is better to write it as rmid,
reloid, xid, cid in the same order as you declare them in
UndoPackStage.

c. Some minor corrections. /Except above/Except for above/; /, If
the/, if the/;  /is same/is the same/; /record, these
information/record rather this information/

d. I think there is no need to start the line "If any of the..." from
a new line, it can be continued where the previous line ends.  Also,
at the end of that line, add a full stop.

4.
/*
+ * Copy the compression global compression info to our context before
+ * starting prepare because this value might get updated multiple time in
+ * case of multi-prepare but the global value should be updated only after
+ * we have successfully inserted the undo record.
+ */

In the above comment, the first 'compression' is not required. /time/times/

5.
+/*
+ * The below common information will be stored in the first undo
record of the page.
+ * Every subsequent undo record will not store this information, if
required this information
+ * will be retrieved from the first undo record of the page.
+ */
+typedef struct UndoCompressionInfo

The line length in the above comments exceeds the 80-char limit.  You
might want to run pgindent to avoid such problems.

6.
+/*
+ * Exclude the common info in undo record flag and also set the compression
+ * info in the context.
+ *

'flag' seems to be a redundant word here?

7.
+UndoSetCommonInfo(UndoCompressionInfo *compressioninfo,
+   UnpackedUndoRecord *urec, UndoRecPtr urp,
+   Buffer buffer)
+{
+
+ /*
+ * If we have valid compression info and the for the same transaction and
+ * the current undo record is on the same block as the last undo record
+ * then exclude the common information which are same as first complete
+ * record on the page.
+ */
+ if (compressioninfo->valid &&
+ FullTransactionIdEquals(compressioninfo->fxid, urec->uur_fxid) &&
+ UndoRecPtrGetBlockNum(urp) == UndoRecPtrGetBlockNum(lasturp))

Here the comment is just a verbal for of if-check.  How about writing
it as: "Exclude the common information from the record which is same
as the first record on the page."

8.
UndoSetCommonInfo()
{
..
if (compressioninfo->valid &&
+ FullTransactionIdEquals(compressioninfo->fxid, urec->uur_fxid) &&
+ UndoRecPtrGetBlockNum(urp) == UndoRecPtrGetBlockNum(lasturp))
+ {
+ urec->uur_info &= ~UREC_INFO_XID;
+
+ /* Don't include rmid if it's same. */
+ if (urec->uur_rmid == compressioninfo->rmid)
+ urec->uur_info &= ~UREC_INFO_RMID;
+
+ /* Don't include reloid if it's same. */
+ if (urec->uur_reloid == compressioninfo->reloid)
+ urec->uur_info &= ~UREC_INFO_RELOID;

In all the checks except for transaction id, urec's info is on the
left side.  I think all the checks can be consistent.

These are some of the things I noticed while skimming through this
patch.  I will do some more detailed review later.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

18 July 2019, 11:40:07

On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Fri, Jun 28, 2019 at 6:09 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > I happened to open up 0001 from this series, which is from Thomas, and
> > I do not think that the pg_buffercache changes are correct.  The idea
> > here is that the customer might install version 1.3 or any prior
> > version on an old release, then upgrade to PostgreSQL 13. When they
> > do, they will be running with the old SQL definitions and the new
> > binaries.  At that point, it sure looks to me like the code in
> > pg_buffercache_pages.c is going to do the Wrong Thing.  [...]
>
> Yep, that was completely wrong.  Here's a new version.
>

One comment/question related to
0022-Use-undo-based-rollback-to-clean-up-files-on-abort.patch.

+make_undo_smgr_create(RelFileNode *rnode, FullTransactionId fxid,
+   XLogReaderState *xlog_record)
+{
+ UnpackedUndoRecord undorecord = {0};
+ UndoRecordInsertContext context;
+
+ undorecord.uur_rmid = RM_SMGR_ID;
+ undorecord.uur_type = UNDO_SMGR_CREATE;
+ undorecord.uur_info = UREC_INFO_PAYLOAD;
+ undorecord.uur_dbid = rnode->dbNode;
+ undorecord.uur_xid = XidFromFullTransactionId(fxid);
+ undorecord.uur_cid = InvalidCommandId;
+ undorecord.uur_fork = InvalidForkNumber;

While reviewing Dilip's patch(undo-record-interface), I noticed that
we include Fork_Num in undo record, if it is not a MAIN_FORKNUM.  So,
in this patch's case, we will always include it as you are passing
InvalidForkNumber.   I also see that the patch doesn't use uur_fork in
the undo record handler, so I think you don't care what is its value.
I am not sure what is the best thing to do here, but it might be
better if we can avoiding adding fork_num in each undo record.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

18 July 2019, 18:58:31

On Wed, Jul 17, 2019 at 2:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> We add entries in queues only when we want them to be processed by
> background workers whereas hash table will contain the entries for all
> the pending undo requests irrespective of whether they are executed by
> foreground-transaction or by background workers.  Once the request is
> processed, we remove it from the hash table.  The reasons for keeping
> all the pending abort requests in hash table is that it allows us to
> compute oldestXidHavingUnappliedUndo and second is it avoids us to
> have duplicate undo requests by backends and discard worker.  In
> short, there is no reason to keep all the entries in queues, but there
> are reasons to keep all the aborted xact entries in hash table.

I think we're drifting off on a tangent here.  That does make sense,
but my original comment that led to this discussion was
"PerformUndoActions() also thinks that there is a possibility of
failing to insert a failed request into the error queue, and makes
reference to such requests being rediscovered by the discard worker,
..." and none of what you've written explains why there is or should
be a possibility of failing to insert a request into the error queue.
I feel like we've discussed this point to death. You just make the
maximum size of the queue equal to the maximum size of the hash table,
and it can't ever fail to have room for a new entry.  If you remove
entries lazily, then it can, but any time it does, you can just go and
clean out all of the dead entries and you're guaranteed to then have
enough room.  And if we switch to rbtree then we won't do lazy removal
any more, and it won't matter anyway.

> > Anyway, if you don't like this solution, propose something else. It's
> > impossible to correctly implement a hard limit unless the number of
> > aborted-but-not-yet-undone transaction is bounded to (HARD_LIMIT -
> > ENTRIES_THAT_WOULD_BE_ADDED_AFTER_RECOVERY_IF_THE_SYSTEM_CRASHED_NOW).
> > If there are 100 transactions each bound to 2 undo logs, and you
> > crash, you will need to (as you have it designed now) add another 200
> > transactions to the hash table upon recovery, and that will make you
> > exceed the hard limit unless you were at least 200 transactions below
> > the limit before the crash.  Have you handled that somehow?  If so,
> > how?
>
> Yeah, we have handled it by reserving the space of MaxBackends.  It is
> UndoRollbackHashTableSize() - MaxBackends.  There is a bug in the
> current patch which is that it should reserve space for 2 *
> MaxBackends so that after recovery, we are safe, but that can be
> fixed.

One of us is REALLY confused here.  Nothing you do in
UndoRollbackHashTableSize() can possibly fix the problem that I'm
talking about. Suppose the system gets to a point where all of the
rollback hash table entries are in use - there are some entries that
are used because work was pushed into the background, and then there
are other entries that are present because those transactions are
being rolled back in the foreground. Now at this point you crash.  Now
when you start up, all the hash table entries, including the reserved
ones, are already in use before any running transactions start.  Now
if you allow transactions to start before some of the rollbacks
complete, you have got big problems.  The system might crash again,
and if it does, when it restarts, the total amount of outstanding
requests will no longer fit in the hash table, which was the whole
premise of this design.

Maybe that doesn't make sense, so think about it this way.  Suppose
the following happens repeatedly: the system starts, someone begins a
transaction that writes an undo record, the rollback workers start up
but don't make very much progress because the system is heavily loaded
or whatever reason, the system crashes, rinse, repeat.  Since no
transactions got successfully rolled back and 1 new transaction that
needs roll back got added, the number of transactions pending rollback
has increased by one.  Now, however big you made the hash table, just
repeat this process that number of times plus one, and the hash table
overflows.  The only way you can prevent that is if you stop the
transaction from writing undo when the hash table is already too full.

> I have responded to it as a separate email, but let's discuss it here.
> So, you are right that only time we need to scan the undo logs to find
> all pending aborted xacts is immediately after startup.  But, we can't
> create a fully update-to-date entry from both the logs unless we make
> undo launcher to also wait to process anything till we are done.  We
> are not doing this in the current patch but we can do it if we want.
> This will be an additional restriction we have to put which is not
> required for the current approach.

I mean, that is just not true. There's no fundamental difference
between having two possible entries each of which looks like this:

struct entry { txn_details d; };

And having a single entry that looks like this:

struct entry { txn_details permanent; txn_details unlogged; bool
using_permanent; bool using_unlogged; };

I mean, I'm not saying you would actually want to do exactly the
second thing, but arguing that something cannot be done with one
design or the other is just not correct.

> Another related thing is that to update the existing entry for queues,
> we need to delete and re-insert the entry after we find the request in
> a different log category.   Again it depends if we point queue entries
> to hash table, then we might not have this additional work but that
> has its own set of complexities.

I don't follow this.  If you have a hash table where the key is XID,
there is no need to delete and reinsert anything just because you
discover that the XID has not only permanent undo but also unlogged
undo, or something of that sort.

> I think it is implementation wise simpler to have one entry per
> persistence level.   It is not that we can't deal with all the
> problems being discussed.

It's possible that it's simpler, but I'm not finding the arguments
you're making very convincing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

19 July 2019, 04:10:02

On Fri, Jul 19, 2019 at 12:28 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jul 17, 2019 at 2:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > Anyway, if you don't like this solution, propose something else. It's
> > > impossible to correctly implement a hard limit unless the number of
> > > aborted-but-not-yet-undone transaction is bounded to (HARD_LIMIT -
> > > ENTRIES_THAT_WOULD_BE_ADDED_AFTER_RECOVERY_IF_THE_SYSTEM_CRASHED_NOW).
> > > If there are 100 transactions each bound to 2 undo logs, and you
> > > crash, you will need to (as you have it designed now) add another 200
> > > transactions to the hash table upon recovery, and that will make you
> > > exceed the hard limit unless you were at least 200 transactions below
> > > the limit before the crash.  Have you handled that somehow?  If so,
> > > how?
> >
> > Yeah, we have handled it by reserving the space of MaxBackends.  It is
> > UndoRollbackHashTableSize() - MaxBackends.  There is a bug in the
> > current patch which is that it should reserve space for 2 *
> > MaxBackends so that after recovery, we are safe, but that can be
> > fixed.
>
> One of us is REALLY confused here.  Nothing you do in
> UndoRollbackHashTableSize() can possibly fix the problem that I'm
> talking about. Suppose the system gets to a point where all of the
> rollback hash table entries are in use - there are some entries that
> are used because work was pushed into the background, and then there
> are other entries that are present because those transactions are
> being rolled back in the foreground.
>

We are doing exactly what you have written in the last line of the
next paragraph "stop the transaction from writing undo when the hash
table is already too full.".  So we will
never face the problems related to repeated crash recovery.  The
definition of too full is that we stop allowing the new transactions
that can write undo when we have the hash table already have entries
equivalent to (UndoRollbackHashTableSize() - MaxBackends).  Does this
make sense?

> Now at this point you crash.  Now
> when you start up, all the hash table entries, including the reserved
> ones, are already in use before any running transactions start.  Now
> if you allow transactions to start before some of the rollbacks
> complete, you have got big problems.  The system might crash again,
> and if it does, when it restarts, the total amount of outstanding
> requests will no longer fit in the hash table, which was the whole
> premise of this design.
>
> Maybe that doesn't make sense, so think about it this way.
>

All you are saying makes sense and I think I can understand the
problem you are trying to describe, but we have thought about the same
thing and have the algorithm/code in place which won't allow such
situations.

>
> > Another related thing is that to update the existing entry for queues,
> > we need to delete and re-insert the entry after we find the request in
> > a different log category.   Again it depends if we point queue entries
> > to hash table, then we might not have this additional work but that
> > has its own set of complexities.
>
> I don't follow this.  If you have a hash table where the key is XID,
> there is no need to delete and reinsert anything just because you
> discover that the XID has not only permanent undo but also unlogged
> undo, or something of that sort.
>

The size of the total undo to be processed will be changed if we find
anyone (permanent or unlogged) later.  Based on the size, the entry
location should be changed in size queue.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

19 July 2019, 08:57:59

On Thu, Jul 11, 2019 at 9:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jul 11, 2019 at 12:38 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > I don't like the fact that undoaccess.c has a new global,
> > undo_compression_info.  I haven't read the code thoroughly, but do we
> > really need that?  I think it's never modified (so it could just be
> > declared const),
>
> Actually, this will get modified otherwise across undo record
> insertion how we will know what was the values of the common fields in
> the first record of the page.  Another option could be that every time
> we insert the record, read the value from the first complete undo
> record on the page but that will be costly because for every new
> insertion we need to read the first undo record of the page.
>

This information won't be shared across transactions, so can't we keep
it in top transaction's state?   It seems to me that will be better
than to maintain it as a global state.

Few more comments on this patch:
1.
PrepareUndoInsert()
{
..
+ if (logswitched)
+ {
..
+ }
+ else
+ {
..
+ resize = true;
..
+ }
+
..
+
+ do
+ {
+ bufidx = UndoGetBufferSlot(context, rnode, cur_blk, rbm);
..
+ rbm = RBM_ZERO;
+ cur_blk++;
+ } while (cur_size < size);
+
+ /*
+ * Set/overwrite compression info if required and also exclude the common
+ * fields from the undo record if possible.
+ */
+ if (UndoSetCommonInfo(compression_info, urec, urecptr,
+   context->prepared_undo_buffers[prepared_undo->undo_buffer_idx[0]].buf))
+ resize = true;
+
+ if (resize)
+ size = UndoRecordExpectedSize(urec);

I see that in some cases where resize is possible are checked before
buffer allocation and some are after.  Isn't it better to do all these
checks before buffer allocation?  Also, isn't it better to even
compute changed size before buffer allocation as that might sometimes
help in lesser buffer allocations?

Can you find a better way to write
:context->prepared_undo_buffers[prepared_undo->undo_buffer_idx[0]].buf)?
 It makes the line too long and difficult to understand.  Check for
similar instances in the patch and if possible, change them as well.

2.
+InsertPreparedUndo(UndoRecordInsertContext *context)
{
..
/*
+ * Try to insert the record into the current page. If it
+ * doesn't succeed then recall the routine with the next page.
+ */
+ InsertUndoData(&ucontext, page, starting_byte);
+ if (ucontext.stage == UNDO_PACK_STAGE_DONE)
+ {
+ MarkBufferDirty(buffer);
+ break;
+ }
+ MarkBufferDirty(buffer);
..
}

Can't we call MarkBufferDirty(buffer) just before 'if' check?  That
will avoid calling it twice.

3.
+ * Later, during insert phase we will write actual records into thse buffers.
+ */
+struct PreparedUndoBuffer

/thse/these

4.
+ /*
+ * If we are writing first undo record for the page the we can set the
+ * compression so that subsequent records from the same transaction can
+ * avoid including common information in the undo records.
+ */
+ if (first_complete_undo)

/page the we/page then we

5.
PrepareUndoInsert()
{
..
After
+ * allocation We'll only advance by as many bytes as we turn out to need.
+ */
+ UndoRecordSetInfo(urec);

Change the beginning of comment as: "After allocation, we'll .."

6.
PrepareUndoInsert()
{
..
* TODO:  instead of storing this in the transaction header we can
+ * have separate undo log switch header and store it there.
+ */
+ prevlogurp =
+ MakeUndoRecPtr(UndoRecPtrGetLogNo(prevlog_insert_urp),
+    (UndoRecPtrGetOffset(prevlog_insert_urp) - prevlen));
+

I don't think this TODO is valid anymore because now the patch has a
separate log-switch header.

7.
/*
+ * If undo log is switched then set the logswitch flag and also reset the
+ * compression info because we can use same compression info for the new
+ * undo log.
+ */
+ if (UndoRecPtrIsValid(prevlog_xact_start))

/can/can't

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Khandekar

Date:

19 July 2019, 11:54:08

On Thu, 9 May 2019 at 12:04, Dilip Kumar <dilipbalaut@gmail.com> wrote:

> Patches can be applied on top of undo branch [1] commit:
> (cb777466d008e656f03771cf16ec7ef9d6f2778b)
>
> [1] https://github.com/EnterpriseDB/zheap/tree/undo

Below are some review points for 0009-undo-page-consistency-checker.patch :

+ /* Calculate the size of the partial record. */
+ partial_rec_size = UndoRecordHeaderSize(phdr->uur_info) +
+ phdr->tuple_len + phdr->payload_len -
+ phdr->record_offset;

There is already an UndoPagePartialRecSize() function which calculates
the size of partial record, which seems to do the same as above. If
this is same, you can omit the above code, and instead down below
where you increment next_record, you can do "next_record +=
UndoPagePartialRecSize()".

Also, I see an extra  sizeof(uint16) added in
UndoPagePartialRecSize(). Not sure which one is correct, and which one
wrong, unless I am wrong in assuming that the above calculation and
the function definition do the same thing.

------------------

+ * We just want to mask the cid in the undo record header.  So
+ * only if the partial record in the current page include the undo
+ * record header then we need to mask the cid bytes in this page.
+ * Otherwise, directly jump to the next record.
Here, I think you mean : "So only if the partial record in the current
page includes the *cid* bytes", rather than "includes the undo record
header"
May be we can say :
We just want to mask the cid. So do the partial record masking only if
the current page includes the cid bytes from the partial record
header.

----------------

+ if (phdr->record_offset < (cid_offset + sizeof(CommandId)))
+ {
+    char    *cid_data;
+    Size mask_size;
+
+    mask_size = Min(cid_offset - phdr->record_offset,
+    sizeof(CommandId));
+
+    cid_data = next_record + cid_offset - phdr->record_offset;
+    memset(&cid_data, MASK_MARKER, mask_size);
+
Here, if record_offset lies *between* cid start and cid end, then
cid_offset - phdr->record_offset will be negative, and so will be
mask_size. Probably abs() should do the work.

Also, an Assert(cid_data + mask_size <= page_end) would be nice. I
know cid position of a partial record cannot go beyond the page
boundary, but it's better to have this Assert to do sanity check.

+ * Process the undo record of the page and mask their cid filed.
filed => field

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

19 July 2019, 13:05:12

On Fri, Jul 19, 2019 at 12:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> We are doing exactly what you have written in the last line of the
> next paragraph "stop the transaction from writing undo when the hash
> table is already too full.".  So we will
> never face the problems related to repeated crash recovery.  The
> definition of too full is that we stop allowing the new transactions
> that can write undo when we have the hash table already have entries
> equivalent to (UndoRollbackHashTableSize() - MaxBackends).  Does this
> make sense?

Oops, I was looking in the wrong place.  Yes, that makes sense, but:

1. It looks like the test you added to PrepareUndoInsert has no
locking, and I don't see how that can be right.

2. It seems like this is would result in testing for each new undo
insertion that gets prepared, whereas surely we would want to only
test when first attaching to an undo log.  If you've already attached
to the undo log, there's no reason not to continue inserting into it,
because doing so doesn't increase the number of transactions (or
transaction-persistence level combinations) that need undo.

3. I don't think the test itself is correct. It can fire even when
there's no problem. It is correct (or would be if it said 2 *
MaxBackends) if every other backend in the system is already attached
to an undo log (or two). But if they are not, it will block
transactions from being started for no reason. For instance, suppose
max_connections = 100 and there are another 100 slots for background
rollbacks. Now suppose that the system crashes when 101 slots are in
use -- 100 pushed into the background plus 1 that was aborted by the
crash. On recovery, this test will refuse to let any new transaction
start. Actually it is OK for up to 99 transactions to write undo, just
not 100.  Or, given that you have a slot per persistence level, it's
OK to have up to 199 transaction-persistence-level combinations in
flight, just not 200. And that is the difference between the system
being unusable after the crash until a rollback succeeds and being
almost fully usable immediately.

> > I don't follow this.  If you have a hash table where the key is XID,
> > there is no need to delete and reinsert anything just because you
> > discover that the XID has not only permanent undo but also unlogged
> > undo, or something of that sort.
>
> The size of the total undo to be processed will be changed if we find
> anyone (permanent or unlogged) later.  Based on the size, the entry
> location should be changed in size queue.

OK, true. But that's not a significant cost, either in runtime or code
complexity.

I still don't really see any good reason to the hash table key be
anything other than XID, or really, FXID. I mean, sure, the data
structure manipulations are a little different, but not in any way
that really matters. And it seems to me that there are some benefits,
the biggest of which is that the system becomes easier for users to
understand.  We can simply say that there is a limit on the number of
transactions that either (1) are in progress and have written undo or
(2) have aborted and not all of the undo has been processed. If the
key is XID + persistence level, then it's a limit on the number of
transaction-and-persistence-level combinations, which I feel is not so
easy to understand. In most but not all scenarios, it means that the
limit is about double what you think the limit is, and as the mistake
in the current version of the patch makes clear, even the people
writing the code can forget about that factor of two.

It affects a few other things, too.  If you made the key XID and fixed
problems (2) and (3) from above, then you'd have a situation where a
transaction could fail at only one times: either it bombs the first
time it tries to write undo, or it works.  As it is, there is a second
failure scenario: you do a bunch of work on permanent (or unlogged)
tables and then try to write to an unlogged (or permanent) table and
it fails because there are not enough slots.  Is that the end of the
world? No, certainly not. The situation should be rare. But if we have
to fail transactions, it's best to fail them before they've started
doing any work, because that minimizes the amount of work we waste by
having to retry. Of course, a transaction that fails midway through
when it tries to write at a second persistence level is also consuming
an undo slot in a situation where we're short of undo slots.

Another thing which Andres pointed out to me off-list is that we might
want to have a function that takes a transaction ID as an argument and
tells you the status of that transaction from the point of view of the
undo machinery: does it have any undo, and if so how much? As you have
it now, such a function would require searching the whole hash table,
because the user won't be able to provide an UndoRecPtr to go with the
XID.  If the hash table key were in fact <XID, undo persistence level>
rather than <XID, UndoRecPtr>, then you could it with two lookups; if
it were XID alone, you could do it with one lookup. The difference
between one lookup and two is not significant, but having to search
the whole hash table is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

19 July 2019, 13:07:27

On Fri, Jul 19, 2019 at 7:54 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> + * We just want to mask the cid in the undo record header.  So
> + * only if the partial record in the current page include the undo
> + * record header then we need to mask the cid bytes in this page.
> + * Otherwise, directly jump to the next record.
> Here, I think you mean : "So only if the partial record in the current
> page includes the *cid* bytes", rather than "includes the undo record
> header"
> May be we can say :
> We just want to mask the cid. So do the partial record masking only if
> the current page includes the cid bytes from the partial record
> header.

Hmm, but why is it correct to mask the CID at all?  Shouldn't that match?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

should there be a hard-limit on the number of transactions pending undo?

From

Robert Haas

Date:

19 July 2019, 17:28:14

On Tue, Jul 16, 2019 at 6:23 PM Andres Freund <andres@anarazel.de> wrote:
> Yea, that seems like a question independent of the "completeness"
> requirement. If desirable, it seems trivial to either have
> RollbackHashEntry have per-persistence level status (for one entry per
> xid), or not (for per-persistence entries).

I want to talk more about the "completeness" issue, which is basically
a question of whether we should (a) put a hard limit on the number of
transactions that have unprocessed undo and that are either aborted or
in progress (and thus capable of aborting) as proposed by Andres or
(b) not have such a hard limit, as originally proposed by me.  I think
everyone who is not working on this code has installed an automatic
filter rule to send the original thread to /dev/null, so I'm changing
the subject line in the hopes of getting some of those people to pay
attention.  If it doesn't work, at least the concerns will be
memorialized in case it comes up later.

I originally proposed (b) because if undo application is failing for
some reason, it seemed better not to bring the whole system to a
screeching halt, but rather just cause incremental performance
degradation or something. However, Andres has pointed out this may
postpone remedial action and let bugs go undetected, so it might not
actually be better. Also, some of the things we need to do, like
computing the oldest XID whose undo has not been retired, are
tricky/expensive if you don't have complete data in memory, and you
can't be sure of having complete data in shared memory without a hard
limit.  No matter which way we go, failures to apply undo had better
be really rare, or we're going to be in really serious trouble, so
we're only concerned here with how to handle what is hopefully a very
rare scenario, not the common case.

I want to consider three specific scenarios that could cause undo
application to fail, and then offer some observations about them.

Scenario #1:

1. Sessions 1..N each begin a transaction and write a bunch of data to
a table (at least enough that we'll try to perform undo in the
background).
2. Session N+1 begins a transaction and tries to lock the same table.
It blocks.
3. Sessions 1..N abort, successfully pushing the undo work into the background.
4. Session N+1 now acquires the lock and sits on it.
5. Optionally, repeat steps 1-4 K times, each time for a different table.

Scenario #2:

1. Any number of sessions begin a transaction, write a bunch of data,
and then abort.
2. They all try to perform undo in the foreground.
3. They get killed using pg_terminate_backend().

Scenario #3:

1. A transaction begins, does some work, and then aborts.
2. When undo processing occurs, 1% of such transactions fail during
undo apply because of a bug in the table AM.
3. When undo processing retries after a failure, it fails again
because the bug is triggered by something about the contents of the
undo record, rather than by, say, concurrency.

In scenario one, the problem is mostly self-correcting. When we decide
that we've got too many things queued up for background processing,
and start to force undo processing to happen in the foreground, it
will start succeeding, because the foreground process will have
retained the lock that it took before writing any data and can
therefore undo those writes without having to wait for the lock.
However, this will do nothing to help the requests that are already in
the background, which will just sit there until they can get the lock.
I think there is a good argument that they should not actually wait
for the lock, or only for a certain time period, and then give up and
put the transaction on the error queue for reprocessing at a later
time. Otherwise, we're pinning down undo workers, which could easily
lead to starvation, just as it does for autovacuum. On the whole, this
doesn't sound too bad. We shouldn't be able to fill up the queue with
small transactions, because of the restriction that we only push undo
work into the background when the transaction is big enough, and if we
fill it up with big transactions, then (1) back-pressure will prevent
the problem from crippling the system and (2) eventually the problem
will be self-correcting, because when the transaction in session N+1
ends, the undo will all go through and everything will be fine. The
only real user impact of this scenario is that unrelated work on the
system might notice that large rollbacks are now happening in the
foreground rather than the background, and if that causes a problem,
the DBA can fix it by terminating session N+1. Even if she doesn't,
you shouldn't ever hit the hard cap.

However, if prepared transactions are in use, we could have a variant
of scenario #1 in which each transaction is first prepared, and then
the prepared transaction is rolled back.  Unlike the ordinary case,
this can lead to a nearly-unbounded growth in the number of
transactions that are pending undo, because we don't have a way to
transfer the locks held by PGPROC used for the prepare to some running
session that could perform the undo.  It's not necessary to have a
large value for max_prepared_transactions; it only has to be greater
than 0, because we can keep reusing the same slots with different
tables.  That is, let N = max_prepared_xacts, and let K be anything at
all; session N+1 can just stay in the same transaction and keep on
taking new locks one at a time until the lock table fills up; not sure
exactly how long that will take, but it's probably a five digit number
of transactions, or maybe six. In this case, we can't force undo into
the foreground, so we can exceed the number of transactions that are
supposed to be backgrounded.  We'll eventually have to just start
refusing new transactions permission to attach to an undo log, and
they'll error out.  Although unpleasant, I don't think that this
scenario is a death sentence for the idea of having a hard cap on the
table size, because if we can the cap is 100k or so, you shouldn't
really hit it unless you specifically make it your goal to do so.  At
least, not this way. But if you have a lower cap, like 1k, it doesn't
seem crazy to think that you could hit this in a non-artificial
scenario; you just need lots of rolled-back prepared transactions plus
some long-running DDL.

We could mitigate the prepared transaction scenario by providing a way
to transfer locks from the prepared transaction to the backend doing
the ROLLBACK PREPARED and then make it try to execute the undo
actions.  I think that would bring this scenario into parity with the
non-prepared case.  We could still try to background large rollbacks,
but if the queue gets too full then ROLLBACK PREPARED would do the
work instead, and, with the hypothetical lock transfer mechanism, that
would dodge the locking issues.

In scenario #2, the undo work is going to have to be retried in the
background, and perforce that means reacquiring locks that have been
released, and so there is a chance of long lock waits and/or deadlock
that cannot really be avoided. I think there is basically no way at
all to avoid an unbounded accumulation of transactions requiring undo
in this case, just as in the similar case where the cluster is
repeatedly shut down or repeatedly crashes. Eventually, if you have a
hard cap on the number of transactions requiring undo, you're going to
hit it, and have to start refusing new undo-using transactions. As
Thomas pointed out, that might still be better than some other systems
which use undo, where the system doesn't open for any transactions at
all after a restart until all undo is retired, and/or where undo is
never processed in the background. But it's a possible concern. On the
other hand, if you don't have a hard cap, the system may just get
further and further behind until it eventually melts, and that's also
a possible concern.

How plausible is this scenario? For most users, cluster restarts and
crashes are uncommon, so that variant isn't likely to happen unless
something else is going badly wrong. As to the scenario as written,
it's not crazy to think that a DBA might try to kill off sessions that
sitting there stuck in undo processing for long periods of time, but
that doesn't make it a good idea. Whatever problems it causes are
analogous to the problems you get if you keep killing of autovacuum
processes: the system is trying to make you do the right thing, and if
you fight it, you will have some kind of trouble no matter what design
decisions we make.

In scenario #3, the hard limit is likely to bring things to a
screeching halt pretty quickly; you'll just run out of space in the
in-memory data structures. Otherwise, the problem will not be obvious
unless you're keeping an eye on error messages in your logs, the first
sign of trouble may be that the undo logs fill up the disk. It's not
really clear which is better. There is value in knowing about the
problem sooner (because then you can file a bug report right away and
get a fix sooner) but there is also value in having the system limp
along instead of grinding to a halt (because then you might not be
totally down while you're waiting for that bug fix to become
available).

One other thing that seems worth noting is that we have to consider
what happens after a restart.  After a crash, and depending on exactly
how we design it perhaps also after a non-crash restart, we won't
immediately know how many outstanding transactions need undo; we'll
have to grovel through the undo logs to find out. If we've got a hard
cap, we can't allow new undo-using transactions to start until we
finish that work.  It's possible that, at the moment of the crash, the
maximum number of items had already been pushed into the background,
and every foreground session was busy trying to undo an abort as well.
If so, we're already up against the limit.  We'll have to scan through
all of the undo logs and examine each transaction to get a count on
how many transactions are already in a needs-undo-work state; only
once we have that value do we know whether it's OK to admit new
transactions to using the undo machinery, and how many we can admit.
In typical cases, that won't take long at all, because there won't be
any pending undo work, or not much, and we'll very quickly read the
handful of transaction headers that we need to consult and away we go.
However, if the hard limit is pretty big, and we're pretty close to
it, counting might take a long time. It seems bothersome to have this
interval between when we start accepting transactions and when we can
accept transactions that use undo. Instead of throwing an ERROR, we
can probably just teach the system to wait for the background process
to finish doing the counting; that's what Amit's patch does currently.
Or, we could not even open for connections until the counting has been
completed.

When I first thought about this, I was really concerned about the idea
of a hard limit, but the more I think about it the less problematic it
seems. I think in the end it boils down to a question of: when things
break, what behavior would users prefer? You can either have a fairly
quick, hard breakage which will definitely get your attention, or you
can have a long, slow process of gradual degradation that doesn't
actually stop the system until, say, the XIDs stuck in the undo
processing queue become old enough to threaten wraparound, or the disk
fills up.  Which is less evil?

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: should there be a hard-limit on the number of transactionspending undo?

From

Andres Freund

Date:

19 July 2019, 18:04:26

Hi,

On 2019-07-19 13:28:14 -0400, Robert Haas wrote:
> I want to consider three specific scenarios that could cause undo
> application to fail, and then offer some observations about them.
> 
> Scenario #1:
> 
> 1. Sessions 1..N each begin a transaction and write a bunch of data to
> a table (at least enough that we'll try to perform undo in the
> background).
> 2. Session N+1 begins a transaction and tries to lock the same table.
> It blocks.
> 3. Sessions 1..N abort, successfully pushing the undo work into the background.
> 4. Session N+1 now acquires the lock and sits on it.
> 5. Optionally, repeat steps 1-4 K times, each time for a different table.
> 
> Scenario #2:
> 
> 1. Any number of sessions begin a transaction, write a bunch of data,
> and then abort.
> 2. They all try to perform undo in the foreground.
> 3. They get killed using pg_terminate_backend().
> 
> Scenario #3:
> 
> 1. A transaction begins, does some work, and then aborts.
> 2. When undo processing occurs, 1% of such transactions fail during
> undo apply because of a bug in the table AM.
> 3. When undo processing retries after a failure, it fails again
> because the bug is triggered by something about the contents of the
> undo record, rather than by, say, concurrency.


> However, if prepared transactions are in use, we could have a variant
> of scenario #1 in which each transaction is first prepared, and then
> the prepared transaction is rolled back.  Unlike the ordinary case,
> this can lead to a nearly-unbounded growth in the number of
> transactions that are pending undo, because we don't have a way to
> transfer the locks held by PGPROC used for the prepare to some running
> session that could perform the undo.

It doesn't seem that hard - and kind of required for robustness
independent of the decision around "completeness" - to find a way to use
the locks already held by the prepared transaction.


> It's not necessary to have a
> large value for max_prepared_transactions; it only has to be greater
> than 0, because we can keep reusing the same slots with different
> tables.  That is, let N = max_prepared_xacts, and let K be anything at
> all; session N+1 can just stay in the same transaction and keep on
> taking new locks one at a time until the lock table fills up; not sure
> exactly how long that will take, but it's probably a five digit number
> of transactions, or maybe six. In this case, we can't force undo into
> the foreground, so we can exceed the number of transactions that are
> supposed to be backgrounded.

I'm not following, unfortunately.

I don't understand what exactly the scenario is you refer to. You say
"session N+1 can just stay in the same transaction", but then you also
reference something taking "probably a five digit number of
transactions". Are those transactions the prepared ones?

Aloso, if someobdy fills up the entire lock table, then the system is
effectively down - independent of UNDO, and no meaningful amount of UNDO
is going to be written. Perhaps we need some better resource control,
but that's really independent of UNDO.

Perhaps you can just explain the scenario in a few more words? My
comments regarding it probably make no sense, given how little I
understand what the scenario is.


> In scenario #2, the undo work is going to have to be retried in the
> background, and perforce that means reacquiring locks that have been
> released, and so there is a chance of long lock waits and/or deadlock
> that cannot really be avoided. I think there is basically no way at
> all to avoid an unbounded accumulation of transactions requiring undo
> in this case, just as in the similar case where the cluster is
> repeatedly shut down or repeatedly crashes. Eventually, if you have a
> hard cap on the number of transactions requiring undo, you're going to
> hit it, and have to start refusing new undo-using transactions. As
> Thomas pointed out, that might still be better than some other systems
> which use undo, where the system doesn't open for any transactions at
> all after a restart until all undo is retired, and/or where undo is
> never processed in the background. But it's a possible concern. On the
> other hand, if you don't have a hard cap, the system may just get
> further and further behind until it eventually melts, and that's also
> a possible concern.

You could force new connections to complete the rollback processing of
the terminated connection, if there's too much pending UNDO. That'd be a
way of providing back-pressure against such crazy scenarios.  Seems
again that it'd be good to have that pressure, independent of the
decision on completeness.



> One other thing that seems worth noting is that we have to consider
> what happens after a restart.  After a crash, and depending on exactly
> how we design it perhaps also after a non-crash restart, we won't
> immediately know how many outstanding transactions need undo; we'll
> have to grovel through the undo logs to find out. If we've got a hard
> cap, we can't allow new undo-using transactions to start until we
> finish that work.

Couldn't we record the outstanding transactions in the checkpoint, and
then recompute the changes to that record during WAL replay?


> When I first thought about this, I was really concerned about the idea
> of a hard limit, but the more I think about it the less problematic it
> seems. I think in the end it boils down to a question of: when things
> break, what behavior would users prefer? You can either have a fairly
> quick, hard breakage which will definitely get your attention, or you
> can have a long, slow process of gradual degradation that doesn't
> actually stop the system until, say, the XIDs stuck in the undo
> processing queue become old enough to threaten wraparound, or the disk
> fills up.  Which is less evil?

Yea, I think that's what it boils down to... Would be good to have a few
more opinions on this.

Greetings,

Andres Freund

Re: should there be a hard-limit on the number of transactionspending undo?

From

Robert Haas

Date:

19 July 2019, 18:50:22

On Fri, Jul 19, 2019 at 2:04 PM Andres Freund <andres@anarazel.de> wrote:
> It doesn't seem that hard - and kind of required for robustness
> independent of the decision around "completeness" - to find a way to use
> the locks already held by the prepared transaction.

I'm not wild about finding more subtasks to put on the must-do list,
but I agree it's doable.

> I'm not following, unfortunately.
>
> I don't understand what exactly the scenario is you refer to. You say
> "session N+1 can just stay in the same transaction", but then you also
> reference something taking "probably a five digit number of
> transactions". Are those transactions the prepared ones?

So you begin a bunch of transactions.  All but one of them begin a
transaction, insert data into a table, and then prepare.  The last one
begins a transaction and locks the table.  Now you roll back all the
prepared transactions.  Those sessions now begin new transactions,
insert data into a second table, and prepare the second set of
transactions.  The last session, which still has the first table
locked, now locks the second table in addition.  Now you again roll
back all the prepared transactions.  At this point you have 2 *
max_prepared_transactions that are waiting for undo, all blocked on
that last session that holds locks on both tables.  So now you go have
all of those sessions begin a third transaction, and they all insert
into a third table, and prepare.  The last session now attempts AEL on
that third table, and once it's waiting, you roll back all the
prepared transactions, after which that last session successfully
picks up its third table lock.

You can keep repeating this, locking a new table each time, until you
run out of lock table space, by which time you will have roughly
max_prepared_transactions * size_of_lock_table transactions waiting
for undo processing.

> You could force new connections to complete the rollback processing of
> the terminated connection, if there's too much pending UNDO. That'd be a
> way of providing back-pressure against such crazy scenarios.  Seems
> again that it'd be good to have that pressure, independent of the
> decision on completeness.

That would definitely provide a whole lot of back-pressure, but it
would also make the system unusable if the undo handler finds a way to
FATAL, or just hangs for some stupid reason (stuck I/O?). It would be
a shame if the administrative action needed to fix the problem were
prevented by the back-pressure mechanism.

One thing I've thought about, which I think would be helpful for a
variety of scenarios, is to have a facility that forces a computed
delay at the each write transaction (when it first writes WAL, or when
an XID is assigned), or we could adapt that to this case and say the
beginning of each undo-using transaction. So for example if you are
about to run out of space in pg_wal, you can slow thinks down to let
the checkpoint complete, or if you are about to run out of XIDs, you
can slow things down to let autovacuum complete, or if you are about
to run out of undo slots, you can slow things down to let some undo to
complete.  The trick is to make sure that you only wait when it's
likely to do some good; if you wait because you're running out of XIDs
and the reason you're running out of XIDs is because somebody left a
replication slot or a prepared transaction around, the back-pressure
is useless.

> Couldn't we record the outstanding transactions in the checkpoint, and
> then recompute the changes to that record during WAL replay?

Hmm, that's not a bad idea. So the transactions would have to "count"
the moment they insert their first undo record, which is exactly the
right thing anyway.

Hmm, but what about transactions that are only touching unlogged tables?

> Yea, I think that's what it boils down to... Would be good to have a few
> more opinions on this.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: should there be a hard-limit on the number of transactionspending undo?

From

Peter Geoghegan

Date:

19 July 2019, 18:57:25

On Fri, Jul 19, 2019 at 10:28 AM Robert Haas <robertmhaas@gmail.com> wrote:
> In scenario #2, the undo work is going to have to be retried in the
> background, and perforce that means reacquiring locks that have been
> released, and so there is a chance of long lock waits and/or deadlock
> that cannot really be avoided.

I haven't studied the UNDO or zheap stuff in any detail, but I am
concerned about rollbacks that deadlock. I'd feel a lot better about
it if forward progress was guaranteed, somehow. That seems to imply
that locks are retained, which is probably massively inconvenient to
ensure. Not least because it probably requires cooperation from
underlying access methods.

--
Peter Geoghegan

Re: should there be a hard-limit on the number of transactionspending undo?

From

Andres Freund

Date:

19 July 2019, 19:12:31

Hi,

On 2019-07-19 14:50:22 -0400, Robert Haas wrote:
> On Fri, Jul 19, 2019 at 2:04 PM Andres Freund <andres@anarazel.de> wrote:
> > It doesn't seem that hard - and kind of required for robustness
> > independent of the decision around "completeness" - to find a way to use
> > the locks already held by the prepared transaction.
> 
> I'm not wild about finding more subtasks to put on the must-do list,
> but I agree it's doable.

Isn't that pretty inherently required? How are otherwise ever going to
be able to roll back a transaction that holds an AEL on a relation it
also modifies?  I might be standing on my own head here, though.

> > You could force new connections to complete the rollback processing of
> > the terminated connection, if there's too much pending UNDO. That'd be a
> > way of providing back-pressure against such crazy scenarios.  Seems
> > again that it'd be good to have that pressure, independent of the
> > decision on completeness.
> 
> That would definitely provide a whole lot of back-pressure, but it
> would also make the system unusable if the undo handler finds a way to
> FATAL, or just hangs for some stupid reason (stuck I/O?). It would be
> a shame if the administrative action needed to fix the problem were
> prevented by the back-pressure mechanism.

Well, then perhaps that admin ought not to constantly terminate
connections...  I was thinking that new connections wouldn't be forced
to do that if there were still a lot of headroom regarding
#transactions-to-be-rolled-back.  And if undo workers kept up, you'd
also not hit this.

> > Couldn't we record the outstanding transactions in the checkpoint, and
> > then recompute the changes to that record during WAL replay?
> 
> Hmm, that's not a bad idea. So the transactions would have to "count"
> the moment they insert their first undo record, which is exactly the
> right thing anyway.
> 
> Hmm, but what about transactions that are only touching unlogged tables?

Wouldn't we throw all that UNDO away in a crash restart? There's no
underlying table data anymore, after all.

And for proper shutdown checkpoints they could just be included.

Greetings,

Andres Freund

Re: should there be a hard-limit on the number of transactionspending undo?

From

Robert Haas

Date:

19 July 2019, 19:52:27

On Fri, Jul 19, 2019 at 2:57 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Jul 19, 2019 at 10:28 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > In scenario #2, the undo work is going to have to be retried in the
> > background, and perforce that means reacquiring locks that have been
> > released, and so there is a chance of long lock waits and/or deadlock
> > that cannot really be avoided.
>
> I haven't studied the UNDO or zheap stuff in any detail, but I am
> concerned about rollbacks that deadlock. I'd feel a lot better about
> it if forward progress was guaranteed, somehow. That seems to imply
> that locks are retained, which is probably massively inconvenient to
> ensure. Not least because it probably requires cooperation from
> underlying access methods.

Right, that's definitely a big part of the concern here, but I don't
really believe that retaining locks is absolutely required, or even
necessarily desirable.  For instance, suppose that I create a table,
bulk-load a whole lotta data into it, and then abort.  Further support
that by the time we start trying to process the undo in the
background, we can't get the lock. Well, that probably means somebody
is performing DDL on the table.  If they just did LOCK TABLE or ALTER
TABLE SET STATISTICS, we are going to need to execute that same undo
once the DDL is complete. However, if the DDL is DROP TABLE, we're
going to find that once we can get the lock, the undo is obsolete, and
we don't need to worry about it any more. Had we made it 100% certain
that the DROP TABLE couldn't go through until the undo was performed,
we could avoid having to worry about the undo having become obsolete
... but that's hardly a win.  We're better off allowing the drop and
then just chucking the undo.

Likely, something like CLUSTER or VACUUM FULL would take care of
removing any rows created by aborted transactions along the way, so
the undo could be thrown away afterwards without processing it.

Point being - there's at least some chance that the operations which
block forward progress also represent progress of another sort.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: should there be a hard-limit on the number of transactionspending undo?

From

Robert Haas

Date:

19 July 2019, 19:57:45

On Fri, Jul 19, 2019 at 3:12 PM Andres Freund <andres@anarazel.de> wrote:
> On 2019-07-19 14:50:22 -0400, Robert Haas wrote:
> > On Fri, Jul 19, 2019 at 2:04 PM Andres Freund <andres@anarazel.de> wrote:
> > > It doesn't seem that hard - and kind of required for robustness
> > > independent of the decision around "completeness" - to find a way to use
> > > the locks already held by the prepared transaction.
> >
> > I'm not wild about finding more subtasks to put on the must-do list,
> > but I agree it's doable.
>
> Isn't that pretty inherently required? How are otherwise ever going to
> be able to roll back a transaction that holds an AEL on a relation it
> also modifies?  I might be standing on my own head here, though.

I think you are.  If a transaction holds an AEL on a relation it also
modifies, we still only need something like RowExclusiveLock to roll
it back.  If we retain the transaction's locks until undo is complete,
we will not deadlock, but we'll also hold AccessExclusiveLock for a
long time.  If we release the transaction's locks, we can perform the
undo in the background with only RowExclusiveLock, which is full of
win.  Even you insist that the undo task should acquire the same lock
the relation has, which seems entirely excessive to me, that hardly
prevents undo from being applied.  Once the original transaction has
released its locks, those locks are released, and the undo system can
acquire those locks the next time the relation isn't busy (or when it
gets to the head of the lock queue).

As far as I can see, the only reason why you would care about this is
to make the back-pressure system effective against prepared
transactions.  Different people may want that more or less, but I have
a little trouble with the idea that it is a hard requirement.

> Well, then perhaps that admin ought not to constantly terminate
> connections...  I was thinking that new connections wouldn't be forced
> to do that if there were still a lot of headroom regarding
> #transactions-to-be-rolled-back.  And if undo workers kept up, you'd
> also not hit this.

Sure, but cascading failure scenarios suck.

> > Hmm, that's not a bad idea. So the transactions would have to "count"
> > the moment they insert their first undo record, which is exactly the
> > right thing anyway.
> >
> > Hmm, but what about transactions that are only touching unlogged tables?
>
> Wouldn't we throw all that UNDO away in a crash restart? There's no
> underlying table data anymore, after all.
>
> And for proper shutdown checkpoints they could just be included.

On thirty seconds thought, that sounds like it would work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: should there be a hard-limit on the number of transactionspending undo?

From

Andres Freund

Date:

19 July 2019, 20:14:38

On 2019-07-19 15:57:45 -0400, Robert Haas wrote:
> On Fri, Jul 19, 2019 at 3:12 PM Andres Freund <andres@anarazel.de> wrote:
> > Isn't that pretty inherently required? How are otherwise ever going to
> > be able to roll back a transaction that holds an AEL on a relation it
> > also modifies?  I might be standing on my own head here, though.
> 
> I think you are.  If a transaction holds an AEL on a relation it also
> modifies, we still only need something like RowExclusiveLock to roll
> it back.  If we retain the transaction's locks until undo is complete,
> we will not deadlock, but we'll also hold AccessExclusiveLock for a
> long time.  If we release the transaction's locks, we can perform the
> undo in the background with only RowExclusiveLock, which is full of
> win.  Even you insist that the undo task should acquire the same lock
> the relation has, which seems entirely excessive to me, that hardly
> prevents undo from being applied.  Once the original transaction has
> released its locks, those locks are released, and the undo system can
> acquire those locks the next time the relation isn't busy (or when it
> gets to the head of the lock queue).

Good morning, Mr Freund. Not sure what you were thinking there.

Re: should there be a hard-limit on the number of transactionspending undo?

From

Peter Geoghegan

Date:

19 July 2019, 22:47:23

On Fri, Jul 19, 2019 at 12:52 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Right, that's definitely a big part of the concern here, but I don't
> really believe that retaining locks is absolutely required, or even
> necessarily desirable.  For instance, suppose that I create a table,
> bulk-load a whole lotta data into it, and then abort.  Further support
> that by the time we start trying to process the undo in the
> background, we can't get the lock. Well, that probably means somebody
> is performing DDL on the table.

I believe that the primary reason why certain other database systems
retain locks until rollback completes (or release their locks in
reverse order, as UNDO processing progresses) is that application code
will often repeat exactly the same actions on receiving a transient
error, until the action finally completes successfully. Just like with
serialization failures, or with manually implemented UPSERT loops that
must sometimes retry. This is why UNDO is often (or always) processed
synchronously, blocking progress of the client connection as its xact
rolls back.

Obviously these other systems could easily hand off the work of
rolling back the transaction to an asynchronous worker process, and
return success to the client that encounters an error (or asks to
abort/roll back) almost immediately. I have to imagine that they
haven't implemented this straightforward optimization because it makes
sense that the cost of rolling back the transaction is primarily borne
by the client that actually rolls back. And, as I said, because a lot
of application code will immediately retry on failure, which needs to
not deadlock with an asynchronous roll back process.

> If they just did LOCK TABLE or ALTER
> TABLE SET STATISTICS, we are going to need to execute that same undo
> once the DDL is complete. However, if the DDL is DROP TABLE, we're
> going to find that once we can get the lock, the undo is obsolete, and
> we don't need to worry about it any more. Had we made it 100% certain
> that the DROP TABLE couldn't go through until the undo was performed,
> we could avoid having to worry about the undo having become obsolete
> ... but that's hardly a win.  We're better off allowing the drop and
> then just chucking the undo.

I'm sure that there are cases like that. And, I'm pretty sure that at
least one of the other database systems that I'm thinking of isn't as
naive as I suggest, without being sure of the specifics. The classic
approach is to retain the locks, even though that sucks in some cases.
That doesn't mean that you have to do it that way, but it's probably a
good idea to present your design in a way that compares and contrasts
with the classic approach.

I'm pretty sure that this is related to the way in which other systems
retain coarse-grained locks when bitmap indexes are used, even though
that makes them totally unusable with OLTP apps. It seems like it
would help users a lot if their bitmap indexes didn't come with that
problem, but it's a price that they continue to have to pay.

> Point being - there's at least some chance that the operations which
> block forward progress also represent progress of another sort.

That's good, provided that there isn't observable lock starvation. I
don't think that you need to eliminate the theoretical risk of lock
starvation. It deserves careful, ongoing consideration, though. It's
difficult to codify exactly what I have in mind, but I can give you an
informal definition now: It's probably okay if there is the occasional
implementation-level deadlock because the user got unlucky once.
However, it's not okay for there to be *continual* deadlocks because
the user got unlucky just once. Even if the user had *extraordinarily*
bad luck that one time. In short, my sense is that it's never okay for
the system as a whole to "get stuck" in a deadlock or livelock loop.
Actually, it might even be okay if somebody had a test case that
exhibits "getting stuck" behavior, provided the test case is very
delicate, and looks truly adversarial (i.e. it goes beyond being
extraordinarily unlucky).

I know that this is all pretty hand-wavy, and I don't expect you to
have a definitive response. These are some high level concerns that I
have, that may or may not apply to what you're trying to do.

-- 
Peter Geoghegan

Re: should there be a hard-limit on the number of transactionspending undo?

From

Robert Haas

Date:

19 July 2019, 23:13:37

On Fri, Jul 19, 2019 at 6:47 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I believe that the primary reason why certain other database systems
> retain locks until rollback completes (or release their locks in
> reverse order, as UNDO processing progresses) is that application code
> will often repeat exactly the same actions on receiving a transient
> error, until the action finally completes successfully. Just like with
> serialization failures, or with manually implemented UPSERT loops that
> must sometimes retry. This is why UNDO is often (or always) processed
> synchronously, blocking progress of the client connection as its xact
> rolls back.

I don't think this matters here at all. As long as there's only DML
involved, there won't be any lock conflicts anyway - everybody's
taking RowExclusiveLock or less, and it's all fine. If you update a
row in zheap, abort, and then try to update again before the rollback
happens, we'll do a page-at-a-time rollback in the foreground, and
proceed with the update; when we get around to applying the undo,
we'll notice that page has already been handled and skip the undo
records that pertain to it.  To get the kinds of problems I'm on about
here, somebody's got to be taking some more serious locks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: should there be a hard-limit on the number of transactionspending undo?

From

Peter Geoghegan

Date:

19 July 2019, 23:27:56

On Fri, Jul 19, 2019 at 4:14 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I don't think this matters here at all. As long as there's only DML
> involved, there won't be any lock conflicts anyway - everybody's
> taking RowExclusiveLock or less, and it's all fine. If you update a
> row in zheap, abort, and then try to update again before the rollback
> happens, we'll do a page-at-a-time rollback in the foreground, and
> proceed with the update; when we get around to applying the undo,
> we'll notice that page has already been handled and skip the undo
> records that pertain to it.  To get the kinds of problems I'm on about
> here, somebody's got to be taking some more serious locks.

If I'm not mistaken, you're tacitly assuming that you'll always be
using zheap, or something sufficiently similar to zheap. It'll
probably never be possible to UNDO changes to something like a GIN
index on a zheap table, because you can never do that with sensible
concurrency/deadlock behavior.

I don't necessarily have a problem with that. I don't pretend to
understand how much of a problem it is. Obviously it partially depends
on what your ambitions are for this infrastructure. Still, assuming
that I have it right, ISTM that UNDO/zheap/whatever should explicitly
own this restriction.

-- 
Peter Geoghegan

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

20 July 2019, 03:49:36

On Fri, Jul 19, 2019 at 6:37 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Jul 19, 2019 at 7:54 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > + * We just want to mask the cid in the undo record header.  So
> > + * only if the partial record in the current page include the undo
> > + * record header then we need to mask the cid bytes in this page.
> > + * Otherwise, directly jump to the next record.
> > Here, I think you mean : "So only if the partial record in the current
> > page includes the *cid* bytes", rather than "includes the undo record
> > header"
> > May be we can say :
> > We just want to mask the cid. So do the partial record masking only if
> > the current page includes the cid bytes from the partial record
> > header.
>
> Hmm, but why is it correct to mask the CID at all?  Shouldn't that match?
>
We don't write CID in the WAL. Because In hot-standby or after
recovery we don't need actual CID for the visibility.   So during REDO
while generating the undo record we set CID as 'FirstCommandId' which
is different from the DO time.  That's the reason we mask it.
-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

20 July 2019, 04:14:59

On Fri, Jul 19, 2019 at 6:35 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Jul 19, 2019 at 12:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > We are doing exactly what you have written in the last line of the
> > next paragraph "stop the transaction from writing undo when the hash
> > table is already too full.".  So we will
> > never face the problems related to repeated crash recovery.  The
> > definition of too full is that we stop allowing the new transactions
> > that can write undo when we have the hash table already have entries
> > equivalent to (UndoRollbackHashTableSize() - MaxBackends).  Does this
> > make sense?
>
> Oops, I was looking in the wrong place.  Yes, that makes sense, but:
>
> 1. It looks like the test you added to PrepareUndoInsert has no
> locking, and I don't see how that can be right.
>

+if (ProcGlobal->xactsHavingPendingUndo >
+(UndoRollbackHashTableSize() - MaxBackends))

Actual HARD_LIMIT is UndoRollbackHashTableSize(), but we only allow a
new backend to prepare the undo record if we have MaxBackends number
of empty slots in the hash table.  This will guarantee us to always
have at least one slot in the hash table for our current prepare, even
if all the backend which running their transaction has aborted and
inserted an entry in the hash table.

I think the problem with this check is that for any backend to prepare
an undo there must be MaxBackend number of empty slots in the hash
table for any concurrent backend to insert the request and this seems
too restrictive.

Having said that I think we must ensure MaxBackends*2 empty slots in
the hash table as each backend can enter 2 requests in the hash table.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: should there be a hard-limit on the number of transactionspending undo?

From

Amit Kapila

Date:

20 July 2019, 04:27:09

On Fri, Jul 19, 2019 at 10:58 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> One other thing that seems worth noting is that we have to consider
> what happens after a restart.  After a crash, and depending on exactly
> how we design it perhaps also after a non-crash restart, we won't
> immediately know how many outstanding transactions need undo; we'll
> have to grovel through the undo logs to find out. If we've got a hard
> cap, we can't allow new undo-using transactions to start until we
> finish that work.  It's possible that, at the moment of the crash, the
> maximum number of items had already been pushed into the background,
> and every foreground session was busy trying to undo an abort as well.
> If so, we're already up against the limit.  We'll have to scan through
> all of the undo logs and examine each transaction to get a count on
> how many transactions are already in a needs-undo-work state; only
> once we have that value do we know whether it's OK to admit new
> transactions to using the undo machinery, and how many we can admit.
> In typical cases, that won't take long at all, because there won't be
> any pending undo work, or not much, and we'll very quickly read the
> handful of transaction headers that we need to consult and away we go.
> However, if the hard limit is pretty big, and we're pretty close to
> it, counting might take a long time. It seems bothersome to have this
> interval between when we start accepting transactions and when we can
> accept transactions that use undo. Instead of throwing an ERROR, we
> can probably just teach the system to wait for the background process
> to finish doing the counting; that's what Amit's patch does currently.
>

Yeah, however, we wait for a certain threshold period of time (one
minute) for counting to finish and then error out.  We can wait till
the counting is finished but I am not sure if that is a good idea
because anyway user can try again after some time.

> Or, we could not even open for connections until the counting has been
> completed.
>
> When I first thought about this, I was really concerned about the idea
> of a hard limit, but the more I think about it the less problematic it
> seems. I think in the end it boils down to a question of: when things
> break, what behavior would users prefer?
>

One minor thing I would like to add here is that we are providing some
knobs wherein the systems having more number of rollbacks can
configure to have a much higher value of hard limit such that it won't
hit in their systems.  I know it is not always easy to find the right
value, but I guess they can learn from the behavior and then change it
to avoid the same in future.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: should there be a hard-limit on the number of transactionspending undo?

From

Amit Kapila

Date:

20 July 2019, 05:17:29

On Sat, Jul 20, 2019 at 4:17 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Jul 19, 2019 at 12:52 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > Right, that's definitely a big part of the concern here, but I don't
> > really believe that retaining locks is absolutely required, or even
> > necessarily desirable.  For instance, suppose that I create a table,
> > bulk-load a whole lotta data into it, and then abort.  Further support
> > that by the time we start trying to process the undo in the
> > background, we can't get the lock. Well, that probably means somebody
> > is performing DDL on the table.
>
> I believe that the primary reason why certain other database systems
> retain locks until rollback completes (or release their locks in
> reverse order, as UNDO processing progresses) is that application code
> will often repeat exactly the same actions on receiving a transient
> error, until the action finally completes successfully. Just like with
> serialization failures, or with manually implemented UPSERT loops that
> must sometimes retry. This is why UNDO is often (or always) processed
> synchronously, blocking progress of the client connection as its xact
> rolls back.
>
> Obviously these other systems could easily hand off the work of
> rolling back the transaction to an asynchronous worker process, and
> return success to the client that encounters an error (or asks to
> abort/roll back) almost immediately. I have to imagine that they
> haven't implemented this straightforward optimization because it makes
> sense that the cost of rolling back the transaction is primarily borne
> by the client that actually rolls back.
>

It is also possible that there are some other disadvantages or
technical challenges in those other systems due to which they decided
not to have such a mechanism.  I think one such database prepares the
consistent copy of pages during read operation based on SCN or
something like that.  It might not be as easy for such a system to
check if there is some pending undo which needs to be consulted.  I am
not telling that there are no ways to overcome such things but that
might have incurred much more cost or has some other disadvantages.
I am not sure if it is straight-forward to imagine why some other
system does the things in some particular way unless there is some
explicit documentation about the same.

Having said that, I agree that there are a good number of advantages
of performing the actions in the client that actually rolls back and
we should try to do that where it is not a good idea to transfer to
background workers like for short transactions.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

20 July 2019, 07:10:18

On Fri, Jul 19, 2019 at 6:35 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Jul 19, 2019 at 12:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > We are doing exactly what you have written in the last line of the
> > next paragraph "stop the transaction from writing undo when the hash
> > table is already too full.".  So we will
> > never face the problems related to repeated crash recovery.  The
> > definition of too full is that we stop allowing the new transactions
> > that can write undo when we have the hash table already have entries
> > equivalent to (UndoRollbackHashTableSize() - MaxBackends).  Does this
> > make sense?
>
> Oops, I was looking in the wrong place.  Yes, that makes sense, but:
>
> 1. It looks like the test you added to PrepareUndoInsert has no
> locking, and I don't see how that can be right.
>
> 2. It seems like this is would result in testing for each new undo
> insertion that gets prepared, whereas surely we would want to only
> test when first attaching to an undo log.  If you've already attached
> to the undo log, there's no reason not to continue inserting into it,
> because doing so doesn't increase the number of transactions (or
> transaction-persistence level combinations) that need undo.
>

I agree that it should not be for each undo insertion rather whenever
any transaction attached to an undo log.

> 3. I don't think the test itself is correct. It can fire even when
> there's no problem. It is correct (or would be if it said 2 *
> MaxBackends) if every other backend in the system is already attached
> to an undo log (or two). But if they are not, it will block
> transactions from being started for no reason.
>

Right, we should find a way to know the exact number of transactions
that are attached to undo-log at any point in time, then we can have a
more precise check.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

20 July 2019, 08:25:45

On Sat, Jul 20, 2019 at 12:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jul 19, 2019 at 6:35 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Fri, Jul 19, 2019 at 12:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > We are doing exactly what you have written in the last line of the
> > > next paragraph "stop the transaction from writing undo when the hash
> > > table is already too full.".  So we will
> > > never face the problems related to repeated crash recovery.  The
> > > definition of too full is that we stop allowing the new transactions
> > > that can write undo when we have the hash table already have entries
> > > equivalent to (UndoRollbackHashTableSize() - MaxBackends).  Does this
> > > make sense?
> >
> > Oops, I was looking in the wrong place.  Yes, that makes sense, but:
> >
> > 1. It looks like the test you added to PrepareUndoInsert has no
> > locking, and I don't see how that can be right.
> >
> > 2. It seems like this is would result in testing for each new undo
> > insertion that gets prepared, whereas surely we would want to only
> > test when first attaching to an undo log.  If you've already attached
> > to the undo log, there's no reason not to continue inserting into it,
> > because doing so doesn't increase the number of transactions (or
> > transaction-persistence level combinations) that need undo.
> >
>
> I agree that it should not be for each undo insertion rather whenever
> any transaction attached to an undo log.
>
> > 3. I don't think the test itself is correct. It can fire even when
> > there's no problem. It is correct (or would be if it said 2 *
> > MaxBackends) if every other backend in the system is already attached
> > to an undo log (or two). But if they are not, it will block
> > transactions from being started for no reason.
> >
>
> Right, we should find a way to know the exact number of transactions
> that are attached to undo-log at any point in time, then we can have a
> more precise check.

Maybe we can make ProcGlobal->xactsHavingPendingUndo an atomic
variable.  We can increment its value atomically whenever
a) A transaction writes the first undo record for each persistence level.
b) For each abort request inserted by the 'StartupPass'.

And, we will decrement it when
a) The transaction commits (decrement by 1 for each persistence level
it has written undo for).
b) Rollback request is processed.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

22 July 2019, 08:51:36

On Thu, Jul 18, 2019 at 4:41 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Tue, Jul 16, 2019 at 8:39 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > Thomas has already objected to another proposal to add functions that
> > turn 32-bit XIDs into 64-bit XIDs.  Therefore, I feel confident in
> > predicting that he will likewise object to GetEpochForXid. I think
> > this needs to be changed somehow, maybe by doing what the XXX comment
> > you added suggests.
>
> Perhaps we should figure out how to write GetOldestFullXmin() and friends.
>
> For FinishPreparedTransaction(), the XXX comment sounds about right
> (TwoPhaseFileHeader should hold an fxid).
>

I think we can do that, but what about subxids in TwoPhaseFileHeader?
Shall we store them as fxids as well?  If we don't do that then it
will appear inconsistent and if we want to store subxids as fxids,
then we need to track them as fxids in TransactionStateData.  It might
not be a very big change, but certainly, more work as compared to if
we just store top-level fxid or use GetEpochForXid as we are currently
using in the patch.   Another thing is changing subxids to fxids can
increase the size of two-phase file for a xact having many
sub-transactions which again might be okay, but not completely sure.

Thoughts?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

22 July 2019, 10:21:00

On Mon, Jul 22, 2019 at 2:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>

I have reviewed 0012-Infrastructure-to-execute-pending-undo-actions,
Please find my comment so far.

1.
+ /* It shouldn't be discarded. */
+ Assert(!UndoRecPtrIsDiscarded(xact_urp));

I think comments can be added to explain why it shouldn't be discarded.

2.
+ /* Compute the offset of the uur_next in the undo record. */
+ offset = SizeOfUndoRecordHeader +
+ offsetof(UndoRecordTransaction, urec_progress);
+
in comment /uur_next/uur_progress

3.
+/*
+ * undo_record_comparator
+ *
+ * qsort comparator to handle undo record for applying undo actions of the
+ * transaction.
+ */
Function header formating is not in sync with other functions.

4.
+void
+undoaction_redo(XLogReaderState *record)
+{
+ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ switch (info)
+ {
+ case XLOG_UNDO_APPLY_PROGRESS:
+ undo_xlog_apply_progress(record);
+ break;

For HotStandby it doesn't make sense to apply this wal as this
progress is only required when we try to apply the undo action after
restart
but in HotStandby we never apply undo actions.

5.
+ Assert(from_urecptr != InvalidUndoRecPtr);
+ Assert(to_urecptr != InvalidUndoRecPtr);

we can use macros UndoRecPtrIsValid instead of checking like this.

6.
+ if ((slot == NULL) || (UndoRecPtrGetLogNo(urecptr) != slot->logno))
+ slot = UndoLogGetSlot(UndoRecPtrGetLogNo(urecptr), false);
+
+ Assert(slot != NULL);
We are passing missing_ok as false in UndoLogGetSlot.  But, not sure
why we are expecting that undo lot can not be dropped.  In multi-log
transaction it's possible
that the tablespace in which next undolog is there is already dropped?

7.
+ */
+ do
+ {
+ BlockNumber progress_block_num = InvalidBlockNumber;
+ int i;
+ int nrecords;
                      .....
    + */
+ if (!UndoRecPtrIsValid(urec_ptr))
+ break;
+ } while (true);

I think we can convert above loop to while(true) instead of do..while,
because there is no need for do while loop.

8.
+ if (last_urecinfo->uur->uur_info & UREC_INFO_LOGSWITCH)
+ {
+ UndoRecordLogSwitch *logswitch = last_urecinfo->uur->uur_logswitch;

IMHO, the caller of UndoFetchRecord should directly check
uur->uur_logswitch instead of uur_info & UREC_INFO_LOGSWITCH.
Actually, uur_info is internally set
for inserting the tuple and check there to know what to insert and
fetch but I think caller of UndoFetchRecord should directly rely on
the field because ideally all
the fields in UnpackUndoRecord must be set and uur_txt or
uur_logswitch will be allocated when those headers present.  I think
this needs to be improved in undo interface patch
as well (in UndoBulkFetchRecord).


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: should there be a hard-limit on the number of transactionspending undo?

From

Thomas Munro

Date:

22 July 2019, 11:14:55

On Sat, Jul 20, 2019 at 11:28 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Jul 19, 2019 at 4:14 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I don't think this matters here at all. As long as there's only DML
> > involved, there won't be any lock conflicts anyway - everybody's
> > taking RowExclusiveLock or less, and it's all fine. If you update a
> > row in zheap, abort, and then try to update again before the rollback
> > happens, we'll do a page-at-a-time rollback in the foreground, and
> > proceed with the update; when we get around to applying the undo,
> > we'll notice that page has already been handled and skip the undo
> > records that pertain to it.  To get the kinds of problems I'm on about
> > here, somebody's got to be taking some more serious locks.
>
> If I'm not mistaken, you're tacitly assuming that you'll always be
> using zheap, or something sufficiently similar to zheap. It'll
> probably never be possible to UNDO changes to something like a GIN
> index on a zheap table, because you can never do that with sensible
> concurrency/deadlock behavior.
>
> I don't necessarily have a problem with that. I don't pretend to
> understand how much of a problem it is. Obviously it partially depends
> on what your ambitions are for this infrastructure. Still, assuming
> that I have it right, ISTM that UNDO/zheap/whatever should explicitly
> own this restriction.

I had a similar thought: you might regret that choice if you were
wanting to implement an AM with lock table-based concurrency control
(meaning that there are lock ordering concerns for row and page locks,
for DML statements, not just DDL).  That seemed a bit too far fetched
to mention before, but are you saying the same sort of concerns might
come up with indexes that support true undo (as opposed to indexes
that still need VACUUM)?

For comparison, ARIES[1] has no-deadlock rollbacks as a basic property
and reacquires locks during restart before new transactions are allow
to execute.  In its model, the locks in question can be on things like
rows and pages.  We don't even use our lock table for those (except
for non-blocking SIREAD locks, irrelevant here).  After crash
recovery, if zheap encounters a row with pending rollback from an
aborted transaction, as usual it either needs to read an older version
from an undo log (for reads) or help execute the rollback before
updating (for writes).  That only requires page-at-a-time LWLocks
("latching"), so it's deadlock-free.  The only deadlock risk comes
from the need to acquire heavyweight locks on relations which
typically only conflict when you run DDL, so yeah, it's tempting to
worry a lot less about those than the fine grained lock traffic from
DML statements that DB2 and others have to deal with.

So spell out the two options again:

A.  Rollback can't deadlock.  You have to make sure you reliably hold
locks until rollback is completed (including some tricky new lock
transfer magic), and then reacquire them after recovery before new
transactions are allowed.  You could trivially achieve the restart
part by simply waiting until all rollback is executed before you allow
new transactions, but other systems including DB2 first acquire all
the locks in an earlier scan through the log, then allow new
connections, and then execute the rollback.  Acquiring them before new
transactions are allowed means that they must fit in the lock table and
there must be no conflicts among them if they were all granted as at
the moment you crashed or shut down.

B.  Rollback can deadlock or exhaust the lock table because we release
and reacquire some arbitrary time later.  No choice but to keep
retrying if anything goes wrong, and rollback is theoretically not
guaranteed to complete and you can contrive a workload that will never
make progress.  This amounts to betting that these problems will be
rare enough that it doesn't matter and eventually make progress, and
it should be fairly clear what's happening and why.

I might as well put the quote marks on now:  "Perhaps we could
implement A later."

[1] https://cs.stanford.edu/people/chrismre/cs345/rl/aries.pdf

--
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Khandekar

Date:

22 July 2019, 15:08:32

On Mon, 22 Jul 2019 at 14:21, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have started review of
0014-Allow-execution-and-discard-of-undo-by-background-wo.patch. Below
are some quick comments to start with:

+++ b/src/backend/access/undo/undoworker.c

+#include "access/xact.h"
+#include "access/undorequest.h"
Order is not alphabetical

+ * Each undo worker then start reading from one of the queue the requests for
start=>starts
queue=>queues

-------------

+ rc = WaitLatch(MyLatch,
+    WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+    10L, WAIT_EVENT_BGWORKER_STARTUP);
+
+ /* emergency bailout if postmaster has died */
+ if (rc & WL_POSTMASTER_DEATH)
+ proc_exit(1);
I think now, thanks to commit cfdf4dc4fc9635a, you don't have to
explicitly handle postmaster death; instead you can use
WL_EXIT_ON_PM_DEATH. Please check at all such places where this is
done in this patch.

-------------

+UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo)
+{
+ /* Block concurrent access. */
+ LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
+
+ MyUndoWorker = &UndoApplyCtx->workers[slot];
Not sure why MyUndoWorker is used here. Can't we use a local variable
? Or do we intentionally attach to the slot as a side-operation ?

-------------

+ * Get the dbid where the wroker should connect to and get the worker
wroker=>worker

-------------

+ BackgroundWorkerInitializeConnectionByOid(urinfo.dbid, 0, 0);
0, 0 => InvalidOid, 0

+ * Set the undo worker request queue from which the undo worker start
+ * looking for a work.
start => should start
a work => work

--------------

+ if (!InsertRequestIntoErrorUndoQueue(urinfo))
I was thinking what happens if for some reason
InsertRequestIntoErrorUndoQueue() itself errors out. In that case, the
entry will not be marked invalid, and so there will be no undo action
carried out because I think the undo worker will exit. What happens
next with this entry ?

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Khandekar

Date:

22 July 2019, 15:13:48

On Fri, 19 Jul 2019 at 17:24, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> On Thu, 9 May 2019 at 12:04, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> > Patches can be applied on top of undo branch [1] commit:
> > (cb777466d008e656f03771cf16ec7ef9d6f2778b)
> >
> > [1] https://github.com/EnterpriseDB/zheap/tree/undo
>
> Below are some review points for 0009-undo-page-consistency-checker.patch :

Another point that I missed :

+    * Process the undo record of the page and mask their cid filed.
+    */
+   while (next_record < page_end)
+   {
+       UndoRecordHeader *header = (UndoRecordHeader *) next_record;
+
+       /* If this undo record has cid present, then mask it */
+       if ((header->urec_info & UREC_INFO_CID) != 0)

Here, even though next record starts in the current page, the
urec_info itself may or may not lie on this page.

I hope this possibility is also considered when populating the
partial-record-specific details in the page header.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

22 July 2019, 16:08:10

On 2019-07-22 14:21:36 +0530, Amit Kapila wrote:
> Another thing is changing subxids to fxids can increase the size of
> two-phase file for a xact having many sub-transactions which again
> might be okay, but not completely sure.

I can't see that being a problem.

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

23 July 2019, 03:18:26

On Mon, Jul 22, 2019 at 8:39 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> On Mon, 22 Jul 2019 at 14:21, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> -------------
>
> +UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo)
> +{
> + /* Block concurrent access. */
> + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
> +
> + MyUndoWorker = &UndoApplyCtx->workers[slot];
> Not sure why MyUndoWorker is used here. Can't we use a local variable
> ? Or do we intentionally attach to the slot as a side-operation ?
>
> -------------
>

I think here, we can use a local variable as well.  Do you see any
problem with the current code or do you think it is better to use a
local variable here?

> --------------
>
> + if (!InsertRequestIntoErrorUndoQueue(urinfo))
> I was thinking what happens if for some reason
> InsertRequestIntoErrorUndoQueue() itself errors out. In that case, the
> entry will not be marked invalid, and so there will be no undo action
> carried out because I think the undo worker will exit. What happens
> next with this entry ?

The same entry is present in two queues xid and size, so next time it
will be executed from the second queue based on it's priority in that
queue.  However, if it fails again a second time in the same way, then
we will be in trouble because now the hash table has entry, but none
of the queues has entry, so none of the workers will attempt to
execute again.  Also, when discard worker again tries to register it,
we won't allow adding the entry to queue thinking either some backend
is executing the same or it must be part of some queue.

The one possibility to deal with this could be that we somehow allow
discard worker to register it again in the queue or we can do this in
critical section so that it allows system restart on error.  However,
the main thing is it possible that InsertRequestIntoErrorUndoQueue
will fail unless there is some bug in the code?  If so, we might want
to have an Assert for this rather than handling this condition.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Khandekar

Date:

23 July 2019, 14:41:41

On Tue, 23 Jul 2019 at 08:48, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 22, 2019 at 8:39 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> >
> > On Mon, 22 Jul 2019 at 14:21, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > -------------
> >
> > +UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo)
> > +{
> > + /* Block concurrent access. */
> > + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
> > +
> > + MyUndoWorker = &UndoApplyCtx->workers[slot];
> > Not sure why MyUndoWorker is used here. Can't we use a local variable
> > ? Or do we intentionally attach to the slot as a side-operation ?
> >
> > -------------
> >
>
> I think here, we can use a local variable as well.  Do you see any
> problem with the current code or do you think it is better to use a
> local variable here?

I think, even though there might not be a correctness issue with the
current code as it stands, we should still use a local variable.
Updating MyUndoWorker is a big side-effect, which the caller is not
supposed to be aware of, because all that function should do is just
get the slot info.

>
> > --------------
> >
> > + if (!InsertRequestIntoErrorUndoQueue(urinfo))
> > I was thinking what happens if for some reason
> > InsertRequestIntoErrorUndoQueue() itself errors out. In that case, the
> > entry will not be marked invalid, and so there will be no undo action
> > carried out because I think the undo worker will exit. What happens
> > next with this entry ?
>
> The same entry is present in two queues xid and size, so next time it
> will be executed from the second queue based on it's priority in that
> queue.  However, if it fails again a second time in the same way, then
> we will be in trouble because now the hash table has entry, but none
> of the queues has entry, so none of the workers will attempt to
> execute again.  Also, when discard worker again tries to register it,
> we won't allow adding the entry to queue thinking either some backend
> is executing the same or it must be part of some queue.
>
> The one possibility to deal with this could be that we somehow allow
> discard worker to register it again in the queue or we can do this in
> critical section so that it allows system restart on error.  However,
> the main thing is it possible that InsertRequestIntoErrorUndoQueue
> will fail unless there is some bug in the code?  If so, we might want
> to have an Assert for this rather than handling this condition.

Yes, I also think that the function would error out only because of
can't-happen cases, like "too many locks taken" or "out of binary heap
slots" or "out of memory" (this last one is not such a can't happen
case). These cases happen probably due to some bugs, I suppose. But I
was wondering : Generally when the code errors out with such
can't-happen elog() calls, worst thing that happens is that the
transaction gets aborted. Whereas, in this case, the worst thing that
could happen is : the undo action would never get executed, which
means selects for this tuple will keep on accessing the undo log ?
This does not sound like any data consistency issue, so we should be
fine after all ?

--------------------

Some further review comments for undoworker.c :


+/* Sets the worker's lingering status. */
+static void
+UndoWorkerIsLingering(bool sleep)
The function name sounds like "is the worker lingering ?". Can we
rename it to something like "UndoWorkerSetLingering" ?

-------------

+ errmsg("undo worker slot %d is empty, cannot attach",
+ slot)));

+ }
+
+ if (MyUndoWorker->proc)
+ {
+ LWLockRelease(UndoWorkerLock);
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("undo worker slot %d is already used by "
+ "another worker, cannot attach", slot)));

These two error messages can have a common error message "could not
attach to worker slot", with errdetail separate for each of them :
slot %d is empty.
slot %d is already used by another worker.

--------------

+static int
+IsUndoWorkerAvailable(void)
+{
+ int i;
+ int alive_workers = 0;
+
+ LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);

Should have bool return value.

Also, why is it keeping track of number of alive workers ? Sounds like
earlier it used to return number of alive workers ? If it indeed needs
to just return true/false, we can do away with alive_workers.

Also, *exclusive* lock is unnecessary.

--------------

+if (UndoGetWork(false, false, &urinfo, NULL) &&
+    IsUndoWorkerAvailable())
+    UndoWorkerLaunch(urinfo);

There is no lock acquired between IsUndoWorkerAvailable() and
UndoWorkerLaunch(); that means even though IsUndoWorkerAvailable()
returns true, there is a small window where UndoWorkerLaunch() does
not find any worker slot with in_use false, causing assertion failure
for (worker != NULL).
--------------

+UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo)
+{
+ /* Block concurrent access. */
+ LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
*Exclusive* lock is unnecessary.
-------------

+ LWLockRelease(UndoWorkerLock);
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("undo worker slot %d is empty",
+ slot)));
I believe there is no need to explicitly release an lwlock before
raising an error, since the lwlocks get released during error
recovery. Please check all other places where this is done.
-------------

+ * Start new undo apply background worker, if possible otherwise return false.
worker, if possible otherwise => worker if possible, otherwise
-------------

+static bool
+UndoWorkerLaunch(UndoRequestInfo urinfo)
We don't check UndoWorkerLaunch() return value. Can't we make it's
return value type void ?
Also, it would be better to have urinfo as pointer to UndoRequestInfo
rather than UndoRequestInfo, so as to avoid structure copy.
-------------

+{
+ BackgroundWorker bgw;
+ BackgroundWorkerHandle *bgw_handle;
+ uint16 generation;
+ int i;
+ int slot = 0;
We can remove variable i, and use slot variable in place of i.
-----------

+ snprintf(bgw.bgw_name, BGW_MAXLEN, "undo apply worker");
I think it would be trivial to also append the worker->generation in
the bgw_name.
-------------


+ if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))
+ {
+ /* Failed to start worker, so clean up the worker slot. */
+ LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
+ UndoWorkerCleanup(worker);
+ LWLockRelease(UndoWorkerLock);
+
+ return false;
+ }

Is it intentional that there is no (warning?) message logged when we
can't register a bg worker ?
-------------


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

24 July 2019, 04:30:27

On Mon, Jul 22, 2019 at 3:51 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
Please find my review comments for
0013-Allow-foreground-transactions-to-perform-undo-action

+ /* initialize undo record locations for the transaction */
+ for (i = 0; i < UndoLogCategories; i++)
+ {
+ s->start_urec_ptr[i] = InvalidUndoRecPtr;
+ s->latest_urec_ptr[i] = InvalidUndoRecPtr;
+ s->undo_req_pushed[i] = false;
+ }

Can't we just memset this memory?


+ * We can't postpone applying undo actions for subtransactions as the
+ * modifications made by aborted subtransaction must not be visible even if
+ * the main transaction commits.
+ */
+ if (IsSubTransaction())
+ return;

I am not completely sure but is it possible that the outer function
CommitTransactionCommand/AbortCurrentTransaction can avoid
calling this function in the switch case based on the current state,
so that under subtransaction this will never be called?


+ /*
+ * Prepare required undo request info so that it can be used in
+ * exception.
+ */
+ ResetUndoRequestInfo(&urinfo);
+ urinfo.dbid = dbid;
+ urinfo.full_xid = fxid;
+ urinfo.start_urec_ptr = start_urec_ptr[per_level];
+

I see that we are preparing urinfo before execute_undo_actions so that
in case of an error in CATCH we can use that to
insert into the queue, but can we just initialize urinfo right there
before inserting into the queue, we have all the information
Am I missing something?

+
+ /*
+ * We need the locations of the start and end undo record pointers when
+ * rollbacks are to be performed for prepared transactions using undo-based
+ * relations.  We need to store this information in the file as the user
+ * might rollback the prepared transaction after recovery and for that we
+ * need it's start and end undo locations.
+ */
+ UndoRecPtr start_urec_ptr[UndoLogCategories];
+ UndoRecPtr end_urec_ptr[UndoLogCategories];

it's -> its

+ bool undo_req_pushed[UndoLogCategories]; /* undo request pushed
+ * to worker? */
+ bool performUndoActions;
+
  struct TransactionStateData *parent; /* back link to parent */

We must have some comments to explain how performUndoActions is used,
where it's set.  If it's explained somewhere else then we can
give reference to that code.

+ for (i = 0; i < UndoLogCategories; i++)
+ {
+ if (s->latest_urec_ptr[i])
+ {
+ s->performUndoActions = true;
+ break;
+ }
+ }

I think we should chek UndoRecPtrIsValid(s->latest_urec_ptr[i])

+ PG_TRY();
+ {
+ /*
+ * Prepare required undo request info so that it can be used in
+ * exception.
+ */
+ ResetUndoRequestInfo(&urinfo);
+ urinfo.dbid = dbid;
+ urinfo.full_xid = fxid;
+ urinfo.start_urec_ptr = start_urec_ptr[per_level];
+
+ /* for subtransactions, we do partial rollback. */
+ execute_undo_actions(urinfo.full_xid,
+ end_urec_ptr[per_level],
+ start_urec_ptr[per_level],
+ !isSubTrans);
+ }
+ PG_CATCH();

Wouldn't it be good to explain in comments that we are not rethrowing
the error in PG_CATCH but because we don't want the main
transaction to get an error if there is an error while applying to
undo action for the main transaction and we will abort the transaction
in the caller of this function?

+tables are only accessible in the backend that has created them.  We can't
+postpone applying undo actions for subtransactions as the modifications
+made by aborted subtransaction must not be visible even if the main transaction
+commits.

I think we need to give detail reasoning why subtransaction changes
will be visible if we don't apply it's undo and the main
the transaction commits by mentioning that we don't use separate
transaction id for the subtransaction and that will make all the
changes of the transaction id visible when it commits.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Rushabh Lathia

Date:

24 July 2019, 05:57:59

Hi,

I have stated review of
0008-Provide-interfaces-to-store-and-fetch-undo-records.patch, here are few

quick comments.

1) README.undointerface should provide more information like API details or
the sequence in which API should get called.

2) Information about the API's in the undoaccess.c file header block would
good. For reference please look at heapam.c.

3) typo

+ * Later, during insert phase we will write actual records into thse buffers.
+ */

%s/thse/these

4) UndoRecordUpdateTransInfo() comments says that this must be called under
the critical section, but seems like undo_xlog_apply_progress() do call it
outside of critical section? Is there exception, then should add comments?
or Am I missing anything?

5) In function UndoBlockGetFirstUndoRecord() below code:

    /* Calculate the size of the partial record. */
    partial_rec_size = UndoRecordHeaderSize(phdr->uur_info) +
                       phdr->tuple_len + phdr->payload_len -
                       phdr->record_offset;

can directly use UndoPagePartialRecSize().

6)

+static int
+UndoGetBufferSlot(UndoRecordInsertContext *context,
+                RelFileNode rnode,
+                BlockNumber blk,
+                ReadBufferMode rbm)
+{
+    int            i;

In the above code variable "i" is mean "block index". It would be good
to give some valuable name to the variable, maybe "blockIndex" ?

7)

* We will also keep a previous undo record pointer to the first and last undo
* record of the transaction in the previous log. The last undo record
* location is used find the previous undo record pointer during rollback.

%s/used fine/used to find

8)

/*
* Defines the number of times we try to wait for rollback hash table to get
* initialized. After these many attempts it will return error and the user
* can retry the operation.
*/
#define ROLLBACK_HT_INIT_WAIT_TRY      60

%s/error/an error

9)

* we can get the exact size of partial record in this page.
*/

%s/of partial/of the partial"

10)

* urecptr - current transaction's undo record pointer which need to be set in
*            the previous transaction's header.

%s/need/needs

11)

    /*
    * If we are writing first undo record for the page the we can set the
    * compression so that subsequent records from the same transaction can
    * avoid including common information in the undo records.
    */

%s/the page the/the page then

12)

    /*
    * If the transaction's undo records are split across the undo logs. So
    * we need to update our own transaction header in the previous log.
    */

double space between "to" and "update"

13)

* The undo record should be freed by the caller by calling ReleaseUndoRecord.
* This function will old the pin on the buffer where we read the previous undo
* record so that when this function is called repeatedly with the same context

%s/old/hold

I will continue further review for the same patch.

Regards,

Rushabh Lathia

www.EntepriseDB.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

24 July 2019, 06:49:17

On Wed, Jul 24, 2019 at 11:28 AM Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
>
> Hi,
>
> I have stated review of
> 0008-Provide-interfaces-to-store-and-fetch-undo-records.patch, here are few
> quick comments.

Thanks for the review, I will work on them soon and post the updated
patch along with other comments.  I have noticed some comments are
pointing to the code which is not part of this patch
for example
>
> 5) In function UndoBlockGetFirstUndoRecord() below code:
>
>     /* Calculate the size of the partial record. */
>     partial_rec_size = UndoRecordHeaderSize(phdr->uur_info) +
>                        phdr->tuple_len + phdr->payload_len -
>                        phdr->record_offset;
>
> can directly use UndoPagePartialRecSize().

UndoBlockGetFirstUndoRecord is added under 0014 patch, I think you got
confused because this code is in undoaccess.c file.  But actually
later patch set added some code under undoaccess.c.  Basically, this
comment needs to be fixed but under another patch.  I am pointing out
so that we don't miss this.

> 8)
>
> /*
>  * Defines the number of times we try to wait for rollback hash table to get
>  * initialized.  After these many attempts it will return error and the user
>  * can retry the operation.
>  */
> #define ROLLBACK_HT_INIT_WAIT_TRY      60
>
> %s/error/an error
>
This macro also added under 0014.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

24 July 2019, 09:15:03

On Thu, Jul 18, 2019 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> >
> > On Fri, Jun 28, 2019 at 6:09 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > I happened to open up 0001 from this series, which is from Thomas, and
> > > I do not think that the pg_buffercache changes are correct.  The idea
> > > here is that the customer might install version 1.3 or any prior
> > > version on an old release, then upgrade to PostgreSQL 13. When they
> > > do, they will be running with the old SQL definitions and the new
> > > binaries.  At that point, it sure looks to me like the code in
> > > pg_buffercache_pages.c is going to do the Wrong Thing.  [...]
> >
> > Yep, that was completely wrong.  Here's a new version.
> >
>
> One comment/question related to
> 0022-Use-undo-based-rollback-to-clean-up-files-on-abort.patch.
>

I have done some more review of undolog patch series and here are my comments:
0003-Add-undo-log-manager.patch
1.
allocate_empty_undo_segment()
{
..
..
if (mkdir(undo_path, S_IRWXU) != 0 && errno != EEXIST)
+ {
+ char    *parentdir;
+
+ if (errno != ENOENT || !InRecovery)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m",
+ undo_path)));
+
+ /*
+ * In recovery, it's possible that the tablespace directory
+ * doesn't exist because a later WAL record removed the whole
+ * tablespace.  In that case we create a regular directory to
+ * stand in for it.  This is similar to the logic in
+ * TablespaceCreateDbspace().
+ */
+
+ /* create two parents up if not exist */
+ parentdir = pstrdup(undo_path);
+ get_parent_directory(parentdir);
+ get_parent_directory(parentdir);
+ /* Can't create parent and it doesn't already exist? */
+ if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST)

All of this code is almost same as we have code in
TablespaceCreateDbspace still we have small differences like here you
are using mkdir instead of MakePGDirectory which as far as I can see
use similar permissions for creating directory.  Also, it checks
whether the directory exists before trying to create it.  Is there a
reason why we need to do a few things differently here if not, they
can both the places use one common function?

2.
allocate_empty_undo_segment()
{
..
..
/* Flush the contents of the file to disk before the next checkpoint. */
+ undofile_request_sync(logno, end / UndoLogSegmentSize, tablespace);
..
}

+void
+undofile_request_sync(UndoLogNumber logno, BlockNumber segno, Oid tablespace)
+{
+ char path[MAXPGPATH];
+ FileTag tag;
+
+ INIT_UNDOFILETAG(tag, logno, tablespace, segno);
+
+ /* Try to send to the checkpointer, but if out of space, do it here. */
+ if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false))

The comment in allocate_empty_undo_segment indicates that the code
wants to flush before checkpoint, but the actual function tries to
register the request with checkpointer.  Shouldn't this be similar to
XLogFileInit where we use pg_fsync to flush the contents immediately?
I guess that will avoid what you have written in comments in the same
function (we just want to make sure that the filesystem has allocated
physical blocks for it so that non-COW filesystems will report ENOSPC
now rather than later when space is needed).  OTOH, I think it is
performance-wise better to postpone the work to checkpointer.  If we
want to push this work to checkpointer, then we might need to change
comments or alternatively, we might want to use bigger segment sizes
to mitigate the performance effect.

If my above understanding is correct and the reason to fsync
immediately is to reserve space now, then we also need to think
whether we are always safe in postponing the work?  Basically, if this
means that it can fail when we are actually trying to write undo, then
it could be risky because we could be in the critical section at that
time.  I am not sure about this point, rather it is just to discuss if
there are any impacts of postponing the fsync work.

Another thing is that recently in commit 475861b261 (commit by you),
we have introduced a mechanism to not fill the files with zero's for
certain filesystems like ZFS.  Do we want similar behavior for undo
files?

3.
+extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
+{
+ UndoLogSlot *slot;
+ size_t end;
+
+ slot = find_undo_log_slot(logno, false);
+
+ /* TODO review interlocking */
+
+ Assert(slot != NULL);
+ Assert(slot->meta.end % UndoLogSegmentSize == 0);
+ Assert(new_end % UndoLogSegmentSize == 0);
+ Assert(InRecovery ||
+    CurrentSession->attached_undo_slots[slot->meta.category] == slot);

Can you write some comments explaining the above Asserts?  Also, can
you explain what interlocking issues are you worried about here?

4.
while (end < new_end)
+ {
+ allocate_empty_undo_segment(logno, slot->meta.tablespace, end);
+ end += UndoLogSegmentSize;
+ }
+
+ /* Flush the directory entries before next checkpoint. */
+ undofile_request_sync_dir(slot->meta.tablespace);

I see that at two places after allocating empty undo segment, the
patch performs undofile_request_sync_dir whereas it doesn't perform
the same in UndoLogNewSegment? Is there a reason for the same or is it
missed from one of the places?

5.
+static void
+extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
{
..
/*
+ * We didn't need to acquire the mutex to read 'end' above because only
+ * we write to it.  But we need the mutex to update it, because the
+ * checkpointer might read it concurrently.

Is this assumption correct?  It seems patch also modified
slot->meta.end during discard in function UndoLogDiscard.  I am
referring below code:

+UndoLogDiscard(UndoRecPtr discard_point, TransactionId xid)
{
..
+ /* Update shmem to show the new discard and end pointers. */
+ LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
+ slot->meta.discard = discard;
+ slot->meta.end = end;
+ LWLockRelease(&slot->mutex);
..
}

6.
extend_undo_log()
{
..
..
if (!InRecovery)
+ {
+ xl_undolog_extend xlrec;
+ XLogRecPtr ptr;
+
+ xlrec.logno = logno;
+ xlrec.end = end;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND);
+ XLogFlush(ptr);
+ }

It is not obvious to me why we need to perform XLogFlush here, can you explain?

7.
+attach_undo_log(UndoLogCategory category, Oid tablespace)
{
..
if (candidate->meta.tablespace == tablespace)
+ {
+ logno = *place;
+ slot = candidate;
+ *place = candidate->next_free;
+ break;
+ }

Here, the code is breaking from the loop, so why do we need to set
*place?  Am I missing something obvious?

8.
+ /* WAL-log the creation of this new undo log. */
+ {
+ xl_undolog_create xlrec;
+
+ xlrec.logno = logno;
+ xlrec.tablespace = slot->meta.tablespace;
+ xlrec.category = slot->meta.category;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));

Here and in most other places in this patch you are using
sizeof(xlrec) for xlog static data.  However, as far as I know in
other places in the code we define the size using offset of the last
parameter of corresponding structure to avoid any inconsistency in WAL
record size across different platforms.   Is there a reason to go
differently with this patch?  See below one for example:

typedef struct xl_hash_add_ovfl_page
{
uint16 bmsize;
bool bmpage_found;
} xl_hash_add_ovfl_page;

#define SizeOfHashAddOvflPage
\
(offsetof(xl_hash_add_ovfl_page, bmpage_found) + sizeof(bool))

9.
+static void
+undolog_xlog_create(XLogReaderState *record)
+{
+ xl_undolog_create *xlrec = (xl_undolog_create *) XLogRecGetData(record);
+ UndoLogSlot *slot;
+
+ /* Create meta-data space in shared memory. */
+ LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+
+ /* TODO: assert that it doesn't exist already? */
+
+ slot = allocate_undo_log_slot();
+ LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);

Why do we need to acquire locks during recovery?

10.
I think UndoLogAllocate can leak allocation of slots.  It first
allocates the slot for a new log from the free pool in there is no
existing slot/log, writes a WAL record and then at a later point of
time it actually creates the required physical space in the log via
extend_undo_log which also writes a separate WAL.  Now, if there is a
error between these two operations, then we will have a redundant slot
allocated.  What if there are repeated errors for similar thing from
multiple backends after which system crashes.  Now, after restart, we
will allocate multiple slots for different lognos which don't have any
actual (physical) logs.  This might not be a big problem in practice
because the chances of error between two operations are less, but
can't we delay the WAL logging for allocation of a slot for a new log.

11.
+UndoLogAllocate()
{
..
..
+ /*
+ * Maintain our tracking of the and the previous transaction start
+ * locations.
+ */
+ if (slot->meta.unlogged.this_xact_start != slot->meta.unlogged.insert)
+ {
+ slot->meta.unlogged.last_xact_start =
+ slot->meta.unlogged.this_xact_start;
+ slot->meta.unlogged.this_xact_start = slot->meta.unlogged.insert;
+ }

".. of the and the ..", after first the, something is missing.

12.
UndoLogAllocate()
{
..
..
+ /*
+ * We don't need to acquire log->mutex to read log->meta.insert and
+ * log->meta.end, because this backend is the only one that can
+ * modify them.
+ */
+ if (unlikely(new_insert > slot->meta.end))

I might be confused but slot->meta.end is modified by discard process
also, so how is it safe?  If so, may be adding a comment to explain
the same would be good.  Also, I think in the comments log should be
replaced with the slot.

13.
UndoLogAllocate()
{
..
+ /* This undo log is entirely full.  Get a new one. */
+ if (logxid == GetTopTransactionId())
+ {
+ /*
+ * If the same transaction is split over two undo logs then
+ * store the previous log number in new log.  See detailed
+ * comments in undorecord.c file header.
+ */
..
}

The undorecord.c should be renamed to undoaccess.c

14.
UndoLogAllocate()
{
..
+ if (logxid != GetTopTransactionId())
+ {
+ /*
+ * While we have the lock, check if we have been forcibly detached by
+ * DROP TABLESPACE.  That can only happen between transactions (see
+ * DropUndoLogsInsTablespace()).
+ */

/DropUndoLogsInsTablespace/DropUndoLogsInTablespace

15.
UndoLogSegmentPath()
{
..
/*
+ * Build the path from log number and offset.  The pathname is the
+ * UndoRecPtr of the first byte in the segment in hexadecimal, with a
+ * period inserted between the components.
+ */
+ snprintf(path, MAXPGPATH, "%s/%06X.%010zX", dir, logno,
+ segno * UndoLogSegmentSize);
..
}

a. It is not very clear from the above code why are we multiplying
segno with UndoLogSegmentSize?  I see that many of the callers pass
segno as segno/UndoLogSegmentSize.  Won't it be better if the caller
take care of passing correct value of segno?
b. In the comment above, instead of offset, shouldn't there be segment number.

16. UndoLogGetLastXactStartPoint is not used any where.  I think this
was required in previous version of patchset, now, we can remove it.

17.
Discussion: https://postgr.es/m/CAEepm%3D2EqROYJ_xYz4v5kfr4b0qw_Lq_6Pe8RTEC8rx3upWsSQ%40mail.gmail.com

This discussion link seems to be from old discussion/thread, not this one.

0019-Add-developer-documentation-for-the-undo-log-storage
18.
+each undo log, a set of meta-data properties is tracked:
+tracked, including:
+
+* the tablespace that holds its segment files
+* the persistence level (permanent, unlogged or temporary)

Here, don't we want to refer to UndoLogCategory rather than
persistence level?  "tracked, including:"  seems bit confusing.

0020-Add-user-facing-documentation-for-undo-logs
19.
<row>
+     <entry><structfield>persistence</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>Persistence level of data stored in this undo log; one of
+      <literal>permanent</literal>, <literal>unlogged</literal> or
+      <literal>temporary</literal>.</entry>
+    </row>

Don't we want to cover the new (shared) undolog category here?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

24 July 2019, 09:32:45

On Wed, Jul 24, 2019 at 2:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 18, 2019 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > > Yep, that was completely wrong.  Here's a new version.
>
> 10.
> I think UndoLogAllocate can leak allocation of slots.  It first
> allocates the slot for a new log from the free pool in there is no
> existing slot/log, writes a WAL record and then at a later point of
> time it actually creates the required physical space in the log via
> extend_undo_log which also writes a separate WAL.  Now, if there is a
> error between these two operations, then we will have a redundant slot
> allocated.  What if there are repeated errors for similar thing from
> multiple backends after which system crashes.  Now, after restart, we
> will allocate multiple slots for different lognos which don't have any
> actual (physical) logs.  This might not be a big problem in practice
> because the chances of error between two operations are less, but
> can't we delay the WAL logging for allocation of a slot for a new log.
>

After sending this email, I was browsing the previous comments raised
by me for this patch and it seems this same point was raised
previously [1] as well and there were few additional questions related
to same (See point-1 in email [1].)


[1] - https://www.postgresql.org/message-id/CAA4eK1LDctrYeZ8ev1N1v-8KwiigAmNMx%3Dt-UTs9qgEFt%2BP0XQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

24 July 2019, 12:13:55

On Wed, Jul 24, 2019 at 2:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 18, 2019 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> 7.
> +attach_undo_log(UndoLogCategory category, Oid tablespace)
> {
> ..
> if (candidate->meta.tablespace == tablespace)
> + {
> + logno = *place;
> + slot = candidate;
> + *place = candidate->next_free;
> + break;
> + }
>
> Here, the code is breaking from the loop, so why do we need to set
> *place?  Am I missing something obvious?
>

I think I know what I was missing.  It seems here you are removing an
element from the freelist.

One point related to detach_current_undo_log.

+ LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
+ slot->pid = InvalidPid;
+ slot->meta.unlogged.xid = InvalidTransactionId;
+ if (full)
+ slot->meta.status = UNDO_LOG_STATUS_FULL;
+ LWLockRelease(&slot->mutex);

If I read the comments in structure UndoLogMetaData, it is mentioned
that 'status' is changed by explicit WAL record whereas there is no
WAL record in code to change the status.  I see the problem as well if
we don't WAL log this change.  Suppose after changing the status of
this log, we allocate a new log and insert some records in that log as
well for the same transaction for which we have inserted records in
the log which we just marked as FULL.  Now, here we form the link
between two logs as the same transaction has overflowed into a new
log.  Say, we crash after this.  Now, after recovery the log won't be
marked as FULL which means there is a chance that it can be used for
some other transaction, if that happens, then our link for a
transaction spanning to different log will break and we won't be able
to access the data in another log.  In short, I think it is important
to WAL log this status change unless I am missing something.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

vignesh C

Date:

24 July 2019, 17:34:24

Hi,

I have done some review of undolog patch series
and here are my comments:
0003-Add-undo-log-manager.patch

1) As undo log is being created in tablespace,
 if the tablespace is dropped later, will it have any impact?
+void
+UndoLogDirectory(Oid tablespace, char *dir)
+{
+ if (tablespace == DEFAULTTABLESPACE_OID ||
+ tablespace == InvalidOid)
+ snprintf(dir, MAXPGPATH, "base/undo");
+ else
+ snprintf(dir, MAXPGPATH, "pg_tblspc/%u/%s/undo",
+ tablespace, TABLESPACE_VERSION_DIRECTORY);
+}

2) Header file exclusion
a) The following headers can be excluded in undolog.c
+#include "access/transam.h"
+#include "access/undolog.h"
+#include "access/xlogreader.h"
+#include "catalog/catalog.h"
+#include "nodes/execnodes.h"
+#include "storage/buf.h"
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/lwlock.h"
+#include "storage/shmem.h"
+#include "storage/standby.h"
+#include "storage/sync.h"
+#include "utils/memutils.h"

b) The following headers can be excluded from undofile.c
+#include "access/undolog.h"
+#include "catalog/database_internal.h"
+#include "miscadmin.h"
+#include "postmaster/bgwriter.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"

3) Some macro replacement.
a)Session.h

+++ b/src/include/access/session.h
@@ -17,6 +17,9 @@
 /* Avoid including typcache.h */
 struct SharedRecordTypmodRegistry;

+/* Avoid including undolog.h */
+struct UndoLogSlot;
+
 /*
  * A struct encapsulating some elements of a user's session.  For now this
  * manages state that applies to parallel query, but it principle it could
@@ -27,6 +30,10 @@ typedef struct Session
  dsm_segment *segment; /* The session-scoped DSM segment. */
  dsa_area   *area; /* The session-scoped DSA area. */

+ /* State managed by undolog.c. */
+ struct UndoLogSlot *attached_undo_slots[4]; /* UndoLogCategories */
+ bool need_to_choose_undo_tablespace;
+

Should we change 4 to UndoLogCategories or suitable macro?
b)
+static inline size_t
+UndoLogNumSlots(void)
+{
+ return MaxBackends * 4;
+}
Should we change 4 to UndoLogCategories  or suitable macro

c)
+allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
+ UndoLogOffset end)
+{
+ struct stat stat_buffer;
+ off_t size;
+ char path[MAXPGPATH];
+ void   *zeroes;
+ size_t nzeroes = 8192;
+ int fd;

should we use BLCKSZ instead of 8192?

4) Should we add a readme file for undolog as it does a fair amount of work
    and is core part of the undo system?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


On Wed, Jul 24, 2019 at 5:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jul 24, 2019 at 2:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Jul 18, 2019 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > 7.
> > +attach_undo_log(UndoLogCategory category, Oid tablespace)
> > {
> > ..
> > if (candidate->meta.tablespace == tablespace)
> > + {
> > + logno = *place;
> > + slot = candidate;
> > + *place = candidate->next_free;
> > + break;
> > + }
> >
> > Here, the code is breaking from the loop, so why do we need to set
> > *place?  Am I missing something obvious?
> >
>
> I think I know what I was missing.  It seems here you are removing an
> element from the freelist.
>
> One point related to detach_current_undo_log.
>
> + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
> + slot->pid = InvalidPid;
> + slot->meta.unlogged.xid = InvalidTransactionId;
> + if (full)
> + slot->meta.status = UNDO_LOG_STATUS_FULL;
> + LWLockRelease(&slot->mutex);
>
> If I read the comments in structure UndoLogMetaData, it is mentioned
> that 'status' is changed by explicit WAL record whereas there is no
> WAL record in code to change the status.  I see the problem as well if
> we don't WAL log this change.  Suppose after changing the status of
> this log, we allocate a new log and insert some records in that log as
> well for the same transaction for which we have inserted records in
> the log which we just marked as FULL.  Now, here we form the link
> between two logs as the same transaction has overflowed into a new
> log.  Say, we crash after this.  Now, after recovery the log won't be
> marked as FULL which means there is a chance that it can be used for
> some other transaction, if that happens, then our link for a
> transaction spanning to different log will break and we won't be able
> to access the data in another log.  In short, I think it is important
> to WAL log this status change unless I am missing something.
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
>


--
Regards,
vignesh
                          Have a nice day

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

25 July 2019, 02:18:07

On Wed, Jul 24, 2019 at 11:04 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Hi,
>
> I have done some review of undolog patch series
> and here are my comments:
> 0003-Add-undo-log-manager.patch
>
> 1) As undo log is being created in tablespace,
>  if the tablespace is dropped later, will it have any impact?
>

Yes, it drops the undo logs present in tablespace being dropped.  See
DropUndoLogsInTablespace() in the same patch.

>
> 4) Should we add a readme file for undolog as it does a fair amount of work
>     and is core part of the undo system?
>

The Readme is already present in the patch series posted by Thomas.
See 0019-Add-developer-documentation-for-the-undo-log-storage.patch in
email [1].

[1] - https://www.postgresql.org/message-id/CA%2BhUKGKni7EEU4FT71vZCCwPeaGb2PQOeKOFjQJavKnD577UMQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

vignesh C

Date:

25 July 2019, 04:00:50

On Thu, Jul 25, 2019 at 7:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jul 24, 2019 at 11:04 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > Hi,
> >
> > I have done some review of undolog patch series
> > and here are my comments:
> > 0003-Add-undo-log-manager.patch
> >
> > 1) As undo log is being created in tablespace,
> >  if the tablespace is dropped later, will it have any impact?

Thanks Amit, that clarifies the problem I was thinking.
I  have another question regarding drop table space failure, but I
don't have a better solution for that problem.
Let me think more about it and discuss.
>
> Yes, it drops the undo logs present in tablespace being dropped.  See
> DropUndoLogsInTablespace() in the same patch.
>
> >
> > 4) Should we add a readme file for undolog as it does a fair amount of work
> >     and is core part of the undo system?
> >
Thanks Amit, I could get the details of readme.
>
> The Readme is already present in the patch series posted by Thomas.
> See 0019-Add-developer-documentation-for-the-undo-log-storage.patch in
> email [1].
>
> [1] - https://www.postgresql.org/message-id/CA%2BhUKGKni7EEU4FT71vZCCwPeaGb2PQOeKOFjQJavKnD577UMQ%40mail.gmail.com
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com

-- 
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Kuntal Ghosh

Date:

25 July 2019, 05:39:57

Hello Thomas,

Here are some review comments on 0003-Add-undo-log-manager.patch. I've
tried to avoid duplicate comments as much as possible.

1. In UndoLogAllocate,
+ * time this backend as needed to write to an undo log at all or because
s/as/has

+ * Maintain our tracking of the and the previous transaction start
Do you mean current log's transaction start as well?

2. In UndoLogAllocateInRecovery,
we try to find the current log from the first undo buffer. So, after a
log switch, we always have to register at least one buffer from the
current undo log first. If we're updating something in the previous
log, the respective buffer should be registered after that. I think we
should document this in the comments.

3. In UndoLogGetOldestRecord(UndoLogNumber logno, bool *full),
it seems the 'full' parameter is not used anywhere. Do we still need this?

+ /* It's been recycled.  SO it must have been entirely discarded. */
s/SO/So

4. In CleanUpUndoCheckPointFiles,
we can emit a debug2 message with something similar to : 'removed
unreachable undo metadata files'

+ if (unlink(path) != 0)
+ elog(ERROR, "could not unlink file \"%s\": %m", path);
according to my observation, whenever we deal with a file operation,
we usually emit a ereport message with errcode_for_file_access().
Should we change it to ereport? There are other file operations as
well including read(), OpenTransientFile() etc.

5. In CheckPointUndoLogs,
+ /* Capture snapshot while holding each mutex. */
+ LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
+ serialized[num_logs++] = slot->meta;
+ LWLockRelease(&slot->mutex);
why do we need an exclusive lock to read something from the slot? A
share lock seems to be sufficient.

pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_SYNC) is called
after pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_WRITE)
without calling     pgstat_report_wait_end(). I think you've done the
same to avoid an extra function call. But, it differs from other
places in the PG code. Perhaps, we should follow this approach
everywhere.

6. In StartupUndoLogs,
+ if (fd < 0)
+ elog(ERROR, "cannot open undo checkpoint snapshot \"%s\": %m", path);
assuming your agreement upon changing above elog to ereport, the
message should be more user friendly. May be something like 'cannot
open pg_undo file'.

+ if ((size = read(fd, &slot->meta, sizeof(slot->meta))) != sizeof(slot->meta))
The usage of size of doesn't look like a problem. But, we can save
some extra padding bytes at the end if we use (offsetof + sizeof)
approach similar to other places in PG.

7. In free_undo_log_slot,
+ /*
+ * When removing an undo log from a slot in shared memory, we acquire
+ * UndoLogLock, log->mutex and log->discard_lock, so that other code can
+ * hold any one of those locks to prevent the slot from being recycled.
+ */
+ LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+ LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
+ Assert(slot->logno != InvalidUndoLogNumber);
+ slot->logno = InvalidUndoLogNumber;
+ memset(&slot->meta, 0, sizeof(slot->meta));
+ LWLockRelease(&slot->mutex);
+ LWLockRelease(UndoLogLock);
you've not taken the discard_lock as mentioned in the comment.

8. In find_undo_log_slot,
+ * 1.  If the calling code knows that it is attached to this lock or is the
s/lock/slot

+ * 2.  All other code should acquire log->mutex before accessing any members,
+ * and after doing so, check that the logno hasn't moved.  If it is not, the
+ * entire undo log must be assumed to be discarded (as if this function
+ * returned NULL) and the caller must behave accordingly.
Perhaps, you meant '..check that the logno remains same. If it is not..'.

+ /*
+ * If we didn't find it, then it must already have been entirely
+ * discarded.  We create a negative cache entry so that we can answer
+ * this question quickly next time.
+ *
+ * TODO: We could track the lowest known undo log number, to reduce
+ * the negative cache entry bloat.
+ */
This is an interesting thought. But, I'm wondering how we are going to
search the discarded logno in the simple hash. I guess that's why it's
in the TODO list.

9. In attach_undo_log,
+ * For now we have a simple linked list of unattached undo logs for each
+ * persistence level.  We'll grovel though it to find something for the
+ * tablespace you asked for.  If you're not using multiple tablespaces
s/though/through

+ if (slot == NULL)
+ {
+ if (UndoLogShared->next_logno > MaxUndoLogNumber)
+ {
+ /*
+ * You've used up all 16 exabytes of undo log addressing space.
+ * This is a difficult state to reach using only 16 exabytes of
+ * WAL.
+ */
+ elog(ERROR, "undo log address space exhausted");
+ }
looks like a potential unlikely() condition.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

25 July 2019, 05:51:33

On Wed, Jul 24, 2019 at 9:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have done some more review of undolog patch series and here are my comments:

Hi Amit,

Thanks!  There a number of actionable changes in your review.  I'll be
posting a new patch set soon that will address most of your complaints
individually.  In this message want to respond to one topic area,
because the answer is long enough already:

> 2.
> allocate_empty_undo_segment()
> {
> ..
> ..
> /* Flush the contents of the file to disk before the next checkpoint. */
> + undofile_request_sync(logno, end / UndoLogSegmentSize, tablespace);
> ..
> }
>
> +void
> +undofile_request_sync(UndoLogNumber logno, BlockNumber segno, Oid tablespace)
> +{
> + char path[MAXPGPATH];
> + FileTag tag;
> +
> + INIT_UNDOFILETAG(tag, logno, tablespace, segno);
> +
> + /* Try to send to the checkpointer, but if out of space, do it here. */
> + if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false))
>
>
> The comment in allocate_empty_undo_segment indicates that the code
> wants to flush before checkpoint, but the actual function tries to
> register the request with checkpointer.  Shouldn't this be similar to
> XLogFileInit where we use pg_fsync to flush the contents immediately?
> I guess that will avoid what you have written in comments in the same
> function (we just want to make sure that the filesystem has allocated
> physical blocks for it so that non-COW filesystems will report ENOSPC
> now rather than later when space is needed).  OTOH, I think it is
> performance-wise better to postpone the work to checkpointer.  If we
> want to push this work to checkpointer, then we might need to change
> comments or alternatively, we might want to use bigger segment sizes
> to mitigate the performance effect.

In an early version I was doing the fsync() immediately.  While
testing zheap, Mithun CY reported that whenever segments couldn't be
recycled in the background, such as during a bit long-running
transaction, he could measure ~6% of the time time spent waiting for
fsync(), and throughput increased with bigger segments (and thus fewer
files to fsync()).  Passing the work off to the checkpointer is better
not only because it's done in the background but also because there is
a chance that the work can be consolidated with other sync requests,
and perhaps even avoided completely if the file is discarded and
unlinked before the next checkpoint.

I'll update the comment to make it clearer.

> If my above understanding is correct and the reason to fsync
> immediately is to reserve space now, then we also need to think
> whether we are always safe in postponing the work?  Basically, if this
> means that it can fail when we are actually trying to write undo, then
> it could be risky because we could be in the critical section at that
> time.  I am not sure about this point, rather it is just to discuss if
> there are any impacts of postponing the fsync work.

Here is my theory for why this arrangement is safe, and why it differs
from what we're doing with WAL segments and regular relation files.
First, let's review why those things work the way they do (as I
understand it):

1.  WAL's use of fdatasync():  The reason we fill and then fsync()
newly created WAL files up front is because we want to make sure the
blocks are definitely on disk.  The comment doesn't spell out exactly
why the author considered later fdatasync() calls to be insufficient,
but they were: it was many years after commit 33cc5d8a4d0d that Linux
ext3/4 filesystems began flushing file size changes to disk in
fdatasync()[1][2].  I don't know if its original behaviour was
intentional or not.  So, if you didn't use the bigger fsync() hammer
on that OS, you might lose the end of a recently extended file in a
power failure even though fdatasync() had returned success.

By my reading of POSIX, that shouldn't be necessary on a conforming
implementation of fdatasync(), and that was fixed years ago in Linux.
I'm not proposing any changes there, and I'm not proposing to take
advantage of that in the new code.  I'm pointing out that that we
don't have to worry about that for these undo segments, because they
are already flushed with fsync(), not fdatasync().

(To understand POSIX's descriptions of fsync() and fdatasync() you
have to find the meanings of "Synchronized I/O Data Integrity
Completion" and "Synchronized I/O File Integrity Completion" elsewhere
in the spec.  TL;DR: fdatasync() is only allowed to skip flushing
attributes like the modified time, it's not allowed to skip flushing a
file size change since that would interfere with retrieving the data.)

2.  Time of reservation:  Although they don't call fsync(), regular
relations and these new undo files still write zeroes up front
(respectively, for a new block and for a new segment).  One reason for
that is that most popular filesystems reserve space at write time, so
you'll get ENOSPC when trying to allocate undo space, and that's a
non-fatal ERROR.  If we deferred until writing back buffer contents,
we might get file holes, and deferred ENOSPC is much harder to report
to users and for users to deal with.

You can still get a ENOSPC at checkpoint write-back time on COW
systems like ZFS, and there is not much I can do about that.  You can
still get ENOSPC at checkpoint fsync() time on NFS, and there's not
much we can do about that for now except panic (without direct IO, or
other big changes).

3.  Separate size tracking:  Another reason that regular relations
write out zeroes at relation-extension time is that that's the only
place that the size of a relation is recorded.  PostgreSQL doesn't
track the number of blocks itself, so we can't defer file extension
until write-back from our buffer pool.  Undo doesn't rely on the
filesystem to track the amount of undo data, it has its own crash-safe
tracking of the discard and end pointers, which can be used to know
which segment files exist and what ranges contain data.  That allows
us to work in whole files at a time, like WAL logs, even though we
still have checkpoint-based flushing rules.

To summarise, we write zeroes so we can report ENOSPC errors as early
as possible, but we defer and consolidate fsync() calls because the
files' contents and names don't actually have to survive power loss
until a checkpoint says they existed at that point in the WAL stream.

Does this make sense?

BTW we could probably use posix_fallocate() instead of writing zeroes;
I think Andres mentioned that recently.  I see also that someone tried
that for WAL and it got reverted back in 2013 (commit
b1892aaeaaf34d8d1637221fc1cbda82ac3fcd71, I didn't try to hunt down
the discussion).

[1] https://lkml.org/lkml/2012/9/3/83
[2] https://github.com/torvalds/linux/commit/b71fc079b5d8f42b2a52743c8d2f1d35d655b1c5

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

25 July 2019, 05:55:45

Hi Thomas,

I have started reviewing 0003-Add-undo-log-manager,  I haven't yet
reviewed but some places I noticed that instead of UndoRecPtr you are
directly
using UndoLogOffset. Which seems like bugs to me

1.
+UndoRecPtr
+UndoLogAllocateInRecovery(UndoLogAllocContext *context,
+   TransactionId xid,
+   uint16 size,
+   bool *need_xact_header,
+   UndoRecPtr *last_xact_start,
....
+ *need_xact_header =
+ context->try_location == InvalidUndoRecPtr &&
+ slot->meta.unlogged.insert == slot->meta.unlogged.this_xact_start;
+ *last_xact_start = slot->meta.unlogged.last_xact_start;

the output parameter last_xact_start is of type UndoRecPtr whereas
slot->meta.unlogged.last_xact_start is of type UndoLogOffset
shouldn't we use MakeUndoRecPtr(logno, offset) here?

2.
+ slot = find_undo_log_slot(logno, false);
+ if (UndoLogOffsetPlusUsableBytes(try_offset, size) <= slot->meta.end)
+ {
+ *need_xact_header = false;
+ return try_offset;
+ }

Here also you are returning directly try_offset instead of UndoRecPtr

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

25 July 2019, 10:37:08

On Wed, Jul 24, 2019 at 2:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 18, 2019 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > Yep, that was completely wrong.  Here's a new version.
> > >
> >
> > One comment/question related to
> > 0022-Use-undo-based-rollback-to-clean-up-files-on-abort.patch.
> >
>
> I have done some more review of undolog patch series and here are my comments:
> 0003-Add-undo-log-manager.patch
>

Some more review of the same patch:
1.
+typedef struct UndoLogSharedData
+{
+ UndoLogNumber free_lists[UndoLogCategories];
+ UndoLogNumber low_logno;

What is the use of low_logno?  I don't see anywhere in the code this
being assigned any value.  Is it for some future use?

2.
+void
+CheckPointUndoLogs(XLogRecPtr checkPointRedo, XLogRecPtr priorCheckPointRedo)
{
..
+ /* Compute header checksum. */
+ INIT_CRC32C(crc);
+ COMP_CRC32C(crc, &UndoLogShared->low_logno, sizeof(UndoLogShared->low_logno));
+ COMP_CRC32C(crc, &UndoLogShared->next_logno,
sizeof(UndoLogShared->next_logno));
+ COMP_CRC32C(crc, &num_logs, sizeof(num_logs));
+ FIN_CRC32C(crc);
+
+ /* Write out the number of active logs + crc. */
+ if ((write(fd, &UndoLogShared->low_logno,
sizeof(UndoLogShared->low_logno)) != sizeof(UndoLogShared->low_logno))
||
+ (write(fd, &UndoLogShared->next_logno,
sizeof(UndoLogShared->next_logno)) !=
sizeof(UndoLogShared->next_logno)) ||

Is it safe to read UndoLogShared without UndoLogLock?  All other
places accessing UndoLogShared uses UndoLogLock, so if this usage is
safe, maybe it is better to add a comment.

3.
UndoLogAllocateInRecovery()
{
..
/*
+ * Otherwise we need to do our own transaction tracking
+ * whenever we see a new xid, to match the logic in
+ * UndoLogAllocate().
+ */
+ if (xid != slot->meta.unlogged.xid)
+ {
+ slot->meta.unlogged.xid = xid;
+ if (slot->meta.unlogged.this_xact_start != slot->meta.unlogged.insert)
+ slot->meta.unlogged.last_xact_start =
+ slot->meta.unlogged.this_xact_start;
+ slot->meta.unlogged.this_xact_start =
+ slot->meta.unlogged.insert;

The code doesn't follow the comment.  In UndoLogAllocate, both
last_xact_start and this_xact_start are assigned in if block, so the
should be the case here.

4.
UndoLogAllocateInRecovery()
{
..
+ /*
+ * Just as in UndoLogAllocate(), the caller may be extending an existing
+ * allocation before committing with UndoLogAdvance().
+ */
+ if (context->try_location != InvalidUndoRecPtr)
+ {
..
}

I am not sure how will this work because unlike UndoLogAllocate, this
function doesn't set try_location initially.  It will be set later by
UndoLogAdvance which can easily go wrong because that doesn't include
UndoLogBlockHeaderSize.

5.
+UndoLogAdvance(UndoLogAllocContext *context, size_t size)
+{
+ context->try_location = UndoLogOffsetPlusUsableBytes(context->try_location,
+ size);
+}

Here, you are using UndoRecPtr whereas UndoLogOffsetPlusUsableBytes
expects offset.

6.
UndoLogAllocateInRecovery()
{
..
+ /*
+ * At this stage we should have an undo log that can handle this
+ * allocation.  If we don't, something is screwed up.
+ */
+ if (UndoLogOffsetPlusUsableBytes(slot->meta.unlogged.insert, size) >
slot->meta.end)
+ elog(ERROR,
+ "cannot allocate %d bytes in undo log %d",
+ (int) size, slot->logno);
..
}

Similar to point-5, here you are using a pointer instead of offset.

7.
UndoLogAllocateInRecovery()
{
..
+ /* We found a reference to a different (or first) undo log. */
+ slot = find_undo_log_slot(logno, false);
..
+ /* TODO: check locking against undo log slot recycling? */
..
}

I think it is better to have an Assert here that slot can't be NULL.
AFAICS, slot can't be NULL unless there is some bug.  I don't
understand this 'TODO' comment.

8.
+ {
+ {"undo_tablespaces", PGC_USERSET,
CLIENT_CONN_STATEMENT,
+ gettext_noop("Sets the
tablespace(s) to use for undo logs."),
+ NULL,
+
GUC_LIST_INPUT | GUC_LIST_QUOTE
+ },
+
&undo_tablespaces,
+ "",
+ check_undo_tablespaces,
assign_undo_tablespaces, NULL
+ },

It seems you need to update variable_is_guc_list_quote for this variable.

9.
+extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
{
..
+ if (!InRecovery)
+ {
+ xl_undolog_extend xlrec;
+ XLogRecPtr ptr;
+
+ xlrec.logno = logno;
+ xlrec.end = end;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND);
+ XLogFlush(ptr);
+ }
..
}

Do we need it for temporary/unlogged persistence level?  Similarly,
there is a WAL logging in attach_undo_log which I can't understand why
it would be required for temporary/unlogged persistence levels.

10.
+choose_undo_tablespace(bool force_detach, Oid *tablespace)
{
..
+ oid = get_tablespace_oid(name, true);
+ if (oid == InvalidOid)
..
}

Do we need to check permissions to see if the current user is allowed
to create in this tablespace?

11.
+static bool
+choose_undo_tablespace(bool force_detach, Oid *tablespace)
+{
+ char   *rawname;
+ List   *namelist;
+ bool
need_to_unlock;
+ int length;
+ int
i;
+
+ /* We need a modifiable copy of string. */
+ rawname =
pstrdup(undo_tablespaces);

I don't see the usage of rawname outside this function, isn't it
better to free it?  I understand that this function won't be called
frequently enough to matter, but still, there is some theoretical
danger if the user continuously changes undo_tablespaces.

12.
+find_undo_log_slot(UndoLogNumber logno, bool locked)
{
..
+ * TODO: We could track the lowest known undo log
number, to reduce
+ * the negative cache entry bloat.
+ */
+ if (result == NULL)
+ {
..
}

Do we have any mechanism to clear this bloat or will it stay till the
end of the session?  If it is later, then I think it might be good to
take care of this TODO.  I think this is not a blocker, but good to
have kind of stuff.

13.
+static void
+allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
+ UndoLogOffset end)
{
..
}

What will happen if the transaction creating undolog segment rolls
back?  Do we want to have pendingDeletes stuff as we have for normal
relation files?  This might also help in clearing the shared memory
state (undo log slots) if any.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

vignesh C

Date:

25 July 2019, 11:26:09

Hi Thomas,

Few review comments on 0003-Add-undo-log-manager.patch:
1) Upgrade may fail
+/*
+ * Compute the new redo, and move the pg_undo file to match if necessary.
+ * Rather than renaming it, we'll create a new copy, so that a failure that
+ * occurs before the controlfile is rewritten won't be fatal.
+ */
+static void
+AdjustRedoLocation(const char *DataDir)
+{
+ uint64 old_redo = ControlFile.checkPointCopy.redo;
+ char old_pg_undo_path[MAXPGPATH];
+ char new_pg_undo_path[MAXPGPATH];
+ int old_fd;
+ int new_fd;
+ ssize_t nread;
+ ssize_t nwritten;
+ char buffer[1024];
+
+ /*
+ * Adjust fields as needed to force an empty XLOG starting at
+ * newXlogSegNo.
+ */

During the upgrade we delete the undo files present in the new cluster
and copy the undo files from the old cluster to the new cluster.
Then we try to readjust the redo location using pg_resetwal.
While trying to readjust we get the current control file details
from current cluster. We try to open the current undo file
present in the cluster using the details from the current cluster.
As the undo files from the current cluster have been removed
and replaced with the old cluster contents, the file open will fail.

Attached a patch to solve this problem.

2)  drop table space failure in corner case.

+ else
+ {
+ /*
+ * There is data we need in this undo log.  We can't force it to
+ * be detached.
+ */
+ ok = false;
+ }
+ LWLockRelease(&slot->mutex);

+ /* If we failed, then give up now and report failure. */
+ if (!ok)
+ return false;

One thought, can we discard the current tablespace entries
and try not to fail.

3) There will be a problem if some files deletion is successful and some
 file deletion fails, the meta contents having end details also need to be
applied or to handle the case where the undo is created further after
rollback

+ while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL)
+ {
+ char segment_path[MAXPGPATH];
+
+ if (strcmp(de->d_name, ".") == 0 ||
+ strcmp(de->d_name, "..") == 0)
+ continue;
+ snprintf(segment_path, sizeof(segment_path), "%s/%s",
+ undo_path, de->d_name);
+ if (unlink(segment_path) < 0)
+ elog(LOG, "couldn't unlink file \"%s\": %m", segment_path);
+ }

4)  In error case unlinked undo segment message will be logged
+ while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL)
+ {
+ char segment_path[MAXPGPATH];
+
+ if (strncmp(de->d_name, segment_prefix, segment_prefix_size) != 0)
+ continue;
+ snprintf(segment_path, sizeof(segment_path), "%s/%s",
+ undo_path, de->d_name);
+ elog(DEBUG1, "unlinked undo segment \"%s\"", segment_path);
+ if (unlink(segment_path) < 0)
+ elog(LOG, "couldn't unlink file \"%s\": %m", segment_path);
+ }
+ FreeDir(dir);

In error case the success message will be logged.

5) UndoRecPtrIsValid can be used to check InvalidUndoRecPtr
+ /*
+ * 'size' is expressed in usable non-header bytes.  Figure out how far we
+ * have to move insert to create space for 'size' usable bytes, stepping
+ * over any intervening headers.
+ */
+ Assert(slot->meta.unlogged.insert % BLCKSZ >= UndoLogBlockHeaderSize);
+ if (context->try_location != InvalidUndoRecPtr)

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
On Thu, Jul 25, 2019 at 9:30 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Thu, Jul 25, 2019 at 7:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Jul 24, 2019 at 11:04 PM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I have done some review of undolog patch series
> > > and here are my comments:
> > > 0003-Add-undo-log-manager.patch
> > >
> > > 1) As undo log is being created in tablespace,
> > >  if the tablespace is dropped later, will it have any impact?
>
> Thanks Amit, that clarifies the problem I was thinking.
> I  have another question regarding drop table space failure, but I
> don't have a better solution for that problem.
> Let me think more about it and discuss.
> >
> > Yes, it drops the undo logs present in tablespace being dropped.  See
> > DropUndoLogsInTablespace() in the same patch.
> >
> > >
> > > 4) Should we add a readme file for undolog as it does a fair amount of work
> > >     and is core part of the undo system?
> > >
> Thanks Amit, I could get the details of readme.
> >
> > The Readme is already present in the patch series posted by Thomas.
> > See 0019-Add-developer-documentation-for-the-undo-log-storage.patch in
> > email [1].
> >
> > [1] - https://www.postgresql.org/message-id/CA%2BhUKGKni7EEU4FT71vZCCwPeaGb2PQOeKOFjQJavKnD577UMQ%40mail.gmail.com
> >
> > --
> > With Regards,
> > Amit Kapila.
> > EnterpriseDB: http://www.enterprisedb.com
>
> --
> Regards,
> Vignesh
> EnterpriseDB: http://www.enterprisedb.com

Attachment

0001-pg_upgrade-failure-fix.patch

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

26 July 2019, 04:08:32

On Thu, Jul 25, 2019 at 11:22 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Wed, Jul 24, 2019 at 9:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I have done some more review of undolog patch series and here are my comments:
>
> Hi Amit,
>
> Thanks!  There a number of actionable changes in your review.  I'll be
> posting a new patch set soon that will address most of your complaints
> individually.  In this message want to respond to one topic area,
> because the answer is long enough already:
>
> > 2.
> > allocate_empty_undo_segment()
> > {
> > ..
> > ..
> > /* Flush the contents of the file to disk before the next checkpoint. */
> > + undofile_request_sync(logno, end / UndoLogSegmentSize, tablespace);
> > ..
> > }
> >
> > +void
> > +undofile_request_sync(UndoLogNumber logno, BlockNumber segno, Oid tablespace)
> > +{
> > + char path[MAXPGPATH];
> > + FileTag tag;
> > +
> > + INIT_UNDOFILETAG(tag, logno, tablespace, segno);
> > +
> > + /* Try to send to the checkpointer, but if out of space, do it here. */
> > + if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false))
> >
> >
> > The comment in allocate_empty_undo_segment indicates that the code
> > wants to flush before checkpoint, but the actual function tries to
> > register the request with checkpointer.  Shouldn't this be similar to
> > XLogFileInit where we use pg_fsync to flush the contents immediately?
> > I guess that will avoid what you have written in comments in the same
> > function (we just want to make sure that the filesystem has allocated
> > physical blocks for it so that non-COW filesystems will report ENOSPC
> > now rather than later when space is needed).  OTOH, I think it is
> > performance-wise better to postpone the work to checkpointer.  If we
> > want to push this work to checkpointer, then we might need to change
> > comments or alternatively, we might want to use bigger segment sizes
> > to mitigate the performance effect.
>
> In an early version I was doing the fsync() immediately.  While
> testing zheap, Mithun CY reported that whenever segments couldn't be
> recycled in the background, such as during a bit long-running
> transaction, he could measure ~6% of the time time spent waiting for
> fsync(), and throughput increased with bigger segments (and thus fewer
> files to fsync()).  Passing the work off to the checkpointer is better
> not only because it's done in the background but also because there is
> a chance that the work can be consolidated with other sync requests,
> and perhaps even avoided completely if the file is discarded and
> unlinked before the next checkpoint.
>
> I'll update the comment to make it clearer.
>

Okay, that makes sense.

> > If my above understanding is correct and the reason to fsync
> > immediately is to reserve space now, then we also need to think
> > whether we are always safe in postponing the work?  Basically, if this
> > means that it can fail when we are actually trying to write undo, then
> > it could be risky because we could be in the critical section at that
> > time.  I am not sure about this point, rather it is just to discuss if
> > there are any impacts of postponing the fsync work.
>
> Here is my theory for why this arrangement is safe, and why it differs
> from what we're doing with WAL segments and regular relation files.
> First, let's review why those things work the way they do (as I
> understand it):
>
> 1.  WAL's use of fdatasync():
>

I was referring to function XLogFileInit which doesn't appear to be
directly using fdatasync.

>
> 3.  Separate size tracking:  Another reason that regular relations
> write out zeroes at relation-extension time is that that's the only
..
>
> To summarise, we write zeroes so we can report ENOSPC errors as early
> as possible, but we defer and consolidate fsync() calls because the
> files' contents and names don't actually have to survive power loss
> until a checkpoint says they existed at that point in the WAL stream.
>
> Does this make sense?
>

Yes, this makes sense.  However, I wonder if we need to do some
special handling for ENOSPC while writing to file in this function
(allocate_empty_undo_segment). Basically, unlink/remove the file if
fail to write because of disk full, something similar to what we do in
XLogFileInit.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

26 July 2019, 05:43:36

On Thu, Jul 25, 2019 at 11:25 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Hi Thomas,
>
> I have started reviewing 0003-Add-undo-log-manager,  I haven't yet
> reviewed but some places I noticed that instead of UndoRecPtr you are
> directly
> using UndoLogOffset. Which seems like bugs to me
>
> 1.
> +UndoRecPtr
> +UndoLogAllocateInRecovery(UndoLogAllocContext *context,
> +   TransactionId xid,
> +   uint16 size,
> +   bool *need_xact_header,
> +   UndoRecPtr *last_xact_start,
> ....
> + *need_xact_header =
> + context->try_location == InvalidUndoRecPtr &&
> + slot->meta.unlogged.insert == slot->meta.unlogged.this_xact_start;
> + *last_xact_start = slot->meta.unlogged.last_xact_start;
>
> the output parameter last_xact_start is of type UndoRecPtr whereas
> slot->meta.unlogged.last_xact_start is of type UndoLogOffset
> shouldn't we use MakeUndoRecPtr(logno, offset) here?
>
> 2.
> + slot = find_undo_log_slot(logno, false);
> + if (UndoLogOffsetPlusUsableBytes(try_offset, size) <= slot->meta.end)
> + {
> + *need_xact_header = false;
> + return try_offset;
> + }
>
> Here also you are returning directly try_offset instead of UndoRecPtr
>

+UndoLogRegister(UndoLogAllocContext *context, uint8 block_id,
UndoLogNumber logno)
+{
+ int i;
+
+ for (i = 0; i < context->num_meta_data_images; ++i)
+ {
+ if (context->meta_data_images[i].logno == logno)
+ {
+ XLogRegisterBufData(block_id,
+ (char *) &context->meta_data_images[i].data,
+ sizeof(context->meta_data_images[i].data));
+ return;
+ }
+ }
+}

I have observed one more thing that you are registering
"meta_data_images" with each buffer of that log.  Suppose, if one undo
record is spread across 2 undo blocks then both the blocks will
include a duplicate copy of this metadata image if this first changes
after the checkpoint?  It will not cause any issue but IMHO we can
avoid including 2 copies of the same meta_data_image.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

26 July 2019, 06:55:15

On Tue, Jul 23, 2019 at 8:12 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> On Tue, 23 Jul 2019 at 08:48, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > > --------------
> > >
> > > + if (!InsertRequestIntoErrorUndoQueue(urinfo))
> > > I was thinking what happens if for some reason
> > > InsertRequestIntoErrorUndoQueue() itself errors out. In that case, the
> > > entry will not be marked invalid, and so there will be no undo action
> > > carried out because I think the undo worker will exit. What happens
> > > next with this entry ?
> >
> > The same entry is present in two queues xid and size, so next time it
> > will be executed from the second queue based on it's priority in that
> > queue.  However, if it fails again a second time in the same way, then
> > we will be in trouble because now the hash table has entry, but none
> > of the queues has entry, so none of the workers will attempt to
> > execute again.  Also, when discard worker again tries to register it,
> > we won't allow adding the entry to queue thinking either some backend
> > is executing the same or it must be part of some queue.
> >
> > The one possibility to deal with this could be that we somehow allow
> > discard worker to register it again in the queue or we can do this in
> > critical section so that it allows system restart on error.  However,
> > the main thing is it possible that InsertRequestIntoErrorUndoQueue
> > will fail unless there is some bug in the code?  If so, we might want
> > to have an Assert for this rather than handling this condition.
>
> Yes, I also think that the function would error out only because of
> can't-happen cases, like "too many locks taken" or "out of binary heap
> slots" or "out of memory" (this last one is not such a can't happen
> case). These cases happen probably due to some bugs, I suppose. But I
> was wondering : Generally when the code errors out with such
> can't-happen elog() calls, worst thing that happens is that the
> transaction gets aborted. Whereas, in this case, the worst thing that
> could happen is : the undo action would never get executed, which
> means selects for this tuple will keep on accessing the undo log ?
>

Yeah, or in zheap, we have page-wise rollback facility which rollbacks
the transaction for a particular page (this gets triggers whenever we
try to update/delete a tuple which was last updated by aborted xact or
when we try to reuse slot of aborted xact) and we don't need to
traverse undo chain.

> This does not sound like any data consistency issue, so we should be
> fine after all ?
>

I will see if we can have an Assert in the code for this.

>
> --------------
>
> +if (UndoGetWork(false, false, &urinfo, NULL) &&
> +    IsUndoWorkerAvailable())
> +    UndoWorkerLaunch(urinfo);
>
> There is no lock acquired between IsUndoWorkerAvailable() and
> UndoWorkerLaunch(); that means even though IsUndoWorkerAvailable()
> returns true, there is a small window where UndoWorkerLaunch() does
> not find any worker slot with in_use false, causing assertion failure
> for (worker != NULL).
> --------------
>

Yeah, I think UndoWorkerLaunch should be able to return without
launching worker in such a case.

>
> + if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))
> + {
> + /* Failed to start worker, so clean up the worker slot. */
> + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
> + UndoWorkerCleanup(worker);
> + LWLockRelease(UndoWorkerLock);
> +
> + return false;
> + }
>
> Is it intentional that there is no (warning?) message logged when we
> can't register a bg worker ?
> -------------

I don't think it was intentional.  I think it will be good to have a
warning here.

I agree with all your other comments.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

26 July 2019, 09:45:33

On Wed, Jul 24, 2019 at 10:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jul 22, 2019 at 3:51 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> Please find my review comments for
> 0013-Allow-foreground-transactions-to-perform-undo-action
>
>
> + * We can't postpone applying undo actions for subtransactions as the
> + * modifications made by aborted subtransaction must not be visible even if
> + * the main transaction commits.
> + */
> + if (IsSubTransaction())
> + return;
>
> I am not completely sure but is it possible that the outer function
> CommitTransactionCommand/AbortCurrentTransaction can avoid
> calling this function in the switch case based on the current state,
> so that under subtransaction this will never be called?
>

We can do that and also can have an additional check similar to "if
(!s->performUndoActions)", but such has to be all places from where
this function is called.  I feel that will make code less readable at
many places.

>
> + bool undo_req_pushed[UndoLogCategories]; /* undo request pushed
> + * to worker? */
> + bool performUndoActions;
> +
>   struct TransactionStateData *parent; /* back link to parent */
>
> We must have some comments to explain how performUndoActions is used,
> where it's set.  If it's explained somewhere else then we can
> give reference to that code.
>

I am planning to remove this variable in the next version and have an
explicit check as we have in UndoActionsRequired.

I agree with your other comments and will address them in the next
version of the patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Khandekar

Date:

26 July 2019, 16:26:28

On Fri, 26 Jul 2019 at 12:25, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I agree with all your other comments.

Thanks for addressing the comments. Below is the continuation of my
comments from 0014-Allow-execution-and-discard-of-undo-by-background-wo.patch
:


+ * Perform rollback request.  We need to connect to the database for first
+ * request and that is required because we access system tables while
for first request and that is required => for the first request. This
is required

---------------

+UndoLauncherShmemSize(void)
+{
+    Size        size;
+
+    /*
+     * Need the fixed struct and the array of LogicalRepWorker.
+     */
+    size = sizeof(UndoApplyCtxStruct);

The fixed structure size should be  offsetof(UndoApplyCtxStruct,
workers) rather than sizeof(UndoApplyCtxStruct)

---------------

In UndoWorkerCleanup(), we set individual fields of the
UndoApplyWorker structure, whereas in UndoLauncherShmemInit(), for all
the UndoApplyWorker array elements, we just memset all the
UndoApplyWorker structure elements to 0. I think we should be
consistent for the two cases. I guess we can just memset to 0 as you
do in UndoLauncherShmemInit(), but this will cause the
worker->undo_worker_queue to be 0 i.e. XID_QUEUE , whereas in
UndoWorkerCleanup(), it is set to -1. Is the -1 value essential, or we
can just set it to XID_QUEUE initially ?
Also, if we just use memset in UndoWorkerCleanup(), we need to first
save generation into a temp variable, and then after memset(), restore
it back.

That brought me to another point :
We already have a macro ResetUndoRequestInfo(), so UndoWorkerCleanup()
can just call ResetUndoRequestInfo().
------------

+        bool        allow_peek;
+
+        CHECK_FOR_INTERRUPTS();
+
+        allow_peek = !TimestampDifferenceExceeds(started_at,
Some comments would be good about what is allow_peek  used for. Something like :
"Arrange to prevent the worker from restarting quickly to switch databases"

-----------------
+++ b/src/backend/access/undo/README.UndoProcessing
-----------------

+worker then start reading from one of the queues the requests for that
start=>starts
---------------

+work, it lingers for UNDO_WORKER_LINGER_MS (10s as default).  This avoids
As per the latest definition, it is 20s. IMHO, there's no need to
mention the default value in the readme.
---------------

+++ b/src/backend/access/undo/discardworker.c
---------------

+ * portion of transaction that is overflowed into a separate log can
be processed
80-col crossed.

+#include "access/undodiscard.h"
+#include "access/discardworker.h"
Not in alphabetical order


+++ b/src/backend/access/undo/undodiscard.c
---------------

+        next_insert = UndoLogGetNextInsertPtr(logno);
I checked UndoLogGetNextInsertPtr() definition. It calls
find_undo_log_slot() to get back the slot from logno. Why not make it
accept slot as against logno ? At all other places, the slot->logno is
passed, so it is convenient to just pass the slot there. And in
UndoDiscardOneLog(), first call find_undo_log_slot() just before the
above line (or call it at the end of the do-while loop). This way,
during each of the UndoLogGetNextInsertPtr() calls in undorequest.c,
we will have one less find_undo_log_slot() call. My suggestion is of
course valid only under the assumption that when you call
UndoLogGetNextInsertPtr(fooslot->logno), then inside
UndoLogGetNextInsertPtr(), find_undo_log_slot() will return back the
same fooslot.
-------------

In UndoDiscardOneLog(), there are at least 2 variable declarations
that can be moved inside the do-while loop : uur and next_insert. I am
not sure about the other variables viz : undofxid and
latest_discardxid. Values of these variables in one iteration continue
across to the second iteration. For latest_discardxid, it looks like
we do want its value to be carried forward, but is it also true for
undofxid ?

+ /* If we reach here, this means there is something to discard. */
+     need_discard = true;
+ } while (true);

Also, about need_discard; there is no place where need_discard is set
to false. That means, from 2nd iteration onwards, it will never be
false. So even if the code that explicitly sets need_discard to true
does not get run, still the undolog will be discarded. Is this
expected ?
-------------

+            if (request_rollback && dbid_exists(uur->uur_txn->urec_dbid))
+            {
+                (void) RegisterRollbackReq(InvalidUndoRecPtr,
+                                           undo_recptr,
+                                           uur->uur_txn->urec_dbid,
+                                           uur->uur_fxid);
+
+                pending_abort = true;
+            }
We can get rid of request_rollback variable. Whatever the "if" block
above is doing, do it in this upper condition :
if (!IsXactApplyProgressCompleted(uur->uur_txn->urec_progress))

Something like this :

if (!IsXactApplyProgressCompleted(uur->uur_txn->urec_progress))
{
    if (dbid_exists(uur->uur_txn->urec_dbid))
    {
        (void) RegisterRollbackReq(InvalidUndoRecPtr,
                                   undo_recptr,
                                   uur->uur_txn->urec_dbid,
                                   uur->uur_fxid);

       pending_abort = true;
    }
}
-------------

+            UndoRecordRelease(uur);
+            uur = NULL;
+        }
.....
.....
+        Assert(uur == NULL);
+
+        /* If we reach here, this means there is something to discard. */
+        need_discard = true;
+    } while (true);

Looks like it is neither necessary to set uur to NULL, nor is it
necessary to have the Assert(uur == NULL). At the start of each
iteration uur is anyway assigned a fresh value, which may or may not
be NULL.
-------------

+ * over undo logs is complete, new undo can is allowed to be written in the
new undo can is allowed => new undo is allowed

+ * hash table size.  So before start allowing any new transaction to write the
before start allowing => before allowing any new transactions to start
writing the
-------------

+    /* Get the smallest of 'xid having pending undo' and 'oldestXmin' */
+    oldestXidHavingUndo = RollbackHTGetOldestFullXid(oldestXidHavingUndo);
+   ....
+   ....
+    if (FullTransactionIdIsValid(oldestXidHavingUndo))
+        pg_atomic_write_u64(&ProcGlobal->oldestFullXidHavingUnappliedUndo,
+                            U64FromFullTransactionId(oldestXidHavingUndo));

Is it possible that the FullTransactionId returned by
RollbackHTGetOldestFullXid() could be invalid ? If not, then the if
condition above can be changed to an Assert().
-------------


+         * If the log is already discarded, then we are done.  It is important
+         * to first check this to ensure that tablespace containing this log
+         * doesn't get dropped concurrently.
+         */
+        LWLockAcquire(&slot->mutex, LW_SHARED);
+        /*
+         * We don't have to worry about slot recycling and check the logno
+         * here, since we don't care about the identity of this slot, we're
+         * visiting all of them.
I guess, it's accidental that the LWLockAcquire() call is *between*
the two comments ?
-----------

+            if (UndoRecPtrGetCategory(undo_recptr) == UNDO_SHARED)
+            {
+                /*
+                 * For the "shared" category, we only discard when the
+                 * rm_undo_status callback tells us we can.
+                 */
+                status = RmgrTable[uur->uur_rmid].rm_undo_status(uur,
&wait_xid);
status variable could be declared in this block itself.
-------------


Some variable declaration alignments and comments spacing need changes
as per pgindent.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

27 July 2019, 02:27:40

Hi,

On 2019-07-25 17:51:33 +1200, Thomas Munro wrote:
> 1.  WAL's use of fdatasync():  The reason we fill and then fsync()
> newly created WAL files up front is because we want to make sure the
> blocks are definitely on disk.  The comment doesn't spell out exactly
> why the author considered later fdatasync() calls to be insufficient,
> but they were: it was many years after commit 33cc5d8a4d0d that Linux
> ext3/4 filesystems began flushing file size changes to disk in
> fdatasync()[1][2].  I don't know if its original behaviour was
> intentional or not.  So, if you didn't use the bigger fsync() hammer
> on that OS, you might lose the end of a recently extended file in a
> power failure even though fdatasync() had returned success.
> 
> By my reading of POSIX, that shouldn't be necessary on a conforming
> implementation of fdatasync(), and that was fixed years ago in Linux.
> I'm not proposing any changes there, and I'm not proposing to take
> advantage of that in the new code.  I'm pointing out that that we
> don't have to worry about that for these undo segments, because they
> are already flushed with fsync(), not fdatasync().

> (To understand POSIX's descriptions of fsync() and fdatasync() you
> have to find the meanings of "Synchronized I/O Data Integrity
> Completion" and "Synchronized I/O File Integrity Completion" elsewhere
> in the spec.  TL;DR: fdatasync() is only allowed to skip flushing
> attributes like the modified time, it's not allowed to skip flushing a
> file size change since that would interfere with retrieving the data.)

Note that there's very good performance reasons trying to avoid metadata
changes at e.g. commit time. They're commonly journaled at the FS level,
which can add a good chunk of IO and synchronization to an operations
that we commonly want to be as fast as possible. Basically you often at
least double the amount of synchronous writes.

And for the potential future where use async direct IO - writes that
change the file size take considerably slower codepaths, and add a lot
of synchronization.

I suspect that's much more likely to be the reason for the preallocation
in 33cc5d8a4d0d, than avoiding an ext* bug (I doubt the bug you
reference existed back then, it IIUC didn't apply to ext2, and ext3 was
was introduced after 33cc5d8a4d0d).

> 2.  Time of reservation:  Although they don't call fsync(), regular
> relations and these new undo files still write zeroes up front
> (respectively, for a new block and for a new segment).  One reason for
> that is that most popular filesystems reserve space at write time, so
> you'll get ENOSPC when trying to allocate undo space, and that's a
> non-fatal ERROR.  If we deferred until writing back buffer contents,
> we might get file holes, and deferred ENOSPC is much harder to report
> to users and for users to deal with.

FWIW, the hole bit I don't quite buy - we could zero the hole at that
time (and not be worse than today, except that it might be done by
somebody that didn't cause the extension), or even better just look up
the buffers between the FS end of the relation, and the block currently
written, and write them out in order.

The whole thing with deferred ENOSPC being harder to report to users is
obviously true regardless of htat.

> BTW we could probably use posix_fallocate() instead of writing zeroes;
> I think Andres mentioned that recently.  I see also that someone tried
> that for WAL and it got reverted back in 2013 (commit
> b1892aaeaaf34d8d1637221fc1cbda82ac3fcd71, I didn't try to hunt down
> the discussion).

IIRC the problem from back then was that while the space is reserved on
the FS level, the actual blocks don't contain zeroes at that time. Which
means that

a) Small writes need to write more, because the surrounding data also
   needs to be zeroed (annoying but not terrible).

b) Writes into the fallocated but not written range IIRC effectively
   cause metadata writes, because while the "allocated file ending"
   doesn't change anymore, the new "non-zero written to" fileending does
   need to be journaled to disk before an f[data]sync - otherwise you
   could end up with the old value after a crash, and would read
   spurious zeroes.

   That's quite bad.

Those don't necessarily apply to e.g. extending relations as we
e.g. don't granularly fsync them. Although even there the performance
picture is mixed - it helps a lot in certain workloads, but there's
others were it mildly regresses performance on ext4. Not sure why yet,
possibly it's due to more heavyweight locking needed when later changing
the "non-zero size", or it's the additional metadata changes. I suspect
those would be mostly gone if we didn't write back blocks in random
order under memory pressure.

Note that neither of those mean that it's not a good idea to
posix_fallocate() and *then* write zeroes, when initializing. For
several filesystems that's more likely to result in more optimally sized
filesystem extents, reducing fragmentation. And without an intervening
f[data]sync, there's not much additional metadata journalling. Although
that's less of an issue on some newer filesystems, IIRC (due to delayed
allocation).

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

29 July 2019, 01:37:28

On Sat, Jul 27, 2019 at 2:27 PM Andres Freund <andres@anarazel.de> wrote:
> Note that neither of those mean that it's not a good idea to
> posix_fallocate() and *then* write zeroes, when initializing. For
> several filesystems that's more likely to result in more optimally sized
> filesystem extents, reducing fragmentation. And without an intervening
> f[data]sync, there's not much additional metadata journalling. Although
> that's less of an issue on some newer filesystems, IIRC (due to delayed
> allocation).

Interesting.  One way to bring back posix_fallocate() without
upsetting people on some filesystem out there would be to turn the new
wal_init_zero GUC into a choice: write (current default, and current
behaviour for 'on'), pwrite_hole (write just the final byte, current
behaviour for 'off'), posix_fallocate (like that 2013 patch that was
reverted) and posix_fallocate_and_write (do both as you said, to try
to solve that problem you mentioned that led to the revert).

I suppose there'd be a parallel GUC undo_init_zero.  Or some more
general GUC for any fixed-sized preallocated files like that (for
example if someone were to decide to do the same for SLRU files
instead of growing them block-by-block), called something like
file_init_zero.

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

29 July 2019, 08:48:30

Hi

On 2019-06-26 01:29:57 +0530, Amit Kapila wrote:
> From 67845a7afa675e973bd0ea9481072effa1eb219d Mon Sep 17 00:00:00 2001
> From: Dilip Kumar <dilipkumar@localhost.localdomain>
> Date: Wed, 24 Apr 2019 14:36:28 +0530
> Subject: [PATCH 05/14] Add prefetch support for the undo log
> 
> Add prefetching function for undo smgr and also provide mechanism
> to prefetch without relcache.


> +#ifdef USE_PREFETCH
>  /*
> - * PrefetchBuffer -- initiate asynchronous read of a block of a relation
> + * PrefetchBufferGuts -- Guts of prefetching a buffer.

>   * No-op if prefetching isn't compiled in.

This isn't true for the this function, as you've defined it?

>  
> diff --git a/src/backend/storage/smgr/undofile.c b/src/backend/storage/smgr/undofile.c
> index 2aa4952..14ccc52 100644
> --- a/src/backend/storage/smgr/undofile.c
> +++ b/src/backend/storage/smgr/undofile.c
> @@ -117,7 +117,18 @@ undofile_extend(SMgrRelation reln, ForkNumber forknum,
>  void
>  undofile_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
>  {
> -    elog(ERROR, "undofile_prefetch is not supported");
> +#ifdef USE_PREFETCH
> +    File        file;
> +    off_t        seekpos;
> +
> +    Assert(forknum == MAIN_FORKNUM);
> +    file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE);
> +    seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) UNDOSEG_SIZE));
> +
> +    Assert(seekpos < (off_t) BLCKSZ * UNDOSEG_SIZE);
> +
> +    (void) FilePrefetch(file, seekpos, BLCKSZ, WAIT_EVENT_UNDO_FILE_PREFETCH);
> +#endif                            /* USE_PREFETCH */
>  }

This looks like it should be part of the commit that introduces
undofile_prefetch(), rather than separately? Afaics there's no reason to
have it in this commit.


> From 7206c40e4cee3391c537cdb22c854889bb417d0e Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas.munro@gmail.com>
> Date: Wed, 6 Mar 2019 16:46:04 +1300
> Subject: [PATCH 03/14] Add undo log manager.

> +/*
> + * If the caller doesn't know the the block_id, but does know the RelFileNode,
> + * forknum and block number, then we try to find it.
> + */
> +XLogRedoAction
> +XLogReadBufferForRedoBlock(XLogReaderState *record,
> +                           SmgrId smgrid,
> +                           RelFileNode rnode,
> +                           ForkNumber forknum,
> +                           BlockNumber blockno,
> +                           ReadBufferMode mode,
> +                           bool get_cleanup_lock,
> +                           Buffer *buf)

I find that a somewhat odd function comment. Nor does the function name
tell me much. A buffer is always block sized.  And you pass in a block
number.


> @@ -347,7 +409,8 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
>       * Make sure that if the block is marked with WILL_INIT, the caller is
>       * going to initialize it. And vice versa.
>       */
> -    zeromode = (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
> +    zeromode = (mode == RBM_ZERO || mode == RBM_ZERO_AND_LOCK ||
> +                mode == RBM_ZERO_AND_CLEANUP_LOCK);
>      willinit = (record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
>      if (willinit && !zeromode)
>          elog(PANIC, "block with WILL_INIT flag in WAL record must be zeroed by redo routine");
> @@ -463,7 +526,7 @@ XLogReadBufferExtended(SmgrId smgrid, RelFileNode rnode, ForkNumber forknum,
>      {
>          /* page exists in file */
>          buffer = ReadBufferWithoutRelcache(smgrid, rnode, forknum, blkno,
> -                                           mode, NULL);
> +                                           mode, NULL, RELPERSISTENCE_PERMANENT);
>      }
>      else
>      {
> @@ -488,7 +551,8 @@ XLogReadBufferExtended(SmgrId smgrid, RelFileNode rnode, ForkNumber forknum,
>                  ReleaseBuffer(buffer);
>              }
>              buffer = ReadBufferWithoutRelcache(smgrid, rnode, forknum,
> -                                               P_NEW, mode, NULL);
> +                                               P_NEW, mode, NULL,
> +                                               RELPERSISTENCE_PERMANENT);
>          }
>          while (BufferGetBlockNumber(buffer) < blkno);
>          /* Handle the corner case that P_NEW returns non-consecutive pages */
> @@ -498,7 +562,8 @@ XLogReadBufferExtended(SmgrId smgrid, RelFileNode rnode, ForkNumber forknum,
>                  LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
>              ReleaseBuffer(buffer);
>              buffer = ReadBufferWithoutRelcache(smgrid, rnode, forknum, blkno,
> -                                               mode, NULL);
> +                                               mode, NULL,
> +                                               RELPERSISTENCE_PERMANENT);
>          }
>      }

Not this patches fault, but it strikes me as a bad idea to just hardcode
RELPERSISTENCE_PERMANENT. E.g. it can totally make sense to WAL log some
records for an unlogged table, e.g. to create the init fork.


> +/*
> + * Main control structure for undo log management in shared memory.
> + * UndoLogSlot objects are arranged in a fixed-size array, with no particular
> + * ordering.
> + */
> +typedef struct UndoLogSharedData
> +{
> +    UndoLogNumber    free_lists[UndoPersistenceLevels];
> +    UndoLogNumber    low_logno;
> +    UndoLogNumber    next_logno;
> +    UndoLogNumber    nslots;
> +    UndoLogSlot        slots[FLEXIBLE_ARRAY_MEMBER];
> +} UndoLogSharedData;

Would be good to document at least low_logno - at least to me it's not
obvious what that means by name. Also some higher level comments about
what the shared memory layout is wouldn't hurt.


> +/*
> + * How many undo logs can be active at a time?  This creates a theoretical
> + * maximum amount of undo data that can exist, but if we set it to a multiple
> + * of the maximum number of backends it will be a very high limit.
> + * Alternative designs involving demand paging or dynamic shared memory could
> + * remove this limit but would be complicated.
> + */
> +static inline size_t
> +UndoLogNumSlots(void)
> +{
> +    return MaxBackends * 4;
> +}

I'd put this factor in a macro (or named integer constant
variable). It's a) nice to have all such numbers defined in one place b)
it makes it easier to understand where the four comes from.


> +/*
> + * Initialize the undo log subsystem.  Called in each backend.
> + */
> +void
> +UndoLogShmemInit(void)
> +{
> +    bool found;
> +
> +    UndoLogShared = (UndoLogSharedData *)
> +        ShmemInitStruct("UndoLogShared", UndoLogShmemSize(), &found);
> +
> +    /* The postmaster initialized the shared memory state. */
> +    if (!IsUnderPostmaster)
> +    {
> +        int        i;
> +
> +        Assert(!found);

I don't quite understand putting this under IsUnderPostmaster, rather
than found (and then potentially having an IsUnderPostmaster assert). I
know that a few other places do it this way too.



> +/*
> + * Iterate through the set of currently active logs.  Pass in NULL to get the
> + * first undo log.

Not a fan of APIs like this. Harder to understand at callsites.


> NULL indicates the end of the set of logs.

+ "A return value of"? Right now this sounds a bit like it's referencing
the NULL argument.


> The caller
> + * must lock the returned log before accessing its members, and must skip if
> + * logno is not valid.
> + */
> +UndoLogSlot *
> +UndoLogNextSlot(UndoLogSlot *slot)
> +{


> +/*
> + * Create a new empty segment file on disk for the byte starting at 'end'.
> + */
> +static void
> +allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
> +                            UndoLogOffset end)
> +{
> +    struct stat    stat_buffer;
> +    off_t    size;
> +    char    path[MAXPGPATH];
> +    void   *zeroes;
> +    size_t    nzeroes = 8192;
> +    int        fd;
> +
> +    UndoLogSegmentPath(logno, end / UndoLogSegmentSize, tablespace, path);
> +
> +    /*
> +     * Create and fully allocate a new file.  If we crashed and recovered
> +     * then the file might already exist, so use flags that tolerate that.
> +     * It's also possible that it exists but is too short, in which case
> +     * we'll write the rest.  We don't really care what's in the file, we
> +     * just want to make sure that the filesystem has allocated physical
> +     * blocks for it, so that non-COW filesystems will report ENOSPC now
> +     * rather than later when the space is needed and we'll avoid creating
> +     * files with holes.
> +     */
> +    fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY);

As I said somewhere nearby, I think it might make sense to optionally
first fallocate, and then zero.

Is there  agood reason to not just use O_TRUNC here, and enter the
zeroing path without a stat? We could potentially end up with holes this
way, I think (if the writes didn't make it to disk, but the metadata
operation did). Also seems better to just start from a consistently
zeroed out block, rather than sometimes having old data in there.



> +    /*
> +     * If we're not in recovery, we need to WAL-log the creation of the new
> +     * file(s).  We do that after the above filesystem modifications, in
> +     * violation of the data-before-WAL rule as exempted by
> +     * src/backend/access/transam/README.  This means that it's possible for
> +     * us to crash having made some or all of the filesystem changes but
> +     * before WAL logging, but in that case we'll eventually try to create the
> +     * same segment(s) again, which is tolerated.
> +     */

Perhaps explain *why* the rule is violated here?


> +/*
> + * Advance the insertion pointer in this context by 'size' usable (non-header)
> + * bytes.  This is the next place we'll try to allocate a record, if it fits.
> + * This is not committed to shared memory until after we've WAL-logged the
> + * record and UndoLogAdvanceFinal() is called.
> + */
> +void
> +UndoLogAdvance(UndoLogAllocContext *context, size_t size)
> +{
> +    context->try_location = UndoLogOffsetPlusUsableBytes(context->try_location,
> +                                                         size);
> +}
> +
> +/*
> + * Advance the insertion pointer to 'size' usable (non-header) bytes past
> + * insertion_point.
> + */
> +void
> +UndoLogAdvanceFinal(UndoRecPtr insertion_point, size_t size)

Think this should differentiate from UndoLogAdvance().

> +    /*
> +     * We acquire UndoLogLock to prevent any undo logs from being created or
> +     * discarded while we build a snapshot of them.  This isn't expected to
> +     * take long on a healthy system because the number of active logs should
> +     * be around the number of backends.  Holding this lock won't prevent
> +     * concurrent access to the undo log, except when segments need to be
> +     * added or removed.
> +     */
> +    LWLockAcquire(UndoLogLock, LW_SHARED);

s/the undo log/undo logs/?


> +    /* Dump into a file under pg_undo. */
> +    snprintf(path, MAXPGPATH, "pg_undo/%016" INT64_MODIFIER "X",
> +             checkPointRedo);
> +    pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_WRITE);
> +    fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY);
> +    if (fd < 0)
> +        ereport(ERROR,
> +                (errcode_for_file_access(),
> +                 errmsg("could not create file \"%s\": %m", path)));
> +
> +    /* Compute header checksum. */
> +    INIT_CRC32C(crc);
> +    COMP_CRC32C(crc, &UndoLogShared->low_logno, sizeof(UndoLogShared->low_logno));
> +    COMP_CRC32C(crc, &UndoLogShared->next_logno, sizeof(UndoLogShared->next_logno));
> +    COMP_CRC32C(crc, &num_logs, sizeof(num_logs));
> +    FIN_CRC32C(crc);
> +
> +    /* Write out the number of active logs + crc. */
> +    if ((write(fd, &UndoLogShared->low_logno, sizeof(UndoLogShared->low_logno)) != sizeof(UndoLogShared->low_logno))
||
> +        (write(fd, &UndoLogShared->next_logno, sizeof(UndoLogShared->next_logno)) !=
sizeof(UndoLogShared->next_logno))||
 
> +        (write(fd, &num_logs, sizeof(num_logs)) != sizeof(num_logs)) ||
> +        (write(fd, &crc, sizeof(crc)) != sizeof(crc)))
> +        ereport(ERROR,
> +                (errcode_for_file_access(),
> +                 errmsg("could not write to file \"%s\": %m", path)));

I'd prefix it with some magic value. It provides a way to do version
bumps if really necessary (or just provide an explicit version), makes
it easier to distinguish proper checksum failures from zeroed out files,
and helps identify the files after FS corruption.


> +    /* Write out the meta data for all active undo logs. */
> +    data = (char *) serialized;
> +    INIT_CRC32C(crc);
> +    serialized_size = num_logs * sizeof(UndoLogMetaData);
> +    while (serialized_size > 0)
> +    {
> +        ssize_t written;
> +
> +        written = write(fd, data, serialized_size);
> +        if (written < 0)
> +            ereport(ERROR,
> +                    (errcode_for_file_access(),
> +                     errmsg("could not write to file \"%s\": %m", path)));
> +        COMP_CRC32C(crc, data, written);
> +        serialized_size -= written;
> +        data += written;
> +    }
> +    FIN_CRC32C(crc);
> +
> +    if (write(fd, &crc, sizeof(crc)) != sizeof(crc))
> +        ereport(ERROR,
> +                (errcode_for_file_access(),
> +                 errmsg("could not write to file \"%s\": %m", path)));
> +

The number of small writes here makes me wonder if this shouldn't either
use fopen/write or a manual buffer.


> +    /* Flush file and directory entry. */
> +    pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_SYNC);
> +    pg_fsync(fd);
> +    if (CloseTransientFile(fd) < 0)
> +        ereport(data_sync_elevel(ERROR),
> +                (errcode_for_file_access(),
> +                 errmsg("could not close file \"%s\": %m", path)));
> +    fsync_fname("pg_undo", true);
> +    pgstat_report_wait_end();

Is there a risk in crashing during this, and leaving an incomplete file
in place? Presumably not, because the checkpoint wouldn't exist?


> +/*
> + * Find the UndoLogSlot object for a given log number.
> + *
> + * The caller may or may not already hold UndoLogLock, and should indicate
> + * this by passing 'locked'.  We'll acquire it in the slow path if necessary.
> + * If it is not held by the caller, the caller must deal with the possibility
> + * that the returned UndoLogSlot no longer contains the requested logno by the
> + * time it is accessed.
> + *
> + * To do that, one of the following approaches must be taken by the calling
> + * code:
> + *
> + * 1.  If the calling code knows that it is attached to this lock or is the

*this "log", not "lock", right?



> +static void
> +attach_undo_log(UndoPersistence persistence, Oid tablespace)
> +{
> +    UndoLogSlot *slot = NULL;
> +    UndoLogNumber logno;
> +    UndoLogNumber *place;
> +
> +    Assert(!InRecovery);
> +    Assert(CurrentSession->attached_undo_slots[persistence] == NULL);
> +
> +    LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
> +
> +    /*
> +     * For now we have a simple linked list of unattached undo logs for each
> +     * persistence level.  We'll grovel though it to find something for the
> +     * tablespace you asked for.  If you're not using multiple tablespaces
> +     * it'll be able to pop one off the front.  We might need a hash table
> +     * keyed by tablespace if this simple scheme turns out to be too slow when
> +     * using many tablespaces and many undo logs, but that seems like an
> +     * unusual use case not worth optimizing for.
> +     */
> +    place = &UndoLogShared->free_lists[persistence];
> +    while (*place != InvalidUndoLogNumber)
> +    {
> +        UndoLogSlot *candidate = find_undo_log_slot(*place, true);
> +
> +        /*
> +         * There should never be an undo log on the freelist that has been
> +         * entirely discarded, or hasn't been created yet.  The persistence
> +         * level should match the freelist.
> +         */
> +        if (unlikely(candidate == NULL))
> +            elog(ERROR,
> +                 "corrupted undo log freelist, no such undo log %u", *place);
> +        if (unlikely(candidate->meta.persistence != persistence))
> +            elog(ERROR,
> +                 "corrupted undo log freelist, undo log %u with persistence %d found on freelist %d",
> +                 *place, candidate->meta.persistence, persistence);
> +
> +        if (candidate->meta.tablespace == tablespace)
> +        {
> +            logno = *place;
> +            slot = candidate;
> +            *place = candidate->next_free;
> +            break;
> +        }
> +        place = &candidate->next_free;
> +    }

I'd replace the linked list with ilist.h ones.


< more tomorrow >


Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

29 July 2019, 15:05:01

On Sun, Jul 28, 2019 at 9:38 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Interesting.  One way to bring back posix_fallocate() without
> upsetting people on some filesystem out there would be to turn the new
> wal_init_zero GUC into a choice: write (current default, and current
> behaviour for 'on'), pwrite_hole (write just the final byte, current
> behaviour for 'off'), posix_fallocate (like that 2013 patch that was
> reverted) and posix_fallocate_and_write (do both as you said, to try
> to solve that problem you mentioned that led to the revert).
>
> I suppose there'd be a parallel GUC undo_init_zero.  Or some more
> general GUC for any fixed-sized preallocated files like that (for
> example if someone were to decide to do the same for SLRU files
> instead of growing them block-by-block), called something like
> file_init_zero.

I think it's pretty sane to have a GUC for how we extend files, but to
me it seems like overkill to have one for every separate kind of file.
It's not theoretically impossible that you could have the data and WAL
on separate partitions on separate mount points with, consequently,
separate needs, and the data (including undo) could be split among
multiple tablespaces each of which uses a different filesystem.
Probably, the right design would be a per-tablespace storage option
plus an overall default that is always used for WAL. However, that
strikes me as a lot of complexity for a pretty marginal use case: most
people have a favorite filesystem and stick with it.

And all of that seems like something a bit separate from coming up
with a good undo framework.  Why doesn't undo just do this like we do
it elsewhere, and leave the question of changing the way we do
extend-and-zero for another thread?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: should there be a hard-limit on the number of transactionspending undo?

From

Robert Haas

Date:

29 July 2019, 16:35:25

On Fri, Jul 19, 2019 at 7:28 PM Peter Geoghegan <pg@bowt.ie> wrote:
> If I'm not mistaken, you're tacitly assuming that you'll always be
> using zheap, or something sufficiently similar to zheap. It'll
> probably never be possible to UNDO changes to something like a GIN
> index on a zheap table, because you can never do that with sensible
> concurrency/deadlock behavior.

I mean, essentially any well-designed framework intended for any sort
of task whatsoever is going to have a design center where one can
foresee that it will work well, and then as a result of working well
for the thing for which it was designed, it will also work well for
other things that are sufficiently similar.  So, I think you're
correct, but I also don't think that's really saying very much. The
trick is to figure out whether and how the ideas you have could be
generalized with reasonable effort to handle other cases, and that's
easier with some projects than others.  I think when it comes to UNDO,
it's actually really hard. The system has some assumptions built into
it which are probably required for good performance and reasonable
complexity, and it's probably got other assumptions in it which are
unnecessary and could be eliminated if we only realized that we were
making those assumptions in the first place. The more involvement we
get from people who aren't coming at this from the point of view of
zheap, the more likely it is that we'll be able to find those
assumptions and wipe them out before they get set in concrete.
Unfortunately, we haven't had many takers so far -- thanks for chiming
in.

I don't really understand your comments about GIN.  My simplistic
understanding of GIN is that it's not very different from btree in
this regard.  Suppose we insert a row, and then the insert aborts;
suppose also that the index wants to use UNDO.  In the case of a btree
index, we're going to go insert an index entry for the new row; upon
abort, we should undo the index insertion by removing that index tuple
or at least marking it dead.  Unless a page split has happened,
triggered either by the insertion itself or by subsequent activity,
this puts the index in a state that is almost perfectly equivalent to
where we were before the now-aborted transaction did any work.  If a
page split has occurred, trying to undo the index insertion is going
to run into two problems. One, we probably can't undo the page split,
so the index will be logically equivalent but not physically
equivalent after we get rid of the new tuple. Two, if the page split
happened after the insertion of the new tuple rather than at the same
time, the index tuple may not be on the page where we left it.
Possibly we can walk right (or left, or sideways, or diagonally at a
35 degree angle, my index-fu is not great here) and be sure of finding
it, assuming the index is not corrupt.

Now, my mental model of a GIN index is that you go find N>=0 index
keys inside each value and do basically the same thing as you would
for a btree index for each one of them. Therefore it seems to me,
possibly stupidly, that you're going to have basically the same
problems, except each problem will now potentially happen up to N
times instead of up to 1 time.  I assume here that in either case -
GIN or btree - you would tentatively record where you left the tuple
that now needs to be zapped and that you can jump to that place
directly to try to zap it. Possibly those assumptions are bad and
maybe that's where you're seeing a concurrency/deadlock problem; if
so, a more detailed explanation would be very helpful.

To me, based on my more limited knowledge of indexing, I'm not really
seeing a concurrency/deadlock issue, but I do see that there's going
to be a horrid efficiency problem if page splits are common.  Suppose
for example that you bulk-load a bunch of rows into an indexed table
in descending order according to the indexed column, with all the new
values being larger than any existing values in that column.  The
insertion point basically doesn't change: you're always inserting
after what was the original high value in the column, and that point
is always on the same page, but that page is going to be repeatedly
split, so that, at the end of the load, almost none of the
newly-inserted rows are going to be on the page into which they were
originally inserted.  Now if you abort, you're going to either have to
walk right a long way from the original insertion point to find each
tuple, or re-find each tuple by traversing from the root of the tree
instead of remembering where you left it. Doing the first for N tuples
is O(N^2), and doing the second is O(N*H) where H is the height of the
btree.  The latter is almost like O(N) given the high fanout of a
btree, but with a much higher constant factor than the
remember-where-you-put-it strategy would be in cases where no splits
have occurred. Neither seems very good.  This seems to be a very
general problem with making undo and indexes work nicely together:
almost any index type has to sometimes move tuple around to different
pages, which makes finding them a lot more expensive than re-finding a
heap tuple.

I think that most of the above is a bit of a diversion from the
original topic of the thread. I think I see the connection you're
making between the two topics: the more likely undo application is to
fail, the more worrying a hard limit is, and deadlocks are a way for
undo application to fail, and if that's likely to be common when undo
is applied to indexes, then undo failure will be common and a hard
limit is bad. However, I think the solution to that problem is
page-at-a-time undo: if foreground process needs to modify a page with
pending undo, and if the modification it wants to make can't be done
sensibly unless the undo is applied first, it should be prepared to
apply that undo itself - just for that page - rather than wait for
somebody else to get it done.  That's important not only for deadlock
avoidance - though deadlock avoidance is certainly a legitimate
concern - but also because the change might be part of some gigantic
rollback that's going to take an hour, and waiting for the undo to hit
all the other pages before it gets to this one will make users very
sad.  Assuming page-at-a-time undo is possible for all undo-using AMs,
which I believe to be more or less a requirement if you want to have
something production-grade, I don't really see what common deadlock
scenario could exist.  Either we're talking about LWLocks -- in which
case we've got a bug in the code -- or we're talking about heavyweight
locks -- in which case we're dealing with a rare scenario where undo
work is piling up behind strategically-acquired AELs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: should there be a hard-limit on the number of transactionspending undo?

From

Peter Geoghegan

Date:

29 July 2019, 18:24:02

On Mon, Jul 22, 2019 at 4:15 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> I had a similar thought: you might regret that choice if you were
> wanting to implement an AM with lock table-based concurrency control
> (meaning that there are lock ordering concerns for row and page locks,
> for DML statements, not just DDL).  That seemed a bit too far fetched
> to mention before, but are you saying the same sort of concerns might
> come up with indexes that support true undo (as opposed to indexes
> that still need VACUUM)?

Yes. It doesn't really make any difference with B-Trees, because the
locks there are very similar to row locks (you still need forwarding
UNDO metadata in index pages, probably for checking the visibility of
index tuples that have their ghost bit set). But when you need to undo
changes to an indexes with coarse grained index tuples (e.g. in a GIN
index), the transaction needs to roll back the index tuple as a whole,
necessitating that locks be held. Heap TIDs need to be completely
stable to avoid a VACUUM-like mechanism -- you cannot just create a
new HOT chain. You even have to be willing to store a single heap row
across two heap pages in extreme cases where an UPDATE makes it
impossible to fit a new row on the same heap page as the original --
this is called row forwarding.

Once heap TIDs are guaranteed to be associated with a logical row for
the lifetime of that row, and once you lock index entries, you're
always able to cleanly undo the changes in the index (i.e. remove new
tuples on abort). Then you have indexes that don't need VACUUMING, and
that have cheap index-only scans.

> For comparison, ARIES[1] has no-deadlock rollbacks as a basic property
> and reacquires locks during restart before new transactions are allow
> to execute.  In its model, the locks in question can be on things like
> rows and pages.  We don't even use our lock table for those (except
> for non-blocking SIREAD locks, irrelevant here).

Right. ARIES has plenty to say about concurrency control, even though
we often think of it as something that is only concerned with crash
recovery. The undo phase is tied to how concurrency control works in
general in ARIES. There is something called ARIES/KVL, and something
else called ARIES/IM [1].

> After crash
> recovery, if zheap encounters a row with pending rollback from an
> aborted transaction, as usual it either needs to read an older version
> from an undo log (for reads) or help execute the rollback before
> updating (for writes).  That only requires page-at-a-time LWLocks
> ("latching"), so it's deadlock-free.  The only deadlock risk comes
> from the need to acquire heavyweight locks on relations which
> typically only conflict when you run DDL, so yeah, it's tempting to
> worry a lot less about those than the fine grained lock traffic from
> DML statements that DB2 and others have to deal with.

I think that DB2 index deletes are synchronous, and immediately remove
space from a leaf page. Rollbacks will re-insert the deleted tuple.
Systems that use a limited form of MVCC based on 2PL [2] set a ghost
bit instead of physically removing the tuple immediately. But I don't
think that that's actually very different to the DB2 classic 2PL
approach, since there is forwarding undo information that makes it
possible to reclaim tuples with the ghost bit set at the earliest
possible opportunity. And because you can immediately do an in-place
update of an index tuple's heap TID in the case of unique indexes,
which can be optimized as a special case. Queries like "UPDATE tab set
tab_pk = tab_pk + 1" work per the SQL standard (no duplicate
violation), and don't even bloat the index, because the changes in the
index can happen almost entirely in-place.

> I might as well put the quote marks on now:  "Perhaps we could
> implement A later."

I don't claim to have any real answers here. I don't claim to
understand how much of a problem this is.

[1] https://15721.courses.cs.cmu.edu/spring2016/papers/a16-graefe.pdf
[2] http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf -- See
"6.7 Standard Practice"
-- 
Peter Geoghegan

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

29 July 2019, 18:47:56

On Tue, Jul 23, 2019 at 10:42 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> I think, even though there might not be a correctness issue with the
> current code as it stands, we should still use a local variable.
> Updating MyUndoWorker is a big side-effect, which the caller is not
> supposed to be aware of, because all that function should do is just
> get the slot info.

Absolutely right.  It's just routine good practice to avoid using
global variables when there is no compelling reason to do otherwise.
The reason you state here is one of several good ones.

> Yes, I also think that the function would error out only because of
> can't-happen cases, like "too many locks taken" or "out of binary heap
> slots" or "out of memory" (this last one is not such a can't happen
> case). These cases happen probably due to some bugs, I suppose. But I
> was wondering : Generally when the code errors out with such
> can't-happen elog() calls, worst thing that happens is that the
> transaction gets aborted. Whereas, in this case, the worst thing that
> could happen is : the undo action would never get executed, which
> means selects for this tuple will keep on accessing the undo log ?
> This does not sound like any data consistency issue, so we should be
> fine after all ?

I don't think so.  Every XID present in undo has to be something we
can look up in CLOG to figure out which transactions are aborted and
which transactions are committed, so that we know which transactions
need undo.  If we forget to undo the transaction, we can't discard it,
which means we can't advance the CLOG transaction horizon, which means
we'll eventually start failing to assign XIDs, leading to a refusal of
all write transactions.  Oops.

More generally, it's not OK for the generic undo layer to make
assumptions about whether the operations performed by the undo
handlers are essential or not.  We don't want to impose a design
constraint the undo can only be used for things that are not actually
critical, because that will make it hard to write AMs that use it.
And there's no reason to live with such a design constraint anyway,
because, as noted above, CLOG truncation requires it.

More generally still, some can't-happen situations should be checked
via Assert() and others via elog(). For example, consider some code
that looks up a syscache tuple and pulls data from the returned tuple.
If the code that handles DDL is written in such a way that the tuple
should always exist, then this is a can't-happen situation, but
generally the code checks this via elog(), not Assert(), because it
could also happen due to the catalog contents being corrupted.  If
Assert() were used, the checks would not run in production builds, and
a corrupt catalog would lead to a seg fault. An elog() is much
friendlier. As a general principle, when a certain thing ought to
always be true, but it being true depends on a whole lot of
assumptions elsewhere in the code, and especially if it also depends
on assumptions like "the database is not corrupted," I think elog() is
preferable.  Assert() is better for things that are more localized and
that really can't go wrong for any reason other than a bug.  In this
case, I think I would tend towards elog(PANIC), but it's arguable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: should there be a hard-limit on the number of transactionspending undo?

From

Robert Haas

Date:

29 July 2019, 19:11:24

On Mon, Jul 29, 2019 at 2:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Yes. It doesn't really make any difference with B-Trees, because the
> locks there are very similar to row locks (you still need forwarding
> UNDO metadata in index pages, probably for checking the visibility of
> index tuples that have their ghost bit set). But when you need to undo
> changes to an indexes with coarse grained index tuples (e.g. in a GIN
> index), the transaction needs to roll back the index tuple as a whole,
> necessitating that locks be held. Heap TIDs need to be completely
> stable to avoid a VACUUM-like mechanism -- you cannot just create a
> new HOT chain. You even have to be willing to store a single heap row
> across two heap pages in extreme cases where an UPDATE makes it
> impossible to fit a new row on the same heap page as the original --
> this is called row forwarding.

I find this hard to believe, because an UPDATE can always be broken up
into a DELETE and an INSERT.  If that were to be done, you would not
have a stable heap TID and you would have a "new HOT chain," or your
AM's equivalent of that concept.  So if we can't handle an UPDATE that
changes the TID, then we also can't handle a DELETE + INSERT.  But
surely handling that case is a hard requirement for any AM.

Sorry if I'm being dense here, but I feel like you're making some
assumptions that I'm not quite following.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: should there be a hard-limit on the number of transactionspending undo?

From

Peter Geoghegan

Date:

29 July 2019, 19:11:53

On Mon, Jul 29, 2019 at 9:35 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I mean, essentially any well-designed framework intended for any sort
> of task whatsoever is going to have a design center where one can
> foresee that it will work well, and then as a result of working well
> for the thing for which it was designed, it will also work well for
> other things that are sufficiently similar.  So, I think you're
> correct, but I also don't think that's really saying very much.

I agree that it's quite unclear how important this is. I don't
necessarily think it matters if zheap doesn't do that well with GIN
indexes. I think it's probably going to be useful to imagine how GIN
indexing might work for zheap because it clarifies the strengths and
weaknesses of your design. It's perfectly fine for there to be
weaknesses, provided that they are well understood.

> The
> trick is to figure out whether and how the ideas you have could be
> generalized with reasonable effort to handle other cases, and that's
> easier with some projects than others.  I think when it comes to UNDO,
> it's actually really hard.

I agree.

> Unfortunately, we haven't had many takers so far -- thanks for chiming
> in.

I don't have the ability to express my general concerns here in a very
crisp way. This is complicated stuff. Thanks for tolerating the
hand-wavy nature of my feedback about this.

> I don't really understand your comments about GIN.  My simplistic
> understanding of GIN is that it's not very different from btree in
> this regard.

GIN is quite similar to btree from a Postgres point of view -- GIN is
simply a btree that is good at storing duplicates (and has higher
level infrastructure to make things like FTS work). So I'd say that
your understanding is fairly complete, at least as far as traditional
Postgres goes. But if we imagine a system in which we have to roll
back in indexes, it's quite a different story. See my remarks to
Thomas just now about that.

> Suppose we insert a row, and then the insert aborts;
> suppose also that the index wants to use UNDO.  In the case of a btree
> index, we're going to go insert an index entry for the new row; upon
> abort, we should undo the index insertion by removing that index tuple
> or at least marking it dead.  Unless a page split has happened,
> triggered either by the insertion itself or by subsequent activity,
> this puts the index in a state that is almost perfectly equivalent to
> where we were before the now-aborted transaction did any work.  If a
> page split has occurred, trying to undo the index insertion is going
> to run into two problems. One, we probably can't undo the page split,
> so the index will be logically equivalent but not physically
> equivalent after we get rid of the new tuple. Two, if the page split
> happened after the insertion of the new tuple rather than at the same
> time, the index tuple may not be on the page where we left it.

Actually, page splits are the archetypal case where undo cannot
restore the original physical state. In general, we cannot expect the
undo process to reverse page splits. Undo might be able to merge the
pages together, but it also might not be able to. It won't be terribly
different to the situation with deletes where the transaction commits,
most likely.

Some other systems have something called "system transactions" for
things like page splits. They don't need to have their commit record
flushed synchronously, and occur in the foreground of the xact that
needs to split the page. That way, rollback doesn't have to concern
itself with rolling back things that are pretty much impossible to
roll back, like page splits.

> Now, my mental model of a GIN index is that you go find N>=0 index
> keys inside each value and do basically the same thing as you would
> for a btree index for each one of them. Therefore it seems to me,
> possibly stupidly, that you're going to have basically the same
> problems, except each problem will now potentially happen up to N
> times instead of up to 1 time.  I assume here that in either case -
> GIN or btree - you would tentatively record where you left the tuple
> that now needs to be zapped and that you can jump to that place
> directly to try to zap it. Possibly those assumptions are bad and
> maybe that's where you're seeing a concurrency/deadlock problem; if
> so, a more detailed explanation would be very helpful.

Imagine a world in which zheap cannot just create a new TID (or HOT
chain) for the same logical tuple, which is something that I believe
should be an important goal for zheap (again, see my remarks to
Thomas). Simplicity for rollbacks in access methods like GIN demands
that you lock the entire index tuple, which may point to hundreds of
logical rows (or TIDs, since they have a 1:1 correspondence with
logical rows in this imaginary world). Rolling back with more granular
locking seems very hard for the same reason that rolling back a page
split would be very hard -- you cannot possibly have enough book
keeping information to make that work in a sane way in the face of
concurrent insertions that may also commit or abort unpredictably. It
seems necessary to bake concurrency control in to roll back at the
index access method level in order to get significant benefits from a
design like zheap.

Now, maybe zheap should be permitted to not work particularly well
with GIN, while teaching btree to take advantage of the common case
where we can roll everything back, even in indexes (so zheap behaves
much more like heapam when you have a GIN index, which is hopefully
not that common). That could be a perfectly reasonable restriction.
But ISTM that you need to make heap TIDs completely stable for the
case that zheap is expected to excel at. You also need to teach nbtree
to take advantage of this by rolling back if and when it's safe to do
so (when we know that heap TIDs are stable for the indexed table).

In general, the only way that rolling back changes to indexes can work
is by making heap TIDs completely stable. Any design for rollback in
nbtree that allows there to be multiple entries for the same logical
row in the index seems like a disaster to me. Are you really going to
put forwarding information in the index that mirrors what has happened
in the table?

> To me, based on my more limited knowledge of indexing, I'm not really
> seeing a concurrency/deadlock issue, but I do see that there's going
> to be a horrid efficiency problem if page splits are common.

I'm not worried about rolling back page splits. That seems to present
us with exactly the same issues as rolling back in GIN indexes
reliably (i.e. problems that are practically impossible to solve, or
at least don't seem worth solving).

> This seems to be a very
> general problem with making undo and indexes work nicely together:
> almost any index type has to sometimes move tuple around to different
> pages, which makes finding them a lot more expensive than re-finding a
> heap tuple.

Right. That's why undo is totally logical in indexes. And it's why you
cannot expect to roll back page splits.

> I think that most of the above is a bit of a diversion from the
> original topic of the thread. I think I see the connection you're
> making between the two topics: the more likely undo application is to
> fail, the more worrying a hard limit is, and deadlocks are a way for
> undo application to fail, and if that's likely to be common when undo
> is applied to indexes, then undo failure will be common and a hard
> limit is bad.

This is an awkward thing to discuss, because it involves so many
interrelated moving parts. And because I know that I could easily miss
quite a bit about the zheap design. Forgive me if I've hijacked the
thread.

> However, I think the solution to that problem is
> page-at-a-time undo: if foreground process needs to modify a page with
> pending undo, and if the modification it wants to make can't be done
> sensibly unless the undo is applied first, it should be prepared to
> apply that undo itself - just for that page - rather than wait for
> somebody else to get it done.  That's important not only for deadlock
> avoidance - though deadlock avoidance is certainly a legitimate
> concern - but also because the change might be part of some gigantic
> rollback that's going to take an hour, and waiting for the undo to hit
> all the other pages before it gets to this one will make users very
> sad.

It's something that users in certain other systems (though certainly
not all other systems) have had to live with for some time.

SQL Server 2019 has something called "instantaneous transaction
rollback", which seems to make SQL Server optionally behave a lot more
like Postgres [1], apparently with many of the same disadvantages as
Postgres. I agree that there is probably a middle way that more or
less has the advantages of both approaches. I don't really know what
that should look like, though.

[1] https://www.microsoft.com/en-us/research/uploads/prod/2019/06/p700-antonopoulos.pdf
-- 
Peter Geoghegan

Re: should there be a hard-limit on the number of transactionspending undo?

From

Peter Geoghegan

Date:

29 July 2019, 19:39:23

On Mon, Jul 29, 2019 at 12:11 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I find this hard to believe, because an UPDATE can always be broken up
> into a DELETE and an INSERT.  If that were to be done, you would not
> have a stable heap TID and you would have a "new HOT chain," or your
> AM's equivalent of that concept.  So if we can't handle an UPDATE that
> changes the TID, then we also can't handle a DELETE + INSERT.  But
> surely handling that case is a hard requirement for any AM.

I'm not saying you can't handle it. But that necessitates "write
amplification", in the sense that you must now create new index tuples
even for indexes where the indexed columns were not logically altered.
Isn't zheap supposed to fix that problem, at least at in version 2 or
version 3? I also think that stable heap TIDs make index-only scans a
lot easier and more effective.

I think that indexes (or at least B-Tree indexes) will ideally almost
always have tuples that are the latest versions with zheap. The
exception is tuples whose ghost bit is set, whose visibility varies
based on the MVCC snapshot in use. But the instant that the
deleting/updating xact commits it becomes legal to recycle the old
heap TID. We don't need to go back to the index to permanently zap the
tuple whose ghost bit we already set, because there is an undo pointer
in the same leaf page, so nobody is in danger of getting confused and
following the now-recycled heap TID.

This ghost bit design owes plenty to 2PL (which will fully remove the
index tuple synchronously, rather than just setting a ghost bit). You
could say that it's a 2PL/MVCC hybrid, while classic Postgres is
"pure" MVCC because it uses explicit row versioning -- it doesn't need
to impose restrictions on TID stability. Which seems to be why we
offer such a large variety of index access methods -- it's relatively
straight forward for Postgres to add niche index AMs, such as SP-GiST.

-- 
Peter Geoghegan

Re: should there be a hard-limit on the number of transactionspending undo?

From

Peter Geoghegan

Date:

29 July 2019, 20:01:54

On Mon, Jul 29, 2019 at 12:39 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I think that indexes (or at least B-Tree indexes) will ideally almost
> always have tuples that are the latest versions with zheap. The
> exception is tuples whose ghost bit is set, whose visibility varies
> based on the MVCC snapshot in use. But the instant that the
> deleting/updating xact commits it becomes legal to recycle the old
> heap TID.

Sorry, I meant the instant the ghost bit index tuple cannot be visible
to any possible MVCC snapshot. Which, in general, will be pretty soon
after the deleting/updating xact commits.

-- 
Peter Geoghegan

Re: should there be a hard-limit on the number of transactionspending undo?

From

Robert Haas

Date:

29 July 2019, 20:03:49

On Mon, Jul 29, 2019 at 3:39 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I'm not saying you can't handle it. But that necessitates "write
> amplification", in the sense that you must now create new index tuples
> even for indexes where the indexed columns were not logically altered.
> Isn't zheap supposed to fix that problem, at least at in version 2 or
> version 3? I also think that stable heap TIDs make index-only scans a
> lot easier and more effective.

I think there's a cost-benefit analysis here.  You're completely
correct that inserting new index tuples causes write amplification
and, yeah, that's bad. On the other hand, row forwarding has its own
costs. If a row ends up persistently moved to someplace else, then
every subsequent access to that row has an extra level of indirection.
If it ends up split between two places, every read of that row incurs
two reads. The "someplace else" where moved rows or ends of split rows
are stored has to be skipped by sequential scans, which is complex and
possibly inefficient if it breaks up a sequential I/O pattern. Those
things are bad, too.

It's a little difficult to compare the kinds of badness.  My thought
is that in the short run, the redirect strategy probably wins, because
there could be and likely are a bunch of indexes and it's cheaper to
just insert one redirect. But in the long term, the redirect thing
seems like a loser, because you have to keep following it. That
(perhaps naive) analysis is why zheap doesn't try to maintain TID
stability.  Instead it wants to do in-place updates (no new TID) as
often as possible, but the fallback strategy is simply to do a
non-in-place update (new TID) rather than a redirect.

> I think that indexes (or at least B-Tree indexes) will ideally almost
> always have tuples that are the latest versions with zheap. The
> exception is tuples whose ghost bit is set, whose visibility varies
> based on the MVCC snapshot in use. But the instant that the
> deleting/updating xact commits it becomes legal to recycle the old
> heap TID. We don't need to go back to the index to permanently zap the
> tuple whose ghost bit we already set, because there is an undo pointer
> in the same leaf page, so nobody is in danger of getting confused and
> following the now-recycled heap TID.

I haven't run across the "ghost bit" terminology before.  Is there a
good place to read about the technique you're assuming here?  A major
question is how you handle inserted rows, that are new now and thus
not yet visible to everyone, but which will later become all-visible.
One idea is: if the undo pointer is new enough that a write
transaction which modified the page could still be in-flight, check
the undo log to ascertain visibility of index tuples.  If not, then
any potentially-deleted index tuples are in fact deleted, and any
others are all-visible. With this design, you don't set the ghost bit
on new tuples, but are still able to stop following the undo pointers
for them after a while.

To put that another way, there seems to be pretty clearly a need for a
bit, but what does the bit mean?  It could mean "please check the undo
log," in which case it'd have to be set on insert, eventually cleared,
and then reset on delete, but I think that's likely to suck.  I think
therefore that the bit should mean
is-deleted-but-not-necessarily-all-visible-yet, which avoids that
problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: should there be a hard-limit on the number of transactionspending undo?

From

Peter Geoghegan

Date:

29 July 2019, 21:14:15

On Mon, Jul 29, 2019 at 1:04 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I think there's a cost-benefit analysis here.  You're completely
> correct that inserting new index tuples causes write amplification
> and, yeah, that's bad. On the other hand, row forwarding has its own
> costs. If a row ends up persistently moved to someplace else, then
> every subsequent access to that row has an extra level of indirection.

The devil is in the details. It doesn't seem that optimistic to assume
that a good implementation could practically always avoid it, by being
clever about heap fillfactor. It can work a bit like external TOAST
pointers. The oversized datums can go on the other heap page, which
presumably not be in the SELECT list of most queries. It won't be one
of the indexed columns in typical cases, so index scans will generally
only have to visit one heap page.

It occurs to me that the zheap design is still sensitive to heap
fillfactor in much the same way as it would be with reliably-stable
TIDs, combined with some amount of row forwarding. It's not essential
for correctness that you avoid creating a new HOT chain (or whatever
it's called in zheap) with new index tuples, but it is still quite
preferable on performance grounds. It's still worth going to a lot of
work to avoid having that happen, such as using external TOAST
pointers with some of the larger datums on the existing heap page.

> If it ends up split between two places, every read of that row incurs
> two reads. The "someplace else" where moved rows or ends of split rows
> are stored has to be skipped by sequential scans, which is complex and
> possibly inefficient if it breaks up a sequential I/O pattern. Those
> things are bad, too.
>
> It's a little difficult to compare the kinds of badness.

I would say that it's extremely difficult. I'm not going to speculate
about how the two approaches might compare today.

> I haven't run across the "ghost bit" terminology before.  Is there a
> good place to read about the technique you're assuming here?

"5.2 Key Range Locking and Ghost Records" from "A Survey of B-Tree
Locking Techniques" seems like a good place to start. As I said
earlier, the paper is available from:
https://15721.courses.cs.cmu.edu/spring2016/papers/a16-graefe.pdf

This description won't define the term ghost record/bit in a precise
way that you can just adopt, since the details will vary somewhat
based on considerations like whether or not MVCC is used. But you'll
get the general idea from the paper, I think.

> To put that another way, there seems to be pretty clearly a need for a
> bit, but what does the bit mean?  It could mean "please check the undo
> log," in which case it'd have to be set on insert, eventually cleared,
> and then reset on delete, but I think that's likely to suck.  I think
> therefore that the bit should mean
> is-deleted-but-not-necessarily-all-visible-yet, which avoids that
> problem.

That sounds about right to me.

-- 
Peter Geoghegan

Re: should there be a hard-limit on the number of transactionspending undo?

From

Thomas Munro

Date:

29 July 2019, 21:51:59

On Tue, Jul 30, 2019 at 7:12 AM Peter Geoghegan <pg@bowt.ie> wrote:
> SQL Server 2019 has something called "instantaneous transaction
> rollback", which seems to make SQL Server optionally behave a lot more
> like Postgres [1], apparently with many of the same disadvantages as
> Postgres. I agree that there is probably a middle way that more or
> less has the advantages of both approaches. I don't really know what
> that should look like, though.
>
> [1] https://www.microsoft.com/en-us/research/uploads/prod/2019/06/p700-antonopoulos.pdf

Thanks for sharing that.  I see they're giving that paper at VLDB next
month in LA... I hope the talk video will be published on the web.
While we've been working on a hybrid vaccum/undo design, they've built
a hybrid undo/vacuum system.  I've only skimmed this, but one of their
concerns that caught my eye is log volume in the presence of long
running transactions ("3.6 Aggressive Log Truncation").  IIUC they
have only a single log for both redo and undo, so a long running
transaction requires them to keep all log data around as long as it
might be needed for that transaction, in traditional SQL Server.
That's basically the flip side of the problem we're trying to solve,
in-heap bloat.  I think we might have a different solution to that
problem, with our finer grained undo logs.  Our undo data is not mixed
in with redo data (though redo can recreated it, it's not needed after
that), and we have multiple undo logs with their own discard pointers,
so a long running transaction only prevents only one single undo log
from being truncated, while other undo logs holding other transactions
can be truncated as soon as those transactions are committed/rolled
back and are either all visible (admittedly tracked with a system-wide
xmin approach for now, but could probably be made more granular) or a
snapshot-too-old threshold is reached (not implemented yet).

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

29 July 2019, 23:35:20

Hi,

I realize that this might not be the absolutely newest version of the
undo storage part of this patchset - but I'm trying to understand the
whole context, and that's hard without reading through the whole stack
in a situation where the layers actually fit together


On 2019-07-29 01:48:30 -0700, Andres Freund wrote:
> < more tomorrow >

> +        /* Move the high log number pointer past this one. */
> +        ++UndoLogShared->next_logno;

Fwiw, I find having "next" and "low" as variable names, and then
describing "next" as high in comments somewhat confusing.


> +/* check_hook: validate new undo_tablespaces */
> +bool
> +check_undo_tablespaces(char **newval, void **extra, GucSource source)
> +{
> +    char       *rawname;
> +    List       *namelist;
> +
> +    /* Need a modifiable copy of string */
> +    rawname = pstrdup(*newval);
> +
> +    /*
> +     * Parse string into list of identifiers, just to check for
> +     * well-formedness (unfortunateley we can't validate the names in the
> +     * catalog yet).
> +     */
> +    if (!SplitIdentifierString(rawname, ',', &namelist))
> +    {
> +        /* syntax error in name list */
> +        GUC_check_errdetail("List syntax is invalid.");
> +        pfree(rawname);
> +        list_free(namelist);
> +        return false;
> +    }

Why can't you validate the catalog here? In a lot of cases this will be
called in a transaction, especially when changing it in a
session. E.g. temp_tablespaces does so?


> +    /*
> +     * Make sure we aren't already in a transaction that has been assigned an
> +     * XID.  This ensures we don't detach from an undo log that we might have
> +     * started writing undo data into for this transaction.
> +     */
> +    if (GetTopTransactionIdIfAny() != InvalidTransactionId)
> +        ereport(ERROR,
> +                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                 (errmsg("undo_tablespaces cannot be changed while a transaction is in progress"))));

Hm. Is this really a great proxy? Seems like it'll block changing the
tablespace unnecessarily in a lot of situations, and like there could
even be holes in the future - it doesn't seem crazy that we'd want to
emit undo without assigning an xid in some situations (e.g. for deleting
files in error cases, or for more aggressive cleanup of dead index
entries during reads or such).

It seems like it'd be pretty easy to just check
CurrentSession->attached_undo_slots[i].slot->meta.unlogged.this_xact_start
or such?


> +static bool
> +choose_undo_tablespace(bool force_detach, Oid *tablespace)
> +{

> +    else
> +    {
> +        /*
> +         * Choose an OID using our pid, so that if several backends have the
> +         * same multi-tablespace setting they'll spread out.  We could easily
> +         * do better than this if more serious load balancing is judged
> +         * useful.
> +         */

We're not really choosing an oid, we're choosing a tablespace. Obviously
one can understand it as is, but it confused me for a second.


> +        int        index = MyProcPid % length;

Hm. Is MyProcPid a good proxy here? Wouldn't it be better to use
MyProc->pgprocno or such? That's much more guaranteed to space out
somewhat evenly?


> +        int        first_index = index;
> +        Oid        oid = InvalidOid;
> +
> +        /*
> +         * Take the tablespace create/drop lock while we look the name up.
> +         * This prevents the tablespace from being dropped while we're trying
> +         * to resolve the name, or while the called is trying to create an
> +         * undo log in it.  The caller will have to release this lock.
> +         */
> +        LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);

Why exclusive?

I think any function that acquires a lock it doesn't release (or the
reverse) ought to have a big honking comment in its header warning of
that.  And an explanation as to why that is.


> +        for (;;)
> +        {
> +            const char *name = list_nth(namelist, index);
> +
> +            oid = get_tablespace_oid(name, true);
> +            if (oid == InvalidOid)
> +            {
> +                /* Unknown tablespace, try the next one. */
> +                index = (index + 1) % length;
> +                /*
> +                 * But if we've tried them all, it's time to complain.  We'll
> +                 * arbitrarily complain about the last one we tried in the
> +                 * error message.
> +                 */
> +                if (index == first_index)
> +                    ereport(ERROR,
> +                            (errcode(ERRCODE_UNDEFINED_OBJECT),
> +                             errmsg("tablespace \"%s\" does not exist", name),
> +                             errhint("Create the tablespace or set undo_tablespaces to a valid or empty list.")));
> +                continue;

Wouldn't it be better to simply include undo_tablespaces in the error
messages? Something roughly like 'none of the tablespaces in undo_tablespaces =
\"%s\" exists"?


> +    /*
> +     * If we came here because the user changed undo_tablesaces, then detach
> +     * from any undo logs we happen to be attached to.
> +     */
> +    if (force_detach)
> +    {
> +        for (i = 0; i < UndoPersistenceLevels; ++i)
> +        {
> +            UndoLogSlot *slot = CurrentSession->attached_undo_slots[i];
> +
> +            if (slot != NULL)
> +            {
> +                LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
> +                slot->pid = InvalidPid;
> +                slot->meta.unlogged.xid = InvalidTransactionId;
> +                LWLockRelease(&slot->mutex);

Would it make sense to re-assert here that the current transaction
didn't write undo?


> +bool
> +DropUndoLogsInTablespace(Oid tablespace)
> +{
> +    DIR *dir;
> +    char undo_path[MAXPGPATH];
> +    UndoLogSlot *slot = NULL;
> +    int        i;
> +
> +    Assert(LWLockHeldByMe(TablespaceCreateLock));

IMO this ought to be mentioned in a function header comment.


> +    /* First, try to kick everyone off any undo logs in this tablespace. */
> +    while ((slot = UndoLogNextSlot(slot)))
> +    {
> +        bool ok;
> +        bool return_to_freelist = false;
> +
> +        /* Skip undo logs in other tablespaces. */
> +        if (slot->meta.tablespace != tablespace)
> +            continue;
> +
> +        /* Check if this undo log can be forcibly detached. */
> +        LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
> +        if (slot->meta.discard == slot->meta.unlogged.insert &&
> +            (slot->meta.unlogged.xid == InvalidTransactionId ||
> +             !TransactionIdIsInProgress(slot->meta.unlogged.xid)))
> +        {

Not everyone will agree, but this looks complicated enough that I'd put
it just in a simple wrapper function. If this were if
(CanDetachUndoForcibly(slot)) you'd not need a comment either...

Also, isn't the slot->meta.discard == slot->meta.unlogged.insert a
separate concern from detaching? My understanding is that it'll be
perfectly normal to have undo logs with undiscarded data that nobody is
attached to?  In fact, I got confused below, because I initially didn't
spot any place that implemented the check referenced in the caller:


> +     * Drop the undo logs in this tablespace.  This will fail (without
> +     * dropping anything) if there are undo logs that we can't afford to drop
> +     * because they contain non-discarded data or a transaction is in
> +     * progress.  Since we hold TablespaceCreateLock, no other session will be
> +     * able to attach to an undo log in this tablespace (or any tablespace
> +     * except default) concurrently.
> +     */
> +    if (!DropUndoLogsInTablespace(tablespaceoid))
> +        ereport(ERROR,
> +                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                 errmsg("tablespace \"%s\" cannot be dropped because it contains non-empty undo logs",
> +                        tablespacename)));
> +
> +    /*




> +        else
> +        {
> +            /*
> +             * There is data we need in this undo log.  We can't force it to
> +             * be detached.
> +             */
> +            ok = false;
> +        }

Seems like we ought to return more information here. An error message
like:

>      /*
> +     * Drop the undo logs in this tablespace.  This will fail (without
> +     * dropping anything) if there are undo logs that we can't afford to drop
> +     * because they contain non-discarded data or a transaction is in
> +     * progress.  Since we hold TablespaceCreateLock, no other session will be
> +     * able to attach to an undo log in this tablespace (or any tablespace
> +     * except default) concurrently.
> +     */
> +    if (!DropUndoLogsInTablespace(tablespaceoid))
> +        ereport(ERROR,
> +                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                 errmsg("tablespace \"%s\" cannot be dropped because it contains non-empty undo logs",
> +                        tablespacename)));

doesn't really allow a DBA to do anything about the issue. Seems we
ought to at least include the pid in the error message?  I'd perhaps
just move the error message from DropTableSpace() into
DropUndoLogsInTablespace(). I don't think that's worse from a layering
perspective, and allows to raise a more precise error, and simplifies
the API.



> +        /*
> +         * Put this undo log back on the appropriate free-list.  No one can
> +         * attach to it while we hold TablespaceCreateLock, but if we return
> +         * earlier in a future go around this loop, we need the undo log to
> +         * remain usable.  We'll remove all appropriate logs from the
> +         * free-lists in a separate step below.
> +         */
> +        if (return_to_freelist)
> +        {
> +            LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
> +            slot->next_free = UndoLogShared->free_lists[slot->meta.persistence];
> +            UndoLogShared->free_lists[slot->meta.persistence] = slot->logno;
> +            LWLockRelease(UndoLogLock);
> +        }

There's multiple places that put logs onto the freelist. I'd put them
into one small function. Not primarily because it'll be easier to read,
but because it makes it easier to search for places that do so.


> +    /*
> +     * We detached all backends from undo logs in this tablespace, and no one
> +     * can attach to any non-default-tablespace undo logs while we hold
> +     * TablespaceCreateLock.  We can now drop the undo logs.
> +     */
> +    slot = NULL;
> +    while ((slot = UndoLogNextSlot(slot)))
> +    {
> +        /* Skip undo logs in other tablespaces. */
> +        if (slot->meta.tablespace != tablespace)
> +            continue;
> +
> +        /*
> +         * Make sure no buffers remain.  When that is done by
> +         * UndoLogDiscard(), the final page is left in shared_buffers because
> +         * it may contain data, or at least be needed again very soon.  Here
> +         * we need to drop even that page from the buffer pool.
> +         */
> +        forget_undo_buffers(slot->logno, slot->meta.discard, slot->meta.discard, true);
> +
> +        /*
> +         * TODO: For now we drop the undo log, meaning that it will never be
> +         * used again.  That wastes the rest of its address space.  Instead,
> +         * we should put it onto a special list of 'offline' undo logs, ready
> +         * to be reactivated in some other tablespace.  Then we can keep the
> +         * unused portion of its address space.
> +         */
> +        LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
> +        slot->meta.status = UNDO_LOG_STATUS_DISCARDED;
> +        LWLockRelease(&slot->mutex);
> +    }

Before I looked up forget_undo_buffers()'s implementation I wrote:

Hm. Iterating through shared buffers several times, especially when
there possibly could be a good sized numbers of undo logs, seems a bit
superfluous. This probably isn't going to be that frequently used in
practice, so it's perhaps ok. But it seems like this might be used when
things are bad (i.e. there's a lot of UNDO).

But I still wonder about that. Especially when there's a lot of UNDO
(most of it not in shared buffers), this could end up doing a *crapton*
of buffer lookups. I'm inclined to think that this case - probably in
contrast to the discard case, would be better served using
DropRelFileNodeBuffers().




> +    /* Remove all dropped undo logs from the free-lists. */
> +    LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
> +    for (i = 0; i < UndoPersistenceLevels; ++i)
> +    {
> +        UndoLogSlot *slot;
> +        UndoLogNumber *place;
> +
> +        place = &UndoLogShared->free_lists[i];
> +        while (*place != InvalidUndoLogNumber)
> +        {
> +            slot = find_undo_log_slot(*place, true);
> +            if (!slot)
> +                elog(ERROR,
> +                     "corrupted undo log freelist, unknown log %u", *place);
> +            if (slot->meta.status == UNDO_LOG_STATUS_DISCARDED)
> +                *place = slot->next_free;
> +            else
> +                place = &slot->next_free;
> +        }
> +    }
> +    LWLockRelease(UndoLogLock);

Hm, shouldn't this check that the log is actually in the being-dropped
tablespace?


> +void
> +ResetUndoLogs(UndoPersistence persistence)
> +{

This imo ought to explain why one would want/need to do that. As far as
I can tell this implementation for example wouldn't be correct in all
that many situations, because it e.g. doesn't drop the relevant buffers?

Seems like this would need to assert that persistence isn't PERMANENT?

This is made more "problematic" by the fact that there's no caller for
this in this commit, only being used much later in the series. But I
think the comment should be there anyway. Hard to review (and
understand) otherwise.

Why is it correct not to take any locks here? The caller in 0014 afaict
is when we're already in hot standby, which means people will possibly
read undo?



> +    UndoLogSlot *slot = NULL;
> +
> +    while ((slot = UndoLogNextSlot(slot)))
> +    {
> +        DIR       *dir;
> +        struct dirent *de;
> +        char    undo_path[MAXPGPATH];
> +        char    segment_prefix[MAXPGPATH];
> +        size_t    segment_prefix_size;
> +
> +        if (slot->meta.persistence != persistence)
> +            continue;
> +
> +        /* Scan the directory for files belonging to this undo log. */
> +        snprintf(segment_prefix, sizeof(segment_prefix), "%06X.", slot->logno);
> +        segment_prefix_size = strlen(segment_prefix);
> +        UndoLogDirectory(slot->meta.tablespace, undo_path);
> +        dir = AllocateDir(undo_path);
> +        if (dir == NULL)
> +            continue;
> +        while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL)
> +        {
> +            char segment_path[MAXPGPATH];
> +
> +            if (strncmp(de->d_name, segment_prefix, segment_prefix_size) != 0)
> +                continue;

I'm perfectly fine with using MAXPGPATH buffers. But I do find it
confusing that in some places you're using dynamic allocations (in some
cases quite repeatedly, like in allocate_empty_undo_segment(), but here
you don't?

Hm, isn't this kinda O(#slot*#total_size_of_undo) due to going over the
whole tablespace for each log?


> +            snprintf(segment_path, sizeof(segment_path), "%s/%s",
> +                     undo_path, de->d_name);
> +            elog(DEBUG1, "unlinked undo segment \"%s\"", segment_path);
> +            if (unlink(segment_path) < 0)
> +                elog(LOG, "couldn't unlink file \"%s\": %m", segment_path);
> +        }
> +        FreeDir(dir);

I think the LOG should be done alternatively do to the DEBUG1, otherwise
it's going to be confusing.  Should this really only be a LOG? Going to
be hard to cleanup for a DBA later.


> +Datum
> +pg_stat_get_undo_logs(PG_FUNCTION_ARGS)
> +{
> +#define PG_STAT_GET_UNDO_LOGS_COLS 9
> +    ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
> +    TupleDesc    tupdesc;
> +    Tuplestorestate *tupstore;
> +    MemoryContext per_query_ctx;
> +    MemoryContext oldcontext;
> +    char *tablespace_name = NULL;
> +    Oid last_tablespace = InvalidOid;
> +    int            i;
> +
> +    /* check to see if caller supports us returning a tuplestore */
> +    if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
> +        ereport(ERROR,
> +                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                 errmsg("set-valued function called in context that cannot accept a set")));
> +    if (!(rsinfo->allowedModes & SFRM_Materialize))
> +        ereport(ERROR,
> +                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                 errmsg("materialize mode required, but it is not " \
> +                        "allowed in this context")));
> +
> +    /* Build a tuple descriptor for our result type */
> +    if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
> +        elog(ERROR, "return type must be a row type");

I wish we'd encapsulate this in one place instead of copying it over and
over.

Imo it's bad style to break error messages over multiple lines, makes it
harder to grep for.


> +    /* Scan all undo logs to build the results. */
> +    for (i = 0; i < UndoLogShared->nslots; ++i)
> +    {
> +        UndoLogSlot *slot = &UndoLogShared->slots[i];
> +        char buffer[17];
> +        Datum values[PG_STAT_GET_UNDO_LOGS_COLS];
> +        bool nulls[PG_STAT_GET_UNDO_LOGS_COLS] = { false };
> +        Oid tablespace;

Uncommented numbers like '17' for buffer lengths make me nervous.


> +        values[0] = ObjectIdGetDatum((Oid) slot->logno);
> +        values[1] = CStringGetTextDatum(
> +            slot->meta.persistence == UNDO_PERMANENT ? "permanent" :
> +            slot->meta.persistence == UNDO_UNLOGGED ? "unlogged" :
> +            slot->meta.persistence == UNDO_TEMP ? "temporary" : "<uknown>");

s/uknown/unknown/


> +        tablespace = slot->meta.tablespace;
> +
> +        snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
> +                 MakeUndoRecPtr(slot->logno, slot->meta.discard));
> +        values[3] = CStringGetTextDatum(buffer);
> +        snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
> +                 MakeUndoRecPtr(slot->logno, slot->meta.unlogged.insert));
> +        values[4] = CStringGetTextDatum(buffer);
> +        snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
> +                 MakeUndoRecPtr(slot->logno, slot->meta.end));
> +        values[5] = CStringGetTextDatum(buffer);

Makes me wonder if we shouldn't have a type for undo pointers.


> +        if (slot->meta.unlogged.xid == InvalidTransactionId)
> +            nulls[6] = true;
> +        else
> +            values[6] = TransactionIdGetDatum(slot->meta.unlogged.xid);
> +        if (slot->pid == InvalidPid)
> +            nulls[7] = true;
> +        else
> +            values[7] = Int32GetDatum((int32) slot->pid);
> +        switch (slot->meta.status)
> +        {
> +        case UNDO_LOG_STATUS_ACTIVE:
> +            values[8] = CStringGetTextDatum("ACTIVE"); break;
> +        case UNDO_LOG_STATUS_FULL:
> +            values[8] = CStringGetTextDatum("FULL"); break;
> +        default:
> +            nulls[8] = true;
> +        }

Don't think this'll survive pgindent.


> +        /*
> +         * Deal with potentially slow tablespace name lookup without the lock.
> +         * Avoid making multiple calls to that expensive function for the
> +         * common case of repeating tablespace.
> +         */
> +        if (tablespace != last_tablespace)
> +        {
> +            if (tablespace_name)
> +                pfree(tablespace_name);
> +            tablespace_name = get_tablespace_name(tablespace);
> +            last_tablespace = tablespace;
> +        }

If we need to do this repeatedly, I think we ought to add a syscache for
tablespace names.


> +        if (tablespace_name)
> +        {
> +            values[2] = CStringGetTextDatum(tablespace_name);
> +            nulls[2] = false;
> +        }
> +        else
> +            nulls[2] = true;
> +
> +        tuplestore_putvalues(tupstore, tupdesc, values, nulls);

Seems like a CHECK_FOR_INTERRUPTS() in this loop wouldn't hurt.


> +    }
> +
> +    if (tablespace_name)
> +        pfree(tablespace_name);

That seems a bit superflous, given we're leaking plenty other memory
(which is perfectly fine).


> +/*
> + * replay the creation of a new undo log
> + */
> +static void
> +undolog_xlog_create(XLogReaderState *record)
> +{
> +    xl_undolog_create *xlrec = (xl_undolog_create *) XLogRecGetData(record);
> +    UndoLogSlot *slot;
> +
> +    /* Create meta-data space in shared memory. */
> +    LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
> +
> +    /* TODO: assert that it doesn't exist already? */
> +    slot = allocate_undo_log_slot();

Doesn't this need some error checking? allocate_undo_log_slot() will
return NULL if there's no slots left.  E.g. restarting a server with a
lower max_connections could have one run into this easily?

> +/*
> + * Drop all buffers for the given undo log, from the old_discard to up
> + * new_discard.  If drop_tail is true, also drop the buffer that holds
> + * new_discard; this is used when discarding undo logs completely, for example
> + * via DROP TABLESPACE.  If it is false, then the final buffer is not dropped
> + * because it may contain data.
> + *
> + */
> +static void
> +forget_undo_buffers(int logno, UndoLogOffset old_discard,
> +                    UndoLogOffset new_discard, bool drop_tail)
> +{
> +    BlockNumber old_blockno;
> +    BlockNumber new_blockno;
> +    RelFileNode    rnode;
> +
> +    UndoRecPtrAssignRelFileNode(rnode, MakeUndoRecPtr(logno, old_discard));
> +    old_blockno = old_discard / BLCKSZ;
> +    new_blockno = new_discard / BLCKSZ;
> +    if (drop_tail)
> +        ++new_blockno;
> +    while (old_blockno < new_blockno)
> +    {

Hm. This'll be quite bad if you have a lot more undo than
shared_buffers. Taking the partition lwlocks this many times will
hurt. OTOH, scanning all of shared buffers everytime we truncate a few
hundred bytes of undo away is obviously also not going to work.


> +        ForgetBuffer(SMGR_UNDO, rnode, UndoLogForkNum, old_blockno);
> +        ForgetLocalBuffer(SMGR_UNDO, rnode, UndoLogForkNum, old_blockno++);

This seems odd to me - why do we need to scan both? We ought to know
which one is needed, right?


> +    }

> +}
> +/*
> + * replay an undo segment discard record
> + */

Missing newline between functions.


> +static void
> +undolog_xlog_discard(XLogReaderState *record)
> +{
> +    /*
> +     * We're about to discard undologs. In Hot Standby mode, ensure that
> +     * there's no queries running which need to get tuple from discarded undo.

nitpick: s/undologs/undo logs/? I think most other comments split it?


> +     * XXX we are passing empty rnode to the conflict function so that it can
> +     * check conflict in all the backend regardless of which database the
> +     * backend is connected.
> +     */
> +    if (InHotStandby && TransactionIdIsValid(xlrec->latestxid))
> +        ResolveRecoveryConflictWithSnapshot(xlrec->latestxid, rnode);

Hm. Perhaps it'd be better to change
ResolveRecoveryConflictWithSnapshot's API to just accept the
database (OTOH, we perhaps ought to be more granular in conflict
processing). Or just mention that it's ok to pass in an invalid rnode?


> +    /*
> +     * See if we need to unlink or rename any files, but don't consider it an
> +     * error if we find that files are missing.  Since UndoLogDiscard()
> +     * performs filesystem operations before WAL logging or updating shmem
> +     * which could be checkpointed, a crash could have left files already
> +     * deleted, but we could replay WAL that expects the files to be there.
> +     */

Or we could have crashed/restarted during WAL replay and processing the
same WAL again. Not sure if that's worth mentioning.


> +    /* Unlink or rename segments that are no longer in range. */
> +    while (old_segment_begin < new_segment_begin)
> +    {
> +        char    discard_path[MAXPGPATH];
> +
> +        /* Tell the checkpointer that the file is going away. */
> +        undofile_forget_sync(slot->logno,
> +                             old_segment_begin / UndoLogSegmentSize,
> +                             slot->meta.tablespace);
> +
> +        UndoLogSegmentPath(xlrec->logno, old_segment_begin / UndoLogSegmentSize,
> +                           slot->meta.tablespace, discard_path);
> +
> +        /* Can we recycle the oldest segment? */
> +        if (end < xlrec->end)
> +        {
> +            char    recycle_path[MAXPGPATH];
> +
> +            UndoLogSegmentPath(xlrec->logno, end / UndoLogSegmentSize,
> +                               slot->meta.tablespace, recycle_path);
> +            if (rename(discard_path, recycle_path) == 0)
> +            {
> +                elog(DEBUG1, "recycled undo segment \"%s\" -> \"%s\"",
> +                     discard_path, recycle_path);
> +                end += UndoLogSegmentSize;
> +            }
> +            else
> +            {
> +                elog(LOG, "could not rename \"%s\" to \"%s\": %m",
> +                     discard_path, recycle_path);
> +            }
> +        }
> +        else
> +        {
> +            if (unlink(discard_path) == 0)
> +                elog(DEBUG1, "unlinked undo segment \"%s\"", discard_path);
> +            else
> +                elog(LOG, "could not unlink \"%s\": %m", discard_path);
> +        }
> +        old_segment_begin += UndoLogSegmentSize;
> +    }

The code to recycle or delete one segment exists in multiple places (at
least also in UndoLogDiscard()). Think it's long enough that it's easily
worthwhile to share.


> +/*

> @@ -1418,12 +1418,18 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf
>              segmentpath = strstr(filename, ".");
>              if (segmentpath != NULL)
>              {
> -                segmentno = atoi(segmentpath + 1);
> -                if (segmentno == 0)
> +                char       *end;
> +                if (strstr(readfilename, "undo"))
> +                    first_blkno = strtol(segmentpath + 1, &end, 16) / BLCKSZ;
> +                else
> +                    first_blkno = strtol(segmentpath + 1, &end, 10) * RELSEG_SIZE;
> +                if (*end != '\0')
>                      ereport(ERROR,
> -                            (errmsg("invalid segment number %d in file \"%s\"",
> -                                    segmentno, filename)));
> +                            (errmsg("invalid segment number in file \"%s\"",
> +                                    filename)));
>              }
> +            else
> +                first_blkno = 0;
>          }
>      }

Hm. Not a fan of just using strstr() here. Can't quite articulate
why. Just somehow rubs me wrong.


>  /*
>   * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
>   *        a relcache entry for the relation.
> - *
> - * NB: At present, this function may only be used on permanent relations, which
> - * is OK, because we only use it during XLOG replay.  If in the future we
> - * want to use it on temporary or unlogged relations, we could pass additional
> - * parameters.
>   */
>  Buffer
>  ReadBufferWithoutRelcache(SmgrId smgrid, RelFileNode rnode, ForkNumber forkNum,
>                            BlockNumber blockNum, ReadBufferMode mode,
> -                          BufferAccessStrategy strategy)
> +                          BufferAccessStrategy strategy,
> +                          char relpersistence)
>  {
>      bool        hit;
>
> -    SMgrRelation smgr = smgropen(smgrid, rnode, InvalidBackendId);
> -
> -    Assert(InRecovery);
> +    SMgrRelation smgr = smgropen(smgrid, rnode,
> +                                 relpersistence == RELPERSISTENCE_TEMP
> +                                 ? MyBackendId : InvalidBackendId);
>
> -    return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
> +    return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum,
>                               mode, strategy, &hit);
>  }

Hm. Using this for undo access means that we don't do any buffer
read/hit counting associatable with the relation causing undo to be
read/written. That seems like a sizable monitoring defficiency.


>  /*
> + * ForgetBuffer -- drop a buffer from shared buffers
> + *
> + * If the buffer isn't present in shared buffers, nothing happens.  If it is
> + * present, it is discarded without making any attempt to write it back out to
> + * the operating system.  The caller must therefore somehow be sure that the
> + * data won't be needed for anything now or in the future.  It assumes that
> + * there is no concurrent access to the block, except that it might be being
> + * concurrently written.
> + */
> +void
> +ForgetBuffer(SmgrId smgrid, RelFileNode rnode, ForkNumber forkNum,
> +             BlockNumber blockNum)
> +{
> +    SMgrRelation smgr = smgropen(smgrid, rnode, InvalidBackendId);
> +    BufferTag    tag;            /* identity of target block */
> +    uint32        hash;            /* hash value for tag */
> +    LWLock       *partitionLock;    /* buffer partition lock for it */
> +    int            buf_id;
> +    BufferDesc *bufHdr;
> +    uint32        buf_state;
> +
> +    /* create a tag so we can lookup the buffer */
> +    INIT_BUFFERTAG(tag, smgrid, smgr->smgr_rnode.node, forkNum, blockNum);
> +
> +    /* determine its hash code and partition lock ID */
> +    hash = BufTableHashCode(&tag);
> +    partitionLock = BufMappingPartitionLock(hash);
> +
> +    /* see if the block is in the buffer pool */
> +    LWLockAcquire(partitionLock, LW_SHARED);
> +    buf_id = BufTableLookup(&tag, hash);
> +    LWLockRelease(partitionLock);
> +
> +    /* didn't find it, so nothing to do */
> +    if (buf_id < 0)
> +        return;
> +
> +    /* take the buffer header lock */
> +    bufHdr = GetBufferDescriptor(buf_id);
> +    buf_state = LockBufHdr(bufHdr);
> +    /*
> +     * The buffer might been evicted after we released the partition lock and
> +     * before we acquired the buffer header lock.  If so, the buffer we've
> +     * locked might contain some other data which we shouldn't touch. If the
> +     * buffer hasn't been recycled, we proceed to invalidate it.
> +     */
> +    if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
> +        bufHdr->tag.blockNum == blockNum &&
> +        bufHdr->tag.forkNum == forkNum)
> +        InvalidateBuffer(bufHdr);        /* releases spinlock */
> +    else
> +        UnlockBufHdr(bufHdr, buf_state);

Phew, I don't like this code one bit. It imo is a bad idea / unnecessary
to look up the buffer, unlock the partition lock, and then recheck
identity. And do exactly the same thing again in InvalidateBuffer()
(including making a copy of the tag while holding the buffer header
lock).


Seems like this should be something roughly like

    ReservePrivateRefCountEntry();

    LWLockAcquire(partitionLock, LW_SHARED);
    buf_id = BufTableLookup(&tag, hash);
    if (buf_id >= 0)
    {
        bufHdr = GetBufferDescriptor(buf_id);

        buf_state = LockBufHdr(bufHdr);

        /*
         * Temporarily acquire pin - that prevents the buffer
         * from being replaced with one that we did not intend
         * to target.
         *
         * XXX:
         */
        ref = PinBuffer_Locked(bufHdr, strategy);

        /* release partition lock, acquire exclusively so we can drop */
        LWLockRelease(partitionLock);

        /* loop until nobody else has the buffer pinned */
        while (true)
        {
            LWLockAcquire(partitionLock, LW_EXCLUSIVE);

            buf_state = LockBufHdr(buf);

            /*
             * Check if somebody else is busy writing the buffer (we
             * have one pin).
             */
            if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
               break;

            // XXX: Should we assert IO_IN_PROGRESS? Ought to be the
            // only way to get here.

            /* wait for IO to finish, without holding locks */
            UnlockBufHdr(buf, buf_state);
            LWLockRelease(partitionLock);

            Assert(GetPrivateRefCount(BufferDescriptorGetBuffer(buf)) == 1);
            WaitIO(buf);

            /* buffer identity can't change, we've a pin */

            // XXX: Assert that the buffer isn't dirty anymore? There
            // ought to be no possibility for it to get dirty now.
        }

        Assert(!(buf_state & BM_PIN_COUNT_WAITER));

        /*
         * Clear out the buffer's tag and flags.  We must do this to ensure that
         * linear scans of the buffer array don't think the buffer is valid.
         */
        oldFlags = buf_state & BUF_FLAG_MASK;
        CLEAR_BUFFERTAG(buf->tag);
        buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
        /* remove our refcount */
        buf_state -= BUF_REFCOUNT_ONE;
        UnlockBufHdr(buf, buf_state);

        /*
         * Remove the buffer from the lookup hashtable, if it was in there.
         */
        if (oldFlags & BM_TAG_VALID)
                BufTableDelete(&oldTag, oldHash);

        /*
         * Done with mapping lock.
         */
        LWLockRelease(oldPartitionLock);

        Assert(ref->refcount == 1);
        ForgetPrivateRefCountEntry(ref);
        ResourceOwnerForgetBuffer(CurrentResourceOwner, BufferDescriptorGetBuffer(buf));
    }

or something in that vein. Now you can validly argue that this is more
complicated - but I also think that this is going to be a much hotter
path than normal relation drops.


<more after some errands>


Greetings,

Andres Freund

Re: should there be a hard-limit on the number of transactionspending undo?

From

Peter Geoghegan

Date:

30 July 2019, 01:41:27

On Mon, Jul 29, 2019 at 2:52 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Thanks for sharing that.  I see they're giving that paper at VLDB next
> month in LA... I hope the talk video will be published on the web.
> While we've been working on a hybrid vaccum/undo design, they've built
> a hybrid undo/vacuum system.

It seems that this will be in a stable release soon, so it's not
pie-in-the-sky stuff. AFAICT, they have indexes that always point to
the latest row version. Getting an old version always required working
backwards from the latest. Perhaps the constant time recovery stuff is
somewhat like Postgres heapam when it comes to SELECTs, INSERTs, and
DELETEs, but much less similar when it comes to UPDATEs. This seems
like it might be an important distinction.

As the MVCC survey paper out of CMU [1] from a couple of years back says:

"The main idea of using logical pointers is that the DBMS uses a fixed
identifier that does not change for each tuple in its index entry.
Then, as shown in Fig. 5a, the DBMS uses an indirection layer that
maps a tuple’s identifier to the HEAD of its version chain. This
avoids the problem of having to update all of a table’s indexes to
point to a new physical location whenever a tuple is modified. (even
if the indexed attributes were not changed)."

To me, this suggests that zheap ought to make heap TIDs "more logical"
than they are with heapam today (heap TIDs are hybrid physical/logical
identifiers today). "Row forwarding" across heap pages is the
traditional way of ensuring that TIDs in indexes are stable even in
the worst case, apparently, but other approaches also seem possible.

[1] http://www.vldb.org/pvldb/vol10/p781-Wu.pdf
--
Peter Geoghegan

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

30 July 2019, 02:45:09

On Tue, Jul 30, 2019 at 12:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jul 23, 2019 at 10:42 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> > Yes, I also think that the function would error out only because of
> > can't-happen cases, like "too many locks taken" or "out of binary heap
> > slots" or "out of memory" (this last one is not such a can't happen
> > case). These cases happen probably due to some bugs, I suppose. But I
> > was wondering : Generally when the code errors out with such
> > can't-happen elog() calls, worst thing that happens is that the
> > transaction gets aborted. Whereas, in this case, the worst thing that
> > could happen is : the undo action would never get executed, which
> > means selects for this tuple will keep on accessing the undo log ?
> > This does not sound like any data consistency issue, so we should be
> > fine after all ?
>
> I don't think so.  Every XID present in undo has to be something we
> can look up in CLOG to figure out which transactions are aborted and
> which transactions are committed, so that we know which transactions
> need undo.  If we forget to undo the transaction, we can't discard it,
> which means we can't advance the CLOG transaction horizon, which means
> we'll eventually start failing to assign XIDs, leading to a refusal of
> all write transactions.  Oops.
>
> More generally, it's not OK for the generic undo layer to make
> assumptions about whether the operations performed by the undo
> handlers are essential or not.  We don't want to impose a design
> constraint the undo can only be used for things that are not actually
> critical, because that will make it hard to write AMs that use it.
> And there's no reason to live with such a design constraint anyway,
> because, as noted above, CLOG truncation requires it.
>
> More generally still, some can't-happen situations should be checked
> via Assert() and others via elog(). For example, consider some code
> that looks up a syscache tuple and pulls data from the returned tuple.
> If the code that handles DDL is written in such a way that the tuple
> should always exist, then this is a can't-happen situation, but
> generally the code checks this via elog(), not Assert(), because it
> could also happen due to the catalog contents being corrupted.  If
> Assert() were used, the checks would not run in production builds, and
> a corrupt catalog would lead to a seg fault. An elog() is much
> friendlier. As a general principle, when a certain thing ought to
> always be true, but it being true depends on a whole lot of
> assumptions elsewhere in the code, and especially if it also depends
> on assumptions like "the database is not corrupted," I think elog() is
> preferable.  Assert() is better for things that are more localized and
> that really can't go wrong for any reason other than a bug.  In this
> case, I think I would tend towards elog(PANIC), but it's arguable.
>

Agreed, elog(PANIC) seems like a better way for this as compared to Assert.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

30 July 2019, 05:03:38

On Fri, Jul 19, 2019 at 2:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 11, 2019 at 9:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Jul 11, 2019 at 12:38 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > >
> > > I don't like the fact that undoaccess.c has a new global,
> > > undo_compression_info.  I haven't read the code thoroughly, but do we
> > > really need that?  I think it's never modified (so it could just be
> > > declared const),
> >
> > Actually, this will get modified otherwise across undo record
> > insertion how we will know what was the values of the common fields in
> > the first record of the page.  Another option could be that every time
> > we insert the record, read the value from the first complete undo
> > record on the page but that will be costly because for every new
> > insertion we need to read the first undo record of the page.
> >
>
> This information won't be shared across transactions, so can't we keep
> it in top transaction's state?   It seems to me that will be better
> than to maintain it as a global state.

I think this idea is good for the DO time but during REDO time it will
not work as we will not have the transaction state.  Having said that
the current idea of keeping in the global variable will also not work
during REDO time because the WAL from the different transaction can be
interleaved.  There are few ideas to handle this issue

1.  At DO time keep in TopTransactionState as you suggested and during
recovery time read from the first complete record on the page.
2.  Just to keep the code uniform always read from the first complete
record of the page.

After putting more though I am more inclined towards idea-2.  Because
we are anyway inserting our current record into that page basically we
have read the buffer and also holds the exclusive lock on the buffer.
So reading a few extra bytes from the buffer will not hurt us IMHO.

If someone has a better solution please suggest.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

30 July 2019, 06:50:29

On Tue, Jul 30, 2019 at 5:03 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Fri, Jul 19, 2019 at 2:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Thu, Jul 11, 2019 at 9:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > On Thu, Jul 11, 2019 at 12:38 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > > I don't like the fact that undoaccess.c has a new global,
> > > > undo_compression_info.  I haven't read the code thoroughly, but do we
> > > > really need that?  I think it's never modified (so it could just be
> > > > declared const),
> > >
> > > Actually, this will get modified otherwise across undo record
> > > insertion how we will know what was the values of the common fields in
> > > the first record of the page.  Another option could be that every time
> > > we insert the record, read the value from the first complete undo
> > > record on the page but that will be costly because for every new
> > > insertion we need to read the first undo record of the page.
> > >
> >
> > This information won't be shared across transactions, so can't we keep
> > it in top transaction's state?   It seems to me that will be better
> > than to maintain it as a global state.
>
> I think this idea is good for the DO time but during REDO time it will
> not work as we will not have the transaction state.  Having said that
> the current idea of keeping in the global variable will also not work
> during REDO time because the WAL from the different transaction can be
> interleaved.  There are few ideas to handle this issue
>
> 1.  At DO time keep in TopTransactionState as you suggested and during
> recovery time read from the first complete record on the page.
> 2.  Just to keep the code uniform always read from the first complete
> record of the page.
>
> After putting more though I am more inclined towards idea-2.  Because
> we are anyway inserting our current record into that page basically we
> have read the buffer and also holds the exclusive lock on the buffer.
> So reading a few extra bytes from the buffer will not hurt us IMHO.
>
> If someone has a better solution please suggest.

Hi Dilip,

Here's some initial review of the following patch (from your public
undo_interface_v1 branch as of this morning).  I haven't tested this
version yet, because my means of testing this stuff involves waiting
for undoprocessing to be rebased, so that I can test it with my
orphaned files stuff and other test programs.  It contains another
suggestion for that problem you just mentioned (and also me pointing
out what you just pointed out, since I wrote it earlier) though I'm
not sure if it's better than your options above.

> commit 2f3c127b9e8bc7d27cf7adebff0a355684dfb94e
> Author: Dilip Kumar <dilipkumar@localhost.localdomain>
> Date:   Thu May 2 11:28:13 2019 +0530
>
>    Provide interfaces to store and fetch undo records.

+#include "commands/tablecmds.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "miscadmin.h"

"miscadmin.h" comes before "storage...".

+/*
+ * Compute the size of the partial record on the undo page.
+ *
+ * Compute the complete record size by uur_info and variable field length
+ * stored in the page header and then subtract the offset of the record so that
+ * we can get the exact size of partial record in this page.
+ */
+static inline Size
+UndoPagePartialRecSize(UndoPageHeader phdr)
+{
+    Size        size;

We decided to use size_t everywhere in new code (except perhaps
functions conforming to function pointer types that historically use
Size in their type).

+    /*
+     * Compute the header size from undo record uur_info, stored in the page
+     * header.
+     */
+    size = UndoRecordHeaderSize(phdr->uur_info);
+
+    /*
+     * Add length of the variable part and undo length. Now, we know the
+     * complete length of the undo record.
+     */
+    size += phdr->tuple_len + phdr->payload_len + sizeof(uint16);
+
+    /*
+     * Subtract the size which is stored in the previous page to get the
+     * partial record size stored in this page.
+     */
+    size -= phdr->record_offset;
+
+    return size;

This is probably a stupid question but why isn't it enough to just
store the offset of the first record that begins on this page, or 0
for none yet?  Why do we need to worry about the partial record's
payload etc?

+UndoRecPtr
+PrepareUndoInsert(UndoRecordInsertContext *context,
+                  UnpackedUndoRecord *urec,
+                  Oid dbid)
+{
...
+    /* Fetch compression info for the transaction. */
+    compression_info = GetTopTransactionUndoCompressionInfo(category);

How can this work correctly in recovery?  [Edit: it doesn't, as you
just pointed out]

I had started reviewing an older version of your patch (the version
that had made it as far as the undoprocessing branch as of recently),
before I had the bright idea to look for a newer version.  I was going
to object to the global variable you had there in the earlier version.
It seems to me that you have to be able to reproduce the exact same
compression in recovery that you produced as "do" time, no?  How can
TopTranasctionStateData be the right place for this in recovery?

One data structure that could perhaps hold this would be
UndoLogTableEntry (the per-backend cache, indexed by undo log number,
with pretty fast lookups; used for things like
UndoLogNumberGetCategory()).  As long as you never want to have
inter-transaction compression, that should have the right scope to
give recovery per-undo log tracking.  If you ever wanted to do
compression between transactions too, maybe UndoLogSlot could work,
but that'd have more complications.

+/*
+ * Read undo records of the transaction in bulk
+ *
+ * Read undo records between from_urecptr and to_urecptr until we exhaust the
+ * the memory size specified by undo_apply_size.  If we could not read all the
+ * records till to_urecptr then the caller should consume current set
of records
+ * and call this function again.
+ *
+ * from_urecptr        - Where to start fetching the undo records.
If we can not
+ *                      read all the records because of memory limit then this
+ *                      will be set to the previous undo record
pointer from where
+ *                      we need to start fetching on next call.
Otherwise it will
+ *                      be set to InvalidUndoRecPtr.
+ * to_urecptr        - Last undo record pointer to be fetched.
+ * undo_apply_size    - Memory segment limit to collect undo records.
+ * nrecords            - Number of undo records read.
+ * one_page            - Caller is applying undo only for one block not for
+ *                      complete transaction.  If this is set true then instead
+ *                      of following transaction undo chain using
prevlen we will
+ *                      follow the block prev chain of the block so that we can
+ *                      avoid reading many unnecessary undo records of the
+ *                      transaction.
+ */
+UndoRecInfo *
+UndoBulkFetchRecord(UndoRecPtr *from_urecptr, UndoRecPtr to_urecptr,
+                    int undo_apply_size, int *nrecords, bool one_page)

Could you please make it clear in comments and assertions what the
relation between from_urecptr and to_urecptr is and what they mean
(they must be in the same undo log, one must be <= the other, both
point to the *start* of a record, so it's not the same as the total
range of undo)?

undo_apply_size is not a good parameter name, because the function is
useful for things other than applying records -- like the
undoinspect() extension (or some better version of that), for example.
Maybe max_result_size or something like that?

+{
...
+        /* Allocate memory for next undo record. */
+        uur = palloc0(sizeof(UnpackedUndoRecord));
...
+
+        size = UnpackedUndoRecordSize(uur);
+        total_size += size;

I see, so the unpacked records are still allocated one at a time.  I
guess that's OK for now.  From some earlier discussion I had been
expecting an arrangement where the actual records were laid out
contiguously with their subcomponents (things they point to in
palloc()'d memory) nearby.

+static uint16
+UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer,
+                     UndoLogCategory category)
+{
...
+    char        prevlen[2];
...
+    prev_rec_len = *(uint16 *) (prevlen);

I don't think that's OK, and might crash on a non-Intel system.  How
about using a union of uint16 and char[2]?

+    /* Copy undo record transaction header if it is present. */
+    if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+        memcpy(&ucontext->urec_txn, uur->uur_txn, SizeOfUndoRecordTransaction);

I was wondering why you don't use D = S instead of mempcy(&D, &S,
size) wherever you can, until I noticed you use these SizeOfXXX macros
that don't include trailing padding from structs, and that's also how
you allocate objects.  Hmm.  So if I were to complain about you not
using plain old assignment whenever you can, I'd also have to complain
about that.

I think that that technique of defining a SizeOfXXX macro that
excludes trailing bytes makes sense for writing into WAL or undo log
buffers using mempcy().  I'm not sure it makes sense for palloc() and
copying into typed variables like you're doing here and I think I'd
prefer the notational simplicity of using the (very humble) type
system facilities C gives us.  (Some memory checker might not like it
you palloc(the shorter size) and then use = if the compiler chooses to
implement it as memcpy sizeof().)

+/*
+ * The below common information will be stored in the first undo record of the
+ * page.  Every subsequent undo record will not store this information, if
+ * required this information will be retrieved from the first undo
record of the
+ * page.
+ */
+typedef struct UndoCompressionInfo

Shouldn't this say "Every subsequent record will not store this
information *if it's the same as the relevant fields in the first
record*"?

+#define UREC_INFO_TRANSACTION                0x001
+#define UREC_INFO_RMID                        0x002
+#define UREC_INFO_RELOID                    0x004
+#define UREC_INFO_XID                        0x008

Should we call this UREC_INFO_FXID, since it refers to a FullTransactionId?

+/*
+ * Every undo record begins with an UndoRecordHeader structure, which is
+ * followed by the additional structures indicated by the contents of
+ * urec_info.  All structures are packed into the alignment without padding
+ * bytes, and the undo record itself need not be aligned either, so care
+ * must be taken when reading the header.
+ */

I think you mean "All structures are packed into undo pages without
considering alignment and without trailing padding bytes"?  This comes
from the definition of the SizeOfXXX macros IIUC.  There might still
be padding between members of some of those structs, no?  Like this
one, that has the second member at offset 2 on my system:

+typedef struct UndoRecordHeader
+{
+    uint8        urec_type;        /* record type code */
+    uint16        urec_info;        /* flag bits */
+} UndoRecordHeader;
+
+#define SizeOfUndoRecordHeader    \
+    (offsetof(UndoRecordHeader, urec_info) + sizeof(uint16))

+/*
+ * Information for a transaction to which this undo belongs.  This
+ * also stores the dbid and the progress of the undo apply during rollback.
+ */
+typedef struct UndoRecordTransaction
+{
+    /*
+     * Undo block number where we need to start reading the undo for applying
+     * the undo action.   InvalidBlockNumber means undo applying hasn't
+     * started for the transaction and MaxBlockNumber mean undo completely
+     * applied. And, any other block number means we have applied partial undo
+     * so next we can start from this block.
+     */
+    BlockNumber urec_progress;
+    Oid            urec_dbid;        /* database id */
+    UndoRecPtr    urec_next;        /* urec pointer of the next transaction */
+} UndoRecordTransaction;

I propose that we rename this to UndoRecordGroupHeader (or something
like that... maybe "Set", but we also use "set" as a verb in various
relevant function names):

1.  We'll also use these for the new "shared" records we recently
invented that don't relate to a transaction.  This is really about
defining the unit of discarding; we throw away the whole set of
records at once, which is why it's basically about proividing a space
for "urec_next".

2.  Though it also holds rollback progress information, which is a
transaction-specific concept, there can be more than one of these sets
of records for a single transaction anyway.  A single transaction can
write undo stuff in more than one undo log (different categories
perm/temp/unlogged/shared and also due to log switching when they are
full).

So really it's just a header for an arbitrary set of records, used to
track when and how to discard them.

If you agree with that idea, perhaps urec_next should become something
like urec_next_group, too.  "next" is a bit vague, especially for
something as untyped as UndoRecPtr: someone might think it points to
the next record.

More soon.

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

30 July 2019, 08:02:20

Hi,

Amit, short note: The patches aren't attached in patch order. Obviously
a miniscule thing, but still nicer if that's not the case.

Dilip, this also contains the start of a review for the undo record
interface further down.


On 2019-07-29 16:35:20 -0700, Andres Freund wrote:
> <more after some errands>

Here we go.


I'm a bit worried about expanding the use of
ReadBufferWithoutRelcache(). Not so much because of the relcache itself,
but because it requires doing separate smgropen() calls. While not
crazily expensive, it's also not free. Especially combined with closing
all such relations at transaction end (c.f. AtEOXact_SMgr).

I'm somewhat inclined to think that this requires a slightly bigger
refactoring than done in this patch. Imo at the very least the smgr
entries ought not to be unowned. But working towards not haven to
re-open the smgr entry for every single trival request ought to be part
of this too.


>  /*
> + * ForgetLocalBuffer - drop a buffer from local buffers
> + *
> + * This is similar to bufmgr.c's ForgetBuffer, except that we do not need
> + * to do any locking since this is all local.  As with that function, this
> + * must be used very carefully, since we'll cheerfully throw away dirty
> + * buffers without any attempt to write them.
> + */
> +void
> +ForgetLocalBuffer(SmgrId smgrid, RelFileNode rnode, ForkNumber forkNum,
> +                  BlockNumber blockNum)
> +{
> +    SMgrRelation smgr = smgropen(smgrid, rnode, BackendIdForTempRelations());
> +    BufferTag    tag;                    /* identity of target block */
> +    LocalBufferLookupEnt *hresult;
> +    BufferDesc *bufHdr;
> +    uint32        buf_state;
> +
> +    /*
> +     * If somehow this is the first request in the session, there's nothing to
> +     * do.  (This probably shouldn't happen, though.)
> +     */
> +    if (LocalBufHash == NULL)
> +        return;

Given that the call to ForgetLocalBuffer() currently is unconditional,
rather than checking the persistence of the undo log, I don't see why
this wouldn't happen?


> +    /* mark buffer invalid */
> +    bufHdr = GetLocalBufferDescriptor(hresult->id);
> +    CLEAR_BUFFERTAG(bufHdr->tag);
> +    buf_state = pg_atomic_read_u32(&bufHdr->state);
> +    buf_state &= ~(BM_VALID | BM_TAG_VALID | BM_DIRTY);
> +    pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);

Shouldn't this also clear out at least the usagecount? I'd probably just
use
    buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
like InvalidateBuffer() does.

I'd probably also add an assert ensuring that the refcount is zero.


> @@ -97,7 +116,6 @@ static dlist_head unowned_relns;
>  /* local function prototypes */
>  static void smgrshutdown(int code, Datum arg);
>
> -
>  /*
>   *    smgrinit(), smgrshutdown() -- Initialize or shut down storage
>   *                                  managers.

spurious change.


> +/*
> + * While md.c expects random access and has a small number of huge
> + * segments, undofile.c manages a potentially very large number of smaller
> + * segments and has a less random access pattern.  Therefore, instead of
> + * keeping a potentially huge array of vfds we'll just keep the most
> + * recently accessed N.
> + *
> + * For now, N == 1, so we just need to hold onto one 'File' handle.
> + */
> +typedef struct UndoFileState
> +{
> +    int        mru_segno;
> +    File    mru_file;
> +} UndoFileState;

IMO N==1 gotta change before this is committable. There's too many
design issues that could creep in without fixing this (e.g. not being
careful enough about closing cached file handles after certain
operations etc), that will be harder to fix later.


> +void
> +undofile_open(SMgrRelation reln)
> +{
> +    UndoFileState *state;
> +
> +    state = MemoryContextAllocZero(UndoFileCxt, sizeof(UndoFileState));
> +    reln->private_data = state;
> +}

Hm, I don't quite like this 'private_data' design. Was that design
discussed anywhere?

Intuitively ISTM that it'd be better if SMgrRelation were embedded in a
per-SMGR type struct. Obviously that'd not quite work as things are set
up, because the size has to be constant due to SMgrRelationHash. But I
think it'd might be good anyway if that hash just stored a pointer to
the relevant SMgrRelation.


> +void
> +undofile_close(SMgrRelation reln, ForkNumber forknum)
> +{
> +}

Hm, aren't we leaking private_data right now?


> +void
> +undofile_create(SMgrRelation reln, ForkNumber forknum, bool isRedo)
> +{
> +    /*
> +     * File creation is managed by undolog.c, but xlogutils.c likes to call
> +     * this just in case.  Ignore.
> +     */
> +}

Phew, this is not pretty.


> +bool
> +undofile_exists(SMgrRelation reln, ForkNumber forknum)
> +{
> +    elog(ERROR, "undofile_exists is not supported");
> +
> +    return false;        /* not reached */
> +}

This one I actually find bad. It seems pretty reasonable to just be able
for SMGR-kind agnostic code to be able to know whether a file exists or
not.


> +void
> +undofile_extend(SMgrRelation reln, ForkNumber forknum,
> +                BlockNumber blocknum, char *buffer,
> +                bool skipFsync)
> +{
> +    elog(ERROR, "undofile_extend is not supported");
> +}

This one I have much less problems with.


> +void
> +undofile_read(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
> +              char *buffer)
> +{
> +    File        file;
> +    off_t        seekpos;
> +    int            nbytes;
> +
> +    Assert(forknum == MAIN_FORKNUM);
> +    file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE);
> +    seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) UNDOSEG_SIZE));
> +    Assert(seekpos < (off_t) BLCKSZ * UNDOSEG_SIZE);

I'd not name this seekpos, given that we're not seeking..


> +BlockNumber
> +undofile_nblocks(SMgrRelation reln, ForkNumber forknum)
> +{
> +    /*
> +     * xlogutils.c likes to call this to decide whether to read or extend; for
> +     * now we lie and say the relation is big as possible.
> +     */
> +    return UndoLogMaxSize / BLCKSZ;
> +}

That's imo not ok.




>  /*
> + *    check_for_live_undo_data()
> + *
> + *    Make sure there are no live undo records (aborted transactions that have
> + *    not been rolled back, or committed transactions whose undo data has not
> + *    yet been discarded).
> + */
> +static void
> +check_for_undo_data(ClusterInfo *cluster)
> +{
> +    PGresult   *res;
> +    PGconn       *conn = connectToServer(cluster, "template1");
> +
> +    if (GET_MAJOR_VERSION(old_cluster.major_version) < 1200)
> +        return;

Needs to be updated now.


> --- a/src/bin/pg_upgrade/exec.c
> +++ b/src/bin/pg_upgrade/exec.c
> @@ -351,6 +351,10 @@ check_data_dir(ClusterInfo *cluster)
>          check_single_dir(pg_data, "pg_clog");
>      else
>          check_single_dir(pg_data, "pg_xact");
> +
> +    /* pg_undo is new in v13 */
> +    if (GET_MAJOR_VERSION(cluster->major_version) >= 1200)
> +        check_single_dir(pg_data, "pg_undo");
>  }

The comment talks about v13, but code checks for v12?


> +++ b/src/bin/pg_upgrade/undo.c
> @@ -0,0 +1,292 @@
> +/*
> + *    undo.c
> + *
> + *    Support for upgrading undo logs.\
> + *    Copyright (c) 2019, PostgreSQL Global Development Group
> + *    src/bin/pg_upgrade/undo.c
> + */

A small design note here seems like a good idea.


> +/* Undo log statuses. */
> +typedef enum
> +{
> +    UNDO_LOG_STATUS_UNUSED = 0,
> +    UNDO_LOG_STATUS_ACTIVE,
> +    UNDO_LOG_STATUS_FULL,
> +    UNDO_LOG_STATUS_DISCARDED
> +} UndoLogStatus;

An explanation of what these mean would be good.


> +/*
> + * Convert from relpersistence ('p', 'u', 't') to an UndoPersistence
> + * enumerator.
> + */
> +#define UndoPersistenceForRelPersistence(rp)                        \
> +    ((rp) == RELPERSISTENCE_PERMANENT ? UNDO_PERMANENT :            \
> +     (rp) == RELPERSISTENCE_UNLOGGED ? UNDO_UNLOGGED : UNDO_TEMP)
> +
> +/*
> + * Convert from UndoPersistence to a relpersistence value.
> + */
> +#define RelPersistenceForUndoPersistence(up)                \
> +    ((up) == UNDO_PERMANENT ? RELPERSISTENCE_PERMANENT :    \
> +     (up) == UNDO_UNLOGGED ? RELPERSISTENCE_UNLOGGED :        \
> +     RELPERSISTENCE_TEMP)

We shouldn't add macros with multiple evaluation hazards without need.



> +/*
> + * Properties of an undo log that don't have explicit WAL records logging
> + * their changes, to reduce WAL volume.  Instead, they change incrementally
> + * whenever data is inserted as a result of other WAL records.  Since the
> + * values recorded in an online checkpoint may be out of the sync (ie not the
> + * correct values as at the redo LSN), these are backed up in buffer data on
> + * first change after each checkpoint.
> + */

s/on first/on the first/?


> +/*
> + * Instantiate fast inline hash table access functions.  We use an identity
> + * hash function for speed, since we already have integers and don't expect
> + * many collisions.
> + */
> +#define SH_PREFIX undologtable
> +#define SH_ELEMENT_TYPE UndoLogTableEntry
> +#define SH_KEY_TYPE UndoLogNumber
> +#define SH_KEY number
> +#define SH_HASH_KEY(tb, key) (key)
> +#define SH_EQUAL(tb, a, b) ((a) == (b))
> +#define SH_SCOPE static inline
> +#define SH_DECLARE
> +#define SH_DEFINE
> +#include "lib/simplehash.h"
> +
> +extern PGDLLIMPORT undologtable_hash *undologtable_cache;

Why isn't this defined in a .c file? I've a bit of a hard time believing
that making UndoLogGetTableEntry() an extern function would be a
meaningful overhead compared to the operations this is used for. Not
exposing those details seems nicer to me.



> +/* Create a new undo log. */
> +typedef struct xl_undolog_create
> +{
> +    UndoLogNumber logno;
> +    Oid        tablespace;
> +    UndoPersistence persistence;
> +} xl_undolog_create;
> +
> +/* Extend an undo log by adding a new segment. */
> +typedef struct xl_undolog_extend
> +{
> +    UndoLogNumber logno;
> +    UndoLogOffset end;
> +} xl_undolog_extend;
> +
> +/* Discard space, and possibly destroy or recycle undo log segments. */
> +typedef struct xl_undolog_discard
> +{
> +    UndoLogNumber logno;
> +    UndoLogOffset discard;
> +    UndoLogOffset end;
> +    TransactionId latestxid;    /* latest xid whose undolog are discarded. */
> +    bool          entirely_discarded;
> +} xl_undolog_discard;
> +
> +/* Switch undo log. */
> +typedef struct xl_undolog_switch
> +{
> +    UndoLogNumber logno;
> +    UndoRecPtr prevlog_xact_start;
> +    UndoRecPtr prevlog_last_urp;
> +} xl_undolog_switch;

I'd add flags to these. Perhaps I'm overly cautious, but I found that
extremely valuable when having to fix bugs in already released
versions. And these aren't so frequent that that'd hurt.  Obviously
entirely_discarded would then be a flag.



> +extern void undofile_init(void);
> +extern void undofile_shutdown(void);
> +extern void undofile_open(SMgrRelation reln);
> +extern void undofile_close(SMgrRelation reln, ForkNumber forknum);
> +extern void undofile_create(SMgrRelation reln, ForkNumber forknum,
> +                            bool isRedo);
> +extern bool undofile_exists(SMgrRelation reln, ForkNumber forknum);
> +extern void undofile_unlink(RelFileNodeBackend rnode, ForkNumber forknum,
> +                            bool isRedo);
> +extern void undofile_extend(SMgrRelation reln, ForkNumber forknum,
> +         BlockNumber blocknum, char *buffer, bool skipFsync);
> +extern void undofile_prefetch(SMgrRelation reln, ForkNumber forknum,
> +           BlockNumber blocknum);
> +extern void undofile_read(SMgrRelation reln, ForkNumber forknum,
> +                          BlockNumber blocknum, char *buffer);
> +extern void undofile_write(SMgrRelation reln, ForkNumber forknum,
> +        BlockNumber blocknum, char *buffer, bool skipFsync);
> +extern void undofile_writeback(SMgrRelation reln, ForkNumber forknum,
> +            BlockNumber blocknum, BlockNumber nblocks);
> +extern BlockNumber undofile_nblocks(SMgrRelation reln, ForkNumber forknum);
> +extern void undofile_truncate(SMgrRelation reln, ForkNumber forknum,
> +           BlockNumber nblocks);
> +extern void undofile_immedsync(SMgrRelation reln, ForkNumber forknum);
> +
> +/* Callbacks used by sync.c. */
> +extern int undofile_syncfiletag(const FileTag *tag, char *path);
> +extern bool undofile_filetagmatches(const FileTag *tag, const FileTag *candidate);
> +
> +/* Management of checkpointer requests. */
> +extern void undofile_request_sync(UndoLogNumber logno, BlockNumber segno,
> +                                  Oid tablespace);
> +extern void undofile_forget_sync(UndoLogNumber logno, BlockNumber segno,
> +                                 Oid tablespace);
> +extern void undofile_forget_sync_tablespace(Oid tablespace);
> +extern void undofile_request_sync_dir(Oid tablespace);

Istm that it'd be better to only have mdopen/... referenced from
smgrsw() and then have f_smgr be included (as a const f_smgr* const) as
part of SMgrRelation.  For one, that'll allow us to hide a lot more of
this into md.c/undofile.c. It's also a pathway to being more extensible.
I think performance ought to be at least as good (as we currently have
to read SMgrRelation->smgr_which, then read the callback from smgrsw
(which is probably combined into one op in many platforms), and then
call it).  Could then also just make the simple smgr* functions static
inline wrappers, as that doesn't require exposing f_smgr anymore.



> From 880f25a543783f8dc3784a51ab1c29b72f6b5b27 Mon Sep 17 00:00:00 2001
> From: Dilip Kumar <dilip.kumar@enterprisedb.com>
> Date: Fri, 7 Jun 2019 15:03:37 +0530
> Subject: [PATCH 06/14] Defect and enhancement in multi-log support

That's imo not a good thing to have in patch series intended to be
reviewed, especially relatively early in the series. At least the commit
message ought to include an explanation.


> Subject: [PATCH 07/14] Provide interfaces to store and fetch undo records.
>
> Add the capability to form undo records and store them in undo logs.  We
> also provide the capability to fetch the undo records.  This layer will use
> undo-log-storage to reserve the space for the undo records and buffer
> management routines to write and read the undo records.
>

> Undo records are stored in sequential order in the undo log.

Maybe "In each und log undo records are stored in sequential order."?



> +++ b/src/backend/access/undo/README.undointerface
> @@ -0,0 +1,29 @@
> +Undo record interface layer
> +---------------------------
> +This is the next layer which sits on top of the undo log storage, which will
> +provide an interface for prepare, insert, or fetch the undo records.  This
> +layer will use undo-log-storage to reserve the space for the undo records
> +and buffer management routine to write and read the undo records.

The reference to "undo log storage" kinda seems like a reference into
nothingness...


> +Writing an undo record
> +----------------------
> +To prepare an undo record, first, it will allocate required space using
> +undo log storage module.  Next, it will pin and lock the required buffers and
> +return an undo record pointer where it will insert the record.  Finally, it
> +calls the Insert routine for final insertion of prepared record.  Additionally,
> +there is a mechanism for multi-insert, wherein multiple records are prepared
> +and inserted at a time.

I'm not sure whta this is telling me. Who is "it"?

To me the filename ("interface"), and the title of this section,
suggests this provides documentation on how to write code to insert undo
records. But I don't think this does.


> +Fetching and undo record
> +------------------------
> +To fetch an undo record, a caller must provide a valid undo record pointer.
> +Optionally, the caller can provide a callback function with the information of
> +the block and offset, which will help in faster retrieval of undo record,
> +otherwise, it has to traverse the undo-chain.

> +There is also an interface to bulk fetch the undo records.  Where the caller
> +can provide a TO and FROM undo record pointer and the memory limit for storing
> +the undo records.  This API will return all the undo record between FROM and TO
> +undo record pointers if they can fit into provided memory limit otherwise, it
> +return whatever can fit into the memory limit.  And, the caller can call it
> +repeatedly until it fetches all the records.

There's a lot of  terminology in this file that's not been introduced. I
think this needs to be greatly expanded and restructured to allow people
unfamiliar with the code to benefit.


> +/*-------------------------------------------------------------------------
> + *
> + * undoaccess.c
> + *      entry points for inserting/fetching undo records

> + * NOTES:
> + * Undo record layout:
> + *
> + * Undo records are stored in sequential order in the undo log.  Each undo
> + * record consists of a variable length header, tuple data, and payload
> + * information.

Is that actually true? There's records without tuples, no?

> The first undo record of each transaction contains a
> + * transaction header that points to the next transaction's start
> header.

Seems like this needs to reference different persistence levels,
otherwise it seems misleading, given there can be multiple first records
in multiple undo logs?


> + * This allows us to discard the entire transaction's log at one-shot
> rather

s/at/in/

> + * than record-by-record.  The callers are not aware of transaction header,

s/of/of the/

> + * this is entirely maintained and used by undo record layer.   See

s/this/it/

> + * undorecord.h for detailed information about undo record header.

s/undo record/the undo record/


I think at the very least there's explanations missing for:
- what is the locking protocol for multiple buffers
- what are the contexts for insertion
- what phases an undo insertion happens in
- updating previous records in general
- what "packing" actually is


> +
> +/* Prototypes for static functions. */


Don't think we commonly include that...

> +static UnpackedUndoRecord *UndoGetOneRecord(UnpackedUndoRecord *urec,
> +                 UndoRecPtr urp, RelFileNode rnode,
> +                 UndoPersistence persistence,
> +                 Buffer *prevbuf);
> +static int UndoRecordPrepareTransInfo(UndoRecordInsertContext *context,
> +                           UndoRecPtr xact_urp, int size, int offset);
> +static void UndoRecordUpdateTransInfo(UndoRecordInsertContext *context,
> +                          int idx);
> +static void UndoRecordPrepareUpdateNext(UndoRecordInsertContext *context,
> +                                    UndoRecPtr urecptr, UndoRecPtr xact_urp);
> +static int UndoGetBufferSlot(UndoRecordInsertContext *context,
> +                  RelFileNode rnode, BlockNumber blk,
> +                  ReadBufferMode rbm);
> +static uint16 UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer,
> +                     UndoPersistence upersistence);
> +
> +/*
> + * Structure to hold the prepared undo information.
> + */
> +struct PreparedUndoSpace
> +{
> +    UndoRecPtr    urp;            /* undo record pointer */
> +    UnpackedUndoRecord *urec;    /* undo record */
> +    uint16        size;            /* undo record size */
> +    int            undo_buffer_idx[MAX_BUFFER_PER_UNDO];    /* undo_buffer array
> +                                                         * index */
> +};
> +
> +/*
> + * This holds undo buffers information required for PreparedUndoSpace during
> + * prepare undo time.  Basically, during prepare time which is called outside
> + * the critical section we will acquire all necessary undo buffers pin and lock.
> + * Later, during insert phase we will write actual records into thse buffers.
> + */
> +struct PreparedUndoBuffer
> +{
> +    UndoLogNumber logno;        /* Undo log number */
> +    BlockNumber blk;            /* block number */
> +    Buffer        buf;            /* buffer allocated for the block */
> +    bool        zero;            /* new block full of zeroes */
> +};

Most files define datatypes before function prototypes, because
functions may reference the datatypes.


> +/*
> + * Prepare to update the transaction header
> + *
> + * It's a helper function for PrepareUpdateNext and
> + * PrepareUpdateUndoActionProgress

This doesn't really explain much.  PrepareUpdateUndoActionProgress
doesnt' exist. I assume it's UndoRecordPrepareApplyProgress from 0012?


> + * xact_urp  - undo record pointer to be updated.
> + * size - number of bytes to be updated.
> + * offset - offset in undo record where to start update.
> + */

These comments seem redundant with the parameter names.


> +static int
> +UndoRecordPrepareTransInfo(UndoRecordInsertContext *context,
> +                           UndoRecPtr xact_urp, int size, int offset)
> +{
> +    BlockNumber cur_blk;
> +    RelFileNode rnode;
> +    int            starting_byte;
> +    int            bufidx;
> +    int            index = 0;
> +    int            remaining_bytes;
> +    XactUndoRecordInfo *xact_info;
> +
> +    xact_info = &context->xact_urec_info[context->nxact_urec_info];
> +
> +    UndoRecPtrAssignRelFileNode(rnode, xact_urp);
> +    cur_blk = UndoRecPtrGetBlockNum(xact_urp);
> +    starting_byte = UndoRecPtrGetPageOffset(xact_urp);
> +
> +    /* Remaining bytes on the current block. */
> +    remaining_bytes = BLCKSZ - starting_byte;
> +
> +    /*
> +     * Is there some byte of the urec_next on the current block, if not then
> +     * start from the next block.
> +     */

This comment needs rephrasing.


> +    /* Loop until we have fetched all the buffers in which we need to write. */
> +    while (size > 0)
> +    {
> +        bufidx = UndoGetBufferSlot(context, rnode, cur_blk, RBM_NORMAL);
> +        xact_info->idx_undo_buffers[index++] = bufidx;
> +        size -= (BLCKSZ - starting_byte);
> +        starting_byte = UndoLogBlockHeaderSize;
> +        cur_blk++;
> +    }

So, this locks a very large number of undo buffers at the same time, do
I see that correctly?  What guarantees that there are no deadlocks due
to multiple buffers locked at the same time (I guess the order inside
the log)? What guarantees that this is a small enough number that we can
even lock all of them at the same time?

Why do we need to lock all of them at the same time? That's not clear to
me.

Also, why do we need code to lock an unbounded number here? It seems
hard to imagine we'd ever want to update more than something around 8
bytes? Shouldn't that at the most require two buffers?


> +/*
> + * Prepare to update the previous transaction's next undo pointer.
> + *
> + * We want to update the next transaction pointer in the previous transaction's
> + * header (first undo record of the transaction).  In prepare phase we will
> + * unpack that record and lock the necessary buffers which we are going to
> + * overwrite and store the unpacked undo record in the context.  Later,
> + * UndoRecordUpdateTransInfo will overwrite the undo record.
> + *
> + * xact_urp - undo record pointer of the previous transaction's header
> + * urecptr - current transaction's undo record pointer which need to be set in
> + *             the previous transaction's header.
> + */
> +static void
> +UndoRecordPrepareUpdateNext(UndoRecordInsertContext *context,
> +                            UndoRecPtr urecptr, UndoRecPtr xact_urp)

That name imo is confusing - it's not clear that it's not actually about
the next record or such.


> +{
> +    UndoLogSlot *slot;
> +    int            index = 0;
> +    int            offset;
> +
> +    /*
> +     * The absence of previous transaction's undo indicate that this backend

*indicates


> +    /*
> +     * Acquire the discard lock before reading the undo record so that discard
> +     * worker doesn't remove the record while we are in process of reading it.
> +     */

*the discard worker


> +    LWLockAcquire(&slot->discard_update_lock, LW_SHARED);
> +    /* Check if it is already discarded. */
> +    if (UndoLogIsDiscarded(xact_urp))
> +    {
> +        /* Release lock and return. */
> +        LWLockRelease(&slot->discard_update_lock);
> +        return;
> +    }

Ho, hum. I don't quite remember what we decided in the discussion about
not having to use the discard lock for this purpose.


> +    /* Compute the offset of the uur_next in the undo record. */
> +    offset = SizeOfUndoRecordHeader +
> +                    offsetof(UndoRecordTransaction, urec_next);
> +
> +    index = UndoRecordPrepareTransInfo(context, xact_urp,
> +                                       sizeof(UndoRecPtr), offset);
> +    /*
> +     * Set the next pointer in xact_urec_info, this will be overwritten in
> +     * actual undo record during update phase.
> +     */
> +    context->xact_urec_info[index].next = urecptr;

What does "this will be overwritten mean"? It sounds like "context->xact_urec_info[index].next"
would be overwritten, but that can't be true.


> +    /* We can now release the discard lock as we have read the undo record. */
> +    LWLockRelease(&slot->discard_update_lock);
> +}

Hm. Because you expect it to be blocked behind the content lwlocks for
the buffers?


> +/*
> + * Overwrite the first undo record of the previous transaction to update its
> + * next pointer.
> + *
> + * This will insert the already prepared record by UndoRecordPrepareTransInfo.

It doesn't actually appear to insert any records. At least not a record
in the way the rest of the file uses that term?


> + * This must be called under the critical section.

s/under the/in a/

Think that should be asserted.


> +    /*
> +     * Start writing directly from the write offset calculated during prepare
> +     * phase.  And, loop until we write required bytes.
> +     */

Why do we do offset calculations multiple times? Seems like all the
offsets, and the split, should be computed in exactly one place.


> +/*
> + * Find the block number in undo buffer array
> + *
> + * If it is present then just return its index otherwise search the buffer and
> + * insert an entry and lock the buffer in exclusive mode.
> + *
> + * Undo log insertions are append-only.  If the caller is writing new data
> + * that begins exactly at the beginning of a page, then there cannot be any
> + * useful data after that point.  In that case RBM_ZERO can be passed in as
> + * rbm so that we can skip a useless read of a disk block.  In all other
> + * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't
> + * happen to be already in the buffer pool.
> + */
> +static int
> +UndoGetBufferSlot(UndoRecordInsertContext *context,
> +                  RelFileNode rnode,
> +                  BlockNumber blk,
> +                  ReadBufferMode rbm)
> +{
> +    int            i;
> +    Buffer        buffer;
> +    XLogRedoAction action = BLK_NEEDS_REDO;
> +    PreparedUndoBuffer *prepared_buffer;
> +    UndoPersistence persistence = context->alloc_context.persistence;
> +
> +    /* Don't do anything, if we already have a buffer pinned for the block. */

As the code stands, it's locked, not just pinned.


> +    for (i = 0; i < context->nprepared_undo_buffer; i++)
> +    {

How large do we expect this to get at most?


> +    /*
> +     * We did not find the block so allocate the buffer and insert into the
> +     * undo buffer array.
> +     */
> +    if (InRecovery)
> +        action = XLogReadBufferForRedoBlock(context->alloc_context.xlog_record,
> +                                            SMGR_UNDO,
> +                                            rnode,
> +                                            UndoLogForkNum,
> +                                            blk,
> +                                            rbm,
> +                                            false,
> +                                            &buffer);

Why is not locking the buffer correct here? Can't there be concurrent
reads during hot standby?


> +/*
> + * This function must be called before all the undo records which are going to
> + * get inserted under a single WAL record.

How can a function be called "before all the undo records"?


> + * nprepared - This defines the max number of undo records that can be
> + * prepared before inserting them.
> + */
> +void
> +BeginUndoRecordInsert(UndoRecordInsertContext *context,
> +                      UndoPersistence persistence,
> +                      int nprepared,
> +                      XLogReaderState *xlog_record)

There definitely needs to be explanation about xlog_record. But also
about memory management etc. Looks like one e.g. can't call this from a
short lived memory context.


> +/*
> + * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
> + * intended to insert.  Upon return, the necessary undo buffers are pinned and
> + * locked.

Again, how is deadlocking / max number of buffers handled, and why do
they all need to be locked at the same time?


> +    /*
> +     * We don't yet know if this record needs a transaction header (ie is the
> +     * first undo record for a given transaction in a given undo log), because
> +     * you can only find out by allocating.  We'll resolve this circularity by
> +     * allocating enough space for a transaction header.  We'll only advance
> +     * by as many bytes as we turn out to need.
> +     */

Why can we only find this out by allocating?  This seems like an API
deficiency of the storage layer to me. The information is in the und log
slot's metadata, no?


> +    urec->uur_next = InvalidUndoRecPtr;
> +    UndoRecordSetInfo(urec);
> +    urec->uur_info |= UREC_INFO_TRANSACTION;
> +    urec->uur_info |= UREC_INFO_LOGSWITCH;
> +    size = UndoRecordExpectedSize(urec);
> +
> +    /* Allocate space for the record. */
> +    if (InRecovery)
> +    {
> +        /*
> +         * We'll figure out where the space needs to be allocated by
> +         * inspecting the xlog_record.
> +         */
> +        Assert(context->alloc_context.persistence == UNDO_PERMANENT);
> +        urecptr = UndoLogAllocateInRecovery(&context->alloc_context,
> +                                            XidFromFullTransactionId(txid),
> +                                            size,
> +                                            &need_xact_header,
> +                                            &last_xact_start,
> +                                            &prevlog_xact_start,
> +                                            &prevlogurp);
> +    }
> +    else
> +    {
> +        /* Allocate space for writing the undo record. */

That's basically the same comment as before the if.


> +        urecptr = UndoLogAllocate(&context->alloc_context,
> +                                  size,
> +                                  &need_xact_header, &last_xact_start,
> +                                  &prevlog_xact_start, &prevlog_insert_urp);
> +
> +        /*
> +         * If prevlog_xact_start is a valid undo record pointer that means
> +         * this transaction's undo records are split across undo logs.
> +         */
> +        if (UndoRecPtrIsValid(prevlog_xact_start))
> +        {
> +            uint16        prevlen;
> +
> +            /*
> +             * If undo log is switch during transaction then we must get a

"is switch" is right.

> +/*
> + * Insert a previously-prepared undo records.

s/a//


More tomorrow.


Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

30 July 2019, 08:24:42

On Tue, Jul 30, 2019 at 12:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Tue, Jul 30, 2019 at 5:03 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I think this idea is good for the DO time but during REDO time it will
> > not work as we will not have the transaction state.  Having said that
> > the current idea of keeping in the global variable will also not work
> > during REDO time because the WAL from the different transaction can be
> > interleaved.  There are few ideas to handle this issue
> >
> > 1.  At DO time keep in TopTransactionState as you suggested and during
> > recovery time read from the first complete record on the page.
> > 2.  Just to keep the code uniform always read from the first complete
> > record of the page.
> >
It contains another
> suggestion for that problem you just mentioned (and also me pointing
> out what you just pointed out, since I wrote it earlier) though I'm
> not sure if it's better than your options above.

Thanks, Thomas for your review,  Currently, I am replying to the
problem which both of us has identified and found a different set of
solutions.  I will go through other comments soon and work on those.

> +UndoRecPtr
> +PrepareUndoInsert(UndoRecordInsertContext *context,
> +                  UnpackedUndoRecord *urec,
> +                  Oid dbid)
> +{
> ...
> +    /* Fetch compression info for the transaction. */
> +    compression_info = GetTopTransactionUndoCompressionInfo(category);
>
> How can this work correctly in recovery?  [Edit: it doesn't, as you
> just pointed out]
>
>
> One data structure that could perhaps hold this would be
> UndoLogTableEntry (the per-backend cache, indexed by undo log number,
> with pretty fast lookups; used for things like
> UndoLogNumberGetCategory()).  As long as you never want to have
> inter-transaction compression, that should have the right scope to
> give recovery per-undo log tracking.  If you ever wanted to do
> compression between transactions too, maybe UndoLogSlot could work,
> but that'd have more complications.

I think this could be a good idea.  I had thought of keeping in the
slot as my 3rd option but later I removed it thinking that we need to
expose the compression field to the undo log layer.  I think keeping
in the UndoLogTableEntry is a better idea than keeping in the slot.
But, I still have the same problem that we need to expose undo
record-level fields to undo log layer to compute the cache entry size.
  OTOH, If we decide to get from the first record of the page (as I
mentioned up thread) then I don't think there is any performance issue
because we are inserting on the same page.  But, for doing that we
need to unpack the complete undo record (guaranteed to be on one
page).   And, UnpackUndoData will internally unpack the payload data
as well which is not required in our case unless we change
UnpackUndoData such that it unpacks only what the caller wants (one
input parameter will do).

I am not sure out of these two which idea is better?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

30 July 2019, 11:56:13

Hi Amit

I've been testing some undo worker workloads (more on that soon), but
here's a small thing: I managed to reach an LWLock self-deadlock in
the undo worker launcher:

diff --git a/src/backend/access/undo/undorequest.c
b/src/backend/access/undo/undorequest.c
...
+bool
+UndoGetWork(bool allow_peek, bool remove_from_queue, UndoRequestInfo *urinfo,
...
+       /* Search the queues under lock as they can be modified concurrently. */
+       LWLockAcquire(RollbackRequestLock, LW_EXCLUSIVE);
...
+                               RollbackHTRemoveEntry(rh->full_xid,
rh->start_urec_ptr);

^ but that function acquires the same lock, leading to:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007fff5d110106 libsystem_kernel.dylib`semop + 10
    frame #1: 0x0000000104bbf24c
postgres`PGSemaphoreLock(sema=0x000000010e216a08) at pg_sema.c:428:15
    frame #2: 0x0000000104c90186
postgres`LWLockAcquire(lock=0x000000010e218300, mode=LW_EXCLUSIVE) at
lwlock.c:1246:4
    frame #3: 0x000000010487463d
postgres`RollbackHTRemoveEntry(full_xid=(value = 89144),
start_urec_ptr=20890721090967) at undorequest.c:1717:2
    frame #4: 0x0000000104873dbe
postgres`UndoGetWork(allow_peek=false, remove_from_queue=false,
urinfo=0x00007ffeeb4d3e30, in_other_db_out=0x0000000000000000) at
undorequest.c:1388:5
    frame #5: 0x0000000104876211 postgres`UndoLauncherMain(main_arg=0)
at undoworker.c:607:7
...
(lldb) print held_lwlocks[0]
(LWLockHandle) $0 = {
  lock = 0x000000010e218300
  mode = LW_EXCLUSIVE
}

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

31 July 2019, 04:19:40

On Tue, Jul 30, 2019 at 1:32 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> Amit, short note: The patches aren't attached in patch order. Obviously
> a miniscule thing, but still nicer if that's not the case.
>

Noted, I will try to ensure that patches are in order in future posts.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

31 July 2019, 04:43:26

On Tue, Jul 30, 2019 at 5:26 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> Hi Amit
>
> I've been testing some undo worker workloads (more on that soon),
>

One small point, there is one small bug in the error queues which is
that the element pushed into error queue doesn't have an updated value
of to_urec_ptr which is important to construct the hash key.  This
will lead to undolauncher/worker think that the action for the same is
already processed and it removes the same from the hash table. I have
a fix for the same which I will share in next version of the patch
(which I am going to share in the next day or two).

>  but
> here's a small thing: I managed to reach an LWLock self-deadlock in
> the undo worker launcher:
>

I could see the problem, will fix in next version.

Thank you for reviewing and testing this.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

31 July 2019, 13:14:00

On Tue, Jul 16, 2019 at 2:09 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jul 1, 2019 at 3:54 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > 2.  Introduced a new RMGR callback rm_undo_status.  It is used to
> > decide when record sets in the UNDO_SHARED category should be
> > discarded (instead of the usual single xid-based rules).  The possible
> > answers are "discard me now!", "ask me again when a given XID is all
> > visible", and "ask me again when a given XID is no longer running".
>
> From the minor nitpicking department, the patches from this stack that
> are updating rmgrlist.h are consistently failing to update the comment
> line preceding the list of PG_RMGR() lines.  This looks to be patches
> 0014 and 0015 in this stack; 0015 seems to need to be squashed into
> 0014.
>

Fixed.  You can verify in patch
0011-Infrastructure-to-execute-pending-undo-actions.  The 'mask' was
missing in the list as well which I have added here, we might want to
commit that separately.

> Reviewing Amit's 0016:
>
> performUndoActions appears to be badly-designed.  For starters, it's
> sometimes wrong: the only place it gets set to true is in
> UndoActionsRequired (which is badly named, because from the name you
> expect it to return a Boolean and to not have side effects, but
> instead it doesn't return anything and does have side effects).
> UndoActionsRequired() only gets called from selected places, like
> AbortCurrentTransaction(), so the rest of the time it just returns a
> wrong answer.  Now maybe it's never called at those times, but there's
> no guard to prevent a function like CanPerformUndoActions() (which is
> also badly named, because performUndoActions tells you whether you
> need to perform undo actions, not whether it's possible to perform
> undo actions) from being called before the flag is set.  I think that
> this flag should be either (1) maintained eagerly - so that wherever
> we set start_urec_ptr we also set the flag right away or (2) removed -
> so when we need to know, we just loop over all of the undo categories
> on the spot, which is not that expensive because there aren't that
> many of them.
>

I have taken approach-2 to fix this.

> It seems pointless to make PrepareTransaction() take undo pointers as
> arguments, because those pointers are just extracted from the
> transaction state, to which PrepareTransaction() has a pointer.
>

Fixed.

> Thomas has already objected to another proposal to add functions that
> turn 32-bit XIDs into 64-bit XIDs.  Therefore, I feel confident in
> predicting that he will likewise object to GetEpochForXid. I think
> this needs to be changed somehow, maybe by doing what the XXX comment
> you added suggests.
>

I will fix this later.  I think we can separately write a patch to
extend Two-phase file to use fulltransactionid and then use it here.

> This patch has some problems with naming consistency.  There's a
> function called PushUndoRequest() which calls a function called
> RegisterRollbackReq() to do the heart of the work.  So, is it undo or
> rollback?  Are we pushing or registering?  Is it a request or a req?
> For bonus points, the flag that the function sets is called
> undo_req_pushed, which is halfway in between the two competing
> terminologies.  Other gripes about PushUndoRequest: push is vague and
> doesn't really explain what's happening, "apllying" is a typo,
> per_level is a poor variable name and shouldn't be declared volatile.
> This function has problems with naming in other places, too; please go
> through all of the names carefully and make them consistent and
> adequately descriptive.
>

I have changed the namings to make them consistent.  If you see
anything else, then do let me know.

> I am not a fan of applying_subxact_undo.  I think we should look for a
> better design there.  A couple of things occur to me.  One is that we
> don't necessarily need to go to FATAL; we could just force the current
> transaction and all of its subtransactions fail all the way out to the
> top level, but then perhaps allow new transactions to be started
> afterwards.  I'm not sure that's worth it, but it would work, and I
> think it has precedent in SxactIsDoomed. Assuming we're going to stick
> with the current FATAL plan, I think we should do something like
> invent a new kind of critical section that forces ERROR to be promoted
> to FATAL and then use it here.  We could call it a semi-critical or
> locally-critical section, and the undo machinery could use it, but
> then also so could other things.  I've wanted that sort of concept
> before, so I think it's a good idea to try to have something general
> and independent of undo.  The same concept could be used in
> PerformUndoActions() instead of having to invent
> pg_rethrow_as_fatal(), so we'd have two uses for this mechanism right
> away.
>

Okay, I have developed the concept of semi-critical section and used
it for sub-transactions and temp tables.  Kindly check if this is
something that you have in mind?

> FinishPreparedTransactions() tries to apply undo actions while
> interrupts are still held.  Is that necessary?  Can we avoid it?
>

Fixed.

> It seems highly likely that the logic added to the TBLOCK_SUBCOMMIT
> case inside CommitTransactionCommand and also into
> ReleaseCurrentSubTransaction should have been added to
> CommitSubTransaction instead.  If that's not true, then we have to
> believe that the TBLOCK_SUBRELEASE call to CommitSubTransaction needs
> different treatment from the other two cases, which sounds unlikely;
> we also have to explain why undo is somehow different from all of
> these other releases that are already handled in that function, not in
> its callers.  I also strongly suspect it is altogether wrong to do
> this before CommitSubTransaction sets s->state to TRANS_COMMIT; what
> if a subxact callback throws an error?
>
> For related reasons, I don't think that the change ReleaseSavepoint()
> are right either.  Notice the header comment: "As above, we don't
> actually do anything here except change blockState."  The "as above"
> part of the comment probably didn't originally refer to
> DefineSavepoint(), which definitely does do other stuff, but to
> something like EndImplicitTransactionBlock() or EndTransactionBlock(),
> and DefineSavepoint() got stuck in the middle later.  Anyway, your
> patch makes the comment false by doing actual state changes in this
> function, rather than just marking the subtransactions for commit.
> But why should that be right?  If none of the many other bits of state
> are manipulated here rather than in CommitSubTransaction(), why is
> undo the one thing that is different?  I guess this is basically just
> compensation for the lack of any of this code in the TBLOCK_SUBRELEASE
> path which I noted in the previous paragraph, but I still think the
> right answer is to put it all in CommitSubTransaction() *after* we set
> TRANS_COMMIT.
>

Changed as per suggestion.

> There are a number of things I either don't like or don't understand
> about PerformUndoActions.  One is that undo_req_pushed gets passed to
> this function.  That just looks really odd from an abstraction point
> of view.  Basically, we have a function whose job is to "perform undo
> actions," and it gets a flag as an argument that tells it to not
> actually perform some of the undo actions: that's odd. I think the
> reason it's like that is because of the issue we've been discussing
> elsewhere that there's a separate undo request for each category.  If
> you didn't have that, you wouldn't need to do this here.  I'm not
> saying that proves that the one-request-per-persistence-level design
> is definitely wrong, but this is certainly not a point in its favor,
> at least IMHO.
>

I think we have discussed in detail about
one-request-per-persistence-level design and I will investigate it to
see if we can make it one-request-per-transaction and if not what are
the challenges and can we overcome them without significantly more
work and complexity.  So for now, I have not changed anything related
to this point.

Apart from these comments, I have changed a few more things:
a. Changed TWOPHASE_MAGIC define as we are changing TwoPhaseFileHeader.
b. Fixed comments by Dilip on same patch [1].  I will respond to them
separately.
c. Fixed the problem reported by Thomas [2] and one similar problem in
an error queue noticed by me.

I have still not addressed all the comments raised.  This is mainly to
unblock Thomas's test and share whatever is done until now.  I am
posting all the patches, but have not modified anything related to
undo-log and undo-interface patches (aka from 0001 to 0008).

[1] - https://www.postgresql.org/message-id/CAFiTN-tObs5BQZETqK12QuOz7nPSXb90PdG49AzK2ZJ4ts1c5g%40mail.gmail.com
[2] -
https://www.postgresql.org/message-id/CA%2BhUKGLv016-1y%3DCwx%2Bmme%2BcFRD5Bn03%3D2JVFnRB7JMLsA35%3Dw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

I had a look at the UNDO patches at
https://www.postgresql.org/message-id/CAA4eK1KKAFBCJuPnFtgdc89djv4xO%3DZkAdXvKQinqN4hWiRbvA%40mail.gmail.com,
and at the patch to use the UNDO logs to clean up orphaned files, from
undo-2019-05-10.tgz earlier in this thread. Are these the latest ones to
review?

Thanks Thomas and Amit and others for working on this! Orphaned relfiles
has been an ugly wart forever. It's a small thing, but really nice to
fix that finally. This has been a long thread, and I haven't read it
all, so please forgive me if I repeat stuff that's already been discussed.

There are similar issues in CREATE/DROP DATABASE code. If you crash in
the middle of CREATE DATABASE, you can be left with orphaned files in
the data directory, or if you crash in the middle of DROP DATABASE, the
data might be gone already but the pg_database entry is still there. We
should plug those holes too.

There's a lot of stuff in the patches that are not relevant for cleaning
up orphaned files. I know this cleaning up orphaned files work is mainly
a vehicle to get the UNDO log committed, so that's expected. If we only
cared about orphaned files, I'm sure the patches wouldn't spend so much
effort on concurrency, for example. Nevertheless, I think we should
leave out some stuff that's clearly unused, for now. For example, a
bunch of fields in the record format: uur_block, uur_offset, uur_tuple.
You can add them later, as part of the patches that actually need them,
but for now they just make the patch larger to review.

Some more thoughts on the record format:

I feel that the level of abstraction is not quite right. There are a
bunch of fields, like uur_block, uur_offset, uur_tuple, that are
probably useful for some UNDO resource managers (zheap I presume), but
seem kind of arbitrary. How is uur_tuple different from uur_payload?
Should they be named more generically as uur_payload1 and uur_payload2?
And why two, why not three or four different payloads? In the WAL record
format, there's a concept of "block id", which allows you to store N
number of different payloads in the record, I think that would be a
better approach. Or only have one payload, and let the resource manager
code divide it as it sees fit.

Many of the fields support a primitive type of compression, where a
field can be omitted if it has the same value as on the first record on
an UNDO page. That's handy. But again I don't like the fact that the
fields have been hard-coded into the UNDO record format. I can see e.g.
the relation oid to be useful for many AMs. But not all. And other AMs
might well want to store and deduplicate other things, aside from the
fields that are in the patch now. I'd like to move most of the fields to
AM specific code, and somehow generalize the compression. One approach
would be to let the AM store an arbitrary struct, and run it through a
general-purpose compression algorithm, using the UNDO page's first
record as the "dictionary". Or make the UNDO page's first record
available in whole to the AM specific code, and let the AM do the
deduplication. For cleaning up orphaned files, though, we don't really
care about any of that, so I'd recommend just ripping it out for now.
Compression/deduplication can be added later as a separate patch.

The orphaned-file cleanup patch doesn't actually use the uur_reloid
field. It stores the RelFileNode instead, in the paylod. I think that's
further evidence that the hard-coded fields in the record format are not
quite right.

I don't like the way UndoFetchRecord returns a palloc'd
UnpackedUndoRecord. I would prefer something similar to the xlogreader
API, where a new call to UndoFetchRecord invalidates the previous
result. On efficiency grounds, to avoid the palloc, but also to be
consistent with xlogreader.

In the UNDO page header, there are a bunch of fields like
pd_lower/pd_upper/pd_special that are copied from the "standard" page
header, that are unused. There's a FIXME comment about that too. Let's
remove them, there's no need for UNDO pages to look like standard
relation pages. The LSN needs to be at the beginning, to work with the
buffer manager, but that's the only requirement.

Could we leave out the UNDO and discard worker processes for now?
Execute all UNDO actions immediately at rollback, and after crash
recovery. That would be fine for cleaning up orphaned files, and it
would cut down the size of the patch to review.

Can this race condition happen: Transaction A creates a table and an
UNDO record to remember it. The transaction is rolled back, and the file
is removed. Another transaction, B, creates a different table, and
chooses the same relfilenode. It loads the table with data, and commits.
Then the system crashes. After crash recovery, the UNDO record for the
first transaction is applied, and it removes the file that belongs to
the second table, created by transaction B.

- Heikki

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

05 August 2019, 03:54:41

On Sun, Aug 4, 2019 at 2:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> I had a look at the UNDO patches at
> https://www.postgresql.org/message-id/CAA4eK1KKAFBCJuPnFtgdc89djv4xO%3DZkAdXvKQinqN4hWiRbvA%40mail.gmail.com,
> and at the patch to use the UNDO logs to clean up orphaned files, from
> undo-2019-05-10.tgz earlier in this thread. Are these the latest ones to
> review?
>

Yes, I am not sure of cleanup orphan file patch (Thomas can confirm
the same), but others are latest.

> Thanks Thomas and Amit and others for working on this! Orphaned relfiles
> has been an ugly wart forever. It's a small thing, but really nice to
> fix that finally. This has been a long thread, and I haven't read it
> all, so please forgive me if I repeat stuff that's already been discussed.
>
> There are similar issues in CREATE/DROP DATABASE code. If you crash in
> the middle of CREATE DATABASE, you can be left with orphaned files in
> the data directory, or if you crash in the middle of DROP DATABASE, the
> data might be gone already but the pg_database entry is still there. We
> should plug those holes too.
>

+1. Interesting.

> There's a lot of stuff in the patches that are not relevant for cleaning
> up orphaned files. I know this cleaning up orphaned files work is mainly
> a vehicle to get the UNDO log committed, so that's expected. If we only
> cared about orphaned files, I'm sure the patches wouldn't spend so much
> effort on concurrency, for example. Nevertheless, I think we should
> leave out some stuff that's clearly unused, for now. For example, a
> bunch of fields in the record format: uur_block, uur_offset, uur_tuple.
> You can add them later, as part of the patches that actually need them,
> but for now they just make the patch larger to review.
>
> Some more thoughts on the record format:
>
> I feel that the level of abstraction is not quite right. There are a
> bunch of fields, like uur_block, uur_offset, uur_tuple, that are
> probably useful for some UNDO resource managers (zheap I presume), but
> seem kind of arbitrary. How is uur_tuple different from uur_payload?
>

The uur_tuple field can only store tuple whereas uur_payload can have
miscellaneous information.  For ex. in zheap, we store transaction
information like CID, CTID, some information related to TPD, etc. in
the payload.  Basically, I think eventually payload will have some
bitmap to indicate what all is stored in it.  OTOH, I agree that if we
want we can store tuple as well in the payload.

> Should they be named more generically as uur_payload1 and uur_payload2?
> And why two, why not three or four different payloads? In the WAL record
> format, there's a concept of "block id", which allows you to store N
> number of different payloads in the record, I think that would be a
> better approach. Or only have one payload, and let the resource manager
> code divide it as it sees fit.
>

For payload, something like what you describe here sounds like a good
idea, but I feel we can have tuple as a separate field.  It will help
in accessing tuple quickly and easily during visibility or rollbacks
for some AM's like zheap.

> Many of the fields support a primitive type of compression, where a
> field can be omitted if it has the same value as on the first record on
> an UNDO page. That's handy. But again I don't like the fact that the
> fields have been hard-coded into the UNDO record format. I can see e.g.
> the relation oid to be useful for many AMs. But not all. And other AMs
> might well want to store and deduplicate other things, aside from the
> fields that are in the patch now. I'd like to move most of the fields to
> AM specific code, and somehow generalize the compression. One approach
> would be to let the AM store an arbitrary struct, and run it through a
> general-purpose compression algorithm, using the UNDO page's first
> record as the "dictionary". Or make the UNDO page's first record
> available in whole to the AM specific code, and let the AM do the
> deduplication. For cleaning up orphaned files, though, we don't really
> care about any of that, so I'd recommend just ripping it out for now.
> Compression/deduplication can be added later as a separate patch.
>

I think this will make the undorecord-interface patch a bit simpler as well.

> The orphaned-file cleanup patch doesn't actually use the uur_reloid
> field. It stores the RelFileNode instead, in the paylod. I think that's
> further evidence that the hard-coded fields in the record format are not
> quite right.
>
>
> I don't like the way UndoFetchRecord returns a palloc'd
> UnpackedUndoRecord. I would prefer something similar to the xlogreader
> API, where a new call to UndoFetchRecord invalidates the previous
> result. On efficiency grounds, to avoid the palloc, but also to be
> consistent with xlogreader.
>
> In the UNDO page header, there are a bunch of fields like
> pd_lower/pd_upper/pd_special that are copied from the "standard" page
> header, that are unused. There's a FIXME comment about that too. Let's
> remove them, there's no need for UNDO pages to look like standard
> relation pages. The LSN needs to be at the beginning, to work with the
> buffer manager, but that's the only requirement.
>
> Could we leave out the UNDO and discard worker processes for now?
> Execute all UNDO actions immediately at rollback, and after crash
> recovery. That would be fine for cleaning up orphaned files,
>

Even if we execute all the undo actions on rollback, we need discard
worker to discard undo at regular intervals.  Also, what if we get an
error while applying undo actions during rollback?  Right now, we have
a mechanism to push such a request to background worker and allow the
session to continue.  Instead, we might want to Panic in such cases if
we don't want to have background undo workers.

> and it
> would cut down the size of the patch to review.
>

If we can find some way to handle all cases and everyone agrees to it,
that would be good. In fact, we can try to get the basic stuff
committed first and then try to get the rest (undo-worker machinery)
done.


> Can this race condition happen: Transaction A creates a table and an
> UNDO record to remember it. The transaction is rolled back, and the file
> is removed. Another transaction, B, creates a different table, and
> chooses the same relfilenode. It loads the table with data, and commits.
> Then the system crashes. After crash recovery, the UNDO record for the
> first transaction is applied, and it removes the file that belongs to
> the second table, created by transaction B.
>

I don't think such a race exists, but we should verify it once.
Basically, once the rollback is complete, we mark the transaction
rollback as complete in the transaction header in undo and write a WAL
for it.  After crash-recovery, we will skip such a transaction.  Isn't
that sufficient to prevent such a race condition?

Thank you for looking into this work.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

05 August 2019, 04:23:34

On Mon, Aug 5, 2019 at 3:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sun, Aug 4, 2019 at 2:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > I had a look at the UNDO patches at
> > https://www.postgresql.org/message-id/CAA4eK1KKAFBCJuPnFtgdc89djv4xO%3DZkAdXvKQinqN4hWiRbvA%40mail.gmail.com,
> > and at the patch to use the UNDO logs to clean up orphaned files, from
> > undo-2019-05-10.tgz earlier in this thread. Are these the latest ones to
> > review?
>
> Yes, I am not sure of cleanup orphan file patch (Thomas can confirm
> the same), but others are latest.

I have a new patch set to post soon, handling all the feedback that
arrived in the past couple of weeks from 5 different reviewers (thanks
all!).

> > There are similar issues in CREATE/DROP DATABASE code. If you crash in
> > the middle of CREATE DATABASE, you can be left with orphaned files in
> > the data directory, or if you crash in the middle of DROP DATABASE, the
> > data might be gone already but the pg_database entry is still there. We
> > should plug those holes too.
> >
>
> +1. Interesting.

Huh.  Right.

> > Could we leave out the UNDO and discard worker processes for now?
> > Execute all UNDO actions immediately at rollback, and after crash
> > recovery. That would be fine for cleaning up orphaned files,
> >
>
> Even if we execute all the undo actions on rollback, we need discard
> worker to discard undo at regular intervals.  Also, what if we get an
> error while applying undo actions during rollback?  Right now, we have
> a mechanism to push such a request to background worker and allow the
> session to continue.  Instead, we might want to Panic in such cases if
> we don't want to have background undo workers.
>
> > and it
> > would cut down the size of the patch to review.
>
> If we can find some way to handle all cases and everyone agrees to it,
> that would be good. In fact, we can try to get the basic stuff
> committed first and then try to get the rest (undo-worker machinery)
> done.

I think it's definitely worth exploring.

> > Can this race condition happen: Transaction A creates a table and an
> > UNDO record to remember it. The transaction is rolled back, and the file
> > is removed. Another transaction, B, creates a different table, and
> > chooses the same relfilenode. It loads the table with data, and commits.
> > Then the system crashes. After crash recovery, the UNDO record for the
> > first transaction is applied, and it removes the file that belongs to
> > the second table, created by transaction B.
>
> I don't think such a race exists, but we should verify it once.
> Basically, once the rollback is complete, we mark the transaction
> rollback as complete in the transaction header in undo and write a WAL
> for it.  After crash-recovery, we will skip such a transaction.  Isn't
> that sufficient to prevent such a race condition?

The usual protection against relfilenode recycling applies: we don't
actually remove the files on disk until after the next checkpoint,
following the successful rollback.  That is, the executing the
rollback doesn't actually remove any files immediately, so you can't
reuse the OID yet.

There might be some problems like that if we tried to handle the
CREATE DATABASE orphans you mentioned too naively though.  Not sure.

> Thank you for looking into this work.

+1

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Heikki Linnakangas

Date:

05 August 2019, 06:39:11

On 05/08/2019 07:23, Thomas Munro wrote:
> On Mon, Aug 5, 2019 at 3:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Sun, Aug 4, 2019 at 2:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> Could we leave out the UNDO and discard worker processes for now?
>>> Execute all UNDO actions immediately at rollback, and after crash
>>> recovery. That would be fine for cleaning up orphaned files,
>>
>> Even if we execute all the undo actions on rollback, we need discard
>> worker to discard undo at regular intervals.  Also, what if we get an
>> error while applying undo actions during rollback?  Right now, we have
>> a mechanism to push such a request to background worker and allow the
>> session to continue.  Instead, we might want to Panic in such cases if
>> we don't want to have background undo workers.
>>
>>> and it
>>> would cut down the size of the patch to review.
>>
>> If we can find some way to handle all cases and everyone agrees to it,
>> that would be good. In fact, we can try to get the basic stuff
>> committed first and then try to get the rest (undo-worker machinery)
>> done.
> 
> I think it's definitely worth exploring.

Yeah. For cleaning up orphaned files, if unlink() fails, we can just log 
the error and move on. That's what we do in the main codepath, too. For 
any other error, PANIC seems ok. We're not expecting any errors during 
undo processing, so it doesn't seems safe to continue running.

Hmm. Since applying the undo record is WAL-logged, you could run out of 
disk space while creating the WAL record. That seems unpleasant.

>>> Can this race condition happen: Transaction A creates a table and an
>>> UNDO record to remember it. The transaction is rolled back, and the file
>>> is removed. Another transaction, B, creates a different table, and
>>> chooses the same relfilenode. It loads the table with data, and commits.
>>> Then the system crashes. After crash recovery, the UNDO record for the
>>> first transaction is applied, and it removes the file that belongs to
>>> the second table, created by transaction B.
>>
>> I don't think such a race exists, but we should verify it once.
>> Basically, once the rollback is complete, we mark the transaction
>> rollback as complete in the transaction header in undo and write a WAL
>> for it.  After crash-recovery, we will skip such a transaction.  Isn't
>> that sufficient to prevent such a race condition?

Ok, I didn't realize there's a flag in the undo record to mark it as 
applied. Yeah, that fixes it. Seems a bit heavy-weight, but I guess it's 
fine. Do you do something different in zheap? I presume writing a WAL 
record for every applied undo record would be too heavy there.

This needs some performance testing. We're creating one extra WAL record 
and one UNDO record for every file creation, and another WAL record on 
abort. It's probably cheap compared to all the other work done during 
table creation, but we should still get some numbers on it.

Some regression tests would be nice too.

- Heikki

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

05 August 2019, 10:16:23

On Mon, Aug 5, 2019 at 12:09 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 05/08/2019 07:23, Thomas Munro wrote:
> > On Mon, Aug 5, 2019 at 3:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> On Sun, Aug 4, 2019 at 2:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >>> Could we leave out the UNDO and discard worker processes for now?
> >>> Execute all UNDO actions immediately at rollback, and after crash
> >>> recovery. That would be fine for cleaning up orphaned files,
> >>
> >> Even if we execute all the undo actions on rollback, we need discard
> >> worker to discard undo at regular intervals.  Also, what if we get an
> >> error while applying undo actions during rollback?  Right now, we have
> >> a mechanism to push such a request to background worker and allow the
> >> session to continue.  Instead, we might want to Panic in such cases if
> >> we don't want to have background undo workers.
> >>
> >>> and it
> >>> would cut down the size of the patch to review.
> >>
> >> If we can find some way to handle all cases and everyone agrees to it,
> >> that would be good. In fact, we can try to get the basic stuff
> >> committed first and then try to get the rest (undo-worker machinery)
> >> done.
> >
> > I think it's definitely worth exploring.
>
> Yeah. For cleaning up orphaned files, if unlink() fails, we can just log
> the error and move on. That's what we do in the main codepath, too. For
> any other error, PANIC seems ok. We're not expecting any errors during
> undo processing, so it doesn't seems safe to continue running.
>
> Hmm. Since applying the undo record is WAL-logged, you could run out of
> disk space while creating the WAL record. That seems unpleasant.
>

We might get away by doing some minimum error handling for orphan file
cleanup patch, but this facility was supposed to be a generic
facility.  Assuming, all of us agree on error handling stuff, still, I
think we might not be able to get away with the requirement for
discard worker to discard the logs.

> >>> Can this race condition happen: Transaction A creates a table and an
> >>> UNDO record to remember it. The transaction is rolled back, and the file
> >>> is removed. Another transaction, B, creates a different table, and
> >>> chooses the same relfilenode. It loads the table with data, and commits.
> >>> Then the system crashes. After crash recovery, the UNDO record for the
> >>> first transaction is applied, and it removes the file that belongs to
> >>> the second table, created by transaction B.
> >>
> >> I don't think such a race exists, but we should verify it once.
> >> Basically, once the rollback is complete, we mark the transaction
> >> rollback as complete in the transaction header in undo and write a WAL
> >> for it.  After crash-recovery, we will skip such a transaction.  Isn't
> >> that sufficient to prevent such a race condition?
>
> Ok, I didn't realize there's a flag in the undo record to mark it as
> applied. Yeah, that fixes it. Seems a bit heavy-weight, but I guess it's
> fine. Do you do something different in zheap? I presume writing a WAL
> record for every applied undo record would be too heavy there.
>

For zheap, we collect all the records of a page, apply them together
and then write the entire page in WAL.  The progress of transaction is
updated at either transaction end (rollback complete) or after
processing some threshold of undo records.  So, generally, the WAL
won't be for each undo record apply.

> This needs some performance testing. We're creating one extra WAL record
> and one UNDO record for every file creation, and another WAL record on
> abort. It's probably cheap compared to all the other work done during
> table creation, but we should still get some numbers on it.
>

makes sense.





-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

05 August 2019, 12:58:50

On Mon, Aug 5, 2019 at 6:16 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> For zheap, we collect all the records of a page, apply them together
> and then write the entire page in WAL.  The progress of transaction is
> updated at either transaction end (rollback complete) or after
> processing some threshold of undo records.  So, generally, the WAL
> won't be for each undo record apply.

This explanation omits a crucial piece of the mechanism, because
Heikki is asking what keeps the undo from being applied multiple
times.  When we apply the undo records to a page, we also adjust the
undo pointers in the page.  Since we have an undo pointer per
transaction slot, and each transaction has its own slot, if we apply
all the undo for a transaction to a page, we can just clear the slot;
if we somehow end up back at the same point later, we'll know not to
apply the undo a second time because we'll see that there's no
transaction slot pointing to the undo we were thinking of applying. If
we roll back to a savepoint, or for some other reason choose to apply
only some of the undo to a page, we can set the undo record pointer
for the transaction back to the value it had before we generated any
newer undo.  Then, we'll know that the newer undo doesn't need to be
applied but the older undo can be applied.

At least, I think that's how it's supposed to work.  If you just
update the progress field, it doesn't guarantee anything, because in
the event of a crash, we could end up keeping the page changes but
losing the update to the progress, as they are part of separate undo
records.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

05 August 2019, 13:24:50

On Sun, Aug 4, 2019 at 5:16 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I feel that the level of abstraction is not quite right. There are a
> bunch of fields, like uur_block, uur_offset, uur_tuple, that are
> probably useful for some UNDO resource managers (zheap I presume), but
> seem kind of arbitrary. How is uur_tuple different from uur_payload?
> Should they be named more generically as uur_payload1 and uur_payload2?
> And why two, why not three or four different payloads? In the WAL record
> format, there's a concept of "block id", which allows you to store N
> number of different payloads in the record, I think that would be a
> better approach. Or only have one payload, and let the resource manager
> code divide it as it sees fit.
>
> Many of the fields support a primitive type of compression, where a
> field can be omitted if it has the same value as on the first record on
> an UNDO page. That's handy. But again I don't like the fact that the
> fields have been hard-coded into the UNDO record format. I can see e.g.
> the relation oid to be useful for many AMs. But not all. And other AMs
> might well want to store and deduplicate other things, aside from the
> fields that are in the patch now. I'd like to move most of the fields to
> AM specific code, and somehow generalize the compression. One approach
> would be to let the AM store an arbitrary struct, and run it through a
> general-purpose compression algorithm, using the UNDO page's first
> record as the "dictionary".

I thought about this, too. I agree that there's something a little
unsatisfying about the current structure, but I haven't been able to
come up with something that seems definitively better. I think
something along the lines of what you are describing here might work
well, but I am VERY doubtful about the idea of a fixed-size struct. I
think AMs are going to want to store variable-length data: especially
tuples, but maybe also other stuff. For instance, imagine some AM that
wants to implement locking that's more fine-grained that the four
levels of tuple locks we have today: instead of just having key locks
and all-columns locks, you could want to store the exact columns to be
locked. Or maybe your TIDs are variable-width.

And the problem is that as soon as you move to something where you
pack in a bunch of variable-sized fields, you lose the ability to
refer to thinks using reasonable names.  That's where I came up with
the idea of an UnpackedUndoRecord: give the common fields that
"everyone's going to need" human-readable names, and jam only the
strange, AM-specific stuff into the payload.  But if those needs are
not actually universal but very much AM-specific, then I'm afraid
we're going to end up with deeply inscrutable code for packing and
unpacking records.  I imagine it's possible to come up with a good
structure for that, but I don't think we have one today.

> I don't like the way UndoFetchRecord returns a palloc'd
> UnpackedUndoRecord. I would prefer something similar to the xlogreader
> API, where a new call to UndoFetchRecord invalidates the previous
> result. On efficiency grounds, to avoid the palloc, but also to be
> consistent with xlogreader.

I don't think that's going to work very well, because we often need to
deal with multiple records at a time.  There is (or was) a bulk-fetch
interface, but I've also found while experimenting with this code that
it can be useful to do things like:

current = undo_fetch(starting_record);
loop:
    next = undo_fetch(current->next_record_ptr);
    if some_test(next):
        break;
    undo_free(current);
    current = next;

I think we shouldn't view such cases as exceptions to the general
paradigm of looking at undo records one at a time, but instead as the
normal case for which everything is optimized.  Cases like orphaned
file cleanup where the number of undo records is probably small and
they're all independent of each other will, I think, turn out to be
the exception rather than the rule.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

05 August 2019, 15:25:10

On Tue, Jul 30, 2019 at 4:02 AM Andres Freund <andres@anarazel.de> wrote:
> I'm a bit worried about expanding the use of
> ReadBufferWithoutRelcache(). Not so much because of the relcache itself,
> but because it requires doing separate smgropen() calls. While not
> crazily expensive, it's also not free. Especially combined with closing
> all such relations at transaction end (c.f. AtEOXact_SMgr).
>
> I'm somewhat inclined to think that this requires a slightly bigger
> refactoring than done in this patch. Imo at the very least the smgr
> entries ought not to be unowned. But working towards not haven to
> re-open the smgr entry for every single trival request ought to be part
> of this too.

I spent some time trying to analyze this today and I agree with you
that there seems to be room for improvement here. When I first looked
at your comments, I wasn't too convinced, because access patterns that
skip around between undo logs seem like they may be fairly common.
Admittedly, there are cases where we want to read from just one undo
log over and over again, and it would be good to optimize those, but I
was initially a bit unconvinced that that there was a problem here
worth being concerned about. Then I realized that you would also
repeat the smgropen() if you read a single record that happens to be
split across two pages, which seems a little silly.

But then I realized that we're being a little silly even in the case
where we're reading a single undo record that is stored entirely on a
single page.  We are certainly going to need to look up the undo log,
but as things stand, we'll basically do it twice. For example, in the
write path, we'll call UndoLogAllocate() and it will look up an
UndoLogControl object for the undo log of interest, and then we'll
call ReadBufferWithoutRelcache() which will call smgropen() which will
do a hash table lookup to find the SMgrRelation associated with that
undo log. That's not a large cost, as you say, but it does seem like
it might be better to avoid having two different lookups in the same
commonly-used code path, each of which peeks into a different
backend-private data structure for information about the very same
undo log.

The obvious thing to do seems to be to have UndoLogControl objects own
SmgrRelations. That would be something of a novelty, since it looks
like currently only a Relation ever owns an SMgrRelation, but the smgr
infrastructure seems to have been set up in a generic way so as to
permit that sort of thing, so it seems like it should be workable.
Perhaps UndoLogAllocate() function could return a pointer to the
UndoLogControl object as well as UndoRecPtr. Then, there could be a
function UndoLogWrite(UndoLogControl *, UndoRecPtr, char *, Size).  On
the read side, instead of calling UndoRecPtrAssignRelFileNode, maybe
the undo log storage layer should provide a function that again
returns an UndoLogControl, and then we could have a matching function
UndoLogRead(UndoLogControl *, UndoRecPtr, char *, Size).

I think this kind of design would address your concerns about using
the unowned list, too, since the UndoLogControl objects would be
owning the SMgrRelations.  It took me a while to understand why you
were concerned about using the unowned list, so I'm going to repeat it
in my own words to make sure I've got it right, and also to possibly
help out anyone else who may also have had difficulty grokking your
concern.  If we have a bunch of short transactions each of which
accesses the same relation, the relcache entry will remain open and
the file won't get closed in between, but if we have a bunch of short
transactions each of which accesses the same undo log, the undo log
will be closed and reopened at the operating system level for each
individual transaction. That happens because when an SMgrRelation is
"owned," the owner takes care of closing it, and so can keep it open
across transactions, but when it's "unowned," it's automatically
closed during transaction cleanup.  And we should fix it, because
closing and reopening the same file for every transaction
unnecessarily might be expensive enough to matter, at least a little
bit.

How does all that sound?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

05 August 2019, 16:42:16

Hi,

On 2019-08-05 11:25:10 -0400, Robert Haas wrote:
> The obvious thing to do seems to be to have UndoLogControl objects own
> SmgrRelations. That would be something of a novelty, since it looks
> like currently only a Relation ever owns an SMgrRelation, but the smgr
> infrastructure seems to have been set up in a generic way so as to
> permit that sort of thing, so it seems like it should be workable.

Yea, I think that'd be a good step.

I'm not 100% convinced it's quite enough, due to the way the undo smgr
only ever has a single file descriptor open, and that undo log segments
are fairly small, and that there'll often be multiple persistence levels
active at the same time. But the undo fd handling is probably a separate
concern than from who owns the smgr relations.

> I think this kind of design would address your concerns about using
> the unowned list, too, since the UndoLogControl objects would be
> owning the SMgrRelations.

Yup.

> How does all that sound?

A good move in the right direction, imo.

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

05 August 2019, 18:29:34

Hi,

(as I was out of context due to dealing with bugs, I've switched to
lookign at the current zheap/undoprocessing branch.

On 2019-07-30 01:02:20 -0700, Andres Freund wrote:
> +/*
> + * Insert a previously-prepared undo records.
> + *
> + * This function will write the actual undo record into the buffers which are
> + * already pinned and locked in PreparedUndoInsert, and mark them dirty.  This
> + * step should be performed inside a critical section.
> + */

Again, I think it's not ok to just assume you can lock an essentially
unbounded number of buffers. This seems almost guaranteed to result in
deadlocks. And there's limits on how many lwlocks one can hold etc.

As far as I can tell there's simply no deadlock avoidance scheme in use
here *at all*? I must be missing something.


> +        /* Main loop for writing the undo record. */
> +        do
> +        {

I'd prefer this to not be a do{} while(true) loop - as written I need to
read to the end to see what the condition is. I don't think we have any
loops like that in the code.


> +            /*
> +             * During recovery, there might be some blocks which are already
> +             * deleted due to some discard command so we can just skip
> +             * inserting into those blocks.
> +             */
> +            if (!BufferIsValid(buffer))
> +            {
> +                Assert(InRecovery);
> +
> +                /*
> +                 * Skip actual writing just update the context so that we have
> +                 * write offset for inserting into next blocks.
> +                 */
> +                SkipInsertingUndoData(&ucontext, BLCKSZ - starting_byte);
> +                if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> +                    break;
> +            }

How exactly can this happen?


> +            else
> +            {
> +                page = BufferGetPage(buffer);
> +
> +                /*
> +                 * Initialize the page whenever we try to write the first
> +                 * record in page.  We start writing immediately after the
> +                 * block header.
> +                 */
> +                if (starting_byte == UndoLogBlockHeaderSize)
> +                    UndoPageInit(page, BLCKSZ, prepared_undo->urec->uur_info,
> +                                 ucontext.already_processed,
> +                                 prepared_undo->urec->uur_tuple.len,
> +                                 prepared_undo->urec->uur_payload.len);
> +
> +                /*
> +                 * Try to insert the record into the current page. If it
> +                 * doesn't succeed then recall the routine with the next page.
> +                 */
> +                InsertUndoData(&ucontext, page, starting_byte);
> +                if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> +                {
> +                    MarkBufferDirty(buffer);
> +                    break;

At this point we're five indentation levels deep. I'd extract at least
either the the per prepared undo code or the code performing the writing
across block boundaries into a separate function. Perhaps both.



> +/*
> + * Helper function for UndoGetOneRecord
> + *
> + * If any of  rmid/reloid/xid/cid is not available in the undo record, then
> + * it will get the information from the first complete undo record in the
> + * page.
> + */
> +static void
> +GetCommonUndoRecInfo(UndoPackContext *ucontext, UndoRecPtr urp,
> +                     RelFileNode rnode, UndoLogCategory category, Buffer buffer)
> +{
> +    /*
> +     * If any of the common header field is not available in the current undo
> +     * record then we must read it from the first complete record of the page.
> +     */

How is it guaranteed that the first record on the page is actually from
the current transaction? Can't there be a situation where that's from
another transaction?



> +/*
> + * Helper function for UndoFetchRecord and UndoBulkFetchRecord
> + *
> + * curbuf - If an input buffer is valid then this function will not release the
> + * pin on that buffer.  If the buffer is not valid then it will assign curbuf
> + * with the first buffer of the current undo record and also it will keep the
> + * pin and lock on that buffer in a hope that while traversing the undo chain
> + * the caller might want to read the previous undo record from the same block.
> + */

Wait, so at exit *curbuf is pinned but not locked, if passed in, but is
pinned *and* locked when not?  That'd not be a sane API. I don't think
the code works like that atm though.


> +static UnpackedUndoRecord *
> +UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
> +                 UndoLogCategory category, Buffer *curbuf)
> +{
> +    Page        page;
> +    int            starting_byte = UndoRecPtrGetPageOffset(urp);
> +    BlockNumber cur_blk;
> +    UndoPackContext ucontext = {{0}};
> +    Buffer        buffer = *curbuf;
> +
> +    cur_blk = UndoRecPtrGetBlockNum(urp);
> +
> +    /* Initiate unpacking one undo record. */
> +    BeginUnpackUndo(&ucontext);
> +
> +    while (true)
> +    {
> +        /* If we already have a buffer then no need to allocate a new one. */
> +        if (!BufferIsValid(buffer))
> +        {
> +            buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
> +                                               RBM_NORMAL, NULL,
> +                                               RelPersistenceForUndoLogCategory(category));
> +
> +            /*
> +             * Remember the first buffer where this undo started as next undo
> +             * record what we fetch might fall on the same buffer.
> +             */
> +            if (!BufferIsValid(*curbuf))
> +                *curbuf = buffer;
> +        }
> +
> +        /* Acquire shared lock on the buffer before reading undo from it. */
> +        LockBuffer(buffer, BUFFER_LOCK_SHARE);
> +
> +        page = BufferGetPage(buffer);
> +
> +        UnpackUndoData(&ucontext, page, starting_byte);
> +
> +        /*
> +         * We are done if we have reached to the done stage otherwise move to
> +         * next block and continue reading from there.
> +         */
> +        if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> +        {
> +            if (buffer != *curbuf)
> +                UnlockReleaseBuffer(buffer);
> +
> +            /*
> +             * Get any of the missing fields from the first record of the
> +             * page.
> +             */
> +            GetCommonUndoRecInfo(&ucontext, urp, rnode, category, *curbuf);
> +            break;
> +        }
> +
> +        /*
> +         * The record spans more than a page so we would have copied it (see
> +         * UnpackUndoRecord).  In such cases, we can release the buffer.
> +         */

Where would it have been copied? Presumably in UnpackUndoData()? Imo the
comment should say so.

I'm a bit confused by the use of "would" in that comment. Either we
have, or not?


> +        if (buffer != *curbuf)
> +            UnlockReleaseBuffer(buffer);

Wait, so we *keep* the buffer locked if it the same as *curbuf? That
can't be right.


> + * Fetch the undo record for given undo record pointer.
> + *
> + * This will internally allocate the memory for the unpacked undo record which
> + * intern will

"intern" should probably be internally? But I'm not sure what the two
"internally"s really add here.



> +/*
> + * Release the memory of the undo record allocated by UndoFetchRecord and
> + * UndoBulkFetchRecord.
> + */
> +void
> +UndoRecordRelease(UnpackedUndoRecord *urec)
> +{
> +    /* Release the memory of payload data if we allocated it. */
> +    if (urec->uur_payload.data)
> +        pfree(urec->uur_payload.data);
> +
> +    /* Release memory of tuple data if we allocated it. */
> +    if (urec->uur_tuple.data)
> +        pfree(urec->uur_tuple.data);
> +
> +    /* Release memory of the transaction header if we allocated it. */
> +    if (urec->uur_txn)
> +        pfree(urec->uur_txn);
> +
> +    /* Release memory of the logswitch header if we allocated it. */
> +    if (urec->uur_logswitch)
> +        pfree(urec->uur_logswitch);
> +
> +    /* Release the memory of the undo record. */
> +    pfree(urec);
> +}

Those comments before each pfree are not useful.


Also, isn't this both fairly slow and fairly failure prone?  The next
record is going to need all that memory again, no?  It seems to me that
there should be one record that's allocated once, and then reused over
multiple fetches, increasing the size if necesssary.

I'm very doubtful that all this freeing of individual allocations in the
undo code makes sense. Shouldn't this just be done in short lived memory
contexts, that then get reset as a whole? That's both far less failure
prone, and faster.


> + * one_page            - Caller is applying undo only for one block not for
> + *                      complete transaction.  If this is set true then instead
> + *                      of following transaction undo chain using prevlen we will
> + *                      follow the block prev chain of the block so that we can
> + *                      avoid reading many unnecessary undo records of the
> + *                      transaction.
> + */
> +UndoRecInfo *
> +UndoBulkFetchRecord(UndoRecPtr *from_urecptr, UndoRecPtr to_urecptr,
> +                    int undo_apply_size, int *nrecords, bool one_page)


There's no caller for one_page mode in the series - I assume that's for
later, during page-wise undo?  It seems to behave in quite noticably
different ways, is that really OK? Makes the code quite hard to
understand.

Also, it seems quite poorly named to me. It sounds like it's about
fetching a single undo page (which makes no sense, obviously). But what
it does is to switch to an entirely different way of traversing the undo
chains.


> +    /*
> +     * In one_page mode we are fetching undo only for one page instead of
> +     * fetching all the undo of the transaction.  Basically, we are fetching
> +     * interleaved undo records.  So it does not make sense to do any prefetch
> +     * in that case.

What does "interleaved" mean here? I assume that there will often be
other UNDO records interspersed? But that's not guaranteed at all,
right? In fact, for a lot of workloads it seems likely that there will
be many consecutive undo records for a single page? In fact, won't that
be the majority of cases?

Thus it's not obvious to me that there's not often going to be
consecutive pages for this case too.  I'd even say that minimizing IO
delay is *MORE* important during page-wise undo, as that happens in the
context of client accesses, and it's not incurring cost on the party
that performed DML, but on some random third party.


I'm doubtful this is a sane interface. There's a lot of duplication
between one_page and not one_page. It presupposes specific ways of
constructing chains that are likely to depend on the AM. to_urecptr is
only used in certain situations.  E.g. I strongly suspect that for
zheap's visibility determinations we'd want to concurrently follow all
the necessary chains to determine visibility for all all tuples on the
page, far enough to find the visible tuple - for seqscan's / bitmap heap
scans / everything using page mode scans, that'll be way more efficient
than doing this one-by-one and possibly even repeatedly.  But what is
exactly the right thing to do is going to be highly AM specific.

I vaguely suspect what you'd want is an interface where the "bulk fetch"
context basically has a FIFO queue of undo records to fetch, and a
function to actually perform fetching. Whenever a record has been
retrieved, a callback determines whether additional records are needed.
In the case of fetching all the undo for a transaction, you'd just queue
- probably in a more efficient representation - all the necessary
undo. In case of page-wise undo, you'd queue the first record of the
chain you'd want to undo, with a callback for queuing the next
record. For visibility determinations in zheap, you'd queue all the
different necessary chains, with a callback that queues the next
necessary record if still needed for visibility determination.

And then I suspect you'd have a separate callback whenever records have
been fetched, with all the 'unconsumed' records. That then can,
e.g. based on memory consumption, decide to process them or not.  For
visibility information you'd probably just want to condense the records
to the minimum necessary (i.e. visibility information for the relevant
tuples, and the visibile tuple when encountered) as soon as available.

Obviously that's pretty handwavy.




>   Also, if we are fetching undo records from more than one
> +     * log, we don't know the boundaries for prefetching.  Hence, we can't use
> +     * prefetching in this case.
> +     */

Hm. Why don't we know the boundaries (or cheaply infer them)?


> +        /*
> +         * If prefetch_pages are half of the prefetch_target then it's time to
> +         * prefetch again.
> +         */
> +        if (prefetch_pages < prefetch_target / 2)
> +            PrefetchUndoPages(rnode, prefetch_target, &prefetch_pages, to_blkno,
> +                              from_blkno, category);

Hm. Why aren't we prefetching again as soon as possible? Given the
current code there's not really benefit in fetching many adjacent pages
at once. And this way it seems we're somewhat likely to cause fairly
bursty IO?


> +        /*
> +         * In one_page mode it's possible that the undo of the transaction
> +         * might have been applied by worker and undo got discarded. Prevent
> +         * discard worker from discarding undo data while we are reading it.
> +         * See detail comment in UndoFetchRecord.  In normal mode we are
> +         * holding transaction undo action lock so it can not be discarded.
> +         */

I don't really see a comment explaining this in UndoFetchRecord. Are
you referring to InHotStandby? Because there's no comment about one_page
mode as far as I can tell? The comment is clearly referring to that,
rather than InHotStandby?



> +        if (one_page)
> +        {
> +            /* Refer comments in UndoFetchRecord. */

Missing "to".


> +            if (InHotStandby)
> +            {
> +                if (UndoRecPtrIsDiscarded(urecptr))
> +                    break;
> +            }
> +            else
> +            {
> +                LWLockAcquire(&slot->discard_lock, LW_SHARED);
> +                if (slot->logno != logno || urecptr < slot->oldest_data)
> +                {
> +                    /*
> +                     * The undo log slot has been recycled because it was
> +                     * entirely discarded, or the data has been discarded
> +                     * already.
> +                     */
> +                    LWLockRelease(&slot->discard_lock);
> +                    break;
> +                }
> +            }

I find this deeply unsatisfying.  It's repeated in a bunch of
places. There's completely different behaviour between the hot-standby
and !hot-standby case. There's UndoRecPtrIsDiscarded for the HS case,
but we do a different test for !HS.  There's no explanation as to why
this is even reachable.


> +            /* Read the undo record. */
> +            UndoGetOneRecord(uur, urecptr, rnode, category, &buffer);
> +
> +            /* Release the discard lock after fetching the record. */
> +            if (!InHotStandby)
> +                LWLockRelease(&slot->discard_lock);
> +        }
> +        else
> +            UndoGetOneRecord(uur, urecptr, rnode, category, &buffer);


And then we do none of this in !one_page mode.


> +        /*
> +         * As soon as the transaction id is changed we can stop fetching the
> +         * undo record.  Ideally, to_urecptr should control this but while
> +         * reading undo only for a page we don't know what is the end undo
> +         * record pointer for the transaction.
> +         */
> +        if (one_page)
> +        {
> +            if (!FullTransactionIdIsValid(fxid))
> +                fxid = uur->uur_fxid;
> +            else if (!FullTransactionIdEquals(fxid, uur->uur_fxid))
> +                break;
> +        }
> +
> +        /* Remember the previous undo record pointer. */
> +        prev_urec_ptr = urecptr;
> +
> +        /*
> +         * Calculate the previous undo record pointer of the transaction.  If
> +         * we are reading undo only for a page then follow the blkprev chain
> +         * of the page.  Otherwise, calculate the previous undo record pointer
> +         * using transaction's current undo record pointer and the prevlen. If
> +         * undo record has a valid uur_prevurp, this is the case of log switch
> +         * during the transaction so we can directly use uur_prevurp as our
> +         * previous undo record pointer of the transaction.
> +         */
> +        if (one_page)
> +            urecptr = uur->uur_prevundo;
> +        else if (uur->uur_logswitch)
> +            urecptr = uur->uur_logswitch->urec_prevurp;
> +        else if (prev_urec_ptr == to_urecptr ||
> +                 uur->uur_info & UREC_INFO_TRANSACTION)
> +            urecptr = InvalidUndoRecPtr;
> +        else
> +            urecptr = UndoGetPrevUndoRecptr(prev_urec_ptr, buffer, category);
> +

FWIW, this is one of those concerns I was referring to above. What
exactly needs to happen seems highly AM specific.


> +/*
> + * Read length of the previous undo record.
> + *
> + * This function will take an undo record pointer as an input and read the
> + * length of the previous undo record which is stored at the end of the previous
> + * undo record.  If the undo record is split then this will add the undo block
> + * header size in the total length.
> + */

This should add some note as to when it's expected to be necessary. I
was kind of concerned that this can be necessary, but it's only needed
during log switches, which disarms that concern.


> +static uint16
> +UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer,
> +                     UndoLogCategory category)
> +{
> +    UndoLogOffset page_offset = UndoRecPtrGetPageOffset(urp);
> +    BlockNumber cur_blk = UndoRecPtrGetBlockNum(urp);
> +    Buffer        buffer = input_buffer;
> +    Page        page = NULL;
> +    char       *pagedata = NULL;
> +    char        prevlen[2];
> +    RelFileNode rnode;
> +    int            byte_to_read = sizeof(uint16);

Shouldn't it be byte_to_read? And the sizeof a type that's tied with the
actual undo format? Imagine we'd ever want to change the length format
for undo records - this would be hard to find.


> +    char        persistence;
> +    uint16        prev_rec_len = 0;
> +
> +    /* Get relfilenode. */
> +    UndoRecPtrAssignRelFileNode(rnode, urp);
> +    persistence = RelPersistenceForUndoLogCategory(category);
> +
> +    if (BufferIsValid(buffer))
> +    {
> +        page = BufferGetPage(buffer);
> +        pagedata = (char *) page;
> +    }
> +
> +    /*
> +     * Length if the previous undo record is store at the end of that record
> +     * so just fetch last 2 bytes.
> +     */
> +    while (byte_to_read > 0)
> +    {

Why does this need a loop around the number of bytes? Can there ever be
a case where this is split across a record? If so, isn't that a bad idea
anyway?


> +        /* Read buffer if the current buffer is not valid. */
> +        if (!BufferIsValid(buffer))
> +        {
> +            buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum,
> +                                               cur_blk, RBM_NORMAL, NULL,
> +                                               persistence);
> +
> +            LockBuffer(buffer, BUFFER_LOCK_SHARE);
> +
> +            page = BufferGetPage(buffer);
> +            pagedata = (char *) page;
> +        }
> +
> +        page_offset -= 1;
> +
> +        /*
> +         * Read current prevlen byte from current block if page_offset hasn't
> +         * reach to undo block header.  Otherwise, go to the previous block
> +         * and continue reading from there.
> +         */
> +        if (page_offset >= UndoLogBlockHeaderSize)
> +        {
> +            prevlen[byte_to_read - 1] = pagedata[page_offset];
> +            byte_to_read -= 1;
> +        }
> +        else
> +        {
> +            /*
> +             * Release the current buffer if it is not provide by the caller.
> +             */
> +            if (input_buffer != buffer)
> +                UnlockReleaseBuffer(buffer);
> +
> +            /*
> +             * Could not read complete prevlen from the current block so go to
> +             * the previous block and start reading from end of the block.
> +             */
> +            cur_blk -= 1;
> +            page_offset = BLCKSZ;
> +
> +            /*
> +             * Reset buffer so that we can read it again for the previous
> +             * block.
> +             */
> +            buffer = InvalidBuffer;
> +        }
> +    }

I can't help but think that this shouldn't be yet another copy of logic
for how to read undo pages.


Need to do something else for a bit. More later.

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

05 August 2019, 23:37:33

On Mon, Aug 5, 2019 at 12:42 PM Andres Freund <andres@anarazel.de> wrote:
> A good move in the right direction, imo.

I spent some more time thinking about this and talking to Thomas about
it and I'd like to propose a somewhat more aggressive restructuring
proposal, with the aim of getting a cleaner separation between layers
of this patch set.

Right now, the undo log storage stuff knows nothing about the contents
of an undo log, whereas the undo interface storage knows everything
about the contents of an undo log. In particular, it knows that it's a
series of records, and those records are grouped into transactions,
and it knows both the format of the individual records and also the
details of how transaction headers work. Nothing can use the undo log
storage system except for the undo interface layer, because the undo
interface layer assumes that all the data in the undo storage system
conforms to the record/recordset format which it defines. However,
there are a few warts: while the undo log storage patch doesn't know
anything about the contents of undo logs, it does know that that
transaction boundaries matter, and it signals to the undo interface
layer whether a transaction header should be inserted for a new
record.  That's a strange thing for the storage layer to be doing.
Also, in addition to three persistence levels, it knows about a fourth
undo log category for "special" data for multixact or TPD-like things.
That's another wart.

Suppose that we instead invent a new layer which sits on top of the
undo log storage layer.  This layer manages what I'm going to call
GHOBs, growable hunks of bytes.  (This is probably not the best name,
but I thought of it in 5 seconds during a complex technical
conversation, so bear with me.)  The GHOB layer supports
open/close/grow/write/overwrite operations.  Conceptually, you open a
GHOB with an initial size and a persistence level, and then you can
subsequently grow it unless you fill up the undo log in which case you
can't grow it any more; when you're done, you close it.  Opening and
closing a GHOB are operations that only make in-memory state changes.
Opening a GHOB finds a place where you could write the initial amount
of data you specify, but it doesn't actually write any data or change
any persistent state yet, except for making sure that nobody else can
grab that space as long as you have the GHOB open.  Closing a GHOB
tells the system that you're not going to grow the object any more,
which means some other GHOB can be placed immediately after the last
data you wrote.  Growing a GHOB doesn't do anything persistent either;
it just tests whether there would be room to write those bytes.  So,
the only operations that make actual persistent changes are write and
overwrite.  These operations just copy data into shared buffers and
mark them dirty, but they are set up so that you can integrate this
with whatever WAL-logging your doing for those operations, so that you
can make the same writes happen at redo time.

Then, on top of the GHOB layer, you have separate submodules for
different kinds of GHOBs.  Most importantly, you have a
transaction-GHOB manager, which opens a GHOB per persistence level the
first time somebody wants to write to it and closes those GHOBs at
end-of-xact.  AMs push records into the transaction-GHOB manager, and
it pushes them into GHOBs on the other side. Then you can also have a
multi-GHOB manager, which would replace what Thomas now has as a
separate undo log category.  The undo-log-storage layer wouldn't have
any fixed limit on the number of GHOBs that could be open at the same
time; it would just be the sum of whatever the individual GHOB type
managers can open. It would be important to keep that number fairly
small since there's not an unlimited supply of undo logs, but that
doesn't seem like a problem for any of the uses we currently have in
mind.  Each GHOB would begin with a magic number identifying the GHOB
type, and would have callbacks for everything else, like "how big is
this GHOB?" and "is it discardable?".

I'm not totally sure I've thought through all of the problems here,
but it seems like this might help us fix some of the aforementioned
layering inversions. The undo log storage system only knows about
storage: it doesn't have to help with things like transaction
boundaries any more, and it continues to be indifferent to the actual
contents of the storage.  At the GHOB layer, we know that we've got
chunks of storage which are the unit of undo discard, and we know that
they start with a magic number that identifies the type, but it
doesn't know whether they are internally broken into records or, if
so, how those records are organized. The individual GHOB managers do
know that stuff; for example, the transaction-GHOB manager would know
that AMs insert undo records and how those records are compressed and
so forth.  One thing that feels good about this system is that you
could actually write something like the test_undo module that Thomas
had in an older patch set.  He threw it away because it doesn't play
nice with the way the undorecord/undoaccess stuff works: that stuff
thinks that all undo records have to be in the format that it knows
about, and if they're not, it will barf. With this, test_undo could
define its own kind of GHOB that keeps stuff until it's explicitly
told to throw it away, and that'd be fine for 'make check' (but not
'make installcheck', probably).

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

06 August 2019, 05:02:52

On Mon, Aug 5, 2019 at 6:29 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Aug 5, 2019 at 6:16 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > For zheap, we collect all the records of a page, apply them together
> > and then write the entire page in WAL.  The progress of transaction is
> > updated at either transaction end (rollback complete) or after
> > processing some threshold of undo records.  So, generally, the WAL
> > won't be for each undo record apply.
>
> This explanation omits a crucial piece of the mechanism, because
> Heikki is asking what keeps the undo from being applied multiple
> times.
>

Okay, I didn't realize that.

>  When we apply the undo records to a page, we also adjust the
> undo pointers in the page.  Since we have an undo pointer per
> transaction slot, and each transaction has its own slot, if we apply
> all the undo for a transaction to a page, we can just clear the slot;
> if we somehow end up back at the same point later, we'll know not to
> apply the undo a second time because we'll see that there's no
> transaction slot pointing to the undo we were thinking of applying. If
> we roll back to a savepoint, or for some other reason choose to apply
> only some of the undo to a page, we can set the undo record pointer
> for the transaction back to the value it had before we generated any
> newer undo.  Then, we'll know that the newer undo doesn't need to be
> applied but the older undo can be applied.
>
> At least, I think that's how it's supposed to work.
>

Right, this is how it works.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

06 August 2019, 07:56:26

Hi,

On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
> Need to do something else for a bit. More later.

Here we go.




> + /*
> +  * Compute the header size of the undo record.
> +  */
> +Size
> +UndoRecordHeaderSize(uint16 uur_info)
> +{
> +    Size        size;
> +
> +    /* Add fixed header size. */
> +    size = SizeOfUndoRecordHeader;
> +
> +    /* Add size of transaction header if it presets. */
> +    if ((uur_info & UREC_INFO_TRANSACTION) != 0)
> +        size += SizeOfUndoRecordTransaction;
> +
> +    /* Add size of rmid if it presets. */
> +    if ((uur_info & UREC_INFO_RMID) != 0)
> +        size += sizeof(RmgrId);
> +
> +    /* Add size of reloid if it presets. */
> +    if ((uur_info & UREC_INFO_RELOID) != 0)
> +        size += sizeof(Oid);
> +
> +    /* Add size of fxid if it presets. */
> +    if ((uur_info & UREC_INFO_XID) != 0)
> +        size += sizeof(FullTransactionId);
> +
> +    /* Add size of cid if it presets. */
> +    if ((uur_info & UREC_INFO_CID) != 0)
> +        size += sizeof(CommandId);
> +
> +    /* Add size of forknum if it presets. */
> +    if ((uur_info & UREC_INFO_FORK) != 0)
> +        size += sizeof(ForkNumber);
> +
> +    /* Add size of prevundo if it presets. */
> +    if ((uur_info & UREC_INFO_PREVUNDO) != 0)
> +        size += sizeof(UndoRecPtr);
> +
> +    /* Add size of the block header if it presets. */
> +    if ((uur_info & UREC_INFO_BLOCK) != 0)
> +        size += SizeOfUndoRecordBlock;
> +
> +    /* Add size of the log switch header if it presets. */
> +    if ((uur_info & UREC_INFO_LOGSWITCH) != 0)
> +        size += SizeOfUndoRecordLogSwitch;
> +
> +    /* Add size of the payload header if it presets. */
> +    if ((uur_info & UREC_INFO_PAYLOAD) != 0)
> +        size += SizeOfUndoRecordPayload;

There's numerous blocks with one if for each type, and the body copied
basically the same for each alternative. That doesn't seem like a
reasonable approach to me. Means that many places need to be adjusted
when we invariably add another type, and seems likely to lead to bugs
over time.

> +    /* Add size of the payload header if it presets. */

FWIW, repeating the same comment, with or without minor differences, 10
times is a bad idea. Especially when the comment doesn't add *any* sort
of information.

Also, "if it presets" presumably is a typo?


> +/*
> + * Compute and return the expected size of an undo record.
> + */
> +Size
> +UndoRecordExpectedSize(UnpackedUndoRecord *uur)
> +{
> +    Size        size;
> +
> +    /* Header size. */
> +    size = UndoRecordHeaderSize(uur->uur_info);
> +
> +    /* Payload data size. */
> +    if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
> +    {
> +        size += uur->uur_payload.len;
> +        size += uur->uur_tuple.len;
> +    }
> +
> +    /* Add undo record length size. */
> +    size += sizeof(uint16);
> +
> +    return size;
> +}
> +
> +/*
> + * Calculate the size of the undo record stored on the page.
> + */
> +static inline Size
> +UndoRecordSizeOnPage(char *page_ptr)
> +{
> +    uint16        uur_info = ((UndoRecordHeader *) page_ptr)->urec_info;
> +    Size        size;
> +
> +    /* Header size. */
> +    size = UndoRecordHeaderSize(uur_info);
> +
> +    /* Payload data size. */
> +    if ((uur_info & UREC_INFO_PAYLOAD) != 0)
> +    {
> +        UndoRecordPayload *payload = (UndoRecordPayload *) (page_ptr + size);
> +
> +        size += payload->urec_payload_len;
> +        size += payload->urec_tuple_len;
> +    }
> +
> +    return size;
> +}
> +
> +/*
> + * Compute size of the Unpacked undo record in memory
> + */
> +Size
> +UnpackedUndoRecordSize(UnpackedUndoRecord *uur)
> +{
> +    Size        size;
> +
> +    size = sizeof(UnpackedUndoRecord);
> +
> +    /* Add payload size if record contains payload data. */
> +    if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
> +    {
> +        size += uur->uur_payload.len;
> +        size += uur->uur_tuple.len;
> +    }
> +
> +    return size;
> +}

These functions are all basically the same. We shouldn't copy code over
and over like this.


> +/*
> + * Initiate inserting an undo record.
> + *
> + * This function will initialize the context for inserting and undo record
> + * which will be inserted by calling InsertUndoData.
> + */
> +void
> +BeginInsertUndo(UndoPackContext *ucontext, UnpackedUndoRecord *uur)
> +{
> +    ucontext->stage = UNDO_PACK_STAGE_HEADER;
> +    ucontext->already_processed = 0;
> +    ucontext->partial_bytes = 0;
> +
> +    /* Copy undo record header. */
> +    ucontext->urec_hd.urec_type = uur->uur_type;
> +    ucontext->urec_hd.urec_info = uur->uur_info;
> +
> +    /* Copy undo record transaction header if it is present. */
> +    if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
> +        memcpy(&ucontext->urec_txn, uur->uur_txn, SizeOfUndoRecordTransaction);
> +
> +    /* Copy rmid if present. */
> +    if ((uur->uur_info & UREC_INFO_RMID) != 0)
> +        ucontext->urec_rmid = uur->uur_rmid;
> +
> +    /* Copy reloid if present. */
> +    if ((uur->uur_info & UREC_INFO_RELOID) != 0)
> +        ucontext->urec_reloid = uur->uur_reloid;
> +
> +    /* Copy fxid if present. */
> +    if ((uur->uur_info & UREC_INFO_XID) != 0)
> +        ucontext->urec_fxid = uur->uur_fxid;
> +
> +    /* Copy cid if present. */
> +    if ((uur->uur_info & UREC_INFO_CID) != 0)
> +        ucontext->urec_cid = uur->uur_cid;
> +
> +    /* Copy undo record relation header if it is present. */
> +    if ((uur->uur_info & UREC_INFO_FORK) != 0)
> +        ucontext->urec_fork = uur->uur_fork;
> +
> +    /* Copy prev undo record pointer if it is present. */
> +    if ((uur->uur_info & UREC_INFO_PREVUNDO) != 0)
> +        ucontext->urec_prevundo = uur->uur_prevundo;
> +
> +    /* Copy undo record block header if it is present. */
> +    if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
> +    {
> +        ucontext->urec_blk.urec_block = uur->uur_block;
> +        ucontext->urec_blk.urec_offset = uur->uur_offset;
> +    }
> +
> +    /* Copy undo record log switch header if it is present. */
> +    if ((uur->uur_info & UREC_INFO_LOGSWITCH) != 0)
> +        memcpy(&ucontext->urec_logswitch, uur->uur_logswitch,
> +               SizeOfUndoRecordLogSwitch);
> +
> +    /* Copy undo record payload header and data if it is present. */
> +    if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
> +    {
> +        ucontext->urec_payload.urec_payload_len = uur->uur_payload.len;
> +        ucontext->urec_payload.urec_tuple_len = uur->uur_tuple.len;
> +        ucontext->urec_payloaddata = uur->uur_payload.data;
> +        ucontext->urec_tupledata = uur->uur_tuple.data;
> +    }
> +    else
> +    {
> +        ucontext->urec_payload.urec_payload_len = 0;
> +        ucontext->urec_payload.urec_tuple_len = 0;
> +    }
> +
> +    /* Compute undo record expected size and store in the context. */
> +    ucontext->undo_len = UndoRecordExpectedSize(uur);
> +}

It really can't be right to have all these fields basically twice, in
UnackedUndoRecord, and UndoPackContext. And then copy them one-by-one.
I mean there's really just some random differences (ordering, some field
names) between the structures, but otherwise they're the same?

What on earth do we gain by this?  This entire intermediate stage makes
no sense at all to me. We copy data into an UndoRecord, then we copy
into an UndoRecordContext, with essentially a field-by-field copy
logic. Then we have another field-by-field logic that copies the data
into the page.




> +/*
> + * Insert the undo record into the input page from the unpack undo context.
> + *
> + * Caller can  call this function multiple times until desired stage is reached.
> + * This will write the undo record into the page.
> + */
> +void
> +InsertUndoData(UndoPackContext *ucontext, Page page, int starting_byte)
> +{
> +    char       *writeptr = (char *) page + starting_byte;
> +    char       *endptr = (char *) page + BLCKSZ;
> +
> +    switch (ucontext->stage)
> +    {
> +        case UNDO_PACK_STAGE_HEADER:
> +            /* Insert undo record header. */
> +            if (!InsertUndoBytes((char *) &ucontext->urec_hd,
> +                                 SizeOfUndoRecordHeader, &writeptr, endptr,
> +                                 &ucontext->already_processed,
> +                                 &ucontext->partial_bytes))
> +                return;
> +            ucontext->stage = UNDO_PACK_STAGE_TRANSACTION;
> +            /* fall through */
> +
> +        case UNDO_PACK_STAGE_TRANSACTION:
> +            if ((ucontext->urec_hd.urec_info & UREC_INFO_TRANSACTION) != 0)
> +            {
> +                /* Insert undo record transaction header. */
> +                if (!InsertUndoBytes((char *) &ucontext->urec_txn,
> +                                     SizeOfUndoRecordTransaction,
> +                                     &writeptr, endptr,
> +                                     &ucontext->already_processed,
> +                                     &ucontext->partial_bytes))
> +                    return;
> +            }
> +            ucontext->stage = UNDO_PACK_STAGE_RMID;
> +            /* fall through */
> +
> +        case UNDO_PACK_STAGE_RMID:
> +            /* Write rmid(if needed and not already done). */
> +            if ((ucontext->urec_hd.urec_info & UREC_INFO_RMID) != 0)
> +            {
> +                if (!InsertUndoBytes((char *) &(ucontext->urec_rmid), sizeof(RmgrId),
> +                                     &writeptr, endptr,
> +                                     &ucontext->already_processed,
> +                                     &ucontext->partial_bytes))
> +                    return;
> +            }
> +            ucontext->stage = UNDO_PACK_STAGE_RELOID;
> +            /* fall through */
> +
> +        case UNDO_PACK_STAGE_RELOID:
> +            /* Write reloid(if needed and not already done). */
> +            if ((ucontext->urec_hd.urec_info & UREC_INFO_RELOID) != 0)
> +            {
> +                if (!InsertUndoBytes((char *) &(ucontext->urec_reloid), sizeof(Oid),
> +                                     &writeptr, endptr,
> +                                     &ucontext->already_processed,
> +                                     &ucontext->partial_bytes))
> +                    return;
> +            }
> +            ucontext->stage = UNDO_PACK_STAGE_XID;
> +            /* fall through */
> +
> +        case UNDO_PACK_STAGE_XID:
> +            /* Write xid(if needed and not already done). */
> +            if ((ucontext->urec_hd.urec_info & UREC_INFO_XID) != 0)
> +            {
> +                if (!InsertUndoBytes((char *) &(ucontext->urec_fxid), sizeof(FullTransactionId),
> +                                     &writeptr, endptr,
> +                                     &ucontext->already_processed,
> +                                     &ucontext->partial_bytes))
> +                    return;
> +            }
> +            ucontext->stage = UNDO_PACK_STAGE_CID;
> +            /* fall through */
> +
> +        case UNDO_PACK_STAGE_CID:
> +            /* Write cid(if needed and not already done). */
> +            if ((ucontext->urec_hd.urec_info & UREC_INFO_CID) != 0)
> +            {
> +                if (!InsertUndoBytes((char *) &(ucontext->urec_cid), sizeof(CommandId),
> +                                     &writeptr, endptr,
> +                                     &ucontext->already_processed,
> +                                     &ucontext->partial_bytes))
> +                    return;
> +            }
> +            ucontext->stage = UNDO_PACK_STAGE_FORKNUM;
> +            /* fall through */
> +
> +        case UNDO_PACK_STAGE_FORKNUM:
> +            if ((ucontext->urec_hd.urec_info & UREC_INFO_FORK) != 0)
> +            {
> +                /* Insert undo record fork number. */
> +                if (!InsertUndoBytes((char *) &ucontext->urec_fork,
> +                                     sizeof(ForkNumber),
> +                                     &writeptr, endptr,
> +                                     &ucontext->already_processed,
> +                                     &ucontext->partial_bytes))
> +                    return;
> +            }
> +            ucontext->stage = UNDO_PACK_STAGE_PREVUNDO;
> +            /* fall through */
> +
> +        case UNDO_PACK_STAGE_PREVUNDO:
> +            if ((ucontext->urec_hd.urec_info & UREC_INFO_PREVUNDO) != 0)
> +            {
> +                /* Insert undo record blkprev. */
> +                if (!InsertUndoBytes((char *) &ucontext->urec_prevundo,
> +                                     sizeof(UndoRecPtr),
> +                                     &writeptr, endptr,
> +                                     &ucontext->already_processed,
> +                                     &ucontext->partial_bytes))
> +                    return;
> +            }
> +            ucontext->stage = UNDO_PACK_STAGE_BLOCK;
> +            /* fall through */
> +
> +        case UNDO_PACK_STAGE_BLOCK:
> +            if ((ucontext->urec_hd.urec_info & UREC_INFO_BLOCK) != 0)
> +            {
> +                /* Insert undo record block header. */
> +                if (!InsertUndoBytes((char *) &ucontext->urec_blk,
> +                                     SizeOfUndoRecordBlock,
> +                                     &writeptr, endptr,
> +                                     &ucontext->already_processed,
> +                                     &ucontext->partial_bytes))
> +                    return;
> +            }
> +            ucontext->stage = UNDO_PACK_STAGE_LOGSWITCH;
> +            /* fall through */
> +
> +        case UNDO_PACK_STAGE_LOGSWITCH:
> +            if ((ucontext->urec_hd.urec_info & UREC_INFO_LOGSWITCH) != 0)
> +            {
> +                /* Insert undo record transaction header. */
> +                if (!InsertUndoBytes((char *) &ucontext->urec_logswitch,
> +                                     SizeOfUndoRecordLogSwitch,
> +                                     &writeptr, endptr,
> +                                     &ucontext->already_processed,
> +                                     &ucontext->partial_bytes))
> +                    return;
> +            }
> +            ucontext->stage = UNDO_PACK_STAGE_PAYLOAD;
> +            /* fall through */
> +
> +        case UNDO_PACK_STAGE_PAYLOAD:
> +            if ((ucontext->urec_hd.urec_info & UREC_INFO_PAYLOAD) != 0)
> +            {
> +                /* Insert undo record payload header. */
> +                if (!InsertUndoBytes((char *) &ucontext->urec_payload,
> +                                     SizeOfUndoRecordPayload,
> +                                     &writeptr, endptr,
> +                                     &ucontext->already_processed,
> +                                     &ucontext->partial_bytes))
> +                    return;
> +            }
> +            ucontext->stage = UNDO_PACK_STAGE_PAYLOAD_DATA;
> +            /* fall through */
> +
> +        case UNDO_PACK_STAGE_PAYLOAD_DATA:
> +            {
> +                int            len = ucontext->urec_payload.urec_payload_len;
> +
> +                if (len > 0)
> +                {
> +                    /* Insert payload data. */
> +                    if (!InsertUndoBytes((char *) ucontext->urec_payloaddata,
> +                                         len, &writeptr, endptr,
> +                                         &ucontext->already_processed,
> +                                         &ucontext->partial_bytes))
> +                        return;
> +                }
> +                ucontext->stage = UNDO_PACK_STAGE_TUPLE_DATA;
> +            }
> +            /* fall through */
> +
> +        case UNDO_PACK_STAGE_TUPLE_DATA:
> +            {
> +                int            len = ucontext->urec_payload.urec_tuple_len;
> +
> +                if (len > 0)
> +                {
> +                    /* Insert tuple data. */
> +                    if (!InsertUndoBytes((char *) ucontext->urec_tupledata,
> +                                         len, &writeptr, endptr,
> +                                         &ucontext->already_processed,
> +                                         &ucontext->partial_bytes))
> +                        return;
> +                }
> +                ucontext->stage = UNDO_PACK_STAGE_UNDO_LENGTH;
> +            }
> +            /* fall through */
> +
> +        case UNDO_PACK_STAGE_UNDO_LENGTH:
> +            /* Insert undo length. */
> +            if (!InsertUndoBytes((char *) &ucontext->undo_len,
> +                                 sizeof(uint16), &writeptr, endptr,
> +                                 &ucontext->already_processed,
> +                                 &ucontext->partial_bytes))
> +                return;
> +
> +            ucontext->stage = UNDO_PACK_STAGE_DONE;
> +            /* fall through */
> +
> +        case UNDO_PACK_STAGE_DONE:
> +            /* Nothing to be done. */
> +            break;
> +
> +        default:
> +            Assert(0);            /* Invalid stage */
> +    }
> +}

I don't understand. The only purpose of this is that we can partially
write a packed-but-not-actually-packed record onto a bunch of pages? And
for that we have an endless chain of copy and pasted code calling
InsertUndoBytes()? Copying data into shared buffers in tiny increments?

If we need to this, what is the whole packed record format good for?
Except for adding a bunch of functions with 10++ ifs and nearly
identical code?

Copying data is expensive. Copying data in tiny increments is more
expensive. Copying data in tiny increments, with a bunch of branches, is
even more expensive. Copying data in tiny increments, with a bunch of
branches, is even more expensive, especially when it's shared
memory. Copying data in tiny increments, with a bunch of branches, is
even more expensive, especially when it's shared memory, especially when
all that shared meory is locked at once.


> +/*
> + * Read the undo record from the input page to the unpack undo context.
> + *
> + * Caller can  call this function multiple times until desired stage is reached.
> + * This will read the undo record from the page and store the data into unpack
> + * undo context, which can be later copied to unpacked undo record by calling
> + * FinishUnpackUndo.
> + */
> +void
> +UnpackUndoData(UndoPackContext *ucontext, Page page, int starting_byte)
> +{
> +    char       *readptr = (char *) page + starting_byte;
> +    char       *endptr = (char *) page + BLCKSZ;
> +
> +    switch (ucontext->stage)
> +    {
> +        case UNDO_PACK_STAGE_HEADER:

You know roughly what I'm thinking.




> commit 95d10fb308e3ec6ac8a7b4b5e7af78f6825f4dc8
> Author:     Amit Kapila <amit.kapila@enterprisedb.com>
> AuthorDate: 2019-06-13 15:10:06 +0530
> Commit:     Amit Kapila <amit.kapila@enterprisedb.com>
> CommitDate: 2019-07-31 16:36:52 +0530
>
>     Infrastructure to register and fetch undo action requests.

I'm pretty sure I suggested that before, but this seems the wrong
order. We should have very basic undo functionality in place, even if it
can't actually guarantee that undo gets processed, before this. The
design of this piece depends on understanding the later parts too much.


>     This infrasture provides a way to allow execution of undo actions.  One
>     might think that we can always execute undo actions on error or explicit
>     rollabck by user, however there are cases when that is not posssible.

s/rollabck by user/rollback by a user/


>     For example, (a) if the system crash while doing operation, then after
>     startup, we need a way to perform undo actions; (b) If we get error while
>     performing undo actions.

"doing operation" doesn't sound right. Maybe "performing an operation"?


>     Apart from this, when there are large rollback requests, then it is quite
>     inefficient to perform all the undo actions and then return control to
>     user.

I don't think efficiency is the right word to describe that. I'd argue
that it's probably often at least as efficient to let that rollback be
processed in that context (higher cache locality, preventing that
backend from creating further undo). It's just that doing so has a bad
effect on latency.


>     To allow efficient execution of the undo actions, we create three queues
>     and a hash table for the rollback requests.

Again I don't think efficient is the right descriptor. My understanding
of the goals of having multiple queues is that it helps to achieve
forward progress among separate goals, without loosing too much
efficiency.

>     A Xid based priority queue
>     which will allow us to process the requests of older transactions and help
>     us to move oldesdXidHavingUnappliedUndo (this is a xid-horizon below which
>     all the transactions are visible) forward.

"This is an important concern, because ..."




> +/*
> + * Returns the undo record pointer corresponding to first record in the given
> + * block.
> + */
> +UndoRecPtr
> +UndoBlockGetFirstUndoRecord(BlockNumber blkno, UndoRecPtr urec_ptr,
> +                            UndoLogCategory category)
> +{
> +    Buffer buffer;
> +    Page page;
> +    UndoPageHeader    phdr;
> +    RelFileNode        rnode;
> +    UndoLogOffset    log_cur_off;
> +    Size            partial_rec_size;
> +    int                offset_cur_page;
> +
> +    if (!BlockNumberIsValid(blkno))
> +        ereport(ERROR,
> +                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                 errmsg("invalid undo block number")));
> +
> +    UndoRecPtrAssignRelFileNode(rnode, urec_ptr);
> +
> +    buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, blkno,
> +                                       RBM_NORMAL, NULL,
> +                                       RelPersistenceForUndoLogCategory(category));
> +
> +    LockBuffer(buffer, BUFFER_LOCK_SHARE);
> +
> +    page = BufferGetPage(buffer);
> +    phdr = (UndoPageHeader)page;
> +
> +    /* Calculate the size of the partial record. */
> +    partial_rec_size = UndoRecordHeaderSize(phdr->uur_info) +
> +                        phdr->tuple_len + phdr->payload_len -
> +                        phdr->record_offset;
> +
> +    /* calculate the offset in current log. */
> +    offset_cur_page = SizeOfUndoPageHeaderData + partial_rec_size;
> +    log_cur_off = (blkno * BLCKSZ) + offset_cur_page;
> +
> +    UnlockReleaseBuffer(buffer);
> +
> +    /* calculate the undo record pointer based on current offset in log. */
> +    return MakeUndoRecPtr(UndoRecPtrGetLogNo(urec_ptr), log_cur_off);
> +}

Yet another function reading undo blocks. No.


>     The undo requests must appear in both xid and size
> + * requests queues or neither.

Why?

>     As of now we, process the requests from these
> + * queues in a round-robin fashion to give equal priority to all three type
> + * of requests.

*types


> + * The rollback requests exceeding a certain threshold are pushed into both
> + * xid and size based queues.  They are also registered in the hash table.

Why aren't rollbacks below the threshold in the hashtable?


> + * To ensure that backend and discard worker don't register the same request
> + * in the hash table, we always register the request with full_xid and the
> + * start pointer for the transaction in the hash table as key.  Backends
> + * always remember the value of start pointer, but discard worker doesn't know

*the discard worker

There's no explanation as to why we need more than the full_xid
(presumably persistency levels). Nor why you chose not to include those.


> + * the actual start value in case transaction's undo spans across multiple
> + * logs.  The reason for the same is that discard worker might encounter the
> + * log which has overflowed undo records of the transaction first.

"the log which has overflowed undo records of the transaction first" is
confusing. Perhaps "the undo log into which the logically earlier undo
overflowed before encountering the logically earlier undo"?


> In such
> + * cases, we need to compute the actual start position.  The first record of a
> + * transaction in each undo log contains a reference to the first record of
> + * this transaction in the previous log.  By following the previous log chain
> + * of this transaction, we find the initial location which is used to register
> + * the request.

It seem wrong that the undo request layer needs to care about any of
this.


> +/* Each worker queue is a binary heap. */
> +typedef struct
> +{
> +    binaryheap *bh;
> +    union
> +    {
> +        UndoXidQueue *xid_elems;
> +        UndoSizeQueue *size_elems;
> +        UndoErrorQueue *error_elems;
> +    }            q_choice;
> +} UndoWorkerQueue;

As we IIRC have decided to change this into a rbtree, I'll ignore
related parts of the current code.  What is the status of that work?
I've checked the git trees, without seeing anything? Your last mail with
patches
https://www.postgresql.org/message-id/CAA4eK1KKAFBCJuPnFtgdc89djv4xO%3DZkAdXvKQinqN4hWiRbvA%40mail.gmail.com
doesn't seem to contain that either?


> +/* Different operations for XID queue */
> +#define InitXidQueue(bh, elems) \
> +( \
> +    UndoWorkerQueues[XID_QUEUE].bh = bh, \
> +    UndoWorkerQueues[XID_QUEUE].q_choice.xid_elems = elems \
> +)
> +
> +#define XidQueueIsEmpty() \
> +    (binaryheap_empty(UndoWorkerQueues[XID_QUEUE].bh))
> +
> +#define GetXidQueueSize() \
> +    (binaryheap_cur_size(UndoWorkerQueues[XID_QUEUE].bh))
> +
> +#define GetXidQueueElem(elem) \
> +    (UndoWorkerQueues[XID_QUEUE].q_choice.xid_elems[elem])
> +
> +#define GetXidQueueTopElem() \
> +( \
> +    AssertMacro(!binaryheap_empty(UndoWorkerQueues[XID_QUEUE].bh)), \
> +    DatumGetPointer(binaryheap_first(UndoWorkerQueues[XID_QUEUE].bh)) \
> +)
> +
> +#define GetXidQueueNthElem(n) \
> +( \
> +    AssertMacro(!XidQueueIsEmpty()), \
> +    DatumGetPointer(binaryheap_nth(UndoWorkerQueues[XID_QUEUE].bh, n)) \
> +)
> +
> +#define SetXidQueueElem(elem, e_dbid, e_full_xid, e_start_urec_ptr) \
> +( \
> +    GetXidQueueElem(elem).dbid = e_dbid, \
> +    GetXidQueueElem(elem).full_xid = e_full_xid, \
> +    GetXidQueueElem(elem).start_urec_ptr = e_start_urec_ptr \
> +)
> +
> +/* Different operations for SIZE queue */
> +#define InitSizeQueue(bh, elems) \
> +( \
> +    UndoWorkerQueues[SIZE_QUEUE].bh = bh, \
> +    UndoWorkerQueues[SIZE_QUEUE].q_choice.size_elems = elems \
> +)
> +
> +#define SizeQueueIsEmpty() \
> +    (binaryheap_empty(UndoWorkerQueues[SIZE_QUEUE].bh))
> +
> +#define GetSizeQueueSize() \
> +    (binaryheap_cur_size(UndoWorkerQueues[SIZE_QUEUE].bh))
> +
> +#define GetSizeQueueElem(elem) \
> +    (UndoWorkerQueues[SIZE_QUEUE].q_choice.size_elems[elem])
> +
> +#define GetSizeQueueTopElem() \
> +( \
> +    AssertMacro(!SizeQueueIsEmpty()), \
> +    DatumGetPointer(binaryheap_first(UndoWorkerQueues[SIZE_QUEUE].bh)) \
> +)
> +
> +#define GetSizeQueueNthElem(n) \
> +( \
> +    AssertMacro(!SizeQueueIsEmpty()), \
> +    DatumGetPointer(binaryheap_nth(UndoWorkerQueues[SIZE_QUEUE].bh, n)) \
> +)
> +
> +#define SetSizeQueueElem(elem, e_dbid, e_full_xid, e_size, e_start_urec_ptr) \
> +( \
> +    GetSizeQueueElem(elem).dbid = e_dbid, \
> +    GetSizeQueueElem(elem).full_xid = e_full_xid, \
> +    GetSizeQueueElem(elem).request_size = e_size, \
> +    GetSizeQueueElem(elem).start_urec_ptr = e_start_urec_ptr \
> +)
> +
> +/* Different operations for Error queue */
> +#define InitErrorQueue(bh, elems) \
> +( \
> +    UndoWorkerQueues[ERROR_QUEUE].bh = bh, \
> +    UndoWorkerQueues[ERROR_QUEUE].q_choice.error_elems = elems \
> +)
> +
> +#define ErrorQueueIsEmpty() \
> +    (binaryheap_empty(UndoWorkerQueues[ERROR_QUEUE].bh))
> +
> +#define GetErrorQueueSize() \
> +    (binaryheap_cur_size(UndoWorkerQueues[ERROR_QUEUE].bh))
> +
> +#define GetErrorQueueElem(elem) \
> +    (UndoWorkerQueues[ERROR_QUEUE].q_choice.error_elems[elem])
> +
> +#define GetErrorQueueTopElem() \
> +( \
> +    AssertMacro(!binaryheap_empty(UndoWorkerQueues[ERROR_QUEUE].bh)), \
> +    DatumGetPointer(binaryheap_first(UndoWorkerQueues[ERROR_QUEUE].bh)) \
> +)
> +
> +#define GetErrorQueueNthElem(n) \
> +( \
> +    AssertMacro(!ErrorQueueIsEmpty()), \
> +    DatumGetPointer(binaryheap_nth(UndoWorkerQueues[ERROR_QUEUE].bh, n)) \
> +)


-ETOOMANYMACROS

I think nearly all of these shouldn't exist. See further below.


> +#define SetErrorQueueElem(elem, e_dbid, e_full_xid, e_start_urec_ptr, e_retry_at, e_occurred_at) \
> +( \
> +    GetErrorQueueElem(elem).dbid = e_dbid, \
> +    GetErrorQueueElem(elem).full_xid = e_full_xid, \
> +    GetErrorQueueElem(elem).start_urec_ptr = e_start_urec_ptr, \
> +    GetErrorQueueElem(elem).next_retry_at = e_retry_at, \
> +    GetErrorQueueElem(elem).err_occurred_at = e_occurred_at \
> +)

It's very very rarely a good idea to have macros that evaluate their
arguments multiple times. It'll also never be a good idea to get the
same element multiple times from a queue.  If needed - I'm very doubtful
of that, given that there's a single caller - it should be a static
inline function that gets the element once, stores it in a local
variable, and then updates all the fields.


> +/*
> + * Binary heap comparison function to compare the size of transactions.
> + */
> +static int
> +undo_size_comparator(Datum a, Datum b, void *arg)
> +{
> +    UndoSizeQueue *sizeQueueElem1 = (UndoSizeQueue *) DatumGetPointer(a);
> +    UndoSizeQueue *sizeQueueElem2 = (UndoSizeQueue *) DatumGetPointer(b);
>
It's very odd that elements are named 'Queue' rather than a queue element.


> +/*
> + * Binary heap comparison function to compare the time at which an error
> + * occurred for transactions.
> + *
> + * The error queue is sorted by next_retry_at and err_occurred_at.  Currently,
> + * the next_retry_at has some constant delay time (see PushErrorQueueElem), so
> + * it doesn't make much sense to sort by both values.  However, in future, if
> + * we have some different algorithm for next_retry_at, then it will work
> + * seamlessly.
> + */

Why is it useful to have error_occurred_at be part of the comparison at
all? If we need a tiebraker, err_occurred_at isn't that (if we can get
conflicts for next_retry_at, then we can also get conflicts in
err_occurred_at).  Seems better to use something actually guaranteed to
be unique for a tiebreaker.


> +int
> +UndoRollbackHashTableSize()
> +{

missing void, at least compared to our common style.


> +    /*
> +     * The rollback hash table is used to avoid duplicate undo requests by
> +     * backends and discard worker.  The table must be able to accomodate all
> +     * active undo requests.  The undo requests must appear in both xid and
> +     * size requests queues or neither.  In same transaction, there can be two
> +     * requests one for logged relations and another for unlogged relations.
> +     * So, the rollback hash table size should be equal to two request queues,
> +     * an error queue (currently this is same as request queue) and max

"the same"? I assume this intended to mean the same size?


> +     * backends. This will ensure that it won't get filled.
> +     */

How does this ensure anything?



> +static int
> +RemoveOldElemsFromXidQueue()

void.



> +/*
> + * Traverse the queue and remove dangling entries, if any.  The queue
> + * entry is considered dangling if the hash table doesn't contain the
> + * corresponding entry.
> + */
> +static int
> +RemoveOldElemsFromSizeQueue()

void.


We shouldn't need this in this form anymore after the rbtree conversion
- but because it again highlights on of my main complaints of all this
work: Don't have multiple copies of essentially equivalent non-trivial
functions. Especially not in the same file.  This is a near verbatim
copy of RemoveOldElemsFromXidQueue.  Without any explanations why it's
needed.

Even if you intended it only as a short-term workaround (e.g. for the
queues not sharing enough of a common base-layout to be able to share
one cleanup routine), at the very least you need to add a FIXME or such
explaining that this needs to be fixed.


> +/*
> + * Traverse the queue and remove dangling entries, if any.  The queue
> + * entry is considered dangling if the hash table doesn't contain the
> + * corresponding entry.
> + */
> +static int
> +RemoveOldElemsFromErrorQueue()
> +{

Another copy.


> +/*
> + * Returns true, if there is some valid request in the given queue, false,
> + * otherwise.
> + *
> + * It fills hkey with hash key corresponding to the nth element of the
> + * specified queue.
> + */
> +static bool
> +GetRollbackHashKeyFromQueue(UndoWorkerQueueType cur_queue, int n,
> +                            RollbackHashKey *hkey)
> +{
> +    if (cur_queue == XID_QUEUE)
> +    {
> +        UndoXidQueue *elem;
> +
> +        /* check if there is a work in the next queue */
> +        if (GetXidQueueSize() <= n)
> +            return false;
> +
> +        elem = (UndoXidQueue *) GetXidQueueNthElem(n);
> +        hkey->full_xid = elem->full_xid;
> +        hkey->start_urec_ptr = elem->start_urec_ptr;
> +    }

This is a slightly different form of copying code repeatedly. Instead of
passing in the queue type, this should get a pointer to the queue passed
in. Functions like Get*QueueSize(), GetErrorQueueNthElem() shouldn't
exist once for each queue type, they should be agnostic as to what the
queue type is, and accept a queue as the parameter.

Yes, there'd still be one additional queue type specific check, for the
time. But that's still a lot less copied code.


I also don't think it's a good idea to use RollbackHashKey as the
parameter/function name here. This function doesn't need to know that
it's for a hash table lookup.


> +/*
> + * Fetch the end urec pointer for the transaction and the undo request size.
> + *
> + * end_urecptr_out - This is an INOUT parameter. If end undo pointer is
> + * specified, we use the same to calculate the size.  Else, we calculate
> + * the end undo pointer and return the same.
> + *
> + * last_log_start_urec_ptr_out - This is an OUT parameter.  If a transaction
> + * writes undo records in multiple undo logs, this is set to the start undo
> + * record pointer of this transaction in the last log.  If the transaction
> + * writes undo records only in single undo log, it is set to start_urec_ptr.
> + * This value is used to update the rollback progress of the transaction in
> + * the last log.  Once, we have start location in last log, the start location
> + * in all the previous logs can be computed.  See execute_undo_actions for
> + * more details.
> + *
> + * XXX: We don't calculate the exact undo size.  We always skip the size of
> + * the last undo record (if not already discarded) from the calculation.  This
> + * optimization allows us to skip fetching an undo record for the most
> + * frequent cases where the end pointer and current start pointer belong to
> + * the same log.  A simple subtraction between them gives us the size.  In
> + * future this function can be modified if someone needs the exact undo size.
> + * As of now, we use this function to calculate the undo size for inserting
> + * in the pending undo actions in undo worker's size queue.
> + */
> +uint64
> +FindUndoEndLocationAndSize(UndoRecPtr start_urecptr,
> +                           UndoRecPtr *end_urecptr_out,
> +                           UndoRecPtr *last_log_start_urecptr_out,
> +                           FullTransactionId full_xid)
> +{

This really can't be the right place for this function.



> +/*
> + * Returns true, if we can push the rollback request to undo wrokers, false,

*workers

Also, it's not really queued to workers. Something like "can queue the
rollback request to be executed in the background" would be more
accurate afaict.


> + * otherwise.
> + */
> +static bool
> +CanPushReqToUndoWorker(UndoRecPtr start_urec_ptr, UndoRecPtr end_urec_ptr,
> +                       uint64 req_size)
> +{
> +    /*
> +     * This must be called after acquring RollbackRequestLock as we will check

*acquiring


> +     * the binary heaps which can change.
> +     */
> +    Assert(LWLockHeldByMeInMode(RollbackRequestLock, LW_EXCLUSIVE));
> +
> +    /*
> +     * We normally push the rollback request to undo workers if the size of
> +     * same is above a certain threshold.
> +     */
> +    if (req_size >= rollback_overflow_size * 1024 * 1024)
> +    {

Why is this being checked with the lock held? Seems like this should be
handled in a pre-check?


> +/*
> + * Initialize the hash-table and priority heap based queues for rollback
> + * requests in shared memory.
> + */
> +void
> +PendingUndoShmemInit(void)
> +{
> +    HASHCTL        info;
> +    bool        foundXidQueue = false;
> +    bool        foundSizeQueue = false;
> +    bool        foundErrorQueue = false;
> +    binaryheap *bh;
> +    UndoXidQueue *xid_elems;
> +    UndoSizeQueue *size_elems;
> +    UndoErrorQueue *error_elems;
> +
> +    MemSet(&info, 0, sizeof(info));
> +
> +    info.keysize = sizeof(TransactionId) + sizeof(UndoRecPtr);
> +    info.entrysize = sizeof(RollbackHashEntry);
> +    info.hash = tag_hash;
> +
> +    RollbackHT = ShmemInitHash("Undo Actions Lookup Table",
> +                               UndoRollbackHashTableSize(),
> +                               UndoRollbackHashTableSize(), &info,
> +                               HASH_ELEM | HASH_FUNCTION | HASH_FIXED_SIZE);
> +
> +    bh = binaryheap_allocate_shm("Undo Xid Binary Heap",
> +                                 pending_undo_queue_size,
> +                                 undo_age_comparator,
> +                                 NULL);
> +
> +    xid_elems = (UndoXidQueue *) ShmemInitStruct("Undo Xid Queue Elements",
> +                                                 UndoXidQueueElemsShmSize(),
> +                                                 &foundXidQueue);
> +
> +    Assert(foundXidQueue || !IsUnderPostmaster);
> +
> +    if (!IsUnderPostmaster)
> +        memset(xid_elems, 0, sizeof(UndoXidQueue));
> +
> +    InitXidQueue(bh, xid_elems);
> +
> +    bh = binaryheap_allocate_shm("Undo Size Binary Heap",
> +                                 pending_undo_queue_size,
> +                                 undo_size_comparator,
> +                                 NULL);
> +    size_elems = (UndoSizeQueue *) ShmemInitStruct("Undo Size Queue Elements",
> +                                                   UndoSizeQueueElemsShmSize(),
> +                                                   &foundSizeQueue);
> +    Assert(foundSizeQueue || !IsUnderPostmaster);
> +
> +    if (!IsUnderPostmaster)
> +        memset(size_elems, 0, sizeof(UndoSizeQueue));
> +
> +    InitSizeQueue(bh, size_elems);
> +
> +    bh = binaryheap_allocate_shm("Undo Error Binary Heap",
> +                                 pending_undo_queue_size,
> +                                 undo_err_time_comparator,
> +                                 NULL);
> +
> +    error_elems = (UndoErrorQueue *) ShmemInitStruct("Undo Error Queue Elements",
> +                                                     UndoErrorQueueElemsShmSize(),
> +                                                     &foundErrorQueue);
> +    Assert(foundErrorQueue || !IsUnderPostmaster);
> +
> +    if (!IsUnderPostmaster)
> +        memset(error_elems, 0, sizeof(UndoSizeQueue));
> +
> +    InitErrorQueue(bh, error_elems);

Hm. Aren't you overwriting previously initialized data here with memset
and Init*Queue, when using an EXEC_BACKEND build (e.g windows)?

I think all the initialization should only be done once, e.g. if
ShmemInitStruct() sets the *found to true. And then the other elements
should be asserted to also exist/not exist.

Also, what is the memset() here supposed to be doing? Aren't you just
memsetting() the first element in the queue? Since the queue is
dynamically sized, a static length (sizeof(UndoSizeQueue)) memset()
obviously cannot cannot initialize the members.

Also, this again is repeating code unnecessarily.


> +/* Insert the request into an error queue. */
> +bool
> +InsertRequestIntoErrorUndoQueue(volatile UndoRequestInfo * urinfo)
> +{
> +    RollbackHashEntry *rh;
> +
> +    LWLockAcquire(RollbackRequestLock, LW_EXCLUSIVE);
> +
> +    /* We can't insert into an error queue if it is already full. */
> +    if (GetErrorQueueSize() >= pending_undo_queue_size)
> +    {
> +        int            num_removed = 0;
> +
> +        /* Try to remove few elements */
> +        num_removed = RemoveOldElemsFromErrorQueue();

If we kept this, I'd rename these as Prune* and reword the comments to
match. This makes the code look like we're actually removing valid
entries.


> +/*
> + * Get the next set of pending rollback request for undo worker.

"set"? We only remove one, no?


> + * allow_peek - if true, peeks a few element from each queue to check whether
> + * any request matches current dbid.
> + * remove_from_queue - if true, picks an element from the queue whose dbid
> + * matches current dbid and remove it from the queue before returning the same
> + * to caller.
> + * urinfo - this is an OUT parameter that returns the details of undo request
> + * whose undo action is still pending.
> + * in_other_db_out - this is an OUT parameter.  If we've not found any work
> + * for current database, but there is work for some other database, we set
> + * this parameter as true.
> + */
> +bool
> +UndoGetWork(bool allow_peek, bool remove_from_queue, UndoRequestInfo *urinfo,
> +            bool *in_other_db_out)
> +{


> +        /*
> +         * If some undo worker is already processing the rollback request or
> +         * it is already processed, then we drop that request from the queue
> +         * and fetch the next entry from the queue.
> +         */
> +        if (!rh || UndoRequestIsInProgress(rh))
> +        {
> +            RemoveRequestFromQueue(cur_queue, 0);
> +            cur_undo_queue++;
> +            continue;
> +        }

When is it possible to hit the in-progress case?




> +        /*
> +         * We've found a work for some database.  If we don't want to remove
> +         * the request, we return from here and spawn a worker process to
> +         * apply the same.
> +         */
> +        if (!remove_from_queue)
> +        {
> +            bool        exists;
> +
> +            StartTransactionCommand();
> +            exists = dbid_exists(rh->dbid);
> +            CommitTransactionCommand();
> +
> +            /*
> +             * If the database doesn't exist, just remove the request since we
> +             * no longer need to apply the undo actions.
> +             */
> +            if (!exists)
> +            {
> +                RemoveRequestFromQueue(cur_queue, 0);
> +                RollbackHTRemoveEntry(rh->full_xid, rh->start_urec_ptr, true);
> +                cur_undo_queue++;
> +                continue;
> +            }

I still think there never should be a case in which this is
possible. Dropping a database ought to remove all the associated undo.


> +                /*
> +                 * The worker can perform this request if it is either not
> +                 * connected to any database or the request belongs to the
> +                 * same database to which it is connected.
> +                 */
> +                if ((MyDatabaseId == InvalidOid) ||
> +                    (MyDatabaseId != InvalidOid && MyDatabaseId == rh->dbid))
> +                {
> +                    /* found a work for current database */
> +                    if (in_other_db_out)
> +                        *in_other_db_out = false;
> +
> +                    /*
> +                     * Mark the undo request in hash table as in_progress so
> +                     * that other undo worker doesn't pick the same entry for
> +                     * rollback.
> +                     */
> +                    rh->status = UNDO_REQUEST_INPROGRESS;
> +
> +                    /* set the undo request info to process */
> +                    SetUndoRequestInfoFromRHEntry(urinfo, rh, cur_queue);
> +
> +                    /*
> +                     * Remove the request from queue so that other undo worker
> +                     * doesn't process the same entry.
> +                     */
> +                    RemoveRequestFromQueue(cur_queue, depth);
> +                    LWLockRelease(RollbackRequestLock);
> +                    return true;

Copy of code from above.


> +/*
> + * This function registers the rollback requests.
> + *
> + * Returns true, if the request is registered and will be processed by undo
> + * worker at some later point of time, false, otherwise in which case caller
> + * can process the undo request by itself.
> + *
> + * The caller may execute undo actions itself if the request is not already
> + * present in rollback hash table and can't be pushed to pending undo request
> + * queues.  The two reasons why request can't be pushed are (a) the size of
> + * request is smaller than a threshold and the request is not from discard
> + * worker, (b) the undo request queues are full.
> + *
> + * It is not advisable to apply the undo actions of a very large transaction
> + * in the foreground as that can lead to a delay in retruning the control back

*returning



> +/* different types of undo worker */
> +typedef enum
> +{
> +    XID_QUEUE = 0,
> +    SIZE_QUEUE = 1,
> +    ERROR_QUEUE
> +} UndoWorkerQueueType;

IMO odd to explictly number two elements of an enum, but not the third.

> +/* This is an entry for undo request queue that is sorted by xid. */
> +typedef struct UndoXidQueue
> +{
> +    FullTransactionId full_xid;
> +    UndoRecPtr    start_urec_ptr;
> +    Oid            dbid;
> +} UndoXidQueue;

As I said before, this isn't a queue, it's a queue entry.


> +/* Reset the undo request info */
> +#define ResetUndoRequestInfo(urinfo) \
> +( \
> +    (urinfo)->full_xid = InvalidFullTransactionId, \
> +    (urinfo)->start_urec_ptr = InvalidUndoRecPtr, \
> +    (urinfo)->end_urec_ptr = InvalidUndoRecPtr, \
> +    (urinfo)->last_log_start_urec_ptr = InvalidUndoRecPtr, \
> +    (urinfo)->dbid = InvalidOid, \
> +    (urinfo)->request_size = 0, \
> +    (urinfo)->undo_worker_queue = InvalidUndoWorkerQueue \
> +)
> +
> +/* set the undo request info from the rollback request */
> +#define SetUndoRequestInfoFromRHEntry(urinfo, rh, cur_queue) \
> +( \
> +    urinfo->full_xid = rh->full_xid, \
> +    urinfo->start_urec_ptr = rh->start_urec_ptr, \
> +    urinfo->end_urec_ptr = rh->end_urec_ptr, \
> +    urinfo->last_log_start_urec_ptr = rh->last_log_start_urec_ptr, \
> +    urinfo->dbid = rh->dbid, \
> +    urinfo->undo_worker_queue = cur_queue \
> +)

See my other complaint about such macros. Multiple evaluation hazard
etc. Also, the different formatting in two consecutively defined macros
is odd.




> +/*-------------------------------------------------------------------------
> + *
> + * undoaction.c
> + *      execute undo actions
> + *
> + * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
> + * Portions Copyright (c) 1994, Regents of the University of California
> + *
> + * src/backend/access/undo/undoaction.c
> + *
> + * To apply the undo actions, we collect the undo records in bulk and try to

s/the//g

> + * process them together.  We ensure to update the transaction's progress at
> + * regular intervals so that after a crash we can skip already applied undo.
> + * The undo apply progress is updated in terms of the number of blocks
> + * processed.  Undo apply progress value XACT_APPLY_PROGRESS_COMPLETED
> + * indicates that all the undo is applied, XACT_APPLY_PROGRESS_NOT_STARTED
> + * indicates that no undo action has been applied yet and any other value
> + * indicates that we have applied undo partially and after crash recovery, we
> + * need to start processing the undo from the same location.
> + *-------------------------------------------------------------------------


> +/*
> + * UpdateUndoApplyProgress - Updates how far undo actions from a particular
> + * log have been applied while rolling back a transaction.  This progress is
> + * measured in terms of undo block number of the undo log till which the
> + * undo actions have been applied.
> + */
> +static void
> +UpdateUndoApplyProgress(UndoRecPtr progress_urec_ptr,
> +                        BlockNumber block_num)
> +{
> +    UndoLogCategory category;
> +    UndoRecordInsertContext context = {{0}};
> +
> +    category =
> +        UndoLogNumberGetCategory(UndoRecPtrGetLogNo(progress_urec_ptr));
> +
> +    /*
> +     * We don't need to update the progress for temp tables as they get
> +     * discraded after startup.
> +     */
> +    if (category == UNDO_TEMP)
> +        return;
> +
> +    BeginUndoRecordInsert(&context, category, 1, NULL);
> +
> +    /*
> +     * Prepare and update the undo apply progress in the transaction header.
> +     */
> +    UndoRecordPrepareApplyProgress(&context, progress_urec_ptr, block_num);
> +
> +    START_CRIT_SECTION();
> +
> +    /* Update the progress in the transaction header. */
> +    UndoRecordUpdateTransInfo(&context, 0);
> +
> +    /* WAL log the undo apply progress. */
> +    {
> +        XLogRecPtr    lsn;
> +        xl_undoapply_progress xlrec;
> +
> +        xlrec.urec_ptr = progress_urec_ptr;
> +        xlrec.progress = block_num;
> +
> +        XLogBeginInsert();
> +        XLogRegisterData((char *) &xlrec, sizeof(xlrec));
> +
> +        RegisterUndoLogBuffers(&context, 1);
> +        lsn = XLogInsert(RM_UNDOACTION_ID, XLOG_UNDO_APPLY_PROGRESS);
> +        UndoLogBuffersSetLSN(&context, lsn);
> +    }
> +
> +    END_CRIT_SECTION();
> +
> +    /* Release undo buffers. */
> +    FinishUndoRecordInsert(&context);
> +}

This whole prepare/execute split for updating apply pregress, and next
undo pointers makes no sense to me.


> +/*
> + * UndoAlreadyApplied - Retruns true, if the actions are already applied,

*returns


> + *    false, otherwise.
> + */
> +static bool
> +UndoAlreadyApplied(FullTransactionId full_xid, UndoRecPtr to_urecptr)
> +{
> +    UnpackedUndoRecord *uur = NULL;
> +    UndoRecordFetchContext    context;
> +
> +    /* Fetch the undo record. */
> +    BeginUndoFetch(&context);
> +    uur = UndoFetchRecord(&context, to_urecptr);
> +    FinishUndoFetch(&context);

Literally all the places that fetch a record, fetch them with exactly
this combination of calls. If that's the pattern, what do we gain by
this split?   Note that UndoBulkFetchRecord does *NOT* use an
UndoRecordFetchContext, for reasons that are beyond me.


> +static void
> +ProcessAndApplyUndo(FullTransactionId full_xid, UndoRecPtr from_urecptr,
> +                    UndoRecPtr to_urecptr, UndoRecPtr last_log_start_urec_ptr,
> +                    bool complete_xact)
> +{
> +    UndoRecInfo *urecinfo;
> +    UndoRecPtr    urec_ptr = from_urecptr;
> +    int            undo_apply_size;
> +
> +    /*
> +     * We choose maintenance_work_mem to collect the undo records for
> +     * rollbacks as most of the large rollback requests are done by
> +     * background worker which can be considered as maintainence operation.
> +     * However, we can introduce a new guc for this as well.
> +     */
> +    undo_apply_size = maintenance_work_mem * 1024L;
> +
> +    /*
> +     * Fetch the multiple undo records that can fit into undo_apply_size; sort
> +     * them and then rmgr specific callback to process them.  Repeat this
> +     * until we process all the records for the transaction being rolled back.
> +     */
> +    do
> +    {

use for(;;) or while (true).

> +        BlockNumber    progress_block_num = InvalidBlockNumber;
> +        int            i;
> +        int            nrecords;
> +        bool        log_switched = false;
> +        bool        rollback_completed = false;
> +        bool        update_progress = false;
> +        UndoRecPtr    progress_urec_ptr = InvalidUndoRecPtr;
> +        UndoRecInfo    *first_urecinfo;
> +        UndoRecInfo    *last_urecinfo;
> +
> +        CHECK_FOR_INTERRUPTS();
> +
> +        /*
> +         * Fetch multiple undo records at once.
> +         *
> +         * At a time, we only fetch the undo records from a single undo log.
> +         * Once, we process all the undo records from one undo log, we update

s/Once, we process/Once we have processed/

> +         * the last_log_start_urec_ptr and proceed to the previous undo log.
> +         */
> +        urecinfo = UndoBulkFetchRecord(&urec_ptr, last_log_start_urec_ptr,
> +                                       undo_apply_size, &nrecords, false);
> +
> +        /*
> +         * Since the rollback of this transaction is in-progress, there will be
> +         * at least one undo record which is not yet discarded.
> +         */
> +        Assert(nrecords > 0);
> +
> +        /*
> +         * Get the required information from first and last undo record before
> +         * we sort all the records.
> +         */
> +        first_urecinfo = &urecinfo[0];
> +        last_urecinfo = &urecinfo[nrecords - 1];
> +        if (last_urecinfo->uur->uur_info & UREC_INFO_LOGSWITCH)
> +        {
> +            UndoRecordLogSwitch *logswitch = last_urecinfo->uur->uur_logswitch;
> +
> +            /*
> +             * We have crossed the log boundary.  The rest of the undo for
> +             * this transaction is in some other log, the location of which
> +             * can be found from this record.  See commets atop undoaccess.c.

*comments


> +            /*
> +             * We need to save the undo record pointer of the last record from
> +             * previous undo log.  We will use the same as from location in
> +             * next iteration of bulk fetch.
> +             */
> +            Assert(UndoRecPtrIsValid(logswitch->urec_prevurp));
> +            urec_ptr = logswitch->urec_prevurp;
> +
> +            /*
> +             * The last fetched undo record corresponds to the first undo
> +             * record of the current log.  Once, the undo actions are performed
> +             * from this log, we've to mark the progress as completed.
> +             */
> +            progress_urec_ptr = last_urecinfo->urp;
> +
> +            /*
> +             * We also need to save the start location of this transaction in
> +             * previous log.  This will be used in the next iteration of bulk
> +             * fetch and updating progress location.
> +             */
> +            if (complete_xact)
> +            {
> +                Assert(UndoRecPtrIsValid(logswitch->urec_prevlogstart));
> +                last_log_start_urec_ptr = logswitch->urec_prevlogstart;
> +            }
> +
> +            /* We've to update the progress for the current log as completed. */
> +            update_progress = true;
> +        }
> +        else if (complete_xact)
> +        {
> +            if (UndoRecPtrIsValid(urec_ptr))
> +            {
> +                /*
> +                 * There are still some undo actions pending in this log.  So,
> +                 * just update the progress block number.
> +                 */
> +                progress_block_num = UndoRecPtrGetBlockNum(last_urecinfo->urp);
> +
> +                /*
> +                 * If we've not fetched undo records for more than one undo
> +                 * block, we can't update the progress block number.  Because,
> +                 * there can still be undo records in this block that needs to
> +                 * be applied for rolling back this transaction.
> +                 */
> +                if (UndoRecPtrGetBlockNum(first_urecinfo->urp) > progress_block_num)
> +                {
> +                    update_progress = true;
> +                    progress_urec_ptr = last_log_start_urec_ptr;
> +                }
> +            }
> +            else
> +            {
> +                /*
> +                 * Invalid urec_ptr indicates that we have executed all the undo
> +                 * actions for this transaction.  So, mark current log header
> +                 * as complete.
> +                 */
> +                Assert(last_log_start_urec_ptr == to_urecptr);
> +                rollback_completed = true;
> +                update_progress = true;
> +                progress_urec_ptr = last_log_start_urec_ptr;
> +            }
> +        }

This should be in a separate function.


> +        /* Free all undo records. */
> +        for (i = 0; i < nrecords; i++)
> +            UndoRecordRelease(urecinfo[i].uur);
> +
> +        /* Free urp array for the current batch of undo records. */
> +        pfree(urecinfo);

As noted elsewhere, I think that's the wrong memory management
strategy. We should be using a memory context for undo processing, and
then just reset it as a whole.  For one, freeing granularly is
inefficient. But more than that, it also means there's nothing to
prevent memory leaks here.


> +/*
> + * execute_undo_actions - Execute the undo actions

That's juts a restatement of the function name.


> + * full_xid - Transaction id that is getting rolled back.
> + * from_urecptr - undo record pointer from where to start applying undo
> + *                actions.
> + * to_urecptr    - undo record pointer up to which the undo actions need to be
> + *                applied.
> + * complete_xact    - true if rollback is for complete transaction.
> + */
> +void
> +execute_undo_actions(FullTransactionId full_xid, UndoRecPtr from_urecptr,
> +                     UndoRecPtr to_urecptr, bool complete_xact)
> +{

Why is this lower case, but ApplyUndo() camel case? How is a reader
supposed to know which one uses for what?


>  typedef struct TwoPhaseFileHeader
>  {
> @@ -927,6 +928,16 @@ typedef struct TwoPhaseFileHeader
>      uint16        gidlen;            /* length of the GID - GID follows the header */
>      XLogRecPtr    origin_lsn;        /* lsn of this record at origin node */
>      TimestampTz origin_timestamp;    /* time of prepare at origin node */
> +
> +    /*
> +     * We need the locations of the start and end undo record pointers when
> +     * rollbacks are to be performed for prepared transactions using undo-based
> +     * relations.  We need to store this information in the file as the user
> +     * might rollback the prepared transaction after recovery and for that we
> +     * need its start and end undo locations.
> +     */
> +    UndoRecPtr    start_urec_ptr[UndoLogCategories];
> +    UndoRecPtr    end_urec_ptr[UndoLogCategories];
>  } TwoPhaseFileHeader;

Why do we not need that knowledge for undo processing of a non-prepared
transaction?


> @@ -191,6 +195,16 @@ typedef struct TransactionStateData
>      bool        didLogXid;        /* has xid been included in WAL record? */
>      int            parallelModeLevel;    /* Enter/ExitParallelMode counter */
>      bool        chain;            /* start a new block after this one */
> +
> +    /* start and end undo record location for each log category */
> +    UndoRecPtr    startUrecPtr[UndoLogCategories]; /* this is 'to' location */
> +    UndoRecPtr    latestUrecPtr[UndoLogCategories]; /* this is 'from'
> +                                                   * location */
> +    /*
> +     * whether the undo request is registered to be processed by worker later?
> +     */
> +    bool        undoRequestResgistered[UndoLogCategories];
> +

s/Resgistered/Registered/


> @@ -2906,9 +2942,18 @@ CommitTransactionCommand(void)
>               * StartTransactionCommand didn't set the STARTED state
>               * appropriately, while TBLOCK_PARALLEL_INPROGRESS should be ended
>               * by EndParallelWorkerTransaction(), not this function.
> +             *
> +             * TBLOCK_(SUB)UNDO means the error has occurred while applying
> +             * undo for a (sub)transaction.  We can't reach here as while

s/We can't reach here as while/This can't be reached while/



> +             * applying undo via top-level transaction, if we get an error,
> +             * then it is handled by ReleaseResourcesAndProcessUndo

Where and how does it handle that? Maybe I misunderstand what you mean?


> +            case TBLOCK_UNDO:
> +                /*
> +                 * We reach here when we got error while applying undo
> +                 * actions, so we don't want to again start applying it. Undo
> +                 * workers can take care of it.
> +                 *
> +                 * AbortTransaction is already done, still need to release
> +                 * locks and perform cleanup.
> +                 */
> +                ResetUndoActionsInfo();
> +                ResourceOwnerRelease(s->curTransactionOwner,
> +                                     RESOURCE_RELEASE_LOCKS,
> +                                     false,
> +                                     true);
> +                s->state = TRANS_ABORT;
>                  CleanupTransaction();

Hm. Why is it ok that we only perform that cleanup action? Either the
rest of potentially held resources will get cleaned up somehow as well,
in which case this ResourceOwnerRelease() ought to be redundant, or
we're potentially leaking important resources like buffer pins, relcache
references and whatnot here?


> +/*
> + * CheckAndRegisterUndoRequest - Register the request for applying undo
> + *    actions.
> + *
> + * It sets the transaction state to indicate whether the request is pushed to
> + * the background worker which is used later to decide whether to apply the
> + * actions.
> + *
> + * It is important to do this before marking the transaction as aborted in
> + * clog otherwise, it is quite possible that discard worker miss this rollback
> + * request from the computation of oldestXidHavingUnappliedUndo.  This is
> + * because it might do that computation before backend can register it in the
> + * rollback hash table.  So, neither oldestXmin computation will consider it
> + * nor the hash table pass would have that value.
> + */
> +static void
> +CheckAndRegisterUndoRequest()

(void)


> +{
> +    TransactionState s = CurrentTransactionState;
> +    bool    result;
> +    int        i;
> +
> +    /*
> +     * We don't want to apply the undo actions when we are already cleaning up
> +     * for FATAL error.  See ReleaseResourcesAndProcessUndo.
> +     */
> +    if (SemiCritSectionCount > 0)
> +    {
> +        ResetUndoActionsInfo();
> +        return;
> +    }

Wait what? Semi critical sections?



> +    for (i = 0; i < UndoLogCategories; i++)
> +    {
> +        /*
> +         * We can't push the undo actions for temp table to background
> +         * workers as the the temp tables are only accessible in the
> +         * backend that has created them.
> +         */
> +        if (i != UNDO_TEMP && UndoRecPtrIsValid(s->latestUrecPtr[i]))
> +        {
> +            result = RegisterUndoRequest(s->latestUrecPtr[i],
> +                                         s->startUrecPtr[i],
> +                                         MyDatabaseId,
> +                                         GetTopFullTransactionId());
> +            s->undoRequestResgistered[i] = result;
> +        }
> +    }

Give code like this I have a hard time seing what the point of having
separate queue entries for the different persistency levels is.




> +void
> +ReleaseResourcesAndProcessUndo(void)
> +{
> +    TransactionState s = CurrentTransactionState;
> +
> +    /*
> +     * We don't want to apply the undo actions when we are already cleaning up
> +     * for FATAL error.  One of the main reasons is that we might be already
> +     * processing undo actions for a (sub)transaction when we reach here
> +     * (for ex. error happens while processing undo actions for a
> +     * subtransaction).
> +     */
> +    if (SemiCritSectionCount > 0)
> +    {
> +        ResetUndoActionsInfo();
> +        return;
> +    }
> +
> +    if (!NeedToPerformUndoActions())
> +        return;
> +
> +    /*
> +     * State should still be TRANS_ABORT from AbortTransaction().
> +     */
> +    if (s->state != TRANS_ABORT)
> +        elog(FATAL, "ReleaseResourcesAndProcessUndo: unexpected state %s",
> +            TransStateAsString(s->state));
> +
> +    /*
> +     * Do abort cleanup processing before applying the undo actions.  We must
> +     * do this before applying the undo actions to remove the effects of
> +     * failed transaction.
> +     */
> +    if (IsSubTransaction())
> +    {
> +        AtSubCleanup_Portals(s->subTransactionId);
> +        s->blockState = TBLOCK_SUBUNDO;
> +    }
> +    else
> +    {
> +        AtCleanup_Portals();    /* now safe to release portal memory */
> +        AtEOXact_Snapshot(false, true); /* and release the transaction's
> +                                         * snapshots */

Why do precisely these actions need to be performed here?


> +        s->fullTransactionId = InvalidFullTransactionId;
> +        s->subTransactionId = TopSubTransactionId;
> +        s->blockState = TBLOCK_UNDO;
> +    }
> +
> +    s->state = TRANS_UNDO;

This seems guaranteed to constantly be out of date with other
modifications of the commit/abort sequence.



> +bool
> +ProcessUndoRequestForEachLogCat(FullTransactionId fxid, Oid dbid,
> +                                UndoRecPtr *end_urec_ptr, UndoRecPtr *start_urec_ptr,
> +                                bool *undoRequestResgistered, bool isSubTrans)
> +{
> +    UndoRequestInfo urinfo;
> +    int            i;
> +    uint32        save_holdoff;
> +    bool        success = true;
> +
> +    for (i = 0; i < UndoLogCategories; i++)
> +    {
> +        if (end_urec_ptr[i] && !undoRequestResgistered[i])
> +        {
> +            save_holdoff = InterruptHoldoffCount;
> +
> +            PG_TRY();
> +            {
> +                /* for subtransactions, we do partial rollback. */
> +                execute_undo_actions(fxid,
> +                                     end_urec_ptr[i],
> +                                     start_urec_ptr[i],
> +                                     !isSubTrans);
> +            }
> +            PG_CATCH();
> +            {
> +                /*
> +                 * Add the request into an error queue so that it can be
> +                 * processed in a timely fashion.
> +                 *
> +                 * If we fail to add the request in an error queue, then mark
> +                 * the entry status as invalid and continue to process the
> +                 * remaining undo requests if any.  This request will be later
> +                 * added back to the queue by discard worker.
> +                 */
> +                ResetUndoRequestInfo(&urinfo);
> +                urinfo.dbid = dbid;
> +                urinfo.full_xid = fxid;
> +                urinfo.start_urec_ptr = start_urec_ptr[i];
> +                if (!InsertRequestIntoErrorUndoQueue(&urinfo))
> +                    RollbackHTMarkEntryInvalid(urinfo.full_xid,
> +                                               urinfo.start_urec_ptr);
> +                /*
> +                 * Errors can reset holdoff count, so restore back.  This is
> +                 * required because this function can be called after holding
> +                 * interrupts.
> +                 */
> +                InterruptHoldoffCount = save_holdoff;
> +
> +                /* Send the error only to server log. */
> +                err_out_to_client(false);
> +                EmitErrorReport();
> +
> +                success = false;
> +
> +                /* We should never reach here when we are in a semi-critical-section. */
> +                Assert(SemiCritSectionCount == 0);

This seems entirely and completely broken. You can't just catch an
exception and continue. What if somebody held an lwlock when the error
was thrown? A buffer pin?  As far as I can tell the semi crit section
stuff doesn't protect you against anything here, because it's not used
exclusively.



> +to complete the requests by themselves.  There is an exception to it where when
> +error queue becomes full, we just mark the request as 'invalid' and continue to
> +process other requests if any.  The discard worker will find this errored
> +transaction at later point of time and again add it to the request queues.

You say it's an exception, but you do not explain why that exception is
there.

Nor why that's not a problem for:

> +We have the hard limit (proportional to the size of the rollback hash table)
> +for the number of transactions that can have pending undo.  This can help us
> +in computing the value of oldestXidHavingUnappliedUndo and allowing us not to
> +accumulate pending undo for a long time which will eventually block the
> +discard of undo.



> + * The main responsibility of the discard worker is to discard the undo log
> + * of transactions that are committed and all-visible or are rolledback.  It

*rolled back


> + * also registers the request for aborted transactions in the work queues.
> + * To know more about work queues, see undorequest.c.  It iterates through all
> + * the active logs one-by-one and try to discard the transactions that are old
> + * enough to matter.
> + *
> + * For tranasctions that spans across multiple logs, the log for committed and

*transactions


> + * all-visible transactions are discarded seprately for each log.  This is

*separately


> + * possible as the transactions that span across logs have separate transaction
> + * header for each log.  For aborted transactions, we try to process the actions

*tranaction headers


> + * of entire transaction at one-shot as we need to perform the actions starting

*an entire transaction in one shot

> + * from end location to start location.  However, it is possbile that the later

*possible

> + * portion of transaction that is overflowed into a separate log can be processed

*a transaction

> + * separately if we encounter the corresponding log first.  If we want we can
> + * combine the log for processing in that case as well, but there is no clear
> + * advantage of the same.

*of doing so



> +void
> +DiscardWorkerRegister(void)
> +{
> +    BackgroundWorker bgw;
> +
> +    memset(&bgw, 0, sizeof(bgw));
> +    bgw.bgw_flags = BGWORKER_SHMEM_ACCESS |
> +        BGWORKER_BACKEND_DATABASE_CONNECTION;

Why is a database needed?


> +    /*
> +     * Scan all the undo logs and intialize the rollback hash table with all
> +     * the pending rollback requests.  This need to be done as a first step
> +     * because only after this the transactions will be allowed to write new
> +     * undo.  See comments atop UndoLogProcess.
> +     */
> +    UndoLogProcess();

Too generic name.



> @@ -668,6 +676,50 @@ PrepareUndoInsert(UndoRecordInsertContext *context,
>      UndoCompressionInfo *compression_info =
>      &context->undo_compression_info[context->alloc_context.category];
>
> +    if (!InRecovery && IsUnderPostmaster)
> +    {
> +        int try_count = 0;
> +
> +        /*
> +         * If we are not in a recovery and not in a single-user-mode, then undo

s/in a single-user-mode/in single-user-mode/ (although I'd also remove
the dashes)


> +         * generation should not be allowed until we have scanned all the undo
> +         * logs and initialized the hash table with all the aborted
> +         * transaction entries.  See detailed comments in UndoLogProcess.
> +         */
> +        while (!ProcGlobal->rollbackHTInitialized)
> +        {
> +            /* Error out after trying for one minute. */
> +            if (try_count > ROLLBACK_HT_INIT_WAIT_TRY)
> +                ereport(ERROR,
> +                        (errcode(ERRCODE_E_R_E_MODIFYING_SQL_DATA_NOT_PERMITTED),
> +                         errmsg("rollback hash table is not yet initialized, wait for sometime and try again")));
> +
> +            /*
> +             * Rollback hash table is not yet intialized, sleep for 1 second
> +             * and try again.
> +             */
> +            pg_usleep(1000000L);
> +            try_count++;
> +        }
> +    }

I think it's wrong to do this here. We shouldn't open the database for
writes before having performed sufficient initialization. If done like
that, we shouldn't ever get here.  Without such sequencing it's actually
not possible to bring up a standby and allow writes in a normal way -
the first few transactions will just fail. That's not ok.

Nor are new retry loops with sleeps ok IMO.


> +    /*
> +     * If the rollback hash table is already full (excluding one additional
> +     * space for each backend) then don't allow to generate any new undo until
> +     * we apply some of the pending requests and create some space in the hash
> +     * table to accept new rollback requests.  Leave the enough slots in the
> +     * hash table so that there is space for all the backends to register at
> +     * least one request.  This is to protect the situation where one backend
> +     * keep consuming slots reserve for the other backends and suddenly there
> +     * is concurrent undo request from all the backends.  So we always keep
> +     * the space reserve for MaxBackends.
> +     */
> +    if (ProcGlobal->xactsHavingPendingUndo >
> +        (UndoRollbackHashTableSize() - MaxBackends))
> +        ereport(ERROR,
> +                (errcode(ERRCODE_INSUFFICIENT_RESOURCES),
> +                 errmsg("max limit for pending rollback request has reached, wait for sometime and try again")));
> +

Why do we need to this work every time we're inserting undo? shouldn't
that just happen once, when first accessing an undo log in a
transaction?


> +    /* There might not be any undo log and hibernation might be needed. */
> +    *hibernate = true;
> +
> +    StartTransactionCommand();

Why do we need this? I assume it's so we can have a resource owner?


Out of energy.


Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

06 August 2019, 21:18:42

Hi,

On 2019-08-06 00:56:26 -0700, Andres Freund wrote:
> Out of energy.

Here's the last section of my low-leve review. Plan to write a higher
level summary afterwards, now that I have a better picture of the code.


> +static void
> +UndoDiscardOneLog(UndoLogSlot *slot, TransactionId xmin, bool *hibernate)

I think the naming here is pretty confusing.  We have UndoDiscard(),
UndoDiscardOneLog(), UndoLogDiscard(). I don't think anybody really can
be expected to understand what is supposed to be what from these names.


> +    /* Loop until we run out of discardable transactions in the given log. */
> +    do
> +    {

for(;;) or while (true)


> +        TransactionId wait_xid = InvalidTransactionId;
> +        bool pending_abort = false;
> +        bool request_rollback = false;
> +        UndoStatus status;
> +        UndoRecordFetchContext    context;
> +
> +        next_insert = UndoLogGetNextInsertPtr(logno);
> +
> +        /* There must be some undo data for a transaction. */
> +        Assert(next_insert != undo_recptr);
> +
> +        /* Fetch the undo record for the given undo_recptr. */
> +        BeginUndoFetch(&context);
> +        uur = UndoFetchRecord(&context, undo_recptr);
> +        FinishUndoFetch(&context);
> +
> +        if (uur != NULL)
> +        {
> +            if (UndoRecPtrGetCategory(undo_recptr) == UNDO_SHARED)

FWIW, this is precisely my problem with exposing such small
informational functions, which actually have to perform some work. As is
there's several places looking up the underlying undo slot, within just
these lines of code.

We do it once in UndoLogGetNextInsertPtr(). Then again in
UndoFetchRecord(). And then again in UndoRecPtrGetCategory(). And then
later again multiple times when actually discarding.  That perhaps
doesn't matter from a performance POV, but for me that indicates that
the APIs aren't quite right.


> +            {
> +                /*
> +                 * For the "shared" category, we only discard when the
> +                 * rm_undo_status callback tells us we can.
> +                 */

Is there a description as to what the rm_status callback is intended to
do? It currently is mandatory, is that intended?  Why does this only
apply to shared records? And why just for SHARED, not for any of the others?



> +            else
> +            {
> +                TransactionId xid = XidFromFullTransactionId(uur->uur_fxid);
> +
> +                /*
> +                 * Otherwise we use the CLOG and xmin to decide whether to
> +                 * wait, discard or roll back.
> +                 *
> +                 * XXX: We've added the transaction-in-progress check to
> +                 * avoid xids of in-progress autovacuum as those are not
> +                 * computed for oldestxmin calculation.

Hm. xids of autovacuum?  The concern here is the xid that autovacuum
might acquire when locking a relation for truncating a table at the end,
with wal_level=replica?  Because otherwise it shouldn't have any xids?



> See
> +                 * DiscardWorkerMain.

Hm. This actually reminds me of a complaint I have about this. ISTM that
the logic for discarding itself should be separate from the discard
worker. I'd just add that, and a UDF to invoke it, in a separate commit.



> +            /*
> +             * Add the aborted transaction to the rollback request queues.
> +             *
> +             * We can ignore the abort for transactions whose corresponding
> +             * database doesn't exist.
> +             */
> +            if (request_rollback && dbid_exists(uur->uur_txn->urec_dbid))
> +            {
> +                (void) RegisterUndoRequest(InvalidUndoRecPtr,
> +                                           undo_recptr,
> +                                           uur->uur_txn->urec_dbid,
> +                                           uur->uur_fxid);
> +
> +                pending_abort = true;
> +            }

As I, I think, said before: This imo should not be necessary.


> +
> +        /*
> +         * We can discard upto this point when one of following conditions is

*up to


> +         * met: (a) we need to wait for a transaction first. (b) there is no
> +         * more log to process. (c) the transaction undo in current log is
> +         * finished. (d) there is a pending abort.
> +         */

This comment is hard to understand. Perhaps you're missing some words?
Because it's e.g. not clear what it means that "we can discard up to
this point", when we "need to wait for a transaction firts". Those seem
strictly contradictory.  I assume what this is trying to say is that we
now have reached the end of the range of undo that can be discarded, so
we should do so now?  But it's really quite muddled, because we don't
actually necessarily discard here, because we might have a wait_xid, for
example?


> +        if (TransactionIdIsValid(wait_xid) ||
> +            next_urecptr == InvalidUndoRecPtr ||
> +            UndoRecPtrGetLogNo(next_urecptr) != logno ||
> +            pending_abort)

Hm. Is it guaranteed that wait_xid isn't actually old enough that we
could discard further? I haven't figured out what precisely the purpose
of rm_undo_status is, so I'm not sure. But the alternative seems to be
that the callback would need to perform its own GetOldestXmin()
computations etc, which seems to make no sense?

It seems to me that the whole DidCommit/!InProgress/ block should not be
part of the if (UndoRecPtrGetCategory(undo_recptr) == UNDO_SHARED) else
if block, but follow it? I.e. the only thing inside the else should be
XidFromFullTransactionId(uur->uur_fxid), and then we check afterwards
whether it, or rm_undo_status()'s return value requires waiting?


> +        {
> +            /* Hey, I got some undo log to discard, can not hibernate now. */
> +            *hibernate = false;

I don't understand why this block sets *hibernate to false.  I mean
need_discard is not guranteed to be true at this point, no?


> +            /*
> +             * If we don't need to wait for this transaction and this is not
> +             * an aborted transaction, then we can discard it as well.
> +             */
> +            if (!TransactionIdIsValid(wait_xid) && !pending_abort)
> +            {
> +                /*
> +                 * It is safe to use next_insert as the location till which we
> +                 * want to discard in this case.  If something new has been
> +                 * added after we have fetched this transaction's record, it
> +                 * won't be considered in this pass of discard.
> +                 */
> +                undo_recptr = next_insert;
> +                latest_discardxid = XidFromFullTransactionId(undofxid);
> +                need_discard = true;
> +
> +                /* We don't have anything more to discard. */
> +                undofxid = InvalidFullTransactionId;
> +            }
> +            /* Update the shared memory state. */
> +            LWLockAcquire(&slot->discard_lock, LW_EXCLUSIVE);
> +
> +            /*
> +             * If the slot has been recycling while we were thinking about it,

*recycled

> +             * we have to abandon the operation.
> +             */
> +            if (slot->logno != logno)
> +            {
> +                LWLockRelease(&slot->discard_lock);
> +                break;
> +            }
> +
> +            /* Update the slot information for the next pass of discard. */
> +            slot->wait_fxmin = undofxid;
> +            slot->oldest_data = undo_recptr;

Perhaps 'next pass of UndoDiscard()' instead? I found it confusing that
UndoDiscardLog() is a loop, meaning that the "next pass" could perhaps
reference the next pass through UndoDiscardLog()'s loop. But it's for
UndoDiscard().



> +            LWLockRelease(&slot->discard_lock);
> +
> +            if (need_discard)
> +            {
> +                LWLockAcquire(&slot->discard_update_lock, LW_EXCLUSIVE);
> +                UndoLogDiscard(undo_recptr, latest_discardxid);
> +                LWLockRelease(&slot->discard_update_lock);
> +            }
> +
> +            break;
> +        }

It seems to me that the entire block above just shouldn't be inside the
loop. As far as I can tell the point of the loop is to figure out up to
where we can discard.  Putting the actually discarding inside that loop
is just confusing (and requires deeper indentation than necessary).



> +/*
> + * Scan all the undo logs and register the aborted transactions.  This is
> + * called as a first function from the discard worker and only after this pass

"a first function"? There can only be one first function, no? Also, what
does "first function" really mean?

As I write earlier, I think this function name is too generic, it
doesn't explain anything. And I think it's not OK for it to be called
(the bgworker is started with BgWorkerStart_RecoveryFinished) after the
system is supposed to be ready (i.e. StartupXlog() has finished, we
signal that we're up to pg_ctl etc, and allow writing transaction), but
necessary for allowing writes to be allowed (we throw errors in
PrepareUndoInsert()).


> + * over undo logs is complete, new undo can is allowed to be written in the

"undo can"?


> + * system.  This is required because after crash recovery we don't know the
> + * exact number of aborted transactions whose rollback request is pending and
> + * we can not allow new undo request if we already have the request equal to
> + * hash table size.  So before start allowing any new transaction to write the
> + * undo we need to make sure that we know exact number of pending requests.
> + */
> +void
> +UndoLogProcess()

(void)

> +{
> +    UndoLogSlot *slot = NULL;
> +
> +    /*
> +     * We need to perform this in a transaction because (a) we need resource
> +     * owner to scan the logs and (b) TransactionIdIsInProgress requires us to
> +     * be in transaction.
> +     */
> +    StartTransactionCommand();

The need for resowners does not imply needing transactions. I think
nearly all aux processes, for example, don't use transactions, but do
have a resowner.



> +    /*
> +     * Loop through all the valid undo logs and scan them transaction by
> +     * transaction to find non-commited transactions if any and register them
> +     * in the rollback hash table.
> +     */
> +    while ((slot = UndoLogNextSlot(slot)))
> +    {
> +        UndoRecPtr    undo_recptr;
> +        UnpackedUndoRecord    *uur = NULL;
> +
> +        /* We do not execute shared (non-transactional) undo records. */
> +        if (slot->meta.category == UNDO_SHARED)
> +            continue;
> +
> +        /* Start scanning the log from the last discard point. */
> +        undo_recptr = UndoLogGetOldestRecord(slot->logno, NULL);
> +
> +        /* Loop until we scan complete log. */
> +        while (1)
> +        {
> +            TransactionId xid;
> +            UndoRecordFetchContext    context;
> +
> +            /* Done with this log. */
> +            if (!UndoRecPtrIsValid(undo_recptr))
> +                break;

Why isn't this loop while(UndoRecPtrIsValid(undo_recptr))?


> +            /*
> +             * Register the rollback request for all uncommitted and not in
> +             * progress transactions whose undo apply progress is still not
> +             * completed.  Even though we don't allow any new transactions to
> +             * write undo until this first pass is completed, there might be
> +             * some prepared transactions which are still in progress, so we
> +             * don't include such transactions.
> +             */
> +            if (!TransactionIdDidCommit(xid) &&
> +                !TransactionIdIsInProgress(xid) &&
> +                !IsXactApplyProgressCompleted(uur->uur_txn->urec_progress))
> +            {
> +                (void) RegisterUndoRequest(InvalidUndoRecPtr, undo_recptr,
> +                                           uur->uur_txn->urec_dbid,
> +                                           uur->uur_fxid);
> +            }
> +
> +            /*
> +             * Go to the next transaction in the same log.  If uur_next is
> +             * point to the undo record pointer in the different log then we are

"is point"

> +             * done with this log so just set undo_recptr to InvalidUndoRecPtr.
> +             */
> +            if (UndoRecPtrGetLogNo(undo_recptr) ==
> +                UndoRecPtrGetLogNo(uur->uur_txn->urec_next))
> +                undo_recptr = uur->uur_txn->urec_next;
> +            else
> +                undo_recptr = InvalidUndoRecPtr;
> +
> +            /* Release memory for the current record. */
> +            UndoRecordRelease(uur);
> +        }
> +    }




> +     * XXX Ideally we can arrange undo logs so that we can efficiently find
> +     * those with oldest_xid < oldestXmin, but for now we'll just scan all of
> +     * them.
> +     */
> +    while ((slot = UndoLogNextSlot(slot)))
> +    {
> +        /*
> +         * If the log is already discarded, then we are done.  It is important
> +         * to first check this to ensure that tablespace containing this log
> +         * doesn't get dropped concurrently.
> +         */
> +        LWLockAcquire(&slot->mutex, LW_SHARED);
> +        /*
> +         * We don't have to worry about slot recycling and check the logno
> +         * here, since we don't care about the identity of this slot, we're
> +         * visiting all of them.
> +         */
> +        if (slot->meta.discard == slot->meta.unlogged.insert)
> +        {
> +            LWLockRelease(&slot->mutex);
> +            continue;
> +        }
> +        LWLockRelease(&slot->mutex);

I'm fairly sure that pgindent will add some newlines here... It's a good
practice to re-pgindent patches.


> +        /* We can't process temporary undo logs. */
> +        if (slot->meta.category == UNDO_TEMP)
> +            continue;
> +
> +        /*
> +         * If the first xid of the undo log is smaller than the xmin then try
> +         * to discard the undo log.
> +         */
> +        if (!FullTransactionIdIsValid(slot->wait_fxmin) ||
> +            FullTransactionIdPrecedes(slot->wait_fxmin, oldestXidHavingUndo))

So the comment describes something different than what's happening,
while otherwise not adding much over the code... That's imo confusing.


> +        {
> +            /* Process the undo log. */
> +            UndoDiscardOneLog(slot, oldestXmin, hibernate);

That comment seems unhelpful.



> +     * XXX: In future, if multiple workers can perform discard then we may
> +     * need to use compare and swap for updating the shared memory value.
> +     */
> +    if (FullTransactionIdIsValid(oldestXidHavingUndo))
> +        pg_atomic_write_u64(&ProcGlobal->oldestFullXidHavingUnappliedUndo,
> +                            U64FromFullTransactionId(oldestXidHavingUndo));

Seems like a lock would be more appropriate if we ever needed that -
only other discard workers would need it, so ...


> +/*
> + * Discard all the logs.  This is particularly required in single user mode
> + * where at the commit time we discard all the undo logs.
> + */
> +void
> +UndoLogDiscardAll(void)
> +{
> +    UndoLogSlot *slot = NULL;
> +
> +    Assert(!IsUnderPostmaster);
> +
> +    /*
> +     * No locks are required for discard, since this called only in single
> +     * user mode.
> +     */
> +    while ((slot = UndoLogNextSlot(slot)))
> +    {
> +        /* If the log is already discarded, then we are done. */
> +        if (slot->meta.discard == slot->meta.unlogged.insert)
> +            continue;
> +
> +        /*
> +         * Process the undo log.
> +         */
> +        UndoLogDiscard(MakeUndoRecPtr(slot->logno, slot->meta.unlogged.insert),
> +                       InvalidTransactionId);
> +    }
> +
> +}

Uh. So. What happens if we start up in single user mode while
transactions that haven't been rolled back yet exist? Which seems like a
pretty typical situation for single user mode, because usually something
has gone wrong before, which means it's quite likely that there are
transactions that effectively aborted and haven't processed undo?  How
is this not entirely broken?


> +/*
> + * Discard the undo logs for temp tables.
> + */
> +void
> +TempUndoDiscard(UndoLogNumber logno)
> +{

The only callsite for this is:

+            case ONCOMMIT_TEMP_DISCARD:
+                /* Discard temp table undo logs for temp tables. */
+                TempUndoDiscard(oc->relid);
+                break;

Which looks mightily odd, given that relid doesn't really sound like an
undo log number.  There's also no code actually registering an
ONCOMMIT_TEMP_DISCARD callback.

Nor is it clear to me why it, in general, would be correct to drop undo
pre-commit, even for temp relations. It's fine for ON COMMIT DROP
relations, but what about temporary relations that are longer lived than
that? As the transaction can still fail at this stage - e.g. due to
serialization failures - we'd just throw undo away that we'll need later?



> @@ -943,9 +1077,24 @@ CanPushReqToUndoWorker(UndoRecPtr start_urec_ptr, UndoRecPtr end_urec_ptr,
>  
>      /*
>       * We normally push the rollback request to undo workers if the size of
> -     * same is above a certain threshold.
> +     * same is above a certain threshold.  However, discard worker is allowed

*the discard worker



>           * The request can't be pushed into the undo worker queue.  The

I don't think 'undo worker queue' is really correct. It's not one
worker, and it's not one queue. And we're not queueing for a specific
worker.


> -         * backends will try executing by itself.

"Executing by itself" doesn't sound right. Execute the undo itself?


> +         * backends will try executing by itself.  The discard worker will
> +         * keep the entry into the rollback hash table with

"will keep the entry into" doesn't sound right. Insert?


> +         * UNDO_REQUEST_INVALID status.  Such requests will be added in the
> +         * undo worker queues in the subsequent passes over undo logs by
> +         * discard worker.
>           */
> -        else
> +        else if (!IsDiscardProcess())
>              rh->status = UNDO_REQUEST_INPROGRESS;
> +        else
> +            rh->status = UNDO_REQUEST_INVALID;
>      }

I don't understand what the point of this is. We add an entry into the
hashtable, but mark it as invalid? How does this not allow to run out of
memory?


> + * To know more about work queues, see undorequest.c.  The worker is launched
> + * to handle requests for a particular database.

I thought we had agreed that workers pick databases after they're
started?  There seems to be plenty code in here that does not implement
that.


> +/* SIGTERM: set flag to exit at next convenient time */
> +static void
> +UndoworkerSigtermHandler(SIGNAL_ARGS)
> +{
> +    got_SIGTERM = true;
> +
> +    /* Waken anything waiting on the process latch */
> +    SetLatch(MyLatch);
> +}
> +
> +/* SIGHUP: set flag to reload configuration at next convenient time */
> +static void
> +UndoLauncherSighup(SIGNAL_ARGS)
> +{
> +    int            save_errno = errno;
> +
> +    got_SIGHUP = true;
> +
> +    /* Waken anything waiting on the process latch */
> +    SetLatch(MyLatch);
> +
> +    errno = save_errno;
> +}

So one handler saves errno, the other doesn't...


> +/*
> + * Wait for a background worker to start up and attach to the shmem context.
> + *
> + * This is only needed for cleaning up the shared memory in case the worker
> + * fails to attach.
> + */
> +static void
> +WaitForUndoWorkerAttach(UndoApplyWorker * worker,
> +                        uint16 generation,
> +                        BackgroundWorkerHandle *handle)

Once we have undo workers pick their db, this should not be needed
anymore. The launcher shouldn't even prepare anything in shared memory
for it.


> +/*
> + * Returns whether an undo worker is available.
> + */
> +static int
> +IsUndoWorkerAvailable(void)
> +{
> +    int            i;
> +    int            alive_workers = 0;
> +
> +    LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
> +
> +    /* Search for attached workers. */
> +    for (i = 0; i < max_undo_workers; i++)
> +    {
> +        UndoApplyWorker *w = &UndoApplyCtx->workers[i];
> +
> +        if (w->in_use)
> +            alive_workers++;
> +    }
> +
> +    LWLockRelease(UndoWorkerLock);
> +
> +    return (alive_workers < max_undo_workers);
> +}
> +
> +/* Sets the worker's lingering status. */
> +static void
> +UndoWorkerIsLingering(bool sleep)
> +{
> +    /* Block concurrent access. */
> +    LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
> +
> +    MyUndoWorker->lingering = sleep;
> +
> +    LWLockRelease(UndoWorkerLock);
> +}
> +
> +/* Get the dbid and undo worker queue set by the undo launcher. */
> +static void
> +UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo)
> +{
> +    /* Block concurrent access. */
> +    LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
> +
> +    MyUndoWorker = &UndoApplyCtx->workers[slot];
> +
> +    if (!MyUndoWorker->in_use)
> +    {
> +        LWLockRelease(UndoWorkerLock);
> +        ereport(ERROR,
> +                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                 errmsg("undo worker slot %d is empty",
> +                        slot)));
> +    }
> +
> +    urinfo->dbid = MyUndoWorker->dbid;
> +    urinfo->undo_worker_queue = MyUndoWorker->undo_worker_queue;
> +
> +    LWLockRelease(UndoWorkerLock);
> +}

Why do all these need an exclusive lock?





> +/*
> + * Perform rollback request.  We need to connect to the database for first
> + * request and that is required because we access system tables while
> + * performing undo actions.
> + */
> +static void
> +UndoWorkerPerformRequest(UndoRequestInfo * urinfo)
> +{
> +    bool error = false;
> +
> +    /* must be connected to the database. */
> +    Assert(MyDatabaseId != InvalidOid);

The comment above says "We need to connect to the database", yet we
assert here that we "must be connected to the database".


> +/*
> + * UndoLauncherRegister
> + *        Register a background worker running the undo worker launcher.
> + */
> +void
> +UndoLauncherRegister(void)
> +{
> +    BackgroundWorker bgw;
> +
> +    if (max_undo_workers == 0)
> +        return;
> +
> +    memset(&bgw, 0, sizeof(bgw));
> +    bgw.bgw_flags = BGWORKER_SHMEM_ACCESS |
> +        BGWORKER_BACKEND_DATABASE_CONNECTION;
> +    bgw.bgw_start_time = BgWorkerStart_RecoveryFinished;
> +    snprintf(bgw.bgw_library_name, BGW_MAXLEN, "postgres");
> +    snprintf(bgw.bgw_function_name, BGW_MAXLEN, "UndoLauncherMain");
> +    snprintf(bgw.bgw_name, BGW_MAXLEN,
> +             "undo worker launcher");
> +    snprintf(bgw.bgw_type, BGW_MAXLEN,
> +             "undo worker launcher");
> +    bgw.bgw_restart_time = 5;
> +    bgw.bgw_notify_pid = 0;
> +    bgw.bgw_main_arg = (Datum)0;
> +
> +    RegisterBackgroundWorker(&bgw);
> +}
> +
> +/*
> + * Main loop for the undo worker launcher process.
> + */
> +void
> +UndoLauncherMain(Datum main_arg)
> +{
> +    UndoRequestInfo urinfo;
> +
> +    ereport(DEBUG1,
> +            (errmsg("undo launcher started")));
> +
> +    before_shmem_exit(UndoLauncherOnExit, (Datum) 0);
> +
> +    Assert(UndoApplyCtx->launcher_pid == 0);
> +    UndoApplyCtx->launcher_pid = MyProcPid;
> +
> +    /* Establish signal handlers. */
> +    pqsignal(SIGHUP, UndoLauncherSighup);
> +    pqsignal(SIGTERM, UndoworkerSigtermHandler);
> +    BackgroundWorkerUnblockSignals();
> +
> +    /* Establish connection to nailed catalogs. */
> +    BackgroundWorkerInitializeConnection(NULL, NULL, 0);

Why do we need to be connected in the launcher? I assume that's because
we still do checks on the database?



Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

07 August 2019, 06:03:36

Hi,

I'll be responding to a bunch of long review emails in this thread
point by point separately, but just picking out a couple of points
here that jumped out at me:

On Wed, Aug 7, 2019 at 9:18 AM Andres Freund <andres@anarazel.de> wrote:
> > +                     {
> > +                             /*
> > +                              * For the "shared" category, we only discard when the
> > +                              * rm_undo_status callback tells us we can.
> > +                              */
>
> Is there a description as to what the rm_status callback is intended to
> do? It currently is mandatory, is that intended?  Why does this only
> apply to shared records? And why just for SHARED, not for any of the others?

Yeah, I will respond to this.  After recent discussions with Robert
the whole UNDO_SHARED concept looks a bit shaky, and there's a better
way trying to get out -- more on that soon.

> > See
> > +                              * DiscardWorkerMain.
>
> Hm. This actually reminds me of a complaint I have about this. ISTM that
> the logic for discarding itself should be separate from the discard
> worker. I'd just add that, and a UDF to invoke it, in a separate commit.

That's not a bad idea -- I have a 'pg_force_discard()' SP which I'll
include in my next patchset, which is a bit raw, which I'm planning to
make a bit smarter -- it might make sense to use the same code path
for that.

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Kuntal Ghosh

Date:

07 August 2019, 07:41:51

Hello Andres,

On Tue, Aug 6, 2019 at 1:26 PM Andres Freund <andres@anarazel.de> wrote:
> > +/* Each worker queue is a binary heap. */
> > +typedef struct
> > +{
> > +     binaryheap *bh;
> > +     union
> > +     {
> > +             UndoXidQueue *xid_elems;
> > +             UndoSizeQueue *size_elems;
> > +             UndoErrorQueue *error_elems;
> > +     }                       q_choice;
> > +} UndoWorkerQueue;
>
> As we IIRC have decided to change this into a rbtree, I'll ignore
> related parts of the current code.  What is the status of that work?
> I've checked the git trees, without seeing anything? Your last mail with
> patches
> https://www.postgresql.org/message-id/CAA4eK1KKAFBCJuPnFtgdc89djv4xO%3DZkAdXvKQinqN4hWiRbvA%40mail.gmail.com
> doesn't seem to contain that either?
>
Yeah, we're changing this into a rbtree. This is still work-in-progress.

>
> > .........
> > +#define GetErrorQueueNthElem(n) \
> > +( \
> > +     AssertMacro(!ErrorQueueIsEmpty()), \
> > +     DatumGetPointer(binaryheap_nth(UndoWorkerQueues[ERROR_QUEUE].bh, n)) \
> > +)
>
>
> -ETOOMANYMACROS
>
> I think nearly all of these shouldn't exist. See further below.
>
>
> > +#define SetErrorQueueElem(elem, e_dbid, e_full_xid, e_start_urec_ptr, e_retry_at, e_occurred_at) \
> > +( \
> > +     GetErrorQueueElem(elem).dbid = e_dbid, \
> > +     GetErrorQueueElem(elem).full_xid = e_full_xid, \
> > +     GetErrorQueueElem(elem).start_urec_ptr = e_start_urec_ptr, \
> > +     GetErrorQueueElem(elem).next_retry_at = e_retry_at, \
> > +     GetErrorQueueElem(elem).err_occurred_at = e_occurred_at \
> > +)
>
> It's very very rarely a good idea to have macros that evaluate their
> arguments multiple times. It'll also never be a good idea to get the
> same element multiple times from a queue.  If needed - I'm very doubtful
> of that, given that there's a single caller - it should be a static
> inline function that gets the element once, stores it in a local
> variable, and then updates all the fields.
>
Noted. Earlier, Robert also raised the point of using so many macros.
He also suggested to use a single type of object that stores all the
information we need. It'll make things simpler and easier to
understand. In the upcoming patch set, we're removing all these
changes.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

07 August 2019, 07:57:29

On Tue, Aug 6, 2019 at 1:26 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
> > Need to do something else for a bit. More later.
>
> Here we go.
>
Thanks for the review.  I will work on them.  Currently, I need
suggestions on some of the review comments.
>
> > + /*
> > +  * Compute the header size of the undo record.
> > +  */
> > +Size
> > +UndoRecordHeaderSize(uint16 uur_info)
> > +{
> > +     Size            size;
> > +
> > +     /* Add fixed header size. */
> > +     size = SizeOfUndoRecordHeader;
> > +
> > +     /* Add size of transaction header if it presets. */
> > +     if ((uur_info & UREC_INFO_TRANSACTION) != 0)
> > +             size += SizeOfUndoRecordTransaction;
> > +
> > +     /* Add size of rmid if it presets. */
> > +     if ((uur_info & UREC_INFO_RMID) != 0)
> > +             size += sizeof(RmgrId);
> > +
> > +     /* Add size of reloid if it presets. */
> > +     if ((uur_info & UREC_INFO_RELOID) != 0)
> > +             size += sizeof(Oid);
> > +

> There's numerous blocks with one if for each type, and the body copied
> basically the same for each alternative. That doesn't seem like a
> reasonable approach to me. Means that many places need to be adjusted
> when we invariably add another type, and seems likely to lead to bugs
> over time.
>
I agree with the point that we are repeating this in a couple of
function and doing different actions e.g.  In this function we are
computing the size and in some other function we are copying the
field.  I am not sure what would be the best way to handle it?  One
approach could just write one function which handles all these cases
but the caller will suggest what action to take.  Basically, it will
look like this.

Function (uur_info, action)
{
   if ((uur_info & UREC_INFO_TRANSACTION) != 0)
  {
     // if action is compute header size
      size += SizeOfUndoRecordTransaction;
    //else if action is copy to dest
      dest = src
    ...
  }
Repeat for other types
}

But, IMHO that function will look confusing to anyone that what
exactly it's trying to achieve.  If anyone has a better idea please
suggest.

> > +/*
> > + * Insert the undo record into the input page from the unpack undo context.
> > + *
> > + * Caller can  call this function multiple times until desired stage is reached.
> > + * This will write the undo record into the page.
> > + */
> > +void
> > +InsertUndoData(UndoPackContext *ucontext, Page page, int starting_byte)
> > +{
> > +     char       *writeptr = (char *) page + starting_byte;
> > +     char       *endptr = (char *) page + BLCKSZ;
> > +
> > +     switch (ucontext->stage)
> > +     {
> > +             case UNDO_PACK_STAGE_HEADER:
> > +                     /* Insert undo record header. */
> > +                     if (!InsertUndoBytes((char *) &ucontext->urec_hd,
> > +                                                              SizeOfUndoRecordHeader, &writeptr, endptr,
> > +                                                              &ucontext->already_processed,
> > +                                                              &ucontext->partial_bytes))
> > +                             return;
> > +                     ucontext->stage = UNDO_PACK_STAGE_TRANSACTION;
> > +                     /* fall through */
> > +
>
> I don't understand. The only purpose of this is that we can partially
> write a packed-but-not-actually-packed record onto a bunch of pages? And
> for that we have an endless chain of copy and pasted code calling
> InsertUndoBytes()? Copying data into shared buffers in tiny increments?
>
> If we need to this, what is the whole packed record format good for?
> Except for adding a bunch of functions with 10++ ifs and nearly
> identical code?
>
> Copying data is expensive. Copying data in tiny increments is more
> expensive. Copying data in tiny increments, with a bunch of branches, is
> even more expensive. Copying data in tiny increments, with a bunch of
> branches, is even more expensive, especially when it's shared
> memory. Copying data in tiny increments, with a bunch of branches, is
> even more expensive, especially when it's shared memory, especially when
> all that shared meory is locked at once.

My idea is, indeed of keeping all these fields duplicated in the
context, just allocate a single memory segment equal to the expected
record size (maybe the payload data can keep separate).  Now, based on
uur_info pack all the field of UnpackedUndoRecord in that memory
segment.  After that In InsertUndoData, we just need one call to
InsertUndoBytes for copying complete header in one shot and another
call for copying payload data.  Does this sound reasonable to you?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

07 August 2019, 08:20:48

On Tue, Aug 6, 2019 at 1:26 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
> > Need to do something else for a bit. More later.
>
> > + *   false, otherwise.
> > + */
> > +static bool
> > +UndoAlreadyApplied(FullTransactionId full_xid, UndoRecPtr to_urecptr)
> > +{
> > +     UnpackedUndoRecord *uur = NULL;
> > +     UndoRecordFetchContext  context;
> > +
> > +     /* Fetch the undo record. */
> > +     BeginUndoFetch(&context);
> > +     uur = UndoFetchRecord(&context, to_urecptr);
> > +     FinishUndoFetch(&context);
>
> Literally all the places that fetch a record, fetch them with exactly
> this combination of calls. If that's the pattern, what do we gain by
> this split?   Note that UndoBulkFetchRecord does *NOT* use an
> UndoRecordFetchContext, for reasons that are beyond me.

Actually, for the zheap or any other AM, where it needs to traverse
the transactions undo the chain. For example, in zheap we will get the
latest undo record pointer from the slot but we need to traverse the
undo record chain backward using the prevundo pointer store in the
undo record and find the undo record for a particular tuple.  Earlier,
there was a loop in UndoFetchRecord which were traversing the chain of
the undo until it finds the matching record and record was matched
using a callback.  There was also an optimization that if the current
record doesn't satisfy the callback then we keep the pin hold on the
buffer and go to the previous record in the chain.  Later based on the
review comments by Robert we have decided that finding the matching
undo record should be caller's responsibility so we have moved the
loop out of the UndoFetchRecord and kept it in the zheap code.  The
reason for keeping the context is that we can keep the buffer pin held
and remember that buffer in the context so that the caller can call
UndoFetchRecord in a loop and the pin will be held on the buffer from
which we have read the last undo record.

I agree that in undoprocessing patch set we always need to fetch one
record so instead of repeating this pattern everywhere we can write
one function and move this sequence of calls in that function.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

07 August 2019, 09:20:17

On Tue, Aug 6, 2019 at 1:26 PM Andres Freund <andres@anarazel.de> wrote:
> On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
>

I am responding to some of the points where I need some more inputs or
some discussion is required.  Some of the things need more thoughts
which I will respond later and some are quite straight forward and
doesn't need much discussion.

>
> > +/*
> > + * Binary heap comparison function to compare the time at which an error
> > + * occurred for transactions.
> > + *
> > + * The error queue is sorted by next_retry_at and err_occurred_at.  Currently,
> > + * the next_retry_at has some constant delay time (see PushErrorQueueElem), so
> > + * it doesn't make much sense to sort by both values.  However, in future, if
> > + * we have some different algorithm for next_retry_at, then it will work
> > + * seamlessly.
> > + */
>
> Why is it useful to have error_occurred_at be part of the comparison at
> all? If we need a tiebraker, err_occurred_at isn't that (if we can get
> conflicts for next_retry_at, then we can also get conflicts in
> err_occurred_at).  Seems better to use something actually guaranteed to
> be unique for a tiebreaker.
>

This was to distinguish the cases where the request is failing
multiple times with the request failed the first time.  I agree that
we need a better unique identifier like FullTransactionid though.  Do
let me know if you have any other suggestion?

>
>
> > +     /*
> > +      * The rollback hash table is used to avoid duplicate undo requests by
> > +      * backends and discard worker.  The table must be able to accomodate all
> > +      * active undo requests.  The undo requests must appear in both xid and
> > +      * size requests queues or neither.  In same transaction, there can be two
> > +      * requests one for logged relations and another for unlogged relations.
> > +      * So, the rollback hash table size should be equal to two request queues,
> > +      * an error queue (currently this is same as request queue) and max
>
> "the same"? I assume this intended to mean the same size?
>

Yes. I will add the word size to be more clear.

>
> > +      * backends. This will ensure that it won't get filled.
> > +      */
>
> How does this ensure anything?
>

Because based on this we will have a hard limit on the number of undo
requests after which we won't allow more requests.  See some more
detailed explanation for the same later in this email.   I think the
comment needs to be updated.

>
> > +      * the binary heaps which can change.
> > +      */
> > +     Assert(LWLockHeldByMeInMode(RollbackRequestLock, LW_EXCLUSIVE));
> > +
> > +     /*
> > +      * We normally push the rollback request to undo workers if the size of
> > +      * same is above a certain threshold.
> > +      */
> > +     if (req_size >= rollback_overflow_size * 1024 * 1024)
> > +     {
>
> Why is this being checked with the lock held? Seems like this should be
> handled in a pre-check?
>

Yeah, it can be a pre-check, but I thought it is better to encapsulate
everything in the function as this is not an expensive check.  I think
we can move it to outside lock to avoid any such confusion.

>
> > + * allow_peek - if true, peeks a few element from each queue to check whether
> > + * any request matches current dbid.
> > + * remove_from_queue - if true, picks an element from the queue whose dbid
> > + * matches current dbid and remove it from the queue before returning the same
> > + * to caller.
> > + * urinfo - this is an OUT parameter that returns the details of undo request
> > + * whose undo action is still pending.
> > + * in_other_db_out - this is an OUT parameter.  If we've not found any work
> > + * for current database, but there is work for some other database, we set
> > + * this parameter as true.
> > + */
> > +bool
> > +UndoGetWork(bool allow_peek, bool remove_from_queue, UndoRequestInfo *urinfo,
> > +                     bool *in_other_db_out)
> > +{
>
>
> > +             /*
> > +              * If some undo worker is already processing the rollback request or
> > +              * it is already processed, then we drop that request from the queue
> > +              * and fetch the next entry from the queue.
> > +              */
> > +             if (!rh || UndoRequestIsInProgress(rh))
> > +             {
> > +                     RemoveRequestFromQueue(cur_queue, 0);
> > +                     cur_undo_queue++;
> > +                     continue;
> > +             }
>
> When is it possible to hit the in-progress case?
>

The same request is in two queues. It is possible that when the
request is being processed from xid queue by one of the workers, the
request from another queue is picked by another worker.  I think this
case won't exist after making rbtree based queues.

> > +/*
> > + * UpdateUndoApplyProgress - Updates how far undo actions from a particular
> > + * log have been applied while rolling back a transaction.  This progress is
> > + * measured in terms of undo block number of the undo log till which the
> > + * undo actions have been applied.
> > + */
> > +static void
> > +UpdateUndoApplyProgress(UndoRecPtr progress_urec_ptr,
> > +                                             BlockNumber block_num)
> > +{
> > +     UndoLogCategory category;
> > +     UndoRecordInsertContext context = {{0}};
> > +
> > +     category =
> > +             UndoLogNumberGetCategory(UndoRecPtrGetLogNo(progress_urec_ptr));
> > +
> > +     /*
> > +      * We don't need to update the progress for temp tables as they get
> > +      * discraded after startup.
> > +      */
> > +     if (category == UNDO_TEMP)
> > +             return;
> > +
> > +     BeginUndoRecordInsert(&context, category, 1, NULL);
> > +
> > +     /*
> > +      * Prepare and update the undo apply progress in the transaction header.
> > +      */
> > +     UndoRecordPrepareApplyProgress(&context, progress_urec_ptr, block_num);
> > +
> > +     START_CRIT_SECTION();
> > +
> > +     /* Update the progress in the transaction header. */
> > +     UndoRecordUpdateTransInfo(&context, 0);
> > +
> > +     /* WAL log the undo apply progress. */
> > +     {
> > +             XLogRecPtr      lsn;
> > +             xl_undoapply_progress xlrec;
> > +
> > +             xlrec.urec_ptr = progress_urec_ptr;
> > +             xlrec.progress = block_num;
> > +
> > +             XLogBeginInsert();
> > +             XLogRegisterData((char *) &xlrec, sizeof(xlrec));
> > +
> > +             RegisterUndoLogBuffers(&context, 1);
> > +             lsn = XLogInsert(RM_UNDOACTION_ID, XLOG_UNDO_APPLY_PROGRESS);
> > +             UndoLogBuffersSetLSN(&context, lsn);
> > +     }
> > +
> > +     END_CRIT_SECTION();
> > +
> > +     /* Release undo buffers. */
> > +     FinishUndoRecordInsert(&context);
> > +}
>
> This whole prepare/execute split for updating apply pregress, and next
> undo pointers makes no sense to me.
>

Can you explain what is your concern here?  Basically, in the prepare
phase, we do read and lock the buffer and in the actual update phase
(which is under critical section), we update the contents in the
shared buffer.  This is the same idea as we use in many places in
code.

>
> >  typedef struct TwoPhaseFileHeader
> >  {
> > @@ -927,6 +928,16 @@ typedef struct TwoPhaseFileHeader
> >       uint16          gidlen;                 /* length of the GID - GID follows the header */
> >       XLogRecPtr      origin_lsn;             /* lsn of this record at origin node */
> >       TimestampTz origin_timestamp;   /* time of prepare at origin node */
> > +
> > +     /*
> > +      * We need the locations of the start and end undo record pointers when
> > +      * rollbacks are to be performed for prepared transactions using undo-based
> > +      * relations.  We need to store this information in the file as the user
> > +      * might rollback the prepared transaction after recovery and for that we
> > +      * need its start and end undo locations.
> > +      */
> > +     UndoRecPtr      start_urec_ptr[UndoLogCategories];
> > +     UndoRecPtr      end_urec_ptr[UndoLogCategories];
> >  } TwoPhaseFileHeader;
>
> Why do we not need that knowledge for undo processing of a non-prepared
> transaction?
>

The non-prepared transaction also needs to be aware of that.  It is
stored in TransactionStateData.  I am not sure if I understand your
question here.

>
> > +                      * applying undo via top-level transaction, if we get an error,
> > +                      * then it is handled by ReleaseResourcesAndProcessUndo
>
> Where and how does it handle that? Maybe I misunderstand what you mean?
>

It is handled in ProcessUndoRequestForEachLogCat which is called from
ReleaseResourcesAndProcessUndo.  Basically, the error is handled in
catch and we insert the request in error queue.  The function name
should be changed in comments.

>
> > +                     case TBLOCK_UNDO:
> > +                             /*
> > +                              * We reach here when we got error while applying undo
> > +                              * actions, so we don't want to again start applying it. Undo
> > +                              * workers can take care of it.
> > +                              *
> > +                              * AbortTransaction is already done, still need to release
> > +                              * locks and perform cleanup.
> > +                              */
> > +                             ResetUndoActionsInfo();
> > +                             ResourceOwnerRelease(s->curTransactionOwner,
> > +                                                                      RESOURCE_RELEASE_LOCKS,
> > +                                                                      false,
> > +                                                                      true);
> > +                             s->state = TRANS_ABORT;
> >                               CleanupTransaction();
>
> Hm. Why is it ok that we only perform that cleanup action? Either the
> rest of potentially held resources will get cleaned up somehow as well,
> in which case this ResourceOwnerRelease() ought to be redundant, or
> we're potentially leaking important resources like buffer pins, relcache
> references and whatnot here?
>

I had initially used AbortTransaction() here for such things, but I
was not sure whether that is the right thing when we reach here in
this state.  Because AbortTransaction is already done once we reach
here.  The similar thing happens for the TBLOCK_SUBUNDO state few
lines below where I had used AbortSubTransaction.  Now, one problem I
faced when AbortSubTransaction got invoked in this code path was it
internally invokes RecordTransactionAbort->XidCacheRemoveRunningXids
which result in the error "did not find subXID %u in MyProc".  The
reason is obvious which is that we had already removed it when
AbortSubTransaction was invoked before applying undo actions. The
releasing of locks was the thing which we have delayed to allow undo
actions to be applied which is done here.  The other idea here I had
was to call AbortTransaction/AbortSubTransaction but somehow avoid
calling RecordTransactionAbort when in this state.  Do you have any
suggestion to deal with this?

>
> > +{
> > +     TransactionState s = CurrentTransactionState;
> > +     bool    result;
> > +     int             i;
> > +
> > +     /*
> > +      * We don't want to apply the undo actions when we are already cleaning up
> > +      * for FATAL error.  See ReleaseResourcesAndProcessUndo.
> > +      */
> > +     if (SemiCritSectionCount > 0)
> > +     {
> > +             ResetUndoActionsInfo();
> > +             return;
> > +     }
>
> Wait what? Semi critical sections?
>

Robert up thread suggested this idea [1] (See paragraph starting with
"I am not a fan of applying_subxact_undo....") to deal with cases
where we get an error while applying undo actions and we need to
promote the error to FATAL.  We have two such cases as of now in this
patch, one is when we process temp log category log and other is when
we are rolling back sub-transactions.  The detailed reasons are
mentioned in function execute_undo_actions. I think this can be used
for other things as well in the future.

>
>
> > +     for (i = 0; i < UndoLogCategories; i++)
> > +     {
> > +             /*
> > +              * We can't push the undo actions for temp table to background
> > +              * workers as the the temp tables are only accessible in the
> > +              * backend that has created them.
> > +              */
> > +             if (i != UNDO_TEMP && UndoRecPtrIsValid(s->latestUrecPtr[i]))
> > +             {
> > +                     result = RegisterUndoRequest(s->latestUrecPtr[i],
> > +                                                                              s->startUrecPtr[i],
> > +                                                                              MyDatabaseId,
> > +                                                                              GetTopFullTransactionId());
> > +                     s->undoRequestResgistered[i] = result;
> > +             }
> > +     }
>
> Give code like this I have a hard time seing what the point of having
> separate queue entries for the different persistency levels is.
>

It is not for this case, rather, it is for the case of discard worker
(background worker) where we process the transactions at log level.
The permanent and unlogged transactions will be in a separate log and
can be encountered at different times, so this leads to having
separate entries for them.  I am planning to give a try to unify them
based on some of the discussions in this email chain.

>
> > +void
> > +ReleaseResourcesAndProcessUndo(void)
> > +{
> > +     TransactionState s = CurrentTransactionState;
> > +
> > +     /*
> > +      * We don't want to apply the undo actions when we are already cleaning up
> > +      * for FATAL error.  One of the main reasons is that we might be already
> > +      * processing undo actions for a (sub)transaction when we reach here
> > +      * (for ex. error happens while processing undo actions for a
> > +      * subtransaction).
> > +      */
> > +     if (SemiCritSectionCount > 0)
> > +     {
> > +             ResetUndoActionsInfo();
> > +             return;
> > +     }
> > +
> > +     if (!NeedToPerformUndoActions())
> > +             return;
> > +
> > +     /*
> > +      * State should still be TRANS_ABORT from AbortTransaction().
> > +      */
> > +     if (s->state != TRANS_ABORT)
> > +             elog(FATAL, "ReleaseResourcesAndProcessUndo: unexpected state %s",
> > +                     TransStateAsString(s->state));
> > +
> > +     /*
> > +      * Do abort cleanup processing before applying the undo actions.  We must
> > +      * do this before applying the undo actions to remove the effects of
> > +      * failed transaction.
> > +      */
> > +     if (IsSubTransaction())
> > +     {
> > +             AtSubCleanup_Portals(s->subTransactionId);
> > +             s->blockState = TBLOCK_SUBUNDO;
> > +     }
> > +     else
> > +     {
> > +             AtCleanup_Portals();    /* now safe to release portal memory */
> > +             AtEOXact_Snapshot(false, true); /* and release the transaction's
> > +                                                                              * snapshots */
>
> Why do precisely these actions need to be performed here?
>

This is to get a transaction into a clean state.  Before calling this
function AbortTransaction has been performed and there were few more
things we need to do for cleanup.

>
> > +             s->fullTransactionId = InvalidFullTransactionId;
> > +             s->subTransactionId = TopSubTransactionId;
> > +             s->blockState = TBLOCK_UNDO;
> > +     }
> > +
> > +     s->state = TRANS_UNDO;
>
> This seems guaranteed to constantly be out of date with other
> modifications of the commit/abort sequence.
>

It is similar to how we change state in Abort(Sub)Transaction and we
change the state back to TRANS_ABORT after applying undo in this
function.  So not sure, how it can be out-of-date.  Do you have any
better suggestion here?

>
>
> > +bool
> > +ProcessUndoRequestForEachLogCat(FullTransactionId fxid, Oid dbid,
> > +                                                             UndoRecPtr *end_urec_ptr, UndoRecPtr
*start_urec_ptr,
> > +                                                             bool *undoRequestResgistered, bool isSubTrans)
> > +{
> > +     UndoRequestInfo urinfo;
> > +     int                     i;
> > +     uint32          save_holdoff;
> > +     bool            success = true;
> > +
> > +     for (i = 0; i < UndoLogCategories; i++)
> > +     {
> > +             if (end_urec_ptr[i] && !undoRequestResgistered[i])
> > +             {
> > +                     save_holdoff = InterruptHoldoffCount;
> > +
> > +                     PG_TRY();
> > +                     {
> > +                             /* for subtransactions, we do partial rollback. */
> > +                             execute_undo_actions(fxid,
> > +                                                                      end_urec_ptr[i],
> > +                                                                      start_urec_ptr[i],
> > +                                                                      !isSubTrans);
> > +                     }
> > +                     PG_CATCH();
> > +                     {
> > +                             /*
> > +                              * Add the request into an error queue so that it can be
> > +                              * processed in a timely fashion.
> > +                              *
> > +                              * If we fail to add the request in an error queue, then mark
> > +                              * the entry status as invalid and continue to process the
> > +                              * remaining undo requests if any.  This request will be later
> > +                              * added back to the queue by discard worker.
> > +                              */
> > +                             ResetUndoRequestInfo(&urinfo);
> > +                             urinfo.dbid = dbid;
> > +                             urinfo.full_xid = fxid;
> > +                             urinfo.start_urec_ptr = start_urec_ptr[i];
> > +                             if (!InsertRequestIntoErrorUndoQueue(&urinfo))
> > +                                     RollbackHTMarkEntryInvalid(urinfo.full_xid,
> > +                                                                                        urinfo.start_urec_ptr);
> > +                             /*
> > +                              * Errors can reset holdoff count, so restore back.  This is
> > +                              * required because this function can be called after holding
> > +                              * interrupts.
> > +                              */
> > +                             InterruptHoldoffCount = save_holdoff;
> > +
> > +                             /* Send the error only to server log. */
> > +                             err_out_to_client(false);
> > +                             EmitErrorReport();
> > +
> > +                             success = false;
> > +
> > +                             /* We should never reach here when we are in a semi-critical-section. */
> > +                             Assert(SemiCritSectionCount == 0);
>
> This seems entirely and completely broken. You can't just catch an
> exception and continue. What if somebody held an lwlock when the error
> was thrown? A buffer pin?
>

The caller deals with that.  For example, when this is called from
FinishPreparedTransaction, we do AbortOutOfAnyTransaction and when
called from ReleaseResourcesAndProcessUndo, we just release locks.  I
think we might need to do something additional for
ReleaseResourcesAndProcessUndo.  Earlier here also, I had
AbortTransaction but was not sure whether that is the right thing to
do especially because it will lead to RecordTransactionAbort called
twice, once when we do AbortTransaction before applying undo actions
and once when we do it after catching the exception.  Like as I said
earlier maybe the right way is to just avoid calling
RecordTransactionAbort again.

>
> > +to complete the requests by themselves.  There is an exception to it where when
> > +error queue becomes full, we just mark the request as 'invalid' and continue to
> > +process other requests if any.  The discard worker will find this errored
> > +transaction at later point of time and again add it to the request queues.
>
> You say it's an exception, but you do not explain why that exception is
> there.
>

The exception is when the error queue becomes full.  The idea is that
individual queues can be full but not the hash table.

> Nor why that's not a problem for:
>
> > +We have the hard limit (proportional to the size of the rollback hash table)
> > +for the number of transactions that can have pending undo.  This can help us
> > +in computing the value of oldestXidHavingUnappliedUndo and allowing us not to
> > +accumulate pending undo for a long time which will eventually block the
> > +discard of undo.
>

The reason why it is not a problem is that we don't remove the entry
from the hash table rather just mark it such that later discard worker
can add it to the queues. I am not sure if I understood your question
completely, but let me try to explain this idea in a bit more detail.

The basic idea is that Rollback Hash Table has space equivalent to all
the three queues plus (2 * MaxBackends).  Now, we will stop allowing
the new transactions that want to write undo once the hash table has
entries equivalent to all three queues and we have 2 * Max_Backends
already attached to undo logs that are not committed.  Assume we have
each queue size as 5 and Max_Backends =10, then ideally we can 35
entries (3 * 5 + 2 * 10) in the hash table. The way all this is
related to the error queue being full is like this:

Say, we have a number of hash table entries equal to 15 which
indicates all queues are full and now 10 backends connected to two
different logs (permanent and unlogged). Next one of the transaction
errors out and try to rollback, at this stage, it will add an entry in
the hash table and try to execute the actions.  While executing
actions, it got an error and couldn't add to error queue because it
was full, so at this stage, it just marks the hash table entry as
invalid and proceeds (consider this happens for both logged and
unlogged categories).  So, at this stage, we will have 17 entries in
the hash table and the other 9 backends attached to 18 logs which
makes space for 35 xacts if the system crashes at this stage.  The
backend which errored out again tries to perform an operation for
which it needs to perform undo.  Now, we won't allow this backend to
perform that action because if it crashed after performing the
operation and before committing, the hash table will overflow.

Currently, there are some problems with the hash table overflow checks
in the code that needs to be fixed.

>
> > +     /* There might not be any undo log and hibernation might be needed. */
> > +     *hibernate = true;
> > +
> > +     StartTransactionCommand();
>
> Why do we need this? I assume it's so we can have a resource owner?
>

Yes, and another reason is we are using dbid_exists in this function.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

07 August 2019, 10:52:55

On Tue, Jul 30, 2019 at 1:54 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jul 30, 2019 at 12:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> >
> > One data structure that could perhaps hold this would be
> > UndoLogTableEntry (the per-backend cache, indexed by undo log number,
> > with pretty fast lookups; used for things like
> > UndoLogNumberGetCategory()).  As long as you never want to have
> > inter-transaction compression, that should have the right scope to
> > give recovery per-undo log tracking.  If you ever wanted to do
> > compression between transactions too, maybe UndoLogSlot could work,
> > but that'd have more complications.
>
> I think this could be a good idea.  I had thought of keeping in the
> slot as my 3rd option but later I removed it thinking that we need to
> expose the compression field to the undo log layer.  I think keeping
> in the UndoLogTableEntry is a better idea than keeping in the slot.
> But, I still have the same problem that we need to expose undo
> record-level fields to undo log layer to compute the cache entry size.
>   OTOH, If we decide to get from the first record of the page (as I
> mentioned up thread) then I don't think there is any performance issue
> because we are inserting on the same page.  But, for doing that we
> need to unpack the complete undo record (guaranteed to be on one
> page).   And, UnpackUndoData will internally unpack the payload data
> as well which is not required in our case unless we change
> UnpackUndoData such that it unpacks only what the caller wants (one
> input parameter will do).
>
> I am not sure out of these two which idea is better?
>

I have one more problem related to compression of the command id
field.  Basically, the problem is that we don't set the command id in
the WAL and we will always store FirstCommandId in the undo[1].   So
suppose there were 2 operations under a different CID then during DO
time both the undo record will store the CID field in their respective
undo records but during REDO time, all the commands will store the
same CID(FirstCommandId) so as per the compression logic the
subsequent record for the same transaction will not store the CID
field.  I am not sure what is the best way to handle this but I have
few ideas.

1) Don't compress the CID field ever.
2) Write CID in WAL,  but just for compressing the CID field in undo
(which may not necessarily go to disk) we don't want to add extra 4
bytes in the WAL.

Any better idea to handle this?

[1] https://www.postgresql.org/message-id/CAFiTN-u2Ny2E-NgT8nmE65awJ7keOzePODZTEg98ceF%2BsNhRtw%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Heikki Linnakangas

Date:

07 August 2019, 10:57:08

On 05/08/2019 16:24, Robert Haas wrote:
> On Sun, Aug 4, 2019 at 5:16 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I feel that the level of abstraction is not quite right. There are a
>> bunch of fields, like uur_block, uur_offset, uur_tuple, that are
>> probably useful for some UNDO resource managers (zheap I presume), but
>> seem kind of arbitrary. How is uur_tuple different from uur_payload?
>> Should they be named more generically as uur_payload1 and uur_payload2?
>> And why two, why not three or four different payloads? In the WAL record
>> format, there's a concept of "block id", which allows you to store N
>> number of different payloads in the record, I think that would be a
>> better approach. Or only have one payload, and let the resource manager
>> code divide it as it sees fit.
>>
>> Many of the fields support a primitive type of compression, where a
>> field can be omitted if it has the same value as on the first record on
>> an UNDO page. That's handy. But again I don't like the fact that the
>> fields have been hard-coded into the UNDO record format. I can see e.g.
>> the relation oid to be useful for many AMs. But not all. And other AMs
>> might well want to store and deduplicate other things, aside from the
>> fields that are in the patch now. I'd like to move most of the fields to
>> AM specific code, and somehow generalize the compression. One approach
>> would be to let the AM store an arbitrary struct, and run it through a
>> general-purpose compression algorithm, using the UNDO page's first
>> record as the "dictionary".
> 
> I thought about this, too. I agree that there's something a little
> unsatisfying about the current structure, but I haven't been able to
> come up with something that seems definitively better. I think
> something along the lines of what you are describing here might work
> well, but I am VERY doubtful about the idea of a fixed-size struct. I
> think AMs are going to want to store variable-length data: especially
> tuples, but maybe also other stuff. For instance, imagine some AM that
> wants to implement locking that's more fine-grained that the four
> levels of tuple locks we have today: instead of just having key locks
> and all-columns locks, you could want to store the exact columns to be
> locked. Or maybe your TIDs are variable-width.

Sure, a fixed-size struct is quite limiting. My point is that all that 
packing of data into UNDO records should be AM-specific. Maybe it would 
be handy to have a a few common fields in the undo record header itself, 
but most data should be in the AM-specific payload, because it varies 
across AMs.

> And the problem is that as soon as you move to something where you
> pack in a bunch of variable-sized fields, you lose the ability to
> refer to thinks using reasonable names.  That's where I came up with
> the idea of an UnpackedUndoRecord: give the common fields that
> "everyone's going to need" human-readable names, and jam only the
> strange, AM-specific stuff into the payload.  But if those needs are
> not actually universal but very much AM-specific, then I'm afraid
> we're going to end up with deeply inscrutable code for packing and
> unpacking records.  I imagine it's possible to come up with a good
> structure for that, but I don't think we have one today.

Yeah, that's also a problem with complicated WAL record types. Hopefully 
the complex cases are an exception, not the norm. A complex case is 
unlikely to fit any pre-defined set of fields anyway. (We could look at 
how e.g. protobuf works, if this is really a big problem. I'm not 
suggesting that we add a dependency just for this, but there might be 
some patterns or interfaces that we could mimic.)

If you remember, we did a big WAL format refactoring in 9.5, which moved 
some information from AM-specific structs to the common headers. Namely, 
the information on the relation blocks that the WAL record applies to. 
That was a very handy refactoring, and allowed tools like pg_waldump to 
print more detailed information about all WAL record types. For WAL 
records, moving the block information was natural, because there was 
special handling for full-page images anyway. However, I don't think we 
have enough experience with UNDO log yet, to know which fields would be 
best to include in the common undo header, and which to leave as 
AM-specific payload. I think we should keep the common header slim, and 
delegate to the AM routines.

For UNDO records, having an XID on every record probably makes sense; 
all the use cases for UNDO log we've discussed are transactional. The 
rules on which UNDO records to apply and what/when to discard, depend on 
whether a transaction committed or aborted and when, so you need the XID 
for that. Although, the rule also depends on the AM; for cleaning up 
orphaned files, an UNDO record for a committed transaction can be 
discarded immediately, while zheap and zedstore records need to be kept 
around longer. So the exact rules for that will need to be AM-specific, 
too. Or maybe there are only a few different cases and we can enumerate 
them, so that an AM can just set a flag on the UNDO record to indicate 
when it can be discarded, instead of having a callback or some other 
totally generic approach.

In short, I think we should keep the common code that deals with UNDO 
records more dumb, and delegate to the AMs more. That's enough for 
cleaning up orphaned files, we don't really need the more complicated 
stuff for that. We probably need more smarts for zheap/zedstore, but we 
don't quite know what it should look like yet. Let's keep it simple for 
now, so that we can get something we can review and commit sooner, and 
we can build on top of that later.

>> I don't like the way UndoFetchRecord returns a palloc'd
>> UnpackedUndoRecord. I would prefer something similar to the xlogreader
>> API, where a new call to UndoFetchRecord invalidates the previous
>> result. On efficiency grounds, to avoid the palloc, but also to be
>> consistent with xlogreader.
> 
> I don't think that's going to work very well, because we often need to
> deal with multiple records at a time.  There is (or was) a bulk-fetch
> interface, but I've also found while experimenting with this code that
> it can be useful to do things like:
> 
> current = undo_fetch(starting_record);
> loop:
>      next = undo_fetch(current->next_record_ptr);
>      if some_test(next):
>          break;
>      undo_free(current);
>      current = next;
> 
> I think we shouldn't view such cases as exceptions to the general
> paradigm of looking at undo records one at a time, but instead as the
> normal case for which everything is optimized.  Cases like orphaned
> file cleanup where the number of undo records is probably small and
> they're all independent of each other will, I think, turn out to be
> the exception rather than the rule.

Hmm. If you're following an UNDO chain, from newest to oldest, I would 
assume that the newer record has enough information to decide whether 
you need to look at the previous record. If the previous record is no 
longer interesting, it might already be discarded away, after all.

I tried to browse through the zheap code but couldn't see that pattern. 
I'm not too familiar with the code, so I might've looked in the wrong 
place, though.

- Heikki

Re: POC: Cleaning up orphaned files using undo logs

From

Heikki Linnakangas

Date:

07 August 2019, 11:34:55

On 07/08/2019 13:52, Dilip Kumar wrote:
> I have one more problem related to compression of the command id
> field.  Basically, the problem is that we don't set the command id in
> the WAL and we will always store FirstCommandId in the undo[1].   So
> suppose there were 2 operations under a different CID then during DO
> time both the undo record will store the CID field in their respective
> undo records but during REDO time, all the commands will store the
> same CID(FirstCommandId) so as per the compression logic the
> subsequent record for the same transaction will not store the CID
> field.  I am not sure what is the best way to handle this but I have
> few ideas.
> 
> 1) Don't compress the CID field ever.
> 2) Write CID in WAL,  but just for compressing the CID field in undo
> (which may not necessarily go to disk) we don't want to add extra 4
> bytes in the WAL.

Most transactions have only a few commands, so you could optimize for 
that. If you use some kind of a variable-byte encoding for it, it could 
be a single byte or even just a few bits, for the common cases.

For the first version, I'd suggest keeping it simple, though, and 
optimize later.

- Heikki

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

07 August 2019, 11:35:38

On Thu, Aug 1, 2019 at 1:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Jul 31, 2019 at 10:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Jul 30, 2019 at 5:26 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > >  but
> > > here's a small thing: I managed to reach an LWLock self-deadlock in
> > > the undo worker launcher:
> > >
> >
> > I could see the problem, will fix in next version.
>
> Fixed both of these problems in the patch just posted by me [1].

I reran the script that found that problem, so I could play with the
linger logic.  It creates N databases, and then it creates tables in
random databases (because I'm testing with the orphaned table cleanup
patch) and commits or rolls back at (say) 100 tx/sec.  While it's
doing that, you can look at the pg_stat_undo_logs view to see the
discard and insert pointers whizzing along nicely, but if you look at
the process table with htop or similar you can see that it's forking
undo apply workers at 100/sec (the pid keeps changing), whenever there
is more than one database involved.  With a single database it lingers
as I was expecting (and then creates problems when you want to drop
the database).  What I was expecting to see is that if you configure
the test to generate undo work in  2, 3 or 4 dbs, and you have
max_undo_workers set to 4, then you should finish up with 4 undo apply
workers hanging around to service the work calmly without any new
forking happening.  If you generate undo work in more than 4
databases, I was expecting to see the undo workers exiting and being
forked so that a total of 4 workers (at any time) can work their way
around the more-than-4 databases, but not switching as fast as they
can, so that we don't waste all our energy on forking and setup (how
fast exactly they should switch, I don't know, that's what I wanted to
see).  A more advanced thing to worry about, not yet tested, is how
well they'll handle asymmetrical work distributions (not enough
workers, but some databases producing a lot and some a little undo
work).  Script attached.

-- 
Thomas Munro
https://enterprisedb.com

Attachment

test_undo_worker_load_balancing.py

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

08 August 2019, 10:43:25

On Wed, Aug 7, 2019 at 5:06 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Thu, Aug 1, 2019 at 1:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Wed, Jul 31, 2019 at 10:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > On Tue, Jul 30, 2019 at 5:26 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > > >  but
> > > > here's a small thing: I managed to reach an LWLock self-deadlock in
> > > > the undo worker launcher:
> > > >
> > >
> > > I could see the problem, will fix in next version.
> >
> > Fixed both of these problems in the patch just posted by me [1].
>
> I reran the script that found that problem, so I could play with the
> linger logic.
>

Thanks for the test.  I will look into it and get back to you.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

08 August 2019, 13:31:06

Hi,

On 2019-08-07 14:50:17 +0530, Amit Kapila wrote:
> On Tue, Aug 6, 2019 at 1:26 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
> > > +/*
> > > + * Binary heap comparison function to compare the time at which an error
> > > + * occurred for transactions.
> > > + *
> > > + * The error queue is sorted by next_retry_at and err_occurred_at.  Currently,
> > > + * the next_retry_at has some constant delay time (see PushErrorQueueElem), so
> > > + * it doesn't make much sense to sort by both values.  However, in future, if
> > > + * we have some different algorithm for next_retry_at, then it will work
> > > + * seamlessly.
> > > + */
> >
> > Why is it useful to have error_occurred_at be part of the comparison at
> > all? If we need a tiebraker, err_occurred_at isn't that (if we can get
> > conflicts for next_retry_at, then we can also get conflicts in
> > err_occurred_at).  Seems better to use something actually guaranteed to
> > be unique for a tiebreaker.
> >
> 
> This was to distinguish the cases where the request is failing
> multiple times with the request failed the first time.  I agree that
> we need a better unique identifier like FullTransactionid though.  Do
> let me know if you have any other suggestion?

Sure, I get why you have the field. Even if it were just for debugging
or such. Was just commenting upon it being used as part of the
comparison.  I'd just go for (next_retry_at, fxid).


> > > +      * backends. This will ensure that it won't get filled.
> > > +      */
> >
> > How does this ensure anything?
> >
> 
> Because based on this we will have a hard limit on the number of undo
> requests after which we won't allow more requests.  See some more
> detailed explanation for the same later in this email.   I think the
> comment needs to be updated.

Well, as your code stands, I don't think there is an actual hard limit
on the number of transactions needing to be undone due to the way errors
are handled. There's no consideration of prepared transactions.



> > > +     START_CRIT_SECTION();
> > > +
> > > +     /* Update the progress in the transaction header. */
> > > +     UndoRecordUpdateTransInfo(&context, 0);
> > > +
> > > +     /* WAL log the undo apply progress. */
> > > +     {
> > > +             XLogRecPtr      lsn;
> > > +             xl_undoapply_progress xlrec;
> > > +
> > > +             xlrec.urec_ptr = progress_urec_ptr;
> > > +             xlrec.progress = block_num;
> > > +
> > > +             XLogBeginInsert();
> > > +             XLogRegisterData((char *) &xlrec, sizeof(xlrec));
> > > +
> > > +             RegisterUndoLogBuffers(&context, 1);
> > > +             lsn = XLogInsert(RM_UNDOACTION_ID, XLOG_UNDO_APPLY_PROGRESS);
> > > +             UndoLogBuffersSetLSN(&context, lsn);
> > > +     }
> > > +
> > > +     END_CRIT_SECTION();
> > > +
> > > +     /* Release undo buffers. */
> > > +     FinishUndoRecordInsert(&context);
> > > +}
> >
> > This whole prepare/execute split for updating apply pregress, and next
> > undo pointers makes no sense to me.
> >
> 
> Can you explain what is your concern here?  Basically, in the prepare
> phase, we do read and lock the buffer and in the actual update phase
> (which is under critical section), we update the contents in the
> shared buffer.  This is the same idea as we use in many places in
> code.

I'll comment on the concerns with the whole API separately.


> > >  typedef struct TwoPhaseFileHeader
> > >  {
> > > @@ -927,6 +928,16 @@ typedef struct TwoPhaseFileHeader
> > >       uint16          gidlen;                 /* length of the GID - GID follows the header */
> > >       XLogRecPtr      origin_lsn;             /* lsn of this record at origin node */
> > >       TimestampTz origin_timestamp;   /* time of prepare at origin node */
> > > +
> > > +     /*
> > > +      * We need the locations of the start and end undo record pointers when
> > > +      * rollbacks are to be performed for prepared transactions using undo-based
> > > +      * relations.  We need to store this information in the file as the user
> > > +      * might rollback the prepared transaction after recovery and for that we
> > > +      * need its start and end undo locations.
> > > +      */
> > > +     UndoRecPtr      start_urec_ptr[UndoLogCategories];
> > > +     UndoRecPtr      end_urec_ptr[UndoLogCategories];
> > >  } TwoPhaseFileHeader;
> >
> > Why do we not need that knowledge for undo processing of a non-prepared
> > transaction?

> The non-prepared transaction also needs to be aware of that.  It is
> stored in TransactionStateData.  I am not sure if I understand your
> question here.

My concern is that I think it's fairly ugly to store data like this in
the 2pc state file. And it's not an insubstantial amount of additional
data either, compared to the current size, even when no undo is in
use. There's a difference between an unused feature increasing backend
local memory and increasing the size of WAL logged data. Obviously it's
not by a huge amount, but still.  It also just feels wrong to me.

We don't need the UndoRecPtr's when recovering from a crash/restart to
process undo. Now we obviously don't want to unnecessarily search for
data that is expensive to gather, which is a good reason for keeping
track of this data. But I do wonder if this is the right approach.

I know that Robert is working on a patch that revises the undo request
layer somewhat, it's possible that this is best discussed afterwards.


> > > +                     case TBLOCK_UNDO:
> > > +                             /*
> > > +                              * We reach here when we got error while applying undo
> > > +                              * actions, so we don't want to again start applying it. Undo
> > > +                              * workers can take care of it.
> > > +                              *
> > > +                              * AbortTransaction is already done, still need to release
> > > +                              * locks and perform cleanup.
> > > +                              */
> > > +                             ResetUndoActionsInfo();
> > > +                             ResourceOwnerRelease(s->curTransactionOwner,
> > > +                                                                      RESOURCE_RELEASE_LOCKS,
> > > +                                                                      false,
> > > +                                                                      true);
> > > +                             s->state = TRANS_ABORT;
> > >                               CleanupTransaction();
> >
> > Hm. Why is it ok that we only perform that cleanup action? Either the
> > rest of potentially held resources will get cleaned up somehow as well,
> > in which case this ResourceOwnerRelease() ought to be redundant, or
> > we're potentially leaking important resources like buffer pins, relcache
> > references and whatnot here?
> >
> 
> I had initially used AbortTransaction() here for such things, but I
> was not sure whether that is the right thing when we reach here in
> this state.  Because AbortTransaction is already done once we reach
> here.  The similar thing happens for the TBLOCK_SUBUNDO state few
> lines below where I had used AbortSubTransaction.  Now, one problem I
> faced when AbortSubTransaction got invoked in this code path was it
> internally invokes RecordTransactionAbort->XidCacheRemoveRunningXids
> which result in the error "did not find subXID %u in MyProc".  The
> reason is obvious which is that we had already removed it when
> AbortSubTransaction was invoked before applying undo actions. The
> releasing of locks was the thing which we have delayed to allow undo
> actions to be applied which is done here.  The other idea here I had
> was to call AbortTransaction/AbortSubTransaction but somehow avoid
> calling RecordTransactionAbort when in this state.  Do you have any
> suggestion to deal with this?

Well, what I'm asking is how this possibly could be correct. Perhaps I'm
just missing something, in which case I don't yet want to make
suggestions for how this should look.

My concern is that you seem to have added a state where process quite a
lot of code - the undo actions, which use buffer pins, lwlocks,
sometimes heavyweight locks, potentially even relache, much more - but
we don't actually clean up any of those in case of error, *except* for
*some* resowner managed things.  I just don't understand how that could
possibly be correct.

I'm also fairly certain that we had discussed that we can't actually
execute undo outside of a somewhat valid transaction environment - and
as far as I can tell, there's nothing of that here.

Even in the path without an error during UNDO, I see code like:
+    else
+    {
+        AtCleanup_Portals();    /* now safe to release portal memory */
+        AtEOXact_Snapshot(false, true); /* and release the transaction's
+                                         * snapshots */
+        s->fullTransactionId = InvalidFullTransactionId;
+        s->subTransactionId = TopSubTransactionId;
+        s->blockState = TBLOCK_UNDO;
+    }

without any comments why exactly these two cleanup callbacks need to be
called, and no others. See also below.


And then when UNDO errors out, I see:

+    for (i = 0; i < UndoLogCategories; i++)
+    {

+            PG_CATCH();
+            {
...
+                /* We should never reach here when we are in a semi-critical-section. */
+                Assert(SemiCritSectionCount == 0);
+            }
+            PG_END_TRY();

meaning that we'll just move on to undo the next persistency category
after an error. But there's absolutely no resource cleanup here. Which,
to me, means we'll very easily self-deadlock and things like
that. Consider an error thrown during undo, while holding an lwlock. If
the next persistence category acquires that lock again, we'll
self-deadlock. There's a lot of other similar issues.

So I just don't understand the current model of the xact.c
integration. That might be because I just don't understand the current
design, or because the current design is pretty broken.


> > > +{
> > > +     TransactionState s = CurrentTransactionState;
> > > +     bool    result;
> > > +     int             i;
> > > +
> > > +     /*
> > > +      * We don't want to apply the undo actions when we are already cleaning up
> > > +      * for FATAL error.  See ReleaseResourcesAndProcessUndo.
> > > +      */
> > > +     if (SemiCritSectionCount > 0)
> > > +     {
> > > +             ResetUndoActionsInfo();
> > > +             return;
> > > +     }
> >
> > Wait what? Semi critical sections?
> >
> 
> Robert up thread suggested this idea [1] (See paragraph starting with
> "I am not a fan of applying_subxact_undo....") to deal with cases
> where we get an error while applying undo actions and we need to
> promote the error to FATAL.

Well, my problem with this starts with the fact that I don't know see a
reason why we would want to promote subtransaction failures to FATAL. Or
why that would be OK - loosing reliability when using savepoints seems
pretty dubious to me.  And sometimes we're can expect to get errors when
savepoints are in use, e.g. out-of-memory errors. And they're often
going to happen again during undo processing. So this isn't a "oh, it
never realistically happens" scenario imo.

There's two comments about this:

+We promote the error to FATAL error if it occurred while applying undo for a
+subtransaction.  The reason we can't proceed without applying subtransaction's
+undo is that the modifications made in that case must not be visible even if
+the main transaction commits.

+     * (a) Subtransactions.  We can't proceed without applying
+     * subtransaction's undo as the modifications made in that case must not
+     * be visible even if the main transaction commits.  The reason why that
+     * can happen is because for undo-based AM's we don't need to have a
+     * separate transaction id for subtransactions and once the main
+     * transaction commits the tuples modified by subtransactions will become
+     * visible.

But that only means we can't allow such errors to be caught - there
should be a lot less harsh ways to achieve that than throwing a FATAL
error. We could e.g. just walk up the transaction stack and mark the
transaction levels as failed or something. So if somebody catches the
error, any database access done will just cause a failure again.


There's also:

+     * (b) Temp tables.  We don't expect background workers to process undo of
+     * temporary tables as the same won't be accessible.

But I fail to see why that requires FATALing either. Isn't the worst
outcome here that we'll have some unnecessary undo around?



> > Give code like this I have a hard time seing what the point of having
> > separate queue entries for the different persistency levels is.

> It is not for this case, rather, it is for the case of discard worker
> (background worker) where we process the transactions at log level.
> The permanent and unlogged transactions will be in a separate log and
> can be encountered at different times, so this leads to having
> separate entries for them.

Given a hashtable over fxid, that doesn't seems like a
counter-argument. We can just do an fxid lookup, and if there's already
an entry, update it to reference the additional persistence level.


One question of understanding: Why do we ever want to register undo
requests for transactions that did not start in the log the discard
worker is currently looking at? It seems to me that there's some
complexity involved due to wanting to do that?  We might have already
processed the portion of the transaction in the later log, but I don't
see why that'd be a problem?

> 
> >
> > > +void
> > > +ReleaseResourcesAndProcessUndo(void)
> > > +{
> > > +     TransactionState s = CurrentTransactionState;
> > > +
> > > +     /*
> > > +      * We don't want to apply the undo actions when we are already cleaning up
> > > +      * for FATAL error.  One of the main reasons is that we might be already
> > > +      * processing undo actions for a (sub)transaction when we reach here
> > > +      * (for ex. error happens while processing undo actions for a
> > > +      * subtransaction).
> > > +      */
> > > +     if (SemiCritSectionCount > 0)
> > > +     {
> > > +             ResetUndoActionsInfo();
> > > +             return;
> > > +     }
> > > +
> > > +     if (!NeedToPerformUndoActions())
> > > +             return;
> > > +
> > > +     /*
> > > +      * State should still be TRANS_ABORT from AbortTransaction().
> > > +      */
> > > +     if (s->state != TRANS_ABORT)
> > > +             elog(FATAL, "ReleaseResourcesAndProcessUndo: unexpected state %s",
> > > +                     TransStateAsString(s->state));
> > > +
> > > +     /*
> > > +      * Do abort cleanup processing before applying the undo actions.  We must
> > > +      * do this before applying the undo actions to remove the effects of
> > > +      * failed transaction.
> > > +      */
> > > +     if (IsSubTransaction())
> > > +     {
> > > +             AtSubCleanup_Portals(s->subTransactionId);
> > > +             s->blockState = TBLOCK_SUBUNDO;
> > > +     }
> > > +     else
> > > +     {
> > > +             AtCleanup_Portals();    /* now safe to release portal memory */
> > > +             AtEOXact_Snapshot(false, true); /* and release the transaction's
> > > +                                                                              * snapshots */
> >
> > Why do precisely these actions need to be performed here?
> >
> 
> This is to get a transaction into a clean state.  Before calling this
> function AbortTransaction has been performed and there were few more
> things we need to do for cleanup.

That doesn't answer my question. Why is it specifically these ones that
need to be called "manually"? Why no others? Where is that explained?  I
assume you just copied them from CleanupTransaction() - but there's no
reference to that fact on either side, which means nobody would know to
keep them in sync.

I'll also note that the way it's currently set up, we don't delete the
transaction context before processing undo, at least as far as I can
see. Which seems that some OOM cases won't be able to roll back, even if
there'd be plenty memory except for the memory used by the
transaction. The portal cleanup will allow for some, but not all of
that, I think.


> > > +bool
> > > +ProcessUndoRequestForEachLogCat(FullTransactionId fxid, Oid dbid,
> > > +                                                             UndoRecPtr *end_urec_ptr, UndoRecPtr
*start_urec_ptr,
> > > +                                                             bool *undoRequestResgistered, bool isSubTrans)
> > > +{
> > > +     UndoRequestInfo urinfo;
> > > +     int                     i;
> > > +     uint32          save_holdoff;
> > > +     bool            success = true;
> > > +
> > > +     for (i = 0; i < UndoLogCategories; i++)
> > > +     {
> > > +             if (end_urec_ptr[i] && !undoRequestResgistered[i])
> > > +             {
> > > +                     save_holdoff = InterruptHoldoffCount;
> > > +
> > > +                     PG_TRY();
> > > +                     {
> > > +                             /* for subtransactions, we do partial rollback. */
> > > +                             execute_undo_actions(fxid,
> > > +                                                                      end_urec_ptr[i],
> > > +                                                                      start_urec_ptr[i],
> > > +                                                                      !isSubTrans);
> > > +                     }
> > > +                     PG_CATCH();
> > > +                     {
> > > +                             /*
> > > +                              * Add the request into an error queue so that it can be
> > > +                              * processed in a timely fashion.
> > > +                              *
> > > +                              * If we fail to add the request in an error queue, then mark
> > > +                              * the entry status as invalid and continue to process the
> > > +                              * remaining undo requests if any.  This request will be later
> > > +                              * added back to the queue by discard worker.
> > > +                              */
> > > +                             ResetUndoRequestInfo(&urinfo);
> > > +                             urinfo.dbid = dbid;
> > > +                             urinfo.full_xid = fxid;
> > > +                             urinfo.start_urec_ptr = start_urec_ptr[i];
> > > +                             if (!InsertRequestIntoErrorUndoQueue(&urinfo))
> > > +                                     RollbackHTMarkEntryInvalid(urinfo.full_xid,
> > > +                                                                                        urinfo.start_urec_ptr);
> > > +                             /*
> > > +                              * Errors can reset holdoff count, so restore back.  This is
> > > +                              * required because this function can be called after holding
> > > +                              * interrupts.
> > > +                              */
> > > +                             InterruptHoldoffCount = save_holdoff;
> > > +
> > > +                             /* Send the error only to server log. */
> > > +                             err_out_to_client(false);
> > > +                             EmitErrorReport();
> > > +
> > > +                             success = false;
> > > +
> > > +                             /* We should never reach here when we are in a semi-critical-section. */
> > > +                             Assert(SemiCritSectionCount == 0);
> >
> > This seems entirely and completely broken. You can't just catch an
> > exception and continue. What if somebody held an lwlock when the error
> > was thrown? A buffer pin?
> >
> 
> The caller deals with that.  For example, when this is called from
> FinishPreparedTransaction, we do AbortOutOfAnyTransaction and when
> called from ReleaseResourcesAndProcessUndo, we just release locks.

I don't see the caller being able to do anything here - the danger is
that a previous category of undo processing might have acquired
resources, and they're not cleaned up on failure, as you've set things up.


> Earlier here also, I had AbortTransaction but was not sure whether
> that is the right thing to do especially because it will lead to
> RecordTransactionAbort called twice, once when we do AbortTransaction
> before applying undo actions and once when we do it after catching the
> exception.  Like as I said earlier maybe the right way is to just
> avoid calling RecordTransactionAbort again.

I think that "just" means that you've not divorced the state in which
undo processing is happening well enough from the "original"
transaction.  I stand by my suggestion that what needs to happen is
roughly
1) re-assign locks from failed (sub-)transaction to a special "undo"
   resource owner
2) completely abort (sub-)transaction
3) start a new (sub-)transaction
4) process undo
5) commit/abort that (sub-)transaction
6) release locks from "undo" resource owner



> > Nor why that's not a problem for:
> >
> > > +We have the hard limit (proportional to the size of the rollback hash table)
> > > +for the number of transactions that can have pending undo.  This can help us
> > > +in computing the value of oldestXidHavingUnappliedUndo and allowing us not to
> > > +accumulate pending undo for a long time which will eventually block the
> > > +discard of undo.
> >
> 
> The reason why it is not a problem is that we don't remove the entry
> from the hash table rather just mark it such that later discard worker
> can add it to the queues. I am not sure if I understood your question
> completely, but let me try to explain this idea in a bit more detail.
> 
> The basic idea is that Rollback Hash Table has space equivalent to all
> the three queues plus (2 * MaxBackends).  Now, we will stop allowing
> the new transactions that want to write undo once the hash table has
> entries equivalent to all three queues and we have 2 * Max_Backends
> already attached to undo logs that are not committed.  Assume we have
> each queue size as 5 and Max_Backends =10, then ideally we can 35
> entries (3 * 5 + 2 * 10) in the hash table. The way all this is
> related to the error queue being full is like this:
> 
> Say, we have a number of hash table entries equal to 15 which
> indicates all queues are full and now 10 backends connected to two
> different logs (permanent and unlogged). Next one of the transaction
> errors out and try to rollback, at this stage, it will add an entry in
> the hash table and try to execute the actions.  While executing
> actions, it got an error and couldn't add to error queue because it
> was full, so at this stage, it just marks the hash table entry as
> invalid and proceeds (consider this happens for both logged and
> unlogged categories).  So, at this stage, we will have 17 entries in
> the hash table and the other 9 backends attached to 18 logs which
> makes space for 35 xacts if the system crashes at this stage.  The
> backend which errored out again tries to perform an operation for
> which it needs to perform undo.  Now, we won't allow this backend to
> perform that action because if it crashed after performing the
> operation and before committing, the hash table will overflow.

What I don't understand is why there's any need for these "in hash
table, but not in any queue, and not being processed" type entries. All
that avoiding that seems to need is to make the error queue a bit
bigger?


> >
> > > +     /* There might not be any undo log and hibernation might be needed. */
> > > +     *hibernate = true;
> > > +
> > > +     StartTransactionCommand();
> >
> > Why do we need this? I assume it's so we can have a resource owner?
> >
> 
> Yes, and another reason is we are using dbid_exists in this function.

I think it'd be good to avoid needing any database access in both
discard worker, and undo launcher. They really shouldn't need catalog
access architecturally, and in the case of the discard worker we'd add
another process that'd potentially hold the xmin horizon down for a
while in some situations. We could of course add exceptions like we have
for vacuum, but I think we really shouldn't need that.

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

08 August 2019, 20:27:16

On Thu, Aug 8, 2019 at 9:31 AM Andres Freund <andres@anarazel.de> wrote:
> I know that Robert is working on a patch that revises the undo request
> layer somewhat, it's possible that this is best discussed afterwards.

Here's what I have at the moment.  This is not by any means a complete
replacement for Amit's undo worker machinery, but it is a significant
redesign (and I believe a significant improvement to) the queue
management stuff from Amit's patch.  I wrote this pretty quickly, so
while it passes simple testing, it probably has a number of bugs, and
to actually use it, it would need to be integrated with xact.c; right
now it's just a standalone module that doesn't do anything except let
itself be tested.

Some of the ways it is different from Amit's patches include:

* It uses RBTree rather than binaryheap, so when we look ahead, we
look ahead in the right order.

* There's no limit to the lookahead distance; when looking ahead, it
will search the entirety of all 3 RBTrees for an entry from the right
database.

* It doesn't have a separate hash table keyed by XID.  I didn't find
that necessary.

* It's better-isolated, as you can see from the fact that I've
included a test module that tests this code without actually ever
putting an UndoRequestManager in shared memory.  I would've liked to
expand this test module, but I don't have time to do that today and
felt it better to get this much sent out.

* It has a lot of comments explaining the design and how it's intended
to integrate with the rest of the system.

Broadly, my vision for how this would get used is:

- Create an UndoRecordManager in shared memory.
- Before a transaction first attaches to a permanent or unlogged undo
log, xact.c would call RegisterUndoRequest(); thereafter, xact.c would
store a pointer to the UndoRecord for the lifetime of the toplevel
transaction.
- Immediately after attaching to a permanent or unlogged undo log,
xact.c would call UndoRequestSetLocation.
- xact.c would track the number of bytes of permanent and unlogged
undo records the transaction generates.  If the transaction goes onto
abort, it reports these by calling FinalizeUndoRequest.
- If the transaction commits, it doesn't need that information, but
does need to call UnregisterUndoRequest() as a post-commit step in
CommitTransaction().
- In the case of an abort, after calling FinalizeUndoRequest, xact.c
would call PerformUndoInBackground() to find out whether to do undo in
the background or the foreground.  If undo is to be done in the
foreground, the backend must go on to call UnregisterUndoRequest() if
undo succeeds, and RescheduleUndoRequest() if it fails.

- In the case of a prepared transaction, a pointer to the UndoRequest
would get stored in the GlobalTransaction (but nothing extra would get
stored in the twophase state file).
- COMMIT PREPARED calls UnregisterUndoRequest().
- ROLLBACK PREPARED calls PerformUndoInBackground; if told to do undo
in the foreground, it must go on to call either
UnregisterUndoRequest() on success or RescheduleUndoRequest() on
failure, just like in the regular abort case.

- After a crash, once recovery is complete but before we open for
connections, or at least before we allow any new undo activity, the
discard worker scans all the logs and makes a bunch of calls to
RecreateUndoRequest(). Then, for each prepared transaction that still
exists, it calls SuspendPreparedUndoRequest() and use the return value
to reset the UndoRequest pointer in the GlobalTransaction.  Only once
both of those steps are completed can undo workers be safely started.
- Undo workers call GetNextUndoRequest() to get the next task that
they should perform, and once they do, they "own" the undo request.
When undo succeeds or fails, they must call either
UnregisterUndoRequest() or RescheduleUndoRequest(), as appropriate,
just like for foreground undo.  Making sure this is water-tight will
probably require some well-done integration with xact.c, so that an
undo request that we "own" because we got it in a background undo
apply process looks exactly the same as one we "own" because it's our
transaction originally.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

0001-Draft-of-new-undo-request-manager.patch

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

09 August 2019, 04:14:36

On Fri, Aug 9, 2019 at 1:57 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Aug 8, 2019 at 9:31 AM Andres Freund <andres@anarazel.de> wrote:
> > I know that Robert is working on a patch that revises the undo request
> > layer somewhat, it's possible that this is best discussed afterwards.
>
> Here's what I have at the moment.  This is not by any means a complete
> replacement for Amit's undo worker machinery, but it is a significant
> redesign (and I believe a significant improvement to) the queue
> management stuff from Amit's patch.
>

Thanks for working on this.  Neither Kuntal nor I have got time to
look into this part in detail.

>  I wrote this pretty quickly, so
> while it passes simple testing, it probably has a number of bugs, and
> to actually use it, it would need to be integrated with xact.c;
>

I can look into this and integrate with other parts of the patch next
week unless you are planning to do.  Right now, I am working on fixing
up some other comments raised on the patches which I will share today
or early next week after which I can start looking into this. I hope
that is fine with you.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

09 August 2019, 08:21:04

On Mon, Jul 22, 2019 at 3:51 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jul 22, 2019 at 2:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> I have reviewed 0012-Infrastructure-to-execute-pending-undo-actions,
> Please find my comment so far.
>
> 1.
> + /* It shouldn't be discarded. */
> + Assert(!UndoRecPtrIsDiscarded(xact_urp));
>
> I think comments can be added to explain why it shouldn't be discarded.
>
> 2.
> + /* Compute the offset of the uur_next in the undo record. */
> + offset = SizeOfUndoRecordHeader +
> + offsetof(UndoRecordTransaction, urec_progress);
> +
> in comment /uur_next/uur_progress
>
> 3.
> +/*
> + * undo_record_comparator
> + *
> + * qsort comparator to handle undo record for applying undo actions of the
> + * transaction.
> + */
> Function header formating is not in sync with other functions.
>

Fixed all the above comments in the attached patch.

> 4.
> +void
> +undoaction_redo(XLogReaderState *record)
> +{
> + uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
> +
> + switch (info)
> + {
> + case XLOG_UNDO_APPLY_PROGRESS:
> + undo_xlog_apply_progress(record);
> + break;
>
> For HotStandby it doesn't make sense to apply this wal as this
> progress is only required when we try to apply the undo action after
> restart
> but in HotStandby we never apply undo actions.
>

I have already responded in my earlier email on why this is required [1].

> 5.
> + Assert(from_urecptr != InvalidUndoRecPtr);
> + Assert(to_urecptr != InvalidUndoRecPtr);
>
> we can use macros UndoRecPtrIsValid instead of checking like this.
>

Fixed.

> 6.
> + if ((slot == NULL) || (UndoRecPtrGetLogNo(urecptr) != slot->logno))
> + slot = UndoLogGetSlot(UndoRecPtrGetLogNo(urecptr), false);
> +
> + Assert(slot != NULL);
> We are passing missing_ok as false in UndoLogGetSlot.  But, not sure
> why we are expecting that undo lot can not be dropped.  In multi-log
> transaction it's possible
> that the tablespace in which next undolog is there is already dropped?
>

Already responded on this in my earlier reply [1].

> 7.
> + */
> + do
> + {
> + BlockNumber progress_block_num = InvalidBlockNumber;
> + int i;
> + int nrecords;
>                       .....
>     + */
> + if (!UndoRecPtrIsValid(urec_ptr))
> + break;
> + } while (true);
>
> I think we can convert above loop to while(true) instead of do..while,
> because there is no need for do while loop.
>
> 8.
> + if (last_urecinfo->uur->uur_info & UREC_INFO_LOGSWITCH)
> + {
> + UndoRecordLogSwitch *logswitch = last_urecinfo->uur->uur_logswitch;
>
> IMHO, the caller of UndoFetchRecord should directly check
> uur->uur_logswitch instead of uur_info & UREC_INFO_LOGSWITCH.
> Actually, uur_info is internally set
> for inserting the tuple and check there to know what to insert and
> fetch but I think caller of UndoFetchRecord should directly rely on
> the field because ideally all
> the fields in UnpackUndoRecord must be set and uur_txt or
> uur_logswitch will be allocated when those headers present.  I think
> this needs to be improved in undo interface patch
> as well (in UndoBulkFetchRecord).
>

Okay, fixed both of the above.  I have exposed a new macro
IsUndoLogSwitched from undorecord.h which you might also want to use
in your patch.

Apart from this, in the attached patches, I have fixed various
comments raised in this thread from Amit Khandekar. I'll respond to
them separately.  I have yet to address various comments raised by
Andres and Robert which also includes integration with the latest
patch on queues posted by Robert.

Note - The patches for undo-log and undo-interface has not been
rebased as others are working actively on their branches.   The branch
where this code resides can be accessed at
https://github.com/EnterpriseDB/zheap/tree/undoprocessing

[1] - https://www.postgresql.org/message-id/CAA4eK1KoA0L%3DPNBc_uu2v8H0%3DLA_Cm%3Do9GyFm6i6DSD6mUMppg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Hi All,

Please find the updated patch for the undo interface layer.  I have
rebased undoprocessing patches on top of that and there are also some
changes in undo storage patch for handling the multi-log transaction
which I am attaching a separate patch for that
[0006-Defect-and-enhancement-in-multi-log-support.patch].

Mainly the new patch includes
1. Improvement in log-switch handling during recovery,  earlier during
recovery we were detecting log switch by adding a separate WAL but in
this version, we are detecting it by the registered buffer in the WAL.
By doing this we are avoiding extra WAL plus this method is in more
sync with identifying undo log in UndoLogAllocateInRecovery.
2. Improved the mechanism of undo compression therein instead of
keeping the compression info in the global variable we are reading it
from the page in which we are inserting the undo record.
3. Improved readme file.

Apart from this, I have worked on the review comments posted in this
thread.  I will reply to all those emails separately.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

undo_13_08_2019.tar.gz

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

13 August 2019, 04:18:23

On Thu, Jul 18, 2019 at 4:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jul 16, 2019 at 2:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
>
> Few comments on the new patch:
>
> 1.
> Additionally,
> +there is a mechanism for multi-insert, wherein multiple records are prepared
> +and inserted at a time.
>
> Which mechanism are you talking about here?  By any chance is this
> related to some old code?

Current code also we have option to prepare multiple records and
insert them at once.  I have enhanced the comments to make it more
clear.
>
> 2.
> +Fetching and undo record
> +------------------------
> +To fetch an undo record, a caller must provide a valid undo record pointer.
> +Optionally, the caller can provide a callback function with the information of
> +the block and offset, which will help in faster retrieval of undo record,
> +otherwise, it has to traverse the undo-chain.
>
> I think this is out-dated information.  You seem to forget updating
> README after latest changes in API.
Right, fixed.
>
> 3.
> + * The cid/xid/reloid/rmid information will be added in the undo record header
> + * in the following cases:
> + * a) The first undo record of the transaction.
> + * b) First undo record of the page.
> + * c) All subsequent record for the transaction which is not the first
> + *   transaction on the page.
> + * Except above cases,  If the rmid/reloid/xid/cid is same in the subsequent
> + * records this information will not be stored in the record, these information
> + * will be retrieved from the first undo record of that page.
> + * If any of the member rmid/reloid/xid/cid has changed, the changed
> information
> + * will be stored in the undo record and the remaining information will be
> + * retrieved from the first complete undo record of the page
> + */
> +UndoCompressionInfo undo_compression_info[UndoLogCategories];
>
> a. Do we want to compress fork_number also?  It is an optional field
> and is only include when undo record is for not MAIN_FORKNUM.  For
> zheap, this means it will never be included, but in future, it could
> be included for some other AM or some other use case.  So, not sure if
> there is any benefit in compressing the same.
Yeah, so as of now I haven't compressed forkno
>
> b. cid/xid/reloid/rmid - I think it is better to write it as rmid,
> reloid, xid, cid in the same order as you declare them in
> UndoPackStage.
>
> c. Some minor corrections. /Except above/Except for above/; /, If
> the/, if the/;  /is same/is the same/; /record, these
> information/record rather this information/
>
> d. I think there is no need to start the line "If any of the..." from
> a new line, it can be continued where the previous line ends.  Also,
> at the end of that line, add a full stop.

This comments are removed in new patch
>
> 4.
> /*
> + * Copy the compression global compression info to our context before
> + * starting prepare because this value might get updated multiple time in
> + * case of multi-prepare but the global value should be updated only after
> + * we have successfully inserted the undo record.
> + */
>
> In the above comment, the first 'compression' is not required. /time/times/
This comments are changed now as design is changed
>
> 5.
> +/*
> + * The below common information will be stored in the first undo
> record of the page.
> + * Every subsequent undo record will not store this information, if
> required this information
> + * will be retrieved from the first undo record of the page.
> + */
> +typedef struct UndoCompressionInfo
>
> The line length in the above comments exceeds the 80-char limit.  You
> might want to run pgindent to avoid such problems.

Fixed,
>
> 6.
> +/*
> + * Exclude the common info in undo record flag and also set the compression
> + * info in the context.
> + *
>
> 'flag' seems to be a redundant word here?
Obsolete comment as per new changes
>
> 7.
> +UndoSetCommonInfo(UndoCompressionInfo *compressioninfo,
> +   UnpackedUndoRecord *urec, UndoRecPtr urp,
> +   Buffer buffer)
> +{
> +
> + /*
> + * If we have valid compression info and the for the same transaction and
> + * the current undo record is on the same block as the last undo record
> + * then exclude the common information which are same as first complete
> + * record on the page.
> + */
> + if (compressioninfo->valid &&
> + FullTransactionIdEquals(compressioninfo->fxid, urec->uur_fxid) &&
> + UndoRecPtrGetBlockNum(urp) == UndoRecPtrGetBlockNum(lasturp))
>
> Here the comment is just a verbal for of if-check.  How about writing
> it as: "Exclude the common information from the record which is same
> as the first record on the page."

Tried to improved in new code.
>
> 8.
> UndoSetCommonInfo()
> {
> ..
> if (compressioninfo->valid &&
> + FullTransactionIdEquals(compressioninfo->fxid, urec->uur_fxid) &&
> + UndoRecPtrGetBlockNum(urp) == UndoRecPtrGetBlockNum(lasturp))
> + {
> + urec->uur_info &= ~UREC_INFO_XID;
> +
> + /* Don't include rmid if it's same. */
> + if (urec->uur_rmid == compressioninfo->rmid)
> + urec->uur_info &= ~UREC_INFO_RMID;
> +
> + /* Don't include reloid if it's same. */
> + if (urec->uur_reloid == compressioninfo->reloid)
> + urec->uur_info &= ~UREC_INFO_RELOID;
>
> In all the checks except for transaction id, urec's info is on the
> left side.  I think all the checks can be consistent.
>
> These are some of the things I noticed while skimming through this
> patch.  I will do some more detailed review later.
>
This code is changed now

Please see the latest patch at
https://www.postgresql.org/message-id/CAFiTN-uf4Bh0FHwec%2BJGbiLq%2Bj00V92W162SLd_JVvwW-jwREg%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

13 August 2019, 04:25:51

On Fri, Jul 19, 2019 at 2:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 11, 2019 at 9:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Jul 11, 2019 at 12:38 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > >
> > > I don't like the fact that undoaccess.c has a new global,
> > > undo_compression_info.  I haven't read the code thoroughly, but do we
> > > really need that?  I think it's never modified (so it could just be
> > > declared const),
> >
> > Actually, this will get modified otherwise across undo record
> > insertion how we will know what was the values of the common fields in
> > the first record of the page.  Another option could be that every time
> > we insert the record, read the value from the first complete undo
> > record on the page but that will be costly because for every new
> > insertion we need to read the first undo record of the page.
> >
>
> This information won't be shared across transactions, so can't we keep
> it in top transaction's state?   It seems to me that will be better
> than to maintain it as a global state.

As replied separetly that during recovery we would not have
transaction state so I have decided to read from the first record on
the page please check in the latest patch.
>
> Few more comments on this patch:
> 1.
> PrepareUndoInsert()
> {
> ..
> + if (logswitched)
> + {
> ..
> + }
> + else
> + {
> ..
> + resize = true;
> ..
> + }
> +
> ..
> +
> + do
> + {
> + bufidx = UndoGetBufferSlot(context, rnode, cur_blk, rbm);
> ..
> + rbm = RBM_ZERO;
> + cur_blk++;
> + } while (cur_size < size);
> +
> + /*
> + * Set/overwrite compression info if required and also exclude the common
> + * fields from the undo record if possible.
> + */
> + if (UndoSetCommonInfo(compression_info, urec, urecptr,
> +   context->prepared_undo_buffers[prepared_undo->undo_buffer_idx[0]].buf))
> + resize = true;
> +
> + if (resize)
> + size = UndoRecordExpectedSize(urec);
>
> I see that in some cases where resize is possible are checked before
> buffer allocation and some are after.  Isn't it better to do all these
> checks before buffer allocation?  Also, isn't it better to even
> compute changed size before buffer allocation as that might sometimes
> help in lesser buffer allocations?

Right, fixed.
>
> Can you find a better way to write
> :context->prepared_undo_buffers[prepared_undo->undo_buffer_idx[0]].buf)?
>  It makes the line too long and difficult to understand.  Check for
> similar instances in the patch and if possible, change them as well.
This code is gone.  While replying I realised that I haven't scanned
complete code for such occurance.  I will work on that in next
version.
>
> 2.
> +InsertPreparedUndo(UndoRecordInsertContext *context)
> {
> ..
> /*
> + * Try to insert the record into the current page. If it
> + * doesn't succeed then recall the routine with the next page.
> + */
> + InsertUndoData(&ucontext, page, starting_byte);
> + if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> + {
> + MarkBufferDirty(buffer);
> + break;
> + }
> + MarkBufferDirty(buffer);
> ..
> }
>
> Can't we call MarkBufferDirty(buffer) just before 'if' check?  That
> will avoid calling it twice.

Done
>
> 3.
> + * Later, during insert phase we will write actual records into thse buffers.
> + */
> +struct PreparedUndoBuffer
>
> /thse/these
Done
>
> 4.
> + /*
> + * If we are writing first undo record for the page the we can set the
> + * compression so that subsequent records from the same transaction can
> + * avoid including common information in the undo records.
> + */
> + if (first_complete_undo)
>
> /page the we/page then we
This code is gone
>
> 5.
> PrepareUndoInsert()
> {
> ..
> After
> + * allocation We'll only advance by as many bytes as we turn out to need.
> + */
> + UndoRecordSetInfo(urec);
>
> Change the beginning of comment as: "After allocation, we'll .."

Done
>
> 6.
> PrepareUndoInsert()
> {
> ..
> * TODO:  instead of storing this in the transaction header we can
> + * have separate undo log switch header and store it there.
> + */
> + prevlogurp =
> + MakeUndoRecPtr(UndoRecPtrGetLogNo(prevlog_insert_urp),
> +    (UndoRecPtrGetOffset(prevlog_insert_urp) - prevlen));
> +
>
> I don't think this TODO is valid anymore because now the patch has a
> separate log-switch header.

Yup.  Anyway now the log switch design is changed.
>
> 7.
> /*
> + * If undo log is switched then set the logswitch flag and also reset the
> + * compression info because we can use same compression info for the new
> + * undo log.
> + */
> + if (UndoRecPtrIsValid(prevlog_xact_start))
>
> /can/can't
Right.  But now compression code is changed so this comment does not exist.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

13 August 2019, 05:22:05

On Wed, Jul 24, 2019 at 11:28 AM Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:
>
> Hi,
>
> I have stated review of
> 0008-Provide-interfaces-to-store-and-fetch-undo-records.patch, here are few
> quick comments.
>
> 1) README.undointerface should provide more information like API details or
> the sequence in which API should get called.
I have improved the readme where I am describing the more user
specific details based on Robert's suggestions offlist.  I think I
need further improvement which can describe the order of api's to be
called.  Unfortunately that is not yet included in this patch set.
>
> 2) Information about the API's in the undoaccess.c file header block would
> good.  For reference please look at heapam.c.
Done
>
> 3) typo
>
> + * Later, during insert phase we will write actual records into thse buffers.
> + */
>
> %s/thse/these
Fixed
>
> 4) UndoRecordUpdateTransInfo() comments says that this must be called under
> the critical section, but seems like undo_xlog_apply_progress() do call it
> outside of critical section?  Is there exception, then should add comments?
> or Am I missing anything?
During recovery, there is an exception but we can add comments for the same.
I think I missed this in the latest patch, I will keep a note of it
and will do this in the next version.

>
>
> 5) In function UndoBlockGetFirstUndoRecord() below code:
>
>     /* Calculate the size of the partial record. */
>     partial_rec_size = UndoRecordHeaderSize(phdr->uur_info) +
>                        phdr->tuple_len + phdr->payload_len -
>                        phdr->record_offset;
>
> can directly use UndoPagePartialRecSize().

This function is part of another patch in undoprocessing patch set
>
> 6)
>
> +static int
> +UndoGetBufferSlot(UndoRecordInsertContext *context,
> +                  RelFileNode rnode,
> +                  BlockNumber blk,
> +                  ReadBufferMode rbm)
> +{
> +    int            i;
>
> In the above code variable "i" is mean "block index".  It would be good
> to give some valuable name to the variable, maybe "blockIndex" ?
>
Fixed
> 7)
>
> * We will also keep a previous undo record pointer to the first and last undo
>  * record of the transaction in the previous log.  The last undo record
>  * location is used find the previous undo record pointer during rollback.
>
>
> %s/used fine/used to find
Fixed
>
> 8)
>
> /*
>  * Defines the number of times we try to wait for rollback hash table to get
>  * initialized.  After these many attempts it will return error and the user
>  * can retry the operation.
>  */
> #define ROLLBACK_HT_INIT_WAIT_TRY      60
>
> %s/error/an error
This is part of different patch in undoprocessing patch set
>
> 9)
>
>  * we can get the exact size of partial record in this page.
>  */
>
> %s/of partial/of the partial"
This comment is removed in the latest code
>
> 10)
>
> * urecptr - current transaction's undo record pointer which need to be set in
> *             the previous transaction's header.
>
> %s/need/needs
Done
>
> 11)
>
>     /*
>      * If we are writing first undo record for the page the we can set the
>      * compression so that subsequent records from the same transaction can
>      * avoid including common information in the undo records.
>      */
>
>
> %s/the page the/the page then
>
> 12)
>
>     /*
>      * If the transaction's undo records are split across the undo logs.  So
>      * we need to  update our own transaction header in the previous log.
>      */
>
> double space between "to" and "update"
Fixed
>
> 13)
>
> * The undo record should be freed by the caller by calling ReleaseUndoRecord.
>  * This function will old the pin on the buffer where we read the previous undo
>  * record so that when this function is called repeatedly with the same context
>
> %s/old/hold
Fixed
>
> I will continue further review for the same patch.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

13 August 2019, 05:43:12

On Tue, Jul 30, 2019 at 12:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>

> Hi Dilip,
>
> > commit 2f3c127b9e8bc7d27cf7adebff0a355684dfb94e
> > Author: Dilip Kumar <dilipkumar@localhost.localdomain>
> > Date:   Thu May 2 11:28:13 2019 +0530
> >
> >    Provide interfaces to store and fetch undo records.
>
> +#include "commands/tablecmds.h"
> +#include "storage/block.h"
> +#include "storage/buf.h"
> +#include "storage/buf_internals.h"
> +#include "storage/bufmgr.h"
> +#include "miscadmin.h"
>
> "miscadmin.h" comes before "storage...".
Right, fixed.
>
> +/*
> + * Compute the size of the partial record on the undo page.
> + *
> + * Compute the complete record size by uur_info and variable field length
> + * stored in the page header and then subtract the offset of the record so that
> + * we can get the exact size of partial record in this page.
> + */
> +static inline Size
> +UndoPagePartialRecSize(UndoPageHeader phdr)
> +{
> +    Size        size;
>
> We decided to use size_t everywhere in new code (except perhaps
> functions conforming to function pointer types that historically use
> Size in their type).
>
> +    /*
> +     * Compute the header size from undo record uur_info, stored in the page
> +     * header.
> +     */
> +    size = UndoRecordHeaderSize(phdr->uur_info);
> +
> +    /*
> +     * Add length of the variable part and undo length. Now, we know the
> +     * complete length of the undo record.
> +     */
> +    size += phdr->tuple_len + phdr->payload_len + sizeof(uint16);
> +
> +    /*
> +     * Subtract the size which is stored in the previous page to get the
> +     * partial record size stored in this page.
> +     */
> +    size -= phdr->record_offset;
> +
> +    return size;
>
> This is probably a stupid question but why isn't it enough to just
> store the offset of the first record that begins on this page, or 0
> for none yet?  Why do we need to worry about the partial record's
> payload etc?
Right, as this patch stand it would be enough to just store the offset
where the first complete record start.  But for undo page consistency
checker we need to mask the CID field in the partial record as well.
So we need to know how many bytes of the partial records are already
written in the previous page (phdr->record_offset), what all fields
are there in the partial record (uur_info) and the variable part to
compute the next record offset.  Currently, I have improved it by
storing the complete record length instead of payload and tuple length
but this we can further improve by storing the next record offset
directly that will avoid some computation.  I haven't worked on undo
consistency patch much in this version so I will analyze this further
in the next version.

>
> +UndoRecPtr
> +PrepareUndoInsert(UndoRecordInsertContext *context,
> +                  UnpackedUndoRecord *urec,
> +                  Oid dbid)
> +{
> ...
> +    /* Fetch compression info for the transaction. */
> +    compression_info = GetTopTransactionUndoCompressionInfo(category);
>
> How can this work correctly in recovery?  [Edit: it doesn't, as you
> just pointed out]
>
> I had started reviewing an older version of your patch (the version
> that had made it as far as the undoprocessing branch as of recently),
> before I had the bright idea to look for a newer version.  I was going
> to object to the global variable you had there in the earlier version.
> It seems to me that you have to be able to reproduce the exact same
> compression in recovery that you produced as "do" time, no?  How can
> TopTranasctionStateData be the right place for this in recovery?
>
> One data structure that could perhaps hold this would be
> UndoLogTableEntry (the per-backend cache, indexed by undo log number,
> with pretty fast lookups; used for things like
> UndoLogNumberGetCategory()).  As long as you never want to have
> inter-transaction compression, that should have the right scope to
> give recovery per-undo log tracking.  If you ever wanted to do
> compression between transactions too, maybe UndoLogSlot could work,
> but that'd have more complications.
Currently, I have read it from the first record on the page.
>
> +/*
> + * Read undo records of the transaction in bulk
> + *
> + * Read undo records between from_urecptr and to_urecptr until we exhaust the
> + * the memory size specified by undo_apply_size.  If we could not read all the
> + * records till to_urecptr then the caller should consume current set
> of records
> + * and call this function again.
> + *
> + * from_urecptr        - Where to start fetching the undo records.
> If we can not
> + *                      read all the records because of memory limit then this
> + *                      will be set to the previous undo record
> pointer from where
> + *                      we need to start fetching on next call.
> Otherwise it will
> + *                      be set to InvalidUndoRecPtr.
> + * to_urecptr        - Last undo record pointer to be fetched.
> + * undo_apply_size    - Memory segment limit to collect undo records.
> + * nrecords            - Number of undo records read.
> + * one_page            - Caller is applying undo only for one block not for
> + *                      complete transaction.  If this is set true then instead
> + *                      of following transaction undo chain using
> prevlen we will
> + *                      follow the block prev chain of the block so that we can
> + *                      avoid reading many unnecessary undo records of the
> + *                      transaction.
> + */
> +UndoRecInfo *
> +UndoBulkFetchRecord(UndoRecPtr *from_urecptr, UndoRecPtr to_urecptr,
> +                    int undo_apply_size, int *nrecords, bool one_page)
>
> Could you please make it clear in comments and assertions what the
> relation between from_urecptr and to_urecptr is and what they mean
> (they must be in the same undo log, one must be <= the other, both
> point to the *start* of a record, so it's not the same as the total
> range of undo)?

I have enhanced the comments for the same
>
> undo_apply_size is not a good parameter name, because the function is
> useful for things other than applying records -- like the
> undoinspect() extension (or some better version of that), for example.
> Maybe max_result_size or something like that?
Changed
>
> +{
> ...
> +        /* Allocate memory for next undo record. */
> +        uur = palloc0(sizeof(UnpackedUndoRecord));
> ...
> +
> +        size = UnpackedUndoRecordSize(uur);
> +        total_size += size;
>
> I see, so the unpacked records are still allocated one at a time.  I
> guess that's OK for now.  From some earlier discussion I had been
> expecting an arrangement where the actual records were laid out
> contiguously with their subcomponents (things they point to in
> palloc()'d memory) nearby.

In earlier version I was allocating one single memory and then packing
the records in that memory.  But, their we need to take care of
alignnment of each unpacked undo record so that we can directly access
them so we have changed it this way.
>
> +static uint16
> +UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer,
> +                     UndoLogCategory category)
> +{
> ...
> +    char        prevlen[2];
> ...
> +    prev_rec_len = *(uint16 *) (prevlen);
>
> I don't think that's OK, and might crash on a non-Intel system.  How
> about using a union of uint16 and char[2]?

changed
>
> +    /* Copy undo record transaction header if it is present. */
> +    if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
> +        memcpy(&ucontext->urec_txn, uur->uur_txn, SizeOfUndoRecordTransaction);
>
> I was wondering why you don't use D = S instead of mempcy(&D, &S,
> size) wherever you can, until I noticed you use these SizeOfXXX macros
> that don't include trailing padding from structs, and that's also how
> you allocate objects.  Hmm.  So if I were to complain about you not
> using plain old assignment whenever you can, I'd also have to complain
> about that.
Fixed
>
> I think that that technique of defining a SizeOfXXX macro that
> excludes trailing bytes makes sense for writing into WAL or undo log
> buffers using mempcy().  I'm not sure it makes sense for palloc() and
> copying into typed variables like you're doing here and I think I'd
> prefer the notational simplicity of using the (very humble) type
> system facilities C gives us.  (Some memory checker might not like it
> you palloc(the shorter size) and then use = if the compiler chooses to
> implement it as memcpy sizeof().)
>
> +/*
> + * The below common information will be stored in the first undo record of the
> + * page.  Every subsequent undo record will not store this information, if
> + * required this information will be retrieved from the first undo
> record of the
> + * page.
> + */
> +typedef struct UndoCompressionInfo
>
> Shouldn't this say "Every subsequent record will not store this
> information *if it's the same as the relevant fields in the first
> record*"?
>
> +#define UREC_INFO_TRANSACTION                0x001
> +#define UREC_INFO_RMID                        0x002
> +#define UREC_INFO_RELOID                    0x004
> +#define UREC_INFO_XID                        0x008
>
> Should we call this UREC_INFO_FXID, since it refers to a FullTransactionId?
Done
>
> +/*
> + * Every undo record begins with an UndoRecordHeader structure, which is
> + * followed by the additional structures indicated by the contents of
> + * urec_info.  All structures are packed into the alignment without padding
> + * bytes, and the undo record itself need not be aligned either, so care
> + * must be taken when reading the header.
> + */
>
> I think you mean "All structures are packed into undo pages without
> considering alignment and without trailing padding bytes"?  This comes
> from the definition of the SizeOfXXX macros IIUC.  There might still
> be padding between members of some of those structs, no?  Like this
> one, that has the second member at offset 2 on my system:
Done
>
> +typedef struct UndoRecordHeader
> +{
> +    uint8        urec_type;        /* record type code */
> +    uint16        urec_info;        /* flag bits */
> +} UndoRecordHeader;
> +
> +#define SizeOfUndoRecordHeader    \
> +    (offsetof(UndoRecordHeader, urec_info) + sizeof(uint16))
>
> +/*
> + * Information for a transaction to which this undo belongs.  This
> + * also stores the dbid and the progress of the undo apply during rollback.
> + */
> +typedef struct UndoRecordTransaction
> +{
> +    /*
> +     * Undo block number where we need to start reading the undo for applying
> +     * the undo action.   InvalidBlockNumber means undo applying hasn't
> +     * started for the transaction and MaxBlockNumber mean undo completely
> +     * applied. And, any other block number means we have applied partial undo
> +     * so next we can start from this block.
> +     */
> +    BlockNumber urec_progress;
> +    Oid            urec_dbid;        /* database id */
> +    UndoRecPtr    urec_next;        /* urec pointer of the next transaction */
> +} UndoRecordTransaction;
>
> I propose that we rename this to UndoRecordGroupHeader (or something
> like that... maybe "Set", but we also use "set" as a verb in various
> relevant function names):
I have changed this
>
> 1.  We'll also use these for the new "shared" records we recently
> invented that don't relate to a transaction.  This is really about
> defining the unit of discarding; we throw away the whole set of
> records at once, which is why it's basically about proividing a space
> for "urec_next".
>
> 2.  Though it also holds rollback progress information, which is a
> transaction-specific concept, there can be more than one of these sets
> of records for a single transaction anyway.  A single transaction can
> write undo stuff in more than one undo log (different categories
> perm/temp/unlogged/shared and also due to log switching when they are
> full).
>
> So really it's just a header for an arbitrary set of records, used to
> track when and how to discard them.
>
> If you agree with that idea, perhaps urec_next should become something
> like urec_next_group, too.  "next" is a bit vague, especially for
> something as untyped as UndoRecPtr: someone might think it points to
> the next record.
Changed
>
> More soon.

the latest patch at
https://www.postgresql.org/message-id/CAFiTN-uf4Bh0FHwec%2BJGbiLq%2Bj00V92W162SLd_JVvwW-jwREg%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

13 August 2019, 08:23:59

On Tue, Jul 30, 2019 at 1:32 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> Amit, short note: The patches aren't attached in patch order. Obviously
> a miniscule thing, but still nicer if that's not the case.
>
> Dilip, this also contains the start of a review for the undo record
> interface further down.

> > Subject: [PATCH 07/14] Provide interfaces to store and fetch undo records.
> >
> > Add the capability to form undo records and store them in undo logs.  We
> > also provide the capability to fetch the undo records.  This layer will use
> > undo-log-storage to reserve the space for the undo records and buffer
> > management routines to write and read the undo records.
> >
>
> > Undo records are stored in sequential order in the undo log.
>
> Maybe "In each und log undo records are stored in sequential order."?
Done
>
>
>
> > +++ b/src/backend/access/undo/README.undointerface
> > @@ -0,0 +1,29 @@
> > +Undo record interface layer
> > +---------------------------
> > +This is the next layer which sits on top of the undo log storage, which will
> > +provide an interface for prepare, insert, or fetch the undo records.  This
> > +layer will use undo-log-storage to reserve the space for the undo records
> > +and buffer management routine to write and read the undo records.
>
> The reference to "undo log storage" kinda seems like a reference into
> nothingness...
Changed
>
>
> > +Writing an undo record
> > +----------------------
> > +To prepare an undo record, first, it will allocate required space using
> > +undo log storage module.  Next, it will pin and lock the required buffers and
> > +return an undo record pointer where it will insert the record.  Finally, it
> > +calls the Insert routine for final insertion of prepared record.  Additionally,
> > +there is a mechanism for multi-insert, wherein multiple records are prepared
> > +and inserted at a time.
>
> I'm not sure whta this is telling me. Who is "it"?
>
> To me the filename ("interface"), and the title of this section,
> suggests this provides documentation on how to write code to insert undo
> records. But I don't think this does.
I have improved it
>
>
> > +Fetching and undo record
> > +------------------------
> > +To fetch an undo record, a caller must provide a valid undo record pointer.
> > +Optionally, the caller can provide a callback function with the information of
> > +the block and offset, which will help in faster retrieval of undo record,
> > +otherwise, it has to traverse the undo-chain.
>
> > +There is also an interface to bulk fetch the undo records.  Where the caller
> > +can provide a TO and FROM undo record pointer and the memory limit for storing
> > +the undo records.  This API will return all the undo record between FROM and TO
> > +undo record pointers if they can fit into provided memory limit otherwise, it
> > +return whatever can fit into the memory limit.  And, the caller can call it
> > +repeatedly until it fetches all the records.
>
> There's a lot of  terminology in this file that's not been introduced. I
> think this needs to be greatly expanded and restructured to allow people
> unfamiliar with the code to benefit.
I have improved it, but I think still I need to work on it to
introduce the terminology used.
>
>
> > +/*-------------------------------------------------------------------------
> > + *
> > + * undoaccess.c
> > + *     entry points for inserting/fetching undo records
>
> > + * NOTES:
> > + * Undo record layout:
> > + *
> > + * Undo records are stored in sequential order in the undo log.  Each undo
> > + * record consists of a variable length header, tuple data, and payload
> > + * information.
>
> Is that actually true? There's records without tuples, no?
Right, changed this
>
> > The first undo record of each transaction contains a
> > + * transaction header that points to the next transaction's start
> > header.
>
> Seems like this needs to reference different persistence levels,
> otherwise it seems misleading, given there can be multiple first records
> in multiple undo logs?
I have changed it.
>
>
> > + * This allows us to discard the entire transaction's log at one-shot
> > rather
>
> s/at/in/
Fixed
>
> > + * than record-by-record.  The callers are not aware of transaction header,
>
> s/of/of the/
Fixed
>
> > + * this is entirely maintained and used by undo record layer.   See
>
> s/this/it/
Fixed
>
> > + * undorecord.h for detailed information about undo record header.
>
> s/undo record/the undo record/
Fixed
>
>
> I think at the very least there's explanations missing for:
> - what is the locking protocol for multiple buffers
> - what are the contexts for insertion
> - what phases an undo insertion happens in
> - updating previous records in general
> - what "packing" actually is
>
>
> > +
> > +/* Prototypes for static functions. */
>
>
> Don't think we commonly include that...
Changed, removed all unwanted prototypes
>
> > +static UnpackedUndoRecord *UndoGetOneRecord(UnpackedUndoRecord *urec,
> > +                              UndoRecPtr urp, RelFileNode rnode,
> > +                              UndoPersistence persistence,
> > +                              Buffer *prevbuf);
> > +static int UndoRecordPrepareTransInfo(UndoRecordInsertContext *context,
> > +                                                UndoRecPtr xact_urp, int size, int offset);
> > +static void UndoRecordUpdateTransInfo(UndoRecordInsertContext *context,
> > +                                               int idx);
> > +static void UndoRecordPrepareUpdateNext(UndoRecordInsertContext *context,
> > +                                                                     UndoRecPtr urecptr, UndoRecPtr xact_urp);
> > +static int UndoGetBufferSlot(UndoRecordInsertContext *context,
> > +                               RelFileNode rnode, BlockNumber blk,
> > +                               ReadBufferMode rbm);
> > +static uint16 UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer,
> > +                                      UndoPersistence upersistence);
> > +
> > +/*
> > + * Structure to hold the prepared undo information.
> > + */
> > +struct PreparedUndoSpace
> > +{
> > +     UndoRecPtr      urp;                    /* undo record pointer */
> > +     UnpackedUndoRecord *urec;       /* undo record */
> > +     uint16          size;                   /* undo record size */
> > +     int                     undo_buffer_idx[MAX_BUFFER_PER_UNDO];   /* undo_buffer array
> > +                                                                                                              *
index*/
 
> > +};
> > +
> > +/*
> > + * This holds undo buffers information required for PreparedUndoSpace during
> > + * prepare undo time.  Basically, during prepare time which is called outside
> > + * the critical section we will acquire all necessary undo buffers pin and lock.
> > + * Later, during insert phase we will write actual records into thse buffers.
> > + */
> > +struct PreparedUndoBuffer
> > +{
> > +     UndoLogNumber logno;            /* Undo log number */
> > +     BlockNumber blk;                        /* block number */
> > +     Buffer          buf;                    /* buffer allocated for the block */
> > +     bool            zero;                   /* new block full of zeroes */
> > +};
>
> Most files define datatypes before function prototypes, because
> functions may reference the datatypes.
done
>
>
> > +/*
> > + * Prepare to update the transaction header
> > + *
> > + * It's a helper function for PrepareUpdateNext and
> > + * PrepareUpdateUndoActionProgress
>
> This doesn't really explain much.  PrepareUpdateUndoActionProgress
> doesnt' exist. I assume it's UndoRecordPrepareApplyProgress from 0012?
Enhanced the comments
>
>
> > + * xact_urp  - undo record pointer to be updated.
> > + * size - number of bytes to be updated.
> > + * offset - offset in undo record where to start update.
> > + */
>
> These comments seem redundant with the parameter names.
fixed
>
>
> > +static int
> > +UndoRecordPrepareTransInfo(UndoRecordInsertContext *context,
> > +                                                UndoRecPtr xact_urp, int size, int offset)
> > +{
> > +     BlockNumber cur_blk;
> > +     RelFileNode rnode;
> > +     int                     starting_byte;
> > +     int                     bufidx;
> > +     int                     index = 0;
> > +     int                     remaining_bytes;
> > +     XactUndoRecordInfo *xact_info;
> > +
> > +     xact_info = &context->xact_urec_info[context->nxact_urec_info];
> > +
> > +     UndoRecPtrAssignRelFileNode(rnode, xact_urp);
> > +     cur_blk = UndoRecPtrGetBlockNum(xact_urp);
> > +     starting_byte = UndoRecPtrGetPageOffset(xact_urp);
> > +
> > +     /* Remaining bytes on the current block. */
> > +     remaining_bytes = BLCKSZ - starting_byte;
> > +
> > +     /*
> > +      * Is there some byte of the urec_next on the current block, if not then
> > +      * start from the next block.
> > +      */
>
> This comment needs rephrasing.
Done
>
>
> > +     /* Loop until we have fetched all the buffers in which we need to write. */
> > +     while (size > 0)
> > +     {
> > +             bufidx = UndoGetBufferSlot(context, rnode, cur_blk, RBM_NORMAL);
> > +             xact_info->idx_undo_buffers[index++] = bufidx;
> > +             size -= (BLCKSZ - starting_byte);
> > +             starting_byte = UndoLogBlockHeaderSize;
> > +             cur_blk++;
> > +     }
>
> So, this locks a very large number of undo buffers at the same time, do
> I see that correctly?  What guarantees that there are no deadlocks due
> to multiple buffers locked at the same time (I guess the order inside
> the log)? What guarantees that this is a small enough number that we can
> even lock all of them at the same time?

I think we are locking them in the block order and that should avoid
the deadlock.  I have explained in the comments.

>
> Why do we need to lock all of them at the same time? That's not clear to
> me.

Because this is called outside the critical section so we keep all the
buffers locked what we want to update inside the critical section for
single wal record.
>
> Also, why do we need code to lock an unbounded number here? It seems
> hard to imagine we'd ever want to update more than something around 8
> bytes? Shouldn't that at the most require two buffers?
Right, it should lock at the most 2 buffers.  Now, I have added assert
for that.  Basically, it can either lock 1 buffer or 2 buffers so I am
not sure what is the best condition to break the loop.  I guess our
target is to write 8 bytes so breaking condition must be the number of
bytes.  I agree that we should never go beyond two buffers but for
that, we can add an assert.  Do you have another opinion on this?

>
>
> > +/*
> > + * Prepare to update the previous transaction's next undo pointer.
> > + *
> > + * We want to update the next transaction pointer in the previous transaction's
> > + * header (first undo record of the transaction).  In prepare phase we will
> > + * unpack that record and lock the necessary buffers which we are going to
> > + * overwrite and store the unpacked undo record in the context.  Later,
> > + * UndoRecordUpdateTransInfo will overwrite the undo record.
> > + *
> > + * xact_urp - undo record pointer of the previous transaction's header
> > + * urecptr - current transaction's undo record pointer which need to be set in
> > + *                    the previous transaction's header.
> > + */
> > +static void
> > +UndoRecordPrepareUpdateNext(UndoRecordInsertContext *context,
> > +                                                     UndoRecPtr urecptr, UndoRecPtr xact_urp)
>
> That name imo is confusing - it's not clear that it's not actually about
> the next record or such.
I agree.  I think I will think about what to name it.    I am planning
to unify 2 function UndoRecordPrepareUpdateNext and
PrepareUpdateUndoActionProgress then we can directly name it
PrepareUndoRecordUpdate.  But for that, I need to get the progress
update code in my patch.

>
>
> > +{
> > +     UndoLogSlot *slot;
> > +     int                     index = 0;
> > +     int                     offset;
> > +
> > +     /*
> > +      * The absence of previous transaction's undo indicate that this backend
>
> *indicates
>
Done
>
> > +     /*
> > +      * Acquire the discard lock before reading the undo record so that discard
> > +      * worker doesn't remove the record while we are in process of reading it.
> > +      */
>
> *the discard worker
Done
>
>
> > +     LWLockAcquire(&slot->discard_update_lock, LW_SHARED);
> > +     /* Check if it is already discarded. */
> > +     if (UndoLogIsDiscarded(xact_urp))
> > +     {
> > +             /* Release lock and return. */
> > +             LWLockRelease(&slot->discard_update_lock);
> > +             return;
> > +     }
>
> Ho, hum. I don't quite remember what we decided in the discussion about
> not having to use the discard lock for this purpose.
I think we haven't concluded an alternative solution for this and
planned to keep it as is for now.   Please correct me if anyone else
has a different opinion.

>
>
> > +     /* Compute the offset of the uur_next in the undo record. */
> > +     offset = SizeOfUndoRecordHeader +
> > +                                     offsetof(UndoRecordTransaction, urec_next);
> > +
> > +     index = UndoRecordPrepareTransInfo(context, xact_urp,
> > +                                                                        sizeof(UndoRecPtr), offset);
> > +     /*
> > +      * Set the next pointer in xact_urec_info, this will be overwritten in
> > +      * actual undo record during update phase.
> > +      */
> > +     context->xact_urec_info[index].next = urecptr;
>
> What does "this will be overwritten mean"? It sounds like "context->xact_urec_info[index].next"
> would be overwritten, but that can't be true.
>
>
> > +     /* We can now release the discard lock as we have read the undo record. */
> > +     LWLockRelease(&slot->discard_update_lock);
> > +}
>
> Hm. Because you expect it to be blocked behind the content lwlocks for
> the buffers?
Yes, I added comments.
>
>
> > +/*
> > + * Overwrite the first undo record of the previous transaction to update its
> > + * next pointer.
> > + *
> > + * This will insert the already prepared record by UndoRecordPrepareTransInfo.
>
> It doesn't actually appear to insert any records. At least not a record
> in the way the rest of the file uses that term?

I think this was old comments.  Fixed it.
>
>
> > + * This must be called under the critical section.
>
> s/under the/in a/
I think I missed in my last patch.  Will fix in next version.
>
> Think that should be asserted.
Added the assert.
>
>
> > +     /*
> > +      * Start writing directly from the write offset calculated during prepare
> > +      * phase.  And, loop until we write required bytes.
> > +      */
>
> Why do we do offset calculations multiple times? Seems like all the
> offsets, and the split, should be computed in exactly one place.
Sorry, I did not understand this,  we are calculating the offset in
the prepare phase.  Do you want to point out something else?

>
>
> > +/*
> > + * Find the block number in undo buffer array
> > + *
> > + * If it is present then just return its index otherwise search the buffer and
> > + * insert an entry and lock the buffer in exclusive mode.
> > + *
> > + * Undo log insertions are append-only.  If the caller is writing new data
> > + * that begins exactly at the beginning of a page, then there cannot be any
> > + * useful data after that point.  In that case RBM_ZERO can be passed in as
> > + * rbm so that we can skip a useless read of a disk block.  In all other
> > + * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't
> > + * happen to be already in the buffer pool.
> > + */
> > +static int
> > +UndoGetBufferSlot(UndoRecordInsertContext *context,
> > +                               RelFileNode rnode,
> > +                               BlockNumber blk,
> > +                               ReadBufferMode rbm)
> > +{
> > +     int                     i;
> > +     Buffer          buffer;
> > +     XLogRedoAction action = BLK_NEEDS_REDO;
> > +     PreparedUndoBuffer *prepared_buffer;
> > +     UndoPersistence persistence = context->alloc_context.persistence;
> > +
> > +     /* Don't do anything, if we already have a buffer pinned for the block. */
>
> As the code stands, it's locked, not just pinned.
Changed

>
>
> > +     for (i = 0; i < context->nprepared_undo_buffer; i++)
> > +     {
>
> How large do we expect this to get at most?
>
In BeginUndoRecordInsert we are computing it

+ /* Compute number of buffers. */
+ nbuffers = (nprepared + MAX_UNDO_UPDATE_INFO) * MAX_BUFFER_PER_UNDO;

>
> > +     /*
> > +      * We did not find the block so allocate the buffer and insert into the
> > +      * undo buffer array.
> > +      */
> > +     if (InRecovery)
> > +             action = XLogReadBufferForRedoBlock(context->alloc_context.xlog_record,
> > +                                                                                     SMGR_UNDO,
> > +                                                                                     rnode,
> > +                                                                                     UndoLogForkNum,
> > +                                                                                     blk,
> > +                                                                                     rbm,
> > +                                                                                     false,
> > +                                                                                     &buffer);
>
> Why is not locking the buffer correct here? Can't there be concurrent
> reads during hot standby?
because XLogReadBufferForRedoBlock is locking it internally. I have
added this in coment in new patch.
>
>
> > +/*
> > + * This function must be called before all the undo records which are going to
> > + * get inserted under a single WAL record.
>
> How can a function be called "before all the undo records"?

"before all the undo records which are getting inserted under single
WAL" because it will set the prepare limit and allocate appropriate
memory for that.
So I am not sure what is your point here? why can't we call it before
all the undo record we are inserting?

>
> > + * nprepared - This defines the max number of undo records that can be
> > + * prepared before inserting them.
> > + */
> > +void
> > +BeginUndoRecordInsert(UndoRecordInsertContext *context,
> > +                                       UndoPersistence persistence,
> > +                                       int nprepared,
> > +                                       XLogReaderState *xlog_record)
>
> There definitely needs to be explanation about xlog_record. But also
> about memory management etc. Looks like one e.g. can't call this from a
> short lived memory context.
I have added coments for this.
>
>
> > +/*
> > + * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
> > + * intended to insert.  Upon return, the necessary undo buffers are pinned and
> > + * locked.
>
> Again, how is deadlocking / max number of buffers handled, and why do
> they all need to be locked at the same time?
>
>
> > +     /*
> > +      * We don't yet know if this record needs a transaction header (ie is the
> > +      * first undo record for a given transaction in a given undo log), because
> > +      * you can only find out by allocating.  We'll resolve this circularity by
> > +      * allocating enough space for a transaction header.  We'll only advance
> > +      * by as many bytes as we turn out to need.
> > +      */
>
> Why can we only find this out by allocating?  This seems like an API
> deficiency of the storage layer to me. The information is in the und log
> slot's metadata, no?

I agree with this.  I think if Thomas agree we can provide an API in
undo log which can provide us this information before we do the actual
allocation.

>
>
> > +     urec->uur_next = InvalidUndoRecPtr;
> > +     UndoRecordSetInfo(urec);
> > +     urec->uur_info |= UREC_INFO_TRANSACTION;
> > +     urec->uur_info |= UREC_INFO_LOGSWITCH;
> > +     size = UndoRecordExpectedSize(urec);
> > +
> > +     /* Allocate space for the record. */
> > +     if (InRecovery)
> > +     {
> > +             /*
> > +              * We'll figure out where the space needs to be allocated by
> > +              * inspecting the xlog_record.
> > +              */
> > +             Assert(context->alloc_context.persistence == UNDO_PERMANENT);
> > +             urecptr = UndoLogAllocateInRecovery(&context->alloc_context,
> > +
XidFromFullTransactionId(txid),
> > +                                                                                     size,
> > +                                                                                     &need_xact_header,
> > +                                                                                     &last_xact_start,
> > +                                                                                     &prevlog_xact_start,
> > +                                                                                     &prevlogurp);
> > +     }
> > +     else
> > +     {
> > +             /* Allocate space for writing the undo record. */
>
> That's basically the same comment as before the if.
Removed
>
>
> > +             urecptr = UndoLogAllocate(&context->alloc_context,
> > +                                                               size,
> > +                                                               &need_xact_header, &last_xact_start,
> > +                                                               &prevlog_xact_start, &prevlog_insert_urp);
> > +
> > +             /*
> > +              * If prevlog_xact_start is a valid undo record pointer that means
> > +              * this transaction's undo records are split across undo logs.
> > +              */
> > +             if (UndoRecPtrIsValid(prevlog_xact_start))
> > +             {
> > +                     uint16          prevlen;
> > +
> > +                     /*
> > +                      * If undo log is switch during transaction then we must get a
>
> "is switch" is right.

This code is removed now.
>
> > +/*
> > + * Insert a previously-prepared undo records.
>
> s/a//
Fixed
>
>
> More tomorrow.
>

refer the latest patch at
https://www.postgresql.org/message-id/CAFiTN-uf4Bh0FHwec%2BJGbiLq%2Bj00V92W162SLd_JVvwW-jwREg%40mail.gmail.com



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

13 August 2019, 10:28:22

On Fri, Aug 9, 2019 at 1:57 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Aug 8, 2019 at 9:31 AM Andres Freund <andres@anarazel.de> wrote:
> > I know that Robert is working on a patch that revises the undo request
> > layer somewhat, it's possible that this is best discussed afterwards.
>
> Here's what I have at the moment.  This is not by any means a complete
> replacement for Amit's undo worker machinery, but it is a significant
> redesign (and I believe a significant improvement to) the queue
> management stuff from Amit's patch.  I wrote this pretty quickly, so
> while it passes simple testing, it probably has a number of bugs, and
> to actually use it, it would need to be integrated with xact.c; right
> now it's just a standalone module that doesn't do anything except let
> itself be tested.
>
> Some of the ways it is different from Amit's patches include:
>
> * It uses RBTree rather than binaryheap, so when we look ahead, we
> look ahead in the right order.
>
> * There's no limit to the lookahead distance; when looking ahead, it
> will search the entirety of all 3 RBTrees for an entry from the right
> database.
>
> * It doesn't have a separate hash table keyed by XID.  I didn't find
> that necessary.
>
> * It's better-isolated, as you can see from the fact that I've
> included a test module that tests this code without actually ever
> putting an UndoRequestManager in shared memory.  I would've liked to
> expand this test module, but I don't have time to do that today and
> felt it better to get this much sent out.
>
> * It has a lot of comments explaining the design and how it's intended
> to integrate with the rest of the system.
>
> Broadly, my vision for how this would get used is:
>
> - Create an UndoRecordManager in shared memory.
> - Before a transaction first attaches to a permanent or unlogged undo
> log, xact.c would call RegisterUndoRequest(); thereafter, xact.c would
> store a pointer to the UndoRecord for the lifetime of the toplevel
> transaction.

So, for top-level transactions rollback, we can directly refer from
UndoRequest *, the start and end locations.  But, what should we do
for sub-transactions (rollback to savepoint)?  One related point is
that we also need information about last_log_start_undo_location to
update the undo apply progress (The basic idea is if the transactions
undo is spanned across multiple logs, we update the progress in each
of the logs.).  We can remember that in the transaction state or
undorequest *.  Any suggestion?

> - Immediately after attaching to a permanent or unlogged undo log,
> xact.c would call UndoRequestSetLocation.
> - xact.c would track the number of bytes of permanent and unlogged
> undo records the transaction generates.  If the transaction goes onto
> abort, it reports these by calling FinalizeUndoRequest.
> - If the transaction commits, it doesn't need that information, but
> does need to call UnregisterUndoRequest() as a post-commit step in
> CommitTransaction().
>

IIUC, for each transaction, we have to take a lock first time it
attaches to a log and then the same lock at commit time.  It seems the
work under lock is less, but still, can't this cause a contention?  It
seems to me this is similar to what we saw in ProcArrayLock where work
under lock was few instructions, but acquiring and releasing the lock
by each backend at commit time was causing a bottleneck.

It might be due to some reason this won't matter in a similar way in
which case we can find that after integrating it with other patches
from undo processing machinery and rebasing zheap branch over it?

How will computation of oldestXidHavingUnappliedUndo will work?

We can probably check the fxid queue and error queue to get that
value.  However, I am not sure if that is sufficient because incase we
perform the request in the foreground, it won't be present in queues.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

13 August 2019, 10:33:48

On Thu, Aug 8, 2019 at 7:01 PM Andres Freund <andres@anarazel.de> wrote:
> On 2019-08-07 14:50:17 +0530, Amit Kapila wrote:
> > On Tue, Aug 6, 2019 at 1:26 PM Andres Freund <andres@anarazel.de> wrote:
> > > On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
>
> > > >  typedef struct TwoPhaseFileHeader
> > > >  {
> > > > @@ -927,6 +928,16 @@ typedef struct TwoPhaseFileHeader
> > > >       uint16          gidlen;                 /* length of the GID - GID follows the header */
> > > >       XLogRecPtr      origin_lsn;             /* lsn of this record at origin node */
> > > >       TimestampTz origin_timestamp;   /* time of prepare at origin node */
> > > > +
> > > > +     /*
> > > > +      * We need the locations of the start and end undo record pointers when
> > > > +      * rollbacks are to be performed for prepared transactions using undo-based
> > > > +      * relations.  We need to store this information in the file as the user
> > > > +      * might rollback the prepared transaction after recovery and for that we
> > > > +      * need its start and end undo locations.
> > > > +      */
> > > > +     UndoRecPtr      start_urec_ptr[UndoLogCategories];
> > > > +     UndoRecPtr      end_urec_ptr[UndoLogCategories];
> > > >  } TwoPhaseFileHeader;
> > >
> > > Why do we not need that knowledge for undo processing of a non-prepared
> > > transaction?
>
> > The non-prepared transaction also needs to be aware of that.  It is
> > stored in TransactionStateData.  I am not sure if I understand your
> > question here.
>
> My concern is that I think it's fairly ugly to store data like this in
> the 2pc state file. And it's not an insubstantial amount of additional
> data either, compared to the current size, even when no undo is in
> use. There's a difference between an unused feature increasing backend
> local memory and increasing the size of WAL logged data. Obviously it's
> not by a huge amount, but still.  It also just feels wrong to me.
>
> We don't need the UndoRecPtr's when recovering from a crash/restart to
> process undo. Now we obviously don't want to unnecessarily search for
> data that is expensive to gather, which is a good reason for keeping
> track of this data. But I do wonder if this is the right approach.
>
> I know that Robert is working on a patch that revises the undo request
> layer somewhat, it's possible that this is best discussed afterwards.
>

Okay, we have started working on integrating with Robert's patch.  I
think not only this but many of the other things will also change.
So, I will respond to other comments after integrating with Robert's
patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

13 August 2019, 11:35:27

On Mon, Aug 5, 2019 at 11:59 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> (as I was out of context due to dealing with bugs, I've switched to
> lookign at the current zheap/undoprocessing branch.
>
> On 2019-07-30 01:02:20 -0700, Andres Freund wrote:
> > +/*
> > + * Insert a previously-prepared undo records.
> > + *
> > + * This function will write the actual undo record into the buffers which are
> > + * already pinned and locked in PreparedUndoInsert, and mark them dirty.  This
> > + * step should be performed inside a critical section.
> > + */
>
> Again, I think it's not ok to just assume you can lock an essentially
> unbounded number of buffers. This seems almost guaranteed to result in
> deadlocks. And there's limits on how many lwlocks one can hold etc.

I think for controlling that we need to put a limit on max prepared
undo?  I am not sure any other way of limiting the number of buffers
because we must lock all the buffer in which we are going to insert
the undo record under one WAL logged operation.

>
> As far as I can tell there's simply no deadlock avoidance scheme in use
> here *at all*? I must be missing something.

We are always locking buffer in block order so I am not sure how it
can deadlock?  Am I missing something?

>
>
> > +             /* Main loop for writing the undo record. */
> > +             do
> > +             {
>
> I'd prefer this to not be a do{} while(true) loop - as written I need to
> read to the end to see what the condition is. I don't think we have any
> loops like that in the code.
Right, changed
>
>
> > +                     /*
> > +                      * During recovery, there might be some blocks which are already
> > +                      * deleted due to some discard command so we can just skip
> > +                      * inserting into those blocks.
> > +                      */
> > +                     if (!BufferIsValid(buffer))
> > +                     {
> > +                             Assert(InRecovery);
> > +
> > +                             /*
> > +                              * Skip actual writing just update the context so that we have
> > +                              * write offset for inserting into next blocks.
> > +                              */
> > +                             SkipInsertingUndoData(&ucontext, BLCKSZ - starting_byte);
> > +                             if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> > +                                     break;
> > +                     }
>
> How exactly can this happen?

Suppose you insert one record for the transaction which split in
block1 and 2.  Now, before this block is actually going to the disk
the transaction committed and become all visible the undo logs are
discarded.  It's possible that block 1 is completely discarded but
block 2 is not because it might have undo for the next transaction.
Now, during recovery (FPW is off) if block 1 is missing but block 2 is
their so we need to skip inserting undo for block 1 as it does not
exist.

>
>
> > +                     else
> > +                     {
> > +                             page = BufferGetPage(buffer);
> > +
> > +                             /*
> > +                              * Initialize the page whenever we try to write the first
> > +                              * record in page.  We start writing immediately after the
> > +                              * block header.
> > +                              */
> > +                             if (starting_byte == UndoLogBlockHeaderSize)
> > +                                     UndoPageInit(page, BLCKSZ, prepared_undo->urec->uur_info,
> > +                                                              ucontext.already_processed,
> > +                                                              prepared_undo->urec->uur_tuple.len,
> > +                                                              prepared_undo->urec->uur_payload.len);
> > +
> > +                             /*
> > +                              * Try to insert the record into the current page. If it
> > +                              * doesn't succeed then recall the routine with the next page.
> > +                              */
> > +                             InsertUndoData(&ucontext, page, starting_byte);
> > +                             if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> > +                             {
> > +                                     MarkBufferDirty(buffer);
> > +                                     break;
>
> At this point we're five indentation levels deep. I'd extract at least
> either the the per prepared undo code or the code performing the writing
> across block boundaries into a separate function. Perhaps both.

I have moved it to the separate function.
>
>
>
> > +/*
> > + * Helper function for UndoGetOneRecord
> > + *
> > + * If any of  rmid/reloid/xid/cid is not available in the undo record, then
> > + * it will get the information from the first complete undo record in the
> > + * page.
> > + */
> > +static void
> > +GetCommonUndoRecInfo(UndoPackContext *ucontext, UndoRecPtr urp,
> > +                                      RelFileNode rnode, UndoLogCategory category, Buffer buffer)
> > +{
> > +     /*
> > +      * If any of the common header field is not available in the current undo
> > +      * record then we must read it from the first complete record of the page.
> > +      */
>
> How is it guaranteed that the first record on the page is actually from
> the current transaction? Can't there be a situation where that's from
> another transaction?
If the first record is not from the same transaction then the record
must have all those fields in it so it should never try to access the
first record.  I have updated the comments for the same.

>
>
>
> > +/*
> > + * Helper function for UndoFetchRecord and UndoBulkFetchRecord
> > + *
> > + * curbuf - If an input buffer is valid then this function will not release the
> > + * pin on that buffer.  If the buffer is not valid then it will assign curbuf
> > + * with the first buffer of the current undo record and also it will keep the
> > + * pin and lock on that buffer in a hope that while traversing the undo chain
> > + * the caller might want to read the previous undo record from the same block.
> > + */
>
> Wait, so at exit *curbuf is pinned but not locked, if passed in, but is
> pinned *and* locked when not?  That'd not be a sane API. I don't think
> the code works like that atm though.
Comments were wrong, I have fixed.
>
>
> > +static UnpackedUndoRecord *
> > +UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
> > +                              UndoLogCategory category, Buffer *curbuf)
> > +{
> > +     Page            page;
> > +     int                     starting_byte = UndoRecPtrGetPageOffset(urp);
> > +     BlockNumber cur_blk;
> > +     UndoPackContext ucontext = {{0}};
> > +     Buffer          buffer = *curbuf;
> > +
> > +     cur_blk = UndoRecPtrGetBlockNum(urp);
> > +
> > +     /* Initiate unpacking one undo record. */
> > +     BeginUnpackUndo(&ucontext);
> > +
> > +     while (true)
> > +     {
> > +             /* If we already have a buffer then no need to allocate a new one. */
> > +             if (!BufferIsValid(buffer))
> > +             {
> > +                     buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
> > +                                                                                        RBM_NORMAL, NULL,
> > +
RelPersistenceForUndoLogCategory(category));
> > +
> > +                     /*
> > +                      * Remember the first buffer where this undo started as next undo
> > +                      * record what we fetch might fall on the same buffer.
> > +                      */
> > +                     if (!BufferIsValid(*curbuf))
> > +                             *curbuf = buffer;
> > +             }
> > +
> > +             /* Acquire shared lock on the buffer before reading undo from it. */
> > +             LockBuffer(buffer, BUFFER_LOCK_SHARE);
> > +
> > +             page = BufferGetPage(buffer);
> > +
> > +             UnpackUndoData(&ucontext, page, starting_byte);
> > +
> > +             /*
> > +              * We are done if we have reached to the done stage otherwise move to
> > +              * next block and continue reading from there.
> > +              */
> > +             if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> > +             {
> > +                     if (buffer != *curbuf)
> > +                             UnlockReleaseBuffer(buffer);
> > +
> > +                     /*
> > +                      * Get any of the missing fields from the first record of the
> > +                      * page.
> > +                      */
> > +                     GetCommonUndoRecInfo(&ucontext, urp, rnode, category, *curbuf);
> > +                     break;
> > +             }
> > +
> > +             /*
> > +              * The record spans more than a page so we would have copied it (see
> > +              * UnpackUndoRecord).  In such cases, we can release the buffer.
> > +              */
>
> Where would it have been copied? Presumably in UnpackUndoData()? Imo the
> comment should say so.
>
> I'm a bit confused by the use of "would" in that comment. Either we
> have, or not?
This comment is obsolete so removed.


> > +             if (buffer != *curbuf)
> > +                     UnlockReleaseBuffer(buffer);
>
> Wait, so we *keep* the buffer locked if it the same as *curbuf? That
> can't be right.
At the end we are releasing the lock on the *curbuf.  But now I have
changed it so that it is more readable.

>
>
> > + * Fetch the undo record for given undo record pointer.
> > + *
> > + * This will internally allocate the memory for the unpacked undo record which
> > + * intern will
>
> "intern" should probably be internally? But I'm not sure what the two
> "internally"s really add here.
>
>
>
> > +/*
> > + * Release the memory of the undo record allocated by UndoFetchRecord and
> > + * UndoBulkFetchRecord.
> > + */
> > +void
> > +UndoRecordRelease(UnpackedUndoRecord *urec)
> > +{
> > +     /* Release the memory of payload data if we allocated it. */
> > +     if (urec->uur_payload.data)
> > +             pfree(urec->uur_payload.data);
> > +
> > +     /* Release memory of tuple data if we allocated it. */
> > +     if (urec->uur_tuple.data)
> > +             pfree(urec->uur_tuple.data);
> > +
> > +     /* Release memory of the transaction header if we allocated it. */
> > +     if (urec->uur_txn)
> > +             pfree(urec->uur_txn);
> > +
> > +     /* Release memory of the logswitch header if we allocated it. */
> > +     if (urec->uur_logswitch)
> > +             pfree(urec->uur_logswitch);
> > +
> > +     /* Release the memory of the undo record. */
> > +     pfree(urec);
> > +}
>
> Those comments before each pfree are not useful.
Removed
>
>
> Also, isn't this both fairly slow and fairly failure prone?  The next
> record is going to need all that memory again, no?  It seems to me that
> there should be one record that's allocated once, and then reused over
> multiple fetches, increasing the size if necesssary.
>
> I'm very doubtful that all this freeing of individual allocations in the
> undo code makes sense. Shouldn't this just be done in short lived memory
> contexts, that then get reset as a whole? That's both far less failure
> prone, and faster.
>
>
> > + * one_page                  - Caller is applying undo only for one block not for
> > + *                                     complete transaction.  If this is set true then instead
> > + *                                     of following transaction undo chain using prevlen we will
> > + *                                     follow the block prev chain of the block so that we can
> > + *                                     avoid reading many unnecessary undo records of the
> > + *                                     transaction.
> > + */
> > +UndoRecInfo *
> > +UndoBulkFetchRecord(UndoRecPtr *from_urecptr, UndoRecPtr to_urecptr,
> > +                                     int undo_apply_size, int *nrecords, bool one_page)
>
>
> There's no caller for one_page mode in the series - I assume that's for
> later, during page-wise undo?  It seems to behave in quite noticably
> different ways, is that really OK? Makes the code quite hard to
> understand.
> Also, it seems quite poorly named to me. It sounds like it's about
> fetching a single undo page (which makes no sense, obviously). But what
> it does is to switch to an entirely different way of traversing the undo
> chains.
one_page was zheap specific so I have removed it.  I think in zheap
specific function we can implement it by UndoFetchRecord in a loop.
>
>
>
> > +     /*
> > +      * In one_page mode we are fetching undo only for one page instead of
> > +      * fetching all the undo of the transaction.  Basically, we are fetching
> > +      * interleaved undo records.  So it does not make sense to do any prefetch
> > +      * in that case.
>
> What does "interleaved" mean here?
I meant that for one page we are following blockprev chain instead of
complete transaction undo chain so there is no guarantee that the undo
records are together.  Basically, the undo records for the different
blocks can be interleaved so I am not sure should we prefetch or not.
 I assume that there will often be
> other UNDO records interspersed? But that's not guaranteed at all,
> right? In fact, for a lot of workloads it seems likely that there will
> be many consecutive undo records for a single page? In fact, won't that
> be the majority of cases?

Ok, that point makes sense to me but I thought if we always assume
this we will do unwanted prefetch where this is not the case and we
will put unnecessary load on the I/O.  Currently, I have moved that
code out of the undo layer so we can take a call while designing zheap
specific function.

>
> Thus it's not obvious to me that there's not often going to be
> consecutive pages for this case too.  I'd even say that minimizing IO
> delay is *MORE* important during page-wise undo, as that happens in the
> context of client accesses, and it's not incurring cost on the party
> that performed DML, but on some random third party.
>
>
> I'm doubtful this is a sane interface. There's a lot of duplication
> between one_page and not one_page. It presupposes specific ways of
> constructing chains that are likely to depend on the AM. to_urecptr is
> only used in certain situations.  E.g. I strongly suspect that for
> zheap's visibility determinations we'd want to concurrently follow all
> the necessary chains to determine visibility for all all tuples on the
> page, far enough to find the visible tuple - for seqscan's / bitmap heap
> scans / everything using page mode scans, that'll be way more efficient
> than doing this one-by-one and possibly even repeatedly.  But what is
> exactly the right thing to do is going to be highly AM specific.
>
> I vaguely suspect what you'd want is an interface where the "bulk fetch"
> context basically has a FIFO queue of undo records to fetch, and a
> function to actually perform fetching. Whenever a record has been
> retrieved, a callback determines whether additional records are needed.
> In the case of fetching all the undo for a transaction, you'd just queue
> - probably in a more efficient representation - all the necessary
> undo. In case of page-wise undo, you'd queue the first record of the
> chain you'd want to undo, with a callback for queuing the next
> record. For visibility determinations in zheap, you'd queue all the
> different necessary chains, with a callback that queues the next
> necessary record if still needed for visibility determination.
>
> And then I suspect you'd have a separate callback whenever records have
> been fetched, with all the 'unconsumed' records. That then can,
> e.g. based on memory consumption, decide to process them or not.  For
> visibility information you'd probably just want to condense the records
> to the minimum necessary (i.e. visibility information for the relevant
> tuples, and the visibile tuple when encountered) as soon as available.

I haven't think on this part yet.  I will analyze part.
>
> Obviously that's pretty handwavy.
>
>
>
>
> >   Also, if we are fetching undo records from more than one
> > +      * log, we don't know the boundaries for prefetching.  Hence, we can't use
> > +      * prefetching in this case.
> > +      */
>
> Hm. Why don't we know the boundaries (or cheaply infer them)?

I have added comments for that.   Basically, when we get the undo
records from the different log (from and to pointers are in the
different log) we don't know in latest undo log till what point the
undo are from this transaction.  We may consider prefetching to the
start of the current log but there is no guarantee that all the blocks
of the current logs are valid and not yet discarded.   Ideally, the
better fix would be that the caller always pass the from and to
pointer from the same undo log.
>
>
> > +             /*
> > +              * If prefetch_pages are half of the prefetch_target then it's time to
> > +              * prefetch again.
> > +              */
> > +             if (prefetch_pages < prefetch_target / 2)
> > +                     PrefetchUndoPages(rnode, prefetch_target, &prefetch_pages, to_blkno,
> > +                                                       from_blkno, category);
>
> Hm. Why aren't we prefetching again as soon as possible? Given the
> current code there's not really benefit in fetching many adjacent pages
> at once. And this way it seems we're somewhat likely to cause fairly
> bursty IO?
Hmm right, we can always prefetch as soon as we are behind the
prefetch target. Done that way.

>
>
> > +             /*
> > +              * In one_page mode it's possible that the undo of the transaction
> > +              * might have been applied by worker and undo got discarded. Prevent
> > +              * discard worker from discarding undo data while we are reading it.
> > +              * See detail comment in UndoFetchRecord.  In normal mode we are
> > +              * holding transaction undo action lock so it can not be discarded.
> > +              */
>
> I don't really see a comment explaining this in UndoFetchRecord. Are
> you referring to InHotStandby? Because there's no comment about one_page
> mode as far as I can tell? The comment is clearly referring to that,
> rather than InHotStandby?
I have removed one_page code.
>
>
>
> > +             if (one_page)
> > +             {
> > +                     /* Refer comments in UndoFetchRecord. */
>
> Missing "to".
>
>
> > +                     if (InHotStandby)
> > +                     {
> > +                             if (UndoRecPtrIsDiscarded(urecptr))
> > +                                     break;
> > +                     }
> > +                     else
> > +                     {
> > +                             LWLockAcquire(&slot->discard_lock, LW_SHARED);
> > +                             if (slot->logno != logno || urecptr < slot->oldest_data)
> > +                             {
> > +                                     /*
> > +                                      * The undo log slot has been recycled because it was
> > +                                      * entirely discarded, or the data has been discarded
> > +                                      * already.
> > +                                      */
> > +                                     LWLockRelease(&slot->discard_lock);
> > +                                     break;
> > +                             }
> > +                     }
>
> I find this deeply unsatisfying.  It's repeated in a bunch of
> places. There's completely different behaviour between the hot-standby
> and !hot-standby case. There's UndoRecPtrIsDiscarded for the HS case,
> but we do a different test for !HS.  There's no explanation as to why
> this is even reachable.
I have added comments in UndoFetchRecord.
>
>
> > +                     /* Read the undo record. */
> > +                     UndoGetOneRecord(uur, urecptr, rnode, category, &buffer);
> > +
> > +                     /* Release the discard lock after fetching the record. */
> > +                     if (!InHotStandby)
> > +                             LWLockRelease(&slot->discard_lock);
> > +             }
> > +             else
> > +                     UndoGetOneRecord(uur, urecptr, rnode, category, &buffer);
>
>
> And then we do none of this in !one_page mode.
UndoBulkFetchRecord is always called from the aborted transaction so
its undo can never get discarded concurrently so ideally, we don't
need to check for discard.  But, during one_page mode, we follow the
when it comes from zheap for the one page it is possible that the undo
for the transaction are applied from the worker for the complete
transaction and its undo logs are discarded.  But, I think this is
highly am specific so I have removed one_page mode from here.


>
>
> > +             /*
> > +              * As soon as the transaction id is changed we can stop fetching the
> > +              * undo record.  Ideally, to_urecptr should control this but while
> > +              * reading undo only for a page we don't know what is the end undo
> > +              * record pointer for the transaction.
> > +              */
> > +             if (one_page)
> > +             {
> > +                     if (!FullTransactionIdIsValid(fxid))
> > +                             fxid = uur->uur_fxid;
> > +                     else if (!FullTransactionIdEquals(fxid, uur->uur_fxid))
> > +                             break;
> > +             }
> > +
> > +             /* Remember the previous undo record pointer. */
> > +             prev_urec_ptr = urecptr;
> > +
> > +             /*
> > +              * Calculate the previous undo record pointer of the transaction.  If
> > +              * we are reading undo only for a page then follow the blkprev chain
> > +              * of the page.  Otherwise, calculate the previous undo record pointer
> > +              * using transaction's current undo record pointer and the prevlen. If
> > +              * undo record has a valid uur_prevurp, this is the case of log switch
> > +              * during the transaction so we can directly use uur_prevurp as our
> > +              * previous undo record pointer of the transaction.
> > +              */
> > +             if (one_page)
> > +                     urecptr = uur->uur_prevundo;
> > +             else if (uur->uur_logswitch)
> > +                     urecptr = uur->uur_logswitch->urec_prevurp;
> > +             else if (prev_urec_ptr == to_urecptr ||
> > +                              uur->uur_info & UREC_INFO_TRANSACTION)
> > +                     urecptr = InvalidUndoRecPtr;
> > +             else
> > +                     urecptr = UndoGetPrevUndoRecptr(prev_urec_ptr, buffer, category);
> > +
>
> FWIW, this is one of those concerns I was referring to above. What
> exactly needs to happen seems highly AM specific.
1. one_page check is gone
2. uur->uur_info & UREC_INFO_TRANSACTION is also related to one_page
so removed this too.
3. else if (uur->uur_logswitch) -> I think this is also related to the
incapability of the caller that it can not identify the log switch but
expect the bulk fetch to detect it and break fetching so that we can
update the
progress in the transaction header of the current log.

I think we can solve these issue by callback as well as you suggested above.

>
>
> > +/*
> > + * Read length of the previous undo record.
> > + *
> > + * This function will take an undo record pointer as an input and read the
> > + * length of the previous undo record which is stored at the end of the previous
> > + * undo record.  If the undo record is split then this will add the undo block
> > + * header size in the total length.
> > + */
>
> This should add some note as to when it's expected to be necessary. I
> was kind of concerned that this can be necessary, but it's only needed
> during log switches, which disarms that concern.

I think this is a normal case because the undo_len store the actual
length of the record.  But, if the undo record split across 2 pages
and if we are at the end of the undo record (start of the next record)
then for computing the absolute start offset of the previous undo
record we need the exact distance between these two records and that
will be
current_offset - (the actual length of the previous record + Undo
record header if the previous log is split across 2 pages)

>
>
> > +static uint16
> > +UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer,
> > +                                      UndoLogCategory category)
> > +{
> > +     UndoLogOffset page_offset = UndoRecPtrGetPageOffset(urp);
> > +     BlockNumber cur_blk = UndoRecPtrGetBlockNum(urp);
> > +     Buffer          buffer = input_buffer;
> > +     Page            page = NULL;
> > +     char       *pagedata = NULL;
> > +     char            prevlen[2];
> > +     RelFileNode rnode;
> > +     int                     byte_to_read = sizeof(uint16);
>
> Shouldn't it be byte_to_read? And the sizeof a type that's tied with the
> actual undo format? Imagine we'd ever want to change the length format
> for undo records - this would be hard to find.

I did not get this comments.  Do you mean that we should not rely on
undo format i.e. we should not assume that undo length is stored at
the end of the undo record?
>
>
> > +     char            persistence;
> > +     uint16          prev_rec_len = 0;
> > +
> > +     /* Get relfilenode. */
> > +     UndoRecPtrAssignRelFileNode(rnode, urp);
> > +     persistence = RelPersistenceForUndoLogCategory(category);
> > +
> > +     if (BufferIsValid(buffer))
> > +     {
> > +             page = BufferGetPage(buffer);
> > +             pagedata = (char *) page;
> > +     }
> > +
> > +     /*
> > +      * Length if the previous undo record is store at the end of that record
> > +      * so just fetch last 2 bytes.
> > +      */
> > +     while (byte_to_read > 0)
> > +     {
>
> Why does this need a loop around the number of bytes? Can there ever be
> a case where this is split across a record? If so, isn't that a bad idea
> anyway?
Yes, as of now, undo record can be splitted at any point even the undo
length can be split acorss 2 pages.  I think we can reduce complexity
by making sure undo length doesn't get split acorss pages.  But for
handling that while allocating the undo we need to detect this whether
the undo length can get splitted by checking the space in the current
page and the undo record length and based on that we need to allocate
1 extra byte in the undo log.  Seems that will add an extra
complexity.
>
>
> > +             /* Read buffer if the current buffer is not valid. */
> > +             if (!BufferIsValid(buffer))
> > +             {
> > +                     buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum,
> > +                                                                                        cur_blk, RBM_NORMAL,
NULL,
> > +                                                                                        persistence);
> > +
> > +                     LockBuffer(buffer, BUFFER_LOCK_SHARE);
> > +
> > +                     page = BufferGetPage(buffer);
> > +                     pagedata = (char *) page;
> > +             }
> > +
> > +             page_offset -= 1;
> > +
> > +             /*
> > +              * Read current prevlen byte from current block if page_offset hasn't
> > +              * reach to undo block header.  Otherwise, go to the previous block
> > +              * and continue reading from there.
> > +              */
> > +             if (page_offset >= UndoLogBlockHeaderSize)
> > +             {
> > +                     prevlen[byte_to_read - 1] = pagedata[page_offset];
> > +                     byte_to_read -= 1;
> > +             }
> > +             else
> > +             {
> > +                     /*
> > +                      * Release the current buffer if it is not provide by the caller.
> > +                      */
> > +                     if (input_buffer != buffer)
> > +                             UnlockReleaseBuffer(buffer);
> > +
> > +                     /*
> > +                      * Could not read complete prevlen from the current block so go to
> > +                      * the previous block and start reading from end of the block.
> > +                      */
> > +                     cur_blk -= 1;
> > +                     page_offset = BLCKSZ;
> > +
> > +                     /*
> > +                      * Reset buffer so that we can read it again for the previous
> > +                      * block.
> > +                      */
> > +                     buffer = InvalidBuffer;
> > +             }
> > +     }
>
> I can't help but think that this shouldn't be yet another copy of logic
> for how to read undo pages.

I haven't yet thought but I will try to unify this with ReadUndoBytes.
Actually, I didn't do that already because ReadUndoByte needs a start
pointer where we need to read the given number of bytes but here we
have an end pointer.  May be by this logic we can compute the start
pointer but that will look equally complex.  I will work on this and
try to figure out something.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

13 August 2019, 12:02:03

On Tue, Aug 6, 2019 at 1:26 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
> > Need to do something else for a bit. More later.
>
>
> > + /*
> > +  * Compute the header size of the undo record.
> > +  */
> > +Size
> > +UndoRecordHeaderSize(uint16 uur_info)
> > +{
> > +     Size            size;
> > +
> > +     /* Add fixed header size. */
> > +     size = SizeOfUndoRecordHeader;
> > +
> > +     /* Add size of transaction header if it presets. */
> > +     if ((uur_info & UREC_INFO_TRANSACTION) != 0)
> > +             size += SizeOfUndoRecordTransaction;
> > +
> > +     /* Add size of rmid if it presets. */
> > +     if ((uur_info & UREC_INFO_RMID) != 0)
> > +             size += sizeof(RmgrId);
> > +
> > +     /* Add size of reloid if it presets. */
> > +     if ((uur_info & UREC_INFO_RELOID) != 0)
> > +             size += sizeof(Oid);
> > +
> > +     /* Add size of fxid if it presets. */
> > +     if ((uur_info & UREC_INFO_XID) != 0)
> > +             size += sizeof(FullTransactionId);
> > +
> > +     /* Add size of cid if it presets. */
> > +     if ((uur_info & UREC_INFO_CID) != 0)
> > +             size += sizeof(CommandId);
> > +
> > +     /* Add size of forknum if it presets. */
> > +     if ((uur_info & UREC_INFO_FORK) != 0)
> > +             size += sizeof(ForkNumber);
> > +
> > +     /* Add size of prevundo if it presets. */
> > +     if ((uur_info & UREC_INFO_PREVUNDO) != 0)
> > +             size += sizeof(UndoRecPtr);
> > +
> > +     /* Add size of the block header if it presets. */
> > +     if ((uur_info & UREC_INFO_BLOCK) != 0)
> > +             size += SizeOfUndoRecordBlock;
> > +
> > +     /* Add size of the log switch header if it presets. */
> > +     if ((uur_info & UREC_INFO_LOGSWITCH) != 0)
> > +             size += SizeOfUndoRecordLogSwitch;
> > +
> > +     /* Add size of the payload header if it presets. */
> > +     if ((uur_info & UREC_INFO_PAYLOAD) != 0)
> > +             size += SizeOfUndoRecordPayload;
>
> There's numerous blocks with one if for each type, and the body copied
> basically the same for each alternative. That doesn't seem like a
> reasonable approach to me. Means that many places need to be adjusted
> when we invariably add another type, and seems likely to lead to bugs
> over time.
I think I have expressed my thought on this in another email
[https://www.postgresql.org/message-id/CAFiTN-vDrXuL6tHK1f_V9PAXp2%2BEFRpPtxCG_DRx08PZXAPkyw%40mail.gmail.com]
>
> > +     /* Add size of the payload header if it presets. */
>
> FWIW, repeating the same comment, with or without minor differences, 10
> times is a bad idea. Especially when the comment doesn't add *any* sort
> of information.
Ok, fixed
>
> Also, "if it presets" presumably is a typo?
Fixed
>
>
> > +/*
> > + * Compute and return the expected size of an undo record.
> > + */
> > +Size
> > +UndoRecordExpectedSize(UnpackedUndoRecord *uur)
> > +{
> > +     Size            size;
> > +
> > +     /* Header size. */
> > +     size = UndoRecordHeaderSize(uur->uur_info);
> > +
> > +     /* Payload data size. */
> > +     if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
> > +     {
> > +             size += uur->uur_payload.len;
> > +             size += uur->uur_tuple.len;
> > +     }
> > +
> > +     /* Add undo record length size. */
> > +     size += sizeof(uint16);
> > +
> > +     return size;
> > +}
> > +
> > +/*
> > + * Calculate the size of the undo record stored on the page.
> > + */
> > +static inline Size
> > +UndoRecordSizeOnPage(char *page_ptr)
> > +{
> > +     uint16          uur_info = ((UndoRecordHeader *) page_ptr)->urec_info;
> > +     Size            size;
> > +
> > +     /* Header size. */
> > +     size = UndoRecordHeaderSize(uur_info);
> > +
> > +     /* Payload data size. */
> > +     if ((uur_info & UREC_INFO_PAYLOAD) != 0)
> > +     {
> > +             UndoRecordPayload *payload = (UndoRecordPayload *) (page_ptr + size);
> > +
> > +             size += payload->urec_payload_len;
> > +             size += payload->urec_tuple_len;
> > +     }
> > +
> > +     return size;
> > +}
> > +
> > +/*
> > + * Compute size of the Unpacked undo record in memory
> > + */
> > +Size
> > +UnpackedUndoRecordSize(UnpackedUndoRecord *uur)
> > +{
> > +     Size            size;
> > +
> > +     size = sizeof(UnpackedUndoRecord);
> > +
> > +     /* Add payload size if record contains payload data. */
> > +     if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
> > +     {
> > +             size += uur->uur_payload.len;
> > +             size += uur->uur_tuple.len;
> > +     }
> > +
> > +     return size;
> > +}
>
> These functions are all basically the same. We shouldn't copy code over
> and over like this.
UnpackedUndoRecordSize -> computes the size of the unpacked undo
record so it's different from above two, just payload part is common
so moved payload size to common function.

UndoRecordExpectedSize and UndoRecordSizeOnPage are two different
functions except for the header size computation so I already had the
common function for the header.  UndoRecordExpectedSize computes the
expected record size so it can access the payload directly from the
unpack undo record whereas the UndoRecordSizeOnPage needs to calculate
the record size by the record pointer which is already stored on the
page so actually it doesn't have the unpacked undo record instead it
first need to compute the header size and then it needs to reach to
the payload data.  Typecast that to payload header and compute the
length.  In unpack undo record payload is stored as StringInfoData
whereas on the page it is packed as  UndoRecordPayload header.  So I
am not sure how to unify them.  Anyway, UndoRecordSizeOnPage is
required only for undo page consistency checker patch so I have moved
out of this patch.  Later, I am planning to handle the comments of the
undo page consistency checker patch so I will try to work on this
function if I can improve it.

>
>
> > +/*
> > + * Initiate inserting an undo record.
> > + *
> > + * This function will initialize the context for inserting and undo record
> > + * which will be inserted by calling InsertUndoData.
> > + */
> > +void
> > +BeginInsertUndo(UndoPackContext *ucontext, UnpackedUndoRecord *uur)
> > +{
> > +     ucontext->stage = UNDO_PACK_STAGE_HEADER;
> > +     ucontext->already_processed = 0;
> > +     ucontext->partial_bytes = 0;
> > +
> > +     /* Copy undo record header. */
> > +     ucontext->urec_hd.urec_type = uur->uur_type;
> > +     ucontext->urec_hd.urec_info = uur->uur_info;
> > +
> > +     /* Copy undo record transaction header if it is present. */
> > +     if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
> > +             memcpy(&ucontext->urec_txn, uur->uur_txn, SizeOfUndoRecordTransaction);
> > +
> > +     /* Copy rmid if present. */
> > +     if ((uur->uur_info & UREC_INFO_RMID) != 0)
> > +             ucontext->urec_rmid = uur->uur_rmid;
> > +
> > +     /* Copy reloid if present. */
> > +     if ((uur->uur_info & UREC_INFO_RELOID) != 0)
> > +             ucontext->urec_reloid = uur->uur_reloid;
> > +
> > +     /* Copy fxid if present. */
> > +     if ((uur->uur_info & UREC_INFO_XID) != 0)
> > +             ucontext->urec_fxid = uur->uur_fxid;
> > +
> > +     /* Copy cid if present. */
> > +     if ((uur->uur_info & UREC_INFO_CID) != 0)
> > +             ucontext->urec_cid = uur->uur_cid;
> > +
> > +     /* Copy undo record relation header if it is present. */
> > +     if ((uur->uur_info & UREC_INFO_FORK) != 0)
> > +             ucontext->urec_fork = uur->uur_fork;
> > +
> > +     /* Copy prev undo record pointer if it is present. */
> > +     if ((uur->uur_info & UREC_INFO_PREVUNDO) != 0)
> > +             ucontext->urec_prevundo = uur->uur_prevundo;
> > +
> > +     /* Copy undo record block header if it is present. */
> > +     if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
> > +     {
> > +             ucontext->urec_blk.urec_block = uur->uur_block;
> > +             ucontext->urec_blk.urec_offset = uur->uur_offset;
> > +     }
> > +
> > +     /* Copy undo record log switch header if it is present. */
> > +     if ((uur->uur_info & UREC_INFO_LOGSWITCH) != 0)
> > +             memcpy(&ucontext->urec_logswitch, uur->uur_logswitch,
> > +                        SizeOfUndoRecordLogSwitch);
> > +
> > +     /* Copy undo record payload header and data if it is present. */
> > +     if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
> > +     {
> > +             ucontext->urec_payload.urec_payload_len = uur->uur_payload.len;
> > +             ucontext->urec_payload.urec_tuple_len = uur->uur_tuple.len;
> > +             ucontext->urec_payloaddata = uur->uur_payload.data;
> > +             ucontext->urec_tupledata = uur->uur_tuple.data;
> > +     }
> > +     else
> > +     {
> > +             ucontext->urec_payload.urec_payload_len = 0;
> > +             ucontext->urec_payload.urec_tuple_len = 0;
> > +     }
> > +
> > +     /* Compute undo record expected size and store in the context. */
> > +     ucontext->undo_len = UndoRecordExpectedSize(uur);
> > +}
>
> It really can't be right to have all these fields basically twice, in
> UnackedUndoRecord, and UndoPackContext. And then copy them one-by-one.
> I mean there's really just some random differences (ordering, some field
> names) between the structures, but otherwise they're the same?
>
> What on earth do we gain by this?  This entire intermediate stage makes
> no sense at all to me. We copy data into an UndoRecord, then we copy
> into an UndoRecordContext, with essentially a field-by-field copy
> logic. Then we have another field-by-field logic that copies the data
> into the page.
The idea was that in UnpackedUndoRecord we have all member as a field
by field but in context, we can keep them in headers for example
UndoRecordHeader, UndoRecordGroup, UndoRecordBlock.  And, the idea
behind this is that during InsertUndoData instead of calling
InsertUndoByte field by field we call it once for each header because
either we have to write all field of that header or none.  But later
we end up having a lot of optional headers and most of them have just
one field in it so it appears that we are copying field by field.

One alternative could be that we palloc a memory in context and then
pack each field in that memory  (except the payload and tuple data)
then in one InsertUndoByte call we can insert complete header part and
in we can have 2 more calls to InsertUndoBytes for writing payload and
tuple data.  What's your thought on this.

>
>
>
>
> > +/*
> > + * Insert the undo record into the input page from the unpack undo context.
> > + *
> > + * Caller can  call this function multiple times until desired stage is reached.
> > + * This will write the undo record into the page.
> > + */
> > +void
> > +InsertUndoData(UndoPackContext *ucontext, Page page, int starting_byte)
> > +{
> > +     char       *writeptr = (char *) page + starting_byte;
> > +     char       *endptr = (char *) page + BLCKSZ;
> > +
> > +     switch (ucontext->stage)
> > +     {
> > +             case UNDO_PACK_STAGE_HEADER:
> > +                     /* Insert undo record header. */
> > +                     if (!InsertUndoBytes((char *) &ucontext->urec_hd,
> > +                                                              SizeOfUndoRecordHeader, &writeptr, endptr,
> > +                                                              &ucontext->already_processed,
> > +                                                              &ucontext->partial_bytes))
> > +                             return;
> > +                     ucontext->stage = UNDO_PACK_STAGE_TRANSACTION;
> > +                     /* fall through */
> > +
> > +             case UNDO_PACK_STAGE_TRANSACTION:
> > +                     if ((ucontext->urec_hd.urec_info & UREC_INFO_TRANSACTION) != 0)
> > +                     {
> > +                             /* Insert undo record transaction header. */
> > +                             if (!InsertUndoBytes((char *) &ucontext->urec_txn,
> > +                                                                      SizeOfUndoRecordTransaction,
> > +                                                                      &writeptr, endptr,
> > +                                                                      &ucontext->already_processed,
> > +                                                                      &ucontext->partial_bytes))
> > +                                     return;
> > +                     }
> > +                     ucontext->stage = UNDO_PACK_STAGE_RMID;
> > +                     /* fall through */
> > +
> > +             case UNDO_PACK_STAGE_RMID:
> > +                     /* Write rmid(if needed and not already done). */
> > +                     if ((ucontext->urec_hd.urec_info & UREC_INFO_RMID) != 0)
> > +                     {
> > +                             if (!InsertUndoBytes((char *) &(ucontext->urec_rmid), sizeof(RmgrId),
> > +                                                                      &writeptr, endptr,
> > +                                                                      &ucontext->already_processed,
> > +                                                                      &ucontext->partial_bytes))
> > +                                     return;
> > +                     }
> > +                     ucontext->stage = UNDO_PACK_STAGE_RELOID;
> > +                     /* fall through */
> > +
> > +             case UNDO_PACK_STAGE_RELOID:
> > +                     /* Write reloid(if needed and not already done). */
> > +                     if ((ucontext->urec_hd.urec_info & UREC_INFO_RELOID) != 0)
> > +                     {
> > +                             if (!InsertUndoBytes((char *) &(ucontext->urec_reloid), sizeof(Oid),
> > +                                                                      &writeptr, endptr,
> > +                                                                      &ucontext->already_processed,
> > +                                                                      &ucontext->partial_bytes))
> > +                                     return;
> > +                     }
> > +                     ucontext->stage = UNDO_PACK_STAGE_XID;
> > +                     /* fall through */
> > +
> > +             case UNDO_PACK_STAGE_XID:
> > +                     /* Write xid(if needed and not already done). */
> > +                     if ((ucontext->urec_hd.urec_info & UREC_INFO_XID) != 0)
> > +                     {
> > +                             if (!InsertUndoBytes((char *) &(ucontext->urec_fxid), sizeof(FullTransactionId),
> > +                                                                      &writeptr, endptr,
> > +                                                                      &ucontext->already_processed,
> > +                                                                      &ucontext->partial_bytes))
> > +                                     return;
> > +                     }
> > +                     ucontext->stage = UNDO_PACK_STAGE_CID;
> > +                     /* fall through */
> > +
> > +             case UNDO_PACK_STAGE_CID:
> > +                     /* Write cid(if needed and not already done). */
> > +                     if ((ucontext->urec_hd.urec_info & UREC_INFO_CID) != 0)
> > +                     {
> > +                             if (!InsertUndoBytes((char *) &(ucontext->urec_cid), sizeof(CommandId),
> > +                                                                      &writeptr, endptr,
> > +                                                                      &ucontext->already_processed,
> > +                                                                      &ucontext->partial_bytes))
> > +                                     return;
> > +                     }
> > +                     ucontext->stage = UNDO_PACK_STAGE_FORKNUM;
> > +                     /* fall through */
> > +
> > +             case UNDO_PACK_STAGE_FORKNUM:
> > +                     if ((ucontext->urec_hd.urec_info & UREC_INFO_FORK) != 0)
> > +                     {
> > +                             /* Insert undo record fork number. */
> > +                             if (!InsertUndoBytes((char *) &ucontext->urec_fork,
> > +                                                                      sizeof(ForkNumber),
> > +                                                                      &writeptr, endptr,
> > +                                                                      &ucontext->already_processed,
> > +                                                                      &ucontext->partial_bytes))
> > +                                     return;
> > +                     }
> > +                     ucontext->stage = UNDO_PACK_STAGE_PREVUNDO;
> > +                     /* fall through */
> > +
> > +             case UNDO_PACK_STAGE_PREVUNDO:
> > +                     if ((ucontext->urec_hd.urec_info & UREC_INFO_PREVUNDO) != 0)
> > +                     {
> > +                             /* Insert undo record blkprev. */
> > +                             if (!InsertUndoBytes((char *) &ucontext->urec_prevundo,
> > +                                                                      sizeof(UndoRecPtr),
> > +                                                                      &writeptr, endptr,
> > +                                                                      &ucontext->already_processed,
> > +                                                                      &ucontext->partial_bytes))
> > +                                     return;
> > +                     }
> > +                     ucontext->stage = UNDO_PACK_STAGE_BLOCK;
> > +                     /* fall through */
> > +
> > +             case UNDO_PACK_STAGE_BLOCK:
> > +                     if ((ucontext->urec_hd.urec_info & UREC_INFO_BLOCK) != 0)
> > +                     {
> > +                             /* Insert undo record block header. */
> > +                             if (!InsertUndoBytes((char *) &ucontext->urec_blk,
> > +                                                                      SizeOfUndoRecordBlock,
> > +                                                                      &writeptr, endptr,
> > +                                                                      &ucontext->already_processed,
> > +                                                                      &ucontext->partial_bytes))
> > +                                     return;
> > +                     }
> > +                     ucontext->stage = UNDO_PACK_STAGE_LOGSWITCH;
> > +                     /* fall through */
> > +
> > +             case UNDO_PACK_STAGE_LOGSWITCH:
> > +                     if ((ucontext->urec_hd.urec_info & UREC_INFO_LOGSWITCH) != 0)
> > +                     {
> > +                             /* Insert undo record transaction header. */
> > +                             if (!InsertUndoBytes((char *) &ucontext->urec_logswitch,
> > +                                                                      SizeOfUndoRecordLogSwitch,
> > +                                                                      &writeptr, endptr,
> > +                                                                      &ucontext->already_processed,
> > +                                                                      &ucontext->partial_bytes))
> > +                                     return;
> > +                     }
> > +                     ucontext->stage = UNDO_PACK_STAGE_PAYLOAD;
> > +                     /* fall through */
> > +
> > +             case UNDO_PACK_STAGE_PAYLOAD:
> > +                     if ((ucontext->urec_hd.urec_info & UREC_INFO_PAYLOAD) != 0)
> > +                     {
> > +                             /* Insert undo record payload header. */
> > +                             if (!InsertUndoBytes((char *) &ucontext->urec_payload,
> > +                                                                      SizeOfUndoRecordPayload,
> > +                                                                      &writeptr, endptr,
> > +                                                                      &ucontext->already_processed,
> > +                                                                      &ucontext->partial_bytes))
> > +                                     return;
> > +                     }
> > +                     ucontext->stage = UNDO_PACK_STAGE_PAYLOAD_DATA;
> > +                     /* fall through */
> > +
> > +             case UNDO_PACK_STAGE_PAYLOAD_DATA:
> > +                     {
> > +                             int                     len = ucontext->urec_payload.urec_payload_len;
> > +
> > +                             if (len > 0)
> > +                             {
> > +                                     /* Insert payload data. */
> > +                                     if (!InsertUndoBytes((char *) ucontext->urec_payloaddata,
> > +                                                                              len, &writeptr, endptr,
> > +                                                                              &ucontext->already_processed,
> > +                                                                              &ucontext->partial_bytes))
> > +                                             return;
> > +                             }
> > +                             ucontext->stage = UNDO_PACK_STAGE_TUPLE_DATA;
> > +                     }
> > +                     /* fall through */
> > +
> > +             case UNDO_PACK_STAGE_TUPLE_DATA:
> > +                     {
> > +                             int                     len = ucontext->urec_payload.urec_tuple_len;
> > +
> > +                             if (len > 0)
> > +                             {
> > +                                     /* Insert tuple data. */
> > +                                     if (!InsertUndoBytes((char *) ucontext->urec_tupledata,
> > +                                                                              len, &writeptr, endptr,
> > +                                                                              &ucontext->already_processed,
> > +                                                                              &ucontext->partial_bytes))
> > +                                             return;
> > +                             }
> > +                             ucontext->stage = UNDO_PACK_STAGE_UNDO_LENGTH;
> > +                     }
> > +                     /* fall through */
> > +
> > +             case UNDO_PACK_STAGE_UNDO_LENGTH:
> > +                     /* Insert undo length. */
> > +                     if (!InsertUndoBytes((char *) &ucontext->undo_len,
> > +                                                              sizeof(uint16), &writeptr, endptr,
> > +                                                              &ucontext->already_processed,
> > +                                                              &ucontext->partial_bytes))
> > +                             return;
> > +
> > +                     ucontext->stage = UNDO_PACK_STAGE_DONE;
> > +                     /* fall through */
> > +
> > +             case UNDO_PACK_STAGE_DONE:
> > +                     /* Nothing to be done. */
> > +                     break;
> > +
> > +             default:
> > +                     Assert(0);                      /* Invalid stage */
> > +     }
> > +}
>
> I don't understand. The only purpose of this is that we can partially
> write a packed-but-not-actually-packed record onto a bunch of pages? And
> for that we have an endless chain of copy and pasted code calling
> InsertUndoBytes()? Copying data into shared buffers in tiny increments?
>
> If we need to this, what is the whole packed record format good for?
> Except for adding a bunch of functions with 10++ ifs and nearly
> identical code?
>
> Copying data is expensive. Copying data in tiny increments is more
> expensive. Copying data in tiny increments, with a bunch of branches, is
> even more expensive. Copying data in tiny increments, with a bunch of
> branches, is even more expensive, especially when it's shared
> memory. Copying data in tiny increments, with a bunch of branches, is
> even more expensive, especially when it's shared memory, especially when
> all that shared meory is locked at once.
>
>
> > +/*
> > + * Read the undo record from the input page to the unpack undo context.
> > + *
> > + * Caller can  call this function multiple times until desired stage is reached.
> > + * This will read the undo record from the page and store the data into unpack
> > + * undo context, which can be later copied to unpacked undo record by calling
> > + * FinishUnpackUndo.
> > + */
> > +void
> > +UnpackUndoData(UndoPackContext *ucontext, Page page, int starting_byte)
> > +{
> > +     char       *readptr = (char *) page + starting_byte;
> > +     char       *endptr = (char *) page + BLCKSZ;
> > +
> > +     switch (ucontext->stage)
> > +     {
> > +             case UNDO_PACK_STAGE_HEADER:
>
> You know roughly what I'm thinking.

I have expressed my thought on this in last comment.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

13 August 2019, 12:11:07

On Tue, Aug 13, 2019 at 6:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> So, for top-level transactions rollback, we can directly refer from
> UndoRequest *, the start and end locations.  But, what should we do
> for sub-transactions (rollback to savepoint)?  One related point is
> that we also need information about last_log_start_undo_location to
> update the undo apply progress (The basic idea is if the transactions
> undo is spanned across multiple logs, we update the progress in each
> of the logs.).  We can remember that in the transaction state or
> undorequest *.  Any suggestion?

The UndoRequest is only for top-level rollback.  Any state that you
need in order to do subtransaction rollback needs to be maintained
someplace else, probably in the transaction state state, or some
subsidiary data structure.  The point here is that the UndoRequest is
going to be stored in shared memory, but there is no reason ever to
store the information about a subtransaction in shared memory, because
that undo always has to be completed by the backend that is
responsible for that transaction. Those things should not get mixed
together.

> IIUC, for each transaction, we have to take a lock first time it
> attaches to a log and then the same lock at commit time.  It seems the
> work under lock is less, but still, can't this cause a contention?  It
> seems to me this is similar to what we saw in ProcArrayLock where work
> under lock was few instructions, but acquiring and releasing the lock
> by each backend at commit time was causing a bottleneck.

LWLocks are pretty fast these days and the critical section is pretty
short, so I think there's a chance it'll be just fine, but maybe it'll
cause enough cache line bouncing to be problematic. If so, I think
there are several possible ways to redesign the locking to improve
things, but it made sense to me to try the simple approach first.

> How will computation of oldestXidHavingUnappliedUndo will work?
>
> We can probably check the fxid queue and error queue to get that
> value.  However, I am not sure if that is sufficient because incase we
> perform the request in the foreground, it won't be present in queues.

Oh, I forgot about that requirement.  I think I can fix it so it does
that fairly easily, but it will require a little bit of redesign which
I won't have time to do this week.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

14 August 2019, 06:57:45

Hi,

On 2019-08-06 14:18:42 -0700, Andres Freund wrote:
> Here's the last section of my low-leve review. Plan to write a higher
> level summary afterwards, now that I have a better picture of the code.

General comments:

- For a feature of this complexity, there's very little architectural
  documentation. Some individual parts have a bit, but there's basically
  nothing over-arching. That makes it extremely hard for anybody that is
  not already involved to understand the design constraints, and even
  for people involved it's hard to understand.

  I think it's very important for this to have a document that first
  explains what the goals, and non-goals, of this feature are. And then
  secondly explains the chosen architecture referencing those
  constraints.  Once that's there it's a lot easier to review this
  patchset, to discuss the overall architecture, etc.

- There are too many different email threads and git branches. The same
  patches are discussed in different threads, layers exist in slightly
  diverging versions in different git trees. Again making it very hard
  for anybody not primarily focussing on undo to join the discussion.

  I think most of the older git branches should be renamed into
  something indicating their historic status. The remaining branches
  should be referenced from a wiki page (linked to in each submission of
  a new patch version), explaining what they're about.  I don't think
  it's realistic to have people infer meaning from the current branch
  names (undo, proposal-undo-log, undo-log-storage, undo-log-storage-v2,
  undo_interface_v1, undoprocessing).

  Given the size of the the overall project it's quite possibly not
  realistic to manage the whole work in a single git branch. With
  separate git branches, as currently done, it's however hard to
  understand which version of a what layer is used.  I think at the very
  least higher layers need to indicate the version of the underlying
  layers is being used.  I suggest just adding a "v01: " style prefix to
  all commit subjects a branch is rebased onto.

  It's also currently hard to understand what version of a layer is
  being discussed. I think submissions all need to include a version
  number (c.f. git format-patch's -v option), and that version ought to
  be included in the subject line. Each new major version of a patch
  should be started as a reply to the first message of a thread, to
  keep the structure of a discussion in a managable shape. New versions
  should include explicit explanations about the major changes compared
  to the last version.

- The code quality of pretty significant parts of the patchset is not
  even close to being good enough. There are areas with a lot of code
  duplication. There are very few higher level explanations for
  interfaces. There's a lot of "i++; /* increment i to increment it */"
  style comments, but not enough higher level comments. There are
  significant issues with parts of the code that aren't noted anywhere
  in comments, leading to reviewers having to repeatedly re-discover
  them (and wasting time on that).

  There's different naming styles in related code without a discernible
  pattern (e.g. UndoRecordSetInfo being followed by
  get_undo_rec_cid_offset). The word-ordering of different layers is
  confusing (e.g. BeginUndoRecordInsert vs UndoLogBeginInsert vs
  PrepareUndoInsert). Different important externally visible functions
  have names that don't allow to determine which is supposed to do what
  (PrepareUndoInsert vs BeginUndoRecordInsert).

More specific comments:

- The whole sequencing of undo record insertion in combination with WAL
  logging does not appear to be right. It's a bit hard to say, because
  there's very little documentation on what the intended default
  sequence of operations is.

  My understanding is that the currently expected pattern is to:

  1) Collect information / perform work needed to perform the action
     that needs to be UNDO logged. E.g. perform visibility
     determinations, wait for lockers, compute new infomask, etc.

     This will likely end with the "content" page(s) (e.g. a table's
     page) being exclusively locked.

  2) Estimate space needed for all UNDO logging (BeginUndoRecordInsert)

  3) Prepare for each undo record, this includes building the
     content for each undo record. PrepareUndoInsert(). This acquires,
     pins and locks buffers for undo.

  4) begin a critical section

  5) modify content page, mark buffer dirty

  6) write UNDO, using InsertPreparedUndo()

  7) associate undo with WAL record (RegisterUndoLogBuffers)

  8) write WAL

  9) End critical section

  But despite reading through the code, including READMEs, I'm doubtful
  that's quite the intended pattern. It REALLY can't be right that one
  needs to parse many function headers to figure out how the most basic
  use of undo could possibly work.  There needs to be very clear
  documentation about how to write undo records. Functions that sound
  like they're involved, need actually useful comments, rather than just
  restatements of their function names (cf RegisterUndoLogBuffers,
  UndoLogBuffersSetLSN, UndoLogRegister).

- I think there's two fairly fundamental, and related, problems with
  the sequence outlined above:

  - We can't search for target buffers to store undo data, while holding
    the "table" page content locked. That can involve writing out
    multiple pages till we find a usable victim buffer. That can take a
    pretty long time. While that's happening the content page would
    currently be locked. Note how e.g. heapam.c is careful to not hold
    *any* content locks while potentially performing IO.  I think the
    current interface makes that hard.

    The easy way to solve would be to require sites performing UNDO
    logging to acquire victim pages before even acquiring other content
    locks. Perhaps the better approach could be for the undo layer to
    hold onto a number of clean buffers, and to keep the last page in an
    already written to undo log pinned.

  - We can't search for victim buffers for further undo data while
    already holding other undo pages content locked. Doing so means that
    while we're e.g. doing IO to clean out the new page, old undo data
    on the previous page can't be read.

    This seems easier to fix. Instead of PrepareUndoInsert() acquiring,
    pinning and locking buffers, it'd need to be split into two
    operations. One that acquires buffers and pins them, and one that
    locks them.  I think it's quite possible that the locking operation
    could just be delayed until InsertPreparedUndo().  But if we solve
    the above problem, most of this might already be solved.

- To me the current split between the packed and unpacked UNDO record
  formats makes very little sense, the reasoning behind having them is
  poorly if at all documented, results in extremely verbose code, and
  isn't extensible.

  When preparing to insert an undo record the in-buffer size is computed
  with UndoRecordHeaderSize() (needs to know about all optional data)
  from within PrepareUndoInsert() (which has a bunch a bunch of
  additional knowledge about the record format). Then during insertion
  InsertPreparedUndo(), first copies the UnpackedUndoRecord into an
  UndoPackContext (again needing ...), and then, via InsertUndoData(),
  copies that in small increments into the respective buffers (again
  needing knowledge about the complete record format, two copies
  even). Beside the code duplication, that also means the memory copies
  are very inefficient, because they're all done in tiny increments,
  multiple times.

  When reading undo it's smilar: UnpackUndoData(), again in small
  chunks, reads the buffer data into an UndoPackContext (another full
  copy of the unpacked record format). But then FinishUnpackUndo()
  *again* copies all that data, into an actual UnpackedUndoRecord
  (again, with a copy of the record format, albeit slightly different
  looking).

  I'm not convinced by Heikki's argument that we shouldn't have
  structure within undo records. In my opinion that is a significant
  weakness of how WAL was initially designed, and even after Heikki's
  work, still is a problem.  But this isn't the right design either.

  Even if were to stay with the current fixed record format, I think
  the current code needs a substantial redesign:

  - I think 'packing' during insertion needs to serialize into a char*
    allocation during PrepareUndoInsert computing the size in parallel
    (or perhaps in InsertPreparedUndo, but probably not).  The size of
    the record should never be split across record boundaries
    (i.e. we'll leave that space unused if we otherwise would need to
    split the size).  The actual insertion would be a maximally sized
    memcpy() (so we'd as many memcpys as the buffer fits in, rather than
    one for each sub-type of a record).

    That allows to remove most of the duplicated knowledge of the record
    format, and makes insertions faster (by doing only large memcpys
    while holding exclusive content locks).

  - When reading an undo record, the whole stage of UnpackUndoData()
    reading data into a the UndoPackContext is omitted, reading directly
    into the UnpackedUndoRecord. That removes one further copy of the
    record format.

  - To avoid having separate copies of the record format logic, I'd
    probably encode it into *one* array of metadata. If we had
    {offsetoff(UnpackedUndoRecord, member),
     membersize(UnpackedUndoRecord, member),
     flag}
    we could fairly trivially remove most knowledge from the places
    currently knowing about the record format.

  I have some vague ideas for how to specify the format in a way that is
  more extensible, but with more structure than just a blob of data. But
  I don't think they're there yet.

- The interface to read undo also doesn't seem right to me. For one
  there's various different ways to read records, with associated code
  duplication (batch, batch in "one page" mode - but that's being
  removed now I think, single record mode).

  I think the batch mode is too restrictive. We might not need this
  during the first merged version, but I think before long we're going
  to want to be able to efficiently traverse all the undo chains we need
  to determine the visibility of all tuples on a page. Otherwise we'll
  cause a lot of additional synchronous read IO, and will repeatedly
  re-fetch information, especially during sequential scans for an older
  snapshot.  I think I briefly outlined this in an earlier email - my
  current though is that the batch interface (which the non-batch
  interface should just be a tiny helper around), should basically be a
  queue of "to-be-fetched" undo records. When batching reading an entire
  transaction, all blocks get put onto that queue. When traversing
  multiple chains, the chains are processed in a breadth-first fashion
  (by just looking at the queue, and pushing additional work to the
  end). That allows to efficiently issue prefetch requests for blocks to
  be read in the near future.

  I think that batch reading should just copy the underlying data into a
  char* buffer. Only the records that currently are being used by
  higher layers should get exploded into an unpacked record. That will
  reduce memory usage quite noticably (and I suspect it also drastically
  reduce the overhead due to a large context with a lot of small
  allocations that then get individually freed).  That will make the
  sorting of undo a bit more CPU inefficient, because individual records
  will need to be partially unpacked for comparison, but I think that's
  going to be a far smaller loss than the win.

- My reading of the current xact.c integration is that it's not workable
  as is. Undo is executed outside of a valid transaction state,
  exceptions aren't properly undone, logic would need to be duplicated
  to a significant degree, new kind of critical section.

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

14 August 2019, 09:18:07

On Wed, Aug 14, 2019 at 12:27 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>

> - I think there's two fairly fundamental, and related, problems with
>   the sequence outlined above:
>
>   - We can't search for target buffers to store undo data, while holding
>     the "table" page content locked.
>
>     The easy way to solve would be to require sites performing UNDO
>     logging to acquire victim pages before even acquiring other content
>     locks. Perhaps the better approach could be for the undo layer to
>     hold onto a number of clean buffers, and to keep the last page in an
>     already written to undo log pinned.
>
>   - We can't search for victim buffers for further undo data while
>     already holding other undo pages content locked. Doing so means that
>     while we're e.g. doing IO to clean out the new page, old undo data
>     on the previous page can't be read.
>
>     This seems easier to fix. Instead of PrepareUndoInsert() acquiring,
>     pinning and locking buffers, it'd need to be split into two
>     operations. One that acquires buffers and pins them, and one that
>     locks them.  I think it's quite possible that the locking operation
>     could just be delayed until InsertPreparedUndo().  But if we solve
>     the above problem, most of this might already be solved.

Basically, that means
- the caller should call PreparedUndoInsert before acquiring table
page content lock right? because the PreparedUndoInsert just compute
the size, allocate the space and pin+lock the buffers and for pinning
the buffers we must compute the size and allocate the space using undo
storage layer.
- So basically, if we delay the lock till InsertPreparedUndo and call
PrepareUndoInsert before acquiring table page content lock this
problem is solved?

Although I haven't yet analyzed the AM specific part that whether it's
always possible to call the PrepareUndoInsert(basically getting all
the undo record ready) before the page content lock.  But, I am sure
that won't be much difficult part.

> - To me the current split between the packed and unpacked UNDO record
>   formats makes very little sense, the reasoning behind having them is
>   poorly if at all documented, results in extremely verbose code, and
>   isn't extensible.
>
>   When preparing to insert an undo record the in-buffer size is computed
>   with UndoRecordHeaderSize() (needs to know about all optional data)
>   from within PrepareUndoInsert() (which has a bunch a bunch of
>   additional knowledge about the record format). Then during insertion
>   InsertPreparedUndo(), first copies the UnpackedUndoRecord into an
>   UndoPackContext (again needing ...), and then, via InsertUndoData(),
>   copies that in small increments into the respective buffers (again
>   needing knowledge about the complete record format, two copies
>   even). Beside the code duplication, that also means the memory copies
>   are very inefficient, because they're all done in tiny increments,
>   multiple times.
>
>   When reading undo it's smilar: UnpackUndoData(), again in small
>   chunks, reads the buffer data into an UndoPackContext (another full
>   copy of the unpacked record format). But then FinishUnpackUndo()
>   *again* copies all that data, into an actual UnpackedUndoRecord
>   (again, with a copy of the record format, albeit slightly different
>   looking).
>
>   I'm not convinced by Heikki's argument that we shouldn't have
>   structure within undo records. In my opinion that is a significant
>   weakness of how WAL was initially designed, and even after Heikki's
>   work, still is a problem.  But this isn't the right design either.
>
>   Even if were to stay with the current fixed record format, I think
>   the current code needs a substantial redesign:
>
>   - I think 'packing' during insertion needs to serialize into a char*
>     allocation during PrepareUndoInsert

ok
 computing the size in parallel
>     (or perhaps in InsertPreparedUndo, but probably not).  The size of
>     the record should never be split across record boundaries
>     (i.e. we'll leave that space unused if we otherwise would need to
>     split the size).
I think before UndoRecordAllocate we need to detect this part that
whether the size of the record will start from the last byte of the
page and if so then allocate one extra byte for the undo record.  Or
always allocate one extra byte for the undo record for handling this
case.  And, in FinalizeUndoAdvance only pass the size how much we have
actually consumed.

  The actual insertion would be a maximally sized
>     memcpy() (so we'd as many memcpys as the buffer fits in, rather than
>     one for each sub-type of a record).
>
>     That allows to remove most of the duplicated knowledge of the record
>     format, and makes insertions faster (by doing only large memcpys
>     while holding exclusive content locks).
Right.
>
>   - When reading an undo record, the whole stage of UnpackUndoData()
>     reading data into a the UndoPackContext is omitted, reading directly
>     into the UnpackedUndoRecord. That removes one further copy of the
>     record format.
So we will read member by member to UnpackedUndoRecord?  because in
context we have at least a few headers packed and we can memcpy one
header at a time like UndoRecordHeader, UndoRecordBlock. But that just
a few of them so if we copy field by field in the UnpackedUndoRecord
then we can get rid of copying in context then copy it back to the
UnpackedUndoRecord.  Is this is what in your mind or you want to store
these structures (UndoRecordHeader, UndoRecordBlock) directly into
UnpackedUndoRecord?

>
>   - To avoid having separate copies of the record format logic, I'd
>     probably encode it into *one* array of metadata. If we had
>     {offsetoff(UnpackedUndoRecord, member),
>      membersize(UnpackedUndoRecord, member),
>      flag}
>     we could fairly trivially remove most knowledge from the places
>     currently knowing about the record format.
Seems interesting.  I will work on this.

>
>
>   I have some vague ideas for how to specify the format in a way that is
>   more extensible, but with more structure than just a blob of data. But
>   I don't think they're there yet.
>
>
> - The interface to read undo also doesn't seem right to me. For one
>   there's various different ways to read records, with associated code
>   duplication (batch, batch in "one page" mode - but that's being
>   removed now I think, single record mode).
>
>   I think the batch mode is too restrictive. We might not need this
>   during the first merged version, but I think before long we're going
>   to want to be able to efficiently traverse all the undo chains we need
>   to determine the visibility of all tuples on a page. Otherwise we'll
>   cause a lot of additional synchronous read IO, and will repeatedly
>   re-fetch information, especially during sequential scans for an older
>   snapshot.  I think I briefly outlined this in an earlier email - my
>   current though is that the batch interface (which the non-batch
>   interface should just be a tiny helper around), should basically be a
>   queue of "to-be-fetched" undo records. When batching reading an entire
>   transaction, all blocks get put onto that queue. When traversing
>   multiple chains, the chains are processed in a breadth-first fashion
>   (by just looking at the queue, and pushing additional work to the
>   end). That allows to efficiently issue prefetch requests for blocks to
>   be read in the near future.

I need to analyze this part.

>
>   I think that batch reading should just copy the underlying data into a
>   char* buffer. Only the records that currently are being used by
>   higher layers should get exploded into an unpacked record. That will
>   reduce memory usage quite noticably (and I suspect it also drastically
>   reduce the overhead due to a large context with a lot of small
>   allocations that then get individually freed).

Ok, I got your idea.  I will analyze it further and work on this if
there is no problem.

  That will make the
>   sorting of undo a bit more CPU inefficient, because individual records
>   will need to be partially unpacked for comparison, but I think that's
>   going to be a far smaller loss than the win.
Right.
>
>
> - My reading of the current xact.c integration is that it's not workable
>   as is. Undo is executed outside of a valid transaction state,
>   exceptions aren't properly undone, logic would need to be duplicated
>   to a significant degree, new kind of critical section.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

14 August 2019, 16:06:12

Hi,

On 2019-08-13 13:53:59 +0530, Dilip Kumar wrote:
> On Tue, Jul 30, 2019 at 1:32 PM Andres Freund <andres@anarazel.de> wrote:
> > > +     /* Loop until we have fetched all the buffers in which we need to write. */
> > > +     while (size > 0)
> > > +     {
> > > +             bufidx = UndoGetBufferSlot(context, rnode, cur_blk, RBM_NORMAL);
> > > +             xact_info->idx_undo_buffers[index++] = bufidx;
> > > +             size -= (BLCKSZ - starting_byte);
> > > +             starting_byte = UndoLogBlockHeaderSize;
> > > +             cur_blk++;
> > > +     }
> >
> > So, this locks a very large number of undo buffers at the same time, do
> > I see that correctly?  What guarantees that there are no deadlocks due
> > to multiple buffers locked at the same time (I guess the order inside
> > the log)? What guarantees that this is a small enough number that we can
> > even lock all of them at the same time?
> 
> I think we are locking them in the block order and that should avoid
> the deadlock.  I have explained in the comments.

Sorry for harping on this so much: But please, please, *always* document
things like this *immediately*. This is among the most crucial things to
document. There shouldn't need to be a reviewer prodding you to do so
many months after the code has been written. For one you've likely
forgotten details by then, but more importantly dependencies on the
locking scheme will have crept into further places - if it's not well
thought through that can be hrad to undo. And it wastes reviewer /
reader bandwidth.

> > Why do we need to lock all of them at the same time? That's not clear to
> > me.
> 
> Because this is called outside the critical section so we keep all the
> buffers locked what we want to update inside the critical section for
> single wal record.

I don't understand this explanation. What does keeping the buffers
locked have to do with the critical section? As explained in a later
email, I think the current approach is not acceptable - but even without
those issues, I don't see why we couldn't just lock the buffers at a
later stage?

> > > +     for (i = 0; i < context->nprepared_undo_buffer; i++)
> > > +     {
> >
> > How large do we expect this to get at most?
> >
> In BeginUndoRecordInsert we are computing it
> 
> + /* Compute number of buffers. */
> + nbuffers = (nprepared + MAX_UNDO_UPDATE_INFO) * MAX_BUFFER_PER_UNDO;

Since nprepared is variable, that doesn't really answer the question.

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

14 August 2019, 16:39:51

Hi,

On 2019-08-13 17:05:27 +0530, Dilip Kumar wrote:
> On Mon, Aug 5, 2019 at 11:59 PM Andres Freund <andres@anarazel.de> wrote:
> > (as I was out of context due to dealing with bugs, I've switched to
> > lookign at the current zheap/undoprocessing branch.
> >
> > On 2019-07-30 01:02:20 -0700, Andres Freund wrote:
> > > +/*
> > > + * Insert a previously-prepared undo records.
> > > + *
> > > + * This function will write the actual undo record into the buffers which are
> > > + * already pinned and locked in PreparedUndoInsert, and mark them dirty.  This
> > > + * step should be performed inside a critical section.
> > > + */
> >
> > Again, I think it's not ok to just assume you can lock an essentially
> > unbounded number of buffers. This seems almost guaranteed to result in
> > deadlocks. And there's limits on how many lwlocks one can hold etc.
> 
> I think for controlling that we need to put a limit on max prepared
> undo?  I am not sure any other way of limiting the number of buffers
> because we must lock all the buffer in which we are going to insert
> the undo record under one WAL logged operation.

I heard that a number of times. But I still don't know why that'd
actually be true. Why would it not be sufficient to just lock the buffer
currently being written to, rather than all buffers? It'd require a bit
of care updating the official current "logical end" of a log, but
otherwise ought to not be particularly hard? Only one backend can extend
the log after all, and until the log is externally visibily extended,
nobody can read or write those buffers, no?


> >
> > As far as I can tell there's simply no deadlock avoidance scheme in use
> > here *at all*? I must be missing something.
> 
> We are always locking buffer in block order so I am not sure how it
> can deadlock?  Am I missing something?

Do we really in all circumstances? Note that we update the transinfo
(and also progress) from earlier in the log. But my main point is more
that there's no documented deadlock avoidance scheme. Which imo means
there's none, because nobody will know to maintain it.



> > > +                     /*
> > > +                      * During recovery, there might be some blocks which are already
> > > +                      * deleted due to some discard command so we can just skip
> > > +                      * inserting into those blocks.
> > > +                      */
> > > +                     if (!BufferIsValid(buffer))
> > > +                     {
> > > +                             Assert(InRecovery);
> > > +
> > > +                             /*
> > > +                              * Skip actual writing just update the context so that we have
> > > +                              * write offset for inserting into next blocks.
> > > +                              */
> > > +                             SkipInsertingUndoData(&ucontext, BLCKSZ - starting_byte);
> > > +                             if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> > > +                                     break;
> > > +                     }
> >
> > How exactly can this happen?
> 
> Suppose you insert one record for the transaction which split in
> block1 and 2.  Now, before this block is actually going to the disk
> the transaction committed and become all visible the undo logs are
> discarded.  It's possible that block 1 is completely discarded but
> block 2 is not because it might have undo for the next transaction.
> Now, during recovery (FPW is off) if block 1 is missing but block 2 is
> their so we need to skip inserting undo for block 1 as it does not
> exist.

Hm. I'm quite doubtful this is a good idea. How will this not force us
to a emit a lot more expensive durable operations while writing undo?
And doesn't this reduce error detection quite remarkably?

Thomas, Robert?





> > > +                     /* Read the undo record. */
> > > +                     UndoGetOneRecord(uur, urecptr, rnode, category, &buffer);
> > > +
> > > +                     /* Release the discard lock after fetching the record. */
> > > +                     if (!InHotStandby)
> > > +                             LWLockRelease(&slot->discard_lock);
> > > +             }
> > > +             else
> > > +                     UndoGetOneRecord(uur, urecptr, rnode, category, &buffer);
> >
> >
> > And then we do none of this in !one_page mode.
> UndoBulkFetchRecord is always called from the aborted transaction so
> its undo can never get discarded concurrently so ideally, we don't
> need to check for discard.

That's an undocumented assumption. Why would anybody reading the
interface know that?



> > > +static uint16
> > > +UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer,
> > > +                                      UndoLogCategory category)
> > > +{
> > > +     UndoLogOffset page_offset = UndoRecPtrGetPageOffset(urp);
> > > +     BlockNumber cur_blk = UndoRecPtrGetBlockNum(urp);
> > > +     Buffer          buffer = input_buffer;
> > > +     Page            page = NULL;
> > > +     char       *pagedata = NULL;
> > > +     char            prevlen[2];
> > > +     RelFileNode rnode;
> > > +     int                     byte_to_read = sizeof(uint16);
> >
> > Shouldn't it be byte_to_read?

Err, *bytes*_to_read.


> > And the sizeof a type that's tied with the actual undo format?
> > Imagine we'd ever want to change the length format for undo records
> > - this would be hard to find.
> 
> Do you mean that we should not rely on undo format i.e. we should not
> assume that undo length is stored at the end of the undo record?

I was referencing the use of sizeof(uint16). I think this should either
reference an UndoRecLen typedef or something like it, or use something
roughly like
#define member_size(type, member) (sizeof((type){0}.member))
and then have bytes_to_read be set to something like
member_size(PackedUndoRecord, len)

> > > +     char            persistence;
> > > +     uint16          prev_rec_len = 0;
> > > +
> > > +     /* Get relfilenode. */
> > > +     UndoRecPtrAssignRelFileNode(rnode, urp);
> > > +     persistence = RelPersistenceForUndoLogCategory(category);
> > > +
> > > +     if (BufferIsValid(buffer))
> > > +     {
> > > +             page = BufferGetPage(buffer);
> > > +             pagedata = (char *) page;
> > > +     }
> > > +
> > > +     /*
> > > +      * Length if the previous undo record is store at the end of that record
> > > +      * so just fetch last 2 bytes.
> > > +      */
> > > +     while (byte_to_read > 0)
> > > +     {
> >
> > Why does this need a loop around the number of bytes? Can there ever be
> > a case where this is split across a record? If so, isn't that a bad idea
> > anyway?
> Yes, as of now, undo record can be splitted at any point even the undo
> length can be split acorss 2 pages.  I think we can reduce complexity
> by making sure undo length doesn't get split acorss pages.

I think we definitely should do that. I'd probably even include more
than just the size in the header that's not allowed to be split across
pages.


> But for handling that while allocating the undo we need to detect this
> whether the undo length can get splitted by checking the space in the
> current page and the undo record length and based on that we need to
> allocate 1 extra byte in the undo log.  Seems that will add an extra
> complexity.

That seems fairly straightforward?

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

14 August 2019, 17:05:53

Hi,

On 2019-08-14 14:48:07 +0530, Dilip Kumar wrote:
> On Wed, Aug 14, 2019 at 12:27 PM Andres Freund <andres@anarazel.de> wrote:
> > - I think there's two fairly fundamental, and related, problems with
> >   the sequence outlined above:
> >
> >   - We can't search for target buffers to store undo data, while holding
> >     the "table" page content locked.
> >
> >     The easy way to solve would be to require sites performing UNDO
> >     logging to acquire victim pages before even acquiring other content
> >     locks. Perhaps the better approach could be for the undo layer to
> >     hold onto a number of clean buffers, and to keep the last page in an
> >     already written to undo log pinned.
> >
> >   - We can't search for victim buffers for further undo data while
> >     already holding other undo pages content locked. Doing so means that
> >     while we're e.g. doing IO to clean out the new page, old undo data
> >     on the previous page can't be read.
> >
> >     This seems easier to fix. Instead of PrepareUndoInsert() acquiring,
> >     pinning and locking buffers, it'd need to be split into two
> >     operations. One that acquires buffers and pins them, and one that
> >     locks them.  I think it's quite possible that the locking operation
> >     could just be delayed until InsertPreparedUndo().  But if we solve
> >     the above problem, most of this might already be solved.
> 
> Basically, that means
> - the caller should call PreparedUndoInsert before acquiring table
> page content lock right? because the PreparedUndoInsert just compute
> the size, allocate the space and pin+lock the buffers and for pinning
> the buffers we must compute the size and allocate the space using undo
> storage layer.

I don't think we can normally pin the undo buffers properly at that
stage. Without knowing the correct contents of the table page - which we
can't know without holding some form of lock preventing modifications -
we can't know how big our undo records are going to be. And we can't
just have buffers that don't exist on disk in shared memory, and we
don't want to allocate undo that we then don't need. So I think what
we'd have to do at that stage, is to "pre-allocate" buffers for the
maximum amount of UNDO needed, but mark the associated bufferdesc as not
yet valid. These buffers would have a pincount > 0, but BM_TAG_VALID
would not be set.

So at the start of a function that will need to insert undo we'd need to
pre-reserve the maximum number of buffers we could potentially
need. That reservation stage would

a) pin the page with the current end of the undo
b) if needed pin the page of older undo that we need to update (e.g. to
   update the next pointer)
c) perform clock sweep etc to acquire (find or create) enough clean to
   hold the maximum amount of undo needed. These buffers would be marked
   as !BM_TAG_VALID | BUF_REFCOUNT_ONE.

I assume that we'd make a) cheap by keeping it pinned for undo logs that
a backend is actively attached to. b) should only be needed once in a
transaction, so it's not too bad. c) we'd probably need to amortize
across multiple undo insertions, by keeping the unused buffers pinned
until the end of the transaction.

I assume that having the infrastructure c) might also make some code
for already in postgres easier. There's obviously some issues around
guaranteeing that the maximum number of such buffers isn't high.

> - So basically, if we delay the lock till InsertPreparedUndo and call
> PrepareUndoInsert before acquiring table page content lock this
> problem is solved?
> 
> Although I haven't yet analyzed the AM specific part that whether it's
> always possible to call the PrepareUndoInsert(basically getting all
> the undo record ready) before the page content lock.  But, I am sure
> that won't be much difficult part.

I think that is somewhere between not possible, and so expensive in a
lot of cases that we'd not want to do it anyway. You'd at leasthave to
first acquire a content lock on the page, mark the target tuple as
locked, then unlock the page, reserve undo, lock the table page,
actually update it.

> >   - When reading an undo record, the whole stage of UnpackUndoData()
> >     reading data into a the UndoPackContext is omitted, reading directly
> >     into the UnpackedUndoRecord. That removes one further copy of the
> >     record format.
> So we will read member by member to UnpackedUndoRecord?  because in
> context we have at least a few headers packed and we can memcpy one
> header at a time like UndoRecordHeader, UndoRecordBlock.

Well, right now you then copy them again later, so not much is gained by
that (although that later copy can happen without the content lock
held). As I think I suggested before, I suspect that the best way would
be to just memcpy() the data from the page(s) into an appropriately
sized buffer with the content lock held, and then perform unpacking
directly into UnpackedUndoRecord. Especially with the bulk API that will
avoid having to do much work with locks held, and reduce memory usage by
only unpacking the record(s) in a batch that are currently being looked
at.

> But that just a few of them so if we copy field by field in the
> UnpackedUndoRecord then we can get rid of copying in context then copy
> it back to the UnpackedUndoRecord.  Is this is what in your mind or
> you want to store these structures (UndoRecordHeader, UndoRecordBlock)
> directly into UnpackedUndoRecord?

I at the moment see no reason not to?

> >

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

16 August 2019, 03:24:20

On Wed, Aug 14, 2019 at 10:35 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-08-14 14:48:07 +0530, Dilip Kumar wrote:
> > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund <andres@anarazel.de> wrote:
> > > - I think there's two fairly fundamental, and related, problems with
> > >   the sequence outlined above:
> > >
> > >   - We can't search for target buffers to store undo data, while holding
> > >     the "table" page content locked.
> > >
> > >     The easy way to solve would be to require sites performing UNDO
> > >     logging to acquire victim pages before even acquiring other content
> > >     locks. Perhaps the better approach could be for the undo layer to
> > >     hold onto a number of clean buffers, and to keep the last page in an
> > >     already written to undo log pinned.
> > >
> > >   - We can't search for victim buffers for further undo data while
> > >     already holding other undo pages content locked. Doing so means that
> > >     while we're e.g. doing IO to clean out the new page, old undo data
> > >     on the previous page can't be read.
> > >
> > >     This seems easier to fix. Instead of PrepareUndoInsert() acquiring,
> > >     pinning and locking buffers, it'd need to be split into two
> > >     operations. One that acquires buffers and pins them, and one that
> > >     locks them.  I think it's quite possible that the locking operation
> > >     could just be delayed until InsertPreparedUndo().  But if we solve
> > >     the above problem, most of this might already be solved.
> >
> > Basically, that means
> > - the caller should call PreparedUndoInsert before acquiring table
> > page content lock right? because the PreparedUndoInsert just compute
> > the size, allocate the space and pin+lock the buffers and for pinning
> > the buffers we must compute the size and allocate the space using undo
> > storage layer.
>
> I don't think we can normally pin the undo buffers properly at that
> stage. Without knowing the correct contents of the table page - which we
> can't know without holding some form of lock preventing modifications -
> we can't know how big our undo records are going to be. And we can't
> just have buffers that don't exist on disk in shared memory, and we
> don't want to allocate undo that we then don't need. So I think what
> we'd have to do at that stage, is to "pre-allocate" buffers for the
> maximum amount of UNDO needed, but mark the associated bufferdesc as not
> yet valid. These buffers would have a pincount > 0, but BM_TAG_VALID
> would not be set.
>
> So at the start of a function that will need to insert undo we'd need to
> pre-reserve the maximum number of buffers we could potentially
> need. That reservation stage would

Maybe we can provide an interface where the caller will input the max
prepared undo (maybe BeginUndoRecordInsert) and based on that we can
compute the max number of buffers we could potentially need for this
particular operation.  Most of the operation insert/update/delete will
need 1 or 2 undo record so we can avoid pinning very high number of
buffers in most of the cases.   Currently, only for the multi-insert
implementation of zheap we might need multiple undo-records (1 undo
per range of record).

> a) pin the page with the current end of the undo
> b) if needed pin the page of older undo that we need to update (e.g. to
>    update the next pointer)
> c) perform clock sweep etc to acquire (find or create) enough clean to
>    hold the maximum amount of undo needed. These buffers would be marked
>    as !BM_TAG_VALID | BUF_REFCOUNT_ONE.
>
> I assume that we'd make a) cheap by keeping it pinned for undo logs that
> a backend is actively attached to. b) should only be needed once in a
> transaction, so it's not too bad. c) we'd probably need to amortize
> across multiple undo insertions, by keeping the unused buffers pinned
> until the end of the transaction.
>
> I assume that having the infrastructure c) might also make some code
> for already in postgres easier. There's obviously some issues around
> guaranteeing that the maximum number of such buffers isn't high.
>
>
> > - So basically, if we delay the lock till InsertPreparedUndo and call
> > PrepareUndoInsert before acquiring table page content lock this
> > problem is solved?
> >
> > Although I haven't yet analyzed the AM specific part that whether it's
> > always possible to call the PrepareUndoInsert(basically getting all
> > the undo record ready) before the page content lock.  But, I am sure
> > that won't be much difficult part.
>
> I think that is somewhere between not possible, and so expensive in a
> lot of cases that we'd not want to do it anyway. You'd at leasthave to
> first acquire a content lock on the page, mark the target tuple as
> locked, then unlock the page, reserve undo, lock the table page,
> actually update it.
>
>
> > >   - When reading an undo record, the whole stage of UnpackUndoData()
> > >     reading data into a the UndoPackContext is omitted, reading directly
> > >     into the UnpackedUndoRecord. That removes one further copy of the
> > >     record format.
> > So we will read member by member to UnpackedUndoRecord?  because in
> > context we have at least a few headers packed and we can memcpy one
> > header at a time like UndoRecordHeader, UndoRecordBlock.
>
> Well, right now you then copy them again later, so not much is gained by
> that (although that later copy can happen without the content lock
> held). As I think I suggested before, I suspect that the best way would
> be to just memcpy() the data from the page(s) into an appropriately
> sized buffer with the content lock held, and then perform unpacking
> directly into UnpackedUndoRecord. Especially with the bulk API that will
> avoid having to do much work with locks held, and reduce memory usage by
> only unpacking the record(s) in a batch that are currently being looked
> at.
ok.
>
> > But that just a few of them so if we copy field by field in the
> > UnpackedUndoRecord then we can get rid of copying in context then copy
> > it back to the UnpackedUndoRecord.  Is this is what in your mind or
> > you want to store these structures (UndoRecordHeader, UndoRecordBlock)
> > directly into UnpackedUndoRecord?
>
> I at the moment see no reason not to?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

16 August 2019, 04:14:25

On Wed, Aug 14, 2019 at 2:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Aug 14, 2019 at 12:27 PM Andres Freund <andres@anarazel.de> wrote:

> >   I think that batch reading should just copy the underlying data into a
> >   char* buffer. Only the records that currently are being used by
> >   higher layers should get exploded into an unpacked record. That will
> >   reduce memory usage quite noticably (and I suspect it also drastically
> >   reduce the overhead due to a large context with a lot of small
> >   allocations that then get individually freed).
>
> Ok, I got your idea.  I will analyze it further and work on this if
> there is no problem.

I think there is one problem that currently while unpacking the undo
record if the record is compressed (i.e. some of the fields does not
exist in the record) then we read those fields from the first record
on the page.  But, if we just memcpy the undo pages to the buffers and
delay the unpacking whenever it's needed seems that we would need to
know the page boundary and also we need to know the offset of the
first complete record on the page from where we can get that
information (which is currently in undo page header).

As of now even if we leave this issue apart I am not very clear what
benefit you are seeing in the way you are describing compared to the
way I am doing it now?

a) Is it the multiple palloc? If so then we can allocate memory at
once and flatten the undo records in that.  Earlier, I was doing that
but we need to align each unpacked undo record so that we can access
them directly and based on Robert's suggestion I have modified it to
multiple palloc.
b) Is it the memory size problem that the unpack undo record will take
more memory compared to the packed record?
c) Do you think that we will not need to unpack all the records?  But,
I think eventually, at the higher level we will have to unpack all the
undo records ( I understand that it will be one at a time)

Or am I completely missing something here?

>
>   That will make the
> >   sorting of undo a bit more CPU inefficient, because individual records
> >   will need to be partially unpacked for comparison, but I think that's
> >   going to be a far smaller loss than the win.
> Right.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

16 August 2019, 04:42:34

Hi Amit,

I've combined three of your messages into one below, and responded
inline.  New patch set to follow shortly, with the fixes listed below
(and others from other reviewers).

On Wed, Jul 24, 2019 at 9:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 0003-Add-undo-log-manager.patch
> 1.
> allocate_empty_undo_segment()
...
> + /* create two parents up if not exist */
> + parentdir = pstrdup(undo_path);
> + get_parent_directory(parentdir);
> + get_parent_directory(parentdir);
> + /* Can't create parent and it doesn't already exist? */
> + if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST)
>
>
> All of this code is almost same as we have code in
> TablespaceCreateDbspace still we have small differences like here you
> are using mkdir instead of MakePGDirectory which as far as I can see
> use similar permissions for creating directory.  Also, it checks
> whether the directory exists before trying to create it.  Is there a
> reason why we need to do a few things differently here if not, they
> can both the places use one common function?

Right, MakePGDirectory() arrived in commit da9b580d8990, and should
probably be used everywhere we create directories under pgdata.
Fixed.

Yeah, I think we could just use TablespaceCreateDbspace() for this, if
we are OK with teaching GetDatabasePath() and GetRelationPath() about
how to make special undo paths, OR we are OK with just using
"standard" paths, where undo files just live under database 9 (instead
of the special "undo" directory).  I stopped using a "9" directory in
earlier versions because undo moved to a separate namespace when we
agreed to use an extra discriminator in buffer tags and so forth; now
that we're back to using database number 9, the question of whether to
reflect that on the filesystem is back.  I have had some trouble
deciding which parts of the system should treat undo logs as some kind
of 'relation' (and the SLRU project will have to tackle the same
questions).  I'll think about that some more before making the change.

> 2.
> allocate_empty_undo_segment()
> {
> ..
> ..
> /* Flush the contents of the file to disk before the next checkpoint. */
> + undofile_request_sync(logno, end / UndoLogSegmentSize, tablespace);
> ..
> }

> The comment in allocate_empty_undo_segment indicates that the code
> wants to flush before checkpoint, but the actual function tries to
> register the request with checkpointer.  Shouldn't this be similar to
> XLogFileInit where we use pg_fsync to flush the contents immediately?

I responded to the general question about when we sync files in an
earlier email.  I've updated the comments to make it clearer that it's
handing the work off, not doing it now.

> Another thing is that recently in commit 475861b261 (commit by you),
> we have introduced a mechanism to not fill the files with zero's for
> certain filesystems like ZFS.  Do we want similar behavior for undo
> files?

Good point.  I will create a separate thread to discuss how the
creation of a central file allocation routine (possibly with a GUC),
and see if we can come up with something reusable for this, but
independently committable.

> 3.
> +extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
> +{
> + UndoLogSlot *slot;
> + size_t end;
> +
> + slot = find_undo_log_slot(logno, false);
> +
> + /* TODO review interlocking */
> +
> + Assert(slot != NULL);
> + Assert(slot->meta.end % UndoLogSegmentSize == 0);
> + Assert(new_end % UndoLogSegmentSize == 0);
> + Assert(InRecovery ||
> +    CurrentSession->attached_undo_slots[slot->meta.category] == slot);
>
> Can you write some comments explaining the above Asserts?  Also, can
> you explain what interlocking issues are you worried about here?

I added comments about the assertions.  I will come back to the
interlocking in another message, which I've now addressed (alluded to
below as well).

> 4.
> while (end < new_end)
> + {
> + allocate_empty_undo_segment(logno, slot->meta.tablespace, end);
> + end += UndoLogSegmentSize;
> + }
> +
> + /* Flush the directory entries before next checkpoint. */
> + undofile_request_sync_dir(slot->meta.tablespace);
>
> I see that at two places after allocating empty undo segment, the
> patch performs undofile_request_sync_dir whereas it doesn't perform
> the same in UndoLogNewSegment? Is there a reason for the same or is it
> missed from one of the places?

You're right.  Done.

> 5.
> +static void
> +extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
> {
> ..
> /*
> + * We didn't need to acquire the mutex to read 'end' above because only
> + * we write to it.  But we need the mutex to update it, because the
> + * checkpointer might read it concurrently.
>
> Is this assumption correct?  It seems patch also modified
> slot->meta.end during discard in function UndoLogDiscard.  I am
> referring below code:
>
> +UndoLogDiscard(UndoRecPtr discard_point, TransactionId xid)
> {
> ..
> + /* Update shmem to show the new discard and end pointers. */
> + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
> + slot->meta.discard = discard;
> + slot->meta.end = end;
> + LWLockRelease(&slot->mutex);
> ..
> }

Yeah, the assumption was wrong, and that's what that other TODO note
about interlocking was referring to.  I have redesigned this so that
there is a separate per-undo log extend_lock that allows
UndoLogDiscard() (in a background worker or superuser command) and
UndoLogAllocate() to serialise extension of the undo log.  Hopefully
foreground processes don't often have to wait (a discard worker will
recycle segments fast enough), but if it ever does have to wait, it's
waiting for another backend to rename() a fully allocated file, which
is hopefully still better than writing a load of zeroes into a new
file.

> 6.
> extend_undo_log()
> {
> ..
> ..
> if (!InRecovery)
> + {
> + xl_undolog_extend xlrec;
> + XLogRecPtr ptr;
> +
> + xlrec.logno = logno;
> + xlrec.end = end;
> +
> + XLogBeginInsert();
> + XLogRegisterData((char *) &xlrec, sizeof(xlrec));
> + ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND);
> + XLogFlush(ptr);
> + }
>
> It is not obvious to me why we need to perform XLogFlush here, can you explain?

It's not needed, and I've removed it.  The important thing here is
that we insert the record after creating the files and telling the
checkpointer to flush them; there's no benefit to flushing the WAL
record.

There are three crash recovery possibilities: (1) We recover from a
checkpoint after this record, and the files are already durable, (2)
we recover from a checkpoint before this record, replay this record,
the file(s) may or may not be present but we'll tolerate them if they
are and overwrite, (3) we recover from a checkpoint before this record
was written, and this WAL record is never replayed because it wasn't
flushed, and then there may or may not be some orphaned files but
we'll eventually try to create files with the same names as we extend
the undo log and tolerate their existence.

> 7.
> +attach_undo_log(UndoLogCategory category, Oid tablespace)
> {
> ..
> if (candidate->meta.tablespace == tablespace)
> + {
> + logno = *place;
> + slot = candidate;
> + *place = candidate->next_free;
> + break;
> + }
>
> Here, the code is breaking from the loop, so why do we need to set
> *place?  Am I missing something obvious?

(See further down).

> 8.
> + /* WAL-log the creation of this new undo log. */
> + {
> + xl_undolog_create xlrec;
> +
> + xlrec.logno = logno;
> + xlrec.tablespace = slot->meta.tablespace;
> + xlrec.category = slot->meta.category;
> +
> + XLogBeginInsert();
> + XLogRegisterData((char *) &xlrec, sizeof(xlrec));
>
> Here and in most other places in this patch you are using
> sizeof(xlrec) for xlog static data.  However, as far as I know in
> other places in the code we define the size using offset of the last
> parameter of corresponding structure to avoid any inconsistency in WAL
> record size across different platforms.   Is there a reason to go
> differently with this patch?  See below one for example:
>
> typedef struct xl_hash_add_ovfl_page
> {
> uint16 bmsize;
> bool bmpage_found;
> } xl_hash_add_ovfl_page;
>
> #define SizeOfHashAddOvflPage
> \
> (offsetof(xl_hash_add_ovfl_page, bmpage_found) + sizeof(bool))

I see.  Apparently we don't always do that:

tmunro@dogmatix $ git grep RegisterData | grep sizeof | wc -l
      60
tmunro@dogmatix $ git grep RegisterData | grep Size | wc -l
      63

I've now done it for all of these structs so that we trim the padding
in some cases, even though in some cases it'll make no difference.

> 9.
> +static void
> +undolog_xlog_create(XLogReaderState *record)
> +{
> + xl_undolog_create *xlrec = (xl_undolog_create *) XLogRecGetData(record);
> + UndoLogSlot *slot;
> +
> + /* Create meta-data space in shared memory. */
> + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
> +
> + /* TODO: assert that it doesn't exist already? */
> +
> + slot = allocate_undo_log_slot();
> + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
>
> Why do we need to acquire locks during recovery?

Mostly because allocate_undo_log_slot() asserts that the lock is held.
We probably don't need this in recovery but it doesn't seem like a
problem, it'll never be contended.

> 10.
> I think UndoLogAllocate can leak allocation of slots.  It first
> allocates the slot for a new log from the free pool in there is no
> existing slot/log, writes a WAL record and then at a later point of
> time it actually creates the required physical space in the log via
> extend_undo_log which also writes a separate WAL.  Now, if there is a
> error between these two operations, then we will have a redundant slot
> allocated.  What if there are repeated errors for similar thing from
> multiple backends after which system crashes.  Now, after restart, we
> will allocate multiple slots for different lognos which don't have any
> actual (physical) logs.  This might not be a big problem in practice
> because the chances of error between two operations are less, but
> can't we delay the WAL logging for allocation of a slot for a new log.

I don't think it leaks anything, and the undo log is not redundant,
it's free/available.  An undo log is allowed to have no space
allocated (discard == end).  In fact that can happen a few different
ways: for example after crash recovery, we unlink all files belonging
to unlogged undo logs by scanning the filesystem, and set discard ==
end, on a segment boundary.  That's also the way new undo logs are
born, and I think that's OK.  If you crash and recover up to the point
the undo log creation was WAL-logged, you'll now have a log with no
space, and then the first person to try to allocate something in it
will extend it (= create the file, move the end pointer) in the
process of allocating space.

> 11.
> +UndoLogAllocate()
> {
> ..
> ..
> + /*
> + * Maintain our tracking of the and the previous transaction start
> + * locations.
> + */
> + if (slot->meta.unlogged.this_xact_start != slot->meta.unlogged.insert)
> + {
> + slot->meta.unlogged.last_xact_start =
> + slot->meta.unlogged.this_xact_start;
> + slot->meta.unlogged.this_xact_start = slot->meta.unlogged.insert;
> + }
>
> ".. of the and the ..", after first the, something is missing.

Fixed.

> 12.
> UndoLogAllocate()
> {
> ..
> ..
> + /*
> + * We don't need to acquire log->mutex to read log->meta.insert and
> + * log->meta.end, because this backend is the only one that can
> + * modify them.
> + */
> + if (unlikely(new_insert > slot->meta.end))
>
> I might be confused but slot->meta.end is modified by discard process
> also, so how is it safe?  If so, may be adding a comment to explain
> the same would be good.  Also, I think in the comments log should be
> replaced with the slot.

Right, now fixed.

I fixed s/log->/slot->/ here and elsewhere in comments.

> 13.
> UndoLogAllocate()
> {
> ..
> + /* This undo log is entirely full.  Get a new one. */
> + if (logxid == GetTopTransactionId())
> + {
> + /*
> + * If the same transaction is split over two undo logs then
> + * store the previous log number in new log.  See detailed
> + * comments in undorecord.c file header.
> + */
> ..
> }
>
> The undorecord.c should be renamed to undoaccess.c

Fixed.

> 14.
> UndoLogAllocate()
> {
> ..
> + if (logxid != GetTopTransactionId())
> + {
> + /*
> + * While we have the lock, check if we have been forcibly detached by
> + * DROP TABLESPACE.  That can only happen between transactions (see
> + * DropUndoLogsInsTablespace()).
> + */
>
> /DropUndoLogsInsTablespace/DropUndoLogsInTablespace

Fixed.

> 15.
> UndoLogSegmentPath()
> {
> ..
> /*
> + * Build the path from log number and offset.  The pathname is the
> + * UndoRecPtr of the first byte in the segment in hexadecimal, with a
> + * period inserted between the components.
> + */
> + snprintf(path, MAXPGPATH, "%s/%06X.%010zX", dir, logno,
> + segno * UndoLogSegmentSize);
> ..
> }
>
> a. It is not very clear from the above code why are we multiplying
> segno with UndoLogSegmentSize?  I see that many of the callers pass
> segno as segno/UndoLogSegmentSize.  Won't it be better if the caller
> take care of passing correct value of segno?

We want "the UndoRecPtr of the first byte in the segment [...] with a
period inserted between the components".  Seems clear?  So undo log 7,
segno 0 will be 000007.0000000000 and undo log 7, segno 1 will be
000007.0000100000, and UndoRecPtr of its first byte is at
0000070000100000 (so when you're looking at pg_stat_undo_logs or
undoinspect() or any other representation of undo record pointers, you
can easily see which files they are referring to).  It's true that we
could pass in the offset of the first byte, instead of the segment
number, but some other callers have a segment number (see undofile.c).

> b. In the comment above, instead of offset, shouldn't there be segment number.

No, segno * segment size == offset (the offset part of an UndoRecPtr
is the lower 48 bits; the upper 24 bits are the undo log number).

> 16. UndoLogGetLastXactStartPoint is not used any where.  I think this
> was required in previous version of patchset, now, we can remove it.

Done, thanks.

> 17.
> Discussion: https://postgr.es/m/CAEepm%3D2EqROYJ_xYz4v5kfr4b0qw_Lq_6Pe8RTEC8rx3upWsSQ%40mail.gmail.com
>
> This discussion link seems to be from old discussion/thread, not this one.

Will reference this one.

> 0019-Add-developer-documentation-for-the-undo-log-storage
> 18.
> +each undo log, a set of meta-data properties is tracked:
> +tracked, including:
> +
> +* the tablespace that holds its segment files
> +* the persistence level (permanent, unlogged or temporary)
>
> Here, don't we want to refer to UndoLogCategory rather than
> persistence level?  "tracked, including:"  seems bit confusing.

Fixed here and elsewhere.

> 0020-Add-user-facing-documentation-for-undo-logs
> 19.
> <row>
> +     <entry><structfield>persistence</structfield></entry>
> +     <entry><type>text</type></entry>
> +     <entry>Persistence level of data stored in this undo log; one of
> +      <literal>permanent</literal>, <literal>unlogged</literal> or
> +      <literal>temporary</literal>.</entry>
> +    </row>
>
> Don't we want to cover the new (shared) undolog category here?

Done (though I have mixed feelings about this shared category; more on
that soon).

On Thu, Jul 25, 2019 at 12:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Jul 24, 2019 at 2:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Thu, Jul 18, 2019 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > 7.
> > +attach_undo_log(UndoLogCategory category, Oid tablespace)
> > {
> > ..
> > if (candidate->meta.tablespace == tablespace)
> > + {
> > + logno = *place;
> > + slot = candidate;
> > + *place = candidate->next_free;
> > + break;
> > + }
> >
> > Here, the code is breaking from the loop, so why do we need to set
> > *place?  Am I missing something obvious?
> >
>
> I think I know what I was missing.  It seems here you are removing an
> element from the freelist.

Right.

> One point related to detach_current_undo_log.
>
> + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
> + slot->pid = InvalidPid;
> + slot->meta.unlogged.xid = InvalidTransactionId;
> + if (full)
> + slot->meta.status = UNDO_LOG_STATUS_FULL;
> + LWLockRelease(&slot->mutex);
>
> If I read the comments in structure UndoLogMetaData, it is mentioned
> that 'status' is changed by explicit WAL record whereas there is no
> WAL record in code to change the status.  I see the problem as well if
> we don't WAL log this change.  Suppose after changing the status of
> this log, we allocate a new log and insert some records in that log as
> well for the same transaction for which we have inserted records in
> the log which we just marked as FULL.  Now, here we form the link
> between two logs as the same transaction has overflowed into a new
> log.  Say, we crash after this.  Now, after recovery the log won't be
> marked as FULL which means there is a chance that it can be used for
> some other transaction, if that happens, then our link for a
> transaction spanning to different log will break and we won't be able
> to access the data in another log.  In short, I think it is important
> to WAL log this status change unless I am missing something.

I thought it was OK to relax that and was going to just fix the
comment, but the case you describe seems important.  It seems we could
either by WAL-logging the status changes as you said, or make sure the
links have enough information to handle that.  I'll think about that
some more.

On Thu, Jul 25, 2019 at 10:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Some more review of the same patch:
> 1.
> +typedef struct UndoLogSharedData
> +{
> + UndoLogNumber free_lists[UndoLogCategories];
> + UndoLogNumber low_logno;
>
> What is the use of low_logno?  I don't see anywhere in the code this
> being assigned any value.  Is it for some future use?

Yeah, fixed, and now used.  It reduces the need for 'negative cache
entries' after a backend has been running for a very long time.

> 2.
> +void
> +CheckPointUndoLogs(XLogRecPtr checkPointRedo, XLogRecPtr priorCheckPointRedo)
> {
> ..
> + /* Compute header checksum. */
> + INIT_CRC32C(crc);
> + COMP_CRC32C(crc, &UndoLogShared->low_logno, sizeof(UndoLogShared->low_logno));
> + COMP_CRC32C(crc, &UndoLogShared->next_logno,
> sizeof(UndoLogShared->next_logno));
> + COMP_CRC32C(crc, &num_logs, sizeof(num_logs));
> + FIN_CRC32C(crc);
> +
> + /* Write out the number of active logs + crc. */
> + if ((write(fd, &UndoLogShared->low_logno,
> sizeof(UndoLogShared->low_logno)) != sizeof(UndoLogShared->low_logno))
> ||
> + (write(fd, &UndoLogShared->next_logno,
> sizeof(UndoLogShared->next_logno)) !=
> sizeof(UndoLogShared->next_logno)) ||
>
> Is it safe to read UndoLogShared without UndoLogLock?  All other
> places accessing UndoLogShared uses UndoLogLock, so if this usage is
> safe, maybe it is better to add a comment.

Fixed for next_logno.  And the other one, low_logno is no longer
written to disk (it can be computed).

> 3.
> UndoLogAllocateInRecovery()
> {
> ..
> /*
> + * Otherwise we need to do our own transaction tracking
> + * whenever we see a new xid, to match the logic in
> + * UndoLogAllocate().
> + */
> + if (xid != slot->meta.unlogged.xid)
> + {
> + slot->meta.unlogged.xid = xid;
> + if (slot->meta.unlogged.this_xact_start != slot->meta.unlogged.insert)
> + slot->meta.unlogged.last_xact_start =
> + slot->meta.unlogged.this_xact_start;
> + slot->meta.unlogged.this_xact_start =
> + slot->meta.unlogged.insert;
>
> The code doesn't follow the comment.  In UndoLogAllocate, both
> last_xact_start and this_xact_start are assigned in if block, so the
> should be the case here.

True, in "do" I only did the assignment if the values were different,
and in "redo" I did the assignment even if they were the same, which
has the same effect, but is indeed distracting.  I've made them the
same.

> 4.
> UndoLogAllocateInRecovery()
> {
> ..
> + /*
> + * Just as in UndoLogAllocate(), the caller may be extending an existing
> + * allocation before committing with UndoLogAdvance().
> + */
> + if (context->try_location != InvalidUndoRecPtr)
> + {
> ..
> }
>
> I am not sure how will this work because unlike UndoLogAllocate, this
> function doesn't set try_location initially.  It will be set later by
> UndoLogAdvance which can easily go wrong because that doesn't include
> UndoLogBlockHeaderSize.

Hmm, yeah, I need to look into that some more.  The 'advance' function
does include consider header bytes though.

I do admit that this code is very hard to follow.  It got that way by
being developed before the 'context' existed.  It needs to be
rewritten in a much clearer way; I'm going to do that.

> 5.
> +UndoLogAdvance(UndoLogAllocContext *context, size_t size)
> +{
> + context->try_location = UndoLogOffsetPlusUsableBytes(context->try_location,
> + size);
> +}
>
> Here, you are using UndoRecPtr whereas UndoLogOffsetPlusUsableBytes
> expects offset.

Yeah that is ugly.  I created UndoRecPtrPlusUsableBytes().

> 6.
> UndoLogAllocateInRecovery()
> {
> ..
> + /*
> + * At this stage we should have an undo log that can handle this
> + * allocation.  If we don't, something is screwed up.
> + */
> + if (UndoLogOffsetPlusUsableBytes(slot->meta.unlogged.insert, size) >
> slot->meta.end)
> + elog(ERROR,
> + "cannot allocate %d bytes in undo log %d",
> + (int) size, slot->logno);
> ..
> }
>
> Similar to point-5, here you are using a pointer instead of offset.

Fixed.

> 7.
> UndoLogAllocateInRecovery()
> {
> ..
> + /* We found a reference to a different (or first) undo log. */
> + slot = find_undo_log_slot(logno, false);
> ..
> + /* TODO: check locking against undo log slot recycling? */
> ..
> }
>
> I think it is better to have an Assert here that slot can't be NULL.
> AFAICS, slot can't be NULL unless there is some bug.  I don't
> understand this 'TODO' comment.

Yeah.  I just removed it.  "slot" was already dereferenced above so an
assertion that it's not NULL is too late to have any useful effect.

> 8.
> + {
> + {"undo_tablespaces", PGC_USERSET,
> CLIENT_CONN_STATEMENT,
> + gettext_noop("Sets the
> tablespace(s) to use for undo logs."),
> + NULL,
> +
> GUC_LIST_INPUT | GUC_LIST_QUOTE
> + },
> +
> &undo_tablespaces,
> + "",
> + check_undo_tablespaces,
> assign_undo_tablespaces, NULL
> + },
>
> It seems you need to update variable_is_guc_list_quote for this variable.

Huh.  Right.  Done.

> 9.
> +extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
> {
> ..
> + if (!InRecovery)
> + {
> + xl_undolog_extend xlrec;
> + XLogRecPtr ptr;
> +
> + xlrec.logno = logno;
> + xlrec.end = end;
> +
> + XLogBeginInsert();
> + XLogRegisterData((char *) &xlrec, sizeof(xlrec));
> + ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND);
> + XLogFlush(ptr);
> + }
> ..
> }
>
> Do we need it for temporary/unlogged persistence level?  Similarly,
> there is a WAL logging in attach_undo_log which I can't understand why
> it would be required for temporary/unlogged persistence levels.

You're right, if we crash we don't care about any data in
temporary/unlogged undo logs, since that data belongs to
temporary/unlogged zheap (etc) tables.  We destroy all of their files
at startup in ResetUndoLogs().  So therefore we might as well not both
to log the extension stuff.

Done.

> 10.
> +choose_undo_tablespace(bool force_detach, Oid *tablespace)
> {
> ..
> + oid = get_tablespace_oid(name, true);
> + if (oid == InvalidOid)
> ..
> }
>
> Do we need to check permissions to see if the current user is allowed
> to create in this tablespace?

Yeah, right.  I added a pg_tablespace_aclcheck() check

postgres=> set undo_tablespaces = ts1;
SET
postgres=> create table t ();
ERROR:  permission denied for tablespace ts1

> 11.
> +static bool
> +choose_undo_tablespace(bool force_detach, Oid *tablespace)
> +{
> + char   *rawname;
> + List   *namelist;
> + bool
> need_to_unlock;
> + int length;
> + int
> i;
> +
> + /* We need a modifiable copy of string. */
> + rawname =
> pstrdup(undo_tablespaces);
>
> I don't see the usage of rawname outside this function, isn't it
> better to free it?  I understand that this function won't be called
> frequently enough to matter, but still, there is some theoretical
> danger if the user continuously changes undo_tablespaces.

Fixed by freeing both rawname and namelist.

> 12.
> +find_undo_log_slot(UndoLogNumber logno, bool locked)
> {
> ..
> + * TODO: We could track the lowest known undo log
> number, to reduce
> + * the negative cache entry bloat.
> + */
> + if (result == NULL)
> + {
> ..
> }
>
> Do we have any mechanism to clear this bloat or will it stay till the
> end of the session?  If it is later, then I think it might be good to
> take care of this TODO.  I think this is not a blocker, but good to
> have kind of stuff.

I did the TODO, so now we can drop negative cache entries below
low_logno from the cache.  There are probably more things we could do
here to be more aggressive but that's a start.

> 13.
> +static void
> +allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
> + UndoLogOffset end)
> {
> ..
> }
>
> What will happen if the transaction creating undolog segment rolls
> back?  Do we want to have pendingDeletes stuff as we have for normal
> relation files?  This might also help in clearing the shared memory
> state (undo log slots) if any.

No, that's non-transactional.  The undo log segment remains created,
just like various other things stay permanently even if the
transaction that created them aborts (relation extension, btree
splits, ...).

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

16 August 2019, 05:15:42

Hi Kuntal,

On Thu, Jul 25, 2019 at 5:40 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> Here are some review comments on 0003-Add-undo-log-manager.patch. I've
> tried to avoid duplicate comments as much as possible.

Thanks!  Replies inline.  I'll be posting a new patch set shortly with
these and other fixes.

> 1. In UndoLogAllocate,
> + * time this backend as needed to write to an undo log at all or because
> s/as/has

Fixed.

> + * Maintain our tracking of the and the previous transaction start
> Do you mean current log's transaction start as well?

Right, fixed.

> 2. In UndoLogAllocateInRecovery,
> we try to find the current log from the first undo buffer. So, after a
> log switch, we always have to register at least one buffer from the
> current undo log first. If we're updating something in the previous
> log, the respective buffer should be registered after that. I think we
> should document this in the comments.

I'm not sure I understand. Is this working correctly today?

> 3. In UndoLogGetOldestRecord(UndoLogNumber logno, bool *full),
> it seems the 'full' parameter is not used anywhere. Do we still need this?
>
> + /* It's been recycled.  SO it must have been entirely discarded. */
> s/SO/So

Fixed.

> 4. In CleanUpUndoCheckPointFiles,
> we can emit a debug2 message with something similar to : 'removed
> unreachable undo metadata files'

Done.

> + if (unlink(path) != 0)
> + elog(ERROR, "could not unlink file \"%s\": %m", path);
> according to my observation, whenever we deal with a file operation,
> we usually emit a ereport message with errcode_for_file_access().
> Should we change it to ereport? There are other file operations as
> well including read(), OpenTransientFile() etc.

Done.

> 5. In CheckPointUndoLogs,
> + /* Capture snapshot while holding each mutex. */
> + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
> + serialized[num_logs++] = slot->meta;
> + LWLockRelease(&slot->mutex);
> why do we need an exclusive lock to read something from the slot? A
> share lock seems to be sufficient.

OK.

> pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_SYNC) is called
> after pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_WRITE)
> without calling     pgstat_report_wait_end(). I think you've done the
> same to avoid an extra function call. But, it differs from other
> places in the PG code. Perhaps, we should follow this approach
> everywhere.

Ok, changed.

> 6. In StartupUndoLogs,
> + if (fd < 0)
> + elog(ERROR, "cannot open undo checkpoint snapshot \"%s\": %m", path);
> assuming your agreement upon changing above elog to ereport, the
> message should be more user friendly. May be something like 'cannot
> open pg_undo file'.

Done.

> + if ((size = read(fd, &slot->meta, sizeof(slot->meta))) != sizeof(slot->meta))
> The usage of size of doesn't look like a problem. But, we can save
> some extra padding bytes at the end if we use (offsetof + sizeof)
> approach similar to other places in PG.

It current ends in a 64 bit value, so there is no padding here.

> 7. In free_undo_log_slot,
> + /*
> + * When removing an undo log from a slot in shared memory, we acquire
> + * UndoLogLock, log->mutex and log->discard_lock, so that other code can
> + * hold any one of those locks to prevent the slot from being recycled.
> + */
> + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
> + LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
> + Assert(slot->logno != InvalidUndoLogNumber);
> + slot->logno = InvalidUndoLogNumber;
> + memset(&slot->meta, 0, sizeof(slot->meta));
> + LWLockRelease(&slot->mutex);
> + LWLockRelease(UndoLogLock);
> you've not taken the discard_lock as mentioned in the comment.

Right, I was half-way between two different ideas about how that
interlocking should work, but I have straightened this out now, and
will write about the overall locking model separately.

> 8. In find_undo_log_slot,
> + * 1.  If the calling code knows that it is attached to this lock or is the
> s/lock/slot

Fixed.

BTW I am experimenting with macros that would actually make assertions
about those programming rules.

> + * 2.  All other code should acquire log->mutex before accessing any members,
> + * and after doing so, check that the logno hasn't moved.  If it is not, the
> + * entire undo log must be assumed to be discarded (as if this function
> + * returned NULL) and the caller must behave accordingly.
> Perhaps, you meant '..check that the logno remains same. If it is not..'.

Fixed.

> + /*
> + * If we didn't find it, then it must already have been entirely
> + * discarded.  We create a negative cache entry so that we can answer
> + * this question quickly next time.
> + *
> + * TODO: We could track the lowest known undo log number, to reduce
> + * the negative cache entry bloat.
> + */
> This is an interesting thought. But, I'm wondering how we are going to
> search the discarded logno in the simple hash. I guess that's why it's
> in the TODO list.

Done.  Each backend tracks its idea of the lowest undo log that
exists.  There is a shared low_logno that is recomputed whenever a
slot is freed (ie a log is entirely discarded).  Whenever a backend
sees that its own value is too low, it walks forward dropping cache
entries.  Perhaps this could be made more proactive later by using
sinval, but I didn't look into that.

> 9. In attach_undo_log,
> + * For now we have a simple linked list of unattached undo logs for each
> + * persistence level.  We'll grovel though it to find something for the
> + * tablespace you asked for.  If you're not using multiple tablespaces
> s/though/through

Fixed.

> + if (slot == NULL)
> + {
> + if (UndoLogShared->next_logno > MaxUndoLogNumber)
> + {
> + /*
> + * You've used up all 16 exabytes of undo log addressing space.
> + * This is a difficult state to reach using only 16 exabytes of
> + * WAL.
> + */
> + elog(ERROR, "undo log address space exhausted");
> + }
> looks like a potential unlikely() condition.

Done.  Yeah, actually every branch containing an unconditional elog()
at ERROR or higher (or maybe even lower) must surely be considered
unlikely, and it'd be nice to tell the leading compilers about that,
but the last thread about that hasn't made it as far as a useful patch
for some technical reason that didn't seem fatal to the concept, IIRC.
I'd be curious to know what sort effect that sort of rule would have
on the whole tree, in terms of code locality, even if you have to hack
the compiler to find out...

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

16 August 2019, 05:26:00

Hi,

On 2019-08-16 09:44:25 +0530, Dilip Kumar wrote:
> On Wed, Aug 14, 2019 at 2:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund <andres@anarazel.de> wrote:
> 
> > >   I think that batch reading should just copy the underlying data into a
> > >   char* buffer. Only the records that currently are being used by
> > >   higher layers should get exploded into an unpacked record. That will
> > >   reduce memory usage quite noticably (and I suspect it also drastically
> > >   reduce the overhead due to a large context with a lot of small
> > >   allocations that then get individually freed).
> >
> > Ok, I got your idea.  I will analyze it further and work on this if
> > there is no problem.
> 
> I think there is one problem that currently while unpacking the undo
> record if the record is compressed (i.e. some of the fields does not
> exist in the record) then we read those fields from the first record
> on the page.  But, if we just memcpy the undo pages to the buffers and
> delay the unpacking whenever it's needed seems that we would need to
> know the page boundary and also we need to know the offset of the
> first complete record on the page from where we can get that
> information (which is currently in undo page header).

I don't understand why that's a problem?


> As of now even if we leave this issue apart I am not very clear what
> benefit you are seeing in the way you are describing compared to the
> way I am doing it now?
> 
> a) Is it the multiple palloc? If so then we can allocate memory at
> once and flatten the undo records in that.  Earlier, I was doing that
> but we need to align each unpacked undo record so that we can access
> them directly and based on Robert's suggestion I have modified it to
> multiple palloc.

Part of it.

> b) Is it the memory size problem that the unpack undo record will take
> more memory compared to the packed record?

Part of it.

> c) Do you think that we will not need to unpack all the records?  But,
> I think eventually, at the higher level we will have to unpack all the
> undo records ( I understand that it will be one at a time)

Part of it. There's a *huge* difference between having a few hundred to
thousand unpacked records, each consisting of several independent
allocations, in memory and having one large block containing all
packed records in a batch, and a few allocations for the few unpacked
records that need to exist.

There's also d) we don't need separate tiny memory copies while holding
buffer locks etc.

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

16 August 2019, 08:47:19

On Fri, Aug 16, 2019 at 10:56 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-08-16 09:44:25 +0530, Dilip Kumar wrote:
> > On Wed, Aug 14, 2019 at 2:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > > >   I think that batch reading should just copy the underlying data into a
> > > >   char* buffer. Only the records that currently are being used by
> > > >   higher layers should get exploded into an unpacked record. That will
> > > >   reduce memory usage quite noticably (and I suspect it also drastically
> > > >   reduce the overhead due to a large context with a lot of small
> > > >   allocations that then get individually freed).
> > >
> > > Ok, I got your idea.  I will analyze it further and work on this if
> > > there is no problem.
> >
> > I think there is one problem that currently while unpacking the undo
> > record if the record is compressed (i.e. some of the fields does not
> > exist in the record) then we read those fields from the first record
> > on the page.  But, if we just memcpy the undo pages to the buffers and
> > delay the unpacking whenever it's needed seems that we would need to
> > know the page boundary and also we need to know the offset of the
> > first complete record on the page from where we can get that
> > information (which is currently in undo page header).
>
> I don't understand why that's a problem?
Okay, I was assuming that we will be only copying data part not
complete page including the page header.  If we copy the page header
as well we might be able to unpack the compressed record as well.

>
>
> > As of now even if we leave this issue apart I am not very clear what
> > benefit you are seeing in the way you are describing compared to the
> > way I am doing it now?
> >
> > a) Is it the multiple palloc? If so then we can allocate memory at
> > once and flatten the undo records in that.  Earlier, I was doing that
> > but we need to align each unpacked undo record so that we can access
> > them directly and based on Robert's suggestion I have modified it to
> > multiple palloc.
>
> Part of it.
>
> > b) Is it the memory size problem that the unpack undo record will take
> > more memory compared to the packed record?
>
> Part of it.
>
> > c) Do you think that we will not need to unpack all the records?  But,
> > I think eventually, at the higher level we will have to unpack all the
> > undo records ( I understand that it will be one at a time)
>
> Part of it. There's a *huge* difference between having a few hundred to
> thousand unpacked records, each consisting of several independent
> allocations, in memory and having one large block containing all
> packed records in a batch, and a few allocations for the few unpacked
> records that need to exist.
>
> There's also d) we don't need separate tiny memory copies while holding
> buffer locks etc.

Yeah, that too.  Yet another problem could be that how are we going to
process those record? Because for that we need to know all the undo
record pointers between start_urecptr and the end_urecptr right?  we
just have the big memory chunk and we have no idea how many undo
records are there and what are their undo record pointers.  And
without knowing that information, I am unable to imagine how we are
going to sort them based on block number.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

17 August 2019, 16:05:21

On Wed, Aug 14, 2019 at 12:39 PM Andres Freund <andres@anarazel.de> wrote:
> > > Again, I think it's not ok to just assume you can lock an essentially
> > > unbounded number of buffers. This seems almost guaranteed to result in
> > > deadlocks. And there's limits on how many lwlocks one can hold etc.
> >
> > I think for controlling that we need to put a limit on max prepared
> > undo?  I am not sure any other way of limiting the number of buffers
> > because we must lock all the buffer in which we are going to insert
> > the undo record under one WAL logged operation.
>
> I heard that a number of times. But I still don't know why that'd
> actually be true. Why would it not be sufficient to just lock the buffer
> currently being written to, rather than all buffers? It'd require a bit
> of care updating the official current "logical end" of a log, but
> otherwise ought to not be particularly hard? Only one backend can extend
> the log after all, and until the log is externally visibily extended,
> nobody can read or write those buffers, no?

Well, I don't understand why you're on about this.  We've discussed it
a number of times but I'm still confused.  I'll repeat my previous
arguments on-list:

1. It's absolutely fine to just put a limit on this, because the
higher-level facilities that use this shouldn't be doing a single
WAL-logged operation that touches a zillion buffers.  We have been
careful to avoid having WAL-logged operations touch an unbounded
number of buffers in plenty of other places, like the btree code, and
we are going to have to be similarly careful here for multiple
reasons, deadlock avoidance being one.  So, saying, "hey, you're going
to lock an unlimited number of buffers" is a straw man.  We aren't.
We can't.

2. The write-ahead logging protocol says that you're supposed to lock
all the buffers at once.  See src/backend/access/transam/README.  If
you want to go patch that file, then this patch can follow whatever
the locking rules in the patched version are.  But until then, the
patch should follow *the actual rules* not some other protocol based
on a hand-wavy explanation in an email someplace. Otherwise, you've
got the same sort of undocumented disaster-waiting-to-happen that you
keep complaining about in other parts of this patch.  We need fewer of
those, not more!

3. There is no reason to care about all of the buffers being locked at
once, because they are not unlimited in number (see point #1) and
nobody else is looking at them anyway (see the final sentence of what
I quoted above).

I think we are, or ought to be, talking about locking 2 (or maybe in
rare cases 3 or 4) undo buffers in connection with a single WAL
record.  If we're talking about more than that, then I think the
higher-level code needs to be changed.  If we're talking about that
many, then we don't need to be clever.  We can just do the standard
thing that the rest of the system does, and it will be fine just like
it is everywhere else.

> > Suppose you insert one record for the transaction which split in
> > block1 and 2.  Now, before this block is actually going to the disk
> > the transaction committed and become all visible the undo logs are
> > discarded.  It's possible that block 1 is completely discarded but
> > block 2 is not because it might have undo for the next transaction.
> > Now, during recovery (FPW is off) if block 1 is missing but block 2 is
> > their so we need to skip inserting undo for block 1 as it does not
> > exist.
>
> Hm. I'm quite doubtful this is a good idea. How will this not force us
> to a emit a lot more expensive durable operations while writing undo?
> And doesn't this reduce error detection quite remarkably?
>
> Thomas, Robert?

I think you're going to need to spell out your assumptions in order
for me to be able to comment intelligently.  This is another thing
that seems pretty normal to me.  Generally, WAL replay might need to
recreate objects whose creation is not separately WAL-logged, and it
might need to skip operations on objects that have been dropped later
in the WAL stream and thus don't exist any more. This seems like an
instance of the latter pattern.  There's no reason to try to put valid
data into pages that we know have been discarded, and both inserting
and discarding undo data need to be logged anyway.

As a general point, I think the hope is that undo generated by
short-running transactions that commit and become all-visible quickly
will be cheap.  We should be able to dirty shared buffers but then
discard the data without ever writing it out to disk if we've logged a
discard of that data. Obviously, if you've got long-running
transactions that are either generating undo or holding old snapshots,
you're going to have to really flush the data, but we want to avoid
that when we can.  And the same is true on the standby: even if we
write the dirty data into shared buffers instead of skipping the write
altogether, we hope to be able to forget about those buffers when we
encounter a discard record before the next checkpoint.

One idea we could consider, if it makes the code sufficiently simpler
and doesn't cost too much performance, is to remove the facility for
skipping over bytes to be written and instead write any bytes that we
don't really want to write to an entirely-fake buffer (e.g. a
backend-private page in a static variable).  That seems a little silly
to me; I suspect there's a better way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

17 August 2019, 17:28:53

Hi,

On 2019-08-17 12:05:21 -0400, Robert Haas wrote:
> On Wed, Aug 14, 2019 at 12:39 PM Andres Freund <andres@anarazel.de> wrote:
> > > > Again, I think it's not ok to just assume you can lock an essentially
> > > > unbounded number of buffers. This seems almost guaranteed to result in
> > > > deadlocks. And there's limits on how many lwlocks one can hold etc.
> > >
> > > I think for controlling that we need to put a limit on max prepared
> > > undo?  I am not sure any other way of limiting the number of buffers
> > > because we must lock all the buffer in which we are going to insert
> > > the undo record under one WAL logged operation.
> >
> > I heard that a number of times. But I still don't know why that'd
> > actually be true. Why would it not be sufficient to just lock the buffer
> > currently being written to, rather than all buffers? It'd require a bit
> > of care updating the official current "logical end" of a log, but
> > otherwise ought to not be particularly hard? Only one backend can extend
> > the log after all, and until the log is externally visibily extended,
> > nobody can read or write those buffers, no?
>
> Well, I don't understand why you're on about this.  We've discussed it
> a number of times but I'm still confused.

There's two reasons here:

The primary one in the context here is that if we do *not* have to lock
the buffers all ahead of time, we can simplify the interface. We
certainly can't lock the buffers over IO (due to buffer reclaim) as
we're doing right now, so we'd need another phase, called by the "user"
during undo insertion. But if we do not need to lock the buffers before
the insertion over all starts, the inserting location doesn't have to
care.

Secondarily, all the reasoning for needing to lock all buffers ahead of
time was imo fairly unconvincing. Following the "recipe" for WAL
insertions is a good idea when writing a new run-of-the-mill WAL
inserting location - but when writing a new fundamental facility, that
already needs to modify how WAL works, then I find that much less
convincing.

> 1. It's absolutely fine to just put a limit on this, because the
> higher-level facilities that use this shouldn't be doing a single
> WAL-logged operation that touches a zillion buffers.  We have been
> careful to avoid having WAL-logged operations touch an unbounded
> number of buffers in plenty of other places, like the btree code, and
> we are going to have to be similarly careful here for multiple
> reasons, deadlock avoidance being one.  So, saying, "hey, you're going
> to lock an unlimited number of buffers" is a straw man.  We aren't.
> We can't.

Well, in the version of code that I was reviewing here, I don't there is
such a limit (there is a limit for buffers per undo record, but no limit
on the number of records inserted together). I think Dilip added a limit
since.  And we have the issue of a lot of IO happening while holding
content locks on several pages.  So I don't think it's a straw man at
all.

> 2. The write-ahead logging protocol says that you're supposed to lock
> all the buffers at once.  See src/backend/access/transam/README.  If
> you want to go patch that file, then this patch can follow whatever
> the locking rules in the patched version are.  But until then, the
> patch should follow *the actual rules* not some other protocol based
> on a hand-wavy explanation in an email someplace. Otherwise, you've
> got the same sort of undocumented disaster-waiting-to-happen that you
> keep complaining about in other parts of this patch.  We need fewer of
> those, not more!

But that's not what I'm asking for? I don't even know where you take
from that I don't want this to be documented. I'm mainly asking for a
comment explaining why the current behaviour is what it is. Because I
don't think an *implicit* "normal WAL logging rules" is sufficient
explanation, because all the locking here happens one or two layers away
from the WAL logging site - so it's absolutely *NOT* obvious that that's
the explanation. And I don't think any of the locking sites actually has
comments explaining why the locks are acquired at that time (in fact,
IIRC until the review some even only mentioned pinning, not locking).

> > > Suppose you insert one record for the transaction which split in
> > > block1 and 2.  Now, before this block is actually going to the disk
> > > the transaction committed and become all visible the undo logs are
> > > discarded.  It's possible that block 1 is completely discarded but
> > > block 2 is not because it might have undo for the next transaction.
> > > Now, during recovery (FPW is off) if block 1 is missing but block 2 is
> > > their so we need to skip inserting undo for block 1 as it does not
> > > exist.
> >
> > Hm. I'm quite doubtful this is a good idea. How will this not force us
> > to a emit a lot more expensive durable operations while writing undo?
> > And doesn't this reduce error detection quite remarkably?
> >
> > Thomas, Robert?
> 
> I think you're going to need to spell out your assumptions in order
> for me to be able to comment intelligently.  This is another thing
> that seems pretty normal to me.  Generally, WAL replay might need to
> recreate objects whose creation is not separately WAL-logged, and it
> might need to skip operations on objects that have been dropped later
> in the WAL stream and thus don't exist any more. This seems like an
> instance of the latter pattern.  There's no reason to try to put valid
> data into pages that we know have been discarded, and both inserting
> and discarding undo data need to be logged anyway.

Yea, I was "intentionally" vague here. I didn't have a concrete scenario
that I was concerned about, but it somehow didn't quite seem right, and
I didn't encounter an explanation why it's guaranteed to be safe. So
more eyes seemed like a good idea.  I'm not at all sure that there is an
actual problem here - I'm mostly trying to understand this code, from
the perspective of somebody reading it for the first time.

I think what primarily makes me concerned is that it's not clear to me
what guarantees that discard is the only reason for the block to
potentially be missing. I contrast to most other similar cases where WAL
replay simply re-creates the objects when trying to replay an action
affecting such an object, here we simply skip over the WAL logged
operation. So if e.g. the entire underlying UNDO file got lost, we
neither re-create it with valid content, nor error out. Which means we
got to be absolutely sure that all undo files are created in a
persistent manner, at their full size. And that there's no way that data
could get lost, without forcing us to perform REDO up to at least the
relevant point again.

While it appears that we always WAL log the undo extension, I am not
convinced the recovery interlock is strong enough. For one
UndoLogDiscard() unlinks segments before WAL logging their removal -
which means if we crash after unlink() and before the
XLogInsert(XLOG_UNDOLOG_DISCARD) we'd theoretically be in trouble (in
practice we might be fine, because there ought to be nobody still
referencing that UNDO - but I don't think that's actually guaranteed as
is). Nor do I see where we're updating minRecoveryLocation when
replaying a XLOG_UNDOLOG_DISCARD, which means that a restart during
recovery could be stopped before the discard has been replayed, leaving
us with wrong UNDO, but allowing write acess. Seems we'd at least need a
few more XLogFlush() calls.

> One idea we could consider, if it makes the code sufficiently simpler
> and doesn't cost too much performance, is to remove the facility for
> skipping over bytes to be written and instead write any bytes that we
> don't really want to write to an entirely-fake buffer (e.g. a
> backend-private page in a static variable).  That seems a little silly
> to me; I suspect there's a better way.

I suspect so too.

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

19 August 2019, 04:37:13

On Sat, Aug 17, 2019 at 9:35 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Aug 14, 2019 at 12:39 PM Andres Freund <andres@anarazel.de> wrote:
> > > > Again, I think it's not ok to just assume you can lock an essentially
> > > > unbounded number of buffers. This seems almost guaranteed to result in
> > > > deadlocks. And there's limits on how many lwlocks one can hold etc.
> > >
> > > I think for controlling that we need to put a limit on max prepared
> > > undo?  I am not sure any other way of limiting the number of buffers
> > > because we must lock all the buffer in which we are going to insert
> > > the undo record under one WAL logged operation.
> >
> > I heard that a number of times. But I still don't know why that'd
> > actually be true. Why would it not be sufficient to just lock the buffer
> > currently being written to, rather than all buffers? It'd require a bit
> > of care updating the official current "logical end" of a log, but
> > otherwise ought to not be particularly hard? Only one backend can extend
> > the log after all, and until the log is externally visibily extended,
> > nobody can read or write those buffers, no?
>
> Well, I don't understand why you're on about this.  We've discussed it
> a number of times but I'm still confused.  I'll repeat my previous
> arguments on-list:
>
> 1. It's absolutely fine to just put a limit on this, because the
> higher-level facilities that use this shouldn't be doing a single
> WAL-logged operation that touches a zillion buffers.  We have been
> careful to avoid having WAL-logged operations touch an unbounded
> number of buffers in plenty of other places, like the btree code, and
> we are going to have to be similarly careful here for multiple
> reasons, deadlock avoidance being one.  So, saying, "hey, you're going
> to lock an unlimited number of buffers" is a straw man.  We aren't.
> We can't.

Right.  So basically, we need to put a limit on how many max undo can
be prepared under single WAL logged operation and that will internally
put the limit on max undo buffers.  Suppose we limit max_prepared_
undo to 2 then we need to lock at max 5 undo buffers.  We need to
somehow deal with the multi-insert code in the zheap because in that
code for inserting in a single page we write one undo record per range
if all the tuple which we are inserting on a single page are
interleaved.  But, maybe we can handle that by just inserting one undo
record which can have multiple ranges.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

19 August 2019, 06:04:26

On Wed, Aug 14, 2019 at 10:35 PM Andres Freund <andres@anarazel.de> wrote:
>

>
> > >   - When reading an undo record, the whole stage of UnpackUndoData()
> > >     reading data into a the UndoPackContext is omitted, reading directly
> > >     into the UnpackedUndoRecord. That removes one further copy of the
> > >     record format.
> > So we will read member by member to UnpackedUndoRecord?  because in
> > context we have at least a few headers packed and we can memcpy one
> > header at a time like UndoRecordHeader, UndoRecordBlock.
>
> Well, right now you then copy them again later, so not much is gained by
> that (although that later copy can happen without the content lock
> held). As I think I suggested before, I suspect that the best way would
> be to just memcpy() the data from the page(s) into an appropriately
> sized buffer with the content lock held, and then perform unpacking
> directly into UnpackedUndoRecord. Especially with the bulk API that will
> avoid having to do much work with locks held, and reduce memory usage by
> only unpacking the record(s) in a batch that are currently being looked
> at.
>
>
> > But that just a few of them so if we copy field by field in the
> > UnpackedUndoRecord then we can get rid of copying in context then copy
> > it back to the UnpackedUndoRecord.  Is this is what in your mind or
> > you want to store these structures (UndoRecordHeader, UndoRecordBlock)
> > directly into UnpackedUndoRecord?
>
> I at the moment see no reason not to?

Currently, In UnpackedUndoRecord we store all members directly which
are set by the caller.  We store pointers to some header which are
allocated internally by the undo layer and the caller need not worry
about setting them.  So now you are suggesting to put other headers
also as structures in UnpackedUndoRecord.  I as such don't have much
problem in doing that but I think initially Robert designed
UnpackedUndoRecord structure this way so it will be good if Robert
provides his opinion on this.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

19 August 2019, 12:22:24

On Sat, Aug 17, 2019 at 10:58 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-08-17 12:05:21 -0400, Robert Haas wrote:
> > On Wed, Aug 14, 2019 at 12:39 PM Andres Freund <andres@anarazel.de> wrote:
> > > > > Again, I think it's not ok to just assume you can lock an essentially
> > > > > unbounded number of buffers. This seems almost guaranteed to result in
> > > > > deadlocks. And there's limits on how many lwlocks one can hold etc.
> > > >
> > > > I think for controlling that we need to put a limit on max prepared
> > > > undo?  I am not sure any other way of limiting the number of buffers
> > > > because we must lock all the buffer in which we are going to insert
> > > > the undo record under one WAL logged operation.
> > >
> > > I heard that a number of times. But I still don't know why that'd
> > > actually be true. Why would it not be sufficient to just lock the buffer
> > > currently being written to, rather than all buffers? It'd require a bit
> > > of care updating the official current "logical end" of a log, but
> > > otherwise ought to not be particularly hard? Only one backend can extend
> > > the log after all, and until the log is externally visibily extended,
> > > nobody can read or write those buffers, no?
> >
> > Well, I don't understand why you're on about this.  We've discussed it
> > a number of times but I'm still confused.
>
> There's two reasons here:
>
> The primary one in the context here is that if we do *not* have to lock
> the buffers all ahead of time, we can simplify the interface. We
> certainly can't lock the buffers over IO (due to buffer reclaim) as
> we're doing right now, so we'd need another phase, called by the "user"
> during undo insertion. But if we do not need to lock the buffers before
> the insertion over all starts, the inserting location doesn't have to
> care.
>
> Secondarily, all the reasoning for needing to lock all buffers ahead of
> time was imo fairly unconvincing. Following the "recipe" for WAL
> insertions is a good idea when writing a new run-of-the-mill WAL
> inserting location - but when writing a new fundamental facility, that
> already needs to modify how WAL works, then I find that much less
> convincing.
>

One point to remember in this regard is that we do need to modify the
LSN in undo pages after writing WAL, so all the undo pages need to be
locked by that time or we again need to take the lock on them.

>
> > 1. It's absolutely fine to just put a limit on this, because the
> > higher-level facilities that use this shouldn't be doing a single
> > WAL-logged operation that touches a zillion buffers.

Right, by default a WAL log can only cover 4 buffers.  If we need to
touch more buffers, then the caller needs to call
XLogEnsureRecordSpace.  So, I agree with the point that generally, it
should be few buffers (2 or 3) of undo that need to be touched in a
single operation and if there are more, either callers need to change
or at the very least they need to be careful about the same.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

19 August 2019, 21:16:10

On 2019-08-19 17:52:24 +0530, Amit Kapila wrote:
> On Sat, Aug 17, 2019 at 10:58 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2019-08-17 12:05:21 -0400, Robert Haas wrote:
> > > On Wed, Aug 14, 2019 at 12:39 PM Andres Freund <andres@anarazel.de> wrote:
> > > > > > Again, I think it's not ok to just assume you can lock an essentially
> > > > > > unbounded number of buffers. This seems almost guaranteed to result in
> > > > > > deadlocks. And there's limits on how many lwlocks one can hold etc.
> > > > >
> > > > > I think for controlling that we need to put a limit on max prepared
> > > > > undo?  I am not sure any other way of limiting the number of buffers
> > > > > because we must lock all the buffer in which we are going to insert
> > > > > the undo record under one WAL logged operation.
> > > >
> > > > I heard that a number of times. But I still don't know why that'd
> > > > actually be true. Why would it not be sufficient to just lock the buffer
> > > > currently being written to, rather than all buffers? It'd require a bit
> > > > of care updating the official current "logical end" of a log, but
> > > > otherwise ought to not be particularly hard? Only one backend can extend
> > > > the log after all, and until the log is externally visibily extended,
> > > > nobody can read or write those buffers, no?
> > >
> > > Well, I don't understand why you're on about this.  We've discussed it
> > > a number of times but I'm still confused.
> >
> > There's two reasons here:
> >
> > The primary one in the context here is that if we do *not* have to lock
> > the buffers all ahead of time, we can simplify the interface. We
> > certainly can't lock the buffers over IO (due to buffer reclaim) as
> > we're doing right now, so we'd need another phase, called by the "user"
> > during undo insertion. But if we do not need to lock the buffers before
> > the insertion over all starts, the inserting location doesn't have to
> > care.
> >
> > Secondarily, all the reasoning for needing to lock all buffers ahead of
> > time was imo fairly unconvincing. Following the "recipe" for WAL
> > insertions is a good idea when writing a new run-of-the-mill WAL
> > inserting location - but when writing a new fundamental facility, that
> > already needs to modify how WAL works, then I find that much less
> > convincing.
> >
> 
> One point to remember in this regard is that we do need to modify the
> LSN in undo pages after writing WAL, so all the undo pages need to be
> locked by that time or we again need to take the lock on them.

Well, my main point, which so far has largely been ignored, was that we
may not acquire page locks when we still need to search for victim
buffers later. If we don't need to lock the pages up-front, but only do
so once we're actually copying the records into the undo pages, then we
don't a separate phase to acquire the locks. We can still hold all of
the page locks at the same time, as long as we just acquire them at the
later stage.  My secondary point was that *none* of this actually is
documented, even if it's entirely unobvious to the reader that the
relevant code can only run during WAL insertion, due to being pretty far
removed from that.

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

20 August 2019, 06:42:28

On Tue, Aug 20, 2019 at 2:46 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2019-08-19 17:52:24 +0530, Amit Kapila wrote:
> > On Sat, Aug 17, 2019 at 10:58 PM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > > Well, I don't understand why you're on about this.  We've discussed it
> > > > a number of times but I'm still confused.
> > >
> > > There's two reasons here:
> > >
> > > The primary one in the context here is that if we do *not* have to lock
> > > the buffers all ahead of time, we can simplify the interface. We
> > > certainly can't lock the buffers over IO (due to buffer reclaim) as
> > > we're doing right now, so we'd need another phase, called by the "user"
> > > during undo insertion. But if we do not need to lock the buffers before
> > > the insertion over all starts, the inserting location doesn't have to
> > > care.
> > >
> > > Secondarily, all the reasoning for needing to lock all buffers ahead of
> > > time was imo fairly unconvincing. Following the "recipe" for WAL
> > > insertions is a good idea when writing a new run-of-the-mill WAL
> > > inserting location - but when writing a new fundamental facility, that
> > > already needs to modify how WAL works, then I find that much less
> > > convincing.
> > >
> >
> > One point to remember in this regard is that we do need to modify the
> > LSN in undo pages after writing WAL, so all the undo pages need to be
> > locked by that time or we again need to take the lock on them.
>
> Well, my main point, which so far has largely been ignored, was that we
> may not acquire page locks when we still need to search for victim
> buffers later. If we don't need to lock the pages up-front, but only do
> so once we're actually copying the records into the undo pages, then we
> don't a separate phase to acquire the locks. We can still hold all of
> the page locks at the same time, as long as we just acquire them at the
> later stage.
>

Okay, IIUC, this means that we should have a separate phase where we
call LockUndoBuffers (or something like that) before
InsertPreparedUndo and after PrepareUndoInsert.  The LockUndoBuffers
will lock all the buffers pinned during PrepareUndoInsert.  We can
probably call LockUndoBuffers before entering the critical section to
avoid any kind of failure in critical section.  If so, that sounds
reasonable to me.

>  My secondary point was that *none* of this actually is
> documented, even if it's entirely unobvious to the reader that the
> relevant code can only run during WAL insertion, due to being pretty far
> removed from that.
>

I think this can be clearly mentioned in README or someplace else.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

20 August 2019, 09:02:18

Hello,

Aside from code changes based on review (and I have more to come of
those), the attached experimental patchset (also at
https://github.com/EnterpriseDB/zheap/tree/undo) has a new protocol
that, I hope, allows for better concurrency, reliability and
readability, and removes a bunch of TODO notes about questionable
interlocking.  However, I'm not quite done figuring out if the bufmgr
interaction is right and will be manageable on the undoaccess side, so
I'm hoping to get some feedback, not asking for anyone to rebase on
top of it yet.

Previously, there were two LWLocks used to make sure that all
discarding was prevented while anyone was reading or writing data in
any part of an undo log, and (probably more problematically) vice
versa.  Here's a new approach that removes that blocking:

1.  Anyone is allowed to try to read or write data at any UndoRecPtr
that has been allocated, through the buffer pool (though you'd usually
want to check it with UndoRecPtrIsDiscarded() first, and only rely on
the system I'm describing to deal with races).

2.  ReadBuffer() might return InvalidBuffer.  This can happen for a
cache miss, if the smgrread implementation wants to indicate that the
buffer has been discarded/truncated and that is expected (md.c won't
ever do that, but undofile.c can).

3.  UndoLogDiscard() uses DiscardBuffer() to invalidate any currently
unpinned buffers, and marks as BM_DISCARDED any that happen to be
pinned right now, so they can't be immediately invalidated.  Such
buffers are never written back and are eligible for reuse on the next
clock sweep, even if they're written into by a backend that managed to
do that when we were trying to discard.

4.  In order to make this work, I needed to track an extra offset
'begin' that says what physical storage exists.  So [begin, end) give
you the range of physical undo space (that is, files that exist on
disk) and [discard, insert) give you the range of active data within
it.  There are now four offsets per log in shm and in the
pg_stat_undo_logs view.

5.  Separating begin from discard allows the WAL logging for
UndoLogDiscard() to do filesystem actions before logging, and other
effects after logging, which have several nice properties if you work
through the various crash scenarios.

This allowed a lot of direct UndoLogSlot access and locking code to be
removed from undodiscard.c and undoaccess.c, because now they can just
proceed as normal, as long as they are prepared to give up whenever
the buffer manager tells them the buffer they're asking for has
evaporated.  Once they've pinned a buffer, they don't need to care if
it becomes (or already was) BM_DISCARDED; attempts to dirty it will be
silently ignore and eventually it'll be reclaimed.  It also gets rid
of 'oldest_data', which was another scheme tagging along behind the
discard pointer.

So now I'd like to get feedback on the sanity of this scheme.  I'm not
saying it doesn't have bugs right now -- I've been trying to figure
out good ways to test it and I'm not quite there yet -- but the
concept.  One observation I have is that there were already code paths
in undoaccess.c that can tolerate InvalidBuffer in recovery, due to
the potentially different discard timing for DO vs REDO.  I think
that's a point in favour of this scheme, but I can see that it's
inconvenient to have to deal with InvalidBuffer whenever you read.

Some other changes in this patch set:

1.  There is a superuser-only procedure pg_force_discard_undo(logno)
that can discard on command.  This can be used to get a system
unwedged if rollback actions are failing.  For example, if you insert
an elog(ERROR, "boo!") into the smgr_undo() and then roll back a table
creation, you'll see a discard worker repeatedly reporting the error,
and pg_stat_undo_logs will show that the undo log space never gets
freed.  This can be fixed with CALL pg_force_discard_undo(<logno>).

2.  There is a  superuser-only testing-only procedure
pg_force_switch_undo(logno) that can be used to force a transaction is
currently writing to that log number to switch to a new one, as if it
hit the end of the undo log (the 1TB address space within each undo
log).  This is good for exercising code that eg rolls back stuff
spread over two undo logs.

3.  When I was removing oldest_data from UndoLogSlot, I started
wondering why wait_fxmin was in there, as it was almost the last
reason why discard worker code needed to know about slots.  Since we
currently have only a single discard worker, and no facility for
coordinating more than one discard worker, I think its bookkeeping
might as well be backend local.  Here I made the stupidest change that
would work: a hash table to hold per-logno wait_fxmin.  I'm not
entirely sure what data structure we really want for this -- it's all
a bit brute force right now.  Thoughts?

I pulled in the latest code from undoprocessing as of today, and I
might be a bit confused about "Defect and enhancement in multi-log
support" some of which I have squashed into the  make undolog patch.
BTW undoprocessing builds with initialized variable warnings in xact.c
on clang today.

-- 
Thomas Munro
https://enterprisedb.com

Attachment

undo-20190820.tgz

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

20 August 2019, 11:41:38

On Wed, Aug 14, 2019 at 10:35 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-08-14 14:48:07 +0530, Dilip Kumar wrote:
> > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund <andres@anarazel.de> wrote:

> I don't think we can normally pin the undo buffers properly at that
> stage. Without knowing the correct contents of the table page - which we
> can't know without holding some form of lock preventing modifications -
> we can't know how big our undo records are going to be. And we can't
> just have buffers that don't exist on disk in shared memory, and we
> don't want to allocate undo that we then don't need. So I think what
> we'd have to do at that stage, is to "pre-allocate" buffers for the
> maximum amount of UNDO needed, but mark the associated bufferdesc as not
> yet valid. These buffers would have a pincount > 0, but BM_TAG_VALID
> would not be set.
>
> So at the start of a function that will need to insert undo we'd need to
> pre-reserve the maximum number of buffers we could potentially
> need. That reservation stage would
>
> a) pin the page with the current end of the undo
> b) if needed pin the page of older undo that we need to update (e.g. to
>    update the next pointer)
> c) perform clock sweep etc to acquire (find or create) enough clean to
>    hold the maximum amount of undo needed. These buffers would be marked
>    as !BM_TAG_VALID | BUF_REFCOUNT_ONE.
>
> I assume that we'd make a) cheap by keeping it pinned for undo logs that
> a backend is actively attached to. b) should only be needed once in a
> transaction, so it's not too bad. c) we'd probably need to amortize
> across multiple undo insertions, by keeping the unused buffers pinned
> until the end of the transaction.
>
> I assume that having the infrastructure c) might also make some code
> for already in postgres easier. There's obviously some issues around
> guaranteeing that the maximum number of such buffers isn't high.

I have analyzed this further, I think there is a problem if the
record/s will not fit into the current undo log and we will have to
switch the log.  Because before knowing the actual record length we
are not sure whether the undo log will switch or not and which undo
log we will get.  And, without knowing the logno (rnode) how we are
going to pin the buffers?  Am I missing something?

Thomas do you think we can get around this problem?

Apart from this while analyzing the other code I have noticed that in
the current PG code we have few occurrences where try to read buffer
under the buffer lock held.
1.
In gistplacetopage
{
...
for (; ptr; ptr = ptr->next)
{
/* Allocate new page */
ptr->buffer = gistNewBuffer(rel);
GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
ptr->page = BufferGetPage(ptr->buffer);
ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
}
2. During page split we find new buffer while holding the lock on the
current buffer.

That doesn't mean that we can't do better but I am just referring to
the existing code where we already have such issues.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

20 August 2019, 13:08:29

On Sat, Aug 17, 2019 at 1:28 PM Andres Freund <andres@anarazel.de> wrote:
> The primary one in the context here is that if we do *not* have to lock
> the buffers all ahead of time, we can simplify the interface. We
> certainly can't lock the buffers over IO (due to buffer reclaim) as
> we're doing right now, so we'd need another phase, called by the "user"
> during undo insertion. But if we do not need to lock the buffers before
> the insertion over all starts, the inserting location doesn't have to
> care.
>
> Secondarily, all the reasoning for needing to lock all buffers ahead of
> time was imo fairly unconvincing. Following the "recipe" for WAL
> insertions is a good idea when writing a new run-of-the-mill WAL
> inserting location - but when writing a new fundamental facility, that
> already needs to modify how WAL works, then I find that much less
> convincing.

So, you seem to be talking about something here which is different
than what I thought we were talking about.  One question is whether we
need to lock all of the buffers "ahead of time," and I think the
answer to that question is probably "no." Since nobody else can be
writing to those buffers, and probably also nobody can be reading them
except maybe for some debugging tool, it should be fine if we enter
the critical section and then lock them at the point when we write the
bytes. I mean, there shouldn't be any contention, and I don't see any
other problems.

The other question is whether need to hold all of the buffer locks at
the same time, and that seems a lot more problematic to me.  It's hard
to say exactly whether this unsafe, because it depends on exactly what
you think we're doing here, and I don't see that you've really spelled
that out.  The normal thing to do is call PageSetLSN() on every page
before releasing the buffer lock, and that means holding all the
buffer locks until after we've called XLogInsert(). Now, you could
argue that we should skip setting the page LSN because the data ahead
of the insertion pointer is effectively invisible anyway, but I bet
that causes problems with checksums, at least, since they rely on the
page LSN being accurate to know whether to emit WAL when a buffer is
written. You could argue that we could do the XLogInsert() first and
only after that lock and dirty the pages one by one, but I think that
might break checkpoint interlocking, since it would then be possible
for the checkpoint scan to pass over a buffer that does not appear to
need writing for the current checkpoint but later gets dirtied and
stamped with an LSN that would have caused it to be written had it
been there at the time the checkpoint scan reached it. I really can't
rule out the possibility that there's some way to make something in
this area work, but I don't know what it is, and I think it's a fairly
risky area to go tinkering.

> Well, in the version of code that I was reviewing here, I don't there is
> such a limit (there is a limit for buffers per undo record, but no limit
> on the number of records inserted together). I think Dilip added a limit
> since.  And we have the issue of a lot of IO happening while holding
> content locks on several pages.  So I don't think it's a straw man at
> all.

Hmm, what do you mean by "a lot of IO happening while holding content
locks on several pages"? We might XLogInsert() but there shouldn't be
any buffer I/O going on at that point.  If there is, I think that
should be redesigned.  We should collect buffer pins first, without
locking.  Then lock.  Then write.  Or maybe lock-and-write, but only
after everything's pinned.  The fact of calling XLogInsert() while
holding buffer locks is not great, but I don't think it's any worse
here than in any other part of the system, because the undo buffers
aren't going to be suffering concurrent access from any other backend,
and because there shouldn't be more than a few of them.

> > 2. The write-ahead logging protocol says that you're supposed to lock
> > all the buffers at once.  See src/backend/access/transam/README.  If
> > you want to go patch that file, then this patch can follow whatever
> > the locking rules in the patched version are.  But until then, the
> > patch should follow *the actual rules* not some other protocol based
> > on a hand-wavy explanation in an email someplace. Otherwise, you've
> > got the same sort of undocumented disaster-waiting-to-happen that you
> > keep complaining about in other parts of this patch.  We need fewer of
> > those, not more!
>
> But that's not what I'm asking for? I don't even know where you take
> from that I don't want this to be documented. I'm mainly asking for a
> comment explaining why the current behaviour is what it is. Because I
> don't think an *implicit* "normal WAL logging rules" is sufficient
> explanation, because all the locking here happens one or two layers away
> from the WAL logging site - so it's absolutely *NOT* obvious that that's
> the explanation. And I don't think any of the locking sites actually has
> comments explaining why the locks are acquired at that time (in fact,
> IIRC until the review some even only mentioned pinning, not locking).

I didn't intend to suggest that you don't want this to be documented.
What I intended to suggest was that you seem to want to deviate from
the documented rules, and it seems to me that we shouldn't do that
unless we change the rules first, and I don't know what you think the
rules should be or why those rules are safe.

I think I basically agree with you about the rest of this: the API
needs to be non-confusing and adequately documented, and it should
avoiding acquiring buffer locks until we have all the relevant pins.

> I think what primarily makes me concerned is that it's not clear to me
> what guarantees that discard is the only reason for the block to
> potentially be missing. I contrast to most other similar cases where WAL
> replay simply re-creates the objects when trying to replay an action
> affecting such an object, here we simply skip over the WAL logged
> operation. So if e.g. the entire underlying UNDO file got lost, we
> neither re-create it with valid content, nor error out. Which means we
> got to be absolutely sure that all undo files are created in a
> persistent manner, at their full size. And that there's no way that data
> could get lost, without forcing us to perform REDO up to at least the
> relevant point again.

I think the crucial question for me here is the extent to which we're
cross-checking against the discard pointer.  If we're like, "oh, this
undo data isn't on disk any more, it must've already been discarded,
let's ignore the write," that doesn't sound particularly great,
because files sometimes go missing. But, if we're like, "oh, we
dirtied this undo buffer but now that undo has been discarded so we
don't need to write the data back to the backing file," that seems
fine. The discard pointer is a fully durable, WAL-logged thing; if
it's somehow wrong, we have got huge problems anyway.

> While it appears that we always WAL log the undo extension, I am not
> convinced the recovery interlock is strong enough. For one
> UndoLogDiscard() unlinks segments before WAL logging their removal -
> which means if we crash after unlink() and before the
> XLogInsert(XLOG_UNDOLOG_DISCARD) we'd theoretically be in trouble (in
> practice we might be fine, because there ought to be nobody still
> referencing that UNDO - but I don't think that's actually guaranteed as
> is).

Hmm, that sounds a little worrying.  I think there are two options
here: unlike what we do with buffers, where we can use buffer locking
etc. to make the insertion of the WAL record effectively simultaneous
with the changes to the data pages, the removal of old undo files has
to happen either before or after XLogInsert(). I think "after" would
be better.  If we do it before, then upon starting up, we have to
accept that there might be undo which is not officially discarded
which nevertheless no longer exists on disk; but that might also cause
us to ignore real corruption.  If we do it after, then we can just
treat it as a non-critical cleanup that can be performed lazily and at
leisure: at any time, without warning, the system may choose to remove
any or all undo backing files all of whose address space is discarded.
If we fail to remove files, we can just emit a WARNING and maybe retry
later at some convenient point in time, or perhaps even just accept
that we'll leak the file in that case.

> Nor do I see where we're updating minRecoveryLocation when
> replaying a XLOG_UNDOLOG_DISCARD, which means that a restart during
> recovery could be stopped before the discard has been replayed, leaving
> us with wrong UNDO, but allowing write acess. Seems we'd at least need a
> few more XLogFlush() calls.

That sounds like a problem, but it seems like it might be better to
make sure that minRecoveryLocation gets bumped, rather than adding
XLogFlush() calls.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

20 August 2019, 14:27:44

On Mon, Aug 19, 2019 at 2:04 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Currently, In UnpackedUndoRecord we store all members directly which
> are set by the caller.  We store pointers to some header which are
> allocated internally by the undo layer and the caller need not worry
> about setting them.  So now you are suggesting to put other headers
> also as structures in UnpackedUndoRecord.  I as such don't have much
> problem in doing that but I think initially Robert designed
> UnpackedUndoRecord structure this way so it will be good if Robert
> provides his opinion on this.

I don't believe that's what is being suggested.  It seems to me that
the thing Andres is complaining about here has roots in the original
sketch that I did for this code.  The oldest version I can find is
here:

https://github.com/EnterpriseDB/zheap/commit/7d194824a18f0c5e85c92451beab4bc6f044254c

In this version, and I think still in the current version, there is a
two-stage marshaling strategy.  First, the individual fields from the
UnpackedUndoRecord get copied into global variables (yes, that was my
fault, too, at least in part!) which are structures. Then, the
structures get copied into the target buffer. The idea of that design
was to keep the code simple, but it didn't really work out, because
things got a lot more complicated between the time I wrote those 3244
lines of code and the >3000 lines of code that live in this patch
today. One thing that change was that we moved more and more in the
direction of considering individual fields as separate objects to be
separately included or excluded, whereas when I wrote that code I
thought we were going to have groups of related fields that stood or
fell together. That idea turned out to be wrong. (There is the
even-larger question here of whether we ought to take Heikki's
suggestion and make this whole thing a lot more generic, but let's
start by discussing how the design that we have today could be
better-implemented.)

If I understand Andres correctly, he's arguing that we ought to get
rid of the two-stage marshaling strategy.  During decoding, he wants
data to go directly from the buffer that contains it to the
UnpackedUndoRecord without ever being stored in the UnpackUndoContext.
During insertion, he wants data to go directly from the
UnpackedUndoRecord to the buffer that contains it.  Or, at least, if
there has to be an intermediate format, he wants it to be just a chunk
of raw bytes, rather than a bunch of individual fields like we have in
UndoPackContext currently.  I think that's a reasonable goal.  I'm not
as concerned about it as he is from a performance point of view, but I
think it would make the code look nicer, and that would be good.  If
we save CPU cycles along the way, that is also good.

In broad outline, what this means is:

1. Any field in the UndoPackContext that starts with urec_ goes away.
2. Instead of something like InsertUndoBytes((char *)
&(ucontext->urec_fxid), ...) we'd write InsertUndobytes((char *)
&uur->uur_fxid, ...).
3. Similarly instead of ReadUndoBytes((char *) &ucontext->urec_fxid,
...) we'd write ReadUndoBytes((char *) &uur->uur_fxid, ...).
4. It seems slightly trickier to handle the cases where we've got a
structure instead of individual fields, like urec_hd.  But those could
be broken down into field-by-field reads and writes, e.g. in this case
one call for urec_type and a second for urec_info.
5. For uur_group and uur_logswitch, the code would need to allocate
those subsidiary structures before copying into them.

To me, that seems like it ought to be a pretty straightforward change
that just makes things simpler.  We'd probably need to pass the
UnpackedUndoRecord to BeginUnpackUndo instead of FinishUnpackUndo, and
keep a pointer to it in the UnpackUndoContext, but that seems fine.
FinishUnpackUndo would end up just about empty, maybe entirely empty.

Is that a reasonable idea?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

20 August 2019, 14:33:12

On Mon, Aug 19, 2019 at 8:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> One point to remember in this regard is that we do need to modify the
> LSN in undo pages after writing WAL, so all the undo pages need to be
> locked by that time or we again need to take the lock on them.

Uh, but a big part of the point of setting the LSN on the pages is to
keep them from being written out before the corresponding WAL is
flushed to disk. If you released and reacquired the lock, the page
could be written out during the window in the middle.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

20 August 2019, 14:34:24

On Mon, Aug 19, 2019 at 5:16 PM Andres Freund <andres@anarazel.de> wrote:
> Well, my main point, which so far has largely been ignored, was that we
> may not acquire page locks when we still need to search for victim
> buffers later. If we don't need to lock the pages up-front, but only do
> so once we're actually copying the records into the undo pages, then we
> don't a separate phase to acquire the locks. We can still hold all of
> the page locks at the same time, as long as we just acquire them at the
> later stage.

+1 for that approach.  I am in complete agreement.

> My secondary point was that *none* of this actually is
> documented, even if it's entirely unobvious to the reader that the
> relevant code can only run during WAL insertion, due to being pretty far
> removed from that.

+1 also for properly documenting stuff.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

20 August 2019, 14:40:02

On Tue, Aug 20, 2019 at 2:42 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Well, my main point, which so far has largely been ignored, was that we
> > may not acquire page locks when we still need to search for victim
> > buffers later. If we don't need to lock the pages up-front, but only do
> > so once we're actually copying the records into the undo pages, then we
> > don't a separate phase to acquire the locks. We can still hold all of
> > the page locks at the same time, as long as we just acquire them at the
> > later stage.
>
> Okay, IIUC, this means that we should have a separate phase where we
> call LockUndoBuffers (or something like that) before
> InsertPreparedUndo and after PrepareUndoInsert.  The LockUndoBuffers
> will lock all the buffers pinned during PrepareUndoInsert.  We can
> probably call LockUndoBuffers before entering the critical section to
> avoid any kind of failure in critical section.  If so, that sounds
> reasonable to me.

I'm kind of scratching my head here, because this is clearly different
than what Andres said in the quoted text to which you were replying.
He clearly implied that we should acquire the buffer locks within the
critical section during InsertPreparedUndo, and you responded by
proposing to do it outside the critical section in a separate step.
Regardless of which way is actually better, when somebody says "hey,
let's do A!" and you respond by saying "sounds good, I'll go implement
B!" that's not really helping us to get toward a solution.

FWIW, although I also thought of doing what you are describing here, I
think Andres's proposal is probably preferable, because it's simpler.
There's not really any reason why we can't take the buffer locks from
within the critical section, and that way callers don't have to deal
with the extra step.

> >  My secondary point was that *none* of this actually is
> > documented, even if it's entirely unobvious to the reader that the
> > relevant code can only run during WAL insertion, due to being pretty far
> > removed from that.
>
> I think this can be clearly mentioned in README or someplace else.

It also needs to be adequately commented in the files and functions involved.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

20 August 2019, 15:43:41

Hi,

On 2019-08-20 09:08:29 -0400, Robert Haas wrote:
> On Sat, Aug 17, 2019 at 1:28 PM Andres Freund <andres@anarazel.de> wrote:
> > The primary one in the context here is that if we do *not* have to lock
> > the buffers all ahead of time, we can simplify the interface. We
> > certainly can't lock the buffers over IO (due to buffer reclaim) as
> > we're doing right now, so we'd need another phase, called by the "user"
> > during undo insertion. But if we do not need to lock the buffers before
> > the insertion over all starts, the inserting location doesn't have to
> > care.
> >
> > Secondarily, all the reasoning for needing to lock all buffers ahead of
> > time was imo fairly unconvincing. Following the "recipe" for WAL
> > insertions is a good idea when writing a new run-of-the-mill WAL
> > inserting location - but when writing a new fundamental facility, that
> > already needs to modify how WAL works, then I find that much less
> > convincing.
> 
> So, you seem to be talking about something here which is different
> than what I thought we were talking about.  One question is whether we
> need to lock all of the buffers "ahead of time," and I think the
> answer to that question is probably "no." Since nobody else can be
> writing to those buffers, and probably also nobody can be reading them
> except maybe for some debugging tool, it should be fine if we enter
> the critical section and then lock them at the point when we write the
> bytes. I mean, there shouldn't be any contention, and I don't see any
> other problems.

Right. As long as we are as restrictive about the number of buffers per
undo record, and number of records per WAL insertions, I don't see any
need to go further than that.


> > Well, in the version of code that I was reviewing here, I don't there is
> > such a limit (there is a limit for buffers per undo record, but no limit
> > on the number of records inserted together). I think Dilip added a limit
> > since.  And we have the issue of a lot of IO happening while holding
> > content locks on several pages.  So I don't think it's a straw man at
> > all.
> 
> Hmm, what do you mean by "a lot of IO happening while holding content
> locks on several pages"? We might XLogInsert() but there shouldn't be
> any buffer I/O going on at that point.

That's my primary complain with how the code is structured right
now. Right now we potentially perform IO while holding exclusive content
locks, often multiple ones even. When acquiring target pages for undo,
currently always already hold the table page exclusive locked, and if
there's more than one buffer for undo, we'll also hold the previous
buffers locked. And acquiring a buffer will often have to write out a
dirty buffer to the OS, and a lot of times that will then also require
the kernel to flush data out.  That's imo an absolute no-go for the
general case.


> If there is, I think that should be redesigned.  We should collect
> buffer pins first, without locking.  Then lock.  Then write.  Or maybe
> lock-and-write, but only after everything's pinned.

Right. It's easy enough to do that for the locks on undo pages
themselves. The harder part is the content lock on the "table page" - we
don't accurately know how many undo buffers we will need, without
holding the table lock (or preventing modifications in some other
manner).

I tried to outline the problem and potential solutions in more detail
in:
https://www.postgresql.org/message-id/20190814065745.2faw3hirvfhbrdwe%40alap3.anarazel.de


> The fact of calling XLogInsert() while holding buffer locks is not
> great, but I don't think it's any worse here than in any other part of
> the system, because the undo buffers aren't going to be suffering
> concurrent access from any other backend, and because there shouldn't
> be more than a few of them.

Yea. That's obviously undesirable, but also fundamentally required at
least in the general case. And it's not at all specific to undo.

> [ WAL logging protocol ]

> I didn't intend to suggest that you don't want this to be documented.
> What I intended to suggest was that you seem to want to deviate from
> the documented rules, and it seems to me that we shouldn't do that
> unless we change the rules first, and I don't know what you think the
> rules should be or why those rules are safe.

IDK. We have at least five different places that at the very least bend
the rules - but with a comment explaining why it's safe in the specific
case. Personally I don't really think the generic guideline needs to
list every potential edge-case.


> > I think what primarily makes me concerned is that it's not clear to me
> > what guarantees that discard is the only reason for the block to
> > potentially be missing. I contrast to most other similar cases where WAL
> > replay simply re-creates the objects when trying to replay an action
> > affecting such an object, here we simply skip over the WAL logged
> > operation. So if e.g. the entire underlying UNDO file got lost, we
> > neither re-create it with valid content, nor error out. Which means we
> > got to be absolutely sure that all undo files are created in a
> > persistent manner, at their full size. And that there's no way that data
> > could get lost, without forcing us to perform REDO up to at least the
> > relevant point again.
> 
> I think the crucial question for me here is the extent to which we're
> cross-checking against the discard pointer.  If we're like, "oh, this
> undo data isn't on disk any more, it must've already been discarded,
> let's ignore the write," that doesn't sound particularly great,
> because files sometimes go missing.

Right.


> But, if we're like, "oh, we dirtied this undo buffer but now that undo
> has been discarded so we don't need to write the data back to the
> backing file," that seems fine. The discard pointer is a fully
> durable, WAL-logged thing; if it's somehow wrong, we have got huge
> problems anyway.

There is some cross-checking against the discard pointer while reading,
but it's not obvious for me that there is in all places. In particularly
for insertions. UndoGetBufferSlot() itself doesn't have a crosscheck
afaict, and I don't see anything in InsertPreparedUndo() either. It's
possible that somehow it's indirectly guaranteed, but if so it'd be far
from obvious enough.


> > While it appears that we always WAL log the undo extension, I am not
> > convinced the recovery interlock is strong enough. For one
> > UndoLogDiscard() unlinks segments before WAL logging their removal -
> > which means if we crash after unlink() and before the
> > XLogInsert(XLOG_UNDOLOG_DISCARD) we'd theoretically be in trouble (in
> > practice we might be fine, because there ought to be nobody still
> > referencing that UNDO - but I don't think that's actually guaranteed as
> > is).
> 
> Hmm, that sounds a little worrying.  I think there are two options
> here: unlike what we do with buffers, where we can use buffer locking
> etc. to make the insertion of the WAL record effectively simultaneous
> with the changes to the data pages, the removal of old undo files has
> to happen either before or after XLogInsert(). I think "after" would
> be better.

Right.


> > Nor do I see where we're updating minRecoveryLocation when
> > replaying a XLOG_UNDOLOG_DISCARD, which means that a restart during
> > recovery could be stopped before the discard has been replayed, leaving
> > us with wrong UNDO, but allowing write acess. Seems we'd at least need a
> > few more XLogFlush() calls.
> 
> That sounds like a problem, but it seems like it might be better to
> make sure that minRecoveryLocation gets bumped, rather than adding
> XLogFlush() calls.

XLogFlush() so far is the way to update minRecoveryLocation:

    /*
     * During REDO, we are reading not writing WAL.  Therefore, instead of
     * trying to flush the WAL, we should update minRecoveryPoint instead. We
     * test XLogInsertAllowed(), not InRecovery, because we need checkpointer
     * to act this way too, and because when it tries to write the
     * end-of-recovery checkpoint, it should indeed flush.
     */
    if (!XLogInsertAllowed())
    {
        UpdateMinRecoveryPoint(record, false);
        return;
    }

I don't think there's currently any other interface available to redo
functions to update minRecoveryLocation. And we already use XLogFlush()
for that purpose in numerous redo routines.

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

20 August 2019, 16:44:23

Hi,

On 2019-08-20 21:02:18 +1200, Thomas Munro wrote:
> Aside from code changes based on review (and I have more to come of
> those), the attached experimental patchset (also at
> https://github.com/EnterpriseDB/zheap/tree/undo) has a new protocol
> that, I hope, allows for better concurrency, reliability and
> readability, and removes a bunch of TODO notes about questionable
> interlocking.  However, I'm not quite done figuring out if the bufmgr
> interaction is right and will be manageable on the undoaccess side, so
> I'm hoping to get some feedback, not asking for anyone to rebase on
> top of it yet.
> 
> Previously, there were two LWLocks used to make sure that all
> discarding was prevented while anyone was reading or writing data in
> any part of an undo log, and (probably more problematically) vice
> versa.  Here's a new approach that removes that blocking:
> 
> 1.  Anyone is allowed to try to read or write data at any UndoRecPtr
> that has been allocated, through the buffer pool (though you'd usually
> want to check it with UndoRecPtrIsDiscarded() first, and only rely on
> the system I'm describing to deal with races).
> 
> 2.  ReadBuffer() might return InvalidBuffer.  This can happen for a
> cache miss, if the smgrread implementation wants to indicate that the
> buffer has been discarded/truncated and that is expected (md.c won't
> ever do that, but undofile.c can).

Hm. This gives me a bit of a stomach ache. It somehow feels like a weird
form of signalling.  Can't quite put my finger on why it makes me feel
queasy.


> 3.  UndoLogDiscard() uses DiscardBuffer() to invalidate any currently
> unpinned buffers, and marks as BM_DISCARDED any that happen to be
> pinned right now, so they can't be immediately invalidated.  Such
> buffers are never written back and are eligible for reuse on the next
> clock sweep, even if they're written into by a backend that managed to
> do that when we were trying to discard.

Hm. When is it legitimate for a backend to write into such a buffer? I
guess that's about updating the previous transaction's next pointer? Or
progress info?


> 5.  Separating begin from discard allows the WAL logging for
> UndoLogDiscard() to do filesystem actions before logging, and other
> effects after logging, which have several nice properties if you work
> through the various crash scenarios.

Hm. ISTM we always need to log before doing some filesystem operation
(see also my recent complaint Robert and I are discussing at the bottom
of [1]). It's just that we can have a separate stage afterwards?

[1] https://www.postgresql.org/message-id/CA%2BTgmoZc5JVYORsGYs8YnkSxUC%3DcLQF1Z%2BfcpH2TTKvqkS7MFg%40mail.gmail.com



> So now I'd like to get feedback on the sanity of this scheme.  I'm not
> saying it doesn't have bugs right now -- I've been trying to figure
> out good ways to test it and I'm not quite there yet -- but the
> concept.  One observation I have is that there were already code paths
> in undoaccess.c that can tolerate InvalidBuffer in recovery, due to
> the potentially different discard timing for DO vs REDO.  I think
> that's a point in favour of this scheme, but I can see that it's
> inconvenient to have to deal with InvalidBuffer whenever you read.

FWIW, I'm far from convinced that those are currently quite right. See
discussion pointed to above.



> I pulled in the latest code from undoprocessing as of today, and I
> might be a bit confused about "Defect and enhancement in multi-log
> support" some of which I have squashed into the  make undolog patch.
> BTW undoprocessing builds with initialized variable warnings in xact.c
> on clang today.

I've complained about that existance of that commit multiple times
now. So far without any comments.

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

20 August 2019, 17:43:27

Hi,

On 2019-08-20 17:11:38 +0530, Dilip Kumar wrote:
> On Wed, Aug 14, 2019 at 10:35 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2019-08-14 14:48:07 +0530, Dilip Kumar wrote:
> > > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund <andres@anarazel.de> wrote:
> 
> > I don't think we can normally pin the undo buffers properly at that
> > stage. Without knowing the correct contents of the table page - which we
> > can't know without holding some form of lock preventing modifications -
> > we can't know how big our undo records are going to be. And we can't
> > just have buffers that don't exist on disk in shared memory, and we
> > don't want to allocate undo that we then don't need. So I think what
> > we'd have to do at that stage, is to "pre-allocate" buffers for the
> > maximum amount of UNDO needed, but mark the associated bufferdesc as not
> > yet valid. These buffers would have a pincount > 0, but BM_TAG_VALID
> > would not be set.
> >
> > So at the start of a function that will need to insert undo we'd need to
> > pre-reserve the maximum number of buffers we could potentially
> > need. That reservation stage would
> >
> > a) pin the page with the current end of the undo
> > b) if needed pin the page of older undo that we need to update (e.g. to
> >    update the next pointer)
> > c) perform clock sweep etc to acquire (find or create) enough clean to
> >    hold the maximum amount of undo needed. These buffers would be marked
> >    as !BM_TAG_VALID | BUF_REFCOUNT_ONE.
> >
> > I assume that we'd make a) cheap by keeping it pinned for undo logs that
> > a backend is actively attached to. b) should only be needed once in a
> > transaction, so it's not too bad. c) we'd probably need to amortize
> > across multiple undo insertions, by keeping the unused buffers pinned
> > until the end of the transaction.
> >
> > I assume that having the infrastructure c) might also make some code
> > for already in postgres easier. There's obviously some issues around
> > guaranteeing that the maximum number of such buffers isn't high.
> 
> 
> I have analyzed this further, I think there is a problem if the
> record/s will not fit into the current undo log and we will have to
> switch the log.  Because before knowing the actual record length we
> are not sure whether the undo log will switch or not and which undo
> log we will get.  And, without knowing the logno (rnode) how we are
> going to pin the buffers?  Am I missing something?

That's precisely why I was suggesting (at the start of the quoted block
above) to not associate the buffers with pages at that point. Instead
just have clean, pinned, *unassociated* buffers. Which can be
re-associated without any IO.


> Apart from this while analyzing the other code I have noticed that in
> the current PG code we have few occurrences where try to read buffer
> under the buffer lock held.

> 1.
> In gistplacetopage
> {
> ...
> for (; ptr; ptr = ptr->next)
> {
> /* Allocate new page */
> ptr->buffer = gistNewBuffer(rel);
> GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
> ptr->page = BufferGetPage(ptr->buffer);
> ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
> }
> 2. During page split we find new buffer while holding the lock on the
> current buffer.
> 
> That doesn't mean that we can't do better but I am just referring to
> the existing code where we already have such issues.

Those are pretty clearly edge-cases, whereas the undo case at hand is a
very common path. Note again that heapam.c goes to considerably trouble
to never do this for common cases.

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

20 August 2019, 17:45:16

On 2019-08-20 09:44:23 -0700, Andres Freund wrote:
> On 2019-08-20 21:02:18 +1200, Thomas Munro wrote:
> > Aside from code changes based on review (and I have more to come of
> > those), the attached experimental patchset (also at
> > https://github.com/EnterpriseDB/zheap/tree/undo) has a new protocol
> > that, I hope, allows for better concurrency, reliability and
> > readability, and removes a bunch of TODO notes about questionable
> > interlocking.  However, I'm not quite done figuring out if the bufmgr
> > interaction is right and will be manageable on the undoaccess side, so
> > I'm hoping to get some feedback, not asking for anyone to rebase on
> > top of it yet.
> > 
> > Previously, there were two LWLocks used to make sure that all
> > discarding was prevented while anyone was reading or writing data in
> > any part of an undo log, and (probably more problematically) vice
> > versa.  Here's a new approach that removes that blocking:

Oh,  more point I forgot to add: Cool!

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

20 August 2019, 18:36:19

On Tue, Aug 20, 2019 at 5:02 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> 3.  UndoLogDiscard() uses DiscardBuffer() to invalidate any currently
> unpinned buffers, and marks as BM_DISCARDED any that happen to be
> pinned right now, so they can't be immediately invalidated.  Such
> buffers are never written back and are eligible for reuse on the next
> clock sweep, even if they're written into by a backend that managed to
> do that when we were trying to discard.

This is definitely more concurrent, but it might be *too much*
concurrency. Suppose that backend #1 is inserting a row and updating
the transaction header for the previous transaction; meanwhile,
backend #2 is discarding the previous transaction. It could happen
that backend #1 locks the transaction header for the previous
transaction and is all set to log the insertion ... but then gets
context-switched out.  Now backend #2 swoops in and logs the discard.
Backend #1 now wakes up and finishes logging a change to a page that,
according to the logic of the WAL stream, no longer exists.

It's probably possible to make this work by ignoring WAL references to
discarded pages during replay, but that seems a bit dangerous. At
least, it loses some sanity checking that you might like to have.

It seems to me that you can avoid this if you require that a backend
that wants to set BM_DISCARDED to acquire at least a shared content
lock before doing so.  If you do that, then once a backend acquires
content lock(s) on the page(s) containing the transaction header for
the purposes of updating it, it can notice that the BM_DISCARDED flag
is set and choose not to update those pages after all.  I think that
would be a smart design choice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

21 August 2019, 02:56:35

On Tue, Aug 13, 2019 at 8:11 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > We can probably check the fxid queue and error queue to get that
> > value.  However, I am not sure if that is sufficient because incase we
> > perform the request in the foreground, it won't be present in queues.
>
> Oh, I forgot about that requirement.  I think I can fix it so it does
> that fairly easily, but it will require a little bit of redesign which
> I won't have time to do this week.

Here's a version with a quick (possibly buggy) prototype of the
oldest-FXID support.  It also includes a bunch of comment changes,
pgindent, and a few other tweaks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

v2-0001-New-undo-request-manager-now-with-UndoRequestMana.patch

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

21 August 2019, 07:54:53

On Tue, Aug 20, 2019 at 7:57 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Aug 19, 2019 at 2:04 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > Currently, In UnpackedUndoRecord we store all members directly which
> > are set by the caller.  We store pointers to some header which are
> > allocated internally by the undo layer and the caller need not worry
> > about setting them.  So now you are suggesting to put other headers
> > also as structures in UnpackedUndoRecord.  I as such don't have much
> > problem in doing that but I think initially Robert designed
> > UnpackedUndoRecord structure this way so it will be good if Robert
> > provides his opinion on this.
>
> I don't believe that's what is being suggested.  It seems to me that
> the thing Andres is complaining about here has roots in the original
> sketch that I did for this code.  The oldest version I can find is
> here:
>
> https://github.com/EnterpriseDB/zheap/commit/7d194824a18f0c5e85c92451beab4bc6f044254c
>
> In this version, and I think still in the current version, there is a
> two-stage marshaling strategy.  First, the individual fields from the
> UnpackedUndoRecord get copied into global variables (yes, that was my
> fault, too, at least in part!) which are structures. Then, the
> structures get copied into the target buffer. The idea of that design
> was to keep the code simple, but it didn't really work out, because
> things got a lot more complicated between the time I wrote those 3244
> lines of code and the >3000 lines of code that live in this patch
> today. One thing that change was that we moved more and more in the
> direction of considering individual fields as separate objects to be
> separately included or excluded, whereas when I wrote that code I
> thought we were going to have groups of related fields that stood or
> fell together. That idea turned out to be wrong. (There is the
> even-larger question here of whether we ought to take Heikki's
> suggestion and make this whole thing a lot more generic, but let's
> start by discussing how the design that we have today could be
> better-implemented.)
>
> If I understand Andres correctly, he's arguing that we ought to get
> rid of the two-stage marshaling strategy.  During decoding, he wants
> data to go directly from the buffer that contains it to the
> UnpackedUndoRecord without ever being stored in the UnpackUndoContext.
> During insertion, he wants data to go directly from the
> UnpackedUndoRecord to the buffer that contains it.  Or, at least, if
> there has to be an intermediate format, he wants it to be just a chunk
> of raw bytes, rather than a bunch of individual fields like we have in
> UndoPackContext currently.  I think that's a reasonable goal.  I'm not
> as concerned about it as he is from a performance point of view, but I
> think it would make the code look nicer, and that would be good.  If
> we save CPU cycles along the way, that is also good.
>
> In broad outline, what this means is:
>
> 1. Any field in the UndoPackContext that starts with urec_ goes away.
> 2. Instead of something like InsertUndoBytes((char *)
> &(ucontext->urec_fxid), ...) we'd write InsertUndobytes((char *)
> &uur->uur_fxid, ...).
> 3. Similarly instead of ReadUndoBytes((char *) &ucontext->urec_fxid,
> ...) we'd write ReadUndoBytes((char *) &uur->uur_fxid, ...).
> 4. It seems slightly trickier to handle the cases where we've got a
> structure instead of individual fields, like urec_hd.  But those could
> be broken down into field-by-field reads and writes, e.g. in this case
> one call for urec_type and a second for urec_info.
> 5. For uur_group and uur_logswitch, the code would need to allocate
> those subsidiary structures before copying into them.
>
> To me, that seems like it ought to be a pretty straightforward change
> that just makes things simpler.  We'd probably need to pass the
> UnpackedUndoRecord to BeginUnpackUndo instead of FinishUnpackUndo, and
> keep a pointer to it in the UnpackUndoContext, but that seems fine.
> FinishUnpackUndo would end up just about empty, maybe entirely empty.
>
> Is that a reasonable idea?
>
I have already attempted that part and I feel it is not making code
any simpler than what we have today.  For packing, it's fine because I
can process all the member once and directly pack it into one memory
chunk and I can insert that to the buffer by one call of
InsertUndoBytes and that will make the code simpler.

But, while unpacking if I directly unpack to the UnpackUndoRecord then
there are few complexities.  I am not saying those are difficult to
implement but code may not look better.

a) First, we need to add extra stages for unpacking as we need to do
field by field.

b) Some of the members like uur_payload and uur_tuple are not the same
type in the UnpackUndoRecord compared to how it is stored in the page.
In UnpackUndoRecord those are StringInfoData whereas on the page we
store it as UndoRecordPayload header followed by the actual data.  I
am not saying we can not unpack this directly we can do it like,
first read the payload length from the page in uur_payload.len then
read tuple length in uur_tuple.len then read both the data.  And, for
that, we will have to add extra stages.

c) Currently, in UnpackUndoContext the members are stored in the same
order in which we are storing them to the page whereas in
UnpackUndoRecord they are stored in the order such that they are more
convenient for them to understand,   like all the fields which are set
by the caller are separate from the fields which are allocated
internally by the undo layer (transaction header and the log switch
header).  Now, for directly unpacking to the UnpackUndoRecord, we need
to read them out of order which will make code more unreadable.

Another option could be that we unpack some part directly into the
UnapackUndoRecord (individual fields) and other parts to
UnpackUndoContext (structures, payload) and in Finalise only copy
those parts from UnpackUndoContext to UnapackUndoRecord.  The code
might look bit confusing though.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

21 August 2019, 10:37:50

On Tue, Aug 20, 2019 at 8:10 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Aug 20, 2019 at 2:42 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > Well, my main point, which so far has largely been ignored, was that we
> > > may not acquire page locks when we still need to search for victim
> > > buffers later. If we don't need to lock the pages up-front, but only do
> > > so once we're actually copying the records into the undo pages, then we
> > > don't a separate phase to acquire the locks. We can still hold all of
> > > the page locks at the same time, as long as we just acquire them at the
> > > later stage.
> >
> > Okay, IIUC, this means that we should have a separate phase where we
> > call LockUndoBuffers (or something like that) before
> > InsertPreparedUndo and after PrepareUndoInsert.  The LockUndoBuffers
> > will lock all the buffers pinned during PrepareUndoInsert.  We can
> > probably call LockUndoBuffers before entering the critical section to
> > avoid any kind of failure in critical section.  If so, that sounds
> > reasonable to me.
>
> I'm kind of scratching my head here, because this is clearly different
> than what Andres said in the quoted text to which you were replying.
> He clearly implied that we should acquire the buffer locks within the
> critical section during InsertPreparedUndo, and you responded by
> proposing to do it outside the critical section in a separate step.
> Regardless of which way is actually better, when somebody says "hey,
> let's do A!" and you respond by saying "sounds good, I'll go implement
> B!" that's not really helping us to get toward a solution.
>

I got confused by the statement "We can still hold all of the page
locks at the same time, as long as we just acquire them at the later
stage."

> FWIW, although I also thought of doing what you are describing here, I
> think Andres's proposal is probably preferable, because it's simpler.
> There's not really any reason why we can't take the buffer locks from
> within the critical section, and that way callers don't have to deal
> with the extra step.
>

IIRC, the reason this was done before starting critical section was
because of coding convention mentioned in src/access/transam/README
(Section: Write-Ahead Log Coding).   It says first pin and exclusive
lock the shared buffers and then start critical section.  It might be
that we can bypass that convention here, but I guess it is mainly to
avoid any error in the critical section.  I have checked the
LWLockAcquire path and there doesn't seem to be any reason that it
will throw error except when the caller has acquired many locks at the
same time which is not the case here.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

21 August 2019, 15:34:16

On Wed, Aug 21, 2019 at 3:55 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have already attempted that part and I feel it is not making code
> any simpler than what we have today.  For packing, it's fine because I
> can process all the member once and directly pack it into one memory
> chunk and I can insert that to the buffer by one call of
> InsertUndoBytes and that will make the code simpler.

OK...

> But, while unpacking if I directly unpack to the UnpackUndoRecord then
> there are few complexities.  I am not saying those are difficult to
> implement but code may not look better.
>
> a) First, we need to add extra stages for unpacking as we need to do
> field by field.
>
> b) Some of the members like uur_payload and uur_tuple are not the same
> type in the UnpackUndoRecord compared to how it is stored in the page.
> In UnpackUndoRecord those are StringInfoData whereas on the page we
> store it as UndoRecordPayload header followed by the actual data.  I
> am not saying we can not unpack this directly we can do it like,
> first read the payload length from the page in uur_payload.len then
> read tuple length in uur_tuple.len then read both the data.  And, for
> that, we will have to add extra stages.

I don't think that this is true; or at least, I think it's avoidable.
The idea of an unpacking stage is that you refuse to advance to the
next stage until you've got a certain number of bytes of data; and
then you unpack everything that pertains to that stage.  If you have 2
4-byte fields that you want to unpack together, you can just wait
until you've 8 bytes of data and then unpack both.  You don't really
need 2 separate stages.  (Similarly, your concern about fields being
in a different order seems like it should be resolved by agreeing on
one ordering and having everything use it; I don't know why there
should be one order that is better in memory and another order that is
better on disk.)

The bigger issue here is that we don't seem to be making very much
progress toward improving the overall design.  Several significant
improvements have been suggested:

1. Thomas suggested quite some time ago that we should make sure that
the transaction header is the first optional header.  If we do that,
then I'm not clear on why we even need this incremental unpacking
stuff any more. The only reason for all of this machinery was so that
we could find the transaction header at an unknown offset inside a
complex record format; there is, if I understand correctly, no other
case in which we want to incrementally decode a record. But if the
transaction header is at a fixed offset, then there seems to be no
need to even have incremental decoding at all.  Or it can be very
simple, with three stages: (1) we don't yet have enough bytes to
figure out how big the record is; (2) we have enough bytes to figure
out how big the record is and we have figured that out but we don't
yet have all of those bytes; and (3) we have the whole record, we can
decode the whole thing and we're done.

2. Based on a discussion with Thomas, I suggested the GHOB stuff,
which gets rid of the idea of transaction headers inside individual
records altogether; instead, there is one undo blob per transaction
(or maybe several if we overflow to another undo log) which begins
with a sentinel byte that identifies it as owned by a transaction, and
then the transaction header immediately follows that without being
part of any record, and the records follow that data.  As with the
previous idea, this gets rid of the need for incremental decoding
because it gets rid of the need to find the transaction header inside
of a bigger record. As Thomas put it to me off-list, it puts the
records inside of a larger chunk of space owned by the transaction
instead of putting the transaction header inside of some particular
record; that seems more correct than the way we have it now.

3. Heikki suggested that the format should be much more generic and
that more should be left up to the AM.  While neither Andres nor I are
convinced that we should go as far in that direction as Heikki is
proposing, the idea clearly has some merit, and we should probably be
moving that way to some degree. For instance, the idea that we should
store a block number and TID is a bit sketchy when you consider that a
multi-insert operation really wants to store a TID list. The zheap
tree has a ridiculous kludge to work around that problem; clearly we
need something better.  We've also mentioned that, in the future, we
might want to support TIDs that are not 6 bytes, and that even just
looking at what's currently under development, zedstore wants to treat
TIDs as 48-bit opaque quantities, not a 4-byte block number and a
2-byte item pointer offset.  So, there is clearly a need to go through
the whole way we're doing this and rethink which parts are generic and
which parts are AM-specific.

4. A related problem, which has been mentioned or at least alluded to
by both Heikki and by me, is that we need a better way of handling the
AM-specific data. Right now, the zheap code packs fixed-size things
into the payload data and then finds them by knowing the offset where
any particular data is within that field, but that's an unmaintainable
mess.  The zheap code could be improved by at least defining those
offsets as constants someplace and adding some comments explaining the
payload formats of various undo records, but even if we do that, it's
not going to generalize very well to anything more complicated than a
few fixed-size bits of data.  I suggested using the pqformat stuff to
try to structure that -- a suggestion to which Heikki has
unfortunately not responded, because I'd really like to get his
thoughts on it -- but whether we do that particular thing or not, I
think we need to do something.  In addition to wanting a better way of
handling packing and unpacking for payload data, there's also a desire
to have it participate in record compression, for which we don't seem
to have any kind of plan.

5. Andres suggested multiple ideas for cleaning up and improving this
code in https://www.postgresql.org/message-id/20190814065745.2faw3hirvfhbrdwe%40alap3.anarazel.de
- which include the idea currently under discussion, several of the
same ideas that I mentioned above, and a number of other things, such
as making packing serialize to a char * rather than some ad-hoc
intermediate format and having a metadata array over which we can loop
rather than having multiple places where there's a separate bit of
code for every field type.  I don't think those suggestions are
entirely unproblematic; for instance, the metadata array would would
probably work a lot better if we moved the transaction and log-switch
headers outside of individual records as suggested in (2) above.
Otherwise, the metadata would have to include not only data-structure
offsets but some kind of a flag indicating which of several data
structures ought to contain the relevant information, which would make
the whole thing a lot messier. And depending on what we do about (4),
this might become moot or the details might change quite a bit,
because if we no longer have a long list of "generic" fields, then we
also won't have a bunch of places in the code that deal with that long
list of generic fields, which means the metadata array might not be
necessary, or might be simpler or smaller or designed differently.
All of which is to make the point that responding to Andres's feedback
will require a bunch of decisions about which parts of the feedback to
take (because some of them are mutually exclusive, as he acknowledges
himself) and what to do about them (because some of them are vague);
yet, taken together, they seem to amount to the need for significant
design changes, as do (1)-(4).

Now, just to be clear, the code we're talking about here is mostly
based on an original design by me, and whatever defects were present
in that original design are nobody's fault but mine. And that list of
defects includes pretty much everything in the above list. But, what
we need to figure out at this point is how we're going to get those
things fixed, and it seems to me that we're going to need a pretty
substantial redesign, but this discussion is kind of down in the
weeds.  I mean, what are we gaining by arguing about how many stages
we need for incremental unpacking if the real thing we need to is get
rid of that concept altogether?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

21 August 2019, 15:36:34

On Wed, Aug 21, 2019 at 6:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > FWIW, although I also thought of doing what you are describing here, I
> > think Andres's proposal is probably preferable, because it's simpler.
> > There's not really any reason why we can't take the buffer locks from
> > within the critical section, and that way callers don't have to deal
> > with the extra step.
>
> IIRC, the reason this was done before starting critical section was
> because of coding convention mentioned in src/access/transam/README
> (Section: Write-Ahead Log Coding).   It says first pin and exclusive
> lock the shared buffers and then start critical section.  It might be
> that we can bypass that convention here, but I guess it is mainly to
> avoid any error in the critical section.  I have checked the
> LWLockAcquire path and there doesn't seem to be any reason that it
> will throw error except when the caller has acquired many locks at the
> same time which is not the case here.

Yeah, I think it's fine to deviate from that convention in this
respect.  We treat LWLockAcquire() as a no-fail operation in many
places; in my opinion, that elog(ERROR) that we have for too many
LWLocks should be changed to elog(PANIC) precisely because we do treat
LWLockAcquire() as no-fail in lots of places in the code, but I think
I suggested that once and got slapped down, and I haven't had the
energy to fight about it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

21 August 2019, 17:22:33


On August 21, 2019 8:36:34 AM PDT, Robert Haas <robertmhaas@gmail.com> wrote:
> We treat LWLockAcquire() as a no-fail operation in many
>places; in my opinion, that elog(ERROR) that we have for too many
>LWLocks should be changed to elog(PANIC) precisely because we do treat
>LWLockAcquire() as no-fail in lots of places in the code, but I think
>I suggested that once and got slapped down, and I haven't had the
>energy to fight about it.

Fwiw, that proposal has my vote.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

21 August 2019, 20:34:53

On Wed, Aug 14, 2019 at 2:57 AM Andres Freund <andres@anarazel.de> wrote:
> - My reading of the current xact.c integration is that it's not workable
>   as is. Undo is executed outside of a valid transaction state,
>   exceptions aren't properly undone, logic would need to be duplicated
>   to a significant degree, new kind of critical section.

Regarding this particular point:

ReleaseResourcesAndProcessUndo() is only supposed to be called after
AbortTransaction(), and the additional steps it performs --
AtCleanup_Portals() and AtEOXact_Snapshot() or alternatively
AtSubCleanup_Portals -- are taken from Cleanup(Sub)Transaction.
That's not crazy; the other steps in Cleanup(Sub)Transaction() look
like stuff that's intended to be performed when we're totally done
with this TransactionState stack entry, whereas these things are
slightly higher-level cleanups that might even block undo (e.g.
undropped portal prevents orphaned file cleanup).  Granted, there are
no comments explaining why those particular cleanup steps are
performed here, and it's possible some other approach is better, but I
think perhaps it's not quite as flagrantly broken as you think.

I am also not convinced that semi-critical sections are a bad idea,
although the if (SemiCritSectionCount > 0) test at the start of
ReleaseResourcesAndProcessUndo() looks wrong.  To roll back a
subtransaction, we must perform undo in the foreground, and if that
fails, the toplevel transaction can't be allowed to commit, full stop.
Since we expect this to be a (very) rare scenario, I don't know why
escalating to FATAL is a catastrophe.  The only other option is to do
something along the lines of SxactIsDoomed(), where we force all
subsequent commits (and sub-commits?) within the toplevel xact to
fail. You can argue that the latter is a better user experience, and
for SSI I certainly agree, but this case isn't quite the same: there's
a good chance we're dealing with a corrupted page or system
administrator intervention to try to kill off a long-running undo
task, and continuing in such cases seems a lot more dubious than after
a serializability failure, where retrying is the expected recovery
mode. The other case is where toplevel undo for a temporary table
fails.  It is unclear to me what, other than FATAL, could suffice
there.  I guess you could just let the session continue and leave the
transaction undone, leaving whatever MVCC machinery the table AM may
have look through it, but that sounds inferior to me. Rip the bandaid
off.

Some general complaints from my side about the xact.c changes:

1. The code structure doesn't seem quite right.  For example:

1a. ProcessUndoRequestForEachLogCat has a try/catch block, but it
seems to me that the job of a try/catch block is to provide structured
error-handling for code for resources for which there's no direct
handling in xact.c or resowner.c.  Here, we're inside of xact.c, so
why are we adding a try/catch block
1b. ReleaseResourcesAndProcessUndo does part of the work of cleaning
up a failed transaction but not all of it, the rest being done by
AbortTransaction, which is called before entering it, plus it also
kicks off the actual undo work.  I would expect a cleaner division of
responsibility.
1c. Having an undo request per UndoLogCategory rather than one per
transaction doesn't seem right to me; hopefully that will get cleaned
up when the undorequest.c stuff I sent before is integrated.
1d. The code at the end of FinishPreparedTransaction() seems to expect
that the called code will catch any error, but that clearing the error
state might need to happen here, and also that we should fire up a new
transaction; I suspect, but am not entirely sure, that that is not the
way it should work.  The code added earlier in that function also
looks suspicious, because it's filling up what is basically a
high-level control function with a bunch of low-level undo-specific
details.  In both places, the undo-specific concerns probably need to
be better-isolated.

2. Signaling is done using some odd-looking mechanisms.  For instance:

2a. The SemiCritSectionCount > 0 test at the top of
ReleaseResourcesAndProcessUndo that I complained about earlier looks
like a guard against reentrancy, but that must be the wrong way to get
there; it makes it impossible to reuse what is ostensibly a
general-purpose facility for any non-undo related purpose without
maybe breaking something.
2b. ResetUndoActionsInfo() is called from a bunch of place, but only 2
of those places have a comment explaining why, and the function
comment is pretty unilluminating. This looks like some kind of
signaling machinery, but it's not very clear to me what it's actually
trying to do.
2c. ResourceOwnerReleaseInternal() is directly calling
NeedToPerformUndoActions(), which feels like a layering violation.
2d. I'm not really sure that TRANS_UNDO is serving any useful purpose;
I think we need TBLOCK_UNDO and TBLOCK_SUBUNDO, but I'm not really
sure TRANS_UNDO is doing anything useful; the change to
SubTransactionIsActive() looks wrong to me, and the other changes I
think would mostly go away if we just used TRANS_INPROGRESS.
2e. I'm skeptical that the err_out_to_client() stuff is the right way
to suppress undo failure messages from being sent to the client.  That
needs to be done, but this doesn't seem like the right way. This is
related to my complaint above about using a try/catch block inside
xact.c.

3. I noticed a few other mistakes when reading through this again
which I include here for the sake of completeness:

3a. memset(..., InvalidUndoRecPtr, ...) will only happen to work if
every byte of InvalidUndoRecPtr happens to have the same value.  That
happens to be true, because it's defined 8 bytes of zeroes, but it's
not OK to code it like this.
3b. "undoRequestResgistered" is a typo.
3c. GetEpochForXid definitely shouldn't exist any more... as has been
reported in the past.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

22 August 2019, 04:21:22

On Wed, Aug 21, 2019 at 9:04 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Aug 21, 2019 at 3:55 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I have already attempted that part and I feel it is not making code
> > any simpler than what we have today.  For packing, it's fine because I
> > can process all the member once and directly pack it into one memory
> > chunk and I can insert that to the buffer by one call of
> > InsertUndoBytes and that will make the code simpler.
>
> OK...

>
> The bigger issue here is that we don't seem to be making very much
> progress toward improving the overall design.  Several significant
> improvements have been suggested:
>
> 1. Thomas suggested quite some time ago that we should make sure that
> the transaction header is the first optional header.

I think this is already done at least 2-3 version before.  So now we
are updating the transaction header directly by writing at that offset
and we don't need staging for this.
>  If we do that,
> then I'm not clear on why we even need this incremental unpacking
> stuff any more. The only reason for all of this machinery was so that
> we could find the transaction header at an unknown offset inside a
> complex record format; there is, if I understand correctly, no other
> case in which we want to incrementally decode a record.
> But if the
> transaction header is at a fixed offset, then there seems to be no
> need to even have incremental decoding at all.  Or it can be very
> simple, with three stages: (1) we don't yet have enough bytes to
> figure out how big the record is; (2) we have enough bytes to figure
> out how big the record is and we have figured that out but we don't
> yet have all of those bytes; and (3) we have the whole record, we can
> decode the whole thing and we're done.

We can not know the complete size of the record even by reading the
header because we have a payload that is variable part and payload
length are stored in the payload header which again can be at random
offset.   But, maybe we can still follow this idea which will make
unpacking far simpler. I have a few ideas
a) Store payload header right after the transaction header so that we
can easily know the complete record size.
b) Once we decode the first header by uur_info we can compute an exact
offset of the payload header and from there we can know the record
size.

>
> 2. Based on a discussion with Thomas, I suggested the GHOB stuff,
> which gets rid of the idea of transaction headers inside individual
> records altogether; instead, there is one undo blob per transaction
> (or maybe several if we overflow to another undo log) which begins
> with a sentinel byte that identifies it as owned by a transaction, and
> then the transaction header immediately follows that without being
> part of any record, and the records follow that data.  As with the
> previous idea, this gets rid of the need for incremental decoding
> because it gets rid of the need to find the transaction header inside
> of a bigger record. As Thomas put it to me off-list, it puts the
> records inside of a larger chunk of space owned by the transaction
> instead of putting the transaction header inside of some particular
> record; that seems more correct than the way we have it now.
>
> 3. Heikki suggested that the format should be much more generic and
> that more should be left up to the AM.  While neither Andres nor I are
> convinced that we should go as far in that direction as Heikki is
> proposing, the idea clearly has some merit, and we should probably be
> moving that way to some degree. For instance, the idea that we should
> store a block number and TID is a bit sketchy when you consider that a
> multi-insert operation really wants to store a TID list. The zheap
> tree has a ridiculous kludge to work around that problem; clearly we
> need something better.  We've also mentioned that, in the future, we
> might want to support TIDs that are not 6 bytes, and that even just
> looking at what's currently under development, zedstore wants to treat
> TIDs as 48-bit opaque quantities, not a 4-byte block number and a
> 2-byte item pointer offset.  So, there is clearly a need to go through
> the whole way we're doing this and rethink which parts are generic and
> which parts are AM-specific.
>
> 4. A related problem, which has been mentioned or at least alluded to
> by both Heikki and by me, is that we need a better way of handling the
> AM-specific data. Right now, the zheap code packs fixed-size things
> into the payload data and then finds them by knowing the offset where
> any particular data is within that field, but that's an unmaintainable
> mess.  The zheap code could be improved by at least defining those
> offsets as constants someplace and adding some comments explaining the
> payload formats of various undo records, but even if we do that, it's
> not going to generalize very well to anything more complicated than a
> few fixed-size bits of data.  I suggested using the pqformat stuff to
> try to structure that -- a suggestion to which Heikki has
> unfortunately not responded, because I'd really like to get his
> thoughts on it -- but whether we do that particular thing or not, I
> think we need to do something.  In addition to wanting a better way of
> handling packing and unpacking for payload data, there's also a desire
> to have it participate in record compression, for which we don't seem
> to have any kind of plan.
>
> 5. Andres suggested multiple ideas for cleaning up and improving this
> code in https://www.postgresql.org/message-id/20190814065745.2faw3hirvfhbrdwe%40alap3.anarazel.de
> - which include the idea currently under discussion, several of the
> same ideas that I mentioned above, and a number of other things, such
> as making packing serialize to a char * rather than some ad-hoc
> intermediate format
I have implemented this patch.  I will post this along with other changes.

and having a metadata array over which we can loop
> rather than having multiple places where there's a separate bit of
> code for every field type.  I don't think those suggestions are
> entirely unproblematic; for instance, the metadata array would would
> probably work a lot better if we moved the transaction and log-switch
> headers outside of individual records as suggested in (2) above.
> Otherwise, the metadata would have to include not only data-structure
> offsets but some kind of a flag indicating which of several data
> structures ought to contain the relevant information, which would make
> the whole thing a lot messier. And depending on what we do about (4),
> this might become moot or the details might change quite a bit,
> because if we no longer have a long list of "generic" fields, then we
> also won't have a bunch of places in the code that deal with that long
> list of generic fields, which means the metadata array might not be
> necessary, or might be simpler or smaller or designed differently.
> All of which is to make the point that responding to Andres's feedback
> will require a bunch of decisions about which parts of the feedback to
> take (because some of them are mutually exclusive, as he acknowledges
> himself) and what to do about them (because some of them are vague);
> yet, taken together, they seem to amount to the need for significant
> design changes, as do (1)-(4).
>
> Now, just to be clear, the code we're talking about here is mostly
> based on an original design by me, and whatever defects were present
> in that original design are nobody's fault but mine. And that list of
> defects includes pretty much everything in the above list. But, what
> we need to figure out at this point is how we're going to get those
> things fixed, and it seems to me that we're going to need a pretty
> substantial redesign, but this discussion is kind of down in the
> weeds.  I mean, what are we gaining by arguing about how many stages
> we need for incremental unpacking if the real thing we need to is get
> rid of that concept altogether?

Actually, In my local changes, I have already got rid of the multiple
stages because I am packing all fields in one char * as suggested in
the first part of the 4).  I had a problem while unpacking because we
don't know the complete size of the record beforehand especially
because of the payload data.  I have suggested a couple of points
above as part of 1) for handling the payload size.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

22 August 2019, 04:28:23

Hi,

On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote:
> We can not know the complete size of the record even by reading the
> header because we have a payload that is variable part and payload
> length are stored in the payload header which again can be at random
> offset.

Wait, but that's just purely self inflicted damage, no? The initial
length just needs to include the payload. And all this is not an issue
anymore?

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

22 August 2019, 04:49:04

On Thu, Aug 22, 2019 at 9:58 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote:
> > We can not know the complete size of the record even by reading the
> > header because we have a payload that is variable part and payload
> > length are stored in the payload header which again can be at random
> > offset.
>
> Wait, but that's just purely self inflicted damage, no? The initial
> length just needs to include the payload. And all this is not an issue
> anymore?
>
Actually, we store the undo length only at the end of the record and
that is for traversing the transaction's undo record chain during bulk
fetch.  Ac such in the beginning of the record we don't have the undo
length.  We do have uur_info but that just tell us which all optional
header are included in the record.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

22 August 2019, 04:54:06

Hi,

On 2019-08-22 10:19:04 +0530, Dilip Kumar wrote:
> On Thu, Aug 22, 2019 at 9:58 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote:
> > > We can not know the complete size of the record even by reading the
> > > header because we have a payload that is variable part and payload
> > > length are stored in the payload header which again can be at random
> > > offset.
> >
> > Wait, but that's just purely self inflicted damage, no? The initial
> > length just needs to include the payload. And all this is not an issue
> > anymore?
> >
> Actually, we store the undo length only at the end of the record and
> that is for traversing the transaction's undo record chain during bulk
> fetch.  Ac such in the beginning of the record we don't have the undo
> length.  We do have uur_info but that just tell us which all optional
> header are included in the record.

But why? It makes a *lot* more sense to have it in the beginning. I
don't think bulk-fetch really requires it to be in the end - we can
still process records forward on a page-by-page basis.

Greetings,

Andres Freund

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

22 August 2019, 05:34:37

On Thu, Aug 22, 2019 at 10:24 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-08-22 10:19:04 +0530, Dilip Kumar wrote:
> > On Thu, Aug 22, 2019 at 9:58 AM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > Hi,
> > >
> > > On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote:
> > > > We can not know the complete size of the record even by reading the
> > > > header because we have a payload that is variable part and payload
> > > > length are stored in the payload header which again can be at random
> > > > offset.
> > >
> > > Wait, but that's just purely self inflicted damage, no? The initial
> > > length just needs to include the payload. And all this is not an issue
> > > anymore?
> > >
> > Actually, we store the undo length only at the end of the record and
> > that is for traversing the transaction's undo record chain during bulk
> > fetch.  Ac such in the beginning of the record we don't have the undo
> > length.  We do have uur_info but that just tell us which all optional
> > header are included in the record.
>
> But why? It makes a *lot* more sense to have it in the beginning. I
> don't think bulk-fetch really requires it to be in the end - we can
> still process records forward on a page-by-page basis.

Yeah, we can handle the bulk fetch as you suggested and it will make
it a lot easier.  But, currently while registering the undo request
(especially during the first pass) we need to compute the from_urecptr
and the to_urecptr. And,  for computing the from_urecptr,  we have the
end location of the transaction because we have the uur_next in the
transaction header and that will tell us the end of our transaction
but we still don't know the undo record pointer of the last record of
the transaction.  As of know, we read previous 2 bytes from the end of
the transaction to know the length of the last record and from there
we can compute the undo record pointer of the last record and that is
our from_urecptr.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

22 August 2019, 10:23:24

On Thu, Aug 22, 2019 at 11:04 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Aug 22, 2019 at 10:24 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2019-08-22 10:19:04 +0530, Dilip Kumar wrote:
> > > On Thu, Aug 22, 2019 at 9:58 AM Andres Freund <andres@anarazel.de> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On 2019-08-22 09:51:22 +0530, Dilip Kumar wrote:
> > > > > We can not know the complete size of the record even by reading the
> > > > > header because we have a payload that is variable part and payload
> > > > > length are stored in the payload header which again can be at random
> > > > > offset.
> > > >
> > > > Wait, but that's just purely self inflicted damage, no? The initial
> > > > length just needs to include the payload. And all this is not an issue
> > > > anymore?
> > > >
> > > Actually, we store the undo length only at the end of the record and
> > > that is for traversing the transaction's undo record chain during bulk
> > > fetch.  Ac such in the beginning of the record we don't have the undo
> > > length.  We do have uur_info but that just tell us which all optional
> > > header are included in the record.
> >
> > But why? It makes a *lot* more sense to have it in the beginning. I
> > don't think bulk-fetch really requires it to be in the end - we can
> > still process records forward on a page-by-page basis.
>
> Yeah, we can handle the bulk fetch as you suggested and it will make
> it a lot easier.  But, currently while registering the undo request
> (especially during the first pass) we need to compute the from_urecptr
> and the to_urecptr. And,  for computing the from_urecptr,  we have the
> end location of the transaction because we have the uur_next in the
> transaction header and that will tell us the end of our transaction
> but we still don't know the undo record pointer of the last record of
> the transaction.  As of know, we read previous 2 bytes from the end of
> the transaction to know the length of the last record and from there
> we can compute the undo record pointer of the last record and that is
> our from_urecptr.
>

How about if we store the location of the last record of the
transaction instead of the location of the next transaction in the
transaction header?  I think if we do that then discard worker might
need to do some additional work in some cases as it needs to tell the
location up to which discard is possible, however, many other cases
might get simplified.  With this also, when the log is switched while
writing records for the same transaction, the transaction header in
the first log will store the start location of the same transaction's
records in the next log.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

22 August 2019, 13:46:59

On Thu, Aug 22, 2019 at 12:54 AM Andres Freund <andres@anarazel.de> wrote:
> But why? It makes a *lot* more sense to have it in the beginning. I
> don't think bulk-fetch really requires it to be in the end - we can
> still process records forward on a page-by-page basis.

There are two separate needs here: to be able to go forward, and to be
able to go backward.  We have the length at the end of each record not
because we're stupid, but so that we can back up.  If we have another
way of backing up, then the thing to do is not to move that to
beginning of the record but to remove it entirely as unnecessary
wastage.  We can also think about how to improve forward traversal.

Considering each problem separately:

For forward traversal, we could simplify things somewhat by having
only 3 decoding stages instead of N decoding stages.  We really only
need (1) a stage for accumulating bytes until we have uur_info, and
then (2) a stage for accumulating bytes until we know the payload and
tuple lengths, and then (3) a stage for accumulating bytes until we
have the whole record.  We have a lot more stages than that right now
but I don't think we really need them for anything. Originally we had
them so that we could do incremental decoding to find the transaction
header in the record, but now that the transaction header is at a
fixed offset, I think the multiplicity of stages is just baggage.

We could simplify things more by deciding that the first two bytes of
the record are going to contain the record size. That would increase
the size of the record by 2 bytes, but we could (mostly) claw those
bytes back by not storing the size of both uur_payload and uur_tuple.
The size of the other one would be computed by subtraction: take the
total record size, subtract the size of whichever of those two things
we store, subtract the mandatory and optional headers that are
present, and the rest must be the other value. That would still add 2
bytes for records that contain neither a payload nor a tuple, but that
would probably be OK given that (a) a lot of records wouldn't be
affected, (b) the code would get simpler, and (c) something like this
seems necessary anyway given that we want to make the record format
more generic. With this approach instead of 3 stages we only need 2:
(1) accumulating bytes until we have the 2-byte length word, and (2)
accumulating bytes until we have the whole record.

For backward traversal, as I see it, there are basically two options.
One is to do what we're doing right now, and store the record length
at the end of the record. (That might mean that a record both begins
and ends with its own length, which is not a crazy design.) The other
is to do what I think you are proposing here: locate the beginning of
the first record on the page, presumably based on some information
stored in the page header, and then work forward through the page to
figure out where all the records start.  Then process them in reverse
order. That saves 2 bytes per record.  It's a little more expensive in
terms of CPU cycles, especially if you only need some of the records
on the page but not all of them, but that's probably not too bad.

I think I'm basically agreeing with what you are proposing but I think
it's important to spell out the underlying concerns, because otherwise
I'm afraid we might think we have a meeting of the minds when we don't
really.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

22 August 2019, 14:03:55

On Thu, Aug 22, 2019 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Yeah, we can handle the bulk fetch as you suggested and it will make
> it a lot easier.  But, currently while registering the undo request
> (especially during the first pass) we need to compute the from_urecptr
> and the to_urecptr. And,  for computing the from_urecptr,  we have the
> end location of the transaction because we have the uur_next in the
> transaction header and that will tell us the end of our transaction
> but we still don't know the undo record pointer of the last record of
> the transaction.  As of know, we read previous 2 bytes from the end of
> the transaction to know the length of the last record and from there
> we can compute the undo record pointer of the last record and that is
> our from_urecptr.=

I don't understand this.  If we're registering an undo request at "do"
time, we don't need to compute the starting location; we can just
remember the UndoRecPtr of the first record we inserted.  If we're
reregistering an undo request after a restart, we can (and, I think,
should) work forward from the discard location rather than backward
from the insert location.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

22 August 2019, 15:51:02

On Thu, Aug 22, 2019 at 7:34 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Aug 22, 2019 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > Yeah, we can handle the bulk fetch as you suggested and it will make
> > it a lot easier.  But, currently while registering the undo request
> > (especially during the first pass) we need to compute the from_urecptr
> > and the to_urecptr. And,  for computing the from_urecptr,  we have the
> > end location of the transaction because we have the uur_next in the
> > transaction header and that will tell us the end of our transaction
> > but we still don't know the undo record pointer of the last record of
> > the transaction.  As of know, we read previous 2 bytes from the end of
> > the transaction to know the length of the last record and from there
> > we can compute the undo record pointer of the last record and that is
> > our from_urecptr.=
>
> I don't understand this.  If we're registering an undo request at "do"
> time, we don't need to compute the starting location; we can just
> remember the UndoRecPtr of the first record we inserted.  If we're
> reregistering an undo request after a restart, we can (and, I think,
> should) work forward from the discard location rather than backward
> from the insert location.

Right, we work froward from the discard location.  So after the
discard location,  while traversing the undo log when we encounter an
aborted transaction we need to register its rollback request.  And,
for doing that we need 1) start location of the first undo record . 2)
start location of the last undo record (last undo record pointer).

We already have 1).  But we have to compute 2).   For doing that if we
unpack the first undo record we will know the start of the next
transaction.  From there if we read the last two bytes then that will
have the length of the last undo record of our transaction.  So we can
compute 2) with below formula

start of the last undo record = start of the next transaction - length
of our transaction's last record.

Am I making sense here?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

22 August 2019, 16:14:10

On Thu, Aug 22, 2019 at 9:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Aug 22, 2019 at 7:34 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Thu, Aug 22, 2019 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > Yeah, we can handle the bulk fetch as you suggested and it will make
> > > it a lot easier.  But, currently while registering the undo request
> > > (especially during the first pass) we need to compute the from_urecptr
> > > and the to_urecptr. And,  for computing the from_urecptr,  we have the
> > > end location of the transaction because we have the uur_next in the
> > > transaction header and that will tell us the end of our transaction
> > > but we still don't know the undo record pointer of the last record of
> > > the transaction.  As of know, we read previous 2 bytes from the end of
> > > the transaction to know the length of the last record and from there
> > > we can compute the undo record pointer of the last record and that is
> > > our from_urecptr.=
> >
> > I don't understand this.  If we're registering an undo request at "do"
> > time, we don't need to compute the starting location; we can just
> > remember the UndoRecPtr of the first record we inserted.  If we're
> > reregistering an undo request after a restart, we can (and, I think,
> > should) work forward from the discard location rather than backward
> > from the insert location.
>
> Right, we work froward from the discard location.  So after the
> discard location,  while traversing the undo log when we encounter an
> aborted transaction we need to register its rollback request.  And,
> for doing that we need 1) start location of the first undo record . 2)
> start location of the last undo record (last undo record pointer).
>
> We already have 1).  But we have to compute 2).   For doing that if we
> unpack the first undo record we will know the start of the next
> transaction.  From there if we read the last two bytes then that will
> have the length of the last undo record of our transaction.  So we can
> compute 2) with below formula
>
> start of the last undo record = start of the next transaction - length
> of our transaction's last record.

Maybe I am saying that because I am just thinking how the requests are
registered as per the current code.  But, those requests will
ultimately be used for collecting the record by the bulk fetch.  So if
we are planning to change the bulk fetch to read forward then maybe we
don't need the valid last undo record pointer because that we will
anyway get while processing forward.  So just knowing the end of the
transaction is sufficient for us to know where to stop.  I am not sure
if this solution has any problem.  Probably  I should think again in
the morning when my mind is well-rested.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Andres Freund

Date:

22 August 2019, 16:24:58

Hi

On August 22, 2019 9:14:10 AM PDT, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> But, those requests will
>ultimately be used for collecting the record by the bulk fetch.  So if
>we are planning to change the bulk fetch to read forward then maybe we
>don't need the valid last undo record pointer because that we will
>anyway get while processing forward.  So just knowing the end of the
>transaction is sufficient for us to know where to stop.  I am not sure
>if this solution has any problem.  Probably  I should think again in
>the morning when my mind is well-rested.

I don't think we can easily do so for bulk apply without incurring significant overhead. It's pretty cheap to read in
forwardorder and then process backwards on a page level - but for an entire transactions undo the story is different.
Wecan't necessarily keep all of it in memory, so we'd have to read the undo twice to find the end. Right? 

Andres

Andres

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: POC: Cleaning up orphaned files using undo logs

From

Dilip Kumar

Date:

22 August 2019, 17:15:55

On Thu, Aug 22, 2019 at 9:55 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi
>
> On August 22, 2019 9:14:10 AM PDT, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > But, those requests will
> >ultimately be used for collecting the record by the bulk fetch.  So if
> >we are planning to change the bulk fetch to read forward then maybe we
> >don't need the valid last undo record pointer because that we will
> >anyway get while processing forward.  So just knowing the end of the
> >transaction is sufficient for us to know where to stop.  I am not sure
> >if this solution has any problem.  Probably  I should think again in
> >the morning when my mind is well-rested.
>
> I don't think we can easily do so for bulk apply without incurring significant overhead. It's pretty cheap to read in
forwardorder and then process backwards on a page level - but for an entire transactions undo the story is different.
Wecan't necessarily keep all of it in memory, so we'd have to read the undo twice to find the end. Right? 
>
I was not talking about the entire transaction,  I was also telling
about the page level as you suggested.  I was just saying that we may
not need the start position of the last undo record of the transaction
for registering the rollback request (which we currently do).
However, we need to know the end of the transaction to know the last
page from which we need to start reading forward.

Let me explain with an example

Transaction1
first, undo start at 10
first, undo end at 100
second, undo start at 101
second, undo end at 200
......
last, undo start at 1000
last, undo end at  1100

Transaction2
first, undo start at 1101
first, undo end at 1200
second, undo start at 1201
second, undo end at 1300

Suppose we want to register the request for Transaction1.  Then
currently we need to know the start undo record pointer (10 as per
above example) and the last undo record pointer (1000).  But, we only
know the start undo record pointer(10) and the start of the next
transaction(1101).  So for calculating the start of the last record,
we use 1101 - 101 (length of the last record store 2 bytes before
1101).

So, now I am saying that maybe we don't need to compute the start of
last undo record (1000) because it's enough to know the end of the
last undo record(1100).  Because on whichever page the last undo
record ends, we can start from that page and read forward on that
page.

* All numbers I used in the above example can be considered as undo
record pointers.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

23 August 2019, 06:03:40

On Wed, Aug 21, 2019 at 4:44 AM Andres Freund <andres@anarazel.de> wrote:
> On 2019-08-20 21:02:18 +1200, Thomas Munro wrote:
> > 1.  Anyone is allowed to try to read or write data at any UndoRecPtr
> > that has been allocated, through the buffer pool (though you'd usually
> > want to check it with UndoRecPtrIsDiscarded() first, and only rely on
> > the system I'm describing to deal with races).
> >
> > 2.  ReadBuffer() might return InvalidBuffer.  This can happen for a
> > cache miss, if the smgrread implementation wants to indicate that the
> > buffer has been discarded/truncated and that is expected (md.c won't
> > ever do that, but undofile.c can).
>
> Hm. This gives me a bit of a stomach ache. It somehow feels like a weird
> form of signalling.  Can't quite put my finger on why it makes me feel
> queasy.

Well, if we're going to allow concurrent access and discarding, there
has to be some part of the system where you can discover that the
thing you wanted is gone.  What's wrong with here?

Stepping back a bit... why do we have a new concept here?  The reason
we don't have anything like this for relations currently is that we
don't have live references to blocks that are expected to be
concurrently truncated, due to heavyweight locking.  But the whole
purpose of the undo system is to allow cheap truncation/discarding of
data that you *do* have live references to, and furthermore the
discarding is expected to be frequent.  The present experiment is
about trying to do so without throwing our hands up and using a
pessimistic lock.

At one point Robert and I discussed some kind of scheme where you'd
register your interest in a range of the log before you begin reading
(some kind of range locks or hazard pointers), so that you would block
discarding in that range, but the system would still allow someone to
read in the middle of the log while the discard worker concurrently
discards non-overlapping data at the end.  But I kept returning to the
idea that the buffer pool already has block-level range locking of
various kinds.  You can register your interest in a range by pinning
the buffers.  That's when you'll find out if the range is already
gone.  We could add an extra layer of range locking around that, but
it wouldn't be any better, it'd just thrash your bus a bit more, and
require more complexity in the discard worker (it has to defer
discarding a bit, and either block or go away and come back later).

> > 3.  UndoLogDiscard() uses DiscardBuffer() to invalidate any currently
> > unpinned buffers, and marks as BM_DISCARDED any that happen to be
> > pinned right now, so they can't be immediately invalidated.  Such
> > buffers are never written back and are eligible for reuse on the next
> > clock sweep, even if they're written into by a backend that managed to
> > do that when we were trying to discard.
>
> Hm. When is it legitimate for a backend to write into such a buffer? I
> guess that's about updating the previous transaction's next pointer? Or
> progress info?

Yes, previous transaction header's next pointer, and progress counter
during rollback.  We're mostly interested in the next pointer here,
because the progress counter update would normally not be updated at a
time when the page might be concurrently discarded.  The exception to
that is a superuser running CALL pg_force_discard_undo() (a
data-eating operation designed to escape a system that can't
successfully roll back and gets stuck, blowing away
not-yet-rolled-back undo records).

Here are some other ideas about how to avoid conflicts between
discarding and transaction header update:

1.  Lossy self-update-only: You could say that transactions are only
allowed to write to their own transaction header, and then have them
periodically update their own length in their own transaction header,
and then teach the discard worker that the length information is only
a starting point for a linear search for the next transaction based on
page header information.  That removes complexity for writers, but
adds complexity and IO and CPU to the discard worker.  Bleugh.

2.  Strict self-update-only:  We could update it as part of
transaction cleanup.  That is, when you commit or abort, probably at
some time when your transaction is still advertised as running, you go
and update your own transaction header with your the size.  If you
never reach that stage, I think you can fix it during crash recovery,
during the startup scan that feeds the rollback request queues.  That
is, if you encounter a transaction header with length == 0, it must be
the final one and its length is therefore known and can be updated,
before you allow new transactions to begin.  There are some
complications like backends that exit without crashing, which I
haven't thought about.  As Amit just pointed out to me, that means
that the update is not folded into the same buffer access as the next
transaction, but perhaps you can mitigate that by not updating your
header if the next header will be on the same page -- the next
transaction can do it safely then (this page with the insert pointer
on it can't be discarded).  As Dilip just pointed out to me, it means
that you always do an update that you might not never need to do if
the transaction is discarded, to which I have no answer.  Bleugh.

3.  Perhaps there is a useful middle ground between ideas 1 and 2: if
it's 0, the discard worker will perform a scan of page headers to
compute the correct value, but if it's non-zero it will consider it to
be correct and trust that value.  The extra work would only happen
after crash recovery or things like elog(FATAL, ...).

4.  You could keep one extra transaction around all the time.  That
is, because we know we only ever want to stamp the transaction header
of the previous transaction, don't let a transaction that hasn't been
stamped with a length yet be discarded.  But now we waste more space!

5.  You could figure out a scheme to advertise the block number of the
start of the previous transaction.  You could have an LWLock that you
have to take to stamp the transaction header of the previous
transaction, and UndoLogDiscard() only has to take the lock if it
wants to discard a range that overlaps with that block.  This avoids
contention for some workloads, but not others, so it seems like a half
measure, and again you still have to deal with InvalidBuffer when
reading.  It's basically range locking; the buffer pool is already a
kind of range locking scheme!

These schemes are all about avoiding conflicts between discarding and
writing, but you'd still have to tolerate InvalidBuffer for reads (ie
reading zheap records) with this scheme, so I suppose you might as
well just treat updates the same and not worry about any of the above.

> > 5.  Separating begin from discard allows the WAL logging for
> > UndoLogDiscard() to do filesystem actions before logging, and other
> > effects after logging, which have several nice properties if you work
> > through the various crash scenarios.
>
> Hm. ISTM we always need to log before doing some filesystem operation
> (see also my recent complaint Robert and I are discussing at the bottom
> of [1]). It's just that we can have a separate stage afterwards?
>
> [1] https://www.postgresql.org/message-id/CA%2BTgmoZc5JVYORsGYs8YnkSxUC%3DcLQF1Z%2BfcpH2TTKvqkS7MFg%40mail.gmail.com

I talked about this a bit with Robert and he pointed out that it's
probably not actually necessary to WAL-log these operations at all,
now that 'begin' and 'end' (= physical storage range) have been
separated from 'discard' and 'insert' (active undo data range).
Instead you could do it like this:

1.  Maintain begin and end pointers in shared memory only, no WAL, no
checkpoint.
2.  Compute their initial values by scanning the filesystem at startup time.
3.  Treat (logical) discard and insert pointers as today; WAL before
shm, checkpoint.
4.  begin must be <= discard, and end must be >= insert, or else PANIC.

I'm looking into that.

> > So now I'd like to get feedback on the sanity of this scheme.  I'm not
> > saying it doesn't have bugs right now -- I've been trying to figure
> > out good ways to test it and I'm not quite there yet -- but the
> > concept.  One observation I have is that there were already code paths
> > in undoaccess.c that can tolerate InvalidBuffer in recovery, due to
> > the potentially different discard timing for DO vs REDO.  I think
> > that's a point in favour of this scheme, but I can see that it's
> > inconvenient to have to deal with InvalidBuffer whenever you read.
>
> FWIW, I'm far from convinced that those are currently quite right. See
> discussion pointed to above.

Yeah.  It seems highly desirable to make it so that all decisions
about whether a write to an undo block is required or should be
skipped are made on the primary, so that WAL reply just does what it's
told.  I am working on that.

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

24 August 2019, 12:21:30

On Fri, Aug 23, 2019 at 2:04 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> 2.  Strict self-update-only:  We could update it as part of
> transaction cleanup.  That is, when you commit or abort, probably at
> some time when your transaction is still advertised as running, you go
> and update your own transaction header with your the size.  If you
> never reach that stage, I think you can fix it during crash recovery,
> during the startup scan that feeds the rollback request queues.  That
> is, if you encounter a transaction header with length == 0, it must be
> the final one and its length is therefore known and can be updated,
> before you allow new transactions to begin.  There are some
> complications like backends that exit without crashing, which I
> haven't thought about.  As Amit just pointed out to me, that means
> that the update is not folded into the same buffer access as the next
> transaction, but perhaps you can mitigate that by not updating your
> header if the next header will be on the same page -- the next
> transaction can do it safely then (this page with the insert pointer
> on it can't be discarded).  As Dilip just pointed out to me, it means
> that you always do an update that you might not never need to do if
> the transaction is discarded, to which I have no answer.  Bleugh.

Andres and I have spent a lot of time on the phone over the last
couple of days and I think we both kind of like this option.  I don't
think that the costs are likely to be very significant: you're talking
about pinning, locking, dirtying, unlocking, and unpinning one buffer
at commit time, or maybe two if your transaction touched both logged
and unlogged tables.  If the transaction is short enough for that
overhead to matter, that buffer is probably already in shared_buffers,
is probably already dirty, and is probably already in your CPU's
cache. So I think the overhead will turn out to be low.

Moreover, I doubt that we want to separately discard every transaction
anyway.  If you have very light-weight transactions, you don't want to
add an extra WAL record per transaction anyway.  Increasing the number
of separate WAL records per transaction from say 5 to 6 would be a
significant additional cost.  You probably want to perform a discard,
say, every 5 seconds or sooner if you can discard at least 64kB of
undo, or something of that sort.  So we're not going to save the
overhead of updating the previous transaction header often enough to
make much difference unless we're discarding so aggressively that we
incur a much larger overhead elsewhere.  I think.

I am a little concerned about the backends that exit without crashing.
Andres seems to want to treat that case as a bug to be fixed, but I
doubt whether that's going to be practical.   We're really only
talking about extreme corner cases here, because
before_shmem_exit(ShutdownPostgres, 0) means we'll
AbortOutOfAnyTransaction() which should RecordTransactionAbort(). Only
if we fail in the AbortTransaction() prior to reaching
RecordTransactionAbort() will we manage to reach the later cleanup
stages without having written an abort record.  I haven't scrutinized
that code lately to see exactly how things can go wrong there, but
there shouldn't be a whole lot. However, there's probably a few
things, like maybe a really poorly-timed malloc() failure.

A zero-order solution would be to install a deadman switch. At
on_shmem_exit time, you must detach from any undo log to which you are
connected, so that somebody else can attach to it later.  We can stick
in a cross-check there that you haven't written any undo bytes to that
log and PANIC if you have.  Then the system must be water-tight.
Perhaps it's possible to do better: if we could identify the cases in
which such logic gets reached, we could try to guarantee that WAL is
written and the undo log safely detached before we get there.  But at
the various least we can promote ERROR/FATAL to PANIC in the relevant
case.

A better solution would be to detect the problem and make sure we
recover from it before reusing the undo log.  Suppose each undo log
has three states: (1) nobody's attached, (2) somebody's attached, and
(3) nobody's attached but the last record might need a fixup.  When we
start up, all undo logs are in state 3, and the discard worker runs
around and puts them into state 1.  Subsequently, they alternate
between states 1 and 2 for as long as the system remains up.  But if
as an exceptional case we reach on_shmem_exit without having detached
the undo log, because of cascading failures, then we put the undo log
in state 3.  The discard worker already knows how to move undo logs
from state 3 to state 1, and it can do the same thing here.  Until it
does nobody else can reuse that undo log.

I might be missing something, but I think that would nail this down
pretty tightly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

24 August 2019, 14:19:19

On Wed, Aug 21, 2019 at 4:34 PM Robert Haas <robertmhaas@gmail.com> wrote:
> ReleaseResourcesAndProcessUndo() is only supposed to be called after
> AbortTransaction(), and the additional steps it performs --
> AtCleanup_Portals() and AtEOXact_Snapshot() or alternatively
> AtSubCleanup_Portals -- are taken from Cleanup(Sub)Transaction.
> That's not crazy; the other steps in Cleanup(Sub)Transaction() look
> like stuff that's intended to be performed when we're totally done
> with this TransactionState stack entry, whereas these things are
> slightly higher-level cleanups that might even block undo (e.g.
> undropped portal prevents orphaned file cleanup).  Granted, there are
> no comments explaining why those particular cleanup steps are
> performed here, and it's possible some other approach is better, but I
> think perhaps it's not quite as flagrantly broken as you think.

Andres smacked me with the clue-bat off-list and now I understand why
this is broken: there's no guarantee that running the various
AtEOXact/AtCleanup functions actually puts the transaction back into a
good state.  They *might* return the system to the state that it was
in immediately following StartTransaction(), but they also might not.
Moreover, ProcessUndoRequestForEachLogCat uses PG_TRY/PG_CATCH and
then discards the error without performing *any cleanup at all* but
then goes on and tries to do undo for other undo log categories
anyway.  That is totally unsafe.

I think that there should only be one chance to perform undo actions,
and as I said or at least alluded to before, if that throws an error,
it shouldn't be caught by a TRY/CATCH block but should be handled by
the state machine in xact.c.  If we're not going to make the state
machine handle these conditions, the addition of
TRANS_UNDO/TBLOCK_UNDO/TBLOCK_SUBUNDO is really missing the boat.  I'm
still not quite sure of the exact sequence of steps: we clearly need
AtCleanup_Portals() and a bunch of the other stuff that happens during
CleanupTransaction(), ideally including the freeing of memory, to
happen before we try undo. But I don't have a great feeling for how to
make that happen, and it seems more desirable for undo to begin as
soon as the transaction fails rather than waiting until
Cleanup(Sub)Transaction() time. I think some more research is needed
here.

> I am also not convinced that semi-critical sections are a bad idea,

Regarding this, after further thought and discussion with Andres,
there are two cases here that call for somewhat different handling:
temporary undo, and subtransaction abort.

In the case of a subtransaction abort, we can't proceed with the
toplevel transaction unless we succeed in applying the
subtransaction's undo, but that does not require killing off the
backend.  It might be a better idea to just fail the containing
subtransaction with the error that occurred during undo apply; if
there are multiple levels of subtransactions present then we might
fail in the same way several times, but eventually we'll fail at the
top level, forcibly kick the undo into the background, and the session
can continue.  The background workers will, hopefully, eventually
recover the situation.  Even if they can't, because, say, the failure
is due to a bug or whatever, killing off the session doesn't really
help.

In the case of temporary undo, killing the session is a much more
appealing approach. If we don't, how will that undo ever get
processed?  We could retry at some point (like every time we return to
the toplevel command prompt?) or just ignore the fact that we didn't
manage to perform those undo actions and leave that undo there like an
albatross, accumulating more and more undo behind it until the session
exits or the disk fills up.  The latter strikes me as a terrible user
experience, especially because for wire protocol reasons we'd have to
swallow the errors or at best convert them to warnings, but YMMV.

Anyway, probably these cases should not be handled exactly the same
way, but exactly what to do probably depends on the previous question:
how exactly does the integration into xact.c's state machine work,
anyway?

Meanwhile, I've been working up a prototype of how the undorequest.c
stuff I sent previously could be integrated with xact.c.  In working
on that, I've realized that there seem to be two different tasks.  One
is tracking the information that we'll need to have available to
perform undo actions.  The other is the actual transaction state
manipulation: when and how do we abort transactions, cleanup
transactions, start new transactions specifically for undo?  How are
transactions performing undo specially marked, if at all?  The
attached patch includes a new module, undostate.c/h, which tries to
handle the first of those things; this is just a prototype, and is
missing some pieces marked with XXX, but I think it's probably the
right general direction.  It will still need to be plugged into a
framework for launching undo apply background workers (which might
require some API adjustments) and it needs xact.c to handle the core
transactional stuff. But hopefully it will help to illustrate how the
undorequest.c stuff that I sent before can actually be put to use.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

v3-0001-New-undo-request-manager-now-with-some-xact.c-int.patch

Re: POC: Cleaning up orphaned files using undo logs

From

Kuntal Ghosh

Date:

30 August 2019, 08:27:00

Hello Thomas,

I was doing some testing for the scenario where the undo written by a
transaction overflows to multiple undo logs. For that I've modified
the following macro:
#define UndoLogMaxSize (1024 * 1024) /* 1MB undo log size */
(I should have used the provided pg_force_switch_undo though..)

I'm getting the following assert failure while performing the recovery
with the same.
"TRAP: FailedAssertion("slot->meta.status == UNDO_LOG_STATUS_FULL",
File: "undolog.c", Line: 997)"

I found that we don't emit an WAL record when we update the
slot->meta.status as UNDO_LOG_STATUS_FULL. If we don't that, after
crash recovery, some new transaction may use that undo log which is
wrong, IMHO. Am I missing something?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

02 September 2019, 04:16:45

On Fri, Aug 30, 2019 at 8:27 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> I'm getting the following assert failure while performing the recovery
> with the same.
> "TRAP: FailedAssertion("slot->meta.status == UNDO_LOG_STATUS_FULL",
> File: "undolog.c", Line: 997)"
>
> I found that we don't emit an WAL record when we update the
> slot->meta.status as UNDO_LOG_STATUS_FULL. If we don't that, after
> crash recovery, some new transaction may use that undo log which is
> wrong, IMHO. Am I missing something?

Thanks, right, that status logging is wrong, will fix in next version.

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

vignesh C

Date:

06 September 2019, 03:31:30

Hi Thomas,

While testing one of the recovery scenarios I found one issue:
FailedAssertion("!(logno == context->recovery_logno)

The details of the same is mentioned below:
The context's try_location was not updated in
UndoLogAllocateInRecovery, in PrepareUndoInsert the try_location was
updated with the undo record size.
In the subsequent UndoLogAllocateInRecovery as the value for
try_location was not initialized but only updated with the size the
logno will always not match if the recovery_logno is non zero and the
assert fails.

Fixed by setting the try_location in UndoLogAllocateInRecovery,
similar to try_location setting in UndoLogAllocate.
Patch for the same is attached.

Please have a look and add the changes in one of the upcoming version.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

On Mon, Sep 2, 2019 at 9:53 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Fri, Aug 30, 2019 at 8:27 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> > I'm getting the following assert failure while performing the recovery
> > with the same.
> > "TRAP: FailedAssertion("slot->meta.status == UNDO_LOG_STATUS_FULL",
> > File: "undolog.c", Line: 997)"
> >
> > I found that we don't emit an WAL record when we update the
> > slot->meta.status as UNDO_LOG_STATUS_FULL. If we don't that, after
> > crash recovery, some new transaction may use that undo log which is
> > wrong, IMHO. Am I missing something?
>
> Thanks, right, that status logging is wrong, will fix in next version.
>
> --
> Thomas Munro
> https://enterprisedb.com
>
>

Attachment

0001-FailedAssertion-logno-context-recovery_logno-fix.patch

Re: POC: Cleaning up orphaned files using undo logs

From

Alvaro Herrera

Date:

12 September 2019, 01:56:51

On 2019-Sep-06, vignesh C wrote:

> Hi Thomas,
> 
> While testing one of the recovery scenarios I found one issue:
> FailedAssertion("!(logno == context->recovery_logno)

I marked this patch Waiting on Author.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: POC: Cleaning up orphaned files using undo logs

From

Kuntal Ghosh

Date:

15 September 2019, 17:26:48

Hello Thomas,

While testing zheap over undo apis, we've found the following
issues/scenarios that might need some fixes/discussions:

1. In UndoLogAllocateInRecovery, when we find the current log number
from the list of registered blocks, we don't check whether the
block->in_use flag is true or not. In XLogResetInsertion, we just
reset in_use flag without reseting the blocks[]->rnode information.
So, if we don't check the in_use flag, it's possible that we'll
consult some block information from the previous WAL record. IMHO,
just adding an in_use check in UndoLogAllocateInRecovery will solve
the problem.

2. A transaction, inserts one undo record and generated a WAL record
for the same, say at WAL location 0/2000A000. Next, the undo record
gets discarded and WAL is generated to update the meta.discard pointer
at location 0/2000B000  At the same time, an ongoing checkpoint with
checkpoint.redo at 0/20000000 flushes the latest meta.discard pointer.
Now, the system crashes.
Now, the recovery starts from the location 0/20000000. When the
recovery of 0/2000A000 happens, it sees the undo record that it's
about to insert, is already discarded as per meta.discard (flushed by
checkpoint). In this case, should we just skip inserting the undo
record?

3. Currently, we create a backup image of the unlogged part of the
undo log's metadata only when some backend allocates some space from
the undo log (in UndoLogAllocate). This helps us restore the unlogged
meta part after a checkpoint.
When we perform an undo action, we also update the undo action
progress and emit an WAL record. The same operation can performed by
the undo worker which doesn't allocate any space from the undo log.
So, if an undo worker emits an WAL record to update undo action
progress after a checkpoint, it'll not be able to WAL log the backup
image of the meta unlogged part. IMHO, this breaks the recovery logic
of unlogged part of undo meta.

Thoughts?

On Mon, Sep 2, 2019 at 9:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Fri, Aug 30, 2019 at 8:27 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> > I'm getting the following assert failure while performing the recovery
> > with the same.
> > "TRAP: FailedAssertion("slot->meta.status == UNDO_LOG_STATUS_FULL",
> > File: "undolog.c", Line: 997)"
> >
> > I found that we don't emit an WAL record when we update the
> > slot->meta.status as UNDO_LOG_STATUS_FULL. If we don't that, after
> > crash recovery, some new transaction may use that undo log which is
> > wrong, IMHO. Am I missing something?
>
> Thanks, right, that status logging is wrong, will fix in next version.
>
> --
> Thomas Munro
> https://enterprisedb.com

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

16 September 2019, 05:52:25

On Mon, Sep 16, 2019 at 5:27 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> While testing zheap over undo apis, we've found the following
> issues/scenarios that might need some fixes/discussions:

Thanks!

> 1. In UndoLogAllocateInRecovery, when we find the current log number
> from the list of registered blocks, we don't check whether the
> block->in_use flag is true or not. In XLogResetInsertion, we just
> reset in_use flag without reseting the blocks[]->rnode information.
> So, if we don't check the in_use flag, it's possible that we'll
> consult some block information from the previous WAL record. IMHO,
> just adding an in_use check in UndoLogAllocateInRecovery will solve
> the problem.

Agreed.  I added a line to break out of that loop if !block->in_use.

BTW I am planning to simplify that code considerably, based on a plan
to introduce a new rule: there can be only one undo record and
therefore only one undo allocation per WAL record.

> 2. A transaction, inserts one undo record and generated a WAL record
> for the same, say at WAL location 0/2000A000. Next, the undo record
> gets discarded and WAL is generated to update the meta.discard pointer
> at location 0/2000B000  At the same time, an ongoing checkpoint with
> checkpoint.redo at 0/20000000 flushes the latest meta.discard pointer.
> Now, the system crashes.
> Now, the recovery starts from the location 0/20000000. When the
> recovery of 0/2000A000 happens, it sees the undo record that it's
> about to insert, is already discarded as per meta.discard (flushed by
> checkpoint). In this case, should we just skip inserting the undo
> record?

I see two options:

1.  We make it so that if you're allocating in recovery and discard >
insert, we'll just set discard = insert so you can proceed.  The code
in undofile_get_segment_file() already copes with missing files during
recovery.

2.  We skip the insert as you said.

I think option 1 is probably best, otherwise you have to cope with
failure to insert by skipping, as you said.

> 3. Currently, we create a backup image of the unlogged part of the
> undo log's metadata only when some backend allocates some space from
> the undo log (in UndoLogAllocate). This helps us restore the unlogged
> meta part after a checkpoint.
> When we perform an undo action, we also update the undo action
> progress and emit an WAL record. The same operation can performed by
> the undo worker which doesn't allocate any space from the undo log.
> So, if an undo worker emits an WAL record to update undo action
> progress after a checkpoint, it'll not be able to WAL log the backup
> image of the meta unlogged part. IMHO, this breaks the recovery logic
> of unlogged part of undo meta.

I thought that was OK because those undo data updates don't depend on
the insert pointer.  But I see what you mean: the next modification of
the page that DOES depend on the insert pointer might not log the
meta-data if it's not the first WAL record to touch it after a
checkpoint.  Rats.  I'll have to think about that some more.

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Kuntal Ghosh

Date:

16 September 2019, 15:08:58

Hello Thomas,

On Mon, Sep 16, 2019 at 11:23 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> > 1. In UndoLogAllocateInRecovery, when we find the current log number
> > from the list of registered blocks, we don't check whether the
> > block->in_use flag is true or not. In XLogResetInsertion, we just
> > reset in_use flag without reseting the blocks[]->rnode information.
> > So, if we don't check the in_use flag, it's possible that we'll
> > consult some block information from the previous WAL record. IMHO,
> > just adding an in_use check in UndoLogAllocateInRecovery will solve
> > the problem.
>
> Agreed.  I added a line to break out of that loop if !block->in_use.
>
I think we should skip the block if !block->in_use. Because, the undo
buffer can be registered in a subsequent block as well. For different
operations, we can use different block_id to register the undo buffer
in the redo record.

> BTW I am planning to simplify that code considerably, based on a plan
> to introduce a new rule: there can be only one undo record and
> therefore only one undo allocation per WAL record.
>
Okay. In that case, we need to rethink the cases for multi-inserts and
non-inlace updates both of which currently inserts multiple undo
record corresponding to a single WAL record. For multi-inserts, it can
be solved easily by moving all the offset information in the payload.
But, for non-inlace updates, we insert one undo record for the update
and one for the insert. Wondering whether we've to insert two WAL
records - one for update and one for the new insert.

> > 2. A transaction, inserts one undo record and generated a WAL record
> > for the same, say at WAL location 0/2000A000. Next, the undo record
> > gets discarded and WAL is generated to update the meta.discard pointer
> > at location 0/2000B000  At the same time, an ongoing checkpoint with
> > checkpoint.redo at 0/20000000 flushes the latest meta.discard pointer.
> > Now, the system crashes.
> > Now, the recovery starts from the location 0/20000000. When the
> > recovery of 0/2000A000 happens, it sees the undo record that it's
> > about to insert, is already discarded as per meta.discard (flushed by
> > checkpoint). In this case, should we just skip inserting the undo
> > record?
>
> I see two options:
>
> 1.  We make it so that if you're allocating in recovery and discard >
> insert, we'll just set discard = insert so you can proceed.  The code
> in undofile_get_segment_file() already copes with missing files during
> recovery.
>
Interesting. This should work.

>
> > 3. Currently, we create a backup image of the unlogged part of the
> > undo log's metadata only when some backend allocates some space from
> > the undo log (in UndoLogAllocate). This helps us restore the unlogged
> > meta part after a checkpoint.
> > When we perform an undo action, we also update the undo action
> > progress and emit an WAL record. The same operation can performed by
> > the undo worker which doesn't allocate any space from the undo log.
> > So, if an undo worker emits an WAL record to update undo action
> > progress after a checkpoint, it'll not be able to WAL log the backup
> > image of the meta unlogged part. IMHO, this breaks the recovery logic
> > of unlogged part of undo meta.
>
> I thought that was OK because those undo data updates don't depend on
> the insert pointer.  But I see what you mean: the next modification of
> the page that DOES depend on the insert pointer might not log the
> meta-data if it's not the first WAL record to touch it after a
> checkpoint.  Rats.  I'll have to think about that some more.
Cool.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Robert Haas

Date:

16 September 2019, 17:01:15

On Mon, Sep 16, 2019 at 11:09 AM Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
> Okay. In that case, we need to rethink the cases for multi-inserts and
> non-inlace updates both of which currently inserts multiple undo
> record corresponding to a single WAL record. For multi-inserts, it can
> be solved easily by moving all the offset information in the payload.
> But, for non-inlace updates, we insert one undo record for the update
> and one for the insert. Wondering whether we've to insert two WAL
> records - one for update and one for the new insert.

No, I think the solution is to put the information about both halves
of the non-in-place update in the same undo record.  I think the only
reason why that's tricky is because we've got two block numbers and
two offsets, and the only reason that's a problem is because
UnpackedUndoRecord only has one field for each of those things, and
that goes right back to Heikki's comments about the format not being
flexible enough. If you see some other problem, it would be
interesting to know what it is.

One thing I've been thinking about is: suppose that you're following
the undo chain for a tuple and you come to a non-in-place update
record. Can you get confused? I don't think so, because you can
compare the TID for which you're following the chain to the new TID
and the old TID in the record and it should match one or the other but
not both. But I don't think you even really need to do that much: if
you started with a deleted item, the first thing in the undo chain has
to be a delete or non-in-place update that got rid of it. And if you
started with a non-deleted item, then the beginning of the undo chain,
if it hasn't been discarded yet, will be the insert or non-in-place
update that created it. There's nowhere else that you can hit a
non-in-place update, and no room (that I can see) for any ambiguity.

It seems to me that zheap went wrong in ending up with separate undo
types for in-place and non-in-place updates. Why not just have ONE
kind of undo record that describes an update, and allow that update to
have either one TID or two TIDs depending on which kind of update it
is? There may be a reason, but I don't know what it is, unless it's
just that the UnpackedUndoRecord idea that I invented wasn't flexible
enough and nobody thought of generalizing it. Curious to hear your
thoughts on this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

16 September 2019, 22:03:20

On Tue, Sep 17, 2019 at 3:09 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> On Mon, Sep 16, 2019 at 11:23 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Agreed.  I added a line to break out of that loop if !block->in_use.
> >
> I think we should skip the block if !block->in_use. Because, the undo
> buffer can be registered in a subsequent block as well. For different
> operations, we can use different block_id to register the undo buffer
> in the redo record.

Oops, right.  So it should just be added to the if condition.  Will do.

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

18 September 2019, 08:18:54

On Mon, Sep 16, 2019 at 10:37 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> It seems to me that zheap went wrong in ending up with separate undo
> types for in-place and non-in-place updates. Why not just have ONE
> kind of undo record that describes an update, and allow that update to
> have either one TID or two TIDs depending on which kind of update it
> is? There may be a reason, but I don't know what it is, unless it's
> just that the UnpackedUndoRecord idea that I invented wasn't flexible
> enough and nobody thought of generalizing it. Curious to hear your
> thoughts on this.
>

I think not only TID's, but we also need to two uur_prevundo (previous
undo of the block) pointers.  This is required both when we have to
perform page-wise undo and chain traversal during visibility checks.
So, we can keep a combination of TID and prevundo.

The other thing is that during rollback when we collect the undo for
each page, applying the action for this undo need some thoughts.  For
example, we can't apply the undo to rollback both Insert and
non-inplace-update as both are on different pages.  The reason is that
the page where non-inplace-update has happened might have more undos
that need to be applied before this.  We can somehow make this undo
available to apply while collecting undo for both the heap pages.  I
think there is also a need to
identify which TID is for Insert and which is for non-inplace-update
part of the operation because we won't know that while applying undo
unless we check the state of a tuple on the page.

So, with this idea, we will make one undo record part of multiple
chains which might need some consideration at different places like
above.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

From

Michael Paquier

Date:

28 November 2019, 02:45:21

On Tue, Sep 17, 2019 at 10:03:20AM +1200, Thomas Munro wrote:
> Oops, right.  So it should just be added to the if condition.  Will do.

It's been a couple of months and the discussion has stale.  It seems
also that the patch was waiting for an update.  So I am marking it as
RwF for now.  Please feel free to update it if you feel that's not
adapted.
--
Michael

Attachment

signature.asc

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

28 November 2019, 03:01:25

On Thu, Nov 28, 2019 at 3:45 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Sep 17, 2019 at 10:03:20AM +1200, Thomas Munro wrote:
> > Oops, right.  So it should just be added to the if condition.  Will do.
>
> It's been a couple of months and the discussion has stale.  It seems
> also that the patch was waiting for an update.  So I am marking it as
> RwF for now.  Please feel free to update it if you feel that's not
> adapted.

Thanks.  We decided to redesign a couple of aspects of the undo
storage and record layers that this patch was intended to demonstrate,
and work on that is underway.  More on that soon.

Re: POC: Cleaning up orphaned files using undo logs

From

Antonin Houska

Date:

12 November 2020, 10:17:36

Thomas Munro <thomas.munro@gmail.com> wrote:

> On Thu, Nov 28, 2019 at 3:45 PM Michael Paquier <michael@paquier.xyz> wrote:
> > On Tue, Sep 17, 2019 at 10:03:20AM +1200, Thomas Munro wrote:
> > > Oops, right.  So it should just be added to the if condition.  Will do.
> >
> > It's been a couple of months and the discussion has stale.  It seems
> > also that the patch was waiting for an update.  So I am marking it as
> > RwF for now.  Please feel free to update it if you feel that's not
> > adapted.
> 
> Thanks.  We decided to redesign a couple of aspects of the undo
> storage and record layers that this patch was intended to demonstrate,
> and work on that is underway.  More on that soon.

As my boss expressed in his recent blog post, we'd like to contribute to the
zheap development, and a couple of developers from other companies are
interested in this as well. Amit Kapila suggested that the "cleanup of
orphaned files" feature is a good start point in getting the code into PG
core, so I've spent some time on it and tried to rebase the patch set.

In fact what I did is not mere rebasing against the current master branch -
I've also (besides various bug fixes) done some design changes.

Incorporated the new Undo Record Set (URS) infrastructure
---------------------------------------------------------

This is also pointed out in [0].

I started from [1] and tried to implement some missing parts (e.g. proper
closing of the URSs after crash), introduced UNDO_DEBUG preprocessor macro
which makes the undo log segments very small and fixed some bugs that the
small segments exposed.

The most significant change I've done was removal of the undo requests from
checkpoint. I could not find any particular bug / race conditions related to
including the requests into the checkpoint, but I concluded that it's easier
to think about consistency and checkpoint timings if we scan the undo log on
restart (after recovery has finished) and create the requests from scratch.

[2] shows where I ended up before I started to rebase this patchset.

No background undo
------------------

Reduced complexity of the patch seems to be the priority at the moment. Amit
suggested that cleanup of an orphaned relation file is simple enough to be
done on foreground and I agree.

"undo worker" is still there, but it only processes undo requests after server
restart because relation data can only be changed in a transaction - it seems
cleaner to launch a background worker for this than to hack the startup
process.

Since the concept of undo requests is closely related to the undo worker, I
removed undorequest.c too. The new (much simpler) undo worker gets the
information on incomplete / aborted transactions from the undo log as
mentioned above.

SMGR enhancement
----------------

I used the 0001 patch from [3] rather than [4], although it's more invasive
because I noticed somewhere in the discussion that there should be no reserved
database OID for the undo log. (InvalidOid cannot be used because it's already
in use for shared catalogs.)

Components added
----------------

pg_undo_dump utility and test framework for undoread.c. BTW, undoread.c seems
to need some refactoring.

Following are a few areas which are not implemented yet because more
discussion is needed there:

Discarding
----------

There's no discard worker for the URS infrastructure yet. I thought about
discarding the undo log during checkpoint, but checkpoint should probably do
more straightforward tasks than the calculation of a new discard pointer for
each undo log, so a background worker is needed. A few notes on that:

  * until the zheap AM gets added, only the transaction that creates the undo
    records needs to access them. This assumption should make the discarding
    algorithm a bit simpler. Note that with zheap, the other transactions need
    to look for old versions of tuples, so the concept of oldestXidHavingUndo
    variable is needed there.

  * it's rather simple to pass pointer the URS pointer to the discard worker
    when transaction either committed or the undo has been executed. If the
    URS only consists of one chunk, the discard pointer can simply be advanced
    to the end of the chunk. But if there are multiple chunks, the discard
    worker might need to scan quite some amount of the undo log because (IIUC)
    chunks of different URSs can be interleaved (if there's not enough space
    for a record in the log 1, log 2 is used, but before we get to discarding,
    another transaction could have added its chunk to the log 1) and because
    the chunks only contain links backwards, not forward. If we added the
    forward link to the chunk header, it would make chunk closing more
    complex.

    How about storing the type header (which includes XID) in each chunk
    instead of only the first chunk of the URS? Thus we'd be able to check for
    each chunk separately whether it can be discarded.

  * if the URS belongs to an aborted transaction or a transaction that could
    not finish due to server crash, the transaction status alone does not
    justify discarding: we also need to be sure that the underlying undo
    records have been applied. So if we want to do without the
    oldestXidHavingUndo variable, some sort of undo progress tracking is
    needed, see below.

Do not execute the same undo record multiple times
--------------------------------------------------

Although I've noticed in the zheap code that it checks whether particular undo
action was already undone, I think this functionality fits better in the URS
layer. Also note in [1] (i.e. the undo layer, no zheap) that the header
comment of AtSubAbort_XactUndo() refers to this problem.

I've tried to implement such a thing (not included in this patch) by adding
last_rec_applied field to UndoRecordSetChunkHeader. When the UNDO stage of the
transaction starts, this field is set to the last undo record of given chunk,
and once that record is applied, the pointer moves to the previous record in
terms of undo pointer (i.e. the next record to be applied - the records are
applied in reverse order) and so on. For recovery purposes, the pointer is
maintained in a similar way as the ud_insertion_point field of
UndoPageHeaderData. However, although I haven't tested performance yet, I
wonder if it's o.k. to lock the buffer containing the chunk header exclusively
for each undo record execution. I wonder if there's a better place to store
the progress information, maybe at page level?

I can spend more time on this project, but need a hint which part I should
focus on. Other hackers might have the same problem. Thanks for any
suggestions.

[0] https://www.postgresql.org/message-id/CA%2BTgmoZwkqXs3hpT_nd17fyMnZDkg8yU%3D5kG%2BHQw%2B80rumiwUA%40mail.gmail.com
[1] https://github.com/EnterpriseDB/zheap/tree/undo-record-set
[2] https://github.com/cybertec-postgresql/postgres/tree/undo-record-set-ah
[3] https://www.postgresql.org/message-id/CA%2BhUKGJfznxutTwpMLKPMjU_k9GhERoogyxx2Sf105LOA2La2A%40mail.gmail.com
[4] https://www.postgresql.org/message-id/CA%2BhUKG%2BMpzRsZFE7ChhRq-Br5VYYi6mafVQ73Af7ahioWo5o8w%40mail.gmail.com

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment

undo-20201112.tgz

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

13 November 2020, 05:11:35

On Thu, Nov 12, 2020 at 10:15 PM Antonin Houska <ah@cybertec.at> wrote:
> Thomas Munro <thomas.munro@gmail.com> wrote:
> > Thanks.  We decided to redesign a couple of aspects of the undo
> > storage and record layers that this patch was intended to demonstrate,
> > and work on that is underway.  More on that soon.
>
> As my boss expressed in his recent blog post, we'd like to contribute to the
> zheap development, and a couple of developers from other companies are
> interested in this as well. Amit Kapila suggested that the "cleanup of
> orphaned files" feature is a good start point in getting the code into PG
> core, so I've spent some time on it and tried to rebase the patch set.

Hi Antonin,

I saw that -- great news! -- and have been meaning to write for a
while.  I think I am nearly ready to talk about it again.  I agree
100% that it's worth trying to do something much simpler than a new
access manager, and this was the simplest useful feature solving a
real-world-problem-that-people-actually-have we could come up with
(based on an idea from Robert).  I think it needs a convincing
explanation for why there is no scenario where the relfilenode is
recycled for a new unlucky table before the rollback is executed,
which might depend on details that you might be working on/changing
(scenarios where you execute undo twice because you forgot you already
did it).

> In fact what I did is not mere rebasing against the current master branch -
> I've also (besides various bug fixes) done some design changes.
>
> Incorporated the new Undo Record Set (URS) infrastructure
> ---------------------------------------------------------
>
> This is also pointed out in [0].
>
> I started from [1] and tried to implement some missing parts (e.g. proper
> closing of the URSs after crash), introduced UNDO_DEBUG preprocessor macro
> which makes the undo log segments very small and fixed some bugs that the
> small segments exposed.

Cool!  Getting up to speed on all these made up concepts like URS, and
getting all these pieces assembled and rebased and up and running is
already quite something, let alone adding missing parts and debugging.

> The most significant change I've done was removal of the undo requests from
> checkpoint. I could not find any particular bug / race conditions related to
> including the requests into the checkpoint, but I concluded that it's easier
> to think about consistency and checkpoint timings if we scan the undo log on
> restart (after recovery has finished) and create the requests from scratch.

Interesting.  I guess that would be closer to textbook three-phase ARIES.

> [2] shows where I ended up before I started to rebase this patchset.
>
> No background undo
> ------------------
>
> Reduced complexity of the patch seems to be the priority at the moment. Amit
> suggested that cleanup of an orphaned relation file is simple enough to be
> done on foreground and I agree.
>
> "undo worker" is still there, but it only processes undo requests after server
> restart because relation data can only be changed in a transaction - it seems
> cleaner to launch a background worker for this than to hack the startup
> process.

I suppose the simplest useful system would be one does the work at
startup before allowing connections, and also in regular backends, and
panics if a backend ever exits while it has pending undo (panic =
"goto crash recovery").  Then you don't have to deal with undo workers
running at the same time as regular sessions which might run into
trouble reacquiring locks (for an AM I mean), or due to OIDs being
recycled with multiple checkpoints, or undo work that gets deferred
until the next restart of the server.

> Since the concept of undo requests is closely related to the undo worker, I
> removed undorequest.c too. The new (much simpler) undo worker gets the
> information on incomplete / aborted transactions from the undo log as
> mentioned above.
>
> SMGR enhancement
> ----------------
>
> I used the 0001 patch from [3] rather than [4], although it's more invasive
> because I noticed somewhere in the discussion that there should be no reserved
> database OID for the undo log. (InvalidOid cannot be used because it's already
> in use for shared catalogs.)

I give up thinking about the colour of the BufferTag shed and went
back to magic database 9, mainly because there seemed to be more
pressing matters.  I don't even think it's that crazy to store this
type of system-wide data in pseudo databases, and I know of other
systems that do similar sorts of things without blinking...

> Following are a few areas which are not implemented yet because more
> discussion is needed there:

Hmm.  I'm thinking about these questions.

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

13 November 2020, 06:48:20

On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska <ah@cybertec.at> wrote:
>
>
> No background undo
> ------------------
>
> Reduced complexity of the patch seems to be the priority at the moment. Amit
> suggested that cleanup of an orphaned relation file is simple enough to be
> done on foreground and I agree.
>

Yeah, I think we should try and see if we can make it work but I
noticed that there are few places like AbortOutOfAnyTransaction where
we have the assumption that undo will be executed in the background.
We need to deal with it.

> "undo worker" is still there, but it only processes undo requests after server
> restart because relation data can only be changed in a transaction - it seems
> cleaner to launch a background worker for this than to hack the startup
> process.
>

But, I see there are still multiple undoworkers that are getting
launched and I am not sure if that works correctly because a
particular undoworker is connected to a database and then it starts
processing all the pending undo.

> Since the concept of undo requests is closely related to the undo worker, I
> removed undorequest.c too. The new (much simpler) undo worker gets the
> information on incomplete / aborted transactions from the undo log as
> mentioned above.
>
> SMGR enhancement
> ----------------
>
> I used the 0001 patch from [3] rather than [4], although it's more invasive
> because I noticed somewhere in the discussion that there should be no reserved
> database OID for the undo log. (InvalidOid cannot be used because it's already
> in use for shared catalogs.)
>
> Components added
> ----------------
>
> pg_undo_dump utility and test framework for undoread.c. BTW, undoread.c seems
> to need some refactoring.
>
>
> Following are a few areas which are not implemented yet because more
> discussion is needed there:
>
> Discarding
> ----------
>
> There's no discard worker for the URS infrastructure yet. I thought about
> discarding the undo log during checkpoint, but checkpoint should probably do
> more straightforward tasks than the calculation of a new discard pointer for
> each undo log, so a background worker is needed. A few notes on that:
>
>   * until the zheap AM gets added, only the transaction that creates the undo
>     records needs to access them. This assumption should make the discarding
>     algorithm a bit simpler. Note that with zheap, the other transactions need
>     to look for old versions of tuples, so the concept of oldestXidHavingUndo
>     variable is needed there.
>
>   * it's rather simple to pass pointer the URS pointer to the discard worker
>     when transaction either committed or the undo has been executed.
>

Why can't we have a separate discard worker which keeps on scanning
the undorecords and discard accordingly? Giving the onus of foreground
process might be tricky because say discard worker is not up to speed
and we ran out of space to pass such information for each commit/abort
request.

>
> Do not execute the same undo record multiple times
> --------------------------------------------------
>
> Although I've noticed in the zheap code that it checks whether particular undo
> action was already undone, I think this functionality fits better in the URS
> layer.
>

If you want to track at undo record level, then won't it lead to
performance overhead and probably additional WAL overhead considering
this action needs to be WAL-logged. I think recording at page-level
might be a better idea.

>
> I can spend more time on this project, but need a hint which part I should
> focus on.
>

I can easily imagine that this needs a lot of work and I can try to
help with this as much as possible from my side. I feel at this stage
we should try to focus on undo-related work (to start with you can
look at finishing the undoprocessing work for which I have shared some
thoughts) and then probably at some point in time we need to rebase
zheap over this.

-- 
With Regards,
Amit Kapila.

Re: POC: Cleaning up orphaned files using undo logs

From

Antonin Houska

Date:

13 November 2020, 12:39:23

Thomas Munro <thomas.munro@gmail.com> wrote:

> On Thu, Nov 12, 2020 at 10:15 PM Antonin Houska <ah@cybertec.at> wrote:

> I saw that -- great news! -- and have been meaning to write for a
> while.  I think I am nearly ready to talk about it again.

I'm looking forward to it :-)

> 100% that it's worth trying to do something much simpler than a new
> access manager, and this was the simplest useful feature solving a
> real-world-problem-that-people-actually-have we could come up with
> (based on an idea from Robert).  I think it needs a convincing
> explanation for why there is no scenario where the relfilenode is
> recycled for a new unlucky table before the rollback is executed,
> which might depend on details that you might be working on/changing
> (scenarios where you execute undo twice because you forgot you already
> did it).

Oh, I haven't thought about this problem yet. That might be another reason for
the undo log infrastructure to record the progress somehow.

> > No background undo
> > ------------------
> >
> > Reduced complexity of the patch seems to be the priority at the moment. Amit
> > suggested that cleanup of an orphaned relation file is simple enough to be
> > done on foreground and I agree.
> >
> > "undo worker" is still there, but it only processes undo requests after server
> > restart because relation data can only be changed in a transaction - it seems
> > cleaner to launch a background worker for this than to hack the startup
> > process.
>
> I suppose the simplest useful system would be one does the work at
> startup before allowing connections, and also in regular backends, and
> panics if a backend ever exits while it has pending undo (panic =
> "goto crash recovery").  Then you don't have to deal with undo workers
> running at the same time as regular sessions which might run into
> trouble reacquiring locks (for an AM I mean), or due to OIDs being
> recycled with multiple checkpoints, or undo work that gets deferred
> until the next restart of the server.

I think that zheap can recognize that page has unapplied undo, so we don't
need to reacquire any page lock on restart. However I agree that the
background undo might introduce other concurrency issues. At least for now
it's worth trying to move the cleanup into the startup process. We can
reconsider this when implementing more expensive undo actions, especially the
zheap rollback.

> > Since the concept of undo requests is closely related to the undo worker, I
> > removed undorequest.c too. The new (much simpler) undo worker gets the
> > information on incomplete / aborted transactions from the undo log as
> > mentioned above.
> >
> > SMGR enhancement
> > ----------------
> >
> > I used the 0001 patch from [3] rather than [4], although it's more invasive
> > because I noticed somewhere in the discussion that there should be no reserved
> > database OID for the undo log. (InvalidOid cannot be used because it's already
> > in use for shared catalogs.)
>
> I give up thinking about the colour of the BufferTag shed and went
> back to magic database 9, mainly because there seemed to be more
> pressing matters.  I don't even think it's that crazy to store this
> type of system-wide data in pseudo databases, and I know of other
> systems that do similar sorts of things without blinking...

ok

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

From

Antonin Houska

Date:

13 November 2020, 13:34:39

Amit Kapila <amit.kapila16@gmail.com> wrote:

> On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska <ah@cybertec.at> wrote:
> >
> >
> > No background undo
> > ------------------
> >
> > Reduced complexity of the patch seems to be the priority at the moment. Amit
> > suggested that cleanup of an orphaned relation file is simple enough to be
> > done on foreground and I agree.
> >
>
> Yeah, I think we should try and see if we can make it work but I
> noticed that there are few places like AbortOutOfAnyTransaction where
> we have the assumption that undo will be executed in the background.
> We need to deal with it.

I think this is o.k. if we always check for unapplied undo during startup.

> > "undo worker" is still there, but it only processes undo requests after server
> > restart because relation data can only be changed in a transaction - it seems
> > cleaner to launch a background worker for this than to hack the startup
> > process.
> >
>
> But, I see there are still multiple undoworkers that are getting
> launched and I am not sure if that works correctly because a
> particular undoworker is connected to a database and then it starts
> processing all the pending undo.

Each undo worker applies only transactions for its own database, see
ProcessExistingUndoRequests():

    /* We can only process undo of the database we are connected to. */
    if (xact_hdr.dboid != MyDatabaseId)
        continue;

Nevertheless, as I've just mentioned in my response to Thomas, I admit that we
should try to live w/o the undo worker altogether.

> > Discarding
> > ----------
> >
> > There's no discard worker for the URS infrastructure yet. I thought about
> > discarding the undo log during checkpoint, but checkpoint should probably do
> > more straightforward tasks than the calculation of a new discard pointer for
> > each undo log, so a background worker is needed. A few notes on that:
> >
> >   * until the zheap AM gets added, only the transaction that creates the undo
> >     records needs to access them. This assumption should make the discarding
> >     algorithm a bit simpler. Note that with zheap, the other transactions need
> >     to look for old versions of tuples, so the concept of oldestXidHavingUndo
> >     variable is needed there.
> >
> >   * it's rather simple to pass pointer the URS pointer to the discard worker
> >     when transaction either committed or the undo has been executed.
> >
>
> Why can't we have a separate discard worker which keeps on scanning
> the undorecords and discard accordingly? Giving the onus of foreground
> process might be tricky because say discard worker is not up to speed
> and we ran out of space to pass such information for each commit/abort
> request.

Sure, there should be a discard worker. The question is how to make its work
efficient. The initial run after restart probably needs to scan everything
between 'discard' and 'insert' pointers, but then it should process only the
parts created by individual transactions.

> >
> > Do not execute the same undo record multiple times
> > --------------------------------------------------
> >
> > Although I've noticed in the zheap code that it checks whether particular undo
> > action was already undone, I think this functionality fits better in the URS
> > layer.
> >
>
> If you want to track at undo record level, then won't it lead to
> performance overhead and probably additional WAL overhead considering
> this action needs to be WAL-logged. I think recording at page-level
> might be a better idea.

I'm not worried about WAL because the undo execution needs to be WAL-logged
anyway - see smgr_undo() in the 0005- part of the patch set. What needs to be
evaluated regarding performance is the (exclusive) locking of the page that
carries the progress information. I'm still not sure whether this info should
be on every page or only in the chunk header. In either case, we have a
problem if there are two or more chunks created by different transactions on
the same page, and if more than on of these transactions need to perform
undo. I tend to believe that this should happen rarely though.

> > I can spend more time on this project, but need a hint which part I should
> > focus on.
> >
>
> I can easily imagine that this needs a lot of work and I can try to
> help with this as much as possible from my side. I feel at this stage
> we should try to focus on undo-related work (to start with you can
> look at finishing the undoprocessing work for which I have shared some
> thoughts) and then probably at some point in time we need to rebase
> zheap over this.

I agree, thanks!

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

15 November 2020, 05:54:06

On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska <ah@cybertec.at> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska <ah@cybertec.at> wrote:
> > >
> > >
> > > No background undo
> > > ------------------
> > >
> > > Reduced complexity of the patch seems to be the priority at the moment. Amit
> > > suggested that cleanup of an orphaned relation file is simple enough to be
> > > done on foreground and I agree.
> > >
> >
> > Yeah, I think we should try and see if we can make it work but I
> > noticed that there are few places like AbortOutOfAnyTransaction where
> > we have the assumption that undo will be executed in the background.
> > We need to deal with it.
>
> I think this is o.k. if we always check for unapplied undo during startup.
>

Hmm, how it is ok to leave undo (and rely on startup) unless it is a
PANIC error. IIRC, this path is invoked in non-panic errors as well.
Basically, we won't be able to discard such an undo which doesn't seem
like a good idea.

> > > "undo worker" is still there, but it only processes undo requests after server
> > > restart because relation data can only be changed in a transaction - it seems
> > > cleaner to launch a background worker for this than to hack the startup
> > > process.
> > >
> >
> > But, I see there are still multiple undoworkers that are getting
> > launched and I am not sure if that works correctly because a
> > particular undoworker is connected to a database and then it starts
> > processing all the pending undo.
>
> Each undo worker applies only transactions for its own database, see
> ProcessExistingUndoRequests():
>
>         /* We can only process undo of the database we are connected to. */
>         if (xact_hdr.dboid != MyDatabaseId)
>                 continue;
>
> Nevertheless, as I've just mentioned in my response to Thomas, I admit that we
> should try to live w/o the undo worker altogether.
>

Okay, but keep in mind that there could be a large amount of undo
(unlike redo which has some limit as we can replay it from the last
checkpoint) which needs to be processed but it might be okay to live
with that for now. Another thing is that it seems we need to connect
to the database to perform it which might appear a bit odd that we
don't allow users to connect to the database but internally we are
connecting it. These are just some points to consider while finalizing
the solution to this.

> > > Discarding
> > > ----------
> > >
> > > There's no discard worker for the URS infrastructure yet. I thought about
> > > discarding the undo log during checkpoint, but checkpoint should probably do
> > > more straightforward tasks than the calculation of a new discard pointer for
> > > each undo log, so a background worker is needed. A few notes on that:
> > >
> > >   * until the zheap AM gets added, only the transaction that creates the undo
> > >     records needs to access them. This assumption should make the discarding
> > >     algorithm a bit simpler. Note that with zheap, the other transactions need
> > >     to look for old versions of tuples, so the concept of oldestXidHavingUndo
> > >     variable is needed there.
> > >
> > >   * it's rather simple to pass pointer the URS pointer to the discard worker
> > >     when transaction either committed or the undo has been executed.
> > >
> >
> > Why can't we have a separate discard worker which keeps on scanning
> > the undorecords and discard accordingly? Giving the onus of foreground
> > process might be tricky because say discard worker is not up to speed
> > and we ran out of space to pass such information for each commit/abort
> > request.
>
> Sure, there should be a discard worker. The question is how to make its work
> efficient. The initial run after restart probably needs to scan everything
> between 'discard' and 'insert' pointers,
>

Yeah, such an initial scan would be helpful to identify pending aborts
and allow them to be processed.

> but then it should process only the
> parts created by individual transactions.
>

Yeah, it needs to process transaction-by-transaction to see which all
we can discard. Also, note that in Single-User mode we need to discard
undo after commit. I think we also need to maintain
oldestXidHavingUndo for CLOG truncation and transaction-wraparound. We
can't allow CLOG truncation for the transaction whose undo is not
discarded as that could be required by some other transaction. For
similar reasons, we can't allow transaction-wraparound and we need to
integrate this into the existing xid-allocation mechanism. I have
found one of the old patch
(Allow-execution-and-discard-of-undo-by-background-wo) attached where
all these concepts were implemented. Unless you have a reason why we
don't these things, you might want to refer to the attached patch to
either re-use or refer to these ideas. There are a few other things
like undorequest and some undoworker mechanism which you can ignore.

-- 
With Regards,
Amit Kapila.

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

15 November 2020, 05:56:01

On Sun, Nov 15, 2020 at 11:24 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska <ah@cybertec.at> wrote:
> >
> > Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska <ah@cybertec.at> wrote:
> > > >
> > > >
> > > > No background undo
> > > > ------------------
> > > >
> > > > Reduced complexity of the patch seems to be the priority at the moment. Amit
> > > > suggested that cleanup of an orphaned relation file is simple enough to be
> > > > done on foreground and I agree.
> > > >
> > >
> > > Yeah, I think we should try and see if we can make it work but I
> > > noticed that there are few places like AbortOutOfAnyTransaction where
> > > we have the assumption that undo will be executed in the background.
> > > We need to deal with it.
> >
> > I think this is o.k. if we always check for unapplied undo during startup.
> >
>
> Hmm, how it is ok to leave undo (and rely on startup) unless it is a
> PANIC error. IIRC, this path is invoked in non-panic errors as well.
> Basically, we won't be able to discard such an undo which doesn't seem
> like a good idea.
>
> > > > "undo worker" is still there, but it only processes undo requests after server
> > > > restart because relation data can only be changed in a transaction - it seems
> > > > cleaner to launch a background worker for this than to hack the startup
> > > > process.
> > > >
> > >
> > > But, I see there are still multiple undoworkers that are getting
> > > launched and I am not sure if that works correctly because a
> > > particular undoworker is connected to a database and then it starts
> > > processing all the pending undo.
> >
> > Each undo worker applies only transactions for its own database, see
> > ProcessExistingUndoRequests():
> >
> >         /* We can only process undo of the database we are connected to. */
> >         if (xact_hdr.dboid != MyDatabaseId)
> >                 continue;
> >
> > Nevertheless, as I've just mentioned in my response to Thomas, I admit that we
> > should try to live w/o the undo worker altogether.
> >
>
> Okay, but keep in mind that there could be a large amount of undo
> (unlike redo which has some limit as we can replay it from the last
> checkpoint) which needs to be processed but it might be okay to live
> with that for now. Another thing is that it seems we need to connect
> to the database to perform it which might appear a bit odd that we
> don't allow users to connect to the database but internally we are
> connecting it. These are just some points to consider while finalizing
> the solution to this.
>
> > > > Discarding
> > > > ----------
> > > >
> > > > There's no discard worker for the URS infrastructure yet. I thought about
> > > > discarding the undo log during checkpoint, but checkpoint should probably do
> > > > more straightforward tasks than the calculation of a new discard pointer for
> > > > each undo log, so a background worker is needed. A few notes on that:
> > > >
> > > >   * until the zheap AM gets added, only the transaction that creates the undo
> > > >     records needs to access them. This assumption should make the discarding
> > > >     algorithm a bit simpler. Note that with zheap, the other transactions need
> > > >     to look for old versions of tuples, so the concept of oldestXidHavingUndo
> > > >     variable is needed there.
> > > >
> > > >   * it's rather simple to pass pointer the URS pointer to the discard worker
> > > >     when transaction either committed or the undo has been executed.
> > > >
> > >
> > > Why can't we have a separate discard worker which keeps on scanning
> > > the undorecords and discard accordingly? Giving the onus of foreground
> > > process might be tricky because say discard worker is not up to speed
> > > and we ran out of space to pass such information for each commit/abort
> > > request.
> >
> > Sure, there should be a discard worker. The question is how to make its work
> > efficient. The initial run after restart probably needs to scan everything
> > between 'discard' and 'insert' pointers,
> >
>
> Yeah, such an initial scan would be helpful to identify pending aborts
> and allow them to be processed.
>
> > but then it should process only the
> > parts created by individual transactions.
> >
>
> Yeah, it needs to process transaction-by-transaction to see which all
> we can discard. Also, note that in Single-User mode we need to discard
> undo after commit. I think we also need to maintain
> oldestXidHavingUndo for CLOG truncation and transaction-wraparound. We
> can't allow CLOG truncation for the transaction whose undo is not
> discarded as that could be required by some other transaction. For
> similar reasons, we can't allow transaction-wraparound and we need to
> integrate this into the existing xid-allocation mechanism. I have
> found one of the old patch
> (Allow-execution-and-discard-of-undo-by-background-wo) attached
>

oops, forgot to attach the patch, doing now.

-- 
With Regards,
Amit Kapila.

Attachment

Allow-execution-and-discard-of-undo-by-background-wo.patch

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

15 November 2020, 06:54:54

On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska <ah@cybertec.at> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska <ah@cybertec.at> wrote:
> >
> > If you want to track at undo record level, then won't it lead to
> > performance overhead and probably additional WAL overhead considering
> > this action needs to be WAL-logged. I think recording at page-level
> > might be a better idea.
>
> I'm not worried about WAL because the undo execution needs to be WAL-logged
> anyway - see smgr_undo() in the 0005- part of the patch set. What needs to be
> evaluated regarding performance is the (exclusive) locking of the page that
> carries the progress information.
>

That is just for one kind of smgr, think how you will do it for
something like zheap. Their idea is to collect all the undo records
(unless the undo for a transaction is very large) for one zheap-page
and apply them together, so maintaining the status at each undo record
level will surely lead to a large amount of additional WAL. See below
how and why we have decided to do it differently.

> I'm still not sure whether this info should
> be on every page or only in the chunk header. In either case, we have a
> problem if there are two or more chunks created by different transactions on
> the same page, and if more than on of these transactions need to perform
> undo. I tend to believe that this should happen rarely though.
>

I think we need to maintain this information at the transaction level
and need to update it after processing a few blocks, at least that is
what was decided and implemented earlier. We also need to update it
when the log is switched or all the actions of the transaction were
applied. The reasoning is that for short transactions it won't matter
and for larger transactions, it is good to update it after a few pages
to avoid WAL and locking overhead. Also, it is better if we collect
the undo in bulk, this is proved to be beneficial for large
transactions. The earlier version of the patch having all these ideas
implemented is attached
(Infrastructure-to-execute-pending-undo-actions and
Provide-interfaces-to-store-and-fetch-undo-records). The second one
has some APIs used by the first one but the main concepts were
implemented in the first one
(Infrastructure-to-execute-pending-undo-actions). I see that in the
current version these can't be used as it is but still it can give us
a good start point and we might be able to either re-use some code and
or ideas from these patches.

-- 
With Regards,
Amit Kapila.

Amit Kapila <amit.kapila16@gmail.com> wrote:

> On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska <ah@cybertec.at> wrote:
> >
> > Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska <ah@cybertec.at> wrote:
> > >
> > > If you want to track at undo record level, then won't it lead to
> > > performance overhead and probably additional WAL overhead considering
> > > this action needs to be WAL-logged. I think recording at page-level
> > > might be a better idea.
> >
> > I'm not worried about WAL because the undo execution needs to be WAL-logged
> > anyway - see smgr_undo() in the 0005- part of the patch set. What needs to be
> > evaluated regarding performance is the (exclusive) locking of the page that
> > carries the progress information.
> >
> 
> That is just for one kind of smgr, think how you will do it for
> something like zheap. Their idea is to collect all the undo records
> (unless the undo for a transaction is very large) for one zheap-page
> and apply them together, so maintaining the status at each undo record
> level will surely lead to a large amount of additional WAL. See below
> how and why we have decided to do it differently.
> 
> > I'm still not sure whether this info should
> > be on every page or only in the chunk header. In either case, we have a
> > problem if there are two or more chunks created by different transactions on
> > the same page, and if more than on of these transactions need to perform
> > undo. I tend to believe that this should happen rarely though.
> >
> 
> I think we need to maintain this information at the transaction level
> and need to update it after processing a few blocks, at least that is
> what was decided and implemented earlier. We also need to update it
> when the log is switched or all the actions of the transaction were
> applied. The reasoning is that for short transactions it won't matter
> and for larger transactions, it is good to update it after a few pages
> to avoid WAL and locking overhead. Also, it is better if we collect
> the undo in bulk, this is proved to be beneficial for large
> transactions.

Attached is what I originally did not include in the patch series, see the
part 0012. I have no better idea so far. The progress information is stored in
the chunk header.

To avoid too frequent locking, maybe the UpdateLastAppliedRecord() function
can be modified so it recognizes when it's necessary to update the progress
info. Also the user (zheap) should think when it should call the function.
Since I've included 0012 now as a prerequisite for discarding (0013),
currently it's only necessary to update the progress at undo log chunk
boundary.

In this version of the patch series I wanted to publish the remaining ideas I
haven't published yet.

> The earlier version of the patch having all these ideas
> implemented is attached
> (Infrastructure-to-execute-pending-undo-actions and
> Provide-interfaces-to-store-and-fetch-undo-records). The second one
> has some APIs used by the first one but the main concepts were
> implemented in the first one
> (Infrastructure-to-execute-pending-undo-actions). I see that in the
> current version these can't be used as it is but still it can give us
> a good start point and we might be able to either re-use some code and
> or ideas from these patches.

Is there a branch with these patches applied? They reference some functions
that I don't see in [1]. I'd like to examine if / how my approach can be
aligned with the current zheap design.

[1] https://github.com/EnterpriseDB/zheap/tree/master

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment

undo-20201204.tgz

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

05 December 2020, 06:00:56

On Fri, Dec 4, 2020 at 1:50 PM Antonin Houska <ah@cybertec.at> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>
> > The earlier version of the patch having all these ideas
> > implemented is attached
> > (Infrastructure-to-execute-pending-undo-actions and
> > Provide-interfaces-to-store-and-fetch-undo-records). The second one
> > has some APIs used by the first one but the main concepts were
> > implemented in the first one
> > (Infrastructure-to-execute-pending-undo-actions). I see that in the
> > current version these can't be used as it is but still it can give us
> > a good start point and we might be able to either re-use some code and
> > or ideas from these patches.
>
> Is there a branch with these patches applied? They reference some functions
> that I don't see in [1]. I'd like to examine if / how my approach can be
> aligned with the current zheap design.
>

Can you once check in the patch-set attached in the email [1]?

[1] - https://www.postgresql.org/message-id/CA%2BhUKG%2BMpzRsZFE7ChhRq-Br5VYYi6mafVQ73Af7ahioWo5o8w%40mail.gmail.com
-- 
With Regards,
Amit Kapila.

Re: POC: Cleaning up orphaned files using undo logs

From

Dmitry Dolgov

Date:

17 January 2021, 15:44:57

> On Fri, Dec 04, 2020 at 10:22:42AM +0100, Antonin Houska wrote:
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska <ah@cybertec.at> wrote:
> > >
> > > Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska <ah@cybertec.at> wrote:
> > > >
> > > > If you want to track at undo record level, then won't it lead to
> > > > performance overhead and probably additional WAL overhead considering
> > > > this action needs to be WAL-logged. I think recording at page-level
> > > > might be a better idea.
> > >
> > > I'm not worried about WAL because the undo execution needs to be WAL-logged
> > > anyway - see smgr_undo() in the 0005- part of the patch set. What needs to be
> > > evaluated regarding performance is the (exclusive) locking of the page that
> > > carries the progress information.
> > >
> >
> > That is just for one kind of smgr, think how you will do it for
> > something like zheap. Their idea is to collect all the undo records
> > (unless the undo for a transaction is very large) for one zheap-page
> > and apply them together, so maintaining the status at each undo record
> > level will surely lead to a large amount of additional WAL. See below
> > how and why we have decided to do it differently.
> >
> > > I'm still not sure whether this info should
> > > be on every page or only in the chunk header. In either case, we have a
> > > problem if there are two or more chunks created by different transactions on
> > > the same page, and if more than on of these transactions need to perform
> > > undo. I tend to believe that this should happen rarely though.
> > >
> >
> > I think we need to maintain this information at the transaction level
> > and need to update it after processing a few blocks, at least that is
> > what was decided and implemented earlier. We also need to update it
> > when the log is switched or all the actions of the transaction were
> > applied. The reasoning is that for short transactions it won't matter
> > and for larger transactions, it is good to update it after a few pages
> > to avoid WAL and locking overhead. Also, it is better if we collect
> > the undo in bulk, this is proved to be beneficial for large
> > transactions.
>
> Attached is what I originally did not include in the patch series, see the
> part 0012. I have no better idea so far. The progress information is stored in
> the chunk header.
>
> To avoid too frequent locking, maybe the UpdateLastAppliedRecord() function
> can be modified so it recognizes when it's necessary to update the progress
> info. Also the user (zheap) should think when it should call the function.
> Since I've included 0012 now as a prerequisite for discarding (0013),
> currently it's only necessary to update the progress at undo log chunk
> boundary.
>
> In this version of the patch series I wanted to publish the remaining ideas I
> haven't published yet.

Thanks for the updated patch. As I've mentioned off the list I'm slowly
looking through it with the intent to concentrate on undo progress
tracking. But before I will post anything I want to mention couple of
strange issues I see, otherwise I will forget for sure. Maybe it's
already known, but running several times 'make installcheck' against a
freshly build postgres with the patch applied from time to time I
observe various errors.

This one happens on a crash recovery, seems like
UndoRecordSetXLogBufData has usr_type = USRT_INVALID and is involved in
the replay process:

    TRAP: FailedAssertion("page_offset + this_page_bytes <= uph->ud_insertion_point", File: "undopage.c", Line: 300)
    postgres: startup recovering 000000010000000000000012(ExceptionalCondition+0xa1)[0x558b38b8a350]
    postgres: startup recovering 000000010000000000000012(UndoPageSkipOverwrite+0x0)[0x558b38761b7e]
    postgres: startup recovering 000000010000000000000012(UndoReplay+0xa1d)[0x558b38766f32]
    postgres: startup recovering 000000010000000000000012(XactUndoReplay+0x77)[0x558b38769281]
    postgres: startup recovering 000000010000000000000012(smgr_redo+0x1af)[0x558b387aa7bd]

This one is somewhat similar:

    TRAP: FailedAssertion("page_offset >= SizeOfUndoPageHeaderData", File: "undopage.c", Line: 287)
    postgres: undo worker for database 36893 (ExceptionalCondition+0xa1)[0x5559c90f1350]
    postgres: undo worker for database 36893 (UndoPageOverwrite+0xa6)[0x5559c8cc8ae3]
    postgres: undo worker for database 36893 (UpdateLastAppliedRecord+0xbe)[0x5559c8ccd008]
    postgres: undo worker for database 36893 (smgr_undo+0xa6)[0x5559c8d11989]

There are also here and there messages about not found undo files:

    ERROR:  cannot open undo segment file 'base/undo/000008.0000020000': No such file or directory
    WARNING:  failed to undo transaction

I haven't found out the trigger yet, but got an impression that it
happens after create_table tests.

Re: POC: Cleaning up orphaned files using undo logs

From

Antonin Houska

Date:

18 January 2021, 13:15:46

Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> Thanks for the updated patch. As I've mentioned off the list I'm slowly
> looking through it with the intent to concentrate on undo progress
> tracking. But before I will post anything I want to mention couple of
> strange issues I see, otherwise I will forget for sure. Maybe it's
> already known, but running several times 'make installcheck' against a
> freshly build postgres with the patch applied from time to time I
> observe various errors.
>
> This one happens on a crash recovery, seems like
> UndoRecordSetXLogBufData has usr_type = USRT_INVALID and is involved in
> the replay process:
>
>     TRAP: FailedAssertion("page_offset + this_page_bytes <= uph->ud_insertion_point", File: "undopage.c", Line: 300)
>     postgres: startup recovering 000000010000000000000012(ExceptionalCondition+0xa1)[0x558b38b8a350]
>     postgres: startup recovering 000000010000000000000012(UndoPageSkipOverwrite+0x0)[0x558b38761b7e]
>     postgres: startup recovering 000000010000000000000012(UndoReplay+0xa1d)[0x558b38766f32]
>     postgres: startup recovering 000000010000000000000012(XactUndoReplay+0x77)[0x558b38769281]
>     postgres: startup recovering 000000010000000000000012(smgr_redo+0x1af)[0x558b387aa7bd]
>
> This one is somewhat similar:
>
>     TRAP: FailedAssertion("page_offset >= SizeOfUndoPageHeaderData", File: "undopage.c", Line: 287)
>     postgres: undo worker for database 36893 (ExceptionalCondition+0xa1)[0x5559c90f1350]
>     postgres: undo worker for database 36893 (UndoPageOverwrite+0xa6)[0x5559c8cc8ae3]
>     postgres: undo worker for database 36893 (UpdateLastAppliedRecord+0xbe)[0x5559c8ccd008]
>     postgres: undo worker for database 36893 (smgr_undo+0xa6)[0x5559c8d11989]

Well, on repeated run of the test I could also hit the first one. I could fix
it and will post a new version of the patch (along with some other small
changes) this week.

> There are also here and there messages about not found undo files:
>
>     ERROR:  cannot open undo segment file 'base/undo/000008.0000020000': No such file or directory
>     WARNING:  failed to undo transaction

I don't see this one in the log so far, will try again.

Thanks for the report!

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

From

Antonin Houska

Date:

29 January 2021, 17:30:15

Antonin Houska <ah@cybertec.at> wrote:

> Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> 
> > Thanks for the updated patch. As I've mentioned off the list I'm slowly
> > looking through it with the intent to concentrate on undo progress
> > tracking. But before I will post anything I want to mention couple of
> > strange issues I see, otherwise I will forget for sure. Maybe it's
> > already known, but running several times 'make installcheck' against a
> > freshly build postgres with the patch applied from time to time I
> > observe various errors.
> > 
> > This one happens on a crash recovery, seems like
> > UndoRecordSetXLogBufData has usr_type = USRT_INVALID and is involved in
> > the replay process:
> > 
> >     TRAP: FailedAssertion("page_offset + this_page_bytes <= uph->ud_insertion_point", File: "undopage.c", Line:
300)
> >     postgres: startup recovering 000000010000000000000012(ExceptionalCondition+0xa1)[0x558b38b8a350]
> >     postgres: startup recovering 000000010000000000000012(UndoPageSkipOverwrite+0x0)[0x558b38761b7e]
> >     postgres: startup recovering 000000010000000000000012(UndoReplay+0xa1d)[0x558b38766f32]
> >     postgres: startup recovering 000000010000000000000012(XactUndoReplay+0x77)[0x558b38769281]
> >     postgres: startup recovering 000000010000000000000012(smgr_redo+0x1af)[0x558b387aa7bd]
> > 
> > This one is somewhat similar:
> > 
> >     TRAP: FailedAssertion("page_offset >= SizeOfUndoPageHeaderData", File: "undopage.c", Line: 287)
> >     postgres: undo worker for database 36893 (ExceptionalCondition+0xa1)[0x5559c90f1350]
> >     postgres: undo worker for database 36893 (UndoPageOverwrite+0xa6)[0x5559c8cc8ae3]
> >     postgres: undo worker for database 36893 (UpdateLastAppliedRecord+0xbe)[0x5559c8ccd008]
> >     postgres: undo worker for database 36893 (smgr_undo+0xa6)[0x5559c8d11989]
> 
> Well, on repeated run of the test I could also hit the first one. I could fix
> it and will post a new version of the patch (along with some other small
> changes) this week.

Attached is the next version. Changes done:

  * Removed the progress tracking and implemented undo discarding in a simpler
    way. Now, instead of maintaining the pointer to the last record applied,
    only a boolean field in the chunk header is set when ROLLBACK is
    done. This helps to determine whether the undo of a non-committed
    transaction can be discarded.

  * Removed the "undo worker" that the previous version only used to apply the
    undo after crash recovery. The startup process does the work now.

  * Umplemented cleanup after crashed CREATE DATABASE and ALTER DATABASE ... SET TABLESPACE.

    BTW, I wonder if this change allows these commands to be executed in a
    transaction block. I think the reason to prohibit that is to minimize the
    window between creation of the files and transaction commit - if the
    server crashes in that window, the new database files survive but the
    catalog changes don't. But maybe there are other reasons. (I don't claim
    it's terribly useful to create database in a transaction block though
    because the client cannot connect to it w/o leaving the current
    transaction.)

  * Reordered the diffs, i.e. moved the discarding in front of the actual
    features.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment

undo-20210129.tgz

Re: POC: Cleaning up orphaned files using undo logs

From

Bruce Momjian

Date:

01 February 2021, 22:57:33

On Fri, Jan 29, 2021 at 06:30:15PM +0100, Antonin Houska wrote:
> Antonin Houska <ah@cybertec.at> wrote:
> > Well, on repeated run of the test I could also hit the first one. I could fix
> > it and will post a new version of the patch (along with some other small
> > changes) this week.
> 
> Attached is the next version. Changes done:

Yikes, this patch is 23k lines, and most of it looks like added lines of
code.  Is this size expected?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: POC: Cleaning up orphaned files using undo logs

From

Antonin Houska

Date:

03 February 2021, 09:03:29

Bruce Momjian <bruce@momjian.us> wrote:

> On Fri, Jan 29, 2021 at 06:30:15PM +0100, Antonin Houska wrote:
> > Antonin Houska <ah@cybertec.at> wrote:
> > > Well, on repeated run of the test I could also hit the first one. I could fix
> > > it and will post a new version of the patch (along with some other small
> > > changes) this week.
> >
> > Attached is the next version. Changes done:
>
> Yikes, this patch is 23k lines, and most of it looks like added lines of
> code.  Is this size expected?

Yes, earlier versions of this patch, e.g. [1], were of comparable size. It's
not really an "ordinary patch".

[1] https://www.postgresql.org/message-id/CA%2BhUKG%2BMpzRsZFE7ChhRq-Br5VYYi6mafVQ73Af7ahioWo5o8w%40mail.gmail.com

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

RE: POC: Cleaning up orphaned files using undo logs

From

"tsunakawa.takay@fujitsu.com"

Date:

03 February 2021, 09:15:03

From: Antonin Houska <ah@cybertec.at>
> not really an "ordinary patch".
>
> [1]
> https://www.postgresql.org/message-id/CA%2BhUKG%2BMpzRsZFE7ChhR
> q-Br5VYYi6mafVQ73Af7ahioWo5o8w%40mail.gmail.com

I'm a bit interested in zheap-related topics.  I'm reading this discussion to see what I can do.  (But this thread is
toolong... there are still 13,000 lines out of 45,000 lines.) 

What's the latest patch set to look at to achieve the undo infrastructure and its would-be first user, orphan file
cleanup? As far as I've read, multiple people posted multiple patch sets, and I don't see how they are related. 


Regards
Takayuki Tsunakawa

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

03 February 2021, 10:26:30

On Wed, Feb 3, 2021 at 2:45 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Antonin Houska <ah@cybertec.at>
> > not really an "ordinary patch".
> >
> > [1]
> > https://www.postgresql.org/message-id/CA%2BhUKG%2BMpzRsZFE7ChhR
> > q-Br5VYYi6mafVQ73Af7ahioWo5o8w%40mail.gmail.com
>
> I'm a bit interested in zheap-related topics.  I'm reading this discussion to see what I can do.  (But this thread is
toolong... there are still 13,000 lines out of 45,000 lines.)
 
>
> What's the latest patch set to look at to achieve the undo infrastructure and its would-be first user, orphan file
cleanup? As far as I've read, multiple people posted multiple patch sets, and I don't see how they are related.
 
>

I feel it is good to start with the latest patch-set posted by Antonin [1].

[1] - https://www.postgresql.org/message-id/87363.1611941415%40antos

-- 
With Regards,
Amit Kapila.

Re: POC: Cleaning up orphaned files using undo logs

From

Antonin Houska

Date:

30 June 2021, 07:04:51

tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote:

> I'm crawling like a snail to read the patch set.  Below are my first set of review comments, which are all minor.

Thanks.

> 
> (1)
> +     <indexterm><primary>tablespace</primary><secondary>temporary</secondary></indexterm>
> 
> temporary -> undo

Fixed.

> 
> (2)
>       <term><varname>undo_tablespaces</varname> (<type>string</type>)
> +
> ...
> +        The value is a list of names of tablespaces.  When there is more than
> +        one name in the list, <productname>PostgreSQL</productname> chooses an
> +        arbitrary one.  If the name doesn't correspond to an existing
> +        tablespace, the next name is tried, and so on until all names have
> +        been tried.  If no valid tablespace is specified, an error is raised.
> +        The validation of the name doesn't happen until the first attempt to
> +        write undo data.
> 
> CREATE privilege needs to be mentioned like temp_tablespaces.

Fixed.

> 
> (3)
> +        The variable can only be changed before the first statement is
> +        executed in a transaction.
> 
> Does it include any first statement that doesn't emit undo?

Yes, it does. As soon as XID is assigned, the variable can no longer be set.

> (4)
> +      <entry>One row for each undo log, showing current pointers,
> +       transactions and backends.
> +       See <xref linkend="pg-stat-undo-logs-view"/> for details.
> 
> I think this can just be written like "showing usage information about the
> undo log" just like other statistics views.  That way, we can avoid having
> to modify this sentence when we want to change the content of the view
> later.

Done.

> 
> (5)
> +     <entry><structfield>discard</structfield></entry>
> +     <entry><type>text</type></entry>
> +     <entry>Location of the oldest data in this undo log.</entry>
> 
> The name does not match the description intuitively.  Maybe "oldest"?

Discarding of the undo log is an important term used in the code.

> BTW, how does this information help users?  (I don't mean to say we
> shouldn't output information that users cannot interpret; other DBMSs output
> such information probably for technical support staff.)

It's for DBA rather than a user. The value indicates whether discarding is
working well or if it's blocked for some reason. If the latter happens, the
undo log can pile up and consume too much disk space.


> (6)
> +     <entry><structfield>insert</structfield></entry>
> +     <entry><type>text</type></entry>
> +     <entry>Location where the next data will be written in this undo
> +      log.</entry>
> ...
> +     <entry><structfield>end</structfield></entry>
> +     <entry><type>text</type></entry>
> +     <entry>Location one byte past the end of the allocated physical storage
> +      backing this undo log.</entry>
> 
> Again, how can these be used?  If they are useful to calculate the amount of used space, shouldn't they be bigint?

bigint is signed, so it cannot express 64-bit number. I think this deserves a
new SQL type for the undo pointer, like pg_lsn for XLOG.

> 
> (7)
> @@ -65,7 +65,7 @@
>         <structfield>smgrid</structfield> <type>integer</type>
>          </para>
>          <para>
> -         Block storage manager ID.  0 for regular relation data.</entry>
> +         Block storage manager ID.  0 for regular relation data.
>          </para></entry>
>       </row>
> 
> I guess this change is mistakenly included?

Fixed.

> 
> (8)
> diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
> @@ -216,6 +216,7 @@ Complete list of usable sgml source files in this directory.
>  <!ENTITY pgtesttiming       SYSTEM "pgtesttiming.sgml">
>  <!ENTITY pgupgrade          SYSTEM "pgupgrade.sgml">
>  <!ENTITY pgwaldump          SYSTEM "pg_waldump.sgml">
> +<!ENTITY pgundodump         SYSTEM "pg_undo_dump.sgml">
>  <!ENTITY postgres           SYSTEM "postgres-ref.sgml">
> 
> @@ -286,6 +286,7 @@
>     &pgtesttiming;
>     &pgupgrade;
>     &pgwaldump;
> +   &pgundodump;
>     &postgres;
> 
> It looks like this list needs to be ordered alphabetically.  So, the new line is better placed between pg_test_timing
andpg_upgrade?
 

Fixed.

> 
> (9)
> I don't want to be disliked because of being picky, but maybe pg_undo_dump should be pg_undodump.  Existing commands
don'tuse '_' to separate words after pg_, except for pg_test_fsync and pg_test_timing.
 

Done.

> 
> (10)
> +   This utility can only be run by the user who installed the server, because
> +   it requires read-only access to the data directory.
> 
> I guess you copied this from pg_waldump or pg_resetwal, but I'm afraid this should be as follows, which is an excerpt
frompg_controldata's page.  (The pages for pg_waldump and pg_resetwal should be fixed in a separate thread.)
 
> 
> This utility can only be run by the user who initialized the cluster because it requires read access to the data
directory.

Fixed

> 
> (11)
> +    The <option>-m</option> option cannot be used if
> +    either <option>-c</option> or <option>-l</option> is used.
> 
> -l -> -r

Fixed.

> Or, why don't we align option characters with pg_waldump?  pg_waldump uses -r to filter by rmgr.  pg_undodump can
outputrecord contents by default like pg_waldump.  Considering pg_dump and pg_dumpall also output all data by default,
thatseems how PostgreSQL commands behave.
 

I've made the -r value (print out the undo records) the default, will consider
using -r for filtering by rmgr.

> 
> (12)
> +   <arg choice="opt"><option>startseg</option><arg choice="opt"><option>endseg</option></arg></arg>
> 
> startseg and endseg are not described.

Fixed. (Of course, this is an evidence that I used pg_waldump as a skeleton
:-))

> 
> (13)
> +Undo files backing undo logs in the default tablespace are stored under
> ...
> +Undo log files contain standard page headers as described in the next section,
> 
> Fluctuations in expressions can be seen: undo file and undo log file.  I think the following "undo data file" fits
best. What do you think?
 
> 
> +      <entry><literal>UndoFileRead</literal></entry>
> +      <entry>Waiting for a read from an undo data file.</entry>
> 

"Undo files backing undo logs ..."

My feeling is that "data files" would be distracting here. I think the point
of this sentence is simply that something resides in a file.

"Undo log files contain standard page headers as described in the next section"

I'm not opposed to "data files" here as there are also other kinds of files
written by undo (at least the metadata written during checkpoint). Changed.

> (14)
> +Undo data exists in a 64-bit address space divided into 2^34 undo
> +logs, each with a theoretical capacity of 1TB.  The first time a
> +backend writes undo, it attaches to an existing undo log whose
> +capacity is not yet exhausted and which is not currently being used by
> +any other backend; or if no suitable undo log already exists, it
> +creates a new one.  To avoid wasting space, each undo log is further
> +divided into 1MB segment files, so that segments which are no longer
> +needed can be removed (possibly recycling the underlying file by
> +renaming it) and segments which are not yet needed do not need to be
> +physically created on disk.  An undo segment file has a name like
> +<filename>000004.0001200000</filename>, where
> +<filename>000004</filename> is the undo log number and
> +<filename>0001200000</filename> is the offset of the first byte
> +held in the file.
> 
> The number of undo logs is not 2^34 but 2^24 (2^64 - 2^40 (1 TB)).

Fixed.

> (15) src/backend/access/undo/README
> \ No newline at end of file
> 
> Let's add a newline.
> 

Fixed.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment

undo-20210630.tgz

Re: POC: Cleaning up orphaned files using undo logs

From

Antonin Houska

Date:

30 June 2021, 17:41:16

Antonin Houska <ah@cybertec.at> wrote:

> tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote:
>
> > I'm crawling like a snail to read the patch set.  Below are my first set of review comments, which are all minor.
>
> Thanks.

I've added the patch to the upcoming CF [1], so it possibly gets more review
and makes some progress. I've marked myself as the author so it's clear who
will try to respond to the reviews. It's clear that other people did much more
work on the feature than I did so far - they are welcome to add themselves to
the author list.

[1] https://commitfest.postgresql.org/33/3228/

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

From

vignesh C

Date:

15 July 2021, 11:37:56

On Wed, Jun 30, 2021 at 11:10 PM Antonin Houska <ah@cybertec.at> wrote:
>
> Antonin Houska <ah@cybertec.at> wrote:
>
> > tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote:
> >
> > > I'm crawling like a snail to read the patch set.  Below are my first set of review comments, which are all
minor.
> >
> > Thanks.
>
> I've added the patch to the upcoming CF [1], so it possibly gets more review
> and makes some progress. I've marked myself as the author so it's clear who
> will try to respond to the reviews. It's clear that other people did much more
> work on the feature than I did so far - they are welcome to add themselves to
> the author list.
>
> [1] https://commitfest.postgresql.org/33/3228/
>

The patch does not apply on Head anymore, could you rebase and post a
patch. I'm changing the status to "Waiting for Author".

Regards,
Vignesh

Re: POC: Cleaning up orphaned files using undo logs

From

Dmitry Dolgov

Date:

16 August 2021, 15:00:18

> On Wed, Jun 30, 2021 at 07:41:16PM +0200, Antonin Houska wrote:
> Antonin Houska <ah@cybertec.at> wrote:
>
> > tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote:
> >
> > > I'm crawling like a snail to read the patch set.  Below are my first set of review comments, which are all
minor.
> >
> > Thanks.
>
> I've added the patch to the upcoming CF [1], so it possibly gets more review
> and makes some progress. I've marked myself as the author so it's clear who
> will try to respond to the reviews. It's clear that other people did much more
> work on the feature than I did so far - they are welcome to add themselves to
> the author list.
>
> [1] https://commitfest.postgresql.org/33/3228/

Hi,

I'm crawling through the patch set like even slower creature than a snail,
sorry for long absence. I'm reading the latest version posted here and,
although it's hard to give any high level design comments on it yet, I thought
it could be useful to post a few findings and questions in the meantime.

* One question about the functionality:

  > On Fri, Jan 29, 2021 at 06:30:15PM +0100, Antonin Houska wrote:
  > Attached is the next version. Changes done:
  >
  >   * Removed the progress tracking and implemented undo discarding in a simpler
  >     way. Now, instead of maintaining the pointer to the last record applied,
  >     only a boolean field in the chunk header is set when ROLLBACK is
  >     done. This helps to determine whether the undo of a non-committed
  >     transaction can be discarded.

  Just to clarify, the whole feature was removed for the sake of
  simplicity, right?

* By throwing at the patchset `make installcheck` I'm getting from time to time
  and error on the restart:

    TRAP: FailedAssertion("BufferIsValid(buffers[nbuffers].buffer)",
    File: "undorecordset.c", Line: 1098, PID: 6055)

  From what I see XLogReadBufferForRedoExtended finds an invalid buffer and
  returns BLK_NOTFOUND. The commentary says:

     If the block was not found, then it must be discarded later in
     the WAL.

  and continues with skip = false, but fails to get a page from an invalid
  buffer few lines later. It seems that the skip flag is supposed to be used
  this situation, should it also guard the BufferGetPage part?

* Another interesting issue I've found happened inside
  DropUndoLogsInTablespace, when the process got SIGTERM. It seems processing
  stuck on:

    slist_foreach_modify(iter, &UndoLogShared->shared_free_lists[i])

  iterating on the same element over and over. My guess is
  clear_private_free_lists was called and caused such unexpected outcome,
  should the access to shared_free_lists be somehow protected?

* I also wonder about the segments in base/undo, the commentary in pg_undodump
  says:

     Since the UNDO log is a continuous stream of changes, any hole
     terminates processing.

  It looks like it's relatively easy to end up with such holes, and pg_undodump
  ends up with a message (found is added by me and contains a found offset
  which do not match the expected value):

    pg_undodump: error: segment 0000000000 missing in log 2, found 0000100000

  This seems to be not causing any real issues, but it's not clear for me if
  such situation with gaps is fine or is it a problem?

Other than that one more time thank you for this tremendous work, I find that
the topic is of extreme importance.

Re: POC: Cleaning up orphaned files using undo logs

From

Antonin Houska

Date:

31 August 2021, 12:46:30

Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> Hi,
> 
> I'm crawling through the patch set like even slower creature than a snail,
> sorry for long absence. I'm reading the latest version posted here and,
> although it's hard to give any high level design comments on it yet, I thought
> it could be useful to post a few findings and questions in the meantime.
> 
> * One question about the functionality:
> 
>   > On Fri, Jan 29, 2021 at 06:30:15PM +0100, Antonin Houska wrote:
>   > Attached is the next version. Changes done:
>   >
>   >   * Removed the progress tracking and implemented undo discarding in a simpler
>   >     way. Now, instead of maintaining the pointer to the last record applied,
>   >     only a boolean field in the chunk header is set when ROLLBACK is
>   >     done. This helps to determine whether the undo of a non-committed
>   >     transaction can be discarded.
> 
>   Just to clarify, the whole feature was removed for the sake of
>   simplicity, right?

Amit Kapila told me that zheap can recognize that particular undo record was
already applied and I could eventually find the corresponding code. So I
removed the tracking from the undo log layer, although I still think it'd fit
there. However then I found out that at least a boolean flag in the chunk
header is needed to handle the discarding, so I implemented it.

> * By throwing at the patchset `make installcheck` I'm getting from time to time
>   and error on the restart:
> 
>     TRAP: FailedAssertion("BufferIsValid(buffers[nbuffers].buffer)",
>     File: "undorecordset.c", Line: 1098, PID: 6055)
> 
>   From what I see XLogReadBufferForRedoExtended finds an invalid buffer and
>   returns BLK_NOTFOUND. The commentary says:
> 
>      If the block was not found, then it must be discarded later in
>      the WAL.
> 
>   and continues with skip = false, but fails to get a page from an invalid
>   buffer few lines later. It seems that the skip flag is supposed to be used
>   this situation, should it also guard the BufferGetPage part?

I could see this sometime too, but can't reproduce it now. It's also not clear
to me how XLogReadBufferForRedoExtended() can return BLK_NOTFOUND, as the
whole undo log segment is created at once, even if only part of it is needed -
see allocate_empty_undo_segment().

> * Another interesting issue I've found happened inside
>   DropUndoLogsInTablespace, when the process got SIGTERM. It seems processing
>   stuck on:
> 
>     slist_foreach_modify(iter, &UndoLogShared->shared_free_lists[i])
> 
>   iterating on the same element over and over. My guess is
>   clear_private_free_lists was called and caused such unexpected outcome,
>   should the access to shared_free_lists be somehow protected?

Well, I could get this error on repeated test run too. Thanks for the report.

The list is protected by UndoLogLock. I found out that the problem was that
free_undo_log_slot() "freed" the slot but didn't remove it from the shared
freelist. Then some other backend thought it's free, picked it from the shared
slot array, used and pushed again to the shared freelist. If the same item is
already at the list head, slist_push_head() makes the initial node point to
itself.

I fixed it by removing the slot from the freelist before calling
free_undo_log_slot() from CheckPointUndoLogs(). (The other call site
DropUndoLogsInTablespace() was o.k.)

> * I also wonder about the segments in base/undo, the commentary in pg_undodump
>   says:
> 
>      Since the UNDO log is a continuous stream of changes, any hole
>      terminates processing.
> 
>   It looks like it's relatively easy to end up with such holes, and pg_undodump
>   ends up with a message (found is added by me and contains a found offset
>   which do not match the expected value):
> 
>     pg_undodump: error: segment 0000000000 missing in log 2, found 0000100000
> 
>   This seems to be not causing any real issues, but it's not clear for me if
>   such situation with gaps is fine or is it a problem?

ok, I missed the point that the initial segment (or the initial sequence of
segments) of the log can be missing due to discarding and segments
recycling. I've fixed that, but if a segment is missing in the middle, it's
still considered an error.

> Other than that one more time thank you for this tremendous work, I find that
> the topic is of extreme importance.

I'm just trying to continue the tremendous work of others :-) Thanks for your
review!

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment

undo-20210831.tgz

Re: POC: Cleaning up orphaned files using undo logs

From

Antonin Houska

Date:

09 September 2021, 15:04:21

The cfbot complained that the patch series no longer applies, so I've rebased
it and also tried to make sure that the other flags become green.

One particular problem was that pg_upgrade complained that "live undo data"
remains in the old cluster. I found out that the temporary undo log causes the
problem, so I've adjusted the query in check_for_undo_data() accordingly until
the problem gets fixed properly.

The problem of the temporary undo log is that it's loaded into local buffers
and that backend can exit w/o flushing local buffers to disk, and thus we are
not guaranteed to find enough information when trying to discard the undo log
the backend wrote. I'm thinking about the following solutions:

1. Let the backend manage temporary undo log on its own (even the slot
   metadata would stay outside the shared memory, and in particular the
   insertion pointer could start from 1 for each session) and remove the
   segment files at the same moment the temporary relations are removed.

   However, by moving the temporary undo slots away from the shared memory,
   computation of oldestFullXidHavingUndo (see the PROC_HDR structure) would
   be affected. It might seem that a transaction which only writes undo log
   for temporary relations does not need to affect oldestFullXidHavingUndo,
   but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo
   prevents transactions to be truncated from the CLOG too early, I wonder if
   the following is possible (This scenario is only applicable to the zheap
   storage engine [1], which is not included in this patch, but should already
   be considered.):

   A transaction creates a temporary table, does some (many) changes and then
   gets rolled back. The undo records are being applied and it takes some
   time. Since XID of the transaction did not affect oldestFullXidHavingUndo,
   the XID can disappear from the CLOG due to truncation. However zundo.c in
   [1] indicates that the transaction status *is* checked during undo
   execution, so we might have a problem.

   Or do I miss something? UndoDiscard() in zheap seems to ignore temporary
   undo:

           /* We can't process temporary undo logs. */
        if (log->meta.persistence == UNDO_TEMP)
            continue;

2. Do not load the temporary undo into local buffers. If it's always in the
   shared buffers, we should never see incomplete data when trying to discard
   undo. In this case, persistence levels UNDOPERSISTENCE_UNLOGGED and
   UNDOPERSISTENCE_TEMP could be merged into a single level.

3. Implement the discarding in another way, but I don't have new idea right
   now.

Suggestions are welcome.

[1] https://github.com/EnterpriseDB/zheap/tree/master

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment

undo-20210909.tgz

Re: POC: Cleaning up orphaned files using undo logs

From

Antonin Houska

Date:

14 September 2021, 08:51:42

Antonin Houska <ah@cybertec.at> wrote:

> Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> > * By throwing at the patchset `make installcheck` I'm getting from time to time
> >   and error on the restart:
> > 
> >     TRAP: FailedAssertion("BufferIsValid(buffers[nbuffers].buffer)",
> >     File: "undorecordset.c", Line: 1098, PID: 6055)
> > 
> >   From what I see XLogReadBufferForRedoExtended finds an invalid buffer and
> >   returns BLK_NOTFOUND. The commentary says:
> > 
> >      If the block was not found, then it must be discarded later in
> >      the WAL.
> > 
> >   and continues with skip = false, but fails to get a page from an invalid
> >   buffer few lines later. It seems that the skip flag is supposed to be used
> >   this situation, should it also guard the BufferGetPage part?
> 
> I could see this sometime too, but can't reproduce it now. It's also not clear
> to me how XLogReadBufferForRedoExtended() can return BLK_NOTFOUND, as the
> whole undo log segment is created at once, even if only part of it is needed -
> see allocate_empty_undo_segment().

I could eventually reproduce the problem. The root cause was that WAL records
were created even for temporary / unlogged undo, and thus only empty pages
could be found during replay. I've fixed that and also setup regular test for
the BLK_NOTFOUND value. That required a few more fixes to UndoReplay().

Attached is a new version.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment

undo-20210914.tgz

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

17 September 2021, 06:18:16

On Thu, Sep 9, 2021 at 8:33 PM Antonin Houska <ah@cybertec.at> wrote:
>
> The cfbot complained that the patch series no longer applies, so I've rebased
> it and also tried to make sure that the other flags become green.
>
> One particular problem was that pg_upgrade complained that "live undo data"
> remains in the old cluster. I found out that the temporary undo log causes the
> problem, so I've adjusted the query in check_for_undo_data() accordingly until
> the problem gets fixed properly.
>
> The problem of the temporary undo log is that it's loaded into local buffers
> and that backend can exit w/o flushing local buffers to disk, and thus we are
> not guaranteed to find enough information when trying to discard the undo log
> the backend wrote. I'm thinking about the following solutions:
>
> 1. Let the backend manage temporary undo log on its own (even the slot
>    metadata would stay outside the shared memory, and in particular the
>    insertion pointer could start from 1 for each session) and remove the
>    segment files at the same moment the temporary relations are removed.
>
>    However, by moving the temporary undo slots away from the shared memory,
>    computation of oldestFullXidHavingUndo (see the PROC_HDR structure) would
>    be affected. It might seem that a transaction which only writes undo log
>    for temporary relations does not need to affect oldestFullXidHavingUndo,
>    but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo
>    prevents transactions to be truncated from the CLOG too early, I wonder if
>    the following is possible (This scenario is only applicable to the zheap
>    storage engine [1], which is not included in this patch, but should already
>    be considered.):
>
>    A transaction creates a temporary table, does some (many) changes and then
>    gets rolled back. The undo records are being applied and it takes some
>    time. Since XID of the transaction did not affect oldestFullXidHavingUndo,
>    the XID can disappear from the CLOG due to truncation.
>

By above do you mean to say that in zheap code, we don't consider XIDs
that operate on temp table/undo for oldestFullXidHavingUndo?

> However zundo.c in
>    [1] indicates that the transaction status *is* checked during undo
>    execution, so we might have a problem.
>

It would be easier to follow if you can tell which exact code are you
referring here?


-- 
With Regards,
Amit Kapila.

Re: POC: Cleaning up orphaned files using undo logs

From

Dmitry Dolgov

Date:

17 September 2021, 16:20:45

> On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote:
> Antonin Houska <ah@cybertec.at> wrote:
>
> > Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > > * By throwing at the patchset `make installcheck` I'm getting from time to time
> > >   and error on the restart:
> > >
> > >     TRAP: FailedAssertion("BufferIsValid(buffers[nbuffers].buffer)",
> > >     File: "undorecordset.c", Line: 1098, PID: 6055)
> > >
> > >   From what I see XLogReadBufferForRedoExtended finds an invalid buffer and
> > >   returns BLK_NOTFOUND. The commentary says:
> > >
> > >      If the block was not found, then it must be discarded later in
> > >      the WAL.
> > >
> > >   and continues with skip = false, but fails to get a page from an invalid
> > >   buffer few lines later. It seems that the skip flag is supposed to be used
> > >   this situation, should it also guard the BufferGetPage part?
> >
> > I could see this sometime too, but can't reproduce it now. It's also not clear
> > to me how XLogReadBufferForRedoExtended() can return BLK_NOTFOUND, as the
> > whole undo log segment is created at once, even if only part of it is needed -
> > see allocate_empty_undo_segment().
>
> I could eventually reproduce the problem. The root cause was that WAL records
> were created even for temporary / unlogged undo, and thus only empty pages
> could be found during replay. I've fixed that and also setup regular test for
> the BLK_NOTFOUND value. That required a few more fixes to UndoReplay().
>
> Attached is a new version.

Yep, makes sense, thanks. I have few more questions:

* The use case with orphaned files is working somewhat differently after
  the rebase on the latest master, do you observe it as well? The
  difference is ApplyPendingUndo -> SyncPostCheckpoint doesn't clean up
  an orphaned relation file immediately (only later on checkpoint)
  because of empty pendingUnlinks. I haven't investigated more yet, but
  seems like after this commit:

    commit 7ff23c6d277d1d90478a51f0dd81414d343f3850
    Author: Thomas Munro <tmunro@postgresql.org>
    Date:   Mon Aug 2 17:32:20 2021 +1200

        Run checkpointer and bgwriter in crash recovery.

        Start up the checkpointer and bgwriter during crash recovery (except in
        --single mode), as we do for replication.  This wasn't done back in
        commit cdd46c76 out of caution.  Now it seems like a better idea to make
        the environment as similar as possible in both cases.  There may also be
        some performance advantages.

  something has to be updated (pendingOps are empty right now, so no
  unlink request is remembered).

* What happened with the idea of abandoning discard worker for the sake
  of simplicity? From what I see limiting everything to foreground undo
  could reduce the core of the patch series to the first four patches
  (forgetting about test and docs, but I guess it would be enough at
  least for the design review), which is already less overwhelming.

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

19 September 2021, 05:14:46

On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote:
>
> * What happened with the idea of abandoning discard worker for the sake
>   of simplicity? From what I see limiting everything to foreground undo
>   could reduce the core of the patch series to the first four patches
>   (forgetting about test and docs, but I guess it would be enough at
>   least for the design review), which is already less overwhelming.
>

I think the discard worker would be required even if we decide to
apply all the undo in the foreground. We need to forget/remove the
undo of committed transactions as well which we can't remove
immediately after the commit.

-- 
With Regards,
Amit Kapila.

Re: POC: Cleaning up orphaned files using undo logs

From

Antonin Houska

Date:

20 September 2021, 04:56:14

Amit Kapila <amit.kapila16@gmail.com> wrote:

> On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> >
> > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote:
> >
> > * What happened with the idea of abandoning discard worker for the sake
> >   of simplicity? From what I see limiting everything to foreground undo
> >   could reduce the core of the patch series to the first four patches
> >   (forgetting about test and docs, but I guess it would be enough at
> >   least for the design review), which is already less overwhelming.
> >
> 
> I think the discard worker would be required even if we decide to
> apply all the undo in the foreground. We need to forget/remove the
> undo of committed transactions as well which we can't remove
> immediately after the commit.

I think I proposed foreground discarding at some point, but you reminded me
that the undo may still be needed for some time even after transaction
commit. Thus the discard worker is indispensable.

What we can miss, at least for the cleanup of the orphaned files, is the *undo
worker*. In this patch series the cleanup is handled by the startup process.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

From

Antonin Houska

Date:

20 September 2021, 05:27:13

Amit Kapila <amit.kapila16@gmail.com> wrote:

> On Thu, Sep 9, 2021 at 8:33 PM Antonin Houska <ah@cybertec.at> wrote:

> > The problem of the temporary undo log is that it's loaded into local buffers
> > and that backend can exit w/o flushing local buffers to disk, and thus we are
> > not guaranteed to find enough information when trying to discard the undo log
> > the backend wrote. I'm thinking about the following solutions:
> >
> > 1. Let the backend manage temporary undo log on its own (even the slot
> >    metadata would stay outside the shared memory, and in particular the
> >    insertion pointer could start from 1 for each session) and remove the
> >    segment files at the same moment the temporary relations are removed.
> >
> >    However, by moving the temporary undo slots away from the shared memory,
> >    computation of oldestFullXidHavingUndo (see the PROC_HDR structure) would
> >    be affected. It might seem that a transaction which only writes undo log
> >    for temporary relations does not need to affect oldestFullXidHavingUndo,
> >    but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo
> >    prevents transactions to be truncated from the CLOG too early, I wonder if
> >    the following is possible (This scenario is only applicable to the zheap
> >    storage engine [1], which is not included in this patch, but should already
> >    be considered.):
> >
> >    A transaction creates a temporary table, does some (many) changes and then
> >    gets rolled back. The undo records are being applied and it takes some
> >    time. Since XID of the transaction did not affect oldestFullXidHavingUndo,
> >    the XID can disappear from the CLOG due to truncation.
> >
>
> By above do you mean to say that in zheap code, we don't consider XIDs
> that operate on temp table/undo for oldestFullXidHavingUndo?

I was referring to the code

        /* We can't process temporary undo logs. */
        if (log->meta.persistence == UNDO_TEMP)
            continue;

in undodiscard.c:UndoDiscard().

>
> > However zundo.c in
> >    [1] indicates that the transaction status *is* checked during undo
> >    execution, so we might have a problem.
> >
>
> It would be easier to follow if you can tell which exact code are you
> referring here?

In meant the call of TransactionIdDidCommit() in
zundo.c:zheap_exec_pending_rollback().

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

From

Antonin Houska

Date:

21 September 2021, 07:02:02

Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote:
> > Antonin Houska <ah@cybertec.at> wrote:
> >
> > > Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> >
> > > > * By throwing at the patchset `make installcheck` I'm getting from time to time
> > > >   and error on the restart:
> > > >
> > > >     TRAP: FailedAssertion("BufferIsValid(buffers[nbuffers].buffer)",
> > > >     File: "undorecordset.c", Line: 1098, PID: 6055)
> > > >
> > > >   From what I see XLogReadBufferForRedoExtended finds an invalid buffer and
> > > >   returns BLK_NOTFOUND. The commentary says:
> > > >
> > > >      If the block was not found, then it must be discarded later in
> > > >      the WAL.
> > > >
> > > >   and continues with skip = false, but fails to get a page from an invalid
> > > >   buffer few lines later. It seems that the skip flag is supposed to be used
> > > >   this situation, should it also guard the BufferGetPage part?
> > >
> > > I could see this sometime too, but can't reproduce it now. It's also not clear
> > > to me how XLogReadBufferForRedoExtended() can return BLK_NOTFOUND, as the
> > > whole undo log segment is created at once, even if only part of it is needed -
> > > see allocate_empty_undo_segment().
> >
> > I could eventually reproduce the problem. The root cause was that WAL records
> > were created even for temporary / unlogged undo, and thus only empty pages
> > could be found during replay. I've fixed that and also setup regular test for
> > the BLK_NOTFOUND value. That required a few more fixes to UndoReplay().
> >
> > Attached is a new version.
> 
> Yep, makes sense, thanks. I have few more questions:
> 
> * The use case with orphaned files is working somewhat differently after
>   the rebase on the latest master, do you observe it as well? The
>   difference is ApplyPendingUndo -> SyncPostCheckpoint doesn't clean up
>   an orphaned relation file immediately (only later on checkpoint)
>   because of empty pendingUnlinks. I haven't investigated more yet, but
>   seems like after this commit:
> 
>     commit 7ff23c6d277d1d90478a51f0dd81414d343f3850
>     Author: Thomas Munro <tmunro@postgresql.org>
>     Date:   Mon Aug 2 17:32:20 2021 +1200
> 
>         Run checkpointer and bgwriter in crash recovery.
> 
>         Start up the checkpointer and bgwriter during crash recovery (except in
>         --single mode), as we do for replication.  This wasn't done back in
>         commit cdd46c76 out of caution.  Now it seems like a better idea to make
>         the environment as similar as possible in both cases.  There may also be
>         some performance advantages.
> 
>   something has to be updated (pendingOps are empty right now, so no
>   unlink request is remembered).

I haven't been debugging that part recently, but yes, this commit is relevant,
thanks for pointing that out! Attached is a patch that should fix it. I'll
include it in the next version of the patch series, unless you tell me that
something is still wrong.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/src/backend/access/undo/undorecordset.c b/src/backend/access/undo/undorecordset.c
index 59eba7dfb6..9d05824141 100644
--- a/src/backend/access/undo/undorecordset.c
+++ b/src/backend/access/undo/undorecordset.c
@@ -2622,14 +2622,6 @@ ApplyPendingUndo(void)
         }
     }
 
-    /*
-     * Some undo actions may unlink files. Since the checkpointer is not
-     * guaranteed to be up, it seems simpler to process the undo request
-     * ourselves in the way the checkpointer would do.
-     */
-    SyncPreCheckpoint();
-    SyncPostCheckpoint();
-
     /* Cleanup. */
     chunktable_destroy(sets);
 }

Re: POC: Cleaning up orphaned files using undo logs

From

Dmitry Dolgov

Date:

21 September 2021, 08:07:55

On Tue, 21 Sep 2021 09:00 Antonin Houska, <ah@cybertec.at> wrote:

Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> Yep, makes sense, thanks. I have few more questions:
>
> * The use case with orphaned files is working somewhat differently after
> the rebase on the latest master, do you observe it as well? The
> difference is ApplyPendingUndo -> SyncPostCheckpoint doesn't clean up
> an orphaned relation file immediately (only later on checkpoint)
> because of empty pendingUnlinks. I haven't investigated more yet, but
> seems like after this commit:
>
> commit 7ff23c6d277d1d90478a51f0dd81414d343f3850
> Author: Thomas Munro <tmunro@postgresql.org>
> Date: Mon Aug 2 17:32:20 2021 +1200
>
> Run checkpointer and bgwriter in crash recovery.
>
> Start up the checkpointer and bgwriter during crash recovery (except in
> --single mode), as we do for replication. This wasn't done back in
> commit cdd46c76 out of caution. Now it seems like a better idea to make
> the environment as similar as possible in both cases. There may also be
> some performance advantages.
>
> something has to be updated (pendingOps are empty right now, so no
> unlink request is remembered).

I haven't been debugging that part recently, but yes, this commit is relevant,
thanks for pointing that out! Attached is a patch that should fix it. I'll
include it in the next version of the patch series, unless you tell me that
something is still wrong.

Sure, but I can take a look only in a couple of days.

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

21 September 2021, 09:59:25

On Mon, Sep 20, 2021 at 10:24 AM Antonin Houska <ah@cybertec.at> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > >
> > > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote:
> > >
> > > * What happened with the idea of abandoning discard worker for the sake
> > >   of simplicity? From what I see limiting everything to foreground undo
> > >   could reduce the core of the patch series to the first four patches
> > >   (forgetting about test and docs, but I guess it would be enough at
> > >   least for the design review), which is already less overwhelming.
> > >
> >
> > I think the discard worker would be required even if we decide to
> > apply all the undo in the foreground. We need to forget/remove the
> > undo of committed transactions as well which we can't remove
> > immediately after the commit.
>
> I think I proposed foreground discarding at some point, but you reminded me
> that the undo may still be needed for some time even after transaction
> commit. Thus the discard worker is indispensable.
>

Right.

> What we can miss, at least for the cleanup of the orphaned files, is the *undo
> worker*. In this patch series the cleanup is handled by the startup process.
>

Okay, I think various people at different point of times has suggested
that idea. I think one thing we might need to consider is what to do
in case of a FATAL error? In case of FATAL error, it won't be
advisable to execute undo immediately, so would we upgrade the error
to PANIC in such cases. I remember vaguely that for clean up of
orphaned files that can happen rarely and someone has suggested
upgrading the error to PANIC in such a case but I don't remember the
exact details.

-- 
With Regards,
Amit Kapila.

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

22 September 2021, 11:15:48

On Mon, Sep 20, 2021 at 10:55 AM Antonin Houska <ah@cybertec.at> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > On Thu, Sep 9, 2021 at 8:33 PM Antonin Houska <ah@cybertec.at> wrote:
>
> > > The problem of the temporary undo log is that it's loaded into local buffers
> > > and that backend can exit w/o flushing local buffers to disk, and thus we are
> > > not guaranteed to find enough information when trying to discard the undo log
> > > the backend wrote. I'm thinking about the following solutions:
> > >
> > > 1. Let the backend manage temporary undo log on its own (even the slot
> > >    metadata would stay outside the shared memory, and in particular the
> > >    insertion pointer could start from 1 for each session) and remove the
> > >    segment files at the same moment the temporary relations are removed.
> > >
> > >    However, by moving the temporary undo slots away from the shared memory,
> > >    computation of oldestFullXidHavingUndo (see the PROC_HDR structure) would
> > >    be affected. It might seem that a transaction which only writes undo log
> > >    for temporary relations does not need to affect oldestFullXidHavingUndo,
> > >    but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo
> > >    prevents transactions to be truncated from the CLOG too early, I wonder if
> > >    the following is possible (This scenario is only applicable to the zheap
> > >    storage engine [1], which is not included in this patch, but should already
> > >    be considered.):
> > >
> > >    A transaction creates a temporary table, does some (many) changes and then
> > >    gets rolled back. The undo records are being applied and it takes some
> > >    time. Since XID of the transaction did not affect oldestFullXidHavingUndo,
> > >    the XID can disappear from the CLOG due to truncation.
> > >
> >
> > By above do you mean to say that in zheap code, we don't consider XIDs
> > that operate on temp table/undo for oldestFullXidHavingUndo?
>
> I was referring to the code
>
>                 /* We can't process temporary undo logs. */
>                 if (log->meta.persistence == UNDO_TEMP)
>                         continue;
>
> in undodiscard.c:UndoDiscard().
>

Here, I think it will just skip undo of temporary undo logs and
oldestFullXidHavingUndo should be advanced after skipping it.

> >
> > > However zundo.c in
> > >    [1] indicates that the transaction status *is* checked during undo
> > >    execution, so we might have a problem.
> > >
> >
> > It would be easier to follow if you can tell which exact code are you
> > referring here?
>
> In meant the call of TransactionIdDidCommit() in
> zundo.c:zheap_exec_pending_rollback().
>

IIRC, this should be called for temp tables after they have exited as
this is only to apply the pending undo actions if any, and in case of
temporary undo after session exit, we shouldn't need it.

I am not able to understand what exact problem you are facing for temp
tables after the session exit. Can you please explain it a bit more?

-- 
With Regards,
Amit Kapila.

Re: POC: Cleaning up orphaned files using undo logs

From

Antonin Houska

Date:

24 September 2021, 11:16:18

Amit Kapila <amit.kapila16@gmail.com> wrote:

> On Mon, Sep 20, 2021 at 10:24 AM Antonin Houska <ah@cybertec.at> wrote:
> >
> > Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > > >
> > > > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote:
> > > >
> > > > * What happened with the idea of abandoning discard worker for the sake
> > > >   of simplicity? From what I see limiting everything to foreground undo
> > > >   could reduce the core of the patch series to the first four patches
> > > >   (forgetting about test and docs, but I guess it would be enough at
> > > >   least for the design review), which is already less overwhelming.

> > What we can miss, at least for the cleanup of the orphaned files, is the *undo
> > worker*. In this patch series the cleanup is handled by the startup process.
> >
>
> Okay, I think various people at different point of times has suggested
> that idea. I think one thing we might need to consider is what to do
> in case of a FATAL error? In case of FATAL error, it won't be
> advisable to execute undo immediately, so would we upgrade the error
> to PANIC in such cases. I remember vaguely that for clean up of
> orphaned files that can happen rarely and someone has suggested
> upgrading the error to PANIC in such a case but I don't remember the
> exact details.

Do you mean FATAL error during normal operation? As far as I understand, even
zheap does not rely on immediate UNDO execution (otherwise it'd never
introduce the undo worker), so FATAL only means that the undo needs to be
applied later so it can be discarded.

As for the orphaned files cleanup feature with no undo worker, we might need
PANIC to ensure that the undo is applied during restart and that it can be
discarded, otherwise the unapplied undo log would stay there until the next
(regular) restart and it would block discarding. However upgrading FATAL to
PANIC just because the current transaction created a table seems quite
rude. So the undo worker might be needed even for this patch?

Or do you mean FATAL error when executing the UNDO?

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

25 September 2021, 06:46:40

On Fri, Sep 24, 2021 at 4:44 PM Antonin Houska <ah@cybertec.at> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > On Mon, Sep 20, 2021 at 10:24 AM Antonin Houska <ah@cybertec.at> wrote:
> > >
> > > Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > > On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > > > >
> > > > > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote:
> > > > >
> > > > > * What happened with the idea of abandoning discard worker for the sake
> > > > >   of simplicity? From what I see limiting everything to foreground undo
> > > > >   could reduce the core of the patch series to the first four patches
> > > > >   (forgetting about test and docs, but I guess it would be enough at
> > > > >   least for the design review), which is already less overwhelming.
>
> > > What we can miss, at least for the cleanup of the orphaned files, is the *undo
> > > worker*. In this patch series the cleanup is handled by the startup process.
> > >
> >
> > Okay, I think various people at different point of times has suggested
> > that idea. I think one thing we might need to consider is what to do
> > in case of a FATAL error? In case of FATAL error, it won't be
> > advisable to execute undo immediately, so would we upgrade the error
> > to PANIC in such cases. I remember vaguely that for clean up of
> > orphaned files that can happen rarely and someone has suggested
> > upgrading the error to PANIC in such a case but I don't remember the
> > exact details.
>
> Do you mean FATAL error during normal operation?
>

Yes.

> As far as I understand, even
> zheap does not rely on immediate UNDO execution (otherwise it'd never
> introduce the undo worker), so FATAL only means that the undo needs to be
> applied later so it can be discarded.
>

Yeah, zheap either applies undo later via background worker or next
time before dml operation if there is a need.

> As for the orphaned files cleanup feature with no undo worker, we might need
> PANIC to ensure that the undo is applied during restart and that it can be
> discarded, otherwise the unapplied undo log would stay there until the next
> (regular) restart and it would block discarding. However upgrading FATAL to
> PANIC just because the current transaction created a table seems quite
> rude.
>

True, I guess but we can once see in what all scenarios it can
generate FATAL during that operation.

> So the undo worker might be needed even for this patch?
>

I think we can keep undo worker as a separate patch and for base patch
keep the idea of promoting FATAL to PANIC. This will at the very least
make the review easier.

> Or do you mean FATAL error when executing the UNDO?
>

No.

-- 
With Regards,
Amit Kapila.

Re: POC: Cleaning up orphaned files using undo logs

From

Antonin Houska

Date:

27 September 2021, 14:15:08

Amit Kapila <amit.kapila16@gmail.com> wrote:

> On Fri, Sep 24, 2021 at 4:44 PM Antonin Houska <ah@cybertec.at> wrote:
> >
> > Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > On Mon, Sep 20, 2021 at 10:24 AM Antonin Houska <ah@cybertec.at> wrote:
> > > >
> > > > Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > > On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > > > > >
> > > > > > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote:
> > > > > >
> > > > > > * What happened with the idea of abandoning discard worker for the sake
> > > > > >   of simplicity? From what I see limiting everything to foreground undo
> > > > > >   could reduce the core of the patch series to the first four patches
> > > > > >   (forgetting about test and docs, but I guess it would be enough at
> > > > > >   least for the design review), which is already less overwhelming.
> >
> > > > What we can miss, at least for the cleanup of the orphaned files, is the *undo
> > > > worker*. In this patch series the cleanup is handled by the startup process.
> > > >
> > >
> > > Okay, I think various people at different point of times has suggested
> > > that idea. I think one thing we might need to consider is what to do
> > > in case of a FATAL error? In case of FATAL error, it won't be
> > > advisable to execute undo immediately, so would we upgrade the error
> > > to PANIC in such cases. I remember vaguely that for clean up of
> > > orphaned files that can happen rarely and someone has suggested
> > > upgrading the error to PANIC in such a case but I don't remember the
> > > exact details.
> >
> > Do you mean FATAL error during normal operation?
> >
>
> Yes.
>
> > As far as I understand, even
> > zheap does not rely on immediate UNDO execution (otherwise it'd never
> > introduce the undo worker), so FATAL only means that the undo needs to be
> > applied later so it can be discarded.
> >
>
> Yeah, zheap either applies undo later via background worker or next
> time before dml operation if there is a need.
>
> > As for the orphaned files cleanup feature with no undo worker, we might need
> > PANIC to ensure that the undo is applied during restart and that it can be
> > discarded, otherwise the unapplied undo log would stay there until the next
> > (regular) restart and it would block discarding. However upgrading FATAL to
> > PANIC just because the current transaction created a table seems quite
> > rude.
> >
>
> True, I guess but we can once see in what all scenarios it can
> generate FATAL during that operation.

By "that operation" you mean "CREATE TABLE"?

It's not about FATAL during CREATE TABLE, rather it's about FATAL anytime
during a transaction. Whichever operation caused the FATAL error, we'd need to
upgrade it to PANIC as long as the transaction has some undo.

Although the postgres core probably does not raise FATAL errors too often (OOM
conditions seem to be the typical cause), I'm still not enthusiastic about
idea that the undo feature turns such errors into PANIC.

I wonder what the reason to avoid undoing transaction on FATAL is. If it's
about possibly long duration of the undo execution, deletion of orphaned files
(relations or the whole databases) via undo shouldn't make things worse
because currently FATAL also triggers this sort of cleanup immediately, it's
just implemented in different ways.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

28 September 2021, 04:50:29

On Mon, Sep 27, 2021 at 7:43 PM Antonin Houska <ah@cybertec.at> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Although the postgres core probably does not raise FATAL errors too often (OOM
> conditions seem to be the typical cause), I'm still not enthusiastic about
> idea that the undo feature turns such errors into PANIC.
>
> I wonder what the reason to avoid undoing transaction on FATAL is. If it's
> about possibly long duration of the undo execution, deletion of orphaned files
> (relations or the whole databases) via undo shouldn't make things worse
> because currently FATAL also triggers this sort of cleanup immediately, it's
> just implemented in different ways.
>

During FATAL, we don't want to perform more operations which can make
the situation worse. Say, we are already short of memory (OOM), and
undo execution can further try to allocate the memory won't do any
good. Depending on the implementation, sometimes undo execution might
need to perform WAL writes or data write which we don't want to do
during FATAL error processing.

-- 
With Regards,
Amit Kapila.

Re: POC: Cleaning up orphaned files using undo logs

From

Antonin Houska

Date:

28 September 2021, 14:08:27

Amit Kapila <amit.kapila16@gmail.com> wrote:

> On Mon, Sep 20, 2021 at 10:55 AM Antonin Houska <ah@cybertec.at> wrote:
> >
> > Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > On Thu, Sep 9, 2021 at 8:33 PM Antonin Houska <ah@cybertec.at> wrote:
> >
> > > > The problem of the temporary undo log is that it's loaded into local buffers
> > > > and that backend can exit w/o flushing local buffers to disk, and thus we are
> > > > not guaranteed to find enough information when trying to discard the undo log
> > > > the backend wrote. I'm thinking about the following solutions:
> > > >
> > > > 1. Let the backend manage temporary undo log on its own (even the slot
> > > >    metadata would stay outside the shared memory, and in particular the
> > > >    insertion pointer could start from 1 for each session) and remove the
> > > >    segment files at the same moment the temporary relations are removed.
> > > >
> > > >    However, by moving the temporary undo slots away from the shared memory,
> > > >    computation of oldestFullXidHavingUndo (see the PROC_HDR structure) would
> > > >    be affected. It might seem that a transaction which only writes undo log
> > > >    for temporary relations does not need to affect oldestFullXidHavingUndo,
> > > >    but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo
> > > >    prevents transactions to be truncated from the CLOG too early, I wonder if
> > > >    the following is possible (This scenario is only applicable to the zheap
> > > >    storage engine [1], which is not included in this patch, but should already
> > > >    be considered.):
> > > >
> > > >    A transaction creates a temporary table, does some (many) changes and then
> > > >    gets rolled back. The undo records are being applied and it takes some
> > > >    time. Since XID of the transaction did not affect oldestFullXidHavingUndo,
> > > >    the XID can disappear from the CLOG due to truncation.
> > > >
> > >
> > > By above do you mean to say that in zheap code, we don't consider XIDs
> > > that operate on temp table/undo for oldestFullXidHavingUndo?
> >
> > I was referring to the code
> >
> >                 /* We can't process temporary undo logs. */
> >                 if (log->meta.persistence == UNDO_TEMP)
> >                         continue;
> >
> > in undodiscard.c:UndoDiscard().
> >
>
> Here, I think it will just skip undo of temporary undo logs and
> oldestFullXidHavingUndo should be advanced after skipping it.

Right, it'll be adavanced, but the transaction XID (if the transaction wrote
only to temporary relations) might still be needed.

> > >
> > > > However zundo.c in
> > > >    [1] indicates that the transaction status *is* checked during undo
> > > >    execution, so we might have a problem.
> > > >
> > >
> > > It would be easier to follow if you can tell which exact code are you
> > > referring here?
> >
> > In meant the call of TransactionIdDidCommit() in
> > zundo.c:zheap_exec_pending_rollback().
> >
>
> IIRC, this should be called for temp tables after they have exited as
> this is only to apply the pending undo actions if any, and in case of
> temporary undo after session exit, we shouldn't need it.

I see (had to play with debugger a bit). Currently this works because the
temporary relations are dropped by AbortTransaction() ->
smgrDoPendingDeletes(), before the undo execution starts. The situation will
change as soon as the file removal will also be handled by the undo subsystem,
however I'm still not sure how to hit the TransactionIdDidCommit() call for
the XID already truncated from CLOG.

I'm starting to admint that there's no issue here: temporary undo is always
applied immediately in foreground, and thus the zheap_exec_pending_rollback()
function never needs to examine XID which no longer exists in the CLOG.

> I am not able to understand what exact problem you are facing for temp
> tables after the session exit. Can you please explain it a bit more?

The problem is that the temporary undo buffers are loaded into backend-local
buffers. Thus there's no guarantee that we'll find a consistent information in
the undo file even if the backend exited cleanly (local buffers are not
flushed at backend exit and there's no WAL for them). However, we need to read
the undo file to find out if (part of) it can be discarded.

I'm trying to find out whether we can ignore the temporary undo when trying to
advance oldestFullXidHavingUndo or not. If we can, then each backend can mange
its temporary undo on its own and - instead of checking which chunks can be
discarded - simply delete the undo files on exit as a whole, just like it
deletes temporary relations. Thus we wouldn't need to pay any special
attention to discarding. Also, if backends managed the temporary undo this
way, it wouldn't be necessary to track it via the shared memory
(UndoLogMetaData).

(With this approach, the undo record to delete the temporary relation must not
be temporary, but this should not be an issue.)

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

From

Dmitry Dolgov

Date:

28 September 2021, 14:26:24

> On Tue, Sep 21, 2021 at 10:07:55AM +0200, Dmitry Dolgov wrote:
> On Tue, 21 Sep 2021 09:00 Antonin Houska, <ah@cybertec.at> wrote:
>
> > Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> >
> > > Yep, makes sense, thanks. I have few more questions:
> > >
> > > * The use case with orphaned files is working somewhat differently after
> > >   the rebase on the latest master, do you observe it as well? The
> > >   difference is ApplyPendingUndo -> SyncPostCheckpoint doesn't clean up
> > >   an orphaned relation file immediately (only later on checkpoint)
> > >   because of empty pendingUnlinks. I haven't investigated more yet, but
> > >   seems like after this commit:
> > >
> > >     commit 7ff23c6d277d1d90478a51f0dd81414d343f3850
> > >     Author: Thomas Munro <tmunro@postgresql.org>
> > >     Date:   Mon Aug 2 17:32:20 2021 +1200
> > >
> > >         Run checkpointer and bgwriter in crash recovery.
> > >
> > >         Start up the checkpointer and bgwriter during crash recovery
> > (except in
> > >         --single mode), as we do for replication.  This wasn't done back
> > in
> > >         commit cdd46c76 out of caution.  Now it seems like a better idea
> > to make
> > >         the environment as similar as possible in both cases.  There may
> > also be
> > >         some performance advantages.
> > >
> > >   something has to be updated (pendingOps are empty right now, so no
> > >   unlink request is remembered).
> >
> > I haven't been debugging that part recently, but yes, this commit is
> > relevant,
> > thanks for pointing that out! Attached is a patch that should fix it. I'll
> > include it in the next version of the patch series, unless you tell me that
> > something is still wrong.
> >
>
> Sure, but  I can take a look only in a couple of days.

Thanks for the patch.

Hm, maybe there is some misunderstanding. My question above was about
the changed behaviour, when orphaned files (e.g. created relation files
after the backend was killed) are removed only by checkpointer when it
kicks in. As far as I understand, the original intention was to do this
job right away, that's why SyncPre/PostCheckpoint was invoked. But the
recent changes around checkpointer make the current implementation
insufficient.

The patch you've proposed removes invokation of SyncPre/PostCheckpoint,
do I see it correctly? In this sense it doesn't change anything, except
removing non-functioning code of course. But the question, probably
reformulated from the more design point of view, stays the same — when
and by which process such orphaned files have to be removed? I've
assumed by removing right away the previous version was trying to avoid
any kind of thunder effects of removing too many at once, but maybe I'm
mistaken here.

Re: POC: Cleaning up orphaned files using undo logs

From

Antonin Houska

Date:

28 September 2021, 19:19:41

Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> > On Tue, Sep 21, 2021 at 10:07:55AM +0200, Dmitry Dolgov wrote:
> > On Tue, 21 Sep 2021 09:00 Antonin Houska, <ah@cybertec.at> wrote:
> >
> > > Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > >
> > > > Yep, makes sense, thanks. I have few more questions:
> > > >
> > > > * The use case with orphaned files is working somewhat differently after
> > > >   the rebase on the latest master, do you observe it as well? The
> > > >   difference is ApplyPendingUndo -> SyncPostCheckpoint doesn't clean up
> > > >   an orphaned relation file immediately (only later on checkpoint)
> > > >   because of empty pendingUnlinks. I haven't investigated more yet, but
> > > >   seems like after this commit:
> > > >
> > > >     commit 7ff23c6d277d1d90478a51f0dd81414d343f3850
> > > >     Author: Thomas Munro <tmunro@postgresql.org>
> > > >     Date:   Mon Aug 2 17:32:20 2021 +1200
> > > >
> > > >         Run checkpointer and bgwriter in crash recovery.
> > > >
> > > >         Start up the checkpointer and bgwriter during crash recovery
> > > (except in
> > > >         --single mode), as we do for replication.  This wasn't done back
> > > in
> > > >         commit cdd46c76 out of caution.  Now it seems like a better idea
> > > to make
> > > >         the environment as similar as possible in both cases.  There may
> > > also be
> > > >         some performance advantages.
> > > >
> > > >   something has to be updated (pendingOps are empty right now, so no
> > > >   unlink request is remembered).
> > >
> > > I haven't been debugging that part recently, but yes, this commit is
> > > relevant,
> > > thanks for pointing that out! Attached is a patch that should fix it. I'll
> > > include it in the next version of the patch series, unless you tell me that
> > > something is still wrong.
> > >
> >
> > Sure, but  I can take a look only in a couple of days.
>
> Thanks for the patch.
>
> Hm, maybe there is some misunderstanding. My question above was about
> the changed behaviour, when orphaned files (e.g. created relation files
> after the backend was killed) are removed only by checkpointer when it
> kicks in. As far as I understand, the original intention was to do this
> job right away, that's why SyncPre/PostCheckpoint was invoked. But the
> recent changes around checkpointer make the current implementation
> insufficient.

> The patch you've proposed removes invokation of SyncPre/PostCheckpoint,
> do I see it correctly? In this sense it doesn't change anything, except
> removing non-functioning code of course.

Yes, it sounds like a misundeerstanding. I thought you complain about code
which is no longer needed.

The original intention was to make sure that the files are ever unlinked. IIRC
before the commit 7ff23c6d27 the calls SyncPre/PostCheckpoint were necessary
because the checkpointer wasn't runnig that early during the startup. Without
these calls the startup process would exit without doing anything. Sorry, I
see now that the comment incorrectly says "... it seems simpler ...", but in
fact it was necessary.

> But the question, probably
> reformulated from the more design point of view, stays the same — when
> and by which process such orphaned files have to be removed? I've
> assumed by removing right away the previous version was trying to avoid
> any kind of thunder effects of removing too many at once, but maybe I'm
> mistaken here.

I'm just trying to use the existing infrastructure: the effect of DROP TABLE
also appear to be performed by the checkpointer. However I don't know why the
unlinks need to be performed by the checkpointer.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: POC: Cleaning up orphaned files using undo logs

From

Thomas Munro

Date:

28 September 2021, 22:27:54

On Wed, Sep 29, 2021 at 8:18 AM Antonin Houska <ah@cybertec.at> wrote:
> I'm just trying to use the existing infrastructure: the effect of DROP TABLE
> also appear to be performed by the checkpointer. However I don't know why the
> unlinks need to be performed by the checkpointer.

For DROP TABLE, we leave an empty file (I've been calling it a
"tombstone file") so that GetNewRelFileNode() won't let you reuse the
same relfilenode in the same checkpoint cycle.  One reason is that
wal_level=minimal has a data-eating crash recovery failure mode if you
reuse a relfilenode in a checkpoint cycle.

Re: POC: Cleaning up orphaned files using undo logs

From

Antonin Houska

Date:

29 September 2021, 06:09:34

Thomas Munro <thomas.munro@gmail.com> wrote:

> On Wed, Sep 29, 2021 at 8:18 AM Antonin Houska <ah@cybertec.at> wrote:
> > I'm just trying to use the existing infrastructure: the effect of DROP TABLE
> > also appear to be performed by the checkpointer. However I don't know why the
> > unlinks need to be performed by the checkpointer.
> 
> For DROP TABLE, we leave an empty file (I've been calling it a
> "tombstone file") so that GetNewRelFileNode() won't let you reuse the
> same relfilenode in the same checkpoint cycle.  One reason is that
> wal_level=minimal has a data-eating crash recovery failure mode if you
> reuse a relfilenode in a checkpoint cycle.

Interesting. Is the problem that REDO of the DROP TABLE command deletes the
relfilenode which already contains the new data, but the new data cannot be
recovered because (due to wal_level=minimal) it's not present in WAL? In this
case I suppose that the checkpoint just ensures that the DROP TABLE won't be
replayed during the next crash recovery.

BTW, does that comment fix attached make sense to you? The corresponding code
in InitSync() is

    /*
     * Create pending-operations hashtable if we need it.  Currently, we need
     * it if we are standalone (not under a postmaster) or if we are a
     * checkpointer auxiliary process.
     */
    if (!IsUnderPostmaster || AmCheckpointerProcess())

I suspect this is also related to the commit 7ff23c6d27.

Thanks for your answer, I was considering to add you to CC :-)

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 1c78581354..ae6c5ff8e4 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -563,7 +563,7 @@ RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
 
     if (pendingOps != NULL)
     {
-        /* standalone backend or startup process: fsync state is local */
+        /* standalone backend or checkpointer process: fsync state is local */
         RememberSyncRequest(ftag, type);
         return true;
     }

Re: POC: Cleaning up orphaned files using undo logs

From

Amit Kapila

Date:

29 September 2021, 11:07:37

On Tue, Sep 28, 2021 at 7:36 PM Antonin Houska <ah@cybertec.at> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > On Mon, Sep 20, 2021 at 10:55 AM Antonin Houska <ah@cybertec.at> wrote:
> > >
> > > Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > > On Thu, Sep 9, 2021 at 8:33 PM Antonin Houska <ah@cybertec.at> wrote:
> > >
> > > > > The problem of the temporary undo log is that it's loaded into local buffers
> > > > > and that backend can exit w/o flushing local buffers to disk, and thus we are
> > > > > not guaranteed to find enough information when trying to discard the undo log
> > > > > the backend wrote. I'm thinking about the following solutions:
> > > > >
> > > > > 1. Let the backend manage temporary undo log on its own (even the slot
> > > > >    metadata would stay outside the shared memory, and in particular the
> > > > >    insertion pointer could start from 1 for each session) and remove the
> > > > >    segment files at the same moment the temporary relations are removed.
> > > > >
> > > > >    However, by moving the temporary undo slots away from the shared memory,
> > > > >    computation of oldestFullXidHavingUndo (see the PROC_HDR structure) would
> > > > >    be affected. It might seem that a transaction which only writes undo log
> > > > >    for temporary relations does not need to affect oldestFullXidHavingUndo,
> > > > >    but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo
> > > > >    prevents transactions to be truncated from the CLOG too early, I wonder if
> > > > >    the following is possible (This scenario is only applicable to the zheap
> > > > >    storage engine [1], which is not included in this patch, but should already
> > > > >    be considered.):
> > > > >
> > > > >    A transaction creates a temporary table, does some (many) changes and then
> > > > >    gets rolled back. The undo records are being applied and it takes some
> > > > >    time. Since XID of the transaction did not affect oldestFullXidHavingUndo,
> > > > >    the XID can disappear from the CLOG due to truncation.
> > > > >
> > > >
> > > > By above do you mean to say that in zheap code, we don't consider XIDs
> > > > that operate on temp table/undo for oldestFullXidHavingUndo?
> > >
> > > I was referring to the code
> > >
> > >                 /* We can't process temporary undo logs. */
> > >                 if (log->meta.persistence == UNDO_TEMP)
> > >                         continue;
> > >
> > > in undodiscard.c:UndoDiscard().
> > >
> >
> > Here, I think it will just skip undo of temporary undo logs and
> > oldestFullXidHavingUndo should be advanced after skipping it.
>
> Right, it'll be adavanced, but the transaction XID (if the transaction wrote
> only to temporary relations) might still be needed.
>
> > > >
> > > > > However zundo.c in
> > > > >    [1] indicates that the transaction status *is* checked during undo
> > > > >    execution, so we might have a problem.
> > > > >
> > > >
> > > > It would be easier to follow if you can tell which exact code are you
> > > > referring here?
> > >
> > > In meant the call of TransactionIdDidCommit() in
> > > zundo.c:zheap_exec_pending_rollback().
> > >
> >
> > IIRC, this should be called for temp tables after they have exited as
> > this is only to apply the pending undo actions if any, and in case of
> > temporary undo after session exit, we shouldn't need it.
>
> I see (had to play with debugger a bit). Currently this works because the
> temporary relations are dropped by AbortTransaction() ->
> smgrDoPendingDeletes(), before the undo execution starts. The situation will
> change as soon as the file removal will also be handled by the undo subsystem,
> however I'm still not sure how to hit the TransactionIdDidCommit() call for
> the XID already truncated from CLOG.
>
> I'm starting to admint that there's no issue here: temporary undo is always
> applied immediately in foreground, and thus the zheap_exec_pending_rollback()
> function never needs to examine XID which no longer exists in the CLOG.
>
> > I am not able to understand what exact problem you are facing for temp
> > tables after the session exit. Can you please explain it a bit more?
>
> The problem is that the temporary undo buffers are loaded into backend-local
> buffers. Thus there's no guarantee that we'll find a consistent information in
> the undo file even if the backend exited cleanly (local buffers are not
> flushed at backend exit and there's no WAL for them). However, we need to read
> the undo file to find out if (part of) it can be discarded.
>
> I'm trying to find out whether we can ignore the temporary undo when trying to
> advance oldestFullXidHavingUndo or not.
>

It seems this is the crucial point. In the code, you pointed, we
ignore the temporary undo while advancing oldestFullXidHavingUndo but
if you find any case where that is not true then we need to discuss
what is the best way to solve it.

-- 
With Regards,
Amit Kapila.

Re: POC: Cleaning up orphaned files using undo logs

From

Antonin Houska

Date:

25 November 2021, 14:02:22

Amit Kapila <amit.kapila16@gmail.com> wrote:

> On Tue, Sep 28, 2021 at 7:36 PM Antonin Houska <ah@cybertec.at> wrote:
> >
> > Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > On Mon, Sep 20, 2021 at 10:55 AM Antonin Houska <ah@cybertec.at> wrote:
> > > >
> > > > Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > > On Thu, Sep 9, 2021 at 8:33 PM Antonin Houska <ah@cybertec.at> wrote:
> > > >
> > > > > > The problem of the temporary undo log is that it's loaded into local buffers
> > > > > > and that backend can exit w/o flushing local buffers to disk, and thus we are
> > > > > > not guaranteed to find enough information when trying to discard the undo log
> > > > > > the backend wrote. I'm thinking about the following solutions:
> > > > > >
> > > > > > 1. Let the backend manage temporary undo log on its own (even the slot
> > > > > >    metadata would stay outside the shared memory, and in particular the
> > > > > >    insertion pointer could start from 1 for each session) and remove the
> > > > > >    segment files at the same moment the temporary relations are removed.
> > > > > >
> > > > > >    However, by moving the temporary undo slots away from the shared memory,
> > > > > >    computation of oldestFullXidHavingUndo (see the PROC_HDR structure) would
> > > > > >    be affected. It might seem that a transaction which only writes undo log
> > > > > >    for temporary relations does not need to affect oldestFullXidHavingUndo,
> > > > > >    but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo
> > > > > >    prevents transactions to be truncated from the CLOG too early, I wonder if
> > > > > >    the following is possible (This scenario is only applicable to the zheap
> > > > > >    storage engine [1], which is not included in this patch, but should already
> > > > > >    be considered.):
> > > > > >
> > > > > >    A transaction creates a temporary table, does some (many) changes and then
> > > > > >    gets rolled back. The undo records are being applied and it takes some
> > > > > >    time. Since XID of the transaction did not affect oldestFullXidHavingUndo,
> > > > > >    the XID can disappear from the CLOG due to truncation.
> > > > > >
> > > > >
> > > > > By above do you mean to say that in zheap code, we don't consider XIDs
> > > > > that operate on temp table/undo for oldestFullXidHavingUndo?
> > > >
> > > > I was referring to the code
> > > >
> > > >                 /* We can't process temporary undo logs. */
> > > >                 if (log->meta.persistence == UNDO_TEMP)
> > > >                         continue;
> > > >
> > > > in undodiscard.c:UndoDiscard().
> > > >
> > >
> > > Here, I think it will just skip undo of temporary undo logs and
> > > oldestFullXidHavingUndo should be advanced after skipping it.
> >
> > Right, it'll be adavanced, but the transaction XID (if the transaction wrote
> > only to temporary relations) might still be needed.
> >
> > > > >
> > > > > > However zundo.c in
> > > > > >    [1] indicates that the transaction status *is* checked during undo
> > > > > >    execution, so we might have a problem.
> > > > > >
> > > > >
> > > > > It would be easier to follow if you can tell which exact code are you
> > > > > referring here?
> > > >
> > > > In meant the call of TransactionIdDidCommit() in
> > > > zundo.c:zheap_exec_pending_rollback().
> > > >
> > >
> > > IIRC, this should be called for temp tables after they have exited as
> > > this is only to apply the pending undo actions if any, and in case of
> > > temporary undo after session exit, we shouldn't need it.
> >
> > I see (had to play with debugger a bit). Currently this works because the
> > temporary relations are dropped by AbortTransaction() ->
> > smgrDoPendingDeletes(), before the undo execution starts. The situation will
> > change as soon as the file removal will also be handled by the undo subsystem,
> > however I'm still not sure how to hit the TransactionIdDidCommit() call for
> > the XID already truncated from CLOG.
> >
> > I'm starting to admint that there's no issue here: temporary undo is always
> > applied immediately in foreground, and thus the zheap_exec_pending_rollback()
> > function never needs to examine XID which no longer exists in the CLOG.
> >
> > > I am not able to understand what exact problem you are facing for temp
> > > tables after the session exit. Can you please explain it a bit more?
> >
> > The problem is that the temporary undo buffers are loaded into backend-local
> > buffers. Thus there's no guarantee that we'll find a consistent information in
> > the undo file even if the backend exited cleanly (local buffers are not
> > flushed at backend exit and there's no WAL for them). However, we need to read
> > the undo file to find out if (part of) it can be discarded.
> >
> > I'm trying to find out whether we can ignore the temporary undo when trying to
> > advance oldestFullXidHavingUndo or not.
> >
> 
> It seems this is the crucial point. In the code, you pointed, we
> ignore the temporary undo while advancing oldestFullXidHavingUndo but
> if you find any case where that is not true then we need to discuss
> what is the best way to solve it.

As I already said above, I think now that the computation of
oldestFullXidHavingUndo can actually ignore the temporary undo, like it
happens in the zheap fork of postgres. At least I couldn't eventually find the
corner case that would break the current solution.

So it should be ok if the temporary undo is managed and discarded by
individual backends. Patch 0005 of the new series tries to do that.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment

undo-20211125.tgz

Re: POC: Cleaning up orphaned files using undo logs

From

Julien Rouhaud

Date:

12 January 2022, 08:27:26

Hi,

On Thu, Nov 25, 2021 at 10:00 PM Antonin Houska <ah@cybertec.at> wrote:
>
> So it should be ok if the temporary undo is managed and discarded by
> individual backends. Patch 0005 of the new series tries to do that.

The cfbot reports that at least the 001 patch doesn't apply anymore:
http://cfbot.cputube.org/patch_36_3228.log
> === applying patch ./undo-20211125/0001-Add-SmgrId-to-smgropen-and-BufferTag.patch
> [...]
> patching file src/bin/pg_waldump/pg_waldump.c
> Hunk #1 succeeded at 480 (offset 17 lines).
> Hunk #2 FAILED at 500.
> Hunk #3 FAILED at 531.
> 2 out of 3 hunks FAILED -- saving rejects to file src/bin/pg_waldump/pg_waldump.c.rej

Could you send a rebased version?  In the meantime I'll switch the cf
entry to Waiting on Author.

回复：POC: Cleaning up orphaned files using undo logs

From

"孔凡深(云梳)"

Date:

29 March 2022, 10:07:56

Hi, Antonin. I am more interested in zheap. Recently reviewing the patch you submitted.

When I use pg_undodump-tool to dump the undo page chunk, I found that some chunk header is abnormal.

After reading the relevant codes in 0006-The-initial-implementation-of-the-pg_undodump-tool.patch,

I feel that there is a bug in the function parse_undo_page.

According to my understanding The size in chunk Header includes chunk header + type-specific header + undo record.

If the entire chunk spans pages, also need to add the size of the page header.

But I found Now only the scenario of chunk header spanning pages is considered, and the scenario of type-specific header is not considered.

* The page header size must eventually be subtracted from

* chunk_bytes_left because it's included in the chunk size. However,

* since chunk_bytes_left is unsigned, we do not subtract anything from it

* if it's still zero. This can happen if we're still reading the chunk

* header or the type-specific header. (The underflow should not be a

* problem because the chunk size will eventually be added, but it seems

* ugly and it makes debugging less convenient.)

if (s->chunk_bytes_left > 0)

{

/* Chunk should not end within page header. */

Assert(s->chunk_bytes_left >= SizeOfUndoPageHeaderData);

s->chunk_bytes_left -= SizeOfUndoPageHeaderData;

s->chunk_bytes_to_skip = 0;

}

/* Processing the chunk header?

else if (s->chunk_hdr_bytes_left > 0 )

s->chunk_bytes_to_skip = SizeOfUndoPageHeaderData;

------------------------------------------------

Should this code be fixed as this?When the type-specific header spans the undo page, the page header should be skipped.

else if (s->chunk_hdr_bytes_left > 0 || s->type_hdr_bytes_left > 0)

s->chunk_bytes_to_skip = SizeOfUndoPageHeaderData;

------------------------------------------------------------------
发件人：Antonin Houska <ah@cybertec.at>
发送时间：2022年3月29日(星期二) 17:25
收件人：Dmitry Dolgov <9erthalion6@gmail.com>; pgsql-hackers <pgsql-hackers@postgresql.org>
主　题：Re: POC: Cleaning up orphaned files using undo logs

The cfbot complained that the patch series no longer applies, so I've rebased
it and also tried to make sure that the other flags become green.

One particular problem was that pg_upgrade complained that "live undo data"
remains in the old cluster. I found out that the temporary undo log causes the
problem, so I've adjusted the query in check_for_undo_data() accordingly until
the problem gets fixed properly.

The problem of the temporary undo log is that it's loaded into local buffers
and that backend can exit w/o flushing local buffers to disk, and thus we are
not guaranteed to find enough information when trying to discard the undo log
the backend wrote. I'm thinking about the following solutions:

1. Let the backend manage temporary undo log on its own (even the slot
metadata would stay outside the shared memory, and in particular the
insertion pointer could start from 1 for each session) and remove the
segment files at the same moment the temporary relations are removed.

However, by moving the temporary undo slots away from the shared memory,
computation of oldestFullXidHavingUndo (see the PROC_HDR structure) would
be affected. It might seem that a transaction which only writes undo log
for temporary relations does not need to affect oldestFullXidHavingUndo,
but it needs to be analyzed thoroughly. Since oldestFullXidHavingUndo
prevents transactions to be truncated from the CLOG too early, I wonder if
the following is possible (This scenario is only applicable to the zheap
storage engine [1], which is not included in this patch, but should already
be considered.):

A transaction creates a temporary table, does some (many) changes and then
gets rolled back. The undo records are being applied and it takes some
time. Since XID of the transaction did not affect oldestFullXidHavingUndo,
the XID can disappear from the CLOG due to truncation. However zundo.c in
[1] indicates that the transaction status *is* checked during undo
execution, so we might have a problem.

Or do I miss something? UndoDiscard() in zheap seems to ignore temporary
undo:

/* We can't process temporary undo logs. */
if (log->meta.persistence == UNDO_TEMP)
continue;

2. Do not load the temporary undo into local buffers. If it's always in the
shared buffers, we should never see incomplete data when trying to discard
undo. In this case, persistence levels UNDOPERSISTENCE_UNLOGGED and
UNDOPERSISTENCE_TEMP could be merged into a single level.

3. Implement the discarding in another way, but I don't have new idea right
now.

Suggestions are welcome.

[1] https://github.com/EnterpriseDB/zheap/tree/master

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment

undo-20210909.tgz

Re: 回复：POC: Cleaning up orphaned files using undo logs

From

Antonin Houska

Date:

11 April 2022, 08:16:10

孔凡深(云梳) <fanshen.kfs@alibaba-inc.com> wrote:

> Hi, Antonin. I am more interested in zheap. Recently reviewing the patch you submitted.
> When I use pg_undodump-tool to dump the undo page chunk, I found that some chunk header is abnormal.
> After reading the relevant codes in 0006-The-initial-implementation-of-the-pg_undodump-tool.patch,
> I feel that there is a bug  in the function parse_undo_page.

Thanks, I'll take a look if the project happens to continue. Currently it
seems that another approach is more likely to be taken:

https://www.postgresql.org/message-id/CA%2BTgmoa_VNzG4ZouZyQQ9h%3DoRiy%3DZQV5%2BxHQXxMWmep4Ygg8Dg%40mail.gmail.com

--
Antonin Houska
Web: https://www.cybertec-postgresql.com