Re: Refactor pg_rewind code and make it work against a standby - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Refactor pg_rewind code and make it work against a standby
Date
Msg-id 4744063e-1a60-1b24-e1b7-7ace5e65f0d2@iki.fi
Whole thread Raw
In response to Re: Refactor pg_rewind code and make it work against a standby  (Soumyadeep Chakraborty <soumyadeep2007@gmail.com>)
Responses Re: Refactor pg_rewind code and make it work against a standby
List pgsql-hackers
On 20/09/2020 23:44, Soumyadeep Chakraborty wrote:
> Before getting into the code review for the patch, I wanted to know why
> we don't use a Bitmapset for target_modified_pages?

Bitmapset is not available in client code. Perhaps it could be moved to 
src/common with some changes, but doesn't seem worth it until there's 
more client code that would need it.

I'm not sure that a bitmap is the best data structure for tracking the 
changed blocks in the first place. A hash table might be better if there 
are only a few changed blocks, or something like 
src/backend/lib/integerset.c if there are many. But as long as the 
simple bitmap gets the job done, let's keep it simple.

> 2. Rename target_modified_pages to target_pages_to_overwrite?
> target_modified_pages can lead to confusion as to whether it includes pages
> that were modified on the target but not even present in the source and
> the other exclusionary cases. target_pages_to_overwrite is much clearer.

Agreed, I'll rename it.

Conceptually, while we're scanning source WAL, we're just making note of 
the modified blocks. The decision on what to do with them happens only 
later, in decide_file_action(). The difference is purely theoretical, 
though. There is no real decision to be made, all the modified blocks 
will be overwritten. So on the whole, I agree 'target_page_to_overwrite' 
is better.

>> /*
>>   * If this is a relation file, copy the modified blocks.
>>   *
>>   * This is in addition to any other changes.
>>   */
>> iter = datapagemap_iterate(&entry->target_modified_pages);
>> while (datapagemap_next(iter, &blkno))
>> {
>> offset = blkno * BLCKSZ;
>>
>> source->queue_fetch_range(source, entry->path, offset, BLCKSZ);
>> }
>> pg_free(iter);
> 
> Can we put this hunk into a static function overwrite_pages()?

Meh, it's only about 10 lines of code, and one caller.

> 4.
> 
>> * block that have changed in the target system.  It makes note of all the
>> * changed blocks in the pagemap of the file.
> 
> Can we replace the above with:
> 
>> * block that has changed in the target system.  It decides if the given
> blkno in the target relfile needs to be overwritten from the source.

Ok. Again conceptually though, process_target_wal_block_change() just 
collects information, and the decisions are made later. But you're right 
that we do leave out truncated-away blocks here, so we are doing more 
than just noting all the changed blocks.

>> /*
>>   * Doesn't exist in either server. Why does it have an entry in the
>>   * first place??
>>   */
>> return FILE_ACTION_NONE;
> 
> Can we delete the above hunk and add the following assert to the very
> top of decide_file_action():
> 
> Assert(entry->target_exists || entry->source_exists);

I would like to keep the check even when assertions are not enabled. 
I'll add an Assert(false) there.

> 7. Please address the FIXME for the symlink case:
> /* FIXME: Check if it points to the same target? */

It's not a new issue. Would be nice to fix, of course. I'm not sure what 
the right thing to do would be. If you have e.g. replaced 
postgresql.conf with a symlink that points outside the data directory, 
would it be appropriate to overwrite it? Or perhaps we should throw an 
error? We also throw an error if a file is a symlink in the source but a 
regular file in the target, or vice versa.

> 8.
> 
> * it anyway. But there's no harm in copying it now.)
> 
> and
> 
> * copy them here. But we don't know which scenario we're
> * dealing with, and there's no harm in copying the missing
> * blocks now, so do it now.
> 
> Could you add a line or two explaining why there is "no harm" in these
> two hunks above?

The previous sentences explain that there's a WAL record covering them. 
So they will be overwritten by WAL replay anyway. Does it need more 
explanation?

> 14. queue_overwrite_range(), finish_overwrite() instead of
> queue_fetch_range(), finish_fetch()? Similarly update\
> *_fetch_file_range() and *_finish_fetch()
> 
> 
> 15. Let's have local_source.c and libpq_source.c instead of *_fetch.c

Good idea! And fetch.h -> rewind_source.h.

I also moved the code in execute_file_actions() function to pg_rewind.c, 
into a new function: perform_rewind(). It felt a bit silly to have just 
execute_file_actions() in a file of its own. perform_rewind() now does 
all the modifications to the data directory, writing the backup file. 
Except for writing the recovery config: that also needs to be when 
there's no rewind to do, so it makes sense to keep it separate. What do 
you think?

> 16.
> 
>> conn = PQconnectdb(connstr_source);
>>
>> if (PQstatus(conn) == CONNECTION_BAD)
>> pg_fatal("could not connect to server: %s",
>> PQerrorMessage(conn));
>>
>> if (showprogress)
>> pg_log_info("connected to server");
> 
> 
> The above hunk should be part of init_libpq_source(). Consequently,
> init_libpq_source() should take a connection string instead of a conn.

The libpq connection is also needed by WriteRecoveryConfig(), that's why 
it's not fully encapsulated in libpq_source.

> 19.
> 
>> typedef struct
>> {
>> const char *path; /* path relative to data directory root */
>> uint64 offset;
>> uint32 length;
>> } fetch_range_request;
> 
> offset should be of type off_t

The 'offset' argument to the queue_fetch_range function is uint64, and 
the argument to the SQL-callable pg_read_binary_file() isint8, so it's 
consistent with them. Then again, the 'len' argument to 
queue_fetch_range() is a size_t, and to pg_read_binary_file() int8, so 
it's not fully consistent with that either. I'll try to make it more 
consistent.

Thanks for the review! Attached is a new version of the patch set.

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: James Coleman
Date:
Subject: Incremental sort docs and release announcement
Next
From: Pavel Stehule
Date:
Subject: Re: proposal: possibility to read dumped table's name from file