Re: pg_walfile_name_offset can return inconsistent values - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Re: pg_walfile_name_offset can return inconsistent values |
Date | |
Msg-id | ZU0xZuARuDbpUMf2@momjian.us Whole thread Raw |
In response to | Re: pg_walfile_name_offset can return inconsistent values (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>) |
Responses |
Re: pg_walfile_name_offset can return inconsistent values
|
List | pgsql-hackers |
On Fri, Jul 26, 2019 at 11:30:19AM +0200, Jehan-Guillaume de Rorthais wrote: > On Fri, 26 Jul 2019 17:21:20 +0900 (Tokyo Standard Time) > Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > > Hello. > > > > While looking [1], I noticed that pg_walfile_name_offset behaves > > somewhat oddly at segment boundary. > > > > select * from (values ('0/16ffffff'), ('0/17000000'), ('0/17000001')) as > > t(lsn), lateral pg_walfile_name_offset(lsn::pg_lsn); > > lsn | file_name | file_offset > > ------------+--------------------------+------------- > > 0/16ffffff | 000000020000000000000016 | 16777215 > > 0/17000000 | 000000020000000000000016 | 0 > > 0/17000001 | 000000020000000000000017 | 1 > > > > > > The file names are right as defined, but the return value of the > > second line wrong, or at least misleading. > > +1 > I noticed it as well and put this report on hold while working on my patch. > Thanks for reporting this! > > > It should be (16, 1000000) or (16, FFFFFF). The former is out-of-domain so we > > would have no way than choosing the latter. I'm not sure the purpose of > > the second output parameter, thus the former might be right > > decision. > > > > The function returns the following result after this patch is > > applied. > > > > select * from (values ('0/16ffffff'), ('0/17000000'), ('0/17000001')) as > > t(lsn), lateral pg_walfile_name_offset(lsn::pg_lsn); > > lsn | file_name | file_offset > > ------------+--------------------------+------------- > > 0/16ffffff | 000000020000000000000016 | 16777214 > > 0/17000000 | 000000020000000000000016 | 16777215 > > 0/17000001 | 000000020000000000000017 | 0 > > So you shift the file offset for all LSN by one byte? This could lead to > regression in various tools relying on this function. > > Moreover, it looks weird as the LSN doesn't reflect the given offset anymore > (FFFFFF <> 16777214, 000001 <> 0, etc). > > Another solution might be to return the same result when for both 0/16ffffff and > 0/17000000, but it doesn't feel right either. > > So in fact, returning 0x1000000 seems to be the cleaner result to me. I know this bug report is four years old, but it is still a pg_walfile_name_offset() bug. Here is the bug: SELECT * FROM (VALUES ('0/16ffffff'), ('0/17000000'), ('0/17000001')) AS t(lsn), LATERAL pg_walfile_name_offset(lsn::pg_lsn); lsn | file_name | file_offset ------------+--------------------------+------------- 0/16ffffff | 000000010000000000000016 | 16777215 --> 0/17000000 | 000000010000000000000016 | 0 0/17000001 | 000000010000000000000017 | 1 The bug is in the indicated line --- it shows the filename as 00016 but offset as zero, when clearly the LSN is pointing to 17/0. The bug is essentially that the code for pg_walfile_name_offset() uses the exact offset from the LSN, but uses the file name from the previous byte of the LSN. The fix involves deciding what the description or purpose of pg_walfile_name_offset() means, and adjusting it to be clearer. The current documentation says: Converts a write-ahead log location to a WAL file name and byte offset within that file. Fix #1: If we assume write-ahead log location means LSN, it is saying show the file/offset of the LSN, and that is most clearly: lsn | file_name | file_offset ------------+--------------------------+------------- 0/16ffffff | 000000010000000000000016 | 16777215 0/17000000 | 000000010000000000000017 | 0 0/17000001 | 000000010000000000000017 | 1 Fix #2: Now, there are some who have said they want the output to be the last written WAL byte (the byte before the LSN), not the current LSN, for archiving purposes. However, if we do that, we have to update the docs to clarify it. Its output would be: lsn | file_name | file_offset ------------+--------------------------+------------- 0/16ffffff | 000000010000000000000016 | 16777214 0/17000000 | 000000010000000000000016 | 16777215 0/17000001 | 000000010000000000000017 | 0 The email thread also considered having the second row offset be 16777216 (2^24), which is not a valid offset for a file if we assume a zero-based offset. Looking further, pg_walfile_name() also returns the filename for the previous byte: SELECT * FROM (values ('0/16ffffff'), ('0/17000000'), ('0/17000001')) as t(lsn), LATERAL pg_walfile_name_offset(lsn::pg_lsn), LATERAL pg_walfile_name(lsn::pg_lsn); lsn | file_name | file_offset | pg_walfile_name ------------+--------------------------+-------------+-------------------------- 0/16ffffff | 000000010000000000000016 | 16777215 | 000000010000000000000016 0/17000000 | 000000010000000000000016 | 0 | 000000010000000000000016 0/17000001 | 000000010000000000000017 | 1 | 000000010000000000000017 I have attached fix #1 as offset1.diff and fix #2 as offset2.diff. I think the most logical fix is #1, but pg_walfile_name() would need to be modified. If the previous file/byte offset are what is desired, fix #2 will need doc changes for both functions. This probably needs to be documented as a backward incompatibility for either fix. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Attachment
pgsql-hackers by date: