Thread: Add pg_walinspect function with block info columns
Hi, When using pg_walinspect, and calling functions like pg_get_wal_records_info(), I often wish that the various information in the block_ref column was separated out into columns so that I could easily access them and pass them to various other functions to add information -- like getting the relname from pg_class like this: SELECT n.nspname, c.relname, wal_info.* FROM pg_get_wal_records_extended_info(:start_lsn, :end_lsn) wal_info JOIN pg_class c ON wal_info.relfilenode = pg_relation_filenode(c.oid) AND wal_info.reldatabase IN (0, (SELECT oid FROM pg_database WHERE datname = current_database())) JOIN pg_namespace n ON n.oid = c.relnamespace; This has been mentioned in [1] amongst other places. So, attached is a patch with pg_get_wal_records_extended_info(). I suspect the name is not very good. Also, it is nearly a direct copy of pg_get_wal_fpi_infos() except for the helper called to fill in the tuplestore, so it might be worth doing something about that. However, I am mainly looking for feedback about whether or not others would find this useful, and, if so, what columns they would like to see in the returned tuplestore. Note that I didn't include the cumulative fpi_len for all the pages since pg_get_wal_fpi_info() now exists. I noticed that pg_get_wal_fpi_info() doesn't list compression information (which is in the block_ref column of pg_get_wal_records_info()). I don't know if this is worth including in my proposed function pg_get_wal_records_extended_info(). - Melanie [1] https://www.postgresql.org/message-id/CAH2-Wz%3DacGKoP8cZ%2B6Af2inoai0N5cZKCY13DaqXCwQNupK8qg%40mail.gmail.com
Attachment
On Wed, Mar 1, 2023 at 12:51 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > When using pg_walinspect, and calling functions like > pg_get_wal_records_info(), I often wish that the various information in > the block_ref column was separated out into columns so that I could > easily access them and pass them to various other functions to add > information -- like getting the relname from pg_class like this: > > SELECT n.nspname, c.relname, wal_info.* > FROM pg_get_wal_records_extended_info(:start_lsn, :end_lsn) wal_info > JOIN pg_class c > ON wal_info.relfilenode = pg_relation_filenode(c.oid) AND > wal_info.reldatabase IN (0, (SELECT oid FROM pg_database > WHERE datname = current_database())) > JOIN pg_namespace n ON n.oid = c.relnamespace; > > > This has been mentioned in [1] amongst other places. > > So, attached is a patch with pg_get_wal_records_extended_info(). I > suspect the name is not very good. Also, it is nearly a direct copy of > pg_get_wal_fpi_infos() except for the helper called to fill in the > tuplestore, so it might be worth doing something about that. > > However, I am mainly looking for feedback about whether or not others > would find this useful, and, if so, what columns they would like to see > in the returned tuplestore. > > Note that I didn't include the cumulative fpi_len for all the pages > since pg_get_wal_fpi_info() now exists. I noticed that > pg_get_wal_fpi_info() doesn't list compression information (which is in > the block_ref column of pg_get_wal_records_info()). I don't know if this > is worth including in my proposed function > pg_get_wal_records_extended_info(). Thinking about this more, it could make sense to have a function which gives you this extended block information and has a parameter like with_fpi which would include the information returned by pg_get_wal_fpi_info(). It might be nice to have it still include the information about the record itself as well. I don't know if it would be instead of pg_get_wal_fpi_info(), though. The way I would use this is when I want to see the record level information but with some additional information aggregated across the relevant blocks. For example, I could group by the record information and relfilenode and using the query in my example above, see all the information for the record along with the relname (when possible). - Melanie
On Thu, Mar 02, 2023 at 11:17:05AM -0500, Melanie Plageman wrote: > Thinking about this more, it could make sense to have a function which > gives you this extended block information and has a parameter like > with_fpi which would include the information returned by > pg_get_wal_fpi_info(). It might be nice to have it still include the > information about the record itself as well. Hmm. I am OK if you want to include more information about the blocks, and it may be nicer to not bloat the interface with more functions than necessary. > I don't know if it would be instead of pg_get_wal_fpi_info(), though. > > The way I would use this is when I want to see the record level > information but with some additional information aggregated across the > relevant blocks. For example, I could group by the record information > and relfilenode and using the query in my example above, see all the > information for the record along with the relname (when possible). As far as I know, a block reference could have some data or a FPW, so it is true that pg_get_wal_fpi_info() is not extensive enough if you want to get more information about the blocks in use for each record, especially if there is some data, and grouping the information about whole set of blocks into a single function call can some time. In order to satisfy your case, why not having one function that does everything, looping over the blocks of a single record as long as XLogRecHasBlockRef() is satisfied, returning the FPW if the block includes an image (or NULL if !XLogRecHasBlockImage()), as well as its data in bytea if XLogRecGetData() gives something (?). I am not sure that this should return anything about the record itself except its ReadRecPtr, though, as ReadRecPtr would be enough to cross-check with the information provided by GetWALRecordInfo() with a join. Hence, I guess that we could update the existing FPI function with: - the addition of some of the flags of bimg_info, like the compression type, if they apply, with a text[]. - the addition of bimg_len, if the block has a FPW, or NULL if none. - the addition of apply_image, if the block has a FPW, or NULL if none. - the addition of the block data, if any, or NULL if there is no data. - an update for the FPW handling, where we would return NULL if there is no FPW references in the block, but still return the full, decompressed 8kB image if it is there. -- Michael
Attachment
On Thu, Mar 2, 2023 at 9:47 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > > However, I am mainly looking for feedback about whether or not others > > would find this useful, and, if so, what columns they would like to see > > in the returned tuplestore. IMO, pg_get_wal_records_extended_info as proposed doesn't look good to me as it outputs most of the columns that are already given by pg_get_wal_records_info.What I think the best way at this point is to make it return the following: lsn pg_lsn block_id int8 spcOid oid dbOid oid relNumber oid forkNames text fpi bytea fpi_info text So, there can be multiple columns for the same record LSN, which means, essentially (lsn, block_id) can be a unique value for the row. If a block has FPI, fpi and fpi_info are non-null, otherwise, nulls. If needed, this output can be joined with pg_get_wal_records_info on lsn, to get all the record level details.What do you think? Will this serve your purpose? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, 6 Mar 2023 at 15:40, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Thu, Mar 2, 2023 at 9:47 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > > However, I am mainly looking for feedback about whether or not others > > > would find this useful, and, if so, what columns they would like to see > > > in the returned tuplestore. > > IMO, pg_get_wal_records_extended_info as proposed doesn't look good to > me as it outputs most of the columns that are already given by > pg_get_wal_records_info.What I think the best way at this point is to > make it return the following: > lsn pg_lsn > block_id int8 > spcOid oid > dbOid oid > relNumber oid > forkNames text > fpi bytea > fpi_info text Wouldn't it be more useful to have type PgXlogRecordBlock ( block_id int ... , blockimage bytea , data bytea ) type PgXlogRecord ( start lsn , ... , blocks PgXlogRecordBlock[] -- array of record's registered blocks, c.q. DecodedBkpBlock->blocks , main_data bytea ) which is returned by one sql function, and then used and processed (unnest()ed) in the relevant views? It would allow anyone to build their own processing on pg_walinspect where they want or need it, without us having to decide what the user wants, and without having to associate blocks with the main xlog record data through the joining of several (fairly expensive) xlog decoding passes. The basic idea is to create a single entrypoint to all relevant data from DecodedXLogRecord in SQL, not multiple. Kind regards, Matthias van de Meent
On Mon, Mar 06, 2023 at 04:08:28PM +0100, Matthias van de Meent wrote: > On Mon, 6 Mar 2023 at 15:40, Bharath Rupireddy >> IMO, pg_get_wal_records_extended_info as proposed doesn't look good to >> me as it outputs most of the columns that are already given by >> pg_get_wal_records_info.What I think the best way at this point is to >> make it return the following: >> lsn pg_lsn >> block_id int8 >> spcOid oid >> dbOid oid >> relNumber oid >> forkNames text >> fpi bytea >> fpi_info text I would add the length of the block data (without the hole and compressed, as the FPI data should always be presented as uncompressed), and the block data if any (without the block data length as one can guess it based on the bytea data length). Note that a block can have both a FPI and some data assigned to it, as far as I recall. > The basic idea is to create a single entrypoint to all relevant data > from DecodedXLogRecord in SQL, not multiple. While I would agree with this principle on simplicity's ground in terms of minimizing the SQL interface and the pg_wal/ lookups, I disagree about it on unsability ground, because we can avoid extra SQL tweaks with more functions. One recent example I have in mind is partitionfuncs.c, which can actually be achieved with a WITH RECURSIVE on the catalogs. There are of course various degrees of complexity, and perhaps unnest() cannot qualify as one, but having two functions returning normalized records (one for the record information, and a second for the block information), is a rather good balance between usability and interface complexity, in my experience. If you have two functions, a JOIN is enough to cross-check the block data and the record data, while an unnest() heavily bloats the main function output (aka byteas of FPIs in a single array). -- Michael
Attachment
At Tue, 7 Mar 2023 09:34:24 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Mon, Mar 06, 2023 at 04:08:28PM +0100, Matthias van de Meent wrote: > > On Mon, 6 Mar 2023 at 15:40, Bharath Rupireddy > >> IMO, pg_get_wal_records_extended_info as proposed doesn't look good to > >> me as it outputs most of the columns that are already given by > >> pg_get_wal_records_info.What I think the best way at this point is to > >> make it return the following: > >> lsn pg_lsn > >> block_id int8 > >> spcOid oid > >> dbOid oid > >> relNumber oid > >> forkNames text > >> fpi bytea > >> fpi_info text > > I would add the length of the block data (without the hole and > compressed, as the FPI data should always be presented as > uncompressed), and the block data if any (without the block data > length as one can guess it based on the bytea data length). Note that > a block can have both a FPI and some data assigned to it, as far as I > recall. +1 > > The basic idea is to create a single entrypoint to all relevant data > > from DecodedXLogRecord in SQL, not multiple. > > While I would agree with this principle on simplicity's ground in > terms of minimizing the SQL interface and the pg_wal/ lookups, I > disagree about it on unsability ground, because we can avoid extra SQL > tweaks with more functions. One recent example I have in mind is > partitionfuncs.c, which can actually be achieved with a WITH RECURSIVE > on the catalogs. There are of course various degrees of complexity, > and perhaps unnest() cannot qualify as one, but having two functions > returning normalized records (one for the record information, and a > second for the block information), is a rather good balance between > usability and interface complexity, in my experience. If you have two > functions, a JOIN is enough to cross-check the block data and the > record data, while an unnest() heavily bloats the main function output > (aka byteas of FPIs in a single array). FWIW, my initial thought about the proposal was similar to Matthias, and tried a function that would convert (for simplicity) the block_ref string to a json object. Although this approach did work, I was not satisfied with its limited usability and poor performance (mainly the poor performance is due to text->json conversion, though).. Finally, I realized that the initial discomfort I experienced stemmed from the name of the function, which suggests that it returns information of "records". This discomfort would disappear if the function were instead named pg_get_wal_blockref_info() or something similar. Thus I'm inclined to agree with Michael's suggestion of creating a new normalized set-returning function that returns information of "blocks". regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Mar 07, 2023 at 11:17:45AM +0900, Kyotaro Horiguchi wrote: > Thus I'm inclined to agree with Michael's suggestion of creating a new > normalized set-returning function that returns information of > "blocks". Just to be clear here, I am not suggesting to add a new function for only the block information, just a rename of the existing pg_get_wal_fpi_info() to something like pg_get_wal_block_info() that includes both the FPI (if any or NULL if none) and the block data (if any or NULL is none) so as all of them are governed by the same lookup at pg_wal/. The fpi information (aka compression type) is displayed if there is a FPI in the block. -- Michael
Attachment
At Tue, 7 Mar 2023 14:44:49 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Tue, Mar 07, 2023 at 11:17:45AM +0900, Kyotaro Horiguchi wrote: > > Thus I'm inclined to agree with Michael's suggestion of creating a new > > normalized set-returning function that returns information of > > "blocks". > > Just to be clear here, I am not suggesting to add a new function for > only the block information, just a rename of the existing > pg_get_wal_fpi_info() to something like pg_get_wal_block_info() that > includes both the FPI (if any or NULL if none) and the block data (if > any or NULL is none) so as all of them are governed by the same lookup > at pg_wal/. The fpi information (aka compression type) is displayed > if there is a FPI in the block. Ah. Yes, that expansion sounds sensible. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Mar 07, 2023 at 03:49:02PM +0900, Kyotaro Horiguchi wrote: > Ah. Yes, that expansion sounds sensible. Okay, so, based on this idea, I have hacked on this stuff and finish with the attached that shows block data if it exists, as well as FPI stuff if any. bimg_info is showed as a text[] for its flags. I guess that I'd better add a test that shows correctly a record with some block data attached to it, on top of the existing one for FPIs.. Any suggestions? Perhaps just a heap/heap2 record? Thoughts? -- Michael
Attachment
At Tue, 7 Mar 2023 16:18:21 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Tue, Mar 07, 2023 at 03:49:02PM +0900, Kyotaro Horiguchi wrote: > > Ah. Yes, that expansion sounds sensible. > > Okay, so, based on this idea, I have hacked on this stuff and finish > with the attached that shows block data if it exists, as well as FPI > stuff if any. bimg_info is showed as a text[] for its flags. # The naming convetion looks inconsistent between # pg_get_wal_records_info and pg_get_wal_block_info but it's not an # issue of this patch.. > I guess that I'd better add a test that shows correctly a record with > some block data attached to it, on top of the existing one for FPIs.. > Any suggestions? Perhaps just a heap/heap2 record? > > Thoughts? I thought that we needed a test for block data when I saw the patch. I don't have great idea but a single insert should work. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Mar 7, 2023 at 12:48 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Mar 07, 2023 at 03:49:02PM +0900, Kyotaro Horiguchi wrote: > > Ah. Yes, that expansion sounds sensible. > > Okay, so, based on this idea, I have hacked on this stuff and finish > with the attached that shows block data if it exists, as well as FPI > stuff if any. bimg_info is showed as a text[] for its flags. +1. > I guess that I'd better add a test that shows correctly a record with > some block data attached to it, on top of the existing one for FPIs.. > Any suggestions? Perhaps just a heap/heap2 record? > > Thoughts? That would be a lot better. Not just the test, but also the documentation can have it. Simple way to generate such a record (both block data and FPI) is to just change the wal_level to logical in walinspect.conf [1], see code around REGBUF_KEEP_DATA and RelationIsLogicallyLogged in heapam.c I had the following comments and fixed them in the attached v2 patch: 1. Still a trace of pg_get_wal_fpi_info in docs, removed it. 2. Used int4 instead of int for fpilen just to be in sync with fpi_length of pg_get_wal_record_info. 3. Changed to be consistent and use just FPI or "F/full page". /* FPI flags */ /* No full page image, so store NULLs for all its fields */ /* Full-page image */ /* Full page exists, so let's save it. */ * and end LSNs. This produces information about the full page images with * to a record. Decompression is applied to the full-page images, if 4. I think we need to free raw_data, raw_page and flags as we loop over multiple blocks (XLR_MAX_BLOCK_ID) and will leak memory which can be a problem if we have many blocks assocated with a single WAL record. flags = (Datum *) palloc0(sizeof(Datum) * bitcnt); Also, we will leak all CStringGetTextDatum memory in the block_id for loop. Another way is to use and reset temp memory context in the for loop over block_ids. I prefer this approach over multiple pfree()s in block_id for loop. 5. I think it'd be good to say if the FPI is for WAL_VERIFICATION, so I changed it to the following. Feel free to ignore this if you think it's not required. if (blk->apply_image) flags[cnt++] = CStringGetTextDatum("APPLY"); else flags[cnt++] = CStringGetTextDatum("WAL_VERIFICATION"); 6. Did minor wordsmithing. 7. Added test case which shows both block data and fpi in the documentation. 8. Changed wal_level to logical in walinspect.conf to test case with block data. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, 7 Mar 2023 at 01:34, Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, Mar 06, 2023 at 04:08:28PM +0100, Matthias van de Meent wrote: > > On Mon, 6 Mar 2023 at 15:40, Bharath Rupireddy > >> IMO, pg_get_wal_records_extended_info as proposed doesn't look good to > >> me as it outputs most of the columns that are already given by > >> pg_get_wal_records_info.What I think the best way at this point is to > >> make it return the following: > >> lsn pg_lsn > >> block_id int8 > >> spcOid oid > >> dbOid oid > >> relNumber oid > >> forkNames text > >> fpi bytea > >> fpi_info text > > I would add the length of the block data (without the hole and > compressed, as the FPI data should always be presented as > uncompressed), and the block data if any (without the block data > length as one can guess it based on the bytea data length). Note that > a block can have both a FPI and some data assigned to it, as far as I > recall. > > > The basic idea is to create a single entrypoint to all relevant data > > from DecodedXLogRecord in SQL, not multiple. > > While I would agree with this principle on simplicity's ground in > terms of minimizing the SQL interface and the pg_wal/ lookups, I > disagree about it on unsability ground, because we can avoid extra SQL > tweaks with more functions. One recent example I have in mind is > partitionfuncs.c, which can actually be achieved with a WITH RECURSIVE > on the catalogs. Correct, but in that case the user would build the same query (or at least with the same complexity) as what we're executing under the hood, right? > There are of course various degrees of complexity, > and perhaps unnest() cannot qualify as one, but having two functions > returning normalized records (one for the record information, and a > second for the block information), is a rather good balance between > usability and interface complexity, in my experience. I would agree, if it weren't for the reasons written below. > If you have two > functions, a JOIN is enough to cross-check the block data and the > record data, Joins are expensive on large datasets; and because WAL is one of the largest datasets in the system, why would we want to force the user to JOIN them if we can produce the data in one pre-baked data structure without a need to join? > while an unnest() heavily bloats the main function output > (aka byteas of FPIs in a single array). I don't see how that would be bad. You can select a subset of columns without much issue, which can allow you to ignore any and all bloat. It is also not easy to imagine that we'd have arguments in the function that determine whether it includes the largest fields (main data, blocks, block data, and block images) or leaves them NULL so that we need to pass less data around if the user doesn't want the data. Matthias van de Meent
On Tue, Mar 07, 2023 at 03:56:22PM +0530, Bharath Rupireddy wrote: > That would be a lot better. Not just the test, but also the > documentation can have it. Simple way to generate such a record (both > block data and FPI) is to just change the wal_level to logical in > walinspect.conf [1], see code around REGBUF_KEEP_DATA and > RelationIsLogicallyLogged in heapam.c I don't agree that we need to go down to wal_level=logical for this. The important part is to check that the non-NULL and NULL paths for the block data and FPI data are both taken, making 4 paths to check. So we need two tests at minimum, which would be either: - One SQL generating no FPI with no block data and a second generating a FPI with block data. v2 was doing that but did not cover the first case. - One SQL generating a FPI with no block data and a second generating no FPI with block data. So let's just geenrate a heap record on an UPDATE, for example, like in the version attached. > 2. Used int4 instead of int for fpilen just to be in sync with > fpi_length of pg_get_wal_record_info. Okay. > 3. Changed to be consistent and use just FPI or "F/full page". > /* FPI flags */ > /* No full page image, so store NULLs for all its fields */ > /* Full-page image */ > /* Full page exists, so let's save it. */ > * and end LSNs. This produces information about the full page images with > * to a record. Decompression is applied to the full-page images, if Fine by me. > 4. I think we need to free raw_data, raw_page and flags as we loop > over multiple blocks (XLR_MAX_BLOCK_ID) and will leak memory which can > be a problem if we have many blocks assocated with a single WAL > record. > flags = (Datum *) palloc0(sizeof(Datum) * bitcnt); > Also, we will leak all CStringGetTextDatum memory in the block_id for loop. > Another way is to use and reset temp memory context in the for loop > over block_ids. I prefer this approach over multiple pfree()s in > block_id for loop. I disagree, this was on purpose in the last version. This version finishes by calling AllocSetContextCreate() and MemoryContextDelete() once per *record*, which will not be free, and we are arguing about resetting the memory context after scanning up to XLR_MAX_BLOCK_ID blocks, or 32 blocks which would go up to 32kB per page in the worst case. That's not going to matter in a large scan for each record, but the extra AllocSet*() calls could. And we basically do the same thing on HEAD. > 5. I think it'd be good to say if the FPI is for WAL_VERIFICATION, so > I changed it to the following. Feel free to ignore this if you think > it's not required. > if (blk->apply_image) > flags[cnt++] = CStringGetTextDatum("APPLY"); > else > flags[cnt++] = CStringGetTextDatum("WAL_VERIFICATION"); Disagreed here as well. WAL_VERIFICATION does not map with any of the internal flags, and actually it may be finished by not being used at replay if the LSN of the page read if higher than what the WAL stores. > 7. Added test case which shows both block data and fpi in the > documentation. Okay on that. > 8. Changed wal_level to logical in walinspect.conf to test case with block data. This change is not necessary, per the argument above. Any comments? -- Michael
Attachment
On Wed, Mar 8, 2023 at 12:58 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Mar 07, 2023 at 03:56:22PM +0530, Bharath Rupireddy wrote: > > That would be a lot better. Not just the test, but also the > > documentation can have it. Simple way to generate such a record (both > > block data and FPI) is to just change the wal_level to logical in > > walinspect.conf [1], see code around REGBUF_KEEP_DATA and > > RelationIsLogicallyLogged in heapam.c > > I don't agree that we need to go down to wal_level=logical for this. > The important part is to check that the non-NULL and NULL paths for > the block data and FPI data are both taken, making 4 paths to check. > So we need two tests at minimum, which would be either: > - One SQL generating no FPI with no block data and a second generating > a FPI with block data. v2 was doing that but did not cover the first > case. > - One SQL generating a FPI with no block data and a second generating > no FPI with block data. > > So let's just geenrate a heap record on an UPDATE, for example, like > in the version attached. Yup, that should work too because block data gets logged [1]. > > 4. I think we need to free raw_data, raw_page and flags as we loop > > over multiple blocks (XLR_MAX_BLOCK_ID) and will leak memory which can > > be a problem if we have many blocks assocated with a single WAL > > record. > > flags = (Datum *) palloc0(sizeof(Datum) * bitcnt); > > Also, we will leak all CStringGetTextDatum memory in the block_id for loop. > > Another way is to use and reset temp memory context in the for loop > > over block_ids. I prefer this approach over multiple pfree()s in > > block_id for loop. > > I disagree, this was on purpose in the last version. This version > finishes by calling AllocSetContextCreate() and MemoryContextDelete() > once per *record*, which will not be free, and we are arguing about > resetting the memory context after scanning up to XLR_MAX_BLOCK_ID > blocks, or 32 blocks which would go up to 32kB per page in the worst > case. That's not going to matter in a large scan for each record, but > the extra AllocSet*() calls could. And we basically do the same thing > on HEAD. It's not just 32kB per page right? 32*8KB on HEAD (no block data, flags and CStringGetTextDatum there). With the patch, the number of pallocs for each block_id = 6 CStringGetTextDatum + BLCKSZ (8KB) + flags (5*size of ptr) + block data_len. In the worst case, all XLR_MAX_BLOCK_ID can have both FPIs and block data. Furthermore, imagine if someone initialized their cluster with a higher BLCKSZ (>= 8KB), then the memory leak happens noticeably on a lower-end system. I understand that performance is critical here but we need to ensure memory is used wisely. Therefore, I'd still vote to free at least the major contributors here, that is, pfree(raw_data);, pfree(raw_page); and pfree(flags); right after they are done using. I'm sure pfree()s don't hurt more than resetting memory context for every block_id. > Any comments? I think we need to output block data length (blk->data_len) similar to fpilen to save users from figuring out how to get the length of a bytea column. This will also keep block data in sync with FPI info. [1] needs_backup = (page_lsn <= RedoRecPtr); (gdb) p page_lsn $2 = 21581544 (gdb) p RedoRecPtr $3 = 21484808 (gdb) p needs_backup $4 = false (gdb) (gdb) bt #0 XLogRecordAssemble (rmid=10 '\n', info=64 '@', RedoRecPtr=21484808, doPageWrites=true, fpw_lsn=0x7ffde118d640, num_fpi=0x7ffde118d634, topxid_included=0x7ffde118d633) at xloginsert.c:582 #1 0x00005598cd9c3ef7 in XLogInsert (rmid=10 '\n', info=64 '@') at xloginsert.c:497 #2 0x00005598cd930452 in log_heap_update (reln=0x7f4a4c7cd808, oldbuf=136, newbuf=136, oldtup=0x7ffde118d820, newtup=0x5598d00cb098, old_key_tuple=0x0, all_visible_cleared=false, new_all_visible_cleared=false) at heapam.c:8473 #3 0x00005598cd92876e in heap_update (relation=0x7f4a4c7cd808, otid=0x7ffde118dab2, newtup=0x5598d00cb098, cid=0, crosscheck=0x0, wait=true, tmfd=0x7ffde118db60, lockmode=0x7ffde118da74) at heapam.c:3741 -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Mar 08, 2023 at 04:01:56PM +0530, Bharath Rupireddy wrote: > I understand that performance is critical here but we need to ensure > memory is used wisely. Therefore, I'd still vote to free at least the > major contributors here, that is, pfree(raw_data);, pfree(raw_page); > and pfree(flags); right after they are done using. I'm sure pfree()s > don't hurt more than resetting memory context for every block_id. Okay by me to have intermediate pfrees between each block scanned if you feel strongly about it. > I think we need to output block data length (blk->data_len) similar to > fpilen to save users from figuring out how to get the length of a > bytea column. This will also keep block data in sync with FPI info. length() works fine on bytea, so it can be used on the block data. fpilen is a very different matter as it would be the length of a page without a hole, or just something compressed. -- Michael
Attachment
On Wed, Mar 8, 2023 at 4:23 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Mar 08, 2023 at 04:01:56PM +0530, Bharath Rupireddy wrote: > > I understand that performance is critical here but we need to ensure > > memory is used wisely. Therefore, I'd still vote to free at least the > > major contributors here, that is, pfree(raw_data);, pfree(raw_page); > > and pfree(flags); right after they are done using. I'm sure pfree()s > > don't hurt more than resetting memory context for every block_id. > > Okay by me to have intermediate pfrees between each block scanned if > you feel strongly about it. Thanks. Attached v4 with that change. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
At Wed, 8 Mar 2023 20:18:06 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in > On Wed, Mar 8, 2023 at 4:23 PM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Wed, Mar 08, 2023 at 04:01:56PM +0530, Bharath Rupireddy wrote: > > > I understand that performance is critical here but we need to ensure > > > memory is used wisely. Therefore, I'd still vote to free at least the > > > major contributors here, that is, pfree(raw_data);, pfree(raw_page); > > > and pfree(flags); right after they are done using. I'm sure pfree()s > > > don't hurt more than resetting memory context for every block_id. > > > > Okay by me to have intermediate pfrees between each block scanned if > > you feel strongly about it. > > Thanks. Attached v4 with that change. Although I'm not strongly opposed to pfreeing them, I'm not sure I like the way the patch frees them. The life times of all of raw_data, raw_page and flags are within a block. They can be freed unconditionally after they are actually used and the scope of the pointer variables can be properly narowed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Mar 09, 2023 at 09:46:12AM +0900, Kyotaro Horiguchi wrote: > Although I'm not strongly opposed to pfreeing them, I'm not sure I > like the way the patch frees them. The life times of all of raw_data, > raw_page and flags are within a block. They can be freed > unconditionally after they are actually used and the scope of the > pointer variables can be properly narowed. The thing is that you cannot keep them inside each individual blocks because they have to be freed once their values are stored in the tuplestore, which is why I guess Bharath has done things this way. After sleeping on that, I tend to prefer the simplicity of v3 where we keep track of the block and fpi data in each of their respective blocks. It means that we lose track of them each time we go to a different block, but the memory context reset done after each record means that scanning through a large WAL history will not cause a leak across the function call. The worst scenario with v3 is a record that makes use of all the 32 blocks with a hell lot of block data in each one of them, which is possible in theory, but very unlikely in practice except if someone uses a custom RGMR to generate crazily-shaped WAL records. I am aware of the fact that it is possible to generate such records if you are really willing to do so, aka this thread: https://www.postgresql.org/message-id/flat/CAEze2WgGiw+LZt+vHf8tWqB_6VxeLsMeoAuod0N=ij1q17n5pw@mail.gmail.com In short, my choice would still be simplicity here with v3, I guess. -- Michael
Attachment
At Thu, 9 Mar 2023 10:15:39 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Thu, Mar 09, 2023 at 09:46:12AM +0900, Kyotaro Horiguchi wrote: > > Although I'm not strongly opposed to pfreeing them, I'm not sure I > > like the way the patch frees them. The life times of all of raw_data, > > raw_page and flags are within a block. They can be freed > > unconditionally after they are actually used and the scope of the > > pointer variables can be properly narowed. > > The thing is that you cannot keep them inside each individual blocks > because they have to be freed once their values are stored in the > tuplestore, which is why I guess Bharath has done things this way. Ugh.. Right. > After sleeping on that, I tend to prefer the simplicity of v3 where we > keep track of the block and fpi data in each of their respective > blocks. It means that we lose track of them each time we go to a > different block, but the memory context reset done after each record > means that scanning through a large WAL history will not cause a leak > across the function call. > > The worst scenario with v3 is a record that makes use of all the 32 > blocks with a hell lot of block data in each one of them, which is > possible in theory, but very unlikely in practice except if someone > uses a custom RGMR to generate crazily-shaped WAL records. I am aware > of the fact that it is possible to generate such records if you are > really willing to do so, aka this thread: > https://www.postgresql.org/message-id/flat/CAEze2WgGiw+LZt+vHf8tWqB_6VxeLsMeoAuod0N=ij1q17n5pw@mail.gmail.com I agree to the view that that "leakage" for at-most 32 blocks and typically 0 to 2 blcoks won't be a matter. > In short, my choice would still be simplicity here with v3, I guess. FWIW, I slightly prefer v3 for the reason I mentioned above. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Mar 9, 2023 at 7:34 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > > In short, my choice would still be simplicity here with v3, I guess. > > FWIW, I slightly prefer v3 for the reason I mentioned above. Hm, then, +1 for v3. FWIW, I quickly tried to hit that case where a single WAL record has max_block_id = XLR_MAX_BLOCK_ID with both FPIs and block data, but I couldn't. I could generate WAL records with 45K FPIs, 10mn block data and the total palloc'd length in the block_id for loop has not crossed 8K. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Mar 09, 2023 at 09:52:57AM +0530, Bharath Rupireddy wrote: > Hm, then, +1 for v3. Okay, thanks. Let's use that, then. -- Michael
Attachment
On Thu, Mar 09, 2023 at 03:37:21PM +0900, Michael Paquier wrote: > Okay, thanks. Let's use that, then. I have done one pass over that today, and applied it. Thanks! I'd really like to do something about the errors we raise in the module when specifying LSNs in the future for this release, now. I got annoyed by it again this morning while doing \watch queries that kept failing randomly while stressing this patch. -- Michael
Attachment
On Fri, Mar 10, 2023 at 6:43 AM Michael Paquier <michael@paquier.xyz> wrote: > > I'd really like to do something about the errors we raise in the > module when specifying LSNs in the future for this release, now. I > got annoyed by it again this morning while doing \watch queries that > kept failing randomly while stressing this patch. Perhaps what is proposed here https://www.postgresql.org/message-id/CALj2ACWqJ+m0HoQj9qkAV2uQfq97yk5jN2MOdfKcXusXsyptKQ@mail.gmail.com might help and avoid many errors around input LSN validations. Let's discuss that in that thread. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Mar 10, 2023 at 06:50:05AM +0530, Bharath Rupireddy wrote: > Perhaps what is proposed here > https://www.postgresql.org/message-id/CALj2ACWqJ+m0HoQj9qkAV2uQfq97yk5jN2MOdfKcXusXsyptKQ@mail.gmail.com > might help and avoid many errors around input LSN validations. Let's > discuss that in that thread. Yep, I am going to look at your proposal. -- Michael
Attachment
I'm excited to see that pg_get_wal_block_info() was merged. Thanks for working on this! Apologies for jumping back in here a bit late. I've been playing around with it and wanted to comment on the performance of JOINing to pg_get_wal_records_info(). On Tue, Mar 7, 2023 at 7:08 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > On Tue, 7 Mar 2023 at 01:34, Michael Paquier <michael@paquier.xyz> wrote: > > > > On Mon, Mar 06, 2023 at 04:08:28PM +0100, Matthias van de Meent wrote: > > > On Mon, 6 Mar 2023 at 15:40, Bharath Rupireddy > > >> IMO, pg_get_wal_records_extended_info as proposed doesn't look good to > > >> me as it outputs most of the columns that are already given by > > >> pg_get_wal_records_info.What I think the best way at this point is to > > >> make it return the following: > > >> lsn pg_lsn > > >> block_id int8 > > >> spcOid oid > > >> dbOid oid > > >> relNumber oid > > >> forkNames text > > >> fpi bytea > > >> fpi_info text > > > > I would add the length of the block data (without the hole and > > compressed, as the FPI data should always be presented as > > uncompressed), and the block data if any (without the block data > > length as one can guess it based on the bytea data length). Note that > > a block can have both a FPI and some data assigned to it, as far as I > > recall. > > > > > The basic idea is to create a single entrypoint to all relevant data > > > from DecodedXLogRecord in SQL, not multiple. > > > > While I would agree with this principle on simplicity's ground in > > terms of minimizing the SQL interface and the pg_wal/ lookups, I > > disagree about it on unsability ground, because we can avoid extra SQL > > tweaks with more functions. One recent example I have in mind is > > partitionfuncs.c, which can actually be achieved with a WITH RECURSIVE > > on the catalogs. > > Correct, but in that case the user would build the same query (or at > least with the same complexity) as what we're executing under the > hood, right? > > > There are of course various degrees of complexity, > > and perhaps unnest() cannot qualify as one, but having two functions > > returning normalized records (one for the record information, and a > > second for the block information), is a rather good balance between > > usability and interface complexity, in my experience. > > I would agree, if it weren't for the reasons written below. > > > If you have two > > functions, a JOIN is enough to cross-check the block data and the > > record data, > > Joins are expensive on large datasets; and because WAL is one of the > largest datasets in the system, why would we want to force the user to > JOIN them if we can produce the data in one pre-baked data structure > without a need to join? I wanted to experiment to see how much slower it is to do the join between pg_get_wal_block_info() and pg_get_wal_records_info() and profile where the time was spent. I saved the wal lsn before and after a bunch of inserts (generates ~1,000,000 records). On master, I did a join like this: SELECT count(*) FROM pg_get_wal_block_info(:start_lsn, :end_lsn) b JOIN pg_get_wal_records_info(:start_lsn, :end_lsn) w ON w.start_lsn = b.lsn; which took 1191 ms. After patching master to add in the columns from pg_get_wal_records_info() which are not returned by pg_get_wal_block_info() (except block_ref column of course), this query: SELECT COUNT(*) FROM pg_get_wal_block_info(:start_lsn, :end_lsn); took 467 ms. Perhaps this difference isn't important, but I found it noticeable. A large chunk of the time is spent joining the tuplestores. The second largest chunk of time seems to be in GetWalRecordInfo()'s calls to XLogRecGetBlockRefInfo(), which spends quite a bit of time in string construction -- mainly for strings we wouldn't end up needing after joining to the block info function. Surprisingly, the string construction seemed to overshadow the performance impact of doubling the decoding passes over the WAL records. So maybe it is worth including more record-level info? On an unrelated note, I had thought that we should have some kind of parameter to pg_get_wal_block_info() to control whether or not we output the FPI to save us from wasting time decompressing the FPIs. AFAIK, we don't have access to projection information from inside the function, so a parameter like "output_fpi" or the like would have to do. I wanted to share what I found in trying this in case someone else had had that thought. TL;DR, it doesn't seem to matter from a performance perspective if we skip decompressing the FPIs in pg_get_wal_block_info(). I hacked "output_fpi" into pg_get_wal_block_info(), enabled pglz wal compression, and generated a boatload of FPIs by dirtying buffers, doing a checkpoint and then updating those pages again right after the checkpoint. With output_fpi = true (same as master), my call to pg_get_wal_block_info() took around 7 seconds and with output_fpi = false, it took around 6 seconds. Not an impressive difference. I noticed that only around 2% of the time is spent in pglz_decompress(). Most of the time (about half) is spent in building tuples for the tuplestore, copying memory around, and writing the tuplestore out to a file. Another 10-20% is spent decoding the records--which has to be done regardless. I wonder if there are cases where the decompression overhead would matter. My conclusion is that it isn't worth bothering with such a parameter. - Melanie
On Tue, Mar 14, 2023 at 3:34 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > After patching master to add in the columns from > pg_get_wal_records_info() which are not returned by > pg_get_wal_block_info() (except block_ref column of course), this query: > > SELECT COUNT(*) FROM pg_get_wal_block_info(:start_lsn, :end_lsn); > > took 467 ms. > > Perhaps this difference isn't important, but I found it noticeable. This seems weird to me too. It's not so much the performance overhead that bothers me (though that's not great either). It seems *illogical* to me. The query you end up writing must do two passes over the WAL records, but its structure almost suggests that it's necessary to do two separate passes over distinct "streams". Why doesn't it already work like this? Why do we need a separate pg_get_wal_block_info() function at all? -- Peter Geoghegan
On Tue, Mar 14, 2023 at 6:57 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Tue, Mar 14, 2023 at 3:34 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > After patching master to add in the columns from > > pg_get_wal_records_info() which are not returned by > > pg_get_wal_block_info() (except block_ref column of course), this query: > > > > SELECT COUNT(*) FROM pg_get_wal_block_info(:start_lsn, :end_lsn); > > > > took 467 ms. > > > > Perhaps this difference isn't important, but I found it noticeable. > > This seems weird to me too. It's not so much the performance overhead > that bothers me (though that's not great either). It seems *illogical* > to me. The query you end up writing must do two passes over the WAL > records, but its structure almost suggests that it's necessary to do > two separate passes over distinct "streams". > > Why doesn't it already work like this? Why do we need a separate > pg_get_wal_block_info() function at all? Well, I think if you only care about the WAL record-level information and not the block-level information, having the WAL record information denormalized like that with all the block information would be a nuisance. But, perhaps you are suggesting a parameter to pg_get_wal_records_info() like "with_block_info" or something, which produces the full denormalized block + record output? - Melanie
On Tue, Mar 14, 2023 at 5:34 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > On Tue, Mar 14, 2023 at 6:57 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Why doesn't it already work like this? Why do we need a separate > > pg_get_wal_block_info() function at all? > > Well, I think if you only care about the WAL record-level information > and not the block-level information, having the WAL record information > denormalized like that with all the block information would be a > nuisance. I generally care about both. When I want to look at things at the pg_get_wal_records_info() level (as opposed to a summary), the block_ref information is *always* of primary importance. I don't want to have to write my own bug-prone parser for block_ref, but why should the only alternative be joining against pg_get_wal_block_info()? The information that I'm interested in is "close at hand" to pg_get_wal_records_info() already. I understand that in the general case there might be quite a few blocks associated with a WAL record. For complicated cases, pg_get_wal_block_info() does make sense. However, the vast majority of individual WAL records (and possibly most WAL record types) are related to one block only. One block that is generally from the relation's main fork. > But, perhaps you are suggesting a parameter to pg_get_wal_records_info() > like "with_block_info" or something, which produces the full > denormalized block + record output? I was thinking of something like that, yes -- though it wouldn't necessarily have to be the *full* denormalized block_ref info, the FPI itself, etc. Just the more useful stuff. It occurs to me that my concern about the information that pg_get_wal_records_info() lacks could be restated as a concern about what pg_get_wal_block_info() lacks: pg_get_wal_block_info() fails to show basic information about the WAL record whose blocks it reports on, even though it could easily show all of the pg_get_wal_records_info() info once per block (barring block_ref). So addressing my concern by adjusting pg_get_wal_block_info() might be the best approach. I'd probably be happy with that -- I'd likely just stop using pg_get_wal_records_info() completely under this scheme. Overall, I'm concerned that we may have missed the opportunity to make simple things easier. Again, wanting to see (say) all of the PRUNE records and VACUUM records with an "order by relfilenode, block_number, lsn" seems likely to be a very common requirement to me. It's exactly the kind of thing that you'd expect an SQL interface to make easy. -- Peter Geoghegan
On Wed, Mar 15, 2023 at 7:20 AM Peter Geoghegan <pg@bowt.ie> wrote: > > > But, perhaps you are suggesting a parameter to pg_get_wal_records_info() > > like "with_block_info" or something, which produces the full > > denormalized block + record output? > > I was thinking of something like that, yes -- though it wouldn't > necessarily have to be the *full* denormalized block_ref info, the FPI > itself, etc. Just the more useful stuff. > > It occurs to me that my concern about the information that > pg_get_wal_records_info() lacks could be restated as a concern about > what pg_get_wal_block_info() lacks: pg_get_wal_block_info() fails to > show basic information about the WAL record whose blocks it reports > on, even though it could easily show all of the > pg_get_wal_records_info() info once per block (barring block_ref). So > addressing my concern by adjusting pg_get_wal_block_info() might be > the best approach. I'd probably be happy with that -- I'd likely just > stop using pg_get_wal_records_info() completely under this scheme. How about something like the attached? It adds the per-record columns to pg_get_wal_block_info() avoiding "possibly expensive" joins with pg_get_wal_records_info(). With this, pg_get_wal_records_info() too will be useful for users scanning WAL at record level. That is to say that we can retain both pg_get_wal_records_info() and pg_get_wal_block_info(). -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Mar 14, 2023 at 06:50:15PM -0700, Peter Geoghegan wrote: > On Tue, Mar 14, 2023 at 5:34 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: >> Well, I think if you only care about the WAL record-level information >> and not the block-level information, having the WAL record information >> denormalized like that with all the block information would be a >> nuisance. > > I generally care about both. When I want to look at things at the > pg_get_wal_records_info() level (as opposed to a summary), the > block_ref information is *always* of primary importance. I don't want > to have to write my own bug-prone parser for block_ref, but why should > the only alternative be joining against pg_get_wal_block_info()? The > information that I'm interested in is "close at hand" to > pg_get_wal_records_info() already. > I am not sure to get the concern here. As long as one is smart enough with SQL, there is no need to perform a double scan of the contents of pg_wal with a large scan on the start LSN. If one wishes to only extract some block for a given record type, or for a filter of your choice, it is possible to use a LATERAL on pg_get_wal_block_info(), say: SELECT r.start_lsn, b.blockid FROM pg_get_wal_records_info('0/01000028', '0/1911AA8') AS r, LATERAL pg_get_wal_block_info(start_lsn, end_lsn) as b WHERE r.resource_manager = 'Heap2'; This will extract the block information that you'd want for a given record type. > I understand that in the general case there might be quite a few > blocks associated with a WAL record. For complicated cases, > pg_get_wal_block_info() does make sense. However, the vast majority of > individual WAL records (and possibly most WAL record types) are > related to one block only. One block that is generally from the > relation's main fork. Sure, though there may be more complicated scenarios, like custom RMGRs. At the end it comes to how much normalization should be applied to the data extracted. FWIW, I think that the current interface is a pretty good balance in usability. -- Michael
Attachment
On Wed, Mar 15, 2023 at 12:13:56PM +0530, Bharath Rupireddy wrote: > How about something like the attached? It adds the per-record columns > to pg_get_wal_block_info() avoiding "possibly expensive" joins with > pg_get_wal_records_info(). > > With this, pg_get_wal_records_info() too will be useful for users > scanning WAL at record level. That is to say that we can retain both > pg_get_wal_records_info() and pg_get_wal_block_info(). FWIW, I am not convinced that there is any need to bloat more the attributes of these functions, as filters for records could basically touch all the fields returned by pg_get_wal_records_info(). What about adding an example in the docs with the LATERAL query I mentioned previously? -- Michael
Attachment
On Wed, Mar 15, 2023 at 12:20 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Mar 14, 2023 at 06:50:15PM -0700, Peter Geoghegan wrote: > > On Tue, Mar 14, 2023 at 5:34 PM Melanie Plageman > > <melanieplageman@gmail.com> wrote: > >> Well, I think if you only care about the WAL record-level information > >> and not the block-level information, having the WAL record information > >> denormalized like that with all the block information would be a > >> nuisance. > > > > I generally care about both. When I want to look at things at the > > pg_get_wal_records_info() level (as opposed to a summary), the > > block_ref information is *always* of primary importance. I don't want > > to have to write my own bug-prone parser for block_ref, but why should > > the only alternative be joining against pg_get_wal_block_info()? The > > information that I'm interested in is "close at hand" to > > pg_get_wal_records_info() already. > > > > I am not sure to get the concern here. As long as one is smart enough > with SQL, there is no need to perform a double scan of the contents of > pg_wal with a large scan on the start LSN. If one wishes to only > extract some block for a given record type, or for a filter of your > choice, it is possible to use a LATERAL on pg_get_wal_block_info(), > say: > SELECT r.start_lsn, b.blockid > FROM pg_get_wal_records_info('0/01000028', '0/1911AA8') AS r, > LATERAL pg_get_wal_block_info(start_lsn, end_lsn) as b > WHERE r.resource_manager = 'Heap2'; > > This will extract the block information that you'd want for a given > record type. It looks like nested-loop join is chosen for LATERAL query [1], that is, for every start_lsn and end_lsn that we get from pg_get_wal_records_info, pg_get_wal_block_info gets called. Whereas, for non-LATERAL join [2], hash/merge join is chosen which is pretty fast (5x) over 5mn WAL records. Therefore, I'm not sure if adding the LATERAL query as an example is a better idea. IIUC, the concern raised so far in this thread is not just on the performance of JOIN queries to get both block info and record level info, but on ease of using pg_walinspect functions. If pg_get_wal_block_info emits the record level information too (which turns out to be 50 LOC more), one doesn't have to be expert at writing JOIN queries or such, but just can run the function, which actually takes way less time (3sec) to scan the same 5mn WAL records [3]. [1] postgres=# EXPLAIN (ANALYZE) SELECT * FROM pg_get_wal_records_info(:'start_lsn', :'end_lsn') AS r, LATERAL pg_get_wal_block_info(start_lsn, end_lsn) AS b WHERE r.resource_manager = 'Heap'; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.01..112.50 rows=5000 width=330) (actual time=3175.114..49596.749 rows=5000019 loops=1) -> Function Scan on pg_get_wal_records_info r (cost=0.00..12.50 rows=5 width=168) (actual time=3175.058..4142.507 rows=4000019 loops=1) Filter: (resource_manager = 'Heap'::text) Rows Removed by Filter: 52081 -> Function Scan on pg_get_wal_block_info b (cost=0.00..10.00 rows=1000 width=162) (actual time=0.011..0.011 rows=1 loops=4000019) Planning Time: 0.076 ms Execution Time: 49998.850 ms (7 rows) Time: 49999.203 ms (00:49.999) [2] postgres=# EXPLAIN (ANALYZE) SELECT * FROM pg_get_wal_block_info(:'start_lsn', :'end_lsn') AS b JOIN pg_get_wal_records_info(:'start_lsn', :'end_lsn') AS w ON w.start_lsn = b.lsn WHERE w.resource_manager = 'Heap'; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=12.57..26.57 rows=25 width=330) (actual time=6241.449..9901.715 rows=5000019 loops=1) Hash Cond: (b.lsn = w.start_lsn) -> Function Scan on pg_get_wal_block_info b (cost=0.00..10.00 rows=1000 width=162) (actual time=1415.815..1870.522 rows=5067960 loops=1) -> Hash (cost=12.50..12.50 rows=5 width=168) (actual time=4665.292..4665.292 rows=4000019 loops=1) Buckets: 65536 (originally 1024) Batches: 128 (originally 1) Memory Usage: 7681kB -> Function Scan on pg_get_wal_records_info w (cost=0.00..12.50 rows=5 width=168) (actual time=3160.010..3852.332 rows=4000019 loops=1) Filter: (resource_manager = 'Heap'::text) Rows Removed by Filter: 52081 Planning Time: 0.082 ms Execution Time: 10159.066 ms (10 rows) Time: 10159.465 ms (00:10.159) [3] postgres=# EXPLAIN ANALYZE SELECT * FROM pg_get_wal_block_info(:'start_lsn', :'end_lsn'); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Function Scan on pg_get_wal_block_info (cost=0.00..10.00 rows=1000 width=286) (actual time=2617.755..3081.526 rows=5004478 loops=1) Planning Time: 0.039 ms Execution Time: 3301.217 ms (3 rows) Time: 3301.817 ms (00:03.302) postgres=# -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Mar 16, 2023 at 2:19 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > On Wed, Mar 15, 2023 at 12:20 PM Michael Paquier <michael@paquier.xyz> wrote: > > I am not sure to get the concern here. As long as one is smart enough > > with SQL, there is no need to perform a double scan of the contents of > > pg_wal with a large scan on the start LSN. If one wishes to only > > extract some block for a given record type, or for a filter of your > > choice, it is possible to use a LATERAL on pg_get_wal_block_info(), > > say: > > SELECT r.start_lsn, b.blockid > > FROM pg_get_wal_records_info('0/01000028', '0/1911AA8') AS r, > > LATERAL pg_get_wal_block_info(start_lsn, end_lsn) as b > > WHERE r.resource_manager = 'Heap2'; > > > > This will extract the block information that you'd want for a given > > record type. The same information *already* appears in pg_get_wal_records_info()'s block_ref output! Why should the user be expected to use a LATERAL join (or any type of join) to get _the same information_, just in a usable form? > IIUC, the concern raised so far in this thread is not just on the > performance of JOIN queries to get both block info and record level > info, but on ease of using pg_walinspect functions. If > pg_get_wal_block_info emits the record level information too (which > turns out to be 50 LOC more), one doesn't have to be expert at writing > JOIN queries or such, but just can run the function, which actually > takes way less time (3sec) to scan the same 5mn WAL records [3]. That's exactly my concern, yes. As you say, it's not just the performance aspect. Requiring users to write a needlessly ornamental query is actively misleading. It suggests that block_ref is distinct information from the blocks output by pg_get_wal_block_info(). -- Peter Geoghegan
On Fri, Mar 17, 2023 at 7:33 AM Peter Geoghegan <pg@bowt.ie> wrote: > > > IIUC, the concern raised so far in this thread is not just on the > > performance of JOIN queries to get both block info and record level > > info, but on ease of using pg_walinspect functions. If > > pg_get_wal_block_info emits the record level information too (which > > turns out to be 50 LOC more), one doesn't have to be expert at writing > > JOIN queries or such, but just can run the function, which actually > > takes way less time (3sec) to scan the same 5mn WAL records [3]. > > That's exactly my concern, yes. As you say, it's not just the > performance aspect. Requiring users to write a needlessly ornamental > query is actively misleading. It suggests that block_ref is distinct > information from the blocks output by pg_get_wal_block_info(). +1 for pg_get_wal_block_info emitting per-record WAL info too along with block info, attached v2 patch does that. IMO, usability wins the race here. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Fri, Mar 17, 2023 at 12:20 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > +1 for pg_get_wal_block_info emitting per-record WAL info too along > with block info, attached v2 patch does that. IMO, usability wins the > race here. I think that this direction makes a lot of sense. Under this scheme, we still have pg_get_wal_records_info(), which is more or less an SQL interface to the information that pg_waldump presents by default. That's important when the record view of things is of primary importance. But now you also have a "block oriented" view of WAL presented by pg_get_wal_block_info(), which is useful when particular blocks are of primary interest. I think that I'll probably end up using both, while primarily using pg_get_wal_block_info() for more advanced analysis that focuses on what happened to particular blocks over time. It makes sense to present pg_get_wal_block_info() immediately after pg_get_wal_records_info() in the documentation under this scheme, since they're closely related. It would make sense to explain the relationship directly: pg_get_wal_block_info() doesn't have the block_ref column because it breaks that same information out by block instead, occasionally showing multiple rows for particular record types (which is what its "extra" columns describe). And, pg_get_wal_block_info() won't show anything for those records whose block_ref column is null according to pg_get_wal_records_info(), such as commit records. (Checks pg_walinspect once more...) Actually, I now see that block_ref won't be NULL for those records that have no block references at all -- it just outputs an empty string. But wouldn't it be better if it actually output NULL? Better for its own sake, but also better because doing so enables describing the relationship between the two functions with reference to block_ref. It seems particularly helpful to me to be able to say that pg_get_wal_block_info() doesn't show anything for precisely those WAL records whose block_ref is NULL according to pg_get_wal_records_info(). -- Peter Geoghegan
On Fri, Mar 17, 2023 at 12:36:04PM -0700, Peter Geoghegan wrote: > I think that this direction makes a lot of sense. Under this scheme, > we still have pg_get_wal_records_info(), which is more or less an SQL > interface to the information that pg_waldump presents by default. > That's important when the record view of things is of primary > importance. But now you also have a "block oriented" view of WAL > presented by pg_get_wal_block_info(), which is useful when particular > blocks are of primary interest. I think that I'll probably end up > using both, while primarily using pg_get_wal_block_info() for more > advanced analysis that focuses on what happened to particular blocks > over time. FWIW, I am not sure that it is a good idea and that we'd better not encourage too much the use of block_info() across a large range of WAL, which is what this function will make users eager to do in this case as it is possible to apply directly more filters to it. This is a community, so, anyway, if you feel strongly about doing this change, feel free to. -- Michael
Attachment
On Fri, Mar 17, 2023 at 4:11 PM Michael Paquier <michael@paquier.xyz> wrote: > FWIW, I am not sure that it is a good idea and that we'd better not > encourage too much the use of block_info() across a large range of > WAL, which is what this function will make users eager to do in this > case as it is possible to apply directly more filters to it. I'm sure that they will do that much more than they would have otherwise. Since we'll have made pg_get_wal_block_info() so much more useful than pg_get_wal_records_info() for many important use cases. Why is that a bad thing? Are you concerned about the overhead of pulling in FPIs when pg_get_wal_block_info() is run, if Bharath's patch is committed? That could be a problem, I suppose -- but it would be good to get more data on that. Do you think that this will be much of an issue, Bharath? I have pushed pg_walinspect to its limits myself (which is how I found that memory leak). Performance matters a great deal when you're doing an analysis of how blocks change over time, on a system that has written a realistically large amount of WAL over minutes or even hours. Why shouldn't that be a priority for pg_walinspect? My concerns have little to do with aesthetics, and everything to do with making those kinds of queries feasible. If the FPI thing is a problem then it seems to me that it should be addressed directly. For example, perhaps it would make sense to add a way to not incur the overhead of decompressing FPIs uselessly in cases where they're of no interest to us (likely the majority of cases once the patch is committed). It also might well make sense to rename pg_get_wal_block_info() to something more general, to reflect its more general purpose once it is expanded by Bharath's patch. As I said, it will become a lot closer to pg_get_wal_records_info(). We should be clear on that. -- Peter Geoghegan
On Fri, Mar 17, 2023 at 04:36:58PM -0700, Peter Geoghegan wrote: > I'm sure that they will do that much more than they would have > otherwise. Since we'll have made pg_get_wal_block_info() so much more > useful than pg_get_wal_records_info() for many important use cases. > Why is that a bad thing? Are you concerned about the overhead of > pulling in FPIs when pg_get_wal_block_info() is run, if Bharath's > patch is committed? That could be a problem, I suppose -- but it would > be good to get more data on that. Do you think that this will be much > of an issue, Bharath? Yes. The CPU cost is one thing, but I am also worrying about the I/O cost with a tuplestore spilling to disk a large number of FPIs, and some workloads can generate WAL so as FPIs is what makes for most of the contents stored in the WAL. (wal_compression is very effective in such cases, for example.) It is true that it is possible to tweak SQLs that exactly do that with a large amount of data materialized, or just eat so much CPU that they basically DoS the backend. Still I'd rather keep a minimalistic design for each function with block_info having only one field able to track back to which record a block information refers to, and I'd like to think one able to look at WAL internals will be smart enough to write SQL in such a way that they avoid that on a production machine. The current design allows to do that in this view, but that's just one way I see how to represent structures at SQL level. Extending block_info() with more record-level attributes allows that as well, still it bloats its interface unnecessarily. Designing software is hard, and it looks like our point of view on that is different. If you wish to change the current interface of block_info, feel free to do so. It does not mean that it cannot be changed, just that I recommend not to do that, and that's just one opinion. This said, your point about having rec_blk_ref reported as an empty string rather than NULL if there are no block references does not feel natural to me, either.. Reporting NULL would be better. -- Michael
Attachment
On Fri, Mar 17, 2023 at 5:51 PM Michael Paquier <michael@paquier.xyz> wrote: > Yes. The CPU cost is one thing, but I am also worrying about the > I/O cost with a tuplestore spilling to disk a large number of FPIs, > and some workloads can generate WAL so as FPIs is what makes for most > of the contents stored in the WAL. (wal_compression is very effective > in such cases, for example.) > > It is true that it is possible to tweak SQLs that exactly do that with > a large amount of data materialized, or just eat so much CPU that they > basically DoS the backend. Still I'd rather keep a minimalistic > design for each function with block_info having only one field able to > track back to which record a block information refers to, and I'd like > to think one able to look at WAL internals will be smart enough to > write SQL in such a way that they avoid that on a production machine. > The current design allows to do that in this view, but that's just one > way I see how to represent structures at SQL level. Not really. It has nothing to do with some abstract ideal about how the data should be logically structured. It is about how the actual underlying physical data structures work, and are accessed in practice, during query execution. And its about the constraints placed on us by the laws of physics. Some ways of doing this are measurably, provably much faster than other ways. It's very much not like we're querying tables whose general structure is under our control, via schema design, where the optimizer could reasonably be expected to make better choices as the data distribution changes. So why treat it like that? Right now, you're still basically standing by a design that is *fundamentally* less efficient for certain types of queries -- queries that I am very interested in, that I'm sure many of us will be interested in. It's not a matter of opinion. It is very much in evidence from Bharath's analysis. If a similar analysis reached the opposite conclusion, then you would be right and I would be wrong. It really is that simple. > This said, your point about having rec_blk_ref reported as an empty > string rather than NULL if there are no block references does not feel > natural to me, either.. Reporting NULL would be better. You have it backwards. It outputs an empty string right now. I want to change that, so that it outputs NULLs instead. -- Peter Geoghegan
On Fri, Mar 17, 2023 at 06:09:05PM -0700, Peter Geoghegan wrote: >> This said, your point about having rec_blk_ref reported as an empty >> string rather than NULL if there are no block references does not feel >> natural to me, either.. Reporting NULL would be better. > > You have it backwards. It outputs an empty string right now. I want to > change that, so that it outputs NULLs instead. My previous paragraph means exactly the same? I have just read what I wrote again. Twice. So, yes, I agree with this point. Sorry if my words meant the contrary to you. :) -- Michael
Attachment
On Sat, Mar 18, 2023 at 1:06 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Fri, Mar 17, 2023 at 12:20 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > +1 for pg_get_wal_block_info emitting per-record WAL info too along > > with block info, attached v2 patch does that. IMO, usability wins the > > race here. > > I think that this direction makes a lot of sense. Under this scheme, > we still have pg_get_wal_records_info(), which is more or less an SQL > interface to the information that pg_waldump presents by default. > That's important when the record view of things is of primary > importance. But now you also have a "block oriented" view of WAL > presented by pg_get_wal_block_info(), which is useful when particular > blocks are of primary interest. I think that I'll probably end up > using both, while primarily using pg_get_wal_block_info() for more > advanced analysis that focuses on what happened to particular blocks > over time. Hm. > It makes sense to present pg_get_wal_block_info() immediately after > pg_get_wal_records_info() in the documentation under this scheme, > since they're closely related. -1. I don't think we need that and even if we did, it's hard to maintain that ordering in future. One who knows to use these functions will anyway get to know how they're related. > (Checks pg_walinspect once more...) > > Actually, I now see that block_ref won't be NULL for those records > that have no block references at all -- it just outputs an empty > string. Yes, that's unnecessary. > But wouldn't it be better if it actually output NULL? +1 done so in the attached 0001 patch. > Better > for its own sake, but also better because doing so enables describing > the relationship between the two functions with reference to > block_ref. It seems particularly helpful to me to be able to say that > pg_get_wal_block_info() doesn't show anything for precisely those WAL > records whose block_ref is NULL according to > pg_get_wal_records_info(). Hm. Attaching v3 patch set - 0001 optimizations around block references, 0002 enables pg_get_wal_block_info() to emit per-record info. Any thoughts? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
At Sat, 18 Mar 2023 10:08:53 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in > On Sat, Mar 18, 2023 at 1:06 AM Peter Geoghegan <pg@bowt.ie> wrote: > > > > On Fri, Mar 17, 2023 at 12:20 AM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > +1 for pg_get_wal_block_info emitting per-record WAL info too along > > > with block info, attached v2 patch does that. IMO, usability wins the > > > race here. > > > > I think that this direction makes a lot of sense. Under this scheme, > > we still have pg_get_wal_records_info(), which is more or less an SQL > > interface to the information that pg_waldump presents by default. > > That's important when the record view of things is of primary > > importance. But now you also have a "block oriented" view of WAL > > presented by pg_get_wal_block_info(), which is useful when particular > > blocks are of primary interest. I think that I'll probably end up > > using both, while primarily using pg_get_wal_block_info() for more > > advanced analysis that focuses on what happened to particular blocks > > over time. > > Hm. Even though I haven't explored every aspect of the view, I believe it makes sense to have at least the record_type in the block data. I don't know how often it will be used, but considering that, I have no objections to adding all the record information (apart from the block data itself) to the block data. > > It makes sense to present pg_get_wal_block_info() immediately after > > pg_get_wal_records_info() in the documentation under this scheme, > > since they're closely related. > > -1. I don't think we need that and even if we did, it's hard to > maintain that ordering in future. One who knows to use these functions > will anyway get to know how they're related. The documentation has just one section titled "General Functions" which directly contains detailed explation of four functions, making it hard to get clear understanding of the available functions. I considered breaking it down into a few subsections, but that wouldn't look great since most of them would only contain one function. However, I feel it would be helpful to add a list of all functions at the beginning of the section. > > (Checks pg_walinspect once more...) > > > > Actually, I now see that block_ref won't be NULL for those records > > that have no block references at all -- it just outputs an empty > > string. > > Yes, that's unnecessary. > > > But wouldn't it be better if it actually output NULL? > > +1 done so in the attached 0001 patch. > > > Better > > for its own sake, but also better because doing so enables describing > > the relationship between the two functions with reference to > > block_ref. It seems particularly helpful to me to be able to say that > > pg_get_wal_block_info() doesn't show anything for precisely those WAL > > records whose block_ref is NULL according to > > pg_get_wal_records_info(). > > Hm. I agree that adding a note about the characteristics would helpful to avoid the misuse of pg_get_wal_block_info(). How about something like, "Note that pg_get_wal_block_info() omits records that contains no block references."? > Attaching v3 patch set - 0001 optimizations around block references, > 0002 enables pg_get_wal_block_info() to emit per-record info. Any > thoughts? + /* Get block references, if any, otherwise continue. */ + if (!XLogRecHasAnyBlockRefs(xlogreader)) + continue; I'm not sure, but the "continue" might be confusing since the code "continue"s if the condition is true and continues the process otherwise.. And it seems like a kind of "explaination of what the code does". I feel we don't need the a comment there. It is not an issue with this patch, but as I look at this version, I'm starting to feel uneasy about the subtle differences between what GetWALRecordsInfo and GetWALBlockInfo do. One solution might be to have GetWALBlockInfo return a values array for each block, but that could make things more complex than needed. Alternatively, could we get GetWALRecordsInfo to call tuplestore_putvalues() internally? This way, both functions can manage the temporary memory context within themselves. About 0002: + /* Reset only per-block output columns, keep per-record info as-is. */ + memset(&nulls[PG_GET_WAL_BLOCK_INFO_PER_RECORD_COLS], 0, + PG_GET_WAL_BLOCK_INFO_PER_RECORD_COLS * sizeof(bool)); + memset(&values[PG_GET_WAL_BLOCK_INFO_PER_RECORD_COLS], 0, + PG_GET_WAL_BLOCK_INFO_PER_RECORD_COLS * sizeof(bool)); sizeof(*values) is not sizeof(bool), but sizeof(Datum). It seems to me that the starting elemnt of the arrays is (PG_GET_WAL_BLOCK_INFO_COLS - PG_GET_WAL_BLOCK_INFO_PER_RECORD_COLS). But I don't think simply rewriting that way is great. #define PG_GET_WAL_RECORD_INFO_COLS 11 ... +#define PG_GET_WAL_BLOCK_INFO_PER_RECORD_COLS 9 This means GetWALBlockInfo overwrites the last two columns generated by GetWalRecordInfo, but I don't think this approach is clean and stable. I agree we don't want the final columns in a block info tuple but we don't want to duplicate the common code path. I initially thought we could devide the function into GetWALCommonInfo(), GetWALRecordInfo() and GetWALBlockInfo(), but it doesn't seem that simple.. In the end, I think we should have separate GetWALRecordInfo() and GetWALBlockInfo() that have duplicate "values[i++] = .." lines. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Sun, Mar 19, 2023 at 8:21 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > The documentation has just one section titled "General Functions" > which directly contains detailed explation of four functions, making > it hard to get clear understanding of the available functions. I > considered breaking it down into a few subsections, but that wouldn't > look great since most of them would only contain one function. > However, I feel it would be helpful to add a list of all functions at > the beginning of the section. I like the idea of sections, even if there is only one function per section in some cases. I also think that we should add a "Tip" that advises users that they may use an "end LSN" that is the largest possible LSN, 'FFFFFFFF/FFFFFFFF' to get information about records up until the current LSN of the cluster (per commit 5c1b6628). Is there a straightforward way to get a usable LSN constant for this purpose? The simplest way I could come up with quickly is "SELECT pg_lsn(2^64.-1)" -- which still isn't very simple. Actually, it might be even worse than 'FFFFFFFF/FFFFFFFF', so perhaps we should just use that in the docs new "Tip". > I agree that adding a note about the characteristics would helpful to > avoid the misuse of pg_get_wal_block_info(). How about something like, > "Note that pg_get_wal_block_info() omits records that contains no > block references."? This should be a strict invariant. In other words, it should be part of the documented contract of pg_get_wal_block_info and pg_get_wal_records_info. The two functions should be defined in terms of each other. Their relationship is important. Users should be able to safely assume that the records that have a NULL block_ref according to pg_get_wal_records_info are *precisely* those records that won't have any entries within pg_get_wal_block_info (assuming that the same LSN range is used with both functions). pg_walinspect should explicitly promise this, and promise the corollary condition around non-NULL block_ref records. It is a useful promise from the point of view of users. It also makes it easier to understand what's really going on here without any ambiguity. I don't completely disagree with Michael about the redundancy. I just think that it's worth it on performance grounds. We might want to say that directly in the docs, too. > > Attaching v3 patch set - 0001 optimizations around block references, > > 0002 enables pg_get_wal_block_info() to emit per-record info. Any > > thoughts? > > + /* Get block references, if any, otherwise continue. */ > + if (!XLogRecHasAnyBlockRefs(xlogreader)) > + continue; > > I'm not sure, but the "continue" might be confusing since the code > "continue"s if the condition is true and continues the process > otherwise.. And it seems like a kind of "explaination of what the > code does". I feel we don't need the a comment there. +1. Also, if GetWALBlockInfo() is now supposed to only be called when XLogRecHasAnyBlockRefs() now then it should probably have an assertion to verify the precondition. > It is not an issue with this patch, but as I look at this version, I'm > starting to feel uneasy about the subtle differences between what > GetWALRecordsInfo and GetWALBlockInfo do. One solution might be to > have GetWALBlockInfo return a values array for each block, but that > could make things more complex than needed. Alternatively, could we > get GetWALRecordsInfo to call tuplestore_putvalues() internally? This > way, both functions can manage the temporary memory context within > themselves. Agreed. I'm also not sure what to do about it, though. > This means GetWALBlockInfo overwrites the last two columns generated > by GetWalRecordInfo, but I don't think this approach is clean and > stable. I agree we don't want the final columns in a block info tuple > but we don't want to duplicate the common code path. > I initially thought we could devide the function into > GetWALCommonInfo(), GetWALRecordInfo() and GetWALBlockInfo(), but it > doesn't seem that simple.. In the end, I think we should have separate > GetWALRecordInfo() and GetWALBlockInfo() that have duplicate > "values[i++] = .." lines. I agree. A little redundancy is better when the alternative is fragile code, and I'm pretty sure that that applies here -- there won't be very many duplicated lines, and the final code will be significantly clearer. There can be a comment about keeping GetWALRecordInfo and GetWALBlockInfo in sync. -- Peter Geoghegan
On Mon, Mar 20, 2023 at 4:34 PM Peter Geoghegan <pg@bowt.ie> wrote: > I agree. A little redundancy is better when the alternative is fragile > code, and I'm pretty sure that that applies here -- there won't be > very many duplicated lines, and the final code will be significantly > clearer. There can be a comment about keeping GetWALRecordInfo and > GetWALBlockInfo in sync. The new pg_get_wal_block_info outputs columns in an order that doesn't seem like the most useful possible order to me. This gives us another reason to have separate GetWALRecordInfo and GetWALBlockInfo utility functions rather than sharing logic for building output tuples. Specifically, I think that pg_get_wal_block_info should ouput the "primary key" columns first: reltablespace, reldatabase, relfilenode, blockid, start_lsn, end_lsn Next comes the columns that duplicate the columns output by pg_get_wal_records_info, in the same order as they appear in pg_get_wal_records_info. (Obviously this won't include block_ref). -- Peter Geoghegan
On Mon, Mar 20, 2023 at 4:51 PM Peter Geoghegan <pg@bowt.ie> wrote: > The new pg_get_wal_block_info outputs columns in an order that doesn't > seem like the most useful possible order to me. This gives us another > reason to have separate GetWALRecordInfo and GetWALBlockInfo utility > functions rather than sharing logic for building output tuples. One more piece of feedback for Bharath: I think that we should also make the description output column display NULLs for those records that don't output any description string. This at least includes the "FPI" record type from the "XLOG" rmgr. Alternatively, we could find a way of making it show a description. -- Peter Geoghegan
On Mon, Mar 20, 2023 at 05:00:25PM -0700, Peter Geoghegan wrote: > I think that we should also make the description output column display > NULLs for those records that don't output any description string. This > at least includes the "FPI" record type from the "XLOG" rmgr. > Alternatively, we could find a way of making it show a description. An empty StringInfo is sent to rm_desc, so hardcoding something in pg_walinspect to show some data does not look right to me. Saying that, using NULL when there is no description is OK by me, as much as is using an empty string because it maps with the reality of the empty StringInfo sent to the rm_desc callback. -- Michael
Attachment
On Mon, Mar 20, 2023 at 04:51:19PM -0700, Peter Geoghegan wrote: > On Mon, Mar 20, 2023 at 4:34 PM Peter Geoghegan <pg@bowt.ie> wrote: >> I agree. A little redundancy is better when the alternative is fragile >> code, and I'm pretty sure that that applies here -- there won't be >> very many duplicated lines, and the final code will be significantly >> clearer. There can be a comment about keeping GetWALRecordInfo and >> GetWALBlockInfo in sync. Well. Vox populi, Vox dei. I am deadly outnumbered. > The new pg_get_wal_block_info outputs columns in an order that doesn't > seem like the most useful possible order to me. This gives us another > reason to have separate GetWALRecordInfo and GetWALBlockInfo utility > functions rather than sharing logic for building output tuples. > > Specifically, I think that pg_get_wal_block_info should ouput the > "primary key" columns first: > > reltablespace, reldatabase, relfilenode, blockid, start_lsn, end_lsn It seems to me that this is up to the one using this SQL? I am not sure to follow why this is important. For the cases you have poked at, I guess it is, but is strikes me that it is just natural to shape that to match the C structures we use for the WAL records themselves, so the other way around. > Next comes the columns that duplicate the columns output by > pg_get_wal_records_info, in the same order as they appear in > pg_get_wal_records_info. (Obviously this won't include block_ref). Hmm. Once you add more record data into pg_get_wal_block_info(), I am not sure to agree with this statement, actually. I would think that the most logical order would be the start_lsn, the end_lsn, the record information that you want for your quals, the block ID for the block registered in the record, and finally then the rest of the block information. So this puts the record-level data first, and the block info after. This would be a bit closer with the order of the WAL record structure itself, aka XLogRecord and such. Hence, it seems to me that 0002 has the order pretty much right. What's the point in adding the description, by the way? Only consistency with the other function? Is that really useful if you want to apply more quals when retrieving some block data? Calling GetWALRecordInfo() as part of GetWALBlockInfo() leads to a rather confusing result, IMO... @@ -377,6 +385,12 @@ pg_get_wal_block_info(PG_FUNCTION_ARGS) while (ReadNextXLogRecord(xlogreader) && xlogreader->EndRecPtr <= end_lsn) { + CHECK_FOR_INTERRUPTS(); + + /* Get block references, if any, otherwise continue. */ + if (!XLogRecHasAnyBlockRefs(xlogreader)) + continue; This early shortcut in 0001 is a good idea. -- Michael
Attachment
On Fri, Mar 17, 2023 at 8:51 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Mar 17, 2023 at 04:36:58PM -0700, Peter Geoghegan wrote: > > I'm sure that they will do that much more than they would have > > otherwise. Since we'll have made pg_get_wal_block_info() so much more > > useful than pg_get_wal_records_info() for many important use cases. > > Why is that a bad thing? Are you concerned about the overhead of > > pulling in FPIs when pg_get_wal_block_info() is run, if Bharath's > > patch is committed? That could be a problem, I suppose -- but it would > > be good to get more data on that. Do you think that this will be much > > of an issue, Bharath? > > Yes. The CPU cost is one thing, but I am also worrying about the > I/O cost with a tuplestore spilling to disk a large number of FPIs, > and some workloads can generate WAL so as FPIs is what makes for most > of the contents stored in the WAL. (wal_compression is very effective > in such cases, for example.) I had done some analysis about CPU costs for decompressing FPI upthread in [1], finding that adding a parameter to allow skipping outputting FPI would not have much impact when FPI are compressed, as decompressing the images comprised very little of the overall time. After reading what you said, I was interested to see how substantial the I/O cost with non-compressed FPI would be. Using a patch with a parameter to pg_get_wal_block_info() to skip outputting FPI, I found that on a fast local nvme ssd, the timing difference between doing so and not still isn't huge -- 9 seconds when outputting the FPI vs 8.5 seconds when skipping outputting FPI. (with ~50,000 records all with non-compressed FPIs). However, perhaps obviously, the I/O cost is worse. Doing nothing but SELECT * FROM pg_get_wal_block_info(:start_lsn, :end_lsn, true) where fpi is not null; per iostat, the write latency was double for the query which output fpi from the one that didn't and the wkB/s was much higher. This is probably obvious, but I'm just wondering if it makes sense to have such a parameter to avoid impacting a system which is doing concurrent I/O with walinspect. I have had use for block info without seeing the FPIs, personally. - Melanie [1] https://www.postgresql.org/message-id/CAAKRu_bJvbcYBRj2cN6G2xV7B7-Ja%2BpjTO1nEnEhRR8OXYiABA%40mail.gmail.com
On Wed, Mar 22, 2023 at 11:35 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Fri, Mar 17, 2023 at 8:51 PM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Fri, Mar 17, 2023 at 04:36:58PM -0700, Peter Geoghegan wrote: > > > I'm sure that they will do that much more than they would have > > > otherwise. Since we'll have made pg_get_wal_block_info() so much more > > > useful than pg_get_wal_records_info() for many important use cases. > > > Why is that a bad thing? Are you concerned about the overhead of > > > pulling in FPIs when pg_get_wal_block_info() is run, if Bharath's > > > patch is committed? That could be a problem, I suppose -- but it would > > > be good to get more data on that. Do you think that this will be much > > > of an issue, Bharath? > > > > Yes. The CPU cost is one thing, but I am also worrying about the > > I/O cost with a tuplestore spilling to disk a large number of FPIs, > > and some workloads can generate WAL so as FPIs is what makes for most > > of the contents stored in the WAL. (wal_compression is very effective > > in such cases, for example.) > > I had done some analysis about CPU costs for decompressing FPI upthread > in [1], finding that adding a parameter to allow skipping outputting FPI > would not have much impact when FPI are compressed, as decompressing the > images comprised very little of the overall time. > > After reading what you said, I was interested to see how substantial the > I/O cost with non-compressed FPI would be. > > Using a patch with a parameter to pg_get_wal_block_info() to skip > outputting FPI, I found that on a fast local nvme ssd, the timing > difference between doing so and not still isn't huge -- 9 seconds when > outputting the FPI vs 8.5 seconds when skipping outputting FPI. (with > ~50,000 records all with non-compressed FPIs). > > However, perhaps obviously, the I/O cost is worse. > Doing nothing but > > SELECT * FROM pg_get_wal_block_info(:start_lsn, :end_lsn, true) > where fpi is not null; Sorry, I should have been more clear: similar results with a select list simply excluding fpi and no where clause. - Melanie
On Mon, Mar 20, 2023 at 7:34 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Sun, Mar 19, 2023 at 8:21 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > It is not an issue with this patch, but as I look at this version, I'm > > starting to feel uneasy about the subtle differences between what > > GetWALRecordsInfo and GetWALBlockInfo do. One solution might be to > > have GetWALBlockInfo return a values array for each block, but that > > could make things more complex than needed. Alternatively, could we > > get GetWALRecordsInfo to call tuplestore_putvalues() internally? This > > way, both functions can manage the temporary memory context within > > themselves. > > Agreed. I'm also not sure what to do about it, though. > > > This means GetWALBlockInfo overwrites the last two columns generated > > by GetWalRecordInfo, but I don't think this approach is clean and > > stable. I agree we don't want the final columns in a block info tuple > > but we don't want to duplicate the common code path. > > > I initially thought we could devide the function into > > GetWALCommonInfo(), GetWALRecordInfo() and GetWALBlockInfo(), but it > > doesn't seem that simple.. In the end, I think we should have separate > > GetWALRecordInfo() and GetWALBlockInfo() that have duplicate > > "values[i++] = .." lines. > > I agree. A little redundancy is better when the alternative is fragile > code, and I'm pretty sure that that applies here -- there won't be > very many duplicated lines, and the final code will be significantly > clearer. There can be a comment about keeping GetWALRecordInfo and > GetWALBlockInfo in sync. So, I also agree that it is better to have the two separate functions instead of overwriting the last two columns. As for keeping them in sync, we could define the number of common columns as a macro like: #define WALINSPECT_INFO_NUM_COMMON_COLS 10 and use that to calculate the size of the values/nulls array in GetWalRecordInfo() and GetWALBlockInfo() (assuming a new version where those two functions duplicate the setting of values[x] = y). That way, if a new column of information is added and one of the two functions forgets to set it in the values array, it would still cause an empty column and it will be easier for the programmer to see it needs to be added. We could even define an enum like: typedef enum walinspect_common_col { WALINSPECT_START_LSN, WALINSPECT_END_LSN, WALINSPECT_PREV_LSN, WALINSPECT_XID, WALINSPECT_RMGR, WALINSPECT_REC_TYPE, WALINSPECT_REC_LENGTH, WALINSPECT_MAIN_DATA_LENGTH, WALINSPECT_FPILEN, WALINSPECT_DESC, WALINSPECT_NUM_COMMON_COL, } walinspect_common_col; and set values in both functions like values[WALINSPECT_FPILEN] = y if we kept the order of common columns the same and as the first N columns for both functions. This would keep us from having to manually update a macro like WALINSPECT_INFO_NUM_COMMON_COLS. Though, I'm not sure how much value that actually adds. - Melanie
On Wed, Mar 22, 2023 at 8:35 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > After reading what you said, I was interested to see how substantial the > I/O cost with non-compressed FPI would be. > > Using a patch with a parameter to pg_get_wal_block_info() to skip > outputting FPI, I found that on a fast local nvme ssd, the timing > difference between doing so and not still isn't huge -- 9 seconds when > outputting the FPI vs 8.5 seconds when skipping outputting FPI. (with > ~50,000 records all with non-compressed FPIs). > > However, perhaps obviously, the I/O cost is worse. > Doing nothing but > > SELECT * FROM pg_get_wal_block_info(:start_lsn, :end_lsn, true) > where fpi is not null; > > per iostat, the write latency was double for the query which output fpi > from the one that didn't and the wkB/s was much higher. I think that we should also have something like the patch that you wrote to skip FPIs. It's not something that I feel as strongly about as the main point about including all the fields from pg_get_wal_records_info. but it does seem worth doing. > I have had use for block info without seeing the FPIs, personally. I'd go further than that myself: I haven't had any use for FPIs at all. If I was going to do something with FPIs then I'd just use pg_waldump, since I'd likely want to get them onto the filesystem for analysis anyway. (Just my experience.) -- Peter Geoghegan
On Tue, Mar 21, 2023 at 11:33 PM Michael Paquier <michael@paquier.xyz> wrote: > > The new pg_get_wal_block_info outputs columns in an order that doesn't > > seem like the most useful possible order to me. This gives us another > > reason to have separate GetWALRecordInfo and GetWALBlockInfo utility > > functions rather than sharing logic for building output tuples. > > > > Specifically, I think that pg_get_wal_block_info should ouput the > > "primary key" columns first: > > > > reltablespace, reldatabase, relfilenode, blockid, start_lsn, end_lsn > > It seems to me that this is up to the one using this SQL? If you see it that way, then why does it matter what I may want to do with the declared order? > I am not > sure to follow why this is important. For the cases you have poked > at, I guess it is, but is strikes me that it is just natural to shape > that to match the C structures we use for the WAL records > themselves, so the other way around. I don't feel very strongly about it, but it seems better to highlight the difference that exists between this and pg_get_wal_records_info. > Hence, it seems to me that 0002 has the order pretty much right. > What's the point in adding the description, by the way? Only > consistency with the other function? Is that really useful if you > want to apply more quals when retrieving some block data? I don't understand. It's useful to include the description for the same reason as it's useful to include it in pg_get_wal_records_info. Why wouldn't it be useful? Most individual records that have any block_ref blocks have exactly one. Most individual WAL records are very simple record types. So pg_get_wal_block_info just isn't going to look that different to pg_get_wal_records_info, once they share most of the same columns. The way that pg_get_wal_block_info disaggregates on block won't make the output look all that different. So each distinct "description" will usually only appear once in pg_get_wal_block_info anyway. -- Peter Geoghegan
On Wed, Mar 22, 2023 at 05:05:10PM -0700, Peter Geoghegan wrote: > I'd go further than that myself: I haven't had any use for FPIs at > all. If I was going to do something with FPIs then I'd just use > pg_waldump, since I'd likely want to get them onto the filesystem for > analysis anyway. (Just my experience.) FWIW, being able to get access to raw FPIs with a SQL interface is useful if you cannot log into the host, which is something that a lot of cloud providers don't allow, and not everybody is able to have access to archived WAL segments in a different host. -- Michael
Attachment
On Wed, Mar 22, 2023 at 5:14 PM Michael Paquier <michael@paquier.xyz> wrote: > On Wed, Mar 22, 2023 at 05:05:10PM -0700, Peter Geoghegan wrote: > > I'd go further than that myself: I haven't had any use for FPIs at > > all. If I was going to do something with FPIs then I'd just use > > pg_waldump, since I'd likely want to get them onto the filesystem for > > analysis anyway. (Just my experience.) > > FWIW, being able to get access to raw FPIs with a SQL interface is > useful if you cannot log into the host, which is something that a lot > of cloud providers don't allow, and not everybody is able to have > access to archived WAL segments in a different host. I'm not saying that it's not ever useful. Just that finding FPIs interesting isn't necessarily all that common when using pg_get_wal_block_info (or won't be, once it has those additional columns from pg_get_wal_records_info). -- Peter Geoghegan
On Mon, Mar 20, 2023 at 8:51 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > + /* Get block references, if any, otherwise continue. */ > + if (!XLogRecHasAnyBlockRefs(xlogreader)) > > code does". I feel we don't need the a comment there. Removed. > This means GetWALBlockInfo overwrites the last two columns generated > by GetWalRecordInfo, but I don't think this approach is clean and > stable. I agree we don't want the final columns in a block info tuple > but we don't want to duplicate the common code path. > > I initially thought we could devide the function into > GetWALCommonInfo(), GetWALRecordInfo() and GetWALBlockInfo(), but it > doesn't seem that simple.. In the end, I think we should have separate > GetWALRecordInfo() and GetWALBlockInfo() that have duplicate > "values[i++] = .." lines. Done as per Peter's suggestion (keeping primary key columns first and having a bit of code duplicated instead of making it complex in the name of deduplication). Please see the attached v4 patch set. On Tue, Mar 21, 2023 at 5:04 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Sun, Mar 19, 2023 at 8:21 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > The documentation has just one section titled "General Functions" > > which directly contains detailed explation of four functions, making > > it hard to get clear understanding of the available functions. I > > considered breaking it down into a few subsections, but that wouldn't > > look great since most of them would only contain one function. > > However, I feel it would be helpful to add a list of all functions at > > the beginning of the section. > > I like the idea of sections, even if there is only one function per > section in some cases. Hm, -1 for now. Most of the extensions that I have seen don't have anything like that. If needed, someone can start a separate thread for such a proposal for all of the extensions. > I also think that we should add a "Tip" that advises users that they > may use an "end LSN" that is the largest possible LSN, > 'FFFFFFFF/FFFFFFFF' to get information about records up until the > current LSN of the cluster (per commit 5c1b6628). > > Is there a straightforward way to get a usable LSN constant for this > purpose? The simplest way I could come up with quickly is "SELECT > pg_lsn(2^64.-1)" -- which still isn't very simple. Actually, it might > be even worse than 'FFFFFFFF/FFFFFFFF', so perhaps we should just use > that in the docs new "Tip". Done. > > I agree that adding a note about the characteristics would helpful to > > avoid the misuse of pg_get_wal_block_info(). How about something like, > > "Note that pg_get_wal_block_info() omits records that contains no > > block references."? > > This should be a strict invariant. In other words, it should be part > of the documented contract of pg_get_wal_block_info and > pg_get_wal_records_info. The two functions should be defined in terms > of each other. Their relationship is important. > > Users should be able to safely assume that the records that have a > NULL block_ref according to pg_get_wal_records_info are *precisely* > those records that won't have any entries within pg_get_wal_block_info > (assuming that the same LSN range is used with both functions). > pg_walinspect should explicitly promise this, and promise the > corollary condition around non-NULL block_ref records. It is a useful > promise from the point of view of users. It also makes it easier to > understand what's really going on here without any ambiguity. > > I don't completely disagree with Michael about the redundancy. I just > think that it's worth it on performance grounds. We might want to say > that directly in the docs, too. Added a note in the docs. > Also, if GetWALBlockInfo() is now supposed to only be called when > XLogRecHasAnyBlockRefs() now then it should probably have an assertion > to verify the precondition. Done. > > I initially thought we could devide the function into > > GetWALCommonInfo(), GetWALRecordInfo() and GetWALBlockInfo(), but it > > doesn't seem that simple.. In the end, I think we should have separate > > GetWALRecordInfo() and GetWALBlockInfo() that have duplicate > > "values[i++] = .." lines. > > I agree. A little redundancy is better when the alternative is fragile > code, and I'm pretty sure that that applies here -- there won't be > very many duplicated lines, and the final code will be significantly > clearer. There can be a comment about keeping GetWALRecordInfo and > GetWALBlockInfo in sync. Done. On Tue, Mar 21, 2023 at 5:21 AM Peter Geoghegan <pg@bowt.ie> wrote: > > The new pg_get_wal_block_info outputs columns in an order that doesn't > seem like the most useful possible order to me. This gives us another > reason to have separate GetWALRecordInfo and GetWALBlockInfo utility > functions rather than sharing logic for building output tuples. > > Specifically, I think that pg_get_wal_block_info should ouput the > "primary key" columns first: > > reltablespace, reldatabase, relfilenode, blockid, start_lsn, end_lsn > > Next comes the columns that duplicate the columns output by > pg_get_wal_records_info, in the same order as they appear in > pg_get_wal_records_info. (Obviously this won't include block_ref). Done. On Tue, Mar 21, 2023 at 5:30 AM Peter Geoghegan <pg@bowt.ie> wrote: > > I think that we should also make the description output column display > NULLs for those records that don't output any description string. This > at least includes the "FPI" record type from the "XLOG" rmgr. > Alternatively, we could find a way of making it show a description. Done. Please see the attached v4 patch set addressing all the review comments. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Mar 23, 2023 at 10:54:40PM +0530, Bharath Rupireddy wrote: > Please see the attached v4 patch set addressing all the review comments. - desc = GetRmgr(XLogRecGetRmid(record)); - id = desc.rm_identify(XLogRecGetInfo(record)); - - if (id == NULL) - id = psprintf("UNKNOWN (%x)", XLogRecGetInfo(record) & ~XLR_INFO_MASK); - - initStringInfo(&rec_desc); - desc.rm_desc(&rec_desc, record); - - /* Block references. */ - initStringInfo(&rec_blk_ref); - XLogRecGetBlockRefInfo(record, false, true, &rec_blk_ref, &fpi_len); - - main_data_len = XLogRecGetDataLen(record); I don't see any need to move this block of code? This leads to unnecessary diffs, potentially making backpatch a bit harder. Either way is not a big deal, still.. Except for this bit, 0001 looks fine by me. OUT reltablespace oid, OUT reldatabase oid, OUT relfilenode oid, OUT relblocknumber int8, + OUT blockid int2, + OUT start_lsn pg_lsn, + OUT end_lsn pg_lsn, + OUT prev_lsn pg_lsn, I'd still put the LSN data before the three OIDs for consistency with the structures, though my opinion does not seem to count much.. -- Michael
Attachment
At Sat, 25 Mar 2023 12:12:50 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Thu, Mar 23, 2023 at 10:54:40PM +0530, Bharath Rupireddy wrote: > OUT reltablespace oid, > OUT reldatabase oid, > OUT relfilenode oid, > OUT relblocknumber int8, > + OUT blockid int2, > + OUT start_lsn pg_lsn, > + OUT end_lsn pg_lsn, > + OUT prev_lsn pg_lsn, > > I'd still put the LSN data before the three OIDs for consistency with > the structures, though my opinion does not seem to count much.. I agree with Michael on this point. Also, although it may not be significant for SQL, the rows are output in lsn order from the function. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Mar 27, 2023 at 9:11 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Sat, 25 Mar 2023 12:12:50 +0900, Michael Paquier <michael@paquier.xyz> wrote in > > On Thu, Mar 23, 2023 at 10:54:40PM +0530, Bharath Rupireddy wrote: > > OUT reltablespace oid, > > OUT reldatabase oid, > > OUT relfilenode oid, > > OUT relblocknumber int8, > > + OUT blockid int2, > > + OUT start_lsn pg_lsn, > > + OUT end_lsn pg_lsn, > > + OUT prev_lsn pg_lsn, > > > > I'd still put the LSN data before the three OIDs for consistency with > > the structures, though my opinion does not seem to count much.. > > I agree with Michael on this point. Also, although it may not be > significant for SQL, the rows are output in lsn order from the > function. Done that way. On Sat, Mar 25, 2023 at 8:42 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Mar 23, 2023 at 10:54:40PM +0530, Bharath Rupireddy wrote: > > Please see the attached v4 patch set addressing all the review comments. > > - desc = GetRmgr(XLogRecGetRmid(record)); > - id = desc.rm_identify(XLogRecGetInfo(record)); > - > - if (id == NULL) > - id = psprintf("UNKNOWN (%x)", XLogRecGetInfo(record) & ~XLR_INFO_MASK); > - > - initStringInfo(&rec_desc); > - desc.rm_desc(&rec_desc, record); > - > - /* Block references. */ > - initStringInfo(&rec_blk_ref); > - XLogRecGetBlockRefInfo(record, false, true, &rec_blk_ref, &fpi_len); > - > - main_data_len = XLogRecGetDataLen(record); > > I don't see any need to move this block of code? This leads to > unnecessary diffs, potentially making backpatch a bit harder. Either > way is not a big deal, still.. Except for this bit, 0001 looks fine > by me. It's a cosmetic change - I wanted to keep the calculation of column values closer to where they're assigned to Datum values. I agree to not cause too much diff and removed them. Please see the attached v5 patch set. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Sat, Mar 25, 2023 at 12:12:50PM +0900, Michael Paquier wrote: > I don't see any need to move this block of code? This leads to > unnecessary diffs, potentially making backpatch a bit harder. Either > way is not a big deal, still.. Except for this bit, 0001 looks fine > by me. FYI, I have gone through 0001 and applied it, after tweaking a bit the part about block references so as we have only one XLogRecHasAnyBlockRefs, with its StringInfoData used only locally. -- Michael
Attachment
On Mon, Mar 27, 2023 at 9:49 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Sat, Mar 25, 2023 at 12:12:50PM +0900, Michael Paquier wrote: > > I don't see any need to move this block of code? This leads to > > unnecessary diffs, potentially making backpatch a bit harder. Either > > way is not a big deal, still.. Except for this bit, 0001 looks fine > > by me. > > FYI, I have gone through 0001 and applied it, after tweaking a bit the > part about block references so as we have only one > XLogRecHasAnyBlockRefs, with its StringInfoData used only locally. Thanks. Here's the v6 patch (last patch that I have with me for pg_walinspect) for adding per-record info to pg_get_wal_block_info. Note that I addressed all review comments received so far. Any thoughts? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Sun, Mar 26, 2023 at 8:41 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > I'd still put the LSN data before the three OIDs for consistency with > > the structures, though my opinion does not seem to count much.. > > I agree with Michael on this point. Also, although it may not be > significant for SQL, the rows are output in lsn order from the > function. I guess that it makes sense to have the LSN data first, but to have the other columns after that. I certainly don't dislike that approach. I just noticed is that "forkname" appears towards the end among declared output parameters, even in Bharath's v6. I think that it should be after "relblocknumber" instead, because it is conceptually part of the "composite primary key", and so belongs right next to "relblocknumber". I failed to mention this detail upthread, I think. -- Peter Geoghegan
On Mon, Mar 27, 2023 at 12:42 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > Thanks. Here's the v6 patch (last patch that I have with me for > pg_walinspect) for adding per-record info to pg_get_wal_block_info. > Note that I addressed all review comments received so far. Any > thoughts? Looking at this now, with the intention of committing it for 16. In addition to what I said a little while ago about the forknum parameter and parameter ordering, I have a concern about the data type: perhaps the forknum paramater should be declared as "relforknumber smallint", instead of using text? That would match the approach taken by pg_buffercache, and would be more efficient. I don't think that using a text column with the fork name adds too much, since this is after all supposed to be a tool used by experts. Plus it's usually pretty clear what it is from context. Not that many WAL records touch the visibility map, and those that do make it relatively obvious which block is from the VM based on other details. Details such as blockid and relblocknumber (the VM is approximately 32k times smaller than the heap). Once I see that the record is (say) a VISIBLE record, I'm already looking at the order of each block reference, and maybe at relblocknumber -- I'm not likely to visually scan the forknum column at all. -- Peter Geoghegan
On Mon, Mar 27, 2023 at 4:59 PM Peter Geoghegan <pg@bowt.ie> wrote: > Looking at this now, with the intention of committing it for 16. I see a bug on HEAD, following yesterday's commit 0276ae42dd. GetWALRecordInfo() will now output the value of the fpi_len variable before it has actually been set by our call to XXXX. So it'll always be 0. Can you post a bugfix patch for this, Bharath? -- Peter Geoghegan
On Mon, Mar 27, 2023 at 4:59 PM Peter Geoghegan <pg@bowt.ie> wrote: > Looking at this now, with the intention of committing it for 16. Attached revision v7 adjusts the column order. This is still WIP, but gives a good idea of the direction I'm going in. v7 makes the column output look like this: pg@regression:5432 [1209761]=# select * from pg_get_wal_block_info('0/10D8280', '0/10D82B8'); ┌─[ RECORD 1 ]─────┬──────────────────────────────────────────────────┐ │ start_lsn │ 0/10D8280 │ │ end_lsn │ 0/10D82B8 │ │ prev_lsn │ 0/10D8208 │ │ blockid │ 0 │ │ reltablespace │ 1,663 │ │ reldatabase │ 1 │ │ relfilenode │ 2,610 │ │ relforknumber │ 0 │ │ relblocknumber │ 0 │ │ xid │ 0 │ │ resource_manager │ Heap2 │ │ record_type │ PRUNE │ │ record_length │ 56 │ │ main_data_length │ 8 │ │ block_fpi_length │ 0 │ │ block_fpi_info │ ∅ │ │ description │ snapshotConflictHorizon 10 nredirected 0 ndead 1 │ │ block_data │ \x2b00 │ │ fpi_data │ ∅ │ └──────────────────┴──────────────────────────────────────────────────┘ Few things to note here: * blockid is now given more prominence, in that it appears just after the LSN related output params. The idea is that the blockid is conceptually closer to the LSN stuff. * There is now a smallint relblocknumber, for consistency with pg_buffercache. This replaces the previous text column. As I mentioned earlier on, I don't think that a text based output param adds much. * The integer fields record_length, main_data_length, block_fpi_length all now all appear together. This is for consistency with the similar output params from the other function. v7 allows block_fpi_length to be 0 instead of NULL, for consistency with the fpi_length param from the other function. The intention is to make it relatively obvious which information "comes from the record" and which information "comes from the block reference". * The block_fpi_info output param appears right after block_fpi_info. This is not very verbose, and I find that it is hard to find by scrolling horizontally in pspg if it gets placed after either block_data or fpi_data, which tend to have at least some huge/wide outputs. It seemed sensible to place block_fpi_info next to the param I'm now calling block_fpi_length, after it was moved next to the other "length" fields. How do people feel about this approach? I'll need to write documentation to help the user to understand what's really going on here. -- Peter Geoghegan
Attachment
On Tue, Mar 28, 2023 at 5:29 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Mar 27, 2023 at 12:42 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > Thanks. Here's the v6 patch (last patch that I have with me for > > pg_walinspect) for adding per-record info to pg_get_wal_block_info. > > Note that I addressed all review comments received so far. Any > > thoughts? > > Looking at this now, with the intention of committing it for 16. > > In addition to what I said a little while ago about the forknum > parameter and parameter ordering, I have a concern about the data > type: perhaps the forknum paramater should be declared as > "relforknumber smallint", instead of using text? That would match the > approach taken by pg_buffercache, and would be more efficient. > > I don't think that using a text column with the fork name adds too > much, since this is after all supposed to be a tool used by experts. > Plus it's usually pretty clear what it is from context. Not that many > WAL records touch the visibility map, and those that do make it > relatively obvious which block is from the VM based on other details. > Details such as blockid and relblocknumber (the VM is approximately > 32k times smaller than the heap). Once I see that the record is (say) > a VISIBLE record, I'm already looking at the order of each block > reference, and maybe at relblocknumber -- I'm not likely to visually > scan the forknum column at all. Hm, agreed. Changed in the attached v7-0002 patch. We can as well write a case statement in the create function SQL to output forkname instead forknumber, but I'd stop doing that to keep in sync with pg_buffercache. On Tue, Mar 28, 2023 at 6:37 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Mar 27, 2023 at 4:59 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Looking at this now, with the intention of committing it for 16. > > I see a bug on HEAD, following yesterday's commit 0276ae42dd. > > GetWALRecordInfo() will now output the value of the fpi_len variable > before it has actually been set by our call to XXXX. So it'll always > be 0. > > Can you post a bugfix patch for this, Bharath? Oh, thanks for finding it out. Fixed in the attached v7-0001 patch. I also removed the "invalid fork number" error as users can figure that out if at all the fork number is wrong. On the ordering of the columns, I kept start_lsn, end_lsn and prev_lsn first and then the rel** columns (this rel** columns order follows pg_buffercache) and then block data related columns. Michael and Kyotaro are of the opinion that it's better to keep LSNs first to be consistent and also given that this function is WAL related, it makes sense to have LSNs first. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Mar 27, 2023 at 06:07:09PM -0700, Peter Geoghegan wrote: > I see a bug on HEAD, following yesterday's commit 0276ae42dd. > > GetWALRecordInfo() will now output the value of the fpi_len variable > before it has actually been set by our call to XXXX. So it'll always > be 0. Indeed, good catch. It looks like I was not careful enough with the block controlled by XLogRecHasAnyBlockRefs(). -- Michael
Attachment
On Mon, Mar 27, 2023 at 7:47 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > Hm, agreed. Changed in the attached v7-0002 patch. We can as well > write a case statement in the create function SQL to output forkname > instead forknumber, but I'd stop doing that to keep in sync with > pg_buffercache. I just don't see much value in any textual representation of fork name, however generated. In practice it's just not adding very much useful information. It is mostly useful as a way of filtering block references, which makes simple integers more natural. > Oh, thanks for finding it out. Fixed in the attached v7-0001 patch. I > also removed the "invalid fork number" error as users can figure that > out if at all the fork number is wrong. Pushed just now. > On the ordering of the columns, I kept start_lsn, end_lsn and prev_lsn > first and then the rel** columns (this rel** columns order follows > pg_buffercache) and then block data related columns. Michael and > Kyotaro are of the opinion that it's better to keep LSNs first to be > consistent and also given that this function is WAL related, it makes > sense to have LSNs first. Right, but I didn't change that part in the revision of the patch I posted. Those columns still came first, and were totally consistent with the pg_get_wal_record_info function. I think that there was a "mid air collision" here, where we both posted patches that we each called v7 within minutes of each other. Just to be clear, I ended up with a column order as described here in my revision: https://postgr.es/m/CAH2-WzmzO-AU4QSbnzzANBkrpg=4CuOd3scVtv+7x65e+QKBZw@mail.gmail.com It now occurs to me that "fpi_data" should perhaps be called "block_fpi_data". What do you think? -- Peter Geoghegan
On Tue, Mar 28, 2023 at 11:15:17AM -0700, Peter Geoghegan wrote: > On Mon, Mar 27, 2023 at 7:47 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: >> Hm, agreed. Changed in the attached v7-0002 patch. We can as well >> write a case statement in the create function SQL to output forkname >> instead forknumber, but I'd stop doing that to keep in sync with >> pg_buffercache. > > I just don't see much value in any textual representation of fork > name, however generated. In practice it's just not adding very much > useful information. It is mostly useful as a way of filtering block > references, which makes simple integers more natural. I disagree with this argument. Personally, I have a *much* better experience with textual representation because there is no need to cross-check the internals of the code in case you don't remember what a given number means in an enum or in a set of bits, especially if you're in a hurry of looking at a production or customer deployment. In short, it makes for less mistakes because you don't have to think about some extra mapping between some integers and what they actually mean through text. The clauses you'd apply for a group by on the forks, or for a filter with IN clauses don't change, they're just made easier to understand for the common user, and that includes experienced people. We'd better think about that like the text[] arrays we use for the flag values, like the FPI flags, or why we've introduced text[] for the HEAP_* flags in the heap functions of pageinspect. There's even more consistency with pageinspect in using a fork name, where we can pass down a fork name to get a raw page. -- Michael
Attachment
On Tue, Mar 28, 2023 at 3:34 PM Michael Paquier <michael@paquier.xyz> wrote: > I disagree with this argument. Personally, I have a *much* better > experience with textual representation because there is no need to > cross-check the internals of the code in case you don't remember what > a given number means in an enum or in a set of bits, especially if > you're in a hurry of looking at a production or customer deployment. I couldn't tell you which fork number is which right off the top of my head, either (except that main fork is 0). But it doesn't matter. There are only ever two relevant fork numbers. And even if that changes, it's perfectly clear which is which from context. There will be two block references for a given (say) VISIBLE record, one of which is obviously for the main fork, the other of which is obviously for the VM fork. Plus it's just more consistent that way. The existing pg_get_wal_block_info() output parameters look like they were directly copied from pg_buffercache (the same names are already used), so why not do the same with relforknumber? > In short, it makes for less mistakes because you don't have to think > about some extra mapping between some integers and what they actually > mean through text. The clauses you'd apply for a group by on the > forks, or for a filter with IN clauses don't change, they're just made > easier to understand for the common user, and that includes > experienced people. It's slightly harder to write a query that filters on text, and it won't perform as well. > We'd better think about that like the text[] > arrays we use for the flag values, like the FPI flags, or why we've > introduced text[] for the HEAP_* flags in the heap functions of > pageinspect. I think that those other things are fine, because they're much less obvious, and are very unlikely to ever appear in a query predicate. > There's even more consistency with pageinspect in using a fork name, > where we can pass down a fork name to get a raw page. pageinspect provides a way of mapping raw infomask/informask2 fields to status flags, through its heap_tuple_infomask_flags function. But the actual function that returns information from tuples (heap_page_items) faithfully represents the raw on-disk format directly -- the heap_tuple_infomask_flags part is totally optional. So that doesn't seem like an example that supports your argument -- quite the contrary. I wouldn't mind adding something to the docs about fork number. In fact it's definitely going to be necessary to have an explanation that at least matches the one from the pg_buffercache docs. -- Peter Geoghegan
On Mon, Mar 27, 2023 at 7:40 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached revision v7 adjusts the column order. This is still WIP, but > gives a good idea of the direction I'm going in. A couple of small tweaks to this appear in the attached revision, v8. Now it looks like this: pg@regression:5432 [1294231]=# select * from pg_get_wal_block_info('0/10E9D80' , '0/10E9DC0'); ┌─[ RECORD 1 ]──────┬────────────────────────────────────┐ │ start_lsn │ 0/10E9D80 │ │ end_lsn │ 0/10E9DC0 │ │ prev_lsn │ 0/10E9860 │ │ blockid │ 0 │ │ reltablespace │ 1,663 │ │ reldatabase │ 1 │ │ relfilenode │ 2,690 │ │ relforknumber │ 0 │ │ relblocknumber │ 5 │ │ xid │ 117 │ │ resource_manager │ Btree │ │ record_type │ INSERT_LEAF │ │ record_length │ 64 │ │ main_data_length │ 2 │ │ block_data_length │ 16 │ │ block_fpi_length │ 0 │ │ block_fpi_info │ ∅ │ │ description │ off 14 │ │ block_data │ \x00005400020010001407000000000000 │ │ block_fpi_data │ ∅ │ └───────────────────┴────────────────────────────────────┘ This is similar to what I showed recently for v7. Just two changes: * The parameter formerly called fpi_data is now called block_fpi_data, to enforce the idea that it's block specific (and for consistency with the related block_fpi_length param). * There is now a new column, which makes the size of block_data explicit: block_data_length This made sense on consistency grounds, since we already had a block_fpi_length. But it also seems quite useful. In this example, I can immediately see that this INSERT_LEAF record needed 2 bytes for the block offset number (indicating off 14), and 16 bytes of block data for the IndexTuple data itself. There is a more recognizable pattern to things, since the size of tuples for a given relation tends to be somewhat homogenous. block_data_length also seems like it could provide users with a handy way of filtering out definitely-irrelevant block references. -- Peter Geoghegan
Attachment
On Wed, Mar 29, 2023 at 12:47 PM Peter Geoghegan <pg@bowt.ie> wrote: > A couple of small tweaks to this appear in the attached revision, v8. I spent some time on the documentation today, too. Attached is v9, which seems pretty close to being committable. I hope to commit what I have here (or something very close to it) in the next couple of days. Note that I've relocated the documentation for pg_get_wal_block_info() right after pg_get_wal_records_info(), despite getting some push back on that before now. It just doesn't make sense to leave it where it is, since the documentation now explains the new functionality by directly comparing the two functions. I also noticed that the docs were never updated following the end_lsn changes in commit 5c1b6628 (they still said that you needed an end_lsn before the server's current LSN). I've fixed that in passing, and added a new "Tip" that advertises the permissive interpretation around end_lsn values in a general sort of way (since it applies equally to all but one of the pg_walinspect functions). I've also done a little bit of restructuring of some of the other functions, to keep things consistent with what I want to do with pg_get_wal_block_info. -- Peter Geoghegan
Attachment
On Thu, Mar 30, 2023 at 5:15 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Mar 29, 2023 at 12:47 PM Peter Geoghegan <pg@bowt.ie> wrote: > > A couple of small tweaks to this appear in the attached revision, v8. > > I spent some time on the documentation today, too. Attached is v9, > which seems pretty close to being committable. I hope to commit what I > have here (or something very close to it) in the next couple of days. > > Note that I've relocated the documentation for pg_get_wal_block_info() > right after pg_get_wal_records_info(), despite getting some push back > on that before now. It just doesn't make sense to leave it where it > is, since the documentation now explains the new functionality by > directly comparing the two functions. > > I also noticed that the docs were never updated following the end_lsn > changes in commit 5c1b6628 (they still said that you needed an end_lsn > before the server's current LSN). I've fixed that in passing, and > added a new "Tip" that advertises the permissive interpretation around > end_lsn values in a general sort of way (since it applies equally to > all but one of the pg_walinspect functions). I've also done a little > bit of restructuring of some of the other functions, to keep things > consistent with what I want to do with pg_get_wal_block_info. I took a look at v9 and LGTM. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Mar 29, 2023 at 8:28 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > I took a look at v9 and LGTM. Pushed, thanks. There is still an outstanding question around the overhead of outputting FPIs and even block data from pg_get_wal_block_info(). At one point Melanie suggested that we'd need to do something about that, and I tend to agree. Attached patch provides an optional parameter that will make pg_get_wal_block_info return NULLs for both block_data and block_fpi_data, no matter whether or not there is something to show. Note that this only affects those two bytea columns; we'll still show everything else, including valid block_data_length and block_fpi_length values (so the metadata describing the on-disk size of block_data and block_fpi_data is unaffected). To test this patch, I ran pgbench for about 5 minutes, using a fairly standard configuration with added indexes and with wal_log_hints enabled. I ended up with the following WAL records afterwards: pg@regression:5432 [1402115]=# SELECT "resource_manager/record_type" t, pg_size_pretty(combined_size) s, fpi_size_percentage perc_fpi FROM pg_get_wal_Stats ('0/10E9D80', 'FFFFFFFF/FFFFFFFF', FALSE) where combined_size > 0; ┌─[ RECORD 1 ]──────────────────┐ │ t │ XLOG │ │ s │ 1557 MB │ │ perc_fpi │ 22.029466865781302 │ ├─[ RECORD 2 ]──────────────────┤ │ t │ Transaction │ │ s │ 49 MB │ │ perc_fpi │ 0 │ ├─[ RECORD 3 ]──────────────────┤ │ t │ Storage │ │ s │ 13 kB │ │ perc_fpi │ 0 │ ├─[ RECORD 4 ]──────────────────┤ │ t │ CLOG │ │ s │ 1380 bytes │ │ perc_fpi │ 0 │ ├─[ RECORD 5 ]──────────────────┤ │ t │ Database │ │ s │ 118 bytes │ │ perc_fpi │ 0 │ ├─[ RECORD 6 ]──────────────────┤ │ t │ RelMap │ │ s │ 565 bytes │ │ perc_fpi │ 0 │ ├─[ RECORD 7 ]──────────────────┤ │ t │ Standby │ │ s │ 30 kB │ │ perc_fpi │ 0 │ ├─[ RECORD 8 ]──────────────────┤ │ t │ Heap2 │ │ s │ 4235 MB │ │ perc_fpi │ 0.6731388657682449 │ ├─[ RECORD 9 ]──────────────────┤ │ t │ Heap │ │ s │ 4482 MB │ │ perc_fpi │ 54.46811493602934 │ ├─[ RECORD 10 ]─────────────────┤ │ t │ Btree │ │ s │ 1786 MB │ │ perc_fpi │ 22.829279332421116 │ └──────────┴────────────────────┘ Time: 3618.693 ms (00:03.619) So about 12GB of WAL -- certainly enough to be a challenge for pg_walinspect. I then ran the following query several times over the same LSN range as before, with my patch applied, but with behavior equivalent to current git HEAD (this is with outputting block_data and block_fpi_data values still turned on): pg@regression:5432 [1402115]=# SELECT count(*) FROM pg_get_wal_block_info ('0/10E9D80', 'FFFFFFFF/FFFFFFFF', false); ┌─[ RECORD 1 ]───────┐ │ count │ 17,031,979 │ └───────┴────────────┘ Time: 35171.463 ms (00:35.171) The time shown here is typical of what I saw. And now the same query, but without any overhead for outputting block_data and block_fpi_data values: pg@regression:5432 [1402115]=# SELECT count(*) FROM pg_get_wal_block_info ('0/10E9D80', 'FFFFFFFF/FFFFFFFF', true); ┌─[ RECORD 1 ]───────┐ │ count │ 17,031,979 │ └───────┴────────────┘ Time: 15235.499 ms (00:15.235) This time is also typical of what I saw. The variance was fairly low, so I won't bother describing it. I think that this is a compelling reason to apply the patch. It would be possible to get about 75% of the benefit shown here by just suppressing block_fpi_data output, without suppressing block_data, but I think that it makes sense to either suppress both or neither. Things like page split records can write a fairly large amount of WAL in a way that resembles an FPI, even though technically no FPI is involved. If there are no objections, I'll move ahead with committing something along the lines of this patch in the next couple of days. -- Peter Geoghegan
Attachment
On Thu, Mar 30, 2023 at 2:41 PM Peter Geoghegan <pg@bowt.ie> wrote: > pg@regression:5432 [1402115]=# SELECT > count(*) > FROM > pg_get_wal_block_info ('0/10E9D80', 'FFFFFFFF/FFFFFFFF', true); > ┌─[ RECORD 1 ]───────┐ > │ count │ 17,031,979 │ > └───────┴────────────┘ > > Time: 15235.499 ms (00:15.235) > > This time is also typical of what I saw. The variance was fairly low, > so I won't bother describing it. If I rerun the same test case with pg_get_wal_records_info (same WAL records, same system) then I find that it takes about 16 and a half seconds. So my patch makes pg_get_wal_block_info a little bit faster than pg_get_wal_records_info for this test case, and likely many interesting cases (assuming that the user opts out of fetching block_data and block_fpi_data values when running pg_get_wal_block_info, per the patch). This result closely matches what I was expecting. We're doing almost the same amount of work when each function is called, so naturally the runtime almost matches. Note that pg_get_wal_records_info does slightly *more* work here, since it alone must output rows for commit records. Unlike pg_get_wal_block_info, which (by design) never outputs rows for WAL records that lack block references. -- Peter Geoghegan
On Thu, Mar 30, 2023 at 2:41 PM Peter Geoghegan <pg@bowt.ie> wrote: > There is still an outstanding question around the overhead of > outputting FPIs and even block data from pg_get_wal_block_info(). At > one point Melanie suggested that we'd need to do something about that, > and I tend to agree. Attached patch provides an optional parameter > that will make pg_get_wal_block_info return NULLs for both block_data > and block_fpi_data, no matter whether or not there is something to > show. Note that this only affects those two bytea columns; we'll still > show everything else, including valid block_data_length and > block_fpi_length values (so the metadata describing the on-disk size > of block_data and block_fpi_data is unaffected). I pushed this patch just now. Except that the final commited version had the "suppress_block_data" output parameter name flipped. It was inverted and renamed to "show_data" (and made "DEFAULT true"). This is closer to how the pg_stat_statements() function handles a similar issue with overly large query texts. I'm very happy with the end result of the work on this thread. It works a lot better for the sorts of queries I am interested in. Thanks to all involved, particularly Bharath. -- Peter Geoghegan