Thread: why do hash index builds use smgrextend() for new splitpoint pages
I'm trying to understand why hash indexes are built primarily in shared buffers except when allocating a new splitpoint's worth of bucket pages -- which is done with smgrextend() directly in _hash_alloc_buckets(). Is this just so that the value returned by smgrnblocks() includes the new splitpoint's worth of bucket pages? All writes of tuple data to pages in this new splitpoint will go through shared buffers (via hash_getnewbuf()). I asked this and got some thoughts from Robert in [1], but I still don't really get it. When a new page is needed during the hash index build, why can't _hash_expandtable() just call ReadBufferExtended() with P_NEW instead of _hash_getnewbuf()? Does it have to do with the BUCKET_TO_BLKNO mapping? - Melanie [1] https://www.postgresql.org/message-id/CA%2BTgmoa%2BQFFhkHgPxyxv6t8aVU0r7GZmu7z8io4vGG7RHPpGzA%40mail.gmail.com
On Fri, Feb 25, 2022 at 4:41 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > I'm trying to understand why hash indexes are built primarily in shared > buffers except when allocating a new splitpoint's worth of bucket pages > -- which is done with smgrextend() directly in _hash_alloc_buckets(). > > Is this just so that the value returned by smgrnblocks() includes the > new splitpoint's worth of bucket pages? > > All writes of tuple data to pages in this new splitpoint will go > through shared buffers (via hash_getnewbuf()). > > I asked this and got some thoughts from Robert in [1], but I still don't > really get it. > > When a new page is needed during the hash index build, why can't > _hash_expandtable() just call ReadBufferExtended() with P_NEW instead of > _hash_getnewbuf()? Does it have to do with the BUCKET_TO_BLKNO mapping? > We allocate the chunk of pages (power-of-2 groups) at the time of split which allows them to appear consecutively in an index. This helps us to compute the physical block number from bucket number easily (BUCKET_TO_BLKNO mapping) with some minimal control information. -- With Regards, Amit Kapila.
On Fri, Feb 25, 2022 at 8:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Feb 25, 2022 at 4:41 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > I'm trying to understand why hash indexes are built primarily in shared > > buffers except when allocating a new splitpoint's worth of bucket pages > > -- which is done with smgrextend() directly in _hash_alloc_buckets(). > > > > Is this just so that the value returned by smgrnblocks() includes the > > new splitpoint's worth of bucket pages? > > > > All writes of tuple data to pages in this new splitpoint will go > > through shared buffers (via hash_getnewbuf()). > > > > I asked this and got some thoughts from Robert in [1], but I still don't > > really get it. > > > > When a new page is needed during the hash index build, why can't > > _hash_expandtable() just call ReadBufferExtended() with P_NEW instead of > > _hash_getnewbuf()? Does it have to do with the BUCKET_TO_BLKNO mapping? > > > > We allocate the chunk of pages (power-of-2 groups) at the time of > split which allows them to appear consecutively in an index. > I think allocating chunks of pages via "ReadBufferExtended() with P_NEW" will be time-consuming as compared to what we do now. -- With Regards, Amit Kapila.
On Thu, Feb 24, 2022 at 10:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Feb 25, 2022 at 4:41 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > I'm trying to understand why hash indexes are built primarily in shared > > buffers except when allocating a new splitpoint's worth of bucket pages > > -- which is done with smgrextend() directly in _hash_alloc_buckets(). > > > > Is this just so that the value returned by smgrnblocks() includes the > > new splitpoint's worth of bucket pages? > > > > All writes of tuple data to pages in this new splitpoint will go > > through shared buffers (via hash_getnewbuf()). > > > > I asked this and got some thoughts from Robert in [1], but I still don't > > really get it. > > > > When a new page is needed during the hash index build, why can't > > _hash_expandtable() just call ReadBufferExtended() with P_NEW instead of > > _hash_getnewbuf()? Does it have to do with the BUCKET_TO_BLKNO mapping? > > > > We allocate the chunk of pages (power-of-2 groups) at the time of > split which allows them to appear consecutively in an index. This > helps us to compute the physical block number from bucket number > easily (BUCKET_TO_BLKNO mapping) with some minimal control > information. got it, thanks. Since _hash_alloc_buckets() WAL-logs the last page of the splitpoint, is it safe to skip the smgrimmedsync()? What if the last page of the splitpoint doesn't end up having any tuples added to it during the index build and the redo pointer is moved past the WAL for this page and then later there is a crash sometime before this page makes it to permanent storage. Does it matter that this page is lost? If not, then why bother WAL-logging it? - Melanie
On Sat, Feb 26, 2022 at 3:01 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > Since _hash_alloc_buckets() WAL-logs the last page of the > splitpoint, is it safe to skip the smgrimmedsync()? What if the last > page of the splitpoint doesn't end up having any tuples added to it > during the index build and the redo pointer is moved past the WAL for > this page and then later there is a crash sometime before this page > makes it to permanent storage. Does it matter that this page is lost? If > not, then why bother WAL-logging it? > I think we don't care if the page is lost before we update the meta-page in the caller because we will try to reallocate in that case. But we do care after meta page update (having the updated information about this extension via different masks) in which case we won't lose this last page because it would have registered the sync request for it via sgmrextend before meta page update. -- With Regards, Amit Kapila.
On Fri, Feb 25, 2022 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sat, Feb 26, 2022 at 3:01 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > Since _hash_alloc_buckets() WAL-logs the last page of the > > splitpoint, is it safe to skip the smgrimmedsync()? What if the last > > page of the splitpoint doesn't end up having any tuples added to it > > during the index build and the redo pointer is moved past the WAL for > > this page and then later there is a crash sometime before this page > > makes it to permanent storage. Does it matter that this page is lost? If > > not, then why bother WAL-logging it? > > > > I think we don't care if the page is lost before we update the > meta-page in the caller because we will try to reallocate in that > case. But we do care after meta page update (having the updated > information about this extension via different masks) in which case we > won't lose this last page because it would have registered the sync > request for it via sgmrextend before meta page update. and could it happen that during smgrextend() for the last page, a checkpoint starts and finishes between FileWrite() and register_dirty_segment(), then index build finishes, and then a crash occurs before another checkpoint completes the pending fsync for that last page?
On Sat, Feb 26, 2022 at 9:17 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Fri, Feb 25, 2022 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Sat, Feb 26, 2022 at 3:01 AM Melanie Plageman > > <melanieplageman@gmail.com> wrote: > > > > > > Since _hash_alloc_buckets() WAL-logs the last page of the > > > splitpoint, is it safe to skip the smgrimmedsync()? What if the last > > > page of the splitpoint doesn't end up having any tuples added to it > > > during the index build and the redo pointer is moved past the WAL for > > > this page and then later there is a crash sometime before this page > > > makes it to permanent storage. Does it matter that this page is lost? If > > > not, then why bother WAL-logging it? > > > > > > > I think we don't care if the page is lost before we update the > > meta-page in the caller because we will try to reallocate in that > > case. But we do care after meta page update (having the updated > > information about this extension via different masks) in which case we > > won't lose this last page because it would have registered the sync > > request for it via sgmrextend before meta page update. > > and could it happen that during smgrextend() for the last page, a > checkpoint starts and finishes between FileWrite() and > register_dirty_segment(), then index build finishes, and then a crash > occurs before another checkpoint completes the pending fsync for that > last page? > Yeah, this seems to be possible and then the problem could be that index's idea and smgr's idea for EOF could be different which could lead to a problem when we try to get a new page via _hash_getnewbuf(). If this theory turns out to be true then probably, we can get an error either because of disk full or the index might request a block that is beyond EOF as determined by RelationGetNumberOfBlocksInFork() in _hash_getnewbuf(). Can we try to reproduce this scenario with the help of a debugger to see if we are missing something? -- With Regards, Amit Kapila.