Re: problems with making relfilenodes 56-bits - Mailing list pgsql-hackers

From Andres Freund
Subject Re: problems with making relfilenodes 56-bits
Date
Msg-id 20221003170125.7xd2vdadjak74tnm@awork3.anarazel.de
Whole thread Raw
In response to Re: problems with making relfilenodes 56-bits  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: problems with making relfilenodes 56-bits  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Re: problems with making relfilenodes 56-bits  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

On 2022-10-03 08:12:39 -0400, Robert Haas wrote:
> On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote:
> > I think it'd be interesting to look at per-record-type stats between two
> > equivalent workload, to see where practical workloads suffer the most
> > (possibly with fpw=off, to make things more repeatable).
>
> I would expect, and Dilip's results seem to confirm, the effect to be
> pretty uniform: basically, nearly every record gets bigger by 4 bytes.
> That's because most records contain at least one block reference, and
> if they contain multiple block references, likely all but one will be
> marked BKPBLOCK_SAME_REL, so we pay the cost just once.

But it doesn't really matter that much if an already large record gets a bit
bigger. Whereas it does matter if it's a small record. Focussing on optimizing
the record types where the increase is large seems like a potential way
forward to me, even if we can't find something generic.


> I thought about trying to buy back some space elsewhere, and I think
> that would be a reasonable approach to getting this committed if we
> could find a way to do it. However, I don't see a terribly obvious way
> of making it happen.

I think there's plenty potential...


> Trying to do it by optimizing specific WAL record
> types seems like a real pain in the neck, because there's tons of
> different WAL records that all have the same problem.

I am not so sure about that. Improving a bunch of the most frequent small
records might buy you back enough on just about every workload to be OK.

I put the top record sizes for an installcheck run with full_page_writes off
at the bottom. Certainly our regression tests aren't generally
representative. But I think it still decently highlights how just improving a
few records could buy you back more than enough.


> Trying to do it in a generic way makes more sense, and the fact that we have
> 2 padding bytes available in XLogRecord seems like a place to start looking,
> but the way forward from there is not clear to me.

Random idea: xl_prev is large. Store a full xl_prev in the page header, but
only store a 2 byte offset from the page header xl_prev within each record.

Greetings,

Andres Freund

by total size:

Type                                           N      (%)          Record size      (%)             FPI size      (%)
    Combined size      (%)
 
----                                           -      ---          -----------      ---             --------      ---
    -------------      ---
 
Heap/INSERT                              1041666 ( 50.48)            106565255 ( 50.54)                    0 (  0.00)
        106565255 ( 43.92)
 
Btree/INSERT_LEAF                         352196 ( 17.07)             24067672 ( 11.41)                    0 (  0.00)
         24067672 (  9.92)
 
Heap/DELETE                               250852 ( 12.16)             13546008 (  6.42)                    0 (  0.00)
         13546008 (  5.58)
 
Hash/INSERT                               108499 (  5.26)              7811928 (  3.70)                    0 (  0.00)
          7811928 (  3.22)
 
Transaction/COMMIT                         16053 (  0.78)              6402657 (  3.04)                    0 (  0.00)
          6402657 (  2.64)
 
Gist/PAGE_UPDATE                           57225 (  2.77)              5217100 (  2.47)                    0 (  0.00)
          5217100 (  2.15)
 
Gin/UPDATE_META_PAGE                       23943 (  1.16)              4539970 (  2.15)                    0 (  0.00)
          4539970 (  1.87)
 
Gin/INSERT                                 27004 (  1.31)              3623998 (  1.72)                    0 (  0.00)
          3623998 (  1.49)
 
Gist/PAGE_SPLIT                              448 (  0.02)              3391244 (  1.61)                    0 (  0.00)
          3391244 (  1.40)
 
SPGist/ADD_LEAF                            38968 (  1.89)              3341696 (  1.58)                    0 (  0.00)
          3341696 (  1.38)
 
...
XLOG/FPI                                    7228 (  0.35)               378924 (  0.18)             29788166 ( 93.67)
         30167090 ( 12.43)
 
...
Gin/SPLIT                                    141 (  0.01)                13011 (  0.01)              1187588 (  3.73)
          1200599 (  0.49)
 
...
                                        --------                      --------                      --------
         --------
 
Total                                    2063609                     210848282 [86.89%]             31802766 [13.11%]
        242651048 [100%]
 

(Included XLOG/FPI and Gin/SPLIT to explain why there's FPIs despite running with fpw=off)

sorted by number of records:
Heap/INSERT                              1041666 ( 50.48)            106565255 ( 50.54)                    0 (  0.00)
        106565255 ( 43.92)
 
Btree/INSERT_LEAF                         352196 ( 17.07)             24067672 ( 11.41)                    0 (  0.00)
         24067672 (  9.92)
 
Heap/DELETE                               250852 ( 12.16)             13546008 (  6.42)                    0 (  0.00)
         13546008 (  5.58)
 
Hash/INSERT                               108499 (  5.26)              7811928 (  3.70)                    0 (  0.00)
          7811928 (  3.22)
 
Gist/PAGE_UPDATE                           57225 (  2.77)              5217100 (  2.47)                    0 (  0.00)
          5217100 (  2.15)
 
SPGist/ADD_LEAF                            38968 (  1.89)              3341696 (  1.58)                    0 (  0.00)
          3341696 (  1.38)
 
Gin/INSERT                                 27004 (  1.31)              3623998 (  1.72)                    0 (  0.00)
          3623998 (  1.49)
 
Gin/UPDATE_META_PAGE                       23943 (  1.16)              4539970 (  2.15)                    0 (  0.00)
          4539970 (  1.87)
 
Standby/LOCK                               18451 (  0.89)               775026 (  0.37)                    0 (  0.00)
           775026 (  0.32)
 
Transaction/COMMIT                         16053 (  0.78)              6402657 (  3.04)                    0 (  0.00)
          6402657 (  2.64)
 



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [patch] \g with multiple result sets and \watch with copy queries
Next
From: Tomas Vondra
Date:
Subject: Re: Question: test "aggregates" failed in 32-bit machine