Re: Improvements in pg_dump/pg_restore toc format and performances - Mailing list pgsql-hackers

From Pierre Ducroquet
Subject Re: Improvements in pg_dump/pg_restore toc format and performances
Date
Msg-id 8312205.NyiUUSuA9g@peanuts2
Whole thread Raw
In response to Re: Improvements in pg_dump/pg_restore toc format and performances  (Nathan Bossart <nathandbossart@gmail.com>)
List pgsql-hackers
On Monday, September 18, 2023 11:52:47 PM CEST Nathan Bossart wrote:
> On Thu, Jul 27, 2023 at 10:51:11AM +0200, Pierre Ducroquet wrote:
> > I ended up writing several patches that shaved some time for pg_restore
> > -l,
> > and reduced the toc.dat size.
> 
> I've only just started taking a look at these patches, and I intend to do a
> more thorough review in the hopefully-not-too-distant future.

Thank you very much.

> Since cfbot is failing on some pg_upgrade and pg_dump tests, I've set this
> to waiting-on-author.

I did not notice anything running meson test -v, I'll look further into it in 
the next days.

> > First patch is "finishing" the job of removing has oids support. When this
> > support was removed, instead of dropping the field from the dumps and
> > increasing the dump versions, the field was kept as is. This field stores
> > a
> > boolean as a string, "true" or "false". This is not free, and requires 10
> > bytes per toc entry.
> 
> This sounds reasonable to me.  I wonder why this wasn't done when WITH OIDS
> was removed in v12.

I suppose it is an oversight, or not wanting to increase the dump version for 
no reason.

> > The second patch removes calls to sscanf and replaces them with strtoul.
> > This was the biggest speedup for pg_restore -l.
> 
> Nice.
> 
> > The third patch changes the dump format further to remove these strtoul
> > calls and store the integers as is instead.
> 
> Do we need to worry about endianness here?

I used the ReadInt/WriteInt functions already defined in pg_dump that take care 
of this issue, so there should be no need to worry.

> > The fourth patch is dirtier and does more changes to the dump format.
> > Instead of storing the owner, tablespace, table access method and schema
> > of each object as a string, pg_dump builds an array of these, stores them
> > at the beginning of the file and replaces the strings with integer fields
> > in the dump. This reduces the file size further, and removes a lot of
> > calls to ReadStr, thus saving quite some time.
> 
> This sounds promising.
> 
> > Patch    Toc size    Dump -s duration    pg_restore -l duration
> > HEAD    214M    23.1s    1.27s
> > #1 (has oid)    210M    22.9s    1.26s
> > #2 (scanf)    210M    22.9s    1.07s
> > #3 (no strtoul)    202M    22.8s    0.94s
> > #4 (string list)    181M    23.1s    0.87s
> 
> At a glance, the size improvements in 0004 look the most interesting to me.

Yes it is, and the speed benefits are interesting too (at least for my usecase)







pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: Improvements in pg_dump/pg_restore toc format and performances
Next
From: Peter Geoghegan
Date:
Subject: Re: Index range search optimization