Re: Improvements in pg_dump/pg_restore toc format and performances - Mailing list pgsql-hackers

From Nathan Bossart
Subject Re: Improvements in pg_dump/pg_restore toc format and performances
Date
Msg-id 20230918215247.GA2661288@nathanxps13
Whole thread Raw
In response to Improvements in pg_dump/pg_restore toc format and performances  (Pierre Ducroquet <p.psql@pinaraf.info>)
Responses Re: Improvements in pg_dump/pg_restore toc format and performances
Re: Improvements in pg_dump/pg_restore toc format and performances
List pgsql-hackers
On Thu, Jul 27, 2023 at 10:51:11AM +0200, Pierre Ducroquet wrote:
> I ended up writing several patches that shaved some time for pg_restore -l, 
> and reduced the toc.dat size.

I've only just started taking a look at these patches, and I intend to do a
more thorough review in the hopefully-not-too-distant future.

> First patch is "finishing" the job of removing has oids support. When this 
> support was removed, instead of dropping the field from the dumps and 
> increasing the dump versions, the field was kept as is. This field stores a 
> boolean as a string, "true" or "false". This is not free, and requires 10 
> bytes per toc entry.

This sounds reasonable to me.  I wonder why this wasn't done when WITH OIDS
was removed in v12.

> The second patch removes calls to sscanf and replaces them with strtoul. This 
> was the biggest speedup for pg_restore -l.

Nice.

> The third patch changes the dump format further to remove these strtoul calls 
> and store the integers as is instead.

Do we need to worry about endianness here?

> The fourth patch is dirtier and does more changes to the dump format. Instead 
> of storing the owner, tablespace, table access method and schema of each 
> object as a string, pg_dump builds an array of these, stores them at the 
> beginning of the file and replaces the strings with integer fields in the dump. 
> This reduces the file size further, and removes a lot of calls to ReadStr, thus 
> saving quite some time.

This sounds promising.

> Patch    Toc size    Dump -s duration    pg_restore -l duration
> HEAD    214M    23.1s    1.27s
> #1 (has oid)    210M    22.9s    1.26s
> #2 (scanf)    210M    22.9s    1.07s
> #3 (no strtoul)    202M    22.8s    0.94s
> #4 (string list)    181M    23.1s    0.87s

At a glance, the size improvements in 0004 look the most interesting to me.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: Inefficiency in parallel pg_restore with many tables
Next
From: Nathan Bossart
Date:
Subject: Re: Improvements in pg_dump/pg_restore toc format and performances