Improvements in pg_dump/pg_restore toc format and performances - Mailing list pgsql-hackers

From Pierre Ducroquet
Subject Improvements in pg_dump/pg_restore toc format and performances
Date
Msg-id 2656000.KRxA6XjA2N@peanuts2
Whole thread Raw
Responses Re: Improvements in pg_dump/pg_restore toc format and performances
List pgsql-hackers
Hi

Following the thread "Inefficiency in parallel pg_restore with many tables", I 
started digging into why the toc.dat files are that big and where time is spent 
when parsing them.

I ended up writing several patches that shaved some time for pg_restore -l, 
and reduced the toc.dat size.

First patch is "finishing" the job of removing has oids support. When this 
support was removed, instead of dropping the field from the dumps and 
increasing the dump versions, the field was kept as is. This field stores a 
boolean as a string, "true" or "false". This is not free, and requires 10 
bytes per toc entry.

The second patch removes calls to sscanf and replaces them with strtoul. This 
was the biggest speedup for pg_restore -l.

The third patch changes the dump format further to remove these strtoul calls 
and store the integers as is instead.

The fourth patch is dirtier and does more changes to the dump format. Instead 
of storing the owner, tablespace, table access method and schema of each 
object as a string, pg_dump builds an array of these, stores them at the 
beginning of the file and replaces the strings with integer fields in the dump. 
This reduces the file size further, and removes a lot of calls to ReadStr, thus 
saving quite some time.

Toc has 453999 entries.

Patch    Toc size    Dump -s duration    pg_restore -l duration
HEAD    214M    23.1s    1.27s
#1 (has oid)    210M    22.9s    1.26s
#2 (scanf)    210M    22.9s    1.07s
#3 (no strtoul)    202M    22.8s    0.94s
#4 (string list)    181M    23.1s    0.87s

Patch four is likely to require more changes. I don't know PostgreSQL code 
enough to do better than calling pgmalloc/pgrealloc and maintaining a char** 
manually, I guess there are structs and functions that do that in a better 
way. And the location of string tables in the file and in the structures is 
probably not acceptable, I suppose these should go to the toc header instead.

I still submit these for comments and first review.

Best regards

 Pierre Ducroquet

Attachment

pgsql-hackers by date:

Previous
From: David Steele
Date:
Subject: Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
Next
From: David Rowley
Date:
Subject: Re: Performance degradation on concurrent COPY into a single relation in PG16.