Hi
Following the thread "Inefficiency in parallel pg_restore with many tables", I
started digging into why the toc.dat files are that big and where time is spent
when parsing them.
I ended up writing several patches that shaved some time for pg_restore -l,
and reduced the toc.dat size.
First patch is "finishing" the job of removing has oids support. When this
support was removed, instead of dropping the field from the dumps and
increasing the dump versions, the field was kept as is. This field stores a
boolean as a string, "true" or "false". This is not free, and requires 10
bytes per toc entry.
The second patch removes calls to sscanf and replaces them with strtoul. This
was the biggest speedup for pg_restore -l.
The third patch changes the dump format further to remove these strtoul calls
and store the integers as is instead.
The fourth patch is dirtier and does more changes to the dump format. Instead
of storing the owner, tablespace, table access method and schema of each
object as a string, pg_dump builds an array of these, stores them at the
beginning of the file and replaces the strings with integer fields in the dump.
This reduces the file size further, and removes a lot of calls to ReadStr, thus
saving quite some time.
Toc has 453999 entries.
Patch Toc size Dump -s duration pg_restore -l duration
HEAD 214M 23.1s 1.27s
#1 (has oid) 210M 22.9s 1.26s
#2 (scanf) 210M 22.9s 1.07s
#3 (no strtoul) 202M 22.8s 0.94s
#4 (string list) 181M 23.1s 0.87s
Patch four is likely to require more changes. I don't know PostgreSQL code
enough to do better than calling pgmalloc/pgrealloc and maintaining a char**
manually, I guess there are structs and functions that do that in a better
way. And the location of string tables in the file and in the structures is
probably not acceptable, I suppose these should go to the toc header instead.
I still submit these for comments and first review.
Best regards
Pierre Ducroquet