Hi,
On 2021-10-21 18:27:25 -0400, Tom Lane wrote:
> Today, pg_dump does a lot of internal lookups via binary search
> in presorted arrays. I thought it might improve matters
> to replace those binary searches with hash tables, theoretically
> converting O(log N) searches into O(1) searches. So I tried making
> a hash table indexed by CatalogId (tableoid+oid) with simplehash.h,
> and replacing as many data structures as I could with that.
That does sound like a good idea in theory...
> This makes the code shorter and (IMO anyway) cleaner, but
>
> (a) the executable size increases by a few KB --- apparently, even
> the minimum subset of simplehash.h's functionality is code-wasteful.
Hm. Surprised a bit by that. In an optimized build the difference is a
smaller, at least.
optimized:
text data bss dec hex filename
448066 7048 1368 456482 6f722 src/bin/pg_dump/pg_dump
447530 7048 1496 456074 6f58a src/bin/pg_dump/pg_dump.orig
debug:
text data bss dec hex filename
516883 7024 1352 525259 803cb src/bin/pg_dump/pg_dump
509819 7024 1480 518323 7e8b3 src/bin/pg_dump/pg_dump.orig
The fact that optimization plays such a role makes me wonder if a good chunk
of the difference is the slightly more complicated find{Type,Func,...}ByOid()
functions.
> (b) I couldn't measure any change in performance at all. I tried
> it on the regression database and on a toy DB with 10000 simple
> tables. Maybe on a really large DB you'd notice some difference,
> but I'm not very optimistic now.
Did you measure runtime of pg_dump, or how much CPU it used? I think a lot of
the time the backend is a bigger bottleneck than pg_dump...
For the regression test DB the majority of the time seems to be spent below
two things:
1) libpq
2) sortDumpableObjects().
I don't think 2) hits the binary search / hashtable path?
It does seem interesting that a substantial part of the time is spent in/below
PQexec() and PQfnumber(). Especially the latter shouldn't be too hard to
optimize away...
Greetings,
Andres Freund