Re: Experimenting with hash tables inside pg_dump - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Experimenting with hash tables inside pg_dump
Date
Msg-id 20211021233757.s7i55dpubw7rjf4a@alap3.anarazel.de
Whole thread Raw
In response to Experimenting with hash tables inside pg_dump  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Experimenting with hash tables inside pg_dump  ("Bossart, Nathan" <bossartn@amazon.com>)
Re: Experimenting with hash tables inside pg_dump  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Experimenting with hash tables inside pg_dump  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

On 2021-10-21 18:27:25 -0400, Tom Lane wrote:
> Today, pg_dump does a lot of internal lookups via binary search
> in presorted arrays.  I thought it might improve matters
> to replace those binary searches with hash tables, theoretically
> converting O(log N) searches into O(1) searches.  So I tried making
> a hash table indexed by CatalogId (tableoid+oid) with simplehash.h,
> and replacing as many data structures as I could with that.

That does sound like a good idea in theory...


> This makes the code shorter and (IMO anyway) cleaner, but
> 
> (a) the executable size increases by a few KB --- apparently, even
> the minimum subset of simplehash.h's functionality is code-wasteful.

Hm. Surprised a bit by that. In an optimized build the difference is a
smaller, at least.

optimized:
   text       data        bss        dec        hex    filename
 448066       7048       1368     456482      6f722    src/bin/pg_dump/pg_dump
 447530       7048       1496     456074      6f58a    src/bin/pg_dump/pg_dump.orig

debug:
   text       data        bss        dec        hex    filename
 516883       7024       1352     525259      803cb    src/bin/pg_dump/pg_dump
 509819       7024       1480     518323      7e8b3    src/bin/pg_dump/pg_dump.orig

The fact that optimization plays such a role makes me wonder if a good chunk
of the difference is the slightly more complicated find{Type,Func,...}ByOid()
functions.


> (b) I couldn't measure any change in performance at all.  I tried
> it on the regression database and on a toy DB with 10000 simple
> tables.  Maybe on a really large DB you'd notice some difference,
> but I'm not very optimistic now.

Did you measure runtime of pg_dump, or how much CPU it used?  I think a lot of
the time the backend is a bigger bottleneck than pg_dump...

For the regression test DB the majority of the time seems to be spent below
two things:
1) libpq
2) sortDumpableObjects().

I don't think 2) hits the binary search / hashtable path?


It does seem interesting that a substantial part of the time is spent in/below
PQexec() and PQfnumber(). Especially the latter shouldn't be too hard to
optimize away...


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Mark Dilger
Date:
Subject: Re: CREATEROLE and role ownership hierarchies
Next
From: "Bossart, Nathan"
Date:
Subject: Re: Experimenting with hash tables inside pg_dump