Re: Experimenting with hash tables inside pg_dump - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Experimenting with hash tables inside pg_dump
Date
Msg-id 20211022055939.z6fihsm7hdzbjttf@alap3.anarazel.de
Whole thread Raw
In response to Re: Experimenting with hash tables inside pg_dump  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Experimenting with hash tables inside pg_dump  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi,

On 2021-10-21 22:13:22 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I wonder though if for some of them we should instead replace the per-object
> > queries with one query returning the information for all objects of a type. It
> > doesn't make all that much sense that we build and send one query for each
> > table and index.
> 
> The trick is the problem I alluded to in another thread: it's not safe to
> do stuff like pg_get_expr() on tables we don't have lock on.

I was looking at getTableAttrs() - sending one query instead of #tables
queries yields a quite substantial speedup in a quick prototype. And I don't
think it changes anything around locking semantics.


> I've thought about doing something like
> 
> SELECT unsafe-functions FROM pg_class WHERE oid IN (someoid, someoid, ...)
> 
> but in cases with tens of thousands of tables, it seems unlikely that
> that's going to behave all that nicely.

That's kinda what I'm doing in the quick hack. But instead of using IN(...) I
made it unnest('{oid, oid, ...}'), that scales much better.

A pg_dump --schema-only of the regression database goes from

real    0m0.675s
user    0m0.039s
sys    0m0.029s

to

real    0m0.477s
user    0m0.037s
sys    0m0.020s

which isn't half-bad.

There's a few more cases like this I think. But most are harder because the
dumping happens one-by-one from dumpDumpableObject(). The relatively easy but
substantial cases I could find quickly were getIndexes(), getConstraints(),
getTriggers()


To see where it's worth putting in time it'd be useful if getSchemaData() in
verbose mode printed timing information...


> The *real* fix, I suppose, would be to fix all those catalog-inspection
> functions so that they operate with respect to the query's snapshot.
> But that's not a job I'm volunteering for.  Besides which, pg_dump
> still has to cope with back-rev servers where it wouldn't be safe.

Yea, that's not a small change :(. I suspect that we'd need a bunch of new
caching infrastructure to make that reasonably performant, since this
presumably couldn't use syscache etc.

Greetings,

Andres Freund

Attachment

pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Parallel vacuum workers prevent the oldest xmin from advancing
Next
From: Alexander Pyhalov
Date:
Subject: Re: Partial aggregates pushdown