Home > mailing lists

Re: Experimenting with hash tables inside pg_dump - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: Experimenting with hash tables inside pg_dump
Date	October 22, 2021 05:59:39
Msg-id	20211022055939.z6fihsm7hdzbjttf@alap3.anarazel.de Whole thread Raw
In response to	Re: Experimenting with hash tables inside pg_dump (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Experimenting with hash tables inside pg_dump
List	pgsql-hackers

Tree view

Hi,

On 2021-10-21 22:13:22 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I wonder though if for some of them we should instead replace the per-object
> > queries with one query returning the information for all objects of a type. It
> > doesn't make all that much sense that we build and send one query for each
> > table and index.
> 
> The trick is the problem I alluded to in another thread: it's not safe to
> do stuff like pg_get_expr() on tables we don't have lock on.

I was looking at getTableAttrs() - sending one query instead of #tables
queries yields a quite substantial speedup in a quick prototype. And I don't
think it changes anything around locking semantics.

> I've thought about doing something like
> 
> SELECT unsafe-functions FROM pg_class WHERE oid IN (someoid, someoid, ...)
> 
> but in cases with tens of thousands of tables, it seems unlikely that
> that's going to behave all that nicely.

That's kinda what I'm doing in the quick hack. But instead of using IN(...) I
made it unnest('{oid, oid, ...}'), that scales much better.

A pg_dump --schema-only of the regression database goes from

real    0m0.675s
user    0m0.039s
sys    0m0.029s

to

real    0m0.477s
user    0m0.037s
sys    0m0.020s

which isn't half-bad.

There's a few more cases like this I think. But most are harder because the
dumping happens one-by-one from dumpDumpableObject(). The relatively easy but
substantial cases I could find quickly were getIndexes(), getConstraints(),
getTriggers()

To see where it's worth putting in time it'd be useful if getSchemaData() in
verbose mode printed timing information...

> The *real* fix, I suppose, would be to fix all those catalog-inspection
> functions so that they operate with respect to the query's snapshot.
> But that's not a job I'm volunteering for.  Besides which, pg_dump
> still has to cope with back-rev servers where it wouldn't be safe.

Yea, that's not a small change :(. I suspect that we'd need a bunch of new
caching infrastructure to make that reasonably performant, since this
presumably couldn't use syscache etc.

Greetings,

Andres Freund

Attachment

pg_dump-bulk-gettableattrs.diff

pgsql-hackers by date:

From: Masahiko Sawada
Date: 22 October 2021, 05:38:02
Subject: Re: Parallel vacuum workers prevent the oldest xmin from advancing

From: Alexander Pyhalov
Date: 22 October 2021, 06:26:50
Subject: Re: Partial aggregates pushdown

Re: Experimenting with hash tables inside pg_dump - Mailing list pgsql-hackers

Attachment

Previous

Next