Re: estimating # of distinct values - Mailing list pgsql-hackers

From Robert Haas
Subject Re: estimating # of distinct values
Date
Msg-id AANLkTik7C9RD64qwv2UV3NWgJ1NwSKS9vJyWw9dbxQTQ@mail.gmail.com
Whole thread Raw
In response to Re: estimating # of distinct values  (tv@fuzzy.cz)
Responses Re: estimating # of distinct values  (Tomas Vondra <tv@fuzzy.cz>)
List pgsql-hackers
On Tue, Jan 18, 2011 at 4:53 AM,  <tv@fuzzy.cz> wrote:
> So the most important question is how to intercept the new/updated rows,
> and where to store them. I think each backend should maintain it's own
> private list of new records and forward them only in case of commit. Does
> that sound reasonable?

At the risk of sounding demoralizing, nothing about this proposal
sounds very promising to me, and that sounds like a particularly bad
idea.  What happens if the transaction updates a billion records?  Or
even a million records?  Are you going to store all of those in
backend-local memory until commit time?  Or spool them in a
potentially-gigantic disk file somewhere?  That memory allocation - or
file - could grow to be larger than the size of the entire database in
the worst case.  And COMMIT could take an awfully long time if it has
to spool megabytes or gigabytes of data off to some other process.
And what happens if there's a crash after the COMMIT but before all
that data is sent?  The estimates become permanently wrong?

And are we doing all of this just to get a more accurate estimate of
ndistinct?  For the amount of effort that it will probably take to get
this working at all, you could probably implement index-only scans and
have enough bandwidth left over to tackle global temporary tables.
And unless I'm missing something, the chances that the performance
consequences will be tolerable are pretty much zero.  And it would
only benefit the tiny fraction of users for whom bad n_distinct
estimates cause bad plans, and then the even tinier percentage of
those who can't conveniently fix it by using the manual override that
we already have in place - which presumably means people who have
gigantic tables that are regularly rewritten with massive changes in
the data distribution that affect plan choice.  Is that more than the
empty set?

Maybe the idea here is that this wouldn't fix just ndistinct estimates
but would also help with multi-column statistics.  Even if that's the
case, I think it's almost certainly a dead end from a performance
standpoint.  Some kind of manual ANALYZE process that can be invoked
when needed to scan the entire table would be painful but maybe
acceptable for certain people with a problem they can't fix any other
way, but doing this automatically for everyone seems like a really bad
idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Cédric Villemain
Date:
Subject: Re: pg_basebackup for streaming base backups
Next
From: Cédric Villemain
Date:
Subject: Re: pg_filedump moved to pgfoundry