Re: estimating # of distinct values - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: estimating # of distinct values
Date
Msg-id 4D34E0AB.9080901@fuzzy.cz
Whole thread Raw
In response to Re: estimating # of distinct values  (Jim Nasby <jim@nasby.net>)
Responses Re: estimating # of distinct values  (Jim Nasby <jim@nasby.net>)
List pgsql-hackers
Dne 9.1.2011 13:58, Jim Nasby napsal(a):
>> A resource fork? Not sure what you mean, could you describe it in more
>> detail?
> 
> Ooops, resource forks are a filesystem thing; we call them relation forks. >From src/backend/storage/smgr/README:

OK, I think using relation forks seems like a good solution. I've done
some basic research and I think these are the basic steps when adding a
new fork:

1) define a new item in the ForkNum enum (relfilenode.h) - this should  be somethink like DISTINCT_FORK I guess

2) modify the ForkNames (catalog.c) and the relpathbackend so that the  proper filename is assigned to the fork

And then it will be accessed through smgr (smgrcreate etc.). Am I right
or is there something else I need to do?

There are a few open questions though:

1) Forks are 'per relation' but the distinct estimators are 'per  column' (or 'per group of columns') so I'm not sure
whetherthe file  should contain all the estimators for the table, or if there should  be one fork for each estimator.
Theformer is a bit difficult to  manage, the latter somehow breaks the current fork naming convention.
 

2) Where to keep the info that there is an estimator for a column? I  guess we could put this into pg_attribute (it's
oneboolean). But  what about the estimators for groups of columns? Because that's why  I'm building this - to get
distinctestimates for groups of columns.
 
  I guess we'll need a new system catalog to track this? (The same  will be true for multi-dimensional histograms
anyway).

3) I still am not sure how to manage the updates, i.e. how to track the  new values.
  One possibility might be to do that synchronously - whenever a new  item is inserted into the table, check if there's
anestimator and  update it. Updating the estimators is quite efficient (e.g. the  bitmap manages to do 10.000.000
insertsin 9 seconds on my ancient  workstation) although there might be issues with locking etc.
 
  The other possibility is to update the estimator asynchronously, i.e.  store the new values somewhere (or just ctid
ofthe row), and then  process it periodically. I'm not sure how to intercept the new rows  and where to store them. In
anotherfork? Somewhere else?
 

regards
Tomas


pgsql-hackers by date:

Previous
From: Greg Smith
Date:
Subject: Re: Spread checkpoint sync
Next
From: Jan Urbański
Date:
Subject: Re: REVIEW: PL/Python validator function