Re: Extracting only the columns needed for a query - Mailing list pgsql-hackers

From Dmitry Dolgov
Subject Re: Extracting only the columns needed for a query
Date
Msg-id 20191217105950.slmha6jrqy5j3qo7@localhost
Whole thread Raw
In response to Re: Extracting only the columns needed for a query  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: Extracting only the columns needed for a query
Re: Extracting only the columns needed for a query
List pgsql-hackers
> On Tue, Jul 16, 2019 at 06:49:10PM -0700, Melanie Plageman wrote:
>
> We implemented Approach B in the attached patch set (patch 0001) and
> then implemented Approach A (patch 0002) to sanity check the pruned
> list of columns to scan we were getting at plan-time.
> We emit a notice in SeqNext() if the two column sets differ.
> Currently, for all of the queries in the regression suite, no
> differences are found.
> We did notice that none of the current regression tests for triggers
> are showing a difference between columns that can be extracted at plan
> time and those that can be extracted at execution time--though we
> haven't dug into this at all.

Thanks for the patch! If I understand correctly from this thread,
approach B is more preferable, so I've concentrated on the patch 0001
and have a few commentaries/questions:

* The idea is to collect columns that have being used for selects/updates
  (where it makes sense for columnar storage to avoid extra work), do I see it
  right? If it's the case, then scanCols could be a bit misleading, since it
  gives an impression that it's only about reads.

* After a quick experiment, it seems that extract_used_columns is invoked for
  updates, but collects too many colums, e.g:

    create table test (id int primary key, a text, b text, c text);
    update test set a = 'something' where id = 1;

  collects into scanCols all columns (varattno from 1 to 4) + again the first
  column from baserestrictinfo. Is it correct?

* Not sure if it supposed to be covered by this functionality, but if we do

    insert ... on conflict (uniq_col) do update set other_col = 'something'

  and actually have to perform an update, extract_used_columns is not called.

* Probably it also makes sense to check IS_DUMMY_REL in extract_used_columns?

> > On Sat, Jun 15, 2019 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Another reason for having the planner do this is that presumably, in
> > an AM that's excited about this, the set of fetched columns should
> > play into the cost estimates for the scan.  I've not been paying
> > enough attention to the tableam work to know if we've got hooks for
> > the AM to affect scan costing ... but if we don't, that seems like
> > a hole that needs plugged.
>
> AM callback relation_estimate_size exists currently which planner leverages.
> Via this callback it fetches tuples, pages, etc.. So, our thought is to extend
> this API if possible to pass down needed column and help perform better costing
> for the query. Though we think if wish to leverage this function, need to know
> list of columns before planning hence might need to use query tree.

I believe it would be beneficial to add this potential API extension patch into
the thread (as an example of an interface defining how scanCols could be used)
and review them together.



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: automating pg_config.h.win32 maintenance
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: error context for vacuum to include block number