Thread: How to know referenced sub-fields of a composite type?

How to know referenced sub-fields of a composite type?

From
Kohei KaiGai
Date:
Hello,

A recent revision of PG-Strom has its columnar-storage using Apache
Arrow format files on
FDW infrastructure. Because of the columnar nature, it allows to load
the values which are
referenced by the query, thus, maximizes efficiency of the storage bandwidth.
http://heterodb.github.io/pg-strom/arrow_fdw/

Apache Arrow defines various primitive types that can be mapped on
PostgreSQL data types.
For example, FloatingPoint (precision=Single) on Arrow is equivalent
to float4 of PostgreSQL.
One interesting data type in Apache Arrow is "Struct" data type. It is
equivalent to composite
type in PostgreSQL. The "Struct" type has sub-fields, and individual
sub-fields have its own
values array for each.

It means we can skip to load the sub-fields unreferenced, if
query-planner can handle
referenced and unreferenced sub-fields correctly.
On the other hands, it looks to me RelOptInfo or other optimizer
related structure don't have
this kind of information. RelOptInfo->attr_needed tells extension
which attributes are referenced
by other relation, however, its granularity is not sufficient for sub-fields.

Probably, all we can do right now is walk-on the RelOptInfo list to
lookup FieldSelect node
to see the referenced sub-fields. Do we have a good idea instead of
this expensive way?
# Right now, PG-Strom loads all the sub-fields of Struct column from
arrow_fdw foreign-table
# regardless of referenced / unreferenced sub-fields. Just a second best.

Best regards,
-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai@heterodb.com>



Re: How to know referenced sub-fields of a composite type?

From
Amit Langote
Date:
Kaigai-san,

On 2019/05/29 12:13, Kohei KaiGai wrote:
> One interesting data type in Apache Arrow is "Struct" data type. It is
> equivalent to composite
> type in PostgreSQL. The "Struct" type has sub-fields, and individual
> sub-fields have its own
> values array for each.
> 
> It means we can skip to load the sub-fields unreferenced, if
> query-planner can handle
> referenced and unreferenced sub-fields correctly.
> On the other hands, it looks to me RelOptInfo or other optimizer
> related structure don't have
> this kind of information. RelOptInfo->attr_needed tells extension
> which attributes are referenced
> by other relation, however, its granularity is not sufficient for sub-fields.

Isn't that true for some other cases as well, like when a query accesses
only some sub-fields of a json(b) column?  In that case too, planner
itself can't optimize away access to other sub-fields.  What it can do
though is match a suitable index to the operator used to access the
individual sub-fields, so that the index (if one is matched and chosen)
can optimize away accessing unnecessary sub-fields.  IOW, it seems to me
that the optimizer leaves it up to the indexes (and plan nodes) to further
optimize access to within a field.  How is this case any different?

> Probably, all we can do right now is walk-on the RelOptInfo list to
> lookup FieldSelect node
> to see the referenced sub-fields. Do we have a good idea instead of
> this expensive way?
> # Right now, PG-Strom loads all the sub-fields of Struct column from
> arrow_fdw foreign-table
> # regardless of referenced / unreferenced sub-fields. Just a second best.

I'm missing something, but if PG-Strom/arrow_fdw does look at the
FieldSelect nodes to see which sub-fields are referenced, why doesn't it
generate a plan that will only access those sub-fields or why can't it?

Thanks,
Amit




Re: How to know referenced sub-fields of a composite type?

From
Kohei KaiGai
Date:
Hi Amit,

2019年5月29日(水) 13:26 Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>:
>
> Kaigai-san,
>
> On 2019/05/29 12:13, Kohei KaiGai wrote:
> > One interesting data type in Apache Arrow is "Struct" data type. It is
> > equivalent to composite
> > type in PostgreSQL. The "Struct" type has sub-fields, and individual
> > sub-fields have its own
> > values array for each.
> >
> > It means we can skip to load the sub-fields unreferenced, if
> > query-planner can handle
> > referenced and unreferenced sub-fields correctly.
> > On the other hands, it looks to me RelOptInfo or other optimizer
> > related structure don't have
> > this kind of information. RelOptInfo->attr_needed tells extension
> > which attributes are referenced
> > by other relation, however, its granularity is not sufficient for sub-fields.
>
> Isn't that true for some other cases as well, like when a query accesses
> only some sub-fields of a json(b) column?  In that case too, planner
> itself can't optimize away access to other sub-fields.  What it can do
> though is match a suitable index to the operator used to access the
> individual sub-fields, so that the index (if one is matched and chosen)
> can optimize away accessing unnecessary sub-fields.  IOW, it seems to me
> that the optimizer leaves it up to the indexes (and plan nodes) to further
> optimize access to within a field.  How is this case any different?
>
I think it is a little bit different scenario.
Even if an index on sub-fields can indicate the tuples to be fetched,
the fetched tuple contains all the sub-fields because heaptuple is
row-oriented data.
For example, if WHERE-clause checks a sub-field: "x" then aggregate
function references other sub-field "y", Scan/Join node has to return
a tuple that contains both "x" and "y". IndexScan also pops up a tuple
with a full composite type, so here is no problem if we cannot know
which sub-fields are referenced in the later stage.
Maybe, if IndexOnlyScan supports to return a partial composite type,
it needs similar infrastructure that can be used for a better composite
type support on columnar storage.

> > Probably, all we can do right now is walk-on the RelOptInfo list to
> > lookup FieldSelect node
> > to see the referenced sub-fields. Do we have a good idea instead of
> > this expensive way?
> > # Right now, PG-Strom loads all the sub-fields of Struct column from
> > arrow_fdw foreign-table
> > # regardless of referenced / unreferenced sub-fields. Just a second best.
>
> I'm missing something, but if PG-Strom/arrow_fdw does look at the
> FieldSelect nodes to see which sub-fields are referenced, why doesn't it
> generate a plan that will only access those sub-fields or why can't it?
>
Likely, it is not a technical problem but not a smart implementation.
If I missed some existing infrastructure we can apply, it may be more suitable
than query/expression tree walking.

Best regards,
--
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai@heterodb.com>



Re: How to know referenced sub-fields of a composite type?

From
Haribabu Kommi
Date:

On Wed, May 29, 2019 at 4:51 PM Kohei KaiGai <kaigai@heterodb.com> wrote:
Hi Amit,

2019年5月29日(水) 13:26 Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>:
>
> Kaigai-san,
>
> On 2019/05/29 12:13, Kohei KaiGai wrote:
> > One interesting data type in Apache Arrow is "Struct" data type. It is
> > equivalent to composite
> > type in PostgreSQL. The "Struct" type has sub-fields, and individual
> > sub-fields have its own
> > values array for each.
> >
> > It means we can skip to load the sub-fields unreferenced, if
> > query-planner can handle
> > referenced and unreferenced sub-fields correctly.
> > On the other hands, it looks to me RelOptInfo or other optimizer
> > related structure don't have
> > this kind of information. RelOptInfo->attr_needed tells extension
> > which attributes are referenced
> > by other relation, however, its granularity is not sufficient for sub-fields.
>
> Isn't that true for some other cases as well, like when a query accesses
> only some sub-fields of a json(b) column?  In that case too, planner
> itself can't optimize away access to other sub-fields.  What it can do
> though is match a suitable index to the operator used to access the
> individual sub-fields, so that the index (if one is matched and chosen)
> can optimize away accessing unnecessary sub-fields.  IOW, it seems to me
> that the optimizer leaves it up to the indexes (and plan nodes) to further
> optimize access to within a field.  How is this case any different?
>
I think it is a little bit different scenario.
Even if an index on sub-fields can indicate the tuples to be fetched,
the fetched tuple contains all the sub-fields because heaptuple is
row-oriented data.
For example, if WHERE-clause checks a sub-field: "x" then aggregate
function references other sub-field "y", Scan/Join node has to return
a tuple that contains both "x" and "y". IndexScan also pops up a tuple
with a full composite type, so here is no problem if we cannot know
which sub-fields are referenced in the later stage.
Maybe, if IndexOnlyScan supports to return a partial composite type,
it needs similar infrastructure that can be used for a better composite
type support on columnar storage.

There is another issue related to the columnar store that needs targeted
columns for projection from the scan is discussed in zedstore [1]. 
Projecting all columns from a columnar store is quite expensive than
the row store. 

 
Regards,
Haribabu Kommi
Fujitsu Australia

Re: How to know referenced sub-fields of a composite type?

From
Amit Langote
Date:
On 2019/05/29 15:50, Kohei KaiGai wrote:
> 2019年5月29日(水) 13:26 Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>:
>>> It means we can skip to load the sub-fields unreferenced, if
>>> query-planner can handle
>>> referenced and unreferenced sub-fields correctly.
>>> On the other hands, it looks to me RelOptInfo or other optimizer
>>> related structure don't have
>>> this kind of information. RelOptInfo->attr_needed tells extension
>>> which attributes are referenced
>>> by other relation, however, its granularity is not sufficient for sub-fields.
>>
>> Isn't that true for some other cases as well, like when a query accesses
>> only some sub-fields of a json(b) column?  In that case too, planner
>> itself can't optimize away access to other sub-fields.  What it can do
>> though is match a suitable index to the operator used to access the
>> individual sub-fields, so that the index (if one is matched and chosen)
>> can optimize away accessing unnecessary sub-fields.  IOW, it seems to me
>> that the optimizer leaves it up to the indexes (and plan nodes) to further
>> optimize access to within a field.  How is this case any different?
>
> I think it is a little bit different scenario.
> Even if an index on sub-fields can indicate the tuples to be fetched,
> the fetched tuple contains all the sub-fields because heaptuple is
> row-oriented data.
>
> For example, if WHERE-clause checks a sub-field: "x" then aggregate
> function references other sub-field "y", Scan/Join node has to return
> a tuple that contains both "x" and "y". IndexScan also pops up a tuple
> with a full composite type, so here is no problem if we cannot know
> which sub-fields are referenced in the later stage.
> Maybe, if IndexOnlyScan supports to return a partial composite type,
> it needs similar infrastructure that can be used for a better composite
> type support on columnar storage.

Ah indeed.  I think I had misunderstood your intent.  Indexes have to do
with optimizing the "filtering" of complex/nested type (json, Arrow
Struct, etc.) values, where unnecessary sub-fields need not be read before
filtering, whereas you're interested in optimizing "projections" of
complex types, where sub-fields that are not used anywhere in the query
need not be read from the stored values.

>>> Probably, all we can do right now is walk-on the RelOptInfo list to
>>> lookup FieldSelect node
>>> to see the referenced sub-fields. Do we have a good idea instead of
>>> this expensive way?
>>> # Right now, PG-Strom loads all the sub-fields of Struct column from
>>> arrow_fdw foreign-table
>>> # regardless of referenced / unreferenced sub-fields. Just a second best.
>>
>> I'm missing something, but if PG-Strom/arrow_fdw does look at the
>> FieldSelect nodes to see which sub-fields are referenced, why doesn't it
>> generate a plan that will only access those sub-fields or why can't it?
>>
> Likely, it is not a technical problem but not a smart implementation.
> If I missed some existing infrastructure we can apply, it may be more suitable
> than query/expression tree walking.

There is no infrastructure for this as far as I know.  Maybe, some will be
built in the future now that storage format is pluggable.

Thanks,
Amit




Re: How to know referenced sub-fields of a composite type?

From
Kohei KaiGai
Date:
2019/05/30 16:33、Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>のメール:

>> On 2019/05/29 15:50, Kohei KaiGai wrote:
>> 2019年5月29日(水) 13:26 Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>:
>>>> It means we can skip to load the sub-fields unreferenced, if
>>>> query-planner can handle
>>>> referenced and unreferenced sub-fields correctly.
>>>> On the other hands, it looks to me RelOptInfo or other optimizer
>>>> related structure don't have
>>>> this kind of information. RelOptInfo->attr_needed tells extension
>>>> which attributes are referenced
>>>> by other relation, however, its granularity is not sufficient for sub-fields.
>>>
>>> Isn't that true for some other cases as well, like when a query accesses
>>> only some sub-fields of a json(b) column?  In that case too, planner
>>> itself can't optimize away access to other sub-fields.  What it can do
>>> though is match a suitable index to the operator used to access the
>>> individual sub-fields, so that the index (if one is matched and chosen)
>>> can optimize away accessing unnecessary sub-fields.  IOW, it seems to me
>>> that the optimizer leaves it up to the indexes (and plan nodes) to further
>>> optimize access to within a field.  How is this case any different?
>>
>> I think it is a little bit different scenario.
>> Even if an index on sub-fields can indicate the tuples to be fetched,
>> the fetched tuple contains all the sub-fields because heaptuple is
>> row-oriented data.
>>
>> For example, if WHERE-clause checks a sub-field: "x" then aggregate
>> function references other sub-field "y", Scan/Join node has to return
>> a tuple that contains both "x" and "y". IndexScan also pops up a tuple
>> with a full composite type, so here is no problem if we cannot know
>> which sub-fields are referenced in the later stage.
>> Maybe, if IndexOnlyScan supports to return a partial composite type,
>> it needs similar infrastructure that can be used for a better composite
>> type support on columnar storage.
>
> Ah indeed.  I think I had misunderstood your intent.  Indexes have to do
> with optimizing the "filtering" of complex/nested type (json, Arrow
> Struct, etc.) values, where unnecessary sub-fields need not be read before
> filtering, whereas you're interested in optimizing "projections" of
> complex types, where sub-fields that are not used anywhere in the query
> need not be read from the stored values.
>
>>>> Probably, all we can do right now is walk-on the RelOptInfo list to
>>>> lookup FieldSelect node
>>>> to see the referenced sub-fields. Do we have a good idea instead of
>>>> this expensive way?
>>>> # Right now, PG-Strom loads all the sub-fields of Struct column from
>>>> arrow_fdw foreign-table
>>>> # regardless of referenced / unreferenced sub-fields. Just a second best.
>>>
>>> I'm missing something, but if PG-Strom/arrow_fdw does look at the
>>> FieldSelect nodes to see which sub-fields are referenced, why doesn't it
>>> generate a plan that will only access those sub-fields or why can't it?
>>>
>> Likely, it is not a technical problem but not a smart implementation.
>> If I missed some existing infrastructure we can apply, it may be more suitable
>> than query/expression tree walking.
>
> There is no infrastructure for this as far as I know.  Maybe, some will be
> built in the future now that storage format is pluggable.

If we design a common infrastructure for both of built-in and extension features, it makes sense for the kinds of
storagesystem. 
IndexOnlyScan is one of the built-in feature that is beneficial by the information of projection. Currently, we always
don’tchoose IndexOnlyScan if index is on sub-field of composite. 

 Best regards,