Thread: tuplestore API problem

tuplestore API problem

From
Tom Lane
Date:
By chance I discovered that this query in the regression tests

SELECT ntile(NULL) OVER (ORDER BY ten, four), ten, four FROM tenk1 LIMIT 2;

stops working if work_mem is small enough: it either dumps core or
delivers wrong answers depending on platform.

After some tracing I found out the reason.  ExecWindowAgg() does this:
   if (!tuplestore_gettupleslot(winstate->buffer, true,                                winstate->ss.ss_ScanTupleSlot))
    elog(ERROR, "unexpected end of tuplestore");
 

and then goes off and calls the window functions (ntile() here), and
expects the ScanTupleSlot to still be valid afterwards.  However,
ntile() forces us to read to the end of the input to find out the number
of rows.  If work_mem is small enough, that means the tuplestore is
forced into dump-to-disk mode, which means it releases all its in-memory
tuples.  And guess what: the ScanTupleSlot is pointing at one of those,
it doesn't have its own copy of the tuple.  So we wind up trying to read
from a trashed bit of memory.

A brute-force solution is to change tuplestore_gettupleslot() so that it
always copies the tuple, but this would be wasted cycles for most uses
of tuplestores.  I'm thinking of changing tuplestore_gettupleslot's API
to add a bool parameter specifying whether the caller wants to force
a copy.

Comments, better ideas?

BTW: this tells me that no one has tried to apply window functions
to nontrivial problems yet ... we'll need to encourage beta testers
to stress that code.
        regards, tom lane


Re: tuplestore API problem

From
Simon Riggs
Date:
On Thu, 2009-03-26 at 12:57 -0400, Tom Lane wrote:
> If work_mem is small enough, that means the tuplestore is
> forced into dump-to-disk mode, which means it releases all its
> in-memory tuples.  And guess what: the ScanTupleSlot is pointing at
> one of those, it doesn't have its own copy of the tuple.  So we wind
> up trying to read from a trashed bit of memory.
> 
> A brute-force solution is to change tuplestore_gettupleslot() so that
> it always copies the tuple, but this would be wasted cycles for most
> uses of tuplestores.  I'm thinking of changing
> tuplestore_gettupleslot's API
> to add a bool parameter specifying whether the caller wants to force
> a copy.
> 
> Comments, better ideas?

Sounds very similar to the solution that you just removed from the hash
join code for performance reasons. Flushing memory when we overflow
sounds like an artifact from the time when tuplestore split from
tuplesort. Can't we keep the appropriate rows in memory and scroll
through them?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: tuplestore API problem

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> Sounds very similar to the solution that you just removed from the hash
> join code for performance reasons. Flushing memory when we overflow
> sounds like an artifact from the time when tuplestore split from
> tuplesort. Can't we keep the appropriate rows in memory and scroll
> through them?

We're not doing a fundamental redesign of the window function support
right now.
        regards, tom lane


Re: tuplestore API problem

From
Hitoshi Harada
Date:
2009/3/27 Tom Lane <tgl@sss.pgh.pa.us>:
> By chance I discovered that this query in the regression tests
>
> SELECT ntile(NULL) OVER (ORDER BY ten, four), ten, four FROM tenk1 LIMIT 2;
>
> stops working if work_mem is small enough: it either dumps core or
> delivers wrong answers depending on platform.
>
> After some tracing I found out the reason.  ExecWindowAgg() does this:
>
>    if (!tuplestore_gettupleslot(winstate->buffer, true,
>                                 winstate->ss.ss_ScanTupleSlot))
>        elog(ERROR, "unexpected end of tuplestore");
>
> and then goes off and calls the window functions (ntile() here), and
> expects the ScanTupleSlot to still be valid afterwards.  However,
> ntile() forces us to read to the end of the input to find out the number
> of rows.  If work_mem is small enough, that means the tuplestore is
> forced into dump-to-disk mode, which means it releases all its in-memory
> tuples.  And guess what: the ScanTupleSlot is pointing at one of those,
> it doesn't have its own copy of the tuple.  So we wind up trying to read
> from a trashed bit of memory.
>
> A brute-force solution is to change tuplestore_gettupleslot() so that it
> always copies the tuple, but this would be wasted cycles for most uses
> of tuplestores.  I'm thinking of changing tuplestore_gettupleslot's API
> to add a bool parameter specifying whether the caller wants to force
> a copy.
>
> Comments, better ideas?

Is this tuplestore API problem? ISTM this is window function's
problem. I think my early code was holding heaptuple instead of
tupleslot for the current row. At a glance, the issue appears in only
current row in window function, which fetches row and uses it later
after storing following rows in some cases. So a brute-force solution
might be that ExecWindowAgg() copies the current row from tuplestore
instead of pointing directly to inside tuplestore memory, not changing
tuplestore API.

Regards,


--
Hitoshi Harada


Re: tuplestore API problem

From
Hitoshi Harada
Date:
2009/3/27 Hitoshi Harada <umi.tanuki@gmail.com>:
> 2009/3/27 Tom Lane <tgl@sss.pgh.pa.us>:
>> By chance I discovered that this query in the regression tests
>>
>> SELECT ntile(NULL) OVER (ORDER BY ten, four), ten, four FROM tenk1 LIMIT 2;
>>
>> stops working if work_mem is small enough: it either dumps core or
>> delivers wrong answers depending on platform.
>>
>> After some tracing I found out the reason.  ExecWindowAgg() does this:
>>
>>    if (!tuplestore_gettupleslot(winstate->buffer, true,
>>                                 winstate->ss.ss_ScanTupleSlot))
>>        elog(ERROR, "unexpected end of tuplestore");
>>
>> and then goes off and calls the window functions (ntile() here), and
>> expects the ScanTupleSlot to still be valid afterwards.  However,
>> ntile() forces us to read to the end of the input to find out the number
>> of rows.  If work_mem is small enough, that means the tuplestore is
>> forced into dump-to-disk mode, which means it releases all its in-memory
>> tuples.  And guess what: the ScanTupleSlot is pointing at one of those,
>> it doesn't have its own copy of the tuple.  So we wind up trying to read
>> from a trashed bit of memory.
>>
>> A brute-force solution is to change tuplestore_gettupleslot() so that it
>> always copies the tuple, but this would be wasted cycles for most uses
>> of tuplestores.  I'm thinking of changing tuplestore_gettupleslot's API
>> to add a bool parameter specifying whether the caller wants to force
>> a copy.
>>
>> Comments, better ideas?
>
> Is this tuplestore API problem? ISTM this is window function's
> problem. I think my early code was holding heaptuple instead of
> tupleslot for the current row. At a glance, the issue appears in only
> current row in window function, which fetches row and uses it later
> after storing following rows in some cases. So a brute-force solution
> might be that ExecWindowAgg() copies the current row from tuplestore
> instead of pointing directly to inside tuplestore memory, not changing
> tuplestore API.
Here's the patch. Hope there are no more on the same reason. It seems
that we'd need to implement something like garbage collector in
tuplestore, marking and tracing each row references, if the complete
solution is required.

Regards,

--
Hitoshi Harada

Attachment

Re: tuplestore API problem

From
Tom Lane
Date:
Hitoshi Harada <umi.tanuki@gmail.com> writes:
> 2009/3/27 Hitoshi Harada <umi.tanuki@gmail.com>:
>> 2009/3/27 Tom Lane <tgl@sss.pgh.pa.us>:
>>> A brute-force solution is to change tuplestore_gettupleslot() so that it
>>> always copies the tuple, but this would be wasted cycles for most uses
>>> of tuplestores. �I'm thinking of changing tuplestore_gettupleslot's API
>>> to add a bool parameter specifying whether the caller wants to force
>>> a copy.

> Here's the patch. Hope there are no more on the same reason. It seems
> that we'd need to implement something like garbage collector in
> tuplestore, marking and tracing each row references, if the complete
> solution is required.

I don't like this; I'm planning to go with the aforementioned API
change instead.  The way you have it guarantees an extra copy cycle
even when tuplestore is already making a copy internally; and it doesn't
help if we find similar problems elsewhere.  (While I'm making the
API change I'll take a close look at each call site to see if it has
any similar risk.)
        regards, tom lane


Re: tuplestore API problem

From
Hitoshi Harada
Date:
2009/3/28 Tom Lane <tgl@sss.pgh.pa.us>:
> Hitoshi Harada <umi.tanuki@gmail.com> writes:
>> 2009/3/27 Hitoshi Harada <umi.tanuki@gmail.com>:
>>> 2009/3/27 Tom Lane <tgl@sss.pgh.pa.us>:
>>>> A brute-force solution is to change tuplestore_gettupleslot() so that it
>>>> always copies the tuple, but this would be wasted cycles for most uses
>>>> of tuplestores.  I'm thinking of changing tuplestore_gettupleslot's API
>>>> to add a bool parameter specifying whether the caller wants to force
>>>> a copy.
>
>> Here's the patch. Hope there are no more on the same reason. It seems
>> that we'd need to implement something like garbage collector in
>> tuplestore, marking and tracing each row references, if the complete
>> solution is required.
>
> I don't like this; I'm planning to go with the aforementioned API
> change instead.  The way you have it guarantees an extra copy cycle
> even when tuplestore is already making a copy internally; and it doesn't
> help if we find similar problems elsewhere.  (While I'm making the
> API change I'll take a close look at each call site to see if it has
> any similar risk.)

You're right. It kills performance even after dumptuples(). Thinking
more, I found the cause is only around dumptuples(). If you can trace
TupleTableSlots that points to memtuples inside tuplestore, you can
materialize them just before WRITETUP() in dumptuples().
So I tried pass EState.es_tupleTables to tuplestore_begin_heap() to
trace those TupleTableSlots. Note that if you pass NULL the behavior
is as before so nothing's broken. Regression passes but I'm not quite
sure EState.es_tupleTable is only place that holds TupleTableSlots
passed to tuplestore...

I know you propose should_copy boolean parameter would be added to
tuplestore_gettupleslot(). That always adds overhead even if
tuplestore *doesn't* dump tuples. That case doesn't need copy tuples,
I guess.

Regards,



--
Hitoshi Harada

Attachment

Re: tuplestore API problem

From
Tom Lane
Date:
Hitoshi Harada <umi.tanuki@gmail.com> writes:
> So I tried pass EState.es_tupleTables to tuplestore_begin_heap() to
> trace those TupleTableSlots. Note that if you pass NULL the behavior
> is as before so nothing's broken. Regression passes but I'm not quite
> sure EState.es_tupleTable is only place that holds TupleTableSlots
> passed to tuplestore...

I've got zero confidence in that, and it seems pretty horrid from a
system structural point of view even if it worked: neither tuplestore.c
nor its direct callers have any business knowing where all the slots
are.  What might be a bit saner is to remember the slot last passed to
tuplestore_gettupleslot for each read pointer.  The implication would be
that we'd be assuming only one slot is used to fetch from any one read
pointer, but that is probably a reasonable restriction.

> I know you propose should_copy boolean parameter would be added to
> tuplestore_gettupleslot().

I already did it, and concluded that not only were all the
tuplestore_gettupleslot calls in nodeWindowAgg broken, but so was
nodeCtescan.  I'm open to reverting that patch if you have a better
solution though.
        regards, tom lane


Re: tuplestore API problem

From
Hitoshi Harada
Date:
2009/3/29 Tom Lane <tgl@sss.pgh.pa.us>:
> Hitoshi Harada <umi.tanuki@gmail.com> writes:
>> So I tried pass EState.es_tupleTables to tuplestore_begin_heap() to
>> trace those TupleTableSlots. Note that if you pass NULL the behavior
>> is as before so nothing's broken. Regression passes but I'm not quite
>> sure EState.es_tupleTable is only place that holds TupleTableSlots
>> passed to tuplestore...
>
> I've got zero confidence in that, and it seems pretty horrid from a
> system structural point of view even if it worked: neither tuplestore.c
> nor its direct callers have any business knowing where all the slots
> are.  What might be a bit saner is to remember the slot last passed to
> tuplestore_gettupleslot for each read pointer.  The implication would be
> that we'd be assuming only one slot is used to fetch from any one read
> pointer, but that is probably a reasonable restriction.

Hm, this choice is better than mine. But if we take this, I suppose we
need to consider the way to break the restriction, for the case we
will be forced to use many TupleTableSlots on one read pointer.
Without that, I don't think it's a good idea to take this way.

>> I know you propose should_copy boolean parameter would be added to
>> tuplestore_gettupleslot().
>
> I already did it, and concluded that not only were all the
> tuplestore_gettupleslot calls in nodeWindowAgg broken, but so was
> nodeCtescan.  I'm open to reverting that patch if you have a better
> solution though.

For now, yours is the better than everything. As we're not staying
here forever, let's choose copy way. Then I'll keep it in my mind and
retry when I will improve the tuplestore performance, with the later
window function's patch.


Regards,

--
Hitoshi Harada


Re: tuplestore API problem

From
Tom Lane
Date:
Hitoshi Harada <umi.tanuki@gmail.com> writes:
> 2009/3/29 Tom Lane <tgl@sss.pgh.pa.us>:
>> ... What might be a bit saner is to remember the slot last passed to
>> tuplestore_gettupleslot for each read pointer. �The implication would be
>> that we'd be assuming only one slot is used to fetch from any one read
>> pointer, but that is probably a reasonable restriction.

> Hm, this choice is better than mine. But if we take this, I suppose we
> need to consider the way to break the restriction, for the case we
> will be forced to use many TupleTableSlots on one read pointer.

I don't think we'd ever be "forced" to do that; and it would be easy to
add an Assert to tuplestore_gettupleslot to check that it gets the same
slot on each call.  Someone who needed to save previous tuples would be
advised to copy older tuples to some other slot after fetching them.
        regards, tom lane


Re: tuplestore API problem

From
Hitoshi Harada
Date:
2009/3/30 Tom Lane <tgl@sss.pgh.pa.us>:
> Hitoshi Harada <umi.tanuki@gmail.com> writes:
>> 2009/3/29 Tom Lane <tgl@sss.pgh.pa.us>:
>>> ... What might be a bit saner is to remember the slot last passed to
>>> tuplestore_gettupleslot for each read pointer.  The implication would be
>>> that we'd be assuming only one slot is used to fetch from any one read
>>> pointer, but that is probably a reasonable restriction.
>
>> Hm, this choice is better than mine. But if we take this, I suppose we
>> need to consider the way to break the restriction, for the case we
>> will be forced to use many TupleTableSlots on one read pointer.
>
> I don't think we'd ever be "forced" to do that; and it would be easy to
> add an Assert to tuplestore_gettupleslot to check that it gets the same
> slot on each call.  Someone who needed to save previous tuples would be
> advised to copy older tuples to some other slot after fetching them.

I don't think advising or documenting such restriction to the future
programmer is a good idea from the point of encapsulation. I've come
up with an idea, that the read pointers remember their last slot as
you suggest and materialize it when another slot comes in
tuplestore_gettupleslot() and forget the former one. By this, you
don't need the restriction above, adding minimum penalty that is paid
if and only if you pass more than one tupleslot to tuplestore, which
doesn't seem to be occurred currently.

Regards,



--
Hitoshi Harada


Re: tuplestore API problem

From
Tom Lane
Date:
Hitoshi Harada <umi.tanuki@gmail.com> writes:
> I don't think advising or documenting such restriction to the future
> programmer is a good idea from the point of encapsulation. I've come
> up with an idea, that the read pointers remember their last slot as
> you suggest and materialize it when another slot comes in
> tuplestore_gettupleslot() and forget the former one. By this, you
> don't need the restriction above, adding minimum penalty that is paid
> if and only if you pass more than one tupleslot to tuplestore, which
> doesn't seem to be occurred currently.

I think that the problem I found a couple days ago
http://archives.postgresql.org/pgsql-hackers/2009-03/msg01247.php
probably blows a hole in all these schemes.  After-the-fact
materialization of a TupleTableSlot won't protect any pass-by-reference
Datums that have already been fetched from that slot.  Perhaps we
could invent a coding rule that would prevent the situation, but
I think it would be awfully easy to mess up in any case where you
actually had a need to keep track of more than one current tuple.

I now think that the CVS-HEAD arrangement is about as good as we
should expect to get for 8.4.  The experiments I've been doing
suggest that the "extra" tuple copying isn't a major bottleneck
anyway...
        regards, tom lane