Thread: drop/truncate table sucks for large values of shared buffers

drop/truncate table sucks for large values of shared buffers

From
Amit Kapila
Date:
Sometime back on one of the PostgreSQL blog [1], there was
discussion about the performance of drop/truncate table for
large values of shared_buffers and it seems that as the value
of shared_buffers increase the performance of drop/truncate
table becomes worse.  I think those are not often used operations,
so it never became priority to look into improving them if possible.

I have looked into it and found that the main reason for such
a behaviour is that for those operations it traverses whole
shared_buffers and it seems to me that we don't need that
especially for not-so-big tables.  We can optimize that path
by looking into buff mapping table for the pages that exist in
shared_buffers for the case when table size is less than some
threshold (say 25%) of shared buffers.

Attached patch implements the above idea and I found that
performance doesn't dip much with patch even with large value
of shared_buffers.  I have also attached script and sql file used
to take performance data.

m/c configuration
--------------------------
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB



Shared_buffers (MB)/Tps83212810248192
HEAD – commit 13813012410348
Patch138132132130133
    
I have observed that this optimization has no effect if the value of
shared_buffers is small (say 8MB, 16MB, ..), so I have used it only
when value of shared_buffers is greater than equal to 32MB.

We might want to use similar optimisation for DropRelFileNodeBuffers()
as well.

Suggestions?




With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: drop/truncate table sucks for large values of shared buffers

From
Tom Lane
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:
> I have looked into it and found that the main reason for such
> a behaviour is that for those operations it traverses whole
> shared_buffers and it seems to me that we don't need that
> especially for not-so-big tables.  We can optimize that path
> by looking into buff mapping table for the pages that exist in
> shared_buffers for the case when table size is less than some
> threshold (say 25%) of shared buffers.

I don't like this too much because it will fail badly if the caller
is wrong about the maximum possible page number for the table, which
seems not exactly far-fetched.  (For instance, remember those kernel bugs
we've seen that cause lseek to lie about the EOF position?)  It also
offers no hope of a fix for the other operations that scan the whole
buffer pool, such as DROP TABLESPACE and DROP DATABASE.

In the past we've speculated about fixing the performance of these things
by complicating the buffer lookup mechanism enough so that it could do
"find any page for this table/tablespace/database" efficiently.
Nobody's had ideas that seemed sane performance-wise though.
        regards, tom lane



Re: drop/truncate table sucks for large values of shared buffers

From
Gurjeet Singh
Date:
On Fri, Jun 26, 2015 at 9:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Sometime back on one of the PostgreSQL blog [1], there was
discussion about the performance of drop/truncate table for
large values of shared_buffers and it seems that as the value
of shared_buffers increase the performance of drop/truncate
table becomes worse.  I think those are not often used operations,
so it never became priority to look into improving them if possible.

I have looked into it and found that the main reason for such
a behaviour is that for those operations it traverses whole
shared_buffers and it seems to me that we don't need that
especially for not-so-big tables.  We can optimize that path
by looking into buff mapping table for the pages that exist in
shared_buffers for the case when table size is less than some
threshold (say 25%) of shared buffers.

Attached patch implements the above idea and I found that
performance doesn't dip much with patch even with large value
of shared_buffers.  I have also attached script and sql file used
to take performance data.

+1 for the effort to improve this.

With your technique added, there are 3 possible ways the search can happen a) Scan NBuffers and scan list of relations, b) Scan NBuffers and bsearch list of relations, and c) Scan list of relations and then invalidate blocks of each fork from shared buffers. Would it be worth it finding one technique that can serve decently from the low-end shared_buffers to the high-end.

On patch:

There are multiple naming styles being used in DropForkSpecificBuffers(); my_name and myName. Given this is a new function, it'd help to be consistent.

s/blk_count/blockNum/

s/new//, for eg. newTag, because there's no corresponding tag/oldTag variable in the function.

s/blocksToDel/blocksToDrop/. BTW, we never pass anything other than the total number of blocks in the fork, so we may as well call it just numBlocks.

s/traverse_buf_freelist/scan_shared_buffers/, because when it is true, we scan the whole shared_buffers.

s/rel_count/rel_num/

Reduce indentation/tab in header-comments of DropForkSpecificBuffers(). But I see there's precedent in neighboring functions, so this may be okay.

Doing pfree() of num_blocks, num_fsm_blocks and num_vm_blocks in one place (instead of two, at different indentation levels) would help readability.

Best regards,
-- 

Re: drop/truncate table sucks for large values of shared buffers

From
Andres Freund
Date:
On 2015-06-27 10:10:04 -0400, Tom Lane wrote:
> In the past we've speculated about fixing the performance of these things
> by complicating the buffer lookup mechanism enough so that it could do
> "find any page for this table/tablespace/database" efficiently.
> Nobody's had ideas that seemed sane performance-wise though.

I've started to play around with doing that a year or three back. My
approach was to use a linux style radix tree for the buffer mapping
table.  Besides lack of time what made it hard to be efficient was the
size of our buffer tags requiring rather deep trees.

I think I was considering playing around with a two-level tree (so we
could cache a pointer in Relation or such), but the memory management
requirements for that made my head hurt too much. The other alternative
is to work on having a much simpler buffer tag

If you have a buffer mapping that allows orderly traversal it becomes a)
much easier to do efficient readahead, as it's possible to read uncached
areas b) allows to write combine neighbouring pages when writing out
from a backend.



Re: drop/truncate table sucks for large values of shared buffers

From
Amit Kapila
Date:
On Sat, Jun 27, 2015 at 7:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > I have looked into it and found that the main reason for such
> > a behaviour is that for those operations it traverses whole
> > shared_buffers and it seems to me that we don't need that
> > especially for not-so-big tables.  We can optimize that path
> > by looking into buff mapping table for the pages that exist in
> > shared_buffers for the case when table size is less than some
> > threshold (say 25%) of shared buffers.
>
> I don't like this too much because it will fail badly if the caller
> is wrong about the maximum possible page number for the table, which
> seems not exactly far-fetched.  (For instance, remember those kernel bugs
> we've seen that cause lseek to lie about the EOF position?)

Considering we already have exclusive lock while doing this operation
and nobody else can perform write on this file, won't closing and
opening it again would avoid such problems.  In patch that is already
done (smgrexists()) for 2 kind of forks and can be done for the third kind
as well.

>  It also
> offers no hope of a fix for the other operations that scan the whole
> buffer pool, such as DROP TABLESPACE and DROP DATABASE.
>

True, but it is not foreclosing if any body has idea to optimize those
paths.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: drop/truncate table sucks for large values of shared buffers

From
Amit Kapila
Date:
On Sat, Jun 27, 2015 at 8:06 PM, Gurjeet Singh <gurjeet@singh.im> wrote:
>
> On Fri, Jun 26, 2015 at 9:45 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> Attached patch implements the above idea and I found that
>> performance doesn't dip much with patch even with large value
>> of shared_buffers.  I have also attached script and sql file used
>> to take performance data.
>
>
> +1 for the effort to improve this.
>

Thanks.

> With your technique added, there are 3 possible ways the search can happen a) Scan NBuffers and scan list of relations, b) Scan NBuffers and bsearch list of relations, and c) Scan list of relations and then invalidate blocks of each fork from shared buffers. Would it be worth it finding one technique that can serve decently from the low-end shared_buffers to the high-end.
>

Yes, it would be great if we could have any such technique, but in
absence of that current optimization would suffice the need unless
there are any loop-holes in it.

> On patch:
>
> There are multiple naming styles being used in DropForkSpecificBuffers(); my_name and myName. Given this is a new function, it'd help to be consistent.
>

Thanks for suggestions, I will improve the patch on those lines if
there is no problem with idea at broad level. 


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: drop/truncate table sucks for large values of shared buffers

From
Robert Haas
Date:
On Sat, Jun 27, 2015 at 11:38 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-06-27 10:10:04 -0400, Tom Lane wrote:
>> In the past we've speculated about fixing the performance of these things
>> by complicating the buffer lookup mechanism enough so that it could do
>> "find any page for this table/tablespace/database" efficiently.
>> Nobody's had ideas that seemed sane performance-wise though.
>
> I've started to play around with doing that a year or three back. My
> approach was to use a linux style radix tree for the buffer mapping
> table.  Besides lack of time what made it hard to be efficient was the
> size of our buffer tags requiring rather deep trees.
>
> I think I was considering playing around with a two-level tree (so we
> could cache a pointer in Relation or such), but the memory management
> requirements for that made my head hurt too much. The other alternative
> is to work on having a much simpler buffer tag

Wouldn't even a two-level tree have the same problem you complained
about vis-a-vis chash?  In that case, you were of the opinion that
even ONE extra level of indirection was enough to pinch.  (I'm still
hoping there is a way to fix that, but even so.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: drop/truncate table sucks for large values of shared buffers

From
Andrew Dunstan
Date:
On 06/27/2015 10:10 AM, Tom Lane wrote:
> It also
> offers no hope of a fix for the other operations that scan the whole
> buffer pool, such as DROP TABLESPACE and DROP DATABASE.
>

Improving DROP TABLE / TRUNCATE would still be a significant advance. 
These cases cause far more real world pain than the ones you mention, in 
my experience.

cheers

andrew





Re: drop/truncate table sucks for large values of shared buffers

From
Andres Freund
Date:
On 2015-06-28 09:11:29 -0400, Robert Haas wrote:
> On Sat, Jun 27, 2015 at 11:38 AM, Andres Freund <andres@anarazel.de> wrote:
> > I've started to play around with doing that a year or three back. My
> > approach was to use a linux style radix tree for the buffer mapping
> > table.  Besides lack of time what made it hard to be efficient was the
> > size of our buffer tags requiring rather deep trees.
> >
> > I think I was considering playing around with a two-level tree (so we
> > could cache a pointer in Relation or such), but the memory management
> > requirements for that made my head hurt too much. The other alternative
> > is to work on having a much simpler buffer tag
> 
> Wouldn't even a two-level tree have the same problem you complained
> about vis-a-vis chash?

I was hoping to avoid the upper tree (mapping relfilenodes to the block
tree) in the majority of the cases by caching that mapping in struct
Relation or so.

But generally, yes, a tree will have more indirections. But they're
often of a different quality than with a hash table. There's a high
amount of spatial locality when looking up blocks: You're much more
likely to lookup a block close to one recently looked up than just a
randomly different one. Hashtables don't have a way to benefit from that
- tree structures sometimes do.

I don't think using a radix tree in itself will have significant
performance benefits over the hash table. But it allows for a bunch of
cool further optimizations like the aforementioned 'intelligent'
readahead, combining writes when flushing buffers, and - which made me
look into it originally - tagging inner nodes in the tree with
information about dirtyness to avoid scanning large amounts of nondirty
buffers.

> In that case, you were of the opinion that even ONE extra level of
> indirection was enough to pinch.  (I'm still hoping there is a way to
> fix that, but even so.)

Me too.


Andres



Re: drop/truncate table sucks for large values of shared buffers

From
Tom Lane
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:
> On Sat, Jun 27, 2015 at 7:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I don't like this too much because it will fail badly if the caller
>> is wrong about the maximum possible page number for the table, which
>> seems not exactly far-fetched.  (For instance, remember those kernel bugs
>> we've seen that cause lseek to lie about the EOF position?)

> Considering we already have exclusive lock while doing this operation
> and nobody else can perform write on this file, won't closing and
> opening it again would avoid such problems.

On what grounds do you base that touching faith?  Quite aside from
outright bugs, having lock on a table has nothing to do with whether
low-level processes such as the checkpointer can touch it.
        regards, tom lane



Re: drop/truncate table sucks for large values of shared buffers

From
Simon Riggs
Date:
On 27 June 2015 at 15:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Amit Kapila <amit.kapila16@gmail.com> writes:
> I have looked into it and found that the main reason for such
> a behaviour is that for those operations it traverses whole
> shared_buffers and it seems to me that we don't need that
> especially for not-so-big tables.  We can optimize that path
> by looking into buff mapping table for the pages that exist in
> shared_buffers for the case when table size is less than some
> threshold (say 25%) of shared buffers.

I don't like this too much because it will fail badly if the caller
is wrong about the maximum possible page number for the table, which
seems not exactly far-fetched.  (For instance, remember those kernel bugs
we've seen that cause lseek to lie about the EOF position?)  It also
offers no hope of a fix for the other operations that scan the whole
buffer pool, such as DROP TABLESPACE and DROP DATABASE.

If that is true, then our reliance on lseek elsewhere could also cause data loss, for example by failing to scan data during a seq scan.

The consequences of failure of lseek in this case are nowhere near as dire, since by definition the data is being destroyed by the user.

So ISTM that we should be able to use this technique.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: drop/truncate table sucks for large values of shared buffers

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On 27 June 2015 at 15:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I don't like this too much because it will fail badly if the caller
>> is wrong about the maximum possible page number for the table, which
>> seems not exactly far-fetched.  (For instance, remember those kernel bugs
>> we've seen that cause lseek to lie about the EOF position?)

> If that is true, then our reliance on lseek elsewhere could also cause data
> loss, for example by failing to scan data during a seq scan.

The lseek point was a for-example, not the entire universe of possible
problem sources for this patch.  (Also, underestimating the EOF point in
a seqscan is normally not an issue since any rows in a just-added page
are by definition not visible to the scan's snapshot.  But I digress.)

> The consequences of failure of lseek in this case are nowhere near as dire,
> since by definition the data is being destroyed by the user.

I'm not sure what you consider "dire", but missing a dirty buffer
belonging to the to-be-destroyed table would result in the system being
permanently unable to checkpoint, because attempts to write out the buffer
to the no-longer-extant file would fail.  You could only get out of the
situation via a forced database crash (immediate shutdown), followed by
replaying all the WAL since the time of the problem.  In production
contexts that could be pretty dire.
        regards, tom lane



Re: drop/truncate table sucks for large values of shared buffers

From
Amit Kapila
Date:
On Sun, Jun 28, 2015 at 9:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > On Sat, Jun 27, 2015 at 7:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I don't like this too much because it will fail badly if the caller
> >> is wrong about the maximum possible page number for the table, which
> >> seems not exactly far-fetched.  (For instance, remember those kernel bugs
> >> we've seen that cause lseek to lie about the EOF position?)
>
> > Considering we already have exclusive lock while doing this operation
> > and nobody else can perform write on this file, won't closing and
> > opening it again would avoid such problems.
>
> On what grounds do you base that touching faith?

Sorry, but I don't get what problem do you see in this touching?

> Quite aside from
> outright bugs, having lock on a table has nothing to do with whether
> low-level processes such as the checkpointer can touch it.
>

I thought that this problem (lseek lie about EOF) would only occur
if there is a recent extension to the file or does mere writes to existing
pages could also cause this problem?

Though we should ideally take care of any failures of OS API's especially
if they could lead to data loss, however not sure if we can do that reliably
in all possible cases.  Can we safely guarantee that as we haven't
encountered any other such problem in any other API, so everything is good.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: drop/truncate table sucks for large values of shared buffers

From
Amit Kapila
Date:
On Sun, Jun 28, 2015 at 9:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On 27 June 2015 at 15:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I don't like this too much because it will fail badly if the caller
> >> is wrong about the maximum possible page number for the table, which
> >> seems not exactly far-fetched.  (For instance, remember those kernel bugs
> >> we've seen that cause lseek to lie about the EOF position?)
>
> > If that is true, then our reliance on lseek elsewhere could also cause data
> > loss, for example by failing to scan data during a seq scan.
>
> The lseek point was a for-example, not the entire universe of possible
> problem sources for this patch.  (Also, underestimating the EOF point in
> a seqscan is normally not an issue since any rows in a just-added page
> are by definition not visible to the scan's snapshot.

How do we ensure that just-added page is before or after the scan's snapshot?
If it is before, then the above point mentioned by Simon is valid.  Does this
mean that all other usages of smgrnblocks()/mdnblocks() is safe with respect
to this issue or the consequences will not be so bad as for this usage?

>  But I digress.)
>
> > The consequences of failure of lseek in this case are nowhere near as dire,
> > since by definition the data is being destroyed by the user.
>
> I'm not sure what you consider "dire", but missing a dirty buffer
> belonging to the to-be-destroyed table would result in the system being
> permanently unable to checkpoint, because attempts to write out the buffer
> to the no-longer-extant file would fail.

So another idea here could be that if instead of failing, we just ignore the
error in case the the object (to which that page belongs) doesn't exist and
we can make Drop free by not invalidating from shared_buffers in case of
Drop/Truncate.  I think this might not be sane idea as we need to have a
way to do lookup of objects from checkpoint and need to handle the case
where same Oid could be assigned to new objects (after wraparound?). 



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: drop/truncate table sucks for large values of shared buffers

From
Robert Haas
Date:
On Sun, Jun 28, 2015 at 12:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm not sure what you consider "dire", but missing a dirty buffer
> belonging to the to-be-destroyed table would result in the system being
> permanently unable to checkpoint, because attempts to write out the buffer
> to the no-longer-extant file would fail.  You could only get out of the
> situation via a forced database crash (immediate shutdown), followed by
> replaying all the WAL since the time of the problem.  In production
> contexts that could be pretty dire.

Hmm, that is kind of ugly.  Is the write actually going to fail,
though, or is it just going to create a sparse file?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: drop/truncate table sucks for large values of shared buffers

From
Simon Riggs
Date:
On 28 June 2015 at 17:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On 27 June 2015 at 15:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I don't like this too much because it will fail badly if the caller
>> is wrong about the maximum possible page number for the table, which
>> seems not exactly far-fetched.  (For instance, remember those kernel bugs
>> we've seen that cause lseek to lie about the EOF position?)

> If that is true, then our reliance on lseek elsewhere could also cause data
> loss, for example by failing to scan data during a seq scan.

The lseek point was a for-example, not the entire universe of possible
problem sources for this patch.  (Also, underestimating the EOF point in
a seqscan is normally not an issue since any rows in a just-added page
are by definition not visible to the scan's snapshot.  But I digress.)

> The consequences of failure of lseek in this case are nowhere near as dire,
> since by definition the data is being destroyed by the user.

I'm not sure what you consider "dire", but missing a dirty buffer
belonging to the to-be-destroyed table would result in the system being
permanently unable to checkpoint, because attempts to write out the buffer
to the no-longer-extant file would fail.  You could only get out of the
situation via a forced database crash (immediate shutdown), followed by
replaying all the WAL since the time of the problem.  In production
contexts that could be pretty dire.

Yes, its bad, but we do notice that has happened. We can also put in code to specifically avoid this error at checkpoint time. 

If lseek fails badly then SeqScans would give *silent* data loss, which in my view is worse. Just added pages aren't the only thing we might miss if lseek is badly wrong.

So, I think this patch still has legs. We can check that the clean up has been 100% when we do the buffer scan at the start of the checkpoint - that way we do just one scan of the buffer pool and move a time-consuming operation into a background process.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: drop/truncate table sucks for large values of shared buffers

From
Amit Kapila
Date:
On Mon, Jun 29, 2015 at 7:18 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On 28 June 2015 at 17:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> I'm not sure what you consider "dire", but missing a dirty buffer
>> belonging to the to-be-destroyed table would result in the system being
>> permanently unable to checkpoint, because attempts to write out the buffer
>> to the no-longer-extant file would fail.  You could only get out of the
>> situation via a forced database crash (immediate shutdown), followed by
>> replaying all the WAL since the time of the problem.  In production
>> contexts that could be pretty dire.
>
>
> Yes, its bad, but we do notice that has happened. We can also put in code to specifically avoid this error at checkpoint time.
>
> If lseek fails badly then SeqScans would give *silent* data loss, which in my view is worse. Just added pages aren't the only thing we might miss if lseek is badly wrong.
>

So for the purpose of this patch, do we need to assume that
lseek can give us wrong size of file and we should add preventive
checks and other handling for the same?
I am okay to change that way, if we are going to have that as assumption
in out code wherever we are using it or will use it in-future, otherwise
we will end with some preventive checks which are actually not required.

Another idea here is that use some other way instead of lseek to know
the size of file.  Do you think we can use stat() for this purpose, we
are already using it in fd.c?

> So, I think this patch still has legs. We can check that the clean up has been 100% when we do the buffer scan at the start of the checkpoint
>

One way to ensure that is verify that each buffer header tag is
valid (which essentially means whether the object exists), do
you have something else in mind to accomplish this part if we
decide to go this route?

> - that way we do just one scan of the buffer pool and move a time-consuming operation into a background process.
>

Agreed and if that turns out to be cheap, then we might want to use
some way to optimize Drop Database and others in same way.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: drop/truncate table sucks for large values of shared buffers

From
Amit Kapila
Date:
On Mon, Jun 29, 2015 at 5:41 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Jun 28, 2015 at 9:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Amit Kapila <amit.kapila16@gmail.com> writes:
> > > On Sat, Jun 27, 2015 at 7:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >> I don't like this too much because it will fail badly if the caller
> > >> is wrong about the maximum possible page number for the table, which
> > >> seems not exactly far-fetched.  (For instance, remember those kernel bugs
> > >> we've seen that cause lseek to lie about the EOF position?)
> >
> > > Considering we already have exclusive lock while doing this operation
> > > and nobody else can perform write on this file, won't closing and
> > > opening it again would avoid such problems.
> >
> > On what grounds do you base that touching faith?
>
> Sorry, but I don't get what problem do you see in this touching?
>

On again thinking about it, I think you are worried that if we close
and reopen the file, it would break any flush operation that could
happen in parallel to it via checkpoint.  Still I am not clear that do
we want to assume that we can't rely on lseek for the size of file
if there can be parallel write activity on file (even if that write doesn't
increase the size of file)?

If yes, then we have below options:

a.  Have some protection mechanism for File Access (ignore error
for file not present or accessed during flush) and clean the buffers
buffers containing invalid objects as is being discussed up-thread.

b. Use some other API like stat to get the size of file?

Do you prefer any of these or if you have any better idea, then please
do share the same?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: drop/truncate table sucks for large values of shared buffers

From
Simon Riggs
Date:
On 30 June 2015 at 05:02, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Jun 29, 2015 at 7:18 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On 28 June 2015 at 17:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> I'm not sure what you consider "dire", but missing a dirty buffer
>> belonging to the to-be-destroyed table would result in the system being
>> permanently unable to checkpoint, because attempts to write out the buffer
>> to the no-longer-extant file would fail.  You could only get out of the
>> situation via a forced database crash (immediate shutdown), followed by
>> replaying all the WAL since the time of the problem.  In production
>> contexts that could be pretty dire.
>
>
> Yes, its bad, but we do notice that has happened. We can also put in code to specifically avoid this error at checkpoint time.
>
> If lseek fails badly then SeqScans would give *silent* data loss, which in my view is worse. Just added pages aren't the only thing we might miss if lseek is badly wrong.
>

So for the purpose of this patch, do we need to assume that
lseek can give us wrong size of file and we should add preventive
checks and other handling for the same?
I am okay to change that way, if we are going to have that as assumption
in out code wherever we are using it or will use it in-future, otherwise
we will end with some preventive checks which are actually not required.

They're preventative checks. You always hope it is wasted effort.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: drop/truncate table sucks for large values of shared buffers

From
Amit Kapila
Date:
On Tue, Jun 30, 2015 at 11:00 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On 30 June 2015 at 05:02, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Mon, Jun 29, 2015 at 7:18 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> >
>> > On 28 June 2015 at 17:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> >>
>> > If lseek fails badly then SeqScans would give *silent* data loss, which in my view is worse. Just added pages aren't the only thing we might miss if lseek is badly wrong.
>> >
>>
>> So for the purpose of this patch, do we need to assume that
>> lseek can give us wrong size of file and we should add preventive
>> checks and other handling for the same?
>> I am okay to change that way, if we are going to have that as assumption
>> in out code wherever we are using it or will use it in-future, otherwise
>> we will end with some preventive checks which are actually not required.
>
>
> They're preventative checks. You always hope it is wasted effort.
>

I am not sure if Preventative checks (without the real need) are okay if they
are not-cheap which could happen in this case.  I think Validating buffer-tag
would require rel or sys cache lookup.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: drop/truncate table sucks for large values of shared buffers

From
Simon Riggs
Date:
On 30 June 2015 at 07:34, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Jun 30, 2015 at 11:00 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On 30 June 2015 at 05:02, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Mon, Jun 29, 2015 at 7:18 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> >
>> > On 28 June 2015 at 17:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> >>
>> > If lseek fails badly then SeqScans would give *silent* data loss, which in my view is worse. Just added pages aren't the only thing we might miss if lseek is badly wrong.
>> >
>>
>> So for the purpose of this patch, do we need to assume that
>> lseek can give us wrong size of file and we should add preventive
>> checks and other handling for the same?
>> I am okay to change that way, if we are going to have that as assumption
>> in out code wherever we are using it or will use it in-future, otherwise
>> we will end with some preventive checks which are actually not required.
>
>
> They're preventative checks. You always hope it is wasted effort.
>

I am not sure if Preventative checks (without the real need) are okay if they
are not-cheap which could happen in this case.  I think Validating buffer-tag
would require rel or sys cache lookup.

True, so don't do that.

Keep a list of dropped relations and have the checkpoint process scan the buffer pool every 64 tables, kinda like AbsorbFsync

All the heavy lifting gets done in a background process and we know we're safe.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: drop/truncate table sucks for large values of shared buffers

From
Amit Kapila
Date:
On Tue, Jun 30, 2015 at 12:10 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On 30 June 2015 at 07:34, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Tue, Jun 30, 2015 at 11:00 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> >
>> > On 30 June 2015 at 05:02, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> >>
>> >> On Mon, Jun 29, 2015 at 7:18 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> >> >
>> >> > On 28 June 2015 at 17:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> >> >>
>> >> > If lseek fails badly then SeqScans would give *silent* data loss, which in my view is worse. Just added pages aren't the only thing we might miss if lseek is badly wrong.
>> >> >
>> >>
>> >> So for the purpose of this patch, do we need to assume that
>> >> lseek can give us wrong size of file and we should add preventive
>> >> checks and other handling for the same?
>> >> I am okay to change that way, if we are going to have that as assumption
>> >> in out code wherever we are using it or will use it in-future, otherwise
>> >> we will end with some preventive checks which are actually not required.
>> >
>> >
>> > They're preventative checks. You always hope it is wasted effort.
>> >
>>
>> I am not sure if Preventative checks (without the real need) are okay if they
>> are not-cheap which could happen in this case.  I think Validating buffer-tag
>> would require rel or sys cache lookup.
>
>
> True, so don't do that.
>
> Keep a list of dropped relations and have the checkpoint process scan the buffer pool every 64 tables, kinda like AbsorbFsync
>

Okay. I think we can maintain the list in similar way as we do for
UNLINK_RELATION_REQUEST in RememberFsyncRequest(), but
why to wait till 64 tables?  We already scan whole buffer list in each
checkpoint cycle, so during that scan we can refer this dropped relation
list and avoid syncing such buffer contents.  Also for ENOENT error
handling for FileWrite, we can use this list to refer relations for which we
need to ignore the error.  I think we are already doing something similar in
mdsync to avoid the problem of Dropped tables, so it seems okay to
have it in mdwrite as well.

The crucial thing in this idea to think about is avoiding reassignment of
relfilenode (due to wrapped OID's) before we have ensured that none of
the buffers contains tag for that relfilenode.  Currently we avoid this for
Fsync case by retaining the first segment of relation (which will avoid
reassignment of relfilenode) till checkpoint ends, I think if we just postpone
it till we have validated it in shared_buffers, then we can avoid this problem
in new scheme and it should be delay of maximum one checkpoint cycle
for unlinking such file assuming we refer dropped relation list in each checkpoint
cycle during buffer scan.

Does that make sense?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: drop/truncate table sucks for large values of shared buffers

From
Simon Riggs
Date:
On 1 July 2015 at 15:39, Amit Kapila <amit.kapila16@gmail.com> wrote:
 
Okay. I think we can maintain the list in similar way as we do for
UNLINK_RELATION_REQUEST in RememberFsyncRequest(), but
why to wait till 64 tables? 

I meant once per checkpoint cycle OR every N tables, whichever is sooner. N would need to vary according to size of Nbuffers.
 
We already scan whole buffer list in each
checkpoint cycle, so during that scan we can refer this dropped relation
list and avoid syncing such buffer contents.  Also for ENOENT error
handling for FileWrite, we can use this list to refer relations for which we
need to ignore the error.  I think we are already doing something similar in
mdsync to avoid the problem of Dropped tables, so it seems okay to
have it in mdwrite as well.

The crucial thing in this idea to think about is avoiding reassignment of
relfilenode (due to wrapped OID's) before we have ensured that none of
the buffers contains tag for that relfilenode.  Currently we avoid this for
Fsync case by retaining the first segment of relation (which will avoid
reassignment of relfilenode) till checkpoint ends, I think if we just postpone
it till we have validated it in shared_buffers, then we can avoid this problem
in new scheme and it should be delay of maximum one checkpoint cycle
for unlinking such file assuming we refer dropped relation list in each checkpoint
cycle during buffer scan.

Yes 

So you are keeping more data around for longer, right? I think we would need some way to trigger a scan when the amount of deferred dropped data files hits a certain size.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: drop/truncate table sucks for large values of shared buffers

From
Amit Kapila
Date:
On Wed, Jul 1, 2015 at 8:26 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On 1 July 2015 at 15:39, Amit Kapila <amit.kapila16@gmail.com> wrote:
>  
>>
>> Okay. I think we can maintain the list in similar way as we do for
>> UNLINK_RELATION_REQUEST in RememberFsyncRequest(), but
>> why to wait till 64 tables?
>
>
> I meant once per checkpoint cycle OR every N tables, whichever is sooner. N would need to vary according to size of Nbuffers.
>  

That sounds sensible to me, see my reply below what I think we can do
for this to work.

>>
>> We already scan whole buffer list in each
>> checkpoint cycle, so during that scan we can refer this dropped relation
>> list and avoid syncing such buffer contents.  Also for ENOENT error
>> handling for FileWrite, we can use this list to refer relations for which we
>> need to ignore the error.  I think we are already doing something similar in
>> mdsync to avoid the problem of Dropped tables, so it seems okay to
>> have it in mdwrite as well.
>>
>> The crucial thing in this idea to think about is avoiding reassignment of
>> relfilenode (due to wrapped OID's) before we have ensured that none of
>> the buffers contains tag for that relfilenode.  Currently we avoid this for
>> Fsync case by retaining the first segment of relation (which will avoid
>> reassignment of relfilenode) till checkpoint ends, I think if we just postpone
>> it till we have validated it in shared_buffers, then we can avoid this problem
>> in new scheme and it should be delay of maximum one checkpoint cycle
>> for unlinking such file assuming we refer dropped relation list in each checkpoint
>> cycle during buffer scan.
>
>
> Yes
>
> So you are keeping more data around for longer, right?

Yes and we already do it for the sake of Fsyncs.

> I think we would need some way to trigger a scan when the amount of deferred dropped data files hits a certain size.
>

Okay, I think we can keep it as if the number of dropped
relations reached 64 or 0.01% (or 0.1%) of shared_buffers
whichever is minimum as a point to trigger checkpoint.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: drop/truncate table sucks for large values of shared buffers

From
Heikki Linnakangas
Date:
On 06/27/2015 07:45 AM, Amit Kapila wrote:
> Sometime back on one of the PostgreSQL blog [1], there was
> discussion about the performance of drop/truncate table for
> large values of shared_buffers and it seems that as the value
> of shared_buffers increase the performance of drop/truncate
> table becomes worse.  I think those are not often used operations,
> so it never became priority to look into improving them if possible.
>
> I have looked into it and found that the main reason for such
> a behaviour is that for those operations it traverses whole
> shared_buffers and it seems to me that we don't need that
> especially for not-so-big tables.  We can optimize that path
> by looking into buff mapping table for the pages that exist in
> shared_buffers for the case when table size is less than some
> threshold (say 25%) of shared buffers.
>
> Attached patch implements the above idea and I found that
> performance doesn't dip much with patch even with large value
> of shared_buffers.  I have also attached script and sql file used
> to take performance data.

I'm marking this as "returned with feedback" in the commitfest. In 
addition to the issues raised so far, ISTM that the patch makes dropping 
a very large table with small shared_buffers slower 
(DropForkSpecificBuffers() is O(n) where n is the size of the relation, 
while the current method is O(shared_buffers))

I concur that we should explore using a radix tree or something else 
that would naturally allow you to find all buffers for relation/database 
X quickly.

- Heikki




Re: drop/truncate table sucks for large values of shared buffers

From
Andres Freund
Date:
On 2015-07-02 16:08:03 +0300, Heikki Linnakangas wrote:
> I'm marking this as "returned with feedback" in the commitfest. In addition
> to the issues raised so far, ISTM that the patch makes dropping a very large
> table with small shared_buffers slower (DropForkSpecificBuffers() is O(n)
> where n is the size of the relation, while the current method is
> O(shared_buffers))

I think upthread there was talk about only using the O(relsize) approach
if relsize << NBuffers. That's actually a pretty common scenario,
especially in testsuites.

> I concur that we should explore using a radix tree or something else that
> would naturally allow you to find all buffers for relation/database X
> quickly.

I unsurprisingly think that's the right way to go. But I'm not sure if
it's not worth to add another layer of bandaid till were there...

Greetings,

Andres Freund



Re: drop/truncate table sucks for large values of shared buffers

From
Amit Kapila
Date:
On Thu, Jul 2, 2015 at 6:38 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 06/27/2015 07:45 AM, Amit Kapila wrote:
>>
>> Sometime back on one of the PostgreSQL blog [1], there was
>> discussion about the performance of drop/truncate table for
>> large values of shared_buffers and it seems that as the value
>> of shared_buffers increase the performance of drop/truncate
>> table becomes worse.  I think those are not often used operations,
>> so it never became priority to look into improving them if possible.
>>
>> I have looked into it and found that the main reason for such
>> a behaviour is that for those operations it traverses whole
>> shared_buffers and it seems to me that we don't need that
>> especially for not-so-big tables.  We can optimize that path
>> by looking into buff mapping table for the pages that exist in
>> shared_buffers for the case when table size is less than some
>> threshold (say 25%) of shared buffers.
>>
>> Attached patch implements the above idea and I found that
>> performance doesn't dip much with patch even with large value
>> of shared_buffers.  I have also attached script and sql file used
>> to take performance data.
>
>
> I'm marking this as "returned with feedback" in the commitfest. In addition to the issues raised so far, 

Don't you think the approach discussed (delayed cleanup of buffers
during checkpoint scan) is sufficient to fix the issues raised.

> ISTM that the patch makes dropping a very large table with small shared_buffers slower (DropForkSpecificBuffers() is O(n) where n is the size of the relation, while the current method is O(shared_buffers))
>

Can you please explain how did you reach that conclusion?
It does only when relation size is less than 25% of shared
buffers.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: drop/truncate table sucks for large values of shared buffers

From
Simon Riggs
Date:
On 2 July 2015 at 14:08, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 06/27/2015 07:45 AM, Amit Kapila wrote:
Sometime back on one of the PostgreSQL blog [1], there was
discussion about the performance of drop/truncate table for
large values of shared_buffers and it seems that as the value
of shared_buffers increase the performance of drop/truncate
table becomes worse.  I think those are not often used operations,
so it never became priority to look into improving them if possible.

I have looked into it and found that the main reason for such
a behaviour is that for those operations it traverses whole
shared_buffers and it seems to me that we don't need that
especially for not-so-big tables.  We can optimize that path
by looking into buff mapping table for the pages that exist in
shared_buffers for the case when table size is less than some
threshold (say 25%) of shared buffers.

Attached patch implements the above idea and I found that
performance doesn't dip much with patch even with large value
of shared_buffers.  I have also attached script and sql file used
to take performance data.

I'm marking this as "returned with feedback" in the commitfest. In addition to the issues raised so far, ISTM that the patch makes dropping a very large table with small shared_buffers slower (DropForkSpecificBuffers() is O(n) where n is the size of the relation, while the current method is O(shared_buffers))

There are no unresolved issues with the approach, nor is it true it is slower. If you think there are some, you should say what they are, not act high handedly to reject a patch without reason.

I concur that we should explore using a radix tree or something else that would naturally allow you to find all buffers for relation/database X quickly.

I doubt that it can be managed efficiently while retaining optimal memory management. If it can its going to be very low priority (or should be).

This approach works, so lets do it, now. If someone comes up with a better way later, great.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: drop/truncate table sucks for large values of shared buffers

From
Heikki Linnakangas
Date:
On 07/02/2015 04:18 PM, Amit Kapila wrote:
> On Thu, Jul 2, 2015 at 6:38 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> On 06/27/2015 07:45 AM, Amit Kapila wrote:
>>>
>>> Sometime back on one of the PostgreSQL blog [1], there was
>>> discussion about the performance of drop/truncate table for
>>> large values of shared_buffers and it seems that as the value
>>> of shared_buffers increase the performance of drop/truncate
>>> table becomes worse.  I think those are not often used operations,
>>> so it never became priority to look into improving them if possible.
>>>
>>> I have looked into it and found that the main reason for such
>>> a behaviour is that for those operations it traverses whole
>>> shared_buffers and it seems to me that we don't need that
>>> especially for not-so-big tables.  We can optimize that path
>>> by looking into buff mapping table for the pages that exist in
>>> shared_buffers for the case when table size is less than some
>>> threshold (say 25%) of shared buffers.
>>>
>>> Attached patch implements the above idea and I found that
>>> performance doesn't dip much with patch even with large value
>>> of shared_buffers.  I have also attached script and sql file used
>>> to take performance data.
>>
>> I'm marking this as "returned with feedback" in the commitfest. In
> addition to the issues raised so far,
>
> Don't you think the approach discussed (delayed cleanup of buffers
> during checkpoint scan) is sufficient to fix the issues raised.

Dunno, but there is no patch for that.

- Heikki




Re: drop/truncate table sucks for large values of shared buffers

From
Heikki Linnakangas
Date:
On 07/02/2015 04:33 PM, Simon Riggs wrote:
> On 2 July 2015 at 14:08, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
>> On 06/27/2015 07:45 AM, Amit Kapila wrote:
>>
>>> Sometime back on one of the PostgreSQL blog [1], there was
>>> discussion about the performance of drop/truncate table for
>>> large values of shared_buffers and it seems that as the value
>>> of shared_buffers increase the performance of drop/truncate
>>> table becomes worse.  I think those are not often used operations,
>>> so it never became priority to look into improving them if possible.
>>>
>>> I have looked into it and found that the main reason for such
>>> a behaviour is that for those operations it traverses whole
>>> shared_buffers and it seems to me that we don't need that
>>> especially for not-so-big tables.  We can optimize that path
>>> by looking into buff mapping table for the pages that exist in
>>> shared_buffers for the case when table size is less than some
>>> threshold (say 25%) of shared buffers.
>>>
>>> Attached patch implements the above idea and I found that
>>> performance doesn't dip much with patch even with large value
>>> of shared_buffers.  I have also attached script and sql file used
>>> to take performance data.
>>>
>>
>> I'm marking this as "returned with feedback" in the commitfest. In
>> addition to the issues raised so far, ISTM that the patch makes dropping a
>> very large table with small shared_buffers slower
>> (DropForkSpecificBuffers() is O(n) where n is the size of the relation,
>> while the current method is O(shared_buffers))
>
> There are no unresolved issues with the approach, nor is it true it is
> slower. If you think there are some, you should say what they are, not act
> high handedly to reject a patch without reason.

Oh, I missed the NBuffers / 4 threshold.

(The total_blocks calculation is prone to overflowing, btw, if you have 
a table close to 32 TB in size. But that's trivial to fix)

>> I concur that we should explore using a radix tree or something else that
>> would naturally allow you to find all buffers for relation/database X
>> quickly.
>
> I doubt that it can be managed efficiently while retaining optimal memory
> management. If it can its going to be very low priority (or should be).
>
> This approach works, so lets do it, now. If someone comes up with a better
> way later, great.

*shrug*. I don't think this is ready to be committed. I can't stop you 
from working on this, but as far as this commitfest is concerned, case 
closed.

- Heikki




Re: drop/truncate table sucks for large values of shared buffers

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On 2 July 2015 at 14:08, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I'm marking this as "returned with feedback" in the commitfest.

> There are no unresolved issues with the approach, nor is it true it is
> slower. If you think there are some, you should say what they are, not act
> high handedly to reject a patch without reason.

Have you read the thread?  There were plenty of objections, as well as
a design for a better solution.  In addition, we should think about
more holistic solutions such as Andres' nearby proposal (which I doubt
will work as-is, but maybe somebody will think of how to fix it).
Committing a band-aid will just make it harder to pursue real fixes.
        regards, tom lane



Re: drop/truncate table sucks for large values of shared buffers

From
Amit Kapila
Date:
On Thu, Jul 2, 2015 at 7:27 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 07/02/2015 04:18 PM, Amit Kapila wrote:
>>
>>
>> Don't you think the approach discussed (delayed cleanup of buffers
>> during checkpoint scan) is sufficient to fix the issues raised.
>
>
> Dunno, but there is no patch for that.
>

That's fair enough, however your reply sounded to me like there is
no further need to explore this idea, unless we have something like
radix tree based approach.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: drop/truncate table sucks for large values of shared buffers

From
Amit Kapila
Date:
On Thu, Jul 2, 2015 at 7:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On 2 July 2015 at 14:08, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >> I'm marking this as "returned with feedback" in the commitfest.
>
> > There are no unresolved issues with the approach, nor is it true it is
> > slower. If you think there are some, you should say what they are, not act
> > high handedly to reject a patch without reason.
>
> Have you read the thread?

I have read the thread again with a focus on the various problems
(objections) raised for this patch and here is a summarization of
my reading.

1. lseek can lie about EOF which could lead to serious problem
during checkpoint if drop table misses to remove the shared buffer
belonging to the table-to-be-dropped.

This problem can be solved by maintaining the list of dropped relations
and then while checkpoint clean such buffers during buffer-pool scan.
Something similar is already used to avoid similar problems during
FSync.

2. Patch doesn't cater to DROP TABLESPACE and DROP DATABASE
operations.

It would have been better if there could be a simpler solution for these
operations as well, but even if we have something that generic to
avoid problems for these operations, I think there is no reason why it
can't be easily adopted for Drop Table operation as the changes
proposed by this patch are very much localized to one function (which
we have to anyway change even without patch if we come-up with
a generic mechanism), the other changes required to avoid the
problem-1 (lseek problem) would still be required even when we
have patch for generic approach ready.  As mentioned by Andrew,
another thing to note is that these operations are much less used
as compare to Drop/Truncate Table, so I think optimzing these
are of somewhat lower priority.
  

3. Can't close-and-open a file (to avoid lseek lie about EOF or
otherwise) as that might lead to a failure if there is flush operation
for file in parallel.

I haven't checked about this, but I think we can find some way
to check if vm and fsm files exist before checking the number
of blocks for those files.


Apart from above, Heikki mentioned about overflow for total number of
blocks calculation, which I think is relatively simpler problem to fix.


>  There were plenty of objections, as well as
> a design for a better solution.

I think here by better solution you mean radix based approach or
something similar, first I don't see any clear design for the same
and second even if tomorrow we have patch for the same ready,
it's not very difficult to change the proposed solution even after
it is committed as the changes are very much localized to one
function.

>  In addition, we should think about
> more holistic solutions such as Andres' nearby proposal (which I doubt
> will work as-is, but maybe somebody will think of how to fix it).
> Committing a band-aid will just make it harder to pursue real fixes.
>

Right, but OTOH waiting for a long time to have some thing much
more generic doesn't sound wise either, especially when we can replace
the generic solution without much difficulty.

Having said above, I am not wedded to work on this idea, so if you
and or others have no inclination for the work in this direction, then
I will stop it and lets wait for the day when we have clear idea for
some generic way. 


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com