Re: COPY FROM WHEN condition - Mailing list pgsql-hackers

From Andres Freund
Subject Re: COPY FROM WHEN condition
Date
Msg-id 20190402011126.jvy5oun2nsdy4pqr@alap3.anarazel.de
Whole thread Raw
In response to Re: COPY FROM WHEN condition  (David Rowley <david.rowley@2ndquadrant.com>)
Responses Re: COPY FROM WHEN condition
List pgsql-hackers
Hi,

On 2019-04-02 14:06:52 +1300, David Rowley wrote:
> On Tue, 2 Apr 2019 at 13:59, Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2019-04-02 13:41:57 +1300, David Rowley wrote:
> > > On Tue, 2 Apr 2019 at 05:19, Andres Freund <andres@anarazel.de> wrote:
> > > > Thanks! I'm not quite clear whether you planning to continue working on
> > > > this, or whether this is a handoff? Either is fine with me, just trying
> > > > to avoid unnecessary work / delay.
> > >
> > > I can, if you've not. I was hoping to gauge if you thought the
> > > approach was worth pursuing.
> >
> > I think it's worth pursuing, with the caveats below. I'm going to focus
> > on docs the not-very-long rest of today, but I definitely could work on
> > this afterwards. But I also would welcome any help. Let me know...
> 
> I'm looking now. I'll post something when I get it into some better
> shape than it us now.

Cool.


> > > > It still seems wrong to me to just perform a second hashtable search
> > > > here, givent that we've already done the partition dispatch.
> > >
> > > The reason I thought this was a good idea is that if we use the
> > > ResultRelInfo to buffer the tuples then there's no end to how many
> > > tuple slots can exist as the code in copy.c has no control over how
> > > many ResultRelInfos are created.
> >
> > To me those aren't contradictory - we're going to have a ResultRelInfo
> > for each partition either way, but there's nothing preventing copy.c
> > from cleaning up subsidiary data in it.  What I was thinking is that
> > we'd just keep track of a list of ResultRelInfos with bulk insert slots,
> > and occasionally clean them up. That way we avoid the secondary lookup,
> > while also managing the amount of slots.
> 
> The problem that I see with that is you can't just add to that list
> when the partition changes. You must check if the ResultRelInfo is
> already in the list or not since we could change partitions and change
> back again. For a list with just a few elements checking
> list_member_ptr should be pretty cheap, but I randomly did choose that
> we try to keep just the last 16 partitions worth of buffers. I don't
> think checking list_member_ptr in a 16 element list is likely to be
> faster than a hash table lookup, do you?

Why do we need that list membership check? If you append the
ResultRelInfo to the list when creating the ResultRelInfo's slots array,
you don't need to touch the list after a partition lookup - you know
it's a member if the ResultRelInfo has a slot array.  Then you only need
to iterate the list when you want to drop slots to avoid using too much
memory - and that's also a sequential scan of the hash table in your
approach, right?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: COPY FROM WHEN condition
Next
From: Amit Langote
Date:
Subject: Re: Ordered Partitioned Table Scans