Re: [HACKERS] UPDATE of partition key - Mailing list pgsql-hackers

From Amit Khandekar
Subject Re: [HACKERS] UPDATE of partition key
Date
Msg-id CAJ3gD9d8DLw+e_jNOmyGDaHXkmww48_1U22brKNB7O7sS1igdQ@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] UPDATE of partition key  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] UPDATE of partition key
List pgsql-hackers
On 9 January 2018 at 23:07, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jan 4, 2018 at 1:18 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> ------------------
>> 1. ExecUpdate() needs to revert back tcs_map value changed by ExecInsert()
>> ------------------
>>
>>>> It seems like the ON CONFLICT stuff handled that by adding a
>>>> second TransitionCaptureState pointer to ModifyTable, thus
>>>> mt_transition_capture and mt_oc_transition_capture.  By that
>>>> precedent, we could add mt_utr_transition_capture or similar, and
>>>> maybe that's the way to go.  It seems a bit unsatisfying, but so does
>>>> what you have now.
>>>
>>> In case of ON CONFLICT, if there are both INSERT and UPDATE statement
>>> triggers referencing transition tables, both of the triggers need to
>>> independently populate their own transition tables, and hence the need
>>> for two separate transition states : mt_transition_capture and
>>> mt_oc_transition_capture. But in case of update-tuple-routing, the
>>> INSERT statement trigger won't come into picture. So the same
>>> mt_transition_capture can serve the purpose of populating the
>>> transition table with OLD and NEW rows. So I think it would be too
>>> redundant, if not incorrect, to have a whole new transition state for
>>> update tuple routing.
>>>
>>> I will see if it turns out better to have two tcs_maps in
>>> TransitionCaptureState, one for update and one for insert. But this,
>>> on first look, does not look good.
>>
>> Suppose TransitionCaptureState has separate maps, upd_del_tcs_maps and
>> insert_tcs_maps for UPDATE/DELETE and INSERT events respectively.
>
> That's not what I suggested.  If you look at what I wrote, I floated
> the idea of having two TransitionCaptureStates, not two separate maps
> within the same TransitionCaptureState.

In the first paragraph of my explanation, I was explaining why two
Transition capture states does not look like a good idea to me :

>>> In case of ON CONFLICT, if there are both INSERT and UPDATE statement
>>> triggers referencing transition tables, both of the triggers need to
>>> independently populate their own transition tables, and hence the need
>>> for two separate transition states : mt_transition_capture and
>>> mt_oc_transition_capture. But in case of update-tuple-routing, the
>>> INSERT statement trigger won't come into picture. So the same
>>> mt_transition_capture can serve the purpose of populating the
>>> transition table with OLD and NEW rows. So I think it would be too
>>> redundant, if not incorrect, to have a whole new transition state for
>>> update tuple routing.

And in the next para, I explained about the other alternative of
having two separate maps as against transition states.

>
>> ------------------
>> 2. mt_childparent_tupconv_maps is indexed by subplan or partition leaf index.
>> ------------------
>> ------------------
>> 3. Renaming of mt_transition_tupconv_maps to mt_childparent_tupconv_maps
>> ------------------
>>
>> We need to change it's name because now this map is not only used for
>> transition capture, but also for update-tuple-routing. Does it look ok
>> for you if, for readability, we keep the childparent tag ? Or else, we
>> can just make it "mt_tupconv_maps", but "mt_childparent_tupconv_maps"
>> looks more informative.
>
> I see your point: the array is being renamed because it now has more
> than one purpose.  But that's also what I'm complaining about with
> regard to point #2: the same array is being used for more than one
> purpose.  That's generally bad style.  If you have two loops in a
> function, it's best to declare two separate loop variables rather than
> reusing the same variable.  This lets the compiler detect, for
> example, an error where the second loop variable is used before it's
> initialized, which would be undetectable if you reused the same
> variable in both places.  Although that particular benefit doesn't
> pertain in this case, I maintain that having a single structure member
> that is indexed one of two different ways is a bad idea.
>
> If I understand correctly, the way we got here is that, in earlier
> patch versions, you had two arrays of maps, but it wasn't clear why we
> needed both of them, and David suggested replacing one of them with an
> array of indexes instead, in the hopes of reducing confusion.

Slight correction; it was suggested by Amit Langote; not by David.

> However, it looks to me like that didn't really work out.  If we
> always needed both maps, or even if we always needed the per-leaf map,
> it would have been a good idea, but it seems here that we can need
> either the per-leaf map or the per-subplan map or both or neither, and
> we want to avoid computing all of the per-leaf conversion maps if we
> only need per-subplan access.

I was ok with either mine or Amit Langote's approach. His approach
uses array of offsets to leaf-partition array, which sounded to me
like it may be re-usable for some similar purpose later.

>
> I think one way to fix this might be to build the per-leaf maps on
> demand.  Just because we're doing UPDATE tuple routing doesn't
> necessarily mean we'll actually need a TupleConversionMap for every
> child.  So we could allocate an array with one byte per leaf, where 0
> means we don't know whether tuple conversion is necessary, 1 means it
> is not, and 2 means it is, or something like that.  Then we have a
> second array with conversion maps.  We provide a function
> tupconv_map_for_leaf() or similar that checks the array; if it finds
> 1, it returns NULL; if it finds 2, it returns the conversion map
> previously calculated. If it finds 0, it calls convert_tuples_by_name,
> caches the result for later, updates the one-byte-per-leaf array with
> the appropriate value, and returns the just-computed conversion map.
> (The reason I'm suggesting 0/1/2 instead of just true/false is to
> reduce cache misses; if we find a 1 in the first array we don't need
> to access the second array at all.)
>
> If that doesn't seem like a good idea for some reason, then my second
> choice would be to leave mt_transition_tupconv_maps named the way it
> is currently and have a separate mt_update_tupconv_maps, with the two
> pointing, if both are initialized and as far as possible, to the same
> TupleConversionMap objects.

So there are two independent optimizations we are talking about :

1. Create the map only when needed. We may not require a map for a
leaf partition if there is no insert happening to that partition. And,
the insert may be part of update-tuple-routing or a plain INSERT
tuple-routing. Also, we may not require map for *every* subplan. It
may happen that many of the update subplans do not return any tuples,
in which case we don't require the maps for the partitions
corresponding to those subplans. This optimization was also suggested
by Thomas Munro initially.

2. In case of UPDATE, for partitions that take part in update scans,
there should be a single map; there should not be two separate maps,
one for accessing per-subplan and the other for accessing per-leaf. My
approach for this was to have a per-leaf array and a per-subplan
array, but they should share the maps wherever possible. I think this
is what you are suggesting in your second choice. The other approach
is as suggested by Amit Langote (which is present in the latest
versions of the patch), where we have an array of maps, and a
subplan-offsets array.


So your preference is for #1. But I think this optimization is not
specific for update-tuple-routing. This was applicable for inserts
also, from the beginning. And we can do this on-demand stuff for
subplan maps also.

Both optimizations are good, and they are independently required. But
I think optimization#2 is purely relevant to update-tuple-routing, so
we should do it now. We can do optimization #1 as a general
optimization, over and above optimization #2. So my opinion is, we do
#1 not as part of update-tuple-routing patch.

For optimization#2 (i.e. your second choice), I can revert back to the
way I had earlier used two different arrays, with per-leaf array
re-using the per-subplan maps.

Let me know if you are ok with this plan.

Then later once we do optimization #1, the maps will not be just
shared between per-subplan and per-leaf arrays, they will also be
created only when required.


Regarding the array names ...

Regardless of any approach, we are going to require two array maps,
one is per-subplan, and the other per-leaf. Now, for transition
capture, we would require both of these maps: per-subplan for
capturing updated rows, and per-leaf for routed rows. And during
update-tuple-routing, for converting the tuple from source partition
to root partition, we require only per-subplan map.

So if we name the per-subplan map as mt_transition_tupconv_maps, it
implies the per-leaf map is not used for transition capture, which is
incorrect. Similar thing, if we name the per-leaf map as
mt_transition_tupconv_maps.

Update-tuple-routing uses only per-subplan map. So per-subplan map can
be named mt_update_tupconv_maps. But again, how can we name the
per-leaf map ?

Noting all this, I feel we can go with names according to the
structure of maps. Something like : mt_perleaf_tupconv_maps, and
mt_persubplan_tupconv_maps. Other suggestions welcome.


>
>> -------------------
>> 4. Explicit signaling for "we are only here for transition tables"
>> -------------------
>>
>> I had given a thought on this earlier. I felt, even the pre-existing
>> conditions like "!trigdesc->trig_update_after_row" are all indirect
>> ways to determine that this function is called only to capture
>> transition tables, and thought that it may have been better to have
>> separate parameter transition_table_only.
>
> I see your point. I guess it's not really this patch's job to solve
> this problem, although I think this is going to need some refactoring
> in the not-too-distant future.  So I think the way you did it is
> probably OK.
>
>> Instead of adding another parameter to AfterTriggerSaveEvent(), I had
>> also considered another approach: Put the transition-tuples-capture
>> logic part of AfterTriggerSaveEvent() into a helper function
>> CaptureTransitionTables(). In ExecInsert() and ExecDelete(), instead
>> of calling ExecARUpdateTriggers(), call this function
>> CaptureTransitionTables(). I then dropped this idea and thought rather
>> to call ExecARUpdateTriggers() which neatly does the required checks
>> and other things like locking the old tuple via GetTupleForTrigger().
>> So if we go by CaptureTransitionTables(), we would need to do what
>> ExecARUpdateTriggers() does before calling CaptureTransitionTables().
>> This is doable. If you think this is worth doing so as to get rid of
>> the "(oldtup == NULL) ^ (newtup == NULL)" condition, we can do that.
>
> Duplicating logic elsewhere to avoid this problem here doesn't seem
> like a good plan.

Yeah, ok.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


pgsql-hackers by date:

Previous
From: Jeevan Chalke
Date:
Subject: Re: [HACKERS] Partition-wise aggregation/grouping
Next
From: Tatsuro Yamada
Date:
Subject: add queryEnv to ExplainOneQuery_hook