Re: Shared detoast Datum proposal - Mailing list pgsql-hackers

From Andy Fan
Subject Re: Shared detoast Datum proposal
Date
Msg-id 87ikz5i9m1.fsf@163.com
Whole thread Raw
In response to Re: Shared detoast Datum proposal  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Shared detoast Datum proposal
List pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:

> On Tue, May 21, 2024 at 10:02 PM Andy Fan <zhihuifan1213@163.com> wrote:
>> One more things I want to highlight it "syscache" is used for metadata
>> and *detoast cache* is used for user data. user data is more
>> likely bigger than metadata, so cache size control (hence eviction logic
>> run more often) is more necessary in this case. The first graph in [1]
>> (including the quota text) highlight this idea.
>
> Yes.
>
>> I think that is same idea as design a.
>
> Sounds similar.
>

This is really great to know!

>> I think the key points includes:
>>
>> 1. Where to put the *detoast value* so that ExprEval system can find out
>> it cheaply. The soluation in A slot->tts_values[*].
>>
>> 2. How to manage the memory for holding the detoast value.
>>
>> The soluation in A is:
>>
>> - using a lazy created MemoryContext in slot.
>> - let the detoast system detoast the datum into the designed
>> MemoryContext.
>>
>> 3. How to let Expr evalution run the extra cycles when needed.
>
> I don't understand (3) but I agree that (1) and (2) are key points. In
> many - nearly all - circumstances just overwriting slot->tts_values is
> unsafe and will cause problems. To make this work, we've got to find
> some way of doing this that is compatible with general design of
> TupleTableSlot, and I think in that API, just about the only time you
> can touch slot->tts_values is when you're trying to populate a virtual
> slot. But often we'll have some other kind of slot. And even if we do
> have a virtual slot, I think you're only allowed to manipulate
> slot->tts_values up until you call ExecStoreVirtualTuple.

Please give me one more chance to explain this. What I mean is:

Take SELECT f(a) FROM t1 join t2...; for example,

When we read the Datum of a Var, we read it from tts->tts_values[*], no
matter what kind of TupleTableSlot is.  So if we can put the "detoast
datum" back to the "original " tts_values, then nothing need to be
changed.

Aside from efficiency considerations, security and rationality are also
important considerations. So what I think now is:

- tts_values[*] means the Datum from the slot, and it has its operations
  like slot_getsomeattrs, ExecMaterializeSlot, ExecCopySlot,
  ExecClearTuple and so on. All of the characters have nothing with what
  kind of slot it is.

- Saving the "detoast datum" version into tts_values[*] doesn't break
  the above semantic and the ExprEval engine just get a detoast version
  so it doesn't need to detoast it again.

- The keypoint is the memory management and effeiciency. for now I think
  all the places where "slot->tts_nvalid" is set to 0 means the
  tts_values[*] is no longer validate, then this is the place we should
  release the memory for the "detoast datum".  All the other places like
  ExecMaterializeSlot or ExecCopySlot also need to think about the
  "detoast datum" as well.

> That's why I mentioned using something temporary. You could legally
> (a) materialize the existing slot, (b) create a new virtual slot, (c)
> copy over the tts_values and tts_isnull arrays, (c) detoast any datums
> you like from tts_values and store the new datums back into the array,
> (d) call ExecStoreVirtualTuple, and then (e) use the new slot in place
> of the original one. That would avoid repeated detoasting, but it
> would also have a bunch of overhead.

Yes, I agree with the overhead, so I perfer to know if the my current
strategy has principle issue first.

> Having to fully materialize the
> existing slot is a cost that you wouldn't always need to pay if you
> didn't do this, but you have to do it to make this work. Creating the
> new virtual slot costs something. Copying the arrays costs something.
> Detoasting costs quite a lot potentially, if you don't end up using
> the values. If you end up needing a heap tuple representation of the
> slot, you're going to now have to make a new one rather than just
> using the one that came straight from the buffer.

> But I do agree with you
> that *if* we could make it work, it would probably be better than a
> longer-lived cache.


I noticed the "if" and great to know that:), speically for the efficiecy
& memory usage control purpose.

Another big challenge of this is how to decide if a Var need this
pre-detoasting strategy, we have to consider:

- The in-place tts_values[*] update makes the slot bigger, which is
  harmful for CP_SMALL_TLIST (createplan.c) purpose.
- ExprEval runtime checking if the Var is toastable is an overhead. It
  is must be decided ExecInitExpr time.

The code complexity because of the above 2 factors makes people (include
me) unhappy.

===
Yesterday I did some worst case testing for the current strategy, and
the first case show me ~1% *unexpected* performance regression and I
tried to find out it with perf record/report, and no lucky. that's the
reason why I didn't send a complete post yesterday.

As for now, since we are talking about the high level design, I think
it is OK to post the improved design document and the incompleted worst
case testing data and my planning.

Current Design
--------------

The high level design is let createplan.c and setref.c decide which
Vars can use this feature, and then the executor save the detoast
datum back slot->to tts_values[*] during the ExprEvalStep of
EEOP_{INNER|OUTER|SCAN}_VAR_TOAST. The reasons includes:

- The existing expression engine read datum from tts_values[*], no any
  extra work need to be done.
- Reuse the lifespan of TupleTableSlot system to manage memory. It
  is natural to think the detoast datum is a tts_value just that it is
  in a detoast format. Since we have a clear lifespan for TupleTableSlot
  already, like ExecClearTuple, ExecCopySlot et. We are easy to reuse
  them for managing the datoast datum's memory.
- The existing projection method can copy the datoasted datum (int64)
  automatically to the next node's slot, but keeping the ownership
  unchanged, so only the slot where the detoast really happen take the
  charge of it's lifespan.

Assuming which Var should use this feature has been decided in
createplan.c and setref.c already. The following strategy is used to
reduce the run time overhead.

1. Putting the detoast datum into tts->tts_values[*]. then the ExprEval
   system doesn't need any extra effort to find them.

static inline void
ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
{
    if (!slot->tts_isnull[attnum] &&
        VARATT_IS_EXTENDED(slot->tts_values[attnum]))
    {
        if (unlikely(slot->tts_data_mctx == NULL))
            slot->tts_data_mctx = GenerationContextCreate(slot->tts_mcxt,
                                     "tts_value_ctx",
                                    ALLOCSET_START_SMALL_SIZES);
        slot->tts_values[attnum] = PointerGetDatum(detoast_attr_ext((struct varlena *) slot->tts_values[attnum],
                                                                 /* save the detoast value to the given MemoryContext.
*/
                                        slot->tts_data_mctx));
    }
}

2. Using a dedicated lazy created memory context under TupleTableSlot so
   that the memory can be released effectively.

static inline void
ExecFreePreDetoastDatum(TupleTableSlot *slot)
{
    if (slot->tts_data_mctx)
        MemoryContextResetOnly(slot->tts_data_mctx);
}

3. Control the memory usage by releasing them whenever the
   slot->tts_values[*] is deemed as invalidate, so the lifespan is
   likely short.

4. Reduce the run-time overhead for checking if a VAR should use
   pre-detoast feature, 3 new steps are introduced.
   EEOP_{INNER,OUTER,SCAN}_VAR_TOAST, so that other Var is totally non
   related.

Now comes to the createplan.c/setref.c part, which decides which Vars
should use the shared detoast feature. The guideline of this is:

1. It needs a detoast for a given expression in the previous logic.
2. It should not breaks the CP_SMALL_TLIST design. Since we saved the
   detoast datum back to tts_values[*], which make tuple bigger. if we
   do this blindly, it would be harmful to the ORDER / HASH style nodes.

A high level data flow is:

1. at the createplan.c, we walk the plan tree go gather the
   CP_SMALL_TLIST because of SORT/HASH style nodes, information and save
   it to Plan.forbid_pre_detoast_vars via the function
   set_plan_forbid_pre_detoast_vars_recurse.

2. at the setrefs.c, fix_{scan|join}_expr will recurse to Var for each
   expression, so it is a good time to track the attribute number and
   see if the Var is directly or indirectly accessed. Usually the
   indirectly access a Var means a detoast would happens, for
   example an expression like a > 3. However some known expressions is
   ignored. for example: NullTest, pg_column_compression which needs the
   raw datum, start_with/sub_string which needs a slice
   detoasting. Currently there is some hard code here, we may needs a
   pg_proc.detoasting_requirement flags to make this generic.  The
   output is {Scan|Join}.xxx_reference_attrs;

Note that here I used '_reference_' rather than '_detoast_' is because
at this part, I still don't know if it is a toastable attrbiute, which
is known at the MakeTupleTableSlot stage.

3. At the InitPlan Stage, we calculate the final xxx_pre_detoast_attrs
   in ScanState & JoinState, which will be passed into expression
   engine in the ExecInitExprRec stage and EEOP_{INNER|OUTER|SCAN}
   _VAR_TOAST steps are generated finally then everything is connected
   with ExecSlotDetoastDatum!

Bad case testing of current design:
====================================

- The extra effort of createplan.c & setref.c & execExpr.c & InitPlan :

CREATE TABLE t(a int, b numeric);

q1:
explain select * from t where a > 3;

q2:
set enable_hashjoin to off;
set enable_mergejoin to off;
set enable_nestloop to on;

explain select * from t join t t2 using(a);

q3:
set enable_hashjoin to off;
set enable_mergejoin to on;
set enable_nestloop to off;
explain select * from t join t t2 using(a);


q4:
set enable_hashjoin to on;
set enable_mergejoin to off;
set enable_nestloop to off;
explain select * from t join t t2 using(a);


pgbench -f x.sql -n postgres  -M simple -T10

| query ID | patched (ms)                        | master (ms)                         | regression(%) |
|----------+-------------------------------------+-------------------------------------+---------------|
|        1 | {0.088, 0.088, 0.088, 0.087, 0.089} | {0.087, 0.086, 0.086, 0.087, 0.087} |         1.14% |
|        2 | not-done-yet                        |                                     |               |
|        3 | not-done-yet                        |                                     |               |
|        4 | not-done-yet                        |                                     |               |

- The extra effort of ExecInterpExpr

insert into t select i, i FROM generate_series(1, 100000) i;

SELECT count(b) from t WHERE b > 0;

In this query, the extra execution code is run but it doesn't buy anything.

| master | patched | regression |
|--------+---------+------------|
|        |         |            |

What I am planning now is:
- Gather more feedback on the overall strategy.
- figure out the 1% unexpected regression. I don't have a clear
direction for now, my current expression is analyzing the perf report,
and I can't find out the root cause. and reading the code can't find out
the root cause as well.
- Simplifying the "planner part" for deciding which Var should use this
strategy. I don't have clear direction as well.
- testing and improving more worst case.

Attached is a appliable and testingable version against current master
(e87e73245).

--
Best Regards
Andy Fan


Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Avoid possible dereference null pointer (src/backend/catalog/pg_depend.c)
Next
From: David Rowley
Date:
Subject: Speed up JSON escape processing with SIMD plus other optimisations