Re: Shared detoast Datum proposal - Mailing list pgsql-hackers
From | Andy Fan |
---|---|
Subject | Re: Shared detoast Datum proposal |
Date | |
Msg-id | 87ikz5i9m1.fsf@163.com Whole thread Raw |
In response to | Re: Shared detoast Datum proposal (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Shared detoast Datum proposal
|
List | pgsql-hackers |
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, May 21, 2024 at 10:02 PM Andy Fan <zhihuifan1213@163.com> wrote: >> One more things I want to highlight it "syscache" is used for metadata >> and *detoast cache* is used for user data. user data is more >> likely bigger than metadata, so cache size control (hence eviction logic >> run more often) is more necessary in this case. The first graph in [1] >> (including the quota text) highlight this idea. > > Yes. > >> I think that is same idea as design a. > > Sounds similar. > This is really great to know! >> I think the key points includes: >> >> 1. Where to put the *detoast value* so that ExprEval system can find out >> it cheaply. The soluation in A slot->tts_values[*]. >> >> 2. How to manage the memory for holding the detoast value. >> >> The soluation in A is: >> >> - using a lazy created MemoryContext in slot. >> - let the detoast system detoast the datum into the designed >> MemoryContext. >> >> 3. How to let Expr evalution run the extra cycles when needed. > > I don't understand (3) but I agree that (1) and (2) are key points. In > many - nearly all - circumstances just overwriting slot->tts_values is > unsafe and will cause problems. To make this work, we've got to find > some way of doing this that is compatible with general design of > TupleTableSlot, and I think in that API, just about the only time you > can touch slot->tts_values is when you're trying to populate a virtual > slot. But often we'll have some other kind of slot. And even if we do > have a virtual slot, I think you're only allowed to manipulate > slot->tts_values up until you call ExecStoreVirtualTuple. Please give me one more chance to explain this. What I mean is: Take SELECT f(a) FROM t1 join t2...; for example, When we read the Datum of a Var, we read it from tts->tts_values[*], no matter what kind of TupleTableSlot is. So if we can put the "detoast datum" back to the "original " tts_values, then nothing need to be changed. Aside from efficiency considerations, security and rationality are also important considerations. So what I think now is: - tts_values[*] means the Datum from the slot, and it has its operations like slot_getsomeattrs, ExecMaterializeSlot, ExecCopySlot, ExecClearTuple and so on. All of the characters have nothing with what kind of slot it is. - Saving the "detoast datum" version into tts_values[*] doesn't break the above semantic and the ExprEval engine just get a detoast version so it doesn't need to detoast it again. - The keypoint is the memory management and effeiciency. for now I think all the places where "slot->tts_nvalid" is set to 0 means the tts_values[*] is no longer validate, then this is the place we should release the memory for the "detoast datum". All the other places like ExecMaterializeSlot or ExecCopySlot also need to think about the "detoast datum" as well. > That's why I mentioned using something temporary. You could legally > (a) materialize the existing slot, (b) create a new virtual slot, (c) > copy over the tts_values and tts_isnull arrays, (c) detoast any datums > you like from tts_values and store the new datums back into the array, > (d) call ExecStoreVirtualTuple, and then (e) use the new slot in place > of the original one. That would avoid repeated detoasting, but it > would also have a bunch of overhead. Yes, I agree with the overhead, so I perfer to know if the my current strategy has principle issue first. > Having to fully materialize the > existing slot is a cost that you wouldn't always need to pay if you > didn't do this, but you have to do it to make this work. Creating the > new virtual slot costs something. Copying the arrays costs something. > Detoasting costs quite a lot potentially, if you don't end up using > the values. If you end up needing a heap tuple representation of the > slot, you're going to now have to make a new one rather than just > using the one that came straight from the buffer. > But I do agree with you > that *if* we could make it work, it would probably be better than a > longer-lived cache. I noticed the "if" and great to know that:), speically for the efficiecy & memory usage control purpose. Another big challenge of this is how to decide if a Var need this pre-detoasting strategy, we have to consider: - The in-place tts_values[*] update makes the slot bigger, which is harmful for CP_SMALL_TLIST (createplan.c) purpose. - ExprEval runtime checking if the Var is toastable is an overhead. It is must be decided ExecInitExpr time. The code complexity because of the above 2 factors makes people (include me) unhappy. === Yesterday I did some worst case testing for the current strategy, and the first case show me ~1% *unexpected* performance regression and I tried to find out it with perf record/report, and no lucky. that's the reason why I didn't send a complete post yesterday. As for now, since we are talking about the high level design, I think it is OK to post the improved design document and the incompleted worst case testing data and my planning. Current Design -------------- The high level design is let createplan.c and setref.c decide which Vars can use this feature, and then the executor save the detoast datum back slot->to tts_values[*] during the ExprEvalStep of EEOP_{INNER|OUTER|SCAN}_VAR_TOAST. The reasons includes: - The existing expression engine read datum from tts_values[*], no any extra work need to be done. - Reuse the lifespan of TupleTableSlot system to manage memory. It is natural to think the detoast datum is a tts_value just that it is in a detoast format. Since we have a clear lifespan for TupleTableSlot already, like ExecClearTuple, ExecCopySlot et. We are easy to reuse them for managing the datoast datum's memory. - The existing projection method can copy the datoasted datum (int64) automatically to the next node's slot, but keeping the ownership unchanged, so only the slot where the detoast really happen take the charge of it's lifespan. Assuming which Var should use this feature has been decided in createplan.c and setref.c already. The following strategy is used to reduce the run time overhead. 1. Putting the detoast datum into tts->tts_values[*]. then the ExprEval system doesn't need any extra effort to find them. static inline void ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum) { if (!slot->tts_isnull[attnum] && VARATT_IS_EXTENDED(slot->tts_values[attnum])) { if (unlikely(slot->tts_data_mctx == NULL)) slot->tts_data_mctx = GenerationContextCreate(slot->tts_mcxt, "tts_value_ctx", ALLOCSET_START_SMALL_SIZES); slot->tts_values[attnum] = PointerGetDatum(detoast_attr_ext((struct varlena *) slot->tts_values[attnum], /* save the detoast value to the given MemoryContext. */ slot->tts_data_mctx)); } } 2. Using a dedicated lazy created memory context under TupleTableSlot so that the memory can be released effectively. static inline void ExecFreePreDetoastDatum(TupleTableSlot *slot) { if (slot->tts_data_mctx) MemoryContextResetOnly(slot->tts_data_mctx); } 3. Control the memory usage by releasing them whenever the slot->tts_values[*] is deemed as invalidate, so the lifespan is likely short. 4. Reduce the run-time overhead for checking if a VAR should use pre-detoast feature, 3 new steps are introduced. EEOP_{INNER,OUTER,SCAN}_VAR_TOAST, so that other Var is totally non related. Now comes to the createplan.c/setref.c part, which decides which Vars should use the shared detoast feature. The guideline of this is: 1. It needs a detoast for a given expression in the previous logic. 2. It should not breaks the CP_SMALL_TLIST design. Since we saved the detoast datum back to tts_values[*], which make tuple bigger. if we do this blindly, it would be harmful to the ORDER / HASH style nodes. A high level data flow is: 1. at the createplan.c, we walk the plan tree go gather the CP_SMALL_TLIST because of SORT/HASH style nodes, information and save it to Plan.forbid_pre_detoast_vars via the function set_plan_forbid_pre_detoast_vars_recurse. 2. at the setrefs.c, fix_{scan|join}_expr will recurse to Var for each expression, so it is a good time to track the attribute number and see if the Var is directly or indirectly accessed. Usually the indirectly access a Var means a detoast would happens, for example an expression like a > 3. However some known expressions is ignored. for example: NullTest, pg_column_compression which needs the raw datum, start_with/sub_string which needs a slice detoasting. Currently there is some hard code here, we may needs a pg_proc.detoasting_requirement flags to make this generic. The output is {Scan|Join}.xxx_reference_attrs; Note that here I used '_reference_' rather than '_detoast_' is because at this part, I still don't know if it is a toastable attrbiute, which is known at the MakeTupleTableSlot stage. 3. At the InitPlan Stage, we calculate the final xxx_pre_detoast_attrs in ScanState & JoinState, which will be passed into expression engine in the ExecInitExprRec stage and EEOP_{INNER|OUTER|SCAN} _VAR_TOAST steps are generated finally then everything is connected with ExecSlotDetoastDatum! Bad case testing of current design: ==================================== - The extra effort of createplan.c & setref.c & execExpr.c & InitPlan : CREATE TABLE t(a int, b numeric); q1: explain select * from t where a > 3; q2: set enable_hashjoin to off; set enable_mergejoin to off; set enable_nestloop to on; explain select * from t join t t2 using(a); q3: set enable_hashjoin to off; set enable_mergejoin to on; set enable_nestloop to off; explain select * from t join t t2 using(a); q4: set enable_hashjoin to on; set enable_mergejoin to off; set enable_nestloop to off; explain select * from t join t t2 using(a); pgbench -f x.sql -n postgres -M simple -T10 | query ID | patched (ms) | master (ms) | regression(%) | |----------+-------------------------------------+-------------------------------------+---------------| | 1 | {0.088, 0.088, 0.088, 0.087, 0.089} | {0.087, 0.086, 0.086, 0.087, 0.087} | 1.14% | | 2 | not-done-yet | | | | 3 | not-done-yet | | | | 4 | not-done-yet | | | - The extra effort of ExecInterpExpr insert into t select i, i FROM generate_series(1, 100000) i; SELECT count(b) from t WHERE b > 0; In this query, the extra execution code is run but it doesn't buy anything. | master | patched | regression | |--------+---------+------------| | | | | What I am planning now is: - Gather more feedback on the overall strategy. - figure out the 1% unexpected regression. I don't have a clear direction for now, my current expression is analyzing the perf report, and I can't find out the root cause. and reading the code can't find out the root cause as well. - Simplifying the "planner part" for deciding which Var should use this strategy. I don't have clear direction as well. - testing and improving more worst case. Attached is a appliable and testingable version against current master (e87e73245). -- Best Regards Andy Fan
Attachment
pgsql-hackers by date: