Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM) - Mailing list pgsql-hackers

From Pavan Deolasee
Subject Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)
Date
Msg-id CABOikdPEhYdsDqa07QR+x-TAeBsaNdSUNoosSzo8xTQ79W-UUw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers


On Wed, Mar 22, 2017 at 4:06 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Mar 21, 2017 at 6:47 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
>
>>
>
> Please find attached rebased patches.
>

Few comments on 0005_warm_updates_v18.patch:


Thanks a lot Amit for review comments.
 
1.
@@ -806,20 +835,35 @@ hashbucketcleanup(Relation rel, Bucket
cur_bucket, Buffer bucket_buf,
{
..
- if (callback && callback(htup, callback_state))
+ if(callback)
  {
- kill_tuple = true;
-
if (tuples_removed)
- *tuples_removed += 1;
+result = callback(htup, is_warm, callback_state);
+ if (result== IBDCR_DELETE)
+ {
+ kill_tuple = true;
+ if (tuples_removed)
+*tuples_removed += 1;
+ }
+ else if (result ==IBDCR_CLEAR_WARM)
+ {
+ clear_tuple= true;
+ }
  }
  else if
(split_cleanup)
..
}

I think this will break the existing mechanism of split cleanup.  We
need to check for split cleanup if the tuple is tuple is not deletable
by the callback.  This is not merely an optimization but a must
condition because we will clear the split cleanup flag after this
bucket is scanned completely.


Ok, I see. Fixed, but please check if this looks good.
 
2.
- PageIndexMultiDelete(page, deletable, ndeletable);
+ /*
+
* Clear the WARM pointers.
+ *
+ * We mustdo this before dealing with the dead items because
+ * PageIndexMultiDelete may move items around to compactify the
+ * array and hence offnums recorded earlierwon't make any sense
+ * after PageIndexMultiDelete is called.
+
 */
+ if (nclearwarm > 0)
+ _hash_clear_items(page,clearwarm, nclearwarm);
+
+ /*
+ * And delete the deletableitems
+ */
+ if (ndeletable > 0)
+
PageIndexMultiDelete(page, deletable, ndeletable);

I think this assumes that the items where we need to clear warm flag
are not deletable, otherwise what is the need to clear the flag if we
are going to delete the tuple.  The deletable tuple can have a warm
flag if it is deletable due to split cleanup.


Yes. Since the callback will either say IBDCR_DELETE or IBDCR_CLEAR_WARM, I don't think we will ever has a situation where a tuple is deleted as well as cleared. I also checked that the bucket split code should carry the WARM flag correctly to the new bucket.

Based on your first comment, I believe the rearranged code with take care of deleting a tuple even if WARM flag is set, if the deletion is because of bucket split. 

3.
+ /*
+ * HASH indexes compute a hash value of the key and store that in the
+ * index. So
we must first obtain the hash of the value obtained from the
+ * heap and then do a comparison
+
 */
+ _hash_convert_tuple(indexRel, values, isnull, values2, isnull2);

I think here, you need to handle the case where heap has a NULL value
as the hash index doesn't contain NULL values, otherwise, the code in
below function can return true which is not right.


I think we can simply conclude hashrecheck has failed the equality if the heap has NULL value because such a tuple should not have been reached via hash index unless a non-NULL hash key was later updated to a NULL key, right?
 
4.
+bool
+hashrecheck(Relation indexRel, IndexInfo *indexInfo, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
{
..
+ att = indexRel->rd_att->attrs[i - 1];
+ if (!datumIsEqual(values2[i - 1], indxvalue, att->attbyval,
+ att->attlen))
+ {
+ equal = false;
+ break;
+ }
..
}

Hash values are always uint32 and attlen can be different for
different datatypes, so I think above doesn't seem to be the right way
to do the comparison.


Since we're referring to the attr from the index relation, wouldn't it tell us the attribute specs of what gets stored in the index and not what's there in the heap? I could be wrong but some quick tests show me that pg_attribute->attlen for the index relation always returns 4 irrespective of the underlying data type in heap.
 
5.
@@ -274,6 +301,8 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
  OffsetNumber offnum;

ItemPointer current;
  bool res;
+ IndexTuple itup;
+

  /* Hash
indexes are always lossy since we store only the hash code */
  scan->xs_recheck = true;
@@ -316,8
+345,6 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
  offnum <=
maxoffnum;
  offnum = OffsetNumberNext(offnum))
  {
-IndexTuple itup;
-

Why above change?


Seems spurious. Fixed.
 

6.
+ *stats = index_bulk_delete(&ivinfo, *stats,
+lazy_indexvac_phase1, (void *) vacrelstats);
+ ereport(elevel,
+(errmsg("scanned index \"%s\" to remove %d row version, found "
+"%0.f warm pointers, %0.f clear pointers, removed "
+"%0.f warm pointers, removed %0.f clear pointers",
+RelationGetRelationName(indrel),
+ vacrelstats->num_dead_tuples,
+ (*stats)->num_warm_pointers,
+(*stats)->num_clear_pointers,
+(*stats)->warm_pointers_removed,
+ (*stats)->clear_pointers_removed)));
+
+ (*stats)->num_warm_pointers = 0;
+ (*stats)->num_clear_pointers = 0;
+ (*stats)->warm_pointers_removed = 0;
+ (*stats)->clear_pointers_removed = 0;
+ (*stats)->pointers_cleared = 0;
+
+ *stats =index_bulk_delete(&ivinfo, *stats,
+ lazy_indexvac_phase2, (void *)vacrelstats);

To convert WARM chains, we need to do two index passes for all the
indexes.  I think it can substantially increase the random I/O. I
think this can help us in doing more WARM updates, but I don't see how
the downside of that (increased random I/O) will be acceptable for all
kind of cases.


Yes, this is a very fair point. The way I proposed to address this upthread is by introducing a set of threshold/scale GUCs specific to WARM. So users can control when to invoke WARM cleanup. Only if the WARM cleanup is required, we do 2 index scans. Otherwise vacuum will work the way it works today without any additional overhead. 

We already have some intelligence to skip the second index scan if we did not find any WARM candidate chains during the first heap scan. This should take care of majority of the users who never update their indexed columns. For others, we need either a knob or some built-in way to deduce whether to do WARM cleanup or not.

Does that seem worthwhile?
 

+exists. Since index vacuum may visit these pointers in any order, we will need
+another index pass to remove dead index pointers. So in the first index pass we
+check which WARM candidates have 2 index pointers. In the second pass, we
+remove the dead pointer and clear the INDEX_WARM_POINTER flag if that's the
+surviving index pointer.

I think there is some mismatch between README and code.  In README, it
is mentioned that dead pointers will be removed in the second phase,
but I think the first phase code lazy_indexvac_phase1() will also
allow to delete the dead pointers (it can return IBDCR_DELETE which
will allow index am to remove dead items.).  Am I missing something
here?


Hmm.. fixed the README. Clearly we do allow removal of dead pointers which are known to be certainly dead in the first index pass itself. Some other pointers can be removed during the second scan once we know about the existence or non existence of WARM index pointers.
 

7.
+ * For CLEAR chains, we just kill the WARM pointer, if it exist,s and keep
+ * the CLEAR pointer.

typo (exist,s)


Fixed.
 
8.
+/*
+ * lazy_indexvac_phase2() -- run first pass of index vacuum

Shouldn't this be -- run the second pass


Yes, fixed.
 
9.
- indexInfo); /* index AM may need this */
+indexInfo, /* index AM may need this */
+(modified_attrs != NULL)); /* type of uniqueness check to do */

comment for the last parameter seems to be wrong.


Fixed.
 
10.
+follow the update chain everytime to the end to see check if this is a WARM
+chain.

"see check" - seems one of those words is sufficient to explain the meaning.


Fixed.
 
11.
+chain. This simplifies the design and addresses certain issues around
+duplicate scans.

"duplicate scans" - shouldn't be duplicate key scans.


Ok, seems better. Fixed.
 
12.
+index on the table, irrespective of whether the key pertaining to the
+index changed or not.

typo.
/index changed/index is changed


Fixed.
 
13.
+For example, if we have a table with two columns and two indexes on each
+of the column. When a tuple is first inserted the table, we have exactly

typo.
/inserted the table/inserted in the table


Fixed.
 
14.
+ lp [1]  [2]
+ [1111, aaaa]->[111, bbbb]

Here, after the update, the first column should be 1111.


Fixed.
 
Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Attachment

pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: [HACKERS] Measuring replay lag
Next
From: Pavan Deolasee
Date:
Subject: Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)