Re: contrib/cache_scan (Re: What's needed for cache-only table scan?) - Mailing list pgsql-hackers

From Kohei KaiGai
Subject Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date
Msg-id CADyhKSXmK6g37RRTVexFc8b1wT=M7Mc6iWN3Btck+mVMPaOx4Q@mail.gmail.com
Whole thread Raw
In response to Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)  (Haribabu Kommi <kommi.haribabu@gmail.com>)
Responses Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)  (Haribabu Kommi <kommi.haribabu@gmail.com>)
List pgsql-hackers
Hello,

The attached patch is a revised one for cache-only scan module
on top of custom-scan interface. Please check it.
The points I changed are below.

Also, the custom-scan patch, the basis of this functionality, is still
looking for the folks who can volunteer for reviewing.
  https://commitfest.postgresql.org/action/patch_view?id=1282

> 1. +# contrib/dbcache/Makefile
>
>    Makefile header comment is not matched with file name location.
>
Fixed.

> 2.+   /*
> +   * Estimation of average width of cached tuples - it does not make
> +   * sense to construct a new cache if its average width is more than
> +   * 30% of the raw data.
> +   */
>
>
>    Move the estimation of average width calculation of cached tuples into
> the case where the cache is not found,
>    otherwise it is an overhead for cache hit scenario.
>
This logic was moved to the block if existing cache was not hit.
So, usual path does not fall to the average width calculation.

In addition, I added one additional GUC (cache_scan.width_threshold) to
specify the threshold to determine usage of cache-scan.

> 3. + if (old_cache)
> + attrs_used = bms_union(attrs_used, &old_cache->attrs_used);
>
>
>    can't we need the check to see the average width is more than 30%? During
> estimation it doesn't
>    include the existing other attributes.
>
See the new ccache_new_attribute_set(). It almost works like bms_union
if total width of the union set of attributes is less than the threshold.
Elsewhere, it drops attributes included in the existing cache, but not
required this scan. Ideally, I want statistical information to know how
frequently/recently the column is referenced, but no such information right
now. So, I put a logic to drop larger columns first.

> 4. + lchunk->right = cchunk;
> + lchunk->l_depth = TTREE_DEPTH(lchunk->right);
>
>
>   I think it should be lchunk->r_depth needs to be set in a clock wise
> rotation.
>
Fixed. Also, I found a bug that didn't update upper pointer of child
nodes when rotation was kicked.

> 5. can you add some comments in the code with how the block is used?
>
I added a source code comment around the definition of shmseg_head.
Is it understandable?

> 6. In do_insert_tuple function I felt moving the tuples and rearranging
> their addresses is little bit costly. How about the following way?
>
>    Always insert the tuple from the bottom of the block where the empty
> space is started and store their corresponding reference pointers
>    in the starting of the block in an array. As and when the new tuple inserts
> this array increases from block start and tuples from block end.
>    Just need to sort this array based on item pointers, no need to update
> their reference pointers.
>
>    In this case the movement is required only when the tuple is moved from
> one block to another block and also whenever if the continuous
>    free space is not available to insert the new tuple. you can decide based
> on how frequent the sorting will happen in general.
>
I newly added a "deadspace" field in the ccache_chunk. It shows the payload
being consumed by tuples already deleted. Because of the format of ccache_chunk,
actual free space is cchunk->usage - offsetof(ccache_chunk,
tuples[cchunk->ntups]),
but deadspace is potential free space. So, I adjusted the do_insert to kick
chunk compaction code, if target chunk has enough "potential" free space, but
no available "actual" free space to store the new tuple.
It makes the overhead around insertion of tuples since all we need to move
is pointer within cchunk->tuples, not contents itself.

> 8. I am not able to find a protection mechanism in insert/delete and etc
> of a tuple in Ttree. As this is a shared memory it can cause problems.
>
I didn't change the existing logic (a giant lock per ccache_head) yet,
because I still wonder this kind of optimization may lose the simpleness
of implementation for the proof-of-concept purpose towards the hook.
Does it really needed for this purpose?

In addition, it seems to me we don't have intent-lock feature, even though
the paper you suggested gave me an impression...

> 9. + /* merge chunks if this chunk has enough space to merge */
> + ccache_merge_chunk(ccache, cchunk);
>
>   calling the merge chunks for every call back of heap page prune is a
> overhead for vacuum. After the merge which may again leads
>   to node splits because of new data.
>
I adjusted the logic little bit. Now ccache_merge_chunk() is called
if vacuumed cache_chunk is different from the previous one. It allows
to reduce number of (trial) merge operation.

> 10. "columner" is present in some places of the patch. correct it.
>
Fixed.

> 11. In cache_scan_next function, incase of cache insert fails because of
> shared memory the tuple pointer is not reset and cache is NULL.
>     Because of this during next record fetch it leads to assert as cache !=
> NULL.
>
Fixed.

> 12. + if (ccache->status != CCACHE_STATUS_IN_PROGRESS)
>       + cs_put_ccache(ccache);
>
>     The cache is created with refcnt as 2 and in some times two times put
> cache is called to eliminate it and in some times with a different approach.
>     It is little bit confusing, can you explain in with comments with why
> 2 is required and how it maintains?
>
I adjusted the logic around reference count. When a cache is created,
the creator process does not decrease reference count if creation was
successfully done. Once some error happen during creation, it shall be
decreased, so we can drop the cache with one "put" operation.

> 13. A performance report is required to see how much impact it can cause
> on insert/delete/update operations because of cache synchronizer.
>
I tried to assign synchronizer trigger on "pgbench_account" then run
pgbench toward the table with/without columnar-cache.
It seems to me here is no big difference more than variants.

* with columnar-cache

$ pgbench -T 15 postgres
tps = 113.586430 (including connections establishing)
tps = 113.597228 (excluding connections establishing)

* without columnar-cache
$ pgbench -T 15 postgres
tps = 110.765080 (including connections establishing)
tps = 110.775890 (excluding connections establishing)

> 14. The Guc variable "cache_scan_disabled" is missed in docs description.
>
Its description was added.

Thanks,

2014-02-14 14:21 GMT+09:00 Haribabu Kommi <kommi.haribabu@gmail.com>:
> On Thu, Feb 13, 2014 at 3:27 PM, Kouhei Kaigai  wrote:
>
>> >       > 8. I am not able to find a protection mechanism in insert/delete
>> > and etc of
>> >       > a tuple in Ttree. As this is a shared memory it can cause
>> > problems.
>> >       >
>> >
>> >       For design simplification, I put a giant lock per columnar-cache.
>> >       So, routines in cscan.c acquires exclusive lwlock prior to
>> > invocation of
>> >       ccache_insert_tuple / ccache_delete_tuple.
>> >
>> >
>> > Correct. But this lock can be a bottleneck for the concurrency. Better
>> > to
>> > analyze the same once we have the performance report.
>> >
>> Well, concurrent updates towards a particular table may cause lock
>> contention
>> due to a giant lock.
>> On the other hands, one my headache is how to avoid dead-locking if we try
>> to
>> implement it using finer granularity locking. Please assume per-chunk
>> locking.
>> It also needs to take a lock on the neighbor nodes when a record is moved
>> out.
>> Concurrently, some other process may try to move another record with
>> inverse
>> order. That is a ticket for dead-locking.
>>
>> Is there idea or reference to implement concurrent tree structure
>> updating?
>>
>> Anyway, it is a good idea to measure the impact of concurrent updates on
>> cached tables, to find out the significance of lock splitting.
>
>
> we can do some of the following things,
> 1. Let only insert can take the exclusive lock.
> 2. Always follow the locking order from root to the children.
> 3. For delete take exclusive lock only on the exact node where the delete is
> happening.
> and etc, we will identify some more based on the performance data.
>
> And one more interesting document i found in the net while searching for the
> concurrency in Ttree,
> which says that B-tree can outperform Ttree as an in-memory index also with
> an little bit expensive of more memory
> usage. The document is attached in the mail.
>
> Regards,
> Hari Babu
> Fujitsu Australia



--
KaiGai Kohei <kaigai@kaigai.gr.jp>

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: WAL Rate Limiting
Next
From: Robert Haas
Date:
Subject: Re: WAL Rate Limiting