Re: contrib/cache_scan (Re: What's needed for cache-only table scan?) - Mailing list pgsql-hackers

From Kouhei Kaigai
Subject Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date
Msg-id 9A28C8860F777E439AA12E8AEA7694F8F8382D@BPXM15GP.gisp.nec.co.jp
Whole thread Raw
In response to Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)  (Kouhei Kaigai <kaigai@ak.jp.nec.com>)
Responses Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
List pgsql-hackers
Sorry, the previous one still has "columner" in the sgml files.
Please see the attached one, instead.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai
> Sent: Tuesday, March 04, 2014 12:35 PM
> To: Haribabu Kommi; Kohei KaiGai
> Cc: Tom Lane; PgHacker; Robert Haas
> Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for cache-only
> table scan?)
>
> Thanks for your reviewing.
>
> According to the discussion in the Custom-Scan API thread, I moved all the
> supplemental facilities (like bms_to/from_string) into the main patch. So,
> you don’t need to apply ctidscan and postgres_fdw patch for testing any
> more.
> (I'll submit the revised one later)
>
> > 1. memcpy(dest, tuple, HEAPTUPLESIZE);
> > + memcpy((char *)dest + HEAPTUPLESIZE,
> >
> > +   tuple->t_data, tuple->t_len);
> >
> >   For a normal tuple these two addresses are different but in case of
> > ccache, it is a continuous memory.
> >   Better write a comment as even if it continuous memory, it is
> > treated as different only.
> >
> OK, I put a source code comment as follows:
>
>         /*
>          * Even though we put the body of HeapTupleHeaderData just after
>          * HeapTupleData, usually, here is no guarantee that both of data
>          * structures are located on continuous memory address.
>          * So, we explicitly adjust tuple->t_data to point the area just
>          * behind of itself, to reference the HeapTuple on columnar-cache
>          * as like regular ones.
>          */
>
> > 2. + uint32 required = HEAPTUPLESIZE + MAXALIGN(tuple->t_len);
> >
> >   t_len is already maxaligned. No problem of using it again, The
> > required length calculation is differing function to function.
> >   For example, in below part of the same function, the same t_len is
> > used directly. It didn't generate any problem, but it may give some
> confusion.
> >
> Once I tried to trust t_len is aligned well, however, Assert() macro said
> it is not a right assumption. See heap_compute_data_size(), it computes
> length of tuple body and adjusts alignment according to the "attalign" value
> of pg_attribute; that is not usually same with sizeof(Datum).
>
> > 4. + cchunk = ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);
> > + if (pchunk != NULL && pchunk != cchunk)
> >
> > + ccache_merge_chunk(ccache, pchunk);
> >
> > + pchunk = cchunk;
> >
> >
> >   The merge_chunk is called only when the heap tuples are spread
> > across two cache chunks. Actually one cache chunk can accommodate one
> > or more than heap pages. it needs some other way of handling.
> >
> I adjusted the logic to merge the chunks as follows:
>
> Once a tuple is vacuumed from a chunk, it also checks whether it can be
> merged with its child leafs. A chunk has up to two child leafs; left one
> has less ctid that the parent, and right one has greater ctid. It means
> a chunk without right child in the left sub-tree or a chunk without left
> child in the right sub-tree are neighbor of the chunk being vacuumed. In
> addition, if vacuumed chunk does not have either (or both) of children,
> it can be merged with parent node.
> I modified ccache_vacuum_tuple() to merge chunks during t-tree walk-down,
> if vacuumed chunk has enough free space.
>
> > 4. for (i=0; i < 20; i++)
> >
> >    Better to replace this magic number with a meaningful macro.
> >
> I rethought here is no good reason why we should construct multiple
> ccache_entries at once. So, I adjusted the code to create a new ccache_entry
> on demand, to track a columnar-cache being acquired.
>
> > 5. "columner" is present in sgml file. correct it.
> >
> Sorry, fixed it.
>
> > 6. "max_cached_attnum" value in the document saying as 128 by default
> > but in the code it set as 256.
> >
> Sorry, fixed it.
>
>
> Also, I tried to run a benchmark of cache_scan on the case when this module
> performs most effectively.
>
> The table t1 is declared as follows:
>   create table t1 (a int, b float, c float, d text, e date, f char(200));
> Its width is almost 256bytes/record, and contains 4milion records, thus
> total table size is almost 1GB.
>
> * 1st trial - it takes longer time than sequential scan because of columnar-
>               cache construction
> postgres=# explain analyze select count(*) from t1 where a % 10 = 5;
>                                                              QUERY PLAN
> ----------------------------------------------------------------------
> --------------------------------------------------------------
>  Aggregate  (cost=200791.62..200791.64 rows=1 width=0) (actual
> time=63105.036..63105.037 rows=1 loops=1)
>    ->  Custom Scan (cache scan) on t1  (cost=0.00..200741.62 rows=20000
> width=0) (actual time=7.397..62832.728 rows=400000 loops=1)
>          Filter: ((a % 10) = 5)
>          Rows Removed by Filter: 3600000  Planning time: 214.506 ms  Total
> runtime: 64629.296 ms
> (6 rows)
>
> * 2nd trial - it takes much faster than sequential scan because of no disk
> access postgres=# explain analyze select count(*) from t1 where a % 10 =
> 5;
>                                                             QUERY PLAN
> ----------------------------------------------------------------------
> ------------------------------------------------------------
>  Aggregate  (cost=67457.53..67457.54 rows=1 width=0) (actual
> time=7833.313..7833.313 rows=1 loops=1)
>    ->  Custom Scan (cache scan) on t1  (cost=0.00..67407.53 rows=20000
> width=0) (actual time=0.154..7615.914 rows=400000 loops=1)
>          Filter: ((a % 10) = 5)
>          Rows Removed by Filter: 3600000  Planning time: 1.019 ms  Total
> runtime: 7833.761 ms
> (6 rows)
>
> * 3rd trial - turn off the cache_scan, so planner chooses the built-in
> SeqScan.
> postgres=# set cache_scan.enabled = off; SET postgres=# explain analyze
> select count(*) from t1 where a % 10 = 5;
>                                                       QUERY PLAN
> ----------------------------------------------------------------------
> ------------------------------------------------
>  Aggregate  (cost=208199.08..208199.09 rows=1 width=0) (actual
> time=59700.810..59700.810 rows=1 loops=1)
>    ->  Seq Scan on t1  (cost=0.00..208149.08 rows=20000 width=0) (actual
> time=715.489..59518.095 rows=400000 loops=1)
>          Filter: ((a % 10) = 5)
>          Rows Removed by Filter: 3600000  Planning time: 0.630 ms  Total
> runtime: 59701.104 ms
> (6 rows)
>
> The reason why such an extreme result.
> I adjusted the system page cache usage to constrain disk cache hit in the
> operating system level, so sequential scan is dominated by disk access
> performance in this case. On the other hand, columnar cache allowed to host
> whole of the records because it omits to cache unreferenced columns.
>
> * GUCs
> shared_buffers = 512MB
> shared_preload_libraries = 'cache_scan'
> cache_scan.num_blocks = 400
>
> [kaigai@iwashi backend]$ free -m
>              total       used       free     shared    buffers
> cached
> Mem:          7986       7839        146          0          2
> 572
> -/+ buffers/cache:       7265        721
> Swap:         8079        265       7814
>
>
> Please don't throw me stones. :-)
> The primary purpose of this extension is to demonstrate usage of custom-scan
> interface and heap_page_prune_hook().
>
> Thanks,
> --
> NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei
> <kaigai@ak.jp.nec.com>
>
>
> > -----Original Message-----
> > From: Haribabu Kommi [mailto:kommi.haribabu@gmail.com]
> > Sent: Monday, February 24, 2014 12:42 PM
> > To: Kohei KaiGai
> > Cc: Kaigai, Kouhei(海外, 浩平); Tom Lane; PgHacker; Robert Haas
> > Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for
> > cache-only table scan?)
> >
> > On Fri, Feb 21, 2014 at 2:19 AM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
> >
> >
> >     Hello,
> >
> >     The attached patch is a revised one for cache-only scan module
> >     on top of custom-scan interface. Please check it.
> >
> >
> >
> > Thanks for the revised patch.  Please find some minor comments.
> >
> > 1. memcpy(dest, tuple, HEAPTUPLESIZE);
> > + memcpy((char *)dest + HEAPTUPLESIZE,
> >
> > +   tuple->t_data, tuple->t_len);
> >
> >
> >   For a normal tuple these two addresses are different but in case of
> > ccache, it is a continuous memory.
> >   Better write a comment as even if it continuous memory, it is
> > treated as different only.
> >
> > 2. + uint32 required = HEAPTUPLESIZE + MAXALIGN(tuple->t_len);
> >
> >   t_len is already maxaligned. No problem of using it again, The
> > required length calculation is differing function to function.
> >   For example, in below part of the same function, the same t_len is
> > used directly. It didn't generate any problem, but it may give some
> confusion.
> >
> > 4. + cchunk = ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);
> > + if (pchunk != NULL && pchunk != cchunk)
> >
> > + ccache_merge_chunk(ccache, pchunk);
> >
> > + pchunk = cchunk;
> >
> >
> >   The merge_chunk is called only when the heap tuples are spread
> > across two cache chunks. Actually one cache chunk can accommodate one
> > or more than heap pages. it needs some other way of handling.
> >
> > 4. for (i=0; i < 20; i++)
> >
> >    Better to replace this magic number with a meaningful macro.
> >
> > 5. "columner" is present in sgml file. correct it.
> >
> > 6. "max_cached_attnum" value in the document saying as 128 by default
> > but in the code it set as 256.
> >
> > I will start regress and performance tests. I will inform you the same
> > once i finish.
> >
> >
> > Regards,
> > Hari Babu
> >
> > Fujitsu Australia

Attachment

pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: jsonb and nested hstore
Next
From: Fabrízio de Royes Mello
Date:
Subject: Re: GSoC proposal - "make an unlogged table logged"