Re: Google Summer of code 2013 - Mailing list pgsql-students
From | Karel K. Rozhoň |
---|---|
Subject | Re: Google Summer of code 2013 |
Date | |
Msg-id | 20130415175122.ca9c29b83ccd4d5f4cd490f6@gmail.com Whole thread Raw |
In response to | Re: Google Summer of code 2013 (Stephen Frost <sfrost@snowman.net>) |
Responses |
Re: Google Summer of code 2013
|
List | pgsql-students |
At first I would like to thank you for your responses and opinions. Of course I don't see all aspects of this problem, so I cannot tell what should be good for future. But I have done someprofiles of group by select and I believe, parallel calling of some hash procedures could help. Set an examle: postgres=# \d charlie Table "public.charlie" Column | Type | Modifiers --------+---------+----------- a | integer | not null b | integer | c | integer | Indexes: "charlie_pkey" PRIMARY KEY, btree (a) insert into charlie (a,b,c) values (generate_series(1,15000000), 5, generate_series(1,100000)); postgres=# explain select count(*) from (select min(a), max(b) from charlie group by c offset 0) x; QUERY PLAN -------------------------------------------------------------------------------- ------------ Aggregate (cost=3196549.18..3196549.19 rows=1 width=0) -> Limit (cost=3044301.02..3195298.60 rows=100047 width=12) -> GroupAggregate (cost=3044301.02..3195298.60 rows=100047 width=12) -> Sort (cost=3044301.02..3081800.29 rows=14999711 width=12) Sort Key: charlie.c -> Seq Scan on charlie (cost=0.00..231079.11 rows=1499971 1 width=12) (6 rows) Oprofile output: CPU: Intel Westmere microarchitecture, speed 2.267e+06 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 85519 38.0101 hash_search_with_hash_value 14009 6.2265 slot_deform_tuple 11808 5.2482 ExecProject 7675 3.4113 slot_getattr 7539 3.3508 advance_aggregates 6932 3.0810 ExecAgg Of course I know, these simply case is only teoretical and in real tables are data much more complicated, but as I can see,almost 40% of CPU time was computed only one hash function: hash_search_with_hash_value. On Sun, 14 Apr 2013 21:54:00 -0400 Stephen Frost <sfrost@snowman.net> wrote: > Tomas, > > * Tomas Vondra (tv@fuzzy.cz) wrote: > > There's certainly a lot of groundwork to do, and I do share the concern > > that the project will have to deal with a lot of dirty work (e.g. when > > transfering data between the processes). But couldn't it be a useful > > part of the discussion? > > A one-off implementation of parallelized hash table building and/or > usage..? No, I don't see that as particularly relevant to the > discussion around how to do parallelize queries. There are a ton of > examples already of parallel hash table building and various other > independent pieces (parallel sort, parallel aggregation, etc). What is > needed for parallel query processing in PG is to figure out what we mean > by it and how to actually implement it. Following that would be making > the query planner and optimizer aware of it, last would be picking a > specific parallelized implementation of each node and writing it (and > that, really, is the 'easy' part in all of this...). > > > I don't expect a commitable patch at the end, but rather something that > > "works" and may be used as a basis for improvements and to build the > > actual groundwork. > > I don't think it'd really help us get any farther with parallel query > execution. To be honest, I'd be a bit surprised if this hasn't been > done already and patches posted to the list in the past.. > > > I think that depends on the workload type. For example for databases > > handling DWH-like queries, parallel hash aggregate is going to be a > > major improvement. > > DWH is what I deal with day-in and day-out and I certainly agree that > parallelizing hash builds would be wonderful- but that doesn't mean that > a patch which implements it without any consideration for the rest of > the challenges around parallel query execution will actually move us, as > a project, any closer to getting it. In fact, I'd expect most DWH > implementations to do what we've done already- massive partitioning and > parallelizing through multiple client connections. > > > Karel mentioned he's currently working on his bachelor thesis, which is > > about hash tables too. That's another reason why he proposed this topic. > > That's wonderful, I'd love to hear about some ways to improve our > hashing system (I've even proposed one modification a few days ago that > I'd like to see tested more). I believe that costing around hashing > needs to be improved too. > > Parallel-anything is a 'sexy' project, but unless it's focused on how we > answer the hard questions around how do we do parallel work efficiently > while maintaining scalability and portability then it's not moving us > forward. > > Thanks, > > Stephen -- Karel K. Rozhoň <karel.rozhon@gmail.com>
pgsql-students by date: