Re: Google Summer of code 2013 - Mailing list pgsql-students

From Karel K. Rozhoň
Subject Re: Google Summer of code 2013
Date
Msg-id 20130415175122.ca9c29b83ccd4d5f4cd490f6@gmail.com
Whole thread Raw
In response to Re: Google Summer of code 2013  (Stephen Frost <sfrost@snowman.net>)
Responses Re: Google Summer of code 2013
List pgsql-students
At first I would like to thank you for your responses and opinions.

Of course I don't see all aspects of this problem, so I cannot tell what should be good for future. But I have done
someprofiles of group by select and I believe, parallel calling of some hash procedures could help.  

Set an examle:

postgres=# \d charlie
    Table "public.charlie"
 Column |  Type   | Modifiers
--------+---------+-----------
 a      | integer | not null
 b      | integer |
 c      | integer |
Indexes:
    "charlie_pkey" PRIMARY KEY, btree (a)

insert into charlie (a,b,c) values (generate_series(1,15000000), 5, generate_series(1,100000));

postgres=# explain select count(*) from (select min(a), max(b) from charlie group by c offset 0) x;
                                         QUERY PLAN

--------------------------------------------------------------------------------
------------
 Aggregate  (cost=3196549.18..3196549.19 rows=1 width=0)
   ->  Limit  (cost=3044301.02..3195298.60 rows=100047 width=12)
         ->  GroupAggregate  (cost=3044301.02..3195298.60 rows=100047 width=12)
               ->  Sort  (cost=3044301.02..3081800.29 rows=14999711 width=12)
                     Sort Key: charlie.c
                     ->  Seq Scan on charlie  (cost=0.00..231079.11 rows=1499971
1 width=12)
(6 rows)

Oprofile output:
CPU: Intel Westmere microarchitecture, speed 2.267e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
85519    38.0101  hash_search_with_hash_value
14009     6.2265  slot_deform_tuple
11808     5.2482  ExecProject
7675      3.4113  slot_getattr
7539      3.3508  advance_aggregates
6932      3.0810  ExecAgg

Of course I know, these simply case is only teoretical and in real tables are data much more complicated, but as I can
see,almost 40% of CPU time was computed only one hash function: hash_search_with_hash_value.  


On Sun, 14 Apr 2013 21:54:00 -0400
Stephen Frost <sfrost@snowman.net> wrote:

> Tomas,
>
> * Tomas Vondra (tv@fuzzy.cz) wrote:
> > There's certainly a lot of groundwork to do, and I do share the concern
> > that the project will have to deal with a lot of dirty work (e.g. when
> > transfering data between the processes). But couldn't it be a useful
> > part of the discussion?
>
> A one-off implementation of parallelized hash table building and/or
> usage..?  No, I don't see that as particularly relevant to the
> discussion around how to do parallelize queries.  There are a ton of
> examples already of parallel hash table building and various other
> independent pieces (parallel sort, parallel aggregation, etc).  What is
> needed for parallel query processing in PG is to figure out what we mean
> by it and how to actually implement it.  Following that would be making
> the query planner and optimizer aware of it, last would be picking a
> specific parallelized implementation of each node and writing it (and
> that, really, is the 'easy' part in all of this...).
>
> > I don't expect a commitable patch at the end, but rather something that
> > "works" and may be used as a basis for improvements and to build the
> > actual groundwork.
>
> I don't think it'd really help us get any farther with parallel query
> execution.  To be honest, I'd be a bit surprised if this hasn't been
> done already and patches posted to the list in the past..
>
> > I think that depends on the workload type. For example for databases
> > handling DWH-like queries, parallel hash aggregate is going to be a
> > major improvement.
>
> DWH is what I deal with day-in and day-out and I certainly agree that
> parallelizing hash builds would be wonderful- but that doesn't mean that
> a patch which implements it without any consideration for the rest of
> the challenges around parallel query execution will actually move us, as
> a project, any closer to getting it.  In fact, I'd expect most DWH
> implementations to do what we've done already- massive partitioning and
> parallelizing through multiple client connections.
>
> > Karel mentioned he's currently working on his bachelor thesis, which is
> > about hash tables too. That's another reason why he proposed this topic.
>
> That's wonderful, I'd love to hear about some ways to improve our
> hashing system (I've even proposed one modification a few days ago that
> I'd like to see tested more).  I believe that costing around hashing
> needs to be improved too.
>
> Parallel-anything is a 'sexy' project, but unless it's focused on how we
> answer the hard questions around how do we do parallel work efficiently
> while maintaining scalability and portability then it's not moving us
> forward.
>
>     Thanks,
>
>         Stephen


--
Karel K. Rozhoň <karel.rozhon@gmail.com>


pgsql-students by date:

Previous
From: Atri Sharma
Date:
Subject: Re: Google Summer of code 2013
Next
From: Akansha Singh
Date:
Subject: Re: Google Summer of code 2013