Thread: Google Summer of code 2013

Google Summer of code 2013

From
"Karel K. Rozhoň"
Date:
Hello everybody,

My name is Karel Rozhoň and I am in the third year of bachelor study at CTU, Faculty of information technology in
Prague(Czech Republic). 

I would like to contribute to PostgreSQL and participate in the Google Summer of Code 2013. I would like to implement
parallelizationof a hash table. I am considering to use processes, because of PostgreSQL is not a thread safe, which
meansthat there could be problems with memory management. I would like learn more about bg_workers, which are in
version9.3 and when it will be appropriate use them. In the end of this project I would like to have functional version
ofparallelization of a hash aggregate. I have some basic knowledgement about hacking PostgreSQL - I am trying to make a
fewsimilar stuffs with Posix threads as a part of my bachelor thesis at this moment (My mentor in bachelor thesis is
PavelStěhule).   

I have spoken about this project with Tomáš Vondra and he agreed to be my mentor.

What do you think about this project?

Thank you for your response and advice
Karel K. Rozhoň

--
Karel K. Rozhoň <karel.rozhon@gmail.com>


Re: Google Summer of code 2013

From
Stephen Frost
Date:
* Karel K. Rozhoň (karel.rozhon@gmail.com) wrote:
> What do you think about this project?

Parallelization in PG is a pretty massive discussion and probably not
suited to a GSoC project.  Once we have the groundwork for doing
parallel processing for a given query, I doubt that hash tables will be
the first piece implemented.

    Thanks,

        Stephen

Attachment

Re: Google Summer of code 2013

From
Tomas Vondra
Date:
Hi,

I've briefly discussed the project with the student a while back, so a
few thoughts on the project:

On 15.4.2013 00:16, Stephen Frost wrote:
> * Karel K. Rozhoň (karel.rozhon@gmail.com) wrote:
>> What do you think about this project?
>
> Parallelization in PG is a pretty massive discussion and probably not
> suited to a GSoC project.  Once we have the groundwork for doing

There's certainly a lot of groundwork to do, and I do share the concern
that the project will have to deal with a lot of dirty work (e.g. when
transfering data between the processes). But couldn't it be a useful
part of the discussion?

I don't expect a commitable patch at the end, but rather something that
"works" and may be used as a basis for improvements and to build the
actual groundwork.


> parallel processing for a given query, I doubt that hash tables will be
> the first piece implemented.

I think that depends on the workload type. For example for databases
handling DWH-like queries, parallel hash aggregate is going to be a
major improvement.

Karel mentioned he's currently working on his bachelor thesis, which is
about hash tables too. That's another reason why he proposed this topic.


regards
Tomas


Re: Google Summer of code 2013

From
Stephen Frost
Date:
Tomas,

* Tomas Vondra (tv@fuzzy.cz) wrote:
> There's certainly a lot of groundwork to do, and I do share the concern
> that the project will have to deal with a lot of dirty work (e.g. when
> transfering data between the processes). But couldn't it be a useful
> part of the discussion?

A one-off implementation of parallelized hash table building and/or
usage..?  No, I don't see that as particularly relevant to the
discussion around how to do parallelize queries.  There are a ton of
examples already of parallel hash table building and various other
independent pieces (parallel sort, parallel aggregation, etc).  What is
needed for parallel query processing in PG is to figure out what we mean
by it and how to actually implement it.  Following that would be making
the query planner and optimizer aware of it, last would be picking a
specific parallelized implementation of each node and writing it (and
that, really, is the 'easy' part in all of this...).

> I don't expect a commitable patch at the end, but rather something that
> "works" and may be used as a basis for improvements and to build the
> actual groundwork.

I don't think it'd really help us get any farther with parallel query
execution.  To be honest, I'd be a bit surprised if this hasn't been
done already and patches posted to the list in the past..

> I think that depends on the workload type. For example for databases
> handling DWH-like queries, parallel hash aggregate is going to be a
> major improvement.

DWH is what I deal with day-in and day-out and I certainly agree that
parallelizing hash builds would be wonderful- but that doesn't mean that
a patch which implements it without any consideration for the rest of
the challenges around parallel query execution will actually move us, as
a project, any closer to getting it.  In fact, I'd expect most DWH
implementations to do what we've done already- massive partitioning and
parallelizing through multiple client connections.

> Karel mentioned he's currently working on his bachelor thesis, which is
> about hash tables too. That's another reason why he proposed this topic.

That's wonderful, I'd love to hear about some ways to improve our
hashing system (I've even proposed one modification a few days ago that
I'd like to see tested more).  I believe that costing around hashing
needs to be improved too.

Parallel-anything is a 'sexy' project, but unless it's focused on how we
answer the hard questions around how do we do parallel work efficiently
while maintaining scalability and portability then it's not moving us
forward.

    Thanks,

        Stephen

Attachment

Re: Google Summer of code 2013

From
"Karel K. Rozhoň"
Date:
At first I would like to thank you for your responses and opinions.

Of course I don't see all aspects of this problem, so I cannot tell what should be good for future. But I have done
someprofiles of group by select and I believe, parallel calling of some hash procedures could help.  

Set an examle:

postgres=# \d charlie
    Table "public.charlie"
 Column |  Type   | Modifiers
--------+---------+-----------
 a      | integer | not null
 b      | integer |
 c      | integer |
Indexes:
    "charlie_pkey" PRIMARY KEY, btree (a)

insert into charlie (a,b,c) values (generate_series(1,15000000), 5, generate_series(1,100000));

postgres=# explain select count(*) from (select min(a), max(b) from charlie group by c offset 0) x;
                                         QUERY PLAN

--------------------------------------------------------------------------------
------------
 Aggregate  (cost=3196549.18..3196549.19 rows=1 width=0)
   ->  Limit  (cost=3044301.02..3195298.60 rows=100047 width=12)
         ->  GroupAggregate  (cost=3044301.02..3195298.60 rows=100047 width=12)
               ->  Sort  (cost=3044301.02..3081800.29 rows=14999711 width=12)
                     Sort Key: charlie.c
                     ->  Seq Scan on charlie  (cost=0.00..231079.11 rows=1499971
1 width=12)
(6 rows)

Oprofile output:
CPU: Intel Westmere microarchitecture, speed 2.267e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
85519    38.0101  hash_search_with_hash_value
14009     6.2265  slot_deform_tuple
11808     5.2482  ExecProject
7675      3.4113  slot_getattr
7539      3.3508  advance_aggregates
6932      3.0810  ExecAgg

Of course I know, these simply case is only teoretical and in real tables are data much more complicated, but as I can
see,almost 40% of CPU time was computed only one hash function: hash_search_with_hash_value.  


On Sun, 14 Apr 2013 21:54:00 -0400
Stephen Frost <sfrost@snowman.net> wrote:

> Tomas,
>
> * Tomas Vondra (tv@fuzzy.cz) wrote:
> > There's certainly a lot of groundwork to do, and I do share the concern
> > that the project will have to deal with a lot of dirty work (e.g. when
> > transfering data between the processes). But couldn't it be a useful
> > part of the discussion?
>
> A one-off implementation of parallelized hash table building and/or
> usage..?  No, I don't see that as particularly relevant to the
> discussion around how to do parallelize queries.  There are a ton of
> examples already of parallel hash table building and various other
> independent pieces (parallel sort, parallel aggregation, etc).  What is
> needed for parallel query processing in PG is to figure out what we mean
> by it and how to actually implement it.  Following that would be making
> the query planner and optimizer aware of it, last would be picking a
> specific parallelized implementation of each node and writing it (and
> that, really, is the 'easy' part in all of this...).
>
> > I don't expect a commitable patch at the end, but rather something that
> > "works" and may be used as a basis for improvements and to build the
> > actual groundwork.
>
> I don't think it'd really help us get any farther with parallel query
> execution.  To be honest, I'd be a bit surprised if this hasn't been
> done already and patches posted to the list in the past..
>
> > I think that depends on the workload type. For example for databases
> > handling DWH-like queries, parallel hash aggregate is going to be a
> > major improvement.
>
> DWH is what I deal with day-in and day-out and I certainly agree that
> parallelizing hash builds would be wonderful- but that doesn't mean that
> a patch which implements it without any consideration for the rest of
> the challenges around parallel query execution will actually move us, as
> a project, any closer to getting it.  In fact, I'd expect most DWH
> implementations to do what we've done already- massive partitioning and
> parallelizing through multiple client connections.
>
> > Karel mentioned he's currently working on his bachelor thesis, which is
> > about hash tables too. That's another reason why he proposed this topic.
>
> That's wonderful, I'd love to hear about some ways to improve our
> hashing system (I've even proposed one modification a few days ago that
> I'd like to see tested more).  I believe that costing around hashing
> needs to be improved too.
>
> Parallel-anything is a 'sexy' project, but unless it's focused on how we
> answer the hard questions around how do we do parallel work efficiently
> while maintaining scalability and portability then it's not moving us
> forward.
>
>     Thanks,
>
>         Stephen


--
Karel K. Rozhoň <karel.rozhon@gmail.com>


Re: Google Summer of code 2013

From
Stephen Frost
Date:
* Karel K. Rozhoň (karel.rozhon@gmail.com) wrote:
> Of course I don't see all aspects of this problem, so I cannot tell what should be good for future. But I have done
someprofiles of group by select and I believe, parallel calling of some hash procedures could help.  

There seems to be some confuison here.  It's certainly true that *many*
(most?  all?) pieces of query processing would benefit from parallel
execution; there is no debate on that.

The issue is that PG is not currently set up to do *any* per-query
parallel processing and it is *not* a trival thing to change that.  We
can talk all day about how wonderful it'd be to do parallel hashing,
parallel sorting, etc, but until PG has a way to parallelize query
processing, there's really no point to writing code to parallelize
individual nodes.

> Of course I know, these simply case is only teoretical and in real tables are data much more complicated, but as I
cansee, almost 40% of CPU time was computed only one hash function: hash_search_with_hash_value.  

Improvements to that would be great, but you can't simply call
pthread_create() in a PG backend and expect things to work.

    Thanks,

        Stephen

Attachment