Thread: Google Summer of code 2013
Hello everybody, My name is Karel Rozhoň and I am in the third year of bachelor study at CTU, Faculty of information technology in Prague(Czech Republic). I would like to contribute to PostgreSQL and participate in the Google Summer of Code 2013. I would like to implement parallelizationof a hash table. I am considering to use processes, because of PostgreSQL is not a thread safe, which meansthat there could be problems with memory management. I would like learn more about bg_workers, which are in version9.3 and when it will be appropriate use them. In the end of this project I would like to have functional version ofparallelization of a hash aggregate. I have some basic knowledgement about hacking PostgreSQL - I am trying to make a fewsimilar stuffs with Posix threads as a part of my bachelor thesis at this moment (My mentor in bachelor thesis is PavelStěhule). I have spoken about this project with Tomáš Vondra and he agreed to be my mentor. What do you think about this project? Thank you for your response and advice Karel K. Rozhoň -- Karel K. Rozhoň <karel.rozhon@gmail.com>
* Karel K. Rozhoň (karel.rozhon@gmail.com) wrote: > What do you think about this project? Parallelization in PG is a pretty massive discussion and probably not suited to a GSoC project. Once we have the groundwork for doing parallel processing for a given query, I doubt that hash tables will be the first piece implemented. Thanks, Stephen
Attachment
Hi, I've briefly discussed the project with the student a while back, so a few thoughts on the project: On 15.4.2013 00:16, Stephen Frost wrote: > * Karel K. Rozhoň (karel.rozhon@gmail.com) wrote: >> What do you think about this project? > > Parallelization in PG is a pretty massive discussion and probably not > suited to a GSoC project. Once we have the groundwork for doing There's certainly a lot of groundwork to do, and I do share the concern that the project will have to deal with a lot of dirty work (e.g. when transfering data between the processes). But couldn't it be a useful part of the discussion? I don't expect a commitable patch at the end, but rather something that "works" and may be used as a basis for improvements and to build the actual groundwork. > parallel processing for a given query, I doubt that hash tables will be > the first piece implemented. I think that depends on the workload type. For example for databases handling DWH-like queries, parallel hash aggregate is going to be a major improvement. Karel mentioned he's currently working on his bachelor thesis, which is about hash tables too. That's another reason why he proposed this topic. regards Tomas
Tomas, * Tomas Vondra (tv@fuzzy.cz) wrote: > There's certainly a lot of groundwork to do, and I do share the concern > that the project will have to deal with a lot of dirty work (e.g. when > transfering data between the processes). But couldn't it be a useful > part of the discussion? A one-off implementation of parallelized hash table building and/or usage..? No, I don't see that as particularly relevant to the discussion around how to do parallelize queries. There are a ton of examples already of parallel hash table building and various other independent pieces (parallel sort, parallel aggregation, etc). What is needed for parallel query processing in PG is to figure out what we mean by it and how to actually implement it. Following that would be making the query planner and optimizer aware of it, last would be picking a specific parallelized implementation of each node and writing it (and that, really, is the 'easy' part in all of this...). > I don't expect a commitable patch at the end, but rather something that > "works" and may be used as a basis for improvements and to build the > actual groundwork. I don't think it'd really help us get any farther with parallel query execution. To be honest, I'd be a bit surprised if this hasn't been done already and patches posted to the list in the past.. > I think that depends on the workload type. For example for databases > handling DWH-like queries, parallel hash aggregate is going to be a > major improvement. DWH is what I deal with day-in and day-out and I certainly agree that parallelizing hash builds would be wonderful- but that doesn't mean that a patch which implements it without any consideration for the rest of the challenges around parallel query execution will actually move us, as a project, any closer to getting it. In fact, I'd expect most DWH implementations to do what we've done already- massive partitioning and parallelizing through multiple client connections. > Karel mentioned he's currently working on his bachelor thesis, which is > about hash tables too. That's another reason why he proposed this topic. That's wonderful, I'd love to hear about some ways to improve our hashing system (I've even proposed one modification a few days ago that I'd like to see tested more). I believe that costing around hashing needs to be improved too. Parallel-anything is a 'sexy' project, but unless it's focused on how we answer the hard questions around how do we do parallel work efficiently while maintaining scalability and portability then it's not moving us forward. Thanks, Stephen
Attachment
At first I would like to thank you for your responses and opinions. Of course I don't see all aspects of this problem, so I cannot tell what should be good for future. But I have done someprofiles of group by select and I believe, parallel calling of some hash procedures could help. Set an examle: postgres=# \d charlie Table "public.charlie" Column | Type | Modifiers --------+---------+----------- a | integer | not null b | integer | c | integer | Indexes: "charlie_pkey" PRIMARY KEY, btree (a) insert into charlie (a,b,c) values (generate_series(1,15000000), 5, generate_series(1,100000)); postgres=# explain select count(*) from (select min(a), max(b) from charlie group by c offset 0) x; QUERY PLAN -------------------------------------------------------------------------------- ------------ Aggregate (cost=3196549.18..3196549.19 rows=1 width=0) -> Limit (cost=3044301.02..3195298.60 rows=100047 width=12) -> GroupAggregate (cost=3044301.02..3195298.60 rows=100047 width=12) -> Sort (cost=3044301.02..3081800.29 rows=14999711 width=12) Sort Key: charlie.c -> Seq Scan on charlie (cost=0.00..231079.11 rows=1499971 1 width=12) (6 rows) Oprofile output: CPU: Intel Westmere microarchitecture, speed 2.267e+06 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 85519 38.0101 hash_search_with_hash_value 14009 6.2265 slot_deform_tuple 11808 5.2482 ExecProject 7675 3.4113 slot_getattr 7539 3.3508 advance_aggregates 6932 3.0810 ExecAgg Of course I know, these simply case is only teoretical and in real tables are data much more complicated, but as I can see,almost 40% of CPU time was computed only one hash function: hash_search_with_hash_value. On Sun, 14 Apr 2013 21:54:00 -0400 Stephen Frost <sfrost@snowman.net> wrote: > Tomas, > > * Tomas Vondra (tv@fuzzy.cz) wrote: > > There's certainly a lot of groundwork to do, and I do share the concern > > that the project will have to deal with a lot of dirty work (e.g. when > > transfering data between the processes). But couldn't it be a useful > > part of the discussion? > > A one-off implementation of parallelized hash table building and/or > usage..? No, I don't see that as particularly relevant to the > discussion around how to do parallelize queries. There are a ton of > examples already of parallel hash table building and various other > independent pieces (parallel sort, parallel aggregation, etc). What is > needed for parallel query processing in PG is to figure out what we mean > by it and how to actually implement it. Following that would be making > the query planner and optimizer aware of it, last would be picking a > specific parallelized implementation of each node and writing it (and > that, really, is the 'easy' part in all of this...). > > > I don't expect a commitable patch at the end, but rather something that > > "works" and may be used as a basis for improvements and to build the > > actual groundwork. > > I don't think it'd really help us get any farther with parallel query > execution. To be honest, I'd be a bit surprised if this hasn't been > done already and patches posted to the list in the past.. > > > I think that depends on the workload type. For example for databases > > handling DWH-like queries, parallel hash aggregate is going to be a > > major improvement. > > DWH is what I deal with day-in and day-out and I certainly agree that > parallelizing hash builds would be wonderful- but that doesn't mean that > a patch which implements it without any consideration for the rest of > the challenges around parallel query execution will actually move us, as > a project, any closer to getting it. In fact, I'd expect most DWH > implementations to do what we've done already- massive partitioning and > parallelizing through multiple client connections. > > > Karel mentioned he's currently working on his bachelor thesis, which is > > about hash tables too. That's another reason why he proposed this topic. > > That's wonderful, I'd love to hear about some ways to improve our > hashing system (I've even proposed one modification a few days ago that > I'd like to see tested more). I believe that costing around hashing > needs to be improved too. > > Parallel-anything is a 'sexy' project, but unless it's focused on how we > answer the hard questions around how do we do parallel work efficiently > while maintaining scalability and portability then it's not moving us > forward. > > Thanks, > > Stephen -- Karel K. Rozhoň <karel.rozhon@gmail.com>
* Karel K. Rozhoň (karel.rozhon@gmail.com) wrote: > Of course I don't see all aspects of this problem, so I cannot tell what should be good for future. But I have done someprofiles of group by select and I believe, parallel calling of some hash procedures could help. There seems to be some confuison here. It's certainly true that *many* (most? all?) pieces of query processing would benefit from parallel execution; there is no debate on that. The issue is that PG is not currently set up to do *any* per-query parallel processing and it is *not* a trival thing to change that. We can talk all day about how wonderful it'd be to do parallel hashing, parallel sorting, etc, but until PG has a way to parallelize query processing, there's really no point to writing code to parallelize individual nodes. > Of course I know, these simply case is only teoretical and in real tables are data much more complicated, but as I cansee, almost 40% of CPU time was computed only one hash function: hash_search_with_hash_value. Improvements to that would be great, but you can't simply call pthread_create() in a PG backend and expect things to work. Thanks, Stephen