Thread: Google Summer of code 2013

Google Summer of code 2013

From
Akansha Singh
Date:
Hi,

I guess parallel query wont be fruitful or most probably
it might have been already implemented..


> There's certainly a lot of groundwork to do, and I do share the concern
> that the project will have to deal with a lot of dirty work (e.g. when
> transfering data between the processes). But couldn't it be a useful
> part of the discussion?

A one-off implementation of parallelized hash table building and/or
usage..?  No, I don't see that as particularly relevant to the
discussion around how to do parallelize queries.  There are a ton of
examples already of parallel hash table building and various other
independent pieces (parallel sort, parallel aggregation, etc).  What is
needed for parallel query processing in PG is to figure out what we mean
by it and how to actually implement it.  Following that would be making
the query planner and optimizer aware of it, last would be picking a
specific parallelized implementation of each node and writing it (and
that, really, is the 'easy' part in all of this...).

> I don't expect a commitable patch at the end, but rather something that
> "works" and may be used as a basis for improvements and to build the
> actual groundwork.

I don't think it'd really help us get any farther with parallel query
execution.  To be honest, I'd be a bit surprised if this hasn't been
done already and patches posted to the list in the past..

> I think that depends on the workload type. For example for databases
> handling DWH-like queries, parallel hash aggregate is going to be a
> major improvement.

DWH is what I deal with day-in and day-out and I certainly agree that
parallelizing hash builds would be wonderful- but that doesn't mean that
a patch which implements it without any consideration for the rest of
the challenges around parallel query execution will actually move us, as
a project, any closer to getting it.  In fact, I'd expect most DWH
implementations to do what we've done already- massive partitioning and
parallelizing through multiple client connections.

> Karel mentioned he's currently working on his bachelor thesis, which is
> about hash tables too. That's another reason why he proposed this topic.

That's wonderful, I'd love to hear about some ways to improve our
hashing system (I've even proposed one modification a few days ago that
I'd like to see tested more).  I believe that costing around hashing
needs to be improved too.

Parallel-anything is a 'sexy' project, but unless it's focused on how we
answer the hard questions around how do we do parallel work efficiently
while maintaining scalability and portability then it's not moving us
forward.

regards
Akansha Singh