Home > mailing lists

Are there any options to parallelize queries? - Mailing list pgsql-general

From	Seref Arikan
Subject	Are there any options to parallelize queries?
Date	August 21, 2012 08:45:44
Msg-id	CA+4ThdoofztNGdw8d1g7CBc4XE81fSydekvDV_0-5YTk7A_sog@mail.gmail.com Whole thread
Responses	Re: Are there any options to parallelize queries? Re: Are there any options to parallelize queries? Re: Are there any options to parallelize queries?
List	pgsql-general

Tree view

Dear all,
I am designing an electronic health record repository which uses postgresql as its RDMS technology. For those who may find the topic interesting, the EHR standard I specialize in is openEHR: http://www.openehr.org/

My design makes use of parallel execution in the layers above DB, and it seems to scale quite good. However, I have a scale problem at hand. A single patient can have up to 1 million different clinical data entries on his/her own, after a few decades of usage. Clinicians do love their data, and especially in chronic disease management, they demand access to whatever data exists. If you have 20 years of data for a diabetics patient for example, they'll want to look for trends in that, or even scroll through all of it, maybe with some filtering.
My requirement is to be able to process those 1 million records as fast as possible. In case of population queries, we're talking about billions of records. Each clinical record, (even with all the optimizations our domain has developed in the last 30 or so years), leads to a number of rows, so you can see that this is really big data. (imagine a national diabetes registry with lifetime data of a few million patients)
I am ready to consider Hadoop or other non-transactional approaches for population queries, but clinical care still requires that I process millions of records for a single patient.

Parallel software frameworks such as Erlang's OTP or Scala's Akka do help a lot, but it would be a lot better if I could feed those frameworks with data faster. So, what options do I have to execute queries in parallel, assuming a transactional system running on postgresql? For example I'd like to get last 10 years' records in chunks of 2 years of data, or chunks of 5K records, fed to N number of parallel processing machines. The clinical system should keep functioning in the mean time, with new records added etc.
PGPool looks like a good option, but I'd appreciate your input. Any proven best practices, architectures, products?

Best regards
Seref

pgsql-general by date:

From: Vincent Veyron
Date: 21 August 2012, 08:18:13
Subject: Re: Amazon High I/O instances

From: Pavel Stehule
Date: 21 August 2012, 09:21:36
Subject: Re: Are there any options to parallelize queries?

Are there any options to parallelize queries? - Mailing list pgsql-general

Previous

Next