Home > mailing lists

Re: Calculating statistic via function rather than with query is slowing my query - Mailing list pgsql-performance

From	Anish Kejariwal
Subject	Re: Calculating statistic via function rather than with query is slowing my query
Date	August 18, 2011 18:46:14
Msg-id	CAOpcnr8K0Rw6_=vZh4-cR9=6CXhBQvEiK_8as5NcC8O4WW4fqw@mail.gmail.com Whole thread Raw
In response to	Re: Calculating statistic via function rather than with query is slowing my query (Craig Ringer <ringerc@ringerc.id.au>)
List	pgsql-performance

Tree view

Hi Craig,

Fair point. For now, I mean "just fast" - which is 5-15 seconds, but I'd like to get it down to the 1-2 second range.

From the query I provided, I have approximately 30,000 unique keys (what I called primary_id) that I'm grouping by, and each key has a series of numerical values for each of the type_ids. I'm looking at averages, stddev and other statistics across a few hundred type_ids (where agg.type_id in ....). The part of the query that varies is the user specified type_ids, which makes it impossible to precalculate my statistics.

I'd like this to eventually scale to a million unique keys, and a thousand type_ids.

For now Postgres been great for modeling the data, understanding where I hit performance bottle necks, and providing a fast enough user interface. But, I'm definitely starting to think about whether I can cache my data (with millions of keys and thousands of type_ids, the data might be too large), and whether to look into distributed databases (even thought I can't precompute the stats, my queries are easily distributable across multiple processors since each processor could take a batch of keys). I might even want to consider a column oriented database - since my keys don't change often, I could potentially add new columns when there are new type_ids.

I've been thinking of looking into memcached or hbase. If you have any suggestions on which options I should explore, I'd greatly appreciate it.

Sorry, for veering off topic a bit from postgres.

thanks,

Anish

On Wed, Aug 17, 2011 at 10:32 PM, Craig Ringer <ringerc@ringerc.id.au> wrote:

On 18/08/2011 9:03 AM, Anish Kejariwal wrote:
Thanks for the help Pavel and Craig. I really appreciate it. I'm going to try a couple of these different options (write a c function, use a sql function with case statements, and use plperl), so I can see which gives me the realtime performance that I need, and works best for clean code in my particular case.
Do you really mean "realtime"? Or just "fast"?

If you have strongly bounded latency requirements, any SQL-based, disk-based system is probably not for you. Especially not one that relies on a statics-based query planner, caching, and periodic checkpoints. I'd be looking into in-memory databases designed for realtime environments where latency is critical.

Hard realtime: If this system fails to respond within <x> milliseconds, all the time, every time, then something will go "smash" or "boom" expensively and unrecoverably.

Soft realtime: If this system responds late, the late response is expensive or less useful. Frequent late responses are unacceptable but the occasional one might be endurable.

Just needs to be fast: If it responds late, the user gets irritated because they're sitting and waiting for a response. Regular long stalls are unacceptable, but otherwise the user can put up with it. You're more concerned with average latency than maximum latency.

--
Craig Ringer

pgsql-performance by date:

From: Ogden
Date: 18 August 2011, 17:31:37
Subject: Re: Raid 5 vs Raid 10 Benchmarks Using bonnie++

From: "Midge Brown"
Date: 18 August 2011, 21:55:57
Subject: settings input for upgrade

Re: Calculating statistic via function rather than with query is slowing my query - Mailing list pgsql-performance

Previous

Next