Re: Query completed in < 1s in PG 9.1 and ~ 700s in PG 9.2 - Mailing list pgsql-performance

From Rodrigo Rosenfeld Rosas
Subject Re: Query completed in < 1s in PG 9.1 and ~ 700s in PG 9.2
Date
Msg-id 509A4917.2080303@gmail.com
Whole thread Raw
In response to Re: Query completed in < 1s in PG 9.1 and ~ 700s in PG 9.2  (Merlin Moncure <mmoncure@gmail.com>)
Responses Re: Query completed in < 1s in PG 9.1 and ~ 700s in PG 9.2
List pgsql-performance
Em 06-11-2012 19:48, Merlin Moncure escreveu:
> On Tue, Nov 6, 2012 at 12:57 PM, Rodrigo Rosenfeld Rosas
> <rr.rosas@gmail.com>  wrote:
>> I would strongly consider investigation of hstore type along with
>> gist/gin index.
>> select * from company_transaction where contract_attributes @>
>> 'State=>Delaware, Paid=Y';
>> etc
>>
>>
>> I'm not very familiar with hstore yet but this was one of the reasons I
>> wanted to migrate to PG 9.2 but I won't be able to migrate the application
>> quickly to use hstore.
> sure -- it's a major change.  note though that 9.1 hstore has
> everything you need.

Great to know.

>> Also, I'm not sure if hstore allows us to be as flexible as we currently are
>> (c1 and (c2 or c3 and not (c4 and c5))). c == condition
> your not gated from that functionality, although making complicated
> expressions might require some thought and defeat some or all of GIST
> optimization. that said, nothing is keeping you from doing:
>
> where fields @>  'c1=>true, c2=>45' and not (fields @>  'c3=>false, c4=>xyz');
>
> range searches would completely bypass GIST.  so that:
> select * from foo where attributes ->  'somekey' between 'value1' and 'value2';
>
> would work but would be brute force.  Still, with a little bit of
> though, you should be able to optimize most common cases and when it
> boils down to straight filter (a and b and c) you'll get an orders of
> magnitude faster query.

Then I'm not sure if hstore would speed up anything because except for
boolean fields most types won't use the equal (=) operator.

For instance, for numeric types (number, percent, currency) and dates it
is more usual to use something like (>), (<) or (between) than (=). For
strings we use ILIKE operator instead of (=).

>>> Barring that, I would then consider complete elimination of integer
>> proxies for your variables.  They make your query virtually impossible
>> to read/write, and they don't help.
>>
>> I'm not sure if I understood what you're talking about. The template is
>> dynamic and contains lots of information for each field, like type (number,
>> percent, string, date, etc), parent_id (auto-referencing), aggregator_id
>> (also auto-referencing) and several other columns. But the values associate
>> the field id (type_id) and the transaction id in a unique way (see unique
>> index in my first message of the thread). Then I need different tables to
>> store the actual value because we're using SQL instead of MongoDB or
>> something else. The table that stores the value depend on the field type.
> Well, that's probably a mistake.  It's probably better to have a
> single table with a text field (which is basically a variant) and a
> 'type' column storing the type of it if you need special handling down
> the line.

This would require tons of run-time conversions that would not be
indexable (dates, numbers, etc). I thought that approach would be much
slower. The user can also sort the results by any field and the sort
operation could also become too slow with all those run-time conversions
in place.

> One thing I'm sure of is that abstracting type behind
> type_id is doing nothing but creating needless extra work.

You said that in the other message and I asked for an example when I
told you why I need a separate table for storing all field data. I still
don't understand what you mean, that is why I asked for some example. I
guess the main problem here is terminology because when I joined this
project I had the same problems I think you're having to understand the
query.

Currently there is a "condition_type" table that actually should be
called "contract_fields" as it contains the possible fields to be
extracted from some contract using our clients' terminology. In this
table we find the label of the field, its actual data type (string,
currency, date, etc) among several other database fields.

So, "type_id" should actually be called "field_id" or
"contract_field_id". It doesn't hold only the data type.

Then we have a table called "transaction_condition" where I would call
it "field_value" or "transaction_field_value" (I simplified before since
a transaction can have multiple contracts but the field is actually part
of a transaction, not of some contract really - we have a references
table that will join the contract and position (paragraph,etc) in the
contract to the transaction).

So I can see two options here. We could either have a column of each
type in "transaction_condition" (or "field_value" as I would call it)
and create an index for each column, or we could have different tables
to store the values. It wasn't me who decided what approach to take some
years ago when this database was designed (I have not joined this
project by then). But I'm not sure either what approach I would have
taken. I would probably perform some benchmarks first before deciding
which one to choose.

But I guess you're seeing a third approach I'm unable to understand,
although I'd love to understand your proposal. Could you please provide
some example?

> You're doing all kinds of acrobatics to fight the schema by hiding it under
> various layers of abstraction.

We do that because we don't see another option. We'd love to know about
any suggestions to improve our design. We don't like to complicate
simple stuff ;)

Thanks in advance,
Rodrigo.



pgsql-performance by date:

Previous
From: Rodrigo Rosenfeld Rosas
Date:
Subject: Re: Query completed in < 1s in PG 9.1 and ~ 700s in PG 9.2
Next
From: Heikki Linnakangas
Date:
Subject: Re: dbt2 performance regresses from 9.1.6 to 9.2.1