Thread: Operator performance question

Operator performance question

From

Alban Hertroys

Date:

09 January 2007, 15:06:48

Hi all,

I need your help on a small performance problem.

I have a table of which I have to do a bunch of counts of various
conditions. The worst case scenario where I have to iterate over every
record in the table performs just a little bit too slow (800ms). That
particular query will be hit a lot (it will be on the index of our web app).

PostgreSQL uses a sequential scan (it should IMO) - I think my
bottleneck is in the operators on the various columns.

My queries look like this:

SELECT COUNT(NULLIF(max_persons BETWEEN 5 AND 8, false)) AS "persons 5-8",
-- And other variations

COUNT(NULLIF(country_id = 74, false)) AS "LOCATION_NETHERLANDS",
-- Basically for every country in Europe

COUNT(NULLIF(specifications & '00000000000000000000000001000000',
0::bit(32))) AS "washing machine",
-- And a bunch more of these; the bit mask is almost fully covered

COUNT(*) AS all
FROM table;

The plan is:
                                                            QUERY PLAN


-----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=7371.23..7371.55 rows=1 width=18) (actual
time=803.374..803.376 rows=1 loops=1)
   ->  Seq Scan on fewo_property_location  (cost=0.00..828.84 rows=41538
width=18) (actual time=0.036..147.211 rows=41492 loops=1)
         Filter: ((location_id IS NOT NULL) AND (property_state_id = 3))
 Total runtime: 804.398 ms
(4 rows)

The table definition is like:
      Column       |   Type   |      Modifiers
-------------------+----------+----------------------
 property_id       | integer  | not null
 property_state_id | integer  | not null
 location_id       | integer  |
 min_persons       | smallint | not null
 max_persons       | smallint | not null
 specifications    | bit(32)  | default (0)::bit(32)
 country_id        | integer  |
Indexes:
    "fewo_property_location_pkey" PRIMARY KEY, btree (property_id)
    "fewo_property_location_country_idx" btree (country_id) WHERE
location_id IS NOT NULL
    "fewo_property_location_country_location_idx" btree (country_id,
location_id) CLUSTER
    "fewo_property_location_location_online_idx" btree (location_id)
WHERE location_id IS NOT NULL AND property_state_id = 3
    "fewo_property_location_property_location_idx" btree (property_id,
location_id) WHERE location_id IS NOT NULL AND property_state_id = 3
    "fewo_property_location_specifications_idx" btree (specifications)
Foreign-key constraints:
    "fewo_property_location_location_id_fkey" FOREIGN KEY (location_id)
REFERENCES fewo_location(location_id) MATCH FULL
    "fewo_property_location_property_state_id_fkey" FOREIGN KEY
(property_state_id) REFERENCES fewo_property_state(property_state_id)
MATCH FULL

My conclusion is that this query time is mostly limited to the somewhat
complex COUNT expressions. Is there any way to do this more efficiently?

For the record, if I constrain this query to specific countries it
performs in about 80ms (10x as fast).

The hardware is a dual Opteron64x2, 4G RAM and some kind of RAID setup
(software, don't know what type) running in a Xen host - it's our
development DB-server.

--
Alban Hertroys
alban@magproductions.nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
   7500 AK Enschede

// Integrate Your World //

Re: Operator performance question

From

"Brandon Aiken"

Date:

09 January 2007, 15:25:15

Shouldn't these be using HAVING?

SELECT COUNT(max_persons) ...
GROUP BY NULL
HAVING max_persons >= 5 AND max_persons <= 8;

--
Brandon Aiken
CS/IT Systems Engineer

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Alban Hertroys
Sent: Tuesday, January 09, 2007 11:07 AM
To: Postgres General
Subject: [GENERAL] Operator performance question

Hi all,

I need your help on a small performance problem.

I have a table of which I have to do a bunch of counts of various
conditions. The worst case scenario where I have to iterate over every
record in the table performs just a little bit too slow (800ms). That
particular query will be hit a lot (it will be on the index of our web
app).

PostgreSQL uses a sequential scan (it should IMO) - I think my
bottleneck is in the operators on the various columns.

My queries look like this:

SELECT COUNT(NULLIF(max_persons BETWEEN 5 AND 8, false)) AS "persons
5-8",
-- And other variations

COUNT(NULLIF(country_id = 74, false)) AS "LOCATION_NETHERLANDS",
-- Basically for every country in Europe

COUNT(NULLIF(specifications & '00000000000000000000000001000000',
0::bit(32))) AS "washing machine",
-- And a bunch more of these; the bit mask is almost fully covered

COUNT(*) AS all
FROM table;

The plan is:
                                                            QUERY PLAN

------------------------------------------------------------------------
-----------------------------------------------------------
 Aggregate  (cost=7371.23..7371.55 rows=1 width=18) (actual
time=803.374..803.376 rows=1 loops=1)
   ->  Seq Scan on fewo_property_location  (cost=0.00..828.84 rows=41538
width=18) (actual time=0.036..147.211 rows=41492 loops=1)
         Filter: ((location_id IS NOT NULL) AND (property_state_id = 3))
 Total runtime: 804.398 ms
(4 rows)

The table definition is like:
      Column       |   Type   |      Modifiers
-------------------+----------+----------------------
 property_id       | integer  | not null
 property_state_id | integer  | not null
 location_id       | integer  |
 min_persons       | smallint | not null
 max_persons       | smallint | not null
 specifications    | bit(32)  | default (0)::bit(32)
 country_id        | integer  |
Indexes:
    "fewo_property_location_pkey" PRIMARY KEY, btree (property_id)
    "fewo_property_location_country_idx" btree (country_id) WHERE
location_id IS NOT NULL
    "fewo_property_location_country_location_idx" btree (country_id,
location_id) CLUSTER
    "fewo_property_location_location_online_idx" btree (location_id)
WHERE location_id IS NOT NULL AND property_state_id = 3
    "fewo_property_location_property_location_idx" btree (property_id,
location_id) WHERE location_id IS NOT NULL AND property_state_id = 3
    "fewo_property_location_specifications_idx" btree (specifications)
Foreign-key constraints:
    "fewo_property_location_location_id_fkey" FOREIGN KEY (location_id)
REFERENCES fewo_location(location_id) MATCH FULL
    "fewo_property_location_property_state_id_fkey" FOREIGN KEY
(property_state_id) REFERENCES fewo_property_state(property_state_id)
MATCH FULL

My conclusion is that this query time is mostly limited to the somewhat
complex COUNT expressions. Is there any way to do this more efficiently?

For the record, if I constrain this query to specific countries it
performs in about 80ms (10x as fast).

The hardware is a dual Opteron64x2, 4G RAM and some kind of RAID setup
(software, don't know what type) running in a Xen host - it's our
development DB-server.

--
Alban Hertroys
alban@magproductions.nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
   7500 AK Enschede

// Integrate Your World //

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Re: Operator performance question

From

Tom Lane

Date:

09 January 2007, 15:32:01

Alban Hertroys <alban@magproductions.nl> writes:
> My conclusion is that this query time is mostly limited to the somewhat
> complex COUNT expressions. Is there any way to do this more efficiently?

Offhand I would bet on the bitstring-AND operations being the
bottleneck; you could test this by comparing the speed of queries that
are doing different mixes of the same number of COUNT()s.  If you're
happy with a fixed-width 32-bit field, consider using an integer field
and integer & operations, instead of bitstring.  Bitstring is a
pass-by-reference type and so inherently a lot less efficient than an
integer.

Another suggestion is to replace

    count(nullif(boolean_expr, false))

with

    sum((boolean_expr)::int)

I think this would be a marginal speed win at best (basically replacing
a Const and a NullIf node with a Cast node), but it just seems to me
to be more natural ... it took me a bit to figure out what your query
was trying to accomplish.

            regards, tom lane

Re: Operator performance question

From

Alban Hertroys

Date:

10 January 2007, 09:06:04

Brandon Aiken wrote:
> Shouldn't these be using HAVING?
>
> SELECT COUNT(max_persons) ...
> GROUP BY NULL
> HAVING max_persons >= 5 AND max_persons <= 8;

I don't think so; I need multiple COUNT results for multiple
combinations; 5-8 persons isn't the only one. I also need counts for
1-4, 9-12 and 13 and up.

The output I currently get is something like this:
"persons 1-4"    | 123
"persons 5-8"    | 234
"persons 9-12"    | 345
...
"LOCATION_NETHERLANDS"    | 987
"LOCATION_BELGIUM"    | 765
...
"washing machine"    | 432
"dish washer"        | 210

I don't see how I would get that with a HAVING clause. And I don't think
you can have multiple HAVING clauses...

The crux of the query is that I get all kinds of conditions counted -
with a single query, output as a single record.

--
Alban Hertroys
alban@magproductions.nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
   7500 AK Enschede

// Integrate Your World //

Re: Operator performance question

From

Alban Hertroys

Date:

10 January 2007, 09:22:49

Tom Lane wrote:
> Alban Hertroys <alban@magproductions.nl> writes:
>> My conclusion is that this query time is mostly limited to the somewhat
>> complex COUNT expressions. Is there any way to do this more efficiently?
>
> Offhand I would bet on the bitstring-AND operations being the
> bottleneck; you could test this by comparing the speed of queries that
> are doing different mixes of the same number of COUNT()s.  If you're
> happy with a fixed-width 32-bit field, consider using an integer field
> and integer & operations, instead of bitstring.  Bitstring is a
> pass-by-reference type and so inherently a lot less efficient than an
> integer.

Hmm... I picked bitstrings because a quick test seemed to show it
performing better than ints. Apparently my test wasn't right.
Definitely a thing to test again.

So far I have some bits of the 32 to spare, but the set of options to
filter on isn't fixed yet - I expect it to grow, and I don't know by how
much yet. I might (although I doubt it) get beyond 64 bits, in which
case even a bigint wouldn't suffice...

> Another suggestion is to replace
>
>     count(nullif(boolean_expr, false))
>
> with
>
>     sum((boolean_expr)::int)

I hadn't realized true::int = 1 and false::int = 0. Thanks for the
suggestions.

> I think this would be a marginal speed win at best (basically replacing
> a Const and a NullIf node with a Cast node), but it just seems to me
> to be more natural ... it took me a bit to figure out what your query
> was trying to accomplish.

It took me a bit to come up with that solution; I agree it's not a very
obvious one. I basically hacked my expression to evaluate to NULL if
false, so that count wouldn't pick it up.

NULLIF did almost what I wanted. I would've been happier with a function
NULLIF(boolean_expr) - with just one boolean parameter. It may be even
better to have a COUNTIF(boolean_expr) aggregate that would only count
values where the expression evaluates true - Come to think of it, that
looks awfully familiar... Oracle maybe?

Anyway, I now have something to go on again. Thanks.

--
Alban Hertroys
alban@magproductions.nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
   7500 AK Enschede

// Integrate Your World //