Real vs Int performance

From: David Greco
Subject: Real vs Int performance
Date: ,
Msg-id: CDFA9340E95A764E9366B4EDF3A43125BCACF7E68D@VA3DIAXVS091.RED001.local
(view: Whole thread, Raw)
Responses: Re: Real vs Int performance  ("Kevin Grittner")
Re: Real vs Int performance  (Tom Lane)
List: pgsql-performance

Tree view

Real vs Int performance  (David Greco, )
 Re: Real vs Int performance  ("Kevin Grittner", )
 Re: Real vs Int performance  (Tom Lane, )
  Re: Real vs Int performance  (David Greco, )
   Re: Real vs Int performance  (Tom Lane, )
    Re: Real vs Int performance  (Shaun Thomas, )
     Re: Real vs Int performance  (Mladen Gogala, )
  Re: Real vs Int performance  ("Igor Neyman", )

New to Postgres and am prototyping a migration from Oracle to Postgres 9.0.1 on Linux. Starting with the data warehouse. Current process is to load the data from

our OLTP (Oracle), dump it into another instance of Oracle for staging and manipulation, then extract it and load it into Infobright. I am trying

to replace the Oracle instance used for staging and manipulation with Postgres. Kettle (PDI), a Java ETL tool, is used for this process.

 

Came across a problem I find perplexing. I recreated the dimensional tables in Oracle and the fields that are integers in Oracle became integers

in Postgres. Was experiencing terrible performance during the load and narrowed down to a particular dimensional lookup problem. The table

dim_carrier holds about 80k rows. You can see the actual query issued by Kettle below, but basically I am looking up using the business key from

our OLTP system. This field is carrier_source_id and is indexed as you can see below. If I change this field from an integer to a real, I get

about a 70x increase in performance of the query. The EXPLAIN ANALYZE output is nearly identical, except for the casting of 1 to a real when the column

is a real. In real life, this query is actually bound and parameterized, but I wished to simplify things a bit here (and don't yet know how to EXPLAIN ANALYZE a parameterized

query). Now in terms of actual performance, the same query executed about 25k times takes 7 seconds with the real column, and 500 seconds with the integer column.

 

What gives here? Seems like integer (or serial) is a pretty common choice for primary key columns, and therefore what I'm experiencing must be an anomoly.

 

 

 

                Table "hits_olap.dim_carrier"

      Column       |            Type             | Modifiers

-------------------+-----------------------------+-----------

 carrier_id        | integer                     | not null

 dim_version       | smallint                    |

 dim_effect_date   | timestamp without time zone |

 dim_expire_date   | timestamp without time zone |

 carrier_source_id | integer                     |

 carrier_name      | character varying(30)       |

 carrier_type      | character varying(30)       |

 carrier_scac      | character varying(4)        |

 carrier_currency  | character varying(3)        |

 current_row       | smallint                    | default 0

Indexes:

    "dim_carrier_pk" PRIMARY KEY, btree (carrier_id)

    "idx_dim_carrier_lookup" btree (carrier_source_id)

 

VACUUM

ANALYZE

REINDEX

 

 EXPLAIN ANALYZE SELECT CARRIER_ID, DIM_VERSION FROM HITS_OLAP.DIM_CARRIER WHERE CARRIER_SOURCE_ID = '1'  AND now() >= DIM_EFFECT_DATE

 AND now() < DIM_EXPIRE_DATE;

 

                                                             QUERY PLAN                                                              

-------------------------------------------------------------------------------------------------------------------------------------

 Index Scan using idx_dim_carrier_lookup on dim_carrier  (cost=0.00..12.10 rows=2 width=6) (actual time=0.076..0.077 rows=1 loops=1)

   Index Cond: (carrier_source_id = 1)

   Filter: ((now() >= dim_effect_date) AND (now() < dim_expire_date))  Total runtime: 0.108 ms

(4 rows)

 

ALTER TABLE

ALTER TABLE

                Table "hits_olap.dim_carrier"

      Column       |            Type             | Modifiers

-------------------+-----------------------------+-----------

 carrier_id        | integer                     | not null

 dim_version       | smallint                    |

 dim_effect_date   | timestamp without time zone |

 dim_expire_date   | timestamp without time zone |

 carrier_source_id | real                        |

 carrier_name      | character varying(30)       |

 carrier_type      | character varying(30)       |

 carrier_scac      | character varying(4)        |

 carrier_currency  | character varying(3)        |

 current_row       | smallint                    | default 0

Indexes:

    "dim_carrier_pk" PRIMARY KEY, btree (carrier_id)

    "idx_dim_carrier_lookup" btree (carrier_source_id)

 

VACUUM

ANALYZE

REINDEX

 

 EXPLAIN ANALYZE SELECT CARRIER_ID, DIM_VERSION FROM HITS_OLAP.DIM_CARRIER WHERE CARRIER_SOURCE_ID = '1'  AND now() >= DIM_EFFECT_DATE

 AND now() < DIM_EXPIRE_DATE;

 

                                                             QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------

 Index Scan using idx_dim_carrier_lookup on dim_carrier  (cost=0.00..12.10 rows=2 width=6) (actual time=0.068..0.069 rows=1 loops=1)

   Index Cond: (carrier_source_id = 1::real)

   Filter: ((now() >= dim_effect_date) AND (now() < dim_expire_date))  Total runtime: 0.097 ms

(4 rows)

 

 

 

Thanks for the help,

 

Dave Greco


pgsql-performance by date:

From: "mark"
Date:
Subject: Re: Queries becoming slow under heavy load
From: Scott Marlowe
Date:
Subject: Re: FW: Queries becoming slow under heavy load