Thread: Real vs Int performance
New to Postgres and am prototyping a migration from Oracle to Postgres 9.0.1 on Linux. Starting with the data warehouse. Current process is to load the data from
our OLTP (Oracle), dump it into another instance of Oracle for staging and manipulation, then extract it and load it into Infobright. I am trying
to replace the Oracle instance used for staging and manipulation with Postgres. Kettle (PDI), a Java ETL tool, is used for this process.
Came across a problem I find perplexing. I recreated the dimensional tables in Oracle and the fields that are integers in Oracle became integers
in Postgres. Was experiencing terrible performance during the load and narrowed down to a particular dimensional lookup problem. The table
dim_carrier holds about 80k rows. You can see the actual query issued by Kettle below, but basically I am looking up using the business key from
our OLTP system. This field is carrier_source_id and is indexed as you can see below. If I change this field from an integer to a real, I get
about a 70x increase in performance of the query. The EXPLAIN ANALYZE output is nearly identical, except for the casting of 1 to a real when the column
is a real. In real life, this query is actually bound and parameterized, but I wished to simplify things a bit here (and don't yet know how to EXPLAIN ANALYZE a parameterized
query). Now in terms of actual performance, the same query executed about 25k times takes 7 seconds with the real column, and 500 seconds with the integer column.
What gives here? Seems like integer (or serial) is a pretty common choice for primary key columns, and therefore what I'm experiencing must be an anomoly.
Table "hits_olap.dim_carrier"
Column | Type | Modifiers
-------------------+-----------------------------+-----------
carrier_id | integer | not null
dim_version | smallint |
dim_effect_date | timestamp without time zone |
dim_expire_date | timestamp without time zone |
carrier_source_id | integer |
carrier_name | character varying(30) |
carrier_type | character varying(30) |
carrier_scac | character varying(4) |
carrier_currency | character varying(3) |
current_row | smallint | default 0
Indexes:
"dim_carrier_pk" PRIMARY KEY, btree (carrier_id)
"idx_dim_carrier_lookup" btree (carrier_source_id)
VACUUM
ANALYZE
REINDEX
EXPLAIN ANALYZE SELECT CARRIER_ID, DIM_VERSION FROM HITS_OLAP.DIM_CARRIER WHERE CARRIER_SOURCE_ID = '1' AND now() >= DIM_EFFECT_DATE
AND now() < DIM_EXPIRE_DATE;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_dim_carrier_lookup on dim_carrier (cost=0.00..12.10 rows=2 width=6) (actual time=0.076..0.077 rows=1 loops=1)
Index Cond: (carrier_source_id = 1)
Filter: ((now() >= dim_effect_date) AND (now() < dim_expire_date)) Total runtime: 0.108 ms
(4 rows)
ALTER TABLE
ALTER TABLE
Table "hits_olap.dim_carrier"
Column | Type | Modifiers
-------------------+-----------------------------+-----------
carrier_id | integer | not null
dim_version | smallint |
dim_effect_date | timestamp without time zone |
dim_expire_date | timestamp without time zone |
carrier_source_id | real |
carrier_name | character varying(30) |
carrier_type | character varying(30) |
carrier_scac | character varying(4) |
carrier_currency | character varying(3) |
current_row | smallint | default 0
Indexes:
"dim_carrier_pk" PRIMARY KEY, btree (carrier_id)
"idx_dim_carrier_lookup" btree (carrier_source_id)
VACUUM
ANALYZE
REINDEX
EXPLAIN ANALYZE SELECT CARRIER_ID, DIM_VERSION FROM HITS_OLAP.DIM_CARRIER WHERE CARRIER_SOURCE_ID = '1' AND now() >= DIM_EFFECT_DATE
AND now() < DIM_EXPIRE_DATE;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_dim_carrier_lookup on dim_carrier (cost=0.00..12.10 rows=2 width=6) (actual time=0.068..0.069 rows=1 loops=1)
Index Cond: (carrier_source_id = 1::real)
Filter: ((now() >= dim_effect_date) AND (now() < dim_expire_date)) Total runtime: 0.097 ms
(4 rows)
Thanks for the help,
Dave Greco
David Greco <David_Greco@harte-hanks.com> wrote: > If I change this field from an integer to a real, I get about a > 70x increase in performance of the query. > I wished to simplify things a bit here (and don't yet know how to > EXPLAIN ANALYZE a parameterized query). > carrier_source_id | integer | > runtime: 0.108 ms > carrier_source_id | real | > runtime: 0.097 ms This doesn't show the problem, so it's hard to guess the cause. Perhaps you can do it with a prepared statement?: http://www.postgresql.org/docs/9.0/interactive/sql-prepare.html Also, plans can be completely different based on the number of rows, width of the rows, distribution of values, etc. You may want to select against the actual tables where you've seen the problem. One tip -- if size permits, try to CLUSTER both tables to avoid any bloat issues, and VACUUM ANALYZE the tables to ensure that hint bits are set and statistics are up to date before running the tests. Run each test several times in a row to see what affect caching has on the issue. -Kevin
David Greco <David_Greco@harte-hanks.com> writes: > Came across a problem I find perplexing. I recreated the dimensional tables in Oracle and the fields that are integersin Oracle became integers > in Postgres. Was experiencing terrible performance during the load and narrowed down to a particular dimensional lookupproblem. The table > dim_carrier holds about 80k rows. You can see the actual query issued by Kettle below, but basically I am looking up usingthe business key from > our OLTP system. This field is carrier_source_id and is indexed as you can see below. If I change this field from an integerto a real, I get > about a 70x increase in performance of the query. That's really, really hard to believe, given that all else is equal --- so I'm betting it isn't. I suspect that what is really happening is that you're passing non-integral comparison constants in your queries. For example, if carrier_id is an integer, then SELECT ... WHERE carrier_id = 42 is indexable, but this isn't: SELECT ... WHERE carrier_id = 42.0 The latter case however *would* be indexable if carrier_id were float. The examples you show fail to show any performance difference at all, but that's probably because you used quoted literals ('42' not 42), which prevents the parser from deciding that a cross-type comparison is demanded. I believe Oracle handles such things differently, so running into this type of issue during an Oracle port isn't too surprising. > In real life, this query is actually bound and parameterized, In that case, an EXPLAIN using literal constants is next door to useless in terms of telling you what will happen in real life. You need to pay attention to exactly how the parameterization is done. Again, I'm suspecting a wrong datatype indication. regards, tom lane
Right you are. Kettle is turning the number(11) field from Oracle into a BigNumber, which is a decimal. If I cast the fieldinto an Integer in Kettle and keep the field an integer in Postgres, I get good performance. Suspect the correct courseof action would simply be to make number(11) fields in Oracle numeric(11,0) fields in Postgres. -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Wednesday, January 26, 2011 5:12 PM To: David Greco Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Real vs Int performance David Greco <David_Greco@harte-hanks.com> writes: > Came across a problem I find perplexing. I recreated the dimensional tables in Oracle and the fields that are integersin Oracle became integers > in Postgres. Was experiencing terrible performance during the load and narrowed down to a particular dimensional lookupproblem. The table > dim_carrier holds about 80k rows. You can see the actual query issued by Kettle below, but basically I am looking up usingthe business key from > our OLTP system. This field is carrier_source_id and is indexed as you can see below. If I change this field from an integerto a real, I get > about a 70x increase in performance of the query. That's really, really hard to believe, given that all else is equal --- so I'm betting it isn't. I suspect that what is really happening is that you're passing non-integral comparison constants in your queries. For example, if carrier_id is an integer, then SELECT ... WHERE carrier_id = 42 is indexable, but this isn't: SELECT ... WHERE carrier_id = 42.0 The latter case however *would* be indexable if carrier_id were float. The examples you show fail to show any performance difference at all, but that's probably because you used quoted literals ('42' not 42), which prevents the parser from deciding that a cross-type comparison is demanded. I believe Oracle handles such things differently, so running into this type of issue during an Oracle port isn't too surprising. > In real life, this query is actually bound and parameterized, In that case, an EXPLAIN using literal constants is next door to useless in terms of telling you what will happen in real life. You need to pay attention to exactly how the parameterization is done. Again, I'm suspecting a wrong datatype indication. regards, tom lane
David Greco <David_Greco@harte-hanks.com> writes: > Right you are. Kettle is turning the number(11) field from Oracle into > a BigNumber, which is a decimal. If I cast the field into an Integer > in Kettle and keep the field an integer in Postgres, I get good > performance. Suspect the correct course of action would simply be to > make number(11) fields in Oracle numeric(11,0) fields in Postgres. Not if you can persuade the client-side code to output integers as integers. "numeric" type is orders of magnitude slower than integers. regards, tom lane
On 01/27/2011 08:18 AM, Tom Lane wrote: > Not if you can persuade the client-side code to output integers as > integers. "numeric" type is orders of magnitude slower than integers. I sadly have to vouch for this. My company converted an old Oracle app and they changed all their primary keys (and foreign keys, and random larger int fields) to NUMERIC(19)'s. I've convinced them all new stuff should be BIGINT if they need that level of coverage, but the damage is already done. I'm not sure about orders of magnitude on the storage/index side, but my tests gave us a 10% boost if just the keys are switched over to INT or BIGINT. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@peak6.com ______________________________________________ See http://www.peak6.com/email_disclaimer.php for terms and conditions related to this email
On 1/27/2011 9:30 AM, Shaun Thomas wrote: > I'm not sure about orders of magnitude on the storage/index side, but my > tests gave us a 10% boost if just the keys are switched over to INT or > BIGINT. Well, it depends on what you're doing. Searching by an integer vs. searching by a text string will probably not make much of a difference. However, if you are calculating sums or averages, there will be a huge difference. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 www.vmsinfo.com
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Wednesday, January 26, 2011 5:12 PM > To: David Greco > Cc: pgsql-performance@postgresql.org > Subject: Re: Real vs Int performance > > David Greco <David_Greco@harte-hanks.com> writes: > > Came across a problem I find perplexing. I recreated the > dimensional > > tables in Oracle and the fields that are integers in Oracle became > > integers in Postgres. Was experiencing terrible performance > during the > > load and narrowed down to a particular dimensional lookup > problem. > ....................................... > ....................................... > ....................................... > ....................................... > In real life, this query is actually bound and parameterized, > > In that case, an EXPLAIN using literal constants is next door > to useless in terms of telling you what will happen in real > life. You need to pay attention to exactly how the > parameterization is done. Again, I'm suspecting a wrong > datatype indication. > > regards, tom lane > To see what happens with parametrized query in "real life" you could try "auto_explain" contrib module. Regards, Igor Neyman