Thread: Real vs Int performance

Real vs Int performance

From
David Greco
Date:

New to Postgres and am prototyping a migration from Oracle to Postgres 9.0.1 on Linux. Starting with the data warehouse. Current process is to load the data from

our OLTP (Oracle), dump it into another instance of Oracle for staging and manipulation, then extract it and load it into Infobright. I am trying

to replace the Oracle instance used for staging and manipulation with Postgres. Kettle (PDI), a Java ETL tool, is used for this process.

 

Came across a problem I find perplexing. I recreated the dimensional tables in Oracle and the fields that are integers in Oracle became integers

in Postgres. Was experiencing terrible performance during the load and narrowed down to a particular dimensional lookup problem. The table

dim_carrier holds about 80k rows. You can see the actual query issued by Kettle below, but basically I am looking up using the business key from

our OLTP system. This field is carrier_source_id and is indexed as you can see below. If I change this field from an integer to a real, I get

about a 70x increase in performance of the query. The EXPLAIN ANALYZE output is nearly identical, except for the casting of 1 to a real when the column

is a real. In real life, this query is actually bound and parameterized, but I wished to simplify things a bit here (and don't yet know how to EXPLAIN ANALYZE a parameterized

query). Now in terms of actual performance, the same query executed about 25k times takes 7 seconds with the real column, and 500 seconds with the integer column.

 

What gives here? Seems like integer (or serial) is a pretty common choice for primary key columns, and therefore what I'm experiencing must be an anomoly.

 

 

 

                Table "hits_olap.dim_carrier"

      Column       |            Type             | Modifiers

-------------------+-----------------------------+-----------

 carrier_id        | integer                     | not null

 dim_version       | smallint                    |

 dim_effect_date   | timestamp without time zone |

 dim_expire_date   | timestamp without time zone |

 carrier_source_id | integer                     |

 carrier_name      | character varying(30)       |

 carrier_type      | character varying(30)       |

 carrier_scac      | character varying(4)        |

 carrier_currency  | character varying(3)        |

 current_row       | smallint                    | default 0

Indexes:

    "dim_carrier_pk" PRIMARY KEY, btree (carrier_id)

    "idx_dim_carrier_lookup" btree (carrier_source_id)

 

VACUUM

ANALYZE

REINDEX

 

 EXPLAIN ANALYZE SELECT CARRIER_ID, DIM_VERSION FROM HITS_OLAP.DIM_CARRIER WHERE CARRIER_SOURCE_ID = '1'  AND now() >= DIM_EFFECT_DATE

 AND now() < DIM_EXPIRE_DATE;

 

                                                             QUERY PLAN                                                              

-------------------------------------------------------------------------------------------------------------------------------------

 Index Scan using idx_dim_carrier_lookup on dim_carrier  (cost=0.00..12.10 rows=2 width=6) (actual time=0.076..0.077 rows=1 loops=1)

   Index Cond: (carrier_source_id = 1)

   Filter: ((now() >= dim_effect_date) AND (now() < dim_expire_date))  Total runtime: 0.108 ms

(4 rows)

 

ALTER TABLE

ALTER TABLE

                Table "hits_olap.dim_carrier"

      Column       |            Type             | Modifiers

-------------------+-----------------------------+-----------

 carrier_id        | integer                     | not null

 dim_version       | smallint                    |

 dim_effect_date   | timestamp without time zone |

 dim_expire_date   | timestamp without time zone |

 carrier_source_id | real                        |

 carrier_name      | character varying(30)       |

 carrier_type      | character varying(30)       |

 carrier_scac      | character varying(4)        |

 carrier_currency  | character varying(3)        |

 current_row       | smallint                    | default 0

Indexes:

    "dim_carrier_pk" PRIMARY KEY, btree (carrier_id)

    "idx_dim_carrier_lookup" btree (carrier_source_id)

 

VACUUM

ANALYZE

REINDEX

 

 EXPLAIN ANALYZE SELECT CARRIER_ID, DIM_VERSION FROM HITS_OLAP.DIM_CARRIER WHERE CARRIER_SOURCE_ID = '1'  AND now() >= DIM_EFFECT_DATE

 AND now() < DIM_EXPIRE_DATE;

 

                                                             QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------

 Index Scan using idx_dim_carrier_lookup on dim_carrier  (cost=0.00..12.10 rows=2 width=6) (actual time=0.068..0.069 rows=1 loops=1)

   Index Cond: (carrier_source_id = 1::real)

   Filter: ((now() >= dim_effect_date) AND (now() < dim_expire_date))  Total runtime: 0.097 ms

(4 rows)

 

 

 

Thanks for the help,

 

Dave Greco

Re: Real vs Int performance

From
"Kevin Grittner"
Date:
David Greco <David_Greco@harte-hanks.com> wrote:

> If I change this field from an integer to a real, I get about a
> 70x increase in performance of the query.

> I wished to simplify things a bit here (and don't yet know how to
> EXPLAIN ANALYZE a parameterized query).

>  carrier_source_id | integer                     |

> runtime: 0.108 ms

>  carrier_source_id | real                        |

> runtime: 0.097 ms

This doesn't show the problem, so it's hard to guess the cause.
Perhaps you can do it with a prepared statement?:

http://www.postgresql.org/docs/9.0/interactive/sql-prepare.html

Also, plans can be completely different based on the number of rows,
width of the rows, distribution of values, etc.  You may want to
select against the actual tables where you've seen the problem.

One tip -- if size permits, try to CLUSTER both tables to avoid any
bloat issues, and VACUUM ANALYZE the tables to ensure that hint bits
are set and statistics are up to date before running the tests.  Run
each test several times in a row to see what affect caching has on
the issue.

-Kevin

Re: Real vs Int performance

From
Tom Lane
Date:
David Greco <David_Greco@harte-hanks.com> writes:
> Came across a problem I find perplexing. I recreated the dimensional tables in Oracle and the fields that are
integersin Oracle became integers 
> in Postgres. Was experiencing terrible performance during the load and narrowed down to a particular dimensional
lookupproblem. The table 
> dim_carrier holds about 80k rows. You can see the actual query issued by Kettle below, but basically I am looking up
usingthe business key from 
> our OLTP system. This field is carrier_source_id and is indexed as you can see below. If I change this field from an
integerto a real, I get 
> about a 70x increase in performance of the query.

That's really, really hard to believe, given that all else is equal ---
so I'm betting it isn't.  I suspect that what is really happening is
that you're passing non-integral comparison constants in your queries.
For example, if carrier_id is an integer, then

    SELECT ... WHERE carrier_id = 42

is indexable, but this isn't:

    SELECT ... WHERE carrier_id = 42.0

The latter case however *would* be indexable if carrier_id were float.

The examples you show fail to show any performance difference at all,
but that's probably because you used quoted literals ('42' not 42),
which prevents the parser from deciding that a cross-type comparison
is demanded.

I believe Oracle handles such things differently, so running into this
type of issue during an Oracle port isn't too surprising.

> In real life, this query is actually bound and parameterized,

In that case, an EXPLAIN using literal constants is next door to useless
in terms of telling you what will happen in real life.  You need to pay
attention to exactly how the parameterization is done.  Again, I'm
suspecting a wrong datatype indication.

            regards, tom lane

Re: Real vs Int performance

From
David Greco
Date:
Right you are. Kettle is turning the number(11) field from Oracle into a BigNumber, which is a decimal. If I cast the
fieldinto an Integer in Kettle and keep the field an integer in Postgres, I get good performance. Suspect the correct
courseof action would simply be to make number(11) fields in Oracle numeric(11,0) fields in Postgres.
 


-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us] 
Sent: Wednesday, January 26, 2011 5:12 PM
To: David Greco
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Real vs Int performance 

David Greco <David_Greco@harte-hanks.com> writes:
> Came across a problem I find perplexing. I recreated the dimensional tables in Oracle and the fields that are
integersin Oracle became integers
 
> in Postgres. Was experiencing terrible performance during the load and narrowed down to a particular dimensional
lookupproblem. The table
 
> dim_carrier holds about 80k rows. You can see the actual query issued by Kettle below, but basically I am looking up
usingthe business key from
 
> our OLTP system. This field is carrier_source_id and is indexed as you can see below. If I change this field from an
integerto a real, I get
 
> about a 70x increase in performance of the query.

That's really, really hard to believe, given that all else is equal ---
so I'm betting it isn't.  I suspect that what is really happening is
that you're passing non-integral comparison constants in your queries.
For example, if carrier_id is an integer, then

    SELECT ... WHERE carrier_id = 42

is indexable, but this isn't:

    SELECT ... WHERE carrier_id = 42.0

The latter case however *would* be indexable if carrier_id were float.

The examples you show fail to show any performance difference at all,
but that's probably because you used quoted literals ('42' not 42),
which prevents the parser from deciding that a cross-type comparison
is demanded.

I believe Oracle handles such things differently, so running into this
type of issue during an Oracle port isn't too surprising.

> In real life, this query is actually bound and parameterized,

In that case, an EXPLAIN using literal constants is next door to useless
in terms of telling you what will happen in real life.  You need to pay
attention to exactly how the parameterization is done.  Again, I'm
suspecting a wrong datatype indication.

            regards, tom lane


Re: Real vs Int performance

From
Tom Lane
Date:
David Greco <David_Greco@harte-hanks.com> writes:
> Right you are. Kettle is turning the number(11) field from Oracle into
> a BigNumber, which is a decimal. If I cast the field into an Integer
> in Kettle and keep the field an integer in Postgres, I get good
> performance. Suspect the correct course of action would simply be to
> make number(11) fields in Oracle numeric(11,0) fields in Postgres.

Not if you can persuade the client-side code to output integers as
integers.  "numeric" type is orders of magnitude slower than integers.

            regards, tom lane

Re: Real vs Int performance

From
Shaun Thomas
Date:
On 01/27/2011 08:18 AM, Tom Lane wrote:

> Not if you can persuade the client-side code to output integers as
> integers.  "numeric" type is orders of magnitude slower than integers.

I sadly have to vouch for this. My company converted an old Oracle app
and they changed all their primary keys (and foreign keys, and random
larger int fields) to NUMERIC(19)'s. I've convinced them all new stuff
should be BIGINT if they need that level of coverage, but the damage is
already done.

I'm not sure about orders of magnitude on the storage/index side, but my
tests gave us a 10% boost if just the keys are switched over to INT or
BIGINT.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@peak6.com

______________________________________________

See  http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email

Re: Real vs Int performance

From
Mladen Gogala
Date:
On 1/27/2011 9:30 AM, Shaun Thomas wrote:
> I'm not sure about orders of magnitude on the storage/index side, but my
> tests gave us a 10% boost if just the keys are switched over to INT or
> BIGINT.

Well, it depends on what you're doing. Searching by an integer vs.
searching by a text string will probably not make much of a difference.
However, if you are calculating sums or averages, there will be a huge
difference.

--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com


Re: Real vs Int performance

From
"Igor Neyman"
Date:

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Wednesday, January 26, 2011 5:12 PM
> To: David Greco
> Cc: pgsql-performance@postgresql.org
> Subject: Re: Real vs Int performance
>
> David Greco <David_Greco@harte-hanks.com> writes:
> > Came across a problem I find perplexing. I recreated the
> dimensional
> > tables in Oracle and the fields that are integers in Oracle became
> > integers in Postgres. Was experiencing terrible performance
> during the
> > load and narrowed down to a particular dimensional lookup
> problem.
> .......................................
> .......................................
> .......................................
> .......................................
> In real life, this query is actually bound and parameterized,
>
> In that case, an EXPLAIN using literal constants is next door
> to useless in terms of telling you what will happen in real
> life.  You need to pay attention to exactly how the
> parameterization is done.  Again, I'm suspecting a wrong
> datatype indication.
>
>             regards, tom lane
>

To see what happens with parametrized query in "real life" you could try
"auto_explain" contrib module.

Regards,
Igor Neyman