Thread: Scaling 10 million records in PostgreSQL table

Scaling 10 million records in PostgreSQL table

From

Navaneethan R

Date:

08 October 2012, 19:44:41

Hi all,

      I have 10 million records in my postgres table.I am running the database in amazon ec2 medium instance. I need to
accessthe last week data from the table. 
It takes huge time to process the simple query.So, i throws time out exception error.

query is :
     select count(*) from dealer_vehicle_details where modified_on between '2012-10-01' and '2012-10-08' and
dealer_id=270001;

After a lot of time it responds 1184 as count

what are the ways i have to follow to increase the performance of this query?

The insertion also going parallel since the daily realtime updation.

what could be the reason exactly for this lacking performace?

Re: Scaling 10 million records in PostgreSQL table

From

Merlin Moncure

Date:

08 October 2012, 19:50:21

On Mon, Oct 8, 2012 at 10:26 AM, Navaneethan R <nava@gridlex.com> wrote:
> Hi all,
>
>       I have 10 million records in my postgres table.I am running the database in amazon ec2 medium instance. I need
toaccess the last week data from the table. 
> It takes huge time to process the simple query.So, i throws time out exception error.
>
> query is :
>      select count(*) from dealer_vehicle_details where modified_on between '2012-10-01' and '2012-10-08' and
dealer_id=270001;
>
> After a lot of time it responds 1184 as count
>
> what are the ways i have to follow to increase the performance of this query?
>
> The insertion also going parallel since the daily realtime updation.
>
> what could be the reason exactly for this lacking performace?

can you send explain analyze?  also table structure?

merlin

Re: Scaling 10 million records in PostgreSQL table

From

Julien Cigar

Date:

08 October 2012, 19:52:33

On 10/08/2012 17:26, Navaneethan R wrote:
> Hi all,

Hello,

>        I have 10 million records in my postgres table.I am running the database in amazon ec2 medium instance. I need
toaccess the last week data from the table. 
> It takes huge time to process the simple query.So, i throws time out exception error.
>
> query is :
>       select count(*) from dealer_vehicle_details where modified_on between '2012-10-01' and '2012-10-08' and
dealer_id=270001;

please show us an EXPLAIN ANALYZE of the query

> After a lot of time it responds 1184 as count
>
> what are the ways i have to follow to increase the performance of this query?
>
> The insertion also going parallel since the daily realtime updation.
>
> what could be the reason exactly for this lacking performace?

missing index, wrong configuration, ...
please also note that, generally, all those "cloud stuff" have generally
very poor I/O performance ..

>
>

Re: Scaling 10 million records in PostgreSQL table

From

Larry Rosenman

Date:

08 October 2012, 19:53:56

On 2012-10-08 10:26, Navaneethan R wrote:
> Hi all,
>
>       I have 10 million records in my postgres table.I am running the
> database in amazon ec2 medium instance. I need to access the last
> week
> data from the table.
> It takes huge time to process the simple query.So, i throws time out
> exception error.
>
> query is :
>      select count(*) from dealer_vehicle_details where modified_on
> between '2012-10-01' and '2012-10-08' and dealer_id=270001;
>
> After a lot of time it responds 1184 as count
>
> what are the ways i have to follow to increase the performance of
> this query?
>
> The insertion also going parallel since the daily realtime updation.
>
> what could be the reason exactly for this lacking performace?
What indexes do you have on your table?

I'll bet none.

What does an explain select count(*) from dealer_vehicle_details where
modified_on
     between '2012-10-01' and '2012-10-08' and dealer_id=270001;

show?

I have a 380Million row table, with LOTS of indexing, and we perform
very well.

Without indexes, the query had to sequential scan all 10 million rows.
That's going to be bad on ANY database.

Re: Scaling 10 million records in PostgreSQL table

From

Steve Crawford

Date:

08 October 2012, 20:10:05

On 10/08/2012 08:26 AM, Navaneethan R wrote:
> Hi all,
>
>        I have 10 million records in my postgres table.I am running the database in amazon ec2 medium instance. I need
toaccess the last week data from the table. 
> It takes huge time to process the simple query.So, i throws time out exception error.
>
> query is :
>       select count(*) from dealer_vehicle_details where modified_on between '2012-10-01' and '2012-10-08' and
dealer_id=270001;
>
> After a lot of time it responds 1184 as count
>
> what are the ways i have to follow to increase the performance of this query?
>
> The insertion also going parallel since the daily realtime updation.
>
> what could be the reason exactly for this lacking performace?
>
>
What version of PostgreSQL? You can use "select version();" and note
that 9.2 has index-only scans which can result in a substantial
performance boost for queries of this type.

What is the structure of your table? You can use "\d+
dealer_vehicle_details" in psql.

Have you tuned PostgreSQL in any way? If so, what?

Cheers,
Steve

Re: Scaling 10 million records in PostgreSQL table

From

Craig Ringer

Date:

08 October 2012, 20:27:55

On 10/08/2012 11:26 PM, Navaneethan R wrote:
> Hi all,
>
>        I have 10 million records in my postgres table.I am running the database in amazon ec2 medium instance.

EC2 usually means "My I/O performance is terrible" and "medium instance"
means "I don't have enough RAM for caching to make up for my terrible
I/O" at the database sizes you're talking.

Anything that hits most of the database is likely to perform pretty
poorly on something like EC2. It might be worth considering one of the
high memory or high I/O instances, but unfortunately they only come in
"really big and really expensive".

If you already have appropriate indexes and have used `explain analyze`
to verify that the query isn't doing anything slow and expensive, it's
possible the easiest way to improve performance is to set up async
replication or log shipping to a local hot standby on real physical
hardware, then do the query there.

--
Craig Ringer

Re: Scaling 10 million records in PostgreSQL table

From

Samuel Gendler

Date:

08 October 2012, 22:42:46

On Mon, Oct 8, 2012 at 1:27 PM, Craig Ringer <ringerc@ringerc.id.au> wrote:

If you already have appropriate indexes and have used `explain analyze` to verify that the query isn't doing anything slow and expensive, it's possible the easiest way to improve performance is to set up async replication or log shipping to a local hot standby on real physical hardware, then do the query there.

I've run postgresql on medium instances using elastic block store for the storage and had no difficulty running queries like this one on tables of comparable (and larger) size. It might not come back in 10ms, but such queries weren't so slow that I would describe the wait as "a lot of time" either. My guess is that this is a sequential scan on a 10 million record table with lots of bloat due to updates. Without more info about table structure and explain analyze output, we are all just guessing, though. Please read the wiki page which describes how to submit performance problems and restate your question.

Re: Scaling 10 million records in PostgreSQL table

From

Navaneethan R

Date:

09 October 2012, 20:56:26

On Tuesday, October 9, 2012 1:40:08 AM UTC+5:30, Steve Crawford wrote:
> On 10/08/2012 08:26 AM, Navaneethan R wrote:
>
> > Hi all,
>
> >
>
> >        I have 10 million records in my postgres table.I am running the database in amazon ec2 medium instance. I
needto access the last week data from the table. 
>
> > It takes huge time to process the simple query.So, i throws time out exception error.
>
> >
>
> > query is :
>
> >       select count(*) from dealer_vehicle_details where modified_on between '2012-10-01' and '2012-10-08' and
dealer_id=270001;
>
> >
>
> > After a lot of time it responds 1184 as count
>
> >
>
> > what are the ways i have to follow to increase the performance of this query?
>
> >
>
> > The insertion also going parallel since the daily realtime updation.
>
> >
>
> > what could be the reason exactly for this lacking performace?
>
> >
>
> >
>
> What version of PostgreSQL? You can use "select version();" and note
>
> that 9.2 has index-only scans which can result in a substantial
>
> performance boost for queries of this type.
>
>
>
> What is the structure of your table? You can use "\d+
>
> dealer_vehicle_details" in psql.
>
>
>
> Have you tuned PostgreSQL in any way? If so, what?
>
>
>
> Cheers,
>
> Steve
>
>
>
>
>
> --
>
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>
> To make changes to your subscription:
>
> http://www.postgresql.org/mailpref/pgsql-performance


version():

  PostgreSQL 8.4.8 on i686-pc-linux-gnu, compiled by GCC gcc-4.5.real (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2, 32-bit

Desc:
                                                    Table "public.dealer_vehicle_details"
     Column     |           Type           |                                Modifiers                                |
Storage| Description  

----------------+--------------------------+-------------------------------------------------------------------------+---------+-------------
 id             | integer                  | not null default nextval('dealer_vehicle_details_new_id_seq'::regclass) |
plain  |  
 vin_id         | integer                  | not null                                                                |
plain  |  
 vin_details_id | integer                  |                                                                         |
plain  |  
 price          | integer                  |                                                                         |
plain  |  
 mileage        | double precision         |                                                                         |
plain  |  
 dealer_id      | integer                  | not null                                                                |
plain  |  
 created_on     | timestamp with time zone | not null                                                                |
plain  |  
 modified_on    | timestamp with time zone | not null                                                                |
plain  |  
Indexes:
    "dealer_vehicle_details_pkey" PRIMARY KEY, btree (id)
    "idx_dealer_sites_id" UNIQUE, btree (id) WHERE dealer_id = 270001
    "idx_dealer_sites_id_526889" UNIQUE, btree (id) WHERE dealer_id = 526889
    "idx_dealer_sites_id_9765" UNIQUE, btree (id, vin_id) WHERE dealer_id = 9765
    "idx_dealer_sites_id_9765_all" UNIQUE, btree (id, vin_id, price, mileage, modified_on, created_on, vin_details_id)
WHEREdealer_id = 9765 
    "mileage_idx" btree (mileage)
    "price_idx" btree (price)
    "vehiclecre_idx" btree (created_on)
    "vehicleid_idx" btree (id)
    "vehiclemod_idx" btree (modified_on)
    "vin_details_id_idx" btree (vin_details_id)
    "vin_id_idx" btree (vin_id)
Foreign-key constraints:
    "dealer_vehicle_master_dealer_id_fkey" FOREIGN KEY (dealer_id) REFERENCES dealer_dealer_master(id) DEFERRABLE
INITIALLYDEFERRED 
    "dealer_vehicle_master_vehicle_id_fkey" FOREIGN KEY (vin_id) REFERENCES dealer_vehicle(id) DEFERRABLE INITIALLY
DEFERRED
    "dealer_vehicle_master_vin_details_id_fkey" FOREIGN KEY (vin_details_id) REFERENCES vin_lookup_table(id) DEFERRABLE
INITIALLYDEFERRED 
Has OIDs: no


   After created the index for WHERE clause "WHERE dealer_id = 270001"..It is performing better.I have more dealer ids
ShouldI do it for each dealer_id? 

And The insertion service also happening background parallel.

So, What are the important steps I should follow frequently to keep the database healthy?

Since, the insertion is happening all time..It would reach millions of millions soon.What are precautions should be
followed?

Re: Scaling 10 million records in PostgreSQL table

From

Samuel Gendler

Date:

09 October 2012, 21:47:14

On Mon, Oct 8, 2012 at 1:25 PM, Navaneethan R <nava@gridlex.com> wrote:

After created the index for WHERE clause "WHERE dealer_id = 270001"..It is performing better.I have more dealer ids Should I do it for each dealer_id?

All you've really done is confuse the issue. Please read the wiki page on how to submit performance questions and actually follow the directions. Show us the table structure when the query is performing poorly ALONG WITH explain analyze output, so we can see how the query is being handled by the db. Adding indexes for just one particular value isn't likely a great solution unless there's a reason why that value is special or performance for that value needs to be particularly good. Far better to get at the root problem of performance issues on that table, whether it is table bloat, insufficient indexes, invalid statistics, or something else.

Re: Scaling 10 million records in PostgreSQL table

From

Daniel Farina

Date:

09 October 2012, 21:49:51

On Mon, Oct 8, 2012 at 1:25 PM, Navaneethan R <nava@gridlex.com> wrote:
> On Tuesday, October 9, 2012 1:40:08 AM UTC+5:30, Steve Crawford wrote:
>> On 10/08/2012 08:26 AM, Navaneethan R wrote:
>>
>> > Hi all,
>>
>> >
>>
>> >        I have 10 million records in my postgres table.I am running the database in amazon ec2 medium instance. I
needto access the last week data from the table. 
>>
>> > It takes huge time to process the simple query.So, i throws time out exception error.
>>
>> >
>>
>> > query is :
>>
>> >       select count(*) from dealer_vehicle_details where modified_on between '2012-10-01' and '2012-10-08' and
dealer_id=270001;
>>
>> >
>>
>> > After a lot of time it responds 1184 as count
>>
>> >
>>
>> > what are the ways i have to follow to increase the performance of this query?
>>
>> >
>>
>> > The insertion also going parallel since the daily realtime updation.
>>
>> >
>>
>> > what could be the reason exactly for this lacking performace?
>>
>> >
>>
>> >
>>
>> What version of PostgreSQL? You can use "select version();" and note
>>
>> that 9.2 has index-only scans which can result in a substantial
>>
>> performance boost for queries of this type.
>>
>>
>>
>> What is the structure of your table? You can use "\d+
>>
>> dealer_vehicle_details" in psql.
>>
>>
>>
>> Have you tuned PostgreSQL in any way? If so, what?
>>
>>
>>
>> Cheers,
>>
>> Steve
>>
>>
>>
>>
>>
>> --
>>
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>>
>> To make changes to your subscription:
>>
>> http://www.postgresql.org/mailpref/pgsql-performance
>
>
> version():
>
>   PostgreSQL 8.4.8 on i686-pc-linux-gnu, compiled by GCC gcc-4.5.real (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2, 32-bit
>
> Desc:
>                                                     Table "public.dealer_vehicle_details"
>      Column     |           Type           |                                Modifiers
|Storage | Description 
>
----------------+--------------------------+-------------------------------------------------------------------------+---------+-------------
>  id             | integer                  | not null default nextval('dealer_vehicle_details_new_id_seq'::regclass)
|plain   | 
>  vin_id         | integer                  | not null
|plain   | 
>  vin_details_id | integer                  |
|plain   | 
>  price          | integer                  |
|plain   | 
>  mileage        | double precision         |
|plain   | 
>  dealer_id      | integer                  | not null
|plain   | 
>  created_on     | timestamp with time zone | not null
|plain   | 
>  modified_on    | timestamp with time zone | not null
|plain   | 
> Indexes:
>     "dealer_vehicle_details_pkey" PRIMARY KEY, btree (id)
>     "idx_dealer_sites_id" UNIQUE, btree (id) WHERE dealer_id = 270001
>     "idx_dealer_sites_id_526889" UNIQUE, btree (id) WHERE dealer_id = 526889
>     "idx_dealer_sites_id_9765" UNIQUE, btree (id, vin_id) WHERE dealer_id = 9765
>     "idx_dealer_sites_id_9765_all" UNIQUE, btree (id, vin_id, price, mileage, modified_on, created_on,
vin_details_id)WHERE dealer_id = 9765 
>     "mileage_idx" btree (mileage)
>     "price_idx" btree (price)
>     "vehiclecre_idx" btree (created_on)
>     "vehicleid_idx" btree (id)
>     "vehiclemod_idx" btree (modified_on)
>     "vin_details_id_idx" btree (vin_details_id)
>     "vin_id_idx" btree (vin_id)
> Foreign-key constraints:
>     "dealer_vehicle_master_dealer_id_fkey" FOREIGN KEY (dealer_id) REFERENCES dealer_dealer_master(id) DEFERRABLE
INITIALLYDEFERRED 
>     "dealer_vehicle_master_vehicle_id_fkey" FOREIGN KEY (vin_id) REFERENCES dealer_vehicle(id) DEFERRABLE INITIALLY
DEFERRED
>     "dealer_vehicle_master_vin_details_id_fkey" FOREIGN KEY (vin_details_id) REFERENCES vin_lookup_table(id)
DEFERRABLEINITIALLY DEFERRED 
> Has OIDs: no
>
>
>    After created the index for WHERE clause "WHERE dealer_id = 270001"..It is performing better.I have more dealer
idsShould I do it for each dealer_id? 

You seem to have created a partial index.  Normally, that's not what
you want.  You just want an index on the field "dealer_id", without
the conditional index.  Conditional indexes are useful when you have a
lot of queries with the same WHERE clause entry, such as "WHERE
deleted_at IS NULL" or whatnot where most of the table has been
soft-deleted.

Here's a recent blog post discussing the topic that doesn't presume a
lot of familiarity with database performance, geared towards
application developers writing OLTP applications, which this seems
like one of:

    http://www.craigkerstiens.com/2012/10/01/understanding-postgres-performance/


--
fdr