Thread: Seq scan on 10million record table.. why?

Seq scan on 10million record table.. why?

From
Vincenzo Melandri
Date:
Hi all

I have a problem with a data import procedure that involve the following query:

select a,b,c,d
from big_table b
join data_sequences_table ds
on b.key1 = ds.key1 and b.key2 = ds.key2
where ds.import_id=xxxxxxxxxx

The "big table" has something like 10.000.000 records ore more
(depending on the table, there are more than one of them).
The data are uploaded in 20k record blocks, and the keys are written
on "data_sequences_table". The keys are composite (key1,key2), and
every 5-10 sequences (depending on the size of the upload) the
data_sequences_table records are deleted.
I have indexes on both the key on the big table and the import_id on
the sequence table.

the query plan evualuate like this:

Merge Join  (cost=2604203.98..2774528.51 rows=129904 width=20)
  Merge Cond: (((( big_table.key1)::numeric) =
data_sequences_table.key1) AND ((( big_table.key2)::numeric) =
data_sequences_table.key2))
  ->  Sort  (cost=2602495.47..2635975.81 rows=13392135 width=20)
        Sort Key: ((big_table.key1)::numeric), ((big_table.key2)::numeric)
        ->  Seq Scan on big_table  (cost=0.00..467919.35 rows=13392135 width=20)
  ->  Sort  (cost=1708.51..1709.48 rows=388 width=32)
        Sort Key: data_sequences_table.key1, data_sequences_table.key2
        ->  Seq Scan on data_sequences_table  (cost=0.00..1691.83
rows=388 width=32)
              Filter: (import_id = 1351592072::numeric)

It executes in something like 80 seconds. The import procedure has
more than 500 occurrences of this situation. :(
Why is the big table evaluated with a seq scan? The result is 0 to
20.000 records (the query returns the records that already exists and
should be updated, not inserted).. Can I do something to speed this
up?

--
Vincenzo.
Imola Informatica

Ai sensi del D.Lgs. 196/2003 si precisa che le informazioni contenute
in questo messaggio sono riservate ed a uso esclusivo del
destinatario.
Pursuant to Legislative Decree No. 196/2003, you are hereby informed
that this message contains confidential information intended only for
the use of the addressee.


Re: Seq scan on 10million record table..why?

From
Gabriele Bartolini
Date:
 Hi Vincenzo,

 On Tue, 30 Oct 2012 13:15:10 +0100, Vincenzo Melandri
 <vmelandri@imolinfo.it> wrote:
> I have indexes on both the key on the big table and the import_id on
> the sequence table.

 Forgive my quick answer, but it might be that the data you are
 retrieving is scattered throughout the whole table, and the index scan
 does not kick in (as it is more expensive to perform lots of random
 fetches rather than a single scan).

 To be able to help you though, I'd need to deeply look at the ETL
 process - I am afraid you need to use a different approach, involving
 either queues or partitioning.

 Sorry for not being able to help you more in this case.

 Cheers,
 Gabriele
--
  Gabriele Bartolini - 2ndQuadrant Italia
  PostgreSQL Training, Services and Support
  Gabriele.Bartolini@2ndQuadrant.it - www.2ndQuadrant.it


Re: Seq scan on 10million record table.. why?

From
Shaun Thomas
Date:
On 10/30/2012 07:15 AM, Vincenzo Melandri wrote:

> Merge Join  (cost=2604203.98..2774528.51 rows=129904 width=20)
>    Merge Cond: (((( big_table.key1)::numeric) =
> data_sequences_table.key1) AND ((( big_table.key2)::numeric) =
> data_sequences_table.key2))
>    ->  Sort  (cost=2602495.47..2635975.81 rows=13392135 width=20)
>          Sort Key: ((big_table.key1)::numeric), ((big_table.key2)::numeric)
>          ->  Seq Scan on big_table  (cost=0.00..467919.35 rows=13392135 width=20)
>    ->  Sort  (cost=1708.51..1709.48 rows=388 width=32)
>          Sort Key: data_sequences_table.key1, data_sequences_table.key2
>          ->  Seq Scan on data_sequences_table  (cost=0.00..1691.83
> rows=388 width=32)
>                Filter: (import_id = 1351592072::numeric)

As always, we need to see an EXPLAIN ANALYZE, not just an EXPLAIN. We
also need to know the version of PostgreSQL and your server settings.
Please refer to this:

http://wiki.postgresql.org/wiki/Slow_Query_Questions

I see a lot of NUMERIC conversions in there, which suggests you're using
NUMERIC for your keys. That's not really recommended practice, but also
suggests the possibility that all your types are not the same. So it
would be very helpful to also see the actual CREATE TABLE, and CREATE
INDEX statements for those tables.

We can't help you with this limited information. Sorry.


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Seq scan on 10million record table.. why?

From
Віталій Тимчишин
Date:
It seems that your tables has different types for columns, see "::numeric" in  "((( big_table.key2)::numeric) =data_sequences_table.key2))"
Postgresql always uses widening conversion, so it can't use index. There are next options to fix:
1) Make all types the same
2) If you are using some narrow type for big_table (say, int2) to save space, you can force narrowing conversion, e.g. "b.key1=ds.key1::int2". Note that if ds.key1 has any values that don't fit into int2, you will have problems. And of course, use your type used instead of int2.
3) Create an index on  (key1::numeric),(key2::numeric) . This is last options this this index will be very specific to the query (or similar ones).

Best regards, Vitalii Tymchyshyn

2012/10/30 Vincenzo Melandri <vmelandri@imolinfo.it>
Hi all

I have a problem with a data import procedure that involve the following query:

select a,b,c,d
from big_table b
join data_sequences_table ds
on b.key1 = ds.key1 and b.key2 = ds.key2
where ds.import_id=xxxxxxxxxx

The "big table" has something like 10.000.000 records ore more
(depending on the table, there are more than one of them).
The data are uploaded in 20k record blocks, and the keys are written
on "data_sequences_table". The keys are composite (key1,key2), and
every 5-10 sequences (depending on the size of the upload) the
data_sequences_table records are deleted.
I have indexes on both the key on the big table and the import_id on
the sequence table.

the query plan evualuate like this:

Merge Join  (cost=2604203.98..2774528.51 rows=129904 width=20)
  Merge Cond: (((( big_table.key1)::numeric) =
data_sequences_table.key1) AND ((( big_table.key2)::numeric) =
data_sequences_table.key2))
  ->  Sort  (cost=2602495.47..2635975.81 rows=13392135 width=20)
        Sort Key: ((big_table.key1)::numeric), ((big_table.key2)::numeric)
        ->  Seq Scan on big_table  (cost=0.00..467919.35 rows=13392135 width=20)
  ->  Sort  (cost=1708.51..1709.48 rows=388 width=32)
        Sort Key: data_sequences_table.key1, data_sequences_table.key2
        ->  Seq Scan on data_sequences_table  (cost=0.00..1691.83
rows=388 width=32)
              Filter: (import_id = 1351592072::numeric)

It executes in something like 80 seconds. The import procedure has
more than 500 occurrences of this situation. :(
Why is the big table evaluated with a seq scan? The result is 0 to
20.000 records (the query returns the records that already exists and
should be updated, not inserted).. Can I do something to speed this
up?

--
Vincenzo.
Imola Informatica

Ai sensi del D.Lgs. 196/2003 si precisa che le informazioni contenute
in questo messaggio sono riservate ed a uso esclusivo del
destinatario.
Pursuant to Legislative Decree No. 196/2003, you are hereby informed
that this message contains confidential information intended only for
the use of the addressee.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



--
Best regards,
 Vitalii Tymchyshyn

Re: Seq scan on 10million record table.. why?

From
Vincenzo Melandri
Date:
> 1) Make all types the same
> 2) If you are using some narrow type for big_table (say, int2) to save
> space, you can force narrowing conversion, e.g. "b.key1=ds.key1::int2". Note
> that if ds.key1 has any values that don't fit into int2, you will have
> problems. And of course, use your type used instead of int2.
>
> Best regards, Vitalii Tymchyshyn
>

This fixed my problem :)
Thanks Vitalii!

For the other suggestions made from Gabriele, unfortunately I can't
make an accurate data-partitioning 'cause (obviously) it will be quite
a big work and the customer finished the budget for this year, so
unless I choose to work for free... ;)


--
Vincenzo.