Thread: Small Queries Really Fast, Large Queries Really Slow...

Small Queries Really Fast, Large Queries Really Slow...

From
Tom Wilcox
Date:
Hi again!

I have finally got my Ubuntu VirtualBox VM running PostgreSQL with PL/Python and am now looking at performance.

So here's the scenario:

We have a great big table:

cse=# \d nlpg.match_data
                                         Table "nlpg.match_data"
      Column       |   Type   |                                Modifiers                                
-------------------+----------+--------------------------------------------------------------------------
 premise_id        | integer  |
 usrn              | bigint   |
 org               | text     |
 sao               | text     |
 level             | text     |
 pao               | text     |
 name              | text     |
 street            | text     |
 town              | text     |
 postcode          | text     |
 match_data_id     | integer  | not null default nextval('nlpg.match_data_match_data_id_seq1'::regclass)
 addr_str          | text     |
 tssearch_name     | tsvector |
 tssearch_street   | tsvector |
 tssearch_addr_str | tsvector |
Indexes:
    "match_data_pkey1" PRIMARY KEY, btree (match_data_id)
    "index_match_data_mid" btree (match_data_id)
    "index_match_data_pid" btree (premise_id)
    "index_match_data_tssearch_addr_str" gin (tssearch_addr_str)
    "index_match_data_tssearch_name" gin (tssearch_name)
    "index_match_data_tssearch_street" gin (tssearch_street)
    "index_match_data_usrn" btree (usrn)

KEY NOTE:
nlpg.match_data has approximately 27,000,000 rows..

Running this query:

EXPLAIN ANALYZE UPDATE nlpg.match_data SET org = org WHERE match_data_id < 1000000;

I get this:

"Index Scan using match_data_pkey1 on match_data  (cost=0.00..1452207.14 rows=1913756 width=302) (actual time=23.448..61559.652 rows=999999 loops=1)"
"  Index Cond: (match_data_id < 1000000)"
"Total runtime: 403855.675 ms"

I copied a chunk of the table like this:

CREATE TABLE nlpg.md_copy AS SELECT * FROM nlpg.match_data WHERE match_data_id < 1000000;

Then ran the same query on the smaller copy table:

EXPLAIN ANALYZE UPDATE nlpg.md_copy SET org = org WHERE match_data_id < 1000000;

And get this:

"Seq Scan on md_copy  (cost=0.00..96935.99 rows=999899 width=301) (actual time=26.745..33944.923 rows=999999 loops=1)"
"  Filter: (match_data_id < 1000000)"
"Total runtime: 57169.419 ms"

As you can see this is much faster per row with the smaller table chunk. I then tried running the same first query with 10 times the number of rows:

EXPLAIN ANALYZE UPDATE nlpg.match_data SET org = org WHERE match_data_id < 10000000;

This takes a massive amount of time (still running) and is definitely a non-linear increase in the run time in comparison with the previous query.

EXPLAIN UPDATE nlpg.match_data SET org = org WHERE match_data_id < 10000000;
"Seq Scan on match_data  (cost=0.00..3980053.11 rows=19172782 width=302)"
"  Filter: (match_data_id < 10000000)"

Any suggestions on what I can do to speed things up? I presume if I turn off Sequential Scan then it might default to Index Scan.. Is there anything else?

Cheers,
Tom

Re: Small Queries Really Fast, Large Queries Really Slow...

From
tv@fuzzy.cz
Date:
> Any suggestions on what I can do to speed things up? I presume if I turn
> off
> Sequential Scan then it might default to Index Scan.. Is there anything
> else?
>
> Cheers,
> Tom

Well, I doubt turning off the sequential scan will improve the performance
in this case - actually the first case (running 400 sec) uses an index
scan, while the 'fast' one uses sequential scan.

Actually I'd try exactly the oposite - disabling the index scan, i.e.
forcing it to use sequential scan in the first case. You're selecting
about 4% of the rows, but we don't know how 'spread' are those rows
through the table. It might happen PostgreSQL actually has to read all the
blocks of the table.

This might be improved by clustering - create and index on the
'match_data_id' colunm and then run

CLUSTER match_data_id_idx ON match_data;

This will sort the table accoring to match_data_id column, which should
improve the performance. But it won't last forever - it degrades through
time, so you'll have to perform clustering once a while (and it locks the
table, so be careful).

How large is the table anyway? How many rows / pages are there? Try
something like this

SELECT reltuples, relpages FROM pg_class WHERE relname = 'match_data';

Multiply the blocks by 8k and you'll get the occupied space. How much is
it? How much memory (shared_buffers) is there?

You could try partitioning accoring to the match_data_id column, but there
are various disadvantages related to foreign keys etc. and it's often a
major change in the application, so I'd consider other solutions first.

BTW I have no experience with running PostgreSQL inside a Virtual Box VM,
so it might be another source of problems. I do remember we had some
serious problems with I/O (network and disks) when running vmware, but it
was a long time ago and now it works fine. But maybe this the root cause?
Can you run dstat / vmstat / iostat or something like that in the host OS
to see which of the resources is causing problems (if any)?

Tomas