Thread: Specifying many rows in a table

Specifying many rows in a table

From
Steve Atkins
Date:
I have a large table (potentially tens or hundreds of millions of rows) and
I need to extract some number of these rows, defined by an integer primary
key.

So, the obvious answer is

  select * from table where id in (1,3,4);

But I may want to extract a large number of rows, many thousands

  select * from table where id in (1, 3, 5, ...., 100000, 100017, 23000);

This falls over when it exceeds the maximum expression depth of 10,000.
And I have a sneaky feeling that increasing max_expr_depth isn't the
right fix.

Performance is pretty important, so does anyone have a good suggestion
for how to phrase this query so that it'll at worst only be a single
seq-scan through the mondo table, and make a reasonable choice as to
whether to use an index-scan or seq-scan, depending on the number
of rows I'm pulling out?

Cheers,
  Steve

Re: Specifying many rows in a table

From
"scott.marlowe"
Date:
On Wed, 28 Jan 2004, Steve Atkins wrote:

> I have a large table (potentially tens or hundreds of millions of rows) and
> I need to extract some number of these rows, defined by an integer primary
> key.
>
> So, the obvious answer is
>
>   select * from table where id in (1,3,4);
>
> But I may want to extract a large number of rows, many thousands
>
>   select * from table where id in (1, 3, 5, ...., 100000, 100017, 23000);

If the ranges are sequential, then between would work.  I have a feeling
that they aren't though, looking at your examples.

> This falls over when it exceeds the maximum expression depth of 10,000.
> And I have a sneaky feeling that increasing max_expr_depth isn't the
> right fix.

The optimizations made for in() queries in the 7.4 branch only really work
when there's a subselect / table in the in.  You could try inserting those
numbers into a temp table and subselecting it.



Re: Specifying many rows in a table

From
Alvaro Herrera
Date:
On Wed, Jan 28, 2004 at 01:15:27PM -0700, scott.marlowe wrote:
> On Wed, 28 Jan 2004, Steve Atkins wrote:
>
> >   select * from table where id in (1, 3, 5, ...., 100000, 100017, 23000);
>
> > This falls over when it exceeds the maximum expression depth of 10,000.
> > And I have a sneaky feeling that increasing max_expr_depth isn't the
> > right fix.
>
> The optimizations made for in() queries in the 7.4 branch only really work
> when there's a subselect / table in the in.  You could try inserting those
> numbers into a temp table and subselecting it.

I think using IN with such a big distinct set would suck performance-
wise, because it would have to resort to the old method which was slow.
I could be mistaken though.  If IN (SELECT...) doesn't work, I'd try
rewriting the query to use an EXISTS pulling the values from a temp
table instead ...

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Before you were born your parents weren't as boring as they are now. They
got that way paying your bills, cleaning up your room and listening to you
tell them how idealistic you are."  -- Charles J. Sykes' advice to teenagers

Re: Specifying many rows in a table

From
"John Sidney-Woollett"
Date:
If you could insert the "in" values into a temporary table then join
against your *large* table you may find that you getter performance, at
the expense of having to populate the temp tables with your "in" values.

Rather than having to populate and delete your "in" values ready for the
next query, consider adding a search counter column to the table, eg

(in pseudo code)
create temporary table INTABLE (searchid integer, invalue integer);

create a function InsertInValues(searchid integer, values text) which
takes a comma delimited list of "in" values, and uses a loop to insert
them into your (temp) INTABLE.

now your query is

select * from table t, intable it
where t.keyvalue = it.invalue
and searchid = XXXX;

Next query, call InsertInValues(searchid integer, values text)  again with
the next set of data, incrementing the searchid value for the function
call, and then perform the next search.

Or write a function SearchUsingInValues(invalue integer) which does the
two steps above within the one function (generating its own internal
searchid), and returns a setof values.

Or do away with the searchid stuff, and either drop/recreate the temp
table between calls, or truncate the data.

Someone else may be able to suggest a better/faster way of generating the
"in" values...

I don't know if that's any use. Hope so.

John Sidney-Woollett

Steve Atkins said:
> I have a large table (potentially tens or hundreds of millions of rows)
> and
> I need to extract some number of these rows, defined by an integer primary
> key.
>
> So, the obvious answer is
>
>   select * from table where id in (1,3,4);
>
> But I may want to extract a large number of rows, many thousands
>
>   select * from table where id in (1, 3, 5, ...., 100000, 100017, 23000);
>
> This falls over when it exceeds the maximum expression depth of 10,000.
> And I have a sneaky feeling that increasing max_expr_depth isn't the
> right fix.
>
> Performance is pretty important, so does anyone have a good suggestion
> for how to phrase this query so that it'll at worst only be a single
> seq-scan through the mondo table, and make a reasonable choice as to
> whether to use an index-scan or seq-scan, depending on the number
> of rows I'm pulling out?
>
> Cheers,
>   Steve
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
>       joining column's datatypes do not match
>


Re: Specifying many rows in a table

From
Tom Lane
Date:
"scott.marlowe" <scott.marlowe@ihs.com> writes:
> On Wed, 28 Jan 2004, Steve Atkins wrote:
>> But I may want to extract a large number of rows, many thousands
>> select * from table where id in (1, 3, 5, ...., 100000, 100017, 23000);

> You could try inserting those
> numbers into a temp table and subselecting it.

That's what I'd try too; it should work reasonably well in 7.4.  Two
thoughts:

1. Be sure to "analyze" the temp table after filling it.  The planner
must know the number of rows in the table to have any shot at choosing a
good plan.

2. If you know that the set of numbers you have emitted are all
different, don't use IN at all, but a straight join to the temp table:

    select table.* from table, tmp where id = tmp.id;

With IN the system will have to work harder to eliminate duplicates from
the temp table; why make it do that if you know it's not necessary?

            regards, tom lane

Re: Specifying many rows in a table

From
Steve Atkins
Date:
On Wed, Jan 28, 2004 at 06:13:36PM -0500, Tom Lane wrote:
> "scott.marlowe" <scott.marlowe@ihs.com> writes:
> > You could try inserting those
> > numbers into a temp table and subselecting it.
>
> That's what I'd try too; it should work reasonably well in 7.4.  Two
> thoughts:
>
> 1. Be sure to "analyze" the temp table after filling it.  The planner
> must know the number of rows in the table to have any shot at choosing a
> good plan.
>
> 2. If you know that the set of numbers you have emitted are all
> different, don't use IN at all, but a straight join to the temp table:
>
>     select table.* from table, tmp where id = tmp.id;
>
> With IN the system will have to work harder to eliminate duplicates from
> the temp table; why make it do that if you know it's not necessary?

Yup.

I'm trying a number of variants (in(list), in(select), exists, join) and
benchmarking all of them across a range of numbers, randomly chosen or
clustered with appropriate splashing of the DB and disk buffers between
runs.

Once the perl script has finished I'll post a pointer to the data.

An interesting thing, though is that for 10,000,000 rows in big and
one row in little, everything indexed and analyzed an exists query
is painfully bad...

  select * from big where exists (select 1 from little where s=big.id)

...expands to nested seqscans over big and little, rather than the
seqscan over little with a nested index scan over big I was expecting.

I'm pretty sure that in(list) will be a clear win for a small number
of rows, and a simple join will win for a larger number, but I'm
benchmarking the others out of random interest.

Cheers,
  Steve

Re: Specifying many rows in a table

From
Tom Lane
Date:
Steve Atkins <steve@blighty.com> writes:
> An interesting thing, though is that for 10,000,000 rows in big and
> one row in little, everything indexed and analyzed an exists query
> is painfully bad...

>   select * from big where exists (select 1 from little where s=big.id)

> ...expands to nested seqscans over big and little, rather than the
> seqscan over little with a nested index scan over big I was expecting.

The planner can figure this out for straight join and (as of 7.4) for
IN, but not for EXISTS().  I've forgotten the details, but there were
some semantic fine points that discouraged me from trying to tackle the
EXISTS() case at the same time as IN.

            regards, tom lane

Re: Specifying many rows in a table

From
Steve Atkins
Date:
On Wed, Jan 28, 2004 at 04:14:21PM -0800, Steve Atkins wrote:
> I'm trying a number of variants (in(list), in(select), exists, join) and
> benchmarking all of them across a range of numbers, randomly chosen or
> clustered with appropriate splashing of the DB and disk buffers between
> runs.
>
> Once the perl script has finished I'll post a pointer to the data.

On the off-chance anyone else is interested in the performance
variations (or is searching the archives from the year 2020)...

I have a table of 10,000,000 rows, each row having a few hundred bytes
of data. It has an integer primary key, and is clustered on that key.
Real data, not synthetic, not that it makes a difference for this
test.

I have in the application layer a list of the rows I want to retrieve
(a sorted list of unique integers) and want to retrieve all those rows
from the big table.

The test system is a single processor Athlon 1800 with a Reiser
filesystem on two striped decent scsi drives. PostgreSQL is pretty
well tuned. Between each test run PostgreSQL was shut down, a gig of
data read from the disk to flush OS caches and PostgreSQL restarted.

The big table is indexed and analysed. All temporary tables are analysed
(and times include construction, population, any indexing and analysing
 of the table, but not destruction of it, nor time to transfer data to the
 client).

The sequence of ids to be retrieved is identical in each run. In the
first batch of tests it's completely random, but sorted. In the second batch
it's a sequence (42, 43, 44, 45...) with a random starting point.


inscalar: select * from big where id in (1, 17, 43 ...., 2071)
inselect: select * from big where id in (select id from small)
exists:   select * from big where exists (select 1 from small where id=big.id)
join:     select * from big, small where big.id = small.id;

inselect-idx: As inselect, but small is indexed
exists-idx:   As exists, but small is indexed
join-idx:     As join, but small is indexed

In all but the first case, small is created as a temporary table, populated
with a sequence of inserts (inside a transaction), possibly indexed then
analysed.

                             number of random rows retrieved
                1        10       100      1000      10000    100000
inscalar       0.394    0.818    3.80    23.5
inselect       0.752    0.770    0.751    0.977     0.798    0.790
join           0.728    0.746    0.625    0.766     0.703    0.806
inselect-idx   0.655    0.642    0.748    0.822     0.777    0.820
exists-idx    50.7     49.4     49.4     50.0      49.4     51.2
join-idx       0.657    0.703    0.760    0.730     0.731    0.815

                             number of sequential rows retrieved
                1        10       100      1000      10000    100000
inscalar       0.464    0.474    0.488    0.529
inselect       0.820    0.826    0.778    0.776     0.612    0.770
join           0.754    0.855    0.831    0.829     0.899    0.742
inselect-idx   0.686    0.679    0.703    0.747     0.747    0.729
exists-idx    49.2     79.8     49.5     49.3      49.3     53.7
join-idx       0.625    0.754    0.697    0.770     0.750    0.783



Overall, IN(scalar) looks pretty good for single digit numbers. A simple
join looks nice for anything else - and indexing the temporary table
doesn't seem to buy anything.

Pretty impressive overall, though.

Cheers,
  Steve

Re: Specifying many rows in a table

From
Greg Stark
Date:
"scott.marlowe" <scott.marlowe@ihs.com> writes:

> The optimizations made for in() queries in the 7.4 branch only really work
> when there's a subselect / table in the in.  You could try inserting those
> numbers into a temp table and subselecting it.

I'll second this option.

But the other choice is to use a prepared statement like "select * from tab
where tab_id = ?" and simply execute it 10,000 times with different
parameters.

This involves lots of round trips to the server and is a lot less efficient,
but it doesn't require building a temporary table, which might be an
advantage. In fact if the data is coming from outside the database then you'll
probably end up building the temporary table using lots of executions like
this anyways, so it might not really be a disadvantage.

In the future the the ideal solution might be to have array processing support
in postgres, where you prepare a statement then ship the server an array of
parameter lists and the server executes the statement once for each array
element. Ideally it would let you stream the array to the server and it
execute them as fast as you can ship the data, avoiding round trip latency.
But I don't think there's any such support in postgres currently.

--
greg

Re: Specifying many rows in a table

From
"NTPT"
Date:
----- Puvodní zpráva -----
Od: "Steve Atkins" <steve@blighty.com>
Komu: <pgsql-general@postgresql.org>
Odesláno: 28. ledna 2004 20:32
Predmet: [GENERAL] Specifying many rows in a table


> I have a large table (potentially tens or hundreds of millions of rows) and
> I need to extract some number of these rows, defined by an integer primary
> key.
>
> So, the obvious answer is
>
>   select * from table where id in (1,3,4);


Should not it be   select * from table where id in ('1','3','4'); ?
ie add an single quote around the numbers ? Think I red some statement in documentation, that without a single quote,
indexscan may not be always used   ? 


>
> But I may want to extract a large number of rows, many thousands
>
>   select * from table where id in (1, 3, 5, ...., 100000, 100017, 23000);
>
> This falls over when it exceeds the maximum expression depth of 10,000.
> And I have a sneaky feeling that increasing max_expr_depth isn't the
> right fix.
>
> Performance is pretty important, so does anyone have a good suggestion
> for how to phrase this query so that it'll at worst only be a single
> seq-scan through the mondo table, and make a reasonable choice as to
> whether to use an index-scan or seq-scan, depending on the number
> of rows I'm pulling out?
>
> Cheers,
>   Steve
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
>       joining column's datatypes do not match
>
>

Re: Specifying many rows in a table

From
Steve Atkins
Date:
On Fri, Jan 30, 2004 at 12:38:43PM +0100, NTPT wrote:

> > I have a large table (potentially tens or hundreds of millions of rows) and
> > I need to extract some number of these rows, defined by an integer primary
> > key.
> >
> > So, the obvious answer is
> >
> >   select * from table where id in (1,3,4);
>
>
> Should not it be   select * from table where id in ('1','3','4'); ?
> ie add an single quote around the numbers ? Think I red some statement in documentation, that without a single quote,
indexscan may not be always used   ? 
>

That's a good rule in general - as an index will usually only be considered
if the type of the field indexed and the type of the value in the where
clause are the same. The usual example of that is an indexed bigint field
"foo" and a query "select * from table where foo=5". That will not use
the index because "5" will be seen as an integer, and an integer isn't
a bigint.

"select * from table where foo=5::bigint" would work, but is clumsy.
Putting quotes around the "5" - "select * from table where foo='5'"
will delay deciding what type the "5" is late enough in the planning
process that it can be treated as a bigint and the index will be
considered.

That's not the issue here, though.

Cheers,
  Steve