Thread: GIN: any ordering guarantees for the hits returned?

GIN: any ordering guarantees for the hits returned?

From
adrobj
Date:
Hello,

I have a moderately large (~10-20GB) table:

CREATE TABLE msgs (
  msg varchar(2048),
  msg_tsv tsvector,
  posted timestamp
);

CREATE INDEX msgs_i ON msgs USING gin(msg_tsv);

The table never gets updated (more specifically, it gets re-created once a
day with no updates in between).

I want to run queries of the following form:

SELECT msg, posted FROM msgs WHERE 'blah blah'::tsquery @@ msg_tsv ORDERED
BY posted DESC;
(with various LIMIT/OFFSET)

Which obviously may get too expensive, for it will cause reading and sorting
of all rows meeting the condition, i.e. too many disk reads.

On the other hand, (as far as I understand) GIN always produces hits already
sorted in the insertion order.

So - what if I just populate my table in the order of decreasing 'posted',
remove the "ORDERED BY" clause and just hope for the best? Will the correct
ordering be guaranteed?

If not, are there any other ideas around?

Thanks,
- adrobj
--
View this message in context:
http://www.nabble.com/GIN%3A-any-ordering-guarantees-for-the-hits-returned--tf4825287.html#a13805836
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: GIN: any ordering guarantees for the hits returned?

From
David Fetter
Date:
On Fri, Nov 16, 2007 at 07:56:45PM -0800, adrobj wrote:
>
> Hello,
>
> I have a moderately large (~10-20GB) table:
>
> CREATE TABLE msgs (
>   msg varchar(2048),
>   msg_tsv tsvector,
>   posted timestamp
> );
>
> CREATE INDEX msgs_i ON msgs USING gin(msg_tsv);
>
> The table never gets updated (more specifically, it gets re-created
> once a day with no updates in between).
>
> I want to run queries of the following form:
>
> SELECT msg, posted FROM msgs WHERE 'blah blah'::tsquery @@ msg_tsv
> ORDERED BY posted DESC; (with various LIMIT/OFFSET)
>
> Which obviously may get too expensive, for it will cause reading and
> sorting of all rows meeting the condition, i.e. too many disk reads.
>
> On the other hand, (as far as I understand) GIN always produces hits
> already sorted in the insertion order.
>
> So - what if I just populate my table in the order of decreasing
> 'posted', remove the "ORDERED BY" clause and just hope for the best?
> Will the correct ordering be guaranteed?

Ordering is never guaranteed without an ORDER BY, except in the time
between a CLUSTER and the first write operation after it.

> If not, are there any other ideas around?

Rather than assuming you know where problems will arise, do some
profiling and find out where they actually do :)

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: GIN: any ordering guarantees for the hits returned?

From
Sam Mason
Date:
On Sun, Nov 18, 2007 at 09:59:41AM -0800, David Fetter wrote:
> Ordering is never guaranteed without an ORDER BY, except in the time
> between a CLUSTER and the first write operation after it.

It's my understanding that with the new "seqscan piggy-backing" in 8.3
even this will go.  I'm not sure if this would affect the behaviour of
GIN indexes, but I'd doubt it.


A quick question for the OP; if the results from these queries are going
to be displayed for human consumption, does it matter much if they're in
perfect sorted order?  For computer algorithms it generally matters much
more, but people are generally a bit more flexible.


  Sam

Re: GIN: any ordering guarantees for the hits returned?

From
Alex Drobychev
Date:


David Fetter <david@fetter.org> wrote:
On Fri, Nov 16, 2007 at 07:56:45PM -0800, adrobj wrote:
>
> Hello,
>
> I have a moderately large (~10-20GB) table:
>
> CREATE TABLE msgs (
> msg varchar(2048),
> msg_tsv tsvector,
> posted timestamp
> );
>
> CREATE INDEX msgs_i ON msgs USING gin(msg_tsv);
>
> The table never gets updated (more specifically, it gets re-created
> once a day with no updates in between).
>
> I want to run queries of the following form:
>
> SELECT msg, posted FROM msgs WHERE 'blah blah'::tsquery @@ msg_tsv
> ORDERED BY posted DESC; (with various LIMIT/OFFSET)
>
> Which obviously may get too expensive, for it will cause reading and
> sorting of all rows meeting the condition, i.e. too many disk reads.
>
> On the other hand, (as far as I understand) GIN always produces hits
> already sorted in the insertion order.
>
> So - what if I just populate my table in the order of decreasing
> 'posted', remove the "ORDERED BY" clause and just hope for the best?
> Will the correct ordering be guaranteed?

Ordering is never guaranteed without an ORDER BY, except in the time
between a CLUSTER and the first write operation after it.
Which sound like my case - there are no writes to the table!
 
Do I really need to CLUSTER - or just doing INSERTs in the right order would be sufficient?
 

> If not, are there any other ideas around?

Rather than assuming you know where problems will arise, do some
profiling and find out where they actually do :)
 
I agree with this maybe 98% - but not 100%. :-) Unfortunately performance can change rather unpredictably when the DB stops fitting in memory - say, 3-4 months after a production roll-out, too late for profiling experiments. :-(


Cheers,
David.
--
David Fetter http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Never miss a thing. Make Yahoo your homepage.

Re: GIN: any ordering guarantees for the hits returned?

From
Alvaro Herrera
Date:
Alex Drobychev wrote:

>   I agree with this maybe 98% - but not 100%. :-) Unfortunately
>   performance can change rather unpredictably when the DB stops
>   fitting in memory - say, 3-4 months after a production roll-out, too
>   late for profiling experiments. :-(

Surely you're capable of inventing random data to simulate the load
you'll have in 3-4 months or even a year?

David is correct in that the order is not guaranteed.  It's not just a
matter of which order the rows were inserted -- the executor can do a
lot of things internally that would make the result appear in a
different order.  Even when the data is CLUSTER'ed the ordering can be
lost.  If you want to have a guaranteed order, use ORDER BY.

--
Alvaro Herrera                          Developer, http://www.PostgreSQL.org/
"Thou shalt not follow the NULL pointer, for chaos and madness await
thee at its end." (2nd Commandment for C programmers)