Thread: How to check a table content efficiently? With LIMIT and OFFSET?

How to check a table content efficiently? With LIMIT and OFFSET?

From

Stefan Keller

Date:

28 May 2011, 19:13:20

Hi,

I’d like to monitor a table in a read-only Postgres database by
writing a PL/pgSQL function.
The database gets periodically overwritten by a mirroring loading process.
The success criteria is, that the table contains at least some records
(actually at least more than 100000).

The first idea which comes to one’s mind is probably a count(*):

SELECT CASE WHEN (count(*) > 10000) THEN 'yes' ELSE 'no' END
FROM planet_osm_point;

But my goal is to consume as few Postgres server resources as possible.
So my idea is to retrieve an arbitrary record, like this:

SELECT CASE WHEN (osm_id::text > '') THEN 'yes' ELSE 'no' END
FROM planet_osm_point LIMIT 1 OFFSET 1000000;

Is there anyone having an even better  better idea?

Yours, Stefan

Re: How to check a table content efficiently? With LIMIT and OFFSET?

From

Stefan Keller

Date:

28 May 2011, 21:45:18

Hi,

That's my solution candidate:

CREATE OR REPLACE FUNCTION isnotempty() RETURNS boolean AS '
  SELECT (count(*) = 1)
  FROM (SELECT osm_id FROM planet_osm_point LIMIT 1 OFFSET 100000) tmp
' LANGUAGE SQL;

Yours, Stefan

2011/5/28 Stefan Keller <sfkeller@gmail.com>:
> Hi,
>
> I’d like to monitor a table in a read-only Postgres database by
> writing a PL/pgSQL function.
> The database gets periodically overwritten by a mirroring loading process.
> The success criteria is, that the table contains at least some records
> (actually at least more than 100000).
>
> The first idea which comes to one’s mind is probably a count(*):
>
> SELECT CASE WHEN (count(*) > 10000) THEN 'yes' ELSE 'no' END
> FROM planet_osm_point;
>
> But my goal is to consume as few Postgres server resources as possible.
> So my idea is to retrieve an arbitrary record, like this:
>
> SELECT CASE WHEN (osm_id::text > '') THEN 'yes' ELSE 'no' END
> FROM planet_osm_point LIMIT 1 OFFSET 1000000;
>
> Is there anyone having an even better  better idea?
>
> Yours, Stefan
>

Re: How to check a table content efficiently? With LIMIT and OFFSET?

From

Craig Ringer

Date:

29 May 2011, 02:32:30

On 05/29/2011 05:45 AM, Stefan Keller wrote:
> Hi,
>
> That's my solution candidate:
>
> CREATE OR REPLACE FUNCTION isnotempty() RETURNS boolean AS '
>    SELECT (count(*) = 1)
>    FROM (SELECT osm_id FROM planet_osm_point LIMIT 1 OFFSET 100000) tmp
> ' LANGUAGE SQL;

LIMIT and OFFSET are often no more efficient than count(*). You're still
likely to need a full table scan.

Here's how I'd do it: I'd ANALYZE the table, then check the table
statistics to see that they looked to be within reasonable bounds. That
way you not only check the import, but in the process you ensure the
statistics used by the query planner are up to date. Since ANALYZE only
tests a sampling of records it does pretty much what you want, something
that it's not so easy to do in SQL.

--
Craig Ringer

Re: How to check a table content efficiently? With LIMIT and OFFSET?

From

Stefan Keller

Date:

29 May 2011, 17:45:24

Hi Craig

Thanks for the answer. I also thought about this. You mean something like this?

SELECT reltuples FROM pg_class WHERE relname = 'mytable';
182820 (rows)

That seams reasonably fast compared to count(*).

But I'm hesitating to use ANALYZE for two reasons:
1. It's very slow: it repeadly takes 59000 ms on my machine.
2. There's an autovacuum background process which already does the
job, doesn't it?

Yours, Stefan

2011/5/29 Craig Ringer <craig@postnewspapers.com.au>:
> On 05/29/2011 05:45 AM, Stefan Keller wrote:
>>
>> Hi,
>>
>> That's my solution candidate:
>>
>> CREATE OR REPLACE FUNCTION isnotempty() RETURNS boolean AS '
>>   SELECT (count(*) = 1)
>>   FROM (SELECT osm_id FROM planet_osm_point LIMIT 1 OFFSET 100000) tmp
>> ' LANGUAGE SQL;
>
> LIMIT and OFFSET are often no more efficient than count(*). You're still
> likely to need a full table scan.
>
> Here's how I'd do it: I'd ANALYZE the table, then check the table statistics
> to see that they looked to be within reasonable bounds. That way you not
> only check the import, but in the process you ensure the statistics used by
> the query planner are up to date. Since ANALYZE only tests a sampling of
> records it does pretty much what you want, something that it's not so easy
> to do in SQL.
>
> --
> Craig Ringer
>

Re: How to check a table content efficiently? With LIMIT and OFFSET?

From

Alban Hertroys

Date:

29 May 2011, 21:21:33

On 29 May 2011, at 19:45, Stefan Keller wrote:

> But I'm hesitating to use ANALYZE for two reasons:
> 1. It's very slow: it repeadly takes 59000 ms on my machine.

ANALYZE on a single table takes 59s?!? That's _really_ long. How big is that table (it has about 180k rows, you did
providethat information, but that's not much at all) and how many indexes are on it? Are you sure you're not
overburdeningyour hardware in some way? 

Or are you in fact talking about a different command? For example, ANALYZE (without specifying a table) or VACUUM
ANALYZE<table>? 

> 2. There's an autovacuum background process which already does the
> job, doesn't it?

Yes, but in its own time. If you know there has been a batch of inserts/deletes you might as well run analyse
immediatelyon that table. 

Also, on this mailing-list people don't appreciate it if you top-post. It makes the context hard to decipher and
sometimeseven makes it difficult to give an accurate answer because the information people want to refer to is far
separatedfrom the bit where they're trying to reply to something you said/asked. Remember, people aren't here for your
sake.

> 2011/5/29 Craig Ringer <craig@postnewspapers.com.au>:
>> On 05/29/2011 05:45 AM, Stefan Keller wrote:
>>>
>>> Hi,
>>>
>>> That's my solution candidate:
>>>
>>> CREATE OR REPLACE FUNCTION isnotempty() RETURNS boolean AS '
>>>   SELECT (count(*) = 1)
>>>   FROM (SELECT osm_id FROM planet_osm_point LIMIT 1 OFFSET 100000) tmp
>>> ' LANGUAGE SQL;
>>
>> LIMIT and OFFSET are often no more efficient than count(*). You're still
>> likely to need a full table scan.
>>
>> Here's how I'd do it: I'd ANALYZE the table, then check the table statistics
>> to see that they looked to be within reasonable bounds. That way you not
>> only check the import, but in the process you ensure the statistics used by
>> the query planner are up to date. Since ANALYZE only tests a sampling of
>> records it does pretty much what you want, something that it's not so easy
>> to do in SQL.
>>
>> --
>> Craig Ringer
>>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

Alban Hertroys

--
Screwing up is an excellent way to attach something to the ceiling.

!DSPAM:737,4de2b6f311926797845409!

Re: How to check a table content efficiently? With LIMIT and OFFSET?

From

Stefan Keller

Date:

29 May 2011, 21:55:09

Hi Alban

On 2011/5/29 Alban Hertroys wrote:
> On 29 May 2011, at 19:45, Stefan Keller wrote:
>
>> But I'm hesitating to use ANALYZE for two reasons:
>> 1. It's very slow: it repeadly takes 59000 ms on my machine.
>
> ANALYZE on a single table takes 59s?!? That's _really_ long. How big is that table (it has about 180k rows, you did
providethat information, but that's not much at all) and how many indexes are on it? Are you sure you're not
overburdeningyour hardware in some way? 
>
> Or are you in fact talking about a different command? For example, ANALYZE (without specifying a table) or VACUUM
ANALYZE<table>? 

You are right: I used ANALYZE (without specifying a table). But this
still takes about 1 to 3 sec which is about 100 times slower than

SELECT (count(*) = 1) FROM (SELECT osm_id FROM planet_osm_point LIMIT
1 OFFSET 100000) tmp;
or
SELECT reltuples FROM pg_class WHERE relname = 'planet_osm_point';

>> 2. There's an autovacuum background process which already does the
>> job, doesn't it?
>
> Yes, but in its own time. If you know there has been a batch of inserts/deletes you might as well run analyse
immediatelyon that table. 

My table is a read-only table after all.
That's another reason why I'm reluctant using ANALYZE <table>.

> Also, on this mailing-list people don't appreciate it if you top-post. It makes the context
> hard to decipher and sometimes even makes it difficult to give an accurate answer
> because the information people want to refer to is far separated from the bit where
> they're trying to reply to something you said/asked. Remember, people aren't here for your sake.

Thank you for the hint, which I didn't know:
Is this really still part of this elderly USENET netiquette here?

Yours, Stefan

>> 2011/5/29 Craig Ringer <craig@postnewspapers.com.au>:
>>> On 05/29/2011 05:45 AM, Stefan Keller wrote:
>>>>
>>>> Hi,
>>>>
>>>> That's my solution candidate:
>>>>
>>>> CREATE OR REPLACE FUNCTION isnotempty() RETURNS boolean AS '
>>>>   SELECT (count(*) = 1)
>>>>   FROM (SELECT osm_id FROM planet_osm_point LIMIT 1 OFFSET 100000) tmp
>>>> ' LANGUAGE SQL;
>>>
>>> LIMIT and OFFSET are often no more efficient than count(*). You're still
>>> likely to need a full table scan.
>>>
>>> Here's how I'd do it: I'd ANALYZE the table, then check the table statistics
>>> to see that they looked to be within reasonable bounds. That way you not
>>> only check the import, but in the process you ensure the statistics used by
>>> the query planner are up to date. Since ANALYZE only tests a sampling of
>>> records it does pretty much what you want, something that it's not so easy
>>> to do in SQL.
>>>
>>> --
>>> Craig Ringer
>>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>>
>
> Alban Hertroys
>
> --
> Screwing up is an excellent way to attach something to the ceiling.
>
>
> !DSPAM:1205,4de2b6e411923449910736!
>
>
>

Re: How to check a table content efficiently? With LIMIT and OFFSET?

From

Craig Ringer

Date:

29 May 2011, 23:35:57

On 05/30/2011 05:55 AM, Stefan Keller wrote:
> Hi Alban
>
> On 2011/5/29 Alban Hertroys wrote:
>> On 29 May 2011, at 19:45, Stefan Keller wrote:
>>
>>> But I'm hesitating to use ANALYZE for two reasons:
>>> 1. It's very slow: it repeadly takes 59000 ms on my machine.
>>
>> ANALYZE on a single table takes 59s?!? That's _really_ long. How big is that table (it has about 180k rows, you did
providethat information, but that's not much at all) and how many indexes are on it? Are you sure you're not
overburdeningyour hardware in some way? 
>>
>> Or are you in fact talking about a different command? For example, ANALYZE (without specifying a table) or VACUUM
ANALYZE<table>?
>
> You are right: I used ANALYZE (without specifying a table). But this
> still takes about 1 to 3 sec which is about 100 times slower than

Hmm, ok. I would've expected ANALYZE on the table to be much, much
faster than your LIMIT ... OFFSET query. If the LIMIT ... OFFSET
approach works better for you, use that.

> SELECT (count(*) = 1) FROM (SELECT osm_id FROM planet_osm_point LIMIT
> 1 OFFSET 100000) tmp;

Instead of (count(*)=1) try writing:

SELECT EXISTS(SELECT osm_id FROM planet_osm_point LIMIT 1 OFFSET 100000);

(untested but I'm pretty sure that's right). Just a readability tweak,
it shouldn't make any real difference in performance.

> SELECT reltuples FROM pg_class WHERE relname = 'planet_osm_point';

That is using cached statistics. If you have just done a batch update
then it is *not* trustworthy without running ANALYZE tablename; first .

--
Craig Ringer

Re: How to check a table content efficiently? With LIMIT and OFFSET?

From

Jaime Casanova

Date:

30 May 2011, 01:27:17

On Sun, May 29, 2011 at 4:55 PM, Stefan Keller <sfkeller@gmail.com> wrote:
>
>>> 2. There's an autovacuum background process which already does the
>>> job, doesn't it?
>>
>> Yes, but in its own time. If you know there has been a batch of inserts/deletes you might as well run analyse
immediatelyon that table. 
>
> My table is a read-only table after all.
> That's another reason why I'm reluctant using ANALYZE <table>.
>

sorry, i don't follow that... why do you think that a read-only table
doesn't need an ANALYZE?

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte y capacitación de PostgreSQL

Re: How to check a table content efficiently? With LIMIT and OFFSET?

From

Stefan Keller

Date:

30 May 2011, 18:27:52

Hi Jaime

2011/5/30 Jaime Casanova <jaime@2ndquadrant.com> wrote:
> On Sun, May 29, 2011 at 4:55 PM, Stefan Keller <sfkeller@gmail.com> wrote:
>>
>>>> 2. There's an autovacuum background process which already does the
>>>> job, doesn't it?
>>>
>>> Yes, but in its own time. If you know there has been a batch of inserts/deletes you might as well run analyse
immediatelyon that table. 
>>
>> My table is a read-only table after all.
>> That's another reason why I'm reluctant using ANALYZE <table>.
>>
>
> sorry, i don't follow that... why do you think that a read-only table
> doesn't need an ANALYZE?

Thanks for joining the discussion.

I'm only reluctant to do an ANALYZE as part of a perdiodical (hourly)
"check table contents" function.

Such an ANALYZE command is already included in the perdiodical
(nightly) update script which mirrors OpenStreetMap data.

BTW: I've asked before for best parameters over at pgsql-performance
("How to configure a read-only database server?" 19. April 2011) and I
am still happy about any hint.

Yours, Stefan

Re: How to check a table content efficiently? With LIMIT and OFFSET?

From

Alban Hertroys

Date:

30 May 2011, 21:06:34

On 29 May 2011, at 23:55, Stefan Keller wrote:

> Hi Alban
>
> On 2011/5/29 Alban Hertroys wrote:
>> On 29 May 2011, at 19:45, Stefan Keller wrote:
>>
>>> But I'm hesitating to use ANALYZE for two reasons:
>>> 1. It's very slow: it repeadly takes 59000 ms on my machine.
>>
>> ANALYZE on a single table takes 59s?!? That's _really_ long. How big is that table (it has about 180k rows, you did
providethat information, but that's not much at all) and how many indexes are on it? Are you sure you're not
overburdeningyour hardware in some way? 
>>
>> Or are you in fact talking about a different command? For example, ANALYZE (without specifying a table) or VACUUM
ANALYZE<table>? 
>
> You are right: I used ANALYZE (without specifying a table). But this
> still takes about 1 to 3 sec which is about 100 times slower than
>
> SELECT (count(*) = 1) FROM (SELECT osm_id FROM planet_osm_point LIMIT
> 1 OFFSET 100000) tmp;
> or

Still 1 to 3 seconds? It should be faster than a full table-scan, as it just takes samples across the table. Perhaps
youhave many indexes or some of an uncommon type? Or is your system I/O-bound perhaps? It does need to write those
statisticsto a system table at some point, in which an I/O-bound system won't perform very well of course. 

> SELECT reltuples FROM pg_class WHERE relname = 'planet_osm_point';

That will only work if you want to know the total amount of rows in the table. If you need to know how many rows will
matcha specific WHERE-clause this falls on its behind. For cases like that you can use the output of EXPLAIN <query>,
butthat will not be very accurate since it uses statistical information about value distribution and such of your data
(whichis gathered by ANALYSE). 

If you really only need the total number of records and if you're indeed inserting/deleting in batches, then it's
probablybest to create statement-level INSERT/DELETE triggers (that could call the same function). I've done this in
thepast using a row-level trigger, but my data came in live. For batches of data it's probably more efficient to call a
statement-leveltrigger once per query than a row-level one for each row. I think you can obtain the number of modified
rowsfrom GET DIAGNOSTICS, off the top of my head. 

>>> 2. There's an autovacuum background process which already does the
>>> job, doesn't it?
>>
>> Yes, but in its own time. If you know there has been a batch of inserts/deletes you might as well run analyse
immediatelyon that table. 
>
> My table is a read-only table after all.
> That's another reason why I'm reluctant using ANALYZE <table>.

You probably won't need to run it as often as every time you need to know the number of rows in it. If the data doesn't
change,then the row-count in the statistics won't either. 
You probably do want to run this after a batch-INSERT/DELETE, or your row-counts will be inaccurate until auto-vacuum
comesalong. 

>> Also, on this mailing-list people don't appreciate it if you top-post. It makes the context
>> hard to decipher and sometimes even makes it difficult to give an accurate answer
>> because the information people want to refer to is far separated from the bit where
>> they're trying to reply to something you said/asked. Remember, people aren't here for your sake.
>
> Thank you for the hint, which I didn't know:
> Is this really still part of this elderly USENET netiquette here?

There's nothing elderly about it. If you're communicating with a large number of people at once, the requirements
change.You don't want to make it difficult on people to follow a thread they possibly didn't follow earlier, or they
eitherwon't bother to answer or they only pick up the latest bit of the thread. In both cases the chances that their
answerswill be irrelevant are quite significant, provided they even do reply. 

In the end it has little to do with style and much more with common sense. The format used here (as well as in USENET)
ismore suitable for mailing lists. 

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,4de33d4211921620335251!