Thread: SELECTing every Nth record for better performance

SELECTing every Nth record for better performance

From
Tom
Date:
I have a big table that is used for datalogging. I'm designing
graphing interface that will visualise the data. When the user is
looking at a small daterange I want the database to be queried for all
records, but when the user is 'zoomed out', looking at an overview, I
want run a query that skips every nth record and returns a managable
dataset that still gives a correct overview of the data without
slowing the programme down. Is there an easy way to do this that I
have overlooked? I looked at:


1. taking the primary key (which is an auto increasing integer) and
running modulo on it. This will not work in my case as the database is
being used for a number of different logging 'runs', and these runs
are not necessarily equally distributed over the primary keys.


2. Subqueries that do SELECT COUNT to generate a row number are too
slow as im dealing with thousands to tens of thousands of records.


3. My last idea was to create a sequence with CYCLE and min_Value 0
and max_value 1:

SELECT * FROM ( SELECT *, (SELECT nextval('counter_seq')) as counter
FROM table) WHERE counter = 0

this didnt work (counter was always the same for all rows), so i put
SELECT nextval('counter_seq') in a function called counter():

SELECT *, counter() as counter FROM table

this gives the table i'm looking for, however, I am unable to use
WHERE counter = 0. when I run EXPLAIN, it tells me that it is actually
not looking at the values in the table but just running the function
again to filter. So I tried this:

SELECT *, (counter()+id-id) as counter FROM table

where Id is the primary key of the table. im trying to fool the
interpreter into looking at the table instead of running the function
itself. Again, this query generates the right table. So, I tried
adding WHERE counter = 0. Again it doesnt work: it returns the same
number of rows, but changes the values of all rows in the 'counter'
column to 1. EXPLAIN does not help me (gives no information about the
filtering).


Any general thoughts on how to achieve my original goal or on how to
fix issues with my 3d attempt are appreciated.

Tom

Re: SELECTing every Nth record for better performance

From
Richard Broersma
Date:
On Thu, Dec 3, 2009 at 9:26 PM, Tom <tom@cstcomposites.com> wrote:

> I
> want run a query that skips every nth record and returns a managable
> dataset that still gives a correct overview of the data without
> slowing the programme down. Is there an easy way to do this that I
> have overlooked? I looked at:

I've played with datalogging.  It was very easy to find nth records
when using date_trunc() on a timestamp.   The only minor problem with
data_trunc was that I couldn't create arbitrary granularity.   For
example it is easy to date_trunc() on an year, month, week, day, hour
or a minute but I wanted 5, 10 and 15 minute increments.  I bet there
could be a solution to this, but I never looked into it.


To improve the select performance, I created functional indexes using
different data_trunc() granularities.

--
Regards,
Richard Broersma Jr.

Visit the Los Angeles PostgreSQL Users Group (LAPUG)
http://pugs.postgresql.org/lapug

Re: SELECTing every Nth record for better performance

From
"A. Kretschmer"
Date:
In response to Tom :
> I have a big table that is used for datalogging. I'm designing
> graphing interface that will visualise the data. When the user is
> looking at a small daterange I want the database to be queried for all
> records, but when the user is 'zoomed out', looking at an overview, I
> want run a query that skips every nth record and returns a managable
> dataset that still gives a correct overview of the data without
> slowing the programme down. Is there an easy way to do this that I
> have overlooked? I looked at:


Do you have 8.4? If yes:

test=# create table data as select s as s from generate_Series(1,1000) s;
SELECT



test=*# select s from (select *, row_number() over (order by s) from
data) foo where row_number % 3 = 0 limit 10;
 s
----
  3
  6
  9
 12
 15
 18
 21
 24
 27
 30
(10 rows)


-- or skip every 5. record:

test=*# select s from (select *, row_number() over (order by s) from
data) foo where row_number % 5 != 0 limit 10;
 s
----
  1
  2
  3
  4
  6
  7
  8
  9
 11
 12
(10 rows)


Regards, Andreas
--
Andreas Kretschmer
Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)
GnuPG: 0x31720C99, 1006 CCB4 A326 1D42 6431  2EB0 389D 1DC2 3172 0C99

Re: SELECTing every Nth record for better performance

From
"A. Kretschmer"
Date:
In response to Richard Broersma :
> On Thu, Dec 3, 2009 at 9:26 PM, Tom <tom@cstcomposites.com> wrote:
>
> > I
> > want run a query that skips every nth record and returns a managable
> > dataset that still gives a correct overview of the data without
> > slowing the programme down. Is there an easy way to do this that I
> > have overlooked? I looked at:
>
> I've played with datalogging.  It was very easy to find nth records
> when using date_trunc() on a timestamp.   The only minor problem with
> data_trunc was that I couldn't create arbitrary granularity.   For
> example it is easy to date_trunc() on an year, month, week, day, hour
> or a minute but I wanted 5, 10 and 15 minute increments.  I bet there
> could be a solution to this, but I never looked into it.

How about:


test=# select * from data limit 10;
         ts
---------------------
 2009-12-01 00:00:00
 2009-12-01 00:01:00
 2009-12-01 00:02:00
 2009-12-01 00:03:00
 2009-12-01 00:04:00
 2009-12-01 00:05:00
 2009-12-01 00:06:00
 2009-12-01 00:07:00
 2009-12-01 00:08:00
 2009-12-01 00:09:00
(10 rows)

-- now with 5 miutes increments, using date_trunc and extract:

test=# select * from data where extract(epoch from date_trunc('minute', ts))::int % (5*60) = 0 limit 10;
         ts
---------------------
 2009-12-01 00:00:00
 2009-12-01 00:05:00
 2009-12-01 00:10:00
 2009-12-01 00:15:00
 2009-12-01 00:20:00
 2009-12-01 00:25:00
 2009-12-01 00:30:00
 2009-12-01 00:35:00
 2009-12-01 00:40:00
 2009-12-01 00:45:00
(10 rows)



Regards, Andreas
--
Andreas Kretschmer
Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)
GnuPG: 0x31720C99, 1006 CCB4 A326 1D42 6431  2EB0 389D 1DC2 3172 0C99

Re: SELECTing every Nth record for better performance

From
Ivan Voras
Date:
A. Kretschmer wrote:
> In response to Tom :
>> I have a big table that is used for datalogging. I'm designing
>> graphing interface that will visualise the data. When the user is
>> looking at a small daterange I want the database to be queried for all
>> records, but when the user is 'zoomed out', looking at an overview, I
>> want run a query that skips every nth record and returns a managable
>> dataset that still gives a correct overview of the data without
>> slowing the programme down. Is there an easy way to do this that I
>> have overlooked? I looked at:
>
>
> Do you have 8.4? If yes:
>
> test=# create table data as select s as s from generate_Series(1,1000) s;
> SELECT
>
>
>
> test=*# select s from (select *, row_number() over (order by s) from
> data) foo where row_number % 3 = 0 limit 10;

Won't this still read in the entire table and only then filter the
records out?

Re: SELECTing every Nth record for better performance

From
"A. Kretschmer"
Date:
In response to Ivan Voras :
> A. Kretschmer wrote:
> >In response to Tom :
> >>I have a big table that is used for datalogging. I'm designing
> >>graphing interface that will visualise the data. When the user is
> >>looking at a small daterange I want the database to be queried for all
> >>records, but when the user is 'zoomed out', looking at an overview, I
> >>want run a query that skips every nth record and returns a managable
> >>dataset that still gives a correct overview of the data without
> >>slowing the programme down. Is there an easy way to do this that I
> >>have overlooked? I looked at:
> >
> >
> >Do you have 8.4? If yes:
> >
> >test=# create table data as select s as s from generate_Series(1,1000) s;
> >SELECT
> >
> >
> >
> >test=*# select s from (select *, row_number() over (order by s) from
> >data) foo where row_number % 3 = 0 limit 10;
>
> Won't this still read in the entire table and only then filter the
> records out?

Yes.

But i think, the problem is the graphing interface. He wants anly a
'zoomed out' - overview, he needs a 'managable dataset'.



Andreas
--
Andreas Kretschmer
Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)
GnuPG: 0x31720C99, 1006 CCB4 A326 1D42 6431  2EB0 389D 1DC2 3172 0C99

Re: SELECTing every Nth record for better performance

From
Grzegorz Jaśkiewicz
Date:
why create another table, straightforward on 8.4:
SELECT * FROM (SELECT *, (row_number() OVER( ORDER BY id))%10 AS rn FROM table) sa WHERE sa.rn=1 LIMIT 10;

Re: SELECTing every Nth record for better performance

From
Greg Smith
Date:
Tom wrote:
> 1. taking the primary key (which is an auto increasing integer) and
> running modulo on it. This will not work in my case as the database is
> being used for a number of different logging 'runs', and these runs
> are not necessarily equally distributed over the primary keys.
>
Problem #1 with what you're trying to do is that it's tricky to get SQL
to have a notion of "row number" in a result set, so that you can then
filter on that number.  The best approach to this is to use PostgreSQL
8.4 where the SQL Window functions can be used for this purpose.

> 2. Subqueries that do SELECT COUNT to generate a row number are too
> slow as im dealing with thousands to tens of thousands of records.
>
Problem #2 is that if you're looking at a only a mildly filtered version
of your data, you're going to pull the whole set in anyway.  Random data
point in this area:  if you do a scan on a table that needs to look at
20% of a table using an index, what will happen when you execute it?
The correct answer is likely "sequential scan of the entire table",
because that's actually more efficient than trying to grab only a few
records once the percentage gets large enough.  The way multiple records
get packed onto a single page, you're likely to actually read every page
of the data anyway even when trying to grab a subset of them, unless the
subset is very small relative to the data and you can traverse an index
usefully.

Once the "n" in your zoom gets large enough, it's possible to make this
worthwhile.  Unless your records are really wide, I would guess that it
would take a 1000:1 compression or more before you'd end up with a query
that's truly shorter than scanning the whole set.  And that wouldn't
work like what you're trying to do right now at all:  you'd instead have
to know the bounds of the data set, generate a sequence of points from
within that set, and then grab the records best matching those to get a
useful zoomed-out subset.  Basically, determine where the sequence of
records you need should be, then go into the data set to find just them
using something like "WHERE ts>x LIMIT 1"; that's the only way to not
scan the whole thing.

I think this whole approach isn't likely to ever converge on what you
want.  The direction I think you should be going is to consider whether
it's possible to create materialized views of your data that summarize
it at wider time scales.  You can't compute such a thing in real-time
usefully without reading the whole data set, and once you realize that
you might as well figure out how to only compute the summarized version
once.  The last comment in this thread as I write this, from Grzegorz,
suggests one approach for something like that.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com


import warnings ?

From
Malm Paul
Date:
Hi list,
I'm using FWTOOLS ogr2gr to import a s57 chart. Im getting a warning when I'm importing in Linux but not in Windows XP,
seebelow. 
Could anyone tell me why and what the difference is in the result (I can't see it because the lnam_refs is empty), will
Imiss anything?  

ogr2ogr -f "PostgreSQL" PG:"host=localhost dbname=US3MI01M user=ddd password=pwd" ./US3MI01M/US3MI01M.000 -lco
OVERWRITE=yes-nln depare DEPARE -t_srs "EPSG:4326" 

Warning 6: Can't create field lnam_refs with type StringList on PostgreSQL layers.  Creating as VARCHAR.

Re: import warnings ?

From
Adrian Klaver
Date:
On Monday 21 December 2009 2:24:02 am Malm Paul wrote:
> Hi list,
> I'm using FWTOOLS ogr2gr to import a s57 chart. Im getting a warning when
> I'm importing in Linux but not in Windows XP, see below. Could anyone tell
> me why and what the difference is in the result (I can't see it because the
> lnam_refs is empty), will I miss anything?
>
> ogr2ogr -f "PostgreSQL" PG:"host=localhost dbname=US3MI01M user=ddd
> password=pwd" ./US3MI01M/US3MI01M.000 -lco OVERWRITE=yes -nln depare DEPARE
> -t_srs "EPSG:4326"
>
> Warning 6: Can't create field lnam_refs with type StringList on PostgreSQL
> layers.  Creating as VARCHAR.

The StringList type is not a native Postgres type. It would seem to be something
FWTOOLS/OGR is supposed to supply and is not, in your Linux setup. My guess is
you would get a better answer from the FWTOOLS mailing list:
http://lists.maptools.org/mailman/listinfo/fwtools

--
Adrian Klaver
aklaver@comcast.net