Re: Large Table - Slow Window Functions (Better Approach?) - Mailing list pgsql-performance

From Jeff Adams - NOAA Affiliate
Subject Re: Large Table - Slow Window Functions (Better Approach?)
Date
Msg-id CA+BdxK8uH4ex7nL3FUF9r6PjgD-Ut0-rLkJT7p6FyJA=21Kdbw@mail.gmail.com
Whole thread Raw
In response to Re: Large Table - Slow Window Functions (Better Approach?)  (Jeff Adams - NOAA Affiliate <jeff.adams@noaa.gov>)
List pgsql-performance
So, I tested out the cursor approach, and it still chugs along for hours. If the result set is large (and the available memory to process small), does it matter what goes on within the cursor. Will it still choke trying assemble and spit out the large result set?

On Mon, Mar 11, 2013 at 11:48 AM, Jeff Adams - NOAA Affiliate <jeff.adams@noaa.gov> wrote:
Thanks again. The sorting does appear to be the issue. I will test out your cursor idea...


On Mon, Mar 11, 2013 at 11:34 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
2013/3/11 Jeff Adams - NOAA Affiliate <jeff.adams@noaa.gov>:
> Pavel,
>
> Thanks for the response. I have not yet had the opportunity to use cursors,
> but am now curious. Could you perhaps provide a bit more detail as to what
> the implementation of your suggested approach would look like?

an example:

$$
DECLARE
  r record;
  prev_r record;

BEGIN
  FOR r IN SELECT * FROM a ORDER BY epoch, mmsi
  LOOP
    IF prev_r IS NOT NULL THEN
      /* do some counting */
      prev_r contains previous row, r contains current row
      do some
      RETURN NEXT .. /* return data in defined order */
    END IF;
    prev_r = r;
  END LOOP;


Probably slow part of your query is sorting - first can be accelerated
by index, but second (as CTE result cannot) - you can try increase
work_mem ??

Regards

Pavel

>
>
> On Mon, Mar 11, 2013 at 11:03 AM, Pavel Stehule <pavel.stehule@gmail.com>
> wrote:
>>
>> Hello
>>
>> you can try procedural solution - use a cursor over ordered data in
>> plpgsql and returns table
>>
>> Regards
>>
>> Pavel Stehule
>>
>> 2013/3/11 Jeff Adams - NOAA Affiliate <jeff.adams@noaa.gov>:
>> > Greetings,
>> >
>> >
>> >
>> > I have a large table (~90 million rows) containing vessel positions. In
>> > addition to a column that contains the location information (the_geom),
>> > the
>> > table also contains two columns that are used to uniquely identify the
>> > vessel (mmsi and name) and a column containing the Unix time (epoch) at
>> > which the position information was logged. I frequently need to assign
>> > records to vessel transits. To do this, I currently create a CTE that
>> > uses a
>> > Window function (partitioning the data by mmsi and name ordered by
>> > epoch) to
>> > examine the time that has elapsed between successive position reports
>> > for
>> > individual vessels. For every position record for a vessel (as
>> > identified
>> > using mmsi and name), if the time elapsed between the current position
>> > record and the previous record (using the lag function) is less than or
>> > equal to 2 hours, I assign the record a value of 0 to a CTE column named
>> > tr_index. If the time elapsed is greater than 2 hours, I assign the
>> > record a
>> > value of 1 to the tr_index column. I then use the CTE to generate
>> > transit
>> > numbers by summing the values in the tr_index field across a Window that
>> > also partitions the data by mmsi and name and is ordered by epoch. This
>> > works, but is very slow (hours). The table is indexed (multi-column
>> > index on
>> > mmsi, name and index on epoch). Does anyone see a way to get what I am
>> > after
>> > in a more efficient manner. What I am after is an assignment of transit
>> > number to vessels' position records based on whether the records were
>> > within
>> > two hours of each other. The SQL that I used is provided below. Any
>> > advice
>> > would be greatly appreciated...
>> >
>> >
>> >
>> > WITH
>> >
>> > cte_01 AS
>> >
>> > (
>> >
>> > SELECT
>> >
>> > a.id,
>> >
>> > a.mmsi,
>> >
>> > a.name,
>> >
>> > a.epoch,
>> >
>> > a.the_geom
>> >
>> > CASE
>> >
>> >   WHEN ((a.epoch - lag(a.epoch) OVER w) / 60) > 120 THEN 1
>> >
>> >   ELSE 0
>> >
>> > END AS tr_index
>> >
>> > FROM table a
>> >
>> > WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
>> >
>> > )
>> >
>> >
>> >
>> >
>> >
>> > SELECT
>> >
>> > a.id,
>> >
>> > a.mmsi,
>> >
>> > a.name,
>> >
>> > a.epoch,
>> >
>> > a.the_geom,
>> >
>> > 1 + sum(a.tr_index) OVER w AS transit,
>> >
>> > a.active
>> >
>> > FROM cte_01 a
>> >
>> > WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
>> >
>> >
>> >
>> > --
>> > Jeff



--
Jeffrey D. Adams
Contractor
OAI, Inc.
In support of:
National Marine Fisheries Service
Office of Protected Resources
1315 East West Hwy, Building SSMC3
Silver Spring, MD 20910-3282
phone: (301) 427-8434
fax: (301) 713-0376

pgsql-performance by date:

Previous
From: Misa Simic
Date:
Subject: Slow concurrent processing
Next
From: Victor Yegorov
Date:
Subject: Re: Large Table - Slow Window Functions (Better Approach?)