Home > mailing lists

advice sought - general approaches to optimizing queries around "event streams" - Mailing list pgsql-general

From	Jonathan Vanasco
Subject	advice sought - general approaches to optimizing queries around "event streams"
Date	September 26, 2014 21:02:35
Msg-id	23BD7FFD-FB8F-40CF-B569-B5530BCEDA06@2xlp.com Whole thread Raw
Responses	Re: advice sought - general approaches to optimizing queries around "event streams" Re: advice sought - general approaches to optimizing queries around "event streams"
List	pgsql-general

Tree view

I have a growing database with millions of rows that track resources against an event stream.

i have a few handfuls of queries that interact with this stream in a variety of ways, and I have managed to drop things
downfrom 70s to 3.5s on full scans and offer .05s partial scans.   

no matter how i restructure queries, I can't seem to get around a few bottlenecks and I wanted to know if there were
anytips/tricks from the community on how to approach them.   

a simple form of my database would be:

    --  1k of
    create table stream (
        id int not null primary key,
    )

    -- 1MM of
    create table resource (
        id int not null primary key,
        col_a bool,
        col_b bool,
        col_c text,
    );

    -- 10MM of
    create table streamevent (
        id int not null,
        event_timestamp timestamp not null,
        stream_id int not null references stream(id)
    );

    -- 10MM of
    create table resource_2_stream_event(
        resource_id int not null references resource(id),
        streamevent_id int not null references streamevent(id)
    )

Everything is running off of indexes; there are no seq scans.

I've managed to optimize my queries by avoiding joins against tables, and turning the stream interaction into a
subqueryor CTE.   
better performance has come from limiting the number of "stream events"  ( which are only the timestamp and resource_id
offa joined table )  

The bottlenecks I've encountered have primarily been:

1.    When interacting with a stream, the ordering of event_timestamp and deduplicating of resources becomes an issue.
    I've figured out a novel way to work with the most recent events, but distant events are troublesome

    using no limit, the query takes 3500 ms
    using a limit of 10000, the query takes 320ms
    using a limit of 1000, the query takes 20ms

    there is a dedicated index of on event_timestamp (desc) , and it is being used
    according to the planner... finding all the records is fine; merging-into and sorting the aggregate to handle the
deduplicationof records in a stream seems to be the issue (either with DISTINCT or max+group_by) 


2.     I can't figure out an effective way to search for a term against an entire stream (using a tsquery/gin based
search)

    I thought about limiting the query by finding matching resources first, then locking it to an event stream, but:
        - scanning the entire table for a term takes about 10 seconds on an initial hit.  subsequent queries for the
sameterms end up using the cache, and complete within 20ms. 

    I get better search performance by calculating the event stream, then searching it for matching documents, but I
stillhave the performance issues related to limiting the window of events 

i didn't include example queries, because I'm more concerned with the general approaches and ideas behind dealing with
largedata sets than i am with raw SQL right now.   

i'm hoping someone can enlighten me into looking at new ways to solve these problems.   i think i've learned more about
postgres/sqlin the past 48hour than I have in the past 15 years, and I'm pretty sure that the improvements I need will
comefrom new ways of querying data , rather than optimizing the current queries.

pgsql-general by date:

From: Bosco Rama
Date: 26 September 2014, 20:25:15
Subject: Re: password in recovery.conf

From: Gavin Flower
Date: 26 September 2014, 21:23:42
Subject: Re: advice sought - general approaches to optimizing queries around "event streams"

advice sought - general approaches to optimizing queries around "event streams" - Mailing list pgsql-general

Previous

Next