advice sought - general approaches to optimizing queries around "event streams" - Mailing list pgsql-general
| From | Jonathan Vanasco |
|---|---|
| Subject | advice sought - general approaches to optimizing queries around "event streams" |
| Date | |
| Msg-id | 23BD7FFD-FB8F-40CF-B569-B5530BCEDA06@2xlp.com Whole thread Raw |
| Responses |
Re: advice sought - general approaches to optimizing queries
around "event streams"
Re: advice sought - general approaches to optimizing queries around "event streams" |
| List | pgsql-general |
I have a growing database with millions of rows that track resources against an event stream.
i have a few handfuls of queries that interact with this stream in a variety of ways, and I have managed to drop things
downfrom 70s to 3.5s on full scans and offer .05s partial scans.
no matter how i restructure queries, I can't seem to get around a few bottlenecks and I wanted to know if there were
anytips/tricks from the community on how to approach them.
a simple form of my database would be:
-- 1k of
create table stream (
id int not null primary key,
)
-- 1MM of
create table resource (
id int not null primary key,
col_a bool,
col_b bool,
col_c text,
);
-- 10MM of
create table streamevent (
id int not null,
event_timestamp timestamp not null,
stream_id int not null references stream(id)
);
-- 10MM of
create table resource_2_stream_event(
resource_id int not null references resource(id),
streamevent_id int not null references streamevent(id)
)
Everything is running off of indexes; there are no seq scans.
I've managed to optimize my queries by avoiding joins against tables, and turning the stream interaction into a
subqueryor CTE.
better performance has come from limiting the number of "stream events" ( which are only the timestamp and resource_id
offa joined table )
The bottlenecks I've encountered have primarily been:
1. When interacting with a stream, the ordering of event_timestamp and deduplicating of resources becomes an issue.
I've figured out a novel way to work with the most recent events, but distant events are troublesome
using no limit, the query takes 3500 ms
using a limit of 10000, the query takes 320ms
using a limit of 1000, the query takes 20ms
there is a dedicated index of on event_timestamp (desc) , and it is being used
according to the planner... finding all the records is fine; merging-into and sorting the aggregate to handle the
deduplicationof records in a stream seems to be the issue (either with DISTINCT or max+group_by)
2. I can't figure out an effective way to search for a term against an entire stream (using a tsquery/gin based
search)
I thought about limiting the query by finding matching resources first, then locking it to an event stream, but:
- scanning the entire table for a term takes about 10 seconds on an initial hit. subsequent queries for the
sameterms end up using the cache, and complete within 20ms.
I get better search performance by calculating the event stream, then searching it for matching documents, but I
stillhave the performance issues related to limiting the window of events
i didn't include example queries, because I'm more concerned with the general approaches and ideas behind dealing with
largedata sets than i am with raw SQL right now.
i'm hoping someone can enlighten me into looking at new ways to solve these problems. i think i've learned more about
postgres/sqlin the past 48hour than I have in the past 15 years, and I'm pretty sure that the improvements I need will
comefrom new ways of querying data , rather than optimizing the current queries.
pgsql-general by date: