I have an existing heavy ETL that serially loads tons of data to PostgreSQL.
This is done using a CLI tool, processing one project after another.
I'd like to parallelize / distribute the work, which I could do from my CLI
tool, but 1) that would be confined to a single machine, and 2) we'd like to
provide a web UI to report progress and logs from the ETL workers / processes.
I mentioned in the past I have a basic LISTEN-NOTIFY-based work queue
in PostgreSQL, that would do nicely for this, but I'm wondering how to best
to deal with the logs.
The ETL process is quite chatty, and can generate quite a bit of logs.
Which for the Web UI to see (developped independently, by a separate team),
I'd want in PostgreSQL as well. The ETL worker wants to write them, almost
continuously. The Web UI wants to read them concurrently, to show them,
and ideally in a timely fashion.
My experience with CI tools like Jenkins is that logs are "laggy", and
arrive in big chunk, dozens of seconds appart, which is a subpar experience.
That's not PostgreSQL related, but just to illustrate what I'd like to avoid.
Has anyone done anything along these lines?
Given the way MVCC works in PostgreSQL, updating (by appending to) a
"file-like" bytea or text[] seems like a bad idea. So each log line should be
it's own row? Or lines with close timestamps aggregated together, to limit
rows generated?
If granular at the line level, will tons of small transactions be a problem?
And blow-up our "Oid budget" too rapidly?
Am I worrying too much? :)
I'd appreciate any advise or experience-based story around this
use-case, please.
Thanks, --DD