Bharath-san, all,
Hmm, I didn't experience performance degradation on my poor-man's Linux VM (4 CPU, 4 GB RAM, HDD)...
[benchmark preparation]
autovacuum = off
shared_buffers = 1GB
checkpoint_timeout = 1h
max_wal_size = 8GB
min_wal_size = 8GB
(other settings to enable parallelism)
CREATE UNLOGGED TABLE a (c char(1100));
INSERT INTO a SELECT i FROM generate_series(1, 300000) i;
(the table size is 335 MB)
[benchmark]
CREATE TABLE b AS SELECT * FROM a;
DROP TABLE a;
CHECKPOINT;
(measure only CTAS)
[results]
parallel_leader_participation = off
  workers  time(ms)
  0  3921
  2  3290
  4  3132
parallel_leader_participation = on
  workers  time(ms)
  2  3266
  4  3247
Although this should be a controversial and may be crazy idea, the following change brought 4-11% speedup.  This is
becauseI thought parallel workers might contend for WAL flush as a result of them using the limited ring buffer and
flushingdirty buffers when the ring buffer is filled.  Can we take advantage of this?
 
[GetBulkInsertState]
/*  bistate->strategy = GetAccessStrategy(BAS_BULKWRITE);*/
    bistate->strategy = NULL;
[results]
parallel_leader_participation = off
  workers  time(ms)
  0  3695  (5% reduction)
  2  3135  (4% reduction)
  4  2767  (11% reduction)
Regards
Takayuki Tsunakawa