Re: Double insertion from scala spark job - Mailing list pgsql-jdbc

From Dave Cramer
Subject Re: Double insertion from scala spark job
Date
Msg-id CADK3HH+ycBStbXk2KjrFWt+-17gS0WE0t_KmKFxNhcFoGUfAwA@mail.gmail.com
Whole thread Raw
In response to Double insertion from scala spark job  (Antoine DUBOIS <antoine.dubois@cc.in2p3.fr>)
Responses Re: Double insertion from scala spark job
List pgsql-jdbc


On Tue, 9 Feb 2021 at 06:48, Antoine DUBOIS <antoine.dubois@cc.in2p3.fr> wrote:
Hello

I'm working with spark and postgresql to compute stat.
I came to encounter a strange behaviour in my job, when working with postgresql output I sometime have a double insertion happenning into my table (and violating constraint).
Detail: Key (xxx, xxx, xxx, xxx, xxx, xxxx, xxxx)=(2021-02-05 00:00:00, data, moredate, evenmoredata, somuchmoredata, dataagain, somuchofit) already exists.  Call getNextException to see other errors in the batch.
My data are generated as duplicate if I write the same data into mysql or into a parquet file with the same input and treatment I don't observe this behaviour.
Dev spec:
Scala 2.12
Spark Version 3.0.1
JDK 8
jdbc  "org.postgresql" % "postgresql" % "42.2.18"

PostgreSQL 12.5

My code is pretty simple and apply a SQL request to a parquet file and write the result like this :

outputDF.write.format("jdbc").option("driver", "org.postgresql.Driver").option("url", "jdbc:postgresql://<HOST>:<PORT>/<SCHEMA>?user=<USERNAME>&password=<PASSWORD>").option("dbtable", "mytable").mode(append).save()

What lead me to think it's a postgres jdbc bug more than anything else is the fact that this same command to output in mysql or in a parquet file produce no duplicate in this particular edge case i have with only some of my input files.
If any of you had any idea what could cause such a behavior (special char in the input, misconfigured something, maybe an option I don't know could help solving this issue )

I came to a point where I'm not sure of anything any longer.
Hope anyone will have some though about it.

You are the first person to report such a problem.

without additional information such as your code, there's little we can do.


Dave Cramer
www.postgres.rocks

pgsql-jdbc by date:

Previous
From: Antoine DUBOIS
Date:
Subject: Double insertion from scala spark job
Next
From: Antoine DUBOIS
Date:
Subject: Re: Double insertion from scala spark job