Hello
I'm working with spark and postgresql to compute stat.
I came to encounter a strange behaviour in my job, when working with postgresql output I sometime have a double insertion happenning into my table (and violating constraint).
Detail: Key (xxx, xxx, xxx, xxx, xxx, xxxx, xxxx)=(2021-02-05 00:00:00, data, moredate, evenmoredata, somuchmoredata, dataagain, somuchofit) already exists. Call getNextException to see other errors in the batch.
My data are generated as duplicate if I write the same data into mysql or into a parquet file with the same input and treatment I don't observe this behaviour.
Dev spec:
Scala 2.12
Spark Version 3.0.1
JDK 8
jdbc "org.postgresql" % "postgresql" % "42.2.18"
PostgreSQL 12.5
My code is pretty simple and apply a SQL request to a parquet file and write the result like this :
outputDF.write.format("jdbc").option("driver", "org.postgresql.Driver
").option("url", "jdbc:postgresql://<HOST>:<PORT>/<SCHEMA>?user=<USERNAME>&password=<PASSWORD>
").option("dbtable", "mytable").mode(append).save()
What lead me to think it's a postgres jdbc bug more than anything else is the fact that this same command to output in mysql or in a parquet file produce no duplicate in this particular edge case i have with only some of my input files.
If any of you had any idea what could cause such a behavior (special char in the input, misconfigured something, maybe an option I don't know could help solving this issue )
I came to a point where I'm not sure of anything any longer.
Hope anyone will have some though about it.