Gcs Storage Sink Connector Production issue (gcs data loss)

Umar_Cheema · December 9, 2025, 4:08pm

Hi everyone,
I previously raised this issue( GCPStorage sink plugin committed offsets ) regarding GCS bucket versioning and offset commits. When versioning is enabled and a file (Avro or Parquet) is uploaded, the offset for the last record in the file isn’t committed. Lets assume we are sending 5 messages to a topic only offset 4 will be committed on upload of resulting AVRO file, lets say its named {topic-name}(46_000000000005).avro based on the last inserted record offset. If the pod or application restarts, Kafka resumes from offset 5. If only one record is present in the next flush, the connector overwrites the previously written file with the same name, effectively deleting records 1-4.

We have managed to reproduce that in a test as well. This isn’t an issue when connect.gcpstorage.exactly.once.enable property is set to true but since it prevents us replaying which we need to do at times we cant use that feature.

Can you suggest a workaround or raise a bug to address this please.

regards & appreciate your help

Umar

Umar_Cheema · December 9, 2025, 5:54pm

Having a peek inside your code the problem seems to be centered around line 89 in writer.scala

github.com/lensesio/stream-reactor

kafka-connect-cloud-common/src/main/scala/io/lenses/streamreactor/connect/cloud/common/sink/writer/Writer.scala

68b23e0cf


      
                    s"Key: ${messageDetail.key}, " +
                    s"Key schema: ${keySchema.getOrElse("<null>")}, " +
                    s"Value: ${messageDetail.value}, " +
                    s"Value schema: ${schema.getOrElse("<null>")}, " +
                    s"Headers: ${messageDetail.headers}.",
                  err,
                )
                NonFatalCloudSinkError(err.getMessage, err.some).asLeft
              case Right(_) =>
                writeState =
                  writingState.update(messageDetail.offset, messageDetail.epochTimestamp, messageDetail.value.schema())
                ().asRight
            }
          
          writeState match {
            case writingWS: Writing =>
              innerMessageWrite(writingWS)
          
            case noWriter @ NoWriter(_) =>
              val writingStateEither = for {
                file         <- stagingFilenameFn()

and line 62 FileNamer.scala

github.com/lensesio/stream-reactor

kafka-connect-cloud-common/src/main/scala/io/lenses/streamreactor/connect/cloud/common/sink/naming/FileNamer.scala

68b23e0cf


      
            extension:                String,
            suffix:                   Option[String],
          ) extends FileNamer {
            def fileName(
              topicPartitionOffset:    TopicPartitionOffset,
              earliestRecordTimestamp: Long,
              latestRecordTimestamp:   Long,
            ): String =
              s"${topicPartitionOffset.topic.value}(${partitionPaddingStrategy.padString(
                topicPartitionOffset.partition.toString,
              )}_${offsetPaddingStrategy.padString(topicPartitionOffset.offset.value.toString)})${suffix.getOrElse("")}.$extension"
          
          }

Since we are using the TopicFileNamePartitioner by virtue of using field partition. Even if you fix it by committing the last record offset (currently last record - 1 is committed) incase of a commit failure we will end up with overwrites and data loss

One reliable fix would be to include the processed timestamp in TopicPartitionOffsetFileNamer (in addition to topic, partition, and offset), ensuring unique filenames and preventing overwrites.

Let us know your thoughts or if you have a preferred alternative. Once you indicate the direction you’d like to take, we can prepare a PR accordingly.

Zsolt_Kovacs · December 10, 2025, 8:44am

Hi lenses devs,
thinking more about it I think (I am a colleague of Umar) that if you fix the commit (currently last record on each partition is not committed, so lag never goes to zero even if there is no traffic on the topic), the above would not be such a huge issue as the file with the last record would not be rewritten - or the whole file would be regenerated.

Nevertheless probably worth adding more data to the filename, e.g. perhaps the lowest offset, instead of timestamps as kafka timestamps are not guaranteed to be strictly monotonically increasing, and all of them can be the same.

Thanks,

Zsolt

Stepi · December 12, 2025, 3:11pm

@Zsolt_Kovacs
Can you please share the sink connector version you are using? in 10.0.3 we have fixed the lag 1 issue. Stream Reactor | Kafka Connectors | Lenses Docs
I expect this to fix the issue you have uncovered.

Regards,

Stefan

Umar_Cheema · December 12, 2025, 4:02pm

Thanks, we were using 10.0.1

stheppi · December 12, 2025, 5:06pm

Thanks for sharing. Please upgrade and let us know the outcome. We expect the issue to be fixed.

Zsolt_Kovacs · December 19, 2025, 3:32pm

hi, you are absolutely right it was fixed.

But still after reconfiguration, we might rewrite the same file with less records.
My colleague will raise a PR with an option possibly give a custom namer class (for example it can add a random string to the name to avoid clash).

Also another option could be to add the starting offset to the filename not only the last offset (timestamps are not guaranteed to be strictly monotonically increasing or unique, so the filename built on timestamps and last offset is also not going to stop us from possibly rewriting the file).

Please let us know your thoughts.

Regards,

Zsolt

Topic		Replies	Views
GCPStorage sink plugin committed offsets General kafka-connect	0	29	October 2, 2025
GCSStorage Sink Connector Apache Kafka Connectors kafka-connect	2	74	May 12, 2025
AWS S3 Connector - How to create file name without the offset Apache Kafka Connectors	2	213	November 20, 2023
GCSStorage Sink, local file housekeeping (connect.gcpstorage.local.tmp.directory) General kafka-connect	7	90	June 30, 2025
GCS Sink connector intermitten write issue General	1	28	October 22, 2025

Gcs Storage Sink Connector Production issue (gcs data loss)

Related topics