GCSStorage Sink Connector

Umar_Cheema · May 12, 2025, 2:49pm

I have a question regarding the filename configuration in the GCP Storage Sink Connector.

According to the documentation (GCP Storage | Lenses Docs), the blob name can be customized in several ways. For the query:

INSERT INTO [bucketName]:*dir1/dir2*
SELECT * FROM [topicName]
PARTITIONBY _header.dt, _header.event_id
STOREAS `AVRO`
PROPERTIES(...)

The resulting filename is structured as:

*dir1/dir2*/dt=2024/09/03/eventId=123**/topicName(partition_topicOffset).ext**

We would like to modify the last path element (in bold) to use a format such as date(optional(partition_topicOffset)).ext

instead of

topicName(partition_topicOffset).ext.

Could you confirm if this is supported and, if so, what configuration options we should specify?

stheppi · May 12, 2025, 3:48pm

Hi @ Umar_Cheema,

The docs does indeed talk about how the blob key is set, but the key name format is not available for configuration at the moment.

So what you’re asking for it’s not possible at the moment. We would appreciate if you share with us the use case. Furthermore, if you use only the date, and within a day there is more than one flush, if object version is not enabled, it will lead to dataloss, since the blob is overwritten. Even if you use the tuple:date+ partition + offset, if your connector reads from a multi-partition topic, there’s the chance of data from different partition generating the same blob key, thus leading to data loss.

Umar_Cheema · May 12, 2025, 4:52pm

Lets say when writing up to 1000 records per file to ephemeral disk storage and uploading them out of order to GCP, a connector must wait for a bunch of files before committing an offset. If the connector crashes and restarts with a new configuration that sets 250 records per file (due to an optimization we introduced), existing files will be rewritten with the new record count. However, by including a timestamp in the filename after the last ‘/’, the files won’t be rewritten; instead, new files with different timestamps will be created, potentially causing duplicates which we can be workaround through some sort of idempotency.

Topic		Replies	Views
AWS S3 Connector - How to create file name without the offset Apache Kafka Connectors	2	179	November 20, 2023
GCSStorage Sink, local file housekeeping (connect.gcpstorage.local.tmp.directory) General kafka-connect	7	58	June 30, 2025
Lenses ADLS Sink Connector - timestamp partitioning problem Apache Kafka Connectors kafka-connect	2	31	April 10, 2025
Back up and restore Kafka Topic without conversion Apache Kafka Connectors	1	127	May 23, 2024
GCS Source Connector: Problem with parquet Apache Kafka Connectors	1	58	October 22, 2024

GCSStorage Sink Connector

Related topics