I have a question regarding the filename configuration in the GCP Storage Sink Connector.
According to the documentation (GCP Storage | Lenses Docs), the blob name can be customized in several ways. For the query:
INSERT INTO [bucketName]:*dir1/dir2*
SELECT * FROM [topicName]
PARTITIONBY _header.dt, _header.event_id
STOREAS `AVRO`
PROPERTIES(...)
The resulting filename is structured as:
*dir1/dir2*/dt=2024/09/03/eventId=123**/topicName(partition_topicOffset).ext
**
We would like to modify the last path element (in bold) to use a format such as date(optional(partition_topicOffset)).ext
instead of
topicName(partition_topicOffset).ext
.
Could you confirm if this is supported and, if so, what configuration options we should specify?
Hi @ Umar_Cheema,
The docs does indeed talk about how the blob key is set, but the key name format is not available for configuration at the moment.
So what you’re asking for it’s not possible at the moment. We would appreciate if you share with us the use case. Furthermore, if you use only the date, and within a day there is more than one flush, if object version is not enabled, it will lead to dataloss, since the blob is overwritten. Even if you use the tuple:date+ partition + offset, if your connector reads from a multi-partition topic, there’s the chance of data from different partition generating the same blob key, thus leading to data loss.
Lets say when writing up to 1000 records per file to ephemeral disk storage and uploading them out of order to GCP, a connector must wait for a bunch of files before committing an offset. If the connector crashes and restarts with a new configuration that sets 250 records per file (due to an optimization we introduced), existing files will be rewritten with the new record count. However, by including a timestamp in the filename after the last ‘/’, the files won’t be rewritten; instead, new files with different timestamps will be created, potentially causing duplicates which we can be workaround through some sort of idempotency.