S3, GCS and Azure Datalake source connectors are ignoring objects

stheppi · October 22, 2024, 10:09am

I’m noticing that the s3 source connector is not actually grabbing all data in the bucket for some reason. It seems nearly half is being ignored and never makes it into the topic.

An example of a file in the bucket would be named: 2024/49/prod-events-1-2024-10-04-00-00-00-30c458b5-f08d-4ca1-98a7-6583e7dfe46b and within this file there can be a few hundred JSON events separated by a new line.
our config:

  "connector.class": "io.lenses.streamreactor.connect.aws.s3.source.S3SourceConnector",
  "connect.s3.source.partition.search.interval": "60000",
  "connect.s3.source.partition.search.continuous": "true",
  "connect.s3.kcql": "insert into event_pipeline.raw select * from ourbucketname STOREAS `json`",
  "tasks.max": "10",
  "connect.s3.aws.auth.mode": "Default",
  "name": "events-s3-source",
  "connect.s3.aws.region": "us-east-1",
  "value.converter": "org.apache.kafka.connect.storage.StringConverter",
  "connect.s3.source.partition.search.recurse.levels": "1"
}

The second directory in the full structure is where I try to partition things into 60 different possible partitions. 00-59, this is simply the current second of the timestamp when the event is created. Kinesis Firehose sets this prefix. So my thought was to use this in order to be able to scale out to multiple tasks (currently 10). So I think I’m well enough over-provisioned for this setup. We’re only talking about 15k events per minute

stheppi · October 22, 2024, 10:12am

It looks like your configuration is using the default bucket object ordering, which is alphanumeric. This is generally the most efficient approach, as it takes advantage of S3’s built-in ability to list objects from a watermark in a lexicographical order. However, if your objects arrive at different times, some may be skipped if they appear before the last processed watermark in the lexicographic order.

From the object names you’re using, it seems this might be happening in your case. To resolve this, you should change the ordering by setting connect.s3.ordering.type to LastModified. This configuration ensures that objects are processed based on their timestamps instead. You can find more details in our documentation here.

Please note that using LastModified requires the connector to list all objects and sort them by their modification time, which impacts performance. And with growing number of objects the impact would only grow.

Topic		Replies	Views
S3 source connector not grabbing all events in bucket General s3 , kafka-connect	1	83	November 22, 2024
S3 source connector doesn't pull files from the bucket Apache Kafka Connectors	1	71	May 21, 2024
Lenses S3 Sink Connector missing sinking the event metadata information in the S3 bucket Apache Kafka Connectors	1	150	January 4, 2024
S3 Source Kafka Connector fails to write tombstone event to kafka with byteArrayConverter Apache Kafka Connectors s3 , kafka-connect	0	47	April 25, 2025
Location of stored data in S3 with the Lenses connector Apache Kafka Connectors kafka-connect	3	103	April 15, 2024

S3, GCS and Azure Datalake source connectors are ignoring objects

Related topics