S3 source connector not grabbing all events in bucket

fejiwack · October 23, 2024, 2:14pm

(reposting discussion from Slack)

I’m noticing that the s3 source connector is not actually grabbing all data in the bucket for some reason. It seems nearly half is being ignored and never makes it into the topic. This is version 7.4.4. This connector is part of a PoC involving a streaming pipeline that is mirrored, one part goes to our legacy system, and the other to our replacement system. We are comparing events between the two and we see a large discrepancy. No idea how long this has been happening, I think it has been happening since day 1 of turning it on.

an example of a file in the bucket would be named: 2024/49/prod-events-1-2024-10-04-00-00-00-30c458b5-f08d-4ca1-98a7-6583e7dfe46b and within this file there can be a few hundred JSON events separated by a new line. Our config:

{
  "connector.class": "io.lenses.streamreactor.connect.aws.s3.source.S3SourceConnector",
  "connect.s3.source.partition.search.interval": "60000",
  "connect.s3.source.partition.search.continuous": "true",
  "connect.s3.kcql": "insert into event_pipeline.raw select * from ourbucketname STOREAS `json`",
  "tasks.max": "10",
  "connect.s3.aws.auth.mode": "Default",
  "name": "events-s3-source",
  "connect.s3.aws.region": "us-east-1",
  "value.converter": "org.apache.kafka.connect.storage.StringConverter",
  "connect.s3.source.partition.search.recurse.levels": "1"
}

The second directory in the full structure is where I try to partition things into 60 different possible partitions. 00-59 , this is simply the current second of the timestamp when the event is created. Kinesis Firehose sets this prefix. So my thought was to use this in order to be able to scale out to multiple tasks (currently 10 ). So I think I’m well enough over-provisioned for this setup. We’re only talking about 15k events per minute.

I’ve verified the S3 bucket has the same amount of data as the legacy flow by using Athena to search for missing event ID’s.

We’re running Kafka Connect 3.5 in k8s, this connector has a dedicated cluster of 3 worker nodes, each with plenty of resources (4 cpu, 5GB of memory), and it’s not using anywhere near those limits.

stheppi · November 22, 2024, 1:56pm

Hi @fejiwack ,

Thanks for raising this. the reason and solution is explained here.

Best wishes,
Stefan

Topic		Replies	Views
S3, GCS and Azure Datalake source connectors are ignoring objects Apache Kafka Connectors kafka-connect	1	64	October 22, 2024
Aws s3 source connector control api request Apache Kafka Connectors	1	64	October 22, 2024
S3 source connector doesn't pull files from the bucket Apache Kafka Connectors	1	73	May 21, 2024
S3 Source connector - How can I test partition Regex without running a connector? Apache Kafka Connectors	1	230	May 11, 2023
Lenses S3 Sink Connector missing sinking the event metadata information in the S3 bucket Apache Kafka Connectors	1	154	January 4, 2024

S3 source connector not grabbing all events in bucket

Related topics