I’m noticing that the s3 source connector is not actually grabbing all data in the bucket for some reason. It seems nearly half is being ignored and never makes it into the topic.
An example of a file in the bucket would be named: 2024/49/prod-events-1-2024-10-04-00-00-00-30c458b5-f08d-4ca1-98a7-6583e7dfe46b and within this file there can be a few hundred JSON events separated by a new line.
our config:
"connector.class": "io.lenses.streamreactor.connect.aws.s3.source.S3SourceConnector",
"connect.s3.source.partition.search.interval": "60000",
"connect.s3.source.partition.search.continuous": "true",
"connect.s3.kcql": "insert into event_pipeline.raw select * from ourbucketname STOREAS `json`",
"tasks.max": "10",
"connect.s3.aws.auth.mode": "Default",
"name": "events-s3-source",
"connect.s3.aws.region": "us-east-1",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"connect.s3.source.partition.search.recurse.levels": "1"
}
The second directory in the full structure is where I try to partition things into 60 different possible partitions. 00-59, this is simply the current second of the timestamp when the event is created. Kinesis Firehose sets this prefix. So my thought was to use this in order to be able to scale out to multiple tasks (currently 10). So I think I’m well enough over-provisioned for this setup. We’re only talking about 15k events per minute