So you want to move some data from A to B in AWS
December 14, 2022
This blog post is not published yet! (This will disappear once it is published)
At my day job I’ve been recently thrown head-first into designing, building and improving an AWS setup.
One of our requirements was to have a script in ECS run against some JSON files when these files are either uploaded or updated within S3. Simple enough.
Given AWS has thousands of ways of doing things, I safely assumed there’s probably an easy way to expose S3 to ECS as some sort of filesystem. Turns out there is, but only if you are running ECS on EC2 and not Fargate. Coming from a heavy Docker and Kubernetes background, I figured there must be some way to expose a “volume” to ECS and indeed I found AWS EFS.
AWS EFS which is block (not Object) storage, simply put allows you to connect a filesystem to an ECS task.
Once I figured this out, the next issue was: how do I get my data from S3 to EFS? I thought this would be a good use case for Lambda and in the course of my reserach I discovered AWS DataSync.
DataSync offers a managed way to sync data within AWS. It offers multiple sources and destinations and can even do syncing data in and out of AWS. Both S3 and EFS are supported so this looked like the perfect solution for our problem.
After multiple false starts until I configured it properly, DataSync was working. The next issue I had to deal with: how do I trigger DataSync when the S3 upload happens?
My research led me to S3 Bucket Notifications - you can configure an S3 bucket to notify a Lambda when its contents change. This resulted in the following stack of dominos setup:
S3 Bucket -> S3 Bucket Notifiction -> Lambda -> Lambda uses AWS API to trigger DataSync and wait for completion -> DataSync Task to copy from S3 to EFS -> Lambda uses AWS API to trigger ECS Task -> ECS Task.
This setup worked… but not too well. It had a number of disadvantages:
- The S3 bucket notification fires for every file changed and the user was uploading all of them, even if they had not changed.
This was triggering several Lambdas. I resolved this by limiting the labmda concurrency to 1 and I did some reserach on how to buffer and consume bucket notifications through something such as SQS.
- The Lambda spent a LONG time waiting.
DataSync tasks, even for tiny amounts of data syncing (we were moving around 50KB), take on the order of several minutes to start. This meant that the lambda had to wait and keep checking via the API whether the DataSync task had completed yet or not, which was all time we were paying for. There was also no easy way to surface this status to our end user, who had no idea whether the process had finished or even started and would likely retry their upload, excacerbating the problem.
Despite being the perfect solution on paper, it felt like I was increasingly trying to convince myself that there is no better solution than DataSync.