Design AWS pipeline to Inject millions of records to Dynamo

Sudheer Kumar
9 min readNov 27, 2023
Photo by Hunter Harritt on Unsplash

Problem Statement: You are receiving data in an S3 bucket in CSV format and that need to be injected toa Dynamo DB table in the most efficient way in terms of time, cost and resources. Also this is not a one time task, but should be repeatable with error handling etc.

For data injestion, AWS has the following high leveloptions:

  • AWS Data pipeline — This tool is getting obsolete as per AWS support and is not availabe in many regions and hence ruled out.
  • AWS EMR — Amazon EMR (Elastic MapReduce) is a cloud-based big data platform for processing large datasets using popular frameworks such as Apache Spark and Hadoop. However, as of now (Nov 2023), Amazon EMR does not offer a serverless option. EMR is powerful and flexible, but it’s not serverless — it requires more management compared to AWS Glue, hence rules out.
  • AWS Glue — Glue is a serverless ETL (Extract, Transform, Load) service and is an options for us.
  • AWS Lambda and an API endpoint: Lambda will trigger the API endpoint that will do batch insert of the data from CSV.

Since first 2 are ruled out, lets look in to 3rd and 4th options in detail.

AWS Glue

  • Architecture: We will a have a Lambda that will be triggered when a new file arrives in S3 bucket. To process the file, the Lamdba will trigger the AWS Glue and this Lambda can’t wait for Glue Job to finish. There is an option to configure another Lambda when the Glue Job finishes, that has to complete the post process operations like updating table regarding the processing status and moving the processed file to a specific S3 “complete” or ”error” folder in the bucket.
  • Glue Job: Glue Job only can write one record at a time, but it can use multiple frames to write data concurrently. That can improve processing speed. But the lack of batch writing with Glue can incrase the network traffic. Steps involved in Glue are:
  • 1. Create a Crawler for S3 Bucket: Create a Glue Crawler that points to your S3 bucket where the CSV files are stored. The crawler will read the CSV files and infer a schema which will be stored in the Glue Data Catalog.
  • 2. Create a Glue Job: Create a Glue ETL job that reads from the source (S3 bucket), transforms the data if necessary, and then writes to the target (DynamoDB table). A high-level pseudocode of what the Glue job script might look like:
# Import necessary libraries
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job

# Initialize the GlueContext
sc = SparkContext()
glueContext = GlueContext(sc)
job = Job(glueContext)

# Begin reading data from the source
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "s3_database", table_name = "s3_table")

# Split the DynamicFrame into multiple smaller DynamicFrames
dynamic_frame_collection = datasource0.split(['column_to_split_on'])

# Write the DynamicFrames to DynamoDB concurrently
for dynamic_frame in dynamic_frame_collection.values():
glueContext.write_dynamic_frame.from_options(frame = dynamic_frame, connection_type = "dynamodb", connection_options = {"dynamodb.output.tableName": "your_dynamodb_table"})

job.commit()

Another important aspects to consider during Glue Job design are:

  • Injection Concurrency and Identity: Are you going to receive multiple files to be injected concurrently? Will the injection will always be INSERT operation or will it involve UPDATE also? Will there be conflict in data betweeb the batches (updates to the same record appearing in multiple concurrent braches?). Note that AWS Glue is primarily designed for ETL (Extract, Transform, Load) operations, and it doesn’t natively support updating existing records in a target table. It’s more geared towards inserting new records or replacing the entire dataset. If your target is RDS, you can use a staging table and post processing can merge the records. If the target is Dynamo DB, you can use custom API.
  • Error Handling: If any error occurs in triggering Lambda, it can post a message to SNS topic which will be routed to emails or slack channels. If any error occurs in AWS Glue, you will know that in the call-back Lambda configures and has to take action accordingly.
  • Retrying on Error: Say you are injecting a 4 million record csv file and it fails in the middle. AWS Glue does not natively support resuming a job from where it left off in the event of a failure.

Now lets us explore the next option.

AWS Lambda and Custom API Endpoint

  • Architecture: We will a have a Lambda that will be triggered when a new file arrives in S3 bucket. To process the file, we will pass the AWS S3 file path to a custom API endpoint.
  • API Design: The API will read the file using an S3 stream and will do batch writing to dynamo DB. The following are the aspects to be considered here:
  • 1. Concurrency: We are using S3 stream to read as the file coulld be huge. There is no way to do concurrent reading here. Other option for concurrency is to split or have the input CSV in chunks and process the m parallely.
  • 2. Error Handling: If any error occurs in triggering Lambda, it can post a message to SNS topic which will be routed to emails or slack channels. If any error occurs in AWS Glue, you will know that in the call-back Lambda configures and has to take action accordingly.
  • Batch Insert: The advantages of batch insert are:
  • 2.1 Reduced Network Overhead: With BatchWriteItem, you can write up to 25 items in a single operation, which can reduce the network overhead compared to making 25 separate PutItem calls.
  • 2.3 Performance: BatchWriteItem can often be faster than individual PutItem calls, especially for larger batches, because it reduces the time spent on network round trips.
  • 2.3 Atomicity: BatchWriteItem allows you to write multiple items atomically. If any of the writes fail, none of the writes are performed. This can be useful in situations where you need to ensure that either all writes succeed or none.
  • 2.4: Max number of recrods that can be inserted in a batch are 25 with max size of individual record as 400KB and max size of the whole batch as 16 MB. Please note there is no difference in cost, if you insert in batch or individual record, but you will save on speed as well network traffic on batch.
  • 3. Retrying on Error: Before writing a batch, you need to persist the records written so far in a persistent medium, another table or so. To retry, processing, you feed manually provide the S3 path and last record inserted (through Lambda environment variable) and the API can resume the task where it stopped.
  • Development tool/deployment selection, choose Serverless framework for Lambda (which automate the triggering of Lambda from S3 and also take care of deployment to different environments like dev, preprod) and our custom APIs are deployed on AWS ECS.

Dynamo DB table Design

DynamoDB is designed to handle large amounts of data and traffic. However, there are a few things to consider:

  • Hot Partitions: DynamoDB distributes data and traffic for a table over multiple partitions based on the partition key value; hence, a large number of writes or reads of items with the same partition key value can lead to “hot” partitions, where one partition receives a higher volume of read/write requests than others. This can potentially throttle your requests.
  • Item Collection Size: DynamoDB has a limit of 10 GB for each partition key value. This includes all the items with the same partition key. If your ID could have more than 10 GB of associated emails, you would exceed this limit.
  • Query Patterns: Your PK and SK has to be decided based on the query patterns. We had a batchID and millions of records for that batch and we will be accessing a record using the batchId and emailId.
  • In Amazon DynamoDB, all items with the same partition key (PK) are stored together, in sorted order by the sort key (if one exists), on the same partition.
  • DynamoDB uses the partition key’s value to determine which partition to store the data. This design allows for efficient querying and scalability. However, it’s important to design your partition key and sort key in a way that the data is evenly distributed across various partitions to avoid potential throttling due to “hot” partitions.

If a PK partition size exceeds 10 GB, what will happen? If the total size of all items with the same partition key value exceeds 10 GB, you will not be able to write any more data to that specific partition key. This is because DynamoDB has a limit of 10 GB of data for each partition key value.

This includes all of the items with the same partition key, their attributes, and any local secondary indexes. If you try to add more data to a partition key that already contains 10 GB of data, the write operation will fail with an exception.

To avoid this, it’s important to design your table in a way that distributes data evenly across different partition keys. If you expect a partition key to have more than 10 GB of data, you might need to introduce some form of sharding or use a different attribute as your partition key.

Now lets design the Dynamo table for a scenario where you have a batchId and each batchId has millions of records.

Access Pattern1: Need to get a record exist for a given batchID and emailId

Access Pattern2: Need to get all the emailIds for a batchId

Design 1:

PK : batchId & SK : email

This design also will work nicely with AccessPattern1 & AccessPattern2. For AccessPattern2, you can easily fetch all items with a specific partition key (ID). However, if you need to fetch data based on the email, you would need to scan the entire table or create a Global Secondary Index (GSI) with email as the partition key.

But there is one problem with this design that, if by any chance, the total size of a batch including indexes cross 10GB, then you can’t insert any more record in to this partition.

Design 2:

PK : email. SK: batchId

This design serves well Access Pattern1, but it fails to serve AccessPattern2. To help with AccessPattern2 in this case, we can define a Global Secondary Index on batchId to avoid table scan to get emails for a specific batchId.

Switching the partition key (PK) to email and the sort key (SK) to ID could help avoid exceeding the 10GB limit for a single partition key.

I have an ID coulmn as PK, is there any difference if I store it as Number or a string?

In DynamoDB, both Number and String can be used as a partition key (PK). However, there are some differences that might influence your choice:

  1. Sort Order: Number and String types are sorted differently. Numbers are sorted numerically, while strings are sorted lexicographically (alphabetically). This can affect the order in which items are returned in a query.
  2. Storage Size: The storage size of Number and String types can be different. For example, the number 12345678901234567890 takes less space when stored as a Number than when it’s stored as a String.
  3. Querying: When querying, you must use the correct data type. If your application logic is easier with one type over the other, it might influence your choice.

So you can see that the design of a Dynamo table totally depends up on the access patterns.

Cost Analysis

Does it cost the same if I insert using batchWriteItem or as Individual records?

Yes.. there is no different in price, if you use batch for inserting. It will only help to reduce the n/w traffic and also improve the speed of insertion a little bit, when we use batching.

This is because even in a batch, the individual records are inserted one by one. The insertion is charged based on WCU (Write Capctiy Unit) and you can write up to 1KB/s with 1 WCU.

Will I be charged 1 WCU, if my single record is only 100 bytes?

Yes.. Any size less than or equal to 1KB will be charged 1 WCU.

When it is advantageous to use provisioned capacity than auto scaling for Dynamo DB?

Provisioned capacity for DynamoDB is advantageous in the following scenarios:

  1. Predictable Workloads: If your application has predictable traffic patterns and you know your read and write capacity requirements, provisioned capacity can be more cost-effective.
  2. Cost Control: With provisioned capacity, you pay for a specified amount of read and write capacity. This can help control costs as you know exactly what your bill will be based on the capacity you’ve provisioned.
  3. Avoid Throttling: If your application has large, sudden bursts of read or write requests, auto scaling might not be able to scale up quickly enough to meet demand, resulting in throttled requests. With provisioned capacity, you can ensure that you have enough capacity to handle these bursts.

Will there be data loss, when I use provisioned capacity for dynamo table and throttling happens?

No, there won’t be data loss when throttling happens in DynamoDB. Throttling occurs when read or write request rates exceed the provisioned capacity. When a request is throttled, it fails with an HTTP 400 code (Bad Request) and a ProvisionedThroughputExceededException.

AWS SDKs have built-in support for automatic retries and exponential backoff, which means they will automatically retry throttled requests. But still if it fails, the failed requests should be retried by the client

It’s important to handle these exceptions in your application code and implement proper retry mechanisms to ensure the requests eventually succeed when capacity becomes available.

Conclusion:

Data Injection at high volume need careful design and you should know the behaviour of the each component/service in the system. You should also consider various aspects like speed, pricing, error handling, retry option etc...

--

--

Sudheer Kumar

Experienced Cloud Architect, Infrastrcuture Automation, Technical Mentor