Achieving Data Agility With The Combined Strengths of AWS and Confluent

Aaron Lieberman
8 min readAug 30, 2021

To create a genuinely reliable and highly scalable platform in the cloud, companies often find they must combine technologies to meet their real-time demands. Many businesses are struggling with real-time streaming capabilities because it is challenging to accomplish elegantly and meet the requirements for effective data transformation and data movement.

To create a cloud-based powerhouse, you need to combine tools that can complement each other’s strengths. One of our favorite combinations is AWS and Confluent. AWS is the standard cloud provider for scalability and flexibility, especially for those using AWS serverless services. Confluent is best-of-breed for messaging, streaming, and real-time needs. Together, they create a synergy that helps businesses respond to their data agility needs.

Strengths of AWS

Combining AWS services alone can be a solid solution for your business on its own. When appropriately combined with Confluent, though, those strengths can significantly extend and enhance the functionality of AWS.

DATA TRANSFORMATIONS (ESPECIALLY AT SCALE)

AWS excels at managing large amounts of data transformations at scale and with parallel processing.

Amazon EMR, Amazon Batch, and AWS Lambda are all suitable for this, each with its own purpose.

  • Amazon EMR can be used if you want to dedicate a cluster to large data transformations
  • Amazon Batch can be used for large batch processes to scale up and down automatically
  • AWS Lambda can be used to execute custom data transformation in parallel at unmatched speed

Confluent is an excellent partner for AWS data transformation processes as an inbound or outbound connection point. If the source easily connects to Confluent using a pre-built Confluent Source Connector, like Salesforce, Confluent can act as an inbound gateway.

Another scenario in the opposite direction is using Confluent for outbound connectivity for the transformed data. In this scenario, AWS would send the processed data back to Confluent. Then a Confluent Topic by way of a Salesforce Sink Connector would facilitate sending the data to Salesforce.

APIS

Amazon’s API Gateway permits developers to quickly create RESTful APIs and easily integrate them with other RESTful APIs or Amazon services. This functionality centers solely on the Amazon API Gateway.

APIs are another case where Confluent is the perfect “other half” of the Amazon equation. Confluent doesn’t offer the ability to create APIs, so when you’re using Confluent, you need a tool that can — and AWS is a natural and logical choice for this purpose.

Combined with Confluent in this way, AWS acts as the API provider, and the Amazon API Gateway integrates with a service like AWS Lambda to accept requests, perform data cleansing and processing. The Lambda can then send the requests along to a Confluent Topic, allowing the data to be sent to subscribers of the topic.

It’s worth noting that this is great for asynchronous API responses but not for APIs that require synchronous responses where data is needed immediately. Asynchronous responses work because the Amazon API Gateway can provide a 202 Accepted response after sending the event to the Confluent topic.

LONG-TERM STORAGE

There are few integrations without a requirement for long-term storage of data. Long-term storage is imperative whether you are storing events for lookup years later, a ledger of information, system events and logs, or have other needs.

AWS has several services that help with this:

  • Amazon S3
  • Amazon RDS
  • Amazon ElasticSearch

Combining these storage solutions with Confluent satisfies many use cases.

  • Confluent Log Storage: Confluent logs can be shipped to AWS and stored in ElasticSearch. You can then define KPIs for your systems and visualize performance against those KPIs using ElasticSearch dashboard.
  • Raw message storage and processing: Amazon API Gateway can accept raw messages and store those in S3 so that the data is never lost. Then, as described in the previous section, API Gateway can send the events to Confluent through AWS Lambda.
  • Pairing with SQL: Confluent has a MySQL connector, as well as several others. Events can be taken in on Confluent Topics and then piped to Amazon RDS running MySQL.

PROCESSING LAYER

Many Confluent customers need a processing layer to host Kafka KStream applications or do additional processing on messages that come through the Confluent platform. This isn’t an uncommon requirement. It’s a basic need for almost every integration project and can be implemented under architectural best practices to decouple the messaging or streaming layer from the processing layer.

On the Amazon side, AWS Lambda and Amazon EC2 are two of the services that make a processing layer possible.

There are two clear use cases when combining this with Confluent.

In the first, Confluent accepts messages and streams to KStream applications that reside on Amazon EC2 using the Java Kafka KStream API. The KStream applications then process data inside the Amazon ecosystem and send messages back to Confluent. Or those messages can land in the AWS ecosystem and persist to an AWS database.

The second uses the Confluent AWS Lambda Sink Connector to send messages from Confluent to AWS Lambda. In this scenario, Lambda is invoked for processing messages in the language of your choice. It can send messages back to Confluent, or, as above, it can land them in the AWS ecosystem and persist them to an AWS database or other storage.

DEVOPS

DevOps should be baked into any modern integration project. We feel so strongly about this idea that we wrote this ebook to help organizations implement DevOps in a serverless environment. (https://bigcompass.ac-page.com/serverless-devops-ebook)

AWS provides excellent tools for DevOps processes, including automation for CI/CD pipelines and source control. There are a host of Amazon services that play well with and support DevOps:

  • AWS CodePipeline
  • AWS CodeBuild
  • AWS CodeDeploy
  • AWS CodeCommit
  • AWS Elastic Beanstalk
  • AWS SAM
  • Customer scripts on AWS

Automation is a crucial benefit of combining AWS DevOps tools with Confluent. For instance, AWS can automate governance, testing, and deployment for all the code used in your system. If you use the Java KStreams API to develop KStreams applications, for example, you can check that code into AWS CodeCommit and automate the deployment of that code to an auto-scaling Amazon EC2 cluster using either Elastic Beanstalk or a combination of CodePipeline, CodeBuild, and CodeDeploy.

You can also automate the creation of Confluent topics using the Confluent API or CLI. The customer scripts you create to accomplish this can live on AWS and be managed through the CI/CD service to automate your Confluent deployments.

SERVERS, NETWORKING, AND FILE SYSTEM

AWS is a no-brainer for organizations with tight network security needs or if you need to work with a file system for your integrations. AWS offers multiple server, networking, and file systems solutions, depending on your specific requirements and usage:

  • Amazon EC2
  • Amazon ECS
  • Amazon EKS
  • Amazon EFS
  • Amazon VPC

When combining Confluent with an Amazon solution, your KStream applications can be deployed to an EC2 auto-scaling cluster. The KStream application has access to a shared file system living on Amazon EFS. The same thing can be implemented using Amazon ECS and EFS rather than Amazon EC2. In either case, you can use Amazon VPC components, including security groups and network ACLs to lock down your network security.

Strengths of Confluent

You may come at this combination from the other direction; perhaps it’s the value of Confluent that has you considering ways to make it more powerful and valuable for your business. Here are some of the strengths of Confluent and how AWS can enhance them.

DATA STAGING AND AGGREGATION

It’s common for integrations to relate one object to another or even to many others that have come through the integration system history. Confluent excels at this because you can maintain all of the historical logs in KTables or ksqlDB. This makes it possible to relate those new events to old ones, aggregate the data, and stage it to send out in a rolling window to wait for new updates — say 10 minutes or whatever is appropriate for your application.

Confluent services associated with this include:

  • KTable
  • ksqlDB

Again, data staging and aggregation is an area where Confluent shines. Paired with AWS, Confluent can receive data for staging and aggregation that have been sent over from AWS after these events have been accepted via API. Confluent can then send this aggregated and staged data back to AWS through the Confluent Lambda Sink Connector for further processing and sending via RESTful HTTPS connection to a downstream web server.

ORDERED EVENTS

For many integration patterns, including system synchronization, referential integrity is a huge component that needs to be addressed. Most tools have challenges with ensuring that data references are valid before sending objects to the target system. Confluent accomplishes this with KTables or ksqlDB.

On the Confluent side, the services that are important for ordered events include:

  • KTable
  • ksqlDB
  • KStream
  • Topics

How can AWS make this better? Imagine Confluent accepting platform events from Salesforce using the Confluent Salesforce Source Connector. If a child data object arrives first from Salesforce, Confluent can hold it for a configurable length of time while waiting for the parent object. Once that parent object arrives, it can re-order the events in a Confluent Topic and then send those events out to AWS in order so that AWS can pass those messages along to a downstream web server properly — parent first, and then the child. This can apply to relationships between employees and supervisors, for example.

CONNECTORS

Accelerating integration connectivity is paramount in today’s fast-paced world. Turn-key solutions help to move the needle on innovation, which is a significant driver for integrations. Confluent has focused on their connector ecosystem to help with this and sourcing connectors from the community. This allows them to establish connectivity with some of the most in-demand solutions, like Salesforce, MongoDB, and Oracle. The combination of Confluent Connectors and Confluent Topics enables these integrations to be set up seamlessly and elegantly.

In a similar scenario to the one described above, requests can be accepted on AWS via Amazon API Gateway, then processed in Lambda. Lambda can then send messages to a Confluent Topic so that Confluent can use one of its connectors to send the data to a destination system such as MongoDB.

REAL-TIME STREAMING

Real-time streaming is becoming more and more prevalent in the integration space. More live data is required as usage increases, and Confluent is the perfect solution to address these needs. Confluent offers excellent scalability along with low-latency message processing, even for lightweight transformations and data enrichment. This means Confluent can meet the requirements of the most demanding real-time streaming needs.

Using Confluent Topics and KStreams makes this possible. AWS comes into play in this scenario to support the scalability of Kafka. Kafka requires robust infrastructure, memory usage, and file system usage — all things that AWS can provide. Kafka can be installed on AWS EC2 auto-scaling groups, Amazon EKS, or Amazon ECS, and combined with Amazon EFS for a highly scalable file system that allows Kafka to process millions of records per second. Using confluent Topics and streaming data to KStreams hosted on AWS supports the implementation of this use case.

Conclusion

Confluent is the epitome of data movement and data streaming capabilities. AWS is the perfect partner for Confluent technology because it rounds out the data agility expertise of Confluent with highly scalable services to support the high demands of real-time streaming. AWS also solves the needs of Confluent in the areas of data storage and data at rest, which are necessary for robust integration strategies.

If you’re looking for more on how Confluent can increase your data agility, check out our Confluent Cloud Integration 30-day Sprint to help get you up and running fast. While you’re there, sign up for the Big Compass newsletter and be the first to find out about our upcoming event on Confluent and AWS.

--

--

Aaron Lieberman

Aaron’s passion for technology drives him to find innovative ways to help advance organizations through technology.