Learnings From Creating IoT Data Pipeline On AWS

10 min Read

Learnings-for-Creating-IoT-Data-Pipelines-on-AWS-1

Handling IoT devices and doing computations on their readings is always a tedious process as it requires both Hardware and Software handling expertise. Furthermore, It becomes even more complex when you need to transfer data using traditional MQTT protocols and design your own servers and infrastructure for handling the flow of data from IoT devices to your software platform.

But, what if you can leave the entire infrastructure and data flow on someone else, while just initializing all the operations by yourself, and post that sit back and relax! Does this sound interesting? Obviously Yes! AWS cloud platform provides a wide variety of services where you can setup your entire data flow of IoT devices on cloud and all the security, data backup is managed by AWS itself

AWS provides different ways to setup IoT data streaming in your software. One of the ways is explained below:

EDGE → API GATEWAY →  KINESIS DATASTREAMS → FIREHOSE → S3 →  ATHENA

  • AWS Kinesis Data Streams: Kinesis Data Streams can be used for streaming real time sensor data in your dashboards.
  • AWS Cognito: For validating the source of data.
  • AWS API Gateway: Serverless API for injecting the data in Kinesis Data Streams. This is the endpoint, which the IoT Edge Client uses for inserting data in your data pipeline.
  • AWS Kinesis Firehose: For dumping data obtained from Kinesis Data Streams in S3.
  • AWS Athena: For performing different aggregation and analysis on real time data stored in S3.

Challenges for Developer

Although this pipeline seems to be simple and straightforward, there are certain areas where the Developer might be challenged and will have put in an extra effort for setting up the flow:

    • Challenge: Kinesis Data Streams are designed to send a blob of data to Firehose, which in turn sends this data to its destination. But what if there is an array of records which need to be sent in each iteration? At this point, one faces the challenge, as Kinesis can easily send a single record but when it comes to multiple records, you may experience serialization errors.
    • Solution: The solution to this problem is to use the right Kinesis method while sending data through API Gateway, PUT RECORD for sending a single record and PUT RECORDS for sending multiple records. However, you need to be cautious while designing the message templates in API Gateway for both methodsSending Data from API Gateway to Kinesis Datastreams:
    • Challenge: Kinesis Firehose offers different formats which can be used by a Developer while creating data dumps in S3. The available data formats are CSV, JSON, Parquest, ORC. Initially you might wonder how can a data format pose to be a challenge? However, once the data size increases exponentially with time, it will become a challenge. CSV and JSON formats are very bulky data sets in the long run.
    • Solution: The right data structure is Parquet, because all other data formats store the data in a row format while Parquet stores it in a columnar format, due to which the query execution from Athena is faster. Also, Parquet data format itself removes unnecessary space and black fields while storing, which ends up saving S3 space as well.Dumping Data in Kinesis Firehose Destination in correct format:
  1. Querying Partitioned Data from S3 in Athena:

    • Challenge: Firehose offers a special reward of storing data in partitioned format in S3, for every date. This is a big gift when you are querying on big S3 data sets from Athena. Because each time you query S3 data from Athena, athena scans a chunk of data in S3.Two use cases may arise in this case:
      1. If the data is not partitioned: If the data is not partitioned, Athena will scan the entire chunk of data present in S3 bucket, i.e, if you have 10 years data records in S3 bucket and each time you query to fetch latest record, it will scan the entire 10 year data, resulting in lot of GET requests which in turn increases S3 cost by big numbers.
      2. If the data is partitioned: If the data is partitioned, Athena will scan only the latest day record even if you have 10 years data present inside your S3 bucket, and your GET requests won’t increase.

    The challenge here is; How to read partitioned data in Athena?

    • Solution: The solution here is to use the concept of Partition Projection while reading partitioned data in S3. You can implement Partition Projection while creating Glue Tables for Firehose. Partition Projection helps reading data for a particular timestamp value only, due to which you might end up scanning only a particular set of records which are actually required while executing the query.

Project Case Study:

We recently created an IOT Based Product for one of the Power Sector Client. The Product focuses on receiving data from IOT sensors installed at Edge Location and then displaying this information on Web Based Application.

Data from IOT sensors travels in cloud(aws) and is encrypted so that the data integrity and security is ensured.The Backend Infrastructure is designed using Amazon Web Services(AWS) and Django Rest Framework. We have used Kinesis for Real Time Streaming on our dashboard and have used Athena for displaying aggregated results for time based filters.

Initially during setting up the infrastructure, we even faced some bottlenecks in AWS Kinesis Data pipelines, i.e, while reading long array jsons in Kinesis or while querying through the partitioned data stored in S3 Bucket. But with some proper research work and collaborative engineering we achieve our  goal and completed this beautiful Product.

Every new recipe may not be perfect in the first go but if we take lessons from other Chef experiences then we may definitely end up making a yummy dish.!

Chitrank Tyagi

Lead Software Engineer

LinkedIn

Chitrank is a tech enthusiast, who loves to code and is fond of playing with AWS services. He believes one should never be afraid of taking the first step as, that one step can be a difference in transforming you to a better version of yourself.

Handling IoT devices and doing computations on their readings is always a tedious process as it requires both Hardware and Software handling expertise. Furthermore, It becomes even more complex when you need to transfer data using traditional MQTT protocols and design your own servers and infrastructure for handling the flow of data from IoT devices to your software platform. But, what if you can leave the entire infrastructure and data flow on someone else, while just initializing all the operations by yourself, and post that sit back and relax! Does this sound interesting? Obviously Yes! AWS cloud platform provides a wide variety of services where you can setup your entire data flow of IoT devices on cloud and all the security, data backup is managed by AWS itself AWS provides different ways to setup IoT data streaming in your software. One of the ways is explained below: EDGE → API GATEWAY →  KINESIS DATASTREAMS → FIREHOSE → S3 →  ATHENA AWS Kinesis Data Streams: Kinesis Data Streams can be used for streaming real time sensor data in your dashboards. AWS Cognito: For validating the source of data. AWS API Gateway: Serverless API for injecting the data in Kinesis Data Streams. This is the endpoint, which the IoT Edge Client uses for inserting data in your data pipeline. AWS Kinesis Firehose: For dumping data obtained from Kinesis Data Streams in S3. AWS Athena: For performing different aggregation and analysis on real time data stored in S3. Challenges for Developer Although this pipeline seems to be simple and straightforward, there are certain areas where the Developer might be challenged and will have put in an extra effort for setting up the flow: Challenge: Kinesis Data Streams are designed to send a blob of data to Firehose, which in turn sends this data to its destination. But what if there is an array of records which need to be sent in each iteration? At this point, one faces the challenge, as Kinesis can easily send a single record but when it comes to multiple records, you may experience serialization errors. Solution: The solution to this problem is to use the right Kinesis method while sending data through API Gateway, PUT RECORD for sending a single record and PUT RECORDS for sending multiple records. However, you need to be cautious while designing the message templates in API Gateway for both methodsSending Data from API Gateway to Kinesis Datastreams: Challenge: Kinesis Firehose offers different formats which can be used by a Developer while creating data dumps in S3. The available data formats are CSV, JSON, Parquest, ORC. Initially you might wonder how can a data format pose to be a challenge? However, once the data size increases exponentially with time, it will become a challenge. CSV and JSON formats are very bulky data sets in the long run. Solution: The right data structure is Parquet, because all other data formats store the data in a row format while Parquet stores it in a columnar format, due to which the query execution from Athena is faster. Also, Parquet data format itself removes unnecessary space and black fields while storing, which ends up saving S3 space as well.Dumping Data in Kinesis Firehose Destination in correct format: Querying Partitioned Data from S3 in Athena: Challenge: Firehose offers a special reward of storing data in partitioned format in S3, for every date. This is a big gift when you are querying on big S3 data sets from Athena. Because each time you query S3 data from Athena, athena scans a chunk of data in S3.Two use cases may arise in this case: If the data is not partitioned: If the data is not partitioned, Athena will scan the entire chunk of data present in S3 bucket, i.e, if you have 10 years data records in S3 bucket and each time you query to fetch latest record, it will scan the entire 10 year data, resulting in lot of GET requests which in turn increases S3 cost by big numbers. If the data is partitioned: If the data is partitioned, Athena will scan only the latest day record even if you have 10 years data present inside your S3 bucket, and your GET requests won’t increase. The challenge here is; How to read partitioned data in Athena? Solution: The solution here is to use the concept of Partition Projection while reading partitioned data in S3. You can implement Partition Projection while creating Glue Tables for Firehose. Partition Projection helps reading data for a particular timestamp value only, due to which you might end up scanning only a particular set of records which are actually required while executing the query. Project Case Study: We recently created an IOT Based Product for one of the Power Sector Client. The Product focuses on receiving data from IOT sensors installed at Edge Location and then displaying this information on Web Based Application. Data from IOT sensors travels in cloud(aws) and is encrypted so that the data integrity and security is ensured.The Backend Infrastructure is designed using Amazon Web Services(AWS) and Django Rest Framework. We have used Kinesis for Real Time Streaming on our dashboard and have used Athena for displaying aggregated results for time based filters. Initially during setting up the infrastructure, we even faced some bottlenecks in AWS Kinesis Data pipelines, i.e, while reading long array jsons in Kinesis or while querying through the partitioned data stored in S3 Bucket. But with some proper research work and collaborative engineering we achieve our  goal and completed this beautiful Product. Every new recipe may not be perfect in the first go but if we take lessons from other Chef experiences then we may definitely end up making a yummy dish.! Chitrank Tyagi Lead Software EngineerChitrank is a tech enthusiast, who loves to code and is fond of playing with AWS services. He believes one should never be afraid of taking the first step as, that one step can be a difference in transforming you

Share

Share

Scroll to Top