How Kafka is changing the Big Data ecosystem - and how SolDevelo uses it

How Kafka is changing the Big Data ecosystem – and how SolDevelo uses it

May 13, 2020

admin

In today’s highly digital world, the constantly growing amount of data available to us can be overwhelming. It is enough to mention the IoT (Internet of Things) that has given rise to new data sources. Devices, such as smartphones, smart-watches, fitness devices, or smart homes, generate the increasing amount of new data every day. The data generated by IoT devices, large in volume and random in nature, is nothing but Big Data that needs to be analyzed in order to extract the critical information or to understand the user behavioral patterns.

In order to overcome the problem with continuous streams of data, there is a growing interest in the tools that are able to cater to multiple business solutions.

The big data world is getting more and more popular and it results in the emergence of the technologies associated with this ecosystem.

In the last few years, a new style of system and architecture has emerged which is built not just around passive storage but around the flow of real-time data streams. This is exactly what Apache Kafka is all about.

What Kafka really is

Apache Kafka, an open-source stream-processing software platform, emerged in 2008 and since then, it has been used as a fundamental infrastructure by thousands of companies – from AirBNB to Netflix. And no wonder: as a reliable way to ingest and move large amounts of data very quickly, it is a very useful tool in the big data space.

Kafka serves as a central hub of data streams. It provides a framework for storing, reading, and analyzing streaming data as well as assures high speed in terms of transportation and data distribution to multiple locations. Designed as a distributed system, Kafka can store a high volume of data on commodity hardware. It runs across many servers, making use of the additional processing power and storage capacity that this brings. Because of its distributed nature and the streamlined way of managing incoming data, it operates very quickly. In fact, it can monitor and react to millions of changes to a dataset every second, which makes it possible to react to streaming data in real-time. And last but not least – thanks to built-in redundancy, it can be used to provide the reliability needed for mission-critical data.

The fact that it is an open-source software makes it even more advantageous. It means that it is essentially free to use and has a large network of users and developers. All of them have access to the source code and debugging tools, thanks to which they can analyze errors and fix them. It is also possible for them to contribute to modifications, updates, and new features as well as offer support for new users.

How Kafka works

Over the past few years, the number of use cases solved by Kafka has increased. With an increasing amount of data from different sources (e.g. website, financial transactions) delivered to a wide range of target systems (e.g. databases, email systems), developers have to write integrations for each one. And it does not come as a surprise when we say that this is not a very convenient process – additionally, it is a slow and multi-step process to deliver data. Kafka acts as an intermediary – it receives data from source systems and then makes this data available to target systems as a real-time stream, ready for user consumption.

kafka-apis

How does it look in detail? Kafka takes information – which can be read from a huge number of data sources – and organizes it into “topics”. This is achieved thanks to a function known as a Producer, which is an interface between applications and the topics – Kafka’s own database of ordered, segmented data, known as the Kafka Topic Log. Another interface, the so-called Consumer, enables topic logs to be read, and the information stored in them passed onto other applications that may need it. When its components are put together with the other common elements of a Big Data analytics framework, Kafka works as the “central nervous system” – it collects large quantities of data and it streams it via user interactions, logs, application metrics, IoT devices, etc., and delivers it as a real-time data stream ready for use.

One of Kafka’s great advantages is that we can always add a new specialized system to consume data published to Kafka. Undoubtedly, it is significant for the development prospects of a Big Data ecosystem.

How SolDevelo uses Kafka and Big Data

SolDevelo Big Data

SolDevelo as a company providing digital technology solutions also uses Apache Kafka and Big Data. We want to check whether the internet network in which the end-user is located works correctly. In order to do so, we perform various assessment tests.

On the user’s device (router) there is a service responsible for performing tests, while on the server located in the network there is a module helping to perform tests and collect test results. Tests in various protocols (UDP or TCP) are being executed between theses services periodically.

At the beginning, for the service to be able to perform the test, it is necessary to request the load balancer to indicate the appropriate service to help. In order for the load balancer to know what websites exist in a given area, websites send information via Kafka that they are ready for testing. Apache Kafka supports real-time data from multiple nodes. Unified input streams support is designed to provide high bandwidth and delay reduction.

When the user-side service performs the test, it sends its result to the collector, which in turn sends data to Kafka. When Kafka receives packages for the right topic, it automatically sends this package to a non-relational database (MongoDB). However, before this happens, the appropriate connector calculates the average values for certain parameters. Thanks to this, we do not have to do it while downloading data from the database. The database can be freely configured, just like Kafka, to be able to handle a large amount of data in the shortest possible time.

Services that are responsible for testing do not have direct access to the database. Only the one responsible for displaying reports for the administrator has direct access to the data – in order to check the status of the entire application.

Currently, the test configuration is performed manually by an administrator. However, in the near future, we intend to use machine learning so that the selection of tests and parameters is even more convenient and takes place automatically.