Using Kafka for Analytic Processing

When Messages Are Not Transactions

Apache Kafka is a handy data processing tool but suffers from a lot of mythology about what it can, cannot, or even should do. It has been called just another pub/sub bus. A database turned inside out. A central nervous system for the digital enterprise. Like all myths, there are shreds of truth in bits of it.

My point of view is from a DB/OS kernel perspective, which is an odd one, so I’ll try to limit the ratholes. I used to be an Oracle kernel developer specializing in performance and scalability. In addition to working on compilers, clusters, and supercomputers I spent most of my career working in the lower half of kernels. To me, Oracle was never a relational database, it was some random OS kernel that happened to run in user space. So is Kafka. Some call Kafka just an application, yet another bag of jars running in the Java virtual OS. But it sure looks like the lower half of a distributed DB kernel to me.

Most data processing kernels do things that a regular OS does: transactions, recovery, shuffling data, and scheduling the hardware. In the case of Kafka, there is the JVM, there might be a container, then a Linux kernel, then maybe a hypervisor, which are each OS kernels in their own right. For now I’ll ignore the streaming access method (strAM) thing, which is not found in conventional DBMS products. Like all databases, Kafka has a persistent state store, so although it’s unique as a streaming database, it’s still a database.

State of Play

For the first time since punch cards were invented, we no longer have to use batch databases for data processing. With Kafka, all those batch businesses can finally become realtime businesses. This is a huge deal. Every batch database ever invented can be transformed by Kafka. The total addressable market for Kafka is not mere billions of dollars but hundred$ of billion$. Due to this market trend, I suspect that streaming will eventually appear in traditional RDBMS products.

Kafka is a streaming database, a database in motion instead of the static classics like DB2, Oracle, MS/SQL, etc. It processes data as a stream of events that occur in time. Most databases keep the last value of any given event (and an event often corresponds to a row in a traditional database), but Kafka keeps a persistent running log of every event that it was ever sent. That running history can be very useful in understanding customer dynamics or the health of a natural gas pipeline. Within Kafka, topics/queues are persisted by commit logs which are write-append, read-only. In-situ updates are not supported, nor are random deletes. You want to change things? You have to write a whole new topic. This is important for forensic/regulatory use cases like hard-masking PII data, but also changes the way things get processed in both good and bad ways. There are some new things to get used to, but they’re small tradeoffs for being able to improve business and situational awareness in near realtime.

Keeping track of a current event, say your bank balance, makes databases good systems of record for accounting. By default, Kafka keeps track of every event or message, such as your current bank balance and all previous transactions. (Kafka can also be set up to operate a like a vintage database and keep only the last message with a compacted topic.) Processing the dynamic history of a set of events is a decent definition of streaming analytics. What Kafka does have in common with batch behemoths—and distinguishes itself from being just another pub/sub product—is that it is stateful. Since every event is recorded and recovered persistently, immutably, and temporally, Kafka can function as a Netflix-scale flight data recorder.  

Analytics is Not Accounting

There are two ways of processing data. OK, probably more than two. There is the classic, punch card, transactional data like your paycheck (OLTP) and there is analytic data (OLAP) where data scientists aggregate hundreds of thousands of paychecks looking for patterns that might be useful to the business. These patterns could help the business be more operationally aware or situationally aware. Unlike almost every other database on the market, Kafka can be optimally configured to do both OLTP and OLAP in realtime. The brilliance of Kafka is that every persistent queue can have its own dedicated transaction layer, so you can tune a queue for the accountants and a few others for the analysts.

When IBM and Oracle realized that a lot of historical transaction data had piled up in their customers’ OLTP databases which treated every row and column as if it were precious paycheck data, they bolted OLAP capabilities onto an OLTP database, and that created conflicts. Asking a DB built for transaction processing to suddenly not care about the sanctity of the row was going to be a train wreck. However, it wasn’t that simple to separate the workloads since data scientists always wanted current data—to be as close in time to what just happened. Analysts piled onto the main operational OLTP database in search of OLAP because they had no choice. Their queries would constantly sweep hundreds of millions of rows while the accountants were trying to work, creating both technical and organizational conflicts.

Kafka is used as a spinal tap to intercept transactional data on the way to the precious OLTP DB and both filter and re-route the traffic to either OLTP or OLAP destinations. This may seem complicated, but at scale, it simplifies things. (If you don’t need to scale, then go ahead and keep running your OLAP monthlies in your Teradata billing system.) Because Kafka can operate in accounting or analytic mode—and do it on a topic-by-topic basis—it can solve a problem that has nothing to do with streaming analytics.  

The legendary ACID acronym (atomicity, consistency, isolation, durability) has been at the heart of OLTP reliability since, well, punch cards. If you have to analyze patterns of columns and rows, then acidic meticulousness comes with a cost in terms of performance, scalability, and sometimes operational stability. To keep the accountants happy, Kafka topics can be configured to be meticulous. Kafka will create 2 or more fully synchronous, strictly coherent copies of every message it is sent.  Every time a message is written (produced) to Kafka, it will by default forward the message to 2 other Kafka message-processing nodes called brokers.

You can also program an individual topic to consider the message safely recorded (committed) only after all the other brokers acknowledge that they have successfully written your message to their persistent media. This process is at the mercy of the CPU, DIMMs, network, and the storage layer’s ability to pull this off quickly. This latency is also at the mercy of how busy the cluster might be, since it isn’t all about you. This all takes time, but if you ask Kafka to treat a message like it’s a precious paycheck, it will. And you will wait. Doing analytics on a platform tuned for accounting is throwing good throughput after bad since no analyst or data scientist cares if your paycheck is correct, or if you even got one. No offense, none taken.

Redo Rathole

This is the rathole part of the program. Oracle and other DBs use redo logs that are really only a performance hack to improve transaction rates. Unfortunately, Kafka calls its stateful media commit logs, but commit logs are not redo logs. Kafka doesn’t do lossy encoding, but it does do lossless encoding like space compression and encryption on what it sends to its commit logs. Oracle redo logs are encoded versions of the changes made to a row. If you know Oracle, think about how Flashback query is implemented. Kafka logs are sorta like persistent UNDO segments. </rathole >.

Kounting Kafka

Kafka has feature called idempotent operation. I did not learn that word in middle school, perhaps you did. This operation is used in conjunction with exactly-once transaction semantics, so only one copy of a  paycheck gets recorded for each pay period. Kafka normally achieves Netflix-level scalability because it allows a topic to be sliced into a dozen or so partitions and lets anybody write into any of those partitions in parallel. To prevent an employee from getting 12 deposits of the same paycheck, the exactly-once feature puts an end to that tomfoolery. It limits the scalability of an accounting-centric topic, but it doesn’t mean you can’t have other topics that permit parallel insertion (or consumption) of data, which is ideal for the analysts because the odd extra copy of your paycheck is just a little extra noise from their perspective. Personally, I’m OK if you get 11 extra paychecks every payday, but then I’m not an accountant.

If you operate in exactly-once hyper-safe mode and are doing 100,000 transactions a day, who cares? But if you’re doing 100,000/sec, you might care. If you are looking for patterns, you really don’t want the topic wired for accountants. A common myth—even in signal intelligence and measurement intelligence—is that there is some magical needle in the haystack, and dropping just one needle or event is catastrophic. In analytics, there is rarely a single needle. Instead, it’s an aggregate of signal needles. The “needle” is a signature of interest. This meta-needle is found in analytics, signal processing, noise shaping, etc. While accountants have single needles like a $36B Swift transaction, analysts do signal processing.

After spending the last few paragraphs trashing the use of an OLTP platform for analytics, there are some OLTP platforms that must handle sensor streams that are transactional—where the contents of a single message could be certain death, for example in airframe avionics. Some data from an Airbus is used for forensic analysis; it’s not always just for the NTSB. It’s always good to find one tiny bit of data, one needle in the haystack, if it indicates the horizontal stabilizer is jammed.

Aggregating signals into signatures and doing that at scale is what Kafka does for cyber security threat surfaces. It ingests and blends telemetry from logs, firewalls, and intrusion sensors. Kafka can also see the entire sensor surface and can detect horizontal anomalies that specialized sensors might miss since they tend to be focused on doing their one thing very well. This aggregation capability forms a more actionable view of the threat surface, whether it’s on a corporate network, a power transmission grid, or a factory floor making hand sanitizer.

Signature Processing at Scale

Kafka is a flexible signature processing substrate for improving any form of awareness in agencies and organizations. At Confluent, I work with customers doing SIEM, Cyber, and Operational Technology (OT) optimization where the telemetry is not paychecks. The streams of signals are processed to improve the signal fidelity before being passed on to an analytic sink like a SIEM. Kafka can also analyze signals for anomalies using that Kafka Streams processing thing which we still haven’t talked about yet. The velocity in these use cases is usually brisk, like 100,000/sec. If you process messages like paychecks, it will be unfashionably inefficient to treat every single message with a pair of Lanvin gloves, but when Kafka is configured and optimized as a high-velocity signal processor, it can outrun any batch database by getting to the data faster without getting bogged down being precious.

Streaming OLAP is Signal Processing

Unlike in the classic OLTP world, the data scientist is the application developer. The micro-services they develop are designed to analyze the data by looking for patterns. A signature is an aggregation of signals and all signals have noise. Data analysts do both signal and signature processing to find patterns of interest or improve the fidelity of the signature to make it easier to find patterns faster. When looking for patterns, data scientists and analysts treat non-transactional data as signals. 

Kafka can’t tell the difference between the signal and the noise because that is in the eye of the analyst. The signal could have noise or the noise, in the form of outliers, could be the signal. If they are working on the noise side, it’s called filtering or noise shaping. If working on the signal side, it’s called anomaly or pattern detection. In either case, the processed streams can still be routed to model trainers, forensic analyzers, or into a regulatory sink where the stream is conditioned for long-term storage. Being able to trim even a couple of bytes from a conditioned forensic stream can make a 7-year archive that much more storable.

Curation Fabric Topology

At Confluent, I’ve pioneered the concept of a curation fabric topology that sits between a sensor fabric and an analytic fabric. In the cyber network world, the sensor fabric usually consists of firewalls, logs from everywhere, and intrusion detection sensors. If you’re monitoring a gas pipeline, the sensor array is different, but the curation fabric is not. On the other side, the analytic fabric historically consisted of batch (woohoo!) databases that come with all those fun features of a batch warehouse—latency, write congestion, inability to crazy scale, abysmal recovery performance, and little tolerance for degraded operation. In the logs world, these are SEIMs (batch data warehouses) and if your SIEM is treating telemetry like paychecks, then you’ll need to buy another pair of expensive leather gloves. In many cases, SEIM customers use Confluent’s curation fabric to offload write-intensive operations that clutter up their SEIMs.

Screen Shot 2021-02-24 at 10.17.51 AM.png

Out of Time

Finally! The stream processing thing. Stream processing is a very powerful access method that works a lot differently than all the other access methods found in classic databases. In DB-land, you have a table. You need to sift through it. You often have to start at one end and sweep the whole thing. When set up to meet your query needs, indexes can reduce most of the sifting, but sometimes you need to sweep all of it looking for patterns. To keep the query current, you have to continuously extract, transform, and load new data into the index every 5 minutes. You have to wait for the ETL job to finish. If the ETL job doesn’t run continuously, then you will…wait for it…wait again. You have more than one index on the table to accommodate a variety of queries? More waiting, and lots more write overhead.

Kafka Streams can filter, ETL, and detect weird things in the data as it flies under your feet. The tricky bit is that stream processing is a temporal access method. The data is analyzed in-flight. Using the streaming SQL language interface KSQL, a query doesn’t have to sift through the data. It waits for the data to come to it. Using a KSQL language feature called WINDOW, a query can run forever, look at the last 30 seconds, or look at the last 30 minutes. Finding the last 30 minutes in classic batch means you run continuous ETL and may have well organized indexes. Even if you don’t need to do realtime analytics, Kafka is often used to augment a classic warehouse because continuous ETL is the perfect fit for stream processing and any other time-centric (or literally, time-series) use cases. 

However, all magic comes with a price, and it might take some time to get your noggin around the notion that the data is always in motion, if there even is data! It can be very unsettling if you are used to the 029 punchy approach. A KSQL query might sit there for 30 minutes and not return anything because nothing has arrived that matches your predicate. It seems hung. From a batch query perspective, this is super bad since a batch query runs, looks for stuff, doesn’t find anything, stops, and tells you it found nothing. I can’t believe I’m about to use a fishing metaphor, but streaming processing is like pole fishing from a bridge over a river and batch processing is like trolling in a lake with a fancy set of nets.

Discrete and Continuous Streaming

 A stream side-step question that comes up a lot with signal processing customers is: can Kafka stream audio and video? The crucial difference between AV streams and Kafka streams is that an AV stream consists of a continuous stream of adjacent “events”. They are not discrete but are inter-connected frames, so what comes out of your DAC is a continuous sine wave or a continuous Lady Gaga. In video, frames follow each other so you get motion pictures. Adjacent Kafka events can be completely discrete, so using Kafka events to capture streams of video isn’t a good use of this technology. 

Using Kafka to capture the odd still or a subframe to temporally corelate with other sensor data is a good use case. A B&W CCTV camera often produces compressed frames that are a few MB even at 1K. They can be grabbed and timestamp-tagged with other sensors. Some industrial customers have old equipment with analog dials that need to be monitored. A camera is pointed at the dial and a frame is snapped. A machine vision process converts the picture of the number to a number which can then be combined with data coming from conventional digital sensors that use SCADA, etc. 

Using Confluent connectors, KSQL, and the Apache Kafka ecosystem to do signal processing opens up a huge set of use cases from SIEM optimization and process control modernization to industrial use cases. My next posts will dive into SIEM and Kafka-in-a-backpack craziness, so stay tuned!