Matt Howlett

Matt Howlett

every blog should have a tagline

Stream Processing in C#/.NET

2019-06-02

The v1.0 release of Confluent's .NET Kafka client brings with it a completely revamped API. This API introduces a number of ideas not yet found in other clients (for example, check out the specifics of the Message class). It's now - in my quite biased opinion - one of the nicer Kafka APIs out there. But at the end of the day, it stays pretty close to the API of clients in other languages, in particular, pretty close to the underlying librdkafka C API.

Kafka clients are very high level compared to the Kafka protocol - they provide a lot of functionality for managing complex error scenarios and broker failover amongst other things. But they're pretty low level compared to many of the tasks you'll want to use them for and I think there's a need for a toolbox of pre-built high level abstractions for common use-cases.

Stream Processing

Something people often want to build on top of Kafka are stream processing applications. By that, I mean horizontally scalable applications that read from one or more Kafka topics, do some potentially stateful processing on that data, and write the result back to one or more Kafka topics. Apache Kafka comes with a stream processing library called Kafka Streams, which is just a bunch of functionality built on top of the the basic Java producer and consumer. Unlike many other data processing systems this is just a library. To write a Kafka Streams application, you simply link against this library and use the abstractions it provides. You don't need to submit your application to a special purpose cluster in order to execute it, you simply run your application as usual, executing multiple instances of it in order to scale. There are many general purpose tools that can help you manage this, Kubernetes being a high profile example.

But Kafka Streams is a Java library - it can only be used with JVM languages. Confluent currently recommends users of other languages consider KSQL, which is a layer over Kafka Streams that allows you to rapidly define stream processing jobs using a SQL syntax. KSQL is pretty awesome, growing quickly, and I think well positioned to become the primary way that people interact with streaming data. But (today) it only solves a subset of the space of streaming problems. One limitation is that it can only work with data in specific formats/schemas. Another is that it can be extended, but only in a limited way via UDFs and UDAFs that must be written in Java.

So there's still a role for stream processing libraries such as Kafka Streams to play, and I'd like to see (as I think many other people would!) a Kafka-centric option along the lines of Kafka Streams available for my go-to programming language, C#.

Keeping it Simple

I also think that quite a lot of value can be achieved with relatively little effort. In particular, I don't think building out capability similar to that in Kafka Streams for defining arbitrary strongly-typed DAG processing topologies is a prerequisite for achieving something useful. Many high profile use-cases of Kafka are the combination of a relatively small number of simple processing steps (operating at large scale) where I think the ability to express that simple structure in code doesn't really provide any value.

So to begin with, I just want to build some simple foundational processing abstractions. For example: stateless processing / filtering, windowed aggregations and stream-stream joins. For each of these abstractions, all that you'll essentially need to specify is an appropriate processing function plus the input and output topic(s) - super simple!

This API simplicity allows for a lot of flexibility - there will be no constraints imposed by an overriding framework on what can and can't be done. For example, the (first?!) window aggregator I want to make won't have record-at-a-time processing in the way that naturally follows from the KStream and KTable abstractions in Kafka Streams - the only event that will result in window aggregate results emitted, and input topic offsets committed, will be when a window is closed off for future updates. The advantage of this approach is it allows for non-linear processing functions like median calculations. A downside is giving up on record-at-a-time-processing, though preliminary window aggregate results could potentially be emitted, but these updates would always come with at-least once semantics (even after transaction support gets added to the .NET client, enabling exactly-once processing).

But Also Far From Simple

So far, I've conveniently avoided talking about materializing change-log record streams. For some use-cases, this is best handled by a third-party data store anyway. For streaming use cases (eg. stream-table joins), it's critical this be integrated in the stream processing library.

What's motivating me here is actually the potential for Kafka to be used as a system-of-record. If you start trying to use Kafka in this way, you'll likely run into the issue of how to enforce various data constraints. For example, say you're storing change-log data for a user table in a compacted topic where each user has a unique id (message key). You'll often also want to enforce that certain fields in the message value are unique - for example username. How do you do this? I hope to go into some more detail in a future blog post of what specifically the problem is and how you might go about solving it (after I've worked out more of the details myself!). But for now, I want to say that these problems turn out to be quite burdensome to solve in application code, and as far as I'm aware are generally left as an exercise to the user by streaming frameworks.

So... I'm aspiring to build an easy-to-use solution to this class of problem (which I see as super important to solve), and having done this, make sure it also fits nicely into the stream-processing paradigm to allow building out processing abstractions such as stream-table joins.

This is all pretty ambitious.

Other Abstractions

There are some other simple abstractions too that would be very useful. For example, a wrapper around the Producer that allows it to be used directly with ASP.NET dependency injection framework and a consumer API that doesn't require an explicit poll loop. I plan make some of these as well.

That pretty much covers my thoughts on what I want to make and why.

To summarize, this project is motivated by two things:

  1. With the basic client now getting quite mature, there's a lot of value that can be realized quite easily by building some simple higher level abstractions - I want to get these out quickly!

  2. I think there's potential for breaking new ground in how Kafka can be used as a backing for powerful for-purpose state stores. I'm motivated to work on this because I like Kafka so much I want to use it for everything :-).

Comments? Feedback? Feel free to engage on twitter (@matt_howlett), or open an issue on the project github repo.

Disclaimers: This is a side-project. I may-or-may not follow through on the ideas outlined above - I'm currently excited by a few different project ideas and can't possibly tackle them all (if you want me to work on this one, you know what to do ...). These are my own thoughts and opinions. I work for Confluent, but I'm not involved with the streaming processing team, haven't put these ideas past them, and only really just started to think about this topic - i.e. my thoughts on it are likely to evolve.