A few months ago I started working on a new project at work. Its business logic is basically obtaining data from a source, apply some calculations, and write them out to a database. Sounds simple enough. Although the source data is provided at a specified time in a batch fashion, there will be a future requirement of an hourly feed so it seemed like a fairly reasonable idea to try and work with this in a streaming computation model. Given the rise of Confluent Platform, our engineering organization thought we should give Kafka Streams a try- this was an adventure because we were traditionally running spark batch jobs and had limited usage of Kafka itself.
I was inclined to use Kafka Streams for the following reasons:
I never used Kafka Streams before so I was curious
After using Azkaban after Luigi, I had a strong sentiment that tasks should be organized in an event-triggered fashion
I personally did not enjoy using the existing tech stack and wanted to use something completely new
I think I feel comfortable enough with the framework itself to give a meaningful insight, and I want to provide a retrospective for myself about the above three points.
Takeaway: Don’t underestimate the effort of doing something you haven’t done before.
All the business requirements were laid out in a user story, and all of them were fairly simple stories because they were broken down quite a lot. Read in this data source, convert it into avro objects. Ok cool. Seems simple enough. Based on my previous experience with MapReduce and Thrift, it’s a fairly trivial task. I’m sure there’s some things I need to learn about Kafka Connect but it should still be a trivial task. <- This is what I thought, but experience has proven otherwise.
Avro doesn’t have a native notion of an optional field, so you need to search for the best way to have optional fields. Now, why does the schema registry keep denying my messages? Ah, my schema change is non-backward compatible, that took a while to google. Hmm, Avro doesn’t have a concept of datetime? It’s actually an int/long epoch time? Uhh, and I’m supposed to use this Kafka Connect library to convert from java Dates to Avro dates? And I have to use java.util.Dates? Oh crap, and java.util.Date isn’t threadsafe?
The above questions were all answered and the correct implementation took a few weeks to write, because there were numerous unexpected differences from my experience to a new one. In planning for tasks where you haven’t worked with the tech stack, you should aggressively overestimate and consider all unturned stones to be landmines.
Takeaway: Read the documentations.
I had a left join that was basically combining two avro objects into one. And since joins are always key-based in the Kafka Streams DSL, this join just has to work. However, I noticed that only some keys are producing the right values and some aren’t. What the heck? Source topic A and B are easily verifiable, I literally see the data there, but why isn’t the join triggering?
Turns out, joins require co-partitioning of the data and that is outlined in great detail in the documentations. Kafka Streams actually is one of the best documented projects I’ve ever used, and I would suggest it as the #1 resource that every Kafka Streams user should know from start to end.
Lurking around the Github Kafka project and the Google group for Confluent platform taught me a lot about the platform as well. I feel fortunate that I got to work with frameworks that have a great community.
In the next article I will write about why I think KS is pretty cool.
Congratulations @tglstory! You received a personal award!
Happy Birthday! - You are on the Steem blockchain for 1 year!
Click here to view your Board
Congratulations @tglstory! You received a personal award!
You can view your badges on your Steem Board and compare to others on the Steem Ranking
Vote for @Steemitboard as a witness to get one more award and increased upvotes!