Case Studies#


How I do Case Studies#

Data Science at Airbnb#

| Podcast Episode: #063 Data Engineering At Airbnb Case Study |------------------| |How Airbnb is doing data engineering? Let’s check it out. | Watch on YouTube \ Listen on Anchor|


Airbnb Engineering Blog:

Data Infrastructure:

Scaling the serving tier:

Druid Analytics:

Spark Streaming for logging events:

-Druid Wiki:

Data Science at Amazon#\

Data Science at Baidu#

Data Science at Blackrock#

Data Science at BMW#

Data Science at

| Podcast Episode: #064 Data Engineering at Case Study |------------------| |How is doing data engineering? Let’s check it out. | Watch on YouTube \ Listen on Anchor|



Kafka Architecture:

Confluent Platform:

Data Science at CERN#

| Podcast Episode: #065 Data Engineering At CERN Case Study |------------------| |How is CERN doing Data Engineering? They must get huge amounts of data from the Large Hadron Collider. Let’s check it out. | Watch on YouTube \ Listen on Anchor|


Data Science at Disney#

Data Science at DLR#

Data Science at Drivetribe#

Data Science at Dropbox#

Data Science at Ebay#

Data Science at Expedia#

Data Science at Facebook#

Data Science at Google#\\\

Data Science at Grammarly#

Data Science at ING Fraud#

Data Science at Instagram#

Data Science at LinkedIn#

| Podcast Episode: #073 Data Engineering At LinkedIn Case Study |------------------| |Let’s check out how LinkedIn is processing data :) | Watch on YouTube \ Listen on Anchor|


Data Science at Lyft#

Data Science at NASA#

| Podcast Episode: #067 Data Engineering At NASA Case Study |------------------| |A look into how NASA is doing data engineering. | Watch on YouTube \ Listen on Anchor|


Data Science at Netflix#

| Podcast Episode: #062 Data Engineering At Netflix Case Study |------------------| |How Netflix is doing Data Engineering using their Keystone platform. | Watch on YouTube \ Listen on Anchor|

Netflix revolutionized how we watch movies and TV. Currently over 75 million users watch 125 million hours of Netflix content every day!

Netflix's revenue comes from a monthly subscription service. So, the goal for Netflix is to keep you subscribed and to get new subscribers.

To achieve this, Netflix is licensing movies from studios as well as creating its own original movies and TV series.

But offering new content is not everything. What is also very important is, to keep you watching content that already exists.

To be able to recommend you content, Netflix is collecting data from users. And it is collecting a lot.

Currently, Netflix analyses about 500 billion user events per day. That results in a stunning 1.3 Petabytes every day.

All this data allows Netflix to build recommender systems for you. The recommenders are showing you content that you might like, based on your viewing habits, or what is currently trending.

The Netflix batch processing pipeline#

When Netflix started out, they had a very simple batch processing system architecture.

The key components were Chuckwa, a scalable data collection system, Amazon S3 and Elastic MapReduce.

Old Netflix Batch Processing Pipeline[]{label="fig:Bild1"}{#fig:Bild1 width="90%"}

Chuckwa wrote incoming messages into Hadoop sequence files, stored in Amazon S3. These files then could be analysed by Elastic MapReduce jobs.

Netflix batch processing pipeline Jobs were executed regularly on a daily and hourly basis. As a result, Netflix could learn how people used the services every hour or once a day.

Know what customers want:#

Because you are looking at the big picture you can create new products. Netflix uses insight from big data to create new TV shows and movies.

They created House of Cards based on data. There is a very interesting TED talk about this you should watch:

How to use data to make a hit TV show | Sebastian Wernicke

Batch processing also helps Netflix to know the exact episode of a TV show that gets you hooked. Not only globally but for every country where Netflix is available.

Check out the article from TheVerge

They know exactly what show works in what country and what show does not.

It helps them create shows that work in everywhere or select the shows to license in different countries. Germany for instance does not have the full library that Americans have :(

We have to put up with only a small portion of TV shows and movies. If you have to select, why not select those that work best.

Batch processing is not enough#

As a data platform for generating insight the Cuckwa pipeline was a good start. It is very important to be able to create hourly and daily aggregated views for user behavior.

To this day Netflix is still doing a lot of batch processing jobs.

The only problem is: With batch processing you are basically looking into the past.

For Netflix, and data driven companies in general, looking into the past is not enough. They want a live view of what is happening.

The trending now feature#

One of the newer Netflix features is "Trending now". To the average user it looks like that "Trending Now" means currently most watched.

This is what I get displayed as trending while I am writing this on a Saturday morning at 8:00 in Germany. But it is so much more.

What is currently being watched is only a part of the data that is used to generate "Trending Now".

Netflix Trending Now Feature[]{label="fig:Bild1"}{#fig:Bild1 width="90%"}

"Trending now" is created based on two types of data sources: Play events and Impression events.

What messages those two types actually include is not really communicated by Netflix. I did some research on the Netflix Techblog and this is what I found out:

Play events include what title you have watched last, where you did stop watching, where you used the 30s rewind and others. Impression events are collected as you browse the Netflix Library like scroll up and down, scroll left or right, click on a movie and so on.

Basically, play events log what you do while you are watching. Impression events are capturing what you do on Netflix, while you are not watching something.

Netflix real-time streaming architecture#

Netflix uses three internet facing services to exchange data with the client's browser or mobile app. These services are simple Apache Tomcat based web services.

The service for receiving play events is called "Viewing History". Impression events are collected with the "Beacon" service.

The "Recommender Service" makes recommendations based on trend data available for clients.

Messages from the Beacon and Viewing History services are put into Apache Kafka. It acts as a buffer between the data services and the analytics.

Beacon and Viewing History publish messages to Kafka topics. The analytics subscribes to the topics and gets the messages automatically delivered in a first in first out fashion.

After the analytics the workflow is straight forward. The trending data is stored in a Cassandra Key-Value store. The recommender service has access to Cassandra and is making the data available to the Netflix client.

Netflix Streaming Pipeline[]{label="fig:Bild1"}{#fig:Bild1 width="90%"}

The algorithms how the analytics system is processing all this data is not known to the public. It is a trade secret of Netflix.

What is known, is the analytics tool they use. Back in Feb 2015 they wrote in the tech blog that they use a custom made tool.

They also stated, that Netflix is going to replace the custom made analytics tool with Apache Spark streaming in the future. My guess is, that they did the switch to Spark some time ago, because their post is more than a year old.

Data Science at OLX#

| Podcast Episode: #083 Data Engineering at OLX Case Study |------------------| |This podcast is a case study about OLX with Senior Data Scientist Alexey Grigorev as guest. It was super fun. | Watch on YouTube \ Listen on Anchor|


Data Science at OTTO#

Data Science at Paypal#

Data Science at Pinterest#

| Podcast Episode: #069 Engineering Culture At Pinterest |------------------| |In this podcast we look into data platform and processing at Pinterest. | Watch on YouTube \ Listen on Anchor|


Data Science at Salesforce#

Data Science at Siemens Mindsphere#

| Podcast Episode: #059 What Is The Siemens Mindsphere IoT Platform? |------------------| |The Internet of things is a huge deal. There are many platforms available. But, which one is actually good? Join me on a 50 minute dive into the Siemens Mindsphere online documentation. I have to say I was super unimpressed by what I found. Many limitations, unclear architecture and no pricing available? Not good! | Watch on YouTube \ Listen on Anchor|

Data Science at Slack#

Data Science at Spotify#

| Podcast Episode: #071 Data Engineering At Spotify Case Study |------------------| |In this episode we are looking at data engineering at Spotify, my favorite music streaming service. How do they process all that data? | Watch on YouTube \ Listen on Anchor|


Data Science at Symantec#

Data Science at Tinder#

Data Science at Twitter#

| Podcast Episode: #072 Data Engineering At Twitter Case Study |------------------| |How is Twitter doing data engineering? Oh man, they have a lot of cool things to share these tweets. | Watch on YouTube \ Listen on Anchor|


Data Science at Uber#

Data Science at Upwork#

Data Science at Woot#

Data Science at Zalando#

| Podcast Episode: #087 Data Engineering At Zalando Case Study Talk |------------------| |I had a great conversation about data engineering for online retailing with Michal Gancarski and Max Schultze. They showed Zalando’s data platform and how they build data pipelines. Super interesting especially for AWS users. | Watch on YouTube

Do me a favor and give these guys a follow on LinkedIn:

LinkedIn of Michal:

LinkedIn of Max:

Zalando has a tech blog with more infos and there is also a meetup in Berlin:

Zalando Blog:

Next Zalando Data Engineering Meetup:

Interesting tools:


Delta Lake:

AWS Step Functions: [ AWS State Language:]( AWS State Language:

Youtube channel of the meetup: [ talk at Spark+AI]( talk at Spark+AI)

Summit about Zalando's Processing Platform:

Talk at Strata London slides: