05-CaseStudies

Case Studies#

Contents#

How I do Case Studies#

Data Science at Airbnb#

| Podcast Episode: #063 Data Engineering At Airbnb Case Study |------------------| |How Airbnb is doing data engineering? Let’s check it out. | Watch on YouTube \ Listen on Anchor|

Slides:

https://medium.com/airbnb-engineering/airbnb-engineering-infrastructure/home

Airbnb Engineering Blog: https://medium.com/airbnb-engineering

Data Infrastructure: https://medium.com/airbnb-engineering/data-infrastructure-at-airbnb-8adfb34f169c

Scaling the serving tier: https://medium.com/airbnb-engineering/unlocking-horizontal-scalability-in-our-web-serving-tier-d907449cdbcf

Druid Analytics: https://medium.com/airbnb-engineering/druid-airbnb-data-platform-601c312f2a4c

Spark Streaming for logging events: https://medium.com/airbnb-engineering/scaling-spark-streaming-for-logging-event-ingestion-4a03141d135d

-Druid Wiki: https://en.wikipedia.org/wiki/Apache_Druid

Data Science at Amazon#

https://aws.amazon.com/solutions/case-studies/amazon-migration-analytics/

Data Science at Baidu#

https://www.slideshare.net/databricks/spark-sql-adaptive-execution-unleashes-the-power-of-cluster-in-large-scale-with-chenzhao-guo-and-carson-wang

Data Science at Blackrock#

https://www.slideshare.net/DataStax/maintaining-consistency-across-data-centers-randy-fradin-blackrock-cassandra-summit-2016

Data Science at BMW#

https://www.unibw.de/code/events-u/jt-2018-workshops/ws3_bigdata_vortrag_widmann.pdf

Data Science at Booking.com#

| Podcast Episode: #064 Data Engineering at Booking.com Case Study |------------------| |How Booking.com is doing data engineering? Let’s check it out. | Watch on YouTube \ Listen on Anchor|

Slides:

https://www.slideshare.net/ConfluentInc/data-streaming-ecosystem-management-at-bookingcom?ref=https://www.confluent.io/kafka-summit-sf18/data-streaming-ecosystem-management

https://www.slideshare.net/SparkSummit/productionizing-behavioural-features-for-machine-learning-with-apache-spark-streaming-with-ben-teeuwen-and-roman-studenikin

https://www.slideshare.net/ConfluentInc/data-streaming-ecosystem-management-at-bookingcom?ref=https://www.confluent.io/kafka-summit-sf18/data-streaming-ecosystem-management

Druid: https://towardsdatascience.com/introduction-to-druid-4bf285b92b5a

Kafka Architecture: https://data-flair.training/blogs/kafka-architecture/

Confluent Platform: https://www.confluent.io/product/confluent-platform/

Data Science at CERN#

| Podcast Episode: #065 Data Engineering At CERN Case Study |------------------| |How is CERN doing Data Engineering? They must get huge amounts of data from the Large Hadron Collider. Let’s check it out. | Watch on YouTube \ Listen on Anchor|

Slides:

https://en.wikipedia.org/wiki/Large_Hadron_Collider

http://www.lhc-facts.ch/index.php?page=datenverarbeitung

https://www.slideshare.net/SparkSummit/next-cern-accelerator-logging-service-with-jakub-wozniak

https://databricks.com/session/the-architecture-of-the-next-cern-accelerator-logging-service

http://opendata.cern.ch

https://gobblin.apache.org

https://www.slideshare.net/databricks/cerns-next-generation-data-analysis-platform-with-apache-spark-with-enric-tejedor

https://www.slideshare.net/SparkSummit/realtime-detection-of-anomalies-in-the-database-infrastructure-using-apache-spark-with-daniel-lanza-and-prasanth-kothuri

Data Science at Disney#

https://medium.com/disney-streaming/delivering-data-in-real-time-via-auto-scaling-kinesis-streams-72a0236b2cd9

Data Science at DLR#

https://www.unibw.de/code/events-u/jt-2018-workshops/ws3_bigdata_vortrag_bamler.pdf

Data Science at Drivetribe#

https://berlin-2017.flink-forward.org/kb_sessions/drivetribes-kappa-architecture-with-apache-flink/

https://www.slideshare.net/FlinkForward/flink-forward-berlin-2017-aris-kyriakos-koliopoulos-drivetribes-kappa-architecture-with-apache-flink

Data Science at Dropbox#

https://blogs.dropbox.com/tech/2019/01/finding-kafkas-throughput-limit-in-dropbox-infrastructure/

Data Science at Ebay#

https://www.slideshare.net/databricks/moving-ebays-data-warehouse-over-to-apache-spark-spark-as-core-etl-platform-at-ebay-with-kim-curtis-and-brian-knauss https://www.slideshare.net/databricks/analytical-dbms-to-apache-spark-auto-migration-framework-with-edward-zhang-and-lipeng-zhu

Data Science at Expedia#

https://www.slideshare.net/BrandonOBrien/spark-streaming-kafka-best-practices-w-brandon-obrien https://www.slideshare.net/Naveen1914/brandon-obrien-streamingdata

Data Science at Facebook#

https://code.fb.com/core-data/apache-spark-scale-a-60-tb-production-use-case/

Data Science at Google#

http://www.unofficialgoogledatascience.com/\ https://ai.google/research/teams/ai-fundamentals-applications/\ https://cloud.google.com/solutions/big-data/\ https://datafloq.com/read/google-applies-big-data-infographic/385

Data Science at Grammarly#

https://www.slideshare.net/databricks/building-a-versatile-analytics-pipeline-on-top-of-apache-spark-with-mikhail-chernetsov

Data Science at ING Fraud#

https://sf-2017.flink-forward.org/kb_sessions/streaming-models-how-ing-adds-models-at-runtime-to-catch-fraudsters/

Data Science at Instagram#

https://www.slideshare.net/SparkSummit/lessons-learned-developing-and-managing-massive-300tb-apache-spark-pipelines-in-production-with-brandon-carl

Data Science at LinkedIn#

| Podcast Episode: #073 Data Engineering At LinkedIn Case Study |------------------| |Let’s check out how LinkedIn is processing data :) | Watch on YouTube \ Listen on Anchor|

Slides:

https://engineering.linkedin.com/teams/data#0

https://www.slideshare.net/yaelgarten/building-a-healthy-data-ecosystem-around-kafka-and-hadoop-lessons-learned-at-linkedin

https://thirdeye.readthedocs.io/en/latest/about.html

http://samza.apache.org

https://www.slideshare.net/ConfluentInc/more-data-more-problems-scaling-kafkamirroring-pipelines-at-linkedin?ref=https://www.confluent.io/kafka-summit-sf18/more_data_more_problems

https://www.slideshare.net/KhaiTran17/conquering-the-lambda-architecture-in-linkedin-metrics-platform-with-apache-calcite-and-apache-samza

https://www.slideshare.net/Hadoop_Summit/unified-batch-stream-processing-with-apache-samza

http://druid.io/docs/latest/design/index.html

Data Science at Lyft#

https://eng.lyft.com/running-apache-airflow-at-lyft-6e53bb8fccff

Data Science at NASA#

| Podcast Episode: #067 Data Engineering At NASA Case Study |------------------| |A look into how NASA is doing data engineering. | Watch on YouTube \ Listen on Anchor|

Slides:

https://esip.figshare.com/articles/Apache_Science_Data_Analytics_Platform/5786421

http://www.socallinuxexpo.org/sites/default/files/presentations/OnSightCloudArchitecture-scale14x.pdf

https://www.slideshare.net/SparkSummit/spark-at-nasajplchris-mattmann?qid=90968554-288e-454a-b63a-21a45cfc897d&v=&b=&from_search=4

https://en.m.wikipedia.org/wiki/Hierarchical_Data_Format

Data Science at Netflix#

| Podcast Episode: #062 Data Engineering At Netflix Case Study |------------------| |How Netflix is doing Data Engineering using their Keystone platform. | Watch on YouTube \ Listen on Anchor|

Netflix revolutionized how we watch movies and TV. Currently over 75 million users watch 125 million hours of Netflix content every day!

Netflix's revenue comes from a monthly subscription service. So, the goal for Netflix is to keep you subscribed and to get new subscribers.

To achieve this, Netflix is licensing movies from studios as well as creating its own original movies and TV series.

But offering new content is not everything. What is also very important is, to keep you watching content that already exists.

To be able to recommend you content, Netflix is collecting data from users. And it is collecting a lot.

Currently, Netflix analyses about 500 billion user events per day. That results in a stunning 1.3 Petabytes every day.

All this data allows Netflix to build recommender systems for you. The recommenders are showing you content that you might like, based on your viewing habits, or what is currently trending.

The Netflix batch processing pipeline#

When Netflix started out, they had a very simple batch processing system architecture.

The key components were Chuckwa, a scalable data collection system, Amazon S3 and Elastic MapReduce.

Old Netflix Batch Processing Pipeline[]{label="fig:Bild1"}{#fig:Bild1 width="90%"}

Chuckwa wrote incoming messages into Hadoop sequence files, stored in Amazon S3. These files then could be analysed by Elastic MapReduce jobs.

Netflix batch processing pipeline Jobs were executed regularly on a daily and hourly basis. As a result, Netflix could learn how people used the services every hour or once a day.

Know what customers want:#

Because you are looking at the big picture you can create new products. Netflix uses insight from big data to create new TV shows and movies.

They created House of Cards based on data. There is a very interesting TED talk about this you should watch:

How to use data to make a hit TV show | Sebastian Wernicke

Batch processing also helps Netflix to know the exact episode of a TV show that gets you hooked. Not only globally but for every country where Netflix is available.

Check out the article from TheVerge

They know exactly what show works in what country and what show does not.

It helps them create shows that work in everywhere or select the shows to license in different countries. Germany for instance does not have the full library that Americans have :(

We have to put up with only a small portion of TV shows and movies. If you have to select, why not select those that work best.

Batch processing is not enough#

As a data platform for generating insight the Cuckwa pipeline was a good start. It is very important to be able to create hourly and daily aggregated views for user behavior.

To this day Netflix is still doing a lot of batch processing jobs.

The only problem is: With batch processing you are basically looking into the past.

For Netflix, and data driven companies in general, looking into the past is not enough. They want a live view of what is happening.

The trending now feature#

One of the newer Netflix features is "Trending now". To the average user it looks like that "Trending Now" means currently most watched.

This is what I get displayed as trending while I am writing this on a Saturday morning at 8:00 in Germany. But it is so much more.

What is currently being watched is only a part of the data that is used to generate "Trending Now".

Netflix Trending Now Feature[]{label="fig:Bild1"}{#fig:Bild1 width="90%"}

"Trending now" is created based on two types of data sources: Play events and Impression events.

What messages those two types actually include is not really communicated by Netflix. I did some research on the Netflix Techblog and this is what I found out:

Play events include what title you have watched last, where you did stop watching, where you used the 30s rewind and others. Impression events are collected as you browse the Netflix Library like scroll up and down, scroll left or right, click on a movie and so on.

Basically, play events log what you do while you are watching. Impression events are capturing what you do on Netflix, while you are not watching something.

Netflix real-time streaming architecture#

Netflix uses three internet facing services to exchange data with the client's browser or mobile app. These services are simple Apache Tomcat based web services.

The service for receiving play events is called "Viewing History". Impression events are collected with the "Beacon" service.

The "Recommender Service" makes recommendations based on trend data available for clients.

Messages from the Beacon and Viewing History services are put into Apache Kafka. It acts as a buffer between the data services and the analytics.

Beacon and Viewing History publish messages to Kafka topics. The analytics subscribes to the topics and gets the messages automatically delivered in a first in first out fashion.

After the analytics the workflow is straight forward. The trending data is stored in a Cassandra Key-Value store. The recommender service has access to Cassandra and is making the data available to the Netflix client.

Netflix Streaming Pipeline[]{label="fig:Bild1"}{#fig:Bild1 width="90%"}

The algorithms how the analytics system is processing all this data is not known to the public. It is a trade secret of Netflix.

What is known, is the analytics tool they use. Back in Feb 2015 they wrote in the tech blog that they use a custom made tool.

They also stated, that Netflix is going to replace the custom made analytics tool with Apache Spark streaming in the future. My guess is, that they did the switch to Spark some time ago, because their post is more than a year old.

Data Science at OLX#

| Podcast Episode: #083 Data Engineering at OLX Case Study |------------------| |This podcast is a case study about OLX with Senior Data Scientist Alexey Grigorev as guest. It was super fun. | Watch on YouTube \ Listen on Anchor|

Slides:

https://www.slideshare.net/mobile/AlexeyGrigorev/image-models-infrastructure-at-olx

Data Science at OTTO#

https://www.slideshare.net/SparkSummit/spark-summit-eu-talk-by-sebastian-schroeder-and-ralf-sigmund

Data Science at Paypal#

https://www.paypal-engineering.com/tag/data/

Data Science at Pinterest#

| Podcast Episode: #069 Engineering Culture At Pinterest |------------------| |In this podcast we look into data platform and processing at Pinterest. | Watch on YouTube \ Listen on Anchor|

Slides:

https://www.slideshare.net/ConfluentInc/pinterests-story-of-streaming-hundreds-of-terabytes-of-pins-from-mysql-to-s3hadoop-continuously?ref=https://www.confluent.io/kafka-summit-sf18/pinterests-story-of-streaming-hundreds-of-terabytes

https://www.slideshare.net/ConfluentInc/building-pinterest-realtime-ads-platform-using-kafka-streams?ref=https://www.confluent.io/kafka-summit-sf18/building-pinterest-real-time-ads-platform-using-kafka-streams

https://medium.com/@Pinterest_Engineering/building-a-real-time-user-action-counting-system-for-ads-88a60d9c9a

https://medium.com/pinterest-engineering/goku-building-a-scalable-and-high-performant-time-series-database-system-a8ff5758a181

https://medium.com/pinterest-engineering/building-a-dynamic-and-responsive-pinterest-7d410e99f0a9

https://medium.com/@Pinterest_Engineering/building-pin-stats-25ec8460e924

https://medium.com/@Pinterest_Engineering/improving-hbase-backup-efficiency-at-pinterest-86159da4b954

https://medium.com/@Pinterest_Engineering/pinterest-joins-the-cloud-native-computing-foundation-e3b3e66cb4f

https://medium.com/@Pinterest_Engineering/using-kafka-streams-api-for-predictive-budgeting-9f58d206c996

https://medium.com/@Pinterest_Engineering/auto-scaling-pinterest-df1d2beb4d64

Data Science at Salesforce#

https://engineering.salesforce.com/building-a-scalable-event-pipeline-with-heroku-and-salesforce-2549cb20ce06

Data Science at Siemens Mindsphere#

| Podcast Episode: #059 What Is The Siemens Mindsphere IoT Platform? |------------------| |The Internet of things is a huge deal. There are many platforms available. But, which one is actually good? Join me on a 50 minute dive into the Siemens Mindsphere online documentation. I have to say I was super unimpressed by what I found. Many limitations, unclear architecture and no pricing available? Not good! | Watch on YouTube \ Listen on Anchor|

Data Science at Slack#

https://speakerdeck.com/vananth22/streaming-data-pipelines-at-slack

Data Science at Spotify#

| Podcast Episode: #071 Data Engineering At Spotify Case Study |------------------| |In this episode we are looking at data engineering at Spotify, my favorite music streaming service. How do they process all that data? | Watch on YouTube \ Listen on Anchor|

Slides:

https://labs.spotify.com/2016/02/25/spotifys-event-delivery-the-road-to-the-cloud-part-i/

https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the-cloud-part-ii/

https://labs.spotify.com/2016/03/10/spotifys-event-delivery-the-road-to-the-cloud-part-iii/

https://www.slideshare.net/InfoQ/scaling-the-data-infrastructure-spotify

https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/

https://labs.spotify.com/2017/04/26/reliable-export-of-cloud-pubsub-streams-to-cloud-storage/

https://labs.spotify.com/2017/11/20/autoscaling-pub-sub-consumers/

Data Science at Symantec#

https://www.slideshare.net/planetcassandra/symantec-cassandra-data-modelling-techniques-in-action

Data Science at Tinder#

https://www.slideshare.net/databricks/scalable-monitoring-using-apache-spark-and-friends-with-utkarsh-bhatnagar

Data Science at Twitter#

| Podcast Episode: #072 Data Engineering At Twitter Case Study |------------------| |How is Twitter doing data engineering? Oh man, they have a lot of cool things to share these tweets. | Watch on YouTube \ Listen on Anchor|

Slides:

https://www.slideshare.net/sawjd/real-time-processing-using-twitter-heron-by-karthik-ramasamy

https://www.slideshare.net/sawjd/big-data-day-la-2016-big-data-track-twitter-heron-scale-karthik-ramasamy-engineering-manager-twitter

https://techjury.net/stats-about/twitter/

https://developer.twitter.com/en/docs/tweets/post-and-engage/overview

https://www.slideshare.net/prasadwagle/extracting-insights-from-data-at-twitter

https://blog.twitter.com/engineering/en_us/topics/insights/2018/twitters-kafka-adoption-story.html

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2017/the-infrastructure-behind-twitter-scale.html

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/the-start-of-a-journey-into-the-cloud.html

https://www.slideshare.net/billonahill/twitter-heron-in-practice

https://medium.com/@kramasamy/introduction-to-apache-heron-c64f8c7c0956

https://www.youtube.com/watch?v=3QHGhnHx5HQ

https://hbase.apache.org

https://db-engines.com/en/system/Amazon+DynamoDB%3BCassandra%3BGoogle+Cloud+Bigtable%3BHBase

Data Science at Uber#

https://eng.uber.com/uber-big-data-platform/

https://eng.uber.com/aresdb/

https://www.uber.com/us/en/uberai/

Data Science at Upwork#

https://www.slideshare.net/databricks/how-to-rebuild-an-endtoend-ml-pipeline-with-databricks-and-upwork-with-thanh-tran

Data Science at Woot#

https://aws.amazon.com/de/blogs/big-data/our-data-lake-story-how-woot-com-built-a-serverless-data-lake-on-aws/

Data Science at Zalando#

| Podcast Episode: #087 Data Engineering At Zalando Case Study Talk |------------------| |I had a great conversation about data engineering for online retailing with Michal Gancarski and Max Schultze. They showed Zalando’s data platform and how they build data pipelines. Super interesting especially for AWS users. | Watch on YouTube

Do me a favor and give these guys a follow on LinkedIn:

LinkedIn of Michal: https://www.linkedin.com/in/michalgancarski/

LinkedIn of Max: https://www.linkedin.com/in/max-schultze-b11996110/

Zalando has a tech blog with more infos and there is also a meetup in Berlin:

Zalando Blog: https://jobs.zalando.com/tech/blog/

Next Zalando Data Engineering Meetup: https://www.meetup.com/Zalando-Tech-Events-Berlin/events/262032282/

Interesting tools:

AWS CDK: https://docs.aws.amazon.com/cdk/latest/guide/what-is.html

Delta Lake: https://delta.io/

AWS Step Functions: [https://aws.amazon.com/step-functions/ AWS State Language: https://states-language.net/spec.html](https://aws.amazon.com/step-functions/ AWS State Language: https://states-language.net/spec.html)

Youtube channel of the meetup: [https://www.youtube.com/channel/UCxwul7aBm2LybbpKGbCOYNA/playlists talk at Spark+AI](https://www.youtube.com/channel/UCxwul7aBm2LybbpKGbCOYNA/playlists talk at Spark+AI)

Summit about Zalando's Processing Platform: https://databricks.com/session/continuous-applications-at-scale-of-100-teams-with-databricks-delta-and-structured-streaming

Talk at Strata London slides: https://databricks.com/session/continuous-applications-at-scale-of-100-teams-with-databricks-delta-and-structured-streaming

https://jobs.zalando.com/tech/blog/what-is-hardcore-data-science--in-practice/?gh_src=4n3gxh1

https://jobs.zalando.com/tech/blog/complex-event-generation-for-business-process-monitoring-using-apache-flink/