- Data Science @Airbnb
- Data Science @Amazon
- Data Science @Baidu
- Data Science @Blackrock
- Data Science @BMW
- Data Science @Booking.com
- Data Science @CERN
- Data Science @Disney
- Data Science @DLR
- Data Science @Drivetribe
- Data Science @Dropbox
- Data Science @Ebay
- Data Science @Expedia
- Data Science @Facebook
- Data Science @Google
- Data Science @Grammarly
- Data Science @ING Fraud
- Data Science @Instagram
- Data Science @LinkedIn
- Data Science @Lyft
- Data Science @NASA
- Data Science @Netflix
- Data Science @OLX
- Data Science @OTTO
- Data Science @Paypal
- Data Science @Pinterest
- Data Science @Salesforce
- Data Science @Siemens Mindsphere
- Data Science @Slack
- Data Science @Spotify
- Data Science @Symantec
- Data Science @Tinder
- Data Science @Twitter
- Data Science @Uber
- Data Science @Upwork
- Data Science @Woot
- Data Science @Zalando
Airbnb Engineering Blog: https://medium.com/airbnb-engineering
Data Infrastructure: https://medium.com/airbnb-engineering/data-infrastructure-at-airbnb-8adfb34f169c
Spark Streaming for logging events: https://medium.com/airbnb-engineering/scaling-spark-streaming-for-logging-event-ingestion-4a03141d135d
-Druid Wiki: https://en.wikipedia.org/wiki/Apache_Druid
Kafka Architecture: https://data-flair.training/blogs/kafka-architecture/
Confluent Platform: https://www.confluent.io/product/confluent-platform/
| Podcast Episode: #065 Data Engineering At CERN Case Study |------------------| |How is CERN doing Data Engineering? They must get huge amounts of data from the Large Hadron Collider. Let’s check it out. | Watch on YouTube \ Listen on Anchor|
http://www.unofficialgoogledatascience.com/\ https://ai.google/research/teams/ai-fundamentals-applications/\ https://cloud.google.com/solutions/big-data/\ https://datafloq.com/read/google-applies-big-data-infographic/385
Netflix revolutionized how we watch movies and TV. Currently over 75 million users watch 125 million hours of Netflix content every day!
Netflix's revenue comes from a monthly subscription service. So, the goal for Netflix is to keep you subscribed and to get new subscribers.
To achieve this, Netflix is licensing movies from studios as well as creating its own original movies and TV series.
But offering new content is not everything. What is also very important is, to keep you watching content that already exists.
To be able to recommend you content, Netflix is collecting data from users. And it is collecting a lot.
Currently, Netflix analyses about 500 billion user events per day. That results in a stunning 1.3 Petabytes every day.
All this data allows Netflix to build recommender systems for you. The recommenders are showing you content that you might like, based on your viewing habits, or what is currently trending.
When Netflix started out, they had a very simple batch processing system architecture.
The key components were Chuckwa, a scalable data collection system, Amazon S3 and Elastic MapReduce.
Chuckwa wrote incoming messages into Hadoop sequence files, stored in Amazon S3. These files then could be analysed by Elastic MapReduce jobs.
Netflix batch processing pipeline Jobs were executed regularly on a daily and hourly basis. As a result, Netflix could learn how people used the services every hour or once a day.
Because you are looking at the big picture you can create new products. Netflix uses insight from big data to create new TV shows and movies.
They created House of Cards based on data. There is a very interesting TED talk about this you should watch:
Batch processing also helps Netflix to know the exact episode of a TV show that gets you hooked. Not only globally but for every country where Netflix is available.
Check out the article from TheVerge
They know exactly what show works in what country and what show does not.
It helps them create shows that work in everywhere or select the shows to license in different countries. Germany for instance does not have the full library that Americans have :(
We have to put up with only a small portion of TV shows and movies. If you have to select, why not select those that work best.
As a data platform for generating insight the Cuckwa pipeline was a good start. It is very important to be able to create hourly and daily aggregated views for user behavior.
To this day Netflix is still doing a lot of batch processing jobs.
The only problem is: With batch processing you are basically looking into the past.
For Netflix, and data driven companies in general, looking into the past is not enough. They want a live view of what is happening.
One of the newer Netflix features is "Trending now". To the average user it looks like that "Trending Now" means currently most watched.
This is what I get displayed as trending while I am writing this on a Saturday morning at 8:00 in Germany. But it is so much more.
What is currently being watched is only a part of the data that is used to generate "Trending Now".
"Trending now" is created based on two types of data sources: Play events and Impression events.
What messages those two types actually include is not really communicated by Netflix. I did some research on the Netflix Techblog and this is what I found out:
Play events include what title you have watched last, where you did stop watching, where you used the 30s rewind and others. Impression events are collected as you browse the Netflix Library like scroll up and down, scroll left or right, click on a movie and so on.
Basically, play events log what you do while you are watching. Impression events are capturing what you do on Netflix, while you are not watching something.
Netflix uses three internet facing services to exchange data with the client's browser or mobile app. These services are simple Apache Tomcat based web services.
The service for receiving play events is called "Viewing History". Impression events are collected with the "Beacon" service.
The "Recommender Service" makes recommendations based on trend data available for clients.
Messages from the Beacon and Viewing History services are put into Apache Kafka. It acts as a buffer between the data services and the analytics.
Beacon and Viewing History publish messages to Kafka topics. The analytics subscribes to the topics and gets the messages automatically delivered in a first in first out fashion.
After the analytics the workflow is straight forward. The trending data is stored in a Cassandra Key-Value store. The recommender service has access to Cassandra and is making the data available to the Netflix client.
The algorithms how the analytics system is processing all this data is not known to the public. It is a trade secret of Netflix.
What is known, is the analytics tool they use. Back in Feb 2015 they wrote in the tech blog that they use a custom made tool.
They also stated, that Netflix is going to replace the custom made analytics tool with Apache Spark streaming in the future. My guess is, that they did the switch to Spark some time ago, because their post is more than a year old.
| Podcast Episode: #083 Data Engineering at OLX Case Study |------------------| |This podcast is a case study about OLX with Senior Data Scientist Alexey Grigorev as guest. It was super fun. | Watch on YouTube \ Listen on Anchor|
| Podcast Episode: #059 What Is The Siemens Mindsphere IoT Platform? |------------------| |The Internet of things is a huge deal. There are many platforms available. But, which one is actually good? Join me on a 50 minute dive into the Siemens Mindsphere online documentation. I have to say I was super unimpressed by what I found. Many limitations, unclear architecture and no pricing available? Not good! | Watch on YouTube \ Listen on Anchor|
| Podcast Episode: #071 Data Engineering At Spotify Case Study |------------------| |In this episode we are looking at data engineering at Spotify, my favorite music streaming service. How do they process all that data? | Watch on YouTube \ Listen on Anchor|
| Podcast Episode: #072 Data Engineering At Twitter Case Study |------------------| |How is Twitter doing data engineering? Oh man, they have a lot of cool things to share these tweets. | Watch on YouTube \ Listen on Anchor|
| Podcast Episode: #087 Data Engineering At Zalando Case Study Talk |------------------| |I had a great conversation about data engineering for online retailing with Michal Gancarski and Max Schultze. They showed Zalando’s data platform and how they build data pipelines. Super interesting especially for AWS users. | Watch on YouTube
Do me a favor and give these guys a follow on LinkedIn:
LinkedIn of Michal: https://www.linkedin.com/in/michalgancarski/
LinkedIn of Max: https://www.linkedin.com/in/max-schultze-b11996110/
Zalando has a tech blog with more infos and there is also a meetup in Berlin:
Zalando Blog: https://jobs.zalando.com/tech/blog/
Next Zalando Data Engineering Meetup: https://www.meetup.com/Zalando-Tech-Events-Berlin/events/262032282/
Delta Lake: https://delta.io/
AWS Step Functions: [https://aws.amazon.com/step-functions/ AWS State Language: https://states-language.net/spec.html](https://aws.amazon.com/step-functions/ AWS State Language: https://states-language.net/spec.html)
Youtube channel of the meetup: [https://www.youtube.com/channel/UCxwul7aBm2LybbpKGbCOYNA/playlists talk at Spark+AI](https://www.youtube.com/channel/UCxwul7aBm2LybbpKGbCOYNA/playlists talk at Spark+AI)
Summit about Zalando's Processing Platform: https://databricks.com/session/continuous-applications-at-scale-of-100-teams-with-databricks-delta-and-structured-streaming