Data Engineering Course: Building A Data Platform#

Contents#

What We Want To Do#

  • Twitter data to predict best time to post using the hashtag datascience or ai

  • Find top tweets for the day

  • Top users

  • Analyze sentiment and keywords

Thoughts On Choosing A Development Environment#

For a local environment you need a good PC. I thought a bit about a budget build around 1.000 Dollars or Euros.

| Podcast Episode: #068 How to Build a Budget Data Science PC |------------------| |In this podcast we look into configuring a sub 1000 dollar PC for data engineering and machine learning. | Watch on YouTube \ Listen on Anchor|

A Look Into the Twitter API#

| Podcast Episode: #081 Twitter API Research |------------------| |In this podcast we were looking into how the Twitter API works and how you get access to it. | Watch on YouTube

Ingesting Tweets with Apache Nifi#

| Podcast Episode: #082 Reading Tweets With Apache Nifi & IaaS vs PaaS vs SaaS |------------------| |In this podcast we are trying to read Twitter Data with Nifi. | Watch on YouTube

| Podcast Episode: #085 Trying to read Tweets with Nifi Part 2 |------------------| |We are looking into the Big Data landscape chart and we are trying to read Twitter Data with Nifi again. | Watch on YouTube

Writing from Nifi to Apache Kafka#

| Podcast Episode: #086 How to Write from Nifi to Kafka Part 1 |------------------| |I’ve been working a lot on the cookbook, because it’s so much fun. I gotta tell you what I added. Then we are trying to write the Tweets from Apache Nifi into Kafka. Also talk about Kafka basics. | Watch on YouTube

| Podcast Episode: #088 How to Write from Nifi to Kafka Part 2 |------------------| |In this podcast we finally figure out how to write to Kafka from Nifi. The problem was the network configuration of the Docker containers. | Watch on YouTube

Apache Zeppelin#

Install and Ingest Kafka Topic#

Start the container:

docker run -d -p 8081:8080 --rm \
-v /Users/xxxx/Documents/DockerFiles/logs:/logs \
-v /Users/xxxx/Documents/DockerFiles/Notebooks:/notebook \
-e ZEPPELIN_LOG_DIR='/logs' \
-e ZEPPELIN_NOTEBOOK_DIR='/notebook' \
--network app-tier --name zeppelin apache/zeppelin:0.7.3

Processing Messages with Spark and SparkSQL#

Visualizing Data#

Switch Processing from Zeppelin to Spark#

Install Spark#

Ingest Messages from Kafka#

Writing from Spark to Kafka#

Move Zeppelin Code to Spark#