Data Engineering Course: Building A Data Platform#


What We Want To Do#

  • Twitter data to predict best time to post using the hashtag datascience or ai

  • Find top tweets for the day

  • Top users

  • Analyze sentiment and keywords

Thoughts On Choosing A Development Environment#

For a local environment you need a good PC. I thought a bit about a budget build around 1.000 Dollars or Euros.

A Look Into the Twitter API#

Ingesting Tweets with Apache Nifi#

Writing from Nifi to Apache Kafka#

Apache Zeppelin#

Install and Ingest Kafka Topic#

Start the container:

docker run -d -p 8081:8080 --rm \
-v /Users/xxxx/Documents/DockerFiles/logs:/logs \
-v /Users/xxxx/Documents/DockerFiles/Notebooks:/notebook \
-e ZEPPELIN_LOG_DIR='/logs' \
-e ZEPPELIN_NOTEBOOK_DIR='/notebook' \
--network app-tier --name zeppelin apache/zeppelin:0.7.3

Processing Messages with Spark and SparkSQL#

Visualizing Data#

Switch Processing from Zeppelin to Spark#

Install Spark#

Ingest Messages from Kafka#

Writing from Spark to Kafka#

Move Zeppelin Code to Spark#