Real-time analysis of Twitter data using Kubernetes, PubSub and BigQuery
Google Cloud PubSub provides many-to-many, asynchronous messaging that decouples senders and receivers. It allows for secure and highly available communication between independently written applications and delivers low-latency, durable messaging. It has just gone to Beta, and is available for anyone to try.
This example Kubernetes app shows how to build a ‘pipeline’ to stream Twitter data into BigQuery using PubSub.
The app uses uses PubSub to buffer the data coming in from Twitter and to decouple ingestion from processing. One of the Kubernetes app pods reads the data from Twitter and publishes it to a PubSub topic. Other pods subscribe to the PubSub topic, grab data in small batches, and stream it into BigQuery. The figure below suggests this flow.
This app can be thought of as a ‘workflow’ type of app– it doesn’t have a web front end (though Kubernetes is great for those types of apps as well). Instead, it is designed to continously run a scalable data ingestion pipeline.
Find the code and more detail here.