Tutorial: Building An Analytics Data Pipeline In Python
If you ever want to learn Python online with data streaming, or data that changes quickly, you might get used to the concept of data pipes. Pipe data lets you change data from one representation to another through a series of steps. Data pipes are an important part of data engineering, which we teach in our new data engineer path. In this tutorial, we will run through building pipe data using Python and SQL.
The general use case for data pipes is to find information about visitors to your website. If you are familiar with Google Analytics, you know the value of seeing real-time and historical information on visitors. In this blog post, we will use data from the web server log to answer questions about our visitors.
If you are not used, every time you visit a web page, such as the DataQuest blog, your browser is sent data from the web server. To host this blog, we use a high-performance web server called Nginx.
The process of sending requests from the web browser to the server.
First, the client sends requests to a web server that requests a particular page. The web server then contains pages from the file system and returns it to the client (the web server can also produce a dynamic page, but we will not worry about the case now). Because serving requests, web servers write lines to log files on the file system that contains several metadata about clients and requests. This log allows someone to see who visits which pages on the website at what time, and do other analysis.
One example is to find out how many users from each country visit your site every day. This can help you know which country focuses on your marketing efforts. At the simplest level, just find out how many visitors you have per day can help you understand if your marketing efforts function properly.
To calculate this metric, we need to describe the log file and analyze it. To do this, we need to build a data pipe.
Think about pipe data
Here is a simple example of pipe data that calculates how many visitors visit the site every day:
We switch from raw log data to the dashboard where we can see the number of visitors per day. Note that this pipe runs continuously – when a new entry is added to the log server to Programming Languages, it reaches it and processes it. There are a number of things that you please note about how we compose a pipe:
Each pipe component is separated from the other, and takes the specified input, and returns the specified output.
Even though we don’t show it here, the output can be cached or survive for further analysis.
We store raw log data to the database. This ensures that if we want to run a different analysis, we have access to all raw data.
We delete duplicate records. It’s easy to introduce duplicate data into your analysis process, so deduplication before passing data through a pipe is very important.
Each pipe component feeds data to other components. We want to save each component as small as possible, so we can increase the pipe components individually, or use output for different types of analysis.
Now we have seen how this pipe looks high, let’s implement it in Python.
Process and store logserver logs
To make our data pipe, we will need access to web server log data. We make scripts that will continue to produce fake log data (but somewhat realistic). This is how to follow this post together:
Cloning this repo.
Follow the readme to install Python requirements.
Run Python log_generator.py.
After running the script, you will see a new entry written to log_a.txt in the same folder. After 100 lines are written for log_a.txt, the script will rotate to log_b.txt. This will continue to switch between files every 100 lines.
After we start the script by Sprintzeal, we only need to write some code to swallow (or read) log. The script will need:
Open the log file and read from line by line.
Parsing each line into the field.
Write each line and parsed field to the database.
Make sure the duplicate line is not written to the database.
The code for this is in the store_logs.py file in this repo if you want to follow.
To achieve our first goal, we can open files and continue to try reading lines from them.
The code below will be:
Open both log files in reading mode.
Loop forever.
Find out where the current character is read for both files (using the tell method).
Try to read one line from both files (using the readline method).
If there is no file that has a line written, sleep a minute then try again.
Before going to bed, set the reading point back to where we originally (before calling readline), so we didn’t miss anything (using the search method).
If one file has a line written, take the line. Remember that only one file can be written at a time, so we can’t get a line from both files.