Web Analytics Application
A Cask Data Application Platform (CDAP) tutorial demonstrating how to perform analytics using access logs.
Overview
This tutorial provides the basic steps for the development of a data application using the Cask Data Application Platform (CDAP). We will use a web analytics application to demonstrate how to develop with CDAP and how CDAP helps when building data applications that run in the Hadoop ecosystem.
Web analytics applications are commonly used to generate statistics and to provide insights about web usage through the analysis of web traffic. A typical web analytics application consists of three components:
- Data Collection: Collects and persists web logs for further processing;
- Data Analysis: Analyses the collected data and produces different measurements; and
- Insights Discovery: Extracts insights from the analysis results.
Additionally, it's important that the application be scalable, fault tolerant, and easy-to-operate. It needs to support ever-increasing amounts of data as well as be flexible in its design to accomodate new application requirements.
In this tutorial, we'll show how easy it is to build a web analytics application with CDAP. In particular, we'll use these CDAP components:
- A stream for web server log collection and persistence to the file system;
- A flow for real-time data analysis over collected logs; and
- SQL Queries to explore and develop insights from the data.
How It Works
In this section, we’ll go through the details about how to develop a web analytics application using CDAP.
Data Collection with a Stream
The sole data source that the web analytics application uses is web server logs. Log events are ingested to a stream called log using the RESTful API provided by CDAP.
Once an event is ingested into a stream, it is persisted and available for processing.
Data Analysis using a Flow
The web analytics application uses a flow, the real-time data processor in CDAP, to produce real-time analytics from the web server logs. A flow contains one or more flowlets that are wired together into a directed acyclic graph or DAG.
To keep the example simple, we only compute the total visit count for each IP visiting the site.
We use a flowlet of type UniqueVisitor
to keep track of the unique visit counts from each client.
It is done in three steps:
- Read a log event from the log stream;
- Parse the client IP from the log event; and
- Increment the visit count of that client IP by 1 and persist the change.
The result of the increment is persisted to a custom dataset UniqueVisitCount
.
Here is what the UniqueVisitor
flowlet looks like:
public class UniqueVisitor extends AbstractFlowlet {
// Request an instance of UniqueVisitCount Dataset
@UseDataSet("UniqueVisitCount")
private UniqueVisitCount table;
@ProcessInput
public void process(StreamEvent streamEvent) {
// Decode the log line as String
String event = Charset.forName("UTF-8").decode(streamEvent.getBody()).toString();
// The first entry in the log event is the IP address
String ip = event.substring(0, event.indexOf(' '));
// Increments the visit count of a given IP by 1
table.increment(ip, 1L);
}
}
The UniqueVisitCount
dataset provides an abstraction of the data logic for incrementing the visit count for a
given IP. It exposes an increment
method, implemented as:
@WriteOnly
public void increment(String ip, long amount) {
// Delegates to the system KeyValueTable for actual storage operation
keyValueTable.increment(Bytes.toBytes(ip), amount);
}
The complete source code of the UniqueVisitCount
class can be found in the example in
src/main/java/co/cask/cdap/examples/webanalytics/UniqueVisitCount.java
To connect the UniqueVisitor
flowlet to read from the log stream, we define a WebAnalyticsFlow
class
that specifies the flow:
@Override
protected void configure() {
setName("WebAnalyticsFlow");
setDescription("Web Analytics Flow");
// Only one Flowlet in this Flow
addFlowlet("UniqueVisitor", new UniqueVisitor());
// Feed events written to the "log" Stream to UniqueVisitor
connectStream("log", "UniqueVisitor");
}
Lastly, we bundle up the dataset and the flow we've defined together to form an application
that can be deployed
and executed in CDAP:
public class WebAnalytics extends AbstractApplication {
@Override
public void configure() {
addStream(new Stream("log"));
addFlow(new WebAnalyticsFlow());
createDataset("UniqueVisitCount", UniqueVisitCount.class,
DatasetProperties.builder().setDescription("Unique Visit Counts").build());
setName("WebAnalytics");
setDescription("Web Analytics Application");
}
}
Building and Starting
You can build the example as described in Building an Example Application.
Start CDAP (as described in Starting and Stopping CDAP).
Deploy the application, as described in Deploying an Application. For example, from the CDAP Local Sandbox home directory, use the Command Line Interface (CLI):
$ cdap cli load artifact examples/WebAnalytics/target/WebAnalytics-5.1.2.jar Successfully added artifact with name 'WebAnalytics' $ cdap cli create app WebAnalytics WebAnalytics 5.1.2 user Successfully created application
> cdap cli load artifact examples\WebAnalytics\target\WebAnalytics-5.1.2.jar Successfully added artifact with name 'WebAnalytics' > cdap cli create app WebAnalytics WebAnalytics 5.1.2 user Successfully created application
Once the application has been deployed, you can start its components, as described in Starting an Application, and detailed at the start of running the example.
Once all components are started, run the example.
When finished, you can stop and remove the application.
Running the Example
Starting the Flow
Using the CDAP UI, go to the WebAnalytics application overview page, programs tab, click WebAnalyticsFlow to get to the flow detail page, then click the Start button; or
From the CDAP Local Sandbox home directory, use the Command Line Interface:
$ cdap cli start flow WebAnalytics.WebAnalyticsFlow Successfully started flow 'WebAnalyticsFlow' of application 'WebAnalytics' with stored runtime arguments '{}'
> cdap cli start flow WebAnalytics.WebAnalyticsFlow Successfully started flow 'WebAnalyticsFlow' of application 'WebAnalytics' with stored runtime arguments '{}'
Injecting Log Events
To inject a single log event, you can use the curl
command:
$ curl -d "192.168.252.135 - - [14/Jan/2014:00:12:51 -0400] 'GET /products HTTP/1.1' 500 182 'http://www.example.org' 'Mozilla/5.0'" \
"http://localhost:11015/v3/namespaces/default/streams/log"
> curl -d "192.168.252.135 - - [14/Jan/2014:00:12:51 -0400] 'GET /products HTTP/1.1' 500 182 'http://www.example.org' 'Mozilla/5.0'" ^
"http://localhost:11015/v3/namespaces/default/streams/log"
This sends the log event (formatted in the Common Log Format or CLF) to the CDAP instance located at
localhost
and listening on port 11015
.
The application includes sample logs, located in examples/resources/accesslog.txt
that you can inject
using the CDAP Commmand Line Interface:
Query the Unique Visitor Page Views
Once the log data has been processed by the WebAnalyticsFlow
, we can explore the
dataset UniqueVisitCount
with a SQL query. You can easily execute SQL queries against
datasets using the CDAP UI by going to the Data page showing All
Datasets, entering UniqueVisitCount in the search box, and clicking on the
UniqueVisitCount dataset:
Then, once at the dataset detail page, select the Explore tab:
You can then run SQL queries against the dataset. Let's try to find the top five IP addresses that visited the site by running a SQL query:
SELECT * FROM dataset_uniquevisitcount ORDER BY value DESC LIMIT 5
SELECT * FROM dataset_uniquevisitcount ORDER BY value DESC LIMIT 5
You can copy and paste the above SQL into the text box as shown below (replacing the default query that is there) and click the Execute SQL button to run it. It may take a moment for the query to finish.
Once it's finished, click on the preview button the right side of the Results table:
This will display the first five rows of the query results:
Stopping and Removing the Application
Once done, you can stop the application—if it hasn't stopped already—as described in Stopping an Application. Here is an example-specific description of the steps:
Stopping the Flow
Using the CDAP UI, go to the WebAnalytics application overview page, programs tab, click WebAnalyticsFlow to get to the flow detail page, then click the Stop button; or
From the CDAP Local Sandbox home directory, use the Command Line Interface:
Removing the Application
You can now remove the application as described in Removing an Application, or:
Using the CDAP UI, go to the WebAnalytics application overview page, programs tab, click the Actions menu on the right side and select Manage to go to the Management pane for the application, then click the Actions menu on the right side and select Delete to delete the application; or
From the CDAP Local Sandbox home directory, use the Command Line Interface: