🔗Web Analytics Application

A Cask Data Application Platform (CDAP) tutorial demonstrating how to perform analytics using access logs.

🔗Overview

This tutorial provides the basic steps for the development of a data application using the Cask Data Application Platform (CDAP). We will use a web analytics application to demonstrate how to develop with CDAP and how CDAP helps when building data applications that run in the Hadoop ecosystem.

Web analytics applications are commonly used to generate statistics and to provide insights about web usage through the analysis of web traffic. A typical web analytics application consists of three components:

  • Data Collection: Collects and persists web logs for further processing;
  • Data Analysis: Analyses the collected data and produces different measurements; and
  • Insights Discovery: Extracts insights from the analysis results.

Additionally, it's important that the application be scalable, fault tolerant, and easy-to-operate. It needs to support ever-increasing amounts of data as well as be flexible in its design to accomodate new application requirements.

In this tutorial, we'll show how easy it is to build a web analytics application with CDAP. In particular, we'll use these CDAP components:

  • A stream for web server log collection and persistence to the file system;
  • A flow for real-time data analysis over collected logs; and
  • SQL Queries to explore and develop insights from the data.

🔗How It Works

In this section, we’ll go through the details about how to develop a web analytics application using CDAP.

🔗Data Collection with a Stream

The sole data source that the web analytics application uses is web server logs. Log events are ingested to a stream called log using the RESTful API provided by CDAP.

Once an event is ingested into a stream, it is persisted and available for processing.

🔗Data Analysis using a Flow

The web analytics application uses a flow, the real-time data processor in CDAP, to produce real-time analytics from the web server logs. A flow contains one or more flowlets that are wired together into a directed acyclic graph or DAG.

To keep the example simple, we only compute the total visit count for each IP visiting the site. We use a flowlet of type UniqueVisitor to keep track of the unique visit counts from each client. It is done in three steps:

  1. Read a log event from the log stream;
  2. Parse the client IP from the log event; and
  3. Increment the visit count of that client IP by 1 and persist the change.

The result of the increment is persisted to a custom dataset UniqueVisitCount.

Here is what the UniqueVisitor flowlet looks like:

public class UniqueVisitor extends AbstractFlowlet {

  // Request an instance of UniqueVisitCount Dataset
  @UseDataSet("UniqueVisitCount")
  private UniqueVisitCount table;

  @ProcessInput
  public void process(StreamEvent streamEvent) {
    // Decode the log line as String
    String event = Charset.forName("UTF-8").decode(streamEvent.getBody()).toString();

    // The first entry in the log event is the IP address
    String ip = event.substring(0, event.indexOf(' '));

    // Increments the visit count of a given IP by 1
    table.increment(ip, 1L);
  }
}

The UniqueVisitCount dataset provides an abstraction of the data logic for incrementing the visit count for a given IP. It exposes an increment method, implemented as:

@WriteOnly
public void increment(String ip, long amount) {
  // Delegates to the system KeyValueTable for actual storage operation
  keyValueTable.increment(Bytes.toBytes(ip), amount);
}

The complete source code of the UniqueVisitCount class can be found in the example in src/main/java/co/cask/cdap/examples/webanalytics/UniqueVisitCount.java

To connect the UniqueVisitor flowlet to read from the log stream, we define a WebAnalyticsFlow class that specifies the flow:

@Override
protected void configure() {
  setName("WebAnalyticsFlow");
  setDescription("Web Analytics Flow");
  // Only one Flowlet in this Flow
  addFlowlet("UniqueVisitor", new UniqueVisitor());
  // Feed events written to the "log" Stream to UniqueVisitor
  connectStream("log", "UniqueVisitor");
}

Lastly, we bundle up the dataset and the flow we've defined together to form an application that can be deployed and executed in CDAP:

public class WebAnalytics extends AbstractApplication {

  @Override
  public void configure() {
    addStream(new Stream("log"));
    addFlow(new WebAnalyticsFlow());
    createDataset("UniqueVisitCount", UniqueVisitCount.class,
                  DatasetProperties.builder().setDescription("Unique Visit Counts").build());

    setName("WebAnalytics");
    setDescription("Web Analytics Application");
  }
}

🔗Building and Starting

  • You can build the example as described in Building an Example Application

  • Start CDAP (as described in Starting and Stopping CDAP).

  • Deploy the application, as described in Deploying an Application. For example, from the Standalone CDAP SDK directory, use the Command Line Interface (CLI):

    $ cdap cli load artifact examples/WebAnalytics/target/WebAnalytics-4.1.1.jar
    
    Successfully added artifact with name 'WebAnalytics'
    
    $ cdap cli create app WebAnalytics WebAnalytics 4.1.1 user
    
    Successfully created application
    
    > cdap cli load artifact examples\WebAnalytics\target\WebAnalytics-4.1.1.jar
    
    Successfully added artifact with name 'WebAnalytics'
    
    > cdap cli create app WebAnalytics WebAnalytics 4.1.1 user
    
    Successfully created application
    
  • Once the application has been deployed, you can start its components, as described in Starting an Application, and detailed at the start of running the example.

  • Once all components are started, run the example.

  • When finished, you can stop and remove the application.

🔗Running the Example

🔗Starting the Flow

  • Using the CDAP UI, go to the WebAnalytics application overview page, programs tab, click WebAnalyticsFlow to get to the flow detail page, then click the Start button; or

  • From the Standalone CDAP SDK directory, use the Command Line Interface:

    $ cdap cli start flow WebAnalytics.WebAnalyticsFlow
    
    Successfully started flow 'WebAnalyticsFlow' of application 'WebAnalytics' with stored runtime arguments '{}'
    
    > cdap cli start flow WebAnalytics.WebAnalyticsFlow
    
    Successfully started flow 'WebAnalyticsFlow' of application 'WebAnalytics' with stored runtime arguments '{}'
    

🔗Injecting Log Events

To inject a single log event, you can use the curl command:

$ curl -d "192.168.252.135 - - [14/Jan/2014:00:12:51 -0400] 'GET /products HTTP/1.1' 500 182 'http://www.example.org' 'Mozilla/5.0'" \
"http://localhost:11015/v3/namespaces/default/streams/log"
> curl -d "192.168.252.135 - - [14/Jan/2014:00:12:51 -0400] 'GET /products HTTP/1.1' 500 182 'http://www.example.org' 'Mozilla/5.0'" ^
"http://localhost:11015/v3/namespaces/default/streams/log"

This sends the log event (formatted in the Common Log Format or CLF) to the CDAP instance located at localhost and listening on port 11015.

The application includes sample logs, located in examples/resources/accesslog.txt that you can inject using the CDAP Commmand Line Interface:

$ cdap cli load stream log examples/resources/accesslog.txt

Successfully loaded file to stream 'log'
> cdap cli load stream log examples\resources\accesslog.txt

Successfully loaded file to stream 'log'

🔗Query the Unique Visitor Page Views

Once the log data has been processed by the WebAnalyticsFlow, we can explore the dataset UniqueVisitCount with a SQL query. You can easily execute SQL queries against datasets using the CDAP UI by going to the Data page showing All Datasets, entering UniqueVisitCount in the search box, and clicking on the UniqueVisitCount dataset:

../_images/web-analytics-0.png

Then, once at the dataset detail page, select the Explore tab:

../_images/web-analytics-1.png

You can then run SQL queries against the dataset. Let's try to find the top five IP addresses that visited the site by running a SQL query:

SELECT * FROM dataset_uniquevisitcount ORDER BY value DESC LIMIT 5
SELECT * FROM dataset_uniquevisitcount ORDER BY value DESC LIMIT 5

You can copy and paste the above SQL into the text box as shown below (replacing the default query that is there) and click the Execute SQL button to run it. It may take a moment for the query to finish.

../_images/web-analytics-2.png

Once it's finished, click on the preview button the right side of the Results table:

../_images/web-analytics-3.png

This will display the first five rows of the query results:

../_images/web-analytics-4.png

🔗Stopping and Removing the Application

Once done, you can stop the application—if it hasn't stopped already—as described in Stopping an Application. Here is an example-specific description of the steps:

Stopping the Flow

  • Using the CDAP UI, go to the WebAnalytics application overview page, programs tab, click WebAnalyticsFlow to get to the flow detail page, then click the Stop button; or

  • From the Standalone CDAP SDK directory, use the Command Line Interface:

    $ cdap cli stop flow WebAnalytics.WebAnalyticsFlow
    
    Successfully stopped flow 'WebAnalyticsFlow' of application 'WebAnalytics'
    
    > cdap cli stop flow WebAnalytics.WebAnalyticsFlow
    
    Successfully stopped flow 'WebAnalyticsFlow' of application 'WebAnalytics'
    

Removing the Application

You can now remove the application as described in Removing an Application, or:

  • Using the CDAP UI, go to the WebAnalytics application overview page, programs tab, click the Actions menu on the right side and select Manage to go to the Management pane for the application, then click the Actions menu on the right side and select Delete to delete the application; or

  • From the Standalone CDAP SDK directory, use the Command Line Interface:

    $ cdap cli delete app WebAnalytics
    
    > cdap cli delete app WebAnalytics