Decision Tree Regression

A Cask Data Application Platform (CDAP) example demonstrating Spark2.

Overview

This example demonstrates a Spark2 application training a machine-learning model using the Decision Tree Regression method.

Labeled data in libsvm format is uploaded to a CDAP Service by a RESTful call. This data is processed by the ModelTrainer Spark program, which divides the labeled data into a test set and a training set. A model is trained using the Decision Tree Regression method, and metadata about the model, such as the root-mean-square error, are stored in an ObjectMappedTable dataset.

Once the ModelTrainer program completes, you can list the models trained by querying the models endpoint of the ModelDataService. You can fetch metadata about a specific model using the models/{model-id} endpoint of the ModelDataService. It will respond with metadata, such as how many data points were tested, how many data points were correctly labeled, and the root-mean-square error.

Let's look at some of these components, and then run the application and see the results.

The DecisionTreeRegression Application

As in the other examples, the components of the application are tied together by the class DecisionTreeRegressionApp:

public class DecisionTreeRegressionApp extends AbstractApplication {
  public static final String TRAINING_DATASET = "trainingData";
  public static final String MODEL_DATASET = "models";
  public static final String MODEL_META = "modelMeta";

  @Override
  public void configure() {
    addService(new ModelDataService());
    addSpark(new ModelTrainer());

    createDataset(TRAINING_DATASET, FileSet.class.getName(), DatasetProperties.EMPTY);
    createDataset(MODEL_DATASET, FileSet.class.getName(), DatasetProperties.EMPTY);
    try {
      createDataset(MODEL_META, ObjectMappedTable.class.getName(),
                    ObjectMappedTableProperties.builder()
                      .setType(ModelMeta.class)
                      .setRowKeyExploreName("id")
                      .setRowKeyExploreType(Schema.Type.STRING)
                      .setExploreTableName(MODEL_META)
                      .build());
    } catch (UnsupportedTypeException e) {
      // will never happen
      throw new IllegalStateException("ModelMeta has an unsupported schema.", e);
    }
  }
. . .

The trainingData and models FileSet Data Storage

The labeled data is stored in a FileSet dataset, trainingData. Trained models are stored in a FileSet dataset, models.

The modelMeta ObjectMappedTable Data Storage

Metadata about trained models are stored in an ObjectMappedTable dataset, modelMeta.

The ModelDataService Service

This service has three endpoints:

  • labels endpoint is used to upload labeled data for training and testing
  • models endpoint is used to list the IDs of all models trained to date
  • models/{model-id} endpoint is used to retrieve metadata about a specific model

Building and Starting

  • You can build the example as described in Building an Example Application.

  • Start CDAP (as described in Starting and Stopping CDAP).

  • Deploy the application, as described in Deploying an Application. For example, from the CDAP Local Sandbox home directory, use the Command Line Interface (CLI):

    $ cdap cli load artifact examples/DecisionTreeRegression/target/DecisionTreeRegression-4.3.4.jar
    
    Successfully added artifact with name 'DecisionTreeRegression'
    
    $ cdap cli create app DecisionTreeRegression DecisionTreeRegression 4.3.4 user
    
    Successfully created application
    
    > cdap cli load artifact examples\DecisionTreeRegression\target\DecisionTreeRegression-4.3.4.jar
    
    Successfully added artifact with name 'DecisionTreeRegression'
    
    > cdap cli create app DecisionTreeRegression DecisionTreeRegression 4.3.4 user
    
    Successfully created application
    
  • Once the application has been deployed, you can start its components, as described in Starting an Application, and detailed at the start of running the example.

  • Once all components are started, run the example.

  • When finished, you can stop and remove the application.

Running the Example

Setting the Spark Version

This example uses Spark2, and the CDAP Sandbox must be configured to use the Spark2 runtime instead of the default of Spark1. To do this, modify the conf/cdap-site.xml file of the CDAP Sandbox. The property app.program.spark.compat must be changed to spark2_2.11 and CDAP restarted, if it is currently running.

Starting the Service

  • Using the CDAP UI, go to the DecisionTreeRegression application overview page, programs tab, click ModelDataService to get to the service detail page, then click the Start button; or

  • From the CDAP Local Sandbox home directory, use the Command Line Interface:

    $ cdap cli start service DecisionTreeRegression.ModelDataService
    
    Successfully started service 'ModelDataService' of application 'DecisionTreeRegression' with stored runtime arguments '{}'
    
    > cdap cli start service DecisionTreeRegression.ModelDataService
    
    Successfully started service 'ModelDataService' of application 'DecisionTreeRegression' with stored runtime arguments '{}'
    

Uploading Label Data

Upload labeled data in libsvm format by running this command from the CDAP Local Sandbox home directory, using the CDAP Command Line Interface:

$ cdap cli call service DecisionTreeRegression.ModelDataService PUT labels body:file examples/DecisionTreeRegression/src/test/resources/sample_libsvm_data.txt
> cdap cli call service DecisionTreeRegression.ModelDataService PUT labels body:file examples\DecisionTreeRegression\src\test\resources\sample_libsvm_data.txt

Running the Spark Program

There are three ways to start the Spark program:

  1. Go to the DecisionTreeRegression application overview page, programs tab, click ModelDataService to get to the service detail page, then click the Start button; or

  2. Use the Command Line Interface:

    $ cdap cli start spark DecisionTreeRegression.ModelTrainer
    
    > cdap cli start spark DecisionTreeRegression.ModelTrainer
    
  3. Send a query via an HTTP request using the curl command:

    $ curl -w"\n" -X POST \
    "http://localhost:11015/v3/namespaces/default/apps/DecisionTreeRegression/spark/ModelTrainer/start"
    
    > curl -X POST ^
    "http://localhost:11015/v3/namespaces/default/apps/DecisionTreeRegression/spark/ModelTrainer/start"
    

Querying the Results

Once the trainer has completed, you can retrieve the ID of the trained model. (If it has not completed, the examples in this section will return no results, and can be retried until they return results.)

To list the IDs of trained models using the ModelDataService, you can:

  • Use the Command Line Interface:

    $ cdap cli call service DecisionTreeRegression.ModelDataService GET models
    
    [ "92f9da09-71c3-45b0-aec5-2eb100cfbbac" ]
    
    > cdap cli call service DecisionTreeRegression.ModelDataService GET models
    
    [ "92f9da09-71c3-45b0-aec5-2eb100cfbbac" ]
    
  • Send a query via an HTTP request using the curl command. For example:

    $ curl -w"\n" -X GET "http://localhost:11015/v3/namespaces/default/apps/DecisionTreeRegression/services/ModelDataService/methods/models"
    
    [ "92f9da09-71c3-45b0-aec5-2eb100cfbbac" ]
    
    > curl -X GET "http://localhost:11015/v3/namespaces/default/apps/DecisionTreeRegression/services/ModelDataService/methods/models"
    
    [ "92f9da09-71c3-45b0-aec5-2eb100cfbbac" ]
    

To retreive metadata about a specific model using the ModelDataService, you can:

  • Use the Command Line Interface:

    $ cdap cli call service DecisionTreeRegression.ModelDataService GET models/92f9da09-71c3-45b0-aec5-2eb100cfbbac
    
    > cdap cli call service DecisionTreeRegression.ModelDataService GET models\92f9da09-71c3-45b0-aec5-2eb100cfbbac
    
  • Send a query via an HTTP request using the curl command. For example:

    $ curl -w"\n" -X GET "http://localhost:11015/v3/namespaces/default/apps/DecisionTreeRegression/services/ModelDataService/methods/models/92f9da09-71c3-45b0-aec5-2eb100cfbbac"
    
    {
      "numFeatures": 692,
      "numPredictions": 37,
      "numPredictionsCorrect": 35,
      "numPredictionsWrong": 2,
      "rmse": 0.2324952774876386,
      "trainingPercentage": 0.7
    }
    
    > curl -X GET "http://localhost:11015/v3/namespaces/default/apps/DecisionTreeRegression/services/ModelDataService/methods/models/92f9da09-71c3-45b0-aec5-2eb100cfbbac"
    
    {
      "numFeatures": 692,
      "numPredictions": 37,
      "numPredictionsCorrect": 35,
      "numPredictionsWrong": 2,
      "rmse": 0.2324952774876386,
      "trainingPercentage": 0.7
    }
    

Stopping and Removing the Application

Once done, you can stop the application—if it hasn't stopped already—as described in Stopping an Application. Here is an example-specific description of the steps:

Stopping the Spark Program

  • Using the CDAP UI, go to the DecisionTreeRegression application overview page, programs tab, click ModelTrainer to get to the Spark program detail page, then click the Stop button; or

  • From the CDAP Local Sandbox home directory, use the Command Line Interface:

    $ cdap cli stop spark DecisionTreeRegression.ModelTrainer
    
    Successfully stopped Spark 'ModelTrainer' of application 'DecisionTreeRegression'
    
    > cdap cli stop spark DecisionTreeRegression.ModelTrainer
    
    Successfully stopped Spark 'ModelTrainer' of application 'DecisionTreeRegression'
    

Stopping the Service

  • Using the CDAP UI, go to the DecisionTreeRegression application overview page, programs tab, click ModelDataService to get to the service detail page, then click the Stop button; or

  • From the CDAP Local Sandbox home directory, use the Command Line Interface:

    $ cdap cli stop service DecisionTreeRegression.ModelDataService
    
    Successfully stopped service 'ModelDataService' of application 'DecisionTreeRegression'
    
    > cdap cli stop service DecisionTreeRegression.ModelDataService
    
    Successfully stopped service 'ModelDataService' of application 'DecisionTreeRegression'
    

Removing the Application

You can now remove the application as described in Removing an Application, or:

  • Using the CDAP UI, go to the DecisionTreeRegression application overview page, programs tab, click the Actions menu on the right side and select Manage to go to the Management pane for the application, then click the Actions menu on the right side and select Delete to delete the application; or

  • From the CDAP Local Sandbox home directory, use the Command Line Interface:

    $ cdap cli delete app DecisionTreeRegression
    
    > cdap cli delete app DecisionTreeRegression