Installation and Configuration

Introduction

This manual is to help you install and configure Cask Data Application Platform (CDAP). It provides the system, network, and software requirements, packaging options, and instructions for installation and verification of the CDAP components so they work with your existing Hadoop cluster. There are specific instructions for upgrading existing CDAP installations.

These are the CDAP components:

  • CDAP UI: User interface—the Console—for managing CDAP applications;
  • CDAP Router: Service supporting REST endpoints for CDAP;
  • CDAP Master: Service for managing runtime, lifecycle and resources of CDAP applications;
  • CDAP Kafka: Metrics and logging transport service, using an embedded version of Kafka; and
  • CDAP Authentication Server: Performs client authentication for CDAP when security is enabled.

Before installing the CDAP components, you must first install a Hadoop cluster with HDFS, YARN, HBase, and Zookeeper. In order to use the ad-hoc querying capabilities of CDAP, you will also need Hive. All CDAP components can be installed on the same boxes as your Hadoop cluster, or on separate boxes that can connect to the Hadoop services.

Our recommended installation is to use two boxes for the CDAP components; the hardware requirements are relatively modest, as most of the work is done by the Hadoop cluster. These two boxes provide high availability; at any one time, one of them is the leader providing services while the other is a follower providing failover support.

Some CDAP components run on YARN, while others orchestrate “containers” in the Hadoop cluster. The CDAP Router service starts a router instance on each of the local boxes and instantiates one or more gateway instances on YARN as determined by the gateway service configuration.

We have specific hardware, network and prerequisite software requirements detailed below that need to be met and completed before installation of the CDAP components.

System Requirements

Hardware Requirements

Systems hosting the CDAP components must meet these hardware specifications, in addition to having CPUs with a minimum speed of 2 GHz:

CDAP Component Hardware Component Specifications
CDAP UI RAM 1 GB minimum, 2 GB recommended
CDAP Router RAM 2 GB minimum, 4 GB recommended
CDAP Master RAM 2 GB minimum, 4 GB recommended
CDAP Kafka RAM 1 GB minimum, 2 GB recommended
Disk Space CDAP Kafka maintains a data cache in a configurable data directory. Required space depends on the number of CDAP applications deployed and running in the CDAP and the quantity of logs and metrics that they generate.
CDAP Authentication Server RAM 1 GB minimum, 2 GB recommended

Network Requirements

CDAP components communicate over your network with HBase, HDFS, and YARN. For the best performance, CDAP components should be located on the same LAN, ideally running at 1 Gbps or faster. A good rule of thumb is to treat CDAP components as you would Hadoop DataNodes.

Software Prerequisites

You’ll need this software installed:

  • Java runtime (on CDAP and Hadoop nodes)
  • Node.js runtime (on CDAP nodes)
  • Hadoop and HBase (and optionally Hive) environment to run against
  • CDAP nodes require Hadoop and HBase client installation and configuration. Note: No Hadoop services need to be running.

Java Runtime

The latest JDK or JRE version 1.7.xx for Linux and Solaris must be installed in your environment; we recommend the Oracle JDK.

To check the Java version installed, run the command:

$ java -version

CDAP is tested with the Oracle JDKs; it may work with other JDKs such as Open JDK, but it has not been tested with them.

Once you have installed the JDK, you’ll need to set the JAVA_HOME environment variable.

Node.js Runtime

You can download the appropriate version of Node.js (from v0.10.* through v0.12.*) from nodejs.org:

  1. The version of Node.js must be from v0.10.* through v0.12.*.
  2. Download the appropriate Linux or Solaris binary .tar.gz from nodejs.org/download/.
  3. Extract somewhere such as /opt/node-[version]/
  4. Build node.js; instructions that may assist are available at github
  5. Ensure that nodejs is in the $PATH. One method is to use a symlink from the installation: ln -s /opt/node-[version]/bin/node /usr/bin/node

Hadoop/HBase Environment

For a distributed enterprise, you must install these Hadoop components:

Component Source Required Version
HDFS Apache Hadoop 2.0.2-alpha through 2.5.0
CDH or HDP (CDH) 4.2.x through 5.3.3 or (HDP) 2.0 or 2.1
YARN Apache Hadoop 2.0.2-alpha through 2.5.0
CDH or HDP (CDH) 4.2.x through 5.3.3 or (HDP) 2.0 or 2.1
HBase Apache 0.94.2+, 0.96.x, and 0.98.x
CDH or HDP (CDH) 4.2.x through 5.3.3 or (HDP) 2.0 or 2.1
Zookeeper Apache Version 3.4.3 through 3.4.5
CDH or HDP (CDH) 4.2.x through 5.3.3 or (HDP) 2.0 or 2.1
Hive Apache Version 0.12.0 through 0.13.1
CDH or HDP (CDH) 4.3.x through 5.3.3 or (HDP) 2.0 or 2.1

Note: Components versions shown in this table are those that we have tested and are confident of their suitability and compatibility. Later versions of components may work, but have not necessarily have been either tested or confirmed compatible.

Note: Certain CDAP components need to reference your Hadoop, HBase, YARN (and possibly Hive) cluster configurations by adding your configuration to their class paths.

Note: Zookeeper’s maxClientCnxns must be raised from its default. We suggest setting it to zero (unlimited connections). As each YARN container launched by CDAP makes a connection to Zookeeper, the number of connections required is a function of usage.

Deployment Architectures

CDAP Minimal Deployment

Note: Minimal deployment runs all the services on single host.

../_images/cdap-minimal-deployment.png

CDAP High Availability and Highly Scalable Deployment

Note: Each component in CDAP is horizontally scalable. This diagram presents the high availability and highly scalable deployment. The number of nodes for each component can be changed based on the requirements.

../_images/cdap-ha-hs-deployment.png

Preparing the Cluster

To prepare your cluster so that CDAP can write to its default namespace, create a top-level /cdap directory in HDFS, owned by an HDFS user yarn:

$ sudo -u hdfs hadoop fs -mkdir /cdap && hadoop fs -chown yarn /cdap

In the CDAP packages, the default HDFS namespace is /cdap and the default HDFS user is yarn. If you set up your cluster as above, no further changes are required.

If your cluster is not setup with these defaults, you’ll need to edit your CDAP configuration once you have downloaded and installed the packages, and prior to starting services.

Packaging

CDAP components are available as either Yum .rpm or APT .deb packages. There is one package for each CDAP component, and each component may have multiple services. Additionally, there is a base CDAP package with three utility packages (for HBase compatibility) installed which creates the base configuration and the cdap user. We provide packages for Ubuntu 12 and CentOS 6.

Available packaging types:

  • RPM: YUM repo
  • Debian: APT repo
  • Tar: For specialized installations only

Note: If you are using Chef to install CDAP, an official cookbook is available.

Preparing Package Managers

RPM using Yum

Download the Cask Yum repo definition file:

$ sudo curl -o /etc/yum.repos.d/cask.repo http://repository.cask.co/centos/6/x86_64/cdap/3.0/cask.repo

This will create the file /etc/yum.repos.d/cask.repo with:

[cask]
name=Cask Packages
baseurl=http://repository.cask.co/centos/6/x86_64/cdap/3.0
enabled=1
gpgcheck=1

Add the Cask Public GPG Key to your repository:

$ sudo rpm --import http://repository.cask.co/centos/6/x86_64/cdap/3.0/pubkey.gpg

Update your Yum cache:

$ sudo yum makecache

Debian using APT

Download the Cask Apt repo definition file:

$ sudo curl -o /etc/apt/sources.list.d/cask.list http://repository.cask.co/ubuntu/precise/amd64/cdap/3.0/cask.list

This will create the file /etc/apt/sources.list.d/cask.list with:

deb [ arch=amd64 ] http://repository.cask.co/ubuntu/precise/amd64/cdap/3.0 precise cdap

Add the Cask Public GPG Key to your repository:

$ curl -s http://repository.cask.co/ubuntu/precise/amd64/cdap/3.0/pubkey.gpg | sudo apt-key add -

Update your APT-cache:

$ sudo apt-get update

Installation

Install the CDAP packages by using one of these methods:

Using Chef:

If you are using Chef to install CDAP, an official cookbook is available.

Using Yum:

$ sudo yum install cdap-gateway cdap-kafka cdap-master cdap-security cdap-ui

Using APT:

$ sudo apt-get install cdap-gateway cdap-kafka cdap-master cdap-security cdap-ui

Do this on each of the boxes that are being used for the CDAP components; our recommended installation is a minimum of two boxes.

This will download and install the latest version of CDAP with all of its dependencies.

Configuration

CDAP packages utilize a central configuration, stored by default in /etc/cdap.

When you install the CDAP base package, a default configuration is placed in /etc/cdap/conf.dist. The cdap-site.xml file is a placeholder where you can define your specific configuration for all CDAP components. The cdap-site.xml.example file shows the properties that usually require customization for all installations.

Similar to Hadoop, CDAP utilizes the alternatives framework to allow you to easily switch between multiple configurations. The alternatives system is used for ease of management and allows you to to choose between different directories to fulfill the same purpose.

Simply copy the contents of /etc/cdap/conf.dist into a directory of your choice (such as /etc/cdap/conf.mycdap) and make all of your customizations there. Then run the alternatives command to point the /etc/cdap/conf symlink to your custom directory.

Configure the cdap-site.xml after you have installed the CDAP packages.

To configure your particular installation, follow one of these two approaches:

  1. Modify cdap-site.xml, using cdap-site.example as a model to follow.

    To make alterations to your configuration, create (or edit if existing) an .xml file conf/cdap-site.xml (see the Appendix: cdap-site.xml) and set appropriate properties.

  2. Add these properties to cdap-site.xml; they are the minimal required configuration:

<configuration>

  <!--
    Cluster configurations
  -->

  <property>
    <name>root.namespace</name>
    <value>cdap</value>
    <description>Specifies the root namespace</description>
  </property>

  <!-- Substitute the zookeeper quorum for components here -->
  <property>
    <name>zookeeper.quorum</name>
    <value>FQDN1:2181,FQDN2:2181/${root.namespace}</value>
    <description>Specifies the zookeeper host:port</description>
  </property>

  <property>
    <name>hdfs.namespace</name>
    <value>/${root.namespace}</value>
    <description>Namespace for HDFS files</description>
  </property>

  <property>
    <name>hdfs.user</name>
    <value>yarn</value>
    <description>User name for accessing HDFS</description>
  </property>

  <!--
    Router configuration
  -->
  <!-- Substitue the IP to which Router service should bind to and listen on -->
  <property>
    <name>router.bind.address</name>
    <value>LOCAL-ROUTER-IP</value>
    <description>Specifies the inet address on which the Router service will listen</description>
  </property>

  <!--
    App Fabric configuration
  -->
  <!-- Substitute the IP to which App-Fabric service should bind to and listen on -->
  <property>
    <name>app.bind.address</name>
    <value>LOCAL-APP-FABRIC-IP</value>
    <description>Specifies the inet address on which the app fabric service will listen</description>
  </property>

  <!--
    Data Fabric configuration
  -->
  <!-- Substitute the IP to which Data-Fabric tx service should bind to and listen on -->
  <property>
    <name>data.tx.bind.address</name>
    <value>LOCAL-DATA-FABRIC-IP</value>
    <description>Specifies the inet address on which the transaction service will listen</description>
  </property>

  <!-- 
    Kafka Configuration
  -->
  <property>
    <name>kafka.log.dir</name>
    <value>/data/cdap/kafka-logs</value>
    <description>Directory to store Kafka logs</description>
  </property>

  <!-- Substitute with a list of all machines which will run the Kafka component -->
  <property>
    <name>kafka.seed.brokers</name>
    <value>FQDN1:9092,FQDN2:9092</value>
    <description>List of Kafka brokers (comma separated)</description>
  </property>

  <!-- Must be <= the number of kafka.seed.brokers configured above.  For HA this should be at least 2. -->
  <property>
    <name>kafka.default.replication.factor</name>
    <value>1</value>
    <description>Kafka replication factor</description>
  </property>

  <!--
    Watchdog Configuration
  -->
  <!-- Substitute the IP to which metrics-query service should bind to and listen on -->
  <property>
    <name>metrics.query.bind.address</name>
    <value>LOCAL-WATCHDOG-IP</value>
    <description>Specifies the inet address on which the metrics-query service will listen</description>
  </property>

  <!--
    Web-App Configuration
  -->
  <property>
    <name>dashboard.bind.port</name>
    <value>9999</value>
    <description>Specifies the port on which dashboard listens</description>
  </property>

  <!-- Substitute the IP of the Router service to which the UI should connect -->
  <property>
    <name>router.server.address</name>
    <value>ROUTER-HOST-IP</value>
    <description>Specifies the destination IP where Router service is running</description>
  </property>

  <property>
    <name>router.server.port</name>
    <value>10000</value>
    <description>Specifies the destination Port where Router service is listening</description>
  </property>

</configuration>

Depending on your installation, you may want to set these properties:

  • If you want to use an HDFS directory with a name other than /cdap:

    1. Create the HDFS directory you want to use, such as /myhadoop/myspace.

    2. Create an hdfs.namespace property for the HDFS directory in conf/cdap-site.xml:

      <property>
        <name>hdfs.namespace</name>
        <value>/myhadoop/myspace</value>
        <description>Default HDFS namespace</description>
      </property>
      
    3. Ensure that the default HDFS user yarn owns that HDFS directory.

  • If you want to use a different HDFS user than yarn:

    1. Check that there is—and create if necessary—a corresponding user on all machines in the cluster on which YARN is running (typically, all of the machines).

    2. Create an hdfs.user property for that user in conf/cdap-site.xml:

      <property>
        <name>hdfs.user</name>
        <value>my_username</value>
        <description>User for accessing HDFS</description>
      </property>
      
    3. Check that the HDFS user owns the HDFS directory described by hdfs.namespace on all machines.

  • Set the router.server.address property in conf/cdap-site.xml to the hostname of the CDAP Router. The CDAP UI uses this property to connect to the Router:

    <property>
      <name>router.server.address</name>
      <value>{router-host-name}</value>
    </property>
    
  • To use the ad-hoc querying capabilities of CDAP, enable the CDAP Explore Service in conf/cdap-site.xml (by default, it is disabled):

    <property>
      <name>cdap.explore.enabled</name>
      <value>true</value>
      <description>Enable Explore functionality</description>
    </property>
    

    This feature cannot be used unless the cluster has a correct version of Hive installed. See the section on Hadoop/HBase Environment. This feature is currently not supported on secure Hadoop clusters.

    Note: Some versions of Hive contain a bug that may prevent the CDAP Explore Service from starting up. See CDAP-1865 for more information about the issue. If the CDAP Explore Service fails to start and you see a javax.jdo.JDODataStoreException: Communications link failure in the log, try adding this property to the Hive hive-site.xml file:

    <property>
      <name>datanucleus.connectionPoolingType</name>
      <value>DBCP</value>
    </property>
    

Secure Hadoop

When running CDAP on top of Secure Hadoop and HBase (using Kerberos authentication), the CDAP Master process will need to obtain Kerberos credentials in order to authenticate with Hadoop and HBase. In this case, the setting for hdfs.user in cdap-site.xml will be ignored and the CDAP Master process will be identified as the Kerberos principal it is authenticated as.

In order to configure CDAP Master for Kerberos authentication:

  • Create a Kerberos principal for the user running CDAP Master. The principal name should be in the form username/hostname@REALM, creating a separate principal for each host where the CDAP Master will run. This prevents simultaneous login attempts from multiple hosts from being mistaken for a replay attack by the Kerberos KDC.

  • Generate a keytab file for each CDAP Master Kerberos principal, and place the file as /etc/security/keytabs/cdap.keytab on the corresponding CDAP Master host. The file should be readable only by the user running the CDAP Master process.

  • Edit /etc/default/cdap-master, substituting the Kerberos principal for <cdap-principal>:

    CDAP_KEYTAB="/etc/security/keytabs/cdap.keytab"
    CDAP_PRINCIPAL="<cdap-principal>@EXAMPLE.REALM.COM"
    
  • Edit /etc/cdap/conf/cdap-site.xml, substituting the Kerberos principal for <cdap-principal> when adding these two properties:

    <property>
      <name>cdap.master.kerberos.keytab</name>
      <value>/etc/security/keytabs/cdap.keytab</value>
      <description>The full path to the Kerberos keytab file containing the CDAP Master's
      credentials.</description>
    </property>
    <property>
      <name>cdap.master.kerberos.principal</name>
      <value><cdap-principal>/_HOST@EXAMPLE.COM</value>
      <description>The Kerberos principal name that should be used to login the CDAP Master
      process. The string "_HOST" will be substituted with the local hostname.</description>
    </property>
    
  • The <cdap-principal> is shown in the commands that follow as cdap; however, you are free to use a different appropriate name.

  • The /cdap directory needs to be owned by the <cdap-principal>; you can set that by running the following command as the hdfs user:

    $ hadoop fs -mkdir /cdap && hadoop fs -chown cdap /cdap
    
  • When running on a secure HBase cluster, as the hbase user, issue the command:

    $ echo "grant 'cdap', 'ACRW'" | hbase shell
    
  • When CDAP Master is started, it will login using the configured keytab file and principal.

Note: CDAP support for secure Hadoop clusters is limited to CDH 5.0 through CDH 5.3.x, and HDP 2.0 or higher.

ULIMIT Configuration

When you install the CDAP packages, the ulimit settings for the CDAP user are specified in the /etc/security/limits.d/cdap.conf file. On Ubuntu, they won’t take effect unless you make changes to the /etc/pam.d/common-session file. You can check this setting with the command ulimit -n when logged in as the CDAP user. For more information, refer to the ulimit discussion in the Apache HBase Reference Guide.

Writing to Temp Files

There are two temp directories utilized by CDAP (both specified in Appendix: cdap-site.xml):

  • app.temp.dir (default: /tmp)
  • kafka.log.dir (default: /tmp/kafka-logs)

The CDAP user should be able to write to both of these directories, as they are used for deploying applications and operating CDAP.

Configuring Security

For instructions on enabling CDAP Security, see CDAP Security; and in particular, see the instructions for configuring the properties of cdap-site.xml.

Starting Services

When all the packages and dependencies have been installed, and the configuration parameters set, you can start the services on each of the CDAP boxes by running the command:

$ for i in `ls /etc/init.d/ | grep cdap` ; do sudo service $i restart ; done

When all the services have completed starting, the CDAP UI should then be accessible through a browser at port 9999.

The URL will be http://<host>:9999 where <host> is the IP address of one of the machines where you installed the packages and started the services.

Making CDAP Highly-available

Repeat these steps on additional boxes. The configurations needed to support high-availability are:

  • kafka.seed.brokers: 127.0.0.1:9092,...
    • Kafka brokers list (comma separated)
  • kafka.default.replication.factor: 2
    • Used to replicate Kafka messages across multiple machines to prevent data loss in the event of a hardware failure.
    • The recommended setting is to run at least two Kafka servers.
    • Set this to the number of Kafka servers.

Getting a Health Check

Administrators can check the health of various services in the system. (In these examples, substitute for <host> the host name or IP address of the CDAP server.)

  • To retrieve the health check of the CDAP UI, make a GET request to the URI:

    http://<host>:9999/status
    
  • To retrieve the health check of the CDAP Router, make a GET request to the URI:

    http://<host>:10000/status
    
  • To retrieve the health check of the CDAP Authentication Server, make a GET request to the URI:

    http://<host>:10009/status
    

On success, the calls return a valid HTTP response with a 200 code.

  • To retrieve the health check of all the services running in YARN, make a GET request to the URI:

    http://<host>:10000/v3/system/services
    

    On success, the call returns a JSON string with component names and their corresponding statuses (reformatted to fit):

    [{"name":"appfabric","description":"Service for managing application
      lifecycle.","status":"OK","logs":"OK","min":1,"max":1,"requested":1,"provisioned":1},
     {"name":"dataset.executor","description":"Service to perform Dataset
      operations.","status":"OK","logs":"OK","min":1,"max":1,"requested":1,"provisioned":1},
     {"name":"explore.service","description":"Service to run Ad-hoc
      queries.","status":"OK","logs":"OK","min":1,"max":1,"requested":1,"provisioned":1},
     {"name":"log.saver","description":"Service to collect and store
      logs.","status":"OK","logs":"NOTOK","min":1,"max":1,"requested":1,"provisioned":1},
     {"name":"metrics","description":"Service to handle metrics
      requests.","status":"OK","logs":"OK","min":1,"max":1,"requested":1,"provisioned":1},
     {"name":"metrics.processor","description":"Service to process application and system
      metrics.","status":"OK","logs":"NOTOK","min":1,"max":1,"requested":1,"provisioned":1},
     {"name":"streams","description":"Service that handles stream data
      ingestion.","status":"OK","logs":"OK","min":1,"max":1,"requested":1,"provisioned":1},
     {"name":"transaction","description":"Service that maintains transaction
      states.","status":"OK","logs":"NOTOK","min":1,"max":1,"requested":1,"provisioned":1}]
    

Verification

To verify that the CDAP software is successfully installed and you are able to use your Hadoop cluster, run an example application. We provide in our SDK pre-built .JAR files for convenience.

  1. Download and install the latest CDAP Software Development Kit (SDK).
  2. Extract to a folder (CDAP_HOME).
  3. Open a command prompt and navigate to CDAP_HOME/examples.
  4. Each example folder has a .jar file in its target directory. For verification, we will use the WordCount example.
  5. Open a web browser to the CDAP UI. It is located on port 9999 of the box where you installed CDAP.
  6. On the UI, click the button Add App.
  7. Find the pre-built WordCount-3.0.0.jar using the dialog box to navigate to CDAP_HOME/examples/WordCount/target/.
  8. Once the application is deployed, instructions on running the example can be found at the WordCount example.
  9. You should be able to start the application, inject sentences, and retrieve results.
  10. When finished, you can stop and remove the application as described in the section on Building and Running CDAP Applications.

Upgrading an Existing Version

When upgrading an existing CDAP installation from a previous version, you will need to make sure the CDAP table definitions in HBase are up-to-date.

These steps will stop CDAP, update the installation, run an upgrade tool for the table definitions, and then restart CDAP.

These steps will upgrade from CDAP 2.8.0 to 3.0.0. (Note: Apps need to be both recompiled and re-deployed.)

  1. Stop all Flows, Services, and other Programs in all your applications.

  2. Stop all CDAP processes:

    $ for i in `ls /etc/init.d/ | grep cdap` ; do sudo service $i stop ; done
    
  3. Update the CDAP packages by running either of these methods:

    • Using Yum (on one line):

      $ sudo yum install cdap cdap-gateway \
            cdap-hbase-compat-0.94 cdap-hbase-compat-0.96 cdap-hbase-compat-0.98 \
            cdap-kafka cdap-master cdap-security cdap-ui
      
    • Using APT (on one line):

      $ sudo apt-get install cdap cdap-gateway \
            cdap-hbase-compat-0.94 cdap-hbase-compat-0.96 cdap-hbase-compat-0.98 \
            cdap-kafka cdap-master cdap-security cdap-ui
      

    Note: We have deprecated the cdap-web-app package in favor of cdap-ui package

  4. Copy the logback-container.xml into your conf directory. Please see Configuration.

  5. If you are upgrading a secure Hadoop cluster, you should authenticate with kinit before the next step (upgrade tool):

    $ kinit -kt <keytab> <principle>
    
  6. Run the upgrade tool:

    $ /opt/cdap/master/bin/svc-master run co.cask.cdap.data.tools.UpgradeTool upgrade
    
  7. Restart the CDAP processes:

    $ for i in `ls /etc/init.d/ | grep cdap` ; do sudo service $i start ; done
    
  8. Run the Flow Queue pending metrics corrector:

    $ /opt/cdap/master/bin/svc-master run co.cask.cdap.data.tools.flow.FlowQueuePendingCorrector
    

    This will correct the pending metrics for flows. This is a new metric that was introduced in CDAP 3.0; flows that existed before the upgrade to 3.0 do not have a correct value for this metric and running the tool provides a one-time correction.

  9. You must recompile and then redeploy your applications.

Troubleshooting

  • If you have YARN configured to use LinuxContainerExecutor (see the setting for yarn.nodemanager.container-executor.class), the cdap user needs to be present on all Hadoop nodes.

  • If you are using a LinuxContainerExecutor, and the UID for the cdap user is less than 500, you will need to add the cdap user to the allowed users configuration for the LinuxContainerExecutor in Yarn by editing the /etc/hadoop/conf/container-executor.cfg file. Change the line for allowed.system.users to:

    allowed.system.users=cdap