🔗Manual Installation using Packages



This section describes installing CDAP on Hadoop clusters that are:

  • Generic Apache Hadoop distributions;
  • CDH (Cloudera Distribution of Apache Hadoop) clusters not managed with Cloudera Manager; or
  • HDP (Hortonworks Data Platform) clusters not managed with Apache Ambari.

Cloudera Manager (CDH), Apache Ambari (HDP), and MapR distributions should be installed with our other distribution instructions.

  • As CDAP depends on HDFS, YARN, HBase, ZooKeeper, and (optionally) Hive and Spark, it must be installed on cluster host(s) with full client configurations for these dependent services.
  • The CDAP Master Service must be co-located on a cluster host with an HDFS client, a YARN client, an HBase client, and—optionally—Hive or Spark clients.
  • Note that these clients are redundant if you are co-locating the CDAP Master on a cluster host (or hosts, in the case of a deployment with high availability) with actual services, such as the HDFS Namenode, the YARN resource manager, or the HBase Master.
  • You can download the Hadoop client and HBase client libraries, and then install them on the hosts running CDAP services. No Hadoop or HBase services need be running.
  • All services run as the 'cdap' user installed by the package manager.
  • If you are installing CDAP with the intention of using replication, see these instructions on CDAP Replication before installing or starting CDAP.

🔗Preparing the Cluster

Please review the Software Prerequisites, as a configured Hadoop, HBase, and Hive (plus an optional Spark client) needs to be configured on the node(s) where CDAP will run.

🔗Hadoop Configuration

  1. ZooKeeper's maxClientCnxns must be raised from its default. We suggest setting it to zero (0: unlimited connections). As each YARN container launched by CDAP makes a connection to ZooKeeper, the number of connections required is a function of usage.

  2. Ensure that YARN has sufficient memory capacity by lowering the default minimum container size (controlled by the property yarn.scheduler.minimum-allocation-mb). Lack of YARN memory capacity is the leading cause of apparent failures that we see reported. We recommend starting with these settings:

    • yarn.nodemanager.delete.debug-delay-sec: 43200 (see note below)
    • yarn.scheduler.minimum-allocation-mb: 512 mb

    The value we recommend for yarn.nodemanager.delete.debug-delay-sec (43200 or 12 hours) is what we use internally at Cask for testing as that provides adequate time to capture the logs of any failures. However, you should use an appropriate non-zero value specific to your environment. A large value can be expensive from a storage perspective.

    Please ensure your yarn.nodemanager.resource.cpu-vcores and yarn.nodemanager.resource.memory-mb settings are set sufficiently to run CDAP, as described in the CDAP Memory and Core Requirements.

🔗HDFS Permissions

Ensure YARN is configured properly to run MapReduce programs. Often, this includes ensuring that the HDFS /user/yarn and /user/cdap directories exist with proper permissions:

$ |su_hdfs|
$ hadoop fs -mkdir -p /user/yarn && hadoop fs -chown yarn:yarn /user/yarn
$ hadoop fs -mkdir -p /user/cdap && hadoop fs -chown cdap:cdap /user/cdap

🔗Downloading and Distributing Packages

🔗Preparing Package Managers

CDAP components are available as either Yum .rpm or APT .deb packages. There is one package for each CDAP component, and each component may have multiple services. Additionally, there is a base CDAP package with three utility packages (for HBase compatibility) installed which creates the base configuration and the cdap user.

We provide packages for Ubuntu 12.04+ and CentOS 6.0+. While these are the only packages we currently provide, they contain no distribution version-specific code, and the same packages will work on equivalent OSes.

Available packaging types:

  • RPM: Yum repo
  • Debian: APT repo
  • Tar: For specialized installations only

Note: If you are using Chef to install CDAP, an official cookbook is available.

🔗On RPM using Yum

Download the Cask Yum repo definition file:

$ sudo curl -o /etc/yum.repos.d/cask.repo http://repository.cask.co/centos/6/x86_64/cdap/4.1/cask.repo

This will create the file /etc/yum.repos.d/cask.repo with:

name=Cask Packages

Add the Cask Public GPG Key to your repository:

$ sudo rpm --import http://repository.cask.co/centos/6/x86_64/cdap/4.1/pubkey.gpg

Update your Yum cache:

$ sudo yum makecache

🔗On Debian using APT

Download the Cask APT repo definition file:

$ sudo curl -o /etc/apt/sources.list.d/cask.list http://repository.cask.co/ubuntu/precise/amd64/cdap/4.1/cask.list

This will create the file /etc/apt/sources.list.d/cask.list with:

deb [ arch=amd64 ] http://repository.cask.co/ubuntu/precise/amd64/cdap/4.1 precise cdap

Add the Cask Public GPG Key to your repository:

$ curl -s http://repository.cask.co/ubuntu/precise/amd64/cdap/4.1/pubkey.gpg | sudo apt-key add -

Update your APT-cache:

$ sudo apt-get update

🔗Using Tar

Download the appropriate CDAP tar file, and then unpack it to an appropriate directory (indicated by $dir):

$ curl http://repository.cask.co/downloads/co/cask/cdap/cdap-distributed-rpm-bundle/4.1/cdap-distributed-rpm-bundle-4.1.tgz
$ tar xf cdap-distributed-rpm-bundle-4.1.tgz -C $dir
$ curl http://repository.cask.co/downloads/co/cask/cdap/cdap-distributed-deb-bundle/4.1/cdap-distributed-deb-bundle-4.1.tgz
$ tar xf cdap-distributed-deb-bundle-4.1.tgz -C $dir

🔗Installing CDAP Services

🔗Package Installation

Install the CDAP packages by using one of the following methods. Do this on each of the boxes that are being used for the CDAP components; our recommended installation is a minimum of two boxes.

This will download and install the latest version of CDAP with all of its dependencies.

To install the optional CDAP CLI on a node, add the cdap-cli package to the list of packages in the commands below.

🔗Using Chef

If you are using Chef to install CDAP, an official cookbook is available.

To install the optional CDAP CLI on a node, use the fullstack recipe.

🔗On RPM using Yum

$ sudo yum install cdap-gateway cdap-kafka cdap-master cdap-security cdap-ui

🔗On Debian using APT

$ sudo apt-get install cdap-gateway cdap-kafka cdap-master cdap-security cdap-ui

🔗Using Tar

Having previously downloaded and unpacked the appropriate tar file to a directory $dir, use:

$ sudo yum localinstall $dir/*.rpm
$ sudo dpkg -i $dir/*.deb
$ sudo apt-get install -f

🔗Create Required Directories

To prepare your cluster so that CDAP can write to its default namespace, create a top-level /cdap directory in HDFS, owned by an HDFS user yarn:

$ su hdfs
$ hadoop fs -mkdir -p /cdap && hadoop fs -chown yarn /cdap

In the CDAP packages, the default property hdfs.namespace is /cdap and the default property hdfs.user is yarn.

Also, create a tx.snapshot subdirectory:

$ su hdfs
$ hadoop fs -mkdir -p /cdap/tx.snapshot && hadoop fs -chown yarn /cdap/tx.snapshot

Note: If you have customized (or will be customizing) the property data.tx.snapshot.dir in your CDAP configuration, use that value instead for /cdap/tx.snapshot.

If your cluster is not setup with these defaults, you'll need to edit your CDAP configuration prior to starting services.

🔗CDAP Configuration

This section describes how to configure the CDAP components so they work with your existing Hadoop cluster. Certain Hadoop components may need changes, as described below, for CDAP to run successfully.

  1. CDAP packages utilize a central configuration, stored by default in /etc/cdap.

    When you install the CDAP base package, a default configuration is placed in /etc/cdap/conf.dist. The cdap-site.xml file is a placeholder where you can define your specific configuration for all CDAP components. The cdap-site.xml.example file shows the properties that usually require customization for all installations.

    Similar to Hadoop, CDAP utilizes the alternatives framework to allow you to easily switch between multiple configurations. The alternatives system is used for ease of management and allows you to to choose between different directories to fulfill the same purpose.

    Simply copy the contents of /etc/cdap/conf.dist into a directory of your choice (such as /etc/cdap/conf.mycdap) and make all of your customizations there. Then run the alternatives command to point the /etc/cdap/conf symlink to your custom directory /etc/cdap/conf.mycdap:

    $ sudo cp -r /etc/cdap/conf.dist /etc/cdap/conf.mycdap
    $ sudo update-alternatives --install /etc/cdap/conf cdap-conf /etc/cdap/conf.mycdap 10
  2. Configure the cdap-site.xml after you have installed the CDAP packages.

    To configure your particular installation, modify cdap-site.xml, using cdap-site.xml.example as a model. (See the appendix for a listing of cdap-site.xml.example, the minimal cdap-site.xml file required.)

    Customize your configuration by creating (or editing if existing) an .xml file conf/cdap-site.xml and set appropriate properties:

    $ sudo cp -f /etc/cdap/conf.mycdap/cdap-site.xml.example /etc/cdap/conf.mycdap/cdap-site.xml
    $ sudo vi /etc/cdap/conf.mycdap/cdap-site.xml
  3. If necessary, customize the file cdap-env.sh after you have installed the CDAP packages.

    Environment variables that will be included in the environment used when launching CDAP and can be set in the cdap-env.sh file, usually at /etc/cdap/conf/cdap-env.sh.

    This is only necessary if you need to customize the environment launching CDAP, such as described below under Local Storage Configuration.

  4. Depending on your installation, you may need to set these properties:

    1. Check that the zookeeper.quorum property in conf/cdap-site.xml is set to the ZooKeeper quorum string, a comma-delimited list of fully-qualified domain names for the ZooKeeper quorum:

          ZooKeeper quorum string; specifies the ZooKeeper host:port;
          substitute the quorum for the components shown here (FQDN1:2181,FQDN2:2181)
    2. Check that the router.server.address property in conf/cdap-site.xml is set to the hostname of the CDAP Router. The CDAP UI uses this property to connect to the Router:

        <description>CDAP Router address to which CDAP UI connects</description>
    3. Check that there exists in HDFS a user directory for the hdfs.user property of conf/cdap-site.xml. By default, the HDFS user is yarn. If necessary, create the directory:

      $ su hdfs
      $ hadoop fs -mkdir -p /user/yarn && hadoop fs -chown yarn:yarn /user/yarn
    4. If you want to use an HDFS directory with a name other than /cdap:

      1. Create the HDFS directory you want to use, such as /myhadoop/myspace.

      2. Create an hdfs.namespace property for the HDFS directory in conf/cdap-site.xml:

          <description>Default HDFS namespace</description>
      3. Check that the default HDFS user yarn owns that HDFS directory.

    5. If you want to use an HDFS user other than yarn, such as my_username:

      1. Check that there is—and create if necessary—a corresponding user on all machines in the cluster on which YARN is running (typically, all of the machines).

      2. Create an hdfs.user property for that user in conf/cdap-site.xml:

          <description>User for accessing HDFS</description>
      3. Check that the HDFS user owns the HDFS directory described by hdfs.namespace on all machines.

      4. Check that there exists in HDFS a /user/ directory for that HDFS user, as described above, such as:

        $ su hdfs
        $ hadoop fs -mkdir -p /user/my_username && hadoop fs -chown my_username:my_username /user/my_username
      5. If you use an HDFS user other than yarn, you must use either a secure cluster or use the LinuxContainerExecutor instead of the DefaultContainerExecutor. (Because of how DefaultContainerExecutor works, other containers will launch as yarn rather than the specified hdfs.user.) On Kerberos-enabled clusters, you must use LinuxContainerExecutor as the DefaultContainerExecutor will not work correctly.

    6. To use the ad-hoc querying capabilities of CDAP, ensure the cluster has a compatible version of Hive installed. See the section on Hadoop Compatibility. To use this feature on secure Hadoop clusters, please see these instructions on configuring secure Hadoop.

      Note: Some versions of Hive contain a bug that may prevent the CDAP Explore Service from starting up. See CDAP-1865 for more information about the issue. If the CDAP Explore Service fails to start and you see a javax.jdo.JDODataStoreException: Communications link failure in the log, try adding this property to the Hive hive-site.xml file:

    7. If Hive is not going to be installed, disable the CDAP Explore Service in conf/cdap-site.xml (by default, it is enabled):

        <description>Enable Explore functionality</description>
    8. If you'd like to publish metadata updates to an external Apache Kafka instance, CDAP has the capability of publishing notifications upon metadata updates. Details on the configuration settings and an example output are shown in the Audit logging section of the Developers' Manual.

🔗ULIMIT Configuration

When you install the CDAP packages, the ulimit settings for the CDAP user are specified in the /etc/security/limits.d/cdap.conf file. On Ubuntu, they won't take effect unless you make changes to the /etc/pam.d/common-session file. You can check this setting with the command ulimit -n when logged in as the CDAP user. For more information, refer to the ulimit discussion in the Apache HBase Reference Guide.

🔗Local Storage Configuration

Local storage directories—depending on the distribution—are utilized by CDAP for deploying applications and operating CDAP.

The CDAP user (the cdap system user) must be able to write to all of these directories, as they are used for deploying applications and for operating CDAP.

  • List of local storage directories

    • Properties specified in the cdap-site.xml file, as described in the Appendix: cdap-site.xml, cdap-default.xml:
      • app.temp.dir (default: /tmp)
      • kafka.server.log.dirs (default: /tmp/kafka-logs)
      • local.data.dir (default: data; if this is instead an absolute path, needs to be writable)
    • Additional directories:
      • /var/cdap/run (used as a PID directory, created by the packages)
      • /var/log/cdap (used as log directory, created by the packages)
      • /var/run/cdap (default CDAP user's home directory, created by the packages)
      • /var/tmp/cdap (default LOCAL_DIR—see below—defined and created in the CDAP init scripts)
  • Note that local.data.dir—which defines the directory for program jar storage when deploying to YARN—is set in the cdap-site.xml and defaults to the relative path data. If the value of local.data.dir is relative, it is put under LOCAL_DIR, such as /var/tmp/cdap/data. However, if instead it is an absolute path, that alone is used as the value. This is desirable so you can easily configure this directory to be elsewhere.

  • The CDAP Master service is governed by environment variables, which set the directories it uses:

    • TEMP_DIR (default: /tmp): The directory serving as the java.io.tmpdir directory
    • LOCAL_DIR (default: /var/tmp/cdap): The directory serving as the user directory for CDAP Master

    These variables can be set in the file /etc/cdap/conf/cdap-env.sh and will be included in the environment when launching CDAP. See CDAP Configuration for details of the central configuration used by CDAP and how to implement this.

  • As in all installations, the kafka.server.log.dirs may need to be created locally. If you configure kafka.server.log.dirs (or any of the other settable parameters) to a particular directory or directories, you need to make sure that the directories exist and that they are writable by the CDAP user.

🔗Configuring Hortonworks Data Platform

Beginning with Hortonworks Data Platform (HDP) 2.2, the MapReduce libraries are in HDFS. This requires an addition be made to the file cdap-env.sh to indicate the version of HDP:

export OPTS="${OPTS} -Dhdp.version=<version>"

where <version> matches the HDP version of the cluster. The build iteration must be included, so if the cluster version of HDP is, use:

export OPTS="${OPTS} -Dhdp.version="

The file cdap-env.sh is located in the central configuration directory, as described above under CDAP Configuration.

In addition, the property app.program.jvm.opts must be set in the cdap-site.xml:

  <value>-XX:MaxPermSize=128M ${twill.jvm.gc.opts} -Dhdp.version=<version> -Dspark.yarn.am.extraJavaOptions=-Dhdp.version=<version></value>
  <description>Java options for all program containers</description>

Using the same example as above, substituting for <version>, as:

  <value>-XX:MaxPermSize=128M ${twill.jvm.gc.opts} -Dhdp.version= -Dspark.yarn.am.extraJavaOptions=-Dhdp.version=</value>
  <description>Java options for all program containers</description>

🔗Starting CDAP Services

When all the packages and dependencies have been installed, and the configuration parameters set, you can start the services on each of the CDAP boxes by running the command:

$ for i in `ls /etc/init.d/ | grep cdap` ; do sudo service $i start ; done

When all the services have completed starting, the CDAP UI should then be accessible through a browser at port 11011.

The URL will be http://<host>:11011 where <host> is the IP address of one of the machines where you installed the packages and started the services.

Note: Service-specific Java heap memory settings (that override the default values) can be created by setting these environment variables:


such as:

$ export AUTH_JAVA_HEAPMAX="-Xmx1024m"

Add any overriding settings to a file, usually /etc/cdap/conf/cdap-env.sh. As described above (in CDAP Configuration), the location of this file will depend on your particular configuration.


🔗CDAP Smoke Test

The CDAP UI may initially show errors while all of the CDAP YARN containers are starting up. Allow for up to a few minutes for this.

The Administration page of the CDAP UI shows the status of the CDAP services. It can be reached at http://<cdap-host>:11011/cdap/administration, substituting for <cdap-host> the host name or IP address of the CDAP server:


CDAP UI: Showing started-up, Administration page.

Further instructions for verifying your installation are contained in Verification.

🔗Advanced Topics

🔗Enabling Security

Cask Data Application Platform (CDAP) supports securing clusters using perimeter security, authorization, impersonation and secure storage.

Network (or cluster) perimeter security limits outside access, providing a first level of security. However, perimeter security itself does not provide the safeguards of authentication, authorization and service request management that a secure Hadoop cluster provides.

Authorization provides a way of enforcing access control on CDAP entities.

Impersonation ensures that programs inside CDAP are run as configured users at the namespace level. When enabled, it guarantees that all actions on datasets, streams and other resources happen as the configured user.

We recommend that in order for CDAP to be secure, CDAP security should always be used in conjunction with secure Hadoop clusters. In cases where secure Hadoop is not or cannot be used, it is inherently insecure and any applications running on the cluster are effectively "trusted”. Although there is still value in having perimeter security, authorization enforcement and secure storage in that situation, whenever possible a secure Hadoop cluster should be employed with CDAP security.

For instructions on enabling CDAP Security, see CDAP Security.

🔗Enabling Kerberos

When running CDAP on top of a secure Hadoop cluster (using Kerberos authentication), the CDAP processes will need to obtain Kerberos credentials in order to authenticate with Hadoop, HBase, ZooKeeper, and (optionally) Hive. In this case, the setting for hdfs.user in cdap-site.xml will be ignored and the CDAP processes will be identified by the default authenticated Kerberos principal.

Note: CDAP support for secure Hadoop clusters is limited to the latest versions of CDH, HDP, MapR, and Apache BigTop; currently, Amazon EMR is not supported on secure Hadoop clusters.

  1. In order to configure CDAP for Kerberos authentication:

    1. Create a Kerberos principal for the user running CDAP. The principal name should be in the form username/hostname@REALM, creating a separate principal for each host where a CDAP service will run. This prevents simultaneous login attempts from multiple hosts from being mistaken for a replay attack by the Kerberos KDC.

    2. Generate a keytab file for each CDAP Master Kerberos principal, and place the file as /etc/security/keytabs/cdap.keytab on the corresponding CDAP Master host. The file should be readable only by the user running the CDAP Master service.

    3. Edit /etc/cdap/conf/cdap-site.xml on each host running a CDAP service, substituting the Kerberos primary (user) for <cdap-principal>, and your Kerberos authentication realm for EXAMPLE.COM, when adding these two properties:

    4. The <cdap-principal> is shown in the commands that follow as cdap; however, you are free to use a different appropriate name.

    5. The /cdap directory needs to be owned by the <cdap-principal>; you can set that by running the following command as the hdfs user (change the ownership in the command from cdap to whatever is the <cdap-principal>):

      $ |su_hdfs| && hadoop fs -mkdir -p /cdap && hadoop fs -chown cdap /cdap
    6. When running on a secure HBase cluster, as the hbase user, issue the command:

      $ echo "grant 'cdap', 'RWCA'" | hbase shell
    7. When CDAP Master is started, it will login using the configured keytab file and principal.

  1. In order to configure YARN for secure Hadoop: the <cdap-principal> user must be able to launch YARN containers, either by adding it to the YARN allowed.system.users whitelist (preferred) or by adjusting the YARN min.user.id to include the <cdap-principal> user.

  2. In order to configure CDAP Explore Service for secure Hadoop:

    1. To allow CDAP to act as a Hive client, it must be given proxyuser permissions and allowed from all hosts. For example: set the following properties in the configuration file core-site.xml, where cdap is a system group to which the cdap user is a member:

    2. To execute Hive queries on a secure cluster, the cluster must be running the MapReduce JobHistoryServer service. Consult your distribution documentation on the proper configuration of this service.

    3. To execute Hive queries on a secure cluster using the CDAP Explore Service, the Hive MetaStore service must be configured for Kerberos authentication. Consult your distribution documentation on the proper configuration of the Hive MetaStore service.

    With all these properties set, the CDAP Explore Service will run on secure Hadoop clusters.

🔗Enabling CDAP HA

In addition to having a cluster architecture that supports HA (high availability), these additional configuration steps need to be followed and completed:

🔗CDAP Components

For each of the CDAP components listed below (Master, Router, Kafka, UI, Authentication Server), these comments apply:

  • Sync the configuration files (such as cdap-site.xml and cdap-security.xml) on all the nodes.
  • While the default bind.address settings (, used for app.bind.address, data.tx.bind.address, router.bind.address, and so on) can be synced across hosts, if you customize them to a particular IP address, they will—as a result—be different on different hosts.
  • Starting services is described in Starting CDAP Services.

🔗CDAP Master

The CDAP Master service primarily performs coordination tasks and can be scaled for redundancy. The instances coordinate amongst themselves, electing one as a leader at all times.

  • Install the cdap-master package on different nodes.
  • Ensure they are configured identically (/etc/cdap/conf/cdap-site.xml).
  • Start the cdap-master service on each node.

🔗CDAP Router

The CDAP Router service is a stateless API endpoint for CDAP, and simply routes requests to the appropriate service. It can be scaled horizontally for performance. A load balancer, if desired, can be placed in front of the nodes running the service.

  • Install the cdap-gateway package on different nodes.
  • The router.bind.address may need to be customized on each box if it is not set to the default wildcard address (
  • Start the cdap-router service on each node.

🔗CDAP Kafka

  • Install the cdap-kafka package on different nodes.

  • Two properties need to be set in the cdap-site.xml files on each node:

    • The Kafka seed brokers list is a comma-separated list of hosts, followed by /${root.namespace}:

      kafka.seed.brokers: myhost.example.com:9092,.../${root.namespace}

      Substitute appropriate addresses for myhost.example.com in the above example.

    • The replication factor is used to replicate Kafka messages across multiple machines to prevent data loss in the event of a hardware failure:

      kafka.default.replication.factor: 2

  • The recommended setting is to run at least two Kafka brokers with a minimum replication factor of two; set this property to the maximum number of tolerated machine failures plus one (assuming you have that number of machines). For example, if you were running five Kafka brokers, and would tolerate two of those failing, you would set the replication factor to three. The number of Kafka brokers listed should always be equal to or greater than the replication factor.

  • Start the cdap-kafka service on each node.


  • Install the cdap-ui package on different nodes.
  • Start the cdap-ui service on each node.

🔗CDAP Authentication Server

  • Install the cdap-security package (the CDAP Authentication Server) on different nodes.
  • Start the cdap-security service on each node.
  • Note that when an unauthenticated request is made in a secure HA setup, a list of all running authentication endpoints will be returned in the body of the request.

🔗Hive Execution Engines

CDAP Explore has support for additional execution engines such as Apache Spark and Apache Tez. Details on specifying these engines and configuring CDAP are in the Developers' Manual section on Date Exploration, Hive Execution Engines.