Installation using Apache Ambari



  • Apache Ambari can only be used to add CDAP to an existing Hadoop cluster, one that already has the required services (Hadoop: HDFS, YARN, HBase, ZooKeeper, and—optionally—Hive and Spark) installed.
  • Ambari is for setting up HDP (Hortonworks Data Platform) on bare clusters; it can't be used for clusters with HDP already installed, where the original installation was not with Ambari.
  • A number of features are currently planned to be added, including:
  • If you are installing CDAP with the intention of using replication, see these instructions on CDAP Replication before installing or starting CDAP.

Preparing the Cluster

Hadoop Configuration

  1. ZooKeeper's maxClientCnxns must be raised from its default. We suggest setting it to zero (0: unlimited connections). As each YARN container launched by CDAP makes a connection to ZooKeeper, the number of connections required is a function of usage.

  2. Ensure that YARN has sufficient memory capacity by lowering the default minimum container size (controlled by the property yarn.scheduler.minimum-allocation-mb). Lack of YARN memory capacity is the leading cause of apparent failures that we see reported. We recommend starting with these settings:

    • yarn.nodemanager.delete.debug-delay-sec: 43200 (see note below)
    • yarn.scheduler.minimum-allocation-mb: 512 mb

    The value we recommend for yarn.nodemanager.delete.debug-delay-sec (43200 or 12 hours) is what we use internally at Cask for testing as that provides adequate time to capture the logs of any failures. However, you should use an appropriate non-zero value specific to your environment. A large value can be expensive from a storage perspective.

    Please ensure your yarn.nodemanager.resource.cpu-vcores and yarn.nodemanager.resource.memory-mb settings are set sufficiently to run CDAP, as described in the CDAP Memory and Core Requirements.

You can make these changes during the configuration of your cluster using Ambari.

HDFS Permissions

Ensure YARN is configured properly to run MapReduce programs. Often, this includes ensuring that the HDFS /user/yarn and /user/cdap directories exist with proper permissions:

$ su hdfs
$ hadoop fs -mkdir -p /user/yarn && hadoop fs -chown yarn:yarn /user/yarn
$ hadoop fs -mkdir -p /user/cdap && hadoop fs -chown cdap:cdap /user/cdap

Downloading and Distributing Packages

Downloading CDAP Ambari Service

To install CDAP on a cluster managed by Ambari, we have available packages for RHEL-compatible and Ubuntu systems, which you can install onto your Ambari management server. These packages add CDAP to the list of available services which Ambari can install.

To install the cdap-ambari-service package, first add the appropriate CDAP repository to your system’s package manager by following the steps below. These steps will install a Cask repository on your Ambari server.

The repository version (shown in the commands below as cdap/5.1) must match the CDAP series which you’d like installed on your cluster. To install the latest version of the CDAP 4.1 series, you would install the CDAP 4.1 repository.

Replace—in the commands that follow on this page—all references to cdap/5.1 with the CDAP Repository corresponding to the version that you would like to use (such as cdap/4.0 for CDAP 4.0.x):

Supported Hortonworks Data Platform (HDP) Distributions
CDAP Series or Release Hadoop Distributions
CDAP 4.1.1, 4.2.x HDP 2.0 through HDP 2.6
CDAP 4.1.0 HDP 2.0 through HDP 2.5
CDAP 4.0.x HDP 2.0 through HDP 2.5
CDAP 3.6.x HDP 2.0 through HDP 2.4
CDAP 3.5.x HDP 2.0 through HDP 2.4
CDAP 3.4.x HDP 2.0 through HDP 2.4
CDAP 3.3.x HDP 2.0 through HDP 2.3
CDAP 3.2.x HDP 2.0 through HDP 2.3
CDAP 3.1.x HDP 2.0 through HDP 2.2
CDAP 3.0.x HDP 2.0 and HDP 2.1


  • The CDAP Ambari service has been tested on Ambari Server 2.3 through 2.5, as supplied from Hortonworks.
  • To install a version lower than the highest current version (such as CDAP 4.1.0 when 4.1.1 is available), you will need to downgrade your repo after installing it.

On RPM using Yum

Download the Cask Yum repo definition file:

$ sudo curl -o /etc/yum.repos.d/cask.repo

This will create the file /etc/yum.repos.d/cask.repo with:

name=Cask Packages

Add the Cask Public GPG Key to your repository:

$ sudo rpm --import

Update your Yum cache:

$ sudo yum makecache

On Debian using APT

Download the Cask APT repo definition file:

$ sudo curl -o /etc/apt/sources.list.d/cask.list

This will create the file /etc/apt/sources.list.d/cask.list with:

deb [ arch=amd64 ] precise cdap

Add the Cask Public GPG Key to your repository:

$ curl -s | sudo apt-key add -

Update your APT-cache:

$ sudo apt-get update

Installing CDAP Ambari Service

Now, install the cdap-ambari-service package from the repo you specified above:

Installing the CDAP Service via YUM

$ sudo yum install -y cdap-ambari-service
$ sudo ambari-server restart

Installing the CDAP Service via APT

$ sudo apt-get install -y cdap-ambari-service
$ sudo ambari-server restart

Installing CDAP Services

You can now install CDAP using the Ambari Service Wizard.

Start the Ambari Service Wizard

  1. In the Ambari UI (the Ambari Dashboard), start the Add Service Wizard.


    Ambari Dashboard: Starting the Add Service Wizard

  2. Select CDAP from the list and click Next. If there are core dependencies which are not currently installed on the cluster, Ambari will prompt you to install them.


    Ambari Dashboard: Selecting CDAP

Assign CDAP Services to Hosts

  1. Next, assign CDAP services to hosts.

    CDAP consists of five daemons:

    1. Master: Coordinator service which launches CDAP system services into YARN
    2. Router: Serves HTTP endpoints for CDAP applications and REST API
    3. Auth Server: For managing authentication tokens on CDAP clusters with perimeter security enabled
    4. Kafka Server: For transporting CDAP metrics and CDAP system service log data
    5. UI: Web interface to CDAP and CDAP Studio

    Ambari Dashboard: Assigning Masters

    We recommended you install all CDAP services onto an edge node (or the NameNode, for smaller clusters) such as in our example above. After assigning the master hosts, click Next.

  2. Select hosts for the CDAP CLI client. This should be installed on every edge node on the cluster or, for smaller clusters, on the same node as the CDAP services.


    Ambari Dashboard: Selecting hosts for CDAP

  3. Click Next to customize the CDAP installation.

Customize CDAP

  1. On the Customize Services screen, you can configure both CDAP features and the environment settings for CDAP and the CDAP services which run on the edge nodes. At the bottom of the Settings tab are settings for common CDAP features and Java services.


    Ambari Dashboard: Customizing Services, CDAP Features and Java Services

  2. On the Customize Services screen, click the Advanced tab to bring up the complete CDAP configuration. Under Advanced cdap-env, you can configure environment settings such as heap sizes and the directories used to store logs and PIDs for the CDAP services which run on the edge nodes.


    Ambari Dashboard: Customizing Services 2

    Under Advanced cdap-site, you can configure all options for the operation and running of CDAP and CDAP applications.

    Additional CDAP configuration properties, not shown in the web interface, can be added using Ambari's advanced custom properties at the end of the page. Documentation of the available CDAP properties is in the Appendix: cdap-site.xml, cdap-default.xml.

    For a complete explanation of these options, refer to the CDAP documentation of cdap-site.xml.

    Additional environment variables can be set, as required, using Ambari's Configs > Advanced > Advanced cdap-env.

    When finished with configuration changes, click Next.

Starting CDAP Services

Deploying CDAP

  1. Review the desired service layout and click Deploy to begin the actual deployment of CDAP.


    Ambari Dashboard: Summary of Services

  2. Ambari will install CDAP and start the services.


    Ambari Dashboard: Install, Start, and Test

  3. After the services are installed and started, you will click Next to get to the Summary screen.

  4. This screen shows a summary of the changes that were made to the cluster. No services should need to be restarted following this operation.


    Ambari Dashboard: Summary

  5. Click Complete to complete the CDAP installation.

CDAP Started

  1. You should now see CDAP listed on the main summary screen for your cluster.

Ambari Dashboard: Selecting CDAP


Service Checks in Apache Ambari

  1. Selecting CDAP from the left sidebar, or choosing it from the Services drop-down menu, will take you to the CDAP service screen.

Ambari Dashboard: CDAP Service Screen

CDAP is now running on your cluster, managed by Ambari. You can login to the CDAP UI at the address of the node running the CDAP UI service at port 11011. The drop-down Quick Links menu has a menu item directly to the CDAP UI.

CDAP Smoke Test

The CDAP UI may initially show errors while all of the CDAP YARN containers are starting up. Allow for up to a few minutes for this.

The Administration page of the CDAP UI shows the status of the CDAP services. It can be reached at http://<cdap-host>:11011/cdap/administration, substituting for <cdap-host> the host name or IP address of the CDAP server:


CDAP UI: Showing started-up, Administration page.

Further instructions for verifying your installation are contained in Verification.

Advanced Topics

Enabling Security

Cask Data Application Platform (CDAP) supports securing clusters using perimeter security, authorization, impersonation and secure storage.

Network (or cluster) perimeter security limits outside access, providing a first level of security. However, perimeter security itself does not provide the safeguards of authentication, authorization and service request management that a secure Hadoop cluster provides.

Authorization provides a way of enforcing access control on CDAP entities.

Impersonation ensures that programs inside CDAP are run as configured users at the namespace level. When enabled, it guarantees that all actions on datasets, streams and other resources happen as the configured user.

We recommend that in order for CDAP to be secure, CDAP security should always be used in conjunction with secure Hadoop clusters. In cases where secure Hadoop is not or cannot be used, it is inherently insecure and any applications running on the cluster are effectively "trusted”. Although there is still value in having perimeter security, authorization enforcement and secure storage in that situation, whenever possible a secure Hadoop cluster should be employed with CDAP security.

For instructions on enabling CDAP Security, see CDAP Security.

CDAP Security is configured by setting the appropriate settings under Ambari for your environment.

Enabling Kerberos

Kerberos support in CDAP is automatically enabled when enabling Kerberos security on your cluster via Ambari. Consult the appropriate Ambari documentation for instructions on enabling Kerberos support for your cluster.

The cdap user must be able to launch YARN containers, which can be accomplished by adjusting the YARN (to 500) to include the cdap user. (As Ambari does not have a mechanism for setting the YARN allowed.system.users—the preferred method of enabling the cdap user as it is more precise and limited—the setting of needs to be used instead.)

  1. If you are adding CDAP to an existing Kerberos cluster, in order to configure CDAP for Kerberos authentication:

    1. The <cdap-principal> is shown in the commands that follow as cdap; however, you are free to use a different appropriate name.

    2. When running on a secure HBase cluster, as the hbase user, issue the command:

      $ echo "grant 'cdap', 'RWCA'" | hbase shell
    3. In order to configure CDAP Explore Service for secure Hadoop:

      1. To allow CDAP to act as a Hive client, it must be given proxyuser permissions and allowed from all hosts. For example: set the following properties in the configuration file core-site.xml, where cdap is a system group to which the cdap user is a member:

      2. To execute Hive queries on a secure cluster, the cluster must be running the MapReduce JobHistoryServer service. Consult your distribution documentation on the proper configuration of this service.

      3. To execute Hive queries on a secure cluster using the CDAP Explore Service, the Hive MetaStore service must be configured for Kerberos authentication. Consult your distribution documentation on the proper configuration of the Hive MetaStore service.

      With all these properties set, the CDAP Explore Service will run on secure Hadoop clusters.

  2. If you are adding Kerberos to an existing cluster, in order to configure CDAP for Kerberos authentication:

    1. The /cdap directory needs to be owned by the <cdap-principal>; you can set that by running the following command as the hdfs user (change the ownership in the command from cdap to whatever is the <cdap-principal>):

      $ su hdfs && hadoop fs -mkdir -p /cdap && hadoop fs -chown cdap /cdap
    2. When converting an existing CDAP cluster to being Kerberos-enabled, you may run into YARN usercache directory permission problems. A non-Kerberos cluster with default settings will run CDAP containers as the user yarn. A Kerberos cluster will run them as the user cdap. When converting, the usercache directory that YARN creates will already exist and be owned by a different user. On all datanodes, run this command, substituting in the correct value of the YARN parameter yarn.nodemanager.local-dirs:

      $ rm -rf <YARN.NODEMANAGER.LOCAL-DIRS>/usercache/cdap

      (As yarn.nodemanager.local-dirs can be a comma-separated list of directories, you may need to run this command multiple times, once for each entry.)

      If, for example, the setting for yarn.nodemanager.local-dirs is /yarn/nm, you would use:

      $ rm -rf /yarn/nm/usercache/cdap

      Restart CDAP after removing the usercache(s).

Enabling CDAP HA

In addition to having a cluster architecture that supports HA (high availability), these additional configuration steps need to be followed and completed:

CDAP Components

For each of the CDAP components listed below (Master, Router, Kafka, UI, Authentication Server), these comments apply:

  • Sync the configuration files (such as cdap-site.xml and cdap-security.xml) on all the nodes.
  • While the default bind.address settings (, used for app.bind.address, data.tx.bind.address, router.bind.address, and so on) can be synced across hosts, if you customize them to a particular IP address, they will—as a result—be different on different hosts. This can be controlled by the settings for an individual Role Instance.

CDAP Master

The CDAP Master service primarily performs coordination tasks and can be scaled for redundancy. The instances coordinate amongst themselves, electing one as a leader at all times.

  • Using the Ambari UI, add additional hosts for the CDAP Master Service to additional machines.

CDAP Router

The CDAP Router service is a stateless API endpoint for CDAP, and simply routes requests to the appropriate service. It can be scaled horizontally for performance. A load balancer, if desired, can be placed in front of the nodes running the service.

  • Using the Ambari UI, add additional hosts for the CDAP Router Service to additional machines.
  • Start each CDAP Router Service role.

CDAP Kafka

  • Using the Ambari UI, add additional hosts for the CDAP Kafka Service to additional machines.
  • Two properties govern the Kafka setting in the cluster:
    • The list of Kafka seed brokers is generated automatically, but the replication factor (kafka.server.default.replication.factor) is not set automatically. Instead, it needs to be set manually.
    • The replication factor is used to replicate Kafka messages across multiple machines to prevent data loss in the event of a hardware failure.
  • The recommended setting is to run at least two Kafka brokers with a minimum replication factor of two; set this property to the maximum number of tolerated machine failures plus one (assuming you have that number of machines). For example, if you were running five Kafka brokers, and would tolerate two of those failing, you would set the replication factor to three. The number of Kafka brokers listed should always be equal to or greater than the replication factor.


  • Using the Ambari UI, add additional hosts for the CDAP UI Service to additional machines.

CDAP Authentication Server

  • Using the Ambari UI, add additional hosts for the CDAP Security Auth Service (the CDAP Authentication Server) to additional machines.
  • Note that when an unauthenticated request is made in a secure HA setup, a list of all running authentication endpoints will be returned in the body of the request.

Hive Execution Engines

CDAP Explore has support for additional execution engines such as Apache Spark and Apache Tez. Details on specifying these engines and configuring CDAP are in the Developer Manual section on Date Exploration, Hive Execution Engines.

Enabling Spark2

In order to use Spark2, you must first install Spark2 on your cluster. If both Spark1 and Spark2 are installed, you must modify cdap-env to set SPARK_MAJOR_VERSION and SPARK_HOME:

export SPARK_HOME=/usr/hdp/{{hdp_version}}/spark2

When Spark2 is in use, Spark1 programs cannot be run. Similarly, when Spark1 is in use, Spark2 programs cannot be run.

When CDAP starts up, it detects the spark version and uploads the corresponding pipeline system artifact. If you have already started CDAP with Spark1, you will also need to delete the pipeline system artifacts, then reload them in order to use the spark2 versions. After CDAP has been restarted with Spark2, use the RESTful API:

$ DELETE /v3/namespaces/system/artifacts/cdap-data-pipeline/versions/5.1.2
$ DELETE /v3/namespaces/system/artifacts/cdap-data-streams/versions/5.1.2
$ POST /v3/namespaces/system/artifacts