๐Ÿ”—Installation using Apache Ambari

step-1step-2step-3step-4step-5

Notes

  • Apache Ambari can only be used to add CDAP to an existing Hadoop cluster, one that already has the required services (Hadoop: HDFS, YARN, HBase, ZooKeeper, andโ€”optionallyโ€”Hive and Spark) installed.
  • Ambari is for setting up HDP (Hortonworks Data Platform) on bare clusters; it can't be used for clusters with HDP already installed, where the original installation was not with Ambari.
  • A number of features are currently planned to be added, including:

๐Ÿ”—Preparing the Cluster

๐Ÿ”—Hadoop Configuration

  1. ZooKeeper's maxClientCnxns must be raised from its default. We suggest setting it to zero (0: unlimited connections). As each YARN container launched by CDAP makes a connection to ZooKeeper, the number of connections required is a function of usage.

  2. Ensure that YARN has sufficient memory capacity by lowering the default minimum container size (controlled by the property yarn.scheduler.minimum-allocation-mb). Lack of YARN memory capacity is the leading cause of apparent failures that we see reported. We recommend starting with these settings:

    • yarn.nodemanager.delete.debug-delay-sec: 43200
    • yarn.scheduler.minimum-allocation-mb: 512 mb

    Please ensure your yarn.nodemanager.resource.cpu-vcores and yarn.nodemanager.resource.memory-mb settings are set sufficiently to run CDAP, as described in the CDAP Memory and Core Requirements.

You can make these changes during the configuration of your cluster using Ambari.

๐Ÿ”—HDFS Permissions

Ensure YARN is configured properly to run MapReduce programs. Often, this includes ensuring that the HDFS /user/yarn directory exists with proper permissions:

# su hdfs
$ hdfs dfs -mkdir -p /user/yarn && hadoop fs -chown yarn /user/yarn && hadoop fs -chgrp yarn /user/yarn

๐Ÿ”—Downloading and Distributing Packages

๐Ÿ”—Downloading CDAP Ambari Service

To install CDAP on a cluster managed by Ambari, we have available packages for RHEL-compatible and Ubuntu systems, which you can install onto your Ambari management server. These packages add CDAP to the list of available services which Ambari can install.

To install the cdap-ambari-service package, first add the appropriate CDAP repository to your systemโ€™s package manager by following the steps below. These steps will install a Cask repository on your Ambari server.

The repository version (shown in the commands below as cdap/3.5) must match the CDAP series which youโ€™d like installed on your cluster. To install the latest version of the CDAP 3.0 series, you would install the CDAP 3.0 repository. The default (in the commands below) is to use cdap/3.3, which has the widest compatibility with the Ambari-supported Hadoop distributions.

Replaceโ€”in the commands that follow on this pageโ€”all references to cdap/3.5 with the CDAP Repository from the list below that you would like to use:

Supported Hortonworks Data Platform (HDP) Distributions
CDAP Series CDAP Repository Hadoop Distributions
CDAP 3.5.x cdap/3.5 HDP 2.0 through HDP 2.4
CDAP 3.4.x cdap/3.4 HDP 2.0 through HDP 2.4
CDAP 3.3.x cdap/3.3 HDP 2.0 through HDP 2.3
CDAP 3.2.x cdap/3.2 HDP 2.0 through HDP 2.3
CDAP 3.1.x cdap/3.1 HDP 2.0 through HDP 2.2
CDAP 3.0.x cdap/3.0 HDP 2.0 and HDP 2.1

Note: The CDAP Ambari service has been tested on Ambari Server 2.0 and 2.1, as supplied from Hortonworks.

๐Ÿ”—On RPM using Yum

Download the Cask Yum repo definition file:

$ sudo curl -o /etc/yum.repos.d/cask.repo http://repository.cask.co/centos/6/x86_64/cdap/3.5/cask.repo

This will create the file /etc/yum.repos.d/cask.repo with:

[cask]
name=Cask Packages
baseurl=http://repository.cask.co/centos/6/x86_64/cdap/3.5
enabled=1
gpgcheck=1

Add the Cask Public GPG Key to your repository:

$ sudo rpm --import http://repository.cask.co/centos/6/x86_64/cdap/3.5/pubkey.gpg

Update your Yum cache:

$ sudo yum makecache

๐Ÿ”—On Debian using APT

Download the Cask APT repo definition file:

$ sudo curl -o /etc/apt/sources.list.d/cask.list http://repository.cask.co/ubuntu/precise/amd64/cdap/3.5/cask.list

This will create the file /etc/apt/sources.list.d/cask.list with:

deb [ arch=amd64 ] http://repository.cask.co/ubuntu/precise/amd64/cdap/3.5 precise cdap

Add the Cask Public GPG Key to your repository:

$ curl -s http://repository.cask.co/ubuntu/precise/amd64/cdap/3.5/pubkey.gpg | sudo apt-key add -

Update your APT-cache:

$ sudo apt-get update

๐Ÿ”—Installing CDAP Ambari Service

Now, install the cdap-ambari-service package from the repo you specified above:

๐Ÿ”—Installing the CDAP Service via YUM

$ sudo yum install -y cdap-ambari-service
$ sudo ambari-server restart

๐Ÿ”—Installing the CDAP Service via APT

$ sudo apt-get install -y cdap-ambari-service
$ sudo ambari-server restart

๐Ÿ”—Installing CDAP Services

You can now install CDAP using the Ambari Service Wizard.

๐Ÿ”—Start the Ambari Service Wizard

  1. In the Ambari UI (the Ambari Dashboard), start the Add Service Wizard.

    ../_images/ss01-add-service.png

    Ambari Dashboard: Starting the Add Service Wizard

  2. Select CDAP from the list and click Next. If there are core dependencies which are not currently installed on the cluster, Ambari will prompt you to install them.

    ../_images/ss02-select-cdap.png

    Ambari Dashboard: Selecting CDAP

๐Ÿ”—Assign CDAP Services to Hosts

  1. Next, assign CDAP services to hosts.

    CDAP consists of five daemons:

    1. Master: Coordinator service which launches CDAP system services into YARN
    2. Router: Serves HTTP endpoints for CDAP applications and REST API
    3. Auth Server: For managing authentication tokens on CDAP clusters with perimeter security enabled
    4. Kafka Server: For transporting CDAP metrics and CDAP system service log data
    5. UI: Web interface to CDAP and Cask Hydrator (for CDAP 3.2.x and later installations)
    ../_images/ss03-assign-masters.png

    Ambari Dashboard: Assigning Masters

    We recommended you install all CDAP services onto an edge node (or the NameNode, for smaller clusters) such as in our example above. After assigning the master hosts, click Next.

  2. Select hosts for the CDAP CLI client. This should be installed on every edge node on the cluster or, for smaller clusters, on the same node as the CDAP services.

    ../_images/ss04-choose-clients.png

    Ambari Dashboard: Selecting hosts for CDAP

  3. Click Next to customize the CDAP installation.

๐Ÿ”—Customize CDAP

  1. On the Customize Services screen, click the Advanced tab to bring up the CDAP configuration. Under Advanced cdap-env, you can configure environment settings such as heap sizes and the directories used to store logs and pids for the CDAP services which run on the edge nodes.

    ../_images/ss05-config-cdap-env.png

    Ambari Dashboard: Customizing Services 1

  2. Under Advanced cdap-site, you can configure all options for the operation and running of CDAP and CDAP applications.

    ../_images/ss06-config-cdap-site.png

    Ambari Dashboard: Customizing Services 2

  3. Router Bind Port, Router Server Port: These two ports should match; Router Server Port is used by the CDAP UI to connect to the CDAP Router service.

    ../_images/ss07-config-enable-explore.png

    Ambari Dashboard: Enabling CDAP Explore

    Additional CDAP configuration properties, not shown in the web interface, can be added using Ambari's advanced custom properties at the end of the page. Documentation of the available CDAP properties is in the Appendix: cdap-site.xml, cdap-default.xml.

    For a complete explanation of these options, refer to the CDAP documentation of cdap-site.xml.

    Additional environment variables can be set, as required, using Ambari's "Configs > Advanced > Advanced cdap-env".

    Note: Service-specific Java heap memory settings (that override the default values) can be created by setting these environment variables:

    AUTH_JAVA_HEAPMAX
    KAFKA_JAVA_HEAPMAX
    MASTER_JAVA_HEAPMAX
    ROUTER_JAVA_HEAPMAX
    

    When finished with configuration changes, click Next.

๐Ÿ”—Starting CDAP Services

๐Ÿ”—Deploying CDAP

  1. Review the desired service layout and click Deploy to begin the actual deployment of CDAP.

    ../_images/ss08-review-deploy.png

    Ambari Dashboard: Summary of Services

  2. Ambari will install CDAP and start the services.

    ../_images/ss09-install-start-test.png

    Ambari Dashboard: Install, Start, and Test

  3. After the services are installed and started, you will click Next to get to the Summary screen.

  4. This screen shows a summary of the changes that were made to the cluster. No services should need to be restarted following this operation.

    ../_images/ss10-post-install-summary.png

    Ambari Dashboard: Summary

  5. Click Complete to complete the CDAP installation.

๐Ÿ”—CDAP Started

  1. You should now see CDAP listed on the main summary screen for your cluster.
../_images/ss11-main-screen.png

Ambari Dashboard: Selecting CDAP

๐Ÿ”—Verification

๐Ÿ”—Service Checks in Apache Ambari

  1. Selecting CDAP from the left sidebar, or choosing it from the Services drop-down menu, will take you to the CDAP service screen.
../_images/ss12-cdap-screen.png

Ambari Dashboard: CDAP Service Screen

CDAP is now running on your cluster, managed by Ambari. You can login to the CDAP UI at the address of the node running the CDAP UI service at port 9999.

๐Ÿ”—CDAP Smoke Test

The CDAP UI may initially show errors while all of the CDAP YARN containers are starting up. Allow for up to a few minutes for this. The System Health link in the CDAP UI in the left side shows the status of the CDAP services.

../_images/console-distributed.png

CDAP UI: Showing started-up before data or applications are deployed.

Further instructions for verifying your installation are contained in Verification.

๐Ÿ”—Advanced Topics

๐Ÿ”—Enabling Security

Cask Data Application Platform (CDAP) supports securing clusters using perimeter security, authorization, impersonation and secure storage.

Network (or cluster) perimeter security limits outside access, providing a first level of security. However, perimeter security itself does not provide the safeguards of authentication, authorization and service request management that a secure Hadoop cluster provides.

Authorization provides a way of enforcing access control on CDAP entities.

Impersonation ensures that programs inside CDAP are run as configured users at the namespace level. When enabled, it guarantees that all actions on datasets, streams and other resources happen as the configured user.

We recommend that in order for CDAP to be secure, CDAP security should always be used in conjunction with secure Hadoop clusters. In cases where secure Hadoop is not or cannot be used, it is inherently insecure and any applications running on the cluster are effectively "trustedโ€. Although there is still value in having perimeter security, authorization enforcement and secure storage in that situation, whenever possible a secure Hadoop cluster should be employed with CDAP security.

For instructions on enabling CDAP Security, see CDAP Security.

CDAP Security is configured by setting the appropriate settings under Ambari for your environment.

๐Ÿ”—Enabling Kerberos

Kerberos support in CDAP is automatically enabled when enabling Kerberos security on your cluster via Ambari. Consult the appropriate Ambari documentation for instructions on enabling Kerberos support for your cluster.

๐Ÿ”—Enabling CDAP HA

In addition to having a cluster architecture that supports HA (high availability), these additional configuration steps need to be followed and completed:

๐Ÿ”—CDAP Components

For each of the CDAP components listed below (Master, Router, Kafka, UI, Authentication Server), these comments apply:

  • Sync the configuration files (such as cdap-site.xml and cdap-security.xml) on all the nodes.
  • While the default bind.address settings (0.0.0.0, used for app.bind.address, data.tx.bind.address, router.bind.address, and so on) can be synced across hosts, if you customize them to a particular IP address, they willโ€”as a resultโ€”be different on different hosts. This can be controlled by the settings for an individual Role Instance.

๐Ÿ”—CDAP Master

The CDAP Master service primarily performs coordination tasks and can be scaled for redundancy. The instances coordinate amongst themselves, electing one as a leader at all times.

  • Using the Ambari UI, add additional hosts for the CDAP Master Service to additional machines.

๐Ÿ”—CDAP Router

The CDAP Router service is a stateless API endpoint for CDAP, and simply routes requests to the appropriate service. It can be scaled horizontally for performance. A load balancer, if desired, can be placed in front of the nodes running the service.

  • Using the Ambari UI, add additional hosts for the CDAP Router Service to additional machines.
  • Start each CDAP Router Service role.

๐Ÿ”—CDAP Kafka

  • Using the Ambari UI, add additional hosts for the CDAP Kafka Service to additional machines.
  • Two properties govern the Kafka setting in the cluster:
    • The list of Kafka seed brokers is generated automatically, but the replication factor (kafka.default.replication.factor) is not set automatically. Instead, it needs to be set manually.
    • The replication factor is used to replicate Kafka messages across multiple machines to prevent data loss in the event of a hardware failure.
  • The recommended setting is to run at least two Kafka brokers with a minimum replication factor of two; set this property to the maximum number of tolerated machine failures plus one (assuming you have that number of machines). For example, if you were running five Kafka brokers, and would tolerate two of those failing, you would set the replication factor to three. The number of Kafka brokers listed should always be equal to or greater than the replication factor.

๐Ÿ”—CDAP UI

  • Using the Ambari UI, add additional hosts for the CDAP UI Service to additional machines.

๐Ÿ”—CDAP Authentication Server

  • Using the Ambari UI, add additional hosts for the CDAP Security Auth Service (the CDAP Authentication Server) to additional machines.
  • Note that when an unauthenticated request is made in a secure HA setup, a list of all running authentication endpoints will be returned in the body of the request.