CDAP Included Applications

Introduction to Included Applications

CDAP comes packaged with several system artifacts to create two types of applications: ETL (Extract, Transform, and Load) pipelines and Data Quality applications, simply by configuring the system artifacts and not writing any code at all.

An application created from a configured system artifact following the ETL pattern is referred to as an ETL pipeline or (interchangeably) as an ETL application. Similarly, an application built following the Data Quality pattern is referred to as a Data Quality application.

In the future, a variety of system artifacts will be delivered. The framework is extensible: users can write their own artifacts if they so chose, and can manage the lifecycle of their custom applications using CDAP.

Cask Hydrator and ETL Pipelines

ETL is Extract, Transform and Load of data, a common first-step in any data application. CDAP endeavors to make performing ETL possible out-of-box without writing code; instead, you just configure CDAP appropriately and operate it.

In this CDAP release, we have added support for self-service batch and real-time data ingestion combined with ETL for the building of Hadoop Data Lakes. Called Cask Hydrator, it provides CDAP users a seamless and easy way to configure and operate ingestion pipelines from different types of sources and data.

Cask Hydrator provides an easy method of configuring pipelines using a visual editor. You drag and drop sources, transformations, and sinks, configuring an ETL pipeline within minutes. It provides an operational view of the resulting ETL pipeline that allows for monitoring of metrics, logs, and other run-time information.

_images/hydrator-pipelines.png

CDAP Hydrator Pipelines: Administration of created pipelines with current status

_images/hydrator-studio.png

CDAP Hydrator Studio: Visual editor showing the creation of an ETL pipeline

These sections describe:

  • ETL Overview: An introduction to ETL, ETL applications, and ETL plugins.
  • ETL Plugins: Details on ETL plugins and exploring available plugins using RESTful APIs.

The lifecycle of ETL Applications is managed using CDAP’s Lifecycle HTTP RESTful API.

Data Quality Application

The goal of the Data Quality Application is to provide users with an extensible CDAP application to help them determine the quality of their data. Users can assess the quality of their data using its out-of-the-box functionality and libraries. The application can be extended with custom aggregation functions and queried with a RESTful API to obtain the results of the quality metric computations.