🔗CDAP Metadata UI

🔗Introduction

The CDAP Metadata UI ("metadata management") lets you see how data is flowing into and out of datasets, streams, and stream views.

It allows you to perform impact and root-cause analysis, delivers an audit-trail for auditability and compliance, and allows you to preview data. Metadata management furnishes access to structured information that describes, explains, and locates data, making it easier to retrieve, use, and manage datasets.

Metadata management also allows users to update metadata for datasets and streams. Users can add, remove, and update tags and user properties directly in the UI. It allows users to set a preferred dictionary of tags so that teams can use the same lexicon when updating metadata.

Metadata management's UI shows a graphical visualization of the lineage of an entity. A lineage shows—for a specified time range—all data access of the entity, and details of where that access originated from.

Metadata management also captures activity metrics for datasets. You can see the datasets that are being used the most and view usage metrics for each dataset. This allows teams to easily determine the appropriate dataset to use for an analysis. The metadata management meter (currently in beta) rates each dataset on a scale that shows how active a dataset is in the system. Users can see the datasets that are being used the most and view usage metrics for each dataset. This allows teams to easily find the right dataset to use for analysis. The metadata management meter also rates each dataset on a scale to quickly show you how active a dataset is in the system.

Metadata management provides users with the ability to define a Data Dictionary that can be applied across all datasets in a namespace. The Data Dictionary allows users to standarize column names, types, if the column contains Personally Identifiable Information (PII) data, and a description of the column. This can be useful for new team members, allowing them to understand the data stored in datasets quickly.

Harvest, Index, Track, and Analyze Datasets

  • Immediate, timely, and seamless capture of technical, business, and operational metadata, enabling faster and better traceability of all datasets.
  • Through its use of lineage, it lets you understand the impact of changing a dataset on other datasets, processes or queries.
  • Tracks the flow of data across enterprise systems and data lakes.
  • Provides viewing and updating complete metadata on datasets, enabling traceability to resolve data issues and to improve data quality.
  • Collects usage metrics about datasets so that you know which datasets are being used most often.
  • Provides the ability to designate certain tags as "preferred" so that teams can easily find and tag datasets.
  • Allows users to preview data directly in the UI.

Supports Standardization, Governance, and Compliance Needs

  • Provide IT with the traceability needed in governing datasets and in applying compliance rules through seamless integration with other extensions.
  • Metadata management has consistent definitions of metadata-containing information about the data to reconcile differences in terminologies.
  • It helps you in the understanding of the lineage of your business-critical data.
  • The Data Dictionary allows you to standarize column names and definitions across datasets.

Blends Metadata Analytics and Integrations

  • See how your datasets are being created, accessed, and processed.
  • Extensible integrations are available with enterprise-grade MDM (master data management) systems such as Cloudera Navigator for centralizing metadata repository and the delivery of complete, accurate, and correct data.

🔗Example Use Case

An example use case describes how metadata management was employed in the data cleansing and validating of three billion records.

🔗Entity Details

Clicking on a name in the search results list will take you to details for a particular entity. Details are provided on the tabs Metadata, Lineage, Audit Log, Preview (included if the dataset is explorable), and Usage.

Metadata

The Metadata tab provides lists of the System Tags, User Tags, Schema, User Properties, and System Properties that were found for the entity. The values shown will vary depending on the type of entity and each individual entity. For instance, a stream may have a schema attached, and if so, it will be displayed.

../_images/tracker-metadata.png

You can add user tags to any entity by clicking the plus button in the UI. You can also remove tags by hovering over the tag and clicking the x. You can also add and remove User Properties for the dataset or stream. This is useful for storing additional details about the dataset for others to see.

Lineage

The Lineage tab shows the relationship between an entity and the programs that are interacting with it. As different lineage diagrams can be created for the same entity, depending on the particular set of programs selected to construct the diagram, a green button in the shape of an arrow is used to cycle through the different lineage digrams that a particular entity participates in.

A date menu in the left side of the diagram lets you control the time range that the diagram displays. By default, the last seven days are used, though a custom range can be specified, in addition to common time ranges (two weeks to one year).

../_images/tracker-lineage.png

Audit Log

The Audit Log tab shows each record in the _auditLog dataset that has been created for that particular entity, displayed in reverse chronological order. Because of how datasets work in CDAP, reading and writing from a flow or service to a dataset shows an access of "UNKNOWN" rather than indicating if it was read or write access. This will be addressed in a future release.

A date menu in the left side of the diagram lets you control the time range that the diagram displays. By default, the last seven days are used, though a custom range can be specified, in addition to common time ranges (two weeks to one year).

../_images/tracker-audit-log.png

Preview

The Preview tab (if available) shows a preview for the dataset. It is available for all datasets that are explorable. You can scroll for up to 500 records. For additional analysis, use the Jump menu to go into CDAP and explore the dataset using a custom query.

../_images/tracker-preview.png

Usage

The Usage tab shows a set of graphs displaying usage metrics for the dataset. At the top is a histogram of all audit messages for a particular dataset. Along the bottom of the screen is a set of charts displaying the Applications and Programs that are accessing the dataset, and a table showing the last time a specific message was received about the dataset. Clicking the Application name or the Program name will take you to that entity in the main CDAP UI.

../_images/tracker-usage.png

Preferred Tags

The Tags tab at the top of the page allows you to enter a common set of preferred terms to use when adding tags to datasets. Preferred tags show up first when adding tags, and will guide your team to use the same terminology. Any preferred tag that has not been attached to any entities can be deleted by clicking the red trashcan icon. If a preferred tag has been added to an entity, you cannot delete it, but you can demote it back to just being a user tag.

../_images/tracker-tags.png

To add preferred tags, click the Add Preferred Tags button and use the UI to add or import a list of tags that you would like to be "preferred". If the tag already exists in CDAP, it will be promoted from being a user tag to being a preferred tag. If it is a new tag in CDAP, it will be added in the Preferred Tags list.

../_images/tracker-tags-upload.png

Data Dictionary

The Dictionary tab at the top of the page allows you to add a set of columns and descriptions that can be viewed by anyone in the namespace. This allows you to provide more detailed descriptions about columns as well as the preferred naming convention, type, and whether the column contains personally identifying information (PII) or not. These definitions will be applied to all datasets in the namespace. For example, any dataset containing the column customerId will have the same definition and type.

../_images/tracker-dictionary.png

🔗Integrations

Metadata management allows for an easy integration with Cloudera Navigator by providing a UI to connecting to a Navigator instance:

../_images/tracker-integration-configuration.png

Details on completing this form are described in CDAP's documentation on the Navigator Integration Application.

🔗Administrating Metadata Management

CDAP metadata management consists of an application in CDAP with two programs and six datasets:

  • _Tracker application: names begins with an underscore
  • TrackerService: Service exposing the metadata management API endpoints
  • AuditLogFlow: Flow that subscribes to Kafka audit messages and stores them in the _auditLog dataset
  • _auditLog: Custom dataset for storing audit messages
  • _auditMetrics: Custom cube dataset for collecting dataset metrics
  • _auditTagsTable: Custom dataset for storing preferred tags
  • _timeSinceTable: Custom dataset for storing the last time a specific audit message was received
  • _dataDictionary: A Table dataset containing the columns and definitions of the Data Dictionary
  • _configurationTable: A Key-value table containing metadata management configuration options

The metadata management UI is shipped with CDAP, started automatically in standalone CDAP as part of the CDAP UI. It is available at:

http://localhost:11011/tracker/ns/default

or (Distributed CDAP):

http://<host>:<dashboard-bind-port>/tracker/ns/default

The application is built from a system artifact included with CDAP, tracker-0.4.1.jar.

To administer metadata management, an HTTP RESTful API is available.

🔗Installation

The CDAP Metadata Management Application is deployed from its system artifact included with CDAP. A CDAP administrator does not need to build anything to add metadata management to CDAP; they merely need to enable the application after starting CDAP.

🔗Enabling Metadata Management

Metadata management is enabled automatically in Standalone CDAP and the UI is available at http://localhost:11011/tracker/ns/default. In the Distributed version of CDAP, you must manually enable metadata management in each namespace by visiting http://<host>:<dashboard-bind-port>/tracker/ns/default and pressing the "Enable" button.

Once pressed, the application will be deployed, the datasets created (if necessary), the flow and service started, and search and audit logging will become available.

If you are enabling metadata management from outside the UI, you will need to follow these steps:

  • Using the CDAP CLI, load the artifact (tracker-0.4.1.jar):

    cdap > load artifact target/tracker-0.4.1.jar
    
  • Create an application configuration file (appconfig.txt) that contains the Audit Log reader configuration (the property auditLogConfig). For example:

    {
      "config": {
        "auditLogConfig" : {
          "topic" : "<audit.topic>",
          "zookeeperString" : "<zookeeper.quorum>"
        }
      }
    }
    

    substituting for <audit.topic> and <zookeeper.quorum> with appropriate values from cdap-site.xml.

  • Create a CDAP application using the configuration file:

    cdap > create app TrackerApp tracker 0.4.1 USER
    

🔗Restarting CDAP

As metadata management is an application running inside CDAP, it does not start up automatically when CDAP is restarted. Each time that you start CDAP, you will need to re-enable metadata management. Re-enabling metadata management does not recreate the datasets; instead, the same datasets as were used in previous runs are used.

If you are using the audit log feature of metadata management, it is best that metadata management be enabled before you begin any other applications.

If the installation of CDAP is an upgrade from a previous version, all activity and datasets prior to the enabling of metadata management will not be available or seen in the CDAP UI.

🔗Disabling and Removing Metadata Management

If for some reason you need to disable or remove metadata management, you would need to:

  • stop all programs of the _Tracker application
  • delete the metadata management application
  • delete the metadata management datasets