πŸ”—Cask Data Application Platform Release Notes

πŸ”—Release 4.1.2

πŸ”—Improvements

  • CDAP-12020 - Reuse network connections for TMS client.
  • CDAP-11959 - Added a way to limit the frequency of retrieving the MapReduce task report, which could cause network load for very large jobs.
  • CDAP-11949 - Added the ability to configure the HBase client scanner cache for a dataset.
  • CDAP-11594 - Added startup check for CDAP master to error out if the configurations for HBaseDDLExecutor extensions are provided, however extension jar cannot be loaded.
  • CDAP-11444 - Upgraded IDEA IntelliJ IDE in CDAP SDK VM to 2017.1.3 release.
  • CDAP-11398 - Upgraded Eclipse IDE in CDAP SDK VM to Neon 3 release.
  • CDAP-9515 - Added the ability to denormalize data, by splitting based on de-limiter text or array flattening, to individual records in Dataprep UI as point and click directive.
  • CDAP-9514 - Added the ability to apply some DataPrep directives on multiple columns, starting with Join columns and Swap columns. Multiple columns can be selected by checking the checkbox next to each column's name, then selecting a directive in the directive dropdown.
  • CDAP-9507 - Added the ability to format data (date time, string formatting etc.,) in Dataprep UI as point and click directive.
  • CDAP-9523 - Added the ability to extract text using regex patterns in Dataprep UI as point and click directive.
  • CDAP-9096 - Added feature where macro arguments are also listed in the runtime arguments of preview mode, just like when running a new pipeline.
  • CDAP-9094 - Added feature where values of macro arguments are automatically populated and shown in the UI when running a pipeline, if those values exist as Preferences.
  • CDAP-6329 - Enable GC logging for cdap services.

πŸ”—Bug Fixes

  • CDAP-11985 - Fixed a bug that ugi provider returns the old and incorrect ugi information.
  • CDAP-11955 - Fixed a bug that sometimes wrong user is used in explore, which results in the failure of deleting namespace.
  • CDAP-11948 - Fixed a bug where committed data could be removed during HBase table flush or compaction.
  • CDAP-11937 - Fixed an issue where a failed MapReduce run was marked as successful.
  • CDAP-11880 - Fixed a bug that hydrator pipelines and other programs do not create datasets at runtime with correct impersonated user.
  • CDAP-11815 - Fixed impersonation when upgrading datasets in UpgradeTool
  • CDAP-11795 - Fixed an issue with retrieving workflow state if it contains an exception without a message.
  • CDAP-11783 - HBaseDDLExecutor implementation is now localized to the containers without adding it in the container classpath.
  • CDAP-10488 - Fixed delete button on action plugins to allow users to delete easily.
  • CDAP-9456 - Fixed a bug that impersonated workflow does not create local datasets with correct impersonated user.
  • CDAP-8963 - Fixed issue in explore preview where UI is not displaying boolean value correctly
  • CDAP-5067 - Fixed an issue where Workflow driver was getting restarted when it runs out of memory, causing the Workflow to be executed from start node again.

πŸ”—Release 4.1.1

πŸ”—Summary

  1. Data Preparation: Point-and-click interactions and integration with the rest of CDAP includingβ€”but not limited toβ€”namespaces, security, and pipelines.
  2. Upgrade: Significant reduction in downtime during CDAP upgrades, by removing some data migration and doing required migration in the background after CDAP starts up.
  3. Pipeline Previews: Added logs, better error messaging, ability to read from existing datasets, and a better stop experience.
  4. Logs: Added a condensed view of logs for CDAP pipelines and programs that does not include logs emitted by the CDAP platform and libraries. The condensed view only contains lifecycle logs, logs emitted by the program or pipeline, and errors.
  5. Schedules: Added the ability to update schedules without redeploying the application.

πŸ”—New Features

πŸ”—Data Preparation

  • CDAP-9235 - Users can now interact with and manage multiple workspaces in Data Preparation.
  • WRANGLER-77 - Added point-and-click interactions for applying directives such as parsing, splitting, find and replace, filling null or empty rows, copying and deleting columns in Data Preparation. They can be invoked by using the dropdown menu for each column.

πŸ”—Logs

  • CDAP-9117 - Added option to the log viewer to only show "user" condensed logs.
  • HYDRATOR-1316 - Logs for previews of CDAP pipelines are now available in the CDAP UI via the Logs button in Preview mode.

πŸ”—Schedules

  • CDAP-8902 - Added support for adding, deleting, updating, and retrieving workflow schedules.

πŸ”—Other New Features

  • CDAP-8872 - Upgraded Apache Tephra dependency to the 0.11.0-incubating version.

  • CDAP-9141, HYDRATOR-1453 - Users can now deploy CDAP pipelines with a single action plugin. This feature can be used to run external Apache Spark programs as CDAP pipelines.

    Added a sparkprogram plugin type that can be used to run arbitrary Spark code at the beginning or end of a pipeline. An external Spark program can be added by clicking the "plus" ("+") button in the CDAP UI, choosing Library, and specifying sparkprogram as the type. It is then available as an Action plugin in the CDAP Studio.

  • CDAP-9250 - Added support for HDP 2.6.

  • CDAP-9281 - Added support for CDH 5.11.0.

  • CDAP-9311 - Added support that allows plugin developers to integrate with CDAP services by exposing CDAP service discovery capabilities in the plugin context.

πŸ”—Improvements

πŸ”—Upgrade

  • CDAP-9278 - Added the running of HBase coprocessor upgrades concurrently on CDAP Datasets.
  • CDAP-9282, CDAP-9283 - Improved the CDAP upgrade process to minimize the downtime needed to upgrade, by performing data migration in the background.

πŸ”—Pipeline Previews

  • CDAP-9017 - Simplified the status, next runtime of pipelines, total number of running pipelines, and drafts in the pipeline list view UI.

πŸ”—Schedules

  • CDAP-8942 - Allow administrators to enable or disable updating schedules using the property "app.deploy.update.schedules" in cdap-site.xml. Users can override this to enable or disable updating schedules during deployment of an application using the same property specified in the configuration of the application.

πŸ”—Other Improvements

  • CDAP-7731 - Added fetch size and transaction flush interval configurations to the Kafka Consumer Flowlet.
  • CDAP-8430 - Users can now see a contextual message with appropriate call(s) to action when no entities are found on the Overview page.
  • CDAP-8990 - Added new configurations to control the YARN application master container memory size, maximum heap memory size, and maximum non-heap memory size: twill.java.heap.memory.ratio, twill.yarn.am.memory.mb, and twill.yarn.am.reserved.memory.mb.
  • CDAP-9003 - Increased the default memory allocation for the CDAP Explore service container to 2048MB.
  • CDAP-9027 - Users can now grant and revoke privileges for UNIX groups and users when using Apache Sentry as the authorization extension for CDAP.
  • CDAP-9077 - Added a "cdap apply-pack [pack]" command to the "cdap" script that allows for upgrading of individual CDAP components.

πŸ”—Bug Fixes

πŸ”—Upgrade

  • CDAP-9185 - Fixed an issue with the pipeline upgrade tool that caused it to skip CDAP 4.0.x pipelines.

πŸ”—Pipeline Previews

  • CDAP-7884 - Fixed a bug that preview cannot read from datasets in real space.
  • CDAP-8013 - When previewing a pipeline in the CDAP Studio, disabled all writes to sinks. Incoming data to sinks can be viewed in the preview tab of the sink, but is not written to the sink.
  • CDAP-9333 - Fixed an issue where preview of CDAP pipelines did not show data for successful stages if a particular stage failed.

πŸ”—Logs

  • CDAP-7138 - Fixed a problem that caused duplicate logs to show up for a running pipeline.
  • CDAP-9248 - Fixed bug where the "Total Messages/Errors/Warnings" at the top of logviewer was showing incorrect values.

πŸ”—Schedules

  • CDAP-8918 - Fixed an issue where redeployment of an application with a deleted schedule would fail.

πŸ”—Other Bug Fixes

  • CDAP-4213 - Removed the requirement of being an admin to run the CDAP startup script for Windows.
  • CDAP-5715 - Made Plugin Endpoint invocation more robust. If a plugin's parent can't instantiate the plugin necessary for invoking, CDAP will attempt with other parents of the plugin and try to instantiate using them before retuning error.
  • CDAP-6348 - Fixed an issue with namespace deletion which caused CDAP Application test cases to fail in a Windows environment.
  • CDAP-8862 - Fix an issue with losing a few metrics when a container is shutdown.
  • CDAP-8888 - Fixed an issue with the YARN container allocation logic so that the correct container size is used.
  • CDAP-8913 - Improved the serializability of Tables and IndexedTables when used in Spark programs.
  • CDAP-8945 - Moved the "add plugin" behavior from a plugin's left panel to an "Add Entity" button in the CDAP Studio UI.
  • CDAP-8950 - Fixed an issue in the CDAP UI where navigating from a stream card to an overview and then to a detail page made the detail page show a spinner icon indefinitely.
  • CDAP-8980, CDAP-9314 - Fixed an issue with the Spark program runtime so that the Kryo serializer can be used.
  • CDAP-9005 - Fixed an issue where the HBase Queue Debugging Tool failed when authorization was enabled.
  • CDAP-9029, CDAP-9035 - Fixed an issue where users could not grant and revoke privileges for UNIX groups and users when using Apache Sentry as the authorization extension for CDAP.
  • CDAP-9046 - Fixed an issue where revoking privileges from a role caused the privilege to be revoked from all roles.
  • CDAP-9086 - Fixed an issue with the Window plugin so that it propagates schema properly.
  • CDAP-9087 - Fixed the Overview panel in home page of the CDAP UI to handle unknown entities appropriately.
  • CDAP-9114 - Added the retrying of local dataset operations when a failure happens.
  • CDAP-9142 - Fixed an issue with the binary format in the Kafka streaming source that prevented pipeline deployment.
  • CDAP-9160 - Fixed an issue that caused YARN containers to be killed due to excessive memory usage when impersonation is enabled.
  • CDAP-9216 - Fixed bug where navigation links were referencing default namespace instead of the current namespace.
  • HYDRATOR-703 - Improved error messages for the 'Get Schema' functionality of Database plugins in CDAP Pipelines.

πŸ”—Known Issues

  • CDAP-9151 - The CDAP CLI commands for getting and setting preferences introduced in CDAP 4.1.0 (such as set app preferences <app-id> <preferences>) are not working correctly. Use the previous commands (marked as deprecated), such as set preferences app <runtime-args> <app-id>, as a workaround.
  • CDAP-9388 - When creating a stream and uploading data from the wizard in the CDAP resource center, the metrics on the cards in the overview do not show appropriate numbers. It will just show zero for the number of events and the bytes.

πŸ”—API Changes

πŸ”—Logs

  • CDAP-9084 - The CDAP Logging APIs now return a 404 status code if the entity (the run id) for which logs are requested does not exist.

πŸ”—Release 4.1.0

πŸ”—New Features

πŸ”—Secure Impersonation

  • CDAP-8110 - Added support for fine-grained impersonation at the CDAP application, dataset, and stream level.
  • CDAP-8355 - Impersonated namespaces can be configured to disallow the impersonation of the namespace owner when running CDAP Explore queries.

πŸ”—Replication and Resiliency

  • CDAP-7685 - Provided SPI hooks that users can implement for performing HBase DDL operations.
  • CDAP-8025 - Added a tool to check a cluster's replication status.
  • CDAP-8032 - CDAP context methods will now be retried according to a program's retry policy. These are governed by these properties:
    • custom.action.retry.policy.base.delay.ms
    • custom.action.retry.policy.max.delay.ms
    • custom.action.retry.policy.max.retries
    • custom.action.retry.policy.max.time.secs
    • custom.action.retry.policy.type
    • flow.retry.policy.base.delay.ms
    • flow.retry.policy.max.delay.ms
    • flow.retry.policy.max.retries
    • flow.retry.policy.max.time.secs
    • flow.retry.policy.type
    • mapreduce.retry.policy.base.delay.ms
    • mapreduce.retry.policy.max.delay.ms
    • mapreduce.retry.policy.max.retries
    • mapreduce.retry.policy.max.time.secs
    • mapreduce.retry.policy.type
    • service.retry.policy.base.delay.ms
    • service.retry.policy.max.delay.ms
    • service.retry.policy.max.retries
    • service.retry.policy.max.time.secs
    • service.retry.policy.type
    • spark.retry.policy.base.delay.ms
    • spark.retry.policy.max.delay.ms
    • spark.retry.policy.max.retries
    • spark.retry.policy.max.time.secs
    • spark.retry.policy.type
    • system.log.process.retry.policy.base.delay.ms
    • system.log.process.retry.policy.max.retries
    • system.log.process.retry.policy.max.time.secs
    • system.log.process.retry.policy.type
    • system.metrics.retry.policy.base.delay.ms
    • system.metrics.retry.policy.max.retries
    • system.metrics.retry.policy.max.time.secs
    • system.metrics.retry.policy.type
    • worker.retry.policy.base.delay.ms
    • worker.retry.policy.max.delay.ms
    • worker.retry.policy.max.retries
    • worker.retry.policy.max.time.secs
    • worker.retry.policy.type
    • workflow.retry.policy.base.delay.ms
    • workflow.retry.policy.max.delay.ms
    • workflow.retry.policy.max.retries
    • workflow.retry.policy.max.time.secs
    • workflow.retry.policy.type
  • CDAP-8037 - Added a master.manage.hbase.coprocessors setting that can be set to false on clusters where the CDAP coprocessors are deployed on every HBase node.

πŸ”—Enhancements to the New CDAP UI

  • CDAP-8021 - Added the management of preferences at the application and program levels.
  • CDAP-8198, CDAP-8199, CDAP-8214, CDAP-8217 - The CDAP UI added dataset and stream detail and overviews.
  • CDAP-8203 - The CDAP UI added a "call-to-action" dialog after entity creation, so users can easily perform actions on the newly-created entities.
  • CDAP-8282, CDAP-8376 - Users can now view events and logs of programs in the new CDAP UI using the events and log view "fast-action" dialogs.
  • CDAP-8398 - Users now see on the CDAP UI homepage a "Just Added" section, listing and highlighting any entities added in the last five minutes.
  • HYDRATOR-208 - The CDAP UI added a duration timer to CDAP pipelines.

πŸ”—Logs

  • CDAP-7676, CDAP-9999 - Added a prototype implementation for a rolling HDFS log appender.
  • CDAP-7962 - Program context information, including namespace, program name, and program type, are now available in the MDC property of each ILoggingEvent emitted from a program container.
  • CDAP-8108 - Revised the CDAP Log Appender to use Logback's Appender interface.
  • CDAP-8231 - The log file cleaner thread will remove metadata and, for successfully deleted metadata entries, it will delete the corresponding log files. The log file cleaner thread will only remove the metadata entries for the old (pre-4.1.0) log format.
  • CDAP-8261 - Logs collected by the CDAP Log Appender will be stored at a common <cdap>/logs path, owned by the cdap user. For security, it is readable only by the cdap user.
  • CDAP-8428 - Added additional metrics about the status of the log framework: log.process.min.delay and log.process.max.delay.

πŸ”—New CDAP Pipeline Plugins

πŸ”—Dataset Improvements

  • CDAP-7596 - Added the ability to reuse an existing file system location and Hive table when creating a partitioned file set.
  • CDAP-7597 - Added configuring the CDAP Explore database and table name for a dataset using dataset properties.
  • CDAP-7683 - Added a tool that pre-builds and loads the HBase coprocessors required by CDAP onto HDFS.
  • CDAP-8070 - Added control of group ownership and permissions through dataset properties.

πŸ”—Other New Features

  • CDAP-4556 - CDAP now uses environment variables in the spark-env.sh and properties in the spark-defaults.conf when launching Spark programs.
  • CDAP-5107 - Added an HTTP RESTful endpoint to retrieve a specific property for a specific version of an artifact in the system scope.
  • CDAP-8122 - Made headers and the request/response bodies available in audit logs for certain RESTful endpoints.
  • CDAP-8292 - Added support for CDH 5.10.0.

πŸ”—Improvements

  • CDAP-3383 - Enabled in CDAP invalid transaction list pruning, a new feature introduced in Apache Tephra. This automates the pruning of the invalid transaction list after data for the invalid transaction has been dropped.
  • CDAP-6046 - Added an easier, additional syntax for the CDAP CLI set/get/load/delete <type> preferences commands, with the preferences at the end of the syntax, such as set workflow preferences MyApp.My.WF 'a=b c=d'.
  • CDAP-7835 - The Metadata Service upgrades the metadata dataset to reduce the time required by the upgrade tool during a CDAP upgrade.
  • CDAP-8019 - Added a configuration to control the timeout of CDAP Explore operations: set explore.http.timeout in the cdap-site.xml file.
  • CDAP-8061 - Moved the Cask Market Path to the cdap-defaults.xml file. Users can now configure the path to a private Cask Market using the configuration setting market.base.url.
  • CDAP-8075 - The CDAP UI added one-step deploy wizards for the Cask Market. Users can now deploy applications and plugins from the Cask Market with a single click, instead of downloading them from the market and then uploading them.
  • CDAP-8152 - StreamingSource plugins now have access to the CDAP SparkExecutionContext to read from datasets and streams.
  • CDAP-8183 - The CDAP UI now automatically retries loading the homepage when the CDAP Server is not up and ready yet.
  • CDAP-8250 - Reduced non-informative stacktrace information in the log when a connection to the CDAP Router is closed prematurely.
  • CDAP-8565 - Improved the master process stop procedure to support fast failover when running with HA. Added a new kill command to force-kill CDAP processes.
  • HYDRATOR-282 - Updated the CSVParser plugin to change "PDL" to "Pipe Delimited" and "TDF" to "Tab Delimited".
  • HYDRATOR-577 - Changed the Table sink plugin to make using the schema.row.field optional, which allows the schema.row.field to be used as a column in the output.
  • HYDRATOR-1006 - Updated the Tokenizer plugin to be more forgiving when parsing tokens by accepting regex with white spaces; the output schema now contains all the fields that were in the input schema and not only the column that is being tokenized.
  • HYDRATOR-1028 - Changed the Data Generator configuration to be easier to use; as the type parameter can only be one of "stream" or "table", changed to using a select widget to configure it.
  • HYDRATOR-1144 - Updated the use of "true/false" select boxes to be consistent in their ordering.
  • HYDRATOR-1149 - Added the ability to read recursive directories to the File source plugin.
  • HYDRATOR-1162 - Added logging to an error-dataset to the LogParser and XMLMultiParser plugins.
  • HYDRATOR-1177 - Plugins can now retrieve the input and output schema of their stage in their initialize methods.
  • WRANGLER-3 - The CDAP UI's Wrangler modal dialog will give a warning when you try to close or exit out of it without confirmation.

πŸ”—Bug Fixes

  • CDAP-2543 - Fixed an issue of a hanging application in the case that a user program JAR is missing dependencies.
  • CDAP-4739 - Fixed an issue to make artifact, datasets, logs, and coprocessor JAR locations resilient to an HDFS Namenode HA upgrade.
  • CDAP-5717 - Fixed an issue with starting the CDAP CLI and the CDAP Standalone when the on-disk path has a space in it.
  • CDAP-6690 - Fixed issues with the formatting of dataset instance properties in the output of the CDAP CLI.
  • CDAP-6704 - Fixed issues with and clarified certain of the CDAP CLI help text and its error messages.
  • CDAP-7155 - Fixed a problem where the Dataset Service failed to start up if authorization was enabled and the authorization plugin was slow to respond.
  • CDAP-7228 - Empty and null metadata tags are now removed in the metadata upgrade step of the CDAP Upgrade Tool.
  • CDAP-7302 - Fixed an issue that caused the CDAP Master to die if HBase was down when a follower became the leader.
  • CDAP-7694 - Fixed an issue where the CDAP service scripts could cause a terminal session to not echo characters.
  • CDAP-7813 - The security policies for accessing entities have been changed and the documentation updated to reflect these changes.
  • CDAP-7911 - The error messages returned for bad requests to the metadata search RESTful APIs have been improved.
  • CDAP-7930 - Performing a metadata search now returns the correct total, even if the offset is very large.
  • CDAP-7935 - Fixed an issue with the CDAP Standalone not starting and stopping correctly.
  • CDAP-7991 - The Cask Market now shows only those entities that are valid for the specific version of CDAP viewing them.
  • CDAP-8001 - Fixed an issue with the retrieving of logs when a namespace was deleted and then recreated with same name.
  • CDAP-8041 - Fixed an issue where the CDAP Master process would hang during a shutdown.
  • CDAP-8086 - Removed an obsolete Update Dataset Specifications step in the CDAP Upgrade tool. This step was required only for upgrading from CDAP versions lower than 3.2 to CDAP version 3.2.
  • CDAP-8087 - Provided a workaround for Scala bug SI-6240 (issues.scala-lang.org/browse/SI-6240) to allow concurrent execution of Spark programs in CDAP Workflows.
  • CDAP-8088 - Fixed the CDAP UI pipeline detail view so that it can be rendered in older browsers.
  • CDAP-8094 - Fixed an issue where the number of records processed during a preview run of the realtime data pipeline was being incremented incorrectly.
  • CDAP-8133 - Fixed an issue with metadata searches with certain offsets overflowing and returning an error.
  • CDAP-8180 - Fixed an issue with the CDAP Standalone not correctly warning about the absence of Node.js.
  • CDAP-8229 - Fix the CDAP UpgradeTool to not rely on the existence of a 'default' namespace.
  • CDAP-8313 - Fixed an issue where system artifacts would continuously be loaded if there was a partial JAR in the system artifacts directory.
  • CDAP-8342 - Fixed an issue where CDAP Explore operations from a program container running as a user were impersonating the namespace owner. Now they impersonate the respective program container users.
  • CDAP-8367 - Fixed issues with "Hive-on-Spark" on newer versions of CDH failing to run Spark jobs due to permission and configuration errors.
  • CDAP-8442 - Fixed an issue in the CDAP UI where the "Stop Program" modal dialog kept loading (showing a spinning wheel) even after the program had been stopped.
  • CDAP-8446 - Fixed an issue where the Transactional.run method could throw the wrong exception if the transaction service was unavailable when it was finishing a transaction.
  • CDAP-8509 - Fixed an issue in the Transactional Messaging System (TMS) table upgrade, where the TMS table could be left in a disabled state if the upgrade tool is run after an upgraded CDAP Master is started and then stopped.
  • CDAP-8544 - Lowered the RPC timeout and number of retries for the HBase operations performed by CDAP Master services.
  • CDAP-8628 - Fixed an issue in the log saver and the metrics processor that if an exception was thrown during the changing of the number of instances, a container JVM process could be left running without performing any work.
  • CDAP-8634 - Corrected the Javadoc of the PluginConfig's containsMacro() method to reflect that it always returns false at runtime.
  • CDAP-8636 - Fixed an issue with Spark programs not working against CDH 5.8.4.
  • CDAP-8672 - Fixed the CDAP Router so that it does not log an error when it cannot discover a service. Previously, the message was logged at the debug level.
  • CDAP-8687 - Fixed an issue where a user who attempts to create an existing stream that was created by a different user received all the privileges and the original user had their privileges revoked.
  • CDAP-8694 - Fixed an issue with properly-locating CDAP_HOME in Distributed CDAP instances outside the default /opt/cdap directory.
  • HYDRATOR-1085 - Fixed an issue where the File Sink plugin was failing when writing byte array records.
  • HYDRATOR-1096 - Fixed an issue with the macro substitution of a Table dataset name.
  • HYDRATOR-1158 - Fixed an issue with the JSON parser failing if no data was present for a nullable field.
  • HYDRATOR-1212 - Fixed an issue where runtime arguments were not being passed correctly for the pipeline preview run in the CDAP UI.
  • HYDRATOR-1219 - Fixed an issue in the Wrangler transform with the handling of escaped characters.
  • HYDRATOR-1226 - Fixed an issue where pipeline previews would not run in a non-default namespace.
  • HYDRATOR-1238 - Fixed an issue where the RunTransform plugin was not checking for null fields.
  • HYDRATOR-1246 - Fixed an issue with the DateTransform plugin and the handling of null values.
  • HYDRATOR-1377 - Fixed an issue with the S3 source and sink plugins in the CDAP Standalone.
  • TRACKER-264 - Fixed an issue with the Data Dictionary's validate API not accepting CDAP-schema JSON.
  • WRANGLER-12 - Added to Wrangler an option to convert column names to be schema-compatible.

πŸ”—Known Issues

  • CDAP-7770 - The current CDAP UI build process does not work on Microsoft Windows.
  • CDAP-8375 - Invalid Transaction Pruning does not work on a replicated cluster. and needs to be disabled by setting the configuration parameter data.tx.prune.enable to false in the cdap-site.xml file.
  • CDAP-8494 - If users navigate to the classic CDAP UI, they cannot come back to the new CDAP UI if they click the browser back button.
  • CDAP-8531, CDAP-8659, CDAP-8791 - If the property hive.compute.query.using.stats is true in HDP 2.5.x clusters, CDAP Explore queries that trigger a MapReduce program can fail.
  • CDAP-8663 - If a user revokes a privilege on a namespace, the privilege on all entities in that namespace are also revoked.
  • CDAP-8789 - On the CDAP UI, program logs show error logs correctly. When switched to "Raw Logs", the error logs are missing. (The same behavior is seen in the classic CDAP UI.) CDAP CLI shows all logs correctly.
  • CDAP-8812 - Long plugin names don't show up in the left sidebar of the CDAP Studio when running on Microsoft Windows.
  • CDAP-8818 - Local datasets appear on the CDAP UI overview page even though they are temporary datasets that should be filtered out.
  • HYDRATOR-1389 - On Windows, users of CDAP Studio must double-click plugin icons in order for their node configuration panels to open.

πŸ”—API Changes

  • CDAP-6642 - Attempting to delete a system artifact by specifying a user namespace (that previously returned a 200, even though the artifact was not deleted) will now return a 404, as that combination of system and user will never occur.
  • CDAP-8445 - The stream endpoint to enqueue messages now returns a 503 instead of a 500 if it failed because the dataset service was unavailable.
  • CDAP-8448 - In general, changed the HTTP RESTful endpoints to return a 503 instead of a 500 when the transaction service was unavailable.
  • CDAP-8606 - Among other new properties added to CDAP, new log saver properties have been added to CDAP, replacing the previous properties. As a consequence, previous properties will no longer work. See the Appendix: cdap-site.xml for details on these properties.

    Old Properties

    • log.cleanup.max.num.files
    • log.cleanup.run.interval.mins
    • log.retention.duration.days

    New Properties

    • custom.action.retry.policy.base.delay.ms
    • custom.action.retry.policy.max.delay.ms
    • custom.action.retry.policy.max.retries
    • custom.action.retry.policy.max.time.secs
    • custom.action.retry.policy.type
    • data.tx.prune.enable
    • data.tx.prune.plugins
    • data.tx.prune.state.table
    • data.tx.pruning.plugin.class
    • explore.http.timeout
    • flow.retry.policy.base.delay.ms
    • flow.retry.policy.max.delay.ms
    • flow.retry.policy.max.retries
    • flow.retry.policy.max.time.secs
    • flow.retry.policy.type
    • hbase.client.retries.number
    • hbase.rpc.timeout
    • log.pipeline.cdap.dir.permissions
    • log.pipeline.cdap.file.cleanup.interval.mins
    • log.pipeline.cdap.file.cleanup.transaction.timeout
    • log.pipeline.cdap.file.max.lifetime.ms
    • log.pipeline.cdap.file.max.size.bytes
    • log.pipeline.cdap.file.permissions
    • log.pipeline.cdap.file.retention.duration.days
    • log.pipeline.cdap.file.sync.interval.bytes
    • log.process.pipeline.auto.buffer.ratio
    • log.process.pipeline.buffer.size
    • log.process.pipeline.checkpoint.interval.ms
    • log.process.pipeline.config.dir
    • log.process.pipeline.event.delay.ms
    • log.process.pipeline.kafka.fetch.size
    • log.process.pipeline.lib.dir
    • log.process.pipeline.logger.cache.expiration.ms
    • log.process.pipeline.logger.cache.size
    • log.publish.partition.key
    • mapreduce.retry.policy.base.delay.ms
    • mapreduce.retry.policy.max.delay.ms
    • mapreduce.retry.policy.max.retries
    • mapreduce.retry.policy.max.time.secs
    • mapreduce.retry.policy.type
    • market.base.url
    • master.manage.hbase.coprocessors
    • metrics.kafka.meta.table
    • metrics.kafka.topic.prefix
    • metrics.messaging.fetcher.limit
    • metrics.messaging.meta.table
    • metrics.messaging.topic.num
    • metrics.topic.prefix
    • router.audit.path.check.enabled
    • security.keytab.path
    • service.retry.policy.base.delay.ms
    • service.retry.policy.max.delay.ms
    • service.retry.policy.max.retries
    • service.retry.policy.max.time.secs
    • service.retry.policy.type
    • spark.retry.policy.base.delay.ms
    • spark.retry.policy.max.delay.ms
    • spark.retry.policy.max.retries
    • spark.retry.policy.max.time.secs
    • spark.retry.policy.type
    • system.log.process.retry.policy.base.delay.ms
    • system.log.process.retry.policy.max.retries
    • system.log.process.retry.policy.max.time.secs
    • system.log.process.retry.policy.type
    • system.metrics.retry.policy.base.delay.ms
    • system.metrics.retry.policy.max.retries
    • system.metrics.retry.policy.max.time.secs
    • system.metrics.retry.policy.type
    • twill.location.cache.dir
    • worker.retry.policy.base.delay.ms
    • worker.retry.policy.max.delay.ms
    • worker.retry.policy.max.retries
    • worker.retry.policy.max.time.secs
    • worker.retry.policy.type
    • workflow.retry.policy.base.delay.ms
    • workflow.retry.policy.max.delay.ms
    • workflow.retry.policy.max.retries
    • workflow.retry.policy.max.time.secs
    • workflow.retry.policy.type

πŸ”—Deprecated and Removed Features

  • See API Changes, CDAP-8606 above for removed properties.
  • CDAP-8753 - Deprecated the waitForFinish() method in the ProgramManager and added the method waitForRun() to replace it which will wait for the actual run records of the given status.

πŸ”—Release 4.0.1

πŸ”—Improvement

  • CDAP-8047 - Added a step in the CDAP Upgrade Tool to disable TMS (Transaction Messaging Service) message and payload tables. The TMS TwillRunnable will update the coprocessors of those tables if required and enable the tables.

πŸ”—Bug Fixes

  • CDAP-7694 - Fixed an issue where the CDAP service scripts could cause a terminal session to not echo characters.
  • CDAP-7992 - The CDAP Security service under Standalone CDAP is no longer forced to bind to localhost.
  • CDAP-8000 - To avoid transaction timeouts, log cleanup is now done in configurable batches (controlled by the property log.cleanup.max.num.files) instead of a single short transaction.
  • CDAP-8007 - Fixed a bug in the TMS (Transaction Messaging Service) message and payload table coprocessors by changing the accessing of CDAP configuration and TMS metadata tables from reading them inline to reading them in a separate thread.
  • CDAP-8023 - Changed the default CDAP UI port to 11011 to match the CDAP 4.0.0 release.
  • CDAP-8086 - Removed an obsolete Update Dataset Specifications step in the CDAP Upgrade tool. This step was required only for upgrading from CDAP versions lower than 3.2 to CDAP Version 3.2.
  • CDAP-8087 - Provided a workaround for Scala bug SI-6240 (https://issues.scala-lang.org/browse/SI-6240) to allow concurrent execution of Spark programs in CDAP Workflows.
  • CDAP-8088 - Fixed the CDAP Hydrator detail view so that it can be rendered in older browsers.
  • CDAP-8094 - Fixed an issue where the number of records processed during a preview run of the realtime data pipeline was being incremented incorrectly.
  • CDAP-8126 - Fixed an issue with the flag used by the Node proxy to enable SSL between the CDAP UI and CDAP Router.
  • CDAP-8137 - Fixed an issue with the CDAP CLI where execute commands may be interpreted incorrectly.
  • CDAP-8148 - Fixed an issue in the template path used with the original CDAP UI when rendering a dataset detailed view.
  • CDAP-8158 - Fixed issues with the Ambari UI "Quick Links" and alerts definitions for SSL and non-default ports and the writing of the cdap-security.xml file when configured under the CDAP Ambari Service.
  • HYDRATOR-1212 - Fixed an issue where runtime arguments were not being passed for the preview run correctly in the CDAP UI.
  • HYDRATOR-1226 - Fixed an issue where previews would not run in a non-default namespace.

πŸ”—Release 4.0.0

πŸ”—New Features

  • Cask Market
    • CDAP-7203 - Adds Cask Market: Cask's Big Data app store, providing an ecosystem of pre-built Hadoop solutions, re-usable templates, and plugins. Within CDAP, users can access the market and create Hadoop solutions or Big Data applications with easy-to-use guided wizards.
  • Cask Wrangler
    • WRANGLER-2 - Added Cask Wrangler: a new CDAP extension for interactive data preparation.
  • CDAP Transactional Messaging System
    • CDAP-7211 - Adds a transactional messaging system that is used for reliable communication of messages between components. In CDAP 4.0.0, the transactional messaging system replaces Kafka for publishing and subscribing audit logs that is used within CDAP for computing data lineage.
  • Operational Statistics
    • CDAP-7670 - Added a pluggable extension to retrieve operational statistics in CDAP. Provided extensions for operational stats from YARN, HDFS, HBase, and CDAP.
    • CDAP-7703 - Added reporting operational statistics for YARN. They can be retrieved using JMX with the domain name co.cask.cdap.operations and the property name set to yarn.
    • CDAP-7704 - Added reporting operational statistics for HBase. They can be retrieved using JMX with the domain name co.cask.cdap.operations and the property name set to hbase as well as through the CDAP UI Administration page.
  • Dynamic Log Level
    • CDAP-5479 - Allow updating or resetting of log levels for program types worker, flow, and service dynamically using REST endpoints.
    • CDAP-7214 - Allow setting the log levels for all program types through runtime arguments or preferences.
  • New Versions of Distributions Supported
    • CDAP-6938 - Added support for Amazon EMR 4.6.0+ installation of CDAP via a bootstrap action script.
    • CDAP-7249 - Added support for HDInsights 3.5.
    • CDAP-7291 - Added support for CDH 5.9.
    • CDAP-7901 - Added support for HDP 2.5.
  • New Hydrator Plugins Added
    • HYDRATOR-504 - Added to the Hydrator plugins a Tokenizer Spark compute plugin.
    • HYDRATOR-512 - Added to the Hydrator plugins a Sink plugin to write to Solr search.
    • HYDRATOR-517 - Added to the Hydrator plugins a Logistic Regression Spark Machine Learning plugin.
    • HYDRATOR-668 - Added to the Hydrator plugins a Decision Tree Regression Spark Machine Learning plugin.
    • HYDRATOR-909 - Added to the Hydrator plugins a SparkCompute Hydrator plugin to compute N-Grams of any given String.
    • HYDRATOR-935 - Added to the Hydrator plugins a Windows share copy Action plugin.
    • HYDRATOR-971 - Added to the Hydrator plugins a Hydrator plugin that watches a directory and streams file content when new files are added.
    • HYDRATOR-973 - Added to the Hydrator plugins an HTTP Poller source plugin for streaming pipelines.
    • HYDRATOR-977 - Added to the Hydrator plugins an XML parser plugin that can parse out multiple records from a single XML document.
    • HYDRATOR-981 - Added to the Hydrator plugins an Action plugin to run any executable binary.
    • HYDRATOR-1029 - Added to the Hydrator plugins an Action plugin to export data in an Oracle database.
    • HYDRATOR-1091 - Added the ability to run a Hydrator pipeline in a preview mode without publishing. It allows users to view the data in each stage of the preview run.
    • HYDRATOR-1111 - Added to the Hydrator plugins a plugin for transforming data according to commands provided by the Cask Wrangler tool.
    • HYDRATOR-1146 - Added to the Hydrator plugins a Sink plugin to write to Amazon Kinesis from Batch pipelines.
  • Cask Tracker
    • TRACKER-233 - Added a data dictionary to Cask Tracker for users to define columns for datasets, enforce a common naming convention, and apply masking to PII (personally identifiable information).

πŸ”—Improvements

  • CDAP-1280 - Merged various shell scripts into a single script to interface with CDAP, called cdap, shipped with both the SDK and Distributed CDAP.
  • CDAP-1696 - Updated the default CDAP Router port to 11015 to avoid conflicting with HiveServer2's default port.
  • CDAP-3262 - Fixed an issue with the CDAP scripts under Windows not handling a JAVA_HOME path with spaces in it correctly. CDAP SDK home directories with spaces in the path are not supported (due to issues with the product) and the scripts now exit if such a path is detected.
  • CDAP-4322 - For MapReduce programs using a PartitionedFileSet as input, the partition key corresponding to the input split is now exposed to the mapper.
  • CDAP-4901 - Fixed an issue where an exception from an HttpContentConsumer was being silently ignored.
  • CDAP-5068 - Added pagination for the search RESTful API. Pagination is achieved via {{offset}}, {{limit}}`, {{numCursors}}, and {{cursor}} parameters in the RESTful API.
  • CDAP-5632 - New menu option in Cloudera Manager when running the CDAP CSD enables running utilities such as the HBaseQueueDebugger.
  • CDAP-6183 - Added the property program.container.dist.jars to set extra jars to be localized to every program container and to be added to classpaths of CDAP programs.
  • CDAP-6425 - Fixed an issue that allowed a FileSet to be created if its corresponding directory already existed.
  • CDAP-6572 - The namespace that integration test cases run against by default has been made configurable.
  • CDAP-6577 - Improved the UpgradeTool to upgrade tables in namespaces with impersonation configured.
  • CDAP-6587 - Added support for impersonation with CDAP Explore (Hive) operations, including enabling exploring of a dataset or running queries against it.
  • CDAP-6635 - Added a feature that implements caching of user credentials in CDAP system services.
  • CDAP-6837 - Fixed an issue in WorkerContext that did not properly implement the contract of the Transactional interface. Note that this fix may cause incompatibilities with previous releases in certain cases. See API Changes, CDAP-6837 for more details.
  • CDAP-6862 - Updated more system services to respect the cdap-site parameter "master.service.memory.mb".
  • CDAP-6885 - Added support for concurrent runs of a Spark program.
  • CDAP-6937 - Added support for running CDAP on Apache HBase 1.2.
  • CDAP-6938 - Added support for Amazon EMR 4.6.0+ installation of CDAP via a bootstrap action script.
  • CDAP-6984 - Added support for enabling SSL between the CDAP Router and CDAP Master.
  • CDAP-6995 - Adding the capability to clean up log files which do not have corresponding metadata.
  • CDAP-7117 - Added support for checkpointing in Spark Streaming programs to persist checkpoints transactionally.
  • CDAP-7181 - Updated the Windows start scripts to match the new shell script functionality.
  • CDAP-7192 - Added the ability to specify an announce address and port for the CDAP AppFabric and Dataset services. Deprecated the properties app.bind.address and dataset.service.bind.address, replacing them with master.services.bind.address as the bind address for master services. Added the properties master.services.announce.address, app.announce.port, and dataset.service.announce.port for use as announce addresses that are different from the bind address.
  • CDAP-7208 - Improved CDAP Master logging of events related to programs that it launches.
  • CDAP-7240 - Fixed a NullPointerException being logged on closing network connection.
  • CDAP-7284 - Upgraded the Apache Tephra version to 0.10-incubating.
  • CDAP-7287 - Added support for enabling client certificate-based authentication to the CDAP Authentication server.
  • CDAP-7291 - Added support for CDH 5.9.
  • CDAP-7319 - Provided programs more control over when and how transactions are executed.
  • CDAP-7385 - The Log HTTP Handler and Router have been fixed to allow the streaming of larger logs files.
  • CDAP-7393 - Revised the documentation on the recommended setting for yarn.nodemanager.delete.debug-delay-sec.
  • CDAP-7439 - Removed the requirement in the documentation of running kinit prior to running the CDAP Upgrade Tool when upgrading a package installation of CDAP on a secure Hadoop cluster.
  • CDAP-7476 - Improves how MapReduce configures its inputs, such that failures surface immediately.
  • CDAP-7477 - Fixed an issue in MapReduce that caused skipping the destroy() method if the committing of any of the dataset outputs failed.
  • CDAP-7557 - DynamicPartitioner can now limit the number of open RecordWriters to one, if the output partition keys are grouped.
  • CDAP-7659 - Added support for specifying the Hive execution engine at runtime (dynamically).
  • CDAP-7761 - Adds the cluster.name property that identifies a cluster; this property can be set in the cdap-site.xml file.
  • CDAP-7797 - Added a step in the CDAP Upgrade Tool to upgrade the specification of the MetadataDataset.
  • HYDRATOR-197 - Included an example of an action and post-run plugin in the cdap-data-pipeline-plugins-archetype.
  • HYDRATOR-947 - Improved the MockSource unit test plugin so that it can be configured to set an output schema, allowing subsequent plugins in the pipeline to have non-null input schemas.
  • HYDRATOR-966 - Enabled macros for the Hive database, table name, and metastore URI properties for the Hive plugins.
  • HYDRATOR-976 - Added compression options to the HDFS sink plugin.
  • HYDRATOR-996 - Enhanced the Kafka streaming source to support configurable partitions and initial offsets, and to support optionally including the partition and offset in the output records.
  • HYDRATOR-1004 - The File Batch source in Hydrator now ignores empty directories.
  • HYDRATOR-1069 - The CSV parser can now accept a custom delimiter for parsing CSV files.
  • HYDRATOR-1072 - The Script filter plugin has been removed from Hydrator; the JavaScript filter can be used instead.
  • TRACKER-167 - Cask Tracker now includes "unknown" accesses when finding top datasets.

πŸ”—Bug Fixes

  • CDAP-2945 - A MapReduce job using either a FileSet or PartitionedFileSet as input no longer fails if there are no input partitions.
  • CDAP-4535 - The Authentication server announce address is now configurable.
  • CDAP-5012 - Fixed a problem with downloading of large (multiple gigabyte) CDAP Explore queries.
  • CDAP-5061 - Fixed an issue where the metadata of streams was not being updated when the stream's schema was altered.
  • CDAP-5372 - Fixed an issue where a warning was logged instead of an error when a MapReduce job failed in the CDAP SDK.
  • CDAP-5897 - Updated the default CDAP UI port to 11011 to avoid conflicting with Accumulo and Cloudera Manager's Activity Monitor.
  • CDAP-6398 - Authentication handler APIs have been updated to restrict which cdap-site.xml and cdap-security.xml properties are available to it.
  • CDAP-6404 - Fixed an issue with searching for an entity in Cask Tracker by metadata after a tag with the same prefix has been removed.
  • CDAP-7031 - Fixed an issue with misleading log messages from the RunRecord corrector.
  • CDAP-7116 - Fixed an issue so as to significantly reduce the chance of a schedule misfire in the case where the CPU cannot trigger a schedule within a certain time threshold.
  • CDAP-7138 - Fixed a problem with duplicate logs showing for a running program.
  • CDAP-7154 - On an incorrect ZooKeeper quorum configuration, the CDAP Upgrade Tool and other services such as Master, Router, and Kafka will timeout with an error instead of hanging indefinitely.
  • CDAP-7175 - Fixed an issue in the CDAP Upgrade Tool to allow it to run on a CDAP instance with authorization enabled.
  • CDAP-7177 - Fixed an issue where macros were not being substituted for postaction plugins.
  • CDAP-7204 - Lineage information is now returned for deleted datasets.
  • CDAP-7248 - Fixed an issue with the FileBatchSource not working with Azure Blob Storage.
  • CDAP-7249 - Fixed an issue with CDAP Explore using Tez on Azure HDInsight.
  • CDAP-7250 - Fixed an issue where dataset usage was not being recorded after an application was deleted.
  • CDAP-7256 - Fixed an issue with the leaking of Hive classes to programs in the CDAP SDK.
  • CDAP-7259 - Added a warning when a PartitionFilter addresses a non-existent field.
  • CDAP-7285 - Fixed an issue that prevented launching of MapReduce jobs on a Hadoop-2.7 cluster.
  • CDAP-7292 - Fixed an issue in the KMeans example that caused it to calculate the wrong cluster centroids.
  • CDAP-7314 - Fixed an issue with the documentation example links to the CDAP ETL Guide.
  • CDAP-7317 - Fixed a misleading error message that occurred when the updating of a CDAP Explore table for a dataset failed.
  • CDAP-7318 - Fixed an issue that would cause MapReduce and Spark programs to fail if too many macros were being used.
  • CDAP-7321 - Fixed an issue with upgrading CDAP using the CDAP Upgrade Tool.
  • CDAP-7324 - Fixed an issue with the CDAP Upgrade Tool while upgrading HBase coprocessors.
  • CDAP-7361 - Fixed an issue with log file corruption if the log saver container crashed due to being killed by YARN.
  • CDAP-7374 - Fixed an issue with Hydrator Studio in the Windows version of Chrome that prevented users from opening and editing a node configuration.
  • CDAP-7394 - Fixed an issue that prevented impersonation in flows from working correctly, by not re-using HBaseAdmin across different UGI.
  • CDAP-7417 - Fixes an issue where the partitions of a PartitionedFileSet were not cleaned up properly after a transaction failure.
  • CDAP-7428 - Fixed an issue preventing having CustomAction and Spark as inner classes.
  • CDAP-7442 - CDAP Ambari Service's required version of Ambari Server was increased to 2.2 to support the empty-value-valid configuration attribute.
  • CDAP-7473 - Fix the logback-container.xml to work on clusters with multiple log directories configured for YARN.
  • CDAP-7482 - Fixed an issue in CDAP logging that caused system logs from Kafka to not be saved after an upgrade and for previously-saved logs to become inaccessible.
  • CDAP-7483 - Fixes an issue where a MapReduce using DynamicPartitioner would leave behind output files if it failed.
  • CDAP-7500 - Fixed an issue where a MapReduce classloader gets closed prematurely.
  • CDAP-7514 - Fixed an issue preventing proper class loading isolation for explicit transactions executed by programs.
  • CDAP-7522 - Improved the documentation for read-less increments.
  • CDAP-7524 - Adds a missing @Override annotation for the WorkerContext.execute() method.
  • CDAP-7527 - Fixed an issue that prevented the using of the logback.xml from an application JAR.
  • CDAP-7548 - Fixed an issue in integration tests to allow JDBC connections against authorization-enabled and SSL-enabled CDAP instances.
  • CDAP-7566 - Improved the usability of ServiceManager in integration tests. The getServiceURL() method now waits for the service to be discoverable before returning the service's URL.
  • CDAP-7612 - Fixed an issue where Spark programs could not be started after a master failover or restart.
  • CDAP-7624 - Fixed an issue where readless increments from different MapReduce tasks cancelled each other out.
  • CDAP-7629 - Added additional tests for read-less increments in HBase.
  • CDAP-7648, CDAP-7663 - Added support for Amazon EMR 4.6.0.
  • CDAP-7652 - Startup checks now validate the HBase version and error out if the HBase version is not supported.
  • CDAP-7660 - The CDAP Ambari service was updated to use scripts for Auth Server/Router alerts in Ambari due to Ambari not supporting CDAP's /status endpoint with WEB check.
  • CDAP-7664 - CDAP Quick Links in the CDAP Ambari Service now correctly link to the CDAP UI.
  • CDAP-7666 - Fixed the YARN startup check to fail instead of warning if the cluster does not have enough capacity to run CDAP services.
  • CDAP-7680 - Fixed an issue in the CDAP Sentry Extension by which privileges were not being deleted when the CDAP entity was deleted.
  • CDAP-7707 - Files installed by the "cdap" package under /etc are now properly marked as config files for RPM packages.
  • CDAP-7724 - Fixed an issue that could cause Spark and MapReduce programs to stop improperly, resulting in a failed run record instead of a killed run record.
  • CDAP-7737 - Fixed the cdap-data-pipeline-plugins-archetype to export everything in the provided groupId and fixed the archetype to use the provided groupId as the Java package instead of using a hardcoded value.
  • CDAP-7742 - Fixed the ordering of search results by relevance in the search RESTful API.
  • CDAP-7757 - Now uses the OpenJDK for redistributable images, such as Docker and Virtual Machine images.
  • CDAP-7819 - The Node.js version check in the CDAP SDK was updated to properly handle patch-level comparisons.
  • HYDRATOR-89 - Batch Hydrator pipelines will now log an error instead of a warning if they fail in the CDAP SDK.
  • HYDRATOR-471 - The Database Batch Source now handles $CONDITIONS when getting a schema.
  • HYDRATOR-499 - GetSchema for an aggregator now fails if there are duplicate names.
  • HYDRATOR-791 - Fixed an issue where Hydrator pipelines using a DBSource were not working in an HDP cluster.
  • HYDRATOR-915 - Fixed an issue where pipelines with multiple sinks connected to the same action could fail to publish.
  • HYDRATOR-948 - Fixed an issue with Spark data pipelines not supporting argument values in excess of 64K characters.
  • HYDRATOR-950 - Password field is now masked in the Email post-run plugin.
  • HYDRATOR-968 - Fixed an issue so that the CDAP UI does not parse macros when starting a pipeline in Hydrator.
  • HYDRATOR-978 - Fixed an issue where macros were not being evaluated in streaming source Hydrator plugins.
  • HYDRATOR-987 - Fixed the UI widget for the S3 source to make its output schema non-editable.
  • HYDRATOR-994 - Stream source duration in the stream source hydrator plugin is now macro-enabled.
  • HYDRATOR-1010 - The Python evaluator can now handle float and double data types.
  • HYDRATOR-1025 - Fixed an issue to format XML correctly in the XML reader plugin.
  • HYDRATOR-1062 - Fixed a serialization issue with StructuredRecords that use primitive arrays.
  • HYDRATOR-1126 - Fixed an issue where the outputSchema plugin function expected an input schema to be present.
  • HYDRATOR-1131 - Added being able to add to an error dataset for malformed rows in CSV while parsing using the CSV parser.
  • HYDRATOR-1132 - A Hydrator application can now set reducer task resources as a per-worker resource provided for MapReduce pipelines.
  • HYDRATOR-1168 - Spark pipelines now use 1024mb of memory by default for the Spark client that submits the job.
  • HYDRATOR-1189 - Any Hydrator pipelines that use S3 (either as an S3 source or an S3 sink) based on core-plugins version 1.4 (used in CDAP prior to 4.0.0) will not execute on a 4.0.x cluster. A workaround is to recreate (clone) the pipeline using a newer version of core-plugins (version 1.5 or higher).
  • TRACKER-217 - Fixed an issue preventing the adding of additional tags after an existing tag had been deleted.
  • TRACKER-225 - Fixed an issue where Cask Tracker was creating too many connections to ZooKeeper.
  • TRACKER-229 - Fixed an issue that was sending program run ids instead of program names.

πŸ”—Known Issues

  • CDAP-6099 - Due to a limitation in the CDAP MapReduce implementation, writing to a dataset does not work in a MapReduce Mapper's destroy() method.
  • CDAP-7444 - If a MapReduce program fails during startup, the program's destroy() method is never called, preventing any cleanup or action there being taken.

πŸ”—API Changes

  • CDAP-1696 - Updated the default CDAP Router port to 11015 to avoid conflicting with HiveServer2's default port. Note that this change may cause incompatibilities with previous releases if hardcoded in scripts or other programs.
  • CDAP-5897 - Updated the default CDAP UI port to 11011 to avoid conflicting with Accumulo and Cloudera Manager's Activity Monitor. Note that this change may cause incompatibilities with previous releases if hardcoded in scripts or other programs.
  • CDAP-6837 - Fixed an issue in WorkerContext that did not properly implement the contract of the Transactional interface. Note that this fix may cause incompatibilities with previous releases in certain cases. See below for details on how to handle this change in existing code.

    The Transactional API defines:

    void execute(TxRunnable runnable) throws TransactionFailureException;
    

    and WorkerContext implements Transactional. However, it declares this method to not throw checked exceptions:

    void execute(TxRunnable runnable);
    

    That means that any TransactionFailureException thrown from a WorkerContext.execute() is wrapped into a RuntimeException, and callers must write code similar to this to handle the exception:

    try {
      getContext().execute(...);
    } catch (Exception e) {
      if (e.getCause() instanceof TransactionFailureException) {
        // Handle it
      } else {
        // What else to expect? It's not clear...
        throw Throwables.propagate(e);
      }
    }
    

    This is ugly and inconsistent with other implementations of Transactional. We have addressed this by altering the WorkerContext to directly raise the TransactionFailureException. However, code must change to accomodate this.

    To address this in existing code, such that it will work both in 4.0.0 and earlier versions of CDAP, use code similar to this:

    @Override
    public void run() {
      try {
        getContext().execute(new TxRunnable() {
          @Override
          public void run(DatasetContext context) throws Exception {
            if (getContext().getRuntimeArguments().containsKey("fail")) {
              throw new RuntimeException("fail");
            }
          }
        });
      } catch (Exception e) {
        if (e instanceof TransactionFailureException) {
          LOG.error("transaction failure");
        } else if (e.getCause() instanceof TransactionFailureException) {
          LOG.error("exception with cause transaction failure");
        } else {
          LOG.error("other failure");
        }
      }
    }
    

    This code will succeed because it handles both the "new style" of the WorkerContext directly throwing a TransactionFailureException and at the same time handle the previous style of the TransactionFailureException being wrapped in a RuntimeException.

    Code that is only used in CDAP 4.0.0 and higher can use a simpler version of this:

      @Override
      public void run() {
        try {
          getContext().execute(new TxRunnable() {
            @Override
            public void run(DatasetContext context) throws Exception {
              if (getContext().getRuntimeArguments().containsKey("fail")) {
                throw new RuntimeException("fail");
              }
            }
          });
        } catch (TransactionFailureException e) {
          ...
        }
      }
    }
    
  • CDAP-7544 - The Metadata HTTP RESTful API has been modified to support sorting and pagination. To do so, the API now uses additional parametersβ€”sort, offset, limit, numCursors, and cursorβ€”and the format of the results returned when searching has changed. Whereas previous to CDAP 4.0.0 the API returned results as a list of results, the API now returns the results as a field in a JSON object.

  • CDAP-7796 - Two properties are changing in version 4.0.0 of the CSD:

    • log.saver.run.memory.megs is replaced with log.saver.container.memory.mb
    • log.saver.run.num.cores is replaced with log.saver.container.num.cores

    Anyone who has modified these properties in previous versions will have to update them after upgrading.

πŸ”—Deprecated and Removed Features

  • CDAP-5246 - Removed the deprecated Kafka feed for metadata updates. Users should instead subscribe to the CDAP Audit feed, which contains metadata update notifications in messages with audit type METADATA_CHANGE.
  • CDAP-6862 - Deprecated "log.saver.run.memory.megs" and "log.saver.run.num.cores", in favor of "log.saver.container.memory.mb" and "log.saver.container.num.cores", respectively.
  • CDAP-7475 - Removes deprecated methods setInputDataset(), setOutputDataset(), and useStreamInput() from the MapReduce API, and related methods from the MapReduceContext.
  • CDAP-7718 - Removed the deprecated StreamBatchReadable class.
  • CDAP-7127 - The deprecated CDAP Explore service instance property has been removed.
  • CDAP-7205 - Removes the deprecated useDatasets() method from API and documentation.
  • CDAP-7563 - Removed the usage of deprecated methods from examples.
  • HYDRATOR-1094 - Removed the deprecated cdap-etl-batch-source-archetype, cdap-etl-batch-sink-archetype, and cdap-etl-transform-archetype in favor of the cdap-data-pipeline-plugins-archetype.

πŸ”—Release 3.6.0

πŸ”—Improvements

  • CDAP-5771 - Allow concurrent runs of different versions of a service. A RouteConfig can be uploaded to configure the percentage of requests that need to be sent to the different versions.
  • CDAP-7281 - Improved the PartitionedFileSet to validate the schema of a partition key. Note that this will break code that uses incorrect partition keys, which was previously silently ignored.
  • CDAP-7343 - All non-versioned endpoints are now directed to applications with a default version. Added test cases with a mixed usage of the new versioned endpoints and the corresponding non-versioned endpoints.
  • CDAP-7366 - Added an upgrade step that adds a default version ID to jobs and triggers in the Schedule Store.
  • CDAP-7385 - The Log HTTP Handler and Router have been fixed to allow the streaming of larger logs files.
  • CDAP-7264 - Added an HTTP RESTful API to create applications with a version.
  • CDAP-7265 - Added an HTTP RESTful API to start or stop programs of a specific application version.
  • CDAP-7266 - Added an upgrade step that adds a default application version to existing applications.
  • CDAP-7268 - Added an HTTP RESTful API to store, fetch, and delete RouteConfigs for user service endpoint routing control.
  • CDAP-7272 - User services now include their application version in the payload when they announce themselves in Apache Twill.

πŸ”—Bug Fixes

  • CDAP-3822 - Unit Test framework now has the capability to exclude scala, so users can depend on their own version of the library.
  • CDAP-7250 - Fixed an issue where dataset usage was not being recorded after an application was deleted.
  • CDAP-7314 - Fixed a problem with the documentation example links to the CDAP ETL Guide.
  • CDAP-7321 - Fixed a problem with upgrading CDAP using the CDAP Upgrade Tool.
  • CDAP-7324 - Fixed a problem with the upgrade tool while upgrading HBase coprocessors.
  • CDAP-7334 - Fixed a problem with the listing of applications not returning the application version correctly.
  • CDAP-7353 - Fixed a problem with using "Download All" logs in the browser log viewer by having it fetch and stream the response to the client.
  • CDAP-7359 - Fixed a problem with NodeJS buffering a response before sending it to a client.
  • CDAP-7361 - Fixed a problem with log file corruption if the log saver container crashes due to being killed by YARN.
  • CDAP-7364 - Fixed a problem with the CDAP UI not handling "5xx" error codes correctly.
  • CDAP-7374 - Fixed Hydrator Studio in the Windows version of Chrome to allow users to open and edit a node configuration.
  • CDAP-7386 - Fixed an error in the "CDAP Introduction" tutorial's "Transforming Your Data" example of an application configuration.
  • CDAP-7391 - Fixed an issue that caused unit test failures when using org.hamcrest classes.
  • CDAP-7392 - Fixed an issue where the Java process corresponding to the MapReduce application master kept running even if the application was moved to the FINISHED state.
  • HYDRATOR-791 - Fixed a problem with Hydrator pipelines using a DBSource not working in an HDP cluster.
  • HYDRATOR-948 - Fixed a problem with Spark data pipelines not supporting argument values in excess of 64K characters.

πŸ”—Release 3.5.2

πŸ”—Known Issues

  • CDAP-7179 - In CDAP 3.5.0, new kafka.server.* properties replace older properties such as kafka.log.dir, as described in the Administration Manual: Appendices: cdap-site.xml.

    If you are upgrading from CDAP 3.4.x to 3.5.x and you have set a value for kafka.log.dir by using Cloudera Manager's safety-valve mechanism, you need to change to the new property kafka.server.log.dirs, as the deprecated kafka.log.dir is being ignored in favor of the new property. If you don't, your custom value will be replaced with the default value.

  • CDAP-7608 - When running in Standalone CDAP, the Cask Hydrator plugin NaiveBayesTrainer has a permgen memory leak that leads to an out-of-memory error if the plugin is repeatedly used a number of times, as few as six runs. The only workaround is to reset the memory by restarting Standalone CDAP.

πŸ”—Improvements

  • CDAP-3262 - Fixed an issue with the CDAP scripts under Windows not handling a JAVA_HOME path with spaces in it correctly. CDAP SDK home directories with spaces in the path are not supported (due to issues with the product) and the scripts now exit if such a path is detected.
  • CDAP-4322 - For MapReduce programs using a PartitionedFileSet as input, expose the partition key corresponding to the input split to the mapper.
  • CDAP-6183 - Added the property program.container.dist.jars to set extra jars to be localized to every program container and to be added to classpaths of CDAP programs.
  • CDAP-6572 - The namespace that integration test cases run against by default has been made configurable.
  • CDAP-6577 - Improve UpgradeTool to upgrade tables in namespaces with impersonation configured.
  • CDAP-6885 - Added support for concurrent runs of a Spark program.
  • CDAP-6587 - Added support for impersonation with CDAP Explore (Hive) operations, such as enabling exploring of a dataset or running queries against it.
  • CDAP-7291 - Added support for CDH 5.9.
  • CDAP-7385 - The Log HTTP Handler and Router have been fixed to allow the streaming of larger logs files.
  • CDAP-7387 - Added support to LogSaver for impersonation.
  • CDAP-7404 - Added authorization for schedules in CDAP.
  • CDAP-7529 - Improved error handling upon failures in namespace creation.
  • CDAP-7557 - DynamicPartitioner can now limit the number of open RecordWriters to one, if the output partition keys are grouped.
  • CDAP-7682 - Added a property kafka.zookeeper.quorum to be used across all internal clients using Kafka.
  • CDAP-7761 - Adds cluster.name as a property that identifies a cluster; this property can be set in the cdap-site.xml.
  • HYDRATOR-979 - Added the Windows Share Copy plugin to the Hydrator plugins.
  • HYDRATOR-997 - The SSH hostname and the command to be executed are now macro-enabled for the SSH action plugin.

πŸ”—Bug Fixes

  • CDAP-6981 - Fixed an issue that prevented macros from being used with a secure KMS store.
  • CDAP-7116 - Fixed an issue so as to significantly reduce the chance of a schedule misfire in the case where the CPU cannot trigger a schedule within a certain time threshold.
  • CDAP-7177 - Fixed an issue where macros were not being substituted for postaction plugins.
  • CDAP-7250 - Fixed an issue where dataset usage was not being recorded after an application was deleted.
  • CDAP-7318 - Fixed an issue that would cause MapReduce and Spark programs to fail if too many macros were being used.
  • CDAP-7391 - Fixed TestFramework classloading to support classes that depend on org.hamcrest.
  • CDAP-7392 - Fixed an issue where the Java process corresponding to the MapReduce application master kept running even if the application was moved to the FINISHED state.
  • CDAP-7394 - Fixed an issue with impersonation in flows not working by not re-using HBaseAdmin across different UGI.
  • CDAP-7396 - Fixed an issue which prevented scheduled jobs from running on a namespace with impersonation.
  • CDAP-7398 - Fixed an issue which prevented an app in a namespace from being deleted if a program for the same app is running in a different namespace.
  • CDAP-7403 - Fixed an issue that prevented the CDAP UI from starting if the logback.xml was configured to log at the INFO or lower level.
  • CDAP-7404 - Added authorization for schedules in CDAP.
  • CDAP-7420 - Avoid the caching of YarnClient in order to fix a problem that occurred in namespaces with impersonation configured.
  • CDAP-7433 - Fixed an issue that prevented HBaseQueueDebugger from running in an impersonated namespace.
  • CDAP-7435 - Fixed an error which prevented the downloading of large logs using the CDAP UI.
  • CDAP-7438, CDAP-7439 - Removed the requirement of running "kinit" prior to running either the Upgrade or Transaction Debugger tools of CDAP on a secure Hadoop cluster.
  • CDAP-7458 - Fixed an issue that prevented the CDAP Upgrade Tool from being run for a namespace with authorization turned on.
  • CDAP-7473 - Fix logback-container.xml to work on clusters with multiple log directories configured for YARN.
  • CDAP-7482 - Fixed a problem in CDAP logging that caused system logs from Kafka to not be saved after an upgrade and for previously-saved logs to become inaccessible.
  • CDAP-7500 - Fixed cases where the MapReduce classloader was being closed prematurely.
  • CDAP-7527 - Fixed a problem that prevented the use of a logback.xml from an application jar.
  • CDAP-7548 - Fixed a problem in integration tests to allow JDBC connections against authorization-enabled and SSL-enabled CDAP instances.
  • CDAP-7566 - Improved the usability of ServiceManager in integration tests. The getServiceURL method now waits for the service to be discoverable before returning the service's URL.
  • CDAP-7612 - Fixed cases where Spark programs could not be started after a master failover or restart.
  • CDAP-7660 - The CDAP Ambari service was updated to use scripts for Auth Server/Router alerts in Ambari due to Ambari not supporting CDAP's /status endpoint with WEB check.
  • HYDRATOR-1125 - Fixed a problem that prevented the adding of a schema with hyphens in the Hydrator UI.

πŸ”—Release 3.5.1

πŸ”—Known Issues

  • CDAP-7175 - If you are upgrading an authorization-enabled CDAP instance, you will need to give the cdap user ADMIN privileges on all existing CDAP namespaces. See the Administration Manual: Upgrading for your distribution for details.

  • CDAP-7179 - In CDAP 3.5.0, new kafka.server.* properties replace older properties such as kafka.log.dir, as described in the Administration Manual: Appendices: cdap-site.xml.

    If you are upgrading from CDAP 3.4.x to 3.5.x and you have set a value for kafka.log.dir by using Cloudera Manager's safety-valve mechanism, you need to change to the new property kafka.server.log.dirs, as the deprecated kafka.log.dir is being ignored in favor of the new property. If you don't, your custom value will be replaced with the default value.

πŸ”—Improvements

  • CDAP-7192 - Added the ability to specify an announce address and port for the appfabric and dataset services.

    Deprecated the properties app.bind.address and dataset.service.bind.address, replacing them with master.services.bind.address as the bind address for master services.

    Added the properties master.services.announce.address, app.announce.port, and dataset.service.announce.port for use as announce addresses that are different from the bind address.

  • CDAP-7240 - Upgraded the version of netty-http used in CDAP to version 0.15, resolving a problem with a NullPointerException being logged on the closing of a network connection.

  • HYDRATOR-578 - Snapshot sinks now allow users to specify a property cleanPartitionsOlderThan that cleans up any snapshots older than "x" days.

πŸ”—Bug Fixes

  • CDAP-6215 - PartitionConsumer appropriately drops partitions that have been deleted from a corresponding PartitionedFileSet.
  • CDAP-6404 - Fixed an issue with searching for an entity in Cask Tracker by metadata after a tag with the same prefix has been removed.
  • CDAP-7138 - Fixed a problem with duplicate logs showing for a running program.
  • CDAP-7175 - Fixed a bug in the upgrade tool to allow it to run on a CDAP instance with authorization enabled.
  • CDAP-7178 - Fixed an issue with uploading an application JAR or file to a stream through the CDAP UI.
  • CDAP-7187 - Fixed a problem with the property dataset.service.bind.address having no effect.
  • CDAP-7199 - Corrected errors in the documentation to correctly show how to set the schema on an existing table.
  • CDAP-7204 - Lineage information is now returned for deleted datasets.
  • CDAP-7222 - Fixed a problem with being unable to delete a namespace if a configured keytab file doesn't exist.
  • CDAP-7235 - Fixed a problem with a NullPointerException when the CDAP UI fetches a log.
  • CDAP-7237 - Prevented accidental grant of additional actions to a user as part of a grant operation when using Apache Sentry as the authorization provider.
  • CDAP-7248 - Fixed a problem with the FileBatchSource not working with Azure Blob Storage.
  • CDAP-7249 - Fixed a problem with CDAP Explore using Tez on Azure HDInsight.
  • HYDRATOR-912 - Fixed an issue where the Joiner plugin was failing in Hydrator pipelines executing in a Spark environment.
  • HYDRATOR-922 - Fixed a bug that caused the Database Source, Joiner, GroupByAggregate, and Deduplicate plugins to fail on certain versions of Spark.
  • HYDRATOR-932 - Fixed an error in the documentation of the HDFS Source and Sink with respect to the alias under high-availability.
  • TRACKER-217 - Fixed an issue preventing the adding of additional tags after an existing tag had been deleted.

πŸ”—Release 3.5.0

πŸ”—Known Issues

  • CDAP-7179 - In CDAP 3.5.0, new kafka.server.* properties replace older properties such as kafka.log.dir, as described in the Administration Manual: Appendices: cdap-site.xml.

    If you are upgrading from CDAP 3.4.x to 3.5.x, and you have set a value for kafka.log.dir by using Cloudera Manager's safety-valve mechanism, you need to change to the new property kafka.server.log.dirs, as the deprecated kafka.log.dir is being ignored in favor of the new property. If you don't, your custom value will be replaced with the default value.

πŸ”—API Changes

  • CDAP-4860 - Introduced an "available" (/available) endpoint for Services to check their availability.
  • CDAP-5279 - The beforeSubmit and onFinish methods of the MapReduce and Spark APIs have been deprecated. Changes to the API include:
    1. AbstractMapReduce and AbstractSpark now implement ProgramLifeCycle
    2. AbstractMapReduce and AbstractSpark now have a final initialize(context) method
    3. AbstractMapReduce and AbstractSpark now have a protected initialize() method default implementation of which will call beforeSubmit()
    4. User programs will override the no-arg initialize method
    5. Driver will call both versions of the initialize method
  • CDAP-6150 - The isSuccessful() method of the WorkflowContext is replaced by the getState() method, which returns the state of the workflow.
  • CDAP-6930 - Incompatible Change: Updated the "cdap-clients" to throw UnauthorizedException when an operation returns 403 - Forbidden from CDAP. Users of "cdap-clients" may need to update their code to handle these exceptions.
  • TRACKER-21 - Renamed the AuditLog service to the TrackerService.

πŸ”—New Features

  • CDAP-2963 - All HBase Tables created through CDAP will now have a key cdap.version in the HTableDescriptor.
  • CDAP-3368 - Add location for cdap cli to PATH in distributed CDAP packages.
  • CDAP-3890 - Improved performance of the Dataset Service.
  • CDAP-4106 - Created pre-defined alert definitions in the CDAP Ambari Service.
  • CDAP-4107 - Support for HA CDAP installations in the CDAP Ambari Service.
  • CDAP-4109 - Support for Kerberos-enabled clusters via the CDAP Ambari service.
  • CDAP-4110 - CDAP Auth Server is now supported in the CDAP Ambari Service on Ambari clusters which have Kerberos enabled.
  • CDAP-4288 - Added an authorization extension backed by Apache Sentry to enforce authorization on CDAP entities.
  • CDAP-4913 - Added a way to cache authorization policies so every authorization enforcement request does not have to make a remote call. Caching is configurableβ€”it can be enabled by setting security.authorization.cache.enabled to true. TTL for cache entries (security.authorization.cache.ttl.secs) as well as refresh interval (security.authorization.cache.refresh.interval.secs) is also configurable.
  • CDAP-5740 - Provided access to Partitioner and Comparator classes to the MapReduceTaskContext by implementing ProgramLifeCycle.
  • CDAP-5770 - Provided setting of YARN container resources requirements for all program types via preferences and runtime arguments.
  • CDAP-6062 - Added protection for a partition of a file set from being deleted while a query is reading the partition.
  • CDAP-6153 - CDAP namespaces can now be mapped to custom namespaces in storage providers. While creating a namespace, users can specify the Filesystem directory, HBase namespace and Hive database for that namespace. These settings cannot be changed once the namespace has been successfully created.
  • CDAP-6168 - Enable authorization, lineage, and audit log at the data operation level for all Datasets.
  • CDAP-6174 - Addes a new log viewer across CDAP, Cask Hydrator, and Cask Tracker, wherever appropriate. Provides easier navigation and debugging functionality for logs of different entities.
  • CDAP-6235 - Added an indicator in the UI of the CDAP mode (distributed or standalone, secure or insecure).
  • CDAP-6393 - Added authorization to the Secure Key HTTP RESTful APIs. To create a secure key, a user needs WRITE privilege on the namespace in which the key is being created. Users can only view secure keys that they have access to. To delete a key, ADMIN privilege is required.
  • CDAP-6456 - Exposed the secure store APIs to Programs.
  • CDAP-6516 - Added authorization for listing and viewing CDAP entities.
  • CDAP-7002 - Fixed an issue where the UI would ignore the configured port when connecting to the CDAP Router.
  • HYDRATOR-156 - Added an alpha feature: Hydrator Data Pipeline preview (CDAP SDK only).
  • HYDRATOR-162 - Added support for executing custom actions in the Cask Hydrator pipelines.
  • HYDRATOR-168 - Re-organized the bottom panel in Cask Hydrator to be in-context. Pipeline-level information is moved to a top panel and plugin-level information is moved to a modal dialog.
  • HYDRATOR-379 - Re-organized the left panel in Cask Hydrator studio view to have a maximum of four categories of plugin types: Source, Transform, Sink, and Actions. All other types are consolidated into one of these types.
  • HYDRATOR-501 - Implemented the Value Mapper plugin for Cask Hydrator plugins. This is a type of transform that maps string values of a field in the input record to another value.
  • HYDRATOR-502 - Added the XML Parser Transform plugin to Cask Hydrator plugins. This plugin uses XPath to extract fields from a complex XML Event. It is generally used in conjunction with the XML Reader Source Plugin.
  • HYDRATOR-503 - Added the XML Reader Source Plugin to Cask Hydrator plugins. This plugin allows users to read XML files stored on HDFS.
  • HYDRATOR-506 - Implemented the Cask Hydrator plugin for Row Denormalizer aggregator. This plugin converts raw data into de-normalized data based on a key column. De-normalized data can be easier and faster to query.
  • HYDRATOR-507 - Added the Cobol Copybook source plugin to Cask Hydrator plugins. This source plugin allows users to read and process mainframe files defined using COBOL Copybook.
  • HYDRATOR-514 - Added the Excel Reader Source plugin to Cask Hydrator Plugins. This plugin provides the ability to read data from one or more Excel file(s).
  • HYDRATOR-629 - Adds macros to pipeline plugin configurations. This allows users to set macros for plugin properties which can be provided as runtime arguments while scheduling and running the pipeline.
  • HYDRATOR-634 - Adds a new Run Configuration player for published pipeline views. This allows users to set runtime arguments while scheduling or running a pipeline.
  • HYDRATOR-685 - Added a Twitter source for Spark Streaming pipelines.
  • TRACKER-96 - Added the ability to edit user properties for a dataset directly in Cask Tracker.
  • TRACKER-98 - Added the Cask Tracker Meter to measure how active a dataset is in a cluster on a scale of zero to 100.
  • TRACKER-100 - Added the ability to add, remove, and manage a common dictionary of Preferred Tags in Cask Tracker and apply them to datasets.
  • TRACKER-104 - Added the ability to preview data directly in the Cask Tracker UI.
  • TRACKER-105 - Added the ability to view usage metrics about datasets in Cask Tracker. Users can view how many applications and programs are accessing each dataset using service endpoints and the Tracker UI.

πŸ”—Changes

  • CDAP-5263 - The "CDAP Applications" section in the documentation has been split into two separate sections under "CDAP Extensions": "Cask Hydrator" and "Cask Tracker".
  • CDAP-5833 - Eliminated some misleading warnings in the Purchase example.
  • CDAP-6143 - Added metadata tag for the local datasets.
  • CDAP-6596 - CDAP Security Extensions are packaged with CDAP Master packages and CDAP Parcel.
  • HYDRATOR-527 - The Script Transform (previously deprecated) has been removed, and is replaced with the JavaScript Transform.
  • HYDRATOR-528 - Secure Store APIs in Hydrator Actions are now exposed.
  • HYDRATOR-649 - A widget textbox can have a configurable placeholder.
  • HYDRATOR-653 - Additional Custom Action Hydrator Plugins have been added.
  • HYDRATOR-682 - The directory containing the Spark Streaming Hydrator plugins has been renamed from batch.spark to spark.
  • TRACKER-155 - An upgrade process has been added to the UI for Cask Tracker.
  • CDAP-886, CDAP-882 - Access to CDAP Streams via the RESTful API, CDAP-CLI, or programmatic API can be authorized through the Security Authorization feature.
  • CDAP-888, CDAP-882 - Enforced authorization in Dataset RESTful APIs. Dataset modules, types, and instances are now governed by authorization policies.
  • CDAP-5691, CDAP-5685 - Improved the performance of the Dataset Service, with server-side caching of getDataset() in DatasetService.
  • CDAP-6154, CDAP-6153 - CDAP Namespaces can now use an existing custom HDFS directory. The custom HDFS directory, whose creation/deletion is managed by the user, can be specified during the creation of a CDAP Namespace as part of its configuration.
  • CDAP-6155, CDAP-6153 - CDAP Namespaces can now use an existing custom HBase namespace. The custom HBase namespace, whose creation/deletion is managed by the user, can be specified during the creation of a CDAP Namespace as part of its configuration.
  • CDAP-6156, CDAP-6153 - CDAP Namespaces can now use an existing custom Hive database. The custom Hive database, whose creation/deletion is managed by the user, can be specified during the creation of a CDAP Namespace as part of its configuration.
  • CDAP-6158, CDAP-6157 - Added support for accessing (read/write) dataset across namespaces in CDAP Spark and MapReduce programs.
  • CDAP-6159, CDAP-6157 - Added support for accessing streams (read only) across namespaces in CDAP Spark and MapReduce programs.
  • HYDRATOR-174, HYDRATOR-157 - Refactored the Spark engine in data pipelines to run all non-action pipeline stages in a single Spark program.
  • HYDRATOR-175, HYDRATOR-157 - Added a streaming pipeline type (the Data Streams artifact) to Cask Hydrator for realtime pipelines run using Spark Streaming.
  • HYDRATOR-177, HYDRATOR-157 - Added a Kafka source for streaming pipelines.
  • HYDRATOR-178, HYDRATOR-157 - Added a window plugin to Cask Hydrator that enables the creation of sliding windows in a streaming pipeline.
  • HYDRATOR-181, HYDRATOR-158 - Added an experimental feature in the CDAP SDK which allows users to preview the Hydrator pipelines.
  • HYDRATOR-182 - Hydrator MapReduce or Spark jobs now support multiple inputs. This will enable more efficient physical workflow generation due to the reduction in the number of MapReduce or Spark programs required for a logical pipeline.
  • HYDRATOR-165 - Support multiple sources as input to a stage.
  • HYDRATOR-748 - Re-organizes batch pipeline settings to a top panel to schedule a batch pipeline, add post-run actions and set pipeline resources and the engine used.
  • HYDRATOR-712 - Added to Cask Hydrator a Batch Pipeline Configuration Schedule.
  • TRACKER-108, TRACKER-98 - Adds to Cask Tracker a tracker meter widget in the UI, including search results page and details page. It displays a metric that determines the 'truthfulness' of a dataset/stream being used in CDAP or Cask Hydrator.
  • TRACKER-109, TRACKER-100 - Adds a separate section for tags in Cask Tracker. This lists all available tags in CDAP.
  • TRACKER-149, TRACKER-105 - Adds a histogram for audit log in Cask Tracker for easier visualization of usage of a dataset.

πŸ”—Improvements

  • CDAP-1545 - Created a Docker-specific ENTRYPOINT script to support passing arguments.
  • CDAP-4065 - Improved the way that MapReduce failures are reported.
  • CDAP-4775 - Warns if either the app-fabric or router bind addresses are configured with a loopback address.
  • CDAP-5000 - The number of containers for the CDAP Explore service is no longer configurable and will be ignored upon specification. It will always be set to one (1).
  • CDAP-5336 - Now publishing stdout and stderr logs for MapReduce containers to CDAP.
  • CDAP-5601 - Allowing the setting of batch size for flowlet process methods via preferences and runtime arguments.
  • CDAP-5794 - Added support for long-running Spark jobs in a Kerberos-enabled cluster.
  • CDAP-5874 - Added support for starting extensions in distributed mode.
  • CDAP-5959 - Setting the JAVA_LIBRARY_PATH now causes CDAP Master to load Hadoop native libraries at startup.
  • CDAP-5969 - CDAP Upgrade tasks are now available in the CDAP Ambari Service.
  • CDAP-6034 - CDAP's Tephra dependency has been changed to depend on the Apache Incubator Tephra project.
  • CDAP-6206 - Improved the error message given on application deployment failure due to a missing Spark library.
  • CDAP-6216 - Added support in the log API for field suppression in JSON format.
  • CDAP-6246 - Added the ability to specify a CDAP Master's temporary directory.
  • CDAP-6276 - Introduced new experimental dataset APIs for updating a dataset's properties.
  • CDAP-6327 - Allowed specifying individual Java heap sizes for Java services in cdap-env.sh.
  • CDAP-6350 - Declared startup script contents as read-only to prevent them from being overridden by a user in cdap-env.sh.
  • CDAP-6361 - Added "Quick Links" for the CDAP UI, Cask Hydrator, and Cask Tracker in the Ambari 2.3+ UI.
  • CDAP-6362 - Added support for CDAP services over SSL in Ambari.
  • CDAP-6363 - Provided service dependencies for Ambari (requires Ambari 2.2+).
  • CDAP-6384 - Updated the Standalone CDAP VM version of IntelliJ IDE to 2016.1.3.
  • CDAP-6573 - Added a tool that allows bringing Hive in-sync with the partitions of a (time-)partitioned fileset.
  • CDAP-6880 - Users can now configure timeouts for internal HTTP connections and reads in cdap-site.xml. These are used for all internal HTTP calls.
  • CDAP-6901 - Added a bootstrap step for authorization in CDAP. As part of this step:
    1. The user that CDAP runs as now receives "admin" privileges on the CDAP instance, as well as "all" privileges on the system namespace.
    2. The list of users specified in the parameter security.authorization.admin.users in cdap-site.xml receives "admin" privileges on the CDAP instance so that they can create namespaces.
  • CDAP-6913 - Changed to use YarnClient instead of the YARN HTTP API to fetch node reports.
  • CDAP-7021 - Improved program launch performance to avoid large CPU spikes when multiple programs are launched at the same time.
  • CDAP-7046 - At configure time, containsMacro(.) on plugin properties that were provided macro syntax will return true. At runtime, all properties will have containsMacro(.) return false.
  • HYDRATOR-219 - Added a new editor for complex schema in the Cask Hydrator UI.
  • HYDRATOR-244 - Added support for macros in plugins. This allows Cask Hydrator plugin fields to accept macros.
  • HYDRATOR-289 - Added support to join data from multiple sources in Cask Hydrator.
  • HYDRATOR-392 - Enhanced the Cask Hydrator upgrade tool to upgrade 3.4.x pipelines to 3.5.x pipelines.
  • HYDRATOR-560 - The plugins NaiveBayesTrainer and NaiveBayesClassifier now have an optional configurable features property. If specified as none, 100 is used as the number of features.
  • HYDRATOR-578 - Snapshot sinks now allow users to specify a property cleanPartitionsOlderThan that cleans up any snapshots older than x days.
  • HYDRATOR-606 - Changed the DBSource plugin to override user-specified output schema.
  • HYDRATOR-607 - Fixed an issue that prevented TPFS sources and sinks created by Hydrator pipelines from being used as either input or output for MapReduce or Spark.
  • HYDRATOR-686 - Many existing Hydrator batch and spark plugins now have macro-enabled properties, as specified in their reference documentation.
  • HYDRATOR-713 - Added Encryptor and Decryptor plugins to Cask Hydrator that can encrypt or decrypt record fields.

πŸ”—Bug Fixes

  • CDAP-2501 - The CDAP Router and UI no longer need to be colocated using Cloudera Manager.
  • CDAP-3131 - Running the endpoint of the Program Lifecycle RESTful API now returns 404 instead of an empty list if a specified application is not found.
  • CDAP-3732 - Fixed an issue where deploying an application was trying to enable CDAP Explore on system tables.
  • CDAP-3750 - Datasets that use reserved Hive keywords will now have their column names properly escaped when executing Hive DDL commands.
  • CDAP-4007 - Fixed an issue when running multiple unit tests in the same JVM.
  • CDAP-4434 - CDAP startup scripts return success (exit 0) if calling a service that is already running.
  • CDAP-5135 - Fixed an issue where the status of a program that was killed through YARN showed in CDAP as having been completed successfully.
  • CDAP-5291 - Fixed a problem in the fit-to-screen functionality of flow diagrams.
  • CDAP-5536 - Fixed a problem with users putting back a partition to PartitionConsumer without processing it.
  • CDAP-5643 - Fixed certain test cases to not depend on US as the system locale.
  • CDAP-5676 - Upgraded the Hive version used by the CDAP SDK to Hive-1.2.1 in order to pick up a fix for parquet tables.
  • CDAP-5875 - Require Spark on clusters configured for Hive on Spark and CDAP Explore service.
  • CDAP-5882 - Removed conditional restart on distributed CDAP package upgrades.
  • CDAP-6026 - Fixed an issue where an exception thrown in the initialize method of the Workflow was causing the Workflow container not to be terminated.
  • CDAP-6035 - Fixed a problem with correctly setting the context classloader for the Workflow initialize() and destroy() methods, to provide a consistent classloading behavior across all program types.
  • CDAP-6045 - Fixed an issue where application deployment was failing on Windows because of a colon (":") character in the filename.
  • CDAP-6052 - Fixed a bug that prevented the setting of local.data.dir in cdap-site.xml to an absolute path.
  • CDAP-6109 - Fixed a NullPointerException issue in Spark when saving RDD to a PartitionedFileSet dataset.
  • CDAP-6115 - Fixed a bug in the Flow system where usage of the primitive byte, short, or char types caused exceptions.
  • CDAP-6121 - Fixed a bug in Spark where using @UseDataset caused a NullPointerException.
  • CDAP-6127 - Fixed a bug not allowing the transaction service to bind to a configurable port.
  • CDAP-6147 - Improved the error message in the authorization and lineage clients when a 404 is returned from the server side.
  • CDAP-6170 - Fixed an issue that caused an error if an application or program attempted to override input/output format properties that were already defined in the dataset properties.
  • CDAP-6280 - Fixed a problem with allowing FileSets and PartitionedFileSets to be tagged as explorable in the CDAP UI.
  • CDAP-6311 - Fixed a bug that the program run record was not correctly reflected in CDAP if the corresponding YARN application failed to start.
  • CDAP-6378 - Fixed the classpath of the MapReduce program launched by CDAP to include the CDAP classes before the Apache Twill classes.
  • CDAP-6386 - Fixed an issue where updating the properties of a dataset deleted all of its partitions in Hive.
  • CDAP-6452 - Add a check for the environment variable CDAP_UI_COMPRESSION_ENABLED to disable UI compression.
  • CDAP-6455 - Fixed the classpath of a MapReduce program launched by the explore service to include the cdap-common.jar at the beginning.
  • CDAP-6486 - Fixed an issue that caused a Zookeeper watch to leak memory every time a program was started.
  • CDAP-6510 - Fixed an issue where the ExploreService was attempting (with no effect except for a slow down) to run the upgrade procedure for all explorable datasets.
  • CDAP-6515 - Fixed classloading issues related to using Guava's Optional class in Spark, allowing programs to perform left-outer and full-outer joins on RDDs.
  • CDAP-6524 - Plugins now support the char primitive as a property type.
  • CDAP-6643 - Fixed an issue that caused massive log messages when there was an underlying HDFS issues.
  • CDAP-6783 - Fixed the classpath ordering in Spark to load the classes from cdap-common first.
  • CDAP-6829 - Fixed issues that prevented the Log Saver from performing cleanup when metadata is present for a non-existing file.
  • CDAP-6852 - Fixed issues that makes the Log Saver more resilient to errors while checkpointing.
  • CDAP-6860 - Improved performance in cube datasets when querying for more than one measure in a query. This will also improve metrics query performance.
  • CDAP-6929 - Logs from Spark driver and executors are now collected.
  • CDAP-6935 - Fix a bug where the live-info endpoint was not working for Workflows, MapReduce, Worker, and Spark.
  • CDAP-6939 - Added support in the CDAP UI for Google Chrome releases prior to version 44.
  • CDAP-7026 - Upon namespace creation, all privileges are granted to both the user who created the namespace as well as the user that programs will run as in the new namespace.
  • CDAP-7066 - Restart of system services now kills containers if the containers are unresponsive so as to not leave stray containers.
  • CDAP-7082 - Removed bundling the parquet JAR from the com.twitter package with CDAP Master.
  • CDAP-7128 - Fixed a bug on changing the number of Worker instances in CDAP Distributed mode.
  • HYDRATOR-47 - The DBSource plugin now casts TINYINT and SMALLINT to INT type correctly.
  • HYDRATOR-54 - The Validator UI configuration is now preserved in a cloned pipeline.
  • HYDRATOR-80 - Fixed an issue where the configuration of the FileSource was failing while setting the properties for the FileInputFormat.
  • HYDRATOR-133 - HDFSSink can now be used alongside other sinks in a Hydrator pipeline.
  • HYDRATOR-149 - Removes the dependency of using labels from plugins in pipelines being imported in UI. Any pipeline configuration publishable from the CDAP-CLI or the Artifact RESTful HTTP API should now be publishable from UI.
  • HYDRATOR-398 - Adds the ability to view properties of plugins in pipelines created in older versions of Cask Hydrator.
  • HYDRATOR-438 - Fixed the Hydrator CSVParser plugin so that a nullable field is only set to null if the parsed value is an empty string and the field is not either a string or nullable string type.
  • HYDRATOR-451 - The CSVParser plugin now supports accepting a nullable string as a field to parse. If the field is null, all other fields are propagated and those that would otherwise be parsed by the CSVParser are set to null.
  • HYDRATOR-459 - Fixed a bug causing the UPPER to lower transform not being applied to all columns correctly for DBSink.
  • HYDRATOR-705 - Fixed an issue with record serialization for non-ASCII values in the shuffle phase of Hydrator pipelines.
  • HYDRATOR-790 - Release CDAP 3.4.0 introduced infinite-scroll for the input and output schemas; the version used (1.2.2) of the infinite scroll component had performance issues. The version of the infinite scroll component used has been downgraded to restore the performance in Hydrator views.
  • TRACKER-42 - Fixed integrating the navigator app in the Cask Tracker UI. The POST body request that was sent while deploying the navigator app was using an older, deprecated property (UI was using metadataKafkaConfig instead of auditKafkaConfig). This should enable using the navigator app in the Cask Tracker UI.

πŸ”—Release 3.4.1

πŸ”—Bug Fixes

  • CDAP-4388 - Fixed a race condition bug in ResourceCoordinator that prevented performing partition assignment in the correct order. It affects the metrics processor and stream coordinator.
  • CDAP-5855 - Avoid the cancellation of delegation tokens upon completion of Explore-launched MapReduce and Spark jobs, as these delegation tokens are shared by CDAP system services.
  • CDAP-5868 - Removed 'SNAPSHOT' from the artifact version of apps created by default by the CDAP UI. This fixes deploying Cask Tracker and Navigator apps, enabling Cask Tracker from the CDAP UI.
  • CDAP-5884 - Fixed a bug that caused SDK builds to fail when using 3.3.x versions of maven.
  • CDAP-5887 - Fixed the Hydrator upgrade tool to correctly write out pipeline configs that failed to upgrade.
  • CDAP-5889 - The CDAP Standalone now deploys and starts the Cask Tracker app in the default namespace if the Tracker artifact is present.
  • CDAP-5898 - Shutdown external processes started by CDAP (Zookeeper and Kafka) when there is an error during either startup or shutdown of CDAP.
  • CDAP-5907 - Fixed an issue where parsing of an AVRO schema was failing when it included optional fields such as 'doc' or 'default'.
  • CDAP-5947 - Fixed a bug in the BatchReadableRDD so that it won't skip records when used by DataFrame.

πŸ”—Known Issues

  • After upgrading CDAP from a pre-3.0 version, any unprocessed metrics data in Kafka will be lost and WARN log messages will be logged that tell about the inability to process old data in the old format.
  • CDAP-797 - When running secure Hadoop clusters, debug logs from MapReduce programs are not available.
  • CDAP-1007 - If the Hive Metastore is restarted while the CDAP Explore Service is running, the Explore Service remains alive, but becomes unusable. To correct, restart the CDAP Masterβ€”which will restart all servicesβ€”as described under "Starting CDAP Services" for your particular Hadoop distribution in the Installation documentation.
  • CDAP-1587 - CDAP internally creates tables in the "user" space that begin with the word "system". User datasets with names starting with "system" can conflict if they were to match one of those names. To avoid this, do not start any datasets with the word "system".
  • CDAP-2632 - The application in the cdap-kafka-ingest-guide does not run on Ubuntu 14.x as of CDAP 3.0.x.
  • CDAP-2721 - Metrics for FileSets can show zero values even if there is data present, because FileSets do not emit metrics (CDAP-587).
  • CDAP-2831 - A workflow that is scheduled by time will not be run between the failure of the primary master and the time that the secondary takes over. This scheduled run will not be triggered at all.
  • CDAP-2920 - Spark jobs on a Kerberos-enabled CDAP cluster cannot run longer than the delegation token expiration.
  • CDAP-2945 - If the input partition filter for a PartitionedFileSet does not match any partitions, MapReduce jobs can fail.
  • CDAP-3000 - The Workflow token is in an inconsistent state for nodes in a fork while the nodes of the fork are still running. It becomes consistent after the join.
  • CDAP-3221 - When running in Standalone CDAP, if a MapReduce job fails repeatedly, then the SDK hits an out-of-memory exception due to perm gen. The Standalone needs restarting at this point.
  • CDAP-3262 - For Microsoft Windows, the Standalone CDAP scripts can fail when used with a JAVA_HOME that is defined as a path with spaces in it. A workaround is to use a definition of JAVA_HOME that does not include spaces, such as C:\PROGRA~1\Java\jdk1.7.0_79\bin or C:\ProgramData\Oracle\Java\javapath.
  • CDAP-3492 - In the CDAP CLI, executing select * from a dataset with many fields generates an error.
  • CDAP-3641 - A RESTful API call to retrieve workflow statistics hangs if units (such as "s" for seconds) are not provided as part of the query.
  • CDAP-3750 - If a table schema contains a field name that is a reserved word in the Hive DDL, 'enable explore' fails.
  • CDAP-5900 - During the upgrade to CDAP 3.4.1, publishing to Kafka is halted because the CDAP Kafka service is not running. As a consequence, any applications that sync to the CDAP metadata will become out-of-sync as changes to the metadata made by the upgrade tool will not be published.

πŸ”—Release 3.4.0

πŸ”—API Changes

  • CDAP-5082 - Added a new Spark Java and Scala API.

πŸ”—New Features

  • CDAP-20 - Removed dependency on the Guava library from the cdap-api module. Applications are now free to use a Guava library version of their choice.
  • CDAP-3051 - Added capability for programs to perform administrative dataset operations (create, update, truncate, drop).
  • CDAP-3854 - Added the capability to configure Kafka topic for logs and notifications using the cdap-site.xml.
  • CDAP-3980 - MapReduce programs submitted via CDAP now support multiple configured inputs.
  • CDAP-4807 - Added an ODBC 3.0 Driver for CDAP Datasets for Windows-based applications that support an ODBC interface.
  • CDAP-4970 - Added capability to fetch the schema from a JDBC source specified for a Database plugin from inside Cask Hydrator.
  • CDAP-5011 - Added a CDAP extension Cask Tracker: data discovery with metadata, audit, and lineage.
  • CDAP-5146 - Added a new Cask Hydrator batchaggregator plugin type. An aggregator operates on a collection of records, grouping them by a key and performing an aggregation on each group.
  • CDAP-5172 - Added support for authorization extensions in CDAP. Extensions extend an Authorizer class and provide a bundle jar containing all their required dependencies. This jar is then specified using the property security.authorization.extension.jar.path in the cdap-site.xml.
  • CDAP-5191 - Added an FTPBatchSource that can fetch data from an FTP server in a batch pipeline of Cask Hydrator.
  • CDAP-5205 - Added a global search across all CDAP entities in the CDAP UI.
  • CDAP-5274 - The Cask Hydrator Studio now includes the capability to configure a new type of pipeline, a "data pipeline" (beta feature).
  • CDAP-5360 - The CDAP UI now supports Sparksink and Sparkcompute plugin types, included in a new "data pipeline" artifact.
  • CDAP-5361 - Added a SparkTransform plugin type, which allows the running of a Spark job that operates as a transform in an ETL batch pipeline.
  • CDAP-5362 - Added a SparkSink plugin type, which allows the running of a Spark job (such as machine learning) on the output of an ETL batch pipeline.
  • CDAP-5392 - Added support for FormatSpecification in Spark when consuming data from a stream.
  • CDAP-5446 - Added an example application demonstrating the use of Spark Streaming with machine-learning and spam classifying.
  • CDAP-5504 - Added experimental support for using Spark as an execution engine for CDAP Explore.
  • CDAP-5707 - Added support for using Tez as an execution engine for CDAP Explore.
  • CDAP-5846 - Bundled Node.js with the CDAP UI RPM and DEB packages and with the CDAP Parcels.

πŸ”—Improvements

  • CDAP-4071 - MapReduce programs can now be configured to write metadata for each partition created using a DynamicPartitioner.
  • CDAP-4117 - Fixed an issue of not using the correct user account to access HDFS when submitting a YARN application through Apache Twill, which caused a cleanup failure (and a confusing error message) upon application termination.
  • CDAP-4644 - Workflow logs now contain logs from all of the actions executed by a workflow.
  • CDAP-4842 - Added a hydrator-test module that contains mock plugins for unit testing Hydrator plugins.
  • CDAP-4925 - Added to the CDAP test framework the ability to delete applications and artifacts, retrieve application information, update an application, and write and remove properties for artifacts.
  • CDAP-4955 - Added a 'postaction' Cask Hydrator plugin type that runs at the end of a pipeline run, irregardless of whether the run succeeded or failed.
  • CDAP-5001 - Downloading an explore query from the CDAP UI will now stream the results directly to the client.
  • CDAP-5037 - Added a configuration property to Cask Hydrator TimePartitionedFileSet (TPFS) sinks that will clean out data that is older than a threshold amount of time.
  • CDAP-5039 - Added runtime macros to database and post-action Cask Hydrator plugins.
  • CDAP-5042 - Added a numSplits configuration property to Cask Hydrator database sources to allow users to configure how many splits should be used for an import query.
  • CDAP-5046 - The CDAP UI now allows a plugin developer to use a "textarea" in node configurations for displaying a plugin property.
  • CDAP-5075 - Programs now have a logical.start.time runtime argument that is populated by the system to be the start time of the program. The argument can be overridden just as other runtime arguments.
  • CDAP-5082 - Added support for Spark streaming (to interact with the transactional datasets in CDAP), and support for concurrent Spark execution through Workflow forking.
  • CDAP-5178 - Changed the format of the Cask Hydrator configuration. All pipeline stages are now together in a "stages" array instead of being broken up into separate "source", "transforms", and "sinks" arrays.
  • CDAP-5181 - Added an HTTP RESTful endpoint to retrieve the state of all nodes in a workflow.
  • CDAP-5182 - Added an API to retrieve the properties that were used to configure (or reconfigure) a dataset.
  • CDAP-5207 - Removed dependency on Guava from the cdap-proto module.
  • CDAP-5228 - Added support for CDH 5.7.
  • CDAP-5330 - The stream creation endpoint now accepts a stream configuration (with TTL, description, format specification, and notification threshold).
  • CDAP-5376 - Added an API for MapReduce to retrieve information about the enclosing workflow, including its run ID.
  • CDAP-5378 - Added access to workflow information in a Spark program when it is executed inside a workflow.
  • CDAP-5424 - Added the ability to track the lineage of external sources and sinks in a Cask Hydrator pipeline.
  • CDAP-5512 - Extended the workflow APIs to allow the use of plugins.
  • CDAP-5664 - Introduced a referenceName property (used for lineage and annotation metadata) into all external sources and sinks. This needs to be set before using any of these plugins.
  • CDAP-5779 - Upgraded the Tephra version in CDAP to 0.7.1.

πŸ”—Bug Fixes

  • CDAP-3498 - Upgraded CDAP to use Apache Twill 0.7.0-incubating with numerous new features, improvements, and bug fixes. See the Apache Twill release notes for details.
  • CDAP-3584 - Upon transaction rollback, a PartitionedFileSet now rolls back the files for the partitions that were added and/or removed in that transaction.
  • CDAP-3749 - Fixed a bug with the database plugins that required a password to be specified if the user was specified, even if the password was empty.
  • CDAP-4060 - Added the status for custom actions in workflow diagrams.
  • CDAP-4143 - Fixed a problem with the database source where a semicolon at the end of the query would cause an error.
  • CDAP-4692 - The CDAP UI now prevents users from accidentally losing their DAG by showing a browser-native popup for a confirmation before navigating away from the Cask Hydrator Studio view.
  • CDAP-4695 - Fixed an issue in the Windows CDAP SDK where streams could not be deleted.
  • CDAP-4735 - Fixed an issue that made Java extensions unavailable to programs, fixing the JavaScript-based Hydrator transforms under Java 8.
  • CDAP-4908 - Removed tableName as a required setting from database sources, since the importQuery is sufficient.
  • CDAP-4921 - Renamed the Hydrator Teradata batch source to Database. The previous Database source is no longer supported.
  • CDAP-4982 - Changed the Cask Hydrator LogParser transform logFormat field from a textbox to a dropdown.
  • CDAP-5041 - Changed several ExploreConnection methods to be no-ops instead of throwing exceptions.
  • CDAP-5062 - Added a fetch.size connection setting to the JDBC driver to control the number of rows fetched per database cursor, and increased the default fetch size from 50 to 1000.
  • CDAP-5092 - Fixed a problem that prevented applications written in Scala from being deployed.
  • CDAP-5103 - Fixed a problem so that when the schema for a view was not explicitly specified, the view system metadata will include the default schema for the specified format if that is available.
  • CDAP-5131 - Fixed a problem when filtering plugins by their extension plugin type; filtering by the extensions plugin type was returning extra results for any plugins that did not have an extension.
  • CDAP-5177 - Fixed a problem with PartitionConsumer not appropriately handling partitions that had been deleted since they were added to the working set.
  • CDAP-5241 - Fixed a problem with metadata for a dataset not being deleted when a dataset was deleted.
  • CDAP-5267 - Fixed a problem with the PartitionFilter.ALWAYS_MATCH not working as an input partition filter. PartitionFilter is now serialized into one key of the runtime arguments, to support serialization of PartitionFilter.ALWAYS_MATCH. If there are additional fields in the PartitionFilter that do not exist in the partitioning, the filter will then never match.
  • CDAP-5272 - Fixed a problem with a null pointer exception when null values were written to a database sink in Cask Hydrator.
  • CDAP-5280 - Corrected the documentation of the Query HTTP RESTful API for the retrieving of the status of a query.
  • CDAP-5297 - Fixed a problem with the CDAP UI not supporting pipelines created using previous versions of Cask Hydrator. The UI now shows appropriate information to upgrade the pipeline to be able to view it in the UI.
  • CDAP-5417 - Fixed an issue with running the CDAP examples in the CDAP SDK under Windows by setting appropriate memory requirements in the cdap.bat start script.
  • CDAP-5460 - Fixed a problem with the workflow Spark programs status not being updated in the CDAP UI on the program list screen when it is run as a part of Workflow.
  • CDAP-5463 - Fixed an issue when changing the number of instances of a worker or service.
  • CDAP-5513 - Fixed a problem with the update of metadata indexes so that search results reflect metadata updates correctly.
  • CDAP-5550 - Fixed a problem with the workflow statistics HTTP RESTful endpoint. The endpoint now has a default limit of 10 and a default interval of 10 seconds.
  • CDAP-5557 - Fixed a problem of not showing an appropriate error message in the node configuration when the CDAP backend returns 404 for a plugin property.
  • CDAP-5583 - Added the ability to support multiple sources in the CDAP UI while constructing a pipeline.
  • CDAP-5619 - Fixed a problem with the import of a pipeline configuration. If the imported pipeline config doesn't have artifact information for a plugin, the CDAP UI now defaults to the latest artifact from the list of artifacts sent by the backend.
  • CDAP-5629 - Fixed a problem with losing metadata after changing the stream format on a MapR cluster by avoiding the use of Hive keywords in the CLF format field names; the 'date' field was renamed to 'request_time'.
  • CDAP-5634 - Fixed a performance issue when rendering/scrolling through large input or output schemas for a plugin in the CDAP UI.
  • CDAP-5652 - Added command line interface command to retrieve the workflow node states.
  • CDAP-5793 - CDAP Explore jobs properly use the latest/updated delegation tokens.
  • CDAP-5844 - Fixed a problem with the updating of the HDFS delegation token for HA mode.

πŸ”—Deprecated and Removed Features

  • See the CDAP 3.4.0 Javadocs for a list of deprecated and removed APIs.
  • As of CDAP v3.4.0, Metadata Update Notifications have been deprecated, pending removal in a later version. The CDAP Audit Notifications contain notifications for metadata changes. Please change all uses of Metadata Update Notifications to consume only those messages from the audit feed that have the type field set to METADATA_CHANGE.

πŸ”—Known Issues

  • After upgrading CDAP from a pre-3.0 version, any unprocessed metrics data in Kafka will be lost and WARN log messages will be logged that tell about the inability to process old data in the old format.
  • CDAP-797 - When running secure Hadoop clusters, debug logs from MapReduce programs are not available.
  • CDAP-1007 - If the Hive Metastore is restarted while the CDAP Explore Service is running, the Explore Service remains alive, but becomes unusable. To correct, restart the CDAP Masterβ€”which will restart all servicesβ€”as described under "Starting CDAP Services" for your particular Hadoop distribution in the Installation documentation.
  • CDAP-1587 - CDAP internally creates tables in the "user" space that begin with the word "system". User datasets with names starting with "system" can conflict if they were to match one of those names. To avoid this, do not start any datasets with the word "system".
  • CDAP-2632 - The application in the cdap-kafka-ingest-guide does not run on Ubuntu 14.x as of CDAP 3.0.x.
  • CDAP-2721 - Metrics for FileSets can show zero values even if there is data present, because FileSets do not emit metrics (CDAP-587).
  • CDAP-2831 - A workflow that is scheduled by time will not be run between the failure of the primary master and the time that the secondary takes over. This scheduled run will not be triggered at all.
  • CDAP-2920 - Spark jobs on a Kerberos-enabled CDAP cluster cannot run longer than the delegation token expiration.
  • CDAP-2945 - If the input partition filter for a PartitionedFileSet does not match any partitions, MapReduce jobs can fail.
  • CDAP-3000 - The Workflow token is in an inconsistent state for nodes in a fork while the nodes of the fork are still running. It becomes consistent after the join.
  • CDAP-3221 - When running in Standalone CDAP, if a MapReduce job fails repeatedly, then the SDK hits an out-of-memory exception due to perm gen. The Standalone needs restarting at this point.
  • CDAP-3262 - For Microsoft Windows, the Standalone CDAP scripts can fail when used with a JAVA_HOME that is defined as a path with spaces in it. A workaround is to use a definition of JAVA_HOME that does not include spaces, such as C:\PROGRA~1\Java\jdk1.7.0_79\bin or C:\ProgramData\Oracle\Java\javapath.
  • CDAP-3492 - In the CDAP CLI, executing select * from a dataset with many fields generates an error.
  • CDAP-3641 - A RESTful API call to retrieve workflow statistics hangs if units (such as "s" for seconds) are not provided as part of the query.
  • CDAP-3750 - If a table schema contains a field name that is a reserved word in the Hive DDL, 'enable explore' fails.

πŸ”—Release 3.3.3

πŸ”—Bug Fix

  • CDAP-5350 - Fixed an issue that prevented MapReduce programs from running on clusters with encryption.

πŸ”—Release 3.3.2

πŸ”—Improvements

πŸ”—Bug Fixes

  • CDAP-4967 - Fixed a schema-parsing bug that prevented the use of schemas where a record is used both as a top-level field and also used inside a different record field.
  • CDAP-5019 - Worked around two issues (SPARK-13441 and YARN-4727) that prevented launching Spark jobs on CDH (Cloudera Data Hub) clusters managed with Cloudera Manager when using Spark 1.4 or greater.
  • CDAP-5063 - Fixed a problem with the CDAP Master not starting when CDAP and the HiveServer2 services are running on the same node in an Ambari cluster.
  • CDAP-5076 - Fixed a problem with the CDAP CLI command "update app" that was parsing the application config incorrectly.
  • CDAP-5094 - Fixed a problem where the explore schema fileset property was being ignored unless an explore format was also present.
  • CDAP-5137 - Fix a problem with Spark jobs not being submitted to the appropriate YARN scheduler queue set for the namespace.

πŸ”—Release 3.3.1

πŸ”—Improvements

  • CDAP-4602 - Updated CDAP to use Tephra 0.6.5.
  • CDAP-4708 - Added system metadata to existing entities.
  • CDAP-4723 - Improved the Hydrator plugin archetypes to include build steps to build the deployment JSON for the artifact.
  • CDAP-4773 - Improved the error logging for the Master Stream service when it can't connect to the CDAP AppFabric server.

πŸ”—Bug Fixes

  • CDAP-4117 - Fixed an issue of not using the correct user to access HDFS when submitting a YARN application through Apache Twill, which caused cleanup failure on application termination.
  • CDAP-4613 - Fixed a problem with tooltips not appearing in Flow and Workflow diagrams displayed in the Firefox browser.
  • CDAP-4679 - The Hydrator UI now prevents drafts from being created with a name of an already-existing draft. This prevents overwriting of existing drafts.
  • CDAP-4688 - Improved the metadata search to return matching entities from both the specified namespace and the system namespace.
  • CDAP-4689 - Fixed a problem when using an Hbase sink as one of multiple sinks in a Hydrator pipeline.
  • CDAP-4720 - Fixed an issue where system metadata updates were not being published to Kafka.
  • CDAP-4721 - Fixed an issue where metadata updates wouldn't be sent when certain entities were deleted.
  • CDAP-4740 - Added validation to the JSON imported in the Hydrator UI.
  • CDAP-4741 - Fixed a bug with deleting artifact metadata when an artifact was deleted.
  • CDAP-4743 - Fixed the Node.js server proxy to handle all backend errors (with and without statusCodes).
  • CDAP-4745 - Fixed a bug in the Hydrator upgrade tool which caused drafts to not get upgraded.
  • CDAP-4753 - Fixed the Hydrator Stream source to not assume an output schema. This is valid when a pipeline is created outside Hydrator UI.
  • CDAP-4754 - Fixed ObjectStore to work when parameterized with custom classes.
  • CDAP-4767 - Fixed an issue where delegation token cancellation of CDAP program was affecting CDAP master services.
  • CDAP-4770 - Fixed the Cask Hydrator UI to automatically reconnect with the CDAP backend when the backend restarts.
  • CDAP-4771 - Fixed an issue in Cloudera Manager installations where CDAP container logs would go to the stdout file instead of the master log.
  • CDAP-4784 - Fixed an issue where the IndexedTable was dropping indices upon row updates.
  • CDAP-4785 - Fixed a problem in the upgrade tool where deleted datasets would cause it to throw a NullPointerException.
  • CDAP-4790 - Fixed an issue where the Hbase implementation of the Table API returned all rows, when the correct response should have been an empty set of columns.
  • CDAP-4800 - Fixed a problem with the error message returned when loading an artifact with an invalid range.
  • CDAP-4806 - Fixed the PartitionedFileSet's DynamicPartitioner to work with Avro OutputFormats.
  • CDAP-4829 - Fixed a Validator Transform function generator in the Hydrator UI.
  • CDAP-4831 - Allows user-scoped plugins to surface the correct widget JSON in the Hydrator UI.
  • CDAP-4832 - Added the ErrorDataset as an option on widget JSON in Hydrator plugins.
  • CDAP-4836 - Fixed a spacing issue for metrics showing in Pipeline diagrams of the Hydrator UI.
  • CDAP-4853 - Fixed issues with the Hydrator UI widgets for the Hydrator Kafka real-time source, JMS real-time source, and CloneRecord transform.
  • CDAP-4865 - Enhanced the CDAP SDK to be able to publish metadata updates to an external Kafka, identified by the configuration property metadata.updates.kafka.broker.list. Publishing can be enabled by setting metadata.updates.publish.enabled to true. Updates are published to the Kafka topic identified by the property metadata.updates.kafka.topic.
  • CDAP-4877 - Fixed errors in Cask Hydrator Plugins. Two plugin documents (core-plugins/docs/Database-batchsink.md and core-plugins/docs/Database-batchsource.md) were removed, as the plugins have been moved from core-plugins to database-plugins (to database-plugins/docs/Database-batchsink.md and database-plugins/docs/Database-batchsource.md).
  • CDAP-4889 - Fixed an issue with upgrading HBase tables while using the CDAP Upgrade Tool.
  • CDAP-4894 - Fixed an issue with CDAP coprocessors that caused HBase tables to be disabled after upgrading the cluster to a highly-available file system.
  • CDAP-4906 - Fixed the CDAP Upgrade Tool to return a non-zero exit status upon error during upgrade.
  • CDAP-4924 - Fixed a PermGen memory leak that occurred while deploying multiple applications with database plugins.
  • CDAP-4927 - Fixed the CDAP Explore Service JDBC driver to do nothing instead of throwing an exception when a commit is called.
  • CDAP-4950 - Added an 'enableAutoCommit' property to the Cask Hydrator database plugins to enable the use of JDBC drivers that, similar to the Hive JDBC driver, do not allow commits.
  • CDAP-4951 - Changed the upload timeout from the CDAP CLI from 15 seconds to unlimited.
  • CDAP-4975 - Pass ResourceManager delegation tokens in the proper format in secure Hadoop HA clusters.

πŸ”—Deprecated and Removed Features

  • See the CDAP 3.3.1 Javadocs for a list of deprecated and removed APIs.
  • The properties router.ssl.webapp.bind.port, router.webapp.bind.port, router.webapp.enabled have been deprecated and will be removed in a future version.

πŸ”—Known Issues

  • After upgrading CDAP from a pre-3.0 version, any unprocessed metrics data in Kafka will be lost and WARN log messages will be logged that tell about the inability to process old data in the old format.
  • CDAP-797 - When running secure Hadoop clusters, debug logs from MapReduce programs are not available.
  • CDAP-1007 - If the Hive Metastore is restarted while the CDAP Explore Service is running, the Explore Service remains alive, but becomes unusable. To correct, restart the CDAP Masterβ€”which will restart all servicesβ€”as described under "Starting CDAP Services" for your particular Hadoop distribution in the Installation documentation.
  • CDAP-1587 - CDAP internally creates tables in the "user" space that begin with the word "system". User datasets with names starting with "system" can conflict if they were to match one of those names. To avoid this, do not start any datasets with the word "system".
  • CDAP-2632 - The application in the cdap-kafka-ingest-guide does not run on Ubuntu 14.x as of CDAP 3.0.x.
  • CDAP-2721 - Metrics for FileSets can show zero values even if there is data present, because FileSets do not emit metrics (CDAP-587 <https://issues.cask.co/browse/CDAP-587>).
  • CDAP-2831 - A workflow that is scheduled by time will not be run between the failure of the primary master and the time that the secondary takes over. This scheduled run will not be triggered at all.
  • CDAP-2945 - If the input partition filter for a PartitionedFileSet does not match any partitions, MapReduce jobs can fail.
  • CDAP-3000 - The Workflow token is in an inconsistent state for nodes in a fork while the nodes of the fork are still running. It becomes consistent after the join.
  • CDAP-3221 - When running in Standalone CDAP, mode, if a MapReduce job fails repeatedly, then the SDK hits an out-of-memory exception due to perm gen. The Standalone needs restarting at this point.
  • CDAP-3262 - For Microsoft Windows, the Standalone CDAP scripts can fail when used with a JAVA_HOME that is defined as a path with spaces in it. A workaround is to use a definition of JAVA_HOME that does not include spaces, such as C:\PROGRA~1\Java\jdk1.7.0_79\bin or C:\ProgramData\Oracle\Java\javapath.
  • CDAP-3492 - In the CDAP CLI, executing select * from a dataset with many fields generates an error.
  • CDAP-3641 - A RESTful API call to retrieve workflow statistics hangs if units (such as "s" for seconds) are not provided as part of the query.
  • CDAP-3750 - If a table schema contains a field name that is a reserved word in the Hive DDL, 'enable explore' fails.

πŸ”—Release 3.3.0

πŸ”—New Features

  • CDAP-961 - Added on demand (dynamic) dataset instantiation through program runtime context.
  • CDAP-2303 - Added lookup capability in context that can be used in existing Script, ScriptFilter and Validator transforms.
  • CDAP-3514 - Added an endpoint to get a count of active queries: /v3/namespaces/<namespace-id>/data/explore/queries/count.
  • CDAP-3857 - Added experimental support for running ETL Batch applications on Spark. Introduced an 'engine' setting in the configuration that defaults to 'mapreduce', but can be set to 'spark'.
  • CDAP-3944 - Added support to PartitionConsumer for concurrency, plus a limit and filter on read.
  • CDAP-3945 - Added support for limiting the number of concurrent schedule runs.
  • CDAP-4016 - Added Java-8 support for Script transforms.
  • CDAP-4022 - Added RESTful APIs to start or stop multiple programs.
  • CDAP-4023 - Added CLI commands to stop, start, restart, or get status of programs in an application.
  • CDAP-4043 - Added support for ETL transforms written in Python.
  • CDAP-4128 - Added a new JavaScript transform that can emit records using an emitter.
  • CDAP-4135 - Added the capability for MapReduce and Spark programs to localize additional resources during setup.
  • CDAP-4228 - Added the ability to configure which artifact a Hydrator plugin should use.
  • CDAP-4230 - Added DAGs to ETL pipelines, which will allow users to fork and merge. ETLConfig has been updated to allow representing a DAG.
  • CDAP-4235 - Added AuthorizationPlugin, for pluggable authorization.
  • CDAP-4263 - Added metadata support for stream views.
  • CDAP-4270 - Added CLI support for metadata and lineage.
  • CDAP-4280 - Added the ability to add metadata to artifacts.
  • CDAP-4289 - Added RESTful APIs to set and get properties for an artifact.
  • CDAP-4264 - Added support for automatically annotating CDAP entities with system metadata when they are created or updated.
  • CDAP-4285 - Added an authorization plugin that uses a system dataset to manage ACLs.
  • CDAP-4403 - Moved Hydrator plugins from the CDAP repository as cdap-etl-lib into its own repository.
  • CDAP-4591 - Improved Metadata Indexing and Search to support searches on words in value and tags.
  • CDAP-4592 - Schema fields are stored as Metadata and are searchable.
  • CDAP-4658 - Added capability in CDAP UI to display system tags.

πŸ”—Improvements

  • CDAP-3079 - Table datasets, and any other dataset that implements RecordWritable<StructuredRecord>, can now be written to using Hive.
  • CDAP-3887 - The CDAP Router now has a configurable timeout for idle connections, with a default timeout of 15 seconds.
  • CDAP-4045 - A new property master.collect.containers.log has been added to cdap-site.xml, which determines if container logs are streamed back to the cdap-master process log. (This has always been the default behavior). For MapR installations, this must be turned off (set to false).
  • CDAP-4133 - Added ability to retrieve the live-info for the AppFabric system service.
  • CDAP-4209 - Added a method to ObjectMappedTable and ObjectStore to retrieve a specific number of splits between a start and end keys.
  • CDAP-4233 - Messages logged by Hydrator are now prefixed with the name of the stage that logged them.
  • CDAP-4301 - Added support for CDH5.5
  • CDAP-4392 - Upgraded netty-http dependency in CDAP to 0.14.0.
  • CDAP-4444 - Make xmllint dependency optional and allow setting variables to skip configuration file parsing.
  • CDAP-4453 - Added a schema validationβ€”for sources, transforms, and sinksβ€”that will validate the pipeline stages schema during deployment, and report any issues.
  • CDAP-4518 - CDAP Master service will now log important configuration settings on startup.
  • CDAP-4523 - Added the config setting master.startup.checks.enabled to control whether CDAP Master startup checks are run or not.
  • CDAP-4536 - Improved the installation experience by adding to the CDAP Master service checks of pre-requisites such as file system permissions, availability of components such as YARN and HBase, resource availability during startup, and to error out if any of the pre-requisites fail.
  • CDAP-4548 - Added a config setting 'master.collect.app.containers.log' that can be set to 'false' to disable streaming of application logs back to the CDAP Master log.
  • CDAP-4598 - Added an error message when a required field is not provided when configuring Hydrator pipeline.

πŸ”—Bug Fixes

  • CDAP-1174 - Prefix start script functions with 'cdap' to prevent namespace collisions.
  • CDAP-2470 - Added a check to cause a DB (source or sink) pipeline to fail during deployment if the table (source or sink) was not found, or if an incorrect connection string was provided.
  • CDAP-3345 - Fixed a bug where the TTL for datasets was incorrect; it was reduced by (a factor of 1000) after an upgrade. After running the upgrade tool, please make sure the TTL values of tables are as expected.
  • CDAP-3542 - Fixed an issue where the failure of a program running in a workflow fork node was causing other programs in the same fork node to remain in the RUNNING state, even after the Workflow was completed.
  • CDAP-3694 - Fixed test failures in the PurchaseHistory, StreamConversion, and WikipediaPipeline example apps included in the CDAP SDK.
  • CDAP-3742 - Fixed a bug where certain MapReduce metrics were not being properly emitted when using multiple outputs.
  • CDAP-3761 - Fixed a problem with DBSink column names not being used to filter input record fields before writing to a DBSink.
  • CDAP-3807 - Added a fix for case sensitivity handling in DBSink.
  • CDAP-3815 - Fixed an issue where the regex filter for S3 Batch Source wasn't getting applied correctly.
  • CDAP-3861 - Fixed an issue about stopping all dependent services when a service is stopped.
  • CDAP-3900 - Fixed a bug when querying for logs of deleted program runs.
  • CDAP-3902 - Fixed a problem with dataset performance degradation because of making multiple remote calls for each "get dataset" request.
  • CDAP-3924 - Fixed QueryClient to work against HTTPS.
  • CDAP-4000 - Fixed an issue where a stream that has a view could not be deleted cleanly.
  • CDAP-4067 - Fixed an issue where socket connections to the TransactionManager were not being closed.
  • CDAP-4092 - Fixes an issue that causes worker threads to go into an infinite recursion while exceptions are being thrown in channel handlers.
  • CDAP-4112 - Fixed a bug that prevented applications from using HBase directly.
  • CDAP-4119 - Fixed a problem where when CDAP Master switched from active to standby, the programs that were running were marked as failed.
  • CDAP-4240 - Fixed a problem in the CLI command used to load an artifact, where the wrong artifact name and version was used if the artifact name ends with a number.
  • CDAP-4294 - Fixed a problem where plugins from another namespace were visible when creating an application using a system artifact.
  • CDAP-4316 - Fixed a problem with the CLI attempting to connect to CDAP when the hostname and port were incorrect.
  • CDAP-4366 - Improved error message when stream views were not found.
  • CDAP-4393 - Fixed an issue where tags search were failing for certain tags.
  • CDAP-4141 - Fixed node.js version checking for the cdap sdk script in the CDAP SDK.
  • CDAP-4373 - Fixed a problem that prevented MapReduce jobs from being run when the Resource Manager switches from active to standby in a Kerberos-enabled HA cluster.
  • CDAP-4384 - Fixed an issue that prevents streams from being read in HA HDFS mode.
  • CDAP-4526 - Fixed init scripts to print service status when stopped.
  • CDAP-4534 - Added configuration 'router.bypass.auth.regex' to exempt certain URLs from authentication.
  • CDAP-4539 - Fixed a problem in the init scripts that forced cdap-kafka-server, cdap-router, and cdap-auth-server to have the Hive client installed.
  • CDAP-4678 - Fixed an issue where the logs and history list on a Hydrator pipeline view was not updating on new runs.

πŸ”—Deprecated and Removed Features

  • See the CDAP 3.3.0 Javadocs for a list of deprecated and removed APIs.
  • CDAP-2481 - Removed a deprecated endpoint to retrieve the status of a currently running node in a workflow.
  • CDAP-2943 - Removed the deprecated builder-style Flow API.
  • CDAP-4128 - Deprecated the Script transform.
  • CDAP-4217 - Deprecated createDataSchedule and createTimeSchedule methods in Schedules class and removed deprecated Schedule constructor.
  • CDAP-4251 - Removed deprecated fluent style API for Flow configuration. The only supported API is now the configurer style.

πŸ”—Known Issues

  • After upgrading CDAP from a pre-3.0 version, any unprocessed metrics data in Kafka will be lost and WARN log messages will be logged that tell about the inability to process old data in the old format.
  • CDAP-797 - When running secure Hadoop clusters, debug logs from MapReduce programs are not available.
  • CDAP-1007 - If the Hive Metastore is restarted while the CDAP Explore Service is running, the Explore Service remains alive, but becomes unusable. To correct, restart the CDAP Masterβ€”which will restart all servicesβ€”as described under "Starting CDAP Services" for your particular Hadoop distribution in the Installation documentation.
  • CDAP-1587 - CDAP internally creates tables in the "user" space that begin with the word "system". User datasets with names starting with "system" can conflict if they were to match one of those names. To avoid this, do not start any datasets with the word "system".
  • CDAP-2632 - The application in the cdap-kafka-ingest-guide does not run on Ubuntu 14.x as of CDAP 3.0.x.
  • CDAP-2721 - Metrics for FileSets can show zero values even if there is data present, because FileSets do not emit metrics (CDAP-587 <https://issues.cask.co/browse/CDAP-587>).
  • CDAP-2831 - A workflow that is scheduled by time will not be run between the failure of the primary master and the time that the secondary takes over. This scheduled run will not be triggered at all.
  • CDAP-2945 - If the input partition filter for a PartitionedFileSet does not match any partitions, MapReduce jobs can fail.
  • CDAP-3000 - The Workflow token is in an inconsistent state for nodes in a fork while the nodes of the fork are still running. It becomes consistent after the join.
  • CDAP-3221 - When running in Standalone CDAP, if a MapReduce job fails repeatedly, then the SDK hits an out-of-memory exception due to perm gen. The Standalone needs restarting at this point.
  • CDAP-3262 - For Microsoft Windows, the Standalone CDAP scripts can fail when used with a JAVA_HOME that is defined as a path with spaces in it. A workaround is to use a definition of JAVA_HOME that does not include spaces, such as C:\PROGRA~1\Java\jdk1.7.0_79\bin or C:\ProgramData\Oracle\Java\javapath.
  • CDAP-3492 - In the CDAP CLI, executing select * from a dataset with many fields generates an error.
  • CDAP-3641 - A RESTful API call to retrieve workflow statistics hangs if units (such as "s" for seconds) are not provided as part of the query.
  • CDAP-3750 - If a table schema contains a field name that is a reserved word in the Hive DDL, 'enable explore' fails.

πŸ”—Release 3.2.1

πŸ”—New Features

  • CDAP-3951 - Added the ability for S3 batch sources and sinks to set additional file system properties.

πŸ”—Improvements

  • CDAP-3870 - Added logging and metrics support for Script, ScriptFilter, and Validator transforms.
  • CDAP-3939 - Improved artifact and application deployment failure handling.

πŸ”—Bug Fixes

  • CDAP-3342 - Fixed a problem with the CDAP SDK unable to start on certain Windows machines by updating the Hadoop native library in CDAP with a version that does not have a dependency on a debug version of the Microsoft msvcr100.dll.
  • CDAP-3815 - Fixed an issue where the regex filter for S3 batch sources wasn't being applied correctly.
  • CDAP-3829 - Fixed snapshot sinks so that the data is explorable as a PartitionedFileSet.
  • CDAP-3833 - Fixed snapshot sinks so that they can be read safely.
  • CDAP-3859 - Fixed a compilation error in the Maven application archetype.
  • CDAP-3860 - Fixed a bug where plugins, packaged in the same artifact as an application class, could not be used by that application class.
  • CDAP-3891 - Updated the documentation to remove references to application templates and adaptors that were removed as of CDAP 3.2.0.
  • CDAP-3949 - Fixed a problem with running certain examples on Linux systems by increasing the maximum Java heap size of the Standalone SDK on Linux systems to 2048m.
  • CDAP-3961 - Fixed a missing dependency on cdap-hbase-compat-1.1 package in the CDAP Master package.

πŸ”—Known Issues

  • After upgrading CDAP from a pre-3.0 version, any unprocessed metrics data in Kafka will be lost and WARN log messages will be logged that tell about the inability to process old data in the old format.
  • CDAP-797 - When running secure Hadoop clusters, debug logs from MapReduce programs are not available.
  • CDAP-1007 - If the Hive Metastore is restarted while the CDAP Explore Service is running, the Explore Service remains alive, but becomes unusable. To correct, restart the CDAP Masterβ€”which will restart all servicesβ€”as described under "Starting CDAP Services" for your particular Hadoop distribution in the Installation documentation.
  • CDAP-1587 - CDAP internally creates tables in the "user" space that begin with the word "system". User datasets with names starting with "system" can conflict if they were to match one of those names. To avoid this, do not start any datasets with the word "system".
  • CDAP-2632 - The application in the cdap-kafka-ingest-guide does not run on Ubuntu 14.x as of CDAP 3.0.x.
  • CDAP-2721 - Metrics for FileSets can show zero values even if there is data present, because FileSets do not emit metrics (CDAP-587 <https://issues.cask.co/browse/CDAP-587>).
  • CDAP-2831 - A workflow that is scheduled by time will not be run between the failure of the primary master and the time that the secondary takes over. This scheduled run will not be triggered at all.
  • CDAP-2945 - If the input partition filter for a PartitionedFileSet does not match any partitions, MapReduce jobs can fail.
  • CDAP-3000 - The Workflow token is in an inconsistent state for nodes in a fork while the nodes of the fork are still running. It becomes consistent after the join.
  • CDAP-3221 - When running in Standalone CDAP, if a MapReduce job fails repeatedly, then the SDK hits an out-of-memory exception due to perm gen. The Standalone needs restarting at this point.
  • CDAP-3262 - For Microsoft Windows, the Standalone CDAP scripts can fail when used with a JAVA_HOME that is defined as a path with spaces in it. A workaround is to use a definition of JAVA_HOME that does not include spaces, such as C:\PROGRA~1\Java\jdk1.7.0_79\bin or C:\ProgramData\Oracle\Java\javapath.
  • CDAP-3492 - In the CDAP CLI, executing select * from a dataset with many fields generates an error.
  • CDAP-3641 - A RESTful API call to retrieve workflow statistics hangs if units (such as "s" for seconds) are not provided as part of the query.
  • CDAP-3750 - If a table schema contains a field name that is a reserved word in the Hive DDL, 'enable explore' fails.

πŸ”—Release 3.2.0

πŸ”—New Features

  • CDAP-2556 - Added support for HBase1.1.
  • CDAP-2666 - Added a new API for creating an application from an artifact.
  • CDAP-2756 - Added the ability to write to multiple outputs from a MapReduce job.
  • CDAP-2757 - Added the ability to dynamically write to multiple partitions of a PartitionedFileSet dataset as the output of a MapReduce job.
  • CDAP-3253 - Added a Stream and Dataset Widget to the CDAP UI.
  • CDAP-3390 - Added stream views, enabling reading from a single stream using various formats and schemas.
  • CDAP-3476 - Added a Validator Transform that can be used to validate records based on a set of available validators and configured to write invalid records to an error dataset.
  • CDAP-3516 - Added a service to manage the metadata of CDAP entities.
  • CDAP-3518 - Added the publishing of metadata change notifications to Apache Kafka.
  • CDAP-3519 - Added the ability to compute lineage of a CDAP dataset or stream in a given time window.
  • CDAP-3520 - Added RESTful APIs for adding/retrieving/deleting of metadata for apps/programs/datasets/streams.
  • CDAP-3521 - Added the ability to record a dataset or stream access by a CDAP program.
  • CDAP-3522 - Added the capability to search CDAP entities based on their metadata.
  • CDAP-3523 - Added RESTful APIs for searching CDAP entities based on business metadata.
  • CDAP-3527 - Added a data store to manage business metadata of CDAP entities.
  • CDAP-3549 - Added SSH port forwarding to the CDAP virtual machine.
  • CDAP-3556 - Added a data store for recording data accesses by CDAP programs and computing lineage.
  • CDAP-3590 - Added the ability to write to multiple sinks in ETL real-time and batch applications.
  • CDAP-3591 - Added the ability for real-time ETL pipelines to write to multiple sinks.
  • CDAP-3592 - Added the ability for batch ETL pipelines to write to multiple sinks.
  • CDAP-3626 - For the CSV and TSV stream formats, a "mapping" setting can now be specified, mapping stream event columns to schema columns.
  • CDAP-3693 - Added support for CDAP to work with HDP 2.3.

πŸ”—Improvements

  • CDAP-1914 - Added documentation of the RESTful endpoint to retrieve the properties of a stream.
  • CDAP-2514 - Added an interface to load a file into a stream from the CDAP UI.
  • CDAP-2809 - The CDAP UI "Errors" pop-up in the main screen now displays the time and date for each error.
  • CDAP-2872 - Updated the Cloudera Manager CSD to use support for logback.
  • CDAP-2950 - Cleaned up the messages shown in the errors dropdown in the CDAP UI.
  • CDAP-3147 - Added a CDAP CLI command to stop a workflow.
  • CDAP-3179 - Added support for upgrading the Hadoop distribution or the HBase version that CDAP is running on.
  • CDAP-3257 - Revised the documentation of the file cdap-default.xml, removed properties no longer in use, and corrected discrepancies between the documentation and the shipped XML file.
  • CDAP-3270 - Improved the help provided in the CDAP CLI for the setting of stream formats.
  • CDAP-3275 - Upgraded netty-http version to 0.12.0.
  • CDAP-3282 - Added a HTTP RESTful API to update the application configuration and artifact version.
  • CDAP-3332 - Added a "clear" button in the CDAP UI for cases where a user decides to not used a pre-populated schema.
  • CDAP-3351 - Defined a directory structure to be used for predefined applications.
  • CDAP-3357 - Added documentation in the source code on adding new commands and completers to the CDAP CLI.
  • CDAP-3393 - In the CDAP UI, added visualization for Workflow tokens in Workflows.
  • CDAP-3419 - HBaseQueueDebugger now shows the minimum queue event transaction write pointer both for each queue and for all queues.
  • CDAP-3443 - Added an example cdap-env.sh to the shipped packages.
  • CDAP-3464 - Added an example in the documentation explaining how to prune invalid transactions from the transaction manager.
  • CDAP-3490 - Modified the CDAP upgrade tool to delete all adapters and the ETLBatch and ETLRealtime ApplicationTemplates.
  • CDAP-3495 - Added the ability to persist the runtime arguments with which a program was run.
  • CDAP-3550 - Added support for writing to Amazon S3 in Avro and Parquet formats from batch ETL applications.
  • CDAP-3564 - Updated CDAP to use Tephra 0.6.2.
  • CDAP-3610 - Updated the transaction debugger client to print checkpoint information.

πŸ”—Bug Fixes

  • CDAP-1697 - Fixed an issue where failed dataset operations via Explore queries did not invalidate the associated transaction.
  • CDAP-1864 - Fixed a problem where users got an incorrect message while creating a dataset in a non-existent namespace.
  • CDAP-1892 - Fixed a problem with services returning the same message for all failures.
  • CDAP-1984 - Fixed a problem where a dataset could be created in a non-existent namespace in standalone mode.
  • CDAP-2428 - Fixed a problem with the CDAP CLI creating file logs.
  • CDAP-2521 - Fixed a problem with the CDAP CLI not auto-completing when setting a stream format.
  • CDAP-2785 - Fixed a problem with the CDAP UI of buttons staying 'in focus' after clicking.
  • CDAP-2809 - The CDAP UI "Errors" pop-up in the main screen now displays the time and date for each error.
  • CDAP-2892 - Fixed a problem with schedules not being deployed in suspended mode.
  • CDAP-3014 - Fixed a problem where failure of a spark node would cause a workflow to restart indefinitely.
  • CDAP-3073 - Fixed an issue with the Standalone CDAP process periodically crashing with Out-of-Memory errors when writing to an Oracle table.
  • CDAP-3101 - Fixed a problem with workflow runs not getting scheduled due to Quartz exceptions.
  • CDAP-3121 - Fixed a problem with discrepancies between the documentation and the defaults actually used by CDAP.
  • CDAP-3200 - Fixed a problem in the CDAP UI with the clone button in an incorrect position when using Firefox.
  • CDAP-3201 - Fixed a problem in the CDAP UI with an incorrect tabbing order when using Firefox.
  • CDAP-3219 - Fixed a problem when specifying the HBase version using the HBASE_VERSION environment variable.
  • CDAP-3233 - Fixed a problem in the CDAP UI error pop-ups not having a default focus on a button.
  • CDAP-3243 - Fixed a problem in the CDAP UI with the default schema shown for streams.
  • CDAP-3260 - Fixed a problem in the CDAP UI with scrolling on the namespaces dropdown on certain pages.
  • CDAP-3261 - Fixed a problem on Distributed CDAP with the serializing of the metadata artifact causing a stack overflow.
  • CDAP-3305 - Fixed a problem in the CDAP UI not warning users if they exit or close their browser without saving.
  • CDAP-3313 - Fixed a problem in the CDAP UI with refreshing always returning to the overview page.
  • CDAP-3326 - Fixed a problem with the table batch source requiring a row key to be set.
  • CDAP-3343 - Fixed a problem with the application deployment for apps that contain Spark.
  • CDAP-3349 - Fixed a problem with the display of ETL application metrics in the CDAP UI.
  • CDAP-3355 - Fixed a problem in the CDAP examples with the use of a runtime argument, min.pages.threshold.
  • CDAP-3362 - Fixed a problem with the logback-container.xml not being copied into master services.
  • CDAP-3374 - Fixed a problem with warning messages in the logs indicating that programs were running that actually were not running.
  • CDAP-3376 - Fixed a problem with being unable to deploy the SparkPageRank example application on a cluster.
  • CDAP-3386 - Fixed a problem with the Spark classes not being found when running a Spark program through a Workflow in Distributed CDAP on HDP 2.2.
  • CDAP-3394 - Fixed a problem with the deployment of applications through the CDAP UI.
  • CDAP-3399 - Fixed a problem with the SparkPageRankApp example spawning multiple containers in distributed mode due to its number of services.
  • CDAP-3400 - Fixed an issue with warning messages about the notification system every time the CDAP Standalone is restarted.
  • CDAP-3408 - Fixed a problem with running the CDAP Explore Service on CDH 5.[2,3].
  • CDAP-3432 - Fixed a bug where connecting with a certain namespace from the CLI would not immediately display that namespace in the CLI prompt.
  • CDAP-3435 - Fixed an issue where the program status was shown as running even after it is stopped.
  • CDAP-3442 - Fixed a problem that caused application creation to fail if a config setting was given to an application that does not use a config.
  • CDAP-3449 - Fixed a problem with the readless increment co-processor not handling multiple readless increment columns in the same row.
  • CDAP-3452 - Fixed a problem that prevented explore service working on clusters with secure hive 0.14.
  • CDAP-3458 - Fixed a problem where streams events that had already been processed were re-processed in flows.
  • CDAP-3470 - Fixed an issue with error messages being logged during a master process restart.
  • CDAP-3472 - Fixed the error message returned when trying to stop a program started by a workflow.
  • CDAP-3473 - Fixed a problem with a workflow failure not updating a run record for the inner program.
  • CDAP-3530 - Fixed a problem with the CDAP UI performance when rendering flow diagrams with a large number of nodes.
  • CDAP-3563 - Removed faulty and unused metrics around CDAP file resource usage.
  • CDAP-3574 - Fix an issue with Explore not working on HDP Hive 0.12.
  • CDAP-3603 - Fixed an issue with configuration properties for ETL Transforms being validated at runtime instead of when an application is created.
  • CDAP-3618 - Fix a problem where suspended schedules were lost when CDAP master was restarted.
  • CDAP-3660 - Fixed and issue where the Hadoop filesystem object was getting instantiated before the Kerberos keytab login was completed, leading to CDAP processes failing after the initial ticket expired.
  • CDAP-3700 - Fixed an issue with the log saver having numerous open connections to HBase, causing it to go Out-of-Memory.
  • CDAP-3711 - Fixed an issue that prevented the downloading of Explore results on a secure cluster.
  • CDAP-3713 - Fixed an issue where certain RESTful APIs were not returning appropriate error messages for internal server errors.
  • CDAP-3716 - Fixed a possible deadlock when CDAP master is restarted with an existing app running on a cluster.

πŸ”—API Changes

  • CDAP-2763 - Added RESTful APIs for managing artifacts.
  • CDAP-2956 - Deprecated the existing API for configuring a workflow action, replacing it with a simpler API.
  • CDAP-3063 - Added CLI commands for managing artifacts.
  • CDAP-3064 - Added an ArtifactClient to interact with Artifact HTTP RESTful APIs.
  • CDAP-3283 - Added artifact information to Application RESTful APIs and the means to filter applications by artifact name and version.
  • CDAP-3324 - Added a RESTful API for creating an application from an artifact.
  • CDAP-3367 - Added the ability to delete an artifact.
  • CDAP-3488 - Changed the ETLBatchTemplate from an ApplicationTemplate to an Application.
  • CDAP-3535 - Added an API for programs to retrieve their application specification at runtime.
  • CDAP-3554 - Changed the plugin types from 'source' to either 'batchsource' or 'realtimesource', and from 'sink' to either 'batchsink' or 'realtimesink' to reflect that the plugins implement different interfaces.
  • CDAP-1554 - Moved constants for default and system namespaces from Common to Id.
  • CDAP-3388 - Added interfaces to cdap-spi that abstract StreamEventRecordFormat (and dependent interfaces) so users can extend the cdap-spi interfaces.
  • CDAP-3583 - Added a RESTful API for retrieving the metadata associated with a particular run of a CDAP program.
  • CDAP-3632 - Added a RESTful API for computing lineage of a CDAP dataset or stream.

πŸ”—Deprecated and Removed Features

  • See the CDAP 3.2.0 Javadocs for a list of deprecated and removed APIs.
  • CDAP-2667 - Removed application templates and adapters RESTful APIs, as these templates and adapters have been replaced with applications that can be controlled with the Lifecycle HTTP RESTful API.
  • CDAP-2951 - Removed deprecated methods in cdap-client.
  • CDAP-3596 - Replaced the ETL ApplicationTemplates with the new ETL Applications.

πŸ”—Known Issues

  • After upgrading CDAP from a pre-3.0 version, any unprocessed metrics data in Kafka will be lost and WARN log messages will be logged that tell about the inability to process old data in the old format.
  • CDAP-797 - When running secure Hadoop clusters, debug logs from MapReduce programs are not available.
  • CDAP-1007 - If the Hive Metastore is restarted while the CDAP Explore Service is running, the Explore Service remains alive, but becomes unusable. To correct, restart the CDAP Masterβ€”which will restart all servicesβ€”as described under "Starting CDAP Services" for your particular Hadoop distribution in the Installation documentation.
  • CDAP-1587 - CDAP internally creates tables in the "user" space that begin with the word "system". User datasets with names starting with "system" can conflict if they were to match one of those names. To avoid this, do not start any datasets with the word "system".
  • CDAP-2632 - The application in the cdap-kafka-ingest-guide does not run on Ubuntu 14.x as of CDAP 3.0.x.
  • CDAP-2721 - Metrics for FileSets can show zero values even if there is data present, because FileSets do not emit metrics (CDAP-587 <https://issues.cask.co/browse/CDAP-587>).
  • CDAP-2831 - A workflow that is scheduled by time will not be run between the failure of the primary master and the time that the secondary takes over. This scheduled run will not be triggered at all.
  • CDAP-2945 - If the input partition filter for a PartitionedFileSet does not match any partitions, MapReduce jobs can fail.
  • CDAP-3000 - The Workflow token is in an inconsistent state for nodes in a fork while the nodes of the fork are still running. It becomes consistent after the join.
  • CDAP-3221 - When running in Standalone CDAP, if a MapReduce job fails repeatedly, then the SDK hits an out-of-memory exception due to perm gen. The Standalone needs restarting at this point.
  • CDAP-3262 - For Microsoft Windows, the Standalone CDAP scripts can fail when used with a JAVA_HOME that is defined as a path with spaces in it. A workaround is to use a definition of JAVA_HOME that does not include spaces, such as C:\PROGRA~1\Java\jdk1.7.0_79\bin or C:\ProgramData\Oracle\Java\javapath.
  • CDAP-3492 - In the CDAP CLI, executing select * from a dataset with many fields generates an error.
  • CDAP-3641 - A RESTful API call to retrieve workflow statistics hangs if units (such as "s" for seconds) are not provided as part of the query.
  • CDAP-3697 - CDAP Explore is broken on secure CDH 5.1.
  • CDAP-3698 - CDAP Explore is unable to get a delegation token while fetching next results on HDP2.0.
  • CDAP-3749 - The DBSource plugin does not allow a username with an empty password.
  • CDAP-3750 - If a table schema contains a field name that is a reserved word in the Hive DDL, 'enable explore' fails.
  • CDAP-3819 - The Cassandra source does not handles spaces properly in column fields which require a comma-separated list.

πŸ”—Release 3.1.0

πŸ”—New Features

MapR 4.1 Support, HDP 2.2 Support, CDH 5.4 Support

  • CDAP-1614 - Added HBase 1.0 support.
  • CDAP-2318 - Made CDAP work on the HDP 2.2 distribution.
  • CDAP-2786 - Added support to CDAP 3.1.0 for the MapR 4.1 distro.
  • CDAP-2798 - Added Hive 0.14 support.
  • CDAP-2801 - Added CDH 5.4 Hive 1.1 support.
  • CDAP-2836 - Added support for restart of specific CDAP System Services Instances.
  • CDAP-2853 - Completed certification process for MapR on CDAP.
  • CDAP-2879 - Added Hive 1.0 in Standalone.
  • CDAP-2881 - Added support for HDP 2.2.x.
  • CDAP-2891 - Documented cdap-env.sh and settings OPTS for HDP 2.2.
  • CDAP-2898 - Added Hive 1.1 in Standalone.
  • CDAP-2953 - Added HiveServer2 support in a secure cluster.

Spark

  • CDAP-344 - Users can now run Spark in distributed mode.
  • CDAP-1993 - Added ability to manipulate the SparkConf.
  • CDAP-2700 - Added the ability to Spark programs of discovering CDAP services in distributed mode.
  • CDAP-2701 - Spark programs are able to collect Metrics in distributed mode.
  • CDAP-2703 - Users are able to collect/view logs from Spark programs in distributed mode.
  • CDAP-2705 - Added examples, guides and documentation for Spark in distributed mode. LogAnalysis application demonstrating parallel execution of the Spark and MapReduce programs using Workflows.
  • CDAP-2923 - Added support for the WorkflowToken in the Spark programs.
  • CDAP-2936 - Spark program can now specify resources usage for driver and executor process in distributed mode.

Workflows

  • CDAP-1983 - Added example application for processing and analyzing Wikipedia data using Workflows.
  • CDAP-2709 - Added ability to add generic keys to the WorkflowToken.
  • CDAP-2712 - Added ability to update the WorkflowToken in MapReduce and Spark programs.
  • CDAP-2713 - Added ability to persist the WorkflowToken per run of the Workflow.
  • CDAP-2714 - Added ability to query the WorkflowToken for the past as well as currently running Workflow runs.
  • CDAP-2752 - Added ability for custom actions to access the CDAP datasets and services.
  • CDAP-2894 - Added an API to retreive the system properties (e.g. MapReduce counters in case of MapReduce program) from the WorkflowToken.
  • CDAP-2923 - Added support for the WorkflowToken in the Spark programs.
  • CDAP-2982 - Added verification that the Workflow contains all programs/custom actions with a unique name.

Datasets

  • CDAP-347 - User can use datasets in beforeSubmit and afterFinish.

  • CDAP-585 - Changes to Spark program runner to use File dataset in Spark. Spark programs can now use file-based datasets.

  • CDAP-2734 - Added PartitionedFileSet support to setting/getting properties at the Partition level.

  • CDAP-2746 - PartitionedFileSets now record the creation time of each partition in the metadata.

  • CDAP-2747 - PartitionedFileSets now index the creation time of partitions to allow selection of partitions that were created after a given time. Introduced BatchPartitionConsumer as a way to incrementally consume new data in a PartitionedFileSet.

  • CDAP-2752 - Added ability for custom actions to access the CDAP datasets and services.

  • CDAP-2758 - FileSet now support existing HDFS locations.

    Treat base paths that start with "/" as absolute in the file system. An absolute base path for a (Partitioned)FileSet was interpreted as relative to the namespace's data directory. Newly created FileSets interpret absolute base paths as absolute in the file system.

    Introduced a new property for (Partitioned)FileSets name "data.external". If true, the base path of the FileSet is assumed to be managed by some external process. That is, the FileSet will not attempt to create the directory, it will not delete any files when the FileSet is dropped or truncated, and it will not allow adding or deleting files or partitions. In other words, the FileSet is read-only.

  • CDAP-2784 - Added support to write to PartitionedFileSet Partition metadata from MapReduce.

  • CDAP-2822 - IndexedTable now supports scans on the indexed field.

Metrics

  • CDAP-2975 - Added pre-split FactTables.
  • CDAP-2326 - Added better unit-test coverage for Cube dataset.
  • CDAP-1853 - Metrics processor scaling no longer needs a master services restart.
  • CDAP-2844 - MapReduce metrics collection no longer use counters, and instead report directly to Kafka.
  • CDAP-2701 - Spark programs are able to collect Metrics in distributed mode.
  • CDAP-2466 - Added CLI for metrics search and query.
  • CDAP-2236 - New CDAP UI switched over to using newer search/query APIs.
  • CDAP-1998 - Removed deprecated Context - Query param in Metrics v3 API.

Miscellaneous New Features

  • CDAP-332 - Added a Restful end-point for deleting Streams.
  • CDAP-1483 - QueueAdmin now uses Id.Namespace instead of simply String.
  • CDAP-1584 - CDAP CLI now shows the username in the CLI prompt.
  • CDAP-2139 - Removed a duplicate Table of Contents on the Documentation Search page.
  • CDAP-2515 - Added a metrics client for search and query by tags.
  • CDAP-2582 - Documented the licenses of the shipped CDAP UI components.
  • CDAP-2595 - Added data modelling of flows.
  • CDAP-2596 - Added data modelling of MapReduce.
  • CDAP-2617 - Added the capability to get logs for a given time range from CLI.
  • CDAP-2618 - Simplified the Cube sink configurations.
  • CDAP-2670 - Added Parquet sink with time partitioned file dataset.
  • CDAP-2739 - Added S3 batch source for ETLbatch.
  • CDAP-2802 - Stopped using HiveConf.ConfVars.defaultValue, to support Hive >0.13.
  • CDAP-2847 - Added ability to add custom filters to FileBatchSource.
  • CDAP-2893 - Custom Transform now parses log formats for ETL.
  • CDAP-2913 - Provided installation method for EMR.
  • CDAP-2915 - Added an SQS real-time plugin for ETL.
  • CDAP-3022 - Added Cloudfront format option to LogParserTransform.
  • CDAP-3032 - Documented TestConfiguration class usage in unit-test framework.

πŸ”—Improvements

  • CDAP-593 - Spark no longer determines the mode through MRConfig.FRAMEWORK_NAME.
  • CDAP-595 - Refactored SparkRuntimeService and SparkProgramWrapper.
  • CDAP-665 - Documentation received a product-specifc 404 Page.
  • CDAP-683 - Changed all README files from markdown to rst format.
  • CDAP-1132 - Improved the CDAP Doc Search Result Sorting.
  • CDAP-1416 - Added links to upper level pages on Docs.
  • CDAP-1572 - Standardized Id classes.
  • CDAP-1583 - Refactored InMemoryWorkerRunner and ServiceProgramRunnner after ServiceWorkers were removed.
  • CDAP-1918 - Switched to using the Spark 1.3.0 release.
  • CDAP-1926 - Streams endpoint accept "now", "now-30s", etc., for time ranges.
  • CDAP-2007 - CLI output for "call service" is rendered in a copy-pastable manner.
  • CDAP-2310 - Kafka Source now able to apply a Schema to the Payload received.
  • CDAP-2388 - Added Java 8 support to CDAP.
  • CDAP-2422 - Removed redundant catch blocks in AdapterHttpHandler.
  • CDAP-2455 - Version in CDAP UI footer is dynamic.
  • CDAP-2482 - Reduced excessive capitalisation in documentation.
  • CDAP-2531 - Adapter details made available through CDAP UI.
  • CDAP-2539 - Added a build identifier (branch, commit) in header of Documentation HTML pages.
  • CDAP-2552 - Documentation Build script now flags errors.
  • CDAP-2554 - Documented that streams can now be deleted.
  • CDAP-2557 - Non-handler logic moved out of DatasetInstanceHandler.
  • CDAP-2570 - CLI prompt changes to 'DISCONNECTED' after CDAP is stopped.
  • CDAP-2578 - Ability to look at configs of created adapters.
  • CDAP-2585 - Use Id in cdap-client rather than Id.Namespace + String.
  • CDAP-2588 - Improvements to the MetricsClient APIs.
  • CDAP-2590 - Switching namespaces when in CDAP UI Operations screens.
  • CDAP-2620 - CDAP clients now use Id classes from cdap proto, instead of plain strings.
  • CDAP-2628 - CDAP UI: Breadcrumbs in Workflow/Mapreduce work as expected.
  • CDAP-2644 - In cdap-clients, no longer need to retrieve runtime arguments before starting a program.
  • CDAP-2651 - CDAP UI: the Namespace is made more prominent.
  • CDAP-2681 - CDAP UI: scrolling no longer enlarges the workflow diagram instead of scrolling through.
  • CDAP-2683 - CDAP UI: added a remove icons for fork and Join.
  • CDAP-2684 - CDAP UI: workflow diagrams are directed graphs.
  • CDAP-2688 - CDAP UI: added search & pagination for lists of apps and datasets.
  • CDAP-2689 - CDAP UI: shows which application is a part of which dataset.
  • CDAP-2691 - CDAP UI: added ability to delete streams.
  • CDAP-2692 - CDAP UI: added pagination for logs.
  • CDAP-2694 - CDAP UI: added a loading icon/UI element when creating an adapter.
  • CDAP-2695 - CDAP UI: long names of adapters are replaced by a short version ending in an ellipsis.
  • CDAP-2697 - CDAP UI: added tab names during adapter creation.
  • CDAP-2716 - CDAP UI: when creating an adapter, the tabbing order shows correctly.
  • CDAP-2733 - Implemented a TimeParitionedFileSet source.
  • CDAP-2811 - Improved Hive version detection.
  • CDAP-2921 - Removed backward-compatibility for pre-2.8 TPFS.
  • CDAP-2938 - Implemented new ETL application template creation.
  • CDAP-2983 - Spark program runner now calls onFailure() of the DatasetOutputCommitter.
  • CDAP-2986 - Spark program now are able to specify runtime arguments when reading or writing a datset.
  • CDAP-2987 - Added an example for Spark using datasets directly.
  • CDAP-2989 - Added an example for Spark using FileSets.
  • CDAP-3018 - Updated workflow guides for workflow token.
  • CDAP-3028 - Improved the system service restart endpoint to handle illegal instance IDs and "service not available".
  • CDAP-3053 - Added schema javadocs that explain how to write the schema to JSON.
  • CDAP-3077 - Add the ability in TableSink to find schema.row.field case-insensitively.
  • CDAP-3144 - Changed CLI command descriptions to use consistent element case.
  • CDAP-3152 - Refactored ETLBatch sources and sinks.

πŸ”—Bug Fixes

  • CDAP-23 - Fixed a problem with the DatasetFramework not loading a given dataset with the same classloader across calls.
  • CDAP-68 - Made sure all network services in Singlenode only bind to localhost.
  • CDAP-376 - Fixed a problem with HBaseOrderedTable never calling HTable.close().
  • CDAP-550 - Consolidated Examples, Guides, and Tutorials styles.
  • CDAP-598 - Fixed problems with the CDAP ClassLoading model.
  • CDAP-674 - Fixed problems with CDAP code examples and versioning.
  • CDAP-814 - Resolved issues in the documentation about element versus program.
  • CDAP-1042 - Fixed a problem with specifying dataset selection as input for Spark job.
  • CDAP-1145 - Fixed the PurchaseAppTest.
  • CDAP-1184 - Fixed a problem with the DELETE call not clearing queue metrics.
  • CDAP-1273 - Fixed a problem with the ProgramClassLoader getResource.
  • CDAP-1457 - Fixed a memory leak of user class after running Spark program.
  • CDAP-1552 - Fixed a problem with Mapreduce progress metrics not being interpolated.
  • CDAP-1868 - Fixed a problem with Java Client and CLI not setting set dataset properties on existing datasets.
  • CDAP-1873 - Fixed a problem with warnings and errors when CDAP-Master starts up.
  • CDAP-1967 - Fixed a problem with CDAP-Master failing to start up due to conflicting dependencies.
  • CDAP-1976 - Fixed a problem with examples not following the same pattern.
  • CDAP-1988 - Fixed a problem with creating a Dataset through REST API failing if no properties are provided.
  • CDAP-2081 - Fixed a problem with StreamSizeSchedulerTest failing randomly.
  • CDAP-2140 - Fixed a problem with the CDAP UI not showing system service status when system services are down.
  • CDAP-2177 - Fixed a problem with Enable and Fix LogSaverPluginTest.
  • CDAP-2208 - Fixed a problem with CDAP-Explore service failing on wrapped indexedTable with Avro (specific record) contents.
  • CDAP-2228 - Fixed a problem with Mapreduce not working in Hadoop 2.2.
  • CDAP-2254 - Fixed a problem with an incorrect error message returned by HTTP RESTful Handler.
  • CDAP-2258 - Fixed a problem with an internal error when attempting to start a non-existing program.
  • CDAP-2279 - Fixed a problem with namespace and gear widgets disappearing when the browser window is too narrow.
  • CDAP-2280 - Fixed a problem when starting a flow from the GUI that the GUI does not fully refresh the page.
  • CDAP-2341 - Fixed a problem that when a MapReduce fails to start, it cannot be started or stopped any more.
  • CDAP-2343 - Fixed a problem in the CDAP UI that Mapreduce logs are convoluted with system logs.
  • CDAP-2344 - Fixed a problem with the formatting of logs in the CDAP UI.
  • CDAP-2355 - Fixed a problem with an Adapter CLI help error.
  • CDAP-2356 - Fixed a problem with CLI autocompletion results not sorted in alphabetical order.
  • CDAP-2365 - Fixed a problem that when restarting CDAP-Master, the CDAP UI oscillates between being up and down.
  • CDAP-2376 - Fixed a problem with logs from mapper and reducer not being collected.
  • CDAP-2444 - Fixed a problem with Cloudera Configuring doc needs fixing.
  • CDAP-2446 - Fixed a problem with that examples needing to be updated for new CDAP UI.
  • CDAP-2454 - Fixed a problem with Proto class RunRecord containing the Apache Twill RunId when serialized in REST API response.
  • CDAP-2459 - Fixed a problem with the CDAP UI going into a loop when the Router returns 200 and App Fabric is not up.
  • CDAP-2474 - Fixed a problem with changing the format of the name for the connectionfactory in JMS source plugin.
  • CDAP-2475 - Fixed a problem with JMS source accepting the type and name of the JMS provider plugin.
  • CDAP-2480 - Fixed a problem with the Workflow current run info endpoint missing a /runs/ in the path.
  • CDAP-2489 - Fixed a problem when, in distributed mode and CDAP master restarted, status of the running PROGRAM is always returned as STOPPED.
  • CDAP-2490 - Fixed a problem with checking if invalid Run Records for Spark and MapReduce are part of run from Workflow child programs.
  • CDAP-2491 - Fixed a problem with the MapReduce program in standalone mode not always using LocalJobRunnerWithFix.
  • CDAP-2493 - After the fix for CDAP-2474 (ConnectionFactory in JMS source), the JSON file requires updating for the change to reflect in CDAP UI.
  • CDAP-2496 - Fixed a problem with CDAP using its own transaction snapshot codec.
  • CDAP-2498 - Fixed a problem with validation while creating adapters only by types and not also by values.
  • CDAP-2517 - Fixed a problem with Explore docs not mentioning partitioned file sets.
  • CDAP-2520 - Fixed a problem with StreamSource not liking values of '0m'.
  • CDAP-2522 - Fixed a problem with TransactionStateCache needing to reference Tephra SnapshotCodecV3.
  • CDAP-2529 - Fixed a problem with CLI not printing an error if it can't connect to CDAP.
  • CDAP-2530 - Fixed a problem with Custom RecordScannable<StructuredRecord> datasets not be explorable.
  • CDAP-2535 - Fixed a problem with IntegrationTestManager deployApplication not being namespaced.
  • CDAP-2538 - Fixed a problem with handling extra whitespace in CLI input.
  • CDAP-2540 - Fixed a problem with the Preferences Namespace CLI help having errors.
  • CDAP-2541 - Added the ability to stop the particular run of a program. Allows concurrent runs of the MapReduce and Workflow programs and the ability to stop programs at a per-run level.
  • CDAP-2547 - Fixed a problem with Kakfa Source - not using the persisted offset when the Adapter is restarted.
  • CDAP-2549 - Fixed a problem with a suspended workflow run record not being removed upon app/namespace delete.
  • CDAP-2562 - Fixed a problem with the automated Doc Build failing in develop.
  • CDAP-2564 - Improved the management of dataset resources.
  • CDAP-2565 - Fixed a problem with the transaction latency metric being of incorrect type.
  • CDAP-2569 - Fixed a problem with master process not being resilient to zookeeper exceptions.
  • CDAP-2571 - Fixed a problem with the RunRecord thread not resilient to errors.
  • CDAP-2587 - Fixed a problem with being unable to create default namespaces on starting up SDK.
  • CDAP-2635 - Fixed a problem with Namespace Create ignoring the properties' config field.
  • CDAP-2636 - Fixed a problem with "out of perm gen" space in CDAP Explore service.
  • CDAP-2654 - Fixed a problem with False values showing up as 'false null' in the CDAP Explore UI.
  • CDAP-2685 - Fixed a problem with the CDAP UI: no empty box for transforms.
  • CDAP-2729 - Fixed a problem with CDAP UI not handling downstream system services gracefully.
  • CDAP-2740 - Fixed a problem with CDAP UI not gracefully handling when the nodejs server goes down.
  • CDAP-2748 - Fixed a problem with the currently running and completed status of Spark programs in a workflow not highlighted in the UI.
  • CDAP-2765 - Fixed a problem with security warnings when CLI starts up.
  • CDAP-2766 - Fixed a problem with CLI asking for the user/password twice.
  • CDAP-2767 - Fixed a problem with incorrect error messages for namespace deletion.
  • CDAP-2768 - Fixed a problem with CLI and UI listing system.queue as a dataset.
  • CDAP-2769 - Fixed a problem with Use co.cask.cdap.common.app.RunIds instead of org.apache.twill.internal.RunIds for InMemoryServiceProgramRunner.
  • CDAP-2787 - Fixed a problem when the number of MapReduce task metrics going over limit and causing MapReduce to fail.
  • CDAP-2796 - Fixed a problem with emitting duplicate metrics for dataset ops.
  • CDAP-2803 - Fixed a problem with scan operations not reflecting in dataset ops metrics.
  • CDAP-2804 - Fixed a problem with DataSetRecordReader incorrectly reporting dataset ops metrics.
  • CDAP-2810 - Fixed a problem with IncrementAndGet, CompareAndSwap, and Delete ops on Table incorrectly reporting two writes each.
  • CDAP-2821 - Fixed a problem with a Spark native library linkage error causing Standalone CDAP to stop.
  • CDAP-2823 - Fixed a problem with the conversion from Avro and to Avro not taking into account nested records.
  • CDAP-2830 - Fixed a problem with CDAP UI dying when CDAP Master is killed.
  • CDAP-2832 - Fixed a problem where suspending a schedule takes a long time and the CDAP UI does not provide any indication.
  • CDAP-2838 - Fixed a problem with poor error message when there is a mistake in security configration.
  • CDAP-2839 - Fixed a problem with the CDAP start script needing updating for the correct Node.js version.
  • CDAP-2848 - Fixed a problem with the Preferences Client test.
  • CDAP-2849 - Fixed a problem with the FileBatchSource reading files in twice if it takes longer that one workflow cycle to complete the job.
  • CDAP-2851 - Fixed a problem with RPM and DEB release artifacts being uploaded to incorrect staging directory.
  • CDAP-2854 - Fixed a problem with the instructions for using Docker.
  • CDAP-2855 - Fixed a problem with the example builds in VM failing with a Maven dependency error.
  • CDAP-2860 - Fixed a problem with the documentation for updating dataset properties.
  • CDAP-2861 - Fixed a problem with CDAP UI not mentioning required fields in all entry forms.
  • CDAP-2862 - Fixed a problem with CDAP UI creating multiple namespaces with the same name.
  • CDAP-2866 - Fixed a problem with FileBatchSource not reattempting to read in files if there is a failure.
  • CDAP-2870 - Fixed a problem with Workflow Diagrams.
  • CDAP-2871 - Fixed a problem with the Cloudera Manager Hbase Gateway dependency.
  • CDAP-2895 - Fixed a problem with a put operation on the WorkflowToken not throwing an exception.
  • CDAP-2899 - Fixed a problem with Mapreduce local dirs not getting cleaned up.
  • CDAP-2900 - Fixed a problem with exposing app.template.dir as a config option.
  • CDAP-2904 - Fixed a problem with "Make Request" button overlapping with paths when a path is long.
  • CDAP-2912 - Fixed a problem with HBaseQueueDebugger not sorting queue barriers correctly.
  • CDAP-2922 - Fixed a problem with datasets created through DynamicDatasetContext not having metrics context. Datasets in MapReduce and Spark programs, and workers, were not emitting metrics.
  • CDAP-2925 - Fixed a problem with the documentation on how to create datasets with properties.
  • CDAP-2932 - Fixed a problem with the AdapterClient getRuns method constructing a malformed URL.
  • CDAP-2935 - Fixed a problem with the logs endpoint to retrieve the latest entry not working correctly.
  • CDAP-2940 - Fixed a problem with the test case ArtifactStoreTest#testConcurrentSnapshotWrite.
  • CDAP-2941 - Fixed a problem with the ScriptTransform failing to initialize.
  • CDAP-2942 - Fixed a problem with the CDAP UI namespace dropdown failing on standalone restart.
  • CDAP-2948 - Fixed a problem with creating Adapters.
  • CDAP-2952 - Fixed a problem with the plugin avro library not being accessible to MapReduce.
  • CDAP-2955 - Fixed a problem with a NoSuchMethodException when trying to explore Datasets/Stream.
  • CDAP-2971 - Fixed a problem with the dataset registration not registering datasets for applications upon deploy.
  • CDAP-2972 - Fixed a problem with being unable to instantiate dataset in ETLWorker initialization.
  • CDAP-2981 - Fixed a problem with undoing a FileSets upgrade in favor of versioning and backward-compatibility.
  • CDAP-2991 - Fixed a problem with Explore not working when it launches a MapReduce job.
  • CDAP-2992 - Fixed a problem with CLI broken for secure CDAP.
  • CDAP-2996 - Fixed a problem with CDAP UI: Stop Run and Suspend Run buttons needed styling updates.
  • CDAP-2997 - Fixed a problem with SparkProgramRunnerTest failing randomly.
  • CDAP-2999 - Fixed a problem with MapReduce jobs showing the duration for tasks as 17 days before the mapper starts.
  • CDAP-3001 - Fixed a problem with truncating a custom dataset failing with internal server error.
  • CDAP-3002 - Fixed a problem with tick initialDelay not working properly.
  • CDAP-3003 - Fixed a problem with user metrics emitted from flowlets not being queryable using the flow's tags.
  • CDAP-3006 - Fixed a problem with updating cdap-spark-* archetypes.
  • CDAP-3007 - Fixed a problem with testing all Spark apps/guides to work with 3.1 (in dist mode).
  • CDAP-3009 - Fixed a problem with the stream conversion upgrade being in the upgrade tool in 3.1.
  • CDAP-3010 - Fixed a problem with a Bower Dependency Error.
  • CDAP-3011 - Fixed a problem with the IncrementSummingScannerTest failing intermittently.
  • CDAP-3012 - Fixed a problem with the DistributedWorkflowProgramRunner localizing the spark-assembly.jar even if the workflow does not contain a Spark program.
  • CDAP-3013 - Fixed a problem with excluding a Spark assembly jar when building a MapReduce job jar.
  • CDAP-3019 - Fixed a problem with the PartitionedFileSet dropPartition not deleting files under the partition.
  • CDAP-3021 - Fixed a problem with allowing Cloudfront data to use BatchFileFilter.
  • CDAP-3023 - Fixed a problem with flowlet instance count defaulting to 1.
  • CDAP-3024 - Fixed a problem with surfacing more logs in CDAP UI for System Services.
  • CDAP-3026 - Fixed a problem with updating SparkPageRank example docs.
  • CDAP-3027 - Fixed a problem with the DFSStreamHeartbeatsTest failing on clusters.
  • CDAP-3030 - Fixed a problem with the loading of custom datasets being broken after upgrading.
  • CDAP-3031 - Fixed a problem with deploying an app with a dataset with an invalid base path returning an "internal error".
  • CDAP-3037 - Fixed a problem with not being able to use a PartitionedFileSet in a custom dataset. If a custom dataset embedded a Table and a PartitionedFileSet, loading the dataset at runtime would fail.
  • CDAP-3038 - Fixed a problem with logs not showing up in UI when using Spark.
  • CDAP-3039 - Fixed a problem with worker not stopping at the end of a run method in standalone.
  • CDAP-3040 - Fixed a problem with flowlet and stream metrics not being available in distributed mode.
  • CDAP-3042 - Fixed a problem with the BufferingTable not merging buffered writes with multi-get results.
  • CDAP-3043 - Fixed a problem with the Javadocs being broken.
  • CDAP-3044 - Fixed a problem with the user service 'methods' field in service specifications being inaccurate.
  • CDAP-3058 - Fixed a problem with the NamespacedLocationFactory not appending correctly.
  • CDAP-3066 - Fixed a problem with FileBatchSource not failing properly.
  • CDAP-3067 - Fixed a problem with the UpgradeTool throwing a NullPointerException during UsageRegistry.upgrade().
  • CDAP-3070 - Fixed a problem on Ubuntu 14.10 where removing JSON files from templates/plugins/ETLBatch breaks adapters.
  • CDAP-3072 - Fixed a problem with a documentation JavaScript bug.
  • CDAP-3073 - Fixed a problem with out-of-memory perm gen space.
  • CDAP-3085 - Fixed a problem with adding integration tests for datasets.
  • CDAP-3086 - Fixed a problem with the CDAP UI current adapter UI.
  • CDAP-3087 - Fixed a problem with CDAP UI: a session timeout on secure mode.
  • CDAP-3088 - Fixed a problem with CDAP UI: application types need to be updated.
  • CDAP-3092 - Fixed a problem with reading multiple files with one mapper in FileBatchSource.
  • CDAP-3096 - Fixed a problem with running MapReduce on HDP 2.2.
  • CDAP-3098 - Fixed problems with the CDAP UI Adapter UI.
  • CDAP-3099 - Fixed a problem with CDAP UI and that settings icons shift 2px when you click on them.
  • CDAP-3104 - Fixed a problem with CDAP Explore throwing an exception if a Table dataset does not set schema.
  • CDAP-3105 - Fixed a problem with LogParserTransform needing to emit HTTP status code info.
  • CDAP-3106 - Fixed a problem with Hive query - local MapReduce task failure on CDH-5.4.
  • CDAP-3125 - Fixed a problem with the WorkerProgramRunnerTest failing intermittently.
  • CDAP-3127 - Fixed a problem with the Kafka guide not working with CDAP 3.1.0.
  • CDAP-3132 - Fixed a problem with the ProgramLifecycleHttpHandlerTest failing intermittently.
  • CDAP-3145 - Fixed a problem with the Metrics processor not processing metrics.
  • CDAP-3146 - Fixed a problem with the CDAP VM build failing to instal the Eclipse plugin.
  • CDAP-3148 - Fixed a problem with CDAP Explore MapReduce queries failing due to MR-framework being localized in the mapper container.
  • CDAP-3149 - Fixed a problem with cycles in an adapter create page causing the browser to freeze.
  • CDAP-3151 - Fixed a problem with CDAP examples shipped with SDK using JDK 1.6.
  • CDAP-3161 - Fixed a problem with MapReduce no longer working with default Cloudera manager settings.
  • CDAP-3173 - Fixed a problem with upgrading to 3.1.0 crashing the HBase co-processor.
  • CDAP-3174 - Fixed a problem with the ETL source/transform/sinks descriptions and documentation.
  • CDAP-3175 - Fixed a problem with the AbstractFlowlet constructors being deprecated when they should not be.

πŸ”—Deprecated and Removed Features

πŸ”—Known Issues

  • See the above section (API Changes) for alterations that can affect existing installations.
  • CDAP has been tested on and supports CDH 4.2.x through CDH 5.4.4, HDP 2.0 through 2.6, and Apache Bigtop 0.8.0. It has not been tested on more recent versions of CDH. See our Hadoop/HBase Environment configurations.
  • After upgrading CDAP from a pre-3.0 version, any unprocessed metrics data in Kafka will be lost and WARN log messages will be logged that tell about the inability to process old data in the old format.
  • Retrieving multiple metricsβ€”by issuing an HTTP POST request with a JSON list as the request body that enumerates the name and attributes for each metricβ€”is currently not supported in the Metrics HTTP RESTful API v3. Instead, use the v2 API. It will be supported in a future release.
  • CDAP-797 - When running secure Hadoop clusters, metrics and debug logs from MapReduce programs are not available.
  • CDAP-1007 - If the Hive Metastore is restarted while the CDAP Explore Service is running, the Explore Service remains alive, but becomes unusable. To correct, restart the CDAP Masterβ€”which will restart all servicesβ€”as described under "Starting CDAP Services" for your particular Hadoop distribution in the Installation documentation.
  • CDAP-1587 - CDAP internally creates tables in the "user" space that begin with the word "system". User datasets with names starting with "system" can conflict if they were to match one of those names. To avoid this, do not start any datasets with the word "system".
  • CDAP-1864 - Creating a dataset in a non-existent namespace manifests in the RESTful API with an incorrect error message.
  • CDAP-2632 - The application in the cdap-kafka-ingest-guide does not run on Ubuntu 14.x as of CDAP 3.0.x.
  • CDAP-2785 - In the CDAP UI, many buttons will remain in focus after being clicked, even if they should not retain focus.
  • CDAP-2831 - A workflow that is scheduled by time will not be run between the failure of the primary master and the time that the secondary takes over. This scheduled run will not be triggered at all.
  • CDAP-2878 - The semantics for TTL are confusing, in that the Table TTL property is interpreted as milliseconds in some contexts: DatasetDefinition.confgure() and getAdmin().
  • CDAP-2945 - If the input partition filter for a PartitionedFileSet does not match any partitions, MapReduce jobs can fail.
  • CDAP-3000 - The Workflow token is in an inconsistent state for nodes in a fork while the nodes of the fork are still running. It becomes consistent after the join.
  • CDAP-3101 - If there are more than 30 concurrent runs of a workflow, the runs will not be scheduled due to a Quartz exception.
  • CDAP-3179 - If you are using CDH 5.3 (CDAP 3.0.0) and are upgrading to CDH 5.4 (CDAP 3.1.0), you must first upgrade the underlying HBase before you upgrade CDAP. This means that you perform the CDH upgrade before upgrading the CDAP.
  • CDAP-3189 - Large MapReduce jobs can cause excessive logging in the CDAP logs.
  • CDAP-3221 - When running in Standalone CDAP, if a MapReduce job fails repeatedly, then the SDK hits an out-of-memory exception due to perm gen. The Standalone needs restarting at this point.

πŸ”—Release 3.0.3

πŸ”—Bug Fix

  • Fixed a Bower dependency error in the CDAP UI (CDAP-3010).

πŸ”—Known Issues

πŸ”—Release 3.0.2

πŸ”—Bug Fixes

πŸ”—Known Issues

πŸ”—Release 3.0.1

πŸ”—New Features

  • In the CDAP UI, mandatory parameters for Application Template creation are marked with asterisks, and if a user tries to create a template without one of those parameters, the missing parameter is highlighted (CDAP-2499).

πŸ”—Improvements

Tools

CDAP UI

CDAP SDK VM

  • Added the Apache Flume agent flume-ng to the CDAP SDK VM (CDAP-2612).
  • Added the ability to copy and paste to the CDAP SDK VM (CDAP-2611).
  • Pre-downloaded the example dependencies into the CDAP SDK VM to speed building of the CDAP examples (CDAP-2613).

πŸ”—Bug Fixes

General

  • Fixed a problem with the HBase store and flows with multiple queues, where one queue name is a prefix of another queue name (CDAP-1996).
  • Fixed a problem with namespaces with underscores in the name crashing the Hadoop HBase region servers (CDAP-2110).
  • Removed the requirement to specify the JDBC driver class property twice in the adaptor configuration for Database Sources and Sinks (CDAP-2453).
  • Fixed a problem in Distributed CDAP where the status of running program always returns as "STOPPED" when the CDAP Master is restarted (CDAP-2489).
  • Fixed a problem with invalid RunRecords for Spark and MapReduce programs that are run as part of a Workflow (CDAP-2490).
  • Fixed a problem with the CDAP Master not being HA (highly available) when a leadership change happens (CDAP-2495).
  • Fixed a problem with upgrading of queues with the UpgradeTool (CDAP-2502).
  • Fixed a problem with ObjectMappedTables not deleting missing fields when updating a row (CDAP-2523, CDAP-2524).
  • Fixed a problem with a stream not being created properly when deploying an application after the default namespace was deleted (CDAP-2537).
  • Fixed a problem with the Applicaton Template Kafka Source not using the persisted offset when the Adapter is restarted (CDAP-2547).
  • A problem with CDAP using its own transaction snapshot codec, leading to huge snapshot files and OutOfMemory exceptions, and transaction snapshots that can't be read using Tephra's tools, has been resolved by replacing the codec with Tephra's SnapshotCodecV3 (CDAP-2563, CDAP-2946, TEPHRA-101).
  • Fixed a problem with CDAP Master not being resilient in the handling of ZooKeeper exceptions (CDAP-2569).
  • Fixed a problem with RunRecords not being cleaned up correctly after certain exceptions (CDAP-2584).
  • Fixed a problem with the CDAP Maven archetype having an incorrect CDAP version in it (CDAP-2634).
  • Fixed a problem with the description of the TwitterSource not describing its output (CDAP-2648).
  • Fixed a problem with the Twitter Source not handling missing fields correctly and as a consequence producing tweets (with errors) that were then not stored on disk (CDAP-2653).
  • Fixed a problem with the TwitterSource not calculating the time of tweet correctly (CDAP-2656).
  • Fixed a problem with the JMS Real-time Source failing to load required plugin sources (CDAP-2661).
  • Fixed a problem with executing Hive queries on a distributed CDAP due to a failure to load Grok classes (CDAP-2678).
  • Fixed a problem with CDAP Program jars not being cleaned up from the temporary directory (CDAP-2698).
  • Fixed a problem with ProjectionTransforms not handling input data fields with null values correctly (CDAP-2719).
  • Fixed a problem with the CDAP SDK running out of memory when MapReduce jobs are run repeatedly (CDAP-2743).
  • Fixed a problem with not using CDAP RunIDs in the in-memory version of the CDAP SDK (CDAP-2769).

CDAP CLI

  • Fixed a problem with the CDAP CLI not printing an error if it is unable to connect to a CDAP instance (CDAP-2529).
  • Fixed a problem with extra whitespace in commands entered into the CDAP CLI causing errors (CDAP-2538).

CDAP SDK Standalone

  • Updated the messages displayed when starting the Standalone CDAP SDK as to components and the JVM required (CDAP-2445).
  • Fixed a problem with the creation of the default namespace upon starting the CDAP SDK (CDAP-2587).

CDAP SDK VM

  • Fixed a problem with using the default namespace on the CDAP SDK Virtual Machine Image (CDAP-2500).
  • Fixed a problem with the VirtualBox VM retaining a MAC address obtained from the build host (CDAP-2640).

CDAP UI

  • Fixed a problem with incorrect flow metrics showing in the CDAP UI (CDAP-2494).
  • Fixed a problem in the CDAP UI with the properties in the Projection Transform being displayed inconsistently (CDAP-2525).
  • Fixed a problem in the CDAP UI not automatically updating the number of flowlet instances (CDAP-2534).
  • Fixed a problem in the CDAP UI with a window resize preventing clicking of the Adapter Template drop down menu (CDAP-2573).
  • Fixed a problem with the CDAP UI not performing validation of mandatory parameters before the creation of an adapter (CDAP-2575).
  • Fixed a problem with an incorrect version of CDAP being shown in the CDAP UI (CDAP-2586).
  • Reduced the number of clicks required to navigate and perform actions within the CDAP UI (CDAP-2622, CDAP-2625).
  • Fixed a problem with an additional forward-slash character in the URL causing a "page not found error" in the CDAP UI (CDAP-2624).
  • Fixed a problem with the error dropdown of the CDAP UI not scrolling when it has a large number of errors (CDAP-2633).
  • Fixed a problem in the CDAP UI with the Twitter Source's consumer key secret not being treated as a password field (CDAP-2649).
  • Fixed a problem with the CDAP UI attempting to create an adapter without a name (CDAP-2652).
  • Fixed a problem with the CDAP UI not being able to find the ETL plugin templates on distributed CDAP (CDAP-2655).
  • Fixed a problem with the CDAP UI's System Dashboard chart having a y-axis starting at "-200" (CDAP-2699).
  • Fixed a problem with the rendering of stack trace logs in the CDAP UI (CDAP-2745).
  • Fixed a problem with the CDAP UI not working with secure CDAP instances, either clusters or standalone (CDAP-2770).
  • Fixed a problem with the coloring of completed runs of Workflow DAGs in the CDAP UI (CDAP-2781).

Documentation

  • Fixed errors with the documentation examples of the ETL Plugins (CDAP-2503).
  • Documented the licenses of all shipped CDAP UI components (CDAP-2582).
  • Corrected issues with the building of Javadocs used on the website and removed Javadocs previously included in the SDK (CDAP-2730).
  • Added a recommended version (v.12.0) of Node.js to the documentation (CDAP-2762).

πŸ”—Known Issues

  • The application in the cdap-kafka-ingest-guide does not run on Ubuntu 14.x and CDAP 3.0.x (CDAP-2632, CDAP-2749).

  • Metrics for TimePartitionedFileSets can show zero values even if there is data present (CDAP-2721).

  • In the CDAP UI: many buttons will remain in focus after being clicked, even if they should not retain focus (CDAP-2785).

  • When the CDAP-Master dies, the CDAP UI does not repsond appropriately, and instead of waiting for routing to the secondary master to begin, it loses its connection (CDAP-2830).

  • A workflow that is scheduled by time will not be run between the failure of the primary master and the time that the secondary takes over. This scheduled run will not be triggered at all. There is no warnings or messages about the missed run of the workflow. (CDAP-2831)

  • CDAP has been tested on and supports CDH 4.2.x through CDH 5.3.x, HDP 2.0 through 2.1, and Apache Bigtop 0.8.0. It has not been tested on more recent versions of CDH. See our Hadoop/HBase Environment configurations.

  • After upgrading CDAP from a pre-3.0 version, any unprocessed metrics data in Kafka will be lost and WARN log messages will be logged that tell about the inability to process old data in the old format.

  • See the above section (API Changes) for alterations that can affect existing installations.

  • When running secure Hadoop clusters, metrics and debug logs from MapReduce programs are not available (CDAP-797).

  • If the Hive Metastore is restarted while the CDAP Explore Service is running, the Explore Service remains alive, but becomes unusable. To correct, restart the CDAP Master, which will restart all services (CDAP-1007).

  • User datasets with names starting with "system" can potentially cause conflicts (CDAP-1587).

  • Scaling the number of metrics processor instances doesn't automatically distribute the processing load to the newer instances of the metrics processor. The CDAP Master needs to be restarted to effectively distribute the processing across all metrics processor instances (CDAP-1853).

  • Creating a dataset in a non-existent namespace manifests in the RESTful API with an incorrect error message (CDAP-1864).

  • Retrieving multiple metricsβ€”by issuing an HTTP POST request with a JSON list as the request body that enumerates the name and attributes for each metricβ€”is currently not supported in the Metrics HTTP RESTful API v3. Instead, use the v2 API. It will be supported in a future release.

  • Typically, datasets are bundled as part of applications. When an application is upgraded and redeployed, any changes in datasets will not be redeployed. This is because datasets can be shared across applications, and an incompatible schema change can break other applications that are using the dataset. A workaround (CDAP-1253) is to allow unchecked dataset upgrades. Upgrades cause the dataset metadata, i.e. its specification including properties, to be updated. The dataset runtime code is also updated. To prevent data loss the existing data and the underlying HBase tables remain as-is.

    You can allow unchecked dataset upgrades by setting the configuration property dataset.unchecked.upgrade to true in cdap-site.xml. This will ensure that datasets are upgraded when the application is redeployed. When this configuration is set, the recommended process to deploy an upgraded dataset is to first stop all applications that are using the dataset before deploying the new version of the application. This lets all containers (flows, services, etc) to pick up the new dataset changes. When datasets are upgraded using dataset.unchecked.upgrade, no schema compatibility checks are performed by the system. Hence it is very important that the developer verify the backward-compatibility, and makes sure that other applications that are using the dataset can work with the new changes.

πŸ”—Release 3.0.0

πŸ”—New Features

πŸ”—New User Interface

  • Introduced a new UI, organization based on namespaces and users.
  • Users can switch between namespaces.
  • Uses web sockets to retrieve updates from the backend.
  • Development Section
    • Introduces a UI for programs based on run-ids.
    • Users can view logs and, in certain casesβ€”flowsβ€”flowlets, of a program based on run ids.
    • Shows a list of datasets and streams used by a program, and shows programs using a specific dataset and stream.
    • Shows the history of a program (list of runs).
    • Datasets or streams are explorable on a dataset/stream level or on a global level.
    • Shows program level metrics on under each program.
  • Operations section
    • Introduces an operations section to explore metrics.
    • Allows users to create custom dashboard with custom metrics.
    • Users can add different types of charts (line, bar, area, pie, donut, scatter, spline, area-spline, area-spline-stacked, area-stacked, step, table).
    • Users can add multiple metrics on a single dashboard, or on a single widget on a single dashboard.
    • Users can organize the widgets in either a two, three, or four-column layout.
    • Users can toggle the frequency at which data is polled for a metric.
    • Users can toggle the resolution of data displayed in a graph.
  • Admin Section
    • Users can manage different objects of CDAP (applications, programs, datasets, and streams).
    • Users can create namespaces.
    • Through the Admin view, users can configure their preferences at the CDAP level, namespace level, or application level.
    • Users can manage the system services, applications, and streams through the Admin view.
  • Adapters
    • Users can create ETLBatch and ETLRealtime adapters from within the UI.
    • Users can choose from a list of plugins that comes by default with CDAP when creating an adapter.
    • Users can save an adapter as a draft, to be created at a later point-in-time.
    • Users can configure plugin properties with appropriate editors from within the UI when creating an adapter.
  • The Old CDAP Console has been deprecated.

πŸ”—Improvement

πŸ”—Bug Fixes

  • The CDAP Authentication server now reports the port correctly when the port is set to 0 (CDAP-614).
  • History of the programs running under workflow (Spark and MapReduce) is now updated correctly (CDAP-1293).
  • Programs running under a workflow now receive a unique run-id (CDAP-2025).
  • RunRecords are now updated with the RuntimeService to account for node failures (CDAP-2202).
  • MapReduce metrics are now available on a secure cluster (CDAP-64).

πŸ”—API Changes

  • The endpoint (POST '<base-url>/metrics/search?target=childContext[&context=<context>]') that searched for the available contexts of metrics has been deprecated, pending removal in a later version of CDAP (CDAP-1998). A replacement endpoint is available.
  • The endpoint (POST '<base-url>/metrics/search?target=metric&context=<context>') that searched for metrics in a specified context has been deprecated, pending removal in a later version of CDAP (CDAP-1998). A replacement endpoint is available.
  • The endpoint (POST '<base-url>/metrics/query?context=<context>[&groupBy=<tags>]&metric=<metric>&<time-range>') that queried for a metric has been deprecated, pending removal in a later version of CDAP (CDAP-1998). A replacement endpoint is available.
  • Metrics: The tag name for service handlers in previous releases was wrongly "runnable", and internally represented as "srn". These metrics are now tagged as "handler" ("hnd"), and metrics queries will only account for this tag name. If you need to query historic metrics that were emitted with the old tag "runnable", use "srn" to query them (instead of either "runnable" or "handler").
  • The CDAP CLI startup options have been changed to accommodate a new option of executing a file containing a series of CLI commands, line-by-line.
  • The metrics system APIs have been improved (CDAP-1596).
  • The rules for resolving resolution when using resolution=auto in metrics query have been changed (CDAP-1922).
  • Backward incompatible changes in InputFormatProvider and OutputFormatProvider. It won't affect user code that uses FileSet or PartitionedFileSet. It only affects classes who implement the InputFormatProvider or OutputFormatProvider:
    • InputFormatProvider.getInputFormatClass() is removed and
      • replaced with InputFormatProvider.getInputFormatClassName();
    • OutputFormatProvider.getOutputFormatClass() is removed and
      • replaced with OutputFormatProvider.getOutputFormatClassName().

πŸ”—Deprecated and Removed Features

  • The File DropZone and File Tailer are no longer supported as of Release 3.0.
  • Support for procedures has been removed. After upgrading, an application that contained a procedure must be redeployed.
  • Support for service workers have been removed. After upgrading, an application that contained a service worker must be redeployed.
  • The old CDAP Console has been deprecated.
  • Support for JDK/JRE 1.6 (Java 6) has ended; JDK/JRE 1.7 (Java 7) is now required for CDAP or the CDAP SDK.

πŸ”—Known Issues

  • CDAP has been tested on and supports CDH 4.2.x through CDH 5.3.x, HDP 2.0 through 2.1, and Apache Bigtop 0.8.0. It has not been tested on more recent versions of CDH. See our Hadoop/HBase Environment configurations.

  • After upgrading CDAP from a pre-3.0 version, any unprocessed metrics data in Kafka will be lost and WARN log messages will be logged that tell about the inability to process old data in the old format.

  • See the above section (API Changes) for alterations that can affect existing installations.

  • When running secure Hadoop clusters, metrics and debug logs from MapReduce programs are not available (CDAP-797).

  • If the Hive Metastore is restarted while the CDAP Explore Service is running, the Explore Service remains alive, but becomes unusable. To correct, restart the CDAP Master, which will restart all services (CDAP-1007).

  • User datasets with names starting with "system" can potentially cause conflicts (CDAP-1587).

  • Scaling the number of metrics processor instances doesn't automatically distribute the processing load to the newer instances of the metrics processor. The CDAP Master needs to be restarted to effectively distribute the processing across all metrics processor instances (CDAP-1853).

  • Creating a dataset in a non-existent namespace manifests in the RESTful API with an incorrect error message (CDAP-1864).

  • Retrieving multiple metricsβ€”by issuing an HTTP POST request with a JSON list as the request body that enumerates the name and attributes for each metricβ€”is currently not supported in the Metrics HTTP RESTful API v3. Instead, use the v2 API. It will be supported in a future release.

  • Typically, datasets are bundled as part of applications. When an application is upgraded and redeployed, any changes in datasets will not be redeployed. This is because datasets can be shared across applications, and an incompatible schema change can break other applications that are using the dataset. A workaround (CDAP-1253) is to allow unchecked dataset upgrades. Upgrades cause the dataset metadata, i.e. its specification including properties, to be updated. The dataset runtime code is also updated. To prevent data loss the existing data and the underlying HBase tables remain as-is.

    You can allow unchecked dataset upgrades by setting the configuration property dataset.unchecked.upgrade to true in cdap-site.xml. This will ensure that datasets are upgraded when the application is redeployed. When this configuration is set, the recommended process to deploy an upgraded dataset is to first stop all applications that are using the dataset before deploying the new version of the application. This lets all containers (flows, services, etc) to pick up the new dataset changes. When datasets are upgraded using dataset.unchecked.upgrade, no schema compatibility checks are performed by the system. Hence it is very important that the developer verify the backward-compatibility, and makes sure that other applications that are using the dataset can work with the new changes.

πŸ”—Release 2.8.0

πŸ”—General

πŸ”—New Features

  • Command Line Interface (CLI)
    • CLI can now directly connect to a CDAP instance of your choice at startup by using cdap cli --uri <uri>.
    • Support for runtime arguments, which can be listed by running "cdap cli --help".
    • Table rendering can be configured using "cli render as <alt|csv>". The option "alt" is the default, with "csv" available for copy & pasting.
    • Stream statistics can be computed using "get stream-stats <stream-id>".
  • Datasets
    • Added an ObjectMappedTable dataset that maps object fields to table columns and that is also explorable.
    • Added a PartitionedFileSet dataset that allows addressing files by meta data and that is also explorable.
    • Table datasets now support a multi-get operation for batched reads.
    • Allow an unchecked dataset upgrade upon application deployment (CDAP-1574).
  • Metrics
    • Added new APIs for exploring available metrics, including drilling down into the context of emitted metrics
    • Added the ability to explore (search) all metrics; previously, this was restricted to custom user metrics
    • There are new APIs for querying metrics
    • New capability to break down a metrics time series using the values of one or more tags in its context
  • Namespaces
    • Applications and programs are now managed within namespaces.
    • Application logs are available within namespaces.
    • Metrics are now collected and queried within namespaces.
    • Datasets can now created and managed within namespaces.
    • Streams are now namespaced for ingestion, fetching, and consuming by programs.
    • Explore operations are now namespaced.
  • Preferences
    • Users can store preferences (a property map) at the instance, namespace, application, or program level.
  • Spark
    • Spark now uses a configurer-style API for specifying (CDAP-382).
  • Workflows
    • Users can schedule a workflow based on increments of data being ingested into a stream.
    • Workflows can be stopped.
    • The execution of a workflow can be forked into parallelized branches.
    • The runtime arguments for workflow can be scoped.
  • Workers
    • Added Worker, a new program type that can be added to CDAP applications, used to run background processes and (beta feature) can write to streams through the WorkerContext.
  • Upgrade and Data Migration Tool
    • Added an automated upgrade tool which supports upgrading from 2.6.x to 2.8.0. (Note: Apps need to be both recompiled and re-deployed.) Upgrade from 2.7.x to 2.8.0 is not currently supported. If you have a use case for it, please reach out to us at cdap-user@googlegroups.com.
    • Added a metric migration tool which migrates old metrics to the new 2.8 format.

πŸ”—Improvement

  • Improved flow performance and scalability with a new distributed queue implementation.

πŸ”—API Changes

  • The endpoint (GET <base-url>/data/explore/datasets/<dataset-name>/schema) that retrieved the schema of a dataset's underlying Hive table has been removed (CDAP-1603).
  • Endpoints have been added to retrieve the CDAP version and the current configurations of CDAP and HBase (Configuration HTTP RESTful API).

πŸ”—Known Issues

  • When running secure Hadoop clusters, metrics and debug logs from MapReduce programs are not available (CDAP-64 and CDAP-797).

  • If the Hive Metastore is restarted while the CDAP Explore Service is running, the Explore Service remains alive, but becomes unusable. To correct, restart the CDAP Master, which will restart all services (CDAP-1007).

  • User datasets with names starting with "system" can potentially cause conflicts (CDAP-1587).

  • Scaling the number of metrics processor instances doesn't automatically distribute the processing load to the newer instances of the metrics processor. The CDAP Master needs to be restarted to effectively distribute the processing across all metrics processor instances (CDAP-1853).

  • Creating a dataset in a non-existent namespace manifests in the RESTful API with an incorrect error message (CDAP-1864).

  • Retrieving multiple metricsβ€”by issuing an HTTP POST request with a JSON list as the request body that enumerates the name and attributes for each metricβ€”is currently not supported in the Metrics HTTP RESTful API v3. Instead, use the v2 API. It will be supported in a future release.

  • Typically, datasets are bundled as part of applications. When an application is upgraded and redeployed, any changes in datasets will not be redeployed. This is because datasets can be shared across applications, and an incompatible schema change can break other applications that are using the dataset. A workaround (CDAP-1253) is to allow unchecked dataset upgrades. Upgrades cause the dataset metadata, i.e. its specification including properties, to be updated. The dataset runtime code is also updated. To prevent data loss the existing data and the underlying HBase tables remain as-is.

    You can allow unchecked dataset upgrades by setting the configuration property dataset.unchecked.upgrade to true in cdap-site.xml. This will ensure that datasets are upgraded when the application is redeployed. When this configuration is set, the recommended process to deploy an upgraded dataset is to first stop all applications that are using the dataset before deploying the new version of the application. This lets all containers (flows, services, etc) to pick up the new dataset changes. When datasets are upgraded using dataset.unchecked.upgrade, no schema compatibility checks are performed by the system. Hence it is very important that the developer verify the backward-compatibility, and makes sure that other applications that are using the dataset can work with the new changes.

  • A race condition resulting in a deadlock can occur when a TwillRunnable container shutdowns while it still has ZooKeeper events to process. This occasionally surfaces when running with OpenJDK or JDK7, though not with Oracle JDK6. It is caused by a change in the ThreadPoolExecutor implementation between Oracle JDK6 and OpenJDK/JDK7. Until Apache Twill is updated in a future version of CDAP, a work-around is to kill the errant process. The YARN command to list all running applications and their app-ids is:

    yarn application -list -appStates RUNNING
    

    The command to kill a process is:

    yarn application -kill <app-id>
    

    All versions of CDAP running Apache Twill version 0.4.0 with this configuration can exhibit this problem (TWILL-110).

πŸ”—Release 2.7.1

πŸ”—API Changes

  • The property security.auth.server.address has been deprecated and replaced with security.auth.server.bind.address (CDAP-639, CDAP-1078).

πŸ”—New Features

  • Spark
    • Spark now uses a configurer-style API for specifying (CDAP-382).
    • Spark can now run as a part of a workflow (CDAP-465).
  • Security
    • CDAP Master now obtains and refreshes Kerberos tickets programmatically (CDAP-1134).
  • Datasets
    • A new, experimental dataset type to support time-partitioned File sets has been added.
    • Time-partitioned File sets can be queried with Impala on CDH distributions (CDAP-926).
    • Streams can be made queryable with Impala by deploying an adapter that periodically converts it into partitions of a time-partitioned File set (CDAP-1129).
    • Support for different levels of conflict detection: ROW, COLUMN, or NONE (CDAP-1016).
    • Removed support for @DisableTransaction (CDAP-1279).
    • Support for annotating a stream with a schema (CDAP-606).
    • A new API for uploading entire files to a stream has been added (CDAP-411).
  • Workflow
    • Workflow now uses a configurer-style API for specifying (CDAP-1207).
    • Multiple instances of a workflow can be run concurrently (CDAP-513).
    • Programs are no longer part of a workflow; instead, they are added in the application and are referenced by a workflow using their names (CDAP-1116).
    • Schedules are now at the application level and properties can be specified for Schedules; these properties will be passed to the scheduled program as runtime arguments (CDAP-1148).

πŸ”—Known Issues

  • When upgrading an existing CDAP installation to 2.7.1, all metrics are reset.

  • When running secure Hadoop clusters, metrics and debug logs from MapReduce programs are not available (CDAP-64 and CDAP-797).

  • When upgrading a cluster from an earlier version of CDAP, warning messages may appear in the master log indicating that in-transit (emitted, but not yet processed) metrics system messages could not be decoded (Failed to decode message to MetricsRecord). This is because of a change in the format of emitted metrics, and can result in a small amount of metrics data points being lost (CDAP-745).

  • A race condition resulting in a deadlock can occur when a TwillRunnable container shutdowns while it still has ZooKeeper events to process. This occasionally surfaces when running with OpenJDK or JDK7, though not with Oracle JDK6. It is caused by a change in the ThreadPoolExecutor implementation between Oracle JDK6 and OpenJDK/JDK7. Until Apache Twill is updated in a future version of CDAP, a work-around is to kill the errant process. The YARN command to list all running applications and their app-ids is:

    yarn application -list -appStates RUNNING
    

    The command to kill a process is:

    yarn application -kill <app-id>
    

    All versions of CDAP running Apache Twill version 0.4.0 with this configuration can exhibit this problem (TWILL-110).

  • Typically, datasets are bundled as part of applications. When an application is upgraded and redeployed, any changes in datasets will not be redeployed. This is because datasets can be shared across applications, and an incompatible schema change can break other applications that are using the dataset. A workaround (CDAP-1253) is to allow unchecked dataset upgrades. Upgrades cause the dataset metadata, i.e. its specification including properties, to be updated. The dataset runtime code is also updated. To prevent data loss the existing data and the underlying HBase tables remain as-is.

    You can allow unchecked dataset upgrades by setting the configuration property dataset.unchecked.upgrade to true in cdap-site.xml. This will ensure that datasets are upgraded when the application is redeployed. When this configuration is set, the recommended process to deploy an upgraded dataset is to first stop all applications that are using the dataset before deploying the new version of the application. This lets all containers (flows, services, etc) to pick up the new dataset changes. When datasets are upgraded using dataset.unchecked.upgrade, no schema compatibility checks are performed by the system. Hence it is very important that the developer verify the backward-compatibility, and makes sure that other applications that are using the dataset can work with the new changes.

πŸ”—Release 2.6.1

πŸ”—CDAP Bug Fixes

  • Allow an unchecked dataset upgrade upon application deployment (CDAP-1253).
  • Update the Hive dataset table when a dataset is updated (CDAP-71).
  • Use Hadoop configuration files bundled with the Explore Service (CDAP-1250).

πŸ”—Known Issues

  • When running secure Hadoop clusters, metrics and debug logs from MapReduce programs are not available (CDAP-64 and CDAP-797).

  • When upgrading a cluster from an earlier version of CDAP, warning messages may appear in the master log indicating that in-transit (emitted, but not yet processed) metrics system messages could not be decoded (Failed to decode message to MetricsRecord). This is because of a change in the format of emitted metrics, and can result in a small amount of metrics data points being lost (CDAP-745).

  • A race condition resulting in a deadlock can occur when a TwillRunnable container shutdowns while it still has ZooKeeper events to process. This occasionally surfaces when running with OpenJDK or JDK7, though not with Oracle JDK6. It is caused by a change in the ThreadPoolExecutor implementation between Oracle JDK6 and OpenJDK/JDK7. Until Apache Twill is updated in a future version of CDAP, a work-around is to kill the errant process. The YARN command to list all running applications and their app-ids is:

    yarn application -list -appStates RUNNING
    

    The command to kill a process is:

    yarn application -kill <app-id>
    

    All versions of CDAP running Apache Twill version 0.4.0 with this configuration can exhibit this problem (TWILL-110).

  • Typically, datasets are bundled as part of applications. When an application is upgraded and redeployed, any changes in datasets will not be redeployed. This is because datasets can be shared across applications, and an incompatible schema change can break other applications that are using the dataset. A workaround (CDAP-1253) is to allow unchecked dataset upgrades. Upgrades cause the dataset metadata, i.e. its specification including properties, to be updated. The dataset runtime code is also updated. To prevent data loss the existing data and the underlying HBase tables remain as-is.

    You can allow unchecked dataset upgrades by setting the configuration property dataset.unchecked.upgrade to true in cdap-site.xml. This will ensure that datasets are upgraded when the application is redeployed. When this configuration is set, the recommended process to deploy an upgraded dataset is to first stop all applications that are using the dataset before deploying the new version of the application. This lets all containers (flows, services, etc) to pick up the new dataset changes. When datasets are upgraded using dataset.unchecked.upgrade, no schema compatibility checks are performed by the system. Hence it is very important that the developer verify the backward-compatibility, and makes sure that other applications that are using the dataset can work with the new changes.

πŸ”—Release 2.6.0

πŸ”—API Changes

  • API for specifying services and MapReduce programs has been changed to use a "configurer" style; this will require modification of user classes implementing either MapReduce or service as the interfaces have changed (CDAP-335).

πŸ”—New Features

  • General
    • Health checks are now available for CDAP system services (CDAP-663).
  • Applications
    • Jar deployment now uses a chunked request and writes to a local temp file (CDAP-91).
  • MapReduce
    • MapReduce programs can now read binary stream data (CDAP-331).
  • Datasets
    • Added FileSet, a new core dataset type for working with sets of files (CDAP-1).
  • Spark
    • Spark programs now emit system and custom user metrics (CDAP-346).
    • Services can be called from Spark programs and its worker nodes (CDAP-348).
    • Spark programs can now read from streams (CDAP-403).
    • Added Spark support to the CDAP CLI (Command-line Interface) (CDAP-425).
    • Improved speed of Spark unit tests (CDAP-600).
    • Spark programs now display system metrics in the CDAP Console (CDAP-652).
  • Procedures
    • Procedures have been deprecated in favor of services (CDAP-413).
  • Services
    • Added an HTTP endpoint that returns the endpoints a particular service exposes (CDAP-412).
    • Added an HTTP endpoint that lists all services (CDAP-469).
    • Default metrics for services have been added to the CDAP Console (CDAP-512).
    • The annotations @QueryParam and @DefaultValue are now supported in custom service handlers (CDAP-664).
  • Metrics
    • System and user metrics now support gauge metrics (CDAP-484).
    • Metrics can be queried using a program’s run-ID (CDAP-620).
  • Documentation

πŸ”—CDAP Bug Fixes

  • Fixed a problem with readless increments not being used when they were enabled in a dataset (CDAP-383).
  • Fixed a problem with applications, whose Spark or Scala user classes were not extended from either JavaSparkProgram or ScalaSparkProgram, failing with a class loading error (CDAP-599).
  • Fixed a problem with the CDAP upgrade tool not preservingβ€”for tables with readless increments enabledβ€”the coprocessor configuration during an upgrade (CDAP-1044).
  • Fixed a problem with the readless increment implementation dropping increment cells when a region flush or compaction occurred (CDAP-1062).

πŸ”—Known Issues

  • When running secure Hadoop clusters, metrics and debug logs from MapReduce programs are not available (CDAP-64 and CDAP-797).

  • When upgrading a cluster from an earlier version of CDAP, warning messages may appear in the master log indicating that in-transit (emitted, but not yet processed) metrics system messages could not be decoded (Failed to decode message to MetricsRecord). This is because of a change in the format of emitted metrics, and can result in a small amount of metrics data points being lost (CDAP-745).

  • A race condition resulting in a deadlock can occur when a TwillRunnable container shutdowns while it still has ZooKeeper events to process. This occasionally surfaces when running with OpenJDK or JDK7, though not with Oracle JDK6. It is caused by a change in the ThreadPoolExecutor implementation between Oracle JDK6 and OpenJDK/JDK7. Until Apache Twill is updated in a future version of CDAP, a work-around is to kill the errant process. The YARN command to list all running applications and their app-ids is:

    yarn application -list -appStates RUNNING
    

    The command to kill a process is:

    yarn application -kill <app-id>
    

    All versions of CDAP running Apache Twill version 0.4.0 with this configuration can exhibit this problem (TWILL-110).

πŸ”—Release 2.5.2

πŸ”—CDAP Bug Fixes

  • Fixed a problem with a Coopr-provisioned secure cluster failing to start due to a classpath issue (CDAP-478).
  • Fixed a problem with the WISE app zip distribution not packaged correctly; a new version (0.2.1) has been released (CDAP-533).
  • Fixed a problem with the examples and tests incorrectly using the ByteBuffer.array method when reading a stream event (CDAP-549).
  • Fixed a problem with the Authentication Server so that it can now communicate with an LDAP instance over SSL (CDAP-556).
  • Fixed a problem with the program class loader to allow applications to use a different version of a library than the one that the CDAP platform uses; for example, a different Kafka library (CDAP-559).
  • Fixed a problem with CDAP master not obtaining new delegation tokens after running for hbase.auth.key.update.interval milliseconds (CDAP-562).
  • Fixed a problem with the transaction not being rolled back when a user service handler throws an exception (CDAP-607).

πŸ”—Other Changes

  • Improved the CDAP documentation:
    • Re-organized the documentation into three manualsβ€”Developers' Manual, Administration Manual, Reference Manualβ€”and a set of examples, how-to guides and tutorials;
    • Documents are now in smaller chapters, with numerous updates and revisions;
    • Added a link for downloading an archive of the documentation for offline use;
    • Added links to examples relevant to a particular component;
    • Added suggested deployment architectures for Distributed CDAP installations;
    • Added a glossary;
    • Added navigation aids at the bottom of each page; and
    • Tested and updated the Standalone CDAP examples and their documentation.

πŸ”—Known Issues

  • Currently, applications that include Spark or Scala classes in user classes not extended from either JavaSparkProgram or ScalaSparkProgram (depending upon the language) fail with a class loading error. Spark or Scala classes should not be used outside of the Spark program. (CDAP-599)

  • Metrics for MapReduce programs aren't populated on secure Hadoop clusters

  • The metric for the number of cores shown in the Resources view of the CDAP Console will be zero unless YARN has been configured to enable virtual cores

  • Writing to datasets through Hive is not supported in CDH4.x (CDAP-988).

  • A race condition resulting in a deadlock can occur when a TwillRunnable container shutdowns while it still has ZooKeeper events to process. This occasionally surfaces when running with OpenJDK or JDK7, though not with Oracle JDK6. It is caused by a change in the ThreadPoolExecutor implementation between Oracle JDK6 and OpenJDK/JDK7. Until Apache Twill is updated in a future version of CDAP, a work-around is to kill the errant process. The YARN command to list all running applications and their app-ids is:

    yarn application -list -appStates RUNNING
    

    The command to kill a process is:

    yarn application -kill <app-id>
    

    All versions of CDAP running Apache Twill version 0.4.0 with this configuration can exhibit this problem (TWILL-110).

πŸ”—Release 2.5.1

πŸ”—CDAP Bug Fixes

  • Improved the documentation of the CDAP authentication and stream clients, both Java and Python APIs.
  • Fixed problems with the CDAP Command Line Interface (CLI):
    • Did not work in non-interactive mode;
    • Printed excessive debug log messages;
    • Relative paths did not work as expected; and
    • Failed to execute SQL queries.
  • Removed dependencies on SNAPSHOT artifacts for netty-http and auth-clients.
  • Corrected an error in the message printed by the startup script cdap sdk.
  • Resolved a problem with the reading of the properties file by the CDAP Flume Client of CDAP Ingest library without first checking if authentication was enabled.

πŸ”—Other Changes

  • The scripts send-query.sh, access-token.sh and access-token.bat has been replaced by the CDAP Command Line Interface, cdap cli.
  • The CDAP Command Line Interface now uses and saves access tokens when connecting to a secure CDAP instance.
  • The CDAP Java Stream Client now allows empty String events to be sent.
  • The CDAP Python Authentication Client's configure() method now takes a dictionary rather than a filepath.

πŸ”—Known Issues

  • Metrics for MapReduce programs aren't populated on secure Hadoop clusters

  • The metric for the number of cores shown in the Resources view of the CDAP Console will be zero unless YARN has been configured to enable virtual cores

  • A race condition resulting in a deadlock can occur when a TwillRunnable container shutdowns while it still has ZooKeeper events to process. This occasionally surfaces when running with OpenJDK or JDK7, though not with Oracle JDK6. It is caused by a change in the ThreadPoolExecutor implementation between Oracle JDK6 and OpenJDK/JDK7. Until Apache Twill is updated in a future version of CDAP, a work-around is to kill the errant process. The YARN command to list all running applications and their app-ids is:

    yarn application -list -appStates RUNNING
    

    The command to kill a process is:

    yarn application -kill <app-id>
    

    All versions of CDAP running Apache Twill version 0.4.0 with this configuration can exhibit this problem (TWILL-110).

πŸ”—Release 2.5.0

πŸ”—New Features

πŸ”—Ad-hoc querying

  • Capability to write to datasets using SQL
  • Added a CDAP JDBC driver allowing connections from Java applications and third-party business intelligence tools
  • Ability to perform ad-hoc queries from the CDAP Console:
    • Execute a SQL query from the Console
    • View list of active, completed queries
    • Download query results

πŸ”—Datasets

  • Datasets can be tested with TestBase outside of the context of an application
  • CDAP now checks datasets for compatibility in a verification stage
  • The Transaction engine uses server-side filtering for efficient transactional reads
  • Dataset specifications can now be dynamically reconfigured through the use of RESTful endpoints
  • The Bundle jar format is now used for dataset libs
  • Increments on datasets are now read-less

πŸ”—Services

  • Added simplified APIs for using services from other programs such as MapReduce, flows and Procedures
  • Added an API for creating services and handlers that can use datasets transactionally
  • Added a RESTful API to make requests to a service via the Router

πŸ”—Security

  • Added authorization logging
  • Added Kerberos authentication to ZooKeeper secret keys
  • Added support for SSL

πŸ”—Spark Integration

  • Supports running Spark programs as a part of CDAP applications in Standalone mode
  • Supports running Spark programs written with Spark versions 1.0.1 or 1.1.0
  • Supports Spark's MLib and GraphX modules
  • Includes three examples demonstrating CDAP Spark programs
  • Adds display of Spark program logs and history in the CDAP Console

πŸ”—Streams

  • Added a collection of applications, tools and APIs specifically for the ETL (Extract, Transform and Loading) of data
  • Added support for asynchronously writing to streams

πŸ”—Clients

  • Added a Command Line Interface
  • Added a Java Client Interface

πŸ”—Major CDAP Bug Fixes

  • Fixed a problem with a HADOOP_HOME exception stacktrace when unit-testing an application
  • Fixed an issue with Hive creating directories in /tmp in the Standalone and unit-test frameworks
  • Fixed a problem with type inconsistency of service API calls, where numbers were showing up as strings
  • Fixed an issue with the premature expiration of long-term Authentication Tokens
  • Fixed an issue with the dataset size metric showing data operations size instead of resource usage

πŸ”—Known Issues

  • Metrics for MapReduce programs aren't populated on secure Hadoop clusters

  • The metric for the number of cores shown in the Resources view of the CDAP Console will be zero unless YARN has been configured to enable virtual cores

  • A race condition resulting in a deadlock can occur when a TwillRunnable container shutdowns while it still has ZooKeeper events to process. This occasionally surfaces when running with OpenJDK or JDK7, though not with Oracle JDK6. It is caused by a change in the ThreadPoolExecutor implementation between Oracle JDK6 and OpenJDK/JDK7. Until Apache Twill is updated in a future version of CDAP, a work-around is to kill the errant process. The YARN command to list all running applications and their app-ids is:

    yarn application -list -appStates RUNNING
    

    The command to kill a process is:

    yarn application -kill <app-id>
    

    All versions of CDAP running Apache Twill version 0.4.0 with this configuration can exhibit this problem (TWILL-110).