HDP 2.2 brings substantial innovations in Apache Hadoop YARN, enabling users of Apache Hadoop to efficiently store their data in a single repository and interact with it simultaneously using a wide variety of engines. This functionality makes YARN particularly attractive for the integration of many distributed Long-Running services.
In this release, we also introduced a new framework Apache™ Slider for easy on boarding of Long-Running service on top of YARN. This framework enables Long-Running applications or services to be deployed in a YARN environment. By adopting Slider, distributed Long-Running applications that aren't YARN-aware can now participate in the YARN ecosystem - with no code modification. We also released Hadoop based NoSQL databases - Apache HBase, Apache Accumulo and stream processing system Apache Storm on top of Slider. There are more applications getting ported on top of Slider to take advantage of YARN integration.
Applications can integrate with YARN natively for complete control where needed or use the Slider framework on top of YARN, to provide rapid integration and additional capabilities in a low-cost and future-proof way.
Native Integration with YARN
As part of HDP 2.2, we are bringing native support for Long-Running Services on existing Hadoop YARN deployments. Deploying Long-Running services on YARN is fundamentally not so different from deploying short-lived applications except for a few differences.
Steps to integrate a basic application with YARN natively:
-
Write a client to submit the application to YARN
-
-
Create a YARN Application
-
Submit Application
-
Write an ApplicationMaster to do the follow at minimum
-
-
Register Application with YARN
-
Negotiate and launch YARN containers
-
And then run application code in those containers
Native Integration using YARN API enables:
-
Application's own Application Master to control container placement, fault handling
-
An IPC API for callers to manipulate the application
-
Application Master can send out event notifications
To integrate an enterprise-ready application, a few more capabilities need to be covered in the Client and Application Master module of the application:
-
Fault tolerance & recovery
-
Intra-application priorities, ordering and placement
-
Security
-
Upgrade/Downgrade of the application package
To handle these aspects of enterprise-ready application, there is a learning curve for the application developer that includes a testing effort and maintenance against future Hadoop releases.
Integration with YARN via Slider
As part of HDP 2.2, we are also introducing Slider, a framework to make it easy to deploy and manage existing applications in a Hadoop cluster on YARN.
Slider manages applications by launching a YARN Application Master for every application instance and agents in every resource container allocated by YARN based on application resource requirements. After the launch Application Master can allocate or de-allocate resources, stop/start application instances based on application administrator's request through Slider client or YARN's resource scheduling pre-emptions or through Ambari integration.
Integration of Long-Running services via Slider provides the following benefits without any additional code:
-
Comprehensive Application Management Framework: Framework handles YARN integration with best-effort placement, fault handling, security
integration.
-
On-demand scale with Multi-tenancy: Enable running multiple application instances in the same Hadoop cluster with different configurations, resource footprint and versions.
-
Integrated Management: Simple API/Ambari Web UI for application lifecycle management including upgrade.
Slider views any application as a set of components and each component is a daemon or executable with its own configuration and scripts, data files, etc. Components may have one or more instances. Slider manages application instances by managing component instances.
Any application to be integrated has the following steps for Slider Integration:
-
Install Slider if not installed [non-Hortonworks Hadoop Distro using Hadoop 2.6
-
Package the Application in a specified format
-
-
Tar the entire application content
-
Write or copy and change metainfo.xml to describe the application structure
-
Write or copy and change lifecycle hook scripts [leverage existing python library]
-
Write default application configurations and default resource requirements
-
Put all of them in specified directory structure
Apart from enabling the application to integrate with YARN, the Slider framework provides many other features that are critical for any Long-Running service on top of YARN:
-
Expand / shrink application instances on-demand
-
Manage application, component and container failures
-
Allocate applications ports dynamically
-
Use YARN application registry to publish application configuration to enable dynamic client lookup
-
Deploy applications in Kerberos secured cluster
-
Aggregate application logs from different containers
-
Make it easy for applications to achieve semi-fixed or completely flexible placement within clusters using YARN node labels
-
Application lifecycle management, which is also tightly integrated Hadoop's only open source management tool, Ambari - usually with minimal effort.
YARN provides multiple ways to integrate Long-Running services with it. Native YARN API based integration is ideal for large-scale distributed algorithms like Map-Reduce or Long-Running services with specific placement and scheduling needs. Any other Long-Running services or applications, a framework like Slider should be considered for ease of integration and other value-added features it offers.