The Apache community released Apache Pig 0.15.0 last week. Although there are many new features in Apache Pig 0.15.0, we would like to highlight two major improvements:
-
Pig on Tez enhancements
-
Using Hive UDFs inside Pig
Below are some details about these important features. For the complete list of features, improvements, and bug fixes, please see the release notes.
Notable Changes
1. Pig on Tez enhancements
Scalability of Pig on Tez
Yahoo! recently put Pig on Tez to a production cluster and they found certain issues at large scale. As a result, the Pig community has made improvements to Tez AM scalability as well as Pig on Tez internals to address these issues.
Tez UI and Tez local mode
The community worked closely with Tez team to get the Tez UI and Tez local mode working for Pig on Tez.
The Tez UI is fully functional now, and you can view Pig plan DAG, vertex, task, task attempts during runtime and after job finishes.
Thanks for the effort from the Tez team to make Tez local mode stable, we are able to migrate more than 2000 Pig unit tests originally designed to run on MR local mode to Tez local mode. This drastically increase the test coverage for Pig on Tez.
Tez Grace auto-parallelism
The degree of parallelism used for processing the query has implications on latency and cluster resource utilization. Pig-on-Tez tries to pick the sweet spot for the user.
Though Tez can do auto-parallelism at runtime based on input size for each vertex, it suffers two issues: first, auto-parallelism only decreases parallelism but does not increase it. The reason is the the upstream data is already partitioned. Increasing parallelism needs repartition the incoming data, which is complex, and Tez has not implemented this functionality. Second, even for decreasing parallelism, Tez needs to merge smaller partitions into bigger ones at some cost.
In this release, we developed Grace auto-parallelism to alleviate this problem. The idea is, when the DAG progresses, we can adjust downstream vertex parallelism before it even starts. By doing so, we can partition the upstream data freely. With Tez's grace auto-parallelism, we can run a vertex with more accuracy in parallelism.
2. Using Hive UDF inside Pig
We can use all types of Hive UDF (UDF/GenericUDF/UDAF/GenericUDAF/GenericUDTF) inside Pig with the newly introduced HiveUDF/HiveUDAF/HiveUDTF udfs in Pig.
Here is one example:
One tested use case for HiveUDF is Hivemall, and you can find the document to invoke Hivemall inside Pig at github.