Support to 2 level nested foreach

AniQuet

Short description: Pig currently supports DISTINCT, FILTER, LIMIT, and ORDER BY inside nested foreach statement and it is highly desired to have support for FOREACH nested inside a foreach.

Additional info: https://issues.apache.org/jira/browse/PIG-1631

Proposal Title: Support to 2 level nested foreach

Student Name: Aniket Mokashi

Student E-mail: amokashi@andrew.cmu.edu

 

Organization/Project: The apache software foundation

Assigned Mentor:  Ashutosh Chauhan, Daniel Dai

 

Proposal Abstract:

Pig currently supports DISTINCT, FILTER, LIMIT, and ORDER BY inside nested foreach statement and it is highly desired to have support for FOREACH nested inside a foreach.

Detailed Description:

As mentioned on GSoC wiki page some of the functionality can be achieved through use of accumulator, bag UDF, query rewrite. But, sometimes it non intuitive to write such solutions. It would be highly desirable to have nested foreach that can deal with this complexity in most optimized way.

In general, We want to support,

C = foreach B {
     C1 = foreach A generate ..;
     ...
     generate group, ..;
}

where, A is an innerBag of B. This will allow us to iterate on the bag and apply expressions(udfs) on the contents of the bag, without having to develop a custom UDF (which takes a bag) for it. This is analogous to nested projection approach. In nested projections, we stream the innerBag A into the projection to apply expression on it. For example, consider following pig script -

A = load 'loc_a' as (a0, a1);
B = group A by a0;
C= foreach B {
     C1 = A.a0;
     C2 = filter C1 by a0 == 0;
     generate C2;
}

In the current implementation we support above query, that generates following in the logical plan --

 c1: (Name: LOForEach Schema: a0#62:int)
       |           |   |
       |           |   (Name: LOGenerate[false] Schema: a0#62:int)
       |           |   |   |
       |           |   |   a0:(Name: Project Type: int Uid: 62 Input: 0 Column: (*))
       |           |   |
       |           |   |---(Name: LOInnerLoad[0] Schema: a0#62:int)
       |           |
       |           |---a: (Name: LOInnerLoad[1] Schema: a0#62:int,a1#63:int)

Here c1 is inside the innerPlan of c, which is a foreach operator in itself. This shows that we already support nested foreach. Note, we have LOInnerLoad and LOGenerate with Foreach, this means that we have a very limited support for foreach at this point. We can add more support to allow users to specify more sophisticated expressions (udfs).

 

Changes Required

  • First, we need changes into parser to support for nested foreach.
    • Nested block (nested_blk) reduces to nested_command_list, which finally reduces to nested_op. We need to add support for nested_foreach here.
    • nested_foreach will be a similar to foreach_simple_statement but will only allow nested_op_input
    • We do not support further nesting, thus usage of simple statement will avoid more nesting
    • In summary, we need to add to lexical parser entry into nested_op, entry for nested_foreach: FOREACH^ nested_op_input generate_clause
  • Secondly, logicalplanbuilder needs to add support code to generate logical plan from the supported syntax.
    • This is done with addition of nested_foreach[String alias] returns[Operator op] to LogicalPlanGenerator.g.
    • We need to add support for buildNestedForeachOp to LogicalPlanBuilder. This will generate required operators to add LOInnerLoad for the bag streaming.
    • Also, generate plan here can be fairly sophisticated (UDFs) and hence LOGenerate will be added by the builder. But, we need to restrict operations of LOGenerate on Projections(Star) of innerBag and throw parser exception otherwise.
  • Now, connection of various operators in innerplan inside DAG will be taken care in LogicalPlanBuilder and we will have required dependencies of LOGenerate, LOUserFunc, LOInnerLoad etc.
  • Next, flatten_generated_item supports flatten clause, col_range, expr and star, we need to see if we have all the required changes to support these. This will need more testing.
  • The necessary transformation required for this change could be added at different stages in plan builder

My background:

I recently graduated from masters program in Information Networking at Carnegie Mellon University. I am currently pursuing my interests in distributed computing with optional practical training. I have strong interest in Hadoop and related technologies. Pig is a powerful tool to harness the hadoop's power to perform computation, it reduces the development cost by orders without pig-users having to worry about internals of map-reduce. But, it is very beneficial to understand how pig works in order to make optimal use of it. My interest is to use this opportunity to learn how pig works and understand how various operations are designed in Pig to work on Hadoop.

Last summer, I worked with Pig team under Daniel's guidance at Yahoo on development of features for Pig 0.8 release. I would like to extend my experience with working with Yahoo folks to learn more about Pig and Hadoop in detail. I have good understanding of Java internals and application of design patterns in OO design.

I hope to learn more from Pig team and make best use of this opportunity.

 

Project Breakdown:

Before 25 April:

Identify various use cases of nested foreach and explore suitable substitute operator (or the possibility of new operator).

April 25- May 23:

Learn ANTLR and understand new pig parser.

Make changes on parser side and add test cases to support changes.

Submit patches with one approach.

May 23- June 20:

Explore multiple approach and identify the best possible appraoch. Complete code related to the approach.

Add test cases to verify the approach.

Perform multiple iterations of discussion with pig developers, identify challenges.

June 20- July 11

Submit the final patch and run it through all the test cases listed out in first phase of project.

Identify areas for optimization and improvement.

July 11- August 1

Identify opportunities to merge approach with Nested Cross implementation. Make required code changes.

August 1- August 22

Final pensil down of changes

Work on other small related issues from Jira.