Custom Processing using Apache Pig UDFs (User Defined Functions)

15.04.2015

Pig UDFs can be easily implemented in Java, below are the Steps to create a UDF using eclipse.

  • Create a normal java project and a java class (UDF), which extends one of the Eval, Store, Load or Filter classes.
  • Override the exec() function to write the implementation.

 Apache Pig UDFs

 Apache Pig UDFs

We need to make sure that we download the latest pig.jar and include it in the build path, otherwise the code will not compile. Every new function has to either extend ‘EvalFunc’ class or any other class like ‘LoadFunc’. All these dependent classes reside in the pig.jar.

  • Using the settings given below, export the Java project as Jar File

 Apache Pig UDFs

  • In order to use this jar,  we will have to register it on the grunt prompt as below:
grunt> register 'your_path_to_jar/NewUDF.jar';
  • Define a name for the UDF

A single word can be defined for the whole method, so as to make the code more readable and also to avoid writing the full method specification at every part of code where it is needed.

grunt>define TRIM com.hadoop.pig.Trim();
  • Using the UDF
grunt>divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
date:chararray, dividends:float);
grunt>trimmed = foreach divs generate TRIM(symbol);

This function can also be used in order to include a set of paths on the command line for Pig to search, while looking for UDFs.

So we change our invocation to:

pig -Dudf.import.list=org.apache.pig.piggybank.evaluation.string register.pig

Using yet another property, we can get rid of the register command as well.

If we add the below code to our command line, then the register command is no longer necessary.

Set Dpig.additional.jars=/usr/local/pig/piggybank/piggybank.jar

 

Creating a UDF (without eclipse)

  • Create a folder myNewUdf
  • Create a UDF in a java file, say 'TrimTo.java'

The class name should also be TrimTo.java. The package name should be same as the folder name where the Java file resides. i.e 'myNewUdf'.

 $ cd myNewUdf/
 $ ls -l
 total 8
-rw-rw-r-- 1 userName userName 1162 Feb 21 16:33 TrimTo.java
  • Compile the Java file
:~/pig/myNewUdf$ javac –classpath /home/userName/pig/trunk/pig.jar TrimTo.java

Now,  the class file would be visible

:~/pig/myNewUdf$ ls -l
total 8
-rw-rw-r-- 1 userName userName 1917 Feb 21 16:45 TrimTo.class
-rw-rw-r-- 1 userName userName 1162 Feb 21 16:33 TrimTo.java
  • Come 1 level up, create the jar with same name as of the folder i.e 'myNewUdf.jar'
:~/pig/myNewUdf$ cd ..
:~/pig$ jar cf myNewUdf.jar myNewUdf
  • In the Grunt Prompt
grunt> REGISTER /home/userName /pig/myNewUdf.jar;
grunt>GrocPricesTrim= FOREACH GrocPrices generate myNewUdf.TrimTo(PRODUCTNAME);
grunt> ILLUSTRATE GrocPricesTrim;

 

Creating and using Macros

Macros are declared with the define statement. A macro takes a set of input parameters, which are string values that will be substituted for the parameters when the macro is expanded. The name of output relation is given in a return statement. The operators of the macro are enclosed in {} (braces).

-------- Macro.pig --------

<strong>define dividend_analysis (daily, year, daily_symbol, daily_open, daily_close)</strong>
<strong>returns analyzed</strong> {
divs = load 'NYSE_dividends' as(exchange:chararray,symbol:chararray,
date:chararray, dividends:float);
divsthisyear = filter divs by date matches '$year-.*';
dailythisyear = filter $daily by date matches '$year-.*';
jnd = join divsthisyear by symbol, dailythisyear by $daily_symbol;
$analyzed = foreach jnd generate dailythisyear::symbol, $daily_close
- $daily_open;
};
------- on the Grunt shell ---------
daily = load '/home/share/Customer-Bigdata-Analysis/NYSE_daily.txt'
as (exchange:chararray, symbol:chararray,date:chararray, open:float,
high:float, low:float, close:float,volume:int, adj_close:float);
import '/home/cs246/PigPPT/macro.pig';
results = dividend_analysis(daily, '2009', 'symbol', 'open', 'close');
describe results;

If you would like to find out more about how Big Data could help you make the most out of your current infrastructure while enabling you to open your digital horizons, do give us a call at +44 (0)203 475 7980 or email us at marketing@whishworks.com

Other useful links:

Email Classifier using Mahout on Hadoop

Spark Cluster Setup on EC2

Installing SolrCloud on Hadoop

Topics

Big Data Apache

Recent Posts