Monday, February 28, 2011

Using the HDFS APIs without the hadoop main

I recently spent way too much time trying to run a simple Java application that uses the HDFS APIs to copy files into HDFS.

While using these APIs works great from within an application launched via the 'hadoop' command line, building one that is called from a Java main() was more challenging.

Why would I want to do this? Mainly because the process that looks for files, figures out where they should live in HDFS, selects from and updates an external database. Using Hibernate and Spring. But the Cascading application that uses the data in HDFS doesn't need Hibernate or Spring. So building a single Jar that supports both seemed overkill (and had its own issues with dependencies and final size of the jar)

First challenge was figuring out what parts of the hadoop shell file to duplicate, since not everything was needed or welcome. Turns out the information is in two places /bin/hadoop-config.sh and /conf/hadoop-env.sh

First try was to set $HADOOP_HOME and try '. $HADOOP_HOME/bin/hadoop-config.sh' in the script that calls my java main().

Except this stepped on HADOOP_HOME for some reason. Turns out the hadoop-config.sh file is playing some games with the path used to call the script to figure out what to set HADOOP_HOME to.

So calling '/opt/hadoop/bin/hadoop-config.sh' (note no '.' or shell variable!) set the path correctly.

then you can use '. $HADOOP_HOME/conf/hadoop-env.sh'

Finally, in your java classpath you need to set '$HADOOP_HOME/conf' before your classes.

For example: java -classpath .:$HADOOP_HOME/conf:/mycode ...

For the record, here is the error I was receiving that helped me find this:

java.io.IOException: Mkdirs failed to create /my_dir/my_file

No comments: