Monday, February 28, 2011

Using the HDFS APIs without the hadoop main

I recently spent way too much time trying to run a simple Java application that uses the HDFS APIs to copy files into HDFS.

While using these APIs works great from within an application launched via the 'hadoop' command line, building one that is called from a Java main() was more challenging.

Why would I want to do this? Mainly because the process that looks for files, figures out where they should live in HDFS, selects from and updates an external database. Using Hibernate and Spring. But the Cascading application that uses the data in HDFS doesn't need Hibernate or Spring. So building a single Jar that supports both seemed overkill (and had its own issues with dependencies and final size of the jar)

First challenge was figuring out what parts of the hadoop shell file to duplicate, since not everything was needed or welcome. Turns out the information is in two places /bin/hadoop-config.sh and /conf/hadoop-env.sh

First try was to set $HADOOP_HOME and try '. $HADOOP_HOME/bin/hadoop-config.sh' in the script that calls my java main().

Except this stepped on HADOOP_HOME for some reason. Turns out the hadoop-config.sh file is playing some games with the path used to call the script to figure out what to set HADOOP_HOME to.

So calling '/opt/hadoop/bin/hadoop-config.sh' (note no '.' or shell variable!) set the path correctly.

then you can use '. $HADOOP_HOME/conf/hadoop-env.sh'

Finally, in your java classpath you need to set '$HADOOP_HOME/conf' before your classes.

For example: java -classpath .:$HADOOP_HOME/conf:/mycode ...

For the record, here is the error I was receiving that helped me find this:

java.io.IOException: Mkdirs failed to create /my_dir/my_file

Thursday, February 10, 2011

Tired of hibernate lazy load errors on collections?

I have a love/hate relationship with Hibernate (Spring w/Hibernate to be fair). It makes some things very, very easy and others so difficult or convoluted.

For example, what should be trivial turns out to be a major pain:

Have an Entity object that contains two child collections. For example

@Entity
public class Client implements Serializable {
...

@OneToMany(mappedBy = "client")
private List users;

@OneToMany(mappedBy = "client")
private List sites;
...

}

Where User and Sites are also simple @Entity classes.

By default the access to these collections are Lazy Loaded. But when you want to access them both you run into problems. Setting both to 'eager' gets an error about too many buckets. Loading one Eager, then passing the object to a JSP for example, with get you the infamous:

org.hibernate.LazyInitializationException: failed to lazily initialize a collection of role: ...Client.organizations, no session or session was closed

(look on Google or StackOverflow for this error, there are no good solutions.)

So the hack/workaround? Create a custom Find method on the DAO and EXPLICITLY tell Hibernate to load the collection.

Note that using the collection directly in 'normal' code doesn't work since Spring/Hibernate/the compiler sees you aren't using the loaded collection and doesn't do the load.

However this works:

Hibernate.initialize(rtn.getOrganizations());

after your default 'find' method returns.

Details of how/why this works here.

Even more bizarre, when debugging this, setting a breakpoint in the 'find' method everything works correctly. No breakpoint and you get the exception.

(Thanks to Scott Mitchell for his help on pointing me to this solution.)