Thursday, February 9, 2012

Finding "Number of Under-Replicated Blocks" in Hadoop

This one was bugging me for a long time. Even with the cluster idle, the Name Node summary would tell me there were a number of Under-Replicated Blocks in the system.

Turns out that all the Name Node problems we've been having were leaving 'temporary' files in HDFS and for whatever reason when we restarted the Name Node it wouldn't fix them.

I found them under: /log/hadoop/tmp/mapred/staging/<>/.staging/job_*

After confirming that the users weren't running active jobs, removing these directories via the command line reduced the number of blocks in the report and eventually all were cleared.

FYI our Name Node problems APPEAR to have been resolved in Cloudera CDH3 u3. Name Node has been up for 3 days now. Previously we were lucky if it lasted 48 hours.

Monday, February 28, 2011

Using the HDFS APIs without the hadoop main

I recently spent way too much time trying to run a simple Java application that uses the HDFS APIs to copy files into HDFS.

While using these APIs works great from within an application launched via the 'hadoop' command line, building one that is called from a Java main() was more challenging.

Why would I want to do this? Mainly because the process that looks for files, figures out where they should live in HDFS, selects from and updates an external database. Using Hibernate and Spring. But the Cascading application that uses the data in HDFS doesn't need Hibernate or Spring. So building a single Jar that supports both seemed overkill (and had its own issues with dependencies and final size of the jar)

First challenge was figuring out what parts of the hadoop shell file to duplicate, since not everything was needed or welcome. Turns out the information is in two places /bin/hadoop-config.sh and /conf/hadoop-env.sh

First try was to set $HADOOP_HOME and try '. $HADOOP_HOME/bin/hadoop-config.sh' in the script that calls my java main().

Except this stepped on HADOOP_HOME for some reason. Turns out the hadoop-config.sh file is playing some games with the path used to call the script to figure out what to set HADOOP_HOME to.

So calling '/opt/hadoop/bin/hadoop-config.sh' (note no '.' or shell variable!) set the path correctly.

then you can use '. $HADOOP_HOME/conf/hadoop-env.sh'

Finally, in your java classpath you need to set '$HADOOP_HOME/conf' before your classes.

For example: java -classpath .:$HADOOP_HOME/conf:/mycode ...

For the record, here is the error I was receiving that helped me find this:

java.io.IOException: Mkdirs failed to create /my_dir/my_file

Thursday, February 10, 2011

Tired of hibernate lazy load errors on collections?

I have a love/hate relationship with Hibernate (Spring w/Hibernate to be fair). It makes some things very, very easy and others so difficult or convoluted.

For example, what should be trivial turns out to be a major pain:

Have an Entity object that contains two child collections. For example

@Entity
public class Client implements Serializable {
...

@OneToMany(mappedBy = "client")
private List users;

@OneToMany(mappedBy = "client")
private List sites;
...

}

Where User and Sites are also simple @Entity classes.

By default the access to these collections are Lazy Loaded. But when you want to access them both you run into problems. Setting both to 'eager' gets an error about too many buckets. Loading one Eager, then passing the object to a JSP for example, with get you the infamous:

org.hibernate.LazyInitializationException: failed to lazily initialize a collection of role: ...Client.organizations, no session or session was closed

(look on Google or StackOverflow for this error, there are no good solutions.)

So the hack/workaround? Create a custom Find method on the DAO and EXPLICITLY tell Hibernate to load the collection.

Note that using the collection directly in 'normal' code doesn't work since Spring/Hibernate/the compiler sees you aren't using the loaded collection and doesn't do the load.

However this works:

Hibernate.initialize(rtn.getOrganizations());

after your default 'find' method returns.

Details of how/why this works here.

Even more bizarre, when debugging this, setting a breakpoint in the 'find' method everything works correctly. No breakpoint and you get the exception.

(Thanks to Scott Mitchell for his help on pointing me to this solution.)

Tuesday, August 24, 2010

Logging in Groovy shouldn't be this hard

Spent about 30 minutes this morning doing what should have been easy: logging from within my application.

Groovy includes/wraps Log4j so I thought it would be easy. All the documentation I found suggested it would be easy.

However all the examples left off one key thing: Defining the 'root' logger.

So, in your Config.groovy, find the log4j section and add/uncomment:

appenders {
console name:'stdout', layout:pattern(conversionPattern: '%c{2} %m%n')
}


Then add below the standard error and warn items:

root {
info 'console'
}

Now in your code you can add 'log.info 'blah blah' and it will appear on the console. The 'appenders' section is where you can add your rolling file loggers for production.

Here is what mine looks like:

// log4j configuration
log4j = {
// Example of changing the log pattern for the default console
// appender:
//
appenders {
console name:'stdout', layout:pattern(conversionPattern: '%c{2} %m%n')
}


error 'org.codehaus.groovy.grails.web.servlet', // controllers
'org.codehaus.groovy.grails.web.pages', // GSP
'org.codehaus.groovy.grails.web.sitemesh', // layouts
'org.codehaus.groovy.grails.web.mapping.filter', // URL mapping
'org.codehaus.groovy.grails.web.mapping', // URL mapping
'org.codehaus.groovy.grails.commons', // core / classloading
'org.codehaus.groovy.grails.plugins', // plugins
'org.codehaus.groovy.grails.orm.hibernate', // hibernate integration
'org.springframework',
'org.hibernate',
'net.sf.ehcache.hibernate'

warn 'org.mortbay.log'

root {
info 'console'
}

Monday, June 14, 2010

Grails keeps rebuilding classes

I found an interesting problem today after replacing my laptop. My Grails 1.2.2 application that ran fine on Friday wasn't running today on 1.3.1 and the new machine.

I thought it was a 1.3.1 issue, but it wasn't. Turns out that if you have a file under src/groovy NOT grails-app AND the filename isn't the name of a class within the file, Grails will keep rebuilding the source file and clearing the Tomcat cache.

basically, I had a file 'GroupByCreator.groovy' which didn't have a class named "GroupByCreator" in it. For some reason IntelliJ doesn't complain and everything compiled correctly.

Instead at runtime I kept getting:

[groovyc] Compiling 1 source file to C:\Development\clouds\reportingui\target\classes

[groovyc] Compiling 2 source files to C:\Development\clouds\reportingui\target\classes

[delete] Deleting directory C:\Documents and Settings\ccurtin\.grails\1.3.1\projects\reportingui\tomcat

Running Grails application..
Server running. Browse to http://localhost:8080/reportingui

[groovyc] Compiling 1 source file to C:\Development\clouds\reportingui\target\classes

[groovyc] Compiling 2 source files to C:\Development\clouds\reportingui\target\classes

[delete] Deleting directory C:\Documents and Settings\ccurtin\.grails\1.3.1\projects\reportingui\tomcat

Running Grails application..
Server running. Browse to http://localhost:8080/reportingui


To figure this out, I went to the \reportingui\target\classes directory and sorted by last modified time. From that it was obvious which classes were being rebuilt, just not why. Finally after looking at the name of the file and the classes (for a few hours Doh!) I figured it out.

Arrgh.

Tuesday, May 18, 2010

GORM Criteria using child to parent relationship

I was surprised that I couldn’t find an example of this in any of the Grails and GORM docs or online examples.

I want to find the names and number of occurrences of the CHILD in a 1 to many association based on an attribute in the parent.

The domain classes are:

class Parent {
static hasMany = [children:Child]
String firstname
}

class Child {
static belongsTo = [parent:Parent]
String name

}

To find all the children and # of occurrences across ANY Parent where firstname is 'John':

def fName = 'John'
def children = Parent.withCriteria {
projections {
groupProperty 'name'
count 'id'
}
parent {
eq 'firstname', fName
}

}

Wednesday, March 10, 2010

March 2010 AWSome Meeting - Cloud Security

Last night’s AWSome Atlanta meeting was about Virtualization and Cloud Security topics. Taylor Banks presented about 50 slides on the different things to pay attention to, first in a virtualized environment, then in a Cloud. He correctly pointed out that all the issues with virtualization are also present in a Cloud environment, so make sure you get them right the first time.

Taylor was a good presenter, though I do wonder when he managed to get a breath in. He talked a lot, but he wasn’t rambling. He was noticeably excited and involved in the materials he was presenting.

Very early on he presented an new acronym that I think I’ll start using : K.I.S.S.M.Y.A.S.S

Meaning:
Keep It Simple, Stupid Make Your Architecture Simpler to Secure

Yes he mentioned it a few more times during the presentation.

He also said something very profound:
Cloud Security is often more of a process than technology
Yes securing the servers and services is important, but aren’t you already doing that in your self-hosted environments? So Cloud Security is about making decisions about what data you are sharing, how important the data is, what formats you are protecting it in and who can access it.

Taylor also took a few shots at the pundits around Cloud Security and did a pretty funny impression of a DBA. He had a comical routine about how giving two different DBAs identical SQL Server instances in a VM, but only telling one that it was a VM. You can guess what he was making fun of ;-)

The presentation is here.

Taylor’s twitter handle is @TaylorBanks.

Wednesday, February 10, 2010

AWSome Atlanta February 2010

Last night’s AWSome Atlanta meeting was one of the best attended that I can remember. The main topic was the Chef configuration management tool for computer infrastructures. The crowd was a mix of the ‘usual suspects’ and quite a few new faces.

During the usual introduction from John Willis we learned there were a number of consultants, end users and a few academics in the crowd. Probably half indicated they were starting to learn about Cloud Computing.

Josh Timberman @jtimberman from Opscode presented about Chef, the concepts behind it and ways to use it in your environment. His presentation was good, but it took a few minutes to get a ‘big picture’ view of what he talked about. However once you understood the goal was to standup a new server (or upgrade an existing server) consistently the materials made a lot of sense.

One of the more interesting things I learned about Chef is they are in beta of a SaaS version where they (Opscode) will host the Chef server which can then be reached via a client within EC2, Rackspace etc. or within your own environment. This is interesting because it removes the need to have a server hosted someplace other than the IaaS provider for the Chef cookbooks and recipes. (Thus no need for IT resources.)

The 30 second overview of Chef: you create recipes and cookbooks for the applications and configurations you need to standup a server. So you can create an “apache 2, Tomcat 6, JDK 6.10 MySoftware 2.00” recipe that knows how to install and configure an exact copy of the environment you want. And do it repeatedly without any intervention or manual steps. (The full explanation took an hour, so there is a lot more to it though.)

Very useful when you are in the cloud and spinning up new instances, but also useful in an internal environment when you need to bring up a new server due to hardware failure (or faster boxes!) or when you want to quickly deploy a new version of software.

Consider building a new Chef recipe for the next release of your software. You create a new server, validate against QA then just change the configuration that defines the production locations/databases etc. No manual check lists, no forgetting about a new service or cron job.

After Josh’s presentation we had about an hour of open discussion. Lots of topics, including EC2, where the cloud is going to impact business, my views on the return of the ISV and a short religious discussion on Ruby ;-) (Sorry Keith)

I hope the new folks got enough out of the session to keep coming back.

Opscode can be found here.

AWSome Atlanta can be found here.

Thursday, January 28, 2010

In my last post I commented that sometimes “drinking the Kool Aid” is a bad thing. Here’s an example that bit me for a couple of hours this week.

Groovy and Grails have done a lot to ‘make things work’ by expanding many of the core Java classes to remove the need for all the ‘work’ needed to get something to work. For example, database connections ‘just work’ and you don’t need tons of exception handling code or thinking about how to release them.

Groovy also makes interacting with existing Java code trivial.

However, the combination of the two leads to some problems. Consider my case. I wanted to upload a comma separated value (CSV) file in Grails, split the file apart and build objects from it. Pretty straight forward right? Done it many times in Java with Struts or GWT.

Searching for examples also showed a number of ways to do it. I took the ‘approved’ version from the Grails site.

Searching for ‘groovy csv’ also showed a lot of examples, including the use of CSVReader, which I’ve used before.

Putting the two examples together was trivial and took a few minutes to test and everything went perfect. Until I tried to upload the same file a second time to test the ‘update’ logic. I received an exception, because the file was locked on the host I couldn’t replace the ‘working’ copy.

Turns out that CSVReader is a Java class with very specific lifecycle steps that hadn’t been ‘groovy-fied’ yet. So while Groovy handled all the exception framework, collection logic etc, it didn’t know to close the CSVReader at the end. I added the explicit close() on CSVReader object and things work now.

I was (am?) trying to learn the Groovy paradigm, not just the language so I’m deliberately not writing ‘java in Groovy’. So I didn’t look at exceptions or lifecycle events since for a lot of other ‘common’ thinks Groovy just works. I’ll pay a little more attention to lifecycle from now on.

Tuesday, January 26, 2010

This is the first of hopefully several posts about my experiences and frustrations learning Groovy and Grails. I’ve been programming for over 20 years and have seen a lot of techniques, technologies and tools that claim to make a developer more productive. Many did make you more productive. Some did until you needed to modify the code a year later, or modify someone else’s code. So I’m pretty skeptical about all the claims from fan boys about any technology.

So with the hype around dynamic languages including Ruby and Groovy, I started looking into how useful they could be. For the record, when Bruce Tate talked about ruby at AJUG in August 2005 I was both skeptical and annoyed with what he was presenting. As a ‘fan boy’ he could see nothing wrong with Ruby or the impact it would have on a production system. In particular we got into a mini-debate about real costs of running a production server vs. benefits of Rails for faster development based on the runtime performance back then.

In November 2009, Pratik Patel did a Grails and Groovy presentation to AJUG, which was right after my NOSQL East presentation where I was asked why I don’t use Groovy instead of Java for my applications. Pratik’s presentation got me thinking more about Groovy and researching the differences and benefits from the Java world I’ve been working in for over 9 years now.

So, I’ve started my first Goovy/Grails project. I’ve been doing it part time for a couple of weeks and have some initial impressions.

Pros
  • GORM for ORM hides a lot of the data tier and is pretty straightforward, especially if you’ve ever fought with understanding Hibernate. Some of the bizarre things are still there, but for the most part it understandable
  • Auto creation of the UI and the scaffolding in Grails is impressive. After years of screwing up Struts and Tiles configurations and having no idea why, this is cool
  • Being able to call down to Java as needed. I have A LOT of helper code and business logic that I don’t want to rewrite
  • ‘duck typing’

Cons
  • GORM. It took me most of a week and a lot of trial and error to figure out how to return certain fields from a multiple table query in a reporting interface without dropping into HSQL. (Yes, I plan on writing a post about this soon)
  • Dynamic language support in tools. I use IntelliJ which is supposed to have the best Groovy/Grails support and it still sucks. I don’t want to wait until runtime to find out that I spelt a parameter name wrong. Or that a variable name is reserved word in Groovy (Category anyone?)
  • Real examples. The IBM series is pretty good, but I spent a lot of time banging my head on file uploads and GORM beyond the basics.
  • Are some of the ‘features’ really a step back? Naming parameters on method calls has been available since PL/1 (at least) and has always been a criticized as too verbose. It just 'feels' wrong to be using them

The biggest negative so far is drinking too much of the Kool Aid ;-) Grails/Groovy makes some things so simple that when you get outside what they’ve “groovy-fied” you need to think again like a Java programmer. I’ll give an example of this in my next post.

So far though I’m impressed with the language and the ease of doing things.