Of algorithms and elephants: 2015

Thursday, June 18, 2015

Gotcha: CDH manager killed, but no error log, on VM

I recently experienced a problem where the Cloudera Manager silently died during startup or a few seconds after. No useful entries in the error log, nothing. If this had been a problem with Heap space, I would have seen an OutOfMemoryError. Some other problem, I would probably have seen some kind of log entry. But the OS just killed the service for some reason.

I was using a VM managed by vagrant, and it turns out the base box I was using had the memory configured by default to be around 512MB, while the Cloudera Manager was configured to have a max heap space of 2 GB. Ouch. What happened was that the service would at some point exhaust the available memory, and the OS killed it.

I finally found the important information hidden on the requirements page: http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/cm_ig_cm_requirements.html

There it says (highlighting added):

RAM - 4 GB is recommended for most cases and is required when using Oracle databases. 2 GB may be sufficient for non-Oracle deployments with fewer than 100 hosts. However, to run the Cloudera Manager Server on a machine with 2 GB of RAM, you must tune down its maximum heap size (by modifying -Xmx in /etc/default/cloudera-scm-server). Otherwise the kernel may kill the Server for consuming too much RAM.

Well thank you, may I suggest you put something like that on the Troubleshooting page as well?

After seeing this, it didn't take long to figure out what was going on (that was after, roughly, 2 or 3 days of debugging...)

Wednesday, June 3, 2015

Connecting to Hadoop CDH with a windows client

Random notes and links on getting Windows clients to work with CDH-5.3.1:

Use the CDH-releases of your hadoop libraries, see http://blog.cloudera.com/blog/2012/08/developing-cdh-applications-with-maven-and-eclipse/ (for other hadoop distributions, this just means: use the same versions on the client as in the cluster)
Set the necessary properties for cross-os functioning (why, oh why is such a thing necessary?), and get winutils.exe, see https://github.com/spring-projects/spring-hadoop/wiki/Using-a-Windows-client-together-with-a-Linux-cluster
set the environment variable HADOOP_USERNAME to an appropriate value, see http://stackoverflow.com/a/11062529/1319284
if you are using HDFS (which you probably are), you need to add hadoop-hdfs to your classpath if it is not already, see http://stackoverflow.com/a/24492225/1319284
check the firewall rules of the nodes, on centOS you can do this with system-config-firewall. See https://www.centos.org/docs/5/html/Deployment_Guide-en-US/ch-fw.html, see here for a list of the ports used by the various CDH components http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_ports_cdh5.html
make sure all configured host names can be resolved, and edit the Hosts file if necessary. (It is located under C:\Windows\System32\drivers\etc\Hosts)
Make sure your compiler compliance level is set to target the right Java version. Right now this is 1.7. Failing to do so generates errors like Unsupported major.minor version 52.0. See here for example http://stackoverflow.com/questions/22489398/unsupported-major-minor-version-52-0
When using eclipse, make sure to export a jar (or build with maven), and then add it to the classpath of the Launch command. That way, Job.setJarByClass will find the jar which can then be uploaded to the cluster. Granted, this is a little hacky, but works.

After doing all this, I successfully ran my MapReduce job from Eclipse.

[Update]
For CDH 5.5.0 (Hadoop 2.6.0), a binary build with winutils.exe can be downloaded from http://www.barik.net/archive/2015/01/19/172716/
In addition to setting hadoop.home.dir, java.library.path must be set to the bin directory.