Wednesday, June 3, 2015

Connecting to Hadoop CDH with a windows client

Random notes and links on getting Windows clients to work with CDH-5.3.1:

  1. Use the CDH-releases of your hadoop libraries, see http://blog.cloudera.com/blog/2012/08/developing-cdh-applications-with-maven-and-eclipse/  (for other hadoop distributions, this just means: use the same versions on the client as in the cluster)
  2. Set the necessary properties for cross-os functioning (why, oh why is such a thing necessary?), and get winutils.exe, see https://github.com/spring-projects/spring-hadoop/wiki/Using-a-Windows-client-together-with-a-Linux-cluster
  3. set the environment variable HADOOP_USERNAME to an appropriate value, see http://stackoverflow.com/a/11062529/1319284
  4. if you are using HDFS (which you probably are), you need to add hadoop-hdfs to your classpath if it is not already, see http://stackoverflow.com/a/24492225/1319284
  5. check the firewall rules of the nodes, on centOS you can do this with system-config-firewall. See https://www.centos.org/docs/5/html/Deployment_Guide-en-US/ch-fw.html, see here for a list of the ports used by the various CDH components http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_ports_cdh5.html
  6. make sure all configured host names can be resolved, and edit the Hosts file if necessary. (It is located under C:\Windows\System32\drivers\etc\Hosts)
  7. Make sure your compiler compliance level is set to target the right Java version. Right now this is 1.7. Failing to do so generates errors like Unsupported major.minor version 52.0. See here for example http://stackoverflow.com/questions/22489398/unsupported-major-minor-version-52-0
  8. When using eclipse, make sure to export a jar (or build with maven), and then add it to the classpath of the Launch command. That way, Job.setJarByClass will find the jar which can then be uploaded to the cluster. Granted, this is a little hacky, but works.

After doing all this, I successfully ran my MapReduce job from Eclipse.

[Update]
For CDH 5.5.0 (Hadoop 2.6.0), a binary build with winutils.exe can be downloaded from http://www.barik.net/archive/2015/01/19/172716/
In addition to setting hadoop.home.dir, java.library.path must be set to the bin directory.

Wednesday, December 4, 2013

The mean is a horrible measure for profiling/benchmarking

The mean is a horrible measure for profiling/benchmarking. When the computation chokes, it generally results in a peak execution time. The mean as measure is not good when you have extreme outliers like that. The median is much more robust in such cases.


So: Don't average over your execution time when you benchmark an algorithm, take the median!

Tuesday, November 20, 2012

My Code is in the Juno release train!

I just wanted to point out that Code Recommenders, and with it Jayes, is in the Juno release train! (see http://www.eclipse.org/org/press-release/20120627_junorelease.php) How sweet is that? Code i wrote as my Bachelor's thesis is now at Eclipse :-)

The Code Recommenders feature is distributed as part of the "Eclipse IDE for Java Developers" Download at Eclipse, thus my code now lives on thousands of developer machines :-D

Friday, June 29, 2012

Minimizing Horn formulas using domain knowledge

Assume you have several implications

(x1 & x2) -> x3
(x1 & !x2) -> x3
...

As, often, you will want to query whether x3 needs to be true for different truth values of your different variables, you might want to optimize your formulas with respect to the number of literals.

For a given right-hand side variable, this is equivalent to a (DNF) logic minimization task that can be done, e.g. with the Quine-McCluskey Algorithm. As your left-hand-side formulas form a formula in DNF, the minterms of this formula are exactly those left-hand-side formulas.

The formula above can be optimised by optimizing (x1 & x2) || (x1 & !x2), which would yield the formula x1, and thus your optimized implication x1 -> x3.

What does this have to do with HornSAT? As you may have noticed, my examples are not Horn formulas. However, in practice you will sometimes model a problem as HornSAT that has additional, implicit constraints. For example, if some variable v potentially has n states, you may model this as n logical variables in your formulas.

These variables have properties that you normally don't explicity model in your formulas, because they only blow up the logical representation without helping the original problem. The property i am refering to is the assumption that v will take exactly one of it's n states, thus exactly one of the n logical variables will be true at any time.

This (also called principle of bivalence for logic) is the same property that allows logic minimization! Thus, using domain knowledge, one can, with a little alteration of "normal" logic minimization algorithms, derive domain-specific logic minimization algorithms. I did this once for extracting logical rules that were implicitly contained inside bayesian networks (didn't help for inference, but gave me the idea).

Note that, while HornSAT is an easy problem, logic minimization is hard.

Monday, May 28, 2012

Jayes 0.2.0 release

Jayes 0.2.0 is here!

I finally removed the SNAPSHOT marker from the plugin.xml and pom.xml files - which means i consider this a release version. Find it at Github

What is Jayes?


Jayes is an open-source Bayesian Network library written in pure Java. I wrote it as part of my Bachelor's thesis. It is used by the Code Recommenders project at Eclipse, where it predicts method calls for code completion. It performs very well, as i showed in my Bachelor's thesis (also available in the repository), and can compete even with native Bayesian Network libraries.

Why version it 0.2.0?


The reason the current version number is 0.2.0 is that at the point where i first wrote Jayes, the Code Recommenders project was in version 0.2.0. This version number is in no wise related to the state of the development of the library.

Features in version 0.2.0


- exact inference in Bayesian Networks with discrete-valued variables
- File formats: XMLBIF v0.3, GeNIe's XDSL format


License


Jayes is licensed under the EPL 1.0

Wednesday, April 18, 2012

Tools I don't want to miss


Inkscape

Inkscape is an open-source SVG editor. Vector graphics have many advantages over pixel graphics that can be done with e.g. GIMP, Paint etc.
I had to draw several diagrams for my bachelors thesis, and I found Inkscape offers lots of functions that make it suitable for that:
  • alignment of different objects
  • drawing perfect squares / circles (press Ctrl while you draw the shape)
  • (very important!) export to PDF and PNG format
The only thing I feel it lacks (maybe i just haven't found that function?) is something modern pixel graphics programs have: multiple layers. But apart from that, it is perfect for drawing diagrams etc. especially when you will need to scale them, as SVG is scale-independent.


Skype / VoIP

because Instant Messaging and mailing sometimes is not productive enough. I spared so much time in university by communicating with my fellow students through Skype, it's ridiculous. Which medium is best for your communication really depends on your task, but sometimes calling someone is just better than messaging!

Cygwin

Yes, i'm still a Windows user, but i can't live without my shell anymore. I do most things through my IDE or some GUI, but often enough, something needs to be automated through a script, and then i open up my cygwin shell and just love it :-)

SSH-Clients (e.g. PuTTy)

It happens more and more that I need to run stuff remotely, so...

Wednesday, March 28, 2012

Colaboration Tool: Online Screen Sharing

While this can also be done with e.g. Skype, maybe you find this useful:
http://www.screenleap.com/ allows for the sharing of screens through a browser interface. Have fun!