Thursday, September 27, 2018

Please do your SEO somewhere else

Recently people have started to post spam comments that only serve to advertise their sites/trainings in India. For the time being I have deactivated the comment function. Do your SEO somewhere else.

Thursday, June 18, 2015

Gotcha: CDH manager killed, but no error log, on VM

I recently experienced a problem where the Cloudera Manager silently died during startup or a few seconds after. No useful entries in the error log, nothing. If this had been a problem with Heap space, I would have seen an OutOfMemoryError. Some other problem, I would probably have seen some kind of log entry. But the OS just killed the service for some reason.

I was using a VM managed by vagrant, and it turns out the base box I was using had the memory configured by default to be around 512MB, while the Cloudera Manager was configured to have a max heap space of 2 GB. Ouch. What happened was that the service would at some point exhaust the available memory, and the OS killed it.

I finally found the important information hidden on the requirements page: http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/cm_ig_cm_requirements.html

There it says (highlighting added):

  • RAM - 4 GB is recommended for most cases and is required when using Oracle databases. 2 GB may be sufficient for non-Oracle deployments with fewer than 100 hosts. However, to run the Cloudera Manager Server on a machine with 2 GB of RAM, you must tune down its maximum heap size (by modifying -Xmx in /etc/default/cloudera-scm-server). Otherwise the kernel may kill the Server for consuming too much RAM.

Well thank you, may I suggest you put something like that on the Troubleshooting page as well?
After seeing this, it didn't take long to figure out what was going on (that was after, roughly, 2 or 3 days of debugging...)

Wednesday, June 3, 2015

Connecting to Hadoop CDH with a windows client

Random notes and links on getting Windows clients to work with CDH-5.3.1:

  1. Use the CDH-releases of your hadoop libraries, see http://blog.cloudera.com/blog/2012/08/developing-cdh-applications-with-maven-and-eclipse/  (for other hadoop distributions, this just means: use the same versions on the client as in the cluster)
  2. Set the necessary properties for cross-os functioning (why, oh why is such a thing necessary?), and get winutils.exe, see https://github.com/spring-projects/spring-hadoop/wiki/Using-a-Windows-client-together-with-a-Linux-cluster
  3. set the environment variable HADOOP_USERNAME to an appropriate value, see http://stackoverflow.com/a/11062529/1319284
  4. if you are using HDFS (which you probably are), you need to add hadoop-hdfs to your classpath if it is not already, see http://stackoverflow.com/a/24492225/1319284
  5. check the firewall rules of the nodes, on centOS you can do this with system-config-firewall. See https://www.centos.org/docs/5/html/Deployment_Guide-en-US/ch-fw.html, see here for a list of the ports used by the various CDH components http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_ports_cdh5.html
  6. make sure all configured host names can be resolved, and edit the Hosts file if necessary. (It is located under C:\Windows\System32\drivers\etc\Hosts)
  7. Make sure your compiler compliance level is set to target the right Java version. Right now this is 1.7. Failing to do so generates errors like Unsupported major.minor version 52.0. See here for example http://stackoverflow.com/questions/22489398/unsupported-major-minor-version-52-0
  8. When using eclipse, make sure to export a jar (or build with maven), and then add it to the classpath of the Launch command. That way, Job.setJarByClass will find the jar which can then be uploaded to the cluster. Granted, this is a little hacky, but works.

After doing all this, I successfully ran my MapReduce job from Eclipse.

[Update]
For CDH 5.5.0 (Hadoop 2.6.0), a binary build with winutils.exe can be downloaded from http://www.barik.net/archive/2015/01/19/172716/
In addition to setting hadoop.home.dir, java.library.path must be set to the bin directory.

Wednesday, December 4, 2013

The mean is a horrible measure for profiling/benchmarking

The mean is a horrible measure for profiling/benchmarking. When the computation chokes, it generally results in a peak execution time. The mean as measure is not good when you have extreme outliers like that. The median is much more robust in such cases.


So: Don't average over your execution time when you benchmark an algorithm, take the median!

Tuesday, November 20, 2012

My Code is in the Juno release train!

I just wanted to point out that Code Recommenders, and with it Jayes, is in the Juno release train! (see http://www.eclipse.org/org/press-release/20120627_junorelease.php) How sweet is that? Code i wrote as my Bachelor's thesis is now at Eclipse :-)

The Code Recommenders feature is distributed as part of the "Eclipse IDE for Java Developers" Download at Eclipse, thus my code now lives on thousands of developer machines :-D

Friday, June 29, 2012

Minimizing Horn formulas using domain knowledge

Assume you have several implications

(x1 & x2) -> x3
(x1 & !x2) -> x3
...

As, often, you will want to query whether x3 needs to be true for different truth values of your different variables, you might want to optimize your formulas with respect to the number of literals.

For a given right-hand side variable, this is equivalent to a (DNF) logic minimization task that can be done, e.g. with the Quine-McCluskey Algorithm. As your left-hand-side formulas form a formula in DNF, the minterms of this formula are exactly those left-hand-side formulas.

The formula above can be optimised by optimizing (x1 & x2) || (x1 & !x2), which would yield the formula x1, and thus your optimized implication x1 -> x3.

What does this have to do with HornSAT? As you may have noticed, my examples are not Horn formulas. However, in practice you will sometimes model a problem as HornSAT that has additional, implicit constraints. For example, if some variable v potentially has n states, you may model this as n logical variables in your formulas.

These variables have properties that you normally don't explicity model in your formulas, because they only blow up the logical representation without helping the original problem. The property i am refering to is the assumption that v will take exactly one of it's n states, thus exactly one of the n logical variables will be true at any time.

This (also called principle of bivalence for logic) is the same property that allows logic minimization! Thus, using domain knowledge, one can, with a little alteration of "normal" logic minimization algorithms, derive domain-specific logic minimization algorithms. I did this once for extracting logical rules that were implicitly contained inside bayesian networks (didn't help for inference, but gave me the idea).

Note that, while HornSAT is an easy problem, logic minimization is hard.

Monday, May 28, 2012

Jayes 0.2.0 release

Jayes 0.2.0 is here!

I finally removed the SNAPSHOT marker from the plugin.xml and pom.xml files - which means i consider this a release version. Find it at Github

What is Jayes?


Jayes is an open-source Bayesian Network library written in pure Java. I wrote it as part of my Bachelor's thesis. It is used by the Code Recommenders project at Eclipse, where it predicts method calls for code completion. It performs very well, as i showed in my Bachelor's thesis (also available in the repository), and can compete even with native Bayesian Network libraries.

Why version it 0.2.0?


The reason the current version number is 0.2.0 is that at the point where i first wrote Jayes, the Code Recommenders project was in version 0.2.0. This version number is in no wise related to the state of the development of the library.

Features in version 0.2.0


- exact inference in Bayesian Networks with discrete-valued variables
- File formats: XMLBIF v0.3, GeNIe's XDSL format


License


Jayes is licensed under the EPL 1.0