Of algorithms and elephants: 2011

Tuesday, September 13, 2011

What do chatbots say to each other when no human is around?

Researchers, probably just for fun, asked that question - and surprisingly, a pretty funny (albeit a little artificial) dialogue is what came out. Watch the video here
I find it hilarious :-D

Thursday, September 1, 2011

The greatness of CSV

CSV - comma-seperated values - is a simple file-format you all should know. It just consists of values, seperated by commas (or tabs, or whatever), with linebreaks seperating, well, the different lines (or records, or whatever your data may look like). You probably already knew that.

Now, let's suppose you have some data, in a CSV table. Let's also suppose the first line contains the names of the columns, like that:

"No.", "Name", "Price"
1, "Chocolade", 0.99
2, "Noodles", 0.39

and so on. As you see, really simple. Actually, why do i feel like writing about this stuff, it looks that simple you wouldn't think it's interesting to talk about it altogether. I mean, there are things like JSON nowadays...

The reason is, i found you can do an incredible lot with those simple CSV-tables.

1. easy to generate, easy to parse, thus also easy to transform in other formats
2. easy to combine, if you have several matching tables. A quick sed '1d' file1 >> file2 does the job
3. stripping the header, it is also easy to sort the lines using the sort command (see manpage)
4. import in programs like Excel, OpenOffice Calc, etc.
5. direct plotting with gnuplot http://www.gnuplot.info/docs_4.2/gnuplot.html#x1-17200043.14
6. many database programs support CSV as data source, so SQL querying is also an option. I haven't found a command-line tool to integrate SQL-Queries in a scripting workflow, but writing such a tool is not the biggest effort.(there are open-source CSV-database drivers or similar out there, for example h2database has commands to read from and write to CSV-files)
7. as i wrote here, there are also tools to directly transform a CSV-table to a LaTeX-table.
8. data-mining applications like Weka...
9. you name it ;-)

Tuesday, July 26, 2011

Converting an Excel/CSV table to LaTeX

Hi! Obviously, if you are reading this, you want a way to convert your Excel/CSV-tables to a latex table. There are some tools out there (csv2latex), but i also found this simple online converter i'd like to share:

Excel To Latex

Works with Excel Tables, CSV, and tables with other seperators as well. It does not, however, generate the table header, so you have to do that on your own.
I find this a little easier to use than a command-line tool, but of course such a tool is still useful for some. Just sorch for csv2latex.

Saturday, July 16, 2011

LaTeX: Referencing Figures

You may have had the following problem:
You want to reference a figure in your LaTeX-document, but instead of it's number, you get the section/subsection number.

First, the solution: You simply have to put the \label after the \caption of the figure.

Second, why: It seems the caption is the thing that actually SETS the number. This is a "feature" that allows multiple captions/labels per figure.

Wednesday, June 8, 2011

Can you boot from an email-account?

Hi! You know how there are programs like Gmail Drive that let you use your email account as normal storage space? And you also know how you can boot from the network, right?

Now, if we had our OS mirrored to our Gmail account using Gmail drive, and have some other PC forward the virtual drive to us over the LAN, could we actually boot from that?

I'd love seeing anyone trying this and tell about his experience. Of course there is absolutely no use to this, but i felt it would be a funny experiment.

Monday, June 6, 2011

Google Flu Trends

Hi, check out this: Google Flu Trends

Apparently, Google is generating flu statistics from their search statistics - and it seems to work! Makes you wonder what other kind of information might be hidden in that statistical data, doesn't it?

Infinity: Hypercubes

We all know what a square and a cube is, and their generalization for higher dimensions is called Hypercube.

Now, let's see... A square has four vertices (0,0) , (0,1), (1,0), (1,1), thus representing 2 bits of information in it's vertices. A cube has 8 = 2^3 vertices and so on... You get the image.

According to wikipedia, the human genome has 3 billion DNA base pairs. We ll know there are four bases, thus every base encodes 2 bits of information. Any human's DNA is therefore one vertex of a 6 billion dimensional Hypercube. Somewhat scary, isn't it? Even more if we realize that all humans together, wo ever lived and will live, only cover an incredibly small subset of the vertices of that hypercube.

Now, let's get to something bigger - Infinity. We can encode every natural number as an infinite sequence of 0's and 1's. Therefore, every natural number is a vertex of an infinite dimensional hypercube. Note that the power set of the natural numbers can as well be expressed as an infinite sequence of 0's and 1's showing whether a number is in the subset or not. However, this does not show N = P(N), as N does not cover any vertices with an infinite amount of 1's in the sequence, while P(N) does. P(N) actually covers ALL vertices of that hypercube.

Now to an even fancier Vector space: The space of R -> R is also a vector space, but this space has uncountably infinite base vectors. R -> {0,1} is a subset of that space, our hypercube again (this time with oncountably infinite dimensions).

... Wait a second, didn't the last one look familiar? Right, looks like binary classification, doesn't it? Binary classifiers are the vertices of that last hypercube. This can be generalized to R^d -> R, but i won't discuss that further. Can we do search in such a space, maybe to find a binary classifier? Of course this is not as good as an SVM or even a Neural Network, but for the heck of it: Why not build a Bounding Volume Hierarchy (Bounding Squares, that is)? Here is how i imagine such an algorithm:

let train(i) be the i'th train data point. Let f(]x,y[) be defined such that for all x' in ]x,y[ f(x') = f(]x,y[) -> we are building some sort of hierarchical structure

1.) set f(]-∞,∞[) = 0
2.) partition(f,-∞,∞,0,n)

partition(f,a,b,i,j):-
3.) let m = (i - j)/2
3.) set f(]a,b[) = label(train(m))
4.) partition(a,train(m),i,m - 1)
5.) partition(train(m),b,m+1,j)

This should run in O(n) or O(n log n) depending on our data structures.
This can be generalized to more than 1 dimension, but it's obvious that this yields a classifier prone to overfitting (and also, i didn't want to invest too much time in something obviously bad).

How secure are my windows passwords?

I just stumbled upon this Guide for cracking Passwords, through a reference on heise.de.
As you might know, passwords are most of the time (hopefully) not stored in clear text, but in the form of one-way hashes. Thus, even if an attacker got hold of a complete copy of the user data (username + password), this does not automatically mean he will be able to access the user's data immediately.
Hash-Functions, to be secure, have to have the property that it is difficult to determine to a given hash(x) one possible x. Some hash-functions however, are a little weak, at least they become weak as the computational power of PCs grow.

For short passwords, Windows (before Vista) seems to store such a weak hash-value. The Guide above has links to the Windows articles showing how to deactivate this behavior. Also, the Guide gives a way for encrypting the hash-file, making things a little more complicated for the potentiall attacker.

Anyway, the truth is: no one actually ever needs your windows password to access your data. If someone has access to the pc, he/she can just boot it from a Boot Disk and access any data that is stored on the hard drive. Since cracking the passwords as described above would need access to the PC, we can assume the attacker has that access. In such a case, it is much easier to just bypass the OS and boot from a CD. The only protection against this is encrypting the hard drive, and/or setting up a BIOS password. If you lose your password, however, you will lose your data (in case of the encrypted hard drive) or your laptop altogether (in the case of PC, the BIOS password can be reset, but it's a hassle). Also, if only using BIOS password, you data could still be read by stealing the (non-encrypted) Hard Drive.

Another easy way to get access to a windows account described in the guide above, is to reset the account's password (which is far easier than cracking it). There are programs which can do this, see the guide for details. The good message about this is, you can reset your password if you have forgotten it! Plus, if someone did this to you, you would be able to realize there was an attack.

I also found a link about hard drive password protection (which is not like Encryption, but like the BIOS-Password) here.

Friday, May 27, 2011

Solving HornSAT in linear time

Recently, i wanted to program a simple horn sat solver for a project (turned out i didn't need it, oh well).
I couldn't find anything on the net (in reasonable time) about how to solve HornSAT in linear time, only references to that it's possible. The naive approach, however, at least the one i took, has quadratic worst-case complexity. Now i thought i might spread the knowledge a bit about how to solve the horn satisfiability problem in linear time.
See also Unit Propagation

1.) build a directed (bipartite) graph, connecting the Horn clauses to their respective positive literal (if they have one), and connecting the literals to all clauses where they appear negated.
2.)[UPDATE] for every clause c not containing a negated literal, do {
if c has no positive literal STOP -> unsatisfiable
else propagate(c)
}

3.) finally, the function for the unit propagation:

propagate(c) :-
if the positive literal of c (p) is not yet marked true {

mark p true
for all (p,c') in the Edge set of the graph{
remove p from c'
if c' has no more negated literals, propagate(c')
}
}

The marked literals form the minimal model of that Horn formula.

Because every literal is only marked true once, the whole algorithm runs in linear time (with respect to the number of literals, because of the graph building).

[Update]
While thinking about this some more, i figured the above algorithm is only better than the naive one, if the number of literals is in O(#clauses²). For completeness, here the algorithm i call the "naive":

while there exists a clause c without negative literals{

if c has no positive literal then STOP -> formula not satisfiable;
mark the variable of the positive literal of c as true;
for all clauses c' remove the just marked literal from the list of negated literals of c'

}

Let c be the number of clauses and l the number of literals.
With the proper implementation, the naive algorithm can run in O(min(c², cl)). This is better then the algorithm above if l is in Ω(c²).
We can easily check this by summing the sizes of the sets of negated literals, which can be done in O(c). If we wire this all together, we get an algorithm in O(min(l,c²)), as c is in O(l).

Here is also a paper on linear time algorithms for Horn Sat: Link

Wednesday, May 25, 2011

What to do when you cannot post comments on blogger.com

Hi! Just a short notice:
Some people have problems not being able to post comments on blogger. Happened to me actually.
The reason for this is that blogger uses third-party cookies for the commenting. I have no clue why they would do that (probably trying to give a cookie from blogger.com or google.com while you are actually at blogspot.com).
Anyway, the solution is accepting third-party cookies. You can do that through your browser configuration.

What are third-party cookies?

Websites use third-party cookies for tracking people, to identify the users when they visit (revisit, actually) their site. This could ALSO be done without cookies, through the IP-Address (except for some Providers, where the IP-Address changes repeatedly, or someone using Thor), but the IP-Address changes every time you connect to the Internet (normally. that is), so they can only be used for short-term tracking. The really interesting data comes from long-term tracking, as web sites can get a feeling for how many regular visitors they really have. So, third-party cookies don't harm you physically, but to a certain extent, violate your privacy.

Tuesday, May 24, 2011

Family Search and Machine Learning

I watched these two videos on youtube lately: FamilySearch Granite Mountain Vault - Part 1 and FamilySearch Granite Mountain Vault - Part 2

Apart from me thinking: "Man, these guys are serious about long-term storage", i was also impressed by the huge amount of handwritten documents that will be available in digital form by ten years from now (or, in other words, get available more and more as we speak). I mean, billions of documents. Would that not be enough training data to train a really well performing handwriting recognition engine.

It seems there might already be such engines, but if there are, why are they not used,let's say, as some kind of recommendation inside of new.familysearch's indexing program? If there's not, oh well, here's a massive amount of to-be indexed handwriting. But this dataset could even go so far (and would only be useful if) as to recognize writing like old german handwriting (which less and less people are able to read) and other old handwriting. Combined with high-level domain knowledge about existing cities and common names at that time, this could make up quite a tight expert system. I would never dare to let such a system do the indexing by itself, but help is always aprreciated, isn't it :-)

It would actually already be a help sometimes to just have character extraction to tell the single characters apart. Oh well, this might be the most difficult part of the work the machine would have to do...

Saturday, May 21, 2011

How to access Wiktionary to check for the existence of a word

I was just listening to this great presentation on robotics and cloud computing, when it struck me that i could share some aimple code i wrote a while ago, with application to cloud computing (although i wouldn't call it that way).

This was for some simple game i wrote in Java, where you had to find words in a character matrix (on Facebook there is a similar game called Scramble). In order for this to work, i needed a dictionary with valid words. Now i didn't know where to get such a list of words from, so i decided to let the program learn the words by itself, through user input, but also through access to online dictionaries. Here's the code:

protected boolean checkThroughWiki(String word) {
 URL url = null;
 try {
     url = new URL("http://de.wiktionary.org/wiki/" + word);
 } catch (MalformedURLException e) {
     return false;
 }
 try {
  HttpURLConnection con = (HttpURLConnection) url.openConnection();
  con.connect();
  if (con.getResponseCode() == 200)
      return true;
  return false;
 } catch (IOException ex) {
  System.out.println(ex.getMessage());
  ex.printStackTrace();
  return false;
 }
}

Doesn't look to difficult, does it? Plus it worked the last time i checked. Of course this is for the german Wiktionary, so you would need to change the URL to point to the wiktionary version of your choice.
What it does is simply checking if the page that should contain the word exists.

Voila, now it's up to you to make a cloud translation or whatever library out of this. ;-)