Tuesday, May 24, 2011

Family Search and Machine Learning

I watched these two videos on youtube lately: FamilySearch Granite Mountain Vault - Part 1 and FamilySearch Granite Mountain Vault - Part 2

Apart from me thinking: "Man, these guys are serious about long-term storage", i was also impressed by the huge amount of handwritten documents that will be available in digital form by ten years from now (or, in other words, get available more and more as we speak). I mean, billions of documents. Would that not be enough training data to train a really well performing handwriting recognition engine.

It seems there might already be such engines, but if there are, why are they not used,let's say, as some kind of recommendation inside of new.familysearch's indexing program? If there's not, oh well, here's a massive amount of to-be indexed handwriting. But this dataset could even go so far (and would only be useful if) as to recognize writing like old german handwriting (which less and less people are able to read) and other old handwriting. Combined with high-level domain knowledge about existing cities and common names at that time, this could make up quite a tight expert system. I would never dare to let such a system do the indexing by itself, but help is always aprreciated, isn't it :-)

It would actually already be a help sometimes to just have character extraction to tell the single characters apart. Oh well, this might be the most difficult part of the work the machine would have to do...

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.