AutoModel - Processing natural language: 2007

Thursday, November 29, 2007

Another week - another success

Finally got Cyc to work just like I wanted it to!
I contacted Larry from Cyc and he was able to give me a version which did actually work the way I wanted it to.
Now I am back on track.
Did lots of jobs for the "Searching for Sense" blog which is private, so you won't be able to see those results.

Matter of fact I will now check my reasoning ideas in the live environment of my Cyc-world.
The game of Ludo will serve as template for eventual problems.

But first I will have to look into yet some other papers to get this straight.

For those who've been following my blog:
Annotation of text is necessary to "transfer" text into something the computer can process. This process does not work that well fully automated. Parsers cannot cope with semantics and therefore make many mistakes.
My goal is to minimize those mistakes by using knowledge/reasoning on the texts that are processes. So far the average detection rate is close to 30%.
If my reasoning can lift this rate up to 50% or higher - my goal would be reached.

Human beings tend to regularly hit the 99-100% mark, by the way.
But then they don't precisely evaluate a 200 pages document in 15mins, but more 5 weeks.

Have a great weekend.

Wednesday, November 21, 2007

Damn I was busy ...

Hey folks,

sorry for not having posted in almost 2 weeks.
I was busy doing stuff which does not directly concern AutoModel.

I finally managed to get Cyc back on track, but my machine is going to kill me. It just takes forever to process. I do need more RAM. And now comes the fun part: My 10months old machine can only handle 2GB of RAM. Well, that is everything but not sufficient for what ResearchCyc demands form my main memory. Too bad.
The 8 core servers in our institute are not useful since Cyc is not running properly under win2k3 or Ubuntu. It does need Win2k/XP or Suse9.x
Again - too bad.

I have updated my mindmap and added some words to my diss.

As for what I am working on in the next few days, that would be:

gaining deeper insight in handling inferences with Cyc
using the Cyc Java API to handle requests and publish those in a very rudimentary way
find out which queries/request would be most helpful while annotating the text.

That should be it for today.

Tuesday, November 6, 2007

Ontologies - don't always work the way you want them to

I talked to the guys from Cyc today since there Cyc-System would not let me rebuild "my world". I already tried compiling the world in serveral systems, as were:

Ubuntu 7.1
Debian 4.0
Windows XP SP2
Windows Vista

Most of the times the unpacking of the zips didn't work and I got a CRC-error. Therefore I contacted Cyc and they said they would like into the issue and tell me what was going on.

Let's see what the folks have to say tomorrow. So far I was able to resurrect an old backup of my Cyc-Database which seems to work just fine.
Working with this all day long I found out that I might need a server for running Cyc on. Having the whole system in a virtual machine on my desktop just isn't what I would call "fast".

Cyc just released a new Version of their "world" (V.1.1.2) which I already downloaded but could not get to work either. It's kind of frustrating.
Last time I dealt with it, it worked like charm. I'll give you an update.

Monday, November 5, 2007

Stability issues with ResearchCyc

The next weeks will all be in the name of ResearchCyc.
Unfortunately I crashed my Cyc-installation last week and couldn't get it back up and running till today. Therefore I planned to switch to a Unix-based server system for Cyc.
Well - damn me, but I should have read the instructions carefully. My favourite Linux (debian, that is) wouldn't be able to run ResearchCyc - it's only meant for RedHat and Suse.
Those guys are using some very specific binaries which couldn't be launched on debian etch. Too bad.
Well, I guess I am going back to another virtual Windows machine. Preferable Windows XP - that did work without any problems so far. Well, I still got a couple of hours left here ... that should suffice to get the system up and running.

Slides are online

The slides are online now.
You can get them here.

Monday, October 29, 2007

Summing up my work in 19 slides

I just finished summing up the work of the last 12 months into 19 slides for my PhD-Meeting tomorrow.
I will upload the files asap when I got the web server back online on which I can put the files.

Tomorrow will be another "Try to get used to working with (Research-)Cyc" day. That thing is so powerful and yet so awkward to use once in a while ... and the stubby Java interface they issue with it doesn't help this a lot.
Well, I guess I'll have to deal with it.

I will give you a short sum-up of my slides later tonight after I refined them once more.

So long.

Wednesday, October 24, 2007

Preparing for PhD Meeting

I got a talk with my co-PhD-Students and my professor on Tuesday evening.
There I will present my results, what I did so far, what I achieved and what the future possible holds for me.
This also includes feedback to my work and giving directions for the next few weeks months of work.

I will offer my presentation as download here as soon as it's done. I should be done with it this Sunday. At least that's my plan.

See you guys soon.

Tuesday, October 9, 2007

Bordering aspects

The system we dream of is supposed to generate program code from natural language.
So far, so good.

My job is to put enough "common sense" into the processing task so that most of the mistakes machines still make can be avoided.
Or to cite Voltaire here: The problem with common sense is, that it is not so common.

But what if we succeed in this task? Somebody will still have to work with the piece of Software we generated. Somebody will have to use it. So what are the interfaces? How can our software interact? How easy will it be to understand that piece of software?
Will it make up for the time we saved generating the code? What do we have to look out for?

To sum it up: What happens around our scenario? Do we have to address that or can we just ignore how to deal with what the machine is supposed to deliver? I don't know yet.

Questions over questions which I will try to address in the next weeks as well.
Thanks to my friend Georg for this valuable tip.

Monday, October 8, 2007

Annotating text - questions raised

Hey, it's been almost 2 weeks.
But I do have some results for you. They look like that:

"Leaving one’s own king under attack, exposing one’s own king to attack and also ’capturing’ the opponent’s king are not allowed."

This would be the original text from the specification.
In order to transform this into a graph, thematic (theta) roles have to be given to each necessary part of the sentence. The redundant/disposable words are just "sharpened-out" by marking them with a #.

The output of this sentence would be something like this:

[ { [ Leaving|ACT one`s1|POSS #own king|{HAB,STATII} under_attack|STAT, ] , [ exposing|ACT one`s2|POSS #own king|{HAB,STATII} to_attack|STAT ] , [ #and #also capturing|ACT #the opponent`s|POSS king|HAB ] }|MODII #are not_allowed|MOD. ]

one`s1 <= They
one`s2 <= They

This might look a little confusing in the beginning, but it is also quite impressive, how easy it is for us humans, to understand relations and concepts of a sentence. But sitting there and annotating the text by hand quickly shows that many things are processed by our brain implicitly and are actually quite hard to put on paper.

To put it short:
Reading the above sentence does not make you think of possessors, habitums and stati at once, does it? We recognize verbs and nouns, the rest just seems to come "naturally".

This is in my opinion the biggest obstacle when it comes to machine understanding.

Anyway, several question especially concerning reasoning where raised. Those were:

How will we be dealing with numerals after all?
When is a word a numeral, when an article?

e.g.: "one can find ..." or "you can move with only one player"

one == same?

one == 1?

one == one/you?

How can realations in numerals be detected?

e.g.: "The chessboard has 8x8 field. Those 64 fields ..."

What happens to prepositions which seem unnecessary while annotation but actually do or can change the semantics of the sentence?

e.g.: "the near corner square to the right of the player is white"
"to the right of the player" (shows a location) is different from "the right of the player" (could also mean the right in a jurisdictional way)

Difference between verbs and their tense:

e.g.: "checkmate" vs. "checkmated" which mean something different

Well, a lot of new stuff to think about I guess ...

Wednesday, September 26, 2007

ResearchCyc as the ontology of choice

Hi folks,

first of all, I have to say that I won my battle against TeX. The initial setup for the dissertation with which I will start annotating the papers is done.
I wrote short summaries of all papers and marked which parts of the articles I would like to mention to motivate my goals.

Next week we're gonna sit down with a student working on the concept of annotating theta-roles to textual specifications. This is going to be extremely interesting since we do not quite know what to expect.
Many concepts just take place in a human's brain while reading text, but when one actually has to mark the words with the right roles which apply in the given context... well, that seems to be something different and quite strenuous. Let's see what insight we gain during this test. This will most likely affect the direction in which the research in our area will head in the next couple of weeks/months.

Additional to all that, I jumped back on the (Research-)Cyc Ontology for reasoning purposes. It might already help while annotating text. But it could also be very helpful on the transformation side, when the sentence has already being parsed into a graph. Well, I guess we'll find out.

Saturday, September 22, 2007

Thematic roles introduction

As I promised you a couple of days ago - here's the explanation of the graph transforming on thematic roles and how we plan to reason on this models.

Thematic roles - also known as theta-roles - are best described in the following two articles here and here. I do not want to go too deep into explaining that, since the articles are quite voluminous.
If your capable of understanding German, you might even get a more detailed, easier to understand and better covered, please see here and here.

I will add an example of how text is annotated latest on Monday and how this looks like in an easy example.

Other than that, I've been struggling with TeX quite a bit summing up the knowledge of about 55 articles which might be of use for the "state of the art/related work" part of the dissertation.

Well, I'll be back on Monday with the news I promised you. Have a safe weekend.

Tuesday, September 18, 2007

A short explanation of how this is all supposed to work

Hey folks,

for those of you who have followed my website, here an update of what the solution is going to look like.
For those of you who see this for the first time - well, be glad you don't have to witness the omnipotence of change in our business *smile*.

The slide above shows a rough sketch of how the tools are supposed to interconnect, exchange their data and finally lead to the UML/Program code of our choice.
The orange "NLP" box represents all possibilites to process natural language. As I already mentioned, there are many - one of which will be taken into closer consideration for AutoModel. We are still comparing the various methods, trying to find the one that works best for us. That's going to be a student thesis.

At this moment, textual transformation (and therefore understanding) takes place by annotating the thematic/theta-roles in the given text. This is still a process which needs a lot of manual labour, but we are already working on an automatic approach to that.

How thematic roles look like, what this is all about and how we later transform these into graphs (all parts of Tom's work), I will tell you in the next post.

Another thing (and quite annoying I have to admit) I was struggling with today was creating the initial dissertation template in LaTex so that I can already start casting my thoughts on paper.
As probably everybody else on this planet using Windows I use MikTeX and TeXnicCenter to get this job done. I still haven't managed to include my JabRef bib-files as literature-list. Well, I guess Rome wasn't built in a day either.

That's it for tonight.

It's papers, papers and guess what? - Papers.

Hey, haven't written much in the last 5 days.
They were all about reading papers concerning NLP.

Next thing - I will tell you about our approach and what the difference to others is so far (Thematic roles vs. well-known NLP approaches).
After that I might just explain our graph theory approach and where the challenge lies.

The other questions are:

What's the state of the art?
Where are the gaps?
Which are the big questions?
Where do we have to narrow our field (since we do not claim to have a solution which is universal)?

All these questions have to be cumulated and casted into papers. And once you have a bunch of papers you align 'em and make your dissertation out of those.
Well, it's note quite as easy, but that is the approximate approach.

More infos to come these days. Have a good one.

Tuesday, September 11, 2007

Mindmapping the papers, ideas for an article

I spent the last two days gaining an overview of another 10 or so papers about commonsense reasoning on natural language and processing of natural language and possible user interaction.
Quite a wide field.
I also talked to my colleague Tom and we have several approaches which we need to address in the next couple of weeks:

First of all we found the perfect set of word / text which we would like to interpret. It's a strict rule, it does make sense, everybody know it and can relate to it and we will not have to bother too much with ambiguities and weird word usages.
What am I talking about? Well, I am talking about the official rules of chess.
We will come up with an article concerning the state-of-the-art of processing natural language and converting into program code or anything similar.
We will then add our thoughts and extensions which we mean to introduce in the next couple of months to round up the complete picture

So far, there are many approaches of dealing with natural language.
One uses the semantics of the English language and transfers those into programmatic semantics. Others rely only controlled languages and specified domains. Some out there intertwine their concepts with MDA and some started to reason/infer on natural sentences.

All of these ideas bring us closer to what we want, but the complete picture is now clearer:
We want give the progam the chess-specification and we shall receive a UML-model out of which code could be generated that can actually "play chess".

Tom's approach with graphs (I will explain that in a later post) abstracts from many other solutions because it initially relies on thematic roles. From then on, it's all graph transformations including reasoning. The latter one will be my part. No need of specified objects, etc. is necessary after the initial prose has been annotated.

The disadvantage of many approaches so far is, that they mostly rely on the specifics of the English language. We understand that this whole concept has to work with any language out there. Or at least a great deal of them.

The steps to be fulfilled and realized therefore are:

Annotation:
(Half-)Automatically annotate the initial text with its thematic roles
Processing
Process the annotated text and create an inital graph
Use graph-transformation to create an initial UML-model
Reasoning
Use reasoning to get rid of ambiguities or double-fetched objects which belong together.
Also use reasoning to split obvious "objects" into other objects with certain properties, e.g. "The cold bottle" could be one object. But what if a "warm bottle" comes around the corner later? Is this a new object, or do you just have an object "bottle" which has the property "temperature" with its possible values "cold, hot"?
Good question, huh? Well, we'll try to do the latter one - it just makes more sense.
Reasoning will supposedly also take place by just having graph-transformations done.
Processing
Process the results of reasoning again and create the new UML-model.
Then transfer this model into code using any of the popular method to create code from UML.

That's it for today - more at the end of the week. I still have loads of papers in front of me which I have to read ...

Monday, September 10, 2007

Finally back from my trip to Australia/USA

Alright.
I am finally back from my 7 week trip which led me all the way up and down the Australian east coast and the outback. After another 9 day stopover in California, I eventually arrived back here in "kind-a-cold" but good old Germany.
This week I will try to get myself back up to speed with what I've been dealing before I left 2 months ago.
So far I need to have another look at Natural Language Interpretations and Representations from other colleagues from around the world. Some medical informatics guys already achieved quite a lot when dealing with natural language reports on patients (see here, here and here).

I also have to meet up with my friend and co-worker Tom to get our target aligned again. This will include involving, teaching and mentoring some graduate students which want to support us in our work. Well, I'll give you an update as soon as I know more.

AutoModel - Processing natural language