0212.98 Data Migration

My current project is data migration. I started to find it a little
boring, so my brain crosreferenced it with NLP to make it more 
interesting. :) Check it out:

Start with a simple table. Each row represents, say, a person. Each
column represents some attribute of the person:


name		  age  hobbies		hero
-------		       ---		----------------	--------------
joe					12			comics, nintendo	superman
wilma								24	basketweaving		milton erickson
fred									42			nintendo, cars		superman
ted												34	  basketweeving	 m. erikson


.. Except there might be thousands of rows, and more interesting tables.
Notice that names are unique. Ages are simple values that go with names.
But hobbies and place of work are different: things repeat. Not only
that, but whover entered ted's record didn't spell very well.

If I want to go in here and say, "show me everyone who has milton 
erickson for a hero," I'll get "wilma"... but I WANT "wilma and ted".

And what about hobbies? Fred and joe both have more than one. If I
say "show me everyone who likes nintendo".. I can't be sure that both
guys will show up, because I didn't mention comics or cars, and the
search engine might not be smart enough.


This kind of reminds me about thinking of a person or idea in different
contexts... For example, I know someone who is a very skilled communicator
in one context, but in another is quite clueless. (Just about everyone
I know falls into this category, actually)... But I want to be able to
look at the whole person, not just their skill or ineptitude. In a sense,
I'm asking my brain to "show me ALL the contexts that relate to this
person"... I don't want to scan through each picture in my timeline or
elsewhere for anything that might or might not be about this person.
Obviously, I'd be doing it unconsciously anyway, but... 


Anyway, to clean up the table, you could do this:

table 1: person

id    name	age	heroID
--    -----	---	------
1     joe	12	2
2     wilma	24	1
3     fred	42	2
4     ted	34	1


table 2: hero (you could even put these in the person table..)

id    hero
--    ------
1     milton
2     superman


table 3: hobby

id    hobby
--    -----
1     comics
2     nintendo
3     basketweeving
4     cars

table 4: person-hobby

personid hobbyid
-------- -------
1		1
1			2
2				3
3					2
3						4
4							3


Now, it's a little harder for a human to read this way, but the people
who use the database never have to see it. The computer turns all the
numbers into the human-readable values.

But! Suppose i notice later that I misspelled "basketweaving". I can
go back and make it right by changing one record in the hobby table,
not have to do it for every single person who liked basketweaving.

Now if i think about a person this way, i might picture many many 
contexts in which that person operates. The person doesn't have to be
in the picture. Instead, i might imagine all these contexts linked by
a thread to a single picture of the person. These little threads can
act like pipelines, so that I can pour everything i know about the person
from each context into the one, single picture, and gain a fuller picture
of them.

These kinds of databases are called relational databases, because they
let me see how people relate to hobbies, or how hobbies relate to 
heros. (Show me all the hobbies liked by people whose hero is milton
erickson).. they make stuff like that easy.

Since the brain is pretty flexible, there's no problem switching back
and forth between ways of representing data.. So as NLPers, we could use
this relational view whenever it came in handy.

So, other than what I already mentioned, where could this come in handy?


- michal