0212.98 Data Migration My current project is data migration. I started to find it a little boring, so my brain crosreferenced it with NLP to make it more interesting. :) Check it out: Start with a simple table. Each row represents, say, a person. Each column represents some attribute of the person: name age hobbies hero ------- --- ---------------- -------------- joe 12 comics, nintendo superman wilma 24 basketweaving milton erickson fred 42 nintendo, cars superman ted 34 basketweeving m. erikson .. Except there might be thousands of rows, and more interesting tables. Notice that names are unique. Ages are simple values that go with names. But hobbies and place of work are different: things repeat. Not only that, but whover entered ted's record didn't spell very well. If I want to go in here and say, "show me everyone who has milton erickson for a hero," I'll get "wilma"... but I WANT "wilma and ted". And what about hobbies? Fred and joe both have more than one. If I say "show me everyone who likes nintendo".. I can't be sure that both guys will show up, because I didn't mention comics or cars, and the search engine might not be smart enough. This kind of reminds me about thinking of a person or idea in different contexts... For example, I know someone who is a very skilled communicator in one context, but in another is quite clueless. (Just about everyone I know falls into this category, actually)... But I want to be able to look at the whole person, not just their skill or ineptitude. In a sense, I'm asking my brain to "show me ALL the contexts that relate to this person"... I don't want to scan through each picture in my timeline or elsewhere for anything that might or might not be about this person. Obviously, I'd be doing it unconsciously anyway, but... Anyway, to clean up the table, you could do this: table 1: person id name age heroID -- ----- --- ------ 1 joe 12 2 2 wilma 24 1 3 fred 42 2 4 ted 34 1 table 2: hero (you could even put these in the person table..) id hero -- ------ 1 milton 2 superman table 3: hobby id hobby -- ----- 1 comics 2 nintendo 3 basketweeving 4 cars table 4: person-hobby personid hobbyid -------- ------- 1 1 1 2 2 3 3 2 3 4 4 3 Now, it's a little harder for a human to read this way, but the people who use the database never have to see it. The computer turns all the numbers into the human-readable values. But! Suppose i notice later that I misspelled "basketweaving". I can go back and make it right by changing one record in the hobby table, not have to do it for every single person who liked basketweaving. Now if i think about a person this way, i might picture many many contexts in which that person operates. The person doesn't have to be in the picture. Instead, i might imagine all these contexts linked by a thread to a single picture of the person. These little threads can act like pipelines, so that I can pour everything i know about the person from each context into the one, single picture, and gain a fuller picture of them. These kinds of databases are called relational databases, because they let me see how people relate to hobbies, or how hobbies relate to heros. (Show me all the hobbies liked by people whose hero is milton erickson).. they make stuff like that easy. Since the brain is pretty flexible, there's no problem switching back and forth between ways of representing data.. So as NLPers, we could use this relational view whenever it came in handy. So, other than what I already mentioned, where could this come in handy? - michal