|
CS: Text Indexing [1021.1998] There's a perl script called ICE that does local site indexing. I plan to use it for this site. It uses a very simple format:
Apple 1 3 Bob 1 Zebra 2 3 ---- 1 bobeatsanapple.txt 2 whatisazebra.txt 3 azebravisitstheorchard.txt So, "Apple" was in files 1 & 3, etc.. If you wanted to test how close two words were (the NEAR operator), you could modify it by adding a bitmap. Divide the document into eigths (or 16ths, or whatever) and assign one bit to each section. If the word appears in that section, turn the bit on. Then, you have:
Apple 1 [0001] 2[1001] Bob 1 [0001] 2 [0010] Zebra 2 [1000] If you AND those bitmaps, you can tell that: in file 1, "Apple" is NEAR "Bob" because (0001 AND 0001) is true. In file 2, "Zebra" is NEAR "Apple" (1001 AND 1000) but NOT NEAR "Bob" because (0010 AND 1000) is false. I wonder if that's how the rest of the computer science world does it.. |