I am working with names and I have a large file of names in which some words are written together (upto 4 or 5) and their corresponding single forms are also present in the word-list.
An example would make this clear
annamarie
mariechristine
johnsmith
johnjoseph smith
john
smith
anna
marie
mary
christine
The program should split the words in the list basing itself on the single forms which are there. Thus
annamarie anna-marie
mariechristine marie christine
johnsmith john smith
johnjosephsmith
In the case of the last since
joseph
is missing, the program could suitably tag the missing element and show the word as
john !joseph! smith
The script/macro would prove especially helpful in separating words in languages such as German whch have a large number of compounded words.
I have a script in awk which does something similar but it takes words from an external dictionary, whereas here I need to bootstrap.
Any help given would be gratefully acknowledged.


