I am working with names and I have a large file of names in which some words are written together (upto 4 or 5) and their corresponding single forms are also present in the word-list.
An example would make this clear
The program should split the words in the list basing itself on the single forms which are there. Thus
mariechristine marie christine
johnsmith john smith
In the case of the last since
is missing, the program could suitably tag the missing element and show the word as
john !joseph! smith
The script/macro would prove especially helpful in separating words in languages such as German whch have a large number of compounded words.
I have a script in awk which does something similar but it takes words from an external dictionary, whereas here I need to bootstrap.
Any help given would be gratefully acknowledged.