Find splits of compound words within a dictionary file

Help with writing and running scripts

Find splits of compound words within a dictionary file

Postby dictdoc » Wed May 02, 2012 11:39 am

Dear all,
I am working with names and I have a large file of names in which some words are written together (upto 4 or 5) and their corresponding single forms are also present in the word-list.
An example would make this clear
annamarie
mariechristine
johnsmith
johnjoseph smith
john
smith
anna
marie
mary
christine

The program should split the words in the list basing itself on the single forms which are there. Thus
annamarie anna-marie
mariechristine marie christine
johnsmith john smith
johnjosephsmith

In the case of the last since
joseph

is missing, the program could suitably tag the missing element and show the word as
john !joseph! smith

The script/macro would prove especially helpful in separating words in languages such as German whch have a large number of compounded words.
I have a script in awk which does something similar but it takes words from an external dictionary, whereas here I need to bootstrap.
Any help given would be gratefully acknowledged.
dictdoc
Basic User
Basic User
 
Posts: 13
Joined: Tue Jan 19, 2010 1:56 am

Re: Find splits of compound words within a dictionary file

Postby Mofi » Sun Jun 10, 2012 10:28 am

I first thought that this is not possible without a dictionary database and therefore did not think too much about this task over.

But today I looked again on it and I think, I have found a good working solution with following script:

Code: Select all
if (UltraEdit.document.length > 0) {

   // Get all words from the file.
   UltraEdit.columnModeOff();
   UltraEdit.activeDocument.selectAll();
   var asWords = UltraEdit.activeDocument.selection.split("\r\n");
   UltraEdit.activeDocument.top();

   // If the last line is an empty line, remove it from the list.
   if (!asWords[asWords.length-1].length) asWords.pop();

   // Create a new array for included words and their positions in the word.
   var nWordCount = asWords.length;
   var asMultiWords = new Array(nWordCount);
   var nWordIndex = 0;
   while (nWordIndex < nWordCount) asMultiWords[nWordIndex++] = "";

   // Find out which words are included completely in other words.
   for (nWordIndex = 0; nWordIndex < nWordCount; nWordIndex++) {

      // Ignore the words where it was already detected that it includes
      // at least 1 other word as this word is surely not a single form.
      if (asMultiWords[nWordIndex].length) continue;

      // Get actual word into a separate string.
      var sActWord = asWords[nWordIndex];

      // Record in which other words this word is included.
      for (var nIndex = 0; nIndex < nWordCount; nIndex++) {
         var nPos = asWords[nIndex].indexOf(sActWord);
         if (nPos < 0) continue;
         if (nIndex == nWordIndex) continue;
         if (nPos < 10) asMultiWords[nIndex] += "0";
         asMultiWords[nIndex] += nPos.toString() + "|" + sActWord + " ";
         // Note: "Words" longer 99 characters are not supported by this script.
      }
   }

   // Create the result list.
   var sResult = "";
   for (nWordIndex = 0; nWordIndex < nWordCount; nWordIndex++) {

      // Ignore the words not containing any other word.
      if (!asMultiWords[nWordIndex].length) continue;

      // Add this word with other words included to the result
      sResult += asWords[nWordIndex] + " =";

      // Put the included words again into an array of strings.
      var asFoundWords = asMultiWords[nWordIndex].split(" ");
      asFoundWords.pop();

      // Build the result string as requested which is quite complicated
      // as it must be found out in which order the included words must
      // be arranged. The array of included words with position (00 to 99,
      // no longer strings are supported) at beginning is sorted according
      // to the position number to get the included words in correct oder.
      // And parts of the word can be also not found on any other line.
      // Also the "word" can be a string containing spaces which must
      // be ignored to build the result correct.
      asFoundWords.sort();
      var nLastPos = 0;
      for (nIndex = 0; nIndex < asFoundWords.length; nIndex++) {

         // Split up the string with position in word and included word.
         var sPos = asFoundWords[nIndex].substr(0,2);
         var sWord = asFoundWords[nIndex].substr(3);
         nPos = parseInt(sPos,10);  // Convert the position back to number.

         // Is this word the expected string part in main word.
         if (nPos != nLastPos) {
            // There is a part of the word not listed on any line.
            sActWord = asWords[nWordIndex];

            // Ignore spaces at begin of not included part.
            while (nLastPos < sActWord.length) {
               if (sActWord[nLastPos] != ' ') break;
               nLastPos++;
            }

            // Ignore spaces at end of not included part.
            var nEndPos = nPos - 1;
            while (nEndPos > nLastPos) {
               if (sActWord[nEndPos] != ' ') break;
               nEndPos--;
            }

            // Was something other than spaces not included?
            if (nLastPos != ++nEndPos) {
               // Yes, include this string part enclosed in exclamation marks.
               sResult += " !" + sActWord.substring(nLastPos,nEndPos) + "!";
            }
            nLastPos = nPos;
         }
         // Appended the included word and update position for next word.
         sResult += " " + sWord;
         nLastPos += sWord.length;
      }
      // After every word append a DOS line termination.
      sResult += "\r\n";
   }
   if (sResult.length) {  // Anything to output build?
      UltraEdit.newFile();
      UltraEdit.activeDocument.write(sResult);
      UltraEdit.activeDocument.top();
   }
   else UltraEdit.messageBox("No word included in any other word.");
}

The result of this script on your input example is:

Code: Select all
annamarie = anna marie
mariechristine = marie christine
johnsmith = john smith
johnjoseph smith = john !joseph! smith
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 3936
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna


Return to Scripts