Sorting a file with frequency count on word length

Help with writing and running scripts

Sorting a file with frequency count on word length

Postby dictdoc » Thu Mar 21, 2013 9:47 pm

Hello,
I have a file which has the following structure

word space frequency

The file is around 30,000 headwords each along with its frequency. The words have different lengths. What I need is a script which can sort the file on length of the headword and once the file is sorted on length: smallest to largest; sort each such set of words having the same length on their frequency.
At present I do this in Excel using the

Code: Select all
=Len(text)

formula, but this is getting tedious.
I am giving below a sample input file

Code: Select all
about 1903238
and 14291859
are 1487971
but 2994482
can 1915289
come 1541623
for 3296048
from 2207336
get 2081392
have 5930242
here 1558771
him 1571291
just 1756270
know 2221467
like 1845600
not 3091071
now 1453264
one 1988291
out 1812292
right 1410555
say 2345958
she 2123744
that 7834407
the 29962169
there 1957160
they 2684414
think 1398723
this 3814998
was 1399013
what 3327049
when 1465219
who 1543711
with 3983564
would 1346905
you 12345509
your 2329896

The expected output would be:

Code: Select all
the 29962169
and 14291859
you 12345509
for 3296048
not 3091071
but 2994482
say 2345958
she 2123744
get 2081392
one 1988291
can 1915289
out 1812292
him 1571291
who 1543711
are 1487971
now 1453264
was 1399013
that 7834407
have 5930242
with 3983564
this 3814998
what 3327049
they 2684414
your 2329896
know 2221467
from 2207336
like 1845600
just 1756270
here 1558771
come 1541623
when 1465219
there 1957160
about 1903238
right 1410555
think 1398723
would 1346905

As you can see the file has been sorted on length and then on frequency count value.

Any help given would avoid the tedium of loading the file each time in Excel. Many thanks in advance.
dictdoc
Basic User
Basic User
 
Posts: 19
Joined: Tue Jan 19, 2010 1:56 am

Re: Sorting a file with frequency count on word length

Postby Mofi » Fri Mar 22, 2013 2:12 am

Here is a script for that task which is not really optimized for speed.

Code: Select all
function sortByWordLengthAndCount (sFirstWord,sSecondWord)
{
   // Get length of the 2 words compared for sort.
   var nWordLength1 = sFirstWord.indexOf(' ');
   var nWordLength2 = sSecondWord.indexOf(' ');
   // Is word 2 is shorter than word 1?
   if (nWordLength2 < nWordLength1)
   {
      return 1;  // Word 1 and 2 must change their order in array.
   }
   // Is word 2 is longer than word 1?
   if (nWordLength2 > nWordLength1)
   {
      return 0;  // Nothing to change on order for these 2 words.
   }
   // Words have identical length, compare the frequency count values.
   var nFrequency1 = parseInt(sFirstWord.substr(++nWordLength1),10);
   var nFrequency2 = parseInt(sSecondWord.substr(nWordLength1),10);
   // Is frequency of word 2 greater than the frequency of word 1?
   if (nFrequency2 > nFrequency1)
   {
      return 1;  // Word 1 and 2 must change their order in array.
   }
   // Is frequency of word 2 lower than the frequency of word 1?
   if (nFrequency2 < nFrequency1)
   {
      return 0; // Nothing to change on order for these 2 words.
   }
   // Compare the words (lines with identical frequency values). This is an
   // alphabetical compare for words with same length and same frequency value.
   if (sFirstWord > sSecondWord)
   {
      return 1;
   }
   return 0;
}

// =========================================================================

if (UltraEdit.document.length > 0)  // Is any file opened?
{
   // Define environment for this script.
   UltraEdit.insertMode();
   UltraEdit.columnModeOff();

   UltraEdit.activeDocument.selectAll();
   var asLines = UltraEdit.activeDocument.selection.split("\r\n");
   // Remove last line string if it is an empty string.
   var bLastLineHasLineTerm = false;
   if (asLines[asLines.length-1].length == 0)
   {
      asLines.pop();
      bLastLineHasLineTerm = true;
   }
   // Sort the lines using the special sort criteria.
   asLines.sort(sortByWordLengthAndCount);
   // Join the lines to a block in user clipboard 9.
   UltraEdit.selectClipboard(9);
   UltraEdit.clipboardContent = asLines.join("\r\n");
   if (bLastLineHasLineTerm) UltraEdit.clipboardContent += "\r\n";
   // Paste the block over selection of entire content in active file.
   UltraEdit.activeDocument.paste();
   UltraEdit.clearClipboard();
   UltraEdit.selectClipboard(0);
   UltraEdit.activeDocument.top();
}

By the way: The BBCode code tags are not only for script/programming language code. The code tags should be used for every preformatted text. If the preformatted text is a code sequence or something else does not matter. The guys who named the BBCode code tags in this manner most likely thought that it is mainly used for real code which is true in most forums. But in a forum for a text editor there is often the need to post preformatted text which is not a code.asLines
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4049
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna


Return to Scripts