Sorting requirements for the words within a color group in a wordfile

Syntax highlighting, code folding, brace matching, code indenting, and function list

Sorting requirements for the words within a color group in a wordfile

Postby rhapdog » Thu May 24, 2012 7:15 pm

Mofi, a few questions about Sort Language with your tool. I have been working hard on sorting, and trying to understand the exact rules for sorting. Perhaps I have misunderstood something somewhere.

In an XML_LANG file, where you have <? in the word list, would not '<' be ignored during the sort and the sort start with '?'? Then, '?>' would be sorted normally?

I was thinking this would sort to:
Code: Select all
/C1
<? ?>

However, your macro sorts it:
Code: Select all
?> <?


I was trying to follow your set of rules for the sorts, and my sort program ended up sorting it as '<? ?>', which does not match up with yours.

Also, another difference in the way it is sorted... when encountering the underscore character within a word (ASCII 95).... For example your macro sorts:
Code: Select all
EDITADD EDITUPDATE ERROR ERRORLOG ERR_GET ERR_PRINT ERR_QUIT EXCLUSIVE EXITNOW

and my program is sorting numerically by ASCII value, which produces:
Code: Select all
EDITADD EDITUPDATE ERR_GET ERR_PRINT ERR_QUIT ERROR ERRORLOG EXCLUSIVE EXITNOW


As I do not recall any issues with wordfiles when using your macro sort, I am wondering if the different way mine is sorting is going to be an issue. Is there a rule to cover this? Or will both be accepted by UE?

Now, if these were all lowercase letters, our programs would have sorted it the same way. There is no Nocase keyword, so this is supposed to be a case sensitive language.

In case you are wondering, this is from the 4gl.uew wordfile in the User Submitted Wordfiles. (Language name INFORMIX)

(I have the HTML_LANG, XML_LANG, and LATEX_LANG working now, but am having difficulty implementing the "** " substring lines, because if a line has "*", it places it in that line. I'll get it fixed before release.)
User avatar
rhapdog
Master
Master
 
Posts: 253
Joined: Tue Apr 01, 2008 10:02 am
Location: Mississippi, USA

Re: Sorting requirements for the words within a color group in a wordfile

Postby Mofi » Fri May 25, 2012 12:57 am

It does not matter for UltraEdit in which order the words starting with same character according to Nocase are listed on a single line or in a sequence of lines. Therefore <? ?> and ?> <? as well as

EDITADD EDITUPDATE ERROR ERRORLOG ERR_GET ERR_PRINT ERR_QUIT EXCLUSIVE EXITNOW

EDITADD EDITUPDATE ERR_GET ERR_PRINT ERR_QUIT ERROR ERRORLOG EXCLUSIVE EXITNOW

result in same behavior for syntax highlighting. A simple correct example for a word list in a color group:

Code: Select all
continue const char cdecl case
unsigned
union
auto
break
else enum extern
default do double
for fortran
far float
goto
huge
if int
label long
near
pascal
register return
short signed sizeof static struct switch
typedef
void volatile
while

As you can see the lines are not sorted alphabetically and also the words within a line are not always in alphabetical order. However, all words will be highlighted correct as all words starting with same letter are summarized in a block.

Now an example which results in not correct highlighting the words far, float and static.

Code: Select all
continue const char cdecl case
unsigned
union
auto
break
else enum extern
default do double
for fortran
goto
huge
if int
label long
near
pascal
register return static
short signed sizeof struct switch
typedef
void volatile
while
far float

There is already the line with for fortran. Therefore UltraEdit expects that all other words starting with character f are on the same line or on one of the next lines with no line between starting with a different character. Because far float is not above or below the line with for fortran, these words are ignored. The word static is not highlighted as it is defined on a line with first character being an r instead of an s.


I agree that <? ?> would be the absolutely correct order as < has in ASCII table a lower value than character ?. Result ?> <? of SortLanguage macro is not 100% correct, but it does not matter for UltraEdit. The wrong order is caused by the fact that the macro moves temporarily < and </ to end of a word with an underscore as delimiter. I reworked my SortLanguage macro on 2012-05-29 to get also those 2 strings sorted correct.

It looks like your sorting algorithm ignores underscores within a word. The underscore has decimal value 95 in ASCII table and has therefore a higher value than uppercase letters. ERR_GET ERR_PRINT left to ERROR can be explained only when ERR_GET ERR_PRINT is interpreted during sort as ERRGETERRPRINT. But then I do not understand why ERR_QUIT being perhaps interpreted as ERRQUIT during sort is also left to ERROR as Q has a higher value than O.

Well, again, for UltraEdit the order of the words within a line starting all with the same character according to Nocase does not matter.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 3937
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Sorting requirements for the words within a color group in a wordfile

Postby rhapdog » Fri May 25, 2012 7:57 am

Mofi wrote:Well, again, for UltraEdit the order of the words within a line starting all with the same character according to Nocase does not matter.

Thank you for that. The examples I had posted did NOT have the Nocase keyword. I see what you mean about the underscore. Strange that it should sort that way. I used a sort that was part of the programming language. Since it will still work with UE, I don't see the need to "rewrite" the sort routine, at least for now.

May I assume the only difference between Nocase and NOT Nocase is whether or not small and capital letters of the same kind (A, a) can be on the same line? It seems logical.

I'll not worry about reworking anything, except for fixing the issue with the "** " lines having an issue whenever there is a word in the same group starting with *, or * by itself.
User avatar
rhapdog
Master
Master
 
Posts: 253
Joined: Tue Apr 01, 2008 10:02 am
Location: Mississippi, USA

Re: Sorting requirements for the words within a color group in a wordfile

Postby Mofi » Fri May 25, 2012 11:53 am

rhapdog wrote:May I assume the only difference between Nocase and NOT Nocase is whether or not small and capital letters of the same kind (A, a) can be on the same line?

This is the main difference regarding sorting the words and where the words can be listed within a color group. There are others features depending also on Nocase like auto-completion and auto-correction. Let's take a look on following example (from javascript.uew):

Code: Select all
ASCIIToUTF8 ASCIIToUnicode
UTF8ToASCII
ansiToOem
ueReOn unicodeToASCII unixMacToDos unixReOn

Without Nocase in first line the listing is correct. With Nocase all words starting with a lowercase characters would not be highlighted correct.

With Nocase present in first line following works:
Code: Select all
ansiToOem ASCIIToUTF8 ASCIIToUnicode
UTF8ToASCII ueReOn unicodeToASCII unixMacToDos unixReOn

as well as this too:

Code: Select all
ansiToOem
ASCIIToUTF8 ASCIIToUnicode
ueReOn unicodeToASCII
UTF8ToASCII unixMacToDos unixReOn

But without Nocase in first line the words in second line and the 2 words starting with lowercase u in fourth line would not be highlighted correct.

Regarding sorting of words with underscores:
It is not absolutely necessary to get the same strictly ASCII/ANSI table related sorted as UltraEdit does with my macro. But perhaps you can define flags as parameters on sort function to get underscore not being ignored (or interpreted like a space character) on sort.

What you should test with built-in sort function is if it uses a local sort. As wordfiles can contain not just words with ASCII characters, but also with ANSI characters, a local sort taking the local language rules of the user of the application into account could produce wrong result. For example following German word list:

Code: Select all
Arbeit Ast
arbeiten
Äste
ähnlich

That is the word list with a case sensitive sort strictly according to ANSI as required by UltraEdit with Nocase not present in first line. With Nocase the words must be listed:

Code: Select all
Arbeit arbeiten Ast
Äste
ähnlich

My SortLanguage macro produces

Code: Select all
Arbeit arbeiten Ast
ähnlich Äste

which is not correct and resulting in Äste being not highlighted correct. I ignored that failured in sorting algorithm in my macro set to do not make it more complicated then necessary for most wordfiles and slow down the macros. But you can take that into account. Nocase is evaluated by UltraEdit only for [A-Za-z], not for special local letters.

A case sensitive local sort for German words would result in

Code: Select all
Arbeit Ast Äste
ähnlich arbeiten

as Ä = A and ä = a.

A case insensitive local sort for German words would result in

Code: Select all
ähnlich Arbeit arbeiten Ast Äste

as Ä = ä = A = a.

I'm quite sure that words starting with a character with a decimal value greater 127 are very rare in wordfiles. I don't have anyone in any of my wordfiles. There are only 20 *.uew files on IDM server containing non ASCII characters. Just 5 of time contain an ANSI character at beginning of a word. 4 of those 5 wordfiles contain just ¬ as word to highlight, the fifth file (ue-oaw.uew) contains several strings starting with «
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 3937
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Sorting requirements for the words within a color group in a wordfile

Postby rhapdog » Fri May 25, 2012 6:47 pm

I was just reading further on this "sort" function in my programming language. It's not an ASCII sort.

The sort in my program does an ANSI Comparison sort, which is controlled by the current Windows locale. (not the code page used in UE, but the Operating System locale)

My program with Nocase sorted your example:
Code: Select all
ähnlich
Arbeit arbeiten Ast

For some reason it totally deleted Äste, and I don't know why at this point.

I would have thought that ähnlich would be after arbeiten, as it has a higher ASCII value. Apparently, ANSI doesn't see it this way.

I'll have to look into the sorting myself, and see if there is an alternative. Removing Äste isn't exactly acceptable practice. It could be that the "sorting" could handle it, but for some reason my "adding" it to the temp list before sorting may have failed. I'll have to debug. If it's removing that one, it may be removing something else as well. Although word counts have confirmed that words have not been removed out of any of my other test files (which have been a lot of them.)

It is possible, since a wordfile with such words in it would be extremely rare, that I may just do something crazy like give a disclaimer that it should not be used with wordfiles that contain ASCII characters greater than 127. Or, perhaps I'll figure this out before release. :)
User avatar
rhapdog
Master
Master
 
Posts: 253
Joined: Tue Apr 01, 2008 10:02 am
Location: Mississippi, USA


Return to Syntax Highlighting