Strip file of all non-printable characters

Find, replace, find in files, replace in files, regular expressions

Strip file of all non-printable characters

Postby c4p0ne » Thu Mar 31, 2011 3:44 am

I'd actually like to be able to do two things (in PERL). One is to strip a file of all non-printable characters only throughout the file. The other (instead of removing every character) is remove entire lines of which contain 1 or more non-printable characters (even if those lines have printable characters in them).

Oh, I wonder if there is a way to run through the file and instead of merely deleting all lines with non-printables in them, to dump those lines into another separate file? I guess I'd have to do the "reverse" of my above question then, which is strip the file of any lines that DO NOT contain at least 1 non-printable, and then simply do a save-as on the result....

::edit::

oh by the way, some files have NULL characters in them and UltraEdit opens those as BINARY files even though they are actually 99.9% text. Is there a way to prevent it from opening files in that mode?
User avatar
c4p0ne
Newbie
 
Posts: 9
Joined: Sat Oct 04, 2008 1:21 pm
Location: Classified

Re: Strip file of all non-printable characters

Postby Mofi » Thu Mar 31, 2011 8:40 am

If there are more than 2 NULL bytes in the first 64 KB of the file, UltraEdit reads it always as binary file. You can simple turn off hex editing mode to see the content of such a file in text edit mode. But if you modify the file now, all NULL bytes are replaced by a space, except configuration setting Allow editing of text files with hex 00's without converting them to spaces is enabled at Advanced - Configuration - Editor - Advanced.

The Perl regular expression engine supports hexadecimal notation of bytes in the form \x00 and therefore you can search for characters you want to remove, best with Replace In Files to prevent loading the files as binary file. Perl regular expression string [\x00-\x1F] for example finds all bytes with a value lower than a space, but that includes also carriage return, line-feed, form-feed and horizontal tabs.

So better would be [\x00-\x08\x0B\x0E-\x1F]+ to find control characters usually not used in plain text files (including the vertical tab character). The additional + lets you find 1 or more control characters in sequence which makes a replace faster. You can use this search string to remove all control characters.

The Perl regular expression search string ^.*[\x00-\x08\x0B\x0E-\x1F].*\r\n deletes entire DOS terminated lines if there is one control character in such a line, although calling a file with such a byte stream as text file with lines is courageous.

Don't use that search strings on a UTF-16 encoded text file. That destroys the content of such a text file.
User avatar
Mofi
Grand Master
Grand Master
 
Posts: 4049
Joined: Thu Jul 29, 2004 11:00 pm
Location: Vienna

Re: Strip file of all non-printable characters

Postby c4p0ne » Fri Apr 01, 2011 5:14 am

Works, works, and works. Thanks again... Yes I am indeed "courageous" :lol: :P. Yes unfortunately I am dealing with some unique files which are MEANT to be text files but because of the nature of what is being done (some of the data being streamed/dumped into the file has these special characters in it) they come out screwy. So the problem here is that the data is correct, and can not be changed. If I have a line which should never show up in a text file like a bunch of wacky nulls separated by more wacky characters, well, it's got to be there. This is something UltraEdit actually wasn't built for, but works GREAT NONETHELESS!

Thus, the least I can do (which i am doing now) is trying to separate all that "correct trash" from actual pure text, and dump it into what I am now officially calling my "courageous text files" directory :lol: Thanks!
User avatar
c4p0ne
Newbie
 
Posts: 9
Joined: Sat Oct 04, 2008 1:21 pm
Location: Classified


Return to Find/Replace/Regular Expressions