First, how large is the file to check for similar lines - 50 KB, 5 MB or 500 MB?
Checking lines on similarity is not a trivial task. As it looks like the script has to process this check on every line against all other lines below as it cannot be expected that similar lines start with the same character and could be sorted first.
All sequences of non alphanumeric characters must be replaced by a space on each line. The line now containing only words consisting only of letters and numbers must be splitted into a list of words. Next all words with less than 4 characters must be removed from the list. When this was done for the current line compared against all other lines below and the next line to compare, a loop is executed which has to count how much entire words in the current line match not case sensitive entire words in the line to compare. If the number of equal words exceeds a threshold value, the line compared with can be treated as similar and the two lines being similar are written to output window, a new file or appended to the clipboard.
Okay, that's how the task could be done. It will take quite long to accomplish as lots of memory allocations/releases, regular expression replaces, and string compares must be done. But it can be done efficiently at all only if the number and length of the lines is not too large to do as much as possible in memory without accessing the file.