Author Topic: [BUG REPORT]MC10.1.0.2743:Search ASCII within OEM866 text file works wrong  (Read 14572 times)

1nd1g0

  • Junior Member
  • **
  • Posts: 16
    • View Profile
Hello Mathias,

I've isolated a bug. While file search process is parsing OEM 866 encoded text files, with lines of pure latin (ASCII) and pure Cyrillic characters, search stops on the second line of Cyrillic characters and does not proceed further.

When just one last character of the second Cyrillic line is removed, search work correctly - the ASCII text at the end of the file is found!

Force search in ASCII encoding solves the problem (but some of my text files are Unicode, so i really need Auto mode).

Thanks for great software and your hard work,
Best regards.

Mathias (Author)

  • Administrator
  • VIP Member
  • *****
  • Posts: 4409
    • View Profile
    • Multi Commander
Do you mean it search text files that are ascii as unicode if there are cyrillic characters in them ?

1nd1g0

  • Junior Member
  • **
  • Posts: 16
    • View Profile
There is not only a magic number of 3+ bytes in range 0x80..0xff in line as a trigger. Even 3 national characters separated with 0x0d 0x0a (CR+LF) at the line end or with spaces each will thrigger the same error.

Let's use any national codes, 0xef or 0xff for example. The following byte (in hex) sequences anywhere in the file will stop further search:

HEX: ef 0d 0a ef 0d 0a ef 0d 0a ...

OR

HEX: ef ef ef 0a 0d ...

OR

HEX: ff 20 20 20 20 ff 20 20 20 ff 20 20 ... (ASCII text encoded starting from here with bytes 0x00..0x7f will not be found)

All the sequences trigger wrong encoding detection. BUT!
HEX: ff 20 20 20 20 ff 20 20 20 ff 20 20 20 ... (ASCII text starting from here will be found correctly)

It's hard to blame Unicode detection algorythm, but looks like it may be the case.
I wonder how one can detect UTF without BOM markers and not to mistaken it to some usual 8-bit encoding. Unicode is more repetetive in structure due to the same amount of bytes per character for the whole file. So, when there is no unicode "rythm" in the file, it can be a hint to treat it as 8-bit ascii-based encoding (CP-..., OEM..., UTF-8...)?
« Last Edit: December 10, 2020, 21:53:09 by 1nd1g0 »

Mathias (Author)

  • Administrator
  • VIP Member
  • *****
  • Posts: 4409
    • View Profile
    • Multi Commander
If file is missing BOM it is almost impossible to do 100% accurate detection of file type all the time. There are so many corner cases.
Problem is that Unicode of extended language like .jp or chinese can be detected as binary files instead of text.
But might be possible to improve the detection some way.

Problem for me is to find test text data that will fail for me. because searching ascii files is depended on language settings on the computer.

In v10.2 that coming soon.. In Core Settings enable app log and set loglevel to debug.
When doing content search you will get entire in the log view of what content type it detects it as.
1 = ascii , 2 = unicode , 3 = utf8 , 4 = binary

Press Ctrl+L to open that panel

===============
Actually thinking about this now again. I think the failure is that it is matching it as binary instead of ascii. So let me know what the logout say.
« Last Edit: December 16, 2020, 09:26:33 by Mathias (Author) »

1nd1g0

  • Junior Member
  • **
  • Posts: 16
    • View Profile
I have a simmingly interesting idea for a (temporal?) workaround without drastical refactoring. What do you think of the 3-rd search mode? There are two of them: Single encoding, automatically detect all the encodings. Temporal workaroud could be the third mode - some interface for advanced users, that allows to disable/enable internal logic switch cases with checkboxes, i.e. :

√ UNICODE
   √ UTF-8
      √ BOM
      × NO-BOM
   × UTF-16
      × BOM
      × NO-BOM
   × UCS-2 BE
      × BOM
      × NO-BOM
   × UCS-2 LE
      × BOM
      × NO-BOM
√ ASCII
× BINARY

Now we have less thigs to choose from and less false-positives. But all the responsibility lays on the sholders of the user, so there has to be some clear statement on it for non-advanced ones not to flood the forum with claims after misconfiguration.

1nd1g0

  • Junior Member
  • **
  • Posts: 16
    • View Profile
Here comes the log. If the binary is format #4 (the last in the list), it is the case then.

Code: [Select]
2020-12-19 14:39:33.626 [8452] Starting
2020-12-19 14:39:33.626 [Find Filter] Search Location #0 : D:\TEMP\_txt\
2020-12-19 14:39:33.626 [Find Filter] Rule #0 , 'File content' Contains (0x10) 'CanNotFindThis'
2020-12-19 14:39:33.627 [8452] Searching in folder : D:\TEMP\_txt
2020-12-19 14:39:33.627 [8452] Finished

2020-12-19 14:39:33.627 File Search - Content Search - File "D:\TEMP\_txt\LastLineWillNotBeFound.txt" detected as content format 4

The file is applied, the search text was "CanNotFindThis" (logs say the same). Settings screenshot is also applied.

1nd1g0

  • Junior Member
  • **
  • Posts: 16
    • View Profile
Well. Not really a solution, but worth mentioning. Setting the checkbox "Search in binary files" forces MC to find the text we need, but the checkbox clears itself after the search is done. And I presume, unicode text will not be found in detected as false-binary text files, so not a viable solution. But proves the point the option to disable selectively some detectors including binary could be useful.

Mathias (Author)

  • Administrator
  • VIP Member
  • *****
  • Posts: 4409
    • View Profile
    • Multi Commander
It tries to detect if file is binary. the sample of the file (first 1000 bytes) have lots of none ascii then it is binary.. I think the limit right now is 10%.. and 26 character file. and 3 potentional none ascii charactgers..  that is over 10%
So it is only for very small files this is detected as binary.. I see if I can tweak it..

1nd1g0

  • Junior Member
  • **
  • Posts: 16
    • View Profile
Since most of the binary files do not have text-like extensions (*.txt, *.log etc.),  and MC has filters (with text file exteinsions filled in)... I'm just saying :) Of course, it's for user to decide wether filter be used or not. And which one.