diff options
author | Jinkun Jang <jinkun.jang@samsung.com> | 2013-03-13 02:21:45 +0900 |
---|---|---|
committer | Jinkun Jang <jinkun.jang@samsung.com> | 2013-03-13 02:21:45 +0900 |
commit | 6013bfbbb6a80ba8529df02de472c6fe49a8d5dd (patch) | |
tree | aef757b2abf225284bf0ccb0df3a7b7aab6765aa /proginfo/txtvsbin.txt | |
parent | 3101b2a7be8f0e3cc6ff469ce2597945c862264b (diff) | |
download | zip-d1eaf71d5740757ca82edfed4f7dcace57367353.tar.gz zip-d1eaf71d5740757ca82edfed4f7dcace57367353.tar.bz2 zip-d1eaf71d5740757ca82edfed4f7dcace57367353.zip |
Tizen 2.1 basesubmit/tizen_2.2/20130714.155558submit/tizen_2.1/20130425.061106submit/tizen_2.1/20130425.000017accepted/tizen_2.1/20130425.0212292.2_release2.2.1_release2.1b_releasetizen_2.2tizen_2.1
Diffstat (limited to 'proginfo/txtvsbin.txt')
-rw-r--r-- | proginfo/txtvsbin.txt | 112 |
1 files changed, 112 insertions, 0 deletions
diff --git a/proginfo/txtvsbin.txt b/proginfo/txtvsbin.txt new file mode 100644 index 0000000..6ba2805 --- /dev/null +++ b/proginfo/txtvsbin.txt @@ -0,0 +1,112 @@ +A Fast Method of Identifying Plain Text Files +============================================= + + +Introduction +------------ + +Given a file coming from an unknown source, it is generally impossible +to conclude automatically, and with 100% accuracy, whether that file is +a plain text file, without performing a heavy-duty semantic analysis on +the file contents. It is, however, possible to obtain a fairly high +degree of accuracy, by employing various simple heuristics. + +Previous versions of the zip tools were using a crude detection scheme, +originally used by PKWare in its PKZip programs: if more than 80% (4/5) +of the bytes are within the range [7..127], the file is labeled as plain +text, otherwise it is labeled as binary. A prominent limitation of this +scheme is the restriction to Latin-based alphabets. Other alphabets, +like Greek, Cyrillic or Asian, make extensive use of the bytes within +the range [128..255], and texts using these alphabets are most often +mis-identified by this scheme; in other words, the rate of false +negatives is sometimes too high, which means that the recall is low. +Another weakness of this scheme is a reduced precision, due to the false +positives that may occur when binary files containing a large amount of +textual characters are mis-identified as plain text. + +In this article we propose a new detection scheme, with a much increased +accuracy and precision, and a near-100% recall. This scheme is designed +to work on ASCII and ASCII-derived alphabets, and it handles single-byte +alphabets (ISO-8859, OEM, KOI-8, etc.), and variable-sized alphabets +(DBCS, UTF-8, etc.). However, it cannot handle fixed-sized, multi-byte +alphabets (UCS-2, UCS-4), nor UTF-16. The principle used by this scheme +can easily be adapted to non-ASCII alphabets like EBCDIC. + + +The Algorithm +------------- + +The algorithm works by dividing the set of bytes [0..255] into three +categories: +- The white list of textual bytecodes: + 9 (TAB), 10 (LF), 13 (CR), 20 (SPACE) to 255 +- The gray list of tolerated bytecodes: + 7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC) +- The black list of undesired, non-textual bytecodes: + 0 (NUL) to 6, 14 to 31. + +If a file contains at least one byte that belongs to the white list, and +no byte that belongs to the black list, then the file is categorized as +plain text. Otherwise, it is categorized as binary. + + +Rationale +--------- + +The idea behind this algorithm relies on two observations. + +The first observation is that, although the full range of 7-bit codes +(0..127) is properly specified by the ASCII standard, most control +characters in the range 0..31 are not used in practice. The only +widely-used, almost universally-portable control codes are 9 (TAB), +10 (LF), and 13 (CR). There are a few more control codes that are +recognized on a reduced range of platforms and text viewers/editors: +7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), and 27 (ESC); but these +codes are rarely (if ever) used alone, without being accompanied by +some printable text. Even the newer, portable text formats, such as +XML, avoid using control characters outside the list mentioned here. + +The second observation is that most of the binary files tend to contain +control characters, especially 0 (NUL); even though the older text +detection schemes observe the presence of non-ASCII codes from the range +[128..255], the precision rarely has to suffer if this upper range is +labeled as textual, because the files that are genuinely binary tend to +contain both control characters, and codes from the upper range. On the +other hand, the upper range needs to be labeled as textual, because it +is being used by virtually all ASCII extensions. In particular, this +range is being heavily used to encode non-Latin scripts. + +Given the two observations, the plain text detection algorithm becomes +straightforward. There must be at least some printable material, or +some portable whitespace such as TAB, CR or LF, otherwise the file is +not labeled as plain text. (The boundary case, when the file is empty, +automatically falls into this category.) However, there must be no +non-portable control characters, otherwise it's very likely that the +intended reader of that file is a machine, rather than a human. + +Since there is no counting involved, other than simply observing the +presence or the absence of some byte values, the algorithm produces +uniform results on any particular text file, no matter what alphabet +encoding is being used for that text. (In contrast, if counting were +involved, it could be possible to obtain different results on a text +encoded, say, using ISO-8859-2 versus UTF-8.) There is the category +of plain text files that are "polluted" with one or a few black-listed +codes, either by mistake, or by peculiar design considerations. In such +cases, a scheme that tolerates a small percentage of black-listed codes +would provide an increased recall (i.e. more true positives). This, +however, incurs a reduced precision, since false positives are also more +likely to appear in binary files that contain large chunks of textual +data. "Polluted" plain text may, in fact, be regarded as binary, on +which text conversions should not be performed. Under this premise, it +is safe to say that the detection method provides a near-100% recall. + +Experiments have been run on a large set of files of various categories, +including plain old texts, system logs, source code, formatted office +documents, compiled object code, etcetera. The results confirm the +optimistic assumptions about the high accuracy, precision and recall +offered by this algorithm. + + +-- +Cosmin Truta +Last updated: 2005-Feb-27 |