![]() Unicode includes ASCII as well as nearly all other languages known to exist. UTF-8) are two common ways of coding characters as numbers. Codes are used to represent one thing (e.g., a character) as something else (e.g., a number).Computers process and store everything as binary.I am still guessing this is based on how Scintilla handles the bytes internally. But if it’s a “character set”, then it shows UTF8 bytes for non-ASCII characters, no matter how it happens to be encoded on disk. When I set 0, then it reads back 0 similarly for the 6 explicit values quoted.īut if I choose a charset like Shift JIS, it does not change the SCI_GETCODEPAGE readback, so the codepage and charset are separate entities.īasically, it appears that if it’s a real Unicode encoding (UTF8, UCS2), He圎ditor will use that byte representation. The docs also mentioned that 0 is valid (for disabling multibyte support). Oh, from Scintilla SCI_GETCODEPAGE docs, “Code page can be set to 65001 (UTF-8), 932 (Japanese Shift-JIS), 936 (Simplified Chinese GBK), 949 (Korean Unified Hangul Code), 950 (Traditional Chinese Big5), or 1361 (Korean Johab).” So it’s there for choosing either UTF-8 or one of the Asian codepages. Even if I tCodePage(855), a subsequent editor.getCodePage() returns 65001. ![]() Even when I go to cmd.exe, chcp 855, then launch a new instance of Notepad from that cmd.exe environment, it returns 65001. Even when I change the settings so that New documents are in something like OEM 855, which should be codepage 855, it returns 65001. As far as I can tell, the SCI_GETCODEPAGE only ever returns 65001. No matter what character set or encoding I choose for a given editor tab, editor.getCodePage() always returns 65001. Why does Scintilla have/need the SCI_SETCODEPAGE and SCI_GETCODEPAGE functions? Given that character sets are treated differently in the code (there aren’t many NPPM_ or SCI_ messages dealing with character sets, compared to more with the Unicode encodings), it appears that character set is mostly used during read-from-disk or write-to-disk, and not during the internal manipulation. So, it appears the Hex Editor works as expected for any of the UTF8/UCS2 encodings, but not with the character set. UTF8, UCS2-LE-BOM, UCS2-BE-BOM all show up differently (and with the number of bytes and endianness that I would expect), but if you choose one of the 8-bit character set “encodings”, it always uses the UTF-8 byte sequence for the ➤ character. Using a file with your ➤ character (from the other post) in it, I started playing around. option returns 12 ( so 4 chars x 3 bytes ) instead of the value 4, as correctly reported in the status bar after a Select All operationīest said in Hex-Editor plugin failed to handle files other than UTF8 said in Hex-Editor plugin failed to handle files other than UTF8 encoding:Īnd I believe that the Scintilla editor object stores the text in memory as UTF8 But, when converted in the UCS-2 BE/LE BOM encoding, the View > Summary. As all these codes belong to the range [\x, then each ideograph is coded with 3 bytes in an UTF-8 encoded file, as well as the ? fullwidth question mark. However, the value of Characters (without line endings) is totally wrong for these encodings and seems to be, instead, the number of bytes of its corresponding UTF-8 file !įor instance, the Unicode value of the four bytes, of text, are 4F60 597D 55CE FF1F. Thus, this fact explains the Document length value seen for UCS-2 BE/LE BOM encoded files. ![]() So, we must remember that the Summary feature just looks into the Notepad buffer, which is UTF-8 encoded. This explains the first of the major problems, found while testing the Summary feature and mentioned at the very beginning of this long post of mine : If you want a true hex editor, which doesn’t hide the encoding, I suggest using a standalone one (possibly like hxd). has provided a bugfix version ( ), but he makes it clear that it’s “unofficial”, and I don’t know whether or not he is actively taking feature requests. However, the official repo ( Editor/) hasn’t been updated in years – the author appears to have abandoned the plugin. In theory, you could put in a feature request with the developer of He圎ditor to allow it to use the real disk contents or the Scintilla-edited contents. And I believe that the Scintilla editor object stores the text in memory as UTF8, so Hex Editor would see the same. Notepad only guarantees what the encoding is when it’s on the disk (for read or for write), not what it is in memory, nor what other plugins might do with the bytes from memory.īased on your results, it is apparent that the Hex Editor gets the contents of the file from the Scintilla editor object, not from the bytes on the disk. Notepad is a text editor, and it will treat text as text characters, not as individual bytes.
0 Comments
Leave a Reply. |