|
12/25 |
2011/12/23-2012/2/6 [Computer/SW/Languages/Python] UID:54272 Activity:nil |
12/23 In Python, why is it that '好'=='\xe5\xa5\xbd' but u'好'!='\xe5\xa5\xbd' ? I'm really baffled. What is the encoding of '\xe5\xa5\xbd'? \_ '好' means '\xe5\xa5\xbd', which is just a string of bytes; it has length 3. Python doesn't know what encoding it's in. u'好' means u'\u597d', which is a string of Unicode characters; it has length 1, and Python recognizes it as a single Chinese character. However, it doesn't have any particular encoding! You have to encode it as a byte string before you can output it, and you can choose whatever encoding you want. u'好'.encode('utf-8') returns '\xe5\xa5\xbd'. See http://docs.python.org/howto/unicode.html \_ wow thanks. I always thought unicode == utf-8, boy I was so wrong. This is all very confusing. \_ dear dumbass: http://www.stereoplex.com/blog/python-unicode-and-unicodedecodeerror http://docs.python.org/library/codecs.html http://stackoverflow.com/questions/643694/utf-8-vs-unicode \_ If all you've used is UTF-8, you'd have no reason to suspect there are other Unicode encodings (and really, if UTF-8 had been designed first, there probably wouldn't be). Not knowing about them doesn't make you dumb. |
12/25 |
|
docs.python.org/howto/unicode.html Release: 103 This HOWTO discusses Python 2x's support for Unicode, and explains various problems that people commonly encounter when trying to work with Unicode. In 1968, the American Standard Code for Information Interchange, better known by its acronym ASCII, was standardized. ASCII defined numeric codes for various characters, with the numeric values running from 0 to 127. For example, the lowercase letter a' is assigned 97 as its code value. ASCII was an American-developed standard, so it only defined unaccented characters. This meant that languages which required accented characters couldn't be faithfully represented in ASCII. I remember looking at Apple BASIC programs, published in French-language publications in the mid-1980s, that had lines like these: PRINT "FICHIER EST COMPLETE." Those messages should contain accents, and they just look wrong to someone who can read French. In the 1980s, almost all personal computers were 8-bit, meaning that bytes could hold values ranging from 0 to 255. ASCII codes only went up to 127, so some machines assigned values between 128 and 255 to accented characters. Different machines had different codes, however, which led to problems exchanging files. Eventually various commonly used sets of values for the 128-255 range emerged. Some were true standards, defined by the International Standards Organization, and some were de facto conventions that were invented by one company or another and managed to catch on. For example, you can't fit both the accented characters used in Western Europe and the Cyrillic alphabet used for Russian into the 128-255 range because there are more than 127 such characters. You could write files using different codes (all your Russian files in a coding system called KOI8, all your French files in a different coding system called Latin1), but what if you wanted to write a French document that quotes some Russian text? In the 1980s people began to want to solve this problem, and the Unicode standardization effort began. Unicode started out using 16-bit characters instead of 8-bit characters. It turns out that even 16 bits isn't enough to meet that goal, and the modern Unicode specification uses a wider range of codes, 0-1,114,111 (0x10ffff in base-16). Unicode and ISO 10646 were originally separate efforts, but the specifications were merged with the 11 revision of Unicode. I don't think the average Python programmer needs to worry about the historical details; A character is the smallest possible component of a text. Characters are abstractions, and vary depending on the language or context you're talking about. For example, the symbol for ohms (Ohm) is usually drawn much like the capital letter omega (W*) in the Greek alphabet (they may even be the same in some fonts), but these are two different characters that have different meanings. The Unicode standard describes how characters are represented by code points. A code point is an integer value, usually denoted in base 16. In the standard, a code point is written using the notation U+12ca to mean the character with value 0x12ca (4810 decimal). The Unicode standard contains a lot of tables listing characters and their corresponding code points: 0061 'a'; U+12ca is a code point, which represents some particular character; in this case, it represents the character ETHIOPIC SYLLABLE WI'. In informal contexts, this distinction between code points and characters will sometimes be forgotten. A character is represented on a screen or on paper by a set of graphical elements that's called a glyph. The glyph for an uppercase A, for example, is two diagonal strokes and a horizontal stroke, though the exact details will depend on the font being used. figuring out the correct glyph to display is generally the job of a GUI toolkit or a terminal's font renderer. To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 to 0x10ffff. This sequence needs to be represented as a set of bytes (meaning, values from 0-255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding. The first encoding you might think of is an array of 32-bit integers. In this representation, the string "Python" would look like this: P y t h o n 0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 This representation is straightforward but using it presents a number of problems. In most texts, the majority of the code points are less than 127, or less than 255, so a lot of space is occupied by zero bytes. The above string takes 24 bytes compared to the 6 bytes needed for an ASCII representation. Increased RAM usage doesn't matter too much (desktop computers have megabytes of RAM, and strings aren't usually that large), but expanding our usage of disk and network bandwidth by a factor of 4 is intolerable. Generally people don't use this encoding, instead choosing other encodings that are more efficient and convenient. Encodings don't have to handle every possible Unicode character, and most encodings don't. For example, Python's default encoding is the ascii' encoding. The rules for converting a Unicode string into the ASCII encoding are simple; for each code point: 1 If the code point is < 128, each byte is the same as the value of the code point. Unicode code points 0-255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can't be encoded into Latin-1. Encodings don't have to be simple one-to-one mappings like Latin-1. Consider IBM's EBCDIC, which was used on IBM mainframes. Letter values weren't in one block: a' through i' had values from 129 to 137, but j' through r' were 145 through 153. If you wanted to use EBCDIC as an encoding, you'd probably use some sort of lookup table to perform the conversion, but this is largely an internal detail. UTF stands for "Unicode Transformation Format", and the 8' means that 8-bit numbers are used in the encoding. UTF-8 has several convenient properties: 1 It can handle any Unicode code point. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can't handle zero bytes. It's also unlikely that random 8-bit data will look like valid UTF-8. |
www.stereoplex.com/blog/python-unicode-and-unicodedecodeerror Dan Fairs -- last modified Sep 30, 2010 04:21 PM In the years I've been developing in Python, Unicode seems to be the topic which causes the greatest amount of confusion amongst developers. Hopefully much of this confusion should go away in Python 3, for reasons I'll come to at the end; but until then, the UnicodeDecodeError is the bane of many developers' lives. Unicode and Encodings OK, let's take a step away from text for a moment. And when I wrote it down, it looks like this: 6 Of course, if I were an ancient Roman (or possibly a clockmaker), I could have written this: Six in Roman Numerals They all mean the same thing - the number six. In other words, we've 'encoded' our idea of the number six in our head in three different ways - three different encodings. The separation of the idea of 'the number six' from its actual representation is basically all Unicode is. The Unicode Character set (UCS) defines a set of things (loosely, a set of letters) that we can represent. How we represent each of those letters is called an encoding. In Unicode parlance, each of those 'things' (letters) are known as 'code points'. Unicode separates the characters' meaning from their representation. For historical reasons, the most common encoding (in Western Europe and the US, anyway) is ASCII. It's an encoding that uses 7 bits, which limits it to 128 possible values. That's enough to represent all the characters that Western Europe and the US use (letters in both cases, the numbers, punctuation, a few characters with diacritics). Therefore, Unicode strings that only include code points that are in these 128 ASCII characters can be encoded as ASCII. Conversely, any ASCII encoded string can be decoded to Unicode. It's worth reiterating that terminology, as you come across it a lot: the transformation from Unicode to an encoding like ASCII is called 'encoding'. The transformation from ASCII back to Unicode is called 'decoding'. Unicode ---- encode ----> ASCII ASCII ---- decode ----> Unicode Non-ASCII encodings Most people don't live in the US or Western Europe, and therefore have a requirement to store more characters than can be represented with ASCII. UTF-8, for example, uses a single byte for encoding all the ASCII values, then variable numbers of bytes to encode further characters. Some terminology Unicode-related terminology can get confusing. Here's a quick glossary: * To encode * Encoding (the verb) means to take a a Unicode string and produce a byte string To decode * Decoding (the verb) means to take a byte string and produce a Unicode string An encoding * An encoding (the noun) is a mapping that describes how to represent a Unicode character as a byte or series of bytes. In other words, when you encode or decode, you need to specify the encoding that you're using. Python, bytes and strings You've probably noticed that there seems to be a couple of ways of writing down strings in Python. One looks like this: 'this is a string' Another looks like this: u'this is a string' There's a good chance that you also know that the second one of those is a Unicode string. And what does it actually mean to 'be a Unicode string'? This byte sequence is, by convention, an ASCII representation (ie. The whole Python standard library, and most third-party modules, happily deal with strings natively in this encoding. As long as you live in US or Western Europe, then that's probably fine for you. This can therefore contain any of the Unicode code points. It's possible that whatever you're using to edit the Python code (or just view it) might not be able to display the entire Unicode character set - for instance, a terminal usually has an encoding that it assumes data it's trying to display is in. There's a special notation, therefore, for representing arbitrary Unicode code points within a Python Unicode string: the \u and \U escapes. there's some subtlety here (see the Python string reference for further information) but you can simply think of the number after the \u (or \U) representing the Unicode code point of the character. So, for example, the following Python string: u'\u0062' represents LATIN SMALL LETTER B, or more simply: u'b' To summarise then: the Unicode character set encompasses all characters that we may wish to represent. Encoding and Decoding Byte strings and Unicode strings provide methods to perform the encoding and decoding for you. encode('ascii') 'd' As you'd expect, the Unicode string has an 'encode' method. You tell Python which encoding you want ('ascii' in this case, there are lots more supported by Python - check the docs) using the first parameter to the encode() call. decode('ascii') u'b' Here, we're telling Python to take the byte string 'b', decode it based on the ASCII decoder and return a Unicode string. Note that in both these previous cases, we didn't really need to specify 'ascii' manually, since Python uses that as a default. UnicodeEncodeError So, we've established that there are encodings which can represent Unicode, or more usually, a certain subset of the Unicode character set. We've already talked about how ASCII can only represent 128 characters. So, what happens if you have a Unicode string that contains code points that are outside that 128 characters? Let's try something all too familiar to UK users: the sign. encode('ascii') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128) Boom. This is Python telling you that it encountered a character in the Unicode string which it can't represent in the requested encoding. There's a fair amount of information in the error: it's giving you the character that it's having problems with, what position it was at in the string, and (in the case of ASCII) it's telling you that the number it was expecting was in the range 0 - 127. Well, you've got a couple of options: * Pick an encoding that does have a representation for the problematic character * Use one of the error handling arguments to encode() The first option is obviously ideal, although its practicality depends on what you're doing with the encoded data. If you're passing it to another system that (for example) requires its text files in ASCII format, you're stuck. In that case, you're left with one of the other two options. encode('ascii','backslashreplace') '\\x83' If you choose one of those options, you'll have to let the eventual consumer of your encoded text know how to handle these. UnicodeDecodeError This one is probably more familiar to most developers. Being a 7-bit representation, ASCII only has 127 characters, represented by the numbers 0 - 127. But what happens if we add a byte that's not in the ASCII range? Python is saying that it encountered a character 0x80 (which is 128 in hex, the one we added) which was at position 3 (counting from zero) in the source byte string which was not in the range 0 - 127. This is normally caused by using the incorrect encoding to try to decode a byte string to Unicode. So, for example, if you were given a UTF-8 byte string, and tried to decode it as ASCII, then you might well see a UnicodeDecodeError. Well, remember what I mentioned before - UTF-8 shares the first 127 characters with ASCII. That means that you can take a UTF-8 byte sequence, and decode it with the ASCII decoder, and *as long as there are no characters outside the ASCII range* it will work. nobody notices the problem until the Japanese office complains the intranet is broken. Unicode Coercion If you try to interpolate a byte string with a Unicode string, or vice-versa, Python will try and convert the byte string to Unicode using the default (ie. So: >>> u'Hi' + ' there' u'Hi there' >>> u'Hi %s' % 'there' u'Hi there' >>> 'Hi %s' % u'there' u'Hi there' These all work fine, because all the strings that we're working with can be represented with ASCII. Look what happens when we try a character which can't be represented with ASCII though: >>> u'Hi ' + chr(128) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position ... |
docs.python.org/library/codecs.html This module defines base classes for standard Python codecs (encoders and decoders) and provides access to the internal Python codec registry which manages the codec and error handling lookup process. Search functions are expected to take one argument, the encoding name in all lower case letters, and return a CodecInfo object having the following attributes: + name The name of the encoding; The various functions or classes take the following arguments: encode and decode: These must be functions or methods which have the same interface as the encode()/decode() methods of Codec instances (see Codec Interface). The functions/methods are expected to work in a stateless mode. Looks up the codec info in the Python codec registry and returns a CodecInfo object as defined above. If not found, the list of registered search functions is scanned. Register the error handling function error_handler under the name name. error_handler will be called during encoding and decoding in case of an error, when name is specified as the errors parameter. UnicodeEncodeError instance, which contains information about the location of the error. The error handler must either raise this or a different exception or return a tuple with a replacement for the unencodable part of the input and a position where encoding should continue. The encoder will encode the replacement and continue encoding the original input at the specified position. Negative position values will be treated as being relative to the end of the input string. Open an encoded file using the given mode and return a wrapped version providing transparent encoding/decoding. The default file mode is 'r' meaning to open the file in read mode. Note The wrapped version will only accept the object format defined by the codecs, ie Unicode objects for most built-in codecs. Output is also codec-dependent and will usually be Unicode as well. Note Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing. encoding specifies the encoding which is to be used for the file. Return a wrapped version of file which provides transparent encoding translation. Strings written to the wrapped file are interpreted according to the given input encoding and then written to the original file as strings using the output encoding. The intermediate encoding will usually be Unicode but depends on the specified codecs. These constants define various encodings of the Unicode byte order mark (BOM) used in UTF-16 and UTF-32 data streams to indicate the byte order used in the stream or file and in UTF-8 as a Unicode signature. The codecs module defines a set of base classes which define the interface and can also be used to easily write your own codecs for use in Python. Each codec has to define four interfaces to make it usable as codec in Python: stateless encoder, stateless decoder, stream reader and stream writer. The stream reader and writers typically reuse the stateless encoder/decoder to implement the file protocols. The Codec class defines the interface for stateless encoders/decoders. To simplify and standardize error handling, the encode() and decode() methods may implement different error handling schemes by providing the errors string argument. Encodes the object input and returns a tuple (output object, length consumed). While codecs are not restricted to use with Unicode, in a Unicode context, encoding converts a Unicode object to a plain string using a particular character set encoding (eg, cp1252 or iso-8859-1). Use StreamCodec for codecs which have to keep state in order to make encoding/decoding efficient. The encoder must be able to handle zero length input and return an empty object of the output object type in this situation. Decodes the object input and returns a tuple (output object, length consumed). In a Unicode context, decoding converts a plain string encoded using a particular character set encoding to a Unicode object. input must be an object which provides the bf_getreadbuf buffer slot. Python strings, buffer objects and memory mapped files are examples of objects providing this slot. Use StreamCodec for codecs which have to keep state in order to make encoding/decoding efficient. The decoder must be able to handle zero length input and return an empty object of the output object type in this situation. IncrementalDecoder classes provide the basic interface for incremental encoding and decoding. Encoding/decoding the input isn't done with one call to the stateless encoder/decoder function, but with multiple calls to the encode()/decode() method of the incremental encoder/decoder. The incremental encoder/decoder keeps track of the encoding/decoding process during method calls. The joined output of calls to the encode()/decode() method is the same as if all the single inputs were joined into one, and this input was encoded/decoded with the stateless encoder/decoder. IncrementalEncoder class is used for encoding an input in multiple steps. It defines the following methods which every incremental encoder must define in order to be compatible with the Python codec registry. All incremental encoders must provide this constructor interface. They are free to add additional keyword arguments, but only the ones defined here are used by the Python codec registry. IncrementalDecoder class is used for decoding an input in multiple steps. It defines the following methods which every incremental decoder must define in order to be compatible with the Python codec registry. All incremental decoders must provide this constructor interface. They are free to add additional keyword arguments, but only the ones defined here are used by the Python codec registry. If final is true the decoder must decode the input completely and must flush all buffers. If this isn't possible (eg because of incomplete byte sequences at the end of the input) it must initiate error handling just like in the stateless case (which might raise an exception). StreamWriter class is a subclass of Codec and defines the following methods which every stream writer must define in order to be compatible with the Python codec registry. All stream writers must provide this constructor interface. They are free to add additional keyword arguments, but only the ones defined here are used by the Python codec registry. stream must be a file-like object open for writing binary data. Flushes and resets the codec buffers used for keeping state. Calling this method should ensure that the data on the output is put into a clean state that allows appending of new fresh data without having to rescan the whole stream to recover state. StreamReader class is a subclass of Codec and defines the following methods which every stream reader must define in order to be compatible with the Python codec registry. All stream readers must provide this constructor interface. They are free to add additional keyword arguments, but only the ones defined here are used by the Python codec registry. stream must be a file-like object open for reading (binary) data. read() will never return more than chars characters, but it might return less, if there are not enough characters available. size indicates the approximate maximum number of bytes to read from the stream for decoding purposes. The default value -1 indicates to read and decode as much as possible. size is intended to prevent having to decode huge files in one step. firstline indicates that it would be sufficient to only return the first line, if there are decoding errors on later lines. The method should use a greedy read strategy meaning that it should read as much data as is allowed within the definition of the encoding and the given size, eg if optional encoding endings or state markers are available on the stream, these should be read too. Read all lines available on the input stream and return them as a list of lines. Line-endings are implemented using the codec's decoder method and are included in the list entries if keepends is true. StreamReader must also inher... |
stackoverflow.com/questions/643694/utf-8-vs-unicode votes up vote 19 down vote accepted To expand on the answers others have given: We've got lots of languages with lots of characters that computers should ideally display. Unicode assigns each character a unique number, or code point. skipping a bit of history here and ignoring memory addressing issues, 8-bit computers would treat an 8-bit byte as the largest numerical unit easily represented on the hardware, 16-bit computers would expand that to two bytes, and so forth. Old character encodings such as ASCII are from the (pre-) 8-bit era, and try to cram the dominant language in computing at the time, ie English, into numbers ranging from 0 to 127 (7 bits). With 26 letters in the alphabet, both in capital and non-capital form, numbers and punctuation signs, that worked pretty well. ASCII got extended by an 8th bit for other, non-English languages, but the additional 128 numbers/code points made available by this expansion would be mapped to different characters depending on the language being displayed. The ISO-8859 standards are the most common forms of this mapping; ISO-8859-1 and ISO-8859-15 (also known as ISO-Latin-1, latin1, and yes there are two different versions of the 8859 ISO standard as well). But that's not enough when you want to represent characters from more than one language, so cramming all available characters into a single byte just won't work. There are essentially two different types of encodings: one expands the value range by adding more bits. Examples of these encodings would be UCS2 (2 bytes = 16 bits) and UCS4 (4 bytes = 32 bits). They suffer from inherently the same problem as ASCII and ISO-8859 standars, as their value range is still limited, even if the limit is vastly higher. The other type of encoding uses a variable number of bytes per character, and the most commonly known encodings for this are the UTF encodings. All UTF encodings work in roughly the same manner: you choose a unit size, which for UTF-8 is 8 bits, for UTF-16 is 16 bits, and for UTF-32 is 32 bits. The standard then defines a few of these bits as flags: if they're set, then the next unit in a sequence of units is to be considered part of the same character. If they're not set, this unit represents one character fully. Thus the most common (English) characters only occupy one byte in UTF-8 (two in UTF-16, 4 in UTF-32), but other language characters can occupy six bytes or more. Multi-byte encodings (I should say multi-unit after the above explanation) have the advantage that they are relatively space-efficient, but the downside that operations such as finding substrings, comparisons, etc. all have to decode the characters to unicode code points before such operations can be performed (there are some shortcuts, though). Both the UCS standards and the UTF standards encode the code points as defined in Unicode. In theory, those encodings could be used to encode any number (within the range the encoding supports) - but of course these encodings were made to encode Unicode code points. Windows handles so-called "Unicode" strings as UTF-16 strings, while most UNIXes default to UTF-8 these days. Communications protocols such as HTTP tend to work best with UTF-8, as the unit size in UTF-8 is the same as in ASCII, and most such protocols were designed in the ASCII era. On the other hand, UTF-16 gives the best average space/processing performanc when representing all living languages. The Unicode standard defines fewer code points than can be represented in 32 bits. Thus for all practical purposes, UTF-32 and UCS4 become the same encoding, as you're unlikely to have to deal with multi-unit characters in UTF-32. Tuukka Mustonen Aug 25 at 14:06 @Tuukka Errors in this posting are legion. ASCII didn't work for English, missing things like curly quotes, cent signs, accents,& a whole lot more--Unicode is not just about non-English; You can't UTF-encode any Unicode scalar value as this says: surrogates & the 66 other noncharacters are all forbidden. tchrist Aug 25 at 16:15 feedback up vote 25 down vote "Unicode" is a unfortunately used in various different ways, depending on the context. Its most correct use (IMO) is as a coded character set - ie a set of characters and a mapping between the characters and integer code points representing them. UTF-8 is a character encoding - a way of converting from sequences of bytes to sequences of characters and vice versa. ASCII is encoded as a single byte per character, and other characters take more bytes depending on their exact code point (up to 4 bytes for all currently defined code points, ie up to U-0010FFFF, and indeed 4 bytes could cope with up to U-001FFFFF). jalf Mar 13 '09 at 17:34 4 @Chris: No, ISO-8859-1 is not UTF-8. Windows 1252 and ISO-8859-1 are mostly the same, but they differ between values 0x80 and 0x99 if I remember correctly, where ISO 8859-1 has a "hole" but CP1252 defines characters. Jon Skeet Mar 14 '09 at 13:02 show 11 more comments feedback up vote 17 down vote They're not the same thing - UTF-8 is a particular way of encoding Unicode. There are lots of different encodings you can choose from depending on your application and the data you intend to use. The most common are UTF-8, UTF-16 and UTF-32 s far as I know. thomasrutter Mar 14 '09 at 9:25 however, the point is that some editors propose to save the file as "Unicode" OR "UTF-8". So the mention about that "Unicode" in that case is UTF-16 I believe necessary. serhio Jul 27 '10 at 10:13 feedback up vote 8 down vote Unicode only define code points, that is, a number which represents a character. How you store these code points in memory depends of the encoding that you are using. UTF-8 is one way of encoding Unicode characters, among many others. Martin Cote 7,8252959 however, the point is that some editors propose to save the file as "Unicode" OR "UTF-8". So the mention about that "Unicode" in that case is UTF-16 I believe necessary. sarsnake Mar 13 '09 at 20:09 however, the point is that some editors propose to save the file as "Unicode" OR "UTF-8". So the mention about that "Unicode" in that case is UTF-16 I believe necessary. Although there are three different UTF-16 encodings: The two explicit UTF-16LE and UTF-16BE and the implicit UTF-16 where the endianness is specified with a BOM. Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. It is the single most common myth about Unicode, so if you thought that, don't feel bad. In fact, Unicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense. Until now, we've assumed that a letter maps to some bits which you can store on disk or in memory: A -> 0100 0001 In Unicode, a letter maps to something called a code point which is still just a theoretical concept. Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. OK, so say we have a string: Hello which, in Unicode, corresponds to these five code points: U+0048 U+0065 U+006C U+006C U+006F. The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let's just store those numbers in two bytes each. |