12/23 In Python, why is it that '好'=='\xe5\xa5\xbd' but
u'好'!='\xe5\xa5\xbd' ? I'm really baffled. What
is the encoding of '\xe5\xa5\xbd'?
\_ '好' means '\xe5\xa5\xbd', which is just a string of bytes; it has
length 3. Python doesn't know what encoding it's in. u'好' means
u'\u597d', which is a string of Unicode characters; it has length 1,
and Python recognizes it as a single Chinese character. However,
it doesn't have any particular encoding! You have to encode it as
a byte string before you can output it, and you can choose whatever
encoding you want. u'好'.encode('utf-8') returns '\xe5\xa5\xbd'.
See http://docs.python.org/howto/unicode.html
\_ wow thanks. I always thought unicode == utf-8, boy I was
so wrong. This is all very confusing.
\_ dear dumbass:
http://www.stereoplex.com/blog/python-unicode-and-unicodedecodeerror
http://docs.python.org/library/codecs.html
http://stackoverflow.com/questions/643694/utf-8-vs-unicode
\_ If all you've used is UTF-8, you'd have no reason to
suspect there are other Unicode encodings (and really,
if UTF-8 had been designed first, there probably wouldn't
be). Not knowing about them doesn't make you dumb. |