python的字符编码问题

刚接触Python不久，经常碰到`UnicodeEncodeError: 'ascii' codec can't encode characters in xxx`的字符编码错误，搞的很头疼。

首先说明：出现这个错误不一定是你代码的问题，很可能和你的terminal或ide有关。

## 关于python的字符编码

python内部的字符串默认采用unicode编码。

如下代码：

```python
s = '翟士丹'
# s在python内被认为是unicode编码的，如果你的文件是utf-8编码的，那么这个字符串会被python以utf-8解码为unicode编码的字符串
# 也就是说s的解码方式和文件的编码有关，最终会被python解释器解码为unicode的编码形式
```

### 关于字符串的encode和decode

encode：将unicode的字符串按照指定的编码方式进行编码

```
'zsd'.encode('utf-8')
# 将unicode表示的字符串zsd编码为utf-8
```

decode：将指定的编码的字符串转换为unicode编码

```
str1.decode('gb2312')
# 假设str1就是gb2312编码的，通过以上函数可以将其转换为unicode编码
# 等同于unicode(str1, 'gb2312')
# 虽然我们知道str1是gb2312编码的，但是python内部不知道啊，他总是以为str1是unicode编码，所以将gb2312编码的字符以unicode形式的处理就可能出现乱码的问题
```

### 关于unicode

python使用unicode函数构造unicode字符串，如unicode('abcd')，如果执行unicode('中文')会提示以下错误：

```
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
```

这是因为unicode函数默认使用了ascii编码器进行解码，而ascii码最大的字符表示范围是0-127，`中文`的byte表示形式超过了127所以使用ascii编码器将`中文`解码为unicode的时候就报错了，这个错误在python 2.x上很常见。

如果你的代码文件是utf-8编码的，使用`unicode('中文', 'utf-8')`指定使用utf-8编码器就不会报错了。得到的结果如下：

```
In [4]: unicode('中文', 'utf-8')
Out[4]: u'\u4e2d\u6587'
```

### 关于u'xxxx'表示的字符串

python中`u'xxxx'`表示的是unicode的字符串，python解释器执行时会根据文件的编码最这个字符串进行unicode解码，最终解码为unicode字符串。

u'xxx'和'xxx'的区别：

'xxx'表示的是8-bit字符串，比如你的文件是utf-8编码的，那么这个字符串就是utf-8编码表示的二进制字符串数据。

而u'xxx'是unicode解码后的字符串。比如你的文件是utf-8编码的，python会采用utf-8编码器将字符串解码为unicode字符串。

相关资料可以参考：<http://blog.163.com/yang_jianli/blog/static/1619900062013830104610633/>

## 关于print

使用print将字符串打印到终端，要根据终端的默认编码对字符串进行编码才能保证不会出现乱码。

参考：

```
When Python executes a print statement, it simply passes the output to the operating system (using fwrite() or something like it), and some other program is responsible for actually displaying that output on the screen. For example, on Windows, it might be the Windows console subsystem that displays the result. Or if you're using Windows and running Python on a Unix box somewhere else, your Windows SSH client is actually responsible for displaying the data. If you are running Python in an xterm on Unix, then xterm and your X server handle the display.

To print data reliably, you must know the encoding that this display program expects.
```

在linux下可以通过locale命令查看终端的默认编码，我的是：

```
LANG="zh_CN.UTF-8"
LC_COLLATE="zh_CN.UTF-8"
LC_CTYPE="zh_CN.UTF-8"
LC_MESSAGES="zh_CN.UTF-8"
LC_MONETARY="zh_CN.UTF-8"
LC_NUMERIC="zh_CN.UTF-8"
LC_TIME="zh_CN.UTF-8"
LC_ALL=
```

print 报错的问题，为啥在window上不报错呢

http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20

https://docs.python.org/2/howto/unicode.html

python的字符编码问题

请我喝咖啡