CPython源码分析|strip做了什么

前几日使用strip时遇到一个问题,在文档中找到了答案,好奇strip具体是如何实现的,就跟踪了下源码。那么一起看看吧。

问题描述

1
2
>>> 'abc_00005.md5'.strip('.md5')
'abc_0000'

本意是想去掉.md5,保留abc_00005,但实际结果却是abc_0000

stackoverflow上也有类似的提问:
https://stackoverflow.com/questions/7853914/strange-behaviour-of-python-strip-function

在官方文档中找到了答案:
https://docs.python.org/zh-cn/3/library/stdtypes.html#str.strip

根本原因是没有深刻理解strip的功能,导致误用。

源码分析

Python3中默认对字符串采用的是Unicode编码的str类型来表示,在Python文档中有描述:

Textual data in Python is handled with str objects, or strings. Strings are immutable sequences of Unicode code points.

上一篇文章在Windows上编译CPython中有提到源码中Objects目录为内置数据类型实现,其中Objects/unicodeobject.c便是str数据类型相关的实现了。Unicode字符串的创建方式有多种,详见官方文档: Creating and accessing Unicode strings

我们从PyUnicode_New函数开始看起

1
2
3
4
5
6
7
8
PyObject *PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar)
{
.......
obj = PyObject_INIT(obj, &PyUnicode_Type);
if (obj == NULL)
return NULL;
.......
}

重点在PyObject_INIT这一句中的PyUnicode_Type,文档中有描述:

This instance of PyTypeObject represents the Python Unicode type. It is exposed to Python code as str.

1
2
3
4
5
6
7
8
9
10
11
PyTypeObject PyUnicode_Type = {
PyVarObject_HEAD_INIT(&PyType_Type, 0)
"str", /* tp_name */
sizeof(PyUnicodeObject), /* tp_size */
......
0, /* tp_iternext */
unicode_methods, /* tp_methods */
......
unicode_new, /* tp_new */
PyObject_Del, /* tp_free */
};

其中的unicode_methods包含了str类型的数据的操作方法:

1
2
3
4
5
6
7
8
9
10
static PyMethodDef unicode_methods[] = {
UNICODE_ENCODE_METHODDEF
UNICODE_REPLACE_METHODDEF
UNICODE_SPLIT_METHODDEF
UNICODE_RSPLIT_METHODDEF
UNICODE_JOIN_METHODDEF
......
UNICODE_STRIP_METHODDEF
......
}

我这里关注的是strip方法,在Objects/clinic/unicodeobject.c.h中可以看到宏定义:

1
2
#define UNICODE_STRIP_METHODDEF    \
{"strip", (PyCFunction)unicode_strip, METH_FASTCALL, unicode_strip__doc__},

这里插播一句,在UNICODE_STRIP_METHODDEF上面可以看到关于strip的文档描述:

1
2
3
4
5
6
7
PyDoc_STRVAR(unicode_strip__doc__,
"strip($self, chars=None, /)\n"
"--\n"
"\n"
"Return a copy of the string with leading and trailing whitespace removed.\n"
"\n"
"If chars is given and not None, remove characters in chars instead.");

其对应在解释器中通过help函数查看的结果:

1
2
3
4
5
6
7
8
9
Python 3.7.7 (default, Jun 21 2020, 15:02:27) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> help(str.strip)
Help on method_descriptor:

strip(self, chars=None, /)
Return a copy of the string with leading and trailing whitespace removed.

If chars is given and not None, remove characters in chars instead.

言归正传,接下来看UNICODE_STRIP_METHODDEF宏定义中的unicode_strip函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
static PyObject *
unicode_strip(PyObject *self, PyObject *const *args, Py_ssize_t nargs)
{
PyObject *return_value = NULL;
PyObject *chars = Py_None;

if (!_PyArg_UnpackStack(args, nargs, "strip",
0, 1,
&chars)) {
goto exit;
}
return_value = unicode_strip_impl(self, chars);

exit:
return return_value;
}

主要调用了unicode_strip_impl函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/*[clinic input]
str.strip as unicode_strip

chars: object = None
/

Return a copy of the string with leading and trailing whitespace removed.

If chars is given and not None, remove characters in chars instead.
[clinic start generated code]*/

static PyObject *
unicode_strip_impl(PyObject *self, PyObject *chars)
/*[clinic end generated code: output=ca19018454345d57 input=385289c6f423b954]*/
{
return do_argstrip(self, BOTHSTRIP, chars);
}

BOTHSTRIP的宏定义如下:

1
2
3
#define LEFTSTRIP 0
#define RIGHTSTRIP 1
#define BOTHSTRIP 2

do_argstrip会根据参数的情况进入不同的处理流程:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
static PyObject *
do_argstrip(PyObject *self, int striptype, PyObject *sep)
{
if (sep != NULL && sep != Py_None) {
if (PyUnicode_Check(sep))
return _PyUnicode_XStrip(self, striptype, sep);
else {
PyErr_Format(PyExc_TypeError,
"%s arg must be None or str",
STRIPNAME(striptype));
return NULL;
}
}

return do_strip(self, striptype);
}

因为我出现的问题属于有参数的情况,会进入到_PyUnicode_XStrip中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/* externally visible for str.strip(unicode) */
PyObject *
_PyUnicode_XStrip(PyObject *self, int striptype, PyObject *sepobj)
/* externally visible for str.strip(unicode) */
PyObject *
_PyUnicode_XStrip(PyObject *self, int striptype, PyObject *sepobj)
{
void *data;
int kind;
Py_ssize_t i, j, len;
BLOOM_MASK sepmask;
Py_ssize_t seplen;

if (PyUnicode_READY(self) == -1 || PyUnicode_READY(sepobj) == -1)
return NULL;

kind = PyUnicode_KIND(self);
data = PyUnicode_DATA(self);
len = PyUnicode_GET_LENGTH(self);
seplen = PyUnicode_GET_LENGTH(sepobj);
sepmask = make_bloom_mask(PyUnicode_KIND(sepobj),
PyUnicode_DATA(sepobj),
seplen);

i = 0;
if (striptype != RIGHTSTRIP) {
while (i < len) {
Py_UCS4 ch = PyUnicode_READ(kind, data, i);
if (!BLOOM(sepmask, ch))
break;
if (PyUnicode_FindChar(sepobj, ch, 0, seplen, 1) < 0)
break;
i++;
}
}

j = len;
if (striptype != LEFTSTRIP) {
j--;
while (j >= i) {
Py_UCS4 ch = PyUnicode_READ(kind, data, j);
if (!BLOOM(sepmask, ch))
break;
if (PyUnicode_FindChar(sepobj, ch, 0, seplen, 1) < 0)
break;
j--;
}

j++;
}

return PyUnicode_Substring(self, i, j);
}

其中的kind是表示unicode对象里面真正的字节的存储方式:

1
2
3
4
5
6
7
8
9
10
enum PyUnicode_Kind {
/* String contains only wstr byte characters. This is only possible
when the string was created with a legacy API and _PyUnicode_Ready()
has not been called yet. */
PyUnicode_WCHAR_KIND = 0,
/* Return values of the PyUnicode_KIND() macro: */
PyUnicode_1BYTE_KIND = 1,
PyUnicode_2BYTE_KIND = 2,
PyUnicode_4BYTE_KIND = 4
};

len是用来获取原始Unicode字符串的长度,seplen是作为参数的字符串长度:

1
2
3
4
5
6
7
/* Returns the length of the unicode string. The caller has to make sure that
the string has it's canonical representation set before calling
this macro. Call PyUnicode_(FAST_)Ready to ensure that. */
#define PyUnicode_GET_LENGTH(op) \
(assert(PyUnicode_Check(op)), \
assert(PyUnicode_IS_READY(op)), \
((PyASCIIObject *)(op))->length)

sepmask是十分重要的一步,后面的计算均由此有关,但我暂时还没完全看懂其中的含义,只列出make_bloom_mask函数中的注释好了:

calculate simple bloom-style bitmask for a given unicode string

接下来会根据striptypesepmask的值决定如何处理,但都会进入到PyUnicode_FindChar中。

顺便看看do_strip函数,其适用于strip不存在任何参数的情况,do_strip_PyUnicode_XStrip的处理逻辑大体相同,但会判断字符串是否只包含ASCII字符,这可以加快处理速度,如果是,其中会通过_Py_ascii_whitespace数组索引来检查一个字符是否为空白字符

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
static PyObject *
do_strip(PyObject *self, int striptype)
{
Py_ssize_t len, i, j;

if (PyUnicode_READY(self) == -1)
return NULL;

len = PyUnicode_GET_LENGTH(self);

if (PyUnicode_IS_ASCII(self)) {
Py_UCS1 *data = PyUnicode_1BYTE_DATA(self);

i = 0;
if (striptype != RIGHTSTRIP) {
while (i < len) {
Py_UCS1 ch = data[i];
if (!_Py_ascii_whitespace[ch])
break;
i++;
}
}
......
}
else {
int kind = PyUnicode_KIND(self);
void *data = PyUnicode_DATA(self);
.....
}
return PyUnicode_Substring(self, i, j);
}

PyUnicode_FindChar使用给定的方向返回字符ch在str[start:end]中的第一个位置(方向==1表示向前搜索,方向==-1表示向后搜索),但随着字符串长度的增加,通过连续调用此函数引入的开销也会增加:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Py_ssize_t
PyUnicode_FindChar(PyObject *str, Py_UCS4 ch,
Py_ssize_t start, Py_ssize_t end,
int direction)
{
int kind;
Py_ssize_t len, result;
if (PyUnicode_READY(str) == -1)
return -2;
len = PyUnicode_GET_LENGTH(str);
ADJUST_INDICES(start, end, len);
if (end - start < 1)
return -1;
kind = PyUnicode_KIND(str);
result = findchar(PyUnicode_1BYTE_DATA(str) + kind*start,
kind, end-start, ch, direction);
if (result == -1)
return -1;
else
return start + result;
}

在此函数中,会经由findchar进入到ucs1lib_find_char中,先看一下STRINGLIB(find_char)的写法:

1
#define STRINGLIB(F)             ucs1lib_##F

利用##连接符在宏定义中,将字符串ucs1lib_和经参数F传递过来的字符串连接起来,即ucs1lib_find_char:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
#if STRINGLIB_SIZEOF_CHAR == 1
# define MEMCHR_CUT_OFF 15
#else
# define MEMCHR_CUT_OFF 40
#endif

Py_LOCAL_INLINE(Py_ssize_t)
STRINGLIB(find_char)(const STRINGLIB_CHAR* s, Py_ssize_t n, STRINGLIB_CHAR ch)
{
const STRINGLIB_CHAR *p, *e;

p = s;
e = s + n;
if (n > MEMCHR_CUT_OFF) {
#if STRINGLIB_SIZEOF_CHAR == 1
p = memchr(s, ch, n);
if (p != NULL)
return (p - s);
return -1;
#else
/* use memchr if we can choose a needle without two many likely
false positives */
const STRINGLIB_CHAR *s1, *e1;
unsigned char needle = ch & 0xff;
/* If looking for a multiple of 256, we'd have too
many false positives looking for the '\0' byte in UCS2
and UCS4 representations. */
if (needle != 0) {
do {
void *candidate = memchr(p, needle,
(e - p) * sizeof(STRINGLIB_CHAR));
if (candidate == NULL)
return -1;
s1 = p;
p = (const STRINGLIB_CHAR *)
_Py_ALIGN_DOWN(candidate, sizeof(STRINGLIB_CHAR));
if (*p == ch)
return (p - s);
/* False positive */
p++;
if (p - s1 > MEMCHR_CUT_OFF)
continue;
if (e - p <= MEMCHR_CUT_OFF)
break;
e1 = p + MEMCHR_CUT_OFF;
while (p != e1) {
if (*p == ch)
return (p - s);
p++;
}
}
while (e - p > MEMCHR_CUT_OFF);
}
#endif
}
while (p < e) {
if (*p == ch)
return (p - s);
p++;
}
return -1;
}

其中的s为源字符串,n为查找长度,ch为查找字符,里面用了一系列的检查方式来提高搜索的效率。
_PyUnicode_XStrip处理完毕后进入到PyUnicode_Substring处理流程中,返回str的子串,从字符索引开始(包含)到字符索引结束(排除),其中判断是否只包含ASCII字符,如果不是,PyUnicode_FromKindAndData会根据PyUnicode_KIND返回的值来创建新的Unicode对象:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
PyObject*
PyUnicode_Substring(PyObject *self, Py_ssize_t start, Py_ssize_t end)
{
......
length = end - start;
if (PyUnicode_IS_ASCII(self)) {
data = PyUnicode_1BYTE_DATA(self);
return _PyUnicode_FromASCII((char*)(data + start), length);
}
else {
kind = PyUnicode_KIND(self);
data = PyUnicode_1BYTE_DATA(self);
return PyUnicode_FromKindAndData(kind,
data + kind * start,
length);
}
}

我这里会进入到_PyUnicode_FromASCII函数,通过memcpy函数进行收尾工作:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
PyObject*
_PyUnicode_FromASCII(const char *buffer, Py_ssize_t size)
{
const unsigned char *s = (const unsigned char *)buffer;
PyObject *unicode;
if (size == 1) {
#ifdef Py_DEBUG
assert((unsigned char)s[0] < 128);
#endif
return get_latin1_char(s[0]);
}
unicode = PyUnicode_New(size, 127);
if (!unicode)
return NULL;
memcpy(PyUnicode_1BYTE_DATA(unicode), s, size);
assert(_PyUnicode_CheckConsistency(unicode, 1));
return unicode;
}

再看看PyUnicode_FromKindAndData好了,就很简单的逻辑,直接用PyUnicode_KIND返回值做switch case:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
PyObject*
PyUnicode_FromKindAndData(int kind, const void *buffer, Py_ssize_t size)
{
if (size < 0) {
PyErr_SetString(PyExc_ValueError, "size must be positive");
return NULL;
}
switch (kind) {
case PyUnicode_1BYTE_KIND:
return _PyUnicode_FromUCS1(buffer, size);
case PyUnicode_2BYTE_KIND:
return _PyUnicode_FromUCS2(buffer, size);
case PyUnicode_4BYTE_KIND:
return _PyUnicode_FromUCS4(buffer, size);
default:
PyErr_SetString(PyExc_SystemError, "invalid kind");
return NULL;
}
}

_PyUnicode_FromUCS1为例可以看出处理逻辑与_PyUnicode_FromASCII函数大体一致:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
static PyObject*
_PyUnicode_FromUCS1(const Py_UCS1* u, Py_ssize_t size)
{
PyObject *res;
unsigned char max_char;

if (size == 0)
_Py_RETURN_UNICODE_EMPTY();
assert(size > 0);
if (size == 1)
return get_latin1_char(u[0]);

max_char = ucs1lib_find_max_char(u, u + size);
res = PyUnicode_New(size, max_char);
if (!res)
return NULL;
memcpy(PyUnicode_1BYTE_DATA(res), u, size);
assert(_PyUnicode_CheckConsistency(res, 1));
return res;
}

结束

到这里分析完strip相关的源码,处理流程就十分清晰了,算是解除了困惑,以后有空还是要多看看Python源码才行😀

参考

1、https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str
2、https://docs.python.org/zh-cn/3/c-api/unicode.html
3、https://docs.python.org/3/c-api/structures.html#c.PyMethodDef
4、https://www.python.org/dev/peps/pep-0393/
5、https://stackoverflow.com/a/38285655

0%