CPython源码分析|strip做了什么

发表于 2020-06-24 分类于 Python 本文字数： 11k 阅读时长 ≈ 10 分钟

前几日使用strip时遇到一个问题，在文档中找到了答案，好奇strip具体是如何实现的，就跟踪了下源码。那么一起看看吧。

问题描述

1 2	>>> 'abc_00005.md5'.strip('.md5') 'abc_0000'

本意是想去掉.md5，保留abc_00005，但实际结果却是abc_0000。

stackoverflow上也有类似的提问:
https://stackoverflow.com/questions/7853914/strange-behaviour-of-python-strip-function

在官方文档中找到了答案:
https://docs.python.org/zh-cn/3/library/stdtypes.html#str.strip

根本原因是没有深刻理解strip的功能，导致误用。

源码分析

Python3中默认对字符串采用的是Unicode编码的str类型来表示，在Python文档中有描述:

Textual data in Python is handled with str objects, or strings. Strings are immutable sequences of Unicode code points.

上一篇文章在Windows上编译CPython中有提到源码中Objects目录为内置数据类型实现，其中Objects/unicodeobject.c便是str数据类型相关的实现了。Unicode字符串的创建方式有多种，详见官方文档: Creating and accessing Unicode strings

我们从PyUnicode_New函数开始看起

PyObject *PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar)
{
    .......
    obj = PyObject_INIT(obj, &PyUnicode_Type);
    if (obj == NULL)
        return NULL;
    .......
}

重点在PyObject_INIT这一句中的PyUnicode_Type，文档中有描述:

This instance of PyTypeObject represents the Python Unicode type. It is exposed to Python code as str.

PyTypeObject PyUnicode_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "str",              /* tp_name */
    sizeof(PyUnicodeObject),        /* tp_size */
    ......
    0,                  /* tp_iternext */
    unicode_methods,            /* tp_methods */
    ......
    unicode_new,            /* tp_new */
    PyObject_Del,           /* tp_free */
};

其中的unicode_methods包含了str类型的数据的操作方法:

static PyMethodDef unicode_methods[] = {
    UNICODE_ENCODE_METHODDEF
    UNICODE_REPLACE_METHODDEF
    UNICODE_SPLIT_METHODDEF
    UNICODE_RSPLIT_METHODDEF
    UNICODE_JOIN_METHODDEF
    ......
    UNICODE_STRIP_METHODDEF
    ......
}

我这里关注的是strip方法，在Objects/clinic/unicodeobject.c.h中可以看到宏定义:

1 2	#define UNICODE_STRIP_METHODDEF \ {"strip", (PyCFunction)unicode_strip, METH_FASTCALL, unicode_strip__doc__},

这里插播一句，在UNICODE_STRIP_METHODDEF上面可以看到关于strip的文档描述:

PyDoc_STRVAR(unicode_strip__doc__,
"strip($self, chars=None, /)\n"
"--\n"
"\n"
"Return a copy of the string with leading and trailing whitespace removed.\n"
"\n"
"If chars is given and not None, remove characters in chars instead.");

其对应在解释器中通过help函数查看的结果:

Python 3.7.7 (default, Jun 21 2020, 15:02:27) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> help(str.strip)
Help on method_descriptor:

strip(self, chars=None, /)
    Return a copy of the string with leading and trailing whitespace removed.

    If chars is given and not None, remove characters in chars instead.

言归正传，接下来看UNICODE_STRIP_METHODDEF宏定义中的unicode_strip函数:

static PyObject *
unicode_strip(PyObject *self, PyObject *const *args, Py_ssize_t nargs)
{
    PyObject *return_value = NULL;
    PyObject *chars = Py_None;

    if (!_PyArg_UnpackStack(args, nargs, "strip",
        0, 1,
        &chars)) {
        goto exit;
    }
    return_value = unicode_strip_impl(self, chars);

exit:
    return return_value;
}

主要调用了unicode_strip_impl函数

/*[clinic input]
str.strip as unicode_strip

    chars: object = None
    /

Return a copy of the string with leading and trailing whitespace removed.

If chars is given and not None, remove characters in chars instead.
[clinic start generated code]*/

static PyObject *
unicode_strip_impl(PyObject *self, PyObject *chars)
/*[clinic end generated code: output=ca19018454345d57 input=385289c6f423b954]*/
{
    return do_argstrip(self, BOTHSTRIP, chars);
}

BOTHSTRIP的宏定义如下:

1
2
3

#define LEFTSTRIP 0
#define RIGHTSTRIP 1
#define BOTHSTRIP 2

do_argstrip会根据参数的情况进入不同的处理流程:

static PyObject *
do_argstrip(PyObject *self, int striptype, PyObject *sep)
{
    if (sep != NULL && sep != Py_None) {
        if (PyUnicode_Check(sep))
            return _PyUnicode_XStrip(self, striptype, sep);
        else {
            PyErr_Format(PyExc_TypeError,
                         "%s arg must be None or str",
                         STRIPNAME(striptype));
            return NULL;
        }
    }

    return do_strip(self, striptype);
}

因为我出现的问题属于有参数的情况，会进入到_PyUnicode_XStrip中:

/* externally visible for str.strip(unicode) */
PyObject *
_PyUnicode_XStrip(PyObject *self, int striptype, PyObject *sepobj)
/* externally visible for str.strip(unicode) */
PyObject *
_PyUnicode_XStrip(PyObject *self, int striptype, PyObject *sepobj)
{
    void *data;
    int kind;
    Py_ssize_t i, j, len;
    BLOOM_MASK sepmask;
    Py_ssize_t seplen;

    if (PyUnicode_READY(self) == -1 || PyUnicode_READY(sepobj) == -1)
        return NULL;

    kind = PyUnicode_KIND(self);
    data = PyUnicode_DATA(self);
    len = PyUnicode_GET_LENGTH(self);
    seplen = PyUnicode_GET_LENGTH(sepobj);
    sepmask = make_bloom_mask(PyUnicode_KIND(sepobj),
                              PyUnicode_DATA(sepobj),
                              seplen);

    i = 0;
    if (striptype != RIGHTSTRIP) {
        while (i < len) {
            Py_UCS4 ch = PyUnicode_READ(kind, data, i);
            if (!BLOOM(sepmask, ch))
                break;
            if (PyUnicode_FindChar(sepobj, ch, 0, seplen, 1) < 0)
                break;
            i++;
        }
    }

    j = len;
    if (striptype != LEFTSTRIP) {
        j--;
        while (j >= i) {
            Py_UCS4 ch = PyUnicode_READ(kind, data, j);
            if (!BLOOM(sepmask, ch))
                break;
            if (PyUnicode_FindChar(sepobj, ch, 0, seplen, 1) < 0)
                break;
            j--;
        }

        j++;
    }

    return PyUnicode_Substring(self, i, j);
}

其中的kind是表示unicode对象里面真正的字节的存储方式:

enum PyUnicode_Kind {
/* String contains only wstr byte characters.  This is only possible
   when the string was created with a legacy API and _PyUnicode_Ready()
   has not been called yet.  */
    PyUnicode_WCHAR_KIND = 0,
/* Return values of the PyUnicode_KIND() macro: */
    PyUnicode_1BYTE_KIND = 1,
    PyUnicode_2BYTE_KIND = 2,
    PyUnicode_4BYTE_KIND = 4
};

len是用来获取原始Unicode字符串的长度，seplen是作为参数的字符串长度:

/* Returns the length of the unicode string. The caller has to make sure that
   the string has it's canonical representation set before calling
   this macro.  Call PyUnicode_(FAST_)Ready to ensure that. */
#define PyUnicode_GET_LENGTH(op)                \
    (assert(PyUnicode_Check(op)),               \
     assert(PyUnicode_IS_READY(op)),            \
     ((PyASCIIObject *)(op))->length)

sepmask是十分重要的一步，后面的计算均由此有关，但我暂时还没完全看懂其中的含义，只列出make_bloom_mask函数中的注释好了:

calculate simple bloom-style bitmask for a given unicode string

接下来会根据striptype和sepmask的值决定如何处理，但都会进入到PyUnicode_FindChar中。

顺便看看do_strip函数，其适用于strip不存在任何参数的情况，do_strip与_PyUnicode_XStrip的处理逻辑大体相同，但会判断字符串是否只包含ASCII字符，这可以加快处理速度，如果是，其中会通过_Py_ascii_whitespace数组索引来检查一个字符是否为空白字符

static PyObject *
do_strip(PyObject *self, int striptype)
{
    Py_ssize_t len, i, j;

    if (PyUnicode_READY(self) == -1)
        return NULL;

    len = PyUnicode_GET_LENGTH(self);

    if (PyUnicode_IS_ASCII(self)) {
        Py_UCS1 *data = PyUnicode_1BYTE_DATA(self);

        i = 0;
        if (striptype != RIGHTSTRIP) {
            while (i < len) {
                Py_UCS1 ch = data[i];
                if (!_Py_ascii_whitespace[ch])
                    break;
                i++;
            }
        }
        ......
    }
    else {
        int kind = PyUnicode_KIND(self);
        void *data = PyUnicode_DATA(self);
        .....
    }
    return PyUnicode_Substring(self, i, j);
}

PyUnicode_FindChar使用给定的方向返回字符ch在str[start:end]中的第一个位置(方向==1表示向前搜索，方向==-1表示向后搜索)，但随着字符串长度的增加，通过连续调用此函数引入的开销也会增加:

Py_ssize_t
PyUnicode_FindChar(PyObject *str, Py_UCS4 ch,
                   Py_ssize_t start, Py_ssize_t end,
                   int direction)
{
    int kind;
    Py_ssize_t len, result;
    if (PyUnicode_READY(str) == -1)
        return -2;
    len = PyUnicode_GET_LENGTH(str);
    ADJUST_INDICES(start, end, len);
    if (end - start < 1)
        return -1;
    kind = PyUnicode_KIND(str);
    result = findchar(PyUnicode_1BYTE_DATA(str) + kind*start,
                      kind, end-start, ch, direction);
    if (result == -1)
        return -1;
    else
        return start + result;
}

在此函数中，会经由findchar进入到ucs1lib_find_char中，先看一下STRINGLIB(find_char)的写法:

1	#define STRINGLIB(F) ucs1lib_##F

利用##连接符在宏定义中，将字符串ucs1lib_和经参数F传递过来的字符串连接起来，即ucs1lib_find_char:

#if STRINGLIB_SIZEOF_CHAR == 1
#  define MEMCHR_CUT_OFF 15
#else
#  define MEMCHR_CUT_OFF 40
#endif

Py_LOCAL_INLINE(Py_ssize_t)
STRINGLIB(find_char)(const STRINGLIB_CHAR* s, Py_ssize_t n, STRINGLIB_CHAR ch)
{
    const STRINGLIB_CHAR *p, *e;

    p = s;
    e = s + n;
    if (n > MEMCHR_CUT_OFF) {
#if STRINGLIB_SIZEOF_CHAR == 1
        p = memchr(s, ch, n);
        if (p != NULL)
            return (p - s);
        return -1;
#else
        /* use memchr if we can choose a needle without two many likely
           false positives */
        const STRINGLIB_CHAR *s1, *e1;
        unsigned char needle = ch & 0xff;
        /* If looking for a multiple of 256, we'd have too
           many false positives looking for the '\0' byte in UCS2
           and UCS4 representations. */
        if (needle != 0) {
            do {
                void *candidate = memchr(p, needle,
                                         (e - p) * sizeof(STRINGLIB_CHAR));
                if (candidate == NULL)
                    return -1;
                s1 = p;
                p = (const STRINGLIB_CHAR *)
                        _Py_ALIGN_DOWN(candidate, sizeof(STRINGLIB_CHAR));
                if (*p == ch)
                    return (p - s);
                /* False positive */
                p++;
                if (p - s1 > MEMCHR_CUT_OFF)
                    continue;
                if (e - p <= MEMCHR_CUT_OFF)
                    break;
                e1 = p + MEMCHR_CUT_OFF;
                while (p != e1) {
                    if (*p == ch)
                        return (p - s);
                    p++;
                }
            }
            while (e - p > MEMCHR_CUT_OFF);
        }
#endif
    }
    while (p < e) {
        if (*p == ch)
            return (p - s);
        p++;
    }
    return -1;
}

其中的s为源字符串，n为查找长度，ch为查找字符，里面用了一系列的检查方式来提高搜索的效率。
_PyUnicode_XStrip处理完毕后进入到PyUnicode_Substring处理流程中，返回str的子串，从字符索引开始(包含)到字符索引结束(排除)，其中判断是否只包含ASCII字符，如果不是，PyUnicode_FromKindAndData会根据PyUnicode_KIND返回的值来创建新的Unicode对象:

PyObject*
PyUnicode_Substring(PyObject *self, Py_ssize_t start, Py_ssize_t end)
{
    ......
    length = end - start;
    if (PyUnicode_IS_ASCII(self)) {
        data = PyUnicode_1BYTE_DATA(self);
        return _PyUnicode_FromASCII((char*)(data + start), length);
    }
    else {
        kind = PyUnicode_KIND(self);
        data = PyUnicode_1BYTE_DATA(self);
        return PyUnicode_FromKindAndData(kind,
                                         data + kind * start,
                                         length);
    }
}

我这里会进入到_PyUnicode_FromASCII函数，通过memcpy函数进行收尾工作:

PyObject*
_PyUnicode_FromASCII(const char *buffer, Py_ssize_t size)
{
    const unsigned char *s = (const unsigned char *)buffer;
    PyObject *unicode;
    if (size == 1) {
#ifdef Py_DEBUG
        assert((unsigned char)s[0] < 128);
#endif
        return get_latin1_char(s[0]);
    }
    unicode = PyUnicode_New(size, 127);
    if (!unicode)
        return NULL;
    memcpy(PyUnicode_1BYTE_DATA(unicode), s, size);
    assert(_PyUnicode_CheckConsistency(unicode, 1));
    return unicode;
}

再看看PyUnicode_FromKindAndData好了，就很简单的逻辑，直接用PyUnicode_KIND返回值做switch case:

PyObject*
PyUnicode_FromKindAndData(int kind, const void *buffer, Py_ssize_t size)
{
    if (size < 0) {
        PyErr_SetString(PyExc_ValueError, "size must be positive");
        return NULL;
    }
    switch (kind) {
    case PyUnicode_1BYTE_KIND:
        return _PyUnicode_FromUCS1(buffer, size);
    case PyUnicode_2BYTE_KIND:
        return _PyUnicode_FromUCS2(buffer, size);
    case PyUnicode_4BYTE_KIND:
        return _PyUnicode_FromUCS4(buffer, size);
    default:
        PyErr_SetString(PyExc_SystemError, "invalid kind");
        return NULL;
    }
}

以_PyUnicode_FromUCS1为例可以看出处理逻辑与_PyUnicode_FromASCII函数大体一致:

static PyObject*
_PyUnicode_FromUCS1(const Py_UCS1* u, Py_ssize_t size)
{
    PyObject *res;
    unsigned char max_char;

    if (size == 0)
        _Py_RETURN_UNICODE_EMPTY();
    assert(size > 0);
    if (size == 1)
        return get_latin1_char(u[0]);

    max_char = ucs1lib_find_max_char(u, u + size);
    res = PyUnicode_New(size, max_char);
    if (!res)
        return NULL;
    memcpy(PyUnicode_1BYTE_DATA(res), u, size);
    assert(_PyUnicode_CheckConsistency(res, 1));
    return res;
}

结束

到这里分析完strip相关的源码，处理流程就十分清晰了，算是解除了困惑，以后有空还是要多看看Python源码才行😀

参考

1、https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str
2、https://docs.python.org/zh-cn/3/c-api/unicode.html
3、https://docs.python.org/3/c-api/structures.html#c.PyMethodDef
4、https://www.python.org/dev/peps/pep-0393/
5、https://stackoverflow.com/a/38285655