ablog

不器用で落着きのない技術者のメモ

pandas で df[df[列番号]==数値] と df[df[列番号].isin([数値])] のどちらが速いか(数値編)

サマリー

14MBの CSV ファイルを読んで 1,000 回 df[df[列番号]==数値] または df[df[列番号].isin([数値])] の実行時間を計測すると、isin のほうが 0.875s(=3.131s-2.256s)遅い結果になった。cProfile でプロファイリングすると df[df[列番号]==数値] の場合、algorithms.py#isin で時間を消費している。次点の pandas._libs.algos.take_2d_axis1_object_objectmethod 'read_low_memory' of 'pandas._libs.parsers.TextReader' objects は所要時間もほぼ同じで、ほぼ algorithms.py#isin の分(cutime:0.873s)だけ遅くなっている。

テストコードでは 100,000行(14MB)の CSV ファイル(|区切り)を pandas で読み込んで DataFrame を生成して df[df[3]==10] または df[df[3].isin([10])] を行っている。これを 1,000 回繰り返している。

コード 実行時間
df[df[3]==10]] 2.256s
df[df[3].isin([10])] 3.131s

データ

  • テストデータを入手
$ aws s3 cp --recursive s3://redshift-downloads/TPC-H/2.18/10GB/ .

$ ls -lh
total 11G
-rw-rw-r--. 1 ec2-user ec2-user 233M May 24  2022 customer.tbl
-rw-rw-r--. 1 ec2-user ec2-user 7.2G May 24  2022 lineitem.tbl
-rw-rw-r--. 1 ec2-user ec2-user 2.2K May 24  2022 nation.tbl
-rw-rw-r--. 1 ec2-user ec2-user 1.7G May 24  2022 orders.tbl
-rw-rw-r--. 1 ec2-user ec2-user 231M May 24  2022 part.tbl
-rw-rw-r--. 1 ec2-user ec2-user 1.2G May 24  2022 partsupp.tbl
-rw-rw-r--. 1 ec2-user ec2-user  384 May 24  2022 region.tbl
-rw-rw-r--. 1 ec2-user ec2-user  14M May 24  2022 supplier.tbl ★

$ wc -l supplier.tbl 
100000 supplier.tbl

$ head supplier.tbl 
1|Supplier#000000001| N kD4on9OM Ipw3,gf0JBoQDd7tgrzrddZ|17|27-918-335-1736|5755.94|each slyly above the careful
2|Supplier#000000002|89eJ5ksX3ImxJQBvxObC,|5|15-679-861-2259|4032.68| slyly bold instructions. idle dependen
3|Supplier#000000003|q1,G3Pj6OjIuUYfUoH18BFTKP5aU9bEV3|1|11-383-516-1199|4192.40|blithely silent requests after the express dependencies are sl
4|Supplier#000000004|Bk7ah4CK8SYQTepEmvMkkgMwg|15|25-843-787-7479|4641.08|riously even requests above the exp
5|Supplier#000000005|Gcdm2rJRzl5qlTVzc|11|21-151-690-3663|-283.84|. slyly regular pinto bea
6|Supplier#000000006|tQxuVm7s7CnK|14|24-696-997-4969|1365.79|final accounts. regular dolphins use against the furiously ironic decoys. 
7|Supplier#000000007|s,4TicNGB4uO6PaSqNBUq|23|33-990-965-2201|6820.35|s unwind silently furiously regular courts. final requests are deposits. requests wake quietly blit
8|Supplier#000000008|9Sq4bBH2FQEmaFOocY45sRTxo6yuoG|17|27-498-742-3860|7627.85|al pinto beans. asymptotes haggl
9|Supplier#000000009|1KhUgZegwM3ua7dsYmekYBsK|10|20-403-398-8662|5302.37|s. unusual, even requests along the furiously regular pac
10|Supplier#000000010|Saygah3gYWMp72i PY|24|34-852-489-8585|3891.91|ing waters. regular requests ar

スクリプト

  • pandas_equal_operator.py … df[df[列番号]==数値] を計測するプログラム
#!/usr/bin/env python3

import pandas as pd
df = pd.read_csv('csv/supplier.tbl', delimiter='|', header=None)

for i in range(10000):
    df[df[3]==10]
  • pandas_isin.py … df[df[列番号].isin([数値])] を計測するプログラム
#!/usr/bin/env python3

import pandas as pd
df = pd.read_csv('csv/supplier.tbl', delimiter='|', header=None)

for i in range(10000):
    df[df[3].isin([10])]
  • cprofile2txt.py … cProfile のプロファイリング結果(バイナリ)をテキストに出力するスクリプト
#!/usr/bin/env python3
import sys
import pstats

cprof_binary_file = sys.argv[1]
cprof_sort_key = sys.argv[2]

sts = pstats.Stats(cprof_binary_file)
sts.strip_dirs().sort_stats(cprof_sort_key).print_stats()

計測(実行)

  • テストプログラムを実行
$ time python -m cProfile -o pandas_equal_operator.prof pandas_equal_operator.py
real    0m2.256s ★実行時間
user    0m2.157s
sys     0m0.201s

$ time python -m cProfile -o pandas_isin.prof pandas_isin.py
real    0m3.131s ★実行時間
user    0m3.046s
sys     0m0.175s
  • cProfile の結果をテキストに変換
$ python cprofile2txt.py pandas_equal_operator.prof tottime > cprof_pandas_equal_operator.txt
$ python cprofile2txt.py pandas_isin.prof tottime > cprof_pandas_isin.txt

プロファイリング結果の確認

df[df[列番号].isin([数値])] の場合は algorithms.py#isin の所要時間が最も長い(cutime:0.873s)、次点の pandas._libs.algos.take_2d_axis1_object_objectmethod 'read_low_memory' of 'pandas._libs.parsers.TextReader' objects は所要時間もほぼ同じなので、algorithms.py#isin の分(cutime:0.873s)だけ遅くなっている。

  • cprof_pandas_equal_operator.txt
$ head -20 cprof_pandas_equal_operator.txt
Wed Oct 11 10:06:16 2023    pandas_equal_operator.prof

         737298 function calls (712789 primitive calls) in 2.068 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000    0.341    0.000    0.341    0.000 {pandas._libs.algos.take_2d_axis1_object_object}
        1    0.259    0.259    0.260    0.260 {method 'read_low_memory' of 'pandas._libs.parsers.TextReader' objects}
     1000    0.140    0.000    0.140    0.000 {method 'nonzero' of 'numpy.ndarray' objects}
        1    0.129    0.129    2.068    2.068 pandas_equal_operator.py:1(<module>)
      408    0.060    0.000    0.060    0.000 {built-in method marshal.loads}
     1000    0.054    0.000    0.054    0.000 {built-in method _operator.eq}
     1000    0.031    0.000    0.031    0.000 {pandas._libs.algos.take_2d_axis1_int64_int64}
113830/111829    0.030    0.000    0.046    0.000 {built-in method builtins.isinstance}
     6002    0.026    0.000    0.026    0.000 {method 'reduce' of 'numpy.ufunc' objects}
     1000    0.026    0.000    0.026    0.000 {pandas._libs.algos.take_2d_axis0_float64_float64}
     3018    0.025    0.000    0.025    0.000 {built-in method numpy.empty}
    80/78    0.024    0.000    0.030    0.000 {built-in method _imp.create_dynamic}
     1000    0.022    0.000    0.031    0.000 utils.py:239(maybe_convert_indices)
  • cprof_pandas_isin.txt
$ head -20 cprof_pandas_isin.txt 
Wed Oct 11 10:08:04 2023    pandas_isin.prof

         732298 function calls (707789 primitive calls) in 2.939 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000    0.785    0.001    0.873    0.001 algorithms.py:457(isin) ★ここで時間がかかっている
     1000    0.353    0.000    0.353    0.000 {pandas._libs.algos.take_2d_axis1_object_object}
        1    0.283    0.283    0.285    0.285 {method 'read_low_memory' of 'pandas._libs.parsers.TextReader' objects}
     1000    0.145    0.000    0.145    0.000 {method 'nonzero' of 'numpy.ndarray' objects}
        1    0.134    0.134    2.939    2.939 pandas_isin.py:1(<module>)
      408    0.064    0.000    0.064    0.000 {built-in method marshal.loads}
     1000    0.033    0.000    0.033    0.000 {pandas._libs.algos.take_2d_axis1_int64_int64}
106830/104829    0.031    0.000    0.051    0.000 {built-in method builtins.isinstance}
     6002    0.030    0.000    0.030    0.000 {method 'reduce' of 'numpy.ufunc' objects}
     4018    0.030    0.000    0.030    0.000 {built-in method numpy.empty}
     3000    0.029    0.000    0.031    0.000 cast.py:550(maybe_promote)
     1000    0.028    0.000    0.028    0.000 {pandas._libs.algos.take_2d_axis0_float64_float64}
     1000    0.027    0.000    0.044    0.000 {pandas._libs.lib.infer_dtype}

時間を消費している algorithms.py の isin のコード

def isin(comps: ListLike, values: ListLike) -> npt.NDArray[np.bool_]:
    """
    Compute the isin boolean array.

    Parameters
    ----------
    comps : list-like
    values : list-like

    Returns
    -------
    ndarray[bool]
        Same length as `comps`.
    """
    if not is_list_like(comps):
        raise TypeError(
            "only list-like objects are allowed to be passed "
            f"to isin(), you passed a `{type(comps).__name__}`"
        )
    if not is_list_like(values):
        raise TypeError(
            "only list-like objects are allowed to be passed "
            f"to isin(), you passed a `{type(values).__name__}`"
        )

    if not isinstance(values, (ABCIndex, ABCSeries, ABCExtensionArray, np.ndarray)):
        orig_values = list(values)
        values = _ensure_arraylike(orig_values, func_name="isin-targets")

        if (
            len(values) > 0
            and values.dtype.kind in "iufcb"
            and not is_signed_integer_dtype(comps)
        ):
            # GH#46485 Use object to avoid upcast to float64 later
            # TODO: Share with _find_common_type_compat
            values = construct_1d_object_array_from_listlike(orig_values)

    elif isinstance(values, ABCMultiIndex):
        # Avoid raising in extract_array
        values = np.array(values)
    else:
        values = extract_array(values, extract_numpy=True, extract_range=True)

    comps_array = _ensure_arraylike(comps, func_name="isin")
    comps_array = extract_array(comps_array, extract_numpy=True)
    if not isinstance(comps_array, np.ndarray):
        # i.e. Extension Array
        return comps_array.isin(values)

    elif needs_i8_conversion(comps_array.dtype):
        # Dispatch to DatetimeLikeArrayMixin.isin
        return pd_array(comps_array).isin(values)
    elif needs_i8_conversion(values.dtype) and not is_object_dtype(comps_array.dtype):
        # e.g. comps_array are integers and values are datetime64s
        return np.zeros(comps_array.shape, dtype=bool)
        # TODO: not quite right ... Sparse/Categorical
    elif needs_i8_conversion(values.dtype):
        return isin(comps_array, values.astype(object))

    elif isinstance(values.dtype, ExtensionDtype):
        return isin(np.asarray(comps_array), np.asarray(values))

    # GH16012
    # Ensure np.isin doesn't get object types or it *may* throw an exception
    # Albeit hashmap has O(1) look-up (vs. O(logn) in sorted array),
    # isin is faster for small sizes
    if (
        len(comps_array) > _MINIMUM_COMP_ARR_LEN
        and len(values) <= 26
        and comps_array.dtype != object
    ):
        # If the values include nan we need to check for nan explicitly
        # since np.nan it not equal to np.nan
        if isna(values).any():

            def f(c, v):
                return np.logical_or(np.isin(c, v).ravel(), np.isnan(c))

        else:
            f = lambda a, b: np.isin(a, b).ravel()

    else:
        common = np_find_common_type(values.dtype, comps_array.dtype)
        values = values.astype(common, copy=False)
        comps_array = comps_array.astype(common, copy=False)
        f = htable.ismember

    return f(comps_array, values)

環境

$ cat /etc/system-release
Amazon Linux release 2023 (Amazon Linux)
$ uname -r
6.1.52-71.125.amzn2023.x86_64
$ pyenv versions
  system
  2.7.16
  2.7.18
* 3.11.5 (set by /home/ec2-user/.pyenv/version)
  • Pandas 2.1.1
$ python
Python 3.11.5 (main, Sep 13 2023, 01:12:39) [GCC 11.3.1 20221121 (Red Hat 11.3.1-4)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> print(pd.__version__)
2.1.1