サマリー
14MBの CSV ファイルを読んで 1,000 回 df[df[列番号]==数値]
または df[df[列番号].isin([数値])]
の実行時間を計測すると、isin のほうが 0.875s(=3.131s-2.256s)遅い結果になった。cProfile でプロファイリングすると df[df[列番号]==数値]
の場合、algorithms.py#isin
で時間を消費している。次点の pandas._libs.algos.take_2d_axis1_object_object
や method 'read_low_memory' of 'pandas._libs.parsers.TextReader' objects
は所要時間もほぼ同じで、ほぼ algorithms.py#isin
の分(cutime:0.873s)だけ遅くなっている。
テストコードでは 100,000行(14MB)の CSV ファイル(|
区切り)を pandas で読み込んで DataFrame を生成して df[df[3]==10]
または df[df[3].isin([10])]
を行っている。これを 1,000 回繰り返している。
コード | 実行時間 |
---|---|
df[df[3]==10]] |
2.256s |
df[df[3].isin([10])] |
3.131s |
データ
- テストデータを入手
$ aws s3 cp --recursive s3://redshift-downloads/TPC-H/2.18/10GB/ . $ ls -lh total 11G -rw-rw-r--. 1 ec2-user ec2-user 233M May 24 2022 customer.tbl -rw-rw-r--. 1 ec2-user ec2-user 7.2G May 24 2022 lineitem.tbl -rw-rw-r--. 1 ec2-user ec2-user 2.2K May 24 2022 nation.tbl -rw-rw-r--. 1 ec2-user ec2-user 1.7G May 24 2022 orders.tbl -rw-rw-r--. 1 ec2-user ec2-user 231M May 24 2022 part.tbl -rw-rw-r--. 1 ec2-user ec2-user 1.2G May 24 2022 partsupp.tbl -rw-rw-r--. 1 ec2-user ec2-user 384 May 24 2022 region.tbl -rw-rw-r--. 1 ec2-user ec2-user 14M May 24 2022 supplier.tbl ★ $ wc -l supplier.tbl 100000 supplier.tbl $ head supplier.tbl 1|Supplier#000000001| N kD4on9OM Ipw3,gf0JBoQDd7tgrzrddZ|17|27-918-335-1736|5755.94|each slyly above the careful 2|Supplier#000000002|89eJ5ksX3ImxJQBvxObC,|5|15-679-861-2259|4032.68| slyly bold instructions. idle dependen 3|Supplier#000000003|q1,G3Pj6OjIuUYfUoH18BFTKP5aU9bEV3|1|11-383-516-1199|4192.40|blithely silent requests after the express dependencies are sl 4|Supplier#000000004|Bk7ah4CK8SYQTepEmvMkkgMwg|15|25-843-787-7479|4641.08|riously even requests above the exp 5|Supplier#000000005|Gcdm2rJRzl5qlTVzc|11|21-151-690-3663|-283.84|. slyly regular pinto bea 6|Supplier#000000006|tQxuVm7s7CnK|14|24-696-997-4969|1365.79|final accounts. regular dolphins use against the furiously ironic decoys. 7|Supplier#000000007|s,4TicNGB4uO6PaSqNBUq|23|33-990-965-2201|6820.35|s unwind silently furiously regular courts. final requests are deposits. requests wake quietly blit 8|Supplier#000000008|9Sq4bBH2FQEmaFOocY45sRTxo6yuoG|17|27-498-742-3860|7627.85|al pinto beans. asymptotes haggl 9|Supplier#000000009|1KhUgZegwM3ua7dsYmekYBsK|10|20-403-398-8662|5302.37|s. unusual, even requests along the furiously regular pac 10|Supplier#000000010|Saygah3gYWMp72i PY|24|34-852-489-8585|3891.91|ing waters. regular requests ar
スクリプト
- pandas_equal_operator.py …
df[df[列番号]==数値]
を計測するプログラム
#!/usr/bin/env python3 import pandas as pd df = pd.read_csv('csv/supplier.tbl', delimiter='|', header=None) for i in range(10000): df[df[3]==10]
- pandas_isin.py …
df[df[列番号].isin([数値])]
を計測するプログラム
#!/usr/bin/env python3 import pandas as pd df = pd.read_csv('csv/supplier.tbl', delimiter='|', header=None) for i in range(10000): df[df[3].isin([10])]
- cprofile2txt.py … cProfile のプロファイリング結果(バイナリ)をテキストに出力するスクリプト
#!/usr/bin/env python3 import sys import pstats cprof_binary_file = sys.argv[1] cprof_sort_key = sys.argv[2] sts = pstats.Stats(cprof_binary_file) sts.strip_dirs().sort_stats(cprof_sort_key).print_stats()
計測(実行)
- テストプログラムを実行
$ time python -m cProfile -o pandas_equal_operator.prof pandas_equal_operator.py real 0m2.256s ★実行時間 user 0m2.157s sys 0m0.201s $ time python -m cProfile -o pandas_isin.prof pandas_isin.py real 0m3.131s ★実行時間 user 0m3.046s sys 0m0.175s
- cProfile の結果をテキストに変換
$ python cprofile2txt.py pandas_equal_operator.prof tottime > cprof_pandas_equal_operator.txt $ python cprofile2txt.py pandas_isin.prof tottime > cprof_pandas_isin.txt
プロファイリング結果の確認
df[df[列番号].isin([数値])]
の場合は algorithms.py#isin
の所要時間が最も長い(cutime:0.873s)、次点の pandas._libs.algos.take_2d_axis1_object_object
や method 'read_low_memory' of 'pandas._libs.parsers.TextReader' objects
は所要時間もほぼ同じなので、algorithms.py#isin
の分(cutime:0.873s)だけ遅くなっている。
- cprof_pandas_equal_operator.txt
$ head -20 cprof_pandas_equal_operator.txt Wed Oct 11 10:06:16 2023 pandas_equal_operator.prof 737298 function calls (712789 primitive calls) in 2.068 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1000 0.341 0.000 0.341 0.000 {pandas._libs.algos.take_2d_axis1_object_object} 1 0.259 0.259 0.260 0.260 {method 'read_low_memory' of 'pandas._libs.parsers.TextReader' objects} 1000 0.140 0.000 0.140 0.000 {method 'nonzero' of 'numpy.ndarray' objects} 1 0.129 0.129 2.068 2.068 pandas_equal_operator.py:1(<module>) 408 0.060 0.000 0.060 0.000 {built-in method marshal.loads} 1000 0.054 0.000 0.054 0.000 {built-in method _operator.eq} 1000 0.031 0.000 0.031 0.000 {pandas._libs.algos.take_2d_axis1_int64_int64} 113830/111829 0.030 0.000 0.046 0.000 {built-in method builtins.isinstance} 6002 0.026 0.000 0.026 0.000 {method 'reduce' of 'numpy.ufunc' objects} 1000 0.026 0.000 0.026 0.000 {pandas._libs.algos.take_2d_axis0_float64_float64} 3018 0.025 0.000 0.025 0.000 {built-in method numpy.empty} 80/78 0.024 0.000 0.030 0.000 {built-in method _imp.create_dynamic} 1000 0.022 0.000 0.031 0.000 utils.py:239(maybe_convert_indices)
- cprof_pandas_isin.txt
$ head -20 cprof_pandas_isin.txt Wed Oct 11 10:08:04 2023 pandas_isin.prof 732298 function calls (707789 primitive calls) in 2.939 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1000 0.785 0.001 0.873 0.001 algorithms.py:457(isin) ★ここで時間がかかっている 1000 0.353 0.000 0.353 0.000 {pandas._libs.algos.take_2d_axis1_object_object} 1 0.283 0.283 0.285 0.285 {method 'read_low_memory' of 'pandas._libs.parsers.TextReader' objects} 1000 0.145 0.000 0.145 0.000 {method 'nonzero' of 'numpy.ndarray' objects} 1 0.134 0.134 2.939 2.939 pandas_isin.py:1(<module>) 408 0.064 0.000 0.064 0.000 {built-in method marshal.loads} 1000 0.033 0.000 0.033 0.000 {pandas._libs.algos.take_2d_axis1_int64_int64} 106830/104829 0.031 0.000 0.051 0.000 {built-in method builtins.isinstance} 6002 0.030 0.000 0.030 0.000 {method 'reduce' of 'numpy.ufunc' objects} 4018 0.030 0.000 0.030 0.000 {built-in method numpy.empty} 3000 0.029 0.000 0.031 0.000 cast.py:550(maybe_promote) 1000 0.028 0.000 0.028 0.000 {pandas._libs.algos.take_2d_axis0_float64_float64} 1000 0.027 0.000 0.044 0.000 {pandas._libs.lib.infer_dtype}
時間を消費している algorithms.py の isin のコード
def isin(comps: ListLike, values: ListLike) -> npt.NDArray[np.bool_]: """ Compute the isin boolean array. Parameters ---------- comps : list-like values : list-like Returns ------- ndarray[bool] Same length as `comps`. """ if not is_list_like(comps): raise TypeError( "only list-like objects are allowed to be passed " f"to isin(), you passed a `{type(comps).__name__}`" ) if not is_list_like(values): raise TypeError( "only list-like objects are allowed to be passed " f"to isin(), you passed a `{type(values).__name__}`" ) if not isinstance(values, (ABCIndex, ABCSeries, ABCExtensionArray, np.ndarray)): orig_values = list(values) values = _ensure_arraylike(orig_values, func_name="isin-targets") if ( len(values) > 0 and values.dtype.kind in "iufcb" and not is_signed_integer_dtype(comps) ): # GH#46485 Use object to avoid upcast to float64 later # TODO: Share with _find_common_type_compat values = construct_1d_object_array_from_listlike(orig_values) elif isinstance(values, ABCMultiIndex): # Avoid raising in extract_array values = np.array(values) else: values = extract_array(values, extract_numpy=True, extract_range=True) comps_array = _ensure_arraylike(comps, func_name="isin") comps_array = extract_array(comps_array, extract_numpy=True) if not isinstance(comps_array, np.ndarray): # i.e. Extension Array return comps_array.isin(values) elif needs_i8_conversion(comps_array.dtype): # Dispatch to DatetimeLikeArrayMixin.isin return pd_array(comps_array).isin(values) elif needs_i8_conversion(values.dtype) and not is_object_dtype(comps_array.dtype): # e.g. comps_array are integers and values are datetime64s return np.zeros(comps_array.shape, dtype=bool) # TODO: not quite right ... Sparse/Categorical elif needs_i8_conversion(values.dtype): return isin(comps_array, values.astype(object)) elif isinstance(values.dtype, ExtensionDtype): return isin(np.asarray(comps_array), np.asarray(values)) # GH16012 # Ensure np.isin doesn't get object types or it *may* throw an exception # Albeit hashmap has O(1) look-up (vs. O(logn) in sorted array), # isin is faster for small sizes if ( len(comps_array) > _MINIMUM_COMP_ARR_LEN and len(values) <= 26 and comps_array.dtype != object ): # If the values include nan we need to check for nan explicitly # since np.nan it not equal to np.nan if isna(values).any(): def f(c, v): return np.logical_or(np.isin(c, v).ravel(), np.isnan(c)) else: f = lambda a, b: np.isin(a, b).ravel() else: common = np_find_common_type(values.dtype, comps_array.dtype) values = values.astype(common, copy=False) comps_array = comps_array.astype(common, copy=False) f = htable.ismember return f(comps_array, values)
環境
$ cat /etc/system-release Amazon Linux release 2023 (Amazon Linux) $ uname -r 6.1.52-71.125.amzn2023.x86_64
- Python 3.11.5
$ pyenv versions system 2.7.16 2.7.18 * 3.11.5 (set by /home/ec2-user/.pyenv/version)
- Pandas 2.1.1
$ python Python 3.11.5 (main, Sep 13 2023, 01:12:39) [GCC 11.3.1 20221121 (Red Hat 11.3.1-4)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd >>> print(pd.__version__) 2.1.1