🌑

💡

Spaghetti.ink

Home Tags About Me Email

Appreciation, Modesty, Persistence

Pandas分块加载

在数据量过大的情况下，一次性加载所有数据可能导致内存不足，引发警告。所以，我们通常需要分块处理这些体积庞大的数据集。

# 查看迭代返回对象的方法
dir(chunks_reader)

E.g. 读取100W行日志记录，并提取指定字段

chunks_reader = pd.read_table(logPath, header=None, iterator=True);

logs = pd.DataFrame();
count = 0;
chunksize = 5000

while True:
    try:
        if (count == 200): break;
        chunk = chunks_reader.get_chunk(chunksize);
        log_chunk = chunk[0].map(getSpecificField).apply(pd.Series, index=['IP', 'Time', 'Method', 'Url', 'Status', 'Size', 'Reference', 'Proxy']);
        logs = pd.concat([logs, log_chunk]);
        count+=1;
        if (count % 10 == 0): print(count);
    except StopIteration:
        break;

上方代码中，我们以每块5000条记录为例，读取200次，一共读取100W条数据。在读取每一个分块后，我们可以先对每一个分块进行处理，再将处理好的分块拼接起来。

本文由 Frank采用署名 4.0 国际 (CC BY 4.0)许可

数据分析 — 2021年3月25日

本文总阅读量次

数据分析

Next posts 【毕设专栏-2】数据预处理-数据清洗

Previous posts 【毕设专题-1】数据加载

View / Make Comments

Made with ❤ and at Hangzhou.

本站总访问量次