The error shows that the machine does not have enough memory to read the entire
CSV into a DataFrame at one time. Assuming you do not need the entire dataset in
memory all at one time, one way to avoid the problem would be to process the CSV in
chunks (by specifying the chunksize parameter):
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
The chunksize parameter specifies the number of rows per chunk.
(The last chunk may contain fewer than chunksize rows, of course.)
Does your workflow require slicing, manipulating, exporting?
If so, you can use dask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API.
Another great alternative would be to use modin because all the functionality is identical to pandas yet it leverages on distributed dataframe libraries such as dask.
The above answer is already satisfying the topic. Anyway, if you need all the data in memory – have a look at bcolz. Its compressing the data in memory. I have had really good experience with it. But its missing a lot of pandas features
Edit: I got compression rates at around 1/10 or orig size i think, of course depending of the kind of data. Important features missing were aggregates.
回答 5
您可以将数据读取为大块,并将每个大块另存为泡菜。
import pandas as pd import pickle
in_path =""#Path where the large file is
out_path =""#Path to save the pickle files to
chunk_size =400000#size of chunks relies on your available memory
separator ="~"
reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size,
low_memory=False)for i, chunk in enumerate(reader):
out_file = out_path +"/data_{}.pkl".format(i+1)with open(out_file,"wb")as f:
pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)
在下一步中,您将读取泡菜并将每个泡菜附加到所需的数据框中。
import glob
pickle_path =""#Same Path as out_path i.e. where the pickle files are
data_p_files=[]for name in glob.glob(pickle_path +"/data_*.pkl"):
data_p_files.append(name)
df = pd.DataFrame([])for i in range(len(data_p_files)):
df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True)
You can read in the data as chunks and save each chunk as pickle.
import pandas as pd
import pickle
in_path = "" #Path where the large file is
out_path = "" #Path to save the pickle files to
chunk_size = 400000 #size of chunks relies on your available memory
separator = "~"
reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size,
low_memory=False)
for i, chunk in enumerate(reader):
out_file = out_path + "/data_{}.pkl".format(i+1)
with open(out_file, "wb") as f:
pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)
In the next step you read in the pickles and append each pickle to your desired dataframe.
import glob
pickle_path = "" #Same Path as out_path i.e. where the pickle files are
data_p_files=[]
for name in glob.glob(pickle_path + "/data_*.pkl"):
data_p_files.append(name)
df = pd.DataFrame([])
for i in range(len(data_p_files)):
df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True)
TextFileReader= pd.read_csv(path, chunksize=1000)# the number of rows per chunk
dfList =[]for df inTextFileReader:
dfList.append(df)
df = pd.concat(dfList,sort=False)
TextFileReader = pd.read_csv(path, chunksize=1000) # the number of rows per chunk
dfList = []
for df in TextFileReader:
dfList.append(df)
df = pd.concat(dfList,sort=False)
回答 8
下面是一个示例:
chunkTemp =[]
queryTemp =[]
query = pd.DataFrame()for chunk in pd.read_csv(file, header=0, chunksize=<your_chunksize>, iterator=True, low_memory=False):#REPLACING BLANK SPACES AT COLUMNS' NAMES FOR SQL OPTIMIZATION
chunk = chunk.rename(columns ={c: c.replace(' ','')for c in chunk.columns})#YOU CAN EITHER: #1)BUFFER THE CHUNKS IN ORDER TO LOAD YOUR WHOLE DATASET
chunkTemp.append(chunk)#2)DO YOUR PROCESSING OVER A CHUNK AND STORE THE RESULT OF IT
query = chunk[chunk[<column_name>].str.startswith(<some_pattern>)]#BUFFERING PROCESSED DATA
queryTemp.append(query)#! NEVER DO pd.concat OR pd.DataFrame() INSIDE A LOOPprint("Database: CONCATENATING CHUNKS INTO A SINGLE DATAFRAME")
chunk = pd.concat(chunkTemp)print("Database: LOADED")#CONCATENATING PROCESSED DATA
query = pd.concat(queryTemp)print(query)
chunkTemp = []
queryTemp = []
query = pd.DataFrame()
for chunk in pd.read_csv(file, header=0, chunksize=<your_chunksize>, iterator=True, low_memory=False):
#REPLACING BLANK SPACES AT COLUMNS' NAMES FOR SQL OPTIMIZATION
chunk = chunk.rename(columns = {c: c.replace(' ', '') for c in chunk.columns})
#YOU CAN EITHER:
#1)BUFFER THE CHUNKS IN ORDER TO LOAD YOUR WHOLE DATASET
chunkTemp.append(chunk)
#2)DO YOUR PROCESSING OVER A CHUNK AND STORE THE RESULT OF IT
query = chunk[chunk[<column_name>].str.startswith(<some_pattern>)]
#BUFFERING PROCESSED DATA
queryTemp.append(query)
#! NEVER DO pd.concat OR pd.DataFrame() INSIDE A LOOP
print("Database: CONCATENATING CHUNKS INTO A SINGLE DATAFRAME")
chunk = pd.concat(chunkTemp)
print("Database: LOADED")
#CONCATENATING PROCESSED DATA
query = pd.concat(queryTemp)
print(query)
The maximum value of UNSIGNED CHAR =255The minimum value of SHORT INT =-32768The maximum value of SHORT INT =32767The minimum value of INT =-2147483648The maximum value of INT =2147483647The minimum value of CHAR =-128The maximum value of CHAR =127The minimum value of LONG =-9223372036854775808The maximum value of LONG =9223372036854775807
I want to make a more comprehensive answer based off of the most of the potential solutions that are already provided. I also want to point out one more potential aid that may help reading process.
Option 1: dtypes
“dtypes” is a pretty powerful parameter that you can use to reduce the memory pressure of read methods. See this and this answer. Pandas, on default, try to infer dtypes of the data.
Referring to data structures, every data stored, a memory allocation takes place. At a basic level refer to the values below (The table below illustrates values for C programming language):
The maximum value of UNSIGNED CHAR = 255
The minimum value of SHORT INT = -32768
The maximum value of SHORT INT = 32767
The minimum value of INT = -2147483648
The maximum value of INT = 2147483647
The minimum value of CHAR = -128
The maximum value of CHAR = 127
The minimum value of LONG = -9223372036854775808
The maximum value of LONG = 9223372036854775807
Refer to this page to see the matching between NumPy and C types.
Let’s say you have an array of integers of digits. You can both theoretically and practically assign, say array of 16-bit integer type, but you would then allocate more memory than you actually need to store that array. To prevent this, you can set dtype option on read_csv. You do not want to store the array items as long integer where actually you can fit them with 8-bit integer (np.int8 or np.uint8).
Reading the data in chunks allows you to access a part of the data in-memory, and you can apply preprocessing on your data and preserve the processed data rather than raw data. It’d be much better if you combine this option with the first one, dtypes.
I want to point out the pandas cookbook sections for that process, where you can find it here. Note those two sections there;
Dask is a framework that is defined in Dask’s website as:
Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love
It was born to cover the necessary parts where pandas cannot reach. Dask is a powerful framework that allows you much more data access by processing it in a distributed way.
You can use dask to preprocess your data as a whole, Dask takes care of the chunking part, so unlike pandas you can just define your processing steps and let Dask do the work. Dask does not apply the computations before it is explicitly pushed by compute and/or persist (see the answer here for the difference).
Other Aids (Ideas)
ETL flow designed for the data. Keeping only what is needed from the raw data.
First, apply ETL to whole data with frameworks like Dask or PySpark, and export the processed data.
Then see if the processed data can be fit in the memory as a whole.
Consider increasing your RAM.
Consider working with that data on a cloud platform.
def apply(dfg):# do stuffreturn dfg
c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)# or
c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)# output to various formats, automatically chunked to reduce memory consumption
c.to_csv_combine(filename='out.csv')
c.to_parquet_combine(filename='out.pq')
c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db','tablename')# fast for postgres
c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db','tablename')# fast for mysql
c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db','tablename')# slow but flexible
In addition to the answers above, for those who want to process CSV and then export to csv, parquet or SQL, d6tstack is another good option. You can load multiple files and it deals with data schema changes (added/removed columns). Chunked out of core support is already built in.
def apply(dfg):
# do stuff
return dfg
c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)
# or
c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)
# output to various formats, automatically chunked to reduce memory consumption
c.to_csv_combine(filename='out.csv')
c.to_parquet_combine(filename='out.pq')
c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # fast for postgres
c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename') # fast for mysql
c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # slow but flexible
In case someone is still looking for something like this, I found that this new library called modin can help. It uses distributed computing that can help with the read. Here’s a nice article comparing its functionality with pandas. It essentially uses the same functions as pandas.
import modin.pandas as pd
pd.read_csv(CSV_FILE_NAME)
Before using chunksize option if you want to be sure about the process function that you want to write inside the chunking for-loop as mentioned by @unutbu you can simply use nrows option.
small_df = pd.read_csv(filename, nrows=100)
Once you are sure that the process block is ready, you can put that in the chunking for loop for the entire dataframe.