>>>import pandas as pd
>>>from random import randint
>>> df = pd.DataFrame({'A':[randint(1,9)for x in xrange(10)],'B':[randint(1,9)*10for x in xrange(10)],'C':[randint(1,9)*100for x in xrange(10)]})>>> df
A B C
0340100163020027708003350200475040054104006370500783020083408009660200
+------------+---------+--------+|| A | B |+------------+---------+---------|0|0.626386|1.52325|----axis=1----->+------------+---------+--------+||| axis=0|↓↓
It specifies the axis along which the means are computed. By default axis=0. This is consistent with the numpy.mean usage when axis is specified explicitly (in numpy.mean, axis==None by default, which computes the mean value over the flattened array) , in which axis=0 along the rows (namely, index in pandas), and axis=1 along the columns. For added clarity, one may choose to specify axis='index' (instead of axis=0) or axis='columns' (instead of axis=1).
These answers do help explain this, but it still isn’t perfectly intuitive for a non-programmer (i.e. someone like me who is learning Python for the first time in context of data science coursework). I still find using the terms “along” or “for each” wrt to rows and columns to be confusing.
What makes more sense to me is to say it this way:
Axis 0 will act on all the ROWS in each COLUMN
Axis 1 will act on all the COLUMNS in each ROW
So a mean on axis 0 will be the mean of all the rows in each column, and a mean on axis 1 will be a mean of all the columns in each row.
Ultimately this is saying the same thing as @zhangxaochen and @Michael, but in a way that is easier for me to internalize.
axis=0 means along “indexes”. It’s a row-wise operation.
Suppose, to perform concat() operation on dataframe1 & dataframe2,
we will take dataframe1 & take out 1st row from dataframe1 and place into the new DF, then we take out another row from dataframe1 and put into new DF, we repeat this process until we reach to the bottom of dataframe1. Then, we do the same process for dataframe2.
Basically, stacking dataframe2 on top of dataframe1 or vice a versa.
E.g making a pile of books on a table or floor
axis=1 means along “columns”. It’s a column-wise operation.
Suppose, to perform concat() operation on dataframe1 & dataframe2,
we will take out the 1st complete column(a.k.a 1st series) of dataframe1 and place into new DF, then we take out the second column of dataframe1 and keep adjacent to it (sideways), we have to repeat this operation until all columns are finished. Then, we repeat the same process on dataframe2.
Basically,
stacking dataframe2 sideways.
E.g arranging books on a bookshelf.
More to it, since arrays are better representations to represent a nested n-dimensional structure compared to matrices! so below can help you more to visualize how axis plays an important role when you generalize to more than one dimension. Also, you can actually print/write/draw/visualize any n-dim array but, writing or visualizing the same in a matrix representation(3-dim) is impossible on a paper more than 3-dimensions.
axis refers to the dimension of the array, in the case of pd.DataFrames axis=0 is the dimension that points downwards and axis=1 the one that points to the right.
Example: Think of an ndarray with shape (3,5,7).
a = np.ones((3,5,7))
a is a 3 dimensional ndarray, i.e. it has 3 axes (“axes” is plural of “axis”). The configuration of a will look like 3 slices of bread where each slice is of dimension 5-by-7. a[0,:,:] will refer to the 0-th slice, a[1,:,:] will refer to the 1-st slice etc.
a.sum(axis=0) will apply sum() along the 0-th axis of a. You will add all the slices and end up with one slice of shape (5,7).
a.sum(axis=0) is equivalent to
b = np.zeros((5,7))
for i in range(5):
for j in range(7):
b[i,j] += a[:,i,j].sum()
In a pd.DataFrame, axes work the same way as in numpy.arrays: axis=0 will apply sum() or any other reduction function for each column.
N.B. In @zhangxaochen’s answer, I find the phrases “along the rows” and “along the columns” slightly confusing. axis=0 should refer to “along each column”, and axis=1 “along each row”.
The easiest way for me to understand is to talk about whether you are calculating a statistic for each column (axis = 0) or each row (axis = 1). If you calculate a statistic, say a mean, with axis = 0 you will get that statistic for each column. So if each observation is a row and each variable is in a column, you would get the mean of each variable. If you set axis = 1 then you will calculate your statistic for each row. In our example, you would get the mean for each observation across all of your variables (perhaps you want the average of related measures).
axis = 0: by column = column-wise = along the rows
Let’s look at the table from Wiki. This is an IMF estimate of GDP from 2010 to 2019 for top ten countries.
1. Axis 1 will act for each row on all the columns If you want to calculate the average (mean) GDP for EACH countries over the decade (2010-2019), you need to do, df.mean(axis=1). For example, if you want to calculate mean GDP of United States from 2010 to 2019, df.loc['United States','2010':'2019'].mean(axis=1)
2. Axis 0 will act for each column on all the rows If I want to calculate the average (mean) GDP for EACH year for all countries, you need to do, df.mean(axis=0). For example, if you want to calculate mean GDP of the year 2015 for United States, China, Japan, Germany and India, df.loc['United States':'India','2015'].mean(axis=0)
Note: The above code will work only after setting “Country(or dependent territory)” column as the Index, using set_index method.
The designer of pandas, Wes McKinney, used to work intensively on finance data. Think of columns as stock names and index as daily prices. You can then guess what the default behavior is (i.e., axis=0) with respect to this finance data. axis=1 can be simply thought as ‘the other direction’.
For example, the statistics functions, such as mean(), sum(), describe(), count() all default to column-wise because it makes more sense to do them for each stock. sort_index(by=) also defaults to column. fillna(method='ffill') will fill along column because it is the same stock. dropna() defaults to row because you probably just want to discard the price on that day instead of throw away all prices of that stock.
Similarly, the square brackets indexing refers to the columns since it’s more common to pick a stock instead of picking a day.
The problem with using axis= properly is for its use for 2 main different cases:
For computing an accumulated value, or rearranging (e. g. sorting) data.
For manipulating (“playing” with) entities (e. g. dataframes).
The main idea behind this answer is that for avoiding the confusion, we select either a number, or a name for specifying the particular axis, whichever is more clear, intuitive, and descriptive.
Pandas is based on NumPy, which is based on mathematics, particularly on n-dimensional matrices. Here is an image for common use of axes’ names in math in the 3-dimensional space:
This picture is for memorizing the axes’ ordinal numbers only:
0 for x-axis,
1 for y-axis, and
2 for z-axis.
The z-axis is only for panels; for dataframes we will restrict our interest to the green-colored, 2-dimensional basic plane with x-axis (0, vertical), and y-axis (1, horizontal).
It’s all for numbers as potential values of axis= parameter.
The names of axes are 'index' (you may use the alias 'rows') and 'columns', and for this explanation it is NOT important the relation between these names and ordinal numbers (of axes), as everybody knows what the words “rows” and “columns” mean (and everybody here — I suppose — knows what the word “index” in pandas means).
And now, my recommendation:
If you want to compute an accumulated value, you may compute it from values located along axis 0 (or along axis 1) — use axis=0 (or axis=1).
Similarly, if you want to rearrange values, use the axis number of the axis, along which are located data for rearranging (e.g. for sorting).
If you want to manipulate (e.g. concatenate) entities (e.g. dataframes) — use axis='index' (synonym: axis='rows') or axis='columns' to specify the resulting change — index (rows) or columns, respectively.
(For concatenating, you will obtain either a longer index (= more rows), or more columns, respectively.)
This is based on @Safak’s answer.
The best way to understand the axes in pandas/numpy is to create a 3d array and check the result of the sum function along the 3 different axes.
df = pd.DataFrame(np.arange(12).reshape(3,4),columns=['A','B','C','D'])print(df)
A B C D
00123145672891011
df.mean(axis=1)01.515.529.5
dtype: float64
df.drop(['A','B'],axis=1,inplace=True)
C D
02316721011
Say if your operation requires traversing from left to right/right to left in a dataframe, you are apparently merging columns ie. you are operating on various columns.
This is axis =1
Example
df = pd.DataFrame(np.arange(12).reshape(3,4),columns=['A', 'B', 'C', 'D'])
print(df)
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
df.mean(axis=1)
0 1.5
1 5.5
2 9.5
dtype: float64
df.drop(['A','B'],axis=1,inplace=True)
C D
0 2 3
1 6 7
2 10 11
Point to note here is we are operating on columns
Similarly, if your operation requires traversing from top to bottom/bottom to top in a dataframe, you are merging rows. This is axis=0.
My thinking : Axis = n, where n = 0, 1, etc. means that the matrix is collapsed (folded) along that axis. So in a 2D matrix, when you collapse along 0 (rows), you are really operating on one column at a time. Similarly for higher order matrices.
This is not the same as the normal reference to a dimension in a matrix, where 0 -> row and 1 -> column. Similarly for other dimensions in an N dimension array.
For pandas object, axis = 0 stands for row-wise operation and axis = 1 stands for column-wise operation. This is different from numpy by definition, we can check definitions from numpy.doc and pandas.doc
I will explicitly avoid using ‘row-wise’ or ‘along the columns’, since people may interpret them in exactly the wrong way.
Analogy first. Intuitively, you would expect that pandas.DataFrame.drop(axis='column') drops a column from N columns and gives you (N – 1) columns. So you can pay NO attention to rows for now (and remove word ‘row’ from your English dictionary.) Vice versa, drop(axis='row') works on rows.
In the same way, sum(axis='column') works on multiple columns and gives you 1 column. Similarly, sum(axis='row') results in 1 row. This is consistent with its simplest form of definition, reducing a list of numbers to a single number.
In general, with axis=column, you see columns, work on columns, and get columns. Forget rows.
With axis=row, change perspective and work on rows.
0 and 1 are just aliases for ‘row’ and ‘column’. It’s the convention of matrix indexing.
+------------+---------+--------+|| A | B |+------------+---------+---------| X |0.626386|1.52325|+------------+---------+--------+| Y |0.626386|1.52325|+------------+---------+--------+
I have been trying to figure out the axis for the last hour as well. The language in all the above answers, and also the documentation is not at all helpful.
To answer the question as I understand it now, in Pandas, axis = 1 or 0 means which axis headers do you want to keep constant when applying the function.
Note: When I say headers, I mean index names
Expanding your example:
+------------+---------+--------+
| | A | B |
+------------+---------+---------
| X | 0.626386| 1.52325|
+------------+---------+--------+
| Y | 0.626386| 1.52325|
+------------+---------+--------+
For axis=1=columns : We keep columns headers constant and apply the mean function by changing data.
To demonstrate, we keep the columns headers constant as:
+------------+---------+--------+
| | A | B |
Now we populate one set of A and B values and then find the mean
| | 0.626386| 1.52325|
Then we populate next set of A and B values and find the mean
| | 0.626386| 1.52325|
Similarly, for axis=rows, we keep row headers constant, and keep changing the data:
To demonstrate, first fix the row headers:
+------------+
| X |
+------------+
| Y |
+------------+
Now populate first set of X and Y values and then find the mean
+------------+---------+
| X | 0.626386
+------------+---------+
| Y | 0.626386
+------------+---------+
Then populate the next set of X and Y values and then find the mean:
+------------+---------+
| X | 1.52325 |
+------------+---------+
| Y | 1.52325 |
+------------+---------+
In summary,
When axis=columns, you fix the column headers and change data, which will come from the different rows.
When axis=rows, you fix the row headers and change data, which will come from the different columns.
Their behaviours are, intriguingly, easier to understand with three-dimensional array than with two-dimensional arrays.
In Python packages numpy and pandas, the axis parameter in sum actually specifies numpy to calculate the mean of all values that can be fetched in the form of array[0, 0, …, i, …, 0] where i iterates through all possible values. The process is repeated with the position of i fixed and the indices of other dimensions vary one after the other (from the most far-right element). The result is a n-1-dimensional array.
In R, the MARGINS parameter let the apply function calculate the mean of all values that can be fetched in the form of array[, … , i, … ,] where i iterates through all possible values. The process is not repeated when all i values have been iterated. Therefore, the result is a simple vector.
Arrays are designed with so-called axis=0 and rows positioned vertically versus axis=1 and columns positioned horizontally. Axis refers to the dimension of the array.
I load some machine learning data from a CSV file. The first 2 columns are observations and the remaining columns are features.
Currently, I do the following:
data = pandas.read_csv('mydata.csv')
which gives something like:
data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))
I’d like to slice this dataframe in two dataframes: one containing the columns a and b and one containing the columns c, d and e.
It is not possible to write something like
observations = data[:'c']
features = data['c':]
I’m not sure what the best method is. Do I need a pd.Panel?
By the way, I find dataframe indexing pretty inconsistent: data['a'] is permitted, but data[0] is not. On the other side, data['a':] is not permitted but data[0:] is.
Is there a practical reason for this? This is really confusing if columns are indexed by Int, given that data[0] != data[0:1]
# slice from 'foo' to 'cat' by every 2nd column
df.loc[:,'foo':'cat':2]# foo quz cat# slice from the beginning to 'bar'
df.loc[:,:'bar']# foo bar# slice from 'quz' to the end by 3
df.loc[:,'quz'::3]# quz sat# attempt from 'sat' to 'bar'
df.loc[:,'sat':'bar']# no columns returned# slice from 'sat' to 'bar'
df.loc[:,'sat':'bar':-1]
sat cat ant quz bar# slice notation is syntatic sugar for the slice function# slice from 'quz' to the end by 2 with slice function
df.loc[:, slice('quz',None,2)]# quz cat dat# select specific columns with a list# select columns foo, bar and dat
df.loc[:,['foo','bar','dat']]# foo bar dat
您可以按行和列进行切片。举例来说,如果你有5列的标签v,w,x,y,z
# slice from 'w' to 'y' and 'foo' to 'ant' by 3
df.loc['w':'y','foo':'ant':3]# foo ant# w# x# y
.loc uses label based indexing to select both rows and columns. The labels being the values of the index or the columns. Slicing with .loc includes the last element.
Let’s assume we have a DataFrame with the following columns: foo, bar, quz, ant, cat, sat, dat.
# selects all rows and all columns beginning at 'foo' up to and including 'sat'
df.loc[:, 'foo':'sat']
# foo bar quz ant cat sat
.loc accepts the same slice notation that Python lists do for both row and columns. Slice notation being start:stop:step
# slice from 'foo' to 'cat' by every 2nd column
df.loc[:, 'foo':'cat':2]
# foo quz cat
# slice from the beginning to 'bar'
df.loc[:, :'bar']
# foo bar
# slice from 'quz' to the end by 3
df.loc[:, 'quz'::3]
# quz sat
# attempt from 'sat' to 'bar'
df.loc[:, 'sat':'bar']
# no columns returned
# slice from 'sat' to 'bar'
df.loc[:, 'sat':'bar':-1]
sat cat ant quz bar
# slice notation is syntatic sugar for the slice function
# slice from 'quz' to the end by 2 with slice function
df.loc[:, slice('quz',None, 2)]
# quz cat dat
# select specific columns with a list
# select columns foo, bar and dat
df.loc[:, ['foo','bar','dat']]
# foo bar dat
You can slice by rows and columns. For instance, if you have 5 rows with labels v, w, x, y, z
# slice from 'w' to 'y' and 'foo' to 'ant' by 3
df.loc['w':'y', 'foo':'ant':3]
# foo ant
# w
# x
# y
>>> df =DataFrame(np.random.rand(4,5), columns = list('abcde'))>>> df.ix[:,'b':]
b c d e00.4187620.0423690.8692030.97231410.9910580.5102280.5947840.53436620.4074720.2598110.3966640.89420230.7261680.1395310.3249320.906575
Note:.ix has been deprecated since Pandas v0.20. You should instead use .loc or .iloc, as appropriate.
The DataFrame.ix index is what you want to be accessing. It’s a little confusing (I agree that Pandas indexing is perplexing at times!), but the following seems to do what you want:
>>> df = DataFrame(np.random.rand(4,5), columns = list('abcde'))
>>> df.ix[:,'b':]
b c d e
0 0.418762 0.042369 0.869203 0.972314
1 0.991058 0.510228 0.594784 0.534366
2 0.407472 0.259811 0.396664 0.894202
3 0.726168 0.139531 0.324932 0.906575
a d
00.8832830.10097510.6143130.22173120.4389630.22436130.4660780.70334740.9552850.11403350.2684430.41699660.6132410.32754870.3707840.35915980.6927080.65941090.8066240.875476
as in your example, if you would like to extract column a and d only (e.i. the 1st and the 4th column), iloc mothod from the pandas dataframe is what you need and could be used very effectively. All you need to know is the index of the columns you would like to extract. For example:
In[37]:import pandas as pd
In[38]:import numpy as np
In[43]: df = pd.DataFrame(np.random.rand(4,7), columns = list('abcdefg'))In[44]: df
Out[44]:
a b c d e f g
00.4090380.7454970.8907670.9458900.0146550.4580700.78663310.5706420.1815520.7945990.0363400.9070110.6552370.73526820.5684400.5016380.1866350.4414450.7033120.1874470.60430530.6791250.6428170.6976280.3916860.6983810.9368990.101806In[45]: df.loc[:,["a","b","c"]]## label based selective column slicing Out[45]:
a b c
00.4090380.7454970.89076710.5706420.1815520.79459920.5684400.5016380.18663530.6791250.6428170.697628In[46]: df.loc[:,"a":"c"]## label based column ranges slicing Out[46]:
a b c
00.4090380.7454970.89076710.5706420.1815520.79459920.5684400.5016380.18663530.6791250.6428170.697628In[47]: df.iloc[:,0:3]## index based column ranges slicing Out[47]:
a b c
00.4090380.7454970.89076710.5706420.1815520.79459920.5684400.5016380.18663530.6791250.6428170.697628### with 2 different column ranges, index based slicing: In[49]: df[df.columns[0:1].tolist()+ df.columns[1:3].tolist()]Out[49]:
a b c
00.4090380.7454970.89076710.5706420.1815520.79459920.5684400.5016380.18663530.6791250.6428170.697628
Here’s how you could use different methods to do selective column slicing, including selective label based, index based and the selective ranges based column slicing.
In [37]: import pandas as pd
In [38]: import numpy as np
In [43]: df = pd.DataFrame(np.random.rand(4,7), columns = list('abcdefg'))
In [44]: df
Out[44]:
a b c d e f g
0 0.409038 0.745497 0.890767 0.945890 0.014655 0.458070 0.786633
1 0.570642 0.181552 0.794599 0.036340 0.907011 0.655237 0.735268
2 0.568440 0.501638 0.186635 0.441445 0.703312 0.187447 0.604305
3 0.679125 0.642817 0.697628 0.391686 0.698381 0.936899 0.101806
In [45]: df.loc[:, ["a", "b", "c"]] ## label based selective column slicing
Out[45]:
a b c
0 0.409038 0.745497 0.890767
1 0.570642 0.181552 0.794599
2 0.568440 0.501638 0.186635
3 0.679125 0.642817 0.697628
In [46]: df.loc[:, "a":"c"] ## label based column ranges slicing
Out[46]:
a b c
0 0.409038 0.745497 0.890767
1 0.570642 0.181552 0.794599
2 0.568440 0.501638 0.186635
3 0.679125 0.642817 0.697628
In [47]: df.iloc[:, 0:3] ## index based column ranges slicing
Out[47]:
a b c
0 0.409038 0.745497 0.890767
1 0.570642 0.181552 0.794599
2 0.568440 0.501638 0.186635
3 0.679125 0.642817 0.697628
### with 2 different column ranges, index based slicing:
In [49]: df[df.columns[0:1].tolist() + df.columns[1:3].tolist()]
Out[49]:
a b c
0 0.409038 0.745497 0.890767
1 0.570642 0.181552 0.794599
2 0.568440 0.501638 0.186635
3 0.679125 0.642817 0.697628
Another way to get a subset of columns from your DataFrame, assuming you want all the rows, would be to do: data[['a','b']] and data[['c','d','e']]
If you want to use numerical column indexes you can do: data[data.columns[:2]] and data[data.columns[2:]]
I want to know if it is possible to use the pandas to_csv() function to add a dataframe to an existing csv file. The csv file has the same structure as the loaded data.
with open('my_csv.csv','a')as f:
df.to_csv(f, header=False)
如果这是您的csv,请执行以下操作foo.csv:
,A,B,C
0,1,2,31,4,5,6
如果您阅读了该内容,然后附加,例如df + 6:
In[1]: df = pd.read_csv('foo.csv', index_col=0)In[2]: df
Out[2]:
A B C
01231456In[3]: df +6Out[3]:
A B C
07891101112In[4]:with open('foo.csv','a')as f:(df +6).to_csv(f, header=False)
with open('my_csv.csv', 'a') as f:
df.to_csv(f, header=False)
If this was your csv, foo.csv:
,A,B,C
0,1,2,3
1,4,5,6
If you read that and then append, for example, df + 6:
In [1]: df = pd.read_csv('foo.csv', index_col=0)
In [2]: df
Out[2]:
A B C
0 1 2 3
1 4 5 6
In [3]: df + 6
Out[3]:
A B C
0 7 8 9
1 10 11 12
In [4]: with open('foo.csv', 'a') as f:
(df + 6).to_csv(f, header=False)
foo.csv becomes:
,A,B,C
0,1,2,3
1,4,5,6
0,7,8,9
1,10,11,12
回答 2
with open(filename,'a')as f:
df.to_csv(f, header=f.tell()==0)
with open(filename, 'a') as f:
df.to_csv(f, header=f.tell()==0)
Create file unless exists, otherwise append
Add header if file is being created, otherwise skip it
回答 3
我在一些标头检查保护措施中使用了一个辅助功能,以处理所有问题:
def appendDFToCSV_void(df, csvFilePath, sep=","):import os
ifnot os.path.isfile(csvFilePath):
df.to_csv(csvFilePath, mode='a', index=False, sep=sep)elif len(df.columns)!= len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns):raiseException("Columns do not match!! Dataframe has "+ str(len(df.columns))+" columns. CSV file has "+ str(len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns))+" columns.")elifnot(df.columns == pd.read_csv(csvFilePath, nrows=1, sep=sep).columns).all():raiseException("Columns and column order of dataframe and csv file do not match!!")else:
df.to_csv(csvFilePath, mode='a', index=False, sep=sep, header=False)
A little helper function I use with some header checking safeguards to handle it all:
def appendDFToCSV_void(df, csvFilePath, sep=","):
import os
if not os.path.isfile(csvFilePath):
df.to_csv(csvFilePath, mode='a', index=False, sep=sep)
elif len(df.columns) != len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns):
raise Exception("Columns do not match!! Dataframe has " + str(len(df.columns)) + " columns. CSV file has " + str(len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns)) + " columns.")
elif not (df.columns == pd.read_csv(csvFilePath, nrows=1, sep=sep).columns).all():
raise Exception("Columns and column order of dataframe and csv file do not match!!")
else:
df.to_csv(csvFilePath, mode='a', index=False, sep=sep, header=False)
Initially starting with a pyspark dataframes – I got type conversion errors (when converting to pandas df’s and then appending to csv) given the schema/column types in my pyspark dataframes
Solved the problem by forcing all columns in each df to be of type string and then appending this to csv as follows:
with open('testAppend.csv', 'a') as f:
df2.toPandas().astype(str).to_csv(f, header=False)
回答 5
晚了一点,但是如果您多次打开和关闭文件或记录数据,统计信息等,您也可以使用上下文管理器。
from contextlib import contextmanager
import pandas as pd
@contextmanagerdef open_file(path, mode):
file_to=open(path,mode)yield file_to
file_to.close()##later
saved_df=pd.DataFrame(data)with open_file('yourcsv.csv','r')as infile:
saved_df.to_csv('yourcsv.csv',mode='a',header=False)`
A bit late to the party but you can also use a context manager, if you’re opening and closing your file multiple times, or logging data, statistics, etc.
from contextlib import contextmanager
import pandas as pd
@contextmanager
def open_file(path, mode):
file_to=open(path,mode)
yield file_to
file_to.close()
##later
saved_df=pd.DataFrame(data)
with open_file('yourcsv.csv','r') as infile:
saved_df.to_csv('yourcsv.csv',mode='a',header=False)`
/Users/josh/anaconda/envs/py27/lib/python2.7/site-packages/pandas/io/parsers.py:1130:
DtypeWarning: Columns (4,5,7,16) have mixed types. Specify dtype
option on import or set low_memory=False.
Why is the dtype option related to low_memory, and why would making it False help with this problem?
The low_memory option is not properly deprecated, but it should be, since it does not actually do anything differently[source]
The reason you get this low_memory warning is because guessing dtypes for each column is very memory demanding. Pandas tries to determine what dtype to set by analyzing the data in each column.
Dtype Guessing (very bad)
Pandas can only determine what dtype a column should have once the whole file is read. This means nothing can really be parsed before the whole file is read unless you risk having to change the dtype of that column when you read the last value.
Consider the example of one file which has a column called user_id.
It contains 10 million rows where the user_id is always numbers.
Since pandas cannot know it is only numbers, it will probably keep it as the original strings until it has read the whole file.
Specifying dtypes (should always be done)
adding
dtype={'user_id': int}
to the pd.read_csv() call will make pandas know when it starts reading the file, that this is only integers.
Also worth noting is that if the last line in the file would have "foobar" written in the user_id column, the loading would crash if the above dtype was specified.
Example of broken data that breaks when dtypes are defined
import pandas as pd
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
csvdata = """user_id,username
1,Alice
3,Bob
foobar,Caesar"""
sio = StringIO(csvdata)
pd.read_csv(sio, dtype={"user_id": int, "username": "string"})
ValueError: invalid literal for long() with base 10: 'foobar'
We have access to numpy dtypes: float, int, bool, timedelta64[ns] and datetime64[ns]. Note that the numpy date/time dtypes are not time zone aware.
Pandas extends this set of dtypes with its own:
‘datetime64[ns, ]’ Which is a time zone aware timestamp.
‘category’ which is essentially an enum (strings represented by integer keys to save
‘period[]’ Not to be confused with a timedelta, these objects are actually anchored to specific time periods
‘Sparse’, ‘Sparse[int]’, ‘Sparse[float]’ is for sparse data or ‘Data that has a lot of holes in it’ Instead of saving the NaN or None in the dataframe it omits the objects, saving space.
‘Interval’ is a topic of its own but its main use is for indexing. See more here
‘Int8’, ‘Int16’, ‘Int32’, ‘Int64’, ‘UInt8’, ‘UInt16’, ‘UInt32’, ‘UInt64’ are all pandas specific integers that are nullable, unlike the numpy variant.
‘string’ is a specific dtype for working with string data and gives access to the .str attribute on the series.
‘boolean’ is like the numpy ‘bool’ but it also supports missing data.
Setting dtype=object will silence the above warning, but will not make it more memory efficient, only process efficient if anything.
Setting dtype=unicode will not do anything, since to numpy, a unicode is represented as object.
Usage of converters
@sparrow correctly points out the usage of converters to avoid pandas blowing up when encountering 'foobar' in a column specified as int. I would like to add that converters are really heavy and inefficient to use in pandas and should be used as a last resort. This is because the read_csv process is a single process.
CSV files can be processed line by line and thus can be processed by multiple converters in parallel more efficiently by simply cutting the file into segments and running multiple processes, something that pandas does not support. But this is a different story.
As for low_memory, it’s True by default and isn’t yet documented. I don’t think its relevant though. The error message is generic, so you shouldn’t need to mess with low_memory anyway. Hope this helps and let me know if you have further problems
As mentioned earlier by firelynx if dtype is explicitly specified and there is mixed data that is not compatible with that dtype then loading will crash. I used a converter like this as a workaround to change the values with incompatible data type so that the data could still be loaded.
def conv(val):
if not val:
return 0
try:
return np.float64(val)
except:
return np.float64(0)
df = pd.read_csv(csv_file,converters={'COL_A':conv,'COL_B':conv})
I had a similar issue with a ~400MB file. Setting low_memory=False did the trick for me. Do the simple things first,I would check that your dataframe isn’t bigger than your system memory, reboot, clear the RAM before proceeding. If you’re still running into errors, its worth making sure your .csv file is ok, take a quick look in Excel and make sure there’s no obvious corruption. Broken original data can wreak havoc…
I was facing a similar issue when processing a huge csv file (6 million rows). I had three issues:
1. the file contained strange characters (fixed using encoding)
2. the datatype was not specified (fixed using dtype property)
3. Using the above I still faced an issue which was related with the file_format that could not be defined based on the filename (fixed using try .. except..)
df = pd.read_csv(csv_file,sep=';', encoding = 'ISO-8859-1',
names=['permission','owner_name','group_name','size','ctime','mtime','atime','filename','full_filename'],
dtype={'permission':str,'owner_name':str,'group_name':str,'size':str,'ctime':object,'mtime':object,'atime':object,'filename':str,'full_filename':str,'first_date':object,'last_date':object})
try:
df['file_format'] = [Path(f).suffix[1:] for f in df.filename.tolist()]
except:
df['file_format'] = ''
Right now I’m importing a fairly large CSV as a dataframe every time I run the script. Is there a good solution for keeping that dataframe constantly available in between runs so I don’t have to spend all that time waiting for the script to run?
df.to_pickle(file_name) # where to save it, usually as a .pkl
Then you can load it back using:
df = pd.read_pickle(file_name)
Note: before 0.11.1 save and load were the only way to do this (they are now deprecated in favor of to_pickle and read_pickle respectively).
Another popular choice is to use HDF5 (pytables) which offers very fast access times for large datasets:
store = HDFStore('store.h5')
store['df'] = df # save it
store['df'] # load it
More advanced strategies are discussed in the cookbook.
Since 0.13 there’s also msgpack which may be be better for interoperability, as a faster alternative to JSON, or if you have python object/text-heavy data (see this question).
Although there are already some answers I found a nice comparison in which they tried several ways to serialize Pandas DataFrames: Efficiently Store Pandas DataFrames.
They compare:
pickle: original ASCII data format
cPickle, a C library
pickle-p2: uses the newer binary format
json: standardlib json library
json-no-index: like json, but without index
msgpack: binary JSON alternative
CSV
hdfstore: HDF5 storage format
In their experiment, they serialize a DataFrame of 1,000,000 rows with the two columns tested separately: one with text data, the other with numbers. Their disclaimer says:
You should not trust that what follows generalizes to your data. You should look at your own data and run benchmarks yourself
The source code for the test which they refer to is available online. Since this code did not work directly I made some minor changes, which you can get here: serialize.py
I got the following results:
They also mention that with the conversion of text data to categorical data the serialization is much faster. In their test about 10 times as fast (also see the test code).
Edit: The higher times for pickle than CSV can be explained by the data format used. By default pickle uses a printable ASCII representation, which generates larger data sets. As can be seen from the graph however, pickle using the newer binary data format (version 2, pickle-p2) has much lower load times.
If I understand correctly, you’re already using pandas.read_csv() but would like to speed up the development process so that you don’t have to load the file in every time you edit your script, is that right? I have a few recommendations:
you could load in only part of the CSV file using pandas.read_csv(..., nrows=1000) to only load the top bit of the table, while you’re doing the development
use ipython for an interactive session, such that you keep the pandas table in memory as you edit and reload your script.
updated use DataFrame.to_feather() and pd.read_feather() to store data in the R-compatible feather binary format that is super fast (in my hands, slightly faster than pandas.to_pickle() on numeric data and much faster on string data).
You might also be interested in this answer on stackoverflow.
回答 3
泡菜很好!
import pandas as pd
df.to_pickle('123.pkl')#to save the dataframe, df to 123.pkl
df1 = pd.read_pickle('123.pkl')#to load 123.pkl back to the dataframe df
import pandas as pd
df.to_pickle('123.pkl') #to save the dataframe, df to 123.pkl
df1 = pd.read_pickle('123.pkl') #to load 123.pkl back to the dataframe df
As already mentioned there are different options and file formats (HDF5, JSON, CSV, parquet, SQL) to store a data frame. However, pickle is not a first-class citizen (depending on your setup), because:
Warning The pickle module is not secure against erroneous or
maliciously constructed data. Never unpickle data received from an
untrusted or unauthenticated source.
Numpy file formats are pretty fast for numerical data
I prefer to use numpy files since they’re fast and easy to work with.
Here’s a simple benchmark for saving and loading a dataframe with 1 column of 1million points.
import numpy as np
import pandas as pd
num_dict = {'voltage': np.random.rand(1000000)}
num_df = pd.DataFrame(num_dict)
using ipython’s %%timeit magic function
%%timeit
with open('num.npy', 'wb') as np_file:
np.save(np_file, num_df)
the output is
100 loops, best of 3: 5.97 ms per loop
to load the data back into a dataframe
%%timeit
with open('num.npy', 'rb') as np_file:
data = np.load(np_file)
data_df = pd.DataFrame(data)
the output is
100 loops, best of 3: 5.12 ms per loop
NOT BAD!
CONS
There’s a problem if you save the numpy file using python 2 and then try opening using python 3 (or vice versa).
Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python.
Protocol version 1 is an old binary format which is also compatible with earlier versions of Python.
Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes. Refer to PEP 307 for information about improvements brought by protocol 2.
Protocol version 3 was added in Python 3.0. It has explicit support for bytes objects and cannot be unpickled by Python 2.x. This is the default protocol, and the recommended protocol when compatibility with other Python 3 versions is required.
Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data format optimizations. Refer to PEP 3154 for information about improvements brought by protocol 4.
pyarrow before serialization 533366
pyarrow 1208051.03 ms ±43.9µs per loop (mean ± std. dev. of 7 runs,50 loops each)
pyarrow zlib 205172.78 ms ±81.8µs per loop (mean ± std. dev. of 7 runs,50 loops each)=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
msgpack before serialization 533366
msgpack 1090391.74 ms ±72.8µs per loop (mean ± std. dev. of 7 runs,50 loops each)
msgpack zlib 166393.05 ms ±71.7µs per loop (mean ± std. dev. of 7 runs,50 loops each)=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
pickle before serialization 533366
pickle 142121733µs ±38.3µs per loop (mean ± std. dev. of 7 runs,50 loops each)
pickle zlib 294773.81 ms ±60.4µs per loop (mean ± std. dev. of 7 runs,50 loops each)=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
feather ERROR feather does not support serializing a non-default index for the index; you can .reset_index() to make the index into column(s)=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
parquet ERROR Nested column branch had multiple children: struct<x: double, y: double>=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
Overall move has been to pyarrow/feather (deprecation warnings from pandas/msgpack). However I have a challenge with pyarrow with transient in specification Data serialized with pyarrow 0.15.1 cannot be deserialized with 0.16.0 ARROW-7961. I’m using serialization to use redis so have to use a binary encoding.
I’ve retested various options (using jupyter notebook)
import sys, pickle, zlib, warnings, io
class foocls:
def pyarrow(out): return pa.serialize(out).to_buffer().to_pybytes()
def msgpack(out): return out.to_msgpack()
def pickle(out): return pickle.dumps(out)
def feather(out): return out.to_feather(io.BytesIO())
def parquet(out): return out.to_parquet(io.BytesIO())
warnings.filterwarnings("ignore")
for c in foocls.__dict__.values():
sbreak = True
try:
c(out)
print(c.__name__, "before serialization", sys.getsizeof(out))
print(c.__name__, sys.getsizeof(c(out)))
%timeit -n 50 c(out)
print(c.__name__, "zlib", sys.getsizeof(zlib.compress(c(out))))
%timeit -n 50 zlib.compress(c(out))
except TypeError as e:
if "not callable" in str(e): sbreak = False
else: raise
except (ValueError) as e: print(c.__name__, "ERROR", e)
finally:
if sbreak: print("=+=" * 30)
warnings.filterwarnings("default")
With following results for my data frame (in out jupyter variable)
pyarrow before serialization 533366
pyarrow 120805
1.03 ms ± 43.9 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
pyarrow zlib 20517
2.78 ms ± 81.8 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
msgpack before serialization 533366
msgpack 109039
1.74 ms ± 72.8 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
msgpack zlib 16639
3.05 ms ± 71.7 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
pickle before serialization 533366
pickle 142121
733 µs ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
pickle zlib 29477
3.81 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
feather ERROR feather does not support serializing a non-default index for the index; you can .reset_index() to make the index into column(s)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
parquet ERROR Nested column branch had multiple children: struct<x: double, y: double>
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
feather and parquet do not work for my data frame. I’m going to continue using pyarrow. However I will supplement with pickle (no compression). When writing to cache store pyarrow and pickle serialised forms. When reading from cache fallback to pickle if pyarrow deserialisation fails.
I want to add another column to the dataframe (or generate a series) of the same length as the dataframe (equal number of records/rows) which sets a colour 'green' if Set == 'Z' and 'red' if Set equals anything else.
df['color']=['red'if x =='Z'else'green'for x in df['Set']]
%timeit测试:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC'),'Set':list('ZZXY')})%timeit df['color']=['red'if x =='Z'else'green'for x in df['Set']]%timeit df['color']= np.where(df['Set']=='Z','green','red')%timeit df['color']= df.Set.map(lambda x:'red'if x =='Z'else'green')1000 loops, best of 3:239µs per loop
1000 loops, best of 3:523µs per loop
1000 loops, best of 3:263µs per loop
List comprehension is another way to create another column conditionally. If you are working with object dtypes in columns, like in your example, list comprehensions typically outperform most other methods.
Example list comprehension:
df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit tests:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
%timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
%timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')
1000 loops, best of 3: 239 µs per loop
1000 loops, best of 3: 523 µs per loop
1000 loops, best of 3: 263 µs per loop
回答 2
可以实现这一目标的另一种方法是
df['color']= df.Set.map(lambda x:'red'if x =='Z'else'green')
The following is slower than the approaches timed here, but we can compute the extra column based on the contents of more than one column, and more than two values can be computed for the extra column.
df=pd.DataFrame(dict(Type='A B B C'.split(),Set='Z Z X Y'.split()))
df['Color']="red"
df.loc[(df['Set']=="Z"),'Color']="green"#practice!
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"),'Color']="purple"
说明:
df=pd.DataFrame(dict(Type='A B B C'.split(),Set='Z Z X Y'.split()))# df so far: TypeSet0 A Z
1 B Z
2 B X
3 C Y
添加“颜色”列并将所有值设置为“红色”
df['Color']="red"
应用您的单个条件:
df.loc[(df['Set']=="Z"),'Color']="green"# df: TypeSetColor0 A Z green
1 B Z green
2 B X red
3 C Y red
Maybe this has been possible with newer updates of Pandas (tested with pandas=1.0.5), but I think the following is the shortest and maybe best answer for the question, so far. You can use the .loc method and use one condition or several depending on your need.
Code Summary:
df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
df['Color'] = "red"
df.loc[(df['Set']=="Z"), 'Color'] = "green"
#practice!
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"
Explanation:
df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
# df so far:
Type Set
0 A Z
1 B Z
2 B X
3 C Y
add a ‘color’ column and set all values to “red”
df['Color'] = "red"
Apply your single condition:
df.loc[(df['Set']=="Z"), 'Color'] = "green"
# df:
Type Set Color
0 A Z green
1 B Z green
2 B X red
3 C Y red
df['color'] = df['Set'].apply(lambda set_: 'green' if set_=='Z' else 'red')
After that, df data frame looks like this:
>>> print(df)
Type Set color
0 A Z green
1 B Z green
2 B X red
3 C Y red
回答 7
如果您要处理海量数据,则最好采用记忆方式:
# First create a dictionary of manually stored values
color_dict ={'Z':'red'}# Second, build a dictionary of "other" values
color_dict_other ={x:'green'for x in df['Set'].unique()if x notin color_dict.keys()}# Next, merge the two
color_dict.update(color_dict_other)# Finally, map it to your column
df['color']= df['Set'].map(color_dict)
If you’re working with massive data, a memoized approach would be best:
# First create a dictionary of manually stored values
color_dict = {'Z':'red'}
# Second, build a dictionary of "other" values
color_dict_other = {x:'green' for x in df['Set'].unique() if x not in color_dict.keys()}
# Next, merge the two
color_dict.update(color_dict_other)
# Finally, map it to your column
df['color'] = df['Set'].map(color_dict)
This approach will be fastest when you have many repeated values. My general rule of thumb is to memoize when: data_size > 10**4 & n_distinct < data_size/4
E.x. Memoize in a case 10,000 rows with 2,500 or fewer distinct values.
The keys are Unicode dates and the values are integers. I would like to convert this into a pandas dataframe by having the dates and their corresponding values as two separate columns. Example: col1: Dates col2: DateValue (the dates are still Unicode and datevalues are still integers)
Any help in this direction would be much appreciated. I am unable to find resources on the pandas docs to help me with this.
I know one solution might be to convert each key-value pair in this dict, into a dict so the entire structure becomes a dict of dicts, and then we can add each row individually to the dataframe. But I want to know if there is an easier way and a more direct way to do this.
So far I have tried converting the dict into a series object but this doesn’t seem to maintain the relationship between the columns:
pd.DataFrame(d)ValueError:If using all scalar values, you must must pass an index
您可以从字典中获取项目(即键值对):
In[11]: pd.DataFrame(d.items())# or list(d.items()) in python 3Out[11]:0102012-07-0239212012-07-0639222012-06-2939132012-06-28391...In[12]: pd.DataFrame(d.items(), columns=['Date','DateValue'])Out[12]:DateDateValue02012-07-0239212012-07-0639222012-06-29391
但是我认为传递Series构造函数更有意义:
In[21]: s = pd.Series(d, name='DateValue')Out[21]:2012-06-083882012-06-093882012-06-10388In[22]: s.index.name ='Date'In[23]: s.reset_index()Out[23]:DateDateValue02012-06-0838812012-06-0938822012-06-10388
The error here, is since calling the DataFrame constructor with scalar values (where it expects values to be a list/dict/… i.e. have multiple columns):
pd.DataFrame(d)
ValueError: If using all scalar values, you must must pass an index
You could take the items from the dictionary (i.e. the key-value pairs):
In [11]: pd.DataFrame(d.items()) # or list(d.items()) in python 3
Out[11]:
0 1
0 2012-07-02 392
1 2012-07-06 392
2 2012-06-29 391
3 2012-06-28 391
...
In [12]: pd.DataFrame(d.items(), columns=['Date', 'DateValue'])
Out[12]:
Date DateValue
0 2012-07-02 392
1 2012-07-06 392
2 2012-06-29 391
But I think it makes more sense to pass the Series constructor:
In [21]: s = pd.Series(d, name='DateValue')
Out[21]:
2012-06-08 388
2012-06-09 388
2012-06-10 388
In [22]: s.index.name = 'Date'
In [23]: s.reset_index()
Out[23]:
Date DateValue
0 2012-06-08 388
1 2012-06-09 388
2 2012-06-10 388
When converting a dictionary into a pandas dataframe where you want the keys to be the columns of said dataframe and the values to be the row values, you can do simply put brackets around the dictionary like this:
>>> dict_ = {'key 1': 'value 1', 'key 2': 'value 2', 'key 3': 'value 3'}
>>> pd.DataFrame([dict_])
key 1 key 2 key 3
0 value 1 value 2 value 3
It’s saved me some headaches so I hope it helps someone out there!
EDIT: In the pandas docs one option for the data parameter in the DataFrame constructor is a list of dictionaries. Here we’re passing a list with one dictionary in it.
d = {'Date': list(yourDict.keys()),'Date_Values': list(yourDict.values())}
df = pandas.DataFrame(data=d)
If you don’t encapsulate yourDict.keys() inside of list() , then you will end up with all of your keys and values being placed in every row of every column. Like this:
import pandas as pdimport numpy as np
df = pd.DataFrame(np.random.randn(10000,4), columns=list('ABCD'))def empty(df):return df.emptydef lenz(df):return len(df)==0def lenzi(df):return len(df.index)==0'''
%timeit empty(df)
%timeit lenz(df)
%timeit lenzi(df)
10000 loops, best of 3: 13.9 µs per loop
100000 loops, best of 3: 2.34 µs per loop
1000000 loops, best of 3: 695 ns per loop
len on index seems to be faster
'''
I use the len function. It’s much faster than empty. len(df.index) is even faster.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD'))
def empty(df):
return df.empty
def lenz(df):
return len(df) == 0
def lenzi(df):
return len(df.index) == 0
'''
%timeit empty(df)
%timeit lenz(df)
%timeit lenzi(df)
10000 loops, best of 3: 13.9 µs per loop
100000 loops, best of 3: 2.34 µs per loop
1000000 loops, best of 3: 695 ns per loop
len on index seems to be faster
'''
回答 2
我更喜欢长途旅行。这些是我为避免使用try-except子句而进行的检查-
检查变量是否不为None
然后检查其是否为数据框和
确保它不为空
这DATA是可疑变量-
DATA isnotNoneand isinstance(DATA, pd.DataFrame)andnot DATA.empty
an empty dataframe with rows containing NaN hence at least 1 column
Arguably, they are not the same. The other answers are imprecise in that df.empty, len(df), or len(df.index) make no distinction and return index is 0 and empty is True in both cases.
Examples
Example 1: An empty dataframe with 0 rows and 0 columns
In [1]: import pandas as pd
df1 = pd.DataFrame()
df1
Out[1]: Empty DataFrame
Columns: []
Index: []
In [2]: len(df1.index) # or len(df1)
Out[2]: 0
In [3]: df1.empty
Out[3]: True
Example 2: A dataframe which is emptied to 0 rows but still retains n columns
In [4]: df2 = pd.DataFrame({'AA' : [1, 2, 3], 'BB' : [11, 22, 33]})
df2
Out[4]: AA BB
0 1 11
1 2 22
2 3 33
In [5]: df2 = df2[df2['AA'] == 5]
df2
Out[5]: Empty DataFrame
Columns: [AA, BB]
Index: []
In [6]: len(df2.index) # or len(df2)
Out[6]: 0
In [7]: df2.empty
Out[7]: True
Now, building on the previous examples, in which the index is 0 and empty is True. When reading the length of the columns index for the first loaded dataframe df1, it returns 0 columns to prove that it is indeed empty.
In [8]: len(df1.columns)
Out[8]: 0
In [9]: len(df2.columns)
Out[9]: 2
Critically, while the second dataframe df2 contains no data, it is not completely empty because it returns the amount of empty columns that persist.
Why it matters
Let’s add a new column to these dataframes to understand the implications:
# As expected, the empty column displays 1 series
In [10]: df1['CC'] = [111, 222, 333]
df1
Out[10]: CC
0 111
1 222
2 333
In [11]: len(df1.columns)
Out[11]: 1
# Note the persisting series with rows containing `NaN` values in df2
In [12]: df2['CC'] = [111, 222, 333]
df2
Out[12]: AA BB CC
0 NaN NaN 111
1 NaN NaN 222
2 NaN NaN 333
In [13]: len(df2.columns)
Out[13]: 3
It is evident that the original columns in df2 have re-surfaced. Therefore, it is prudent to instead read the length of the columns index with len(pandas.core.frame.DataFrame.columns) to see if a dataframe is empty.
Practical solution
# New dataframe df
In [1]: df = pd.DataFrame({'AA' : [1, 2, 3], 'BB' : [11, 22, 33]})
df
Out[1]: AA BB
0 1 11
1 2 22
2 3 33
# This data manipulation approach results in an empty df
# because of a subset of values that are not available (`NaN`)
In [2]: df = df[df['AA'] == 5]
df
Out[2]: Empty DataFrame
Columns: [AA, BB]
Index: []
# NOTE: the df is empty, BUT the columns are persistent
In [3]: len(df.columns)
Out[3]: 2
# And accordingly, the other answers on this page
In [4]: len(df.index) # or len(df)
Out[4]: 0
In [5]: df.empty
Out[5]: True
# SOLUTION: conditionally check for empty columns
In [6]: if len(df.columns) != 0: # <--- here
# Do something, e.g.
# drop any columns containing rows with `NaN`
# to make the df really empty
df = df.dropna(how='all', axis=1)
df
Out[6]: Empty DataFrame
Columns: []
Index: []
# Testing shows it is indeed empty now
In [7]: len(df.columns)
Out[7]: 0
Adding a new data series works as expected without the re-surfacing of empty columns (factually, without any series that were containing rows with only NaN):
In [8]: df['CC'] = [111, 222, 333]
df
Out[8]: CC
0 111
1 222
2 333
In [9]: len(df.columns)
Out[9]: 1
回答 4
1)如果一个DataFrame具有Nan和Non Null值,并且您想查找该DataFrame是否
是否为空,然后尝试此代码。
2)什么时候会发生这种情况?
使用单个函数绘制多个DataFrame时会发生这种情况
作为参数传递的参数。在这种情况下,该函数甚至尝试绘制数据
当DataFrame为空并因此绘制一个空图时!
如果仅显示“ DataFrame has no data”消息,将很有意义。
3)为什么?
如果DataFrame为空(即完全不包含任何数据。请使用Nan值来提醒您DataFrame)
被认为是非空的),那么最好不要绘制而是显示一条消息:
假设我们有两个DataFrames df1和df2。
函数myfunc接受任何DataFrame(在这种情况下为df1和df2)并打印一条消息
如果DataFrame为空(而不是绘制):
df1 df2
col1 col2 col1 col2
Nan2NanNan2NanNanNan
和功能:
def myfunc(df):if(df.count().sum())>0:##count the total number of non Nan values.Equal to 0 if DataFrame is emptyprint('not empty')
df.plot(kind='barh')else:
display a message instead of plotting if it is empty
print('empty')
1) If a DataFrame has got Nan and Non Null values and you want to find whether the DataFrame
is empty or not then try this code.
2) when this situation can happen?
This situation happens when a single function is used to plot more than one DataFrame
which are passed as parameter.In such a situation the function try to plot the data even
when a DataFrame is empty and thus plot an empty figure!.
It will make sense if simply display 'DataFrame has no data' message.
3) why?
if a DataFrame is empty(i.e. contain no data at all.Mind you DataFrame with Nan values
is considered non empty) then it is desirable not to plot but put out a message :
Suppose we have two DataFrames df1 and df2.
The function myfunc takes any DataFrame(df1 and df2 in this case) and print a message
if a DataFrame is empty(instead of plotting):
df1 df2
col1 col2 col1 col2
Nan 2 Nan Nan
2 Nan Nan Nan
and the function:
def myfunc(df):
if (df.count().sum())>0: ##count the total number of non Nan values.Equal to 0 if DataFrame is empty
print('not empty')
df.plot(kind='barh')
else:
display a message instead of plotting if it is empty
print('empty')
The error message says that if you’re passing scalar values, you have to pass an index. So you can either not use scalar values for the columns — e.g. use a list:
>>> df = pd.DataFrame({'A': [a], 'B': [b]})
>>> df
A B
0 2 3
or use scalar values and pass an index:
>>> df = pd.DataFrame({'A': a, 'B': b}, index=[0])
>>> df
A B
0 2 3
From the documentation on the ‘orient’ argument: If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.
回答 9
熊猫魔术在工作。一切逻辑都搞定了。
错误消息"ValueError: If using all scalar values, you must pass an index"说您必须传递索引。
这并不一定意味着传递索引会使熊猫按照自己的意愿去做
传递索引时,pandas会将字典键视为列名,并将值视为列中索引中每个值应包含的值。
a =2
b =3
df2 = pd.DataFrame({'A':a,'B':b}, index=[1])
A B
123
传递更大的索引:
df2 = pd.DataFrame({'A':a,'B':b}, index=[1,2,3,4])
A B
123223323423
The error message "ValueError: If using all scalar values, you must pass an index" Says you must pass an index.
This does not necessarily mean passing an index makes pandas do what you want it to do
When you pass an index, pandas will treat your dictionary keys as column names and the values as what the column should contain for each of the values in the index.
a = 2
b = 3
df2 = pd.DataFrame({'A':a,'B':b}, index=[1])
A B
1 2 3
An index is usually automatically generated by a dataframe when none is given. However, pandas does not know how many rows of 2 and 3 you want. You can however be more explicit about it
df2 = pd.DataFrame({'A':[a]*4,'B':[b]*4})
df2
A B
0 2 3
1 2 3
2 2 3
3 2 3
The default index is 0 based though.
I would recommend always passing a dictionary of lists to the dataframe constructor when creating dataframes. It’s easier to read for other developers. Pandas has a lot of caveats, don’t make other developers have to experts in all of them in order to read your code.
回答 10
输入不必是记录列表,也可以是单个字典:
pd.DataFrame.from_records({'a':1,'b':2}, index=[0])
a b
012