+------------+---------+--------+|| A | B |+------------+---------+---------|0|0.626386|1.52325|----axis=1----->+------------+---------+--------+||| axis=0|↓↓
It specifies the axis along which the means are computed. By default axis=0. This is consistent with the numpy.mean usage when axis is specified explicitly (in numpy.mean, axis==None by default, which computes the mean value over the flattened array) , in which axis=0 along the rows (namely, index in pandas), and axis=1 along the columns. For added clarity, one may choose to specify axis='index' (instead of axis=0) or axis='columns' (instead of axis=1).
These answers do help explain this, but it still isn’t perfectly intuitive for a non-programmer (i.e. someone like me who is learning Python for the first time in context of data science coursework). I still find using the terms “along” or “for each” wrt to rows and columns to be confusing.
What makes more sense to me is to say it this way:
Axis 0 will act on all the ROWS in each COLUMN
Axis 1 will act on all the COLUMNS in each ROW
So a mean on axis 0 will be the mean of all the rows in each column, and a mean on axis 1 will be a mean of all the columns in each row.
Ultimately this is saying the same thing as @zhangxaochen and @Michael, but in a way that is easier for me to internalize.
axis=0 means along “indexes”. It’s a row-wise operation.
Suppose, to perform concat() operation on dataframe1 & dataframe2,
we will take dataframe1 & take out 1st row from dataframe1 and place into the new DF, then we take out another row from dataframe1 and put into new DF, we repeat this process until we reach to the bottom of dataframe1. Then, we do the same process for dataframe2.
Basically, stacking dataframe2 on top of dataframe1 or vice a versa.
E.g making a pile of books on a table or floor
axis=1 means along “columns”. It’s a column-wise operation.
Suppose, to perform concat() operation on dataframe1 & dataframe2,
we will take out the 1st complete column(a.k.a 1st series) of dataframe1 and place into new DF, then we take out the second column of dataframe1 and keep adjacent to it (sideways), we have to repeat this operation until all columns are finished. Then, we repeat the same process on dataframe2.
Basically,
stacking dataframe2 sideways.
E.g arranging books on a bookshelf.
More to it, since arrays are better representations to represent a nested n-dimensional structure compared to matrices! so below can help you more to visualize how axis plays an important role when you generalize to more than one dimension. Also, you can actually print/write/draw/visualize any n-dim array but, writing or visualizing the same in a matrix representation(3-dim) is impossible on a paper more than 3-dimensions.
axis refers to the dimension of the array, in the case of pd.DataFrames axis=0 is the dimension that points downwards and axis=1 the one that points to the right.
Example: Think of an ndarray with shape (3,5,7).
a = np.ones((3,5,7))
a is a 3 dimensional ndarray, i.e. it has 3 axes (“axes” is plural of “axis”). The configuration of a will look like 3 slices of bread where each slice is of dimension 5-by-7. a[0,:,:] will refer to the 0-th slice, a[1,:,:] will refer to the 1-st slice etc.
a.sum(axis=0) will apply sum() along the 0-th axis of a. You will add all the slices and end up with one slice of shape (5,7).
a.sum(axis=0) is equivalent to
b = np.zeros((5,7))
for i in range(5):
for j in range(7):
b[i,j] += a[:,i,j].sum()
In a pd.DataFrame, axes work the same way as in numpy.arrays: axis=0 will apply sum() or any other reduction function for each column.
N.B. In @zhangxaochen’s answer, I find the phrases “along the rows” and “along the columns” slightly confusing. axis=0 should refer to “along each column”, and axis=1 “along each row”.
The easiest way for me to understand is to talk about whether you are calculating a statistic for each column (axis = 0) or each row (axis = 1). If you calculate a statistic, say a mean, with axis = 0 you will get that statistic for each column. So if each observation is a row and each variable is in a column, you would get the mean of each variable. If you set axis = 1 then you will calculate your statistic for each row. In our example, you would get the mean for each observation across all of your variables (perhaps you want the average of related measures).
axis = 0: by column = column-wise = along the rows
Let’s look at the table from Wiki. This is an IMF estimate of GDP from 2010 to 2019 for top ten countries.
1. Axis 1 will act for each row on all the columns If you want to calculate the average (mean) GDP for EACH countries over the decade (2010-2019), you need to do, df.mean(axis=1). For example, if you want to calculate mean GDP of United States from 2010 to 2019, df.loc['United States','2010':'2019'].mean(axis=1)
2. Axis 0 will act for each column on all the rows If I want to calculate the average (mean) GDP for EACH year for all countries, you need to do, df.mean(axis=0). For example, if you want to calculate mean GDP of the year 2015 for United States, China, Japan, Germany and India, df.loc['United States':'India','2015'].mean(axis=0)
Note: The above code will work only after setting “Country(or dependent territory)” column as the Index, using set_index method.
The designer of pandas, Wes McKinney, used to work intensively on finance data. Think of columns as stock names and index as daily prices. You can then guess what the default behavior is (i.e., axis=0) with respect to this finance data. axis=1 can be simply thought as ‘the other direction’.
For example, the statistics functions, such as mean(), sum(), describe(), count() all default to column-wise because it makes more sense to do them for each stock. sort_index(by=) also defaults to column. fillna(method='ffill') will fill along column because it is the same stock. dropna() defaults to row because you probably just want to discard the price on that day instead of throw away all prices of that stock.
Similarly, the square brackets indexing refers to the columns since it’s more common to pick a stock instead of picking a day.
The problem with using axis= properly is for its use for 2 main different cases:
For computing an accumulated value, or rearranging (e. g. sorting) data.
For manipulating (“playing” with) entities (e. g. dataframes).
The main idea behind this answer is that for avoiding the confusion, we select either a number, or a name for specifying the particular axis, whichever is more clear, intuitive, and descriptive.
Pandas is based on NumPy, which is based on mathematics, particularly on n-dimensional matrices. Here is an image for common use of axes’ names in math in the 3-dimensional space:
This picture is for memorizing the axes’ ordinal numbers only:
0 for x-axis,
1 for y-axis, and
2 for z-axis.
The z-axis is only for panels; for dataframes we will restrict our interest to the green-colored, 2-dimensional basic plane with x-axis (0, vertical), and y-axis (1, horizontal).
It’s all for numbers as potential values of axis= parameter.
The names of axes are 'index' (you may use the alias 'rows') and 'columns', and for this explanation it is NOT important the relation between these names and ordinal numbers (of axes), as everybody knows what the words “rows” and “columns” mean (and everybody here — I suppose — knows what the word “index” in pandas means).
And now, my recommendation:
If you want to compute an accumulated value, you may compute it from values located along axis 0 (or along axis 1) — use axis=0 (or axis=1).
Similarly, if you want to rearrange values, use the axis number of the axis, along which are located data for rearranging (e.g. for sorting).
If you want to manipulate (e.g. concatenate) entities (e.g. dataframes) — use axis='index' (synonym: axis='rows') or axis='columns' to specify the resulting change — index (rows) or columns, respectively.
(For concatenating, you will obtain either a longer index (= more rows), or more columns, respectively.)
This is based on @Safak’s answer.
The best way to understand the axes in pandas/numpy is to create a 3d array and check the result of the sum function along the 3 different axes.
df = pd.DataFrame(np.arange(12).reshape(3,4),columns=['A','B','C','D'])print(df)
A B C D
00123145672891011
df.mean(axis=1)01.515.529.5
dtype: float64
df.drop(['A','B'],axis=1,inplace=True)
C D
02316721011
Say if your operation requires traversing from left to right/right to left in a dataframe, you are apparently merging columns ie. you are operating on various columns.
This is axis =1
Example
df = pd.DataFrame(np.arange(12).reshape(3,4),columns=['A', 'B', 'C', 'D'])
print(df)
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
df.mean(axis=1)
0 1.5
1 5.5
2 9.5
dtype: float64
df.drop(['A','B'],axis=1,inplace=True)
C D
0 2 3
1 6 7
2 10 11
Point to note here is we are operating on columns
Similarly, if your operation requires traversing from top to bottom/bottom to top in a dataframe, you are merging rows. This is axis=0.
My thinking : Axis = n, where n = 0, 1, etc. means that the matrix is collapsed (folded) along that axis. So in a 2D matrix, when you collapse along 0 (rows), you are really operating on one column at a time. Similarly for higher order matrices.
This is not the same as the normal reference to a dimension in a matrix, where 0 -> row and 1 -> column. Similarly for other dimensions in an N dimension array.
For pandas object, axis = 0 stands for row-wise operation and axis = 1 stands for column-wise operation. This is different from numpy by definition, we can check definitions from numpy.doc and pandas.doc
I will explicitly avoid using ‘row-wise’ or ‘along the columns’, since people may interpret them in exactly the wrong way.
Analogy first. Intuitively, you would expect that pandas.DataFrame.drop(axis='column') drops a column from N columns and gives you (N – 1) columns. So you can pay NO attention to rows for now (and remove word ‘row’ from your English dictionary.) Vice versa, drop(axis='row') works on rows.
In the same way, sum(axis='column') works on multiple columns and gives you 1 column. Similarly, sum(axis='row') results in 1 row. This is consistent with its simplest form of definition, reducing a list of numbers to a single number.
In general, with axis=column, you see columns, work on columns, and get columns. Forget rows.
With axis=row, change perspective and work on rows.
0 and 1 are just aliases for ‘row’ and ‘column’. It’s the convention of matrix indexing.
+------------+---------+--------+|| A | B |+------------+---------+---------| X |0.626386|1.52325|+------------+---------+--------+| Y |0.626386|1.52325|+------------+---------+--------+
I have been trying to figure out the axis for the last hour as well. The language in all the above answers, and also the documentation is not at all helpful.
To answer the question as I understand it now, in Pandas, axis = 1 or 0 means which axis headers do you want to keep constant when applying the function.
Note: When I say headers, I mean index names
Expanding your example:
+------------+---------+--------+
| | A | B |
+------------+---------+---------
| X | 0.626386| 1.52325|
+------------+---------+--------+
| Y | 0.626386| 1.52325|
+------------+---------+--------+
For axis=1=columns : We keep columns headers constant and apply the mean function by changing data.
To demonstrate, we keep the columns headers constant as:
+------------+---------+--------+
| | A | B |
Now we populate one set of A and B values and then find the mean
| | 0.626386| 1.52325|
Then we populate next set of A and B values and find the mean
| | 0.626386| 1.52325|
Similarly, for axis=rows, we keep row headers constant, and keep changing the data:
To demonstrate, first fix the row headers:
+------------+
| X |
+------------+
| Y |
+------------+
Now populate first set of X and Y values and then find the mean
+------------+---------+
| X | 0.626386
+------------+---------+
| Y | 0.626386
+------------+---------+
Then populate the next set of X and Y values and then find the mean:
+------------+---------+
| X | 1.52325 |
+------------+---------+
| Y | 1.52325 |
+------------+---------+
In summary,
When axis=columns, you fix the column headers and change data, which will come from the different rows.
When axis=rows, you fix the row headers and change data, which will come from the different columns.
Their behaviours are, intriguingly, easier to understand with three-dimensional array than with two-dimensional arrays.
In Python packages numpy and pandas, the axis parameter in sum actually specifies numpy to calculate the mean of all values that can be fetched in the form of array[0, 0, …, i, …, 0] where i iterates through all possible values. The process is repeated with the position of i fixed and the indices of other dimensions vary one after the other (from the most far-right element). The result is a n-1-dimensional array.
In R, the MARGINS parameter let the apply function calculate the mean of all values that can be fetched in the form of array[, … , i, … ,] where i iterates through all possible values. The process is not repeated when all i values have been iterated. Therefore, the result is a simple vector.
Arrays are designed with so-called axis=0 and rows positioned vertically versus axis=1 and columns positioned horizontally. Axis refers to the dimension of the array.
I load some machine learning data from a CSV file. The first 2 columns are observations and the remaining columns are features.
Currently, I do the following:
data = pandas.read_csv('mydata.csv')
which gives something like:
data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))
I’d like to slice this dataframe in two dataframes: one containing the columns a and b and one containing the columns c, d and e.
It is not possible to write something like
observations = data[:'c']
features = data['c':]
I’m not sure what the best method is. Do I need a pd.Panel?
By the way, I find dataframe indexing pretty inconsistent: data['a'] is permitted, but data[0] is not. On the other side, data['a':] is not permitted but data[0:] is.
Is there a practical reason for this? This is really confusing if columns are indexed by Int, given that data[0] != data[0:1]
# slice from 'foo' to 'cat' by every 2nd column
df.loc[:,'foo':'cat':2]# foo quz cat# slice from the beginning to 'bar'
df.loc[:,:'bar']# foo bar# slice from 'quz' to the end by 3
df.loc[:,'quz'::3]# quz sat# attempt from 'sat' to 'bar'
df.loc[:,'sat':'bar']# no columns returned# slice from 'sat' to 'bar'
df.loc[:,'sat':'bar':-1]
sat cat ant quz bar# slice notation is syntatic sugar for the slice function# slice from 'quz' to the end by 2 with slice function
df.loc[:, slice('quz',None,2)]# quz cat dat# select specific columns with a list# select columns foo, bar and dat
df.loc[:,['foo','bar','dat']]# foo bar dat
您可以按行和列进行切片。举例来说,如果你有5列的标签v,w,x,y,z
# slice from 'w' to 'y' and 'foo' to 'ant' by 3
df.loc['w':'y','foo':'ant':3]# foo ant# w# x# y
.loc uses label based indexing to select both rows and columns. The labels being the values of the index or the columns. Slicing with .loc includes the last element.
Let’s assume we have a DataFrame with the following columns: foo, bar, quz, ant, cat, sat, dat.
# selects all rows and all columns beginning at 'foo' up to and including 'sat'
df.loc[:, 'foo':'sat']
# foo bar quz ant cat sat
.loc accepts the same slice notation that Python lists do for both row and columns. Slice notation being start:stop:step
# slice from 'foo' to 'cat' by every 2nd column
df.loc[:, 'foo':'cat':2]
# foo quz cat
# slice from the beginning to 'bar'
df.loc[:, :'bar']
# foo bar
# slice from 'quz' to the end by 3
df.loc[:, 'quz'::3]
# quz sat
# attempt from 'sat' to 'bar'
df.loc[:, 'sat':'bar']
# no columns returned
# slice from 'sat' to 'bar'
df.loc[:, 'sat':'bar':-1]
sat cat ant quz bar
# slice notation is syntatic sugar for the slice function
# slice from 'quz' to the end by 2 with slice function
df.loc[:, slice('quz',None, 2)]
# quz cat dat
# select specific columns with a list
# select columns foo, bar and dat
df.loc[:, ['foo','bar','dat']]
# foo bar dat
You can slice by rows and columns. For instance, if you have 5 rows with labels v, w, x, y, z
# slice from 'w' to 'y' and 'foo' to 'ant' by 3
df.loc['w':'y', 'foo':'ant':3]
# foo ant
# w
# x
# y
>>> df =DataFrame(np.random.rand(4,5), columns = list('abcde'))>>> df.ix[:,'b':]
b c d e00.4187620.0423690.8692030.97231410.9910580.5102280.5947840.53436620.4074720.2598110.3966640.89420230.7261680.1395310.3249320.906575
Note:.ix has been deprecated since Pandas v0.20. You should instead use .loc or .iloc, as appropriate.
The DataFrame.ix index is what you want to be accessing. It’s a little confusing (I agree that Pandas indexing is perplexing at times!), but the following seems to do what you want:
>>> df = DataFrame(np.random.rand(4,5), columns = list('abcde'))
>>> df.ix[:,'b':]
b c d e
0 0.418762 0.042369 0.869203 0.972314
1 0.991058 0.510228 0.594784 0.534366
2 0.407472 0.259811 0.396664 0.894202
3 0.726168 0.139531 0.324932 0.906575
a d
00.8832830.10097510.6143130.22173120.4389630.22436130.4660780.70334740.9552850.11403350.2684430.41699660.6132410.32754870.3707840.35915980.6927080.65941090.8066240.875476
as in your example, if you would like to extract column a and d only (e.i. the 1st and the 4th column), iloc mothod from the pandas dataframe is what you need and could be used very effectively. All you need to know is the index of the columns you would like to extract. For example:
In[37]:import pandas as pd
In[38]:import numpy as np
In[43]: df = pd.DataFrame(np.random.rand(4,7), columns = list('abcdefg'))In[44]: df
Out[44]:
a b c d e f g
00.4090380.7454970.8907670.9458900.0146550.4580700.78663310.5706420.1815520.7945990.0363400.9070110.6552370.73526820.5684400.5016380.1866350.4414450.7033120.1874470.60430530.6791250.6428170.6976280.3916860.6983810.9368990.101806In[45]: df.loc[:,["a","b","c"]]## label based selective column slicing Out[45]:
a b c
00.4090380.7454970.89076710.5706420.1815520.79459920.5684400.5016380.18663530.6791250.6428170.697628In[46]: df.loc[:,"a":"c"]## label based column ranges slicing Out[46]:
a b c
00.4090380.7454970.89076710.5706420.1815520.79459920.5684400.5016380.18663530.6791250.6428170.697628In[47]: df.iloc[:,0:3]## index based column ranges slicing Out[47]:
a b c
00.4090380.7454970.89076710.5706420.1815520.79459920.5684400.5016380.18663530.6791250.6428170.697628### with 2 different column ranges, index based slicing: In[49]: df[df.columns[0:1].tolist()+ df.columns[1:3].tolist()]Out[49]:
a b c
00.4090380.7454970.89076710.5706420.1815520.79459920.5684400.5016380.18663530.6791250.6428170.697628
Here’s how you could use different methods to do selective column slicing, including selective label based, index based and the selective ranges based column slicing.
In [37]: import pandas as pd
In [38]: import numpy as np
In [43]: df = pd.DataFrame(np.random.rand(4,7), columns = list('abcdefg'))
In [44]: df
Out[44]:
a b c d e f g
0 0.409038 0.745497 0.890767 0.945890 0.014655 0.458070 0.786633
1 0.570642 0.181552 0.794599 0.036340 0.907011 0.655237 0.735268
2 0.568440 0.501638 0.186635 0.441445 0.703312 0.187447 0.604305
3 0.679125 0.642817 0.697628 0.391686 0.698381 0.936899 0.101806
In [45]: df.loc[:, ["a", "b", "c"]] ## label based selective column slicing
Out[45]:
a b c
0 0.409038 0.745497 0.890767
1 0.570642 0.181552 0.794599
2 0.568440 0.501638 0.186635
3 0.679125 0.642817 0.697628
In [46]: df.loc[:, "a":"c"] ## label based column ranges slicing
Out[46]:
a b c
0 0.409038 0.745497 0.890767
1 0.570642 0.181552 0.794599
2 0.568440 0.501638 0.186635
3 0.679125 0.642817 0.697628
In [47]: df.iloc[:, 0:3] ## index based column ranges slicing
Out[47]:
a b c
0 0.409038 0.745497 0.890767
1 0.570642 0.181552 0.794599
2 0.568440 0.501638 0.186635
3 0.679125 0.642817 0.697628
### with 2 different column ranges, index based slicing:
In [49]: df[df.columns[0:1].tolist() + df.columns[1:3].tolist()]
Out[49]:
a b c
0 0.409038 0.745497 0.890767
1 0.570642 0.181552 0.794599
2 0.568440 0.501638 0.186635
3 0.679125 0.642817 0.697628
Another way to get a subset of columns from your DataFrame, assuming you want all the rows, would be to do: data[['a','b']] and data[['c','d','e']]
If you want to use numerical column indexes you can do: data[data.columns[:2]] and data[data.columns[2:]]
Do you know how to get the index or column of a DataFrame as a NumPy array or python list?
回答 0
要获取NumPy数组,应使用以下values属性:
In[1]: df = pd.DataFrame({'A':[1,2,3],'B':[4,5,6]}, index=['a','b','c']); df
A B
a 14
b 25
c 36In[2]: df.index.values
Out[2]: array(['a','b','c'], dtype=object)
To get a NumPy array, you should use the values attribute:
In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c']); df
A B
a 1 4
b 2 5
c 3 6
In [2]: df.index.values
Out[2]: array(['a', 'b', 'c'], dtype=object)
This accesses how the data is already stored, so there’s no need for a conversion.
Note: This attribute is also available for many other pandas’ objects.
In [3]: df['A'].values
Out[3]: Out[16]: array([1, 2, 3])
You can use df.index to access the index object and then get the values in a list using df.index.tolist(). Similarly, you can use df['col'].tolist() for Series.
Deprecate your usage of .values in favour of these methods!
From v0.24.0 onwards, we will have two brand spanking new, preferred methods for obtaining NumPy arrays from Index, Series, and DataFrame objects: they are to_numpy(), and .array. Regarding usage, the docs mention:
We haven’t removed or deprecated Series.values or
DataFrame.values, but we highly recommend and using .array or
.to_numpy() instead.
For Series and Indexes backed by normal NumPy arrays, Series.array
will return a new arrays.PandasArray, which is a thin (no-copy)
wrapper around a numpy.ndarray. arrays.PandasArray isn’t especially
useful on its own, but it does provide the same interface as any
extension array defined in pandas or by a third-party library.
So, to summarise, .array will return either
The existing ExtensionArray backing the Index/Series, or
If there is a NumPy array backing the series, a new ExtensionArray object is created as a thin wrapper over the underlying array.
Rationale for adding TWO new methods
These functions were added as a result of discussions under two GitHub issues GH19954 and GH23623.
Specifically, the docs mention the rationale:
[…] with .values it was unclear whether the returned value would be the
actual array, some transformation of it, or one of pandas custom
arrays (like Categorical). For example, with PeriodIndex, .values
generates a new ndarray of period objects each time. […]
These two functions aim to improve the consistency of the API, which is a major step in the right direction.
Lastly, .values will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.
If you are dealing with a multi-index dataframe, you may be interested in extracting only the column of one name of the multi-index. You can do this as
df.index.get_level_values('name_sub_index')
and of course name_sub_index must be an element of the FrozenListdf.index.names
Below is a simple way to convert dataframe column into numpy array.
df = pd.DataFrame(somedict)
ytrain = df['label']
ytrain_numpy = np.array([x for x in ytrain['label']])
ytrain_numpy is a numpy array.
I tried with to.numpy() but it gave me the below error:
TypeError: no supported conversion for types: (dtype(‘O’),) while doing Binary Relevance classfication using Linear SVC.
to.numpy() was converting the dataFrame into numpy array but the inner element’s data type was list because of which the above error was observed.
What’s the easiest way to add an empty column to a pandas DataFrame object? The best I’ve stumbled upon is something like
df['foo'] = df.apply(lambda _: '', axis=1)
Is there a less perverse method?
回答 0
如果我理解正确,则应填写作业:
>>>import numpy as np
>>>import pandas as pd
>>> df = pd.DataFrame({"A":[1,2,3],"B":[2,3,4]})>>> df
A B
012123234>>> df["C"]="">>> df["D"]= np.nan
>>> df
A B C D
012NaN123NaN234NaN
To add to DSM’s answer and building on this associated question, I’d split the approach into two cases:
Adding a single column: Just assign empty values to the new columns, e.g. df['C'] = np.nan
Adding multiple columns: I’d suggest using the .reindex(columns=[...])method of pandas to add the new columns to the dataframe’s column index. This also works for adding multiple new rows with .reindex(rows=[...]). Note that newer versions of Pandas (v>0.20) allow you to specify an axis keyword rather than explicitly assigning to columns or rows.
Starting with v0.16.0, DF.assign() could be used to assign new columns (single/multiple) to a DF. These columns get inserted in alphabetical order at the end of the DF.
This becomes advantageous compared to simple assignment in cases wherein you want to perform a series of chained operations directly on the returned dataframe.
Consider the same DF sample demonstrated by @DSM:
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
df
Out[18]:
A B
0 1 2
1 2 3
2 3 4
df.assign(C="",D=np.nan)
Out[21]:
A B C D
0 1 2 NaN
1 2 3 NaN
2 3 4 NaN
Note that this returns a copy with all the previous columns along with the newly created ones. In order for the original DF to be modified accordingly, use it like : df = df.assign(...) as it does not support inplace operation currently.
The below code address the question “How do I add n number of empty columns to my existing dataframe”. In the interest of keeping solutions to similar problems in one place, I am adding it here.
Approach 1 (to create 64 additional columns with column names from 1-64)
m = list(range(1,65,1))
dd=pd.DataFrame(columns=m)
df.join(dd).replace(np.nan,'') #df is the dataframe that already exists
Approach 2 (to create 64 additional columns with column names from 1-64)
df['column']=None#This works. This will create a new column with None type
df.column =None#This will work only when the column is already present in the dataframe
df['column'] = None #This works. This will create a new column with None type
df.column = None #This will work only when the column is already present in the dataframe
I want to know if it is possible to use the pandas to_csv() function to add a dataframe to an existing csv file. The csv file has the same structure as the loaded data.
with open('my_csv.csv','a')as f:
df.to_csv(f, header=False)
如果这是您的csv,请执行以下操作foo.csv:
,A,B,C
0,1,2,31,4,5,6
如果您阅读了该内容,然后附加,例如df + 6:
In[1]: df = pd.read_csv('foo.csv', index_col=0)In[2]: df
Out[2]:
A B C
01231456In[3]: df +6Out[3]:
A B C
07891101112In[4]:with open('foo.csv','a')as f:(df +6).to_csv(f, header=False)
with open('my_csv.csv', 'a') as f:
df.to_csv(f, header=False)
If this was your csv, foo.csv:
,A,B,C
0,1,2,3
1,4,5,6
If you read that and then append, for example, df + 6:
In [1]: df = pd.read_csv('foo.csv', index_col=0)
In [2]: df
Out[2]:
A B C
0 1 2 3
1 4 5 6
In [3]: df + 6
Out[3]:
A B C
0 7 8 9
1 10 11 12
In [4]: with open('foo.csv', 'a') as f:
(df + 6).to_csv(f, header=False)
foo.csv becomes:
,A,B,C
0,1,2,3
1,4,5,6
0,7,8,9
1,10,11,12
回答 2
with open(filename,'a')as f:
df.to_csv(f, header=f.tell()==0)
with open(filename, 'a') as f:
df.to_csv(f, header=f.tell()==0)
Create file unless exists, otherwise append
Add header if file is being created, otherwise skip it
回答 3
我在一些标头检查保护措施中使用了一个辅助功能,以处理所有问题:
def appendDFToCSV_void(df, csvFilePath, sep=","):import os
ifnot os.path.isfile(csvFilePath):
df.to_csv(csvFilePath, mode='a', index=False, sep=sep)elif len(df.columns)!= len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns):raiseException("Columns do not match!! Dataframe has "+ str(len(df.columns))+" columns. CSV file has "+ str(len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns))+" columns.")elifnot(df.columns == pd.read_csv(csvFilePath, nrows=1, sep=sep).columns).all():raiseException("Columns and column order of dataframe and csv file do not match!!")else:
df.to_csv(csvFilePath, mode='a', index=False, sep=sep, header=False)
A little helper function I use with some header checking safeguards to handle it all:
def appendDFToCSV_void(df, csvFilePath, sep=","):
import os
if not os.path.isfile(csvFilePath):
df.to_csv(csvFilePath, mode='a', index=False, sep=sep)
elif len(df.columns) != len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns):
raise Exception("Columns do not match!! Dataframe has " + str(len(df.columns)) + " columns. CSV file has " + str(len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns)) + " columns.")
elif not (df.columns == pd.read_csv(csvFilePath, nrows=1, sep=sep).columns).all():
raise Exception("Columns and column order of dataframe and csv file do not match!!")
else:
df.to_csv(csvFilePath, mode='a', index=False, sep=sep, header=False)
Initially starting with a pyspark dataframes – I got type conversion errors (when converting to pandas df’s and then appending to csv) given the schema/column types in my pyspark dataframes
Solved the problem by forcing all columns in each df to be of type string and then appending this to csv as follows:
with open('testAppend.csv', 'a') as f:
df2.toPandas().astype(str).to_csv(f, header=False)
回答 5
晚了一点,但是如果您多次打开和关闭文件或记录数据,统计信息等,您也可以使用上下文管理器。
from contextlib import contextmanager
import pandas as pd
@contextmanagerdef open_file(path, mode):
file_to=open(path,mode)yield file_to
file_to.close()##later
saved_df=pd.DataFrame(data)with open_file('yourcsv.csv','r')as infile:
saved_df.to_csv('yourcsv.csv',mode='a',header=False)`
A bit late to the party but you can also use a context manager, if you’re opening and closing your file multiple times, or logging data, statistics, etc.
from contextlib import contextmanager
import pandas as pd
@contextmanager
def open_file(path, mode):
file_to=open(path,mode)
yield file_to
file_to.close()
##later
saved_df=pd.DataFrame(data)
with open_file('yourcsv.csv','r') as infile:
saved_df.to_csv('yourcsv.csv',mode='a',header=False)`
If the DataFrame is huge, and the number of rows to drop is large as well, then simple drop by index df.drop(df.index[]) takes too much time.
In my case, I have a multi-indexed DataFrame of floats with 100M rows x 3 cols, and I need to remove 10k rows from it. The fastest method I found is, quite counterintuitively, to take the remaining rows.
Let indexes_to_drop be an array of positional indexes to drop ([1, 2, 4] in the question).
In[17]: df
Out[17]:
a b c d e
one 0.456558-2.5364320.216279-1.305855-0.121635
two -1.015127-0.4451331.8676812.1793920.518801In[18]: df.drop('one')Out[18]:
a b c d e
two -1.015127-0.4451331.8676812.1793920.518801
等效于:
In[19]: df.drop(df.index[[0]])Out[19]:
a b c d e
two -1.015127-0.4451331.8676812.1793920.518801
You can also pass to DataFrame.drop the label itself (instead of Series of index labels):
In[17]: df
Out[17]:
a b c d e
one 0.456558 -2.536432 0.216279 -1.305855 -0.121635
two -1.015127 -0.445133 1.867681 2.179392 0.518801
In[18]: df.drop('one')
Out[18]:
a b c d e
two -1.015127 -0.445133 1.867681 2.179392 0.518801
Which is equivalent to:
In[19]: df.drop(df.index[[0]])
Out[19]:
a b c d e
two -1.015127 -0.445133 1.867681 2.179392 0.518801
Use the index of this unwanted dataframe to drop the rows from the original dataframe.
Example:
Suppose you have a dataframe df which as many columns including ‘Age’ which is an integer. Now let’s say you want to drop all the rows with ‘Age’ as negative number.
Here is a bit specific example, I would like to show. Say you have many duplicate entries in some of your rows. If you have string entries you could easily use string methods to find all indexes to drop.
In a comment to @theodros-zelleke’s answer, @j-jones asked about what to do if the index is not unique. I had to deal with such a situation. What I did was to rename the duplicates in the index before I called drop(), a la:
where rename_duplicates() is a function I defined that went through the elements of index and renamed the duplicates. I used the same renaming pattern as pd.read_csv() uses on columns, i.e., "%s.%d" % (name, count), where name is the name of the row and count is how many times it has occurred previously.
df = df.drop(df.index[2,3])or
df.drop(df.index[2,3],inplace=True)print(df)
df =
index column1
000330#This approach removes the rows as we wanted but the index remains unordered
方法2
df.drop(df.index[2,3],inplace=True,ignore_index=True)print(df)
df =
index column1
000130#This approach removes the rows as we wanted and resets the index.
df = df.drop(df.index[2,3])
or
df.drop(df.index[2,3],inplace=True)
print(df)
df =
index column1
0 00
3 30
#This approach removes the rows as we wanted but the index remains unordered
Approach 2
df.drop(df.index[2,3],inplace=True,ignore_index=True)
print(df)
df =
index column1
0 00
1 30
#This approach removes the rows as we wanted and resets the index.
/Users/josh/anaconda/envs/py27/lib/python2.7/site-packages/pandas/io/parsers.py:1130:
DtypeWarning: Columns (4,5,7,16) have mixed types. Specify dtype
option on import or set low_memory=False.
Why is the dtype option related to low_memory, and why would making it False help with this problem?
The low_memory option is not properly deprecated, but it should be, since it does not actually do anything differently[source]
The reason you get this low_memory warning is because guessing dtypes for each column is very memory demanding. Pandas tries to determine what dtype to set by analyzing the data in each column.
Dtype Guessing (very bad)
Pandas can only determine what dtype a column should have once the whole file is read. This means nothing can really be parsed before the whole file is read unless you risk having to change the dtype of that column when you read the last value.
Consider the example of one file which has a column called user_id.
It contains 10 million rows where the user_id is always numbers.
Since pandas cannot know it is only numbers, it will probably keep it as the original strings until it has read the whole file.
Specifying dtypes (should always be done)
adding
dtype={'user_id': int}
to the pd.read_csv() call will make pandas know when it starts reading the file, that this is only integers.
Also worth noting is that if the last line in the file would have "foobar" written in the user_id column, the loading would crash if the above dtype was specified.
Example of broken data that breaks when dtypes are defined
import pandas as pd
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
csvdata = """user_id,username
1,Alice
3,Bob
foobar,Caesar"""
sio = StringIO(csvdata)
pd.read_csv(sio, dtype={"user_id": int, "username": "string"})
ValueError: invalid literal for long() with base 10: 'foobar'
We have access to numpy dtypes: float, int, bool, timedelta64[ns] and datetime64[ns]. Note that the numpy date/time dtypes are not time zone aware.
Pandas extends this set of dtypes with its own:
‘datetime64[ns, ]’ Which is a time zone aware timestamp.
‘category’ which is essentially an enum (strings represented by integer keys to save
‘period[]’ Not to be confused with a timedelta, these objects are actually anchored to specific time periods
‘Sparse’, ‘Sparse[int]’, ‘Sparse[float]’ is for sparse data or ‘Data that has a lot of holes in it’ Instead of saving the NaN or None in the dataframe it omits the objects, saving space.
‘Interval’ is a topic of its own but its main use is for indexing. See more here
‘Int8’, ‘Int16’, ‘Int32’, ‘Int64’, ‘UInt8’, ‘UInt16’, ‘UInt32’, ‘UInt64’ are all pandas specific integers that are nullable, unlike the numpy variant.
‘string’ is a specific dtype for working with string data and gives access to the .str attribute on the series.
‘boolean’ is like the numpy ‘bool’ but it also supports missing data.
Setting dtype=object will silence the above warning, but will not make it more memory efficient, only process efficient if anything.
Setting dtype=unicode will not do anything, since to numpy, a unicode is represented as object.
Usage of converters
@sparrow correctly points out the usage of converters to avoid pandas blowing up when encountering 'foobar' in a column specified as int. I would like to add that converters are really heavy and inefficient to use in pandas and should be used as a last resort. This is because the read_csv process is a single process.
CSV files can be processed line by line and thus can be processed by multiple converters in parallel more efficiently by simply cutting the file into segments and running multiple processes, something that pandas does not support. But this is a different story.
As for low_memory, it’s True by default and isn’t yet documented. I don’t think its relevant though. The error message is generic, so you shouldn’t need to mess with low_memory anyway. Hope this helps and let me know if you have further problems
As mentioned earlier by firelynx if dtype is explicitly specified and there is mixed data that is not compatible with that dtype then loading will crash. I used a converter like this as a workaround to change the values with incompatible data type so that the data could still be loaded.
def conv(val):
if not val:
return 0
try:
return np.float64(val)
except:
return np.float64(0)
df = pd.read_csv(csv_file,converters={'COL_A':conv,'COL_B':conv})
I had a similar issue with a ~400MB file. Setting low_memory=False did the trick for me. Do the simple things first,I would check that your dataframe isn’t bigger than your system memory, reboot, clear the RAM before proceeding. If you’re still running into errors, its worth making sure your .csv file is ok, take a quick look in Excel and make sure there’s no obvious corruption. Broken original data can wreak havoc…
I was facing a similar issue when processing a huge csv file (6 million rows). I had three issues:
1. the file contained strange characters (fixed using encoding)
2. the datatype was not specified (fixed using dtype property)
3. Using the above I still faced an issue which was related with the file_format that could not be defined based on the filename (fixed using try .. except..)
df = pd.read_csv(csv_file,sep=';', encoding = 'ISO-8859-1',
names=['permission','owner_name','group_name','size','ctime','mtime','atime','filename','full_filename'],
dtype={'permission':str,'owner_name':str,'group_name':str,'size':str,'ctime':object,'mtime':object,'atime':object,'filename':str,'full_filename':str,'first_date':object,'last_date':object})
try:
df['file_format'] = [Path(f).suffix[1:] for f in df.filename.tolist()]
except:
df['file_format'] = ''
Right now I’m importing a fairly large CSV as a dataframe every time I run the script. Is there a good solution for keeping that dataframe constantly available in between runs so I don’t have to spend all that time waiting for the script to run?
df.to_pickle(file_name) # where to save it, usually as a .pkl
Then you can load it back using:
df = pd.read_pickle(file_name)
Note: before 0.11.1 save and load were the only way to do this (they are now deprecated in favor of to_pickle and read_pickle respectively).
Another popular choice is to use HDF5 (pytables) which offers very fast access times for large datasets:
store = HDFStore('store.h5')
store['df'] = df # save it
store['df'] # load it
More advanced strategies are discussed in the cookbook.
Since 0.13 there’s also msgpack which may be be better for interoperability, as a faster alternative to JSON, or if you have python object/text-heavy data (see this question).
Although there are already some answers I found a nice comparison in which they tried several ways to serialize Pandas DataFrames: Efficiently Store Pandas DataFrames.
They compare:
pickle: original ASCII data format
cPickle, a C library
pickle-p2: uses the newer binary format
json: standardlib json library
json-no-index: like json, but without index
msgpack: binary JSON alternative
CSV
hdfstore: HDF5 storage format
In their experiment, they serialize a DataFrame of 1,000,000 rows with the two columns tested separately: one with text data, the other with numbers. Their disclaimer says:
You should not trust that what follows generalizes to your data. You should look at your own data and run benchmarks yourself
The source code for the test which they refer to is available online. Since this code did not work directly I made some minor changes, which you can get here: serialize.py
I got the following results:
They also mention that with the conversion of text data to categorical data the serialization is much faster. In their test about 10 times as fast (also see the test code).
Edit: The higher times for pickle than CSV can be explained by the data format used. By default pickle uses a printable ASCII representation, which generates larger data sets. As can be seen from the graph however, pickle using the newer binary data format (version 2, pickle-p2) has much lower load times.
If I understand correctly, you’re already using pandas.read_csv() but would like to speed up the development process so that you don’t have to load the file in every time you edit your script, is that right? I have a few recommendations:
you could load in only part of the CSV file using pandas.read_csv(..., nrows=1000) to only load the top bit of the table, while you’re doing the development
use ipython for an interactive session, such that you keep the pandas table in memory as you edit and reload your script.
updated use DataFrame.to_feather() and pd.read_feather() to store data in the R-compatible feather binary format that is super fast (in my hands, slightly faster than pandas.to_pickle() on numeric data and much faster on string data).
You might also be interested in this answer on stackoverflow.
回答 3
泡菜很好!
import pandas as pd
df.to_pickle('123.pkl')#to save the dataframe, df to 123.pkl
df1 = pd.read_pickle('123.pkl')#to load 123.pkl back to the dataframe df
import pandas as pd
df.to_pickle('123.pkl') #to save the dataframe, df to 123.pkl
df1 = pd.read_pickle('123.pkl') #to load 123.pkl back to the dataframe df
As already mentioned there are different options and file formats (HDF5, JSON, CSV, parquet, SQL) to store a data frame. However, pickle is not a first-class citizen (depending on your setup), because:
Warning The pickle module is not secure against erroneous or
maliciously constructed data. Never unpickle data received from an
untrusted or unauthenticated source.
Numpy file formats are pretty fast for numerical data
I prefer to use numpy files since they’re fast and easy to work with.
Here’s a simple benchmark for saving and loading a dataframe with 1 column of 1million points.
import numpy as np
import pandas as pd
num_dict = {'voltage': np.random.rand(1000000)}
num_df = pd.DataFrame(num_dict)
using ipython’s %%timeit magic function
%%timeit
with open('num.npy', 'wb') as np_file:
np.save(np_file, num_df)
the output is
100 loops, best of 3: 5.97 ms per loop
to load the data back into a dataframe
%%timeit
with open('num.npy', 'rb') as np_file:
data = np.load(np_file)
data_df = pd.DataFrame(data)
the output is
100 loops, best of 3: 5.12 ms per loop
NOT BAD!
CONS
There’s a problem if you save the numpy file using python 2 and then try opening using python 3 (or vice versa).
Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python.
Protocol version 1 is an old binary format which is also compatible with earlier versions of Python.
Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes. Refer to PEP 307 for information about improvements brought by protocol 2.
Protocol version 3 was added in Python 3.0. It has explicit support for bytes objects and cannot be unpickled by Python 2.x. This is the default protocol, and the recommended protocol when compatibility with other Python 3 versions is required.
Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data format optimizations. Refer to PEP 3154 for information about improvements brought by protocol 4.
pyarrow before serialization 533366
pyarrow 1208051.03 ms ±43.9µs per loop (mean ± std. dev. of 7 runs,50 loops each)
pyarrow zlib 205172.78 ms ±81.8µs per loop (mean ± std. dev. of 7 runs,50 loops each)=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
msgpack before serialization 533366
msgpack 1090391.74 ms ±72.8µs per loop (mean ± std. dev. of 7 runs,50 loops each)
msgpack zlib 166393.05 ms ±71.7µs per loop (mean ± std. dev. of 7 runs,50 loops each)=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
pickle before serialization 533366
pickle 142121733µs ±38.3µs per loop (mean ± std. dev. of 7 runs,50 loops each)
pickle zlib 294773.81 ms ±60.4µs per loop (mean ± std. dev. of 7 runs,50 loops each)=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
feather ERROR feather does not support serializing a non-default index for the index; you can .reset_index() to make the index into column(s)=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
parquet ERROR Nested column branch had multiple children: struct<x: double, y: double>=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
Overall move has been to pyarrow/feather (deprecation warnings from pandas/msgpack). However I have a challenge with pyarrow with transient in specification Data serialized with pyarrow 0.15.1 cannot be deserialized with 0.16.0 ARROW-7961. I’m using serialization to use redis so have to use a binary encoding.
I’ve retested various options (using jupyter notebook)
import sys, pickle, zlib, warnings, io
class foocls:
def pyarrow(out): return pa.serialize(out).to_buffer().to_pybytes()
def msgpack(out): return out.to_msgpack()
def pickle(out): return pickle.dumps(out)
def feather(out): return out.to_feather(io.BytesIO())
def parquet(out): return out.to_parquet(io.BytesIO())
warnings.filterwarnings("ignore")
for c in foocls.__dict__.values():
sbreak = True
try:
c(out)
print(c.__name__, "before serialization", sys.getsizeof(out))
print(c.__name__, sys.getsizeof(c(out)))
%timeit -n 50 c(out)
print(c.__name__, "zlib", sys.getsizeof(zlib.compress(c(out))))
%timeit -n 50 zlib.compress(c(out))
except TypeError as e:
if "not callable" in str(e): sbreak = False
else: raise
except (ValueError) as e: print(c.__name__, "ERROR", e)
finally:
if sbreak: print("=+=" * 30)
warnings.filterwarnings("default")
With following results for my data frame (in out jupyter variable)
pyarrow before serialization 533366
pyarrow 120805
1.03 ms ± 43.9 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
pyarrow zlib 20517
2.78 ms ± 81.8 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
msgpack before serialization 533366
msgpack 109039
1.74 ms ± 72.8 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
msgpack zlib 16639
3.05 ms ± 71.7 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
pickle before serialization 533366
pickle 142121
733 µs ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
pickle zlib 29477
3.81 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
feather ERROR feather does not support serializing a non-default index for the index; you can .reset_index() to make the index into column(s)
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
parquet ERROR Nested column branch had multiple children: struct<x: double, y: double>
=+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+==+=
feather and parquet do not work for my data frame. I’m going to continue using pyarrow. However I will supplement with pickle (no compression). When writing to cache store pyarrow and pickle serialised forms. When reading from cache fallback to pickle if pyarrow deserialisation fails.
>>> df = pd.DataFrame({'col2':{0:'a',1:2,2: np.nan},'col1':{0:'w',1:1,2:2}})>>> di ={1:"A",2:"B"}>>> df
col1 col20 w a11222NaN>>> df.replace({"col1": di})
col1 col20 w a1 A 22 B NaN
If your dictionary has more than a couple of keys, using map can be much faster than replace. There are two versions of this approach, depending on whether your dictionary exhaustively maps all possible values (and also whether you want non-matches to keep their values or be converted to NaNs):
Exhaustive Mapping
In this case, the form is very simple:
df['col1'].map(di) # note: if the dictionary does not exhaustively map all
# entries then non-matched entries are changed to NaNs
Although map most commonly takes a function as its argument, it can alternatively take a dictionary or series: Documentation for Pandas.series.map
Non-Exhaustive Mapping
If you have a non-exhaustive mapping and wish to retain the existing variables for non-matches, you can add fillna:
and testing with %timeit, it appears that map is approximately 10x faster than replace.
Note that your speedup with map will vary with your data. The largest speedup appears to be with large dictionaries and exhaustive replaces. See @jpp answer (linked above) for more extensive benchmarks and discussion.
回答 2
您的问题有点含糊。至少有三种解释:
中的键di引用索引值
中的键是di指df['col1']值
中的键di指的是索引位置(不是OP的问题,而是为了娱乐而抛出的。)
以下是每种情况的解决方案。
情况1:
如果的键di旨在引用索引值,则可以使用以下update方法:
df['col1'].update(pd.Series(di))
例如,
import pandas as pdimport numpy as np
df = pd.DataFrame({'col1':['w',10,20],'col2':['a',30, np.nan]},
index=[1,2,0])# col1 col2# 1 w a# 2 10 30# 0 20 NaN
di ={0:"A",2:"B"}# The value at the 0-index is mapped to 'A', the value at the 2-index is mapped to 'B'
df['col1'].update(pd.Series(di))print(df)
import pandas as pdimport numpy as np
df = pd.DataFrame({'col1':['w',10,20],'col2':['a',30, np.nan]},
index=[1,2,0])print(df)# col1 col2# 1 w a# 2 10 30# 0 20 NaN
di ={10:"A",20:"B"}# The values 10 and 20 are replaced by 'A' and 'B'
df['col1'].replace(di, inplace=True)print(df)
Yield
col1 col21 w a2 A 300 B NaN
注意如何在这种情况下,在键di改为匹配值的df['col1']。
情况3:
如果其中的键di引用了索引位置,则可以使用
df['col1'].put(di.keys(), di.values())
以来
df = pd.DataFrame({'col1':['w',10,20],'col2':['a',30, np.nan]},
index=[1,2,0])
di ={0:"A",2:"B"}# The values at the 0 and 2 index locations are replaced by 'A' and 'B'
df['col1'].put(di.keys(), di.values())print(df)
There is a bit of ambiguity in your question. There are at least three two interpretations:
the keys in di refer to index values
the keys in di refer to df['col1'] values
the keys in di refer to index locations (not the OP’s question, but thrown in for fun.)
Below is a solution for each case.
Case 1:
If the keys of di are meant to refer to index values, then you could use the update method:
df['col1'].update(pd.Series(di))
For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
# col1 col2
# 1 w a
# 2 10 30
# 0 20 NaN
di = {0: "A", 2: "B"}
# The value at the 0-index is mapped to 'A', the value at the 2-index is mapped to 'B'
df['col1'].update(pd.Series(di))
print(df)
yields
col1 col2
1 w a
2 B 30
0 A NaN
I’ve modified the values from your original post so it is clearer what update is doing.
Note how the keys in di are associated with index values. The order of the index values — that is, the index locations — does not matter.
Case 2:
If the keys in di refer to df['col1'] values, then @DanAllan and @DSM show how to achieve this with replace:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
print(df)
# col1 col2
# 1 w a
# 2 10 30
# 0 20 NaN
di = {10: "A", 20: "B"}
# The values 10 and 20 are replaced by 'A' and 'B'
df['col1'].replace(di, inplace=True)
print(df)
yields
col1 col2
1 w a
2 A 30
0 B NaN
Note how in this case the keys in di were changed to match values in df['col1'].
Case 3:
If the keys in di refer to index locations, then you could use
df['col1'].put(di.keys(), di.values())
since
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
di = {0: "A", 2: "B"}
# The values at the 0 and 2 index locations are replaced by 'A' and 'B'
df['col1'].put(di.keys(), di.values())
print(df)
yields
col1 col2
1 A a
2 10 30
0 B NaN
Here, the first and third rows were altered, because the keys in di are 0 and 2, which with Python’s 0-based indexing refer to the first and third locations.
回答 3
如果您有多个列要在数据数据帧中重新映射,则添加到此问题:
def remap(data,dict_labels):"""
This function take in a dictionnary of labels : dict_labels
and replace the values (previously labelencode) into the string.
ex: dict_labels = {{'col1':{1:'A',2:'B'}}
"""for field,values in dict_labels.items():print("I am remapping %s"%field)
data.replace({field:values},inplace=True)print("DONE")return data
Adding to this question if you ever have more than one columns to remap in a data dataframe:
def remap(data,dict_labels):
"""
This function take in a dictionnary of labels : dict_labels
and replace the values (previously labelencode) into the string.
ex: dict_labels = {{'col1':{1:'A',2:'B'}}
"""
for field,values in dict_labels.items():
print("I am remapping %s"%field)
data.replace({field:values},inplace=True)
print("DONE")
return data
DSM has the accepted answer, but the coding doesn’t seem to work for everyone. Here is one that works with the current version of pandas (0.23.4 as of 8/2018):
Given map is faster than replace (@JohnE’s solution) you need to be careful with Non-Exhaustive mappings where you intend to map specific values to NaN. The proper method in this case requires that you mask the Series when you .fillna, else you undo the mapping to NaN.
import pandas as pd
import numpy as np
d = {'m': 'Male', 'f': 'Female', 'missing': np.NaN}
df = pd.DataFrame({'gender': ['m', 'f', 'missing', 'Male', 'U']})
keep_nan = [k for k,v in d.items() if pd.isnull(v)]
s = df['gender']
df['mapped'] = s.map(d).fillna(s.mask(s.isin(keep_nan)))
gender mapped
0 m Male
1 f Female
2 missing NaN
3 Male Male
4 U U
As an extension to what have been proposed by Nico Coallier (apply to multiple columns) and U10-Forward(using apply style of methods), and summarising it into a one-liner I propose:
The .transform() processes each column as a series. Contrary to .apply()which passes the columns aggregated in a DataFrame.
Consequently you can apply the Series method map().
Finally, and I discovered this behaviour thanks to U10, you can use the whole Series in the .get() expression. Unless I have misunderstood its behaviour and it processes sequentially the series instead of bitwisely.
The .get(x,x)accounts for the values you did not mention in your mapping dictionary which would be considered as Nan otherwise by the .map() method
回答 9
一种更本地的熊猫方法是应用如下替换函数:
def multiple_replace(dict, text):# Create a regular expression from the dictionary keys
regex = re.compile("(%s)"%"|".join(map(re.escape, dict.keys())))# For each match, look-up corresponding value in dictionaryreturn regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
定义函数后,可以将其应用于数据框。
di ={1:"A",2:"B"}
df['col1']= df.apply(lambda row: multiple_replace(di, row['col1']), axis=1)
A more native pandas approach is to apply a replace function as below:
def multiple_replace(dict, text):
# Create a regular expression from the dictionary keys
regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
# For each match, look-up corresponding value in dictionary
return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
Once you defined the function, you can apply it to your dataframe.
IF [ERI_Hispanic]=1 THEN RETURN “Hispanic”
ELSE IF SUM([ERI_AmerInd_AKNatv]+[ERI_Asian]+[ERI_Black_Afr.Amer]+[ERI_HI_PacIsl]+[ERI_White])>1 THEN RETURN “TwoorMore”
ELSE IF [ERI_AmerInd_AKNatv]=1 THEN RETURN “A/I AK Native”
ELSE IF [ERI_Asian]=1 THEN RETURN “Asian”
ELSE IF [ERI_Black_Afr.Amer]=1 THEN RETURN “Black/AA”
ELSE IF [ERI_HI_PacIsl]=1 THEN RETURN “Haw/PacIsl.”
ELSE IF [ERI_White]=1 THEN RETURN “White”
lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian eri_hispanic eri_nat_amer eri_white rno_defined0 MOST JEFF E 000001White1 CRUISE TOM E 000100White2 DEPP JOHNNY 000001Unknown3 DICAP LEO 000001Unknown4 BRANDO MARLON E 000000White5 HANKS TOM 000001Unknown6 DENIRO ROBERT E 010001White7 PACINO AL E 000001White8 WILLIAMS ROBIN E 001000White9 EASTWOOD CLINT E 000001White
I want to apply my custom function (it uses an if-else ladder) to these six columns (ERI_Hispanic, ERI_AmerInd_AKNatv, ERI_Asian, ERI_Black_Afr.Amer, ERI_HI_PacIsl, ERI_White) in each row of my dataframe.
I’ve tried different methods from other questions but still can’t seem to find the right answer for my problem. The critical piece of this is that if the person is counted as Hispanic they can’t be counted as anything else. Even if they have a “1” in another ethnicity column they still are counted as Hispanic not two or more races. Similarly, if the sum of all the ERI columns is greater than 1 they are counted as two or more races and can’t be counted as a unique ethnicity(except for Hispanic). Hopefully this makes sense. Any help will be greatly appreciated.
Its almost like doing a for loop through each row and if each record meets a criterion they are added to one list and eliminated from the original.
From the dataframe below I need to calculate a new column based on the following spec in SQL:
IF [ERI_Hispanic] = 1 THEN RETURN “Hispanic”
ELSE IF SUM([ERI_AmerInd_AKNatv] + [ERI_Asian] + [ERI_Black_Afr.Amer] + [ERI_HI_PacIsl] + [ERI_White]) > 1 THEN RETURN “Two or More”
ELSE IF [ERI_AmerInd_AKNatv] = 1 THEN RETURN “A/I AK Native”
ELSE IF [ERI_Asian] = 1 THEN RETURN “Asian”
ELSE IF [ERI_Black_Afr.Amer] = 1 THEN RETURN “Black/AA”
ELSE IF [ERI_HI_PacIsl] = 1 THEN RETURN “Haw/Pac Isl.”
ELSE IF [ERI_White] = 1 THEN RETURN “White”
Comment: If the ERI Flag for Hispanic is True (1), the employee is classified as “Hispanic”
Comment: If more than 1 non-Hispanic ERI Flag is true, return “Two or More”
lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian eri_hispanic eri_nat_amer eri_white rno_defined
0 MOST JEFF E 0 0 0 0 0 1 White
1 CRUISE TOM E 0 0 0 1 0 0 White
2 DEPP JOHNNY 0 0 0 0 0 1 Unknown
3 DICAP LEO 0 0 0 0 0 1 Unknown
4 BRANDO MARLON E 0 0 0 0 0 0 White
5 HANKS TOM 0 0 0 0 0 1 Unknown
6 DENIRO ROBERT E 0 1 0 0 0 1 White
7 PACINO AL E 0 0 0 0 0 1 White
8 WILLIAMS ROBIN E 0 0 1 0 0 0 White
9 EASTWOOD CLINT E 0 0 0 0 0 1 White
lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian eri_hispanic eri_nat_amer eri_white rno_defined race_label0 MOST JEFF E 000001WhiteWhite1 CRUISE TOM E 000100WhiteHispanic2 DEPP JOHNNY NaN000001UnknownWhite3 DICAP LEO NaN000001UnknownWhite4 BRANDO MARLON E 000000WhiteOther5 HANKS TOM NaN000001UnknownWhite6 DENIRO ROBERT E 010001WhiteTwoOrMore7 PACINO AL E 000001WhiteWhite8 WILLIAMS ROBIN E 001000WhiteHaw/PacIsl.9 EASTWOOD CLINT E 000001WhiteWhite
OK, two steps to this – first is to write a function that does the translation you want – I’ve put an example together based on your pseudo-code:
def label_race (row):
if row['eri_hispanic'] == 1 :
return 'Hispanic'
if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
return 'Two Or More'
if row['eri_nat_amer'] == 1 :
return 'A/I AK Native'
if row['eri_asian'] == 1:
return 'Asian'
if row['eri_afr_amer'] == 1:
return 'Black/AA'
if row['eri_hawaiian'] == 1:
return 'Haw/Pac Isl.'
if row['eri_white'] == 1:
return 'White'
return 'Other'
You may want to go over this, but it seems to do the trick – notice that the parameter going into the function is considered to be a Series object labelled “row”.
Next, use the apply function in pandas to apply the function – e.g.
df.apply (lambda row: label_race(row), axis=1)
Note the axis=1 specifier, that means that the application is done at a row, rather than a column level. The results are here:
0 White
1 Hispanic
2 White
3 White
4 Other
5 White
6 Two Or More
7 White
8 Haw/Pac Isl.
9 White
If you’re happy with those results, then run it again, saving the results into a new column in your original dataframe.
The resultant dataframe looks like this (scroll to the right to see the new column):
lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian eri_hispanic eri_nat_amer eri_white rno_defined race_label
0 MOST JEFF E 0 0 0 0 0 1 White White
1 CRUISE TOM E 0 0 0 1 0 0 White Hispanic
2 DEPP JOHNNY NaN 0 0 0 0 0 1 Unknown White
3 DICAP LEO NaN 0 0 0 0 0 1 Unknown White
4 BRANDO MARLON E 0 0 0 0 0 0 White Other
5 HANKS TOM NaN 0 0 0 0 0 1 Unknown White
6 DENIRO ROBERT E 0 1 0 0 0 1 White Two Or More
7 PACINO AL E 0 0 0 0 0 1 White White
8 WILLIAMS ROBIN E 0 0 1 0 0 0 White Haw/Pac Isl.
9 EASTWOOD CLINT E 0 0 0 0 0 1 White White
回答 1
由于这是Google针对“来自其他人的熊猫专栏”的第一个结果,因此下面是一个简单的示例:
import pandas as pd# make a simple dataframe
df = pd.DataFrame({'a':[1,2],'b':[3,4]})
df# a b# 0 1 3# 1 2 4# create an unattached column with an index
df.apply(lambda row: row.a + row.b, axis=1)# 0 4# 1 6# do same but attach it to the dataframe
df['c']= df.apply(lambda row: row.a + row.b, axis=1)
df# a b c# 0 1 3 4# 1 2 4 6
如果得到了,SettingWithCopyWarning您也可以通过以下方式进行操作:
fn =lambda row: row.a + row.b # define a function for the new column
col = df.apply(fn, axis=1)# get column data with an index
df = df.assign(c=col.values)# assign values to column 'c'
Since this is the first Google result for ‘pandas new column from others’, here’s a simple example:
import pandas as pd
# make a simple dataframe
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
df
# a b
# 0 1 3
# 1 2 4
# create an unattached column with an index
df.apply(lambda row: row.a + row.b, axis=1)
# 0 4
# 1 6
# do same but attach it to the dataframe
df['c'] = df.apply(lambda row: row.a + row.b, axis=1)
df
# a b c
# 0 1 3 4
# 1 2 4 6
If you get the SettingWithCopyWarning you can do it this way also:
fn = lambda row: row.a + row.b # define a function for the new column
col = df.apply(fn, axis=1) # get column data with an index
df = df.assign(c=col.values) # assign values to column 'c'
The answers above are perfectly valid, but a vectorized solution exists, in the form of numpy.select. This allows you to define conditions, then define outputs for those conditions, much more efficiently than using apply:
.apply() takes in a function as the first parameter; pass in the label_race function as so:
df['race_label'] = df.apply(label_race, axis=1)
You don’t need to make a lambda function to pass in a function.
回答 4
尝试这个,
df.loc[df['eri_white']==1,'race_label']='White'
df.loc[df['eri_hawaiian']==1,'race_label']='Haw/Pac Isl.'
df.loc[df['eri_afr_amer']==1,'race_label']='Black/AA'
df.loc[df['eri_asian']==1,'race_label']='Asian'
df.loc[df['eri_nat_amer']==1,'race_label']='A/I AK Native'
df.loc[(df['eri_afr_amer']+ df['eri_asian']+ df['eri_hawaiian']+ df['eri_nat_amer']+ df['eri_white'])>1,'race_label']='Two Or More'
df.loc[df['eri_hispanic']==1,'race_label']='Hispanic'
df['race_label'].fillna('Other', inplace=True)
O / P:
lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian \0 MOST JEFF E 0001 CRUISE TOM E 0002 DEPP JOHNNY NaN0003 DICAP LEO NaN0004 BRANDO MARLON E 0005 HANKS TOM NaN0006 DENIRO ROBERT E 0107 PACINO AL E 0008 WILLIAMS ROBIN E 0019 EASTWOOD CLINT E 000
eri_hispanic eri_nat_amer eri_white rno_defined race_label 0001WhiteWhite1100WhiteHispanic2001UnknownWhite3001UnknownWhite4000WhiteOther5001UnknownWhite6001WhiteTwoOrMore7001WhiteWhite8000WhiteHaw/PacIsl.9001WhiteWhite
lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian \
0 MOST JEFF E 0 0 0
1 CRUISE TOM E 0 0 0
2 DEPP JOHNNY NaN 0 0 0
3 DICAP LEO NaN 0 0 0
4 BRANDO MARLON E 0 0 0
5 HANKS TOM NaN 0 0 0
6 DENIRO ROBERT E 0 1 0
7 PACINO AL E 0 0 0
8 WILLIAMS ROBIN E 0 0 1
9 EASTWOOD CLINT E 0 0 0
eri_hispanic eri_nat_amer eri_white rno_defined race_label
0 0 0 1 White White
1 1 0 0 White Hispanic
2 0 0 1 Unknown White
3 0 0 1 Unknown White
4 0 0 0 White Other
5 0 0 1 Unknown White
6 0 0 1 White Two Or More
7 0 0 1 White White
8 0 0 0 White Haw/Pac Isl.
9 0 0 1 White White
use .loc instead of apply.
it improves vectorization.
.loc works in simple manner, mask rows based on the condition, apply values to the freeze rows.