The column names (which are strings) cannot be sliced in the manner you tried.
Here you have a couple of options. If you know from context which variables you want to slice out, you can just return a view of only those columns by passing a list into the __getitem__ syntax (the []’s).
df1 = df[['a','b']]
Alternatively, if it matters to index them numerically and not by their name (say your code should automatically do this without knowing the names of the first two columns) then you can do this instead:
df1 = df.iloc[:,0:2] # Remember that Python does not slice inclusive of the ending index.
Additionally, you should familiarize yourself with the idea of a view into a Pandas object vs. a copy of that object. The first of the above methods will return a new copy in memory of the desired sub-object (the desired slices).
Sometimes, however, there are indexing conventions in Pandas that don’t do this and instead give you a new variable that just refers to the same chunk of memory as the sub-object or slice in the original object. This will happen with the second way of indexing, so you can modify it with the copy() function to get a regular copy. When this happens, changing what you think is the sliced object can sometimes alter the original object. Always good to be on the look out for this.
df1 = df.iloc[0,0:2].copy() # To avoid the case where changing df1 also changes df
To use iloc, you need to know the column positions (or indices). As the column positions may change, instead of hard-coding indices, you can use iloc along with get_loc function of columns method of dataframe object to obtain column indices.
{df.columns.get_loc(c):c for idx, c in enumerate(df.columns)}
Now you can use this dictionary to access columns through names and using iloc.
import pandas as pd
import numpy as np
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100,6)),
columns=list('ABCDEF'),
index=['R{}'.format(i)for i in range(100)])
df.head()Out:
A B C D E F
R0 99786116738
R1 62273080776
R2 155380274477
R3 756547308486
R4 1894162182
要从C到E获得列(请注意,与整数切片不同,列中包含’E’):
df.loc[:,'C':'E']Out:
C D E
R0 611673
R1 30807
R2 802744
R3 473084
R4 41621
R5 5580...
同样适用于基于标签选择行。从这些列中获取行“ R6”至“ R10”:
df.loc['R6':'R10','C':'E']Out:
C D E
R6 512731
R7 831918
R8 116765
R9 782729
R10 71694
Same works for selecting rows based on labels. Get the rows ‘R6’ to ‘R10’ from those columns:
df.loc['R6':'R10', 'C':'E']
Out:
C D E
R6 51 27 31
R7 83 19 18
R8 11 67 65
R9 78 27 29
R10 7 16 94
.loc also accepts a boolean array so you can select the columns whose corresponding entry in the array is True. For example, df.columns.isin(list('BCD')) returns array([False, True, True, True, False, False], dtype=bool) – True if the column name is in the list ['B', 'C', 'D']; False, otherwise.
Assuming your column names (df.columns) are ['index','a','b','c'], then the data you want is in the
3rd & 4th columns. If you don’t know their names when your script runs, you can do this
newdf = df[df.columns[2:4]] # Remember, Python is 0-offset! The "3rd" entry is at slot 2.
As EMS points out in his answer, df.ix slices columns a bit more concisely, but the .columns slicing interface might be more natural because it uses the vanilla 1-D python list indexing/slicing syntax.
WARN: 'index' is a bad name for a DataFrame column. That same label is also used for the real df.index attribute, a Index array. So your column is returned by df['index'] and the real DataFrame index is returned by df.index. An Index is a special kind of Series optimized for lookup of it’s elements’ values. For df.index it’s for looking up rows by their label. That df.columns attribute is also a pd.Index array, for looking up columns by their labels.
回答 3
In[39]: df
Out[39]:
index a b c
0123412345In[40]: df1 = df[['b','c']]In[41]: df1
Out[41]:
b c
034145
I realize this question is quite old, but in the latest version of pandas there is an easy way to do exactly this. Column names (which are strings) can be sliced in whatever manner you like.
You could provide a list of columns to be dropped and return back the DataFrame with only the columns needed using the drop() function on a Pandas DataFrame.
Just saying
colsToDrop = ['a']
df.drop(colsToDrop, axis=1)
would return a DataFrame with just the columns b and c.
Starting with 0.21.0, using .loc or [] with a list with one or more missing labels is deprecated in favor of .reindex. So, the answer to your question is:
df1 = df.reindex(columns=['b','c'])
In prior versions, using .loc[list-of-labels] would work as long as at least 1 of the keys was found (otherwise it would raise a KeyError). This behavior is deprecated and now shows a warning message. The recommended alternative is to use .reindex().
df1= pd.DataFrame() #creating an empty dataframe
for index,i in df.iterrows():
df1.loc[index,'A']=df.loc[index,'A']
df1.loc[index,'B']=df.loc[index,'B']
df1.head()
The different approaches discussed in above responses are based on the assumption that either the user knows column indices to drop or subset on, or the user wishes to subset a dataframe using a range of columns (for instance between ‘C’ : ‘E’). pandas.DataFrame.drop() is certainly an option to subset data based on a list of columns defined by user (though you have to be cautious that you always use copy of dataframe and inplace parameters should not be set to True!!)
Another option is to use pandas.columns.difference(), which does a set difference on column names, and returns an index type of array containing desired columns. Following is the solution:
I’ve seen several answers on that, but on remained unclear to me. How would you select those columns of interest? The answer to that is that if you have them gathered in a list, you can just reference the columns using the list.
I have the following list/numpy array extracted_features, specifying 63 columns. The original dataset has 103 columns, and I would like to extract exactly those, then I would use
dataset[extracted_features]
And you will end up with this
This something you would use quite often in Machine Learning (more specifically, in feature selection). I would like to discuss other ways too, but I think that has already been covered by other stackoverflowers. Hope this’ve been helpful!
You can use pandas.DataFrame.filter method to either filter or reorder columns like this:
df1 = df.filter(['a', 'b'])
回答 16
df[['a','b']]# select all rows of 'a' and 'b'column
df.loc[0:10,['a','b']]# index 0 to 10 select column 'a' and 'b'
df.loc[0:10,['a':'b']]# index 0 to 10 select column 'a' to 'b'
df.iloc[0:10,3:5]# index 0 to 10 and column 3 to 5
df.iloc[3,3:5]# index 3 of column 3 to 5
df[['a','b']] # select all rows of 'a' and 'b'column
df.loc[0:10, ['a','b']] # index 0 to 10 select column 'a' and 'b'
df.loc[0:10, ['a':'b']] # index 0 to 10 select column 'a' to 'b'
df.iloc[0:10, 3:5] # index 0 to 10 and column 3 to 5
df.iloc[3, 3:5] # index 3 of column 3 to 5