标签归档:select

从列表中随机选择50个项目写入文件

问题:从列表中随机选择50个项目写入文件

到目前为止,我已经弄清楚了如何导入文件,创建新文件以及使列表随机化。

我在从列表中随机选择50个项目以写入文件时遇到麻烦吗?

def randomizer(input,output1='random_1.txt',output2='random_2.txt',output3='random_3.txt',output4='random_total.txt'):

#Input file 
    query=open(input,'r').read().split()
    dir,file=os.path.split(input)

    temp1 = os.path.join(dir,output1)
    temp2 = os.path.join(dir,output2)
    temp3 = os.path.join(dir,output3)
    temp4 = os.path.join(dir,output4)


    out_file4=open(temp4,'w')

    random.shuffle(query)

    for item in query:
        out_file4.write(item+'\n')   

因此,如果总随机文件为

example:

random_total = ['9','2','3','1','5','6','8','7','0','4']

我想要3个文件(out_file1 | 2 | 3),其中第一个随机集为3,第二个随机集为3,第三个随机集为3(对于此示例,但我要创建的文件应该有50个)

random_1 = ['9','2','3']
random_2 = ['1','5','6']
random_3 = ['8','7','0']

因此,不会包含最后一个“ 4”,这很好。

如何从随机选择的列表中选择50?

更好的是,如何从原始列表中随机选择50个?

So far I have figured out how to import the file, create new files, and randomize the list.

I’m having trouble selecting only 50 items from the list randomly to write to a file?

def randomizer(input,output1='random_1.txt',output2='random_2.txt',output3='random_3.txt',output4='random_total.txt'):

#Input file 
    query=open(input,'r').read().split()
    dir,file=os.path.split(input)

    temp1 = os.path.join(dir,output1)
    temp2 = os.path.join(dir,output2)
    temp3 = os.path.join(dir,output3)
    temp4 = os.path.join(dir,output4)


    out_file4=open(temp4,'w')

    random.shuffle(query)

    for item in query:
        out_file4.write(item+'\n')   

So if the total randomization file was

example:

random_total = ['9','2','3','1','5','6','8','7','0','4']

I would want 3 files (out_file1|2|3) with the first random set of 3, second random set of 3, and third random set of 3 (for this example, but the one I want to create should have 50)

random_1 = ['9','2','3']
random_2 = ['1','5','6']
random_3 = ['8','7','0']

So the last ‘4’ will not be included which is fine.

How can I select 50 from the list that I randomized ?

Even better, how could I select 50 at random from the original list ?


回答 0

如果列表按随机顺序排列,则可以只取前50个。

否则,使用

import random
random.sample(the_list, 50)

random.sample 帮助文字:

sample(self, population, k) method of random.Random instance
    Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    To choose a sample in a range of integers, use xrange as an argument.
    This is especially fast and space efficient for sampling from a
    large population:   sample(xrange(10000000), 60)

If the list is in random order, you can just take the first 50.

Otherwise, use

import random
random.sample(the_list, 50)

random.sample help text:

sample(self, population, k) method of random.Random instance
    Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    To choose a sample in a range of integers, use xrange as an argument.
    This is especially fast and space efficient for sampling from a
    large population:   sample(xrange(10000000), 60)

回答 1

选择随机项目的一种简单方法是先随机播放然后切片。

import random
a = [1,2,3,4,5,6,7,8,9]
random.shuffle(a)
print a[:4] # prints 4 random variables

One easy way to select random items is to shuffle then slice.

import random
a = [1,2,3,4,5,6,7,8,9]
random.shuffle(a)
print a[:4] # prints 4 random variables

回答 2

我认为random.choice()是更好的选择。

import numpy as np

mylist = [13,23,14,52,6,23]

np.random.choice(mylist, 3, replace=False)

该函数从列表中返回3个随机选择的值的数组

I think random.choice() is a better option.

import numpy as np

mylist = [13,23,14,52,6,23]

np.random.choice(mylist, 3, replace=False)

the function returns an array of 3 randomly chosen values from the list


回答 3

假设您的列表包含100个元素,并且您想随机选择50个元素。以下是要遵循的步骤:

  1. 导入库
  2. 为随机数生成器创建种子,我将其放在2
  3. 准备一个数字列表,从中随机抽取
  4. 从数字列表中进行随机选择

码:

from random import seed
from random import choice

seed(2)
numbers = [i for i in range(100)]

print(numbers)

for _ in range(50):
    selection = choice(numbers)
    print(selection)

Say your list has 100 elements and you want to pick 50 of them in a random way. Here are the steps to follow:

  1. Import the libraries
  2. Create the seed for random number generator, I have put it at 2
  3. Prepare a list of numbers from which to pick up in a random way
  4. Make the random choices from the numbers list

Code:

from random import seed
from random import choice

seed(2)
numbers = [i for i in range(100)]

print(numbers)

for _ in range(50):
    selection = choice(numbers)
    print(selection)

从列表或元组中明确选择项目

问题:从列表或元组中明确选择项目

我有以下Python列表(也可以是元组):

myList = ['foo', 'bar', 'baz', 'quux']

我可以说

>>> myList[0:3]
['foo', 'bar', 'baz']
>>> myList[::2]
['foo', 'baz']
>>> myList[1::2]
['bar', 'quux']

如何显式挑选索引没有特定模式的项目?例如,我要选择[0,2,3]。或者,从1000个很大的清单中,我要选择[87, 342, 217, 998, 500]。是否有一些Python语法可以做到这一点?看起来像这样:

>>> myBigList[87, 342, 217, 998, 500]

I have the following Python list (can also be a tuple):

myList = ['foo', 'bar', 'baz', 'quux']

I can say

>>> myList[0:3]
['foo', 'bar', 'baz']
>>> myList[::2]
['foo', 'baz']
>>> myList[1::2]
['bar', 'quux']

How do I explicitly pick out items whose indices have no specific patterns? For example, I want to select [0,2,3]. Or from a very big list of 1000 items, I want to select [87, 342, 217, 998, 500]. Is there some Python syntax that does that? Something that looks like:

>>> myBigList[87, 342, 217, 998, 500]

回答 0

list( myBigList[i] for i in [87, 342, 217, 998, 500] )

我将答案与python 2.5.2进行了比较:

  • 19.7微秒: [ myBigList[i] for i in [87, 342, 217, 998, 500] ]

  • 20.6 USEC: map(myBigList.__getitem__, (87, 342, 217, 998, 500))

  • 22.7 USEC: itemgetter(87, 342, 217, 998, 500)(myBigList)

  • 24.6 USEC: list( myBigList[i] for i in [87, 342, 217, 998, 500] )

请注意,在Python 3中,第1个已更改为与第4个相同。


另一种选择是以a开头,numpy.array它允许通过列表或a进行索引numpy.array

>>> import numpy
>>> myBigList = numpy.array(range(1000))
>>> myBigList[(87, 342, 217, 998, 500)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: invalid index
>>> myBigList[[87, 342, 217, 998, 500]]
array([ 87, 342, 217, 998, 500])
>>> myBigList[numpy.array([87, 342, 217, 998, 500])]
array([ 87, 342, 217, 998, 500])

tuple不工作方式相同那些片。

list( myBigList[i] for i in [87, 342, 217, 998, 500] )

I compared the answers with python 2.5.2:

  • 19.7 usec: [ myBigList[i] for i in [87, 342, 217, 998, 500] ]

  • 20.6 usec: map(myBigList.__getitem__, (87, 342, 217, 998, 500))

  • 22.7 usec: itemgetter(87, 342, 217, 998, 500)(myBigList)

  • 24.6 usec: list( myBigList[i] for i in [87, 342, 217, 998, 500] )

Note that in Python 3, the 1st was changed to be the same as the 4th.


Another option would be to start out with a numpy.array which allows indexing via a list or a numpy.array:

>>> import numpy
>>> myBigList = numpy.array(range(1000))
>>> myBigList[(87, 342, 217, 998, 500)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: invalid index
>>> myBigList[[87, 342, 217, 998, 500]]
array([ 87, 342, 217, 998, 500])
>>> myBigList[numpy.array([87, 342, 217, 998, 500])]
array([ 87, 342, 217, 998, 500])

The tuple doesn’t work the same way as those are slices.


回答 1

那这个呢:

from operator import itemgetter
itemgetter(0,2,3)(myList)
('foo', 'baz', 'quux')

What about this:

from operator import itemgetter
itemgetter(0,2,3)(myList)
('foo', 'baz', 'quux')

回答 2

它不是内置的,但是如果您愿意,可以创建一个将元组作为“索引”的list的子类:

class MyList(list):

    def __getitem__(self, index):
        if isinstance(index, tuple):
            return [self[i] for i in index]
        return super(MyList, self).__getitem__(index)


seq = MyList("foo bar baaz quux mumble".split())
print seq[0]
print seq[2,4]
print seq[1::2]

印刷

foo
['baaz', 'mumble']
['bar', 'quux']

It isn’t built-in, but you can make a subclass of list that takes tuples as “indexes” if you’d like:

class MyList(list):

    def __getitem__(self, index):
        if isinstance(index, tuple):
            return [self[i] for i in index]
        return super(MyList, self).__getitem__(index)


seq = MyList("foo bar baaz quux mumble".split())
print seq[0]
print seq[2,4]
print seq[1::2]

printing

foo
['baaz', 'mumble']
['bar', 'quux']

回答 3

也许列表理解是按顺序进行的:

L = ['a', 'b', 'c', 'd', 'e', 'f']
print [ L[index] for index in [1,3,5] ]

生成:

['b', 'd', 'f']

那是您要找的东西吗?

Maybe a list comprehension is in order:

L = ['a', 'b', 'c', 'd', 'e', 'f']
print [ L[index] for index in [1,3,5] ]

Produces:

['b', 'd', 'f']

Is that what you are looking for?


回答 4

>>> map(myList.__getitem__, (2,2,1,3))
('baz', 'baz', 'bar', 'quux')

您也可以创建自己的List类,该类支持将元组用作__getitem__要执行的操作的参数myList[(2,2,1,3)]

>>> map(myList.__getitem__, (2,2,1,3))
('baz', 'baz', 'bar', 'quux')

You can also create your own List class which supports tuples as arguments to __getitem__ if you want to be able to do myList[(2,2,1,3)].


回答 5

我只想指出,即使itemgetter的语法看起来也很整洁,但是在大型列表上执行时有点慢。

import timeit
from operator import itemgetter
start=timeit.default_timer()
for i in range(1000000):
    itemgetter(0,2,3)(myList)
print ("Itemgetter took ", (timeit.default_timer()-start))

物品获取者1.065209062149279

start=timeit.default_timer()
for i in range(1000000):
    myList[0],myList[2],myList[3]
print ("Multiple slice took ", (timeit.default_timer()-start))

多个切片花费0.6225321444745759

I just want to point out, even syntax of itemgetter looks really neat, but it’s kinda slow when perform on large list.

import timeit
from operator import itemgetter
start=timeit.default_timer()
for i in range(1000000):
    itemgetter(0,2,3)(myList)
print ("Itemgetter took ", (timeit.default_timer()-start))

Itemgetter took 1.065209062149279

start=timeit.default_timer()
for i in range(1000000):
    myList[0],myList[2],myList[3]
print ("Multiple slice took ", (timeit.default_timer()-start))

Multiple slice took 0.6225321444745759


回答 6

另一个可能的解决方案:

sek=[]
L=[1,2,3,4,5,6,7,8,9,0]
for i in [2, 4, 7, 0, 3]:
   a=[L[i]]
   sek=sek+a
print (sek)

Another possible solution:

sek=[]
L=[1,2,3,4,5,6,7,8,9,0]
for i in [2, 4, 7, 0, 3]:
   a=[L[i]]
   sek=sek+a
print (sek)

回答 7

当你有一个布尔numpy数组时,就像 mask

[mylist[i] for i in np.arange(len(mask), dtype=int)[mask]]

适用于任何序列或np.array的lambda:

subseq = lambda myseq, mask : [myseq[i] for i in np.arange(len(mask), dtype=int)[mask]]

newseq = subseq(myseq, mask)

like often when you have a boolean numpy array like mask

[mylist[i] for i in np.arange(len(mask), dtype=int)[mask]]

A lambda that works for any sequence or np.array:

subseq = lambda myseq, mask : [myseq[i] for i in np.arange(len(mask), dtype=int)[mask]]

newseq = subseq(myseq, mask)


Python:将元组/字典作为键,进行选择,排序

问题:Python:将元组/字典作为键,进行选择,排序

假设我有大量不同颜色的水果,例如24个蓝色香蕉,12个绿色苹果,0个蓝色草莓等等。我想将它们组织成Python的数据结构,以便于选择和排序。我的想法是将它们放入以元组为键的字典中,例如,

{ ('banana',    'blue' ): 24,
  ('apple',     'green'): 12,
  ('strawberry','blue' ): 0,
  ...
}

甚至字典,例如

{ {'fruit': 'banana',    'color': 'blue' }: 24,
  {'fruit': 'apple',     'color': 'green'}: 12,
  {'fruit': 'strawberry','color': 'blue' }: 0,
  ...
}

例如,我想检索所有蓝色水果或所有颜色的香蕉的列表,或按水果名称对这本词典进行排序。有没有办法做到这一点?

用元组作为键的字典可能不是处理这种情况的正确方法。

欢迎所有建议!

Suppose I have quantities of fruits of different colors, e.g., 24 blue bananas, 12 green apples, 0 blue strawberries and so on. I’d like to organize them in a data structure in Python that allows for easy selection and sorting. My idea was to put them into a dictionary with tuples as keys, e.g.,

{
    ('banana',    'blue' ): 24,
    ('apple',     'green'): 12,
    ('strawberry','blue' ): 0,
    # ...
}

or even dictionaries, e.g.,

{
    {'fruit': 'banana',    'color': 'blue' }: 24,
    {'fruit': 'apple',     'color': 'green'}: 12,
    {'fruit': 'strawberry','color': 'blue' }: 0,
    # ...
}

I’d like to retrieve a list of all blue fruit, or bananas of all colors, for example, or to sort this dictionary by the name of the fruit. Are there ways to do this in a clean way?

It might well be that dictionaries with tuples as keys are not the proper way to handle this situation.

All suggestions welcome!


回答 0

就个人而言,我喜欢python的一件事是tuple-dict组合。您在这里拥有的实际上是一个2d数组(其中x =水果名称,y =颜色),而且我通常是实现2d数组的元组字典的支持者,至少在诸如之类numpy的数据库不适合使用时。简而言之,我认为您有一个很好的方法。

请注意,如果不做一些额外的工作,就不能将字典用作字典中的键,因此这不是一个很好的解决方案。

也就是说,您还应该考虑namedtuple()。这样,您可以执行以下操作:

>>> from collections import namedtuple
>>> Fruit = namedtuple("Fruit", ["name", "color"])
>>> f = Fruit(name="banana", color="red")
>>> print f
Fruit(name='banana', color='red')
>>> f.name
'banana'
>>> f.color
'red'

现在您可以使用fruitcount字典:

>>> fruitcount = {Fruit("banana", "red"):5}
>>> fruitcount[f]
5

其他技巧:

>>> fruits = fruitcount.keys()
>>> fruits.sort()
>>> print fruits
[Fruit(name='apple', color='green'), 
 Fruit(name='apple', color='red'), 
 Fruit(name='banana', color='blue'), 
 Fruit(name='strawberry', color='blue')]
>>> fruits.sort(key=lambda x:x.color)
>>> print fruits
[Fruit(name='banana', color='blue'), 
 Fruit(name='strawberry', color='blue'), 
 Fruit(name='apple', color='green'), 
 Fruit(name='apple', color='red')]

与chmullig相呼应,要获得一个水果的所有颜色的列表,您必须过滤键,即

bananas = [fruit for fruit in fruits if fruit.name=='banana']

Personally, one of the things I love about python is the tuple-dict combination. What you have here is effectively a 2d array (where x = fruit name and y = color), and I am generally a supporter of the dict of tuples for implementing 2d arrays, at least when something like numpy or a database isn’t more appropriate. So in short, I think you’ve got a good approach.

Note that you can’t use dicts as keys in a dict without doing some extra work, so that’s not a very good solution.

That said, you should also consider namedtuple(). That way you could do this:

>>> from collections import namedtuple
>>> Fruit = namedtuple("Fruit", ["name", "color"])
>>> f = Fruit(name="banana", color="red")
>>> print f
Fruit(name='banana', color='red')
>>> f.name
'banana'
>>> f.color
'red'

Now you can use your fruitcount dict:

>>> fruitcount = {Fruit("banana", "red"):5}
>>> fruitcount[f]
5

Other tricks:

>>> fruits = fruitcount.keys()
>>> fruits.sort()
>>> print fruits
[Fruit(name='apple', color='green'), 
 Fruit(name='apple', color='red'), 
 Fruit(name='banana', color='blue'), 
 Fruit(name='strawberry', color='blue')]
>>> fruits.sort(key=lambda x:x.color)
>>> print fruits
[Fruit(name='banana', color='blue'), 
 Fruit(name='strawberry', color='blue'), 
 Fruit(name='apple', color='green'), 
 Fruit(name='apple', color='red')]

Echoing chmullig, to get a list of all colors of one fruit, you would have to filter the keys, i.e.

bananas = [fruit for fruit in fruits if fruit.name=='banana']

回答 1

最好的选择是创建一个简单的数据结构来对您所拥有的进行建模。然后,您可以将这些对象存储在一个简单的列表中,并根据需要进行排序/检索。

对于这种情况,我将使用以下类:

class Fruit:
    def __init__(self, name, color, quantity): 
        self.name = name
        self.color = color
        self.quantity = quantity

    def __str__(self):
        return "Name: %s, Color: %s, Quantity: %s" % \
     (self.name, self.color, self.quantity)

然后,您可以简单地构造“ Fruit”实例并将其添加到列表中,如下所示:

fruit1 = Fruit("apple", "red", 12)
fruit2 = Fruit("pear", "green", 22)
fruit3 = Fruit("banana", "yellow", 32)
fruits = [fruit3, fruit2, fruit1] 

简单的列表fruits将更加容易,混乱并且维护得更好。

一些使用示例:

下面的所有输出是运行给定代码段后的结果:

for fruit in fruits:
    print fruit

未排序清单:

显示:

Name: banana, Color: yellow, Quantity: 32
Name: pear, Color: green, Quantity: 22
Name: apple, Color: red, Quantity: 12

按名称按字母顺序排序:

fruits.sort(key=lambda x: x.name.lower())

显示:

Name: apple, Color: red, Quantity: 12
Name: banana, Color: yellow, Quantity: 32
Name: pear, Color: green, Quantity: 22

按数量排序:

fruits.sort(key=lambda x: x.quantity)

显示:

Name: apple, Color: red, Quantity: 12
Name: pear, Color: green, Quantity: 22
Name: banana, Color: yellow, Quantity: 32

颜色==红色:

red_fruit = filter(lambda f: f.color == "red", fruits)

显示:

Name: apple, Color: red, Quantity: 12

Your best option will be to create a simple data structure to model what you have. Then you can store these objects in a simple list and sort/retrieve them any way you wish.

For this case, I’d use the following class:

class Fruit:
    def __init__(self, name, color, quantity): 
        self.name = name
        self.color = color
        self.quantity = quantity

    def __str__(self):
        return "Name: %s, Color: %s, Quantity: %s" % \
     (self.name, self.color, self.quantity)

Then you can simply construct “Fruit” instances and add them to a list, as shown in the following manner:

fruit1 = Fruit("apple", "red", 12)
fruit2 = Fruit("pear", "green", 22)
fruit3 = Fruit("banana", "yellow", 32)
fruits = [fruit3, fruit2, fruit1] 

The simple list fruits will be much easier, less confusing, and better-maintained.

Some examples of use:

All outputs below is the result after running the given code snippet followed by:

for fruit in fruits:
    print fruit

Unsorted list:

Displays:

Name: banana, Color: yellow, Quantity: 32
Name: pear, Color: green, Quantity: 22
Name: apple, Color: red, Quantity: 12

Sorted alphabetically by name:

fruits.sort(key=lambda x: x.name.lower())

Displays:

Name: apple, Color: red, Quantity: 12
Name: banana, Color: yellow, Quantity: 32
Name: pear, Color: green, Quantity: 22

Sorted by quantity:

fruits.sort(key=lambda x: x.quantity)

Displays:

Name: apple, Color: red, Quantity: 12
Name: pear, Color: green, Quantity: 22
Name: banana, Color: yellow, Quantity: 32

Where color == red:

red_fruit = filter(lambda f: f.color == "red", fruits)

Displays:

Name: apple, Color: red, Quantity: 12

回答 2

数据库,词典的字典,词典列表的字典,命名为tuple(这是一个子类),sqlite,冗余…我不敢相信自己的眼睛。还有什么 ?

“很可能以元组为键的字典不是处理这种情况的正确方法。”

“我的直觉是数据库对于OP的需求而言是过大的;”

是的 我想

因此,我认为,一个元组列表就足够了:

from operator import itemgetter

li = [  ('banana',     'blue'   , 24) ,
        ('apple',      'green'  , 12) ,
        ('strawberry', 'blue'   , 16 ) ,
        ('banana',     'yellow' , 13) ,
        ('apple',      'gold'   , 3 ) ,
        ('pear',       'yellow' , 10) ,
        ('strawberry', 'orange' , 27) ,
        ('apple',      'blue'   , 21) ,
        ('apple',      'silver' , 0 ) ,
        ('strawberry', 'green'  , 4 ) ,
        ('banana',     'brown'  , 14) ,
        ('strawberry', 'yellow' , 31) ,
        ('apple',      'pink'   , 9 ) ,
        ('strawberry', 'gold'   , 0 ) ,
        ('pear',       'gold'   , 66) ,
        ('apple',      'yellow' , 9 ) ,
        ('pear',       'brown'  , 5 ) ,
        ('strawberry', 'pink'   , 8 ) ,
        ('apple',      'purple' , 7 ) ,
        ('pear',       'blue'   , 51) ,
        ('chesnut',    'yellow',  0 )   ]


print set( u[1] for u in li ),': all potential colors'
print set( c for f,c,n in li if n!=0),': all effective colors'
print [ c for f,c,n in li if f=='banana' ],': all potential colors of bananas'
print [ c for f,c,n in li if f=='banana' and n!=0],': all effective colors of bananas'
print

print set( u[0] for u in li ),': all potential fruits'
print set( f for f,c,n in li if n!=0),': all effective fruits'
print [ f for f,c,n in li if c=='yellow' ],': all potential fruits being yellow'
print [ f for f,c,n in li if c=='yellow' and n!=0],': all effective fruits being yellow'
print

print len(set( u[1] for u in li )),': number of all potential colors'
print len(set(c for f,c,n in li if n!=0)),': number of all effective colors'
print len( [c for f,c,n in li if f=='strawberry']),': number of potential colors of strawberry'
print len( [c for f,c,n in li if f=='strawberry' and n!=0]),': number of effective colors of strawberry'
print

# sorting li by name of fruit
print sorted(li),'  sorted li by name of fruit'
print

# sorting li by number 
print sorted(li, key = itemgetter(2)),'  sorted li by number'
print

# sorting li first by name of color and secondly by name of fruit
print sorted(li, key = itemgetter(1,0)),'  sorted li first by name of color and secondly by name of fruit'
print

结果

set(['blue', 'brown', 'gold', 'purple', 'yellow', 'pink', 'green', 'orange', 'silver']) : all potential colors
set(['blue', 'brown', 'gold', 'purple', 'yellow', 'pink', 'green', 'orange']) : all effective colors
['blue', 'yellow', 'brown'] : all potential colors of bananas
['blue', 'yellow', 'brown'] : all effective colors of bananas

set(['strawberry', 'chesnut', 'pear', 'banana', 'apple']) : all potential fruits
set(['strawberry', 'pear', 'banana', 'apple']) : all effective fruits
['banana', 'pear', 'strawberry', 'apple', 'chesnut'] : all potential fruits being yellow
['banana', 'pear', 'strawberry', 'apple'] : all effective fruits being yellow

9 : number of all potential colors
8 : number of all effective colors
6 : number of potential colors of strawberry
5 : number of effective colors of strawberry

[('apple', 'blue', 21), ('apple', 'gold', 3), ('apple', 'green', 12), ('apple', 'pink', 9), ('apple', 'purple', 7), ('apple', 'silver', 0), ('apple', 'yellow', 9), ('banana', 'blue', 24), ('banana', 'brown', 14), ('banana', 'yellow', 13), ('chesnut', 'yellow', 0), ('pear', 'blue', 51), ('pear', 'brown', 5), ('pear', 'gold', 66), ('pear', 'yellow', 10), ('strawberry', 'blue', 16), ('strawberry', 'gold', 0), ('strawberry', 'green', 4), ('strawberry', 'orange', 27), ('strawberry', 'pink', 8), ('strawberry', 'yellow', 31)]   sorted li by name of fruit

[('apple', 'silver', 0), ('strawberry', 'gold', 0), ('chesnut', 'yellow', 0), ('apple', 'gold', 3), ('strawberry', 'green', 4), ('pear', 'brown', 5), ('apple', 'purple', 7), ('strawberry', 'pink', 8), ('apple', 'pink', 9), ('apple', 'yellow', 9), ('pear', 'yellow', 10), ('apple', 'green', 12), ('banana', 'yellow', 13), ('banana', 'brown', 14), ('strawberry', 'blue', 16), ('apple', 'blue', 21), ('banana', 'blue', 24), ('strawberry', 'orange', 27), ('strawberry', 'yellow', 31), ('pear', 'blue', 51), ('pear', 'gold', 66)]   sorted li by number

[('apple', 'blue', 21), ('banana', 'blue', 24), ('pear', 'blue', 51), ('strawberry', 'blue', 16), ('banana', 'brown', 14), ('pear', 'brown', 5), ('apple', 'gold', 3), ('pear', 'gold', 66), ('strawberry', 'gold', 0), ('apple', 'green', 12), ('strawberry', 'green', 4), ('strawberry', 'orange', 27), ('apple', 'pink', 9), ('strawberry', 'pink', 8), ('apple', 'purple', 7), ('apple', 'silver', 0), ('apple', 'yellow', 9), ('banana', 'yellow', 13), ('chesnut', 'yellow', 0), ('pear', 'yellow', 10), ('strawberry', 'yellow', 31)]   sorted li first by name of color and secondly by name of fruit

Database, dict of dicts, dictionary of list of dictionaries, named tuple (it’s a subclass), sqlite, redundancy… I didn’t believe my eyes. What else ?

“It might well be that dictionaries with tuples as keys are not the proper way to handle this situation.”

“my gut feeling is that a database is overkill for the OP’s needs; “

Yeah! I thought

So, in my opinion, a list of tuples is plenty enough :

from operator import itemgetter

li = [  ('banana',     'blue'   , 24) ,
        ('apple',      'green'  , 12) ,
        ('strawberry', 'blue'   , 16 ) ,
        ('banana',     'yellow' , 13) ,
        ('apple',      'gold'   , 3 ) ,
        ('pear',       'yellow' , 10) ,
        ('strawberry', 'orange' , 27) ,
        ('apple',      'blue'   , 21) ,
        ('apple',      'silver' , 0 ) ,
        ('strawberry', 'green'  , 4 ) ,
        ('banana',     'brown'  , 14) ,
        ('strawberry', 'yellow' , 31) ,
        ('apple',      'pink'   , 9 ) ,
        ('strawberry', 'gold'   , 0 ) ,
        ('pear',       'gold'   , 66) ,
        ('apple',      'yellow' , 9 ) ,
        ('pear',       'brown'  , 5 ) ,
        ('strawberry', 'pink'   , 8 ) ,
        ('apple',      'purple' , 7 ) ,
        ('pear',       'blue'   , 51) ,
        ('chesnut',    'yellow',  0 )   ]


print set( u[1] for u in li ),': all potential colors'
print set( c for f,c,n in li if n!=0),': all effective colors'
print [ c for f,c,n in li if f=='banana' ],': all potential colors of bananas'
print [ c for f,c,n in li if f=='banana' and n!=0],': all effective colors of bananas'
print

print set( u[0] for u in li ),': all potential fruits'
print set( f for f,c,n in li if n!=0),': all effective fruits'
print [ f for f,c,n in li if c=='yellow' ],': all potential fruits being yellow'
print [ f for f,c,n in li if c=='yellow' and n!=0],': all effective fruits being yellow'
print

print len(set( u[1] for u in li )),': number of all potential colors'
print len(set(c for f,c,n in li if n!=0)),': number of all effective colors'
print len( [c for f,c,n in li if f=='strawberry']),': number of potential colors of strawberry'
print len( [c for f,c,n in li if f=='strawberry' and n!=0]),': number of effective colors of strawberry'
print

# sorting li by name of fruit
print sorted(li),'  sorted li by name of fruit'
print

# sorting li by number 
print sorted(li, key = itemgetter(2)),'  sorted li by number'
print

# sorting li first by name of color and secondly by name of fruit
print sorted(li, key = itemgetter(1,0)),'  sorted li first by name of color and secondly by name of fruit'
print

result

set(['blue', 'brown', 'gold', 'purple', 'yellow', 'pink', 'green', 'orange', 'silver']) : all potential colors
set(['blue', 'brown', 'gold', 'purple', 'yellow', 'pink', 'green', 'orange']) : all effective colors
['blue', 'yellow', 'brown'] : all potential colors of bananas
['blue', 'yellow', 'brown'] : all effective colors of bananas

set(['strawberry', 'chesnut', 'pear', 'banana', 'apple']) : all potential fruits
set(['strawberry', 'pear', 'banana', 'apple']) : all effective fruits
['banana', 'pear', 'strawberry', 'apple', 'chesnut'] : all potential fruits being yellow
['banana', 'pear', 'strawberry', 'apple'] : all effective fruits being yellow

9 : number of all potential colors
8 : number of all effective colors
6 : number of potential colors of strawberry
5 : number of effective colors of strawberry

[('apple', 'blue', 21), ('apple', 'gold', 3), ('apple', 'green', 12), ('apple', 'pink', 9), ('apple', 'purple', 7), ('apple', 'silver', 0), ('apple', 'yellow', 9), ('banana', 'blue', 24), ('banana', 'brown', 14), ('banana', 'yellow', 13), ('chesnut', 'yellow', 0), ('pear', 'blue', 51), ('pear', 'brown', 5), ('pear', 'gold', 66), ('pear', 'yellow', 10), ('strawberry', 'blue', 16), ('strawberry', 'gold', 0), ('strawberry', 'green', 4), ('strawberry', 'orange', 27), ('strawberry', 'pink', 8), ('strawberry', 'yellow', 31)]   sorted li by name of fruit

[('apple', 'silver', 0), ('strawberry', 'gold', 0), ('chesnut', 'yellow', 0), ('apple', 'gold', 3), ('strawberry', 'green', 4), ('pear', 'brown', 5), ('apple', 'purple', 7), ('strawberry', 'pink', 8), ('apple', 'pink', 9), ('apple', 'yellow', 9), ('pear', 'yellow', 10), ('apple', 'green', 12), ('banana', 'yellow', 13), ('banana', 'brown', 14), ('strawberry', 'blue', 16), ('apple', 'blue', 21), ('banana', 'blue', 24), ('strawberry', 'orange', 27), ('strawberry', 'yellow', 31), ('pear', 'blue', 51), ('pear', 'gold', 66)]   sorted li by number

[('apple', 'blue', 21), ('banana', 'blue', 24), ('pear', 'blue', 51), ('strawberry', 'blue', 16), ('banana', 'brown', 14), ('pear', 'brown', 5), ('apple', 'gold', 3), ('pear', 'gold', 66), ('strawberry', 'gold', 0), ('apple', 'green', 12), ('strawberry', 'green', 4), ('strawberry', 'orange', 27), ('apple', 'pink', 9), ('strawberry', 'pink', 8), ('apple', 'purple', 7), ('apple', 'silver', 0), ('apple', 'yellow', 9), ('banana', 'yellow', 13), ('chesnut', 'yellow', 0), ('pear', 'yellow', 10), ('strawberry', 'yellow', 31)]   sorted li first by name of color and secondly by name of fruit

回答 3

在这种情况下,字典可能不是您应该使用的字典。功能更全的库将是更好的选择。可能是真实的数据库。最简单的是sqlite。您可以通过传递字符串’:memory:’而不是文件名来将整个内容保留在内存中。

如果您确实想继续沿着这条路径前进,则可以使用键或值中的额外属性来完成。但是,字典不能是另一本字典的键,而元组可以。该文档说明了允许的内容。它必须是一个不可变的对象,其中包括仅包含字符串和数字的字符串,数字和元组(以及递归仅包含那些类型的更多元组…)。

您可以使用做第一个示例d = {('apple', 'red') : 4},但是要查询所需的内容将非常困难。您需要执行以下操作:

#find all apples
apples = [d[key] for key in d.keys() if key[0] == 'apple']

#find all red items
red = [d[key] for key in d.keys() if key[1] == 'red']

#the red apple
redapples = d[('apple', 'red')]

A dictionary probably isn’t what you should be using in this case. A more full featured library would be a better alternative. Probably a real database. The easiest would be sqlite. You can keep the whole thing in memory by passing in the string ‘:memory:’ instead of a filename.

If you do want to continue down this path, you can do it with the extra attributes in the key or the value. However a dictionary can’t be the key to a another dictionary, but a tuple can. The docs explain what’s allowable. It must be an immutable object, which includes strings, numbers and tuples that contain only strings and numbers (and more tuples containing only those types recursively…).

You could do your first example with d = {('apple', 'red') : 4}, but it’ll be very hard to query for what you want. You’d need to do something like this:

#find all apples
apples = [d[key] for key in d.keys() if key[0] == 'apple']

#find all red items
red = [d[key] for key in d.keys() if key[1] == 'red']

#the red apple
redapples = d[('apple', 'red')]

回答 4

使用键作为元组时,只需使用给定的第二个组件过滤键并对其进行排序:

blue_fruit = sorted([k for k in data.keys() if k[1] == 'blue'])
for k in blue_fruit:
  print k[0], data[k] # prints 'banana 24', etc

排序之所以有效,是因为如果元组的组成部分具有自然顺序,则它们具有自然顺序。

使用键作为完全成熟的对象,只需按即可过滤k.color == 'blue'

您不能真正将dicts用作键,但是可以创建一个最简单的类,例如class Foo(object): pass,并向其动态添加任何属性:

k = Foo()
k.color = 'blue'

这些实例可以用作字典键,但要注意其可变性!

With keys as tuples, you just filter the keys with given second component and sort it:

blue_fruit = sorted([k for k in data.keys() if k[1] == 'blue'])
for k in blue_fruit:
  print k[0], data[k] # prints 'banana 24', etc

Sorting works because tuples have natural ordering if their components have natural ordering.

With keys as rather full-fledged objects, you just filter by k.color == 'blue'.

You can’t really use dicts as keys, but you can create a simplest class like class Foo(object): pass and add any attributes to it on the fly:

k = Foo()
k.color = 'blue'

These instances can serve as dict keys, but beware their mutability!


回答 5

您可能有一个词典,其中的条目是其他词典的列表:

fruit_dict = dict()
fruit_dict['banana'] = [{'yellow': 24}]
fruit_dict['apple'] = [{'red': 12}, {'green': 14}]
print fruit_dict

输出:

{‘香蕉’:[{‘黄色’:24}],’苹果’:[{‘红色’:12},{‘绿色’:14}]}

编辑:正如eumiro指出的那样,您可以使用词典字典:

fruit_dict = dict()
fruit_dict['banana'] = {'yellow': 24}
fruit_dict['apple'] = {'red': 12, 'green': 14}
print fruit_dict

输出:

{‘香蕉’:{‘黄色’:24},’苹果’:{‘绿色’:14,’红色’:12}}

You could have a dictionary where the entries are a list of other dictionaries:

fruit_dict = dict()
fruit_dict['banana'] = [{'yellow': 24}]
fruit_dict['apple'] = [{'red': 12}, {'green': 14}]
print fruit_dict

Output:

{‘banana’: [{‘yellow’: 24}], ‘apple’: [{‘red’: 12}, {‘green’: 14}]}

Edit: As eumiro pointed out, you could use a dictionary of dictionaries:

fruit_dict = dict()
fruit_dict['banana'] = {'yellow': 24}
fruit_dict['apple'] = {'red': 12, 'green': 14}
print fruit_dict

Output:

{‘banana’: {‘yellow’: 24}, ‘apple’: {‘green’: 14, ‘red’: 12}}


回答 6

从类似Trie的数据结构中有效提取此类数据。它还允许快速排序。内存效率可能不会那么好。

传统的trie将单词的每个字母存储为树中的节点。但是在您的情况下,您的“字母”是不同的。您正在存储字符串而不是字符。

它可能看起来像这样:

root:                Root
                     /|\
                    / | \
                   /  |  \     
fruit:       Banana Apple Strawberry
              / |      |     \
             /  |      |      \
color:     Blue Yellow Green  Blue
            /   |       |       \
           /    |       |        \
end:      24   100      12        0

看到这个链接:在Python中的特里

This type of data is efficiently pulled from a Trie-like data structure. It also allows for fast sorting. The memory efficiency might not be that great though.

A traditional trie stores each letter of a word as a node in the tree. But in your case your “alphabet” is different. You are storing strings instead of characters.

it might look something like this:

root:                Root
                     /|\
                    / | \
                   /  |  \     
fruit:       Banana Apple Strawberry
              / |      |     \
             /  |      |      \
color:     Blue Yellow Green  Blue
            /   |       |       \
           /    |       |        \
end:      24   100      12        0

see this link: trie in python


回答 7

您要独立使用两个键,因此有两个选择:

  1. 有两个类型的字典作为存储冗余数据{'banana' : {'blue' : 4, ...}, .... }{'blue': {'banana':4, ...} ...}。然后,搜索和排序很容易,但是您必须确保同时修改字典。

  2. 将其仅存储一个字典,然后编写对其进行迭代的函数,例如:

    d = {'banana' : {'blue' : 4, 'yellow':6}, 'apple':{'red':1} }
    
    blueFruit = [(fruit,d[fruit]['blue']) if d[fruit].has_key('blue') for fruit in d.keys()]

You want to use two keys independently, so you have two choices:

  1. Store the data redundantly with two dicts as {'banana' : {'blue' : 4, ...}, .... } and {'blue': {'banana':4, ...} ...}. Then, searching and sorting is easy but you have to make sure you modify the dicts together.

  2. Store it just one dict, and then write functions that iterate over them eg.:

    d = {'banana' : {'blue' : 4, 'yellow':6}, 'apple':{'red':1} }
    
    blueFruit = [(fruit,d[fruit]['blue']) if d[fruit].has_key('blue') for fruit in d.keys()]
    

在pandas数据框中选择多个列

问题:在pandas数据框中选择多个列

我在不同的列中有数据,但是我不知道如何提取数据以将其保存在另一个变量中。

index  a   b   c
1      2   3   4
2      3   4   5

如何选择'a''b'并将其保存到df1?

我试过了

df1 = df['a':'b']
df1 = df.ix[:, 'a':'b']

似乎没有任何工作。

I have data in different columns but I don’t know how to extract it to save it in another variable.

index  a   b   c
1      2   3   4
2      3   4   5

How do I select 'a', 'b' and save it in to df1?

I tried

df1 = df['a':'b']
df1 = df.ix[:, 'a':'b']

None seem to work.


回答 0

列名(字符串)无法按照您尝试的方式进行切片。

在这里,您有两个选择。如果您从上下文中知道要切出哪些变量,则可以通过将列表传递给__getitem__语法([])来仅返回那些列的视图。

df1 = df[['a','b']]

或者,如果需要对它们进行数字索引而不是按其名称进行索引(例如,您的代码应在不知道前两列名称的情况下自动执行此操作),则可以执行以下操作:

df1 = df.iloc[:,0:2] # Remember that Python does not slice inclusive of the ending index.

此外,您应该熟悉Pandas对象与该对象副本的视图概念。上述方法中的第一个将在内存中返回所需子对象(所需切片)的新副本。

但是,有时熊猫中有一些索引约定不执行此操作,而是给您一个新变量,该变量仅引用与原始对象中的子对象或切片相同的内存块。第二种索引编制方式会发生这种情况,因此您可以使用copy()函数对其进行修改以获取常规副本。发生这种情况时,更改您认为是切片对象的内容有时会更改原始对象。始终对此保持警惕。

df1 = df.iloc[0,0:2].copy() # To avoid the case where changing df1 also changes df

要使用iloc,您需要知道列位置(或索引)。由于列位置可能会改变,而不是硬编码索引,则可以使用ilocget_loc功能的columns数据框对象的方法来获得列索引。

{df.columns.get_loc(c):c for idx, c in enumerate(df.columns)}

现在,您可以使用此字典通过名称和使用来访问列iloc

The column names (which are strings) cannot be sliced in the manner you tried.

Here you have a couple of options. If you know from context which variables you want to slice out, you can just return a view of only those columns by passing a list into the __getitem__ syntax (the []’s).

df1 = df[['a','b']]

Alternatively, if it matters to index them numerically and not by their name (say your code should automatically do this without knowing the names of the first two columns) then you can do this instead:

df1 = df.iloc[:,0:2] # Remember that Python does not slice inclusive of the ending index.

Additionally, you should familiarize yourself with the idea of a view into a Pandas object vs. a copy of that object. The first of the above methods will return a new copy in memory of the desired sub-object (the desired slices).

Sometimes, however, there are indexing conventions in Pandas that don’t do this and instead give you a new variable that just refers to the same chunk of memory as the sub-object or slice in the original object. This will happen with the second way of indexing, so you can modify it with the copy() function to get a regular copy. When this happens, changing what you think is the sliced object can sometimes alter the original object. Always good to be on the look out for this.

df1 = df.iloc[0,0:2].copy() # To avoid the case where changing df1 also changes df

To use iloc, you need to know the column positions (or indices). As the column positions may change, instead of hard-coding indices, you can use iloc along with get_loc function of columns method of dataframe object to obtain column indices.

{df.columns.get_loc(c):c for idx, c in enumerate(df.columns)}

Now you can use this dictionary to access columns through names and using iloc.


回答 1

从0.11.0版本开始,可以按照您尝试使用.loc索引器的方式对列进行切片:

df.loc[:, 'C':'E']

等价于

df[['C', 'D', 'E']]  # or df.loc[:, ['C', 'D', 'E']]

C通过返回列E


随机生成的DataFrame的演示:

import pandas as pd
import numpy as np
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100, 6)), 
                  columns=list('ABCDEF'), 
                  index=['R{}'.format(i) for i in range(100)])
df.head()

Out: 
     A   B   C   D   E   F
R0  99  78  61  16  73   8
R1  62  27  30  80   7  76
R2  15  53  80  27  44  77
R3  75  65  47  30  84  86
R4  18   9  41  62   1  82

要从C到E获得列(请注意,与整数切片不同,列中包含’E’):

df.loc[:, 'C':'E']

Out: 
      C   D   E
R0   61  16  73
R1   30  80   7
R2   80  27  44
R3   47  30  84
R4   41  62   1
R5    5  58   0
...

同样适用于基于标签选择行。从这些列中获取行“ R6”至“ R10”:

df.loc['R6':'R10', 'C':'E']

Out: 
      C   D   E
R6   51  27  31
R7   83  19  18
R8   11  67  65
R9   78  27  29
R10   7  16  94

.loc还接受一个布尔数组,因此您可以选择在数组中对应条目为的列True。例如,如果列名称在列表中,则df.columns.isin(list('BCD'))返回array([False, True, True, True, False, False], dtype=bool)-True ['B', 'C', 'D'];错误,否则。

df.loc[:, df.columns.isin(list('BCD'))]

Out: 
      B   C   D
R0   78  61  16
R1   27  30  80
R2   53  80  27
R3   65  47  30
R4    9  41  62
R5   78   5  58
...

As of version 0.11.0, columns can be sliced in the manner you tried using the .loc indexer:

df.loc[:, 'C':'E']

is equivalent of

df[['C', 'D', 'E']]  # or df.loc[:, ['C', 'D', 'E']]

and returns columns C through E.


A demo on a randomly generated DataFrame:

import pandas as pd
import numpy as np
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100, 6)), 
                  columns=list('ABCDEF'), 
                  index=['R{}'.format(i) for i in range(100)])
df.head()

Out: 
     A   B   C   D   E   F
R0  99  78  61  16  73   8
R1  62  27  30  80   7  76
R2  15  53  80  27  44  77
R3  75  65  47  30  84  86
R4  18   9  41  62   1  82

To get the columns from C to E (note that unlike integer slicing, ‘E’ is included in the columns):

df.loc[:, 'C':'E']

Out: 
      C   D   E
R0   61  16  73
R1   30  80   7
R2   80  27  44
R3   47  30  84
R4   41  62   1
R5    5  58   0
...

Same works for selecting rows based on labels. Get the rows ‘R6’ to ‘R10’ from those columns:

df.loc['R6':'R10', 'C':'E']

Out: 
      C   D   E
R6   51  27  31
R7   83  19  18
R8   11  67  65
R9   78  27  29
R10   7  16  94

.loc also accepts a boolean array so you can select the columns whose corresponding entry in the array is True. For example, df.columns.isin(list('BCD')) returns array([False, True, True, True, False, False], dtype=bool) – True if the column name is in the list ['B', 'C', 'D']; False, otherwise.

df.loc[:, df.columns.isin(list('BCD'))]

Out: 
      B   C   D
R0   78  61  16
R1   27  30  80
R2   53  80  27
R3   65  47  30
R4    9  41  62
R5   78   5  58
...

回答 2

假设列名(df.columns)为['index','a','b','c'],则所需数据在第3列和第4列中。如果在脚本运行时不知道它们的名称,则可以执行此操作

newdf = df[df.columns[2:4]] # Remember, Python is 0-offset! The "3rd" entry is at slot 2.

正如EMS在他的回答中指出的那样,df.ix切片列更加简洁,但是.columns切片界面可能更自然,因为它使用了香草1-D python列表索引/切片语法。

警告:这'index'DataFrame列的坏名称。该标签也用于真实df.index属性Index数组。因此,您的列由返回,df['index']而真正的DataFrame索引由返回df.index。An Index是一种特殊的Series优化方法,用于查找其元素的值。对于df.index,它用于按标签查找行。该df.columns属性也是一个pd.Index数组,用于按标签查找列。

Assuming your column names (df.columns) are ['index','a','b','c'], then the data you want is in the 3rd & 4th columns. If you don’t know their names when your script runs, you can do this

newdf = df[df.columns[2:4]] # Remember, Python is 0-offset! The "3rd" entry is at slot 2.

As EMS points out in his answer, df.ix slices columns a bit more concisely, but the .columns slicing interface might be more natural because it uses the vanilla 1-D python list indexing/slicing syntax.

WARN: 'index' is a bad name for a DataFrame column. That same label is also used for the real df.index attribute, a Index array. So your column is returned by df['index'] and the real DataFrame index is returned by df.index. An Index is a special kind of Series optimized for lookup of it’s elements’ values. For df.index it’s for looking up rows by their label. That df.columns attribute is also a pd.Index array, for looking up columns by their labels.


回答 3

In [39]: df
Out[39]: 
   index  a  b  c
0      1  2  3  4
1      2  3  4  5

In [40]: df1 = df[['b', 'c']]

In [41]: df1
Out[41]: 
   b  c
0  3  4
1  4  5
In [39]: df
Out[39]: 
   index  a  b  c
0      1  2  3  4
1      2  3  4  5

In [40]: df1 = df[['b', 'c']]

In [41]: df1
Out[41]: 
   b  c
0  3  4
1  4  5

回答 4

我知道这个问题已经很老了,但是在最新版本的熊猫中,有一种简单的方法可以做到这一点。列名(即字符串)可以按您喜欢的任何方式进行切片。

columns = ['b', 'c']
df1 = pd.DataFrame(df, columns=columns)

I realize this question is quite old, but in the latest version of pandas there is an easy way to do exactly this. Column names (which are strings) can be sliced in whatever manner you like.

columns = ['b', 'c']
df1 = pd.DataFrame(df, columns=columns)

回答 5

您可以提供要删除的列的列表,然后仅使用drop()Pandas DataFrame上的函数返回带有所需列的DataFrame。

只是说

colsToDrop = ['a']
df.drop(colsToDrop, axis=1)

将返回仅包含列b和的DataFrame c

drop方法在此处记录

You could provide a list of columns to be dropped and return back the DataFrame with only the columns needed using the drop() function on a Pandas DataFrame.

Just saying

colsToDrop = ['a']
df.drop(colsToDrop, axis=1)

would return a DataFrame with just the columns b and c.

The drop method is documented here.


回答 6

有了熊猫

机智列名称

dataframe[['column1','column2']]

通过iloc和具有索引号的特定列进行选择:

dataframe.iloc[:,[1,2]]

与loc列名称可以像

dataframe.loc[:,['column1','column2']]

With pandas,

wit column names

dataframe[['column1','column2']]

to select by iloc and specific columns with index number:

dataframe.iloc[:,[1,2]]

with loc column names can be used like

dataframe.loc[:,['column1','column2']]

回答 7

我发现此方法非常有用:

# iloc[row slicing, column slicing]
surveys_df.iloc [0:3, 1:4]

可以在这里找到更多详细信息

I found this method to be very useful:

# iloc[row slicing, column slicing]
surveys_df.iloc [0:3, 1:4]

More details can be found here


回答 8

从0.21.0开始,不推荐使用.loc[]使用带有一个或多个缺少标签的列表,而推荐使用.reindex。因此,您的问题的答案是:

df1 = df.reindex(columns=['b','c'])

在以前的版本中,.loc[list-of-labels]只要找到至少一个键就可以使用using (否则将引发KeyError)。此行为已弃用,现在显示警告消息。推荐的替代方法是使用.reindex()

索引和选择数据中了解更多信息

Starting with 0.21.0, using .loc or [] with a list with one or more missing labels is deprecated in favor of .reindex. So, the answer to your question is:

df1 = df.reindex(columns=['b','c'])

In prior versions, using .loc[list-of-labels] would work as long as at least 1 of the keys was found (otherwise it would raise a KeyError). This behavior is deprecated and now shows a warning message. The recommended alternative is to use .reindex().

Read more at Indexing and Selecting Data


回答 9

您可以使用熊猫。我创建了DataFrame:

    import pandas as pd
    df = pd.DataFrame([[1, 2,5], [5,4, 5], [7,7, 8], [7,6,9]], 
                      index=['Jane', 'Peter','Alex','Ann'],
                      columns=['Test_1', 'Test_2', 'Test_3'])

数据框:

           Test_1  Test_2  Test_3
    Jane        1       2       5
    Peter       5       4       5
    Alex        7       7       8
    Ann         7       6       9

要按名称选择1列或更多列:

    df[['Test_1','Test_3']]

           Test_1  Test_3
    Jane        1       5
    Peter       5       5
    Alex        7       8
    Ann         7       9

您还可以使用:

    df.Test_2

和哟列 Test_2

    Jane     2
    Peter    4
    Alex     7
    Ann      6

您也可以使用从这些行中选择列和行.loc()。这称为“切片”。请注意,我从列Test_1Test_3

    df.loc[:,'Test_1':'Test_3']

“切片”为:

            Test_1  Test_2  Test_3
     Jane        1       2       5
     Peter       5       4       5
     Alex        7       7       8
     Ann         7       6       9

如果你只是想PeterAnn来自列Test_1Test_3

    df.loc[['Peter', 'Ann'],['Test_1','Test_3']]

你得到:

           Test_1  Test_3
    Peter       5       5
    Ann         7       9

You can use pandas. I create the DataFrame:

    import pandas as pd
    df = pd.DataFrame([[1, 2,5], [5,4, 5], [7,7, 8], [7,6,9]], 
                      index=['Jane', 'Peter','Alex','Ann'],
                      columns=['Test_1', 'Test_2', 'Test_3'])

The DataFrame:

           Test_1  Test_2  Test_3
    Jane        1       2       5
    Peter       5       4       5
    Alex        7       7       8
    Ann         7       6       9

To select 1 or more columns by name:

    df[['Test_1','Test_3']]

           Test_1  Test_3
    Jane        1       5
    Peter       5       5
    Alex        7       8
    Ann         7       9

You can also use:

    df.Test_2

And yo get column Test_2

    Jane     2
    Peter    4
    Alex     7
    Ann      6

You can also select columns and rows from these rows using .loc(). This is called “slicing”. Notice that I take from column Test_1to Test_3

    df.loc[:,'Test_1':'Test_3']

The “Slice” is:

            Test_1  Test_2  Test_3
     Jane        1       2       5
     Peter       5       4       5
     Alex        7       7       8
     Ann         7       6       9

And if you just want Peter and Ann from columns Test_1 and Test_3:

    df.loc[['Peter', 'Ann'],['Test_1','Test_3']]

You get:

           Test_1  Test_3
    Peter       5       5
    Ann         7       9

回答 10

如果要按行索引和列名获取一个元素,则可以像那样进行df['b'][0]。它像成像一样简单。

或者,您也可以df.ix[0,'b']混合使用索引和标签。

注意:由于ix不推荐使用v0.20 ,而推荐使用loc/ iloc

If you want to get one element by row index and column name, you can do it just like df['b'][0]. It is as simple as you can image.

Or you can use df.ix[0,'b'],mixed usage of index and label.

Note: Since v0.20 ix has been deprecated in favour of loc / iloc.


回答 11

一种不同而又简单的方法:迭代行

使用iterows

 df1= pd.DataFrame() #creating an empty dataframe
 for index,i in df.iterrows():
    df1.loc[index,'A']=df.loc[index,'A']
    df1.loc[index,'B']=df.loc[index,'B']
    df1.head()

One different and easy approach : iterating rows

using iterows

 df1= pd.DataFrame() #creating an empty dataframe
 for index,i in df.iterrows():
    df1.loc[index,'A']=df.loc[index,'A']
    df1.loc[index,'B']=df.loc[index,'B']
    df1.head()

回答 12

以上响应中讨论的不同方法是基于以下假设:用户知道要放下或子集化的列索引,或者用户希望使用一定范围的列(例如,在“ C”与“ E”之间)对数据帧进行子集化。pandas.DataFrame.drop()当然是基于用户定义的列列表对数据进行子集化的选项(尽管您必须谨慎使用始终使用dataframe的副本,并且inplace参数不应设置为True!)

另一种选择是使用pandas.columns.difference(),它对列名进行设置上的区别,并返回包含所需列的数组的索引类型。以下是解决方案:

df = pd.DataFrame([[2,3,4],[3,4,5]],columns=['a','b','c'],index=[1,2])
columns_for_differencing = ['a']
df1 = df.copy()[df.columns.difference(columns_for_differencing)]
print(df1)

输出为: b c 1 3 4 2 4 5

The different approaches discussed in above responses are based on the assumption that either the user knows column indices to drop or subset on, or the user wishes to subset a dataframe using a range of columns (for instance between ‘C’ : ‘E’). pandas.DataFrame.drop() is certainly an option to subset data based on a list of columns defined by user (though you have to be cautious that you always use copy of dataframe and inplace parameters should not be set to True!!)

Another option is to use pandas.columns.difference(), which does a set difference on column names, and returns an index type of array containing desired columns. Following is the solution:

df = pd.DataFrame([[2,3,4],[3,4,5]],columns=['a','b','c'],index=[1,2])
columns_for_differencing = ['a']
df1 = df.copy()[df.columns.difference(columns_for_differencing)]
print(df1)

The output would be: b c 1 3 4 2 4 5


回答 13

您也可以使用df.pop()

>>> df = pd.DataFrame([('falcon', 'bird',    389.0),
...                    ('parrot', 'bird',     24.0),
...                    ('lion',   'mammal',   80.5),
...                    ('monkey', 'mammal', np.nan)],
...                   columns=('name', 'class', 'max_speed'))
>>> df
     name   class  max_speed
0  falcon    bird      389.0
1  parrot    bird       24.0
2    lion  mammal       80.5
3  monkey  mammal 

>>> df.pop('class')
0      bird
1      bird
2    mammal
3    mammal
Name: class, dtype: object

>>> df
     name  max_speed
0  falcon      389.0
1  parrot       24.0
2    lion       80.5
3  monkey        NaN

让我知道这是否对您有帮助,请使用df.pop(c)

you can also use df.pop()

>>> df = pd.DataFrame([('falcon', 'bird',    389.0),
...                    ('parrot', 'bird',     24.0),
...                    ('lion',   'mammal',   80.5),
...                    ('monkey', 'mammal', np.nan)],
...                   columns=('name', 'class', 'max_speed'))
>>> df
     name   class  max_speed
0  falcon    bird      389.0
1  parrot    bird       24.0
2    lion  mammal       80.5
3  monkey  mammal 

>>> df.pop('class')
0      bird
1      bird
2    mammal
3    mammal
Name: class, dtype: object

>>> df
     name  max_speed
0  falcon      389.0
1  parrot       24.0
2    lion       80.5
3  monkey        NaN

let me know if this helps so for you , please use df.pop(c)


回答 14

我已经看到了一些答案,但是仍然不清楚。您将如何选择那些感兴趣的列?答案是,如果将它们收集在列表中,则可以使用列表引用列。

print(extracted_features.shape)
print(extracted_features)

(63,)
['f000004' 'f000005' 'f000006' 'f000014' 'f000039' 'f000040' 'f000043'
 'f000047' 'f000048' 'f000049' 'f000050' 'f000051' 'f000052' 'f000053'
 'f000054' 'f000055' 'f000056' 'f000057' 'f000058' 'f000059' 'f000060'
 'f000061' 'f000062' 'f000063' 'f000064' 'f000065' 'f000066' 'f000067'
 'f000068' 'f000069' 'f000070' 'f000071' 'f000072' 'f000073' 'f000074'
 'f000075' 'f000076' 'f000077' 'f000078' 'f000079' 'f000080' 'f000081'
 'f000082' 'f000083' 'f000084' 'f000085' 'f000086' 'f000087' 'f000088'
 'f000089' 'f000090' 'f000091' 'f000092' 'f000093' 'f000094' 'f000095'
 'f000096' 'f000097' 'f000098' 'f000099' 'f000100' 'f000101' 'f000103']

我有以下list / numpy数组extracted_features,指定63列。原始数据集有103列,我想准确提取出这些列,然后使用

dataset[extracted_features]

你最终会得到这个

您将在机器学习中(特别是在功能选择中)经常使用此功能。我也想讨论其他方式,但是我认为其他stackoverflowers已经对此进行了讨论。希望这对您有所帮助!

I’ve seen several answers on that, but on remained unclear to me. How would you select those columns of interest? The answer to that is that if you have them gathered in a list, you can just reference the columns using the list.

Example

print(extracted_features.shape)
print(extracted_features)

(63,)
['f000004' 'f000005' 'f000006' 'f000014' 'f000039' 'f000040' 'f000043'
 'f000047' 'f000048' 'f000049' 'f000050' 'f000051' 'f000052' 'f000053'
 'f000054' 'f000055' 'f000056' 'f000057' 'f000058' 'f000059' 'f000060'
 'f000061' 'f000062' 'f000063' 'f000064' 'f000065' 'f000066' 'f000067'
 'f000068' 'f000069' 'f000070' 'f000071' 'f000072' 'f000073' 'f000074'
 'f000075' 'f000076' 'f000077' 'f000078' 'f000079' 'f000080' 'f000081'
 'f000082' 'f000083' 'f000084' 'f000085' 'f000086' 'f000087' 'f000088'
 'f000089' 'f000090' 'f000091' 'f000092' 'f000093' 'f000094' 'f000095'
 'f000096' 'f000097' 'f000098' 'f000099' 'f000100' 'f000101' 'f000103']

I have the following list/numpy array extracted_features, specifying 63 columns. The original dataset has 103 columns, and I would like to extract exactly those, then I would use

dataset[extracted_features]

And you will end up with this

This something you would use quite often in Machine Learning (more specifically, in feature selection). I would like to discuss other ways too, but I think that has already been covered by other stackoverflowers. Hope this’ve been helpful!


回答 15

您可以使用pandas.DataFrame.filtermethod来过滤或重新排序列,如下所示:

df1 = df.filter(['a', 'b'])

You can use pandas.DataFrame.filter method to either filter or reorder columns like this:

df1 = df.filter(['a', 'b'])

回答 16

df[['a','b']] # select all rows of 'a' and 'b'column 
df.loc[0:10, ['a','b']] # index 0 to 10 select column 'a' and 'b'
df.loc[0:10, ['a':'b']] # index 0 to 10 select column 'a' to 'b'
df.iloc[0:10, 3:5] # index 0 to 10 and column 3 to 5
df.iloc[3, 3:5] # index 3 of column 3 to 5
df[['a','b']] # select all rows of 'a' and 'b'column 
df.loc[0:10, ['a','b']] # index 0 to 10 select column 'a' and 'b'
df.loc[0:10, ['a':'b']] # index 0 to 10 select column 'a' to 'b'
df.iloc[0:10, 3:5] # index 0 to 10 and column 3 to 5
df.iloc[3, 3:5] # index 3 of column 3 to 5