分类目录归档:知识问答

如何在Pandas的DataFrame中的行上进行迭代?

问题:如何在Pandas的DataFrame中的行上进行迭代?

我有一个DataFrame熊猫来的:

import pandas as pd
inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}]
df = pd.DataFrame(inp)
print df

输出:

   c1   c2
0  10  100
1  11  110
2  12  120

现在,我要遍历该框架的行。对于每一行,我希望能够通过列名访问其元素(单元格中的值)。例如:

for row in df.rows:
   print row['c1'], row['c2']

熊猫有可能这样做吗?

我发现了类似的问题。但这并不能给我我所需的答案。例如,建议在那里使用:

for date, row in df.T.iteritems():

要么

for row in df.iterrows():

但我不了解该row对象是什么以及如何使用它。

I have a DataFrame from pandas:

import pandas as pd
inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}]
df = pd.DataFrame(inp)
print df

Output:

   c1   c2
0  10  100
1  11  110
2  12  120

Now I want to iterate over the rows of this frame. For every row I want to be able to access its elements (values in cells) by the name of the columns. For example:

for row in df.rows:
   print row['c1'], row['c2']

Is it possible to do that in pandas?

I found this similar question. But it does not give me the answer I need. For example, it is suggested there to use:

for date, row in df.T.iteritems():

or

for row in df.iterrows():

But I do not understand what the row object is and how I can work with it.


回答 0

DataFrame.iterrows是产生索引和行的生成器

import pandas as pd
import numpy as np

df = pd.DataFrame([{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}])

for index, row in df.iterrows():
    print(row['c1'], row['c2'])

Output: 
   10 100
   11 110
   12 120

DataFrame.iterrows is a generator which yield both index and row

import pandas as pd
import numpy as np

df = pd.DataFrame([{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}])

for index, row in df.iterrows():
    print(row['c1'], row['c2'])

Output: 
   10 100
   11 110
   12 120

回答 1

如何在Pandas的DataFrame中的行上进行迭代?

答案:不要*

熊猫中的迭代是一种反模式,只有在用尽所有其他选项后才应执行此操作。您不应iter将名称中带有“ ”的任何函数使用超过数千行,否则您将不得不习惯很多等待。

您要打印一个DataFrame吗?使用DataFrame.to_string()

您要计算吗?在这种情况下,请按以下顺序搜索方法(列表从此处修改):

  1. 向量化
  2. Cython例程
  3. 列表推导(香草for循环)
  4. DataFrame.apply():i)可以在cython中执行的约简操作,ii)在python空间中进行迭代
  5. DataFrame.itertuples()iteritems()
  6. DataFrame.iterrows()

iterrows并且itertuples(在该问题的答案中都获得很多票)应该在非常罕见的情况下使用,例如生成行对象/命名元以进行顺序处理,这实际上是这些功能唯一有用的东西。

呼吁授权迭代中
的docs页面上有一个巨大的红色警告框,指出:

遍历熊猫对象通常很慢。在许多情况下,不需要手动在行上进行迭代。

*实际上比“不要”复杂一些。df.iterrows()是此问题的正确答案,但是“向量化您的操作”是更好的选择。我将承认在某些情况下无法避免迭代(例如,某些操作的结果取决于为上一行计算的值)。但是,需要一些熟悉库才能知道何时。如果不确定是否需要迭代解决方案,则可能不需要。PS:要进一步了解我编写此答案的依据,请跳到最底端。


比循环快:矢量化Cython

熊猫(通过NumPy或通过Cythonized函数)对许多基本操作和计算进行了“向量化”。这包括算术,比较,(大部分)归约,整形(例如透视),联接和groupby操作。浏览有关基本基本功能的文档,以找到适合您问题的矢量化方法。

如果不存在,请使用自定义cython扩展名自行编写。


下一件事:列表理解*

如果1)没有可用的向量化解决方案,2)性能很重要,但不够重要,不足以经历对代码进行cythonize的麻烦,并且3)您尝试执行元素转换,则列表理解应该是您的下一个调用端口在您的代码上。有大量证据表明,列表理解对于许多常见的熊猫任务足够快(甚至有时更快)。

公式很简单,

# iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df['col']]
# iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df['col1'], df['col2'])]
# iterating over multiple columns - same data type
result = [f(row[0], ..., row[n]) for row in df[['col1', ...,'coln']].to_numpy()]
# iterating over multiple columns - differing data type
result = [f(row[0], ..., row[n]) for row in zip(df['col1'], ..., df['coln'])]

如果可以将业务逻辑封装到一个函数中,则可以使用调用它的列表理解。您可以通过原始python的简单性和速度来使任意复杂的事情起作用。

注意事项
列表推论假设您的数据易于使用-这意味着您的数据类型是一致的,并且您没有NaN,但这不能总是保证。

  1. 第一个更明显,但是在处理NaN时,如果存在内置熊猫方法,则更喜欢它们(因为它们具有更好的极端情况处理逻辑),或者确保您的业务逻辑包括适当的NaN处理逻辑。
  2. 在处理混合数据类型时,您应该进行迭代,zip(df['A'], df['B'], ...)而不是df[['A', 'B']].to_numpy()因为后者隐式地将数据转换为最常见的类型。例如,如果A为数字而B为字符串,to_numpy()则将整个数组转换为字符串,这可能不是您想要的。幸运的是,zip将所有列一起ping是最简单的解决方法。

* YMMV出于上面“ 注意事项”部分概述的原因。


一个明显的例子

让我们用添加两个pandas column的简单示例来演示差异A + B。这是可向量化的操作数,因此很容易对比上述方法的性能。

在此处输入图片说明

基准测试代码,供您参考。

但是,我应该指出的是,并非总是如此。有时,“什么是最佳操作方法”的答案是“取决于您的数据”。我的建议是在建立数据之前先测试一下数据的不同方法。


进一步阅读

*熊猫字符串方法是“矢量化的”,因为它们在系列中已指定但可在每个元素上使用。底层机制仍然是迭代的,因为字符串操作本来就很难向量化。


为什么我写这个答案

我从新用户那里注意到的一个普遍趋势是提出以下形式的问题:“如何在df上迭代以执行X?”。显示iterrows()在for循环内执行某些操作时调用的代码。这就是为什么。尚未引入向量化概念的图书馆新用户可能会想到通过迭代数据来执行某些操作来解决其问题的代码。不知道如何遍历DataFrame,他们要做的第一件事就是Google它并最终在此问题上出现。然后,他们看到被接受的答案告诉他们如何操作,然后他们闭上眼睛并运行此代码,而无需首先质疑迭代是否是正确的选择。

该答案的目的是帮助新用户理解迭代并不一定是解决每个问题的方法,并且可能存在更好,更快和更惯用的解决方案,值得您花时间探索它们。我并不是要发动迭代与向量化之战,而是希望在开发使用此库的问题的解决方案时通知新用户。

How to iterate over rows in a DataFrame in Pandas?

Answer: DON’T*!

Iteration in pandas is an anti-pattern, and is something you should only do when you have exhausted every other option. You should not use any function with “iter” in its name for more than a few thousand rows or you will have to get used to a lot of waiting.

Do you want to print a DataFrame? Use DataFrame.to_string().

Do you want to compute something? In that case, search for methods in this order (list modified from here):

  1. Vectorization
  2. Cython routines
  3. List Comprehensions (vanilla for loop)
  4. DataFrame.apply(): i)  Reductions that can be performed in cython, ii) Iteration in python space
  5. DataFrame.itertuples() and iteritems()
  6. DataFrame.iterrows()

iterrows and itertuples (both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.

Appeal to Authority
The docs page on iteration has a huge red warning box that says:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed […].

* It’s actually a little more complicated than “don’t”. df.iterrows() is the correct answer to this question, but “vectorize your ops” is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row). However, it takes some familiarity with the library to know when. If you’re not sure whether you need an iterative solution, you probably don’t. PS: To know more about my rationale for writing this answer, skip to the very bottom.


Faster than Looping: Vectorization, Cython

A good number of basic operations and computations are “vectorised” by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.

If none exists, feel free to write your own using custom cython extensions.


Next Best Thing: List Comprehensions*

List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you’re trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common pandas tasks.

The formula is simple,

# iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df['col']]
# iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df['col1'], df['col2'])]
# iterating over multiple columns - same data type
result = [f(row[0], ..., row[n]) for row in df[['col1', ...,'coln']].to_numpy()]
# iterating over multiple columns - differing data type
result = [f(row[0], ..., row[n]) for row in zip(df['col1'], ..., df['coln'])]

If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw python.

Caveats
List comprehensions assume that your data is easy to work with – what that means is your data types are consistent and you don’t have NaNs, but this cannot always be guaranteed.

  1. The first one is more obvious, but when dealing with NaNs, prefer in-built pandas methods if they exist (because they have much better corner-case handling logic), or ensure your business logic includes appropriate NaN handling logic.
  2. When dealing with mixed data types you should iterate over zip(df['A'], df['B'], ...) instead of df[['A', 'B']].to_numpy() as the latter implicitly upcasts data to the most common type. As an example if A is numeric and B is string, to_numpy() will cast the entire array to string, which may not be what you want. Fortunately zipping your columns together is the most straightforward workaround to this.

* YMMV for the reasons outlined in the Caveats section above.


An Obvious Example

Let’s demonstrate the difference with a simple example of adding two pandas columns A + B. This is a vectorizable operaton, so it will be easy to contrast the performance of the methods discussed above.

enter image description here

Benchmarking code, for your reference.

I should mention, however, that it isn’t always this cut and dry. Sometimes the answer to “what is the best method for an operation” is “it depends on your data”. My advice is to test out different approaches on your data before settling on one.


Further Reading

* Pandas string methods are “vectorized” in the sense that they are specified on the series but operate on each element. The underlying mechanisms are still iterative, because string operations are inherently hard to vectorize.


Why I Wrote this Answer

A common trend I notice from new users is to ask questions of the form “how can I iterate over my df to do X?”. Showing code that calls iterrows() while doing something inside a for loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question. They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is not the right thing to do.

The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I’m not trying to start a war of iteration vs vectorization, but I want new users to be informed when developing solutions to their problems with this library.


回答 2

首先考虑是否真的需要遍历 DataFrame中的行。有关其他选择,请参见此答案

如果仍然需要遍历行,则可以使用以下方法。请注意一些其他 警告中未提及的重要警告

itertuples() 应该比 iterrows()

但是要注意,根据文档(目前为熊猫0.24.2):

  • Iterrows:dtype可能与每一行都不匹配

    因为iterrows为每一行返回一个Series,所以它不会在各行中保留 dtype(dtypes在DataFrames的各列之间都保留)。为了在遍历行时保留dtype,最好使用itertuples()返回值的命名元组,并且通常比iterrows()快得多

  • 行程:请勿修改行

    永远不要修改要迭代的内容。不能保证在所有情况下都能正常工作。根据数据类型,迭代器将返回副本而不是视图,并且对其进行写入将无效。

    使用DataFrame.apply()代替:

    new_df = df.apply(lambda x: x * 2)
  • itertuples:

    如果列名是无效的Python标识符,重复出现或以下划线开头,则列名将重命名为位置名。具有大量列(> 255)时,将返回常规元组。

有关更多详细信息,请参见有关迭代的pandas文档

First consider if you really need to iterate over rows in a DataFrame. See this answer for alternatives.

If you still need to iterate over rows, you can use methods below. Note some important caveats which are not mentioned in any of the other answers.

itertuples() is supposed to be faster than iterrows()

But be aware, according to the docs (pandas 0.24.2 at the moment):

  • iterrows: dtype might not match from row to row

    Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally much faster than iterrows()

  • iterrows: Do not modify rows

    You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

    Use DataFrame.apply() instead:

    new_df = df.apply(lambda x: x * 2)
    
  • itertuples:

    The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.

See pandas docs on iteration for more details.


回答 3

您应该使用df.iterrows()。尽管逐行迭代并不是特别有效,因为Series必须创建对象。

You should use df.iterrows(). Though iterating row-by-row is not especially efficient since Series objects have to be created.


回答 4

虽然这iterrows()是一个不错的选择,但有时itertuples()可能会更快:

df = pd.DataFrame({'a': randn(1000), 'b': randn(1000),'N': randint(100, 1000, (1000)), 'x': 'x'})

%timeit [row.a * 2 for idx, row in df.iterrows()]
# => 10 loops, best of 3: 50.3 ms per loop

%timeit [row[1] * 2 for row in df.itertuples()]
# => 1000 loops, best of 3: 541 µs per loop

While iterrows() is a good option, sometimes itertuples() can be much faster:

df = pd.DataFrame({'a': randn(1000), 'b': randn(1000),'N': randint(100, 1000, (1000)), 'x': 'x'})

%timeit [row.a * 2 for idx, row in df.iterrows()]
# => 10 loops, best of 3: 50.3 ms per loop

%timeit [row[1] * 2 for row in df.itertuples()]
# => 1000 loops, best of 3: 541 µs per loop

回答 5

您还可以df.apply()用于遍历行并访问一个函数的多列。

docs:DataFrame.apply()

def valuation_formula(x, y):
    return x * y * 0.5

df['price'] = df.apply(lambda row: valuation_formula(row['x'], row['y']), axis=1)

You can also use df.apply() to iterate over rows and access multiple columns for a function.

docs: DataFrame.apply()

def valuation_formula(x, y):
    return x * y * 0.5

df['price'] = df.apply(lambda row: valuation_formula(row['x'], row['y']), axis=1)

回答 6

您可以按以下方式使用df.iloc函数:

for i in range(0, len(df)):
    print df.iloc[i]['c1'], df.iloc[i]['c2']

You can use the df.iloc function as follows:

for i in range(0, len(df)):
    print df.iloc[i]['c1'], df.iloc[i]['c2']

回答 7

我一直在寻找如何在行和列上进行迭代,因此在这里结束:

for i, row in df.iterrows():
    for j, column in row.iteritems():
        print(column)

I was looking for How to iterate on rows AND columns and ended here so :

for i, row in df.iterrows():
    for j, column in row.iteritems():
        print(column)

回答 8

您可以编写自己的迭代器来实现 namedtuple

from collections import namedtuple

def myiter(d, cols=None):
    if cols is None:
        v = d.values.tolist()
        cols = d.columns.values.tolist()
    else:
        j = [d.columns.get_loc(c) for c in cols]
        v = d.values[:, j].tolist()

    n = namedtuple('MyTuple', cols)

    for line in iter(v):
        yield n(*line)

这可以直接与媲美pd.DataFrame.itertuples。我的目标是更高效地执行相同的任务。


对于具有我的功能的给定数据框:

list(myiter(df))

[MyTuple(c1=10, c2=100), MyTuple(c1=11, c2=110), MyTuple(c1=12, c2=120)]

或搭配pd.DataFrame.itertuples

list(df.itertuples(index=False))

[Pandas(c1=10, c2=100), Pandas(c1=11, c2=110), Pandas(c1=12, c2=120)]

全面测试
我们测试使所有列均可用并对其进行子集设置。

def iterfullA(d):
    return list(myiter(d))

def iterfullB(d):
    return list(d.itertuples(index=False))

def itersubA(d):
    return list(myiter(d, ['col3', 'col4', 'col5', 'col6', 'col7']))

def itersubB(d):
    return list(d[['col3', 'col4', 'col5', 'col6', 'col7']].itertuples(index=False))

res = pd.DataFrame(
    index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    columns='iterfullA iterfullB itersubA itersubB'.split(),
    dtype=float
)

for i in res.index:
    d = pd.DataFrame(np.random.randint(10, size=(i, 10))).add_prefix('col')
    for j in res.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        res.at[i, j] = timeit(stmt, setp, number=100)

res.groupby(res.columns.str[4:-1], axis=1).plot(loglog=True);

在此处输入图片说明

在此处输入图片说明

You can write your own iterator that implements namedtuple

from collections import namedtuple

def myiter(d, cols=None):
    if cols is None:
        v = d.values.tolist()
        cols = d.columns.values.tolist()
    else:
        j = [d.columns.get_loc(c) for c in cols]
        v = d.values[:, j].tolist()

    n = namedtuple('MyTuple', cols)

    for line in iter(v):
        yield n(*line)

This is directly comparable to pd.DataFrame.itertuples. I’m aiming at performing the same task with more efficiency.


For the given dataframe with my function:

list(myiter(df))

[MyTuple(c1=10, c2=100), MyTuple(c1=11, c2=110), MyTuple(c1=12, c2=120)]

Or with pd.DataFrame.itertuples:

list(df.itertuples(index=False))

[Pandas(c1=10, c2=100), Pandas(c1=11, c2=110), Pandas(c1=12, c2=120)]

A comprehensive test
We test making all columns available and subsetting the columns.

def iterfullA(d):
    return list(myiter(d))

def iterfullB(d):
    return list(d.itertuples(index=False))

def itersubA(d):
    return list(myiter(d, ['col3', 'col4', 'col5', 'col6', 'col7']))

def itersubB(d):
    return list(d[['col3', 'col4', 'col5', 'col6', 'col7']].itertuples(index=False))

res = pd.DataFrame(
    index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    columns='iterfullA iterfullB itersubA itersubB'.split(),
    dtype=float
)

for i in res.index:
    d = pd.DataFrame(np.random.randint(10, size=(i, 10))).add_prefix('col')
    for j in res.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        res.at[i, j] = timeit(stmt, setp, number=100)

res.groupby(res.columns.str[4:-1], axis=1).plot(loglog=True);

enter image description here

enter image description here


回答 9

如何有效地进行迭代?

如果确实需要迭代熊猫数据,则可能要避免使用iterrows()。有不同的方法,通常iterrows()远非最佳。itertuples()可以快100倍。

简而言之:

  • 通常使用df.itertuples(name=None)。特别是当您有固定数量的列且少于255列时。参见要点(3)
  • 否则,df.itertuples()除非您的列具有特殊字符(例如空格或’-‘),否则请使用。参见要点(2)
  • 它可以使用itertuples()使用最后一个例子,即使你的数据帧有奇怪列。参见要点(4)
  • iterrows()当您无法使用以前的解决方案时使用。参见要点(1)

遍历pandas数据框中的行的不同方法:

生成具有一百万行四列的随机数据框:

    df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list('ABCD'))
    print(df)

1)通常iterrows()很方便,但是该死的慢:

start_time = time.clock()
result = 0
for _, row in df.iterrows():
    result += max(row['B'], row['C'])

total_elapsed_time = round(time.clock() - start_time, 2)
print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))

2)默认itertuples()值已经快得多,但是它不适用于诸如以下的列名My Col-Name is very Strange(如果重复列或如果列名不能简单地转换为python变量名,则应避免使用此方法):

start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
    result += max(row.B, row.C)

total_elapsed_time = round(time.clock() - start_time, 2)
print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))

3)itertuples()使用name = None 的默认值甚至更快,但由于必须在每列中定义一个变量,因此并不十分方便。

start_time = time.clock()
result = 0
for(_, col1, col2, col3, col4) in df.itertuples(name=None):
    result += max(col2, col3)

total_elapsed_time = round(time.clock() - start_time, 2)
print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))

4)最后,named itertuples()的速度比上一点慢,但是您不必为每列定义一个变量,它可以与诸如的列名一起使用My Col-Name is very Strange

start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
    result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])

total_elapsed_time = round(time.clock() - start_time, 2)
print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))

输出:

         A   B   C   D
0       41  63  42  23
1       54   9  24  65
2       15  34  10   9
3       39  94  82  97
4        4  88  79  54
...     ..  ..  ..  ..
999995  48  27   4  25
999996  16  51  34  28
999997   1  39  61  14
999998  66  51  27  70
999999  51  53  47  99

[1000000 rows x 4 columns]

1. Iterrows done in 104.96 seconds, result = 66151519
2. Named Itertuples done in 1.26 seconds, result = 66151519
3. Itertuples done in 0.94 seconds, result = 66151519
4. Polyvalent Itertuples working even with special characters in the column name done in 2.94 seconds, result = 66151519

本文是iterrows和itertuples之间非常有趣的比较

How to iterate efficiently?

If you really have to iterate a pandas dataframe, you will probably want to avoid using iterrows(). There are different methods and the usual iterrows() is far from being the best. itertuples() can be 100 times faster.

In short:

  • As a general rule, use df.itertuples(name=None). In particular, when you have a fixed number columns and less than 255 columns. See point (3)
  • Otherwise, use df.itertuples() except if your columns have special characters such as spaces or ‘-‘. See point (2)
  • It is possible to use itertuples() even if your dataframe has strange columns by using the last example. See point (4)
  • Only use iterrows() if you cannot the previous solutions. See point (1)

Different methods to iterate over rows in a pandas dataframe:

Generate a random dataframe with a million rows and 4 columns:

    df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list('ABCD'))
    print(df)

1) The usual iterrows() is convenient but damn slow:

start_time = time.clock()
result = 0
for _, row in df.iterrows():
    result += max(row['B'], row['C'])

total_elapsed_time = round(time.clock() - start_time, 2)
print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))

2) The default itertuples() is already much faster but it doesn’t work with column names such as My Col-Name is very Strange (you should avoid this method if your columns are repeated or if a column name cannot be simply converted to a python variable name).:

start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
    result += max(row.B, row.C)

total_elapsed_time = round(time.clock() - start_time, 2)
print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))

3) The default itertuples() using name=None is even faster but not really convenient as you have to define a variable per column.

start_time = time.clock()
result = 0
for(_, col1, col2, col3, col4) in df.itertuples(name=None):
    result += max(col2, col3)

total_elapsed_time = round(time.clock() - start_time, 2)
print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))

4) Finally, the named itertuples() is slower than the previous point but you do not have to define a variable per column and it works with column names such as My Col-Name is very Strange.

start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
    result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])

total_elapsed_time = round(time.clock() - start_time, 2)
print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))

Output:

         A   B   C   D
0       41  63  42  23
1       54   9  24  65
2       15  34  10   9
3       39  94  82  97
4        4  88  79  54
...     ..  ..  ..  ..
999995  48  27   4  25
999996  16  51  34  28
999997   1  39  61  14
999998  66  51  27  70
999999  51  53  47  99

[1000000 rows x 4 columns]

1. Iterrows done in 104.96 seconds, result = 66151519
2. Named Itertuples done in 1.26 seconds, result = 66151519
3. Itertuples done in 0.94 seconds, result = 66151519
4. Polyvalent Itertuples working even with special characters in the column name done in 2.94 seconds, result = 66151519

This article is a very interesting comparison between iterrows and itertuples


回答 10

要循环一个中的所有行,dataframe您可以使用:

for x in range(len(date_example.index)):
    print date_example['Date'].iloc[x]

To loop all rows in a dataframe you can use:

for x in range(len(date_example.index)):
    print date_example['Date'].iloc[x]

回答 11

 for ind in df.index:
     print df['c1'][ind], df['c2'][ind]
 for ind in df.index:
     print df['c1'][ind], df['c2'][ind]

回答 12

有时一个有用的模式是:

# Borrowing @KutalmisB df example
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])
# The to_dict call results in a list of dicts
# where each row_dict is a dictionary with k:v pairs of columns:value for that row
for row_dict in df.to_dict(orient='records'):
    print(row_dict)

结果是:

{'col1':1.0, 'col2':0.1}
{'col1':2.0, 'col2':0.2}

Sometimes a useful pattern is:

# Borrowing @KutalmisB df example
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])
# The to_dict call results in a list of dicts
# where each row_dict is a dictionary with k:v pairs of columns:value for that row
for row_dict in df.to_dict(orient='records'):
    print(row_dict)

Which results in:

{'col1':1.0, 'col2':0.1}
{'col1':2.0, 'col2':0.2}

回答 13

若要将a中的所有行循环dataframe方便地使用每行的值,可以将其转换为s。例如:namedtuplesndarray

df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])

遍历行:

for row in df.itertuples(index=False, name='Pandas'):
    print np.asarray(row)

结果是:

[ 1.   0.1]
[ 2.   0.2]

请注意,如果index=True所述索引被添加为元组的第一个元素,这可能是不期望的对某些应用。

To loop all rows in a dataframe and use values of each row conveniently, namedtuples can be converted to ndarrays. For example:

df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])

Iterating over the rows:

for row in df.itertuples(index=False, name='Pandas'):
    print np.asarray(row)

results in:

[ 1.   0.1]
[ 2.   0.2]

Please note that if index=True, the index is added as the first element of the tuple, which may be undesirable for some applications.


回答 14

有一种方法可以在返回DataFrame而不是Series的同时迭代引发行。我没有看到任何人提到您可以将index作为列表传递给要作为DataFrame返回的行:

for i in range(len(df)):
    row = df.iloc[[i]]

请注意双括号的用法。这将返回一个具有单行的DataFrame。

There is a way to iterate throw rows while getting a DataFrame in return, and not a Series. I don’t see anyone mentioning that you can pass index as a list for the row to be returned as a DataFrame:

for i in range(len(df)):
    row = df.iloc[[i]]

Note the usage of double brackets. This returns a DataFrame with a single row.


回答 15

对于查看和修改值,我将使用iterrows()。在for循环中,并通过使用元组拆包(请参见示例:)i, row,我row仅用于查看值,并在想要修改值时iloc方法一起使用。如先前的答案所述,您不应在此处修改要迭代的内容。

for i, row in df.iterrows():
    df_column_A = df.loc[i, 'A']
    if df_column_A == 'Old_Value':
        df_column_A = 'New_value'  

这里的rowin循环是该行的副本,而不是它的视图。因此,您不应编写类似的内容row['A'] = 'New_Value',它不会修改DataFrame。但是,您可以使用iloc指定DataFrame来完成工作。

For both viewing and modifying values, I would use iterrows(). In a for loop and by using tuple unpacking (see the example: i, row), I use the row for only viewing the value and use i with the loc method when I want to modify values. As stated in previous answers, here you should not modify something you are iterating over.

for i, row in df.iterrows():
    df_column_A = df.loc[i, 'A']
    if df_column_A == 'Old_Value':
        df_column_A = 'New_value'  

Here the row in the loop is a copy of that row, and not a view of it. Therefore, you should NOT write something like row['A'] = 'New_Value', it will not modify the DataFrame. However, you can use i and loc and specify the DataFrame to do the work.


回答 16

我知道我要参加答疑会很晚,但是我只想添加到上述@ cs95的答案中,我认为这应该是公认的答案。在他的回答中,他表明,熊猫矢量化远胜过其他使用数据帧计算内容的熊猫方法。

我想补充一点,如果您首先将数据帧转换为numpy数组,然后使用向量化,它甚至比pandas数据帧向量化要快(而且还包括将其转换回数据帧系列的时间)。

如果在@ cs95的基准代码中添加以下功能,这将非常明显:

def np_vectorization(df):
    np_arr = df.to_numpy()
    return pd.Series(np_arr[:,0] + np_arr[:,1], index=df.index)

def just_np_vectorization(df):
    np_arr = df.to_numpy()
    return np_arr[:,0] + np_arr[:,1]

在此处输入图片说明

I know I’m late to the answering party, but I just wanted to add to @cs95’s answer above, which I believe should be the accepted answer. In his answer, he shows that pandas vectorization far outperforms other pandas methods for computing stuff with dataframes.

I wanted to add that if you first convert the dataframe to a numpy array and then use vectorization, it’s even faster than pandas dataframe vectorization, (and that includes the time to turn it back into a dataframe series).

If you add the following functions to @cs95’s benchmark code, this becomes pretty evident:

def np_vectorization(df):
    np_arr = df.to_numpy()
    return pd.Series(np_arr[:,0] + np_arr[:,1], index=df.index)

def just_np_vectorization(df):
    np_arr = df.to_numpy()
    return np_arr[:,0] + np_arr[:,1]

enter image description here


回答 17

您还可以进行numpy索引以提高速度。对于某些应用程序,它并不是真正的迭代,但是比迭代好得多。

subset = row['c1'][0:5]
all = row['c1'][:]

您可能还需要将其转换为数组。这些索引/选择应该已经像Numpy数组一样起作用,但是我遇到了问题,需要进行强制转换

np.asarray(all)
imgs[:] = cv2.resize(imgs[:], (224,224) ) #resize every image in an hdf5 file

You can also do numpy indexing for even greater speed ups. It’s not really iterating but works much better than iteration for certain applications.

subset = row['c1'][0:5]
all = row['c1'][:]

You may also want to cast it to an array. These indexes/selections are supposed to act like Numpy arrays already but I ran into issues and needed to cast

np.asarray(all)
imgs[:] = cv2.resize(imgs[:], (224,224) ) #resize every image in an hdf5 file

回答 18

有很多方法可以遍历pandas数据框中的行。一种非常简单直观的方法是:

df=pd.DataFrame({'A':[1,2,3], 'B':[4,5,6],'C':[7,8,9]})
print(df)
for i in range(df.shape[0]):
    # For printing the second column
    print(df.iloc[i,1])
    # For printing more than one columns
    print(df.iloc[i,[0,2]])

There are so many ways to iterate over the rows in pandas dataframe. One very simple and intuitive way is :

df=pd.DataFrame({'A':[1,2,3], 'B':[4,5,6],'C':[7,8,9]})
print(df)
for i in range(df.shape[0]):
    # For printing the second column
    print(df.iloc[i,1])
    # For printing more than one columns
    print(df.iloc[i,[0,2]])

回答 19

本示例使用iloc隔离数据帧中的每个数字。

import pandas as pd

 a = [1, 2, 3, 4]
 b = [5, 6, 7, 8]

 mjr = pd.DataFrame({'a':a, 'b':b})

 size = mjr.shape

 for i in range(size[0]):
     for j in range(size[1]):
         print(mjr.iloc[i, j])

This example uses iloc to isolate each digit in the data frame.

import pandas as pd

 a = [1, 2, 3, 4]
 b = [5, 6, 7, 8]

 mjr = pd.DataFrame({'a':a, 'b':b})

 size = mjr.shape

 for i in range(size[0]):
     for j in range(size[1]):
         print(mjr.iloc[i, j])

回答 20

某些库(例如,我使用的Java互操作库)要求每次将值连续传递一次,例如,如果是流数据。为了复制流式传输的性质,我逐一“流式传输”我的数据帧值,我写了下面的内容,它有时会派上用场。

class DataFrameReader:
  def __init__(self, df):
    self._df = df
    self._row = None
    self._columns = df.columns.tolist()
    self.reset()
    self.row_index = 0

  def __getattr__(self, key):
    return self.__getitem__(key)

  def read(self) -> bool:
    self._row = next(self._iterator, None)
    self.row_index += 1
    return self._row is not None

  def columns(self):
    return self._columns

  def reset(self) -> None:
    self._iterator = self._df.itertuples()

  def get_index(self):
    return self._row[0]

  def index(self):
    return self._row[0]

  def to_dict(self, columns: List[str] = None):
    return self.row(columns=columns)

  def tolist(self, cols) -> List[object]:
    return [self.__getitem__(c) for c in cols]

  def row(self, columns: List[str] = None) -> Dict[str, object]:
    cols = set(self._columns if columns is None else columns)
    return {c : self.__getitem__(c) for c in self._columns if c in cols}

  def __getitem__(self, key) -> object:
    # the df index of the row is at index 0
    try:
        if type(key) is list:
            ix = [self._columns.index(key) + 1 for k in key]
        else:
            ix = self._columns.index(key) + 1
        return self._row[ix]
    except BaseException as e:
        return None

  def __next__(self) -> 'DataFrameReader':
    if self.read():
        return self
    else:
        raise StopIteration

  def __iter__(self) -> 'DataFrameReader':
    return self

可以使用:

for row in DataFrameReader(df):
  print(row.my_column_name)
  print(row.to_dict())
  print(row['my_column_name'])
  print(row.tolist())

并保留要迭代的行的值/名称映射。显然,这比使用如上所述的apply和Cython慢​​很多,但是在某些情况下是必需的。

Some libraries (e.g. a Java interop library that I use) require values to be passed in a row at a time, for example, if streaming data. To replicate the streaming nature, I ‘stream’ my dataframe values one by one, I wrote the below, which comes in handy from time to time.

class DataFrameReader:
  def __init__(self, df):
    self._df = df
    self._row = None
    self._columns = df.columns.tolist()
    self.reset()
    self.row_index = 0

  def __getattr__(self, key):
    return self.__getitem__(key)

  def read(self) -> bool:
    self._row = next(self._iterator, None)
    self.row_index += 1
    return self._row is not None

  def columns(self):
    return self._columns

  def reset(self) -> None:
    self._iterator = self._df.itertuples()

  def get_index(self):
    return self._row[0]

  def index(self):
    return self._row[0]

  def to_dict(self, columns: List[str] = None):
    return self.row(columns=columns)

  def tolist(self, cols) -> List[object]:
    return [self.__getitem__(c) for c in cols]

  def row(self, columns: List[str] = None) -> Dict[str, object]:
    cols = set(self._columns if columns is None else columns)
    return {c : self.__getitem__(c) for c in self._columns if c in cols}

  def __getitem__(self, key) -> object:
    # the df index of the row is at index 0
    try:
        if type(key) is list:
            ix = [self._columns.index(key) + 1 for k in key]
        else:
            ix = self._columns.index(key) + 1
        return self._row[ix]
    except BaseException as e:
        return None

  def __next__(self) -> 'DataFrameReader':
    if self.read():
        return self
    else:
        raise StopIteration

  def __iter__(self) -> 'DataFrameReader':
    return self

Which can be used:

for row in DataFrameReader(df):
  print(row.my_column_name)
  print(row.to_dict())
  print(row['my_column_name'])
  print(row.tolist())

And preserves the values/ name mapping for the rows being iterated. Obviously, is a lot slower than using apply and Cython as indicated above, but is necessary in some circumstances.


回答 21

简而言之

  • 尽可能使用向量化
  • 如果操作无法向量化-使用列表推导
  • 如果您需要一个代表整个行的对象,请使用itertuples
  • 如果上述操作太慢-请尝试swifter.apply
  • 如果仍然太慢-请尝试Cython例程

详细资料 该视频中的

基准测试 熊猫DataFrame中行的迭代基准

In short

  • Use vectorization if possible
  • If operation can’t be vectorized – use list comprehensions
  • If you need a single object representing entire row – use itertuples
  • If the above is too slow – try swifter.apply
  • If it’s still too slow – try Cython routine

Details in this video

Benchmark Benchmark of iteration over rows in a pandas DataFrame


如何获取列表中的元素数量?

问题:如何获取列表中的元素数量?

考虑以下:

items = []
items.append("apple")
items.append("orange")
items.append("banana")

# FAKE METHOD:
items.amount()  # Should return 3

如何获取列表中的元素数量items

Consider the following:

items = []
items.append("apple")
items.append("orange")
items.append("banana")

# FAKE METHOD:
items.amount()  # Should return 3

How do I get the number of elements in the list items?


回答 0

len()函数可以与Python中的几种不同类型一起使用-内置类型和库类型。例如:

>>> len([1,2,3])
3

官方2.x文档在这里: 官方3.x文档在这里:len()
len()

The len() function can be used with several different types in Python – both built-in types and library types. For example:

>>> len([1,2,3])
3

Official 2.x documentation is here: len()
Official 3.x documentation is here: len()


回答 1

如何获得列表的大小?

要查找列表的大小,请使用内置函数len

items = []
items.append("apple")
items.append("orange")
items.append("banana")

现在:

len(items)

返回3。

说明

Python中的所有内容都是一个对象,包括列表。在C实现中,所有对象都有某种头。

列表和其他类似的内置对象在Python中具有“大小”,尤其是具有一个名为的属性ob_size,其中缓存了对象中元素的数量。因此,检查列表中对象的数量非常快。

但是,如果您要检查列表大小是否为零,请不要使用len-而是将列表放在布尔值上下文中-如果为空,则将其视为False,否则将其视为True

来自文档

len(s)

返回对象的长度(项目数)。参数可以是序列(例如字符串,字节,元组,列表或范围)或集合(例如字典,集合或冻结集合)。

len与实施__len__,从数据模型文档

object.__len__(self)

调用以实现内置函数len()。应该返回对象的长度,即> = 0的整数。而且,在Boolean上下文中,未定义__nonzero__()[在Python 2或__bool__()Python 3中]方法且其__len__()方法返回零的对象被视为false。

我们还可以看到这__len__是一种列表方法:

items.__len__()

返回3。

内建类型,你可以得到len的(长)

实际上,我们看到我们可以为所有描述的类型获取此信息:

>>> all(hasattr(cls, '__len__') for cls in (str, bytes, tuple, list, 
                                            xrange, dict, set, frozenset))
True

请勿len用于测试空列表或非空列表

当然,要测试特定长度,只需测试是否相等:

if len(items) == required_length:
    ...

但是在测试零长度列表或反数列表时有一种特殊情况。在这种情况下,请勿测试是否相等。

另外,请勿执行以下操作:

if len(items): 
    ...

相反,只需执行以下操作:

if items:     # Then we have some items, not empty!
    ...

要么

if not items: # Then we have an empty list!
    ...

在这里解释原因,但总之,if items或者if not items更具可读性和性能。

How to get the size of a list?

To find the size of a list, use the builtin function, len:

items = []
items.append("apple")
items.append("orange")
items.append("banana")

And now:

len(items)

returns 3.

Explanation

Everything in Python is an object, including lists. All objects have a header of some sort in the C implementation.

Lists and other similar builtin objects with a “size” in Python, in particular, have an attribute called ob_size, where the number of elements in the object is cached. So checking the number of objects in a list is very fast.

But if you’re checking if list size is zero or not, don’t use len – instead, put the list in a boolean context – it treated as False if empty, True otherwise.

From the docs

len(s)

Return the length (the number of items) of an object. The argument may be a sequence (such as a string, bytes, tuple, list, or range) or a collection (such as a dictionary, set, or frozen set).

len is implemented with __len__, from the data model docs:

object.__len__(self)

Called to implement the built-in function len(). Should return the length of the object, an integer >= 0. Also, an object that doesn’t define a __nonzero__() [in Python 2 or __bool__() in Python 3] method and whose __len__() method returns zero is considered to be false in a Boolean context.

And we can also see that __len__ is a method of lists:

items.__len__()

returns 3.

Builtin types you can get the len (length) of

And in fact we see we can get this information for all of the described types:

>>> all(hasattr(cls, '__len__') for cls in (str, bytes, tuple, list, 
                                            xrange, dict, set, frozenset))
True

Do not use len to test for an empty or nonempty list

To test for a specific length, of course, simply test for equality:

if len(items) == required_length:
    ...

But there’s a special case for testing for a zero length list or the inverse. In that case, do not test for equality.

Also, do not do:

if len(items): 
    ...

Instead, simply do:

if items:     # Then we have some items, not empty!
    ...

or

if not items: # Then we have an empty list!
    ...

I explain why here but in short, if items or if not items is both more readable and more performant.


回答 2

虽然由于“开箱即用”功能在意义上更有意义,所以这可能没有用,但是一个相当简单的技巧是使用length属性创建类:

class slist(list):
    @property
    def length(self):
        return len(self)

您可以这样使用它:

>>> l = slist(range(10))
>>> l.length
10
>>> print l
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

从本质上讲,它与列表对象完全相同,其附加好处是具有OOP友好length属性。

和往常一样,您的里程可能会有所不同。

While this may not be useful due to the fact that it’d make a lot more sense as being “out of the box” functionality, a fairly simple hack would be to build a class with a length property:

class slist(list):
    @property
    def length(self):
        return len(self)

You can use it like so:

>>> l = slist(range(10))
>>> l.length
10
>>> print l
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Essentially, it’s exactly identical to a list object, with the added benefit of having an OOP-friendly length property.

As always, your mileage may vary.


回答 3

此外,len您还可以使用operator.length_hint(需要Python 3.4+)。对于一个法线而言,list两者都是等效的,但length_hint可以获取列表迭代器的长度,这在某些情况下可能很有用:

>>> from operator import length_hint
>>> l = ["apple", "orange", "banana"]
>>> len(l)
3
>>> length_hint(l)
3

>>> list_iterator = iter(l)
>>> len(list_iterator)
TypeError: object of type 'list_iterator' has no len()
>>> length_hint(list_iterator)
3

但是length_hint根据定义,它只是一个“提示”,因此大多数时候len会更好。

我已经看到了一些建议访问的答案__len__。在处理类似的内置类时list,这是可以的,但可能会导致自定义类出现问题,因为len(和length_hint)实现了一些安全检查。例如,两者都不允许负长度或超过某个值(该sys.maxsize值)的长度。因此,使用len函数而不是__len__方法总是更安全!

Besides len you can also use operator.length_hint (requires Python 3.4+). For a normal list both are equivalent, but length_hint makes it possible to get the length of a list-iterator, which could be useful in certain circumstances:

>>> from operator import length_hint
>>> l = ["apple", "orange", "banana"]
>>> len(l)
3
>>> length_hint(l)
3

>>> list_iterator = iter(l)
>>> len(list_iterator)
TypeError: object of type 'list_iterator' has no len()
>>> length_hint(list_iterator)
3

But length_hint is by definition only a “hint”, so most of the time len is better.

I’ve seen several answers suggesting accessing __len__. This is all right when dealing with built-in classes like list, but it could lead to problems with custom classes, because len (and length_hint) implement some safety checks. For example, both do not allow negative lengths or lengths that exceed a certain value (the sys.maxsize value). So it’s always safer to use the len function instead of the __len__ method!


回答 4

通过前面给出的示例来回答您的问题:

items = []
items.append("apple")
items.append("orange")
items.append("banana")

print items.__len__()

Answering your question as the examples also given previously:

items = []
items.append("apple")
items.append("orange")
items.append("banana")

print items.__len__()

回答 5

并且为了完整性(主要是教育性的),可以不使用该len()功能。我不认为这是一个很好的选择。不要像在PYTHON中那样编程,但这是学习算法的目的。

def count(list):
    item_count = 0
    for item in list[:]:
        item_count += 1
    return item_count

count([1,2,3,4,5])

(中的冒号list[:]是隐式的,因此也是可选的。)

对于新程序员来说,这里的教训是:您无法在不计算点的情况下获得列表中的项目数。问题就变成了:什么时候该计数它们呢?例如,诸如套接字的连接系统调用之类的高性能代码(用C编写)connect(int sockfd, const struct sockaddr *addr, socklen_t addrlen);不会计算元素的长度(将责任归于调用代码)。请注意,地址的长度被传递以节省首先计算长度的步骤吗?另一个选择:通过计算,在将项目添加到传递的对象中时,跟踪项目的数量可能很有意义。请注意,这会占用更多的内存空间。请参阅Naftuli Kay的答案

跟踪长度以提高性能,同时占用更多内存空间的示例。请注意,我从不使用len()函数,因为会跟踪长度:

class MyList(object):
    def __init__(self):
        self._data = []
        self.length = 0 # length tracker that takes up memory but makes length op O(1) time


        # the implicit iterator in a list class
    def __iter__(self):
        for elem in self._data:
            yield elem

    def add(self, elem):
        self._data.append(elem)
        self.length += 1

    def remove(self, elem):
        self._data.remove(elem)
        self.length -= 1

mylist = MyList()
mylist.add(1)
mylist.add(2)
mylist.add(3)
print(mylist.length) # 3
mylist.remove(3)
print(mylist.length) # 2

And for completeness (primarily educational), it is possible without using the len() function. I would not condone this as a good option DO NOT PROGRAM LIKE THIS IN PYTHON, but it serves a purpose for learning algorithms.

def count(list):
    item_count = 0
    for item in list[:]:
        item_count += 1
    return item_count

count([1,2,3,4,5])

(The colon in list[:] is implicit and is therefore also optional.)

The lesson here for new programmers is: You can’t get the number of items in a list without counting them at some point. The question becomes: when is a good time to count them? For example, high-performance code like the connect system call for sockets (written in C) connect(int sockfd, const struct sockaddr *addr, socklen_t addrlen);, does not calculate the length of elements (giving that responsibility to the calling code). Notice that the length of the address is passed along to save the step of counting the length first? Another option: computationally, it might make sense to keep track of the number of items as you add them within the object that you pass. Mind that this takes up more space in memory. See Naftuli Kay‘s answer.

Example of keeping track of the length to improve performance while taking up more space in memory. Note that I never use the len() function because the length is tracked:

class MyList(object):
    def __init__(self):
        self._data = []
        self.length = 0 # length tracker that takes up memory but makes length op O(1) time


        # the implicit iterator in a list class
    def __iter__(self):
        for elem in self._data:
            yield elem

    def add(self, elem):
        self._data.append(elem)
        self.length += 1

    def remove(self, elem):
        self._data.remove(elem)
        self.length -= 1

mylist = MyList()
mylist.add(1)
mylist.add(2)
mylist.add(3)
print(mylist.length) # 3
mylist.remove(3)
print(mylist.length) # 2

回答 6

len()实际工作方式而言,这是其C实现

static PyObject *
builtin_len(PyObject *module, PyObject *obj)
/*[clinic end generated code: output=fa7a270d314dfb6c input=bc55598da9e9c9b5]*/
{
    Py_ssize_t res;

    res = PyObject_Size(obj);
    if (res < 0) {
        assert(PyErr_Occurred());
        return NULL;
    }
    return PyLong_FromSsize_t(res);
}

Py_ssize_t是对象可以具有的最大长度。PyObject_Size()是一个返回对象大小的函数。如果无法确定对象的大小,则返回-1。在这种情况下,将执行以下代码块:

if (res < 0) {
        assert(PyErr_Occurred());
        return NULL;
    }

结果引发了异常。否则,将执行以下代码块:

return PyLong_FromSsize_t(res);

res这是一个C整数,将转换为python long并返回。longs自python 3起,所有python整数都存储。

In terms of how len() actually works, this is its C implementation:

static PyObject *
builtin_len(PyObject *module, PyObject *obj)
/*[clinic end generated code: output=fa7a270d314dfb6c input=bc55598da9e9c9b5]*/
{
    Py_ssize_t res;

    res = PyObject_Size(obj);
    if (res < 0) {
        assert(PyErr_Occurred());
        return NULL;
    }
    return PyLong_FromSsize_t(res);
}

Py_ssize_t is the maximum length that the object can have. PyObject_Size() is a function that returns the size of an object. If it cannot determine the size of an object, it returns -1. In that case, this code block will be executed:

if (res < 0) {
        assert(PyErr_Occurred());
        return NULL;
    }

And an exception is raised as a result. Otherwise, this code block will be executed:

return PyLong_FromSsize_t(res);

res which is a C integer, is converted into a python long and returned. All python integers are stored as longs since Python 3.


如何按字典值对字典列表进行排序?

问题:如何按字典值对字典列表进行排序?

我有一个字典列表,希望每个项目都按特定的属性值排序。

考虑下面的数组,

[{'name':'Homer', 'age':39}, {'name':'Bart', 'age':10}]

当排序name,应该成为

[{'name':'Bart', 'age':10}, {'name':'Homer', 'age':39}]

I have a list of dictionaries and want each item to be sorted by a specific property values.

Take into consideration the array below,

[{'name':'Homer', 'age':39}, {'name':'Bart', 'age':10}]

When sorted by name, should become

[{'name':'Bart', 'age':10}, {'name':'Homer', 'age':39}]

回答 0

使用密钥而不是cmp看起来更干净:

newlist = sorted(list_to_be_sorted, key=lambda k: k['name']) 

或如JFSebastian和其他人所建议的,

from operator import itemgetter
newlist = sorted(list_to_be_sorted, key=itemgetter('name')) 

为了完整性(如fitzgeraldsteele的评论中指出的那样),请添加reverse=True降序排列

newlist = sorted(l, key=itemgetter('name'), reverse=True)

It may look cleaner using a key instead a cmp:

newlist = sorted(list_to_be_sorted, key=lambda k: k['name']) 

or as J.F.Sebastian and others suggested,

from operator import itemgetter
newlist = sorted(list_to_be_sorted, key=itemgetter('name')) 

For completeness (as pointed out in comments by fitzgeraldsteele), add reverse=True to sort descending

newlist = sorted(l, key=itemgetter('name'), reverse=True)

回答 1

import operator

通过key =’name’对字典列表进行排序:

list_of_dicts.sort(key=operator.itemgetter('name'))

按照key =’age’对字典列表进行排序:

list_of_dicts.sort(key=operator.itemgetter('age'))
import operator

To sort the list of dictionaries by key=’name’:

list_of_dicts.sort(key=operator.itemgetter('name'))

To sort the list of dictionaries by key=’age’:

list_of_dicts.sort(key=operator.itemgetter('age'))

回答 2

my_list = [{'name':'Homer', 'age':39}, {'name':'Bart', 'age':10}]

my_list.sort(lambda x,y : cmp(x['name'], y['name']))

my_list 现在将成为您想要的。

(3年后)进行编辑以添加:

新的key论点更加有效和整洁。更好的答案现在看起来像:

my_list = sorted(my_list, key=lambda k: k['name'])

…IMO比operator.itemgetterymmv 更容易理解。

my_list = [{'name':'Homer', 'age':39}, {'name':'Bart', 'age':10}]

my_list.sort(lambda x,y : cmp(x['name'], y['name']))

my_list will now be what you want.

(3 years later) Edited to add:

The new key argument is more efficient and neater. A better answer now looks like:

my_list = sorted(my_list, key=lambda k: k['name'])

…the lambda is, IMO, easier to understand than operator.itemgetter, but YMMV.


回答 3

如果要按多个键对列表进行排序,可以执行以下操作:

my_list = [{'name':'Homer', 'age':39}, {'name':'Milhouse', 'age':10}, {'name':'Bart', 'age':10} ]
sortedlist = sorted(my_list , key=lambda elem: "%02d %s" % (elem['age'], elem['name']))

它相当骇人听闻,因为它依赖于将值转换为单个字符串表示形式进行比较,但是它对于包括负数在内的数字也可以正常工作(尽管如果使用数字,则需要使用零填充来适当格式化字符串)

If you want to sort the list by multiple keys you can do the following:

my_list = [{'name':'Homer', 'age':39}, {'name':'Milhouse', 'age':10}, {'name':'Bart', 'age':10} ]
sortedlist = sorted(my_list , key=lambda elem: "%02d %s" % (elem['age'], elem['name']))

It is rather hackish, since it relies on converting the values into a single string representation for comparison, but it works as expected for numbers including negative ones (although you will need to format your string appropriately with zero paddings if you are using numbers)


回答 4

import operator
a_list_of_dicts.sort(key=operator.itemgetter('name'))

‘key’用于按任意值排序,’itemgetter’将该值设置为每个项目的’name’属性。

import operator
a_list_of_dicts.sort(key=operator.itemgetter('name'))

‘key’ is used to sort by an arbitrary value and ‘itemgetter’ sets that value to each item’s ‘name’ attribute.


回答 5

a = [{'name':'Homer', 'age':39}, ...]

# This changes the list a
a.sort(key=lambda k : k['name'])

# This returns a new list (a is not modified)
sorted(a, key=lambda k : k['name']) 
a = [{'name':'Homer', 'age':39}, ...]

# This changes the list a
a.sort(key=lambda k : k['name'])

# This returns a new list (a is not modified)
sorted(a, key=lambda k : k['name']) 

回答 6

我想你的意思是:

[{'name':'Homer', 'age':39}, {'name':'Bart', 'age':10}]

排序如下:

sorted(l,cmp=lambda x,y: cmp(x['name'],y['name']))

I guess you’ve meant:

[{'name':'Homer', 'age':39}, {'name':'Bart', 'age':10}]

This would be sorted like this:

sorted(l,cmp=lambda x,y: cmp(x['name'],y['name']))

回答 7

您可以使用自定义比较函数,也可以传入一个计算自定义排序键的函数。通常,这样做效率更高,因为每个项只计算一次密钥,而比较函数将被调用多次。

您可以这样进行:

def mykey(adict): return adict['name']
x = [{'name': 'Homer', 'age': 39}, {'name': 'Bart', 'age':10}]
sorted(x, key=mykey)

但是标准库包含用于获取任意对象项的通用例程:itemgetter。因此,请尝试以下操作:

from operator import itemgetter
x = [{'name': 'Homer', 'age': 39}, {'name': 'Bart', 'age':10}]
sorted(x, key=itemgetter('name'))

You could use a custom comparison function, or you could pass in a function that calculates a custom sort key. That’s usually more efficient as the key is only calculated once per item, while the comparison function would be called many more times.

You could do it this way:

def mykey(adict): return adict['name']
x = [{'name': 'Homer', 'age': 39}, {'name': 'Bart', 'age':10}]
sorted(x, key=mykey)

But the standard library contains a generic routine for getting items of arbitrary objects: itemgetter. So try this instead:

from operator import itemgetter
x = [{'name': 'Homer', 'age': 39}, {'name': 'Bart', 'age':10}]
sorted(x, key=itemgetter('name'))

回答 8

使用Perl的Schwartzian变换,

py = [{'name':'Homer', 'age':39}, {'name':'Bart', 'age':10}]

sort_on = "name"
decorated = [(dict_[sort_on], dict_) for dict_ in py]
decorated.sort()
result = [dict_ for (key, dict_) in decorated]

>>> result
[{'age': 10, 'name': 'Bart'}, {'age': 39, 'name': 'Homer'}]

有关Perl Schwartzian变换的更多信息

在计算机科学中,Schwartzian变换是一种Perl编程习惯用法,用于提高对项目列表进行排序的效率。当排序实际上是基于元素的某个属性(键)的排序时,此惯用法适用于基于比较的排序,其中计算该属性是一项应执行最少次数的密集操作。Schwartzian转换的显着之处在于它不使用命名的临时数组。

Using Schwartzian transform from Perl,

py = [{'name':'Homer', 'age':39}, {'name':'Bart', 'age':10}]

do

sort_on = "name"
decorated = [(dict_[sort_on], dict_) for dict_ in py]
decorated.sort()
result = [dict_ for (key, dict_) in decorated]

gives

>>> result
[{'age': 10, 'name': 'Bart'}, {'age': 39, 'name': 'Homer'}]

More on Perl Schwartzian transform

In computer science, the Schwartzian transform is a Perl programming idiom used to improve the efficiency of sorting a list of items. This idiom is appropriate for comparison-based sorting when the ordering is actually based on the ordering of a certain property (the key) of the elements, where computing that property is an intensive operation that should be performed a minimal number of times. The Schwartzian Transform is notable in that it does not use named temporary arrays.


回答 9

您必须实现自己的比较功能,该功能将通过名称键的值比较字典。请参阅从PythonInfo Wiki对Mini-HOW TO进行排序

You have to implement your own comparison function that will compare the dictionaries by values of name keys. See Sorting Mini-HOW TO from PythonInfo Wiki


回答 10

有时我们需要使用lower()例如

lists = [{'name':'Homer', 'age':39},
  {'name':'Bart', 'age':10},
  {'name':'abby', 'age':9}]

lists = sorted(lists, key=lambda k: k['name'])
print(lists)
# [{'name':'Bart', 'age':10}, {'name':'Homer', 'age':39}, {'name':'abby', 'age':9}]

lists = sorted(lists, key=lambda k: k['name'].lower())
print(lists)
# [ {'name':'abby', 'age':9}, {'name':'Bart', 'age':10}, {'name':'Homer', 'age':39}]

sometime we need to use lower() for example

lists = [{'name':'Homer', 'age':39},
  {'name':'Bart', 'age':10},
  {'name':'abby', 'age':9}]

lists = sorted(lists, key=lambda k: k['name'])
print(lists)
# [{'name':'Bart', 'age':10}, {'name':'Homer', 'age':39}, {'name':'abby', 'age':9}]

lists = sorted(lists, key=lambda k: k['name'].lower())
print(lists)
# [ {'name':'abby', 'age':9}, {'name':'Bart', 'age':10}, {'name':'Homer', 'age':39}]

回答 11

这是另一种通用解决方案-它按键和值对dict的元素进行排序。它的优点-无需指定键,并且如果某些词典中缺少某些键,它将仍然有效。

def sort_key_func(item):
    """ helper function used to sort list of dicts

    :param item: dict
    :return: sorted list of tuples (k, v)
    """
    pairs = []
    for k, v in item.items():
        pairs.append((k, v))
    return sorted(pairs)
sorted(A, key=sort_key_func)

Here is the alternative general solution – it sorts elements of dict by keys and values. The advantage of it – no need to specify keys, and it would still work if some keys are missing in some of dictionaries.

def sort_key_func(item):
    """ helper function used to sort list of dicts

    :param item: dict
    :return: sorted list of tuples (k, v)
    """
    pairs = []
    for k, v in item.items():
        pairs.append((k, v))
    return sorted(pairs)
sorted(A, key=sort_key_func)

回答 12

使用pandas包是另一种方法,尽管它的大规模运行比其他人提出的更传统的方法要慢得多:

import pandas as pd

listOfDicts = [{'name':'Homer', 'age':39}, {'name':'Bart', 'age':10}]
df = pd.DataFrame(listOfDicts)
df = df.sort_values('name')
sorted_listOfDicts = df.T.to_dict().values()

以下是一些小型词典和大型(100k +)字典的一些基准值:

setup_large = "listOfDicts = [];\
[listOfDicts.extend(({'name':'Homer', 'age':39}, {'name':'Bart', 'age':10})) for _ in range(50000)];\
from operator import itemgetter;import pandas as pd;\
df = pd.DataFrame(listOfDicts);"

setup_small = "listOfDicts = [];\
listOfDicts.extend(({'name':'Homer', 'age':39}, {'name':'Bart', 'age':10}));\
from operator import itemgetter;import pandas as pd;\
df = pd.DataFrame(listOfDicts);"

method1 = "newlist = sorted(listOfDicts, key=lambda k: k['name'])"
method2 = "newlist = sorted(listOfDicts, key=itemgetter('name')) "
method3 = "df = df.sort_values('name');\
sorted_listOfDicts = df.T.to_dict().values()"

import timeit
t = timeit.Timer(method1, setup_small)
print('Small Method LC: ' + str(t.timeit(100)))
t = timeit.Timer(method2, setup_small)
print('Small Method LC2: ' + str(t.timeit(100)))
t = timeit.Timer(method3, setup_small)
print('Small Method Pandas: ' + str(t.timeit(100)))

t = timeit.Timer(method1, setup_large)
print('Large Method LC: ' + str(t.timeit(100)))
t = timeit.Timer(method2, setup_large)
print('Large Method LC2: ' + str(t.timeit(100)))
t = timeit.Timer(method3, setup_large)
print('Large Method Pandas: ' + str(t.timeit(1)))

#Small Method LC: 0.000163078308105
#Small Method LC2: 0.000134944915771
#Small Method Pandas: 0.0712950229645
#Large Method LC: 0.0321750640869
#Large Method LC2: 0.0206089019775
#Large Method Pandas: 5.81405615807

Using the pandas package is another method, though it’s runtime at large scale is much slower than the more traditional methods proposed by others:

import pandas as pd

listOfDicts = [{'name':'Homer', 'age':39}, {'name':'Bart', 'age':10}]
df = pd.DataFrame(listOfDicts)
df = df.sort_values('name')
sorted_listOfDicts = df.T.to_dict().values()

Here are some benchmark values for a tiny list and a large (100k+) list of dicts:

setup_large = "listOfDicts = [];\
[listOfDicts.extend(({'name':'Homer', 'age':39}, {'name':'Bart', 'age':10})) for _ in range(50000)];\
from operator import itemgetter;import pandas as pd;\
df = pd.DataFrame(listOfDicts);"

setup_small = "listOfDicts = [];\
listOfDicts.extend(({'name':'Homer', 'age':39}, {'name':'Bart', 'age':10}));\
from operator import itemgetter;import pandas as pd;\
df = pd.DataFrame(listOfDicts);"

method1 = "newlist = sorted(listOfDicts, key=lambda k: k['name'])"
method2 = "newlist = sorted(listOfDicts, key=itemgetter('name')) "
method3 = "df = df.sort_values('name');\
sorted_listOfDicts = df.T.to_dict().values()"

import timeit
t = timeit.Timer(method1, setup_small)
print('Small Method LC: ' + str(t.timeit(100)))
t = timeit.Timer(method2, setup_small)
print('Small Method LC2: ' + str(t.timeit(100)))
t = timeit.Timer(method3, setup_small)
print('Small Method Pandas: ' + str(t.timeit(100)))

t = timeit.Timer(method1, setup_large)
print('Large Method LC: ' + str(t.timeit(100)))
t = timeit.Timer(method2, setup_large)
print('Large Method LC2: ' + str(t.timeit(100)))
t = timeit.Timer(method3, setup_large)
print('Large Method Pandas: ' + str(t.timeit(1)))

#Small Method LC: 0.000163078308105
#Small Method LC2: 0.000134944915771
#Small Method Pandas: 0.0712950229645
#Large Method LC: 0.0321750640869
#Large Method LC2: 0.0206089019775
#Large Method Pandas: 5.81405615807

回答 13

如果你不需要原来listdictionaries,你可以用修改就地sort()使用自定义按键功能的方法。

按键功能:

def get_name(d):
    """ Return the value of a key in a dictionary. """

    return d["name"]

list进行排序:

data_one = [{'name': 'Homer', 'age': 39}, {'name': 'Bart', 'age': 10}]

就地排序:

data_one.sort(key=get_name)

如果您需要原始的list,请调用将sorted()函数传递给的函数list和键函数,然后将返回的排序list后的变量分配给新变量:

data_two = [{'name': 'Homer', 'age': 39}, {'name': 'Bart', 'age': 10}]
new_data = sorted(data_two, key=get_name)

印刷data_onenew_data

>>> print(data_one)
[{'name': 'Bart', 'age': 10}, {'name': 'Homer', 'age': 39}]
>>> print(new_data)
[{'name': 'Bart', 'age': 10}, {'name': 'Homer', 'age': 39}]

If you do not need the original list of dictionaries, you could modify it in-place with sort() method using a custom key function.

Key function:

def get_name(d):
    """ Return the value of a key in a dictionary. """

    return d["name"]

The list to be sorted:

data_one = [{'name': 'Homer', 'age': 39}, {'name': 'Bart', 'age': 10}]

Sorting it in-place:

data_one.sort(key=get_name)

If you need the original list, call the sorted() function passing it the list and the key function, then assign the returned sorted list to a new variable:

data_two = [{'name': 'Homer', 'age': 39}, {'name': 'Bart', 'age': 10}]
new_data = sorted(data_two, key=get_name)

Printing data_one and new_data.

>>> print(data_one)
[{'name': 'Bart', 'age': 10}, {'name': 'Homer', 'age': 39}]
>>> print(new_data)
[{'name': 'Bart', 'age': 10}, {'name': 'Homer', 'age': 39}]

回答 14

假设我有一本D包含以下内容的字典。要进行排序,只需使用sort中的key参数来传递自定义函数,如下所示:

D = {'eggs': 3, 'ham': 1, 'spam': 2}
def get_count(tuple):
    return tuple[1]

sorted(D.items(), key = get_count, reverse=True)
# or
sorted(D.items(), key = lambda x: x[1], reverse=True)  # avoiding get_count function call

检查这个出来。

Let’s say I have a dictionary D with elements below. To sort just use key argument in sorted to pass custom function as below :

D = {'eggs': 3, 'ham': 1, 'spam': 2}
def get_count(tuple):
    return tuple[1]

sorted(D.items(), key = get_count, reverse=True)
# or
sorted(D.items(), key = lambda x: x[1], reverse=True)  # avoiding get_count function call

Check this out.


回答 15

我一直是lambda过滤器的忠实拥护者,但是如果您考虑时间复杂性,则不是最佳选择

第一选择

sorted_list = sorted(list_to_sort, key= lambda x: x['name'])
# returns list of values

第二选择

list_to_sort.sort(key=operator.itemgetter('name'))
#edits the list, does not return a new list

快速比较执行时间

# First option
python3.6 -m timeit -s "list_to_sort = [{'name':'Homer', 'age':39}, {'name':'Bart', 'age':10}, {'name':'Faaa', 'age':57}, {'name':'Errr', 'age':20}]" -s "sorted_l=[]" "sorted_l = sorted(list_to_sort, key=lambda e: e['name'])"

1000000次循环,最好为3:每个循环0.736微秒

# Second option 
python3.6 -m timeit -s "list_to_sort = [{'name':'Homer', 'age':39}, {'name':'Bart', 'age':10}, {'name':'Faaa', 'age':57}, {'name':'Errr', 'age':20}]" -s "sorted_l=[]" -s "import operator" "list_to_sort.sort(key=operator.itemgetter('name'))"

1000000次循环,最好为3:每个循环0.438微秒

I have been a big fan of filter w/ lambda however it is not best option if you considering time complexity

First option

sorted_list = sorted(list_to_sort, key= lambda x: x['name'])
# returns list of values

Second option

list_to_sort.sort(key=operator.itemgetter('name'))
#edits the list, does not return a new list

Fast comparison of exec times

# First option
python3.6 -m timeit -s "list_to_sort = [{'name':'Homer', 'age':39}, {'name':'Bart', 'age':10}, {'name':'Faaa', 'age':57}, {'name':'Errr', 'age':20}]" -s "sorted_l=[]" "sorted_l = sorted(list_to_sort, key=lambda e: e['name'])"

1000000 loops, best of 3: 0.736 usec per loop

# Second option 
python3.6 -m timeit -s "list_to_sort = [{'name':'Homer', 'age':39}, {'name':'Bart', 'age':10}, {'name':'Faaa', 'age':57}, {'name':'Errr', 'age':20}]" -s "sorted_l=[]" -s "import operator" "list_to_sort.sort(key=operator.itemgetter('name'))"

1000000 loops, best of 3: 0.438 usec per loop


回答 16

如果需要考虑性能,我会使用内置函数operator.itemgetter来代替lambda手工函数,而使用内置函数来代替。该itemgetter功能似乎比lambda根据我的测试快约20%。

https://wiki.python.org/moin/PythonSpeed

同样,内置函数比手工生成的等效函数运行得更快。例如,map(operator.add,v1,v2)比map(lambda x,y:x + y,v1,v2)快。

这是使用lambdavs 进行排序速度的比较itemgetter

import random
import operator

# create a list of 100 dicts with random 8-letter names and random ages from 0 to 100.
l = [{'name': ''.join(random.choices(string.ascii_lowercase, k=8)), 'age': random.randint(0, 100)} for i in range(100)]

# Test the performance with a lambda function sorting on name
%timeit sorted(l, key=lambda x: x['name'])
13 µs ± 388 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# Test the performance with itemgetter sorting on name
%timeit sorted(l, key=operator.itemgetter('name'))
10.7 µs ± 38.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# Check that each technique produces same sort order
sorted(l, key=lambda x: x['name']) == sorted(l, key=operator.itemgetter('name'))
True

两种技术都以相同的顺序对列表进行排序(通过执行代码块中的final语句进行验证),但是一种方法要快一些。

If performance is a concern, I would use operator.itemgetter instead of lambda as built-in functions perform faster than hand-crafted functions. The itemgetter function seems to perform approximately 20% faster than lambda based on my testing.

From https://wiki.python.org/moin/PythonSpeed:

Likewise, the builtin functions run faster than hand-built equivalents. For example, map(operator.add, v1, v2) is faster than map(lambda x,y: x+y, v1, v2).

Here is a comparison of sorting speed using lambda vs itemgetter.

import random
import operator

# create a list of 100 dicts with random 8-letter names and random ages from 0 to 100.
l = [{'name': ''.join(random.choices(string.ascii_lowercase, k=8)), 'age': random.randint(0, 100)} for i in range(100)]

# Test the performance with a lambda function sorting on name
%timeit sorted(l, key=lambda x: x['name'])
13 µs ± 388 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# Test the performance with itemgetter sorting on name
%timeit sorted(l, key=operator.itemgetter('name'))
10.7 µs ± 38.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# Check that each technique produces same sort order
sorted(l, key=lambda x: x['name']) == sorted(l, key=operator.itemgetter('name'))
True

Both techniques sort the list in the same order (verified by execution of the final statement in the code block) but one is a little faster.


回答 17

您可以使用以下代码

sorted_dct = sorted(dct_name.items(), key = lambda x : x[1])

You may use the following code

sorted_dct = sorted(dct_name.items(), key = lambda x : x[1])

如何在没有换行符或空格的情况下进行打印?

问题:如何在没有换行符或空格的情况下进行打印?

我想在里面做 。我想在这个例子中做什么

在C中:

#include <stdio.h>

int main() {
    int i;
    for (i=0; i<10; i++) printf(".");
    return 0;
}

输出:

..........

在Python中:

>>> for i in range(10): print('.')
.
.
.
.
.
.
.
.
.
.
>>> print('.', '.', '.', '.', '.', '.', '.', '.', '.', '.')
. . . . . . . . . .

在Python中print会添加\n或空格,如何避免呢?现在,这只是一个例子,不要告诉我我可以先构建一个字符串然后再打印它。我想知道如何将字符串“附加”到stdout

I’d like to do it in . What I’d like to do in this example in :

In C:

#include <stdio.h>

int main() {
    int i;
    for (i=0; i<10; i++) printf(".");
    return 0;
}

Output:

..........

In Python:

>>> for i in range(10): print('.')
.
.
.
.
.
.
.
.
.
.
>>> print('.', '.', '.', '.', '.', '.', '.', '.', '.', '.')
. . . . . . . . . .

In Python print will add a \n or space, how can I avoid that? Now, it’s just an example, don’t tell me I can first build a string then print it. I’d like to know how to “append” strings to stdout.


回答 0

在Python 3中,您可以使用函数的sep=end=参数print

不在字符串末尾添加换行符:

print('.', end='')

不在要打印的所有函数参数之间添加空格:

print('a', 'b', 'c', sep='')

您可以将任何字符串传递给任何一个参数,并且可以同时使用两个参数。

如果您在缓冲方面遇到麻烦,可以通过添加flush=True关键字参数来刷新输出:

print('.', end='', flush=True)

Python 2.6和2.7

在Python 2.6中,您可以print使用__future__模块从Python 3 导入函数:

from __future__ import print_function

允许您使用上面的Python 3解决方案。

但是,请注意,在从Python 2中导入flushprint函数的版本中,关键字不可用__future__;它仅适用于Python 3,更具体地说是3.3及更高版本。在早期版本中,您仍然需要通过调用进行手动刷新sys.stdout.flush()。您还必须在执行此导入操作的文件中重写所有其他打印语句。

或者你可以使用 sys.stdout.write()

import sys
sys.stdout.write('.')

您可能还需要调用

sys.stdout.flush()

确保stdout立即冲洗。

In Python 3, you can use the sep= and end= parameters of the print function:

To not add a newline to the end of the string:

print('.', end='')

To not add a space between all the function arguments you want to print:

print('a', 'b', 'c', sep='')

You can pass any string to either parameter, and you can use both parameters at the same time.

If you are having trouble with buffering, you can flush the output by adding flush=True keyword argument:

print('.', end='', flush=True)

Python 2.6 and 2.7

From Python 2.6 you can either import the print function from Python 3 using the __future__ module:

from __future__ import print_function

which allows you to use the Python 3 solution above.

However, note that the flush keyword is not available in the version of the print function imported from __future__ in Python 2; it only works in Python 3, more specifically 3.3 and later. In earlier versions you’ll still need to flush manually with a call to sys.stdout.flush(). You’ll also have to rewrite all other print statements in the file where you do this import.

Or you can use sys.stdout.write()

import sys
sys.stdout.write('.')

You may also need to call

sys.stdout.flush()

to ensure stdout is flushed immediately.


回答 1

它应该像Guido Van Rossum在此链接中描述的那样简单:

回复:没有AC / R的情况下如何打印?

http://legacy.python.org/search/hypermail/python-1992/0115.html

是否可以打印某些内容但不自动附加回车符?

是的,在要打印的最后一个参数之后附加一个逗号。例如,此循环在用空格分隔的一行上打印数字0..9。注意添加最后换行符的无参数“ print”:

>>> for i in range(10):
...     print i,
... else:
...     print
...
0 1 2 3 4 5 6 7 8 9
>>> 

It should be as simple as described at this link by Guido Van Rossum:

Re: How does one print without a c/r ?

http://legacy.python.org/search/hypermail/python-1992/0115.html

Is it possible to print something but not automatically have a carriage return appended to it ?

Yes, append a comma after the last argument to print. For instance, this loop prints the numbers 0..9 on a line separated by spaces. Note the parameterless “print” that adds the final newline:

>>> for i in range(10):
...     print i,
... else:
...     print
...
0 1 2 3 4 5 6 7 8 9
>>> 

回答 2

注意:这个问题的标题曾经是“如何在python中使用printf?”之类的东西。

由于人们可能会来这里根据标题进行查找,因此Python还支持printf样式的替换:

>>> strings = [ "one", "two", "three" ]
>>>
>>> for i in xrange(3):
...     print "Item %d: %s" % (i, strings[i])
...
Item 0: one
Item 1: two
Item 2: three

而且,您可以方便地将字符串值相乘:

>>> print "." * 10
..........

Note: The title of this question used to be something like “How to printf in python?”

Since people may come here looking for it based on the title, Python also supports printf-style substitution:

>>> strings = [ "one", "two", "three" ]
>>>
>>> for i in xrange(3):
...     print "Item %d: %s" % (i, strings[i])
...
Item 0: one
Item 1: two
Item 2: three

And, you can handily multiply string values:

>>> print "." * 10
..........

回答 3

对python2.6 +使用python3样式的打印功能 (还将破坏同一文件中任何现有的关键字打印语句。)

# for python2 to use the print() function, removing the print keyword
from __future__ import print_function
for x in xrange(10):
    print('.', end='')

要不破坏您的所有python2打印关键字,请创建一个单独的printf.py文件

# printf.py

from __future__ import print_function

def printf(str, *args):
    print(str % args, end='')

然后,在您的文件中使用它

from printf import printf
for x in xrange(10):
    printf('.')
print 'done'
#..........done

更多示例展示printf风格

printf('hello %s', 'world')
printf('%i %f', 10, 3.14)
#hello world10 3.140000

Use the python3-style print function for python2.6+ (will also break any existing keyworded print statements in the same file.)

# for python2 to use the print() function, removing the print keyword
from __future__ import print_function
for x in xrange(10):
    print('.', end='')

To not ruin all your python2 print keywords, create a separate printf.py file

# printf.py

from __future__ import print_function

def printf(str, *args):
    print(str % args, end='')

Then, use it in your file

from printf import printf
for x in xrange(10):
    printf('.')
print 'done'
#..........done

More examples showing printf style

printf('hello %s', 'world')
printf('%i %f', 10, 3.14)
#hello world10 3.140000

回答 4

如何在同一行上打印:

import sys
for i in xrange(0,10):
   sys.stdout.write(".")
   sys.stdout.flush()

How to print on the same line:

import sys
for i in xrange(0,10):
   sys.stdout.write(".")
   sys.stdout.flush()

回答 5

新功能(自Python 3.x起)print具有一个可选end参数,可用于修改结尾字符:

print("HELLO", end="")
print("HELLO")

输出:

你好你好

还有sep分隔符:

print("HELLO", "HELLO", "HELLO", sep="")

输出:

你好你好你好

如果您想在Python 2.x中使用它,只需在文件开头添加它:

from __future__ import print_function

The new (as of Python 3.x) print function has an optional end parameter that lets you modify the ending character:

print("HELLO", end="")
print("HELLO")

Output:

HELLOHELLO

There’s also sep for separator:

print("HELLO", "HELLO", "HELLO", sep="")

Output:

HELLOHELLOHELLO

If you wanted to use this in Python 2.x just add this at the start of your file:

from __future__ import print_function


回答 6

使用functools.partial创建一个名为printf的新函数

>>> import functools

>>> printf = functools.partial(print, end="")

>>> printf("Hello world\n")
Hello world

使用默认参数包装函数的简单方法。

Using functools.partial to create a new function called printf

>>> import functools

>>> printf = functools.partial(print, end="")

>>> printf("Hello world\n")
Hello world

Easy way to wrap a function with default parameters.


回答 7

您只需,print函数的末尾添加,这样它就不会在新行上打印。

You can just add , in the end of print function so it won’t print on new line.


回答 8

在Python 3+中,print是一个函数。您打电话的时候

print('hello world')

Python将其转换为

print('hello world', end='\n')

您可以更改end为所需的任何内容。

print('hello world', end='')
print('hello world', end=' ')

In Python 3+, print is a function. When you call

print('hello world')

Python translates it to

print('hello world', end='\n')

You can change end to whatever you want.

print('hello world', end='')
print('hello world', end=' ')

回答 9

python 2.6+

from __future__ import print_function # needs to be first statement in file
print('.', end='')

的Python 3

print('.', end='')

python <= 2.5

import sys
sys.stdout.write('.')

如果每次打印后多余的空间都可以,在python 2中

print '.',

在python 2中产生误导避免

print('.'), # avoid this if you want to remain sane
# this makes it look like print is a function but it is not
# this is the `,` creating a tuple and the parentheses enclose an expression
# to see the problem, try:
print('.', 'x'), # this will print `('.', 'x') `

python 2.6+:

from __future__ import print_function # needs to be first statement in file
print('.', end='')

python 3:

print('.', end='')

python <= 2.5:

import sys
sys.stdout.write('.')

if extra space is OK after each print, in python 2

print '.',

misleading in python 2 – avoid:

print('.'), # avoid this if you want to remain sane
# this makes it look like print is a function but it is not
# this is the `,` creating a tuple and the parentheses enclose an expression
# to see the problem, try:
print('.', 'x'), # this will print `('.', 'x') `

回答 10

你可以试试:

import sys
import time
# Keeps the initial message in buffer.
sys.stdout.write("\rfoobar bar black sheep")
sys.stdout.flush()
# Wait 2 seconds
time.sleep(2)
# Replace the message with a new one.
sys.stdout.write("\r"+'hahahahaaa             ')
sys.stdout.flush()
# Finalize the new message by printing a return carriage.
sys.stdout.write('\n')

You can try:

import sys
import time
# Keeps the initial message in buffer.
sys.stdout.write("\rfoobar bar black sheep")
sys.stdout.flush()
# Wait 2 seconds
time.sleep(2)
# Replace the message with a new one.
sys.stdout.write("\r"+'hahahahaaa             ')
sys.stdout.flush()
# Finalize the new message by printing a return carriage.
sys.stdout.write('\n')

回答 11

您可以在python3中执行以下操作:

#!usr/bin/python

i = 0
while i<10 :
    print('.',end='')
    i = i+1

并用python filename.py或执行python3 filename.py

You can do the same in python3 as follows :

#!usr/bin/python

i = 0
while i<10 :
    print('.',end='')
    i = i+1

and execute it with python filename.py or python3 filename.py


回答 12

我最近有同样的问题..

我通过做解决了:

import sys, os

# reopen stdout with "newline=None".
# in this mode,
# input:  accepts any newline character, outputs as '\n'
# output: '\n' converts to os.linesep

sys.stdout = os.fdopen(sys.stdout.fileno(), "w", newline=None)

for i in range(1,10):
        print(i)

这在Unix和Windows上都可以使用…尚未在macosx上对其进行测试…

hth

i recently had the same problem..

i solved it by doing:

import sys, os

# reopen stdout with "newline=None".
# in this mode,
# input:  accepts any newline character, outputs as '\n'
# output: '\n' converts to os.linesep

sys.stdout = os.fdopen(sys.stdout.fileno(), "w", newline=None)

for i in range(1,10):
        print(i)

this works on both unix and windows … have not tested it on macosx …

hth


回答 13

@lenooh满足了我的查询。我在搜索“ python抑制换行符”时发现了这篇文章。我在Raspberry Pi上使用IDLE3开发用于PuTTY的Python 3.2。我想在PuTTY命令行上创建一个进度条。我不希望页面滚动离开。我想要一条水平线来再次确保用户不会害怕该程序没有停顿下来,也没有在快乐的无限循环中被送去吃午饭-恳求“离开我,我做得很好,但这可能需要一些时间。” 交互式消息-类似于文本中的进度条。

print('Skimming for', search_string, '\b! .001', end='')初始化通过准备下一屏幕写,这将打印3退格作为⌫⌫⌫调刀混合法,然后一个周期,拭去“001”和延伸期间的行中的消息。之后search_string鹦鹉用户输入,\b!修剪我的惊叹号search_string文字背在其上的空间print(),否则的力量,正确放置标点符号。接下来是空格和我正在模拟的“进度条”的第一个“点”。然后,该消息也不必要地以页码填充(格式为长度为3的长度,前导零),以引起用户的注意,正在处理进度,这也将反映出我们稍后将构建的周期数对。

import sys

page=1
search_string=input('Search for?',)
print('Skimming for', search_string, '\b! .001', end='')
sys.stdout.flush() # the print function with an end='' won't print unless forced
while page:
    # some stuff…
    # search, scrub, and build bulk output list[], count items,
    # set done flag True
    page=page+1 #done flag set in 'some_stuff'
    sys.stdout.write('\b\b\b.'+format(page, '03')) #<-- here's the progress bar meat
    sys.stdout.flush()
    if done: #( flag alternative to break, exit or quit)
        print('\nSorting', item_count, 'items')
        page=0 # exits the 'while page' loop
list.sort()
for item_count in range(0, items)
    print(list[item_count])
#print footers here
 if not (len(list)==items):
    print('#error_handler')

进度栏处在sys.stdout.write('\b\b\b.'+format(page, '03'))排队状态。首先,要擦除到左侧,它会将光标备份到三个数字字符上,并以’\ b \ b \ b’作为⌫⌫⌫摩擦,并放下新的句点以增加进度条的长度。然后,它写入到目前为止的页面的三位数。由于sys.stdout.write()等待完整的缓冲区或输出通道关闭,因此sys.stdout.flush()强制立即写入。sys.stdout.flush()内置到末尾,print()而则绕过print(txt, end='' )。然后,代码循环执行其繁琐的时间密集型操作,同时不再打印任何内容,直到返回此处擦除三位数字,添加一个句点并再次写入三位数字(递增)。

擦拭和重写的三个数字是没有必要的手段-它只是一个蓬勃发展,其例证了sys.stdout.write()对比print()。您只需将周期条每次打印更长的时间,就可以很容易地给句号加注,而忘记三个花哨的反斜杠-b⌫退格键(当然也不会写入格式化的页数),而无需使用空格或换行符,而只需使用sys.stdout.write('.'); sys.stdout.flush()对。

请注意,Raspberry Pi IDLE3 Python外壳程序不将Backspace用作⌫rubout,而是打印一个空格,而是创建一个明显的分数列表。

-(o = 8> wiz

@lenooh satisfied my query. I discovered this article while searching for ‘python suppress newline’. I’m using IDLE3 on Raspberry Pi to develop Python 3.2 for PuTTY. I wanted to create a progress bar on the PuTTY command line. I didn’t want the page scrolling away. I wanted a horizontal line to re-assure the user from freaking out that the program hasn’t cruncxed to a halt nor been sent to lunch on a merry infinite loop – as a plea to ‘leave me be, I’m doing fine, but this may take some time.’ interactive message – like a progress bar in text.

The print('Skimming for', search_string, '\b! .001', end='') initializes the message by preparing for the next screen-write, which will print three backspaces as ⌫⌫⌫ rubout and then a period, wiping off ‘001’ and extending the line of periods. After search_string parrots user input, the \b! trims the exclamation point of my search_string text to back over the space which print() otherwise forces, properly placing the punctuation. That’s followed by a space and the first ‘dot’ of the ‘progress bar’ which I’m simulating. Unnecessarily, the message is also then primed with the page number (formatted to a length of three with leading zeros) to take notice from the user that progress is being processed and which will also reflect the count of periods we will later build out to the right.

import sys

page=1
search_string=input('Search for?',)
print('Skimming for', search_string, '\b! .001', end='')
sys.stdout.flush() # the print function with an end='' won't print unless forced
while page:
    # some stuff…
    # search, scrub, and build bulk output list[], count items,
    # set done flag True
    page=page+1 #done flag set in 'some_stuff'
    sys.stdout.write('\b\b\b.'+format(page, '03')) #<-- here's the progress bar meat
    sys.stdout.flush()
    if done: #( flag alternative to break, exit or quit)
        print('\nSorting', item_count, 'items')
        page=0 # exits the 'while page' loop
list.sort()
for item_count in range(0, items)
    print(list[item_count])
#print footers here
 if not (len(list)==items):
    print('#error_handler')

The progress bar meat is in the sys.stdout.write('\b\b\b.'+format(page, '03')) line. First, to erase to the left, it backs up the cursor over the three numeric characters with the ‘\b\b\b’ as ⌫⌫⌫ rubout and drops a new period to add to the progress bar length. Then it writes three digits of the page it has progressed to so far. Because sys.stdout.write() waits for a full buffer or the output channel to close, the sys.stdout.flush() forces the immediate write. sys.stdout.flush() is built into the end of print() which is bypassed with print(txt, end='' ). Then the code loops through its mundane time intensive operations while it prints nothing more until it returns here to wipe three digits back, add a period and write three digits again, incremented.

The three digits wiped and rewritten is by no means necessary – it’s just a flourish which exemplifies sys.stdout.write() versus print(). You could just as easily prime with a period and forget the three fancy backslash-b ⌫ backspaces (of course not writing formatted page counts as well) by just printing the period bar longer by one each time through – without spaces or newlines using just the sys.stdout.write('.'); sys.stdout.flush() pair.

Please note that the Raspberry Pi IDLE3 Python shell does not honor the backspace as ⌫ rubout but instead prints a space, creating an apparent list of fractions instead.

—(o=8> wiz


回答 14

您会注意到上述所有答案都是正确的。但是我想做一个捷径,总是总是在最后写入“ end =”参数。

你可以定义一个像

def Print(*args,sep='',end='',file=None,flush=False):
    print(*args,sep=sep,end=end,file=file,flush=flush)

它将接受所有数量的参数。即使它将接受所有其他参数,如file,flush等,并使用相同的名称。

You will notice that all the above answers are correct. But I wanted to make a shortcut to always writing the ” end=” ” parameter in the end.

You could define a function like

def Print(*args,sep='',end='',file=None,flush=False):
    print(*args,sep=sep,end=end,file=file,flush=flush)

It would accept all the number of parameters. Even it will accept all the other parameters like file, flush ,etc and with the same name.


回答 15

这些答案中的许多似乎有些复杂。在Python 3.x中,您只需执行以下操作:

print(<expr>, <expr>, ..., <expr>, end=" ")

end的默认值是"\n"。我们只是将其更改为空格,或者您也可以使用end=""(没有空格)执行printf通常的操作。

Many of these answers seem a little complicated. In Python 3.x you simply do this:

print(<expr>, <expr>, ..., <expr>, end=" ")

The default value of end is "\n". We are simply changing it to a space or you can also use end="" (no space) to do what printf normally does.


回答 16

您想在for循环中打印一些内容;但是您不希望它每次都在新行中打印..例如:

 for i in range (0,5):
   print "hi"

 OUTPUT:
    hi
    hi
    hi
    hi
    hi

但是您希望它像这样打印:嗨,嗨,嗨,嗨,嗨?只需在打印“ hi”后添加一个逗号

例:

for i in range (0,5): print "hi", OUTPUT: hi hi hi hi hi

you want to print something in for loop right;but you don’t want it print in new line every time.. for example:

 for i in range (0,5):
   print "hi"

 OUTPUT:
    hi
    hi
    hi
    hi
    hi

but you want it to print like this: hi hi hi hi hi hi right???? just add a comma after print “hi”

Example:

for i in range (0,5): print "hi", OUTPUT: hi hi hi hi hi


回答 17

或具有以下功能:

def Print(s):
   return sys.stdout.write(str(s))

那么现在:

for i in range(10): # or `xrange` for python 2 version
   Print(i)

输出:

0123456789

Or have a function like:

def Print(s):
   return sys.stdout.write(str(s))

Then now:

for i in range(10): # or `xrange` for python 2 version
   Print(i)

Outputs:

0123456789

回答 18

for i in xrange(0,10): print '\b.',

这在2.7.8和2.5.2(分别为Canopy和OSX终端)中都有效-不需要模块导入或时间旅行。

for i in xrange(0,10): print '\b.',

This worked in both 2.7.8 & 2.5.2 (Canopy and OSX terminal, respectively) — no module imports or time travel required.


回答 19

一般有两种方法可以做到这一点:

在Python 3.x中不使用换行符进行打印

在print语句之后不添加任何内容,并使用end='' as 删除’\ n’ :

>>> print('hello')
hello  # appending '\n' automatically
>>> print('world')
world # with previous '\n' world comes down

# solution is:
>>> print('hello', end='');print(' world'); # end with anything like end='-' or end=" " but not '\n'
hello world # it seem correct output

循环中的另一个示例

for i in range(1,10):
    print(i, end='.')

在Python 2.x中不使用换行符进行打印

添加结尾逗号表示打印后忽略\n

>>> print "hello",; print" world"
hello world

循环中的另一个示例

for i in range(1,10):
    print "{} .".format(i),

希望这会帮助你。您可以访问此链接

There are general two ways to do this:

Print without newline in Python 3.x

Append nothing after the print statement and remove ‘\n’ by using end='' as:

>>> print('hello')
hello  # appending '\n' automatically
>>> print('world')
world # with previous '\n' world comes down

# solution is:
>>> print('hello', end='');print(' world'); # end with anything like end='-' or end=" " but not '\n'
hello world # it seem correct output

Another Example in Loop:

for i in range(1,10):
    print(i, end='.')

Print without newline in Python 2.x

Adding a trailing comma says that after print ignore \n.

>>> print "hello",; print" world"
hello world

Another Example in Loop:

for i in range(1,10):
    print "{} .".format(i),

Hope this will help you. You can visit this link .


回答 20

…您不需要导入任何库。只需使用删除字符:

BS=u'\0008' # the unicode for "delete" character
for i in range(10):print(BS+"."),

这将删除换行符和空格(^ _ ^)*

…you do not need to import any library. Just use the delete character:

BS=u'\0008' # the unicode for "delete" character
for i in range(10):print(BS+"."),

this removes the newline and the space (^_^)*


为什么在C ++中从stdin读取行比Python慢​​得多?

问题:为什么在C ++中从stdin读取行比Python慢​​得多?

我想比较使用Python和C ++从stdin读取的字符串输入的行数,并且震惊地看到我的C ++代码运行速度比等效的Python代码慢一个数量级。由于我的C ++生锈,而且我还不是专家Pythonista,因此请告诉我我做错了什么还是误解了什么。


(TLDR回答:包括以下声明:cin.sync_with_stdio(false)或仅使用fgets代替。

TLDR结果:一直滚动到我的问题的底部,然后查看表格。)


C ++代码:

#include <iostream>
#include <time.h>

using namespace std;

int main() {
    string input_line;
    long line_count = 0;
    time_t start = time(NULL);
    int sec;
    int lps;

    while (cin) {
        getline(cin, input_line);
        if (!cin.eof())
            line_count++;
    };

    sec = (int) time(NULL) - start;
    cerr << "Read " << line_count << " lines in " << sec << " seconds.";
    if (sec > 0) {
        lps = line_count / sec;
        cerr << " LPS: " << lps << endl;
    } else
        cerr << endl;
    return 0;
}

// Compiled with:
// g++ -O3 -o readline_test_cpp foo.cpp

等同于Python:

#!/usr/bin/env python
import time
import sys

count = 0
start = time.time()

for line in  sys.stdin:
    count += 1

delta_sec = int(time.time() - start_time)
if delta_sec >= 0:
    lines_per_sec = int(round(count/delta_sec))
    print("Read {0} lines in {1} seconds. LPS: {2}".format(count, delta_sec,
       lines_per_sec))

这是我的结果:

$ cat test_lines | ./readline_test_cpp
Read 5570000 lines in 9 seconds. LPS: 618889

$cat test_lines | ./readline_test.py
Read 5570000 lines in 1 seconds. LPS: 5570000

我应该注意,我在Mac OS X v10.6.8(Snow Leopard)和Linux 2.6.32(Red Hat Linux 6.2)下都尝试过。前者是MacBook Pro,后者是非常强大的服务器,并不是说这太相关了。

$ for i in {1..5}; do echo "Test run $i at `date`"; echo -n "CPP:"; cat test_lines | ./readline_test_cpp ; echo -n "Python:"; cat test_lines | ./readline_test.py ; done
Test run 1 at Mon Feb 20 21:29:28 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 2 at Mon Feb 20 21:29:39 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 3 at Mon Feb 20 21:29:50 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 4 at Mon Feb 20 21:30:01 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 5 at Mon Feb 20 21:30:11 EST 2012
CPP:   Read 5570001 lines in 10 seconds. LPS: 557000
Python:Read 5570000 lines in  1 seconds. LPS: 5570000

微小的基准附录和总结

为了完整起见,我认为我将使用原始(已同步)C ++代码更新同一框上同一文件的读取速度。同样,这是针对快速磁盘上的100M行文件。这是比较,有几种解决方案/方法:

Implementation      Lines per second
python (default)           3,571,428
cin (default/naive)          819,672
cin (no sync)             12,500,000
fgets                     14,285,714
wc (not fair comparison)  54,644,808

I wanted to compare reading lines of string input from stdin using Python and C++ and was shocked to see my C++ code run an order of magnitude slower than the equivalent Python code. Since my C++ is rusty and I’m not yet an expert Pythonista, please tell me if I’m doing something wrong or if I’m misunderstanding something.


(TLDR answer: include the statement: cin.sync_with_stdio(false) or just use fgets instead.

TLDR results: scroll all the way down to the bottom of my question and look at the table.)


C++ code:

#include <iostream>
#include <time.h>

using namespace std;

int main() {
    string input_line;
    long line_count = 0;
    time_t start = time(NULL);
    int sec;
    int lps;

    while (cin) {
        getline(cin, input_line);
        if (!cin.eof())
            line_count++;
    };

    sec = (int) time(NULL) - start;
    cerr << "Read " << line_count << " lines in " << sec << " seconds.";
    if (sec > 0) {
        lps = line_count / sec;
        cerr << " LPS: " << lps << endl;
    } else
        cerr << endl;
    return 0;
}

// Compiled with:
// g++ -O3 -o readline_test_cpp foo.cpp

Python Equivalent:

#!/usr/bin/env python
import time
import sys

count = 0
start = time.time()

for line in  sys.stdin:
    count += 1

delta_sec = int(time.time() - start_time)
if delta_sec >= 0:
    lines_per_sec = int(round(count/delta_sec))
    print("Read {0} lines in {1} seconds. LPS: {2}".format(count, delta_sec,
       lines_per_sec))

Here are my results:

$ cat test_lines | ./readline_test_cpp
Read 5570000 lines in 9 seconds. LPS: 618889

$cat test_lines | ./readline_test.py
Read 5570000 lines in 1 seconds. LPS: 5570000

I should note that I tried this both under Mac OS X v10.6.8 (Snow Leopard) and Linux 2.6.32 (Red Hat Linux 6.2). The former is a MacBook Pro, and the latter is a very beefy server, not that this is too pertinent.

$ for i in {1..5}; do echo "Test run $i at `date`"; echo -n "CPP:"; cat test_lines | ./readline_test_cpp ; echo -n "Python:"; cat test_lines | ./readline_test.py ; done
Test run 1 at Mon Feb 20 21:29:28 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 2 at Mon Feb 20 21:29:39 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 3 at Mon Feb 20 21:29:50 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 4 at Mon Feb 20 21:30:01 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 5 at Mon Feb 20 21:30:11 EST 2012
CPP:   Read 5570001 lines in 10 seconds. LPS: 557000
Python:Read 5570000 lines in  1 seconds. LPS: 5570000

Tiny benchmark addendum and recap

For completeness, I thought I’d update the read speed for the same file on the same box with the original (synced) C++ code. Again, this is for a 100M line file on a fast disk. Here’s the comparison, with several solutions/approaches:

Implementation      Lines per second
python (default)           3,571,428
cin (default/naive)          819,672
cin (no sync)             12,500,000
fgets                     14,285,714
wc (not fair comparison)  54,644,808

回答 0

默认情况下,cin与stdio同步,这将使其避免任何输入缓冲。如果将其添加到主目录的顶部,应该会看到更好的性能:

std::ios_base::sync_with_stdio(false);

通常,当缓冲输入流时,而不是一次读取一个字符,而是以更大的块读取该流。这减少了系统调用的数量,这些调用通常比较昂贵。但是,由于FILE*基于stdioiostreams通常具有单独的实现,因此也具有单独的缓冲区,如果将两者一起使用,则可能会导致问题。例如:

int myvalue1;
cin >> myvalue1;
int myvalue2;
scanf("%d",&myvalue2);

如果读取的输入cin多于实际需要的输入,则该函数将无法使用第二个整数值,该scanf函数具有自己的独立缓冲区。这将导致意外的结果。

为避免这种情况,默认情况下,流与同步stdio。实现此目的的一种常用方法是cin使用stdio函数一次读取每个字符。不幸的是,这带来了很多开销。对于少量输入来说,这不是一个大问题,但是当您读取数百万行时,性能损失将是巨大的。

幸运的是,库设计人员决定,如果您知道自己在做什么,则还应该能够禁用此功能以提高性能,因此他们提供了该sync_with_stdio方法。

By default, cin is synchronized with stdio, which causes it to avoid any input buffering. If you add this to the top of your main, you should see much better performance:

std::ios_base::sync_with_stdio(false);

Normally, when an input stream is buffered, instead of reading one character at a time, the stream will be read in larger chunks. This reduces the number of system calls, which are typically relatively expensive. However, since the FILE* based stdio and iostreams often have separate implementations and therefore separate buffers, this could lead to a problem if both were used together. For example:

int myvalue1;
cin >> myvalue1;
int myvalue2;
scanf("%d",&myvalue2);

If more input was read by cin than it actually needed, then the second integer value wouldn’t be available for the scanf function, which has its own independent buffer. This would lead to unexpected results.

To avoid this, by default, streams are synchronized with stdio. One common way to achieve this is to have cin read each character one at a time as needed using stdio functions. Unfortunately, this introduces a lot of overhead. For small amounts of input, this isn’t a big problem, but when you are reading millions of lines, the performance penalty is significant.

Fortunately, the library designers decided that you should also be able to disable this feature to get improved performance if you knew what you were doing, so they provided the sync_with_stdio method.


回答 1

出于好奇,我了解了幕后情况,并且在每次测试中都使用了dtruss / strace

C ++

./a.out < in
Saw 6512403 lines in 8 seconds.  Crunch speed: 814050

系统调用 sudo dtruss -c ./a.out < in

CALL                                        COUNT
__mac_syscall                                   1
<snip>
open                                            6
pread                                           8
mprotect                                       17
mmap                                           22
stat64                                         30
read_nocancel                               25958

Python

./a.py < in
Read 6512402 lines in 1 seconds. LPS: 6512402

系统调用 sudo dtruss -c ./a.py < in

CALL                                        COUNT
__mac_syscall                                   1
<snip>
open                                            5
pread                                           8
mprotect                                       17
mmap                                           21
stat64                                         29

Just out of curiosity I’ve taken a look at what happens under the hood, and I’ve used dtruss/strace on each test.

C++

./a.out < in
Saw 6512403 lines in 8 seconds.  Crunch speed: 814050

syscalls sudo dtruss -c ./a.out < in

CALL                                        COUNT
__mac_syscall                                   1
<snip>
open                                            6
pread                                           8
mprotect                                       17
mmap                                           22
stat64                                         30
read_nocancel                               25958

Python

./a.py < in
Read 6512402 lines in 1 seconds. LPS: 6512402

syscalls sudo dtruss -c ./a.py < in

CALL                                        COUNT
__mac_syscall                                   1
<snip>
open                                            5
pread                                           8
mprotect                                       17
mmap                                           21
stat64                                         29

回答 2

我在这里落后了几年,但是:

在原始帖子的“编辑4/5/6”中,您正在使用以下结构:

$ /usr/bin/time cat big_file | program_to_benchmark

这有两种不同的错误方式:

  1. 您实际上是在定时执行cat,而不是基准测试。显示的“用户”和“系统” CPU使用率time是的cat,而不是基准测试程序。更糟糕的是,“实时”时间也不一定准确。根据cat本地操作系统中和管道的实现,有可能cat在读取器进程完成其工作之前写入最终的巨型缓冲区并退出。

  2. 使用cat是不必要的,实际上会适得其反;您正在添加活动部件。如果您使用的是足够老的系统(例如,具有单个CPU,并且-在某些代计算机中-I / O比CPU快),则仅cat运行一个事实就可以使结果显色。您还必须遵守输入和输出缓冲以及其他处理的所有cat要求。(如果我是Randal Schwartz,这可能会为您赢得“猫的无用使用”奖。

更好的构造是:

$ /usr/bin/time program_to_benchmark < big_file

在此语句中,外壳程序将打开big_file,并将其作为已打开的文件描述符传递给您的程序(time然后,实际上将其作为子进程执行到该程序)。所读取文件的100%严格是您要进行基准测试的程序的责任。这使您可以真正了解其性能,而不会产生虚假的并发症。

我会提到两个可能但实际上是错误的“修复程序”,这些也可以考虑(但我对它们进行了“不同”的编号,因为这些并不是原始帖子中出现的错误):

答:您可以通过仅定时执行程序来“修复”此问题:

$ cat big_file | /usr/bin/time program_to_benchmark

B.或通过计时整个管道:

$ /usr/bin/time sh -c 'cat big_file | program_to_benchmark'

由于与#2相同的原因,它们是错误的:它们仍在cat不必要地使用。我提到它们的原因有几个:

  • 对于对POSIX shell的I / O重定向功能不完全满意的人来说,它们更“自然”

  • 可能存在的情况cat 需要(例如:要读取的文件需要某种特权来访问,并且不希望授予该特权的程序进行基准测试:sudo cat /dev/sda | /usr/bin/time my_compression_test --no-output

  • 实际上,在现代机器上,cat管道中添加的内容可能没有任何实际意义。

但是我有些犹豫地说那最后一件事。如果我们检查“ Edit 5”中的最后一个结果-

$ /usr/bin/time cat temp_big_file | wc -l
0.01user 1.34system 0:01.83elapsed 74%CPU ...

-这声称cat在测试期间消耗了74%的CPU; 而实际上1.34 / 1.83约为74%。也许运行:

$ /usr/bin/time wc -l < temp_big_file

只会花剩下的0.49秒!可能不需要:cat这里必须支付read()从文件“磁盘”(实际上是缓冲区高速缓存)传输文件的系统调用(或等效调用),以及为将文件传递到的管道写操作wc。正确的测试仍然必须进行这些read()调用。只有写到管道和读到管道调用将被保存,并且这些调用应该非常便宜。

尽管如此,我预计您将能够测量出两者之间的差异cat file | wc -lwc -l < file并找到明显的差异(两位数百分比)。每个较慢的测试在绝对时间内都会付出类似的代价。但是,这只占其总时间的一小部分。

实际上,我在Linux 3.13(Ubuntu 14.04)系统上对1.5 GB的垃圾文件进行了一些快速测试,获得了这些结果(这些结果实际上是“最好的3个”结果;当然,在启动缓存之后):

$ time wc -l < /tmp/junk
real 0.280s user 0.156s sys 0.124s (total cpu 0.280s)
$ time cat /tmp/junk | wc -l
real 0.407s user 0.157s sys 0.618s (total cpu 0.775s)
$ time sh -c 'cat /tmp/junk | wc -l'
real 0.411s user 0.118s sys 0.660s (total cpu 0.778s)

请注意,这两个管道结果声称比实际的挂钟时间花费了更多的CPU时间(user + sys)。这是因为我正在使用shell(bash)的内置“ time”命令,该命令可以识别管道。我在多核计算机上,流水线中的各个进程可以使用各个核,因此,CPU时间的累积要比实时更快。通过使用,/usr/bin/time我看到的CPU时间比实时时间要短-表明它只能计时单个管道元素在其命令行上传递给它的时间。而且,shell的输出/usr/bin/time仅提供毫秒,而仅提供百分之一秒。

因此,在的效率水平上wc -l,将cat产生巨大的差异:409/283 = 1.453或45.3%多的实时,和775/280 = 2.768,或多177%的CPU使用!在我的随机情况下,它是同时存在的测试箱。

我要补充一点,这些测试样式之间至少存在另一个显着差异,我不能说这是好处还是错误;您必须自己决定:

运行时cat big_file | /usr/bin/time my_program,您的程序正在以正好由发送的速度从管道接收输入cat,并且块的大小不得大于编写的速度cat

运行时/usr/bin/time my_program < big_file,程序会收到一个指向实际文件的打开文件描述符。当您的程序(在许多情况下,该语言是使用其编写的语言的I / O库)在提供引用常规文件的文件描述符时可能会采取不同的操作。它可能用于mmap(2)将输入文件映射到其地址空间,而不是使用显式的read(2)系统调用。与运行cat二进制文件的少量费用相比,这些差异可能会对基准测试结果产生更大的影响。

当然,如果同一程序在两种情况下的执行情况显着不同,这将是一个有趣的基准结果。它确实表明该程序或其I / O库正在做一些有趣的事情,例如使用mmap()。因此,在实践中最好同时使用两种基准。也许将cat结果小幅折算以“原谅”其运行成本cat

I’m a few years behind here, but:

In ‘Edit 4/5/6’ of the original post, you are using the construction:

$ /usr/bin/time cat big_file | program_to_benchmark

This is wrong in a couple of different ways:

  1. You’re actually timing the execution of `cat`, not your benchmark. The ‘user’ and ‘sys’ CPU usage displayed by `time` are those of `cat`, not your benchmarked program. Even worse, the ‘real’ time is also not necessarily accurate. Depending on the implementation of `cat` and of pipelines in your local OS, it is possible that `cat` writes a final giant buffer and exits long before the reader process finishes its work.

  2. Use of `cat` is unnecessary and in fact counterproductive; you’re adding moving parts. If you were on a sufficiently old system (i.e. with a single CPU and — in certain generations of computers — I/O faster than CPU) — the mere fact that `cat` was running could substantially color the results. You are also subject to whatever input and output buffering and other processing `cat` may do. (This would likely earn you a ‘Useless Use Of Cat’ award if I were Randal Schwartz.

A better construction would be:

$ /usr/bin/time program_to_benchmark < big_file

In this statement it is the shell which opens big_file, passing it to your program (well, actually to `time` which then executes your program as a subprocess) as an already-open file descriptor. 100% of the file reading is strictly the responsibility of the program you’re trying to benchmark. This gets you a real reading of its performance without spurious complications.

I will mention two possible, but actually wrong, ‘fixes’ which could also be considered (but I ‘number’ them differently as these are not things which were wrong in the original post):

A. You could ‘fix’ this by timing only your program:

$ cat big_file | /usr/bin/time program_to_benchmark

B. or by timing the entire pipeline:

$ /usr/bin/time sh -c 'cat big_file | program_to_benchmark'

These are wrong for the same reasons as #2: they’re still using `cat` unnecessarily. I mention them for a few reasons:

  • they’re more ‘natural’ for people who aren’t entirely comfortable with the I/O redirection facilities of the POSIX shell

  • there may be cases where `cat` is needed (e.g.: the file to be read requires some sort of privilege to access, and you do not want to grant that privilege to the program to be benchmarked: `sudo cat /dev/sda | /usr/bin/time my_compression_test –no-output`)

  • in practice, on modern machines, the added `cat` in the pipeline is probably of no real consequence

But I say that last thing with some hesitation. If we examine the last result in ‘Edit 5’ —

$ /usr/bin/time cat temp_big_file | wc -l
0.01user 1.34system 0:01.83elapsed 74%CPU ...

— this claims that `cat` consumed 74% of the CPU during the test; and indeed 1.34/1.83 is approximately 74%. Perhaps a run of:

$ /usr/bin/time wc -l < temp_big_file

would have taken only the remaining .49 seconds! Probably not: `cat` here had to pay for the read() system calls (or equivalent) which transferred the file from ‘disk’ (actually buffer cache), as well as the pipe writes to deliver them to `wc`. The correct test would still have had to do those read() calls; only the write-to-pipe and read-from-pipe calls would have been saved, and those should be pretty cheap.

Still, I predict you would be able to measure the difference between `cat file | wc -l` and `wc -l < file` and find a noticeable (2-digit percentage) difference. Each of the slower tests will have paid a similar penalty in absolute time; which would however amount to a smaller fraction of its larger total time.

In fact I did some quick tests with a 1.5 gigabyte file of garbage, on a Linux 3.13 (Ubuntu 14.04) system, obtaining these results (these are actually ‘best of 3’ results; after priming the cache, of course):

$ time wc -l < /tmp/junk
real 0.280s user 0.156s sys 0.124s (total cpu 0.280s)
$ time cat /tmp/junk | wc -l
real 0.407s user 0.157s sys 0.618s (total cpu 0.775s)
$ time sh -c 'cat /tmp/junk | wc -l'
real 0.411s user 0.118s sys 0.660s (total cpu 0.778s)

Notice that the two pipeline results claim to have taken more CPU time (user+sys) than real wall-clock time. This is because I’m using the shell (bash)’s built-in ‘time’ command, which is cognizant of the pipeline; and I’m on a multi-core machine where separate processes in a pipeline can use separate cores, accumulating CPU time faster than realtime. Using /usr/bin/time I see smaller CPU time than realtime — showing that it can only time the single pipeline element passed to it on its command line. Also, the shell’s output gives milliseconds while /usr/bin/time only gives hundredths of a second.

So at the efficiency level of `wc -l`, the `cat` makes a huge difference: 409 / 283 = 1.453 or 45.3% more realtime, and 775 / 280 = 2.768, or a whopping 177% more CPU used! On my random it-was-there-at-the-time test box.

I should add that there is at least one other significant difference between these styles of testing, and I can’t say whether it is a benefit or fault; you have to decide this yourself:

When you run `cat big_file | /usr/bin/time my_program`, your program is receiving input from a pipe, at precisely the pace sent by `cat`, and in chunks no larger than written by `cat`.

When you run `/usr/bin/time my_program < big_file`, your program receives an open file descriptor to the actual file. Your program — or in many cases the I/O libraries of the language in which it was written — may take different actions when presented with a file descriptor referencing a regular file. It may use mmap(2) to map the input file into its address space, instead of using explicit read(2) system calls. These differences could have a far larger effect on your benchmark results than the small cost of running the `cat` binary.

Of course it is an interesting benchmark result if the same program performs significantly differently between the two cases. It shows that, indeed, the program or its I/O libraries are doing something interesting, like using mmap(). So in practice it might be good to run the benchmarks both ways; perhaps discounting the `cat` result by some small factor to “forgive” the cost of running `cat` itself.


回答 3

我在Mac上使用g ++在计算机上重现了原始结果。

while循环之前将以下语句添加到C ++版本,使其与Python版本内联:

std::ios_base::sync_with_stdio(false);
char buffer[1048576];
std::cin.rdbuf()->pubsetbuf(buffer, sizeof(buffer));

sync_with_stdio将速度提高到2秒,并且设置更大的缓冲区将其降低到1秒。

I reproduced the original result on my computer using g++ on a Mac.

Adding the following statements to the C++ version just before the while loop brings it inline with the Python version:

std::ios_base::sync_with_stdio(false);
char buffer[1048576];
std::cin.rdbuf()->pubsetbuf(buffer, sizeof(buffer));

sync_with_stdio improved speed to 2 seconds, and setting a larger buffer brought it down to 1 second.


回答 4

getlinescanf如果您不关心文件加载时间或正在加载小型文本文件,则流操作符可以很方便。但是,如果性能是您所关心的,那么您实际上应该只是将整个文件缓冲到内存中(假设它将适合)。

这是一个例子:

//open file in binary mode
std::fstream file( filename, std::ios::in|::std::ios::binary );
if( !file ) return NULL;

//read the size...
file.seekg(0, std::ios::end);
size_t length = (size_t)file.tellg();
file.seekg(0, std::ios::beg);

//read into memory buffer, then close it.
char *filebuf = new char[length+1];
file.read(filebuf, length);
filebuf[length] = '\0'; //make it null-terminated
file.close();

如果需要,可以将流包装在该缓冲区周围,以便更方便地进行访问,如下所示:

std::istrstream header(&filebuf[0], length);

另外,如果您控制文件,请考虑使用平面二进制数据格式而不是文本。读写更加可靠,因为您不必处理所有空白。它也更小且解析速度更快。

getline, stream operators, scanf, can be convenient if you don’t care about file loading time or if you are loading small text files. But, if the performance is something you care about, you should really just buffer the entire file into memory (assuming it will fit).

Here’s an example:

//open file in binary mode
std::fstream file( filename, std::ios::in|::std::ios::binary );
if( !file ) return NULL;

//read the size...
file.seekg(0, std::ios::end);
size_t length = (size_t)file.tellg();
file.seekg(0, std::ios::beg);

//read into memory buffer, then close it.
char *filebuf = new char[length+1];
file.read(filebuf, length);
filebuf[length] = '\0'; //make it null-terminated
file.close();

If you want, you can wrap a stream around that buffer for more convenient access like this:

std::istrstream header(&filebuf[0], length);

Also, if you are in control of the file, consider using a flat binary data format instead of text. It’s more reliable to read and write because you don’t have to deal with all the ambiguities of whitespace. It’s also smaller and much faster to parse.


回答 5

对于我来说,以下代码比到目前为止发布的其他代码更快:(Visual Studio 2013,64位,500 MB文件,行长统一为[0,1000))。

const int buffer_size = 500 * 1024;  // Too large/small buffer is not good.
std::vector<char> buffer(buffer_size);
int size;
while ((size = fread(buffer.data(), sizeof(char), buffer_size, stdin)) > 0) {
    line_count += count_if(buffer.begin(), buffer.begin() + size, [](char ch) { return ch == '\n'; });
}

它比我的所有Python尝试都要多2倍。

The following code was faster for me than the other code posted here so far: (Visual Studio 2013, 64-bit, 500 MB file with line length uniformly in [0, 1000)).

const int buffer_size = 500 * 1024;  // Too large/small buffer is not good.
std::vector<char> buffer(buffer_size);
int size;
while ((size = fread(buffer.data(), sizeof(char), buffer_size, stdin)) > 0) {
    line_count += count_if(buffer.begin(), buffer.begin() + size, [](char ch) { return ch == '\n'; });
}

It beats all my Python attempts by more than a factor 2.


回答 6

顺便说一句,C ++版本的行数比Python版本的行数大1个原因是,仅当尝试读取超出eof的值时,才会设置eof标志。因此正确的循环将是:

while (cin) {
    getline(cin, input_line);

    if (!cin.eof())
        line_count++;
};

By the way, the reason the line count for the C++ version is one greater than the count for the Python version is that the eof flag only gets set when an attempt is made to read beyond eof. So the correct loop would be:

while (cin) {
    getline(cin, input_line);

    if (!cin.eof())
        line_count++;
};

回答 7

在您的第二个示例(带有scanf())的情况下,这样做仍然较慢的原因可能是因为scanf(“%s”)解析了字符串并查找了任何空格字符(空格,制表符,换行符)。

同样,是的,CPython进行了一些缓存以避免硬盘读取。

In your second example (with scanf()) reason why this is still slower might be because scanf(“%s”) parses string and looks for any space char (space, tab, newline).

Also, yes, CPython does some caching to avoid harddisk reads.


回答 8

答案的第一要素:<iostream>缓慢。该死的慢。scanf如下所示,我获得了巨大的性能提升,但是它仍然比Python慢​​两倍。

#include <iostream>
#include <time.h>
#include <cstdio>

using namespace std;

int main() {
    char buffer[10000];
    long line_count = 0;
    time_t start = time(NULL);
    int sec;
    int lps;

    int read = 1;
    while(read > 0) {
        read = scanf("%s", buffer);
        line_count++;
    };
    sec = (int) time(NULL) - start;
    line_count--;
    cerr << "Saw " << line_count << " lines in " << sec << " seconds." ;
    if (sec > 0) {
        lps = line_count / sec;
        cerr << "  Crunch speed: " << lps << endl;
    } 
    else
        cerr << endl;
    return 0;
}

A first element of an answer: <iostream> is slow. Damn slow. I get a huge performance boost with scanf as in the below, but it is still two times slower than Python.

#include <iostream>
#include <time.h>
#include <cstdio>

using namespace std;

int main() {
    char buffer[10000];
    long line_count = 0;
    time_t start = time(NULL);
    int sec;
    int lps;

    int read = 1;
    while(read > 0) {
        read = scanf("%s", buffer);
        line_count++;
    };
    sec = (int) time(NULL) - start;
    line_count--;
    cerr << "Saw " << line_count << " lines in " << sec << " seconds." ;
    if (sec > 0) {
        lps = line_count / sec;
        cerr << "  Crunch speed: " << lps << endl;
    } 
    else
        cerr << endl;
    return 0;
}

回答 9

好吧,我看到在您的第二个解决方案中,您从切换cinscanf,这是我要向您提出的第一个建议(cin是sloooooooooooow)。现在,如果您从切换scanffgets,则会看到性能的另一提升:fgets是用于字符串输入的最快的C ++函数。

顺便说一句,不知道同步的事情,很好。但是您仍然应该尝试fgets

Well, I see that in your second solution you switched from cin to scanf, which was the first suggestion I was going to make you (cin is sloooooooooooow). Now, if you switch from scanf to fgets, you would see another boost in performance: fgets is the fastest C++ function for string input.

BTW, didn’t know about that sync thing, nice. But you should still try fgets.


重命名熊猫列

问题:重命名熊猫列

我有一个使用熊猫和列标签的DataFrame,我需要对其进行编辑以替换原始列标签。

我想A在原始列名称为的DataFrame 中更改列名称:

['$a', '$b', '$c', '$d', '$e'] 

['a', 'b', 'c', 'd', 'e'].

我已经将编辑后的列名存储在列表中,但是我不知道如何替换列名。

I have a DataFrame using pandas and column labels that I need to edit to replace the original column labels.

I’d like to change the column names in a DataFrame A where the original column names are:

['$a', '$b', '$c', '$d', '$e'] 

to

['a', 'b', 'c', 'd', 'e'].

I have the edited column names stored it in a list, but I don’t know how to replace the column names.


回答 0

只需将其分配给.columns属性:

>>> df = pd.DataFrame({'$a':[1,2], '$b': [10,20]})
>>> df.columns = ['a', 'b']
>>> df
   a   b
0  1  10
1  2  20

Just assign it to the .columns attribute:

>>> df = pd.DataFrame({'$a':[1,2], '$b': [10,20]})
>>> df.columns = ['a', 'b']
>>> df
   a   b
0  1  10
1  2  20

回答 1

重命名特定列

使用该df.rename()函数并引用要重命名的列。并非所有列都必须重命名:

df = df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})
# Or rename the existing DataFrame (rather than creating a copy) 
df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'}, inplace=True)

最小代码示例

df = pd.DataFrame('x', index=range(3), columns=list('abcde'))
df

   a  b  c  d  e
0  x  x  x  x  x
1  x  x  x  x  x
2  x  x  x  x  x

下列方法均起作用并产生相同的输出:

df2 = df.rename({'a': 'X', 'b': 'Y'}, axis=1)  # new method
df2 = df.rename({'a': 'X', 'b': 'Y'}, axis='columns')
df2 = df.rename(columns={'a': 'X', 'b': 'Y'})  # old method  

df2

   X  Y  c  d  e
0  x  x  x  x  x
1  x  x  x  x  x
2  x  x  x  x  x

切记将结果分配回去,因为修改未就位。或者,指定inplace=True

df.rename({'a': 'X', 'b': 'Y'}, axis=1, inplace=True)
df

   X  Y  c  d  e
0  x  x  x  x  x
1  x  x  x  x  x
2  x  x  x  x  x

从v0.25版开始,如果指定errors='raise'了无效的“要重命名的列” ,您还可以指定引发错误。参见v0.25 rename()文档


REASSIGN列标题

df.set_axis()axis=1inplace=False一起使用(返回副本)。

df2 = df.set_axis(['V', 'W', 'X', 'Y', 'Z'], axis=1, inplace=False)
df2

   V  W  X  Y  Z
0  x  x  x  x  x
1  x  x  x  x  x
2  x  x  x  x  x

这将返回一个副本,但是您可以通过设置来就地修改DataFrame inplace=True(这是版本<= 0.24的默认行为,但将来可能会更改)。

您还可以直接分配标题:

df.columns = ['V', 'W', 'X', 'Y', 'Z']
df

   V  W  X  Y  Z
0  x  x  x  x  x
1  x  x  x  x  x
2  x  x  x  x  x

RENAME SPECIFIC COLUMNS

Use the df.rename() function and refer the columns to be renamed. Not all the columns have to be renamed:

df = df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})
# Or rename the existing DataFrame (rather than creating a copy) 
df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'}, inplace=True)

Minimal Code Example

df = pd.DataFrame('x', index=range(3), columns=list('abcde'))
df

   a  b  c  d  e
0  x  x  x  x  x
1  x  x  x  x  x
2  x  x  x  x  x

The following methods all work and produce the same output:

df2 = df.rename({'a': 'X', 'b': 'Y'}, axis=1)  # new method
df2 = df.rename({'a': 'X', 'b': 'Y'}, axis='columns')
df2 = df.rename(columns={'a': 'X', 'b': 'Y'})  # old method  

df2

   X  Y  c  d  e
0  x  x  x  x  x
1  x  x  x  x  x
2  x  x  x  x  x

Remember to assign the result back, as the modification is not-inplace. Alternatively, specify inplace=True:

df.rename({'a': 'X', 'b': 'Y'}, axis=1, inplace=True)
df

   X  Y  c  d  e
0  x  x  x  x  x
1  x  x  x  x  x
2  x  x  x  x  x

From v0.25, you can also specify errors='raise' to raise errors if an invalid column-to-rename is specified. See v0.25 rename() docs.


REASSIGN COLUMN HEADERS

Use df.set_axis() with axis=1 and inplace=False (to return a copy).

df2 = df.set_axis(['V', 'W', 'X', 'Y', 'Z'], axis=1, inplace=False)
df2

   V  W  X  Y  Z
0  x  x  x  x  x
1  x  x  x  x  x
2  x  x  x  x  x

This returns a copy, but you can modify the DataFrame in-place by setting inplace=True (this is the default behaviour for versions <=0.24 but is likely to change in the future).

You can also assign headers directly:

df.columns = ['V', 'W', 'X', 'Y', 'Z']
df

   V  W  X  Y  Z
0  x  x  x  x  x
1  x  x  x  x  x
2  x  x  x  x  x

回答 2

rename方法可以带有一个函数,例如:

In [11]: df.columns
Out[11]: Index([u'$a', u'$b', u'$c', u'$d', u'$e'], dtype=object)

In [12]: df.rename(columns=lambda x: x[1:], inplace=True)

In [13]: df.columns
Out[13]: Index([u'a', u'b', u'c', u'd', u'e'], dtype=object)

The rename method can take a function, for example:

In [11]: df.columns
Out[11]: Index([u'$a', u'$b', u'$c', u'$d', u'$e'], dtype=object)

In [12]: df.rename(columns=lambda x: x[1:], inplace=True)

In [13]: df.columns
Out[13]: Index([u'a', u'b', u'c', u'd', u'e'], dtype=object)

回答 3

使用文本数据中所述

df.columns = df.columns.str.replace('$','')

As documented in Working with text data:

df.columns = df.columns.str.replace('$','')

回答 4

熊猫0.21+答案

0.21版中对列重命名进行了一些重大更新。

  • rename方法添加了axis可以设置为columns或的参数1。此更新使该方法与其他pandas API匹配。它仍然具有indexcolumns参数,但是您不再被迫使用它们。
  • set_axis方法inplace设置为False可以使所有的索引或列标签与命名列表。

熊猫的例子0.21+

构造样本DataFrame:

df = pd.DataFrame({'$a':[1,2], '$b': [3,4], 
                   '$c':[5,6], '$d':[7,8], 
                   '$e':[9,10]})

   $a  $b  $c  $d  $e
0   1   3   5   7   9
1   2   4   6   8  10

renameaxis='columns'或一起使用axis=1

df.rename({'$a':'a', '$b':'b', '$c':'c', '$d':'d', '$e':'e'}, axis='columns')

要么

df.rename({'$a':'a', '$b':'b', '$c':'c', '$d':'d', '$e':'e'}, axis=1)

两者都导致以下结果:

   a  b  c  d   e
0  1  3  5  7   9
1  2  4  6  8  10

仍然可以使用旧的方法签名:

df.rename(columns={'$a':'a', '$b':'b', '$c':'c', '$d':'d', '$e':'e'})

rename函数还接受将应用于每个列名称的函数。

df.rename(lambda x: x[1:], axis='columns')

要么

df.rename(lambda x: x[1:], axis=1)

set_axis与列表一起使用inplace=False

您可以为该set_axis方法提供一个列表,该列表的长度等于列(或索引)的数量。当前,inplace默认值为True,但在将来的版本inplace中将默认为False

df.set_axis(['a', 'b', 'c', 'd', 'e'], axis='columns', inplace=False)

要么

df.set_axis(['a', 'b', 'c', 'd', 'e'], axis=1, inplace=False)

为什么不使用df.columns = ['a', 'b', 'c', 'd', 'e']

像这样直接分配列没有错。这是一个完美的解决方案。

using的优点set_axis是它可以用作方法链的一部分,并返回DataFrame的新副本。没有它,您将不得不在重新分配列之前将链的中间步骤存储到另一个变量。

# new for pandas 0.21+
df.some_method1()
  .some_method2()
  .set_axis()
  .some_method3()

# old way
df1 = df.some_method1()
        .some_method2()
df1.columns = columns
df1.some_method3()

Pandas 0.21+ Answer

There have been some significant updates to column renaming in version 0.21.

  • The rename method has added the axis parameter which may be set to columns or 1. This update makes this method match the rest of the pandas API. It still has the index and columns parameters but you are no longer forced to use them.
  • The set_axis method with the inplace set to False enables you to rename all the index or column labels with a list.

Examples for Pandas 0.21+

Construct sample DataFrame:

df = pd.DataFrame({'$a':[1,2], '$b': [3,4], 
                   '$c':[5,6], '$d':[7,8], 
                   '$e':[9,10]})

   $a  $b  $c  $d  $e
0   1   3   5   7   9
1   2   4   6   8  10

Using rename with axis='columns' or axis=1

df.rename({'$a':'a', '$b':'b', '$c':'c', '$d':'d', '$e':'e'}, axis='columns')

or

df.rename({'$a':'a', '$b':'b', '$c':'c', '$d':'d', '$e':'e'}, axis=1)

Both result in the following:

   a  b  c  d   e
0  1  3  5  7   9
1  2  4  6  8  10

It is still possible to use the old method signature:

df.rename(columns={'$a':'a', '$b':'b', '$c':'c', '$d':'d', '$e':'e'})

The rename function also accepts functions that will be applied to each column name.

df.rename(lambda x: x[1:], axis='columns')

or

df.rename(lambda x: x[1:], axis=1)

Using set_axis with a list and inplace=False

You can supply a list to the set_axis method that is equal in length to the number of columns (or index). Currently, inplace defaults to True, but inplace will be defaulted to False in future releases.

df.set_axis(['a', 'b', 'c', 'd', 'e'], axis='columns', inplace=False)

or

df.set_axis(['a', 'b', 'c', 'd', 'e'], axis=1, inplace=False)

Why not use df.columns = ['a', 'b', 'c', 'd', 'e']?

There is nothing wrong with assigning columns directly like this. It is a perfectly good solution.

The advantage of using set_axis is that it can be used as part of a method chain and that it returns a new copy of the DataFrame. Without it, you would have to store your intermediate steps of the chain to another variable before reassigning the columns.

# new for pandas 0.21+
df.some_method1()
  .some_method2()
  .set_axis()
  .some_method3()

# old way
df1 = df.some_method1()
        .some_method2()
df1.columns = columns
df1.some_method3()

回答 5

由于只想删除所有列名中的$符号,因此可以执行以下操作:

df = df.rename(columns=lambda x: x.replace('$', ''))

要么

df.rename(columns=lambda x: x.replace('$', ''), inplace=True)

Since you only want to remove the $ sign in all column names, you could just do:

df = df.rename(columns=lambda x: x.replace('$', ''))

OR

df.rename(columns=lambda x: x.replace('$', ''), inplace=True)

回答 6

df.columns = ['a', 'b', 'c', 'd', 'e']

它将按照您提供的顺序用您提供的名称替换现有名称。

df.columns = ['a', 'b', 'c', 'd', 'e']

It will replace the existing names with the names you provide, in the order you provide.


回答 7

old_names = ['$a', '$b', '$c', '$d', '$e'] 
new_names = ['a', 'b', 'c', 'd', 'e']
df.rename(columns=dict(zip(old_names, new_names)), inplace=True)

这样,您可以根据需要手动编辑new_names。当您只需要重命名几列以纠正拼写错误,重音符号,删除特殊字符等时,效果很好。

old_names = ['$a', '$b', '$c', '$d', '$e'] 
new_names = ['a', 'b', 'c', 'd', 'e']
df.rename(columns=dict(zip(old_names, new_names)), inplace=True)

This way you can manually edit the new_names as you wish. Works great when you need to rename only a few columns to correct mispellings, accents, remove special characters etc.


回答 8

一线或管道解决方案

我将专注于两件事:

  1. OP明确指出

    我已经将编辑后的列名存储在列表中,但是我不知道如何替换列名。

    我不想解决如何替换'$'或删除每个列标题中的第一个字符的问题。OP已完成此步骤。相反,我想集中精力用columns给定替换列名称列表的新对象替换现有对象。

  2. df.columns = newnew新列名称的列表在哪里就变得很简单。这种方法的缺点是,它需要编辑现有数据框的columns属性,并且无法内联完成。我将展示一些通过流水执行此操作而不编辑现有数据框的方法。


设置1
为了着重于需要使用现有列表重命名替换列名称,我将创建一个df具有初始列名称和不相关的新列名称的新示例数据框。

df = pd.DataFrame({'Jack': [1, 2], 'Mahesh': [3, 4], 'Xin': [5, 6]})
new = ['x098', 'y765', 'z432']

df

   Jack  Mahesh  Xin
0     1       3    5
1     2       4    6

解决方案1
pd.DataFrame.rename

已经有人说过,如果您有一个字典将旧的列名映射到新的列名,则可以使用pd.DataFrame.rename

d = {'Jack': 'x098', 'Mahesh': 'y765', 'Xin': 'z432'}
df.rename(columns=d)

   x098  y765  z432
0     1     3     5
1     2     4     6

但是,您可以轻松创建该词典并将其包含在对的调用中rename。以下内容利用了以下事实:迭代时df,我们迭代每个列名。

# given just a list of new column names
df.rename(columns=dict(zip(df, new)))

   x098  y765  z432
0     1     3     5
1     2     4     6

如果您原始的列名是唯一的,那么这很好。但是,如果不是这样,那么就会崩溃。


设置2个
非唯一列

df = pd.DataFrame(
    [[1, 3, 5], [2, 4, 6]],
    columns=['Mahesh', 'Mahesh', 'Xin']
)
new = ['x098', 'y765', 'z432']

df

   Mahesh  Mahesh  Xin
0       1       3    5
1       2       4    6

解决方案2
pd.concat使用keys参数

首先,请注意当我们尝试使用解决方案1时会发生什么:

df.rename(columns=dict(zip(df, new)))

   y765  y765  z432
0     1     3     5
1     2     4     6

我们没有将new列表映射为列名。我们最终重复了y765。相反,我们可以在遍历的列时使用函数的keys参数。pd.concatdf

pd.concat([c for _, c in df.items()], axis=1, keys=new) 

   x098  y765  z432
0     1     3     5
1     2     4     6

解决方案3
重建。仅当dtype所有列都有一个时,才应使用此选项。否则,您最终将dtype object获得所有列,并且将它们转换回需要更多的词典工作。

dtype

pd.DataFrame(df.values, df.index, new)

   x098  y765  z432
0     1     3     5
1     2     4     6

混合的 dtype

pd.DataFrame(df.values, df.index, new).astype(dict(zip(new, df.dtypes)))

   x098  y765  z432
0     1     3     5
1     2     4     6

解决方案4
这是使用transpose和的花招set_indexpd.DataFrame.set_index允许我们设置内联索引,但没有对应的set_columns。这样我们就可以转置,然后再set_index转回。但是,此处适用解决方案3 的相同警告dtype与混合dtype警告。

dtype

df.T.set_index(np.asarray(new)).T

   x098  y765  z432
0     1     3     5
1     2     4     6

混合的 dtype

df.T.set_index(np.asarray(new)).T.astype(dict(zip(new, df.dtypes)))

   x098  y765  z432
0     1     3     5
1     2     4     6

解决方案5在循环
使用的每个元素中使用a 在此解决方案中,我们传递一个lambda来接受但忽略它。它也需要一个但并不期望。取而代之的是,将迭代器指定为默认值,然后我可以使用该迭代器一次遍历一个迭代器,而无需考虑is 的值。lambdapd.DataFrame.renamenew
xyx

df.rename(columns=lambda x, y=iter(new): next(y))

   x098  y765  z432
0     1     3     5
1     2     4     6

正如人们在sopython聊天中向我指出的那样,如果*x和之间添加一个,则y可以保护我的y变量。不过,在这种情况下,我认为它不需要保护。仍然值得一提。

df.rename(columns=lambda x, *, y=iter(new): next(y))

   x098  y765  z432
0     1     3     5
1     2     4     6

One line or Pipeline solutions

I’ll focus on two things:

  1. OP clearly states

    I have the edited column names stored it in a list, but I don’t know how to replace the column names.

    I do not want to solve the problem of how to replace '$' or strip the first character off of each column header. OP has already done this step. Instead I want to focus on replacing the existing columns object with a new one given a list of replacement column names.

  2. df.columns = new where new is the list of new columns names is as simple as it gets. The drawback of this approach is that it requires editing the existing dataframe’s columns attribute and it isn’t done inline. I’ll show a few ways to perform this via pipelining without editing the existing dataframe.


Setup 1
To focus on the need to rename of replace column names with a pre-existing list, I’ll create a new sample dataframe df with initial column names and unrelated new column names.

df = pd.DataFrame({'Jack': [1, 2], 'Mahesh': [3, 4], 'Xin': [5, 6]})
new = ['x098', 'y765', 'z432']

df

   Jack  Mahesh  Xin
0     1       3    5
1     2       4    6

Solution 1
pd.DataFrame.rename

It has been said already that if you had a dictionary mapping the old column names to new column names, you could use pd.DataFrame.rename.

d = {'Jack': 'x098', 'Mahesh': 'y765', 'Xin': 'z432'}
df.rename(columns=d)

   x098  y765  z432
0     1     3     5
1     2     4     6

However, you can easily create that dictionary and include it in the call to rename. The following takes advantage of the fact that when iterating over df, we iterate over each column name.

# given just a list of new column names
df.rename(columns=dict(zip(df, new)))

   x098  y765  z432
0     1     3     5
1     2     4     6

This works great if your original column names are unique. But if they are not, then this breaks down.


Setup 2
non-unique columns

df = pd.DataFrame(
    [[1, 3, 5], [2, 4, 6]],
    columns=['Mahesh', 'Mahesh', 'Xin']
)
new = ['x098', 'y765', 'z432']

df

   Mahesh  Mahesh  Xin
0       1       3    5
1       2       4    6

Solution 2
pd.concat using the keys argument

First, notice what happens when we attempt to use solution 1:

df.rename(columns=dict(zip(df, new)))

   y765  y765  z432
0     1     3     5
1     2     4     6

We didn’t map the new list as the column names. We ended up repeating y765. Instead, we can use the keys argument of the pd.concat function while iterating through the columns of df.

pd.concat([c for _, c in df.items()], axis=1, keys=new) 

   x098  y765  z432
0     1     3     5
1     2     4     6

Solution 3
Reconstruct. This should only be used if you have a single dtype for all columns. Otherwise, you’ll end up with dtype object for all columns and converting them back requires more dictionary work.

Single dtype

pd.DataFrame(df.values, df.index, new)

   x098  y765  z432
0     1     3     5
1     2     4     6

Mixed dtype

pd.DataFrame(df.values, df.index, new).astype(dict(zip(new, df.dtypes)))

   x098  y765  z432
0     1     3     5
1     2     4     6

Solution 4
This is a gimmicky trick with transpose and set_index. pd.DataFrame.set_index allows us to set an index inline but there is no corresponding set_columns. So we can transpose, then set_index, and transpose back. However, the same single dtype versus mixed dtype caveat from solution 3 applies here.

Single dtype

df.T.set_index(np.asarray(new)).T

   x098  y765  z432
0     1     3     5
1     2     4     6

Mixed dtype

df.T.set_index(np.asarray(new)).T.astype(dict(zip(new, df.dtypes)))

   x098  y765  z432
0     1     3     5
1     2     4     6

Solution 5
Use a lambda in pd.DataFrame.rename that cycles through each element of new
In this solution, we pass a lambda that takes x but then ignores it. It also takes a y but doesn’t expect it. Instead, an iterator is given as a default value and I can then use that to cycle through one at a time without regard to what the value of x is.

df.rename(columns=lambda x, y=iter(new): next(y))

   x098  y765  z432
0     1     3     5
1     2     4     6

And as pointed out to me by the folks in sopython chat, if I add a * in between x and y, I can protect my y variable. Though, in this context I don’t believe it needs protecting. It is still worth mentioning.

df.rename(columns=lambda x, *, y=iter(new): next(y))

   x098  y765  z432
0     1     3     5
1     2     4     6

回答 9

列名称与系列名称

我想解释一下幕后发生的事情。

数据框是一组系列。

系列又是对 numpy.array

numpy.array有财产 .name

这是系列的名称。很少有人会尊重大熊猫的这一属性,但它会在某些地方徘徊,并可以用来破解某些大熊猫的行为。

命名列列表

这里有很多答案都谈到该df.columns属性list实际上是一个Series。这意味着它具有.name属性。

如果您决定填写各列的名称,则会发生这种情况Series

df.columns = ['column_one', 'column_two']
df.columns.names = ['name of the list of columns']
df.index.names = ['name of the index']

name of the list of columns     column_one  column_two
name of the index       
0                                    4           1
1                                    5           2
2                                    6           3

请注意,索引的名称总是低一列。

that绕的神器

.name属性有时会持续存在。如果设置df.columns = ['one', 'two']df.one.name则将为'one'

如果您设置,df.one.name = 'three'那么df.columns仍然会给您['one', 'two'],并df.one.name会给您'three'

pd.DataFrame(df.one) 将返回

    three
0       1
1       2
2       3

因为pandas重用.name了已经定义的Series

多级列名称

熊猫有做多层列名的方法。没有太多魔术,但是我也想在答案中涵盖这一点,因为我看不到有人在这里进行这项工作。

    |one            |
    |one      |two  |
0   |  4      |  1  |
1   |  5      |  2  |
2   |  6      |  3  |

通过将列设置为列表很容易实现,如下所示:

df.columns = [['one', 'one'], ['one', 'two']]

Column names vs Names of Series

I would like to explain a bit what happens behind the scenes.

Dataframes are a set of Series.

Series in turn are an extension of a numpy.array

numpy.arrays have a property .name

This is the name of the series. It is seldom that pandas respects this attribute, but it lingers in places and can be used to hack some pandas behaviors.

Naming the list of columns

A lot of answers here talks about the df.columns attribute being a list when in fact it is a Series. This means it has a .name attribute.

This is what happens if you decide to fill in the name of the columns Series:

df.columns = ['column_one', 'column_two']
df.columns.names = ['name of the list of columns']
df.index.names = ['name of the index']

name of the list of columns     column_one  column_two
name of the index       
0                                    4           1
1                                    5           2
2                                    6           3

Note that the name of the index always comes one column lower.

Artifacts that linger

The .name attribute lingers on sometimes. If you set df.columns = ['one', 'two'] then the df.one.name will be 'one'.

If you set df.one.name = 'three' then df.columns will still give you ['one', 'two'], and df.one.name will give you 'three'

BUT

pd.DataFrame(df.one) will return

    three
0       1
1       2
2       3

Because pandas reuses the .name of the already defined Series.

Multi level column names

Pandas has ways of doing multi layered column names. There is not so much magic involved but I wanted to cover this in my answer too since I don’t see anyone picking up on this here.

    |one            |
    |one      |two  |
0   |  4      |  1  |
1   |  5      |  2  |
2   |  6      |  3  |

This is easily achievable by setting columns to lists, like this:

df.columns = [['one', 'one'], ['one', 'two']]

回答 10

如果您有数据框,则df.columns会将所有内容转储到您可以操作的列表中,然后将其重新分配给数据框作为列名…

columns = df.columns
columns = [row.replace("$","") for row in columns]
df.rename(columns=dict(zip(columns, things)), inplace=True)
df.head() #to validate the output

最好的办法?IDK。一种方法-是的。

下面是使用cProfile衡量内存和执行时间的一种更好的评估问题答案中提出的所有主要技术的方法。@ kadee,@ kaitlyn和@eumiro具有执行时间最快的功能-尽管这些功能是如此之快,我们将比较所有答案的.000和.001秒舍入。道德:我上面的回答可能不是“最佳”方法。

import pandas as pd
import cProfile, pstats, re

old_names = ['$a', '$b', '$c', '$d', '$e']
new_names = ['a', 'b', 'c', 'd', 'e']
col_dict = {'$a': 'a', '$b': 'b','$c':'c','$d':'d','$e':'e'}

df = pd.DataFrame({'$a':[1,2], '$b': [10,20],'$c':['bleep','blorp'],'$d':[1,2],'$e':['texa$','']})

df.head()

def eumiro(df,nn):
    df.columns = nn
    #This direct renaming approach is duplicated in methodology in several other answers: 
    return df

def lexual1(df):
    return df.rename(columns=col_dict)

def lexual2(df,col_dict):
    return df.rename(columns=col_dict, inplace=True)

def Panda_Master_Hayden(df):
    return df.rename(columns=lambda x: x[1:], inplace=True)

def paulo1(df):
    return df.rename(columns=lambda x: x.replace('$', ''))

def paulo2(df):
    return df.rename(columns=lambda x: x.replace('$', ''), inplace=True)

def migloo(df,on,nn):
    return df.rename(columns=dict(zip(on, nn)), inplace=True)

def kadee(df):
    return df.columns.str.replace('$','')

def awo(df):
    columns = df.columns
    columns = [row.replace("$","") for row in columns]
    return df.rename(columns=dict(zip(columns, '')), inplace=True)

def kaitlyn(df):
    df.columns = [col.strip('$') for col in df.columns]
    return df

print 'eumiro'
cProfile.run('eumiro(df,new_names)')
print 'lexual1'
cProfile.run('lexual1(df)')
print 'lexual2'
cProfile.run('lexual2(df,col_dict)')
print 'andy hayden'
cProfile.run('Panda_Master_Hayden(df)')
print 'paulo1'
cProfile.run('paulo1(df)')
print 'paulo2'
cProfile.run('paulo2(df)')
print 'migloo'
cProfile.run('migloo(df,old_names,new_names)')
print 'kadee'
cProfile.run('kadee(df)')
print 'awo'
cProfile.run('awo(df)')
print 'kaitlyn'
cProfile.run('kaitlyn(df)')

If you’ve got the dataframe, df.columns dumps everything into a list you can manipulate and then reassign into your dataframe as the names of columns…

columns = df.columns
columns = [row.replace("$","") for row in columns]
df.rename(columns=dict(zip(columns, things)), inplace=True)
df.head() #to validate the output

Best way? IDK. A way – yes.

A better way of evaluating all the main techniques put forward in the answers to the question is below using cProfile to gage memory & execution time. @kadee, @kaitlyn, & @eumiro had the functions with the fastest execution times – though these functions are so fast we’re comparing the rounding of .000 and .001 seconds for all the answers. Moral: my answer above likely isn’t the ‘Best’ way.

import pandas as pd
import cProfile, pstats, re

old_names = ['$a', '$b', '$c', '$d', '$e']
new_names = ['a', 'b', 'c', 'd', 'e']
col_dict = {'$a': 'a', '$b': 'b','$c':'c','$d':'d','$e':'e'}

df = pd.DataFrame({'$a':[1,2], '$b': [10,20],'$c':['bleep','blorp'],'$d':[1,2],'$e':['texa$','']})

df.head()

def eumiro(df,nn):
    df.columns = nn
    #This direct renaming approach is duplicated in methodology in several other answers: 
    return df

def lexual1(df):
    return df.rename(columns=col_dict)

def lexual2(df,col_dict):
    return df.rename(columns=col_dict, inplace=True)

def Panda_Master_Hayden(df):
    return df.rename(columns=lambda x: x[1:], inplace=True)

def paulo1(df):
    return df.rename(columns=lambda x: x.replace('$', ''))

def paulo2(df):
    return df.rename(columns=lambda x: x.replace('$', ''), inplace=True)

def migloo(df,on,nn):
    return df.rename(columns=dict(zip(on, nn)), inplace=True)

def kadee(df):
    return df.columns.str.replace('$','')

def awo(df):
    columns = df.columns
    columns = [row.replace("$","") for row in columns]
    return df.rename(columns=dict(zip(columns, '')), inplace=True)

def kaitlyn(df):
    df.columns = [col.strip('$') for col in df.columns]
    return df

print 'eumiro'
cProfile.run('eumiro(df,new_names)')
print 'lexual1'
cProfile.run('lexual1(df)')
print 'lexual2'
cProfile.run('lexual2(df,col_dict)')
print 'andy hayden'
cProfile.run('Panda_Master_Hayden(df)')
print 'paulo1'
cProfile.run('paulo1(df)')
print 'paulo2'
cProfile.run('paulo2(df)')
print 'migloo'
cProfile.run('migloo(df,old_names,new_names)')
print 'kadee'
cProfile.run('kadee(df)')
print 'awo'
cProfile.run('awo(df)')
print 'kaitlyn'
cProfile.run('kaitlyn(df)')

回答 11

假设这是您的数据框。

在此处输入图片说明

您可以使用两种方法重命名列。

  1. 使用 dataframe.columns=[#list]

    df.columns=['a','b','c','d','e']

    在此处输入图片说明

    此方法的局限性在于,如果必须更改一列,则必须传递完整的列列表。同样,此方法不适用于索引标签。例如,如果您通过以下操作:

    df.columns = ['a','b','c','d']

    这将引发错误。长度不匹配:预期轴有5个元素,新值有4个元素。

  2. 另一种方法是Pandas rename()方法,用于重命名任何索引,列或行

    df = df.rename(columns={'$a':'a'})

    在此处输入图片说明

同样,您可以更改任何行或列。

Let’s say this is your dataframe.

enter image description here

You can rename the columns using two methods.

  1. Using dataframe.columns=[#list]

    df.columns=['a','b','c','d','e']
    

    enter image description here

    The limitation of this method is that if one column has to be changed, full column list has to be passed. Also, this method is not applicable on index labels. For example, if you passed this:

    df.columns = ['a','b','c','d']
    

    This will throw an error. Length mismatch: Expected axis has 5 elements, new values have 4 elements.

  2. Another method is the Pandas rename() method which is used to rename any index, column or row

    df = df.rename(columns={'$a':'a'})
    

    enter image description here

Similarly, you can change any rows or columns.


回答 12

df = pd.DataFrame({'$a': [1], '$b': [1], '$c': [1], '$d': [1], '$e': [1]})

如果新的列列表与现有列的顺序相同,则分配很简单:

new_cols = ['a', 'b', 'c', 'd', 'e']
df.columns = new_cols
>>> df
   a  b  c  d  e
0  1  1  1  1  1

如果您有一个将旧列名键入新列名的字典,则可以执行以下操作:

d = {'$a': 'a', '$b': 'b', '$c': 'c', '$d': 'd', '$e': 'e'}
df.columns = df.columns.map(lambda col: d[col])  # Or `.map(d.get)` as pointed out by @PiRSquared.
>>> df
   a  b  c  d  e
0  1  1  1  1  1

如果没有列表或字典映射,则可以$通过列表理解来去除前导符号:

df.columns = [col[1:] if col[0] == '$' else col for col in df]
df = pd.DataFrame({'$a': [1], '$b': [1], '$c': [1], '$d': [1], '$e': [1]})

If your new list of columns is in the same order as the existing columns, the assignment is simple:

new_cols = ['a', 'b', 'c', 'd', 'e']
df.columns = new_cols
>>> df
   a  b  c  d  e
0  1  1  1  1  1

If you had a dictionary keyed on old column names to new column names, you could do the following:

d = {'$a': 'a', '$b': 'b', '$c': 'c', '$d': 'd', '$e': 'e'}
df.columns = df.columns.map(lambda col: d[col])  # Or `.map(d.get)` as pointed out by @PiRSquared.
>>> df
   a  b  c  d  e
0  1  1  1  1  1

If you don’t have a list or dictionary mapping, you could strip the leading $ symbol via a list comprehension:

df.columns = [col[1:] if col[0] == '$' else col for col in df]

回答 13


回答 14

让我们通过一个小例子来了解重命名…

1.使用映射重命名列:

df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) #creating a df with column name A and B
df.rename({"A": "new_a", "B": "new_b"},axis='columns',inplace =True) #renaming column A with 'new_a' and B with 'new_b'

output:
   new_a  new_b
0  1       4
1  2       5
2  3       6

2.使用映射重命名索引/行名:

df.rename({0: "x", 1: "y", 2: "z"},axis='index',inplace =True) #Row name are getting replaced by 'x','y','z'.

output:
       new_a  new_b
    x  1       4
    y  2       5
    z  3       6

Let’s Understand renaming by a small example…

1.Renaming columns using mapping:

df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) #creating a df with column name A and B
df.rename({"A": "new_a", "B": "new_b"},axis='columns',inplace =True) #renaming column A with 'new_a' and B with 'new_b'

output:
   new_a  new_b
0  1       4
1  2       5
2  3       6

2.Renaming index/Row_Name using mapping:

df.rename({0: "x", 1: "y", 2: "z"},axis='index',inplace =True) #Row name are getting replaced by 'x','y','z'.

output:
       new_a  new_b
    x  1       4
    y  2       5
    z  3       6

回答 15

我们可以替换原始列标签的另一种方法是通过从原始列标签中删除不需要的字符(此处为“ $”)。

可以通过在df.columns上运行for循环,并将剥离后的列附加到df.columns来完成。

取而代之的是,我们可以通过使用如下列表理解来在一个语句中整齐地做到这一点:

df.columns = [col.strip('$') for col in df.columns]

stripPython中的方法从字符串的开头和结尾去除给定的字符。)

Another way we could replace the original column labels is by stripping the unwanted characters (here ‘$’) from the original column labels.

This could have been done by running a for loop over df.columns and appending the stripped columns to df.columns.

Instead , we can do this neatly in a single statement by using list comprehension like below:

df.columns = [col.strip('$') for col in df.columns]

(strip method in Python strips the given character from beginning and end of the string.)


回答 16

真正简单就用

df.columns = ['Name1', 'Name2', 'Name3'...]

它将按照您放置它们的顺序分配列名

Real simple just use

df.columns = ['Name1', 'Name2', 'Name3'...]

and it will assign the column names by the order you put them


回答 17

您可以使用str.slice

df.columns = df.columns.str.slice(1)

You could use str.slice for that:

df.columns = df.columns.str.slice(1)

回答 18

我知道这个问题和答案已经被to死了。但是我提到它是为了解决我遇到的一个问题。我能够使用来自不同答案的点点滴滴来解决它,从而在有人需要时提供我的回复。

我的方法很通用,您可以通过用逗号分隔delimiters=变量并将其过时的方式添加其他定界符。

工作代码:

import pandas as pd
import re


df = pd.DataFrame({'$a':[1,2], '$b': [3,4],'$c':[5,6], '$d': [7,8], '$e': [9,10]})

delimiters = '$'
matchPattern = '|'.join(map(re.escape, delimiters))
df.columns = [re.split(matchPattern, i)[1] for i in df.columns ]

输出:

>>> df
   $a  $b  $c  $d  $e
0   1   3   5   7   9
1   2   4   6   8  10

>>> df
   a  b  c  d   e
0  1  3  5  7   9
1  2  4  6  8  10

I know this question and answer has been chewed to death. But I referred to it for inspiration for one of the problem I was having . I was able to solve it using bits and pieces from different answers hence providing my response in case anyone needs it.

My method is generic wherein you can add additional delimiters by comma separating delimiters= variable and future-proof it.

Working Code:

import pandas as pd
import re


df = pd.DataFrame({'$a':[1,2], '$b': [3,4],'$c':[5,6], '$d': [7,8], '$e': [9,10]})

delimiters = '$'
matchPattern = '|'.join(map(re.escape, delimiters))
df.columns = [re.split(matchPattern, i)[1] for i in df.columns ]

Output:

>>> df
   $a  $b  $c  $d  $e
0   1   3   5   7   9
1   2   4   6   8  10

>>> df
   a  b  c  d   e
0  1  3  5  7   9
1  2  4  6  8  10

回答 19

请注意,这些方法不适用于MultiIndex。对于MultiIndex,您需要执行以下操作:

>>> df = pd.DataFrame({('$a','$x'):[1,2], ('$b','$y'): [3,4], ('e','f'):[5,6]})
>>> df
   $a $b  e
   $x $y  f
0  1  3  5
1  2  4  6
>>> rename = {('$a','$x'):('a','x'), ('$b','$y'):('b','y')}
>>> df.columns = pandas.MultiIndex.from_tuples([
        rename.get(item, item) for item in df.columns.tolist()])
>>> df
   a  b  e
   x  y  f
0  1  3  5
1  2  4  6

Note that these approach do not work for a MultiIndex. For a MultiIndex, you need to do something like the following:

>>> df = pd.DataFrame({('$a','$x'):[1,2], ('$b','$y'): [3,4], ('e','f'):[5,6]})
>>> df
   $a $b  e
   $x $y  f
0  1  3  5
1  2  4  6
>>> rename = {('$a','$x'):('a','x'), ('$b','$y'):('b','y')}
>>> df.columns = pandas.MultiIndex.from_tuples([
        rename.get(item, item) for item in df.columns.tolist()])
>>> df
   a  b  e
   x  y  f
0  1  3  5
1  2  4  6

回答 20

另一种选择是使用正则表达式重命名:

import pandas as pd
import re

df = pd.DataFrame({'$a':[1,2], '$b':[3,4], '$c':[5,6]})

df = df.rename(columns=lambda x: re.sub('\$','',x))
>>> df
   a  b  c
0  1  3  5
1  2  4  6

Another option is to rename using a regular expression:

import pandas as pd
import re

df = pd.DataFrame({'$a':[1,2], '$b':[3,4], '$c':[5,6]})

df = df.rename(columns=lambda x: re.sub('\$','',x))
>>> df
   a  b  c
0  1  3  5
1  2  4  6

回答 21

如果您必须处理无法由提供系统命名的列负载,那么我想出了以下方法,该方法将一次通用方法与特定替换方法结合在一起。

首先,使用正则表达式从数据框的列名称中创建字典,以丢弃某些列名称的附录,然后向字典中添加特定的替换内容,以便稍后在接收数据库中按预期命名核心列。

然后将其一次性应用到数据帧。

dict=dict(zip(df.columns,df.columns.str.replace('(:S$|:C1$|:L$|:D$|\.Serial:L$)','')))
dict['brand_timeseries:C1']='BTS'
dict['respid:L']='RespID'
dict['country:C1']='CountryID'
dict['pim1:D']='pim_actual'
df.rename(columns=dict, inplace=True)

If you have to deal with loads of columns named by the providing system out of your control, I came up with the following approach that is a combination of a general approach and specific replacments in one go.

First create a dictionary from the dataframe column names using regex expressions in order to throw away certain appendixes of column names and then add specific replacements to the dictionary to name core columns as expected later in the receiving database.

This is then applied to the dataframe in one go.

dict=dict(zip(df.columns,df.columns.str.replace('(:S$|:C1$|:L$|:D$|\.Serial:L$)','')))
dict['brand_timeseries:C1']='BTS'
dict['respid:L']='RespID'
dict['country:C1']='CountryID'
dict['pim1:D']='pim_actual'
df.rename(columns=dict, inplace=True)

回答 22

除了已经提供的解决方案之外,您还可以在读取文件时替换所有列。我们可以使用namesheader=0做到这一点。

首先,我们创建一个名称列表,以用作列名:

import pandas as pd

ufo_cols = ['city', 'color reported', 'shape reported', 'state', 'time']
ufo.columns = ufo_cols

ufo = pd.read_csv('link to the file you are using', names = ufo_cols, header = 0)

在这种情况下,所有列名称都将替换为列表中的名称。

In addition to the solution already provided, you can replace all the columns while you are reading the file. We can use names and header=0 to do that.

First, we create a list of the names that we like to use as our column names:

import pandas as pd

ufo_cols = ['city', 'color reported', 'shape reported', 'state', 'time']
ufo.columns = ufo_cols

ufo = pd.read_csv('link to the file you are using', names = ufo_cols, header = 0)

In this case, all the column names will be replaced with the names you have in your list.


回答 23

这是一个我喜欢用来减少键入的漂亮小功能:

def rename(data, oldnames, newname): 
    if type(oldnames) == str: #input can be a string or list of strings 
        oldnames = [oldnames] #when renaming multiple columns 
        newname = [newname] #make sure you pass the corresponding list of new names
    i = 0 
    for name in oldnames:
        oldvar = [c for c in data.columns if name in c]
        if len(oldvar) == 0: 
            raise ValueError("Sorry, couldn't find that column in the dataset")
        if len(oldvar) > 1: #doesn't have to be an exact match 
            print("Found multiple columns that matched " + str(name) + " :")
            for c in oldvar:
                print(str(oldvar.index(c)) + ": " + str(c))
            ind = input('please enter the index of the column you would like to rename: ')
            oldvar = oldvar[int(ind)]
        if len(oldvar) == 1:
            oldvar = oldvar[0]
        data = data.rename(columns = {oldvar : newname[i]})
        i += 1 
    return data   

这是它如何工作的示例:

In [2]: df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=['col1','col2','omg','idk'])
#first list = existing variables
#second list = new names for those variables
In [3]: df = rename(df, ['col','omg'],['first','ohmy']) 
Found multiple columns that matched col :
0: col1
1: col2

please enter the index of the column you would like to rename: 0

In [4]: df.columns
Out[5]: Index(['first', 'col2', 'ohmy', 'idk'], dtype='object')

Here’s a nifty little function I like to use to cut down on typing:

def rename(data, oldnames, newname): 
    if type(oldnames) == str: #input can be a string or list of strings 
        oldnames = [oldnames] #when renaming multiple columns 
        newname = [newname] #make sure you pass the corresponding list of new names
    i = 0 
    for name in oldnames:
        oldvar = [c for c in data.columns if name in c]
        if len(oldvar) == 0: 
            raise ValueError("Sorry, couldn't find that column in the dataset")
        if len(oldvar) > 1: #doesn't have to be an exact match 
            print("Found multiple columns that matched " + str(name) + " :")
            for c in oldvar:
                print(str(oldvar.index(c)) + ": " + str(c))
            ind = input('please enter the index of the column you would like to rename: ')
            oldvar = oldvar[int(ind)]
        if len(oldvar) == 1:
            oldvar = oldvar[0]
        data = data.rename(columns = {oldvar : newname[i]})
        i += 1 
    return data   

Here is an example of how it works:

In [2]: df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=['col1','col2','omg','idk'])
#first list = existing variables
#second list = new names for those variables
In [3]: df = rename(df, ['col','omg'],['first','ohmy']) 
Found multiple columns that matched col :
0: col1
1: col2

please enter the index of the column you would like to rename: 0

In [4]: df.columns
Out[5]: Index(['first', 'col2', 'ohmy', 'idk'], dtype='object')

回答 24

重命名熊猫中的列很容易。

df.rename(columns = {'$a':'a','$b':'b','$c':'c','$d':'d','$e':'e'},inplace = True)

Renaming columns in pandas is an easy task.

df.rename(columns = {'$a':'a','$b':'b','$c':'c','$d':'d','$e':'e'},inplace = True)

回答 25

假设您可以使用正则表达式。该解决方案无需使用正则表达式进行手动编码

import pandas as pd
import re

srch=re.compile(r"\w+")

data=pd.read_csv("CSV_FILE.csv")
cols=data.columns
new_cols=list(map(lambda v:v.group(),(list(map(srch.search,cols)))))
data.columns=new_cols

Assuming you can use regular expression. This solution removes the need of manual encoding using regex

import pandas as pd
import re

srch=re.compile(r"\w+")

data=pd.read_csv("CSV_FILE.csv")
cols=data.columns
new_cols=list(map(lambda v:v.group(),(list(map(srch.search,cols)))))
data.columns=new_cols

确定对象的类型?

问题:确定对象的类型?

有没有一种简单的方法来确定变量是列表,字典还是其他?我回来的对象可能是任何一种类型,我需要能够分辨出两者之间的区别。

Is there a simple way to determine if a variable is a list, dictionary, or something else? I am getting an object back that may be either type and I need to be able to tell the difference.


回答 0

有两个内置函数可以帮助您识别对象的类型。您可以使用type() ,如果你需要一个对象的确切类型,并isinstance()检查对象的反对的东西类型。通常,您希望使用isistance()大多数时间,因为它非常健壮并且还支持类型继承。


要获取对象的实际类型,请使用内置type()函数。将对象作为唯一参数传递将返回该对象的类型对象:

>>> type([]) is list
True
>>> type({}) is dict
True
>>> type('') is str
True
>>> type(0) is int
True

当然,这也适用于自定义类型:

>>> class Test1 (object):
        pass
>>> class Test2 (Test1):
        pass
>>> a = Test1()
>>> b = Test2()
>>> type(a) is Test1
True
>>> type(b) is Test2
True

请注意,type()这只会返回对象的直接类型,而不能告诉您类型继承。

>>> type(b) is Test1
False

为此,您应该使用该isinstance功能。当然,这也适用于内置类型:

>>> isinstance(b, Test1)
True
>>> isinstance(b, Test2)
True
>>> isinstance(a, Test1)
True
>>> isinstance(a, Test2)
False
>>> isinstance([], list)
True
>>> isinstance({}, dict)
True

isinstance()通常是确保对象类型的首选方法,因为它还将接受派生类型。因此,除非您实际需要类型对象(无论出于何种原因),否则使用isinstance()优先于type()

第二个参数isinstance()还接受类型的元组,因此可以一次检查多个类型。isinstance如果对象属于以下任何类型,则将返回true:

>>> isinstance([], (tuple, list, set))
True

There are two built-in functions that help you identify the type of an object. You can use type() if you need the exact type of an object, and isinstance() to check an object’s type against something. Usually, you want to use isistance() most of the times since it is very robust and also supports type inheritance.


To get the actual type of an object, you use the built-in type() function. Passing an object as the only parameter will return the type object of that object:

>>> type([]) is list
True
>>> type({}) is dict
True
>>> type('') is str
True
>>> type(0) is int
True

This of course also works for custom types:

>>> class Test1 (object):
        pass
>>> class Test2 (Test1):
        pass
>>> a = Test1()
>>> b = Test2()
>>> type(a) is Test1
True
>>> type(b) is Test2
True

Note that type() will only return the immediate type of the object, but won’t be able to tell you about type inheritance.

>>> type(b) is Test1
False

To cover that, you should use the isinstance function. This of course also works for built-in types:

>>> isinstance(b, Test1)
True
>>> isinstance(b, Test2)
True
>>> isinstance(a, Test1)
True
>>> isinstance(a, Test2)
False
>>> isinstance([], list)
True
>>> isinstance({}, dict)
True

isinstance() is usually the preferred way to ensure the type of an object because it will also accept derived types. So unless you actually need the type object (for whatever reason), using isinstance() is preferred over type().

The second parameter of isinstance() also accepts a tuple of types, so it’s possible to check for multiple types at once. isinstance will then return true, if the object is of any of those types:

>>> isinstance([], (tuple, list, set))
True

回答 1

您可以使用type()

>>> a = []
>>> type(a)
<type 'list'>
>>> f = ()
>>> type(f)
<type 'tuple'>

You can do that using type():

>>> a = []
>>> type(a)
<type 'list'>
>>> f = ()
>>> type(f)
<type 'tuple'>

回答 2

使用tryexcept块可能更Pythonic 。这样一来,如果你有这叫声也像列表,或叫声也像字典类,它会循规蹈矩,无论什么的类型真的是。

为了明确起见,“告诉变量类型”之间的差异的首选方法是使用“ 鸭子类型”:只要变量响应的方法(和返回类型)是子例程所期望的,则将其视为期望的成为。例如,如果您有一个用getattr和重载方括号运算符的类setattr,但使用了一些有趣的内部方案,那么如果它试图模仿的话,它就适合充当字典。

type(A) is type(B)检查的另一个问题是,如果A是的子类B,它将以false编程方式求值,您希望它是的时间true。如果对象是列表的子类,则它应像列表一样工作:检查其他答案中提供的类型将防止此情况。(isinstance但是可以)。

It might be more Pythonic to use a tryexcept block. That way, if you have a class which quacks like a list, or quacks like a dict, it will behave properly regardless of what its type really is.

To clarify, the preferred method of “telling the difference” between variable types is with something called duck typing: as long as the methods (and return types) that a variable responds to are what your subroutine expects, treat it like what you expect it to be. For example, if you have a class that overloads the bracket operators with getattr and setattr, but uses some funny internal scheme, it would be appropriate for it to behave as a dictionary if that’s what it’s trying to emulate.

The other problem with the type(A) is type(B) checking is that if A is a subclass of B, it evaluates to false when, programmatically, you would hope it would be true. If an object is a subclass of a list, it should work like a list: checking the type as presented in the other answer will prevent this. (isinstance will work, however).


回答 3

在对象的实例上,您还具有:

__class__

属性。这是从Python 3.3控制台获取的示例

>>> str = "str"
>>> str.__class__
<class 'str'>
>>> i = 2
>>> i.__class__
<class 'int'>
>>> class Test():
...     pass
...
>>> a = Test()
>>> a.__class__
<class '__main__.Test'>

请注意,在python 3.x和New-Style类(可从Python 2.6中可选)中,类和类型已合并,这有时会导致意外结果。主要是因为这个原因,我最喜欢的测试类型/类的方法是内置函数中的isinstance

On instances of object you also have the:

__class__

attribute. Here is a sample taken from Python 3.3 console

>>> str = "str"
>>> str.__class__
<class 'str'>
>>> i = 2
>>> i.__class__
<class 'int'>
>>> class Test():
...     pass
...
>>> a = Test()
>>> a.__class__
<class '__main__.Test'>

Beware that in python 3.x and in New-Style classes (aviable optionally from Python 2.6) class and type have been merged and this can sometime lead to unexpected results. Mainly for this reason my favorite way of testing types/classes is to the isinstance built in function.


回答 4

确定Python对象的类型

确定对象的类型 type

>>> obj = object()
>>> type(obj)
<class 'object'>

尽管可行,但请避免使用双下划线属性,例如__class__-它们在语义上不公开,并且在这种情况下(也许不是),内置函数通常具有更好的行为。

>>> obj.__class__ # avoid this!
<class 'object'>

类型检查

有没有一种简单的方法来确定变量是列表,字典还是其他?我回来的对象可能是任何一种类型,我需要能够分辨出两者之间的区别。

嗯,这是一个不同的问题,不要使用type-use isinstance

def foo(obj):
    """given a string with items separated by spaces, 
    or a list or tuple, 
    do something sensible
    """
    if isinstance(obj, str):
        obj = str.split()
    return _foo_handles_only_lists_or_tuples(obj)

这涵盖了您的用户通过子类来做一些聪明或明智的事情的情况str-根据Liskov Substitution的原理,您希望能够在不破坏代码的情况下使用子类实例-并isinstance支持这一点。

使用抽象

甚至更好的是,您可能会从collections或寻找特定的抽象基类numbers

from collections import Iterable
from numbers import Number

def bar(obj):
    """does something sensible with an iterable of numbers, 
    or just one number
    """
    if isinstance(obj, Number): # make it a 1-tuple
        obj = (obj,)
    if not isinstance(obj, Iterable):
        raise TypeError('obj must be either a number or iterable of numbers')
    return _bar_sensible_with_iterable(obj)

或者只是不明确地进行类型检查

或者,也许最重要的是,使用鸭式输入,而不要显式地检查代码。鸭式打字以更高的雅致和更少的冗长性支持Liskov Substitution。

def baz(obj):
    """given an obj, a dict (or anything with an .items method) 
    do something sensible with each key-value pair
    """
    for key, value in obj.items():
        _baz_something_sensible(key, value)

结论

  • 使用type真正得到一个实例的类。
  • 使用isinstance显式检查实际的子类或注册的抽象。
  • 只是避免在有意义的地方进行类型检查。

Determine the type of a Python object

Determine the type of an object with type

>>> obj = object()
>>> type(obj)
<class 'object'>

Although it works, avoid double underscore attributes like __class__ – they’re not semantically public, and, while perhaps not in this case, the builtin functions usually have better behavior.

>>> obj.__class__ # avoid this!
<class 'object'>

type checking

Is there a simple way to determine if a variable is a list, dictionary, or something else? I am getting an object back that may be either type and I need to be able to tell the difference.

Well that’s a different question, don’t use type – use isinstance:

def foo(obj):
    """given a string with items separated by spaces, 
    or a list or tuple, 
    do something sensible
    """
    if isinstance(obj, str):
        obj = str.split()
    return _foo_handles_only_lists_or_tuples(obj)

This covers the case where your user might be doing something clever or sensible by subclassing str – according to the principle of Liskov Substitution, you want to be able to use subclass instances without breaking your code – and isinstance supports this.

Use Abstractions

Even better, you might look for a specific Abstract Base Class from collections or numbers:

from collections import Iterable
from numbers import Number

def bar(obj):
    """does something sensible with an iterable of numbers, 
    or just one number
    """
    if isinstance(obj, Number): # make it a 1-tuple
        obj = (obj,)
    if not isinstance(obj, Iterable):
        raise TypeError('obj must be either a number or iterable of numbers')
    return _bar_sensible_with_iterable(obj)

Or Just Don’t explicitly Type-check

Or, perhaps best of all, use duck-typing, and don’t explicitly type-check your code. Duck-typing supports Liskov Substitution with more elegance and less verbosity.

def baz(obj):
    """given an obj, a dict (or anything with an .items method) 
    do something sensible with each key-value pair
    """
    for key, value in obj.items():
        _baz_something_sensible(key, value)

Conclusion

  • Use type to actually get an instance’s class.
  • Use isinstance to explicitly check for actual subclasses or registered abstractions.
  • And just avoid type-checking where it makes sense.

回答 5

您可以使用type()isinstance()

>>> type([]) is list
True

警告您可以list通过在当前作用域中分配相同名称的变量来破坏文件或其他任何类型。

>>> the_d = {}
>>> t = lambda x: "aight" if type(x) is dict else "NOPE"
>>> t(the_d) 'aight'
>>> dict = "dude."
>>> t(the_d) 'NOPE'

在上方,我们看到dict将其重新分配给字符串,因此进行了测试:

type({}) is dict

…失败。

要解决此问题并type()谨慎使用:

>>> import __builtin__
>>> the_d = {}
>>> type({}) is dict
True
>>> dict =""
>>> type({}) is dict
False
>>> type({}) is __builtin__.dict
True

You can use type() or isinstance().

>>> type([]) is list
True

Be warned that you can clobber list or any other type by assigning a variable in the current scope of the same name.

>>> the_d = {}
>>> t = lambda x: "aight" if type(x) is dict else "NOPE"
>>> t(the_d) 'aight'
>>> dict = "dude."
>>> t(the_d) 'NOPE'

Above we see that dict gets reassigned to a string, therefore the test:

type({}) is dict

…fails.

To get around this and use type() more cautiously:

>>> import __builtin__
>>> the_d = {}
>>> type({}) is dict
True
>>> dict =""
>>> type({}) is dict
False
>>> type({}) is __builtin__.dict
True

回答 6

尽管问题已经很老了,但我偶然发现了这个问题,同时自己找到了正确的方法,并且我认为仍然需要澄清一下,至少对于Python 2.x(没有检查Python 3,但是由于经典类出现了问题,在这样的版本上消失了,可能没有关系)。

在这里,我试图回答标题的问题:如何确定任意对象的类型?在许多评论和答案中,关于使用或不使用isinstance的其他建议也可以,但是我没有解决这些问题。

type()方法的主要问题是,它不适用于旧式实例

class One:
    pass

class Two:
    pass


o = One()
t = Two()

o_type = type(o)
t_type = type(t)

print "Are o and t instances of the same class?", o_type is t_type

执行此代码片段将生成:

Are o and t instances of the same class? True

我认为这不是大多数人所期望的。

这种__class__方法最接近正确性,但是在一种关键情况下不起作用:当传入的对象是旧式(而不是实例!)时,因为这些对象缺少此类属性。

这是我能想到的最小的代码片段,以一致的方式满足了此类合法问题:

#!/usr/bin/env python
from types import ClassType
#we adopt the null object pattern in the (unlikely) case
#that __class__ is None for some strange reason
_NO_CLASS=object()
def get_object_type(obj):
    obj_type = getattr(obj, "__class__", _NO_CLASS)
    if obj_type is not _NO_CLASS:
        return obj_type
    # AFAIK the only situation where this happens is an old-style class
    obj_type = type(obj)
    if obj_type is not ClassType:
        raise ValueError("Could not determine object '{}' type.".format(obj_type))
    return obj_type

While the questions is pretty old, I stumbled across this while finding out a proper way myself, and I think it still needs clarifying, at least for Python 2.x (did not check on Python 3, but since the issue arises with classic classes which are gone on such version, it probably doesn’t matter).

Here I’m trying to answer the title’s question: how can I determine the type of an arbitrary object? Other suggestions about using or not using isinstance are fine in many comments and answers, but I’m not addressing those concerns.

The main issue with the type() approach is that it doesn’t work properly with old-style instances:

class One:
    pass

class Two:
    pass


o = One()
t = Two()

o_type = type(o)
t_type = type(t)

print "Are o and t instances of the same class?", o_type is t_type

Executing this snippet would yield:

Are o and t instances of the same class? True

Which, I argue, is not what most people would expect.

The __class__ approach is the most close to correctness, but it won’t work in one crucial case: when the passed-in object is an old-style class (not an instance!), since those objects lack such attribute.

This is the smallest snippet of code I could think of that satisfies such legitimate question in a consistent fashion:

#!/usr/bin/env python
from types import ClassType
#we adopt the null object pattern in the (unlikely) case
#that __class__ is None for some strange reason
_NO_CLASS=object()
def get_object_type(obj):
    obj_type = getattr(obj, "__class__", _NO_CLASS)
    if obj_type is not _NO_CLASS:
        return obj_type
    # AFAIK the only situation where this happens is an old-style class
    obj_type = type(obj)
    if obj_type is not ClassType:
        raise ValueError("Could not determine object '{}' type.".format(obj_type))
    return obj_type

回答 7

小心使用isinstance

isinstance(True, bool)
True
>>> isinstance(True, int)
True

但是输入

type(True) == bool
True
>>> type(True) == int
False

be careful using isinstance

isinstance(True, bool)
True
>>> isinstance(True, int)
True

but type

type(True) == bool
True
>>> type(True) == int
False

回答 8

除了前面的答案外,值得一提的是collections.abc它的存在还包含一些补充鸭类的抽象基类(ABC)。

例如,与其明确地检查某物是否为列表,不如:

isinstance(my_obj, list)

如果您只想查看自己拥有的对象是否允许获取物品,可以使用collections.abc.Sequence

from collections.abc import Sequence
isinstance(my_obj, Sequence) 

如果您对允许获取,设置删除项目(即可序列)的对象非常感兴趣,则可以选择collections.abc.MutableSequence

许多其它的ABC被定义在那里,Mapping对于可以使用的地图,对象IterableCallable,等等。有关这些文件的完整列表,请参见的文档collections.abc

As an aside to the previous answers, it’s worth mentioning the existence of collections.abc which contains several abstract base classes (ABCs) that complement duck-typing.

For example, instead of explicitly checking if something is a list with:

isinstance(my_obj, list)

you could, if you’re only interested in seeing if the object you have allows getting items, use collections.abc.Sequence:

from collections.abc import Sequence
isinstance(my_obj, Sequence) 

if you’re strictly interested in objects that allow getting, setting and deleting items (i.e mutable sequences), you’d opt for collections.abc.MutableSequence.

Many other ABCs are defined there, Mapping for objects that can be used as maps, Iterable, Callable, et cetera. A full list of all these can be seen in the documentation for collections.abc.


回答 9

通常,您可以从具有类名称的对象中提取字符串,

str_class = object.__class__.__name__

并进行比较

if str_class == 'dict':
    # blablabla..
elif str_class == 'customclass':
    # blebleble..

In general you can extract a string from object with the class name,

str_class = object.__class__.__name__

and using it for comparison,

if str_class == 'dict':
    # blablabla..
elif str_class == 'customclass':
    # blebleble..

回答 10

在许多实际情况下,而不是使用typeisinstance也可以使用@functools.singledispatch,这是用来定义的通用功能功能实现用于不同类型的同一操作的多个函数构成)。

换句话说,当您具有如下代码时,您将希望使用它:

def do_something(arg):
    if isinstance(arg, int):
        ... # some code specific to processing integers
    if isinstance(arg, str):
        ... # some code specific to processing strings
    if isinstance(arg, list):
        ... # some code specific to processing lists
    ...  # etc

这是一个如何工作的小例子:

from functools import singledispatch


@singledispatch
def say_type(arg):
    raise NotImplementedError(f"I don't work with {type(arg)}")


@say_type.register
def _(arg: int):
    print(f"{arg} is an integer")


@say_type.register
def _(arg: bool):
    print(f"{arg} is a boolean")
>>> say_type(0)
0 is an integer
>>> say_type(False)
False is a boolean
>>> say_type(dict())
# long error traceback ending with:
NotImplementedError: I don't work with <class 'dict'>

另外,我们可以使用抽象类一次覆盖几种类型:

from collections.abc import Sequence


@say_type.register
def _(arg: Sequence):
    print(f"{arg} is a sequence!")
>>> say_type([0, 1, 2])
[0, 1, 2] is a sequence!
>>> say_type((1, 2, 3))
(1, 2, 3) is a sequence!

In many practical cases instead of using type or isinstance you can also use @functools.singledispatch, which is used to define generic functions (function composed of multiple functions implementing the same operation for different types).

In other words, you would want to use it when you have a code like the following:

def do_something(arg):
    if isinstance(arg, int):
        ... # some code specific to processing integers
    if isinstance(arg, str):
        ... # some code specific to processing strings
    if isinstance(arg, list):
        ... # some code specific to processing lists
    ...  # etc

Here is a small example of how it works:

from functools import singledispatch


@singledispatch
def say_type(arg):
    raise NotImplementedError(f"I don't work with {type(arg)}")


@say_type.register
def _(arg: int):
    print(f"{arg} is an integer")


@say_type.register
def _(arg: bool):
    print(f"{arg} is a boolean")
>>> say_type(0)
0 is an integer
>>> say_type(False)
False is a boolean
>>> say_type(dict())
# long error traceback ending with:
NotImplementedError: I don't work with <class 'dict'>

Additionaly we can use abstract classes to cover several types at once:

from collections.abc import Sequence


@say_type.register
def _(arg: Sequence):
    print(f"{arg} is a sequence!")
>>> say_type([0, 1, 2])
[0, 1, 2] is a sequence!
>>> say_type((1, 2, 3))
(1, 2, 3) is a sequence!

回答 11

type()是比更好的解决方案isinstance(),特别是在booleans

TrueFalse只是关键字,平均10Python编写的。从而,

isinstance(True, int)

isinstance(False, int)

都回来了True。两个布尔值都是整数的实例。type()但是,它更聪明:

type(True) == int

返回False

type() is a better solution than isinstance(), particularly for booleans:

True and False are just keywords that mean 1 and 0 in python. Thus,

isinstance(True, int)

and

isinstance(False, int)

both return True. Both booleans are an instance of an integer. type(), however, is more clever:

type(True) == int

returns False.


如何从Python字典中删除键?

问题:如何从Python字典中删除键?

从字典中删除键时,我使用:

if 'key' in my_dict:
    del my_dict['key']

有没有一种方法可以做到这一点?

When deleting a key from a dictionary, I use:

if 'key' in my_dict:
    del my_dict['key']

Is there a one line way of doing this?


回答 0

要删除键而不管它是否在字典中,请使用以下两个参数的形式dict.pop()

my_dict.pop('key', None)

my_dict[key]如果key字典中存在,则返回,None否则返回。如果第二个参数未指定(即my_dict.pop('key'))并且key不存在,KeyError则引发a。

要删除肯定存在的密钥,您还可以使用

del my_dict['key']

KeyError如果密钥不在字典中,则将引发a 。

To delete a key regardless of whether it is in the dictionary, use the two-argument form of dict.pop():

my_dict.pop('key', None)

This will return my_dict[key] if key exists in the dictionary, and None otherwise. If the second parameter is not specified (ie. my_dict.pop('key')) and key does not exist, a KeyError is raised.

To delete a key that is guaranteed to exist, you can also use

del my_dict['key']

This will raise a KeyError if the key is not in the dictionary.


回答 1

专门回答“是否有一种统一的方法?”

if 'key' in my_dict: del my_dict['key']

…嗯,你问过 ;-)

你应该考虑,虽然,从删除对象的这种方式dict不是原子 -它是可能的,'key'可能是在my_dict该过程中if的语句,但是可以删除之前del被执行,在这种情况下del将失败,KeyError。鉴于此,最安全的使用dict.pop方式是

try:
    del my_dict['key']
except KeyError:
    pass

当然,这绝对不是单线的。

Specifically to answer “is there a one line way of doing this?”

if 'key' in my_dict: del my_dict['key']

…well, you asked ;-)

You should consider, though, that this way of deleting an object from a dict is not atomic—it is possible that 'key' may be in my_dict during the if statement, but may be deleted before del is executed, in which case del will fail with a KeyError. Given this, it would be safest to either use dict.pop or something along the lines of

try:
    del my_dict['key']
except KeyError:
    pass

which, of course, is definitely not a one-liner.


回答 2

我花了一些时间弄清楚究竟my_dict.pop("key", None)在做什么。因此,我将其添加为答案以节省其他Google搜索时间:

pop(key[, default])

如果key在字典中,请删除它并返回其值,否则返回default。如果未提供默认值并且字典中没有KeyError则引发a。

文献资料

It took me some time to figure out what exactly my_dict.pop("key", None) is doing. So I’ll add this as an answer to save others Googling time:

pop(key[, default])

If key is in the dictionary, remove it and return its value, else return default. If default is not given and key is not in the dictionary, a KeyError is raised.

Documentation


回答 3

del my_dict[key]my_dict.pop(key)在键存在时从字典中删除键要快一些

>>> import timeit
>>> setup = "d = {i: i for i in range(100000)}"

>>> timeit.timeit("del d[3]", setup=setup, number=1)
1.79e-06
>>> timeit.timeit("d.pop(3)", setup=setup, number=1)
2.09e-06
>>> timeit.timeit("d2 = {key: val for key, val in d.items() if key != 3}", setup=setup, number=1)
0.00786

但是,当密钥不存在时,它会if key in my_dict: del my_dict[key]比稍快一点my_dict.pop(key, None)。两者都至少比快三倍deltry/ except语句:

>>> timeit.timeit("if 'missing key' in d: del d['missing key']", setup=setup)
0.0229
>>> timeit.timeit("d.pop('missing key', None)", setup=setup)
0.0426
>>> try_except = """
... try:
...     del d['missing key']
... except KeyError:
...     pass
... """
>>> timeit.timeit(try_except, setup=setup)
0.133

del my_dict[key] is slightly faster than my_dict.pop(key) for removing a key from a dictionary when the key exists

>>> import timeit
>>> setup = "d = {i: i for i in range(100000)}"

>>> timeit.timeit("del d[3]", setup=setup, number=1)
1.79e-06
>>> timeit.timeit("d.pop(3)", setup=setup, number=1)
2.09e-06
>>> timeit.timeit("d2 = {key: val for key, val in d.items() if key != 3}", setup=setup, number=1)
0.00786

But when the key doesn’t exist if key in my_dict: del my_dict[key] is slightly faster than my_dict.pop(key, None). Both are at least three times faster than del in a try/except statement:

>>> timeit.timeit("if 'missing key' in d: del d['missing key']", setup=setup)
0.0229
>>> timeit.timeit("d.pop('missing key', None)", setup=setup)
0.0426
>>> try_except = """
... try:
...     del d['missing key']
... except KeyError:
...     pass
... """
>>> timeit.timeit(try_except, setup=setup)
0.133

回答 4

如果您需要在一行代码中从字典中删除很多键,我认为使用map()非常简洁且Python可读:

myDict = {'a':1,'b':2,'c':3,'d':4}
map(myDict.pop, ['a','c']) # The list of keys to remove
>>> myDict
{'b': 2, 'd': 4}

并且,如果您需要在弹出字典中没有的值的地方捕获错误,请在map()中使用lambda,如下所示:

map(lambda x: myDict.pop(x,None), ['a', 'c', 'e'])
[1, 3, None] # pop returns
>>> myDict
{'b': 2, 'd': 4}

或中的python3,您必须改为使用列表推导:

[myDict.pop(x, None) for x in ['a', 'c', 'e']]

有用。即使myDict没有“ e”键,“ e”也不会引起错误。

If you need to remove a lot of keys from a dictionary in one line of code, I think using map() is quite succinct and Pythonic readable:

myDict = {'a':1,'b':2,'c':3,'d':4}
map(myDict.pop, ['a','c']) # The list of keys to remove
>>> myDict
{'b': 2, 'd': 4}

And if you need to catch errors where you pop a value that isn’t in the dictionary, use lambda inside map() like this:

map(lambda x: myDict.pop(x,None), ['a', 'c', 'e'])
[1, 3, None] # pop returns
>>> myDict
{'b': 2, 'd': 4}

or in python3, you must use a list comprehension instead:

[myDict.pop(x, None) for x in ['a', 'c', 'e']]

It works. And ‘e’ did not cause an error, even though myDict did not have an ‘e’ key.


回答 5

您可以使用字典理解来创建新字典,并删除该键:

>>> my_dict = {k: v for k, v in my_dict.items() if k != 'key'}

您可以按条件删除。如果key不存在,则没有错误。

You can use a dictionary comprehension to create a new dictionary with that key removed:

>>> my_dict = {k: v for k, v in my_dict.items() if k != 'key'}

You can delete by conditions. No error if key doesn’t exist.


回答 6

使用“ del”关键字:

del dict[key]

Using the “del” keyword:

del dict[key]

回答 7

我们可以通过以下几种方法从Python字典中删除键。

使用del关键字;这几乎与您所采用的方法相同-

 myDict = {'one': 100, 'two': 200, 'three': 300 }
 print(myDict)  # {'one': 100, 'two': 200, 'three': 300}
 if myDict.get('one') : del myDict['one']
 print(myDict)  # {'two': 200, 'three': 300}

要么

我们可以像下面这样:

但是请记住,在此过程中,它实际上不会从字典中删除任何键,而不会从该字典中排除特定的键。另外,我观察到它返回的字典与的顺序不同myDict

myDict = {'one': 100, 'two': 200, 'three': 300, 'four': 400, 'five': 500}
{key:value for key, value in myDict.items() if key != 'one'}

如果我们在外壳中运行它,它将执行类似的操作{'five': 500, 'four': 400, 'three': 300, 'two': 200}-请注意,它与的顺序不同myDict。再次,如果我们尝试打印myDict,那么我们可以看到所有键,包括通过这种方法从字典中排除的键。但是,我们可以通过将以下语句分配给变量来创建新字典:

var = {key:value for key, value in myDict.items() if key != 'one'}

现在,如果我们尝试打印它,它将遵循父命令:

print(var) # {'two': 200, 'three': 300, 'four': 400, 'five': 500}

要么

使用pop()方法。

myDict = {'one': 100, 'two': 200, 'three': 300}
print(myDict)

if myDict.get('one') : myDict.pop('one')
print(myDict)  # {'two': 200, 'three': 300}

del和之间的区别在于pop,使用pop()方法,我们实际上可以根据需要存储键的值,如下所示:

myDict = {'one': 100, 'two': 200, 'three': 300}
if myDict.get('one') : var = myDict.pop('one')
print(myDict) # {'two': 200, 'three': 300}
print(var)    # 100

如果您觉得有用,请叉要点以备将来参考。

We can delete a key from a Python dictionary by the some following approaches.

Using the del keyword; it’s almost the same approach like you did though –

 myDict = {'one': 100, 'two': 200, 'three': 300 }
 print(myDict)  # {'one': 100, 'two': 200, 'three': 300}
 if myDict.get('one') : del myDict['one']
 print(myDict)  # {'two': 200, 'three': 300}

Or

We can do like following:

But one should keep in mind that, in this process actually it won’t delete any key from the dictionary rather than making specific key excluded from that dictionary. In addition, I observed that it returned a dictionary which was not ordered the same as myDict.

myDict = {'one': 100, 'two': 200, 'three': 300, 'four': 400, 'five': 500}
{key:value for key, value in myDict.items() if key != 'one'}

If we run it in the shell, it’ll execute something like {'five': 500, 'four': 400, 'three': 300, 'two': 200} – notice that it’s not the same ordered as myDict. Again if we try to print myDict, then we can see all keys including which we excluded from the dictionary by this approach. However, we can make a new dictionary by assigning the following statement into a variable:

var = {key:value for key, value in myDict.items() if key != 'one'}

Now if we try to print it, then it’ll follow the parent order:

print(var) # {'two': 200, 'three': 300, 'four': 400, 'five': 500}

Or

Using the pop() method.

myDict = {'one': 100, 'two': 200, 'three': 300}
print(myDict)

if myDict.get('one') : myDict.pop('one')
print(myDict)  # {'two': 200, 'three': 300}

The difference between del and pop is that, using pop() method, we can actually store the key’s value if needed, like the following:

myDict = {'one': 100, 'two': 200, 'three': 300}
if myDict.get('one') : var = myDict.pop('one')
print(myDict) # {'two': 200, 'three': 300}
print(var)    # 100

Fork this gist for future reference, if you find this useful.


回答 8

如果您想要非常冗长,可以使用异常处理:

try: 
    del dict[key]

except KeyError: pass

但是,pop()如果键不存在,这比方法要慢。

my_dict.pop('key', None)

几个键无关紧要,但是如果重复执行此操作,则后一种方法是更好的选择。

最快的方法是这样的:

if 'key' in dict: 
    del myDict['key']

但是此方法很危险,因为如果'key'在两行之间将其删除,KeyError则会引发a。

You can use exception handling if you want to be very verbose:

try: 
    del dict[key]

except KeyError: pass

This is slower, however, than the pop() method, if the key doesn’t exist.

my_dict.pop('key', None)

It won’t matter for a few keys, but if you’re doing this repeatedly, then the latter method is a better bet.

The fastest approach is this:

if 'key' in dict: 
    del myDict['key']

But this method is dangerous because if 'key' is removed in between the two lines, a KeyError will be raised.


回答 9

我更喜欢不变的版本

foo = {
    1:1,
    2:2,
    3:3
}
removeKeys = [1,2]
def woKeys(dct, keyIter):
    return {
        k:v
        for k,v in dct.items() if k not in keyIter
    }

>>> print(woKeys(foo, removeKeys))
{3: 3}
>>> print(foo)
{1: 1, 2: 2, 3: 3}

I prefer the immutable version

foo = {
    1:1,
    2:2,
    3:3
}
removeKeys = [1,2]
def woKeys(dct, keyIter):
    return {
        k:v
        for k,v in dct.items() if k not in keyIter
    }

>>> print(woKeys(foo, removeKeys))
{3: 3}
>>> print(foo)
{1: 1, 2: 2, 3: 3}

回答 10

另一种方法是通过使用items()+ dict理解

items()结合dict理解也可以帮助我们完成键-值对删除的任务,但是它具有不适合就地使用dict的缺点。实际上,如果创建了一个新字典,除了我们不希望包含的密钥之外。

test_dict = {"sai" : 22, "kiran" : 21, "vinod" : 21, "sangam" : 21} 

# Printing dictionary before removal 
print ("dictionary before performing remove is : " + str(test_dict)) 

# Using items() + dict comprehension to remove a dict. pair 
# removes  vinod
new_dict = {key:val for key, val in test_dict.items() if key != 'vinod'} 

# Printing dictionary after removal 
print ("dictionary after remove is : " + str(new_dict)) 

输出:

dictionary before performing remove is : {'sai': 22, 'kiran': 21, 'vinod': 21, 'sangam': 21}
dictionary after remove is : {'sai': 22, 'kiran': 21, 'sangam': 21}

Another way is by Using items() + dict comprehension

items() coupled with dict comprehension can also help us achieve task of key-value pair deletion but, it has drawback of not being an inplace dict technique. Actually a new dict if created except for the key we don’t wish to include.

test_dict = {"sai" : 22, "kiran" : 21, "vinod" : 21, "sangam" : 21} 

# Printing dictionary before removal 
print ("dictionary before performing remove is : " + str(test_dict)) 

# Using items() + dict comprehension to remove a dict. pair 
# removes  vinod
new_dict = {key:val for key, val in test_dict.items() if key != 'vinod'} 

# Printing dictionary after removal 
print ("dictionary after remove is : " + str(new_dict)) 

Output:

dictionary before performing remove is : {'sai': 22, 'kiran': 21, 'vinod': 21, 'sangam': 21}
dictionary after remove is : {'sai': 22, 'kiran': 21, 'sangam': 21}

回答 11

单键过滤

  • 如果my_dict中存在“ key”,则返回“ key”并将其从my_dict中删除
  • 如果my_dict中不存在“键”,则返回None

这将改变my_dict(可变)

my_dict.pop('key', None)

按键上有多个过滤器

生成一个新的字典(不可变的)

dic1 = {
    "x":1,
    "y": 2,
    "z": 3
}

def func1(item):
    return  item[0]!= "x" and item[0] != "y"

print(
    dict(
        filter(
            lambda item: item[0] != "x" and item[0] != "y", 
            dic1.items()
            )
    )
)

Single filter on key

  • return “key” and remove it from my_dict if “key” exists in my_dict
  • return None if “key” doesn’t exist in my_dict

this will change my_dict in place (mutable)

my_dict.pop('key', None)

Multiple filters on keys

generate a new dict (immutable)

dic1 = {
    "x":1,
    "y": 2,
    "z": 3
}

def func1(item):
    return  item[0]!= "x" and item[0] != "y"

print(
    dict(
        filter(
            lambda item: item[0] != "x" and item[0] != "y", 
            dic1.items()
            )
    )
)

为什么是string.join(list)而不是list.join(string)?

问题:为什么是string.join(list)而不是list.join(string)?

这一直使我感到困惑。看起来这样会更好:

my_list = ["Hello", "world"]
print(my_list.join("-"))
# Produce: "Hello-world"

比这个:

my_list = ["Hello", "world"]
print("-".join(my_list))
# Produce: "Hello-world"

是否有特定原因?

This has always confused me. It seems like this would be nicer:

my_list = ["Hello", "world"]
print(my_list.join("-"))
# Produce: "Hello-world"

Than this:

my_list = ["Hello", "world"]
print("-".join(my_list))
# Produce: "Hello-world"

Is there a specific reason it is like this?


回答 0

这是因为任何可迭代项都可以连接(例如,列表,元组,字典,集合),但是结果和“连接器” 必须是字符串。

例如:

'_'.join(['welcome', 'to', 'stack', 'overflow'])
'_'.join(('welcome', 'to', 'stack', 'overflow'))
'welcome_to_stack_overflow'

使用字符串以外的其他东西会引发以下错误:

TypeError:序列项0:预期的str实例,找到的int

It’s because any iterable can be joined (e.g, list, tuple, dict, set), but the result and the “joiner” must be strings.

For example:

'_'.join(['welcome', 'to', 'stack', 'overflow'])
'_'.join(('welcome', 'to', 'stack', 'overflow'))
'welcome_to_stack_overflow'

Using something else than strings will raise the following error:

TypeError: sequence item 0: expected str instance, int found


回答 1

这在String方法中进行了讨论……最终在Python-Dev中实现,并被Guido接受。该线程始于1999年6月,并str.join包含在2000年9月发布的Python 1.6中(并支持Unicode)。Python 2.0(受支持的str方法,包括join)于2000年10月发布。

  • 此线程中提出了四个选项:
    • str.join(seq)
    • seq.join(str)
    • seq.reduce(str)
    • join 作为内置功能
  • Guido不仅希望支持lists,tuples,而且还支持所有序列/可迭代对象。
  • seq.reduce(str) 对于新来者来说很难。
  • seq.join(str) 从序列到str / unicode引入了意外的依赖关系。
  • join()因为内置函数仅支持特定的数据类型。因此,使用内置的命名空间是不好的。如果join()支持许多数据类型,则创建优化的实现将很困难,如果使用该__add__方法实现,则为O(n²)。
  • 分隔符(sep)不应省略。显式胜于隐式。

此线程中没有其他原因。

以下是一些其他想法(我自己和我朋友的想法):

  • Unicode支持即将到来,但这不是最终的。当时,UTF-8最有可能取代UCS2 / 4。要计算UTF-8字符串的总缓冲区长度,需要知道字符编码规则。
  • 那时,Python已经决定了通用的序列接口规则,用户可以在其中创建类似序列的(可迭代)类。但是Python直到2.2才支持扩展内置类型。那时,很难提供基本的可迭代类(在另一条评论中提到)。

Guido的决定记录在历史邮件中,决定str.join(seq)

有趣,但看起来确实正确!巴里,去吧…-
吉多·范·罗苏姆(Guido van Rossum)

This was discussed in the String methods… finally thread in the Python-Dev achive, and was accepted by Guido. This thread began in Jun 1999, and str.join was included in Python 1.6 which was released in Sep 2000 (and supported Unicode). Python 2.0 (supported str methods including join) was released in Oct 2000.

  • There were four options proposed in this thread:
    • str.join(seq)
    • seq.join(str)
    • seq.reduce(str)
    • join as a built-in function
  • Guido wanted to support not only lists, tuples, but all sequences/iterables.
  • seq.reduce(str) is difficult for new-comers.
  • seq.join(str) introduces unexpected dependency from sequences to str/unicode.
  • join() as a built-in function would support only specific data types. So using a built in namespace is not good. If join() supports many datatypes, creating optimized implementation would be difficult, if implemented using the __add__ method then it’s O(n²).
  • The separator string (sep) should not be omitted. Explicit is better than implicit.

There are no other reasons offered in this thread.

Here are some additional thoughts (my own, and my friend’s):

  • Unicode support was coming, but it was not final. At that time UTF-8 was the most likely about to replace UCS2/4. To calculate total buffer length of UTF-8 strings it needs to know character coding rule.
  • At that time, Python had already decided on a common sequence interface rule where a user could create a sequence-like (iterable) class. But Python didn’t support extending built-in types until 2.2. At that time it was difficult to provide basic iterable class (which is mentioned in another comment).

Guido’s decision is recorded in a historical mail, deciding on str.join(seq):

Funny, but it does seem right! Barry, go for it…
–Guido van Rossum


回答 2

因为该join()方法位于字符串类中,而不是列表类中?

我同意这看起来很有趣。

参见http://www.faqs.org/docs/diveintopython/odbchelper_join.html

历史记录。当我第一次学习Python时,我期望join是一个列表方法,它将分隔符作为参数。很多人都有相同的感觉,join方法背后还有一个故事。在Python 1.6之前,字符串没有所有这些有用的方法。有一个单独的字符串模块,其中包含所有字符串函数。每个函数都将字符串作为第一个参数。这些功能被认为很重要,足以放在字符串本身上,这对于诸如lower,upper和split这样的功能是有意义的。但是许多铁杆Python程序员反对使用新的join方法,认为它应该是列表的方法,或者根本不应该移动,而只是保留旧字符串模块的一部分(仍然有很多方法)里面有用的东西)。

— Mark Pilgrim,深入Python

Because the join() method is in the string class, instead of the list class?

I agree it looks funny.

See http://www.faqs.org/docs/diveintopython/odbchelper_join.html:

Historical note. When I first learned Python, I expected join to be a method of a list, which would take the delimiter as an argument. Lots of people feel the same way, and there’s a story behind the join method. Prior to Python 1.6, strings didn’t have all these useful methods. There was a separate string module which contained all the string functions; each function took a string as its first argument. The functions were deemed important enough to put onto the strings themselves, which made sense for functions like lower, upper, and split. But many hard-core Python programmers objected to the new join method, arguing that it should be a method of the list instead, or that it shouldn’t move at all but simply stay a part of the old string module (which still has lots of useful stuff in it). I use the new join method exclusively, but you will see code written either way, and if it really bothers you, you can use the old string.join function instead.

— Mark Pilgrim, Dive into Python


回答 3

我同意起初这是违反直觉的,但是有充分的理由。Join不能成为列表的方法,因为:

  • 它也必须适用于不同的可迭代对象(元组,生成器等)
  • 在不同类型的字符串之间它必须具有不同的行为。

实际上有两种连接方法(Python 3.0):

>>> b"".join
<built-in method join of bytes object at 0x00A46800>
>>> "".join
<built-in method join of str object at 0x00A28D40>

如果join是列表的一种方法,则它必须检查其参数以确定要调用的参数。而且您不能将byte和str结合在一起,因此它们现在的用法很有意义。

I agree that it’s counterintuitive at first, but there’s a good reason. Join can’t be a method of a list because:

  • it must work for different iterables too (tuples, generators, etc.)
  • it must have different behavior between different types of strings.

There are actually two join methods (Python 3.0):

>>> b"".join
<built-in method join of bytes object at 0x00A46800>
>>> "".join
<built-in method join of str object at 0x00A28D40>

If join was a method of a list, then it would have to inspect its arguments to decide which one of them to call. And you can’t join byte and str together, so the way they have it now makes sense.


回答 4

为什么用它string.join(list)代替list.join(string)

这是因为join是“字符串”方法!它从任何迭代创建一个字符串。如果我们将方法卡在列表中,那么当我们拥有非列表的可迭代对象时该怎么办?

如果您有一个字符串元组怎么办?如果这是一种list方法,则必须将每个这样的字符串迭代器都转换为,list然后才能将元素连接到单个字符串中!例如:

some_strings = ('foo', 'bar', 'baz')

让我们推出自己的列表连接方法:

class OurList(list): 
    def join(self, s):
        return s.join(self)

并使用它,请注意,我们必须首先从每个可迭代对象创建一个列表,以将该字符串连接到该可迭代对象,从而浪费内存和处理能力:

>>> l = OurList(some_strings) # step 1, create our list
>>> l.join(', ') # step 2, use our list join method!
'foo, bar, baz'

因此,我们看到我们必须添加一个额外的步骤来使用我们的列表方法,而不仅仅是使用内置的字符串方法:

>>> ' | '.join(some_strings) # a single step!
'foo | bar | baz'

生成器性能警告

Python用于创建最终字符串的算法str.join实际上必须传递两次迭代,因此,如果为其提供生成器表达式,则必须先将其具体化为列表,然后才能创建最终字符串。

因此,尽管绕过生成器通常比列表理解更好,但这str.join是一个exceptions:

>>> import timeit
>>> min(timeit.repeat(lambda: ''.join(str(i) for i in range(10) if i)))
3.839168446022086
>>> min(timeit.repeat(lambda: ''.join([str(i) for i in range(10) if i])))
3.339879313018173

但是,该str.join操作在语义上仍然是“字符串”操作,因此将其放在str对象上而不是在其他可迭代对象上还是有意义的。

Why is it string.join(list) instead of list.join(string)?

This is because join is a “string” method! It creates a string from any iterable. If we stuck the method on lists, what about when we have iterables that aren’t lists?

What if you have a tuple of strings? If this were a list method, you would have to cast every such iterator of strings as a list before you could join the elements into a single string! For example:

some_strings = ('foo', 'bar', 'baz')

Let’s roll our own list join method:

class OurList(list): 
    def join(self, s):
        return s.join(self)

And to use it, note that we have to first create a list from each iterable to join the strings in that iterable, wasting both memory and processing power:

>>> l = OurList(some_strings) # step 1, create our list
>>> l.join(', ') # step 2, use our list join method!
'foo, bar, baz'

So we see we have to add an extra step to use our list method, instead of just using the builtin string method:

>>> ' | '.join(some_strings) # a single step!
'foo | bar | baz'

Performance Caveat for Generators

The algorithm Python uses to create the final string with str.join actually has to pass over the iterable twice, so if you provide it a generator expression, it has to materialize it into a list first before it can create the final string.

Thus, while passing around generators is usually better than list comprehensions, str.join is an exception:

>>> import timeit
>>> min(timeit.repeat(lambda: ''.join(str(i) for i in range(10) if i)))
3.839168446022086
>>> min(timeit.repeat(lambda: ''.join([str(i) for i in range(10) if i])))
3.339879313018173

Nevertheless, the str.join operation is still semantically a “string” operation, so it still makes sense to have it on the str object than on miscellaneous iterables.


回答 5

将其视为拆分的自然正交运算。

我明白为什么它适用于任何可迭代的,所以不能简单地执行只是在列表中。

为了提高可读性,我想用该语言查看它,但我认为这实际上是不可行的-如果可迭代性是一个接口,则可以将其添加到该接口中,但这只是一个约定,因此没有中央方法将其添加到可迭代的事物集中。

Think of it as the natural orthogonal operation to split.

I understand why it is applicable to anything iterable and so can’t easily be implemented just on list.

For readability, I’d like to see it in the language but I don’t think that is actually feasible – if iterability were an interface then it could be added to the interface but it is just a convention and so there’s no central way to add it to the set of things which are iterable.


回答 6

主要是因为a的结果someString.join()是字符串。

序列(列表或元组等)不会出现在结果中,而只是一个字符串。因为结果是一个字符串,所以作为字符串的方法是有意义的。

Primarily because the result of a someString.join() is a string.

The sequence (list or tuple or whatever) doesn’t appear in the result, just a string. Because the result is a string, it makes sense as a method of a string.


回答 7

- 在“-”中。join(my_list)声明您正在从列表的连接元素转换为字符串。它以结果为导向。(为便于记忆和理解)

我制作了一个methods_of_string的详尽备忘单,供您参考。

string_methonds_44 = {
    'convert': ['join','split', 'rsplit','splitlines', 'partition', 'rpartition'],
    'edit': ['replace', 'lstrip', 'rstrip', 'strip'],
    'search': ['endswith', 'startswith', 'count', 'index', 'find','rindex', 'rfind',],
    'condition': ['isalnum', 'isalpha', 'isdecimal', 'isdigit', 'isnumeric','isidentifier',
                  'islower','istitle', 'isupper','isprintable', 'isspace', ],
    'text': ['lower', 'upper', 'capitalize', 'title', 'swapcase',
             'center', 'ljust', 'rjust', 'zfill', 'expandtabs','casefold'],
    'encode': ['translate', 'maketrans', 'encode'],
    'format': ['format', 'format_map']}

- in “-“.join(my_list) declares that you are converting to a string from joining elements a list.It’s result-oriented.(just for easy memory and understanding)

I make a exhaustive cheatsheet of methods_of_string for your reference.

string_methonds_44 = {
    'convert': ['join','split', 'rsplit','splitlines', 'partition', 'rpartition'],
    'edit': ['replace', 'lstrip', 'rstrip', 'strip'],
    'search': ['endswith', 'startswith', 'count', 'index', 'find','rindex', 'rfind',],
    'condition': ['isalnum', 'isalpha', 'isdecimal', 'isdigit', 'isnumeric','isidentifier',
                  'islower','istitle', 'isupper','isprintable', 'isspace', ],
    'text': ['lower', 'upper', 'capitalize', 'title', 'swapcase',
             'center', 'ljust', 'rjust', 'zfill', 'expandtabs','casefold'],
    'encode': ['translate', 'maketrans', 'encode'],
    'format': ['format', 'format_map']}

回答 8

两者都不好。

string.join(xs,delimit)表示字符串模块知道列表的存在,而列表列表却没有任何业务意义,因为字符串模块仅适用于字符串。

list.join(delimit)更好一点,因为我们习惯于将字符串作为基本类型(从语言上讲,它们是)。但是,这意味着需要动态调度连接,因为在a.split("\n") python编译器,可能不知道a是什么,因此需要查找它(类似于vtable查找),如果您花很多时间这样做,这会很昂贵。次。

如果python运行时编译器知道列表是内置模块,则它可以跳过动态查找并将意图直接编码为字节码,否则,它需要动态地解析“ a”的“ join”,这可能是多层的每次调用的继承关系(因为两次调用之间,join的含义可能已更改,因为python是一种动态语言)。

可悲的是,这是抽象的最终缺陷。无论您选择哪种抽象,您的抽象都仅在您要解决的问题的背景下才有意义,因此,当您开始将它们胶合在一起时,您将永远无法获得与基础意识形态相一致的一致抽象而不将它们包装在与您的意识形态相符的视图中。知道了这一点,python的方法更灵活,因为它更便宜,您可以自己制作包装器或自己的预处理器,为此要花更多的钱才能使它看起来“更漂亮”。

Both are not nice.

string.join(xs, delimit) means that the string module is aware of the existence of a list, which it has no business knowing about, since the string module only works with strings.

list.join(delimit) is a bit nicer because we’re so used to strings being a fundamental type(and lingually speaking, they are). However this means that join needs to be dispatched dynamically because in the arbitrary context of a.split("\n") the python compiler might not know what a is, and will need to look it up(analogously to vtable lookup), which is expensive if you do it a lot of times.

if the python runtime compiler knows that list is a built in module, it can skip the dynamic lookup and encode the intent into the bytecode directly, whereas otherwise it needs to dynamically resolve “join” of “a”, which may be up several layers of inheritence per call(since between calls, the meaning of join may have changed, because python is a dynamic language).

sadly, this is the ultimate flaw of abstraction; no matter what abstraction you choose, your abstraction will only make sense in the context of the problem you’re trying to solve, and as such you can never have a consistent abstraction that doesn’t become inconsistent with underlying ideologies as you start gluing them together without wrapping them in a view that is consistent with your ideology. Knowing this, python’s approach is more flexible since it’s cheaper, it’s up to you to pay more to make it look “nicer”, either by making your own wrapper, or your own preprocessor.


回答 9

变量my_list"-"都是对象。具体来说,它们分别是类list和的实例str。该join函数属于该类str。因此,使用语法"-".join(my_list)是因为对象"-"my_list作为输入。

The variables my_list and "-" are both objects. Specifically, they’re instances of the classes list and str, respectively. The join function belongs to the class str. Therefore, the syntax "-".join(my_list) is used because the object "-" is taking my_list as an input.


如何从列表中随机选择一个项目?

问题:如何从列表中随机选择一个项目?

假设我有以下列表:

foo = ['a', 'b', 'c', 'd', 'e']

从此列表中随机检索项目的最简单方法是什么?

Assume I have the following list:

foo = ['a', 'b', 'c', 'd', 'e']

What is the simplest way to retrieve an item at random from this list?


回答 0

采用 random.choice()

import random

foo = ['a', 'b', 'c', 'd', 'e']
print(random.choice(foo))

对于密码安全的随机选择(例如,用于从单词列表生成密码短语),请使用secrets.choice()

import secrets

foo = ['battery', 'correct', 'horse', 'staple']
print(secrets.choice(foo))

secrets是Python 3.6中的新功能,在旧版本的Python上,您可以使用random.SystemRandom此类:

import random

secure_random = random.SystemRandom()
print(secure_random.choice(foo))

Use random.choice()

import random

foo = ['a', 'b', 'c', 'd', 'e']
print(random.choice(foo))

For cryptographically secure random choices (e.g. for generating a passphrase from a wordlist) use secrets.choice()

import secrets

foo = ['battery', 'correct', 'horse', 'staple']
print(secrets.choice(foo))

secrets is new in Python 3.6, on older versions of Python you can use the random.SystemRandom class:

import random

secure_random = random.SystemRandom()
print(secure_random.choice(foo))

回答 1

如果您想从列表中随机选择一个以上的项目,或者从一组中选择一个项目,则建议random.sample改用。

import random
group_of_items = {1, 2, 3, 4}               # a sequence or set will work here.
num_to_select = 2                           # set the number to select here.
list_of_random_items = random.sample(group_of_items, num_to_select)
first_random_item = list_of_random_items[0]
second_random_item = list_of_random_items[1] 

如果您只是从列表中拉出一个项目,那么选择就不会那么笨拙,因为使用sample的语法将random.sample(some_list, 1)[0]random.choice(some_list)

但是不幸的是,选择仅适用于序列(例如列表或元组)中的单个输出。虽然random.choice(tuple(some_set))可能是从集合中获取单个项目的选项。

编辑:使用秘密

正如许多人指出的那样,如果需要更安全的伪随机样本,则应使用secrets模块:

import secrets                              # imports secure module.
secure_random = secrets.SystemRandom()      # creates a secure random object.
group_of_items = {1, 2, 3, 4}               # a sequence or set will work here.
num_to_select = 2                           # set the number to select here.
list_of_random_items = secure_random.sample(group_of_items, num_to_select)
first_random_item = list_of_random_items[0]
second_random_item = list_of_random_items[1]

编辑:Pythonic一线

如果您希望使用更具Python风格的单行代码来选择多个项目,则可以使用拆包。

import random
first_random_item, second_random_item = random.sample(group_of_items, 2)

If you want to randomly select more than one item from a list, or select an item from a set, I’d recommend using random.sample instead.

import random
group_of_items = {1, 2, 3, 4}               # a sequence or set will work here.
num_to_select = 2                           # set the number to select here.
list_of_random_items = random.sample(group_of_items, num_to_select)
first_random_item = list_of_random_items[0]
second_random_item = list_of_random_items[1] 

If you’re only pulling a single item from a list though, choice is less clunky, as using sample would have the syntax random.sample(some_list, 1)[0] instead of random.choice(some_list).

Unfortunately though, choice only works for a single output from sequences (such as lists or tuples). Though random.choice(tuple(some_set)) may be an option for getting a single item from a set.

EDIT: Using Secrets

As many have pointed out, if you require more secure pseudorandom samples, you should use the secrets module:

import secrets                              # imports secure module.
secure_random = secrets.SystemRandom()      # creates a secure random object.
group_of_items = {1, 2, 3, 4}               # a sequence or set will work here.
num_to_select = 2                           # set the number to select here.
list_of_random_items = secure_random.sample(group_of_items, num_to_select)
first_random_item = list_of_random_items[0]
second_random_item = list_of_random_items[1]

EDIT: Pythonic One-Liner

If you want a more pythonic one-liner for selecting multiple items, you can use unpacking.

import random
first_random_item, second_random_item = random.sample(group_of_items, 2)

回答 2

如果您还需要索引,请使用 random.randrange

from random import randrange
random_index = randrange(len(foo))
print(foo[random_index])

If you also need the index, use random.randrange

from random import randrange
random_index = randrange(len(foo))
print(foo[random_index])

回答 3

从Python 3.6开始,您可以使用该secrets模块,该random模块比加密或安全用途的模块更好。

要从列表中打印随机元素:

import secrets
foo = ['a', 'b', 'c', 'd', 'e']
print(secrets.choice(foo))

要打印随机索引:

print(secrets.randbelow(len(foo)))

有关详细信息,请参阅PEP 506

As of Python 3.6 you can use the secrets module, which is preferable to the random module for cryptography or security uses.

To print a random element from a list:

import secrets
foo = ['a', 'b', 'c', 'd', 'e']
print(secrets.choice(foo))

To print a random index:

print(secrets.randbelow(len(foo)))

For details, see PEP 506.


回答 4

我提出了一个脚本,用于从列表中删除随机拾取的项目,直到它为空:

维持set并删除随机拾取的元素(带有choice),直到列表为空。

s=set(range(1,6))
import random

while len(s)>0:
  s.remove(random.choice(list(s)))
  print(s)

三个运行给出三个不同的答案:

>>> 
set([1, 3, 4, 5])
set([3, 4, 5])
set([3, 4])
set([4])
set([])
>>> 
set([1, 2, 3, 5])
set([2, 3, 5])
set([2, 3])
set([2])
set([])

>>> 
set([1, 2, 3, 5])
set([1, 2, 3])
set([1, 2])
set([1])
set([])

I propose a script for removing randomly picked up items off a list until it is empty:

Maintain a set and remove randomly picked up element (with choice) until list is empty.

s=set(range(1,6))
import random

while len(s)>0:
  s.remove(random.choice(list(s)))
  print(s)

Three runs give three different answers:

>>> 
set([1, 3, 4, 5])
set([3, 4, 5])
set([3, 4])
set([4])
set([])
>>> 
set([1, 2, 3, 5])
set([2, 3, 5])
set([2, 3])
set([2])
set([])

>>> 
set([1, 2, 3, 5])
set([1, 2, 3])
set([1, 2])
set([1])
set([])

回答 5

foo = ['a', 'b', 'c', 'd', 'e']
number_of_samples = 1

在python 2:

random_items = random.sample(population=foo, k=number_of_samples)

在python 3:

random_items = random.choices(population=foo, k=number_of_samples)
foo = ['a', 'b', 'c', 'd', 'e']
number_of_samples = 1

In python 2:

random_items = random.sample(population=foo, k=number_of_samples)

In python 3:

random_items = random.choices(population=foo, k=number_of_samples)

回答 6

numpy 解: numpy.random.choice

对于这个问题,它的作用与接受的答案(import random; random.choice())相同,但是我添加了它,因为程序员可能已经导入numpy了(像我一样),并且这两种方法之间可能存在一些差异,这可能与您的实际用例有关。

import numpy as np    
np.random.choice(foo) # randomly selects a single item

为了重现性,您可以执行以下操作:

np.random.seed(123)
np.random.choice(foo) # first call will always return 'c'

对于以形式返回的一个或多个项目的样本array,请传递size参数:

np.random.choice(foo, 5)          # sample with replacement (default)
np.random.choice(foo, 5, False)   # sample without replacement

numpy solution: numpy.random.choice

For this question, it works the same as the accepted answer (import random; random.choice()), but I added it because the programmer may have imported numpy already (like me) & also there are some differences between the two methods that may concern your actual use case.

import numpy as np    
np.random.choice(foo) # randomly selects a single item

For reproducibility, you can do:

np.random.seed(123)
np.random.choice(foo) # first call will always return 'c'

For samples of one or more items, returned as an array, pass the size argument:

np.random.choice(foo, 5)          # sample with replacement (default)
np.random.choice(foo, 5, False)   # sample without replacement

回答 7

如何从列表中随机选择一个项目?

假设我有以下列表:

foo = ['a', 'b', 'c', 'd', 'e']  

从此列表中随机检索项目的最简单方法是什么?

如果您想接近真正的随机性,那么我建议secrets.choice从标准库(Python 3.6中的新增功能)中进行建议:

>>> from secrets import choice         # Python 3 only
>>> choice(list('abcde'))
'c'

上面的内容等同于我以前的建议,即使用模块中的SystemRandom对象randomchoice方法-早于Python 2:

>>> import random                      # Python 2 compatible
>>> sr = random.SystemRandom()
>>> foo = list('abcde')
>>> foo
['a', 'b', 'c', 'd', 'e']

现在:

>>> sr.choice(foo)
'd'
>>> sr.choice(foo)
'e'
>>> sr.choice(foo)
'a'
>>> sr.choice(foo)
'b'
>>> sr.choice(foo)
'a'
>>> sr.choice(foo)
'c'
>>> sr.choice(foo)
'c'

如果需要确定性伪随机选择,请使用choice函数(实际上是Random对象上的绑定方法):

>>> random.choice
<bound method Random.choice of <random.Random object at 0x800c1034>>

看来是随机的,但实际上不是,我们可以看看是否反复播种:

>>> random.seed(42); random.choice(foo), random.choice(foo), random.choice(foo)
('d', 'a', 'b')
>>> random.seed(42); random.choice(foo), random.choice(foo), random.choice(foo)
('d', 'a', 'b')
>>> random.seed(42); random.choice(foo), random.choice(foo), random.choice(foo)
('d', 'a', 'b')
>>> random.seed(42); random.choice(foo), random.choice(foo), random.choice(foo)
('d', 'a', 'b')
>>> random.seed(42); random.choice(foo), random.choice(foo), random.choice(foo)
('d', 'a', 'b')

一条评论:

这与random.choice是否真正随机无关。如果修复种子,您将获得可重复的结果-这就是种子的设计目的。您也可以将种子传递给SystemRandom。sr = random.SystemRandom(42)

好吧,是的,您可以给它传递一个“种子”参数,但是您会看到该SystemRandom对象只是忽略了它

def seed(self, *args, **kwds):
    "Stub method.  Not used for a system random number generator."
    return None

How to randomly select an item from a list?

Assume I have the following list:

foo = ['a', 'b', 'c', 'd', 'e']  

What is the simplest way to retrieve an item at random from this list?

If you want close to truly random, then I suggest secrets.choice from the standard library (New in Python 3.6.):

>>> from secrets import choice         # Python 3 only
>>> choice(list('abcde'))
'c'

The above is equivalent to my former recommendation, using a SystemRandom object from the random module with the choice method – available earlier in Python 2:

>>> import random                      # Python 2 compatible
>>> sr = random.SystemRandom()
>>> foo = list('abcde')
>>> foo
['a', 'b', 'c', 'd', 'e']

And now:

>>> sr.choice(foo)
'd'
>>> sr.choice(foo)
'e'
>>> sr.choice(foo)
'a'
>>> sr.choice(foo)
'b'
>>> sr.choice(foo)
'a'
>>> sr.choice(foo)
'c'
>>> sr.choice(foo)
'c'

If you want a deterministic pseudorandom selection, use the choice function (which is actually a bound method on a Random object):

>>> random.choice
<bound method Random.choice of <random.Random object at 0x800c1034>>

It seems random, but it’s actually not, which we can see if we reseed it repeatedly:

>>> random.seed(42); random.choice(foo), random.choice(foo), random.choice(foo)
('d', 'a', 'b')
>>> random.seed(42); random.choice(foo), random.choice(foo), random.choice(foo)
('d', 'a', 'b')
>>> random.seed(42); random.choice(foo), random.choice(foo), random.choice(foo)
('d', 'a', 'b')
>>> random.seed(42); random.choice(foo), random.choice(foo), random.choice(foo)
('d', 'a', 'b')
>>> random.seed(42); random.choice(foo), random.choice(foo), random.choice(foo)
('d', 'a', 'b')

A comment:

This is not about whether random.choice is truly random or not. If you fix the seed, you will get the reproducible results — and that’s what seed is designed for. You can pass a seed to SystemRandom, too. sr = random.SystemRandom(42)

Well, yes you can pass it a “seed” argument, but you’ll see that the SystemRandom object simply ignores it:

def seed(self, *args, **kwds):
    "Stub method.  Not used for a system random number generator."
    return None

回答 8

如果您需要索引,请使用:

import random
foo = ['a', 'b', 'c', 'd', 'e']
print int(random.random() * len(foo))
print foo[int(random.random() * len(foo))]

random.choice做同样的事情:)

if you need the index just use:

import random
foo = ['a', 'b', 'c', 'd', 'e']
print int(random.random() * len(foo))
print foo[int(random.random() * len(foo))]

random.choice does the same:)


回答 9

这是带有定义随机索引的变量的代码:

import random

foo = ['a', 'b', 'c', 'd', 'e']
randomindex = random.randint(0,len(foo)-1) 
print (foo[randomindex])
## print (randomindex)

这是没有变量的代码:

import random

foo = ['a', 'b', 'c', 'd', 'e']
print (foo[random.randint(0,len(foo)-1)])

这是用最短和最聪明的方法实现的代码:

import random

foo = ['a', 'b', 'c', 'd', 'e']
print(random.choice(foo))

(python 2.7)

This is the code with a variable that defines the random index:

import random

foo = ['a', 'b', 'c', 'd', 'e']
randomindex = random.randint(0,len(foo)-1) 
print (foo[randomindex])
## print (randomindex)

This is the code without the variable:

import random

foo = ['a', 'b', 'c', 'd', 'e']
print (foo[random.randint(0,len(foo)-1)])

And this is the code in the shortest and smartest way to do it:

import random

foo = ['a', 'b', 'c', 'd', 'e']
print(random.choice(foo))

(python 2.7)


回答 10

以下代码演示了是否需要生产相同的物品。您还可以指定要提取的样本数量。
sample方法返回一个新列表,其中包含总体中的元素,而保留原始总体不变。结果列表按选择顺序排列,因此所有子切片也将是有效的随机样本。

import random as random
random.seed(0)  # don't use seed function, if you want different results in each run
print(random.sample(foo,3))  # 3 is the number of sample you want to retrieve

Output:['d', 'e', 'a']

The following code demonstrates if you need to produce the same items. You can also specify how many samples you want to extract.
The sample method returns a new list containing elements from the population while leaving the original population unchanged. The resulting list is in selection order so that all sub-slices will also be valid random samples.

import random as random
random.seed(0)  # don't use seed function, if you want different results in each run
print(random.sample(foo,3))  # 3 is the number of sample you want to retrieve

Output:['d', 'e', 'a']

回答 11

随机项目选择:

import random

my_list = [1, 2, 3, 4, 5]
num_selections = 2

new_list = random.sample(my_list, num_selections)

要保留列表的顺序,您可以执行以下操作:

randIndex = random.sample(range(len(my_list)), n_selections)
randIndex.sort()
new_list = [my_list[i] for i in randIndex]

重复的https://stackoverflow.com/a/49682832/4383027

Random item selection:

import random

my_list = [1, 2, 3, 4, 5]
num_selections = 2

new_list = random.sample(my_list, num_selections)

To preserve the order of the list, you could do:

randIndex = random.sample(range(len(my_list)), n_selections)
randIndex.sort()
new_list = [my_list[i] for i in randIndex]

Duplicate of https://stackoverflow.com/a/49682832/4383027


回答 12

我们也可以使用randint做到这一点。

from random import randint
l= ['a','b','c']

def get_rand_element(l):
    if l:
        return l[randint(0,len(l)-1)]
    else:
        return None

get_rand_element(l)

We can also do this using randint.

from random import randint
l= ['a','b','c']

def get_rand_element(l):
    if l:
        return l[randint(0,len(l)-1)]
    else:
        return None

get_rand_element(l)

回答 13

您可以:

from random import randint

foo = ["a", "b", "c", "d", "e"]

print(foo[randint(0,4)])

You could just:

from random import randint

foo = ["a", "b", "c", "d", "e"]

print(foo[randint(0,4)])