分类目录归档:知识问答

如何在Python中将字符串转换为整数?

问题:如何在Python中将字符串转换为整数?

我有一个来自MySQL查询的元组,像这样:

T1 = (('13', '17', '18', '21', '32'),
      ('07', '11', '13', '14', '28'),
      ('01', '05', '06', '08', '15', '16'))

我想将所有字符串元素转换为整数,然后将它们放回列表列表中:

T2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]

我试图用它来实现它,eval但是还没有得到令人满意的结果。

I have a tuple of tuples from a MySQL query like this:

T1 = (('13', '17', '18', '21', '32'),
      ('07', '11', '13', '14', '28'),
      ('01', '05', '06', '08', '15', '16'))

I’d like to convert all the string elements into integers and put them back into a list of lists:

T2 = [[13, 17, 18, 21, 32], [7, 11, 13, 14, 28], [1, 5, 6, 8, 15, 16]]

I tried to achieve it with eval but didn’t get any decent result yet.


回答 0

int()是Python标准的内置函数,用于将字符串转换为整数值。您使用一个包含数字作为参数的字符串来调用它,它返回转换为整数的数字:

print (int("1") + 1)

上面的照片2

如果您知道列表T1的结构(它仅包含列表,仅一个级别),则可以在Python 2中执行此操作:

T2 = [map(int, x) for x in T1]

在Python 3中:

T2 = [list(map(int, x)) for x in T1]

int() is the Python standard built-in function to convert a string into an integer value. You call it with a string containing a number as the argument, and it returns the number converted to an integer:

print (int("1") + 1)

The above prints 2.

If you know the structure of your list, T1 (that it simply contains lists, only one level), you could do this in Python 2:

T2 = [map(int, x) for x in T1]

In Python 3:

T2 = [list(map(int, x)) for x in T1]

回答 1

您可以通过列表理解来做到这一点:

T2 = [[int(column) for column in row] for row in T1]

内部列表理解([int(column) for column in row])建立一个listint期从序列int-able物体,如小数字符串中row。外部列表推导([... for row in T1]))生成一个内部列表推导的结果的列表,该结果适用于中的每个项目T1

如果任何行包含无法通过转换的对象,则代码段将失败int。如果要处理包含非十进制字符串的行,则需要一个更智能的函数。

如果您知道行的结构,则可以使用对行函数的调用来替换内部列表理解。例如。

T2 = [parse_a_row_of_T1(row) for row in T1]

You can do this with a list comprehension:

T2 = [[int(column) for column in row] for row in T1]

The inner list comprehension ([int(column) for column in row]) builds a list of ints from a sequence of int-able objects, like decimal strings, in row. The outer list comprehension ([... for row in T1])) builds a list of the results of the inner list comprehension applied to each item in T1.

The code snippet will fail if any of the rows contain objects that can’t be converted by int. You’ll need a smarter function if you want to process rows containing non-decimal strings.

If you know the structure of the rows, you can replace the inner list comprehension with a call to a function of the row. Eg.

T2 = [parse_a_row_of_T1(row) for row in T1]

回答 2

我宁愿只使用理解列表:

[[int(y) for y in x] for x in T1]

I would rather prefer using only comprehension lists:

[[int(y) for y in x] for x in T1]

回答 3

代替put int( ),put float( )可以让您将小数与整数一起使用。

Instead of putting int( ), put float( ) which will let you use decimals along with integers.


回答 4

到目前为止,我都同意所有人的回答,但是问题是,如果您没有所有整数,它们将崩溃。

如果要排除非整数,则

T1 = (('13', '17', '18', '21', '32'),
      ('07', '11', '13', '14', '28'),
      ('01', '05', '06', '08', '15', '16'))
new_list = list(list(int(a) for a in b) for b in T1 if a.isdigit())

这仅产生实际数字。我不使用直接列表推导的原因是因为列表推导会泄漏其内部变量。

I would agree with everyones answers so far but the problem is is that if you do not have all integers they will crash.

If you wanted to exclude non-integers then

T1 = (('13', '17', '18', '21', '32'),
      ('07', '11', '13', '14', '28'),
      ('01', '05', '06', '08', '15', '16'))
new_list = list(list(int(a) for a in b) for b in T1 if a.isdigit())

This yields only actual digits. The reason I don’t use direct list comprehensions is because list comprehension leaks their internal variables.


回答 5

T3=[]

for i in range(0,len(T1)):
    T3.append([])
    for j in range(0,len(T1[i])):
        b=int(T1[i][j])
        T3[i].append(b)

print T3
T3=[]

for i in range(0,len(T1)):
    T3.append([])
    for j in range(0,len(T1[i])):
        b=int(T1[i][j])
        T3[i].append(b)

print T3

回答 6

尝试这个。

x = "1"

x是一个字符串,因为它周围带有引号,但其中带有数字。

x = int(x)

由于x的数字为1,因此我可以将其变成整数。

要查看字符串是否为数字,可以执行此操作。

def is_number(var):
    try:
        if var == int(var):
            return True
    except Exception:
        return False

x = "1"

y = "test"

x_test = is_number(x)

print(x_test)

它应该打印到IDLE True,因为x是一个数字。

y_test = is_number(y)

print(y_test)

它应该打印为IDLE False,因为y中没有数字。

Try this.

x = "1"

x is a string because it has quotes around it, but it has a number in it.

x = int(x)

Since x has the number 1 in it, I can turn it in to a integer.

To see if a string is a number, you can do this.

def is_number(var):
    try:
        if var == int(var):
            return True
    except Exception:
        return False

x = "1"

y = "test"

x_test = is_number(x)

print(x_test)

It should print to IDLE True because x is a number.

y_test = is_number(y)

print(y_test)

It should print to IDLE False because y in not a number.


回答 7

使用列表推导:

t2 = [map(int, list(l)) for l in t1]

Using list comprehensions:

t2 = [map(int, list(l)) for l in t1]

回答 8

在Python 3.5.1中,这些工作如下:

c = input('Enter number:')
print (int(float(c)))
print (round(float(c)))

Enter number:  4.7
4
5

乔治。

In Python 3.5.1 things like these work:

c = input('Enter number:')
print (int(float(c)))
print (round(float(c)))

and

Enter number:  4.7
4
5

George.


回答 9

查看此功能

def parse_int(s):
    try:
        res = int(eval(str(s)))
        if type(res) == int:
            return res
    except:
        return

然后

val = parse_int('10')  # Return 10
val = parse_int('0')  # Return 0
val = parse_int('10.5')  # Return 10
val = parse_int('0.0')  # Return 0
val = parse_int('Ten')  # Return None

您也可以检查

if val == None:  # True if input value can not be converted
    pass  # Note: Don't use 'if not val:'

See this function

def parse_int(s):
    try:
        res = int(eval(str(s)))
        if type(res) == int:
            return res
    except:
        return

Then

val = parse_int('10')  # Return 10
val = parse_int('0')  # Return 0
val = parse_int('10.5')  # Return 10
val = parse_int('0.0')  # Return 0
val = parse_int('Ten')  # Return None

You can also check

if val == None:  # True if input value can not be converted
    pass  # Note: Don't use 'if not val:'

回答 10

适用于Python 2的另一个功能解决方案:

from functools import partial

map(partial(map, int), T1)

不过,Python 3会有些混乱:

list(map(list, map(partial(map, int), T1)))

我们可以用包装纸解决

def oldmap(f, iterable):
    return list(map(f, iterable))

oldmap(partial(oldmap, int), T1)

Yet another functional solution for Python 2:

from functools import partial

map(partial(map, int), T1)

Python 3 will be a little bit messy though:

list(map(list, map(partial(map, int), T1)))

we can fix this with a wrapper

def oldmap(f, iterable):
    return list(map(f, iterable))

oldmap(partial(oldmap, int), T1)

回答 11

如果只是元组的元组,类似 rows=[map(int, row) for row in rows]就可以解决。(在其中有一个列表推导和对map(f,lst)的调用,该调用等于[f in a lst]中的f(a)。)

如果由于某种原因在数据库中有类似的东西,Eval 不是您想要做的__import__("os").unlink("importantsystemfile")。始终验证您的输入(如果没有其他问题,如果输入错误,则会引发int()异常)。

If it’s only a tuple of tuples, something like rows=[map(int, row) for row in rows] will do the trick. (There’s a list comprehension and a call to map(f, lst), which is equal to [f(a) for a in lst], in there.)

Eval is not what you want to do, in case there’s something like __import__("os").unlink("importantsystemfile") in your database for some reason. Always validate your input (if with nothing else, the exception int() will raise if you have bad input).


回答 12

您可以执行以下操作:

T1 = (('13', '17', '18', '21', '32'),  
     ('07', '11', '13', '14', '28'),  
     ('01', '05', '06', '08', '15', '16'))  
new_list = list(list(int(a) for a in b if a.isdigit()) for b in T1)  
print(new_list)  

You can do something like this:

T1 = (('13', '17', '18', '21', '32'),  
     ('07', '11', '13', '14', '28'),  
     ('01', '05', '06', '08', '15', '16'))  
new_list = list(list(int(a) for a in b if a.isdigit()) for b in T1)  
print(new_list)  

回答 13

我想分享一个似乎此处未提及的可用选项:

rumpy.random.permutation(x)

将生成数组x的随机排列。不完全是您的要求,但这是解决类似问题的潜在方法。

I want to share an available option that doesn’t seem to be mentioned here yet:

rumpy.random.permutation(x)

Will generate a random permutation of array x. Not exactly what you asked for, but it is a potential solution to similar questions.


使用pandas GroupBy获取每个组的统计信息(例如计数,均值等)?

问题:使用pandas GroupBy获取每个组的统计信息(例如计数,均值等)?

我有一个数据框,df并且从中使用了几列groupby

df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()

通过以上方法,我几乎得到了所需的表(数据框)。缺少的是另外一列,其中包含每个组中的行数。换句话说,我有意思,但我也想知道有多少个数字被用来获得这些价值。例如,在第一组中有8个值,在第二组中有10个值,依此类推。

简而言之:如何获取数据框的分组统计信息?

I have a data frame df and I use several columns from it to groupby:

df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()

In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.

In short: How do I get group-wise statistics for a dataframe?


回答 0

groupby对象上,该agg函数可以列出一个列表,以一次应用多种聚合方法。这应该给您您需要的结果:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])

On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])

回答 1

快速回答:

获取每个组的行数的最简单方法是调用.size(),它返回一个Series

df.groupby(['col1','col2']).size()


通常,您希望此结果为DataFrame(而不是Series),因此您可以执行以下操作:

df.groupby(['col1', 'col2']).size().reset_index(name='counts')


如果您想了解如何计算每组的行数和其他统计信息,请继续阅读下面的内容。


详细的例子:

考虑以下示例数据框:

In [2]: df
Out[2]: 
  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

首先让我们.size()用来获取行数:

In [3]: df.groupby(['col1', 'col2']).size()
Out[3]: 
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64

然后让我们使用.size().reset_index(name='counts')来获取行数:

In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]: 
  col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1


包括结果以获取更多统计信息

当您要计算分组数据的统计信息时,通常如下所示:

In [5]: (df
   ...: .groupby(['col1', 'col2'])
   ...: .agg({
   ...:     'col3': ['mean', 'count'], 
   ...:     'col4': ['median', 'min', 'count']
   ...: }))
Out[5]: 
            col4                  col3      
          median   min count      mean count
col1 col2                                   
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1

由于嵌套的列标签,并且行计数是基于每列的,因此上面的结果有点令人讨厌。

为了获得对输出的更多控制权,我通常将统计信息拆分为单独的汇总,然后使用进行合并join。看起来像这样:

In [6]: gb = df.groupby(['col1', 'col2'])
   ...: counts = gb.size().to_frame(name='counts')
   ...: (counts
   ...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
   ...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
   ...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
   ...:  .reset_index()
   ...: )
   ...: 
Out[6]: 
  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63



脚注

下面显示了用于生成测试数据的代码:

In [1]: import numpy as np
   ...: import pandas as pd 
   ...: 
   ...: keys = np.array([
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['E', 'F'],
   ...:         ['E', 'F'],
   ...:         ['G', 'H'] 
   ...:         ])
   ...: 
   ...: df = pd.DataFrame(
   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 
   ...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
   ...: )
   ...: 
   ...: df[['col3', 'col4', 'col5', 'col6']] = \
   ...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)
   ...: 


免责声明:

如果您要聚合的某些列具有空值,那么您真的希望将组行计数视为每列的独立聚合。否则,您可能会误认为实际上有多少记录用于计算均值之类的东西,因为熊猫会NaN在均值计算中丢弃条目而不会告诉您。

Quick Answer:

The simplest way to get row counts per group is by calling .size(), which returns a Series:

df.groupby(['col1','col2']).size()


Usually you want this result as a DataFrame (instead of a Series) so you can do:

df.groupby(['col1', 'col2']).size().reset_index(name='counts')


If you want to find out how to calculate the row counts and other statistics for each group continue reading below.


Detailed example:

Consider the following example dataframe:

In [2]: df
Out[2]: 
  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

First let’s use .size() to get the row counts:

In [3]: df.groupby(['col1', 'col2']).size()
Out[3]: 
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64

Then let’s use .size().reset_index(name='counts') to get the row counts:

In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]: 
  col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1


Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

In [5]: (df
   ...: .groupby(['col1', 'col2'])
   ...: .agg({
   ...:     'col3': ['mean', 'count'], 
   ...:     'col4': ['median', 'min', 'count']
   ...: }))
Out[5]: 
            col4                  col3      
          median   min count      mean count
col1 col2                                   
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:

In [6]: gb = df.groupby(['col1', 'col2'])
   ...: counts = gb.size().to_frame(name='counts')
   ...: (counts
   ...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
   ...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
   ...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
   ...:  .reset_index()
   ...: )
   ...: 
Out[6]: 
  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63



Footnotes

The code used to generate the test data is shown below:

In [1]: import numpy as np
   ...: import pandas as pd 
   ...: 
   ...: keys = np.array([
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['E', 'F'],
   ...:         ['E', 'F'],
   ...:         ['G', 'H'] 
   ...:         ])
   ...: 
   ...: df = pd.DataFrame(
   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 
   ...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
   ...: )
   ...: 
   ...: df[['col3', 'col4', 'col5', 'col6']] = \
   ...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)
   ...: 


Disclaimer:

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.


回答 2

一种功能统治一切: GroupBy.describe

返回countmeanstd,和其他有用的统计每个组。

df.groupby(['col1', 'col2'])['col3', 'col4'].describe()

# Setup
np.random.seed(0)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

from IPython.display import display

with pd.option_context('precision', 2):
    display(df.groupby(['A', 'B'])['C'].describe())

           count  mean   std   min   25%   50%   75%   max
A   B                                                     
bar one      1.0  0.40   NaN  0.40  0.40  0.40  0.40  0.40
    three    1.0  2.24   NaN  2.24  2.24  2.24  2.24  2.24
    two      1.0 -0.98   NaN -0.98 -0.98 -0.98 -0.98 -0.98
foo one      2.0  1.36  0.58  0.95  1.15  1.36  1.56  1.76
    three    1.0 -0.15   NaN -0.15 -0.15 -0.15 -0.15 -0.15
    two      2.0  1.42  0.63  0.98  1.20  1.42  1.65  1.87

要获取特定的统计信息,只需选择它们,

df.groupby(['A', 'B'])['C'].describe()[['count', 'mean']]

           count      mean
A   B                     
bar one      1.0  0.400157
    three    1.0  2.240893
    two      1.0 -0.977278
foo one      2.0  1.357070
    three    1.0 -0.151357
    two      2.0  1.423148

describe适用于多列(更改['C']为(['C', 'D']或完全删除),看看会发生什么,结果是一个MultiIndexed列数据框)。

您还将获得不同的字符串数据统计信息。这是一个例子

df2 = df.assign(D=list('aaabbccc')).sample(n=100, replace=True)

with pd.option_context('precision', 2):
    display(df2.groupby(['A', 'B'])
               .describe(include='all')
               .dropna(how='all', axis=1))

              C                                                   D                
          count  mean       std   min   25%   50%   75%   max count unique top freq
A   B                                                                              
bar one    14.0  0.40  5.76e-17  0.40  0.40  0.40  0.40  0.40    14      1   a   14
    three  14.0  2.24  4.61e-16  2.24  2.24  2.24  2.24  2.24    14      1   b   14
    two     9.0 -0.98  0.00e+00 -0.98 -0.98 -0.98 -0.98 -0.98     9      1   c    9
foo one    22.0  1.43  4.10e-01  0.95  0.95  1.76  1.76  1.76    22      2   a   13
    three  15.0 -0.15  0.00e+00 -0.15 -0.15 -0.15 -0.15 -0.15    15      1   c   15
    two    26.0  1.49  4.48e-01  0.98  0.98  1.87  1.87  1.87    26      2   b   15

有关更多信息,请参见文档

One Function to Rule Them All: GroupBy.describe

Returns count, mean, std, and other useful statistics per-group.

df.groupby(['col1', 'col2'])['col3', 'col4'].describe()

# Setup
np.random.seed(0)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

from IPython.display import display

with pd.option_context('precision', 2):
    display(df.groupby(['A', 'B'])['C'].describe())

           count  mean   std   min   25%   50%   75%   max
A   B                                                     
bar one      1.0  0.40   NaN  0.40  0.40  0.40  0.40  0.40
    three    1.0  2.24   NaN  2.24  2.24  2.24  2.24  2.24
    two      1.0 -0.98   NaN -0.98 -0.98 -0.98 -0.98 -0.98
foo one      2.0  1.36  0.58  0.95  1.15  1.36  1.56  1.76
    three    1.0 -0.15   NaN -0.15 -0.15 -0.15 -0.15 -0.15
    two      2.0  1.42  0.63  0.98  1.20  1.42  1.65  1.87

To get specific statistics, just select them,

df.groupby(['A', 'B'])['C'].describe()[['count', 'mean']]

           count      mean
A   B                     
bar one      1.0  0.400157
    three    1.0  2.240893
    two      1.0 -0.977278
foo one      2.0  1.357070
    three    1.0 -0.151357
    two      2.0  1.423148

describe works for multiple columns (change ['C'] to ['C', 'D']—or remove it altogether—and see what happens, the result is a MultiIndexed columned dataframe).

You also get different statistics for string data. Here’s an example,

df2 = df.assign(D=list('aaabbccc')).sample(n=100, replace=True)

with pd.option_context('precision', 2):
    display(df2.groupby(['A', 'B'])
               .describe(include='all')
               .dropna(how='all', axis=1))

              C                                                   D                
          count  mean       std   min   25%   50%   75%   max count unique top freq
A   B                                                                              
bar one    14.0  0.40  5.76e-17  0.40  0.40  0.40  0.40  0.40    14      1   a   14
    three  14.0  2.24  4.61e-16  2.24  2.24  2.24  2.24  2.24    14      1   b   14
    two     9.0 -0.98  0.00e+00 -0.98 -0.98 -0.98 -0.98 -0.98     9      1   c    9
foo one    22.0  1.43  4.10e-01  0.95  0.95  1.76  1.76  1.76    22      2   a   13
    three  15.0 -0.15  0.00e+00 -0.15 -0.15 -0.15 -0.15 -0.15    15      1   c   15
    two    26.0  1.49  4.48e-01  0.98  0.98  1.87  1.87  1.87    26      2   b   15

For more information, see the documentation.


回答 3

我们可以使用groupby和count轻松地做到这一点。但是,我们应该记住使用reset_index()。

df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()

We can easily do it by using groupby and count. But, we should remember to use reset_index().

df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()

回答 4

要获取多个统计信息,请折叠索引并保留列名:

df = df.groupby(['col1','col2']).agg(['mean', 'count'])
df.columns = [ ' '.join(str(i) for i in col) for col in df.columns]
df.reset_index(inplace=True)
df

生成:

**在此处输入图片说明**

To get multiple stats, collapse the index, and retain column names:

df = df.groupby(['col1','col2']).agg(['mean', 'count'])
df.columns = [ ' '.join(str(i) for i in col) for col in df.columns]
df.reset_index(inplace=True)
df

Produces:

**enter image description here**


回答 5

创建一个组对象并调用如下示例所示的方法:

grp = df.groupby(['col1',  'col2',  'col3']) 

grp.max() 
grp.mean() 
grp.describe() 

Create a group object and call methods like below example:

grp = df.groupby(['col1',  'col2',  'col3']) 

grp.max() 
grp.mean() 
grp.describe() 

回答 6

请尝试此代码

new_column=df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).count()
df['count_it']=new_column
df

我认为该代码将添加一个名为“ count it”的列,每个列的计数

Please try this code

new_column=df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).count()
df['count_it']=new_column
df

I think that code will add a column called ‘count it’ which count of each group


在Python中使用try-except-else是否是一种好习惯?

问题:在Python中使用try-except-else是否是一种好习惯?

在Python中,我不时看到该块:

try:
   try_this(whatever)
except SomeException as exception:
   #Handle exception
else:
   return something

try-except-else存在的原因是什么?

我不喜欢这种编程,因为它使用异常来执行流控制。但是,如果它包含在语言中,则一定有充分的理由,不是吗?

据我了解,异常不是错误,并且仅应将其用于特殊情况(例如,我尝试将文件写入磁盘,并且没有更多空间,或者我没有权限),而不是流控制。

通常,我将异常处理为:

something = some_default_value
try:
    something = try_this(whatever)
except SomeException as exception:
    #Handle exception
finally:
    return something

或者,如果发生异常,我真的不想返回任何东西,那么:

try:
    something = try_this(whatever)
    return something
except SomeException as exception:
    #Handle exception

From time to time in Python, I see the block:

try:
   try_this(whatever)
except SomeException as exception:
   #Handle exception
else:
   return something

What is the reason for the try-except-else to exist?

I do not like that kind of programming, as it is using exceptions to perform flow control. However, if it is included in the language, there must be a good reason for it, isn’t it?

It is my understanding that exceptions are not errors, and that they should only be used for exceptional conditions (e.g. I try to write a file into disk and there is no more space, or maybe I do not have permission), and not for flow control.

Normally I handle exceptions as:

something = some_default_value
try:
    something = try_this(whatever)
except SomeException as exception:
    #Handle exception
finally:
    return something

Or if I really do not want to return anything if an exception happens, then:

try:
    something = try_this(whatever)
    return something
except SomeException as exception:
    #Handle exception

回答 0

“我不知道它是否出于无知,但我不喜欢这种编程,因为它使用异常来执行流控制。”

在Python世界中,使用异常进行流控制是常见且正常的。

甚至Python核心开发人员也将异常用于流控制,并且该样式已在语言中大量使用(即,迭代器协议使用StopIteration发出信号以终止循环)。

此外,try-except样式用于防止某些“跨越式”构造固有的竞争条件。例如,测试os.path.exists会导致信息在您使用时已过时。同样,Queue.full返回的信息可能已过时。在这种情况下,try-except-else样式将产生更可靠的代码。

“据我了解,异常不是错误,它们仅应用于特殊情况”

在其他一些语言中,该规则反映了图书馆所反映的文化规范。该“规则”还部分基于这些语言的性能考虑。

Python的文化规范有些不同。在许多情况下,必须对控制流使用exceptions。另外,在Python中使用异常不会像在某些编译语言中那样降低周围的代码和调用代码的速度(即CPython已经在每一步实现了用于异常检查的代码,而不管您是否实际使用异常)。

换句话说,您理解“exceptions是为了exceptions”是一条在其他语言中有意义的规则,但不适用于Python。

“但是,如果它本身包含在语言中,那一定有充分的理由,不是吗?”

除了帮助避免竞争条件外,异常对于在循环外拉出错误处理也非常有用。这是解释语言中的必要优化,这些语言通常不会具有自动循环不变的代码运动

另外,在通常情况下,异常可以大大简化代码,在正常情况下,处理问题的能力与问题发生的地方相距甚远。例如,通常有用于业务逻辑的顶级用户界面代码调用代码,而后者又调用低级例程。低级例程中出现的情况(例如数据库访问中唯一键的重复记录)只能以顶级代码处理(例如,要求用户提供与现有键不冲突的新键)。对此类控制流使用异常可以使中级例程完全忽略该问题,并将其与流控制的这一方面很好地分离。

这里有一篇关于异常必不可少的不错的博客文章

另外,请参见此堆栈溢出答案:异常真的是异常错误吗?

“ try-except-else存在的原因是什么?”

其他条款本身很有趣。它在没有exceptions的情况下运行,但是在最终条款之前。这是其主要目的。

如果没有else子句,那么在最终确定之前运行其他代码的唯一选择就是将代码添加到try子句的笨拙做法。这很笨拙,因为它冒着在代码中引发异常的危险,而这些异常本来不会受到try块的保护。

在完成之前运行其他不受保护的代码的用例很少出现。因此,不要期望在已发布的代码中看到很多示例。这有点罕见。

else子句的另一个用例是执行在没有异常发生时必须发生的动作以及在处理异常时不发生的动作。例如:

recip = float('Inf')
try:
    recip = 1 / f(x)
except ZeroDivisionError:
    logging.info('Infinite result')
else:
    logging.info('Finite result')

另一个示例发生在单元测试赛跑者中:

try:
    tests_run += 1
    run_testcase(case)
except Exception:
    tests_failed += 1
    logging.exception('Failing test case: %r', case)
    print('F', end='')
else:
    logging.info('Successful test case: %r', case)
    print('.', end='')

最后,在尝试块中最常用的else子句是为了美化一些(在相同的缩进级别上对齐exceptions结果和非exceptions结果)。此用法始终是可选的,并非严格必要。

“I do not know if it is out of ignorance, but I do not like that kind of programming, as it is using exceptions to perform flow control.”

In the Python world, using exceptions for flow control is common and normal.

Even the Python core developers use exceptions for flow-control and that style is heavily baked into the language (i.e. the iterator protocol uses StopIteration to signal loop termination).

In addition, the try-except-style is used to prevent the race-conditions inherent in some of the “look-before-you-leap” constructs. For example, testing os.path.exists results in information that may be out-of-date by the time you use it. Likewise, Queue.full returns information that may be stale. The try-except-else style will produce more reliable code in these cases.

“It my understanding that exceptions are not errors, they should only be used for exceptional conditions”

In some other languages, that rule reflects their cultural norms as reflected in their libraries. The “rule” is also based in-part on performance considerations for those languages.

The Python cultural norm is somewhat different. In many cases, you must use exceptions for control-flow. Also, the use of exceptions in Python does not slow the surrounding code and calling code as it does in some compiled languages (i.e. CPython already implements code for exception checking at every step, regardless of whether you actually use exceptions or not).

In other words, your understanding that “exceptions are for the exceptional” is a rule that makes sense in some other languages, but not for Python.

“However, if it is included in the language itself, there must be a good reason for it, isn’t it?”

Besides helping to avoid race-conditions, exceptions are also very useful for pulling error-handling outside loops. This is a necessary optimization in interpreted languages which do not tend to have automatic loop invariant code motion.

Also, exceptions can simplify code quite a bit in common situations where the ability to handle an issue is far removed from where the issue arose. For example, it is common to have top level user-interface code calling code for business logic which in turn calls low-level routines. Situations arising in the low-level routines (such as duplicate records for unique keys in database accesses) can only be handled in top-level code (such as asking the user for a new key that doesn’t conflict with existing keys). The use of exceptions for this kind of control-flow allows the mid-level routines to completely ignore the issue and be nicely decoupled from that aspect of flow-control.

There is a nice blog post on the indispensibility of exceptions here.

Also, see this Stack Overflow answer: Are exceptions really for exceptional errors?

“What is the reason for the try-except-else to exist?”

The else-clause itself is interesting. It runs when there is no exception but before the finally-clause. That is its primary purpose.

Without the else-clause, the only option to run additional code before finalization would be the clumsy practice of adding the code to the try-clause. That is clumsy because it risks raising exceptions in code that wasn’t intended to be protected by the try-block.

The use-case of running additional unprotected code prior to finalization doesn’t arise very often. So, don’t expect to see many examples in published code. It is somewhat rare.

Another use-case for the else-clause is to perform actions that must occur when no exception occurs and that do not occur when exceptions are handled. For example:

recip = float('Inf')
try:
    recip = 1 / f(x)
except ZeroDivisionError:
    logging.info('Infinite result')
else:
    logging.info('Finite result')

Another example occurs in unittest runners:

try:
    tests_run += 1
    run_testcase(case)
except Exception:
    tests_failed += 1
    logging.exception('Failing test case: %r', case)
    print('F', end='')
else:
    logging.info('Successful test case: %r', case)
    print('.', end='')

Lastly, the most common use of an else-clause in a try-block is for a bit of beautification (aligning the exceptional outcomes and non-exceptional outcomes at the same level of indentation). This use is always optional and isn’t strictly necessary.


回答 1

try-except-else存在的原因是什么?

一个try块可以处理预期的错误。该except块应该只捕获您准备处理的异常。如果您处理了意外错误,则您的代码可能会做错事情并隐藏错误。

else如果没有错误,将执行一个子句,通过不执行该代码try块中的代码,可以避免捕获意外错误。同样,捕获意外错误可能会隐藏错误。

例如:

try:
    try_this(whatever)
except SomeException as the_exception:
    handle(the_exception)
else:
    return something

“ try,except”套件有两个可选子句,elsefinally。所以实际上是try-except-else-finally

else仅在try块中没有异常的情况下才会评估。它使我们能够简化下面更复杂的代码:

no_error = None
try:
    try_this(whatever)
    no_error = True
except SomeException as the_exception:
    handle(the_exception)
if no_error:
    return something

因此,如果将a else与替代方案(可能会产生错误)进行比较,我们会发现它减少了代码行,并且我们可以拥有更具可读性,可维护性和更少错误的代码库。

finally

finally 即使使用return语句对另一行进行评估,它也将执行。

用伪代码分解

这可能有助于以尽可能小的形式展示所有功能并带有注释来分解此内容。假定此语法在语法上正确(但除非定义名称,否则不可运行)伪代码在函数中。

例如:

try:
    try_this(whatever)
except SomeException as the_exception:
    handle_SomeException(the_exception)
    # Handle a instance of SomeException or a subclass of it.
except Exception as the_exception:
    generic_handle(the_exception)
    # Handle any other exception that inherits from Exception
    # - doesn't include GeneratorExit, KeyboardInterrupt, SystemExit
    # Avoid bare `except:`
else: # there was no exception whatsoever
    return something()
    # if no exception, the "something()" gets evaluated,
    # but the return will not be executed due to the return in the
    # finally block below.
finally:
    # this block will execute no matter what, even if no exception,
    # after "something" is eval'd but before that value is returned
    # but even if there is an exception.
    # a return here will hijack the return functionality. e.g.:
    return True # hijacks the return in the else clause above

的确,我们可以将代码包含在else块中的代码中try,如果没有异常,它将在其中运行,但是如果该代码本身引发了我们正在捕获的异常,该怎么办?将其留在try块中将隐藏该错误。

我们希望最小化try块中的代码行,以避免捕获我们未曾想到的异常,其原理是,如果我们的代码失败,我们希望它大声失败。这是最佳做法

据我了解,异常不是错误

在Python中,大多数exceptions都是错误。

我们可以使用pydoc查看异常层次结构。例如,在Python 2中:

$ python -m pydoc exceptions

或Python 3:

$ python -m pydoc builtins

将给我们层次结构。我们可以看到大多数Exception错误都是错误的,尽管Python使用其中的一些错误来结束for循环(StopIteration)。这是Python 3的层次结构:

BaseException
    Exception
        ArithmeticError
            FloatingPointError
            OverflowError
            ZeroDivisionError
        AssertionError
        AttributeError
        BufferError
        EOFError
        ImportError
            ModuleNotFoundError
        LookupError
            IndexError
            KeyError
        MemoryError
        NameError
            UnboundLocalError
        OSError
            BlockingIOError
            ChildProcessError
            ConnectionError
                BrokenPipeError
                ConnectionAbortedError
                ConnectionRefusedError
                ConnectionResetError
            FileExistsError
            FileNotFoundError
            InterruptedError
            IsADirectoryError
            NotADirectoryError
            PermissionError
            ProcessLookupError
            TimeoutError
        ReferenceError
        RuntimeError
            NotImplementedError
            RecursionError
        StopAsyncIteration
        StopIteration
        SyntaxError
            IndentationError
                TabError
        SystemError
        TypeError
        ValueError
            UnicodeError
                UnicodeDecodeError
                UnicodeEncodeError
                UnicodeTranslateError
        Warning
            BytesWarning
            DeprecationWarning
            FutureWarning
            ImportWarning
            PendingDeprecationWarning
            ResourceWarning
            RuntimeWarning
            SyntaxWarning
            UnicodeWarning
            UserWarning
    GeneratorExit
    KeyboardInterrupt
    SystemExit

有评论者问:

假设您有一个可对外部API进行ping的方法,并且想在API包装器之外的类上处理异常,那么您是否只是从方法中的except子句中返回e,其中e是异常对象?

不,您不返回该异常,只需将其重新引发raise以保留堆栈跟踪即可。

try:
    try_this(whatever)
except SomeException as the_exception:
    handle(the_exception)
    raise

或者,在Python 3中,您可以引发新的异常并通过异常链接保留回溯:

try:
    try_this(whatever)
except SomeException as the_exception:
    handle(the_exception)
    raise DifferentException from the_exception

我在这里详细回答

What is the reason for the try-except-else to exist?

A try block allows you to handle an expected error. The except block should only catch exceptions you are prepared to handle. If you handle an unexpected error, your code may do the wrong thing and hide bugs.

An else clause will execute if there were no errors, and by not executing that code in the try block, you avoid catching an unexpected error. Again, catching an unexpected error can hide bugs.

Example

For example:

try:
    try_this(whatever)
except SomeException as the_exception:
    handle(the_exception)
else:
    return something

The “try, except” suite has two optional clauses, else and finally. So it’s actually try-except-else-finally.

else will evaluate only if there is no exception from the try block. It allows us to simplify the more complicated code below:

no_error = None
try:
    try_this(whatever)
    no_error = True
except SomeException as the_exception:
    handle(the_exception)
if no_error:
    return something

so if we compare an else to the alternative (which might create bugs) we see that it reduces the lines of code and we can have a more readable, maintainable, and less buggy code-base.

finally

finally will execute no matter what, even if another line is being evaluated with a return statement.

Broken down with pseudo-code

It might help to break this down, in the smallest possible form that demonstrates all features, with comments. Assume this syntactically correct (but not runnable unless the names are defined) pseudo-code is in a function.

For example:

try:
    try_this(whatever)
except SomeException as the_exception:
    handle_SomeException(the_exception)
    # Handle a instance of SomeException or a subclass of it.
except Exception as the_exception:
    generic_handle(the_exception)
    # Handle any other exception that inherits from Exception
    # - doesn't include GeneratorExit, KeyboardInterrupt, SystemExit
    # Avoid bare `except:`
else: # there was no exception whatsoever
    return something()
    # if no exception, the "something()" gets evaluated,
    # but the return will not be executed due to the return in the
    # finally block below.
finally:
    # this block will execute no matter what, even if no exception,
    # after "something" is eval'd but before that value is returned
    # but even if there is an exception.
    # a return here will hijack the return functionality. e.g.:
    return True # hijacks the return in the else clause above

It is true that we could include the code in the else block in the try block instead, where it would run if there were no exceptions, but what if that code itself raises an exception of the kind we’re catching? Leaving it in the try block would hide that bug.

We want to minimize lines of code in the try block to avoid catching exceptions we did not expect, under the principle that if our code fails, we want it to fail loudly. This is a best practice.

It is my understanding that exceptions are not errors

In Python, most exceptions are errors.

We can view the exception hierarchy by using pydoc. For example, in Python 2:

$ python -m pydoc exceptions

or Python 3:

$ python -m pydoc builtins

Will give us the hierarchy. We can see that most kinds of Exception are errors, although Python uses some of them for things like ending for loops (StopIteration). This is Python 3’s hierarchy:

BaseException
    Exception
        ArithmeticError
            FloatingPointError
            OverflowError
            ZeroDivisionError
        AssertionError
        AttributeError
        BufferError
        EOFError
        ImportError
            ModuleNotFoundError
        LookupError
            IndexError
            KeyError
        MemoryError
        NameError
            UnboundLocalError
        OSError
            BlockingIOError
            ChildProcessError
            ConnectionError
                BrokenPipeError
                ConnectionAbortedError
                ConnectionRefusedError
                ConnectionResetError
            FileExistsError
            FileNotFoundError
            InterruptedError
            IsADirectoryError
            NotADirectoryError
            PermissionError
            ProcessLookupError
            TimeoutError
        ReferenceError
        RuntimeError
            NotImplementedError
            RecursionError
        StopAsyncIteration
        StopIteration
        SyntaxError
            IndentationError
                TabError
        SystemError
        TypeError
        ValueError
            UnicodeError
                UnicodeDecodeError
                UnicodeEncodeError
                UnicodeTranslateError
        Warning
            BytesWarning
            DeprecationWarning
            FutureWarning
            ImportWarning
            PendingDeprecationWarning
            ResourceWarning
            RuntimeWarning
            SyntaxWarning
            UnicodeWarning
            UserWarning
    GeneratorExit
    KeyboardInterrupt
    SystemExit

A commenter asked:

Say you have a method which pings an external API and you want to handle the exception at a class outside the API wrapper, do you simply return e from the method under the except clause where e is the exception object?

No, you don’t return the exception, just reraise it with a bare raise to preserve the stacktrace.

try:
    try_this(whatever)
except SomeException as the_exception:
    handle(the_exception)
    raise

Or, in Python 3, you can raise a new exception and preserve the backtrace with exception chaining:

try:
    try_this(whatever)
except SomeException as the_exception:
    handle(the_exception)
    raise DifferentException from the_exception

I elaborate in my answer here.


回答 2

Python不赞成将异常仅用于特殊情况的想法,实际上,习惯用法是“要求宽恕,而不是允许”。这意味着将异常作为流程控制的常规部分是完全可以接受的,并且实际上是受到鼓励的。

通常,这是一件好事,因为以这种方式工作有助于避免某些问题(显而易见的示例是,通常避免出现竞争条件),并且它倾向于使代码更具可读性。

假设您遇到这样一种情况,您需要处理一些用户输入,但是已经处理了默认输入。该try: ... except: ... else: ...结构使代码易于阅读:

try:
   raw_value = int(input())
except ValueError:
   value = some_processed_value
else: # no error occured
   value = process_value(raw_value)

与其他语言可能的工作方式进行比较:

raw_value = input()
if valid_number(raw_value):
    value = process_value(int(raw_value))
else:
    value = some_processed_value

注意优点。无需检查该值是否有效并单独对其进行分析,只需完成一次即可。代码也遵循更合理的顺序,首先是主代码路径,然后是“如果不起作用,请执行此操作”。

该示例自然有点虚构,但它显示了这种结构的情况。

Python doesn’t subscribe to the idea that exceptions should only be used for exceptional cases, in fact the idiom is ‘ask for forgiveness, not permission’. This means that using exceptions as a routine part of your flow control is perfectly acceptable, and in fact, encouraged.

This is generally a good thing, as working this way helps avoid some issues (as an obvious example, race conditions are often avoided), and it tends to make code a little more readable.

Imagine you have a situation where you take some user input which needs to be processed, but have a default which is already processed. The try: ... except: ... else: ... structure makes for very readable code:

try:
   raw_value = int(input())
except ValueError:
   value = some_processed_value
else: # no error occured
   value = process_value(raw_value)

Compare to how it might work in other languages:

raw_value = input()
if valid_number(raw_value):
    value = process_value(int(raw_value))
else:
    value = some_processed_value

Note the advantages. There is no need to check the value is valid and parse it separately, they are done once. The code also follows a more logical progression, the main code path is first, followed by ‘if it doesn’t work, do this’.

The example is naturally a little contrived, but it shows there are cases for this structure.


回答 3

在python中使用try-except-else是否是一种好习惯?

答案是它取决于上下文。如果您这样做:

d = dict()
try:
    item = d['item']
except KeyError:
    item = 'default'

它表明您不太了解Python。此功能封装在dict.get方法中:

item = d.get('item', 'default')

try/ except块是写什么都可以有效地在一行用原子方法执行的视觉上更多混乱和冗长的方式。在其他情况下,这是正确的。

但是,这并不意味着我们应该避免所有异常处理。在某些情况下,最好避免比赛条件。不要检查文件是否存在,只需尝试将其打开,然后捕获相应的IOError。为了简单起见,请尝试将其封装或分解为适当的名称。

阅读PythonZen,了解其中存在一些紧绷的原则,并且要警惕过于依赖其中任何一条语句的教条。

Is it a good practice to use try-except-else in python?

The answer to this is that it is context dependent. If you do this:

d = dict()
try:
    item = d['item']
except KeyError:
    item = 'default'

It demonstrates that you don’t know Python very well. This functionality is encapsulated in the dict.get method:

item = d.get('item', 'default')

The try/except block is a much more visually cluttered and verbose way of writing what can be efficiently executing in a single line with an atomic method. There are other cases where this is true.

However, that does not mean that we should avoid all exception handling. In some cases it is preferred to avoid race conditions. Don’t check if a file exists, just attempt to open it, and catch the appropriate IOError. For the sake of simplicity and readability, try to encapsulate this or factor it out as apropos.

Read the Zen of Python, understanding that there are principles that are in tension, and be wary of dogma that relies too heavily on any one of the statements in it.


回答 4

请参见以下示例,该示例说明了有关try-except-else-finally的所有信息:

for i in range(3):
    try:
        y = 1 / i
    except ZeroDivisionError:
        print(f"\ti = {i}")
        print("\tError report: ZeroDivisionError")
    else:
        print(f"\ti = {i}")
        print(f"\tNo error report and y equals {y}")
    finally:
        print("Try block is run.")

实施它并获得:

    i = 0
    Error report: ZeroDivisionError
Try block is run.
    i = 1
    No error report and y equals 1.0
Try block is run.
    i = 2
    No error report and y equals 0.5
Try block is run.

See the following example which illustrate everything about try-except-else-finally:

for i in range(3):
    try:
        y = 1 / i
    except ZeroDivisionError:
        print(f"\ti = {i}")
        print("\tError report: ZeroDivisionError")
    else:
        print(f"\ti = {i}")
        print(f"\tNo error report and y equals {y}")
    finally:
        print("Try block is run.")

Implement it and come by:

    i = 0
    Error report: ZeroDivisionError
Try block is run.
    i = 1
    No error report and y equals 1.0
Try block is run.
    i = 2
    No error report and y equals 0.5
Try block is run.

回答 5

您应谨慎使用finally块,因为它与try中使用else块的功能不同,除了。无论try的结果如何,都将运行finally块。

In [10]: dict_ = {"a": 1}

In [11]: try:
   ....:     dict_["b"]
   ....: except KeyError:
   ....:     pass
   ....: finally:
   ....:     print "something"
   ....:     
something

正如所有人都指出的那样,使用else块会使您的代码更具可读性,并且仅在未引发异常时运行

In [14]: try:
             dict_["b"]
         except KeyError:
             pass
         else:
             print "something"
   ....:

You should be careful about using the finally block, as it is not the same thing as using an else block in the try, except. The finally block will be run regardless of the outcome of the try except.

In [10]: dict_ = {"a": 1}

In [11]: try:
   ....:     dict_["b"]
   ....: except KeyError:
   ....:     pass
   ....: finally:
   ....:     print "something"
   ....:     
something

As everyone has noted using the else block causes your code to be more readable, and only runs when an exception is not thrown

In [14]: try:
             dict_["b"]
         except KeyError:
             pass
         else:
             print "something"
   ....:

回答 6

每当您看到以下内容时:

try:
    y = 1 / x
except ZeroDivisionError:
    pass
else:
    return y

甚至这个:

try:
    return 1 / x
except ZeroDivisionError:
    return None

考虑一下这个:

import contextlib
with contextlib.suppress(ZeroDivisionError):
    return 1 / x

Whenever you see this:

try:
    y = 1 / x
except ZeroDivisionError:
    pass
else:
    return y

Or even this:

try:
    return 1 / x
except ZeroDivisionError:
    return None

Consider this instead:

import contextlib
with contextlib.suppress(ZeroDivisionError):
    return 1 / x

回答 7

只是因为没有人发表过这一意见,我会说

避免使用else条款,因为大多数人都不熟悉这些条款try/excepts

与关键字tryexcept和和不同finally,该else子句的含义不言而喻。它的可读性较差。因为它不经常使用,所以它将导致阅读您的代码的人想要仔细检查文档,以确保他们了解正在发生的事情。

(我之所以写此答案,恰恰是因为我try/except/else在代码库中找到了a ,它导致了wtf时刻并迫使我进行了谷歌搜索)。

因此,无论我在哪里看到类似OP示例的代码:

try:
    try_this(whatever)
except SomeException as the_exception:
    handle(the_exception)
else:
    # do some more processing in non-exception case
    return something

我宁愿重构为

try:
    try_this(whatever)
except SomeException as the_exception:
    handle(the_exception)
    return  # <1>
# do some more processing in non-exception case  <2>
return something
  • <1>明确的回报清楚地表明,在exceptions情况下,我们已经完成工作

  • <2>作为一个很好的次要副作用,该else块中的代码以前经过了一个级别的确定。

Just because no-one else has posted this opinion, I would say

avoid else clauses in try/excepts because they’re unfamiliar to most people

Unlike the keywords try, except, and finally, the meaning of the else clause isn’t self-evident; it’s less readable. Because it’s not used very often, it’ll cause people that read your code to want to double-check the docs to be sure they understand what’s going on.

(I’m writing this answer precisely because I found a try/except/else in my codebase and it caused a wtf moment and forced me to do some googling).

So, wherever I see code like the OP example:

try:
    try_this(whatever)
except SomeException as the_exception:
    handle(the_exception)
else:
    # do some more processing in non-exception case
    return something

I would prefer to refactor to

try:
    try_this(whatever)
except SomeException as the_exception:
    handle(the_exception)
    return  # <1>
# do some more processing in non-exception case  <2>
return something
  • <1> explicit return, clearly shows that, in the exception case, we are finished working

  • <2> as a nice minor side-effect, the code that used to be in the else block is dedented by one level.


回答 8

这是我关于如何理解Python中try-except-else-finally块的简单代码段:

def div(a, b):
    try:
        a/b
    except ZeroDivisionError:
        print("Zero Division Error detected")
    else:
        print("No Zero Division Error")
    finally:
        print("Finally the division of %d/%d is done" % (a, b))

让我们尝试div 1/1:

div(1, 1)
No Zero Division Error
Finally the division of 1/1 is done

让我们尝试div 1/0

div(1, 0)
Zero Division Error detected
Finally the division of 1/0 is done

This is my simple snippet on howto understand try-except-else-finally block in Python:

def div(a, b):
    try:
        a/b
    except ZeroDivisionError:
        print("Zero Division Error detected")
    else:
        print("No Zero Division Error")
    finally:
        print("Finally the division of %d/%d is done" % (a, b))

Let’s try div 1/1:

div(1, 1)
No Zero Division Error
Finally the division of 1/1 is done

Let’s try div 1/0

div(1, 0)
Zero Division Error detected
Finally the division of 1/0 is done

回答 9

OP,您是正确的。 在Python中try / except之后的else很难看。它导致另一个不需要的流控制对象:

try:
    x = blah()
except:
    print "failed at blah()"
else:
    print "just succeeded with blah"

完全清楚的等效项是:

try:
    x = blah()
    print "just succeeded with blah"
except:
    print "failed at blah()"

这比else子句清楚得多。try / except之后的else不经常编写,因此花一点时间弄清楚其含义是什么。

仅仅因为您可以做某事,并不意味着您应该做某事。

语言已经添加了许多功能,因为有人认为它可能派上用场。麻烦的是,功能越多,事物的清晰度和显而易见性就越差,这是因为人们通常不使用那些钟声和口哨声。

这里只有我的5美分。我必须走到后面,清理掉大学一年级开发人员写的很多代码,这些开发人员认为他们很聪明,并希望以超级严格,超级高效的方式编写代码,这只会使事情变得一团糟以尝试稍后阅读/修改。我每天对可读性进行投票,而星期日则两次。

OP, YOU ARE CORRECT. The else after try/except in Python is ugly. it leads to another flow-control object where none is needed:

try:
    x = blah()
except:
    print "failed at blah()"
else:
    print "just succeeded with blah"

A totally clear equivalent is:

try:
    x = blah()
    print "just succeeded with blah"
except:
    print "failed at blah()"

This is far clearer than an else clause. The else after try/except is not frequently written, so it takes a moment to figure what the implications are.

Just because you CAN do a thing, doesn’t mean you SHOULD do a thing.

Lots of features have been added to languages because someone thought it might come in handy. Trouble is, the more features, the less clear and obvious things are because people don’t usually use those bells and whistles.

Just my 5 cents here. I have to come along behind and clean up a lot of code written by 1st-year out of college developers who think they’re smart and want to write code in some uber-tight, uber-efficient way when that just makes it a mess to try and read / modify later. I vote for readability every day and twice on Sundays.


如何使用PIL调整图像大小并保持其纵横比?

问题:如何使用PIL调整图像大小并保持其纵横比?

有什么明显的方法可以实现我所缺少的吗?我只是想制作缩略图。

Is there an obvious way to do this that I’m missing? I’m just trying to make thumbnails.


回答 0

定义最大大小。然后,通过计算调整大小比例min(maxwidth/width, maxheight/height)

适当的大小是oldsize*ratio

当然,还有一个库方法可以做到这一点:method Image.thumbnail
以下是PIL文档中的一个(经过编辑的)示例。

import os, sys
import Image

size = 128, 128

for infile in sys.argv[1:]:
    outfile = os.path.splitext(infile)[0] + ".thumbnail"
    if infile != outfile:
        try:
            im = Image.open(infile)
            im.thumbnail(size, Image.ANTIALIAS)
            im.save(outfile, "JPEG")
        except IOError:
            print "cannot create thumbnail for '%s'" % infile

Define a maximum size. Then, compute a resize ratio by taking min(maxwidth/width, maxheight/height).

The proper size is oldsize*ratio.

There is of course also a library method to do this: the method Image.thumbnail.
Below is an (edited) example from the PIL documentation.

import os, sys
import Image

size = 128, 128

for infile in sys.argv[1:]:
    outfile = os.path.splitext(infile)[0] + ".thumbnail"
    if infile != outfile:
        try:
            im = Image.open(infile)
            im.thumbnail(size, Image.ANTIALIAS)
            im.save(outfile, "JPEG")
        except IOError:
            print "cannot create thumbnail for '%s'" % infile

回答 1

该脚本将使用PIL(Python影像库)将图像(somepic.jpg)调整为300像素的宽度,并且高度与新宽度成比例。它通过确定原始宽度(img.size [0])的300个像素的百分比,然后将原始高度(img.size [1])乘以该百分比,来实现此目的。将“ basewidth”更改为任何其他数字以更改图像的默认宽度。

from PIL import Image

basewidth = 300
img = Image.open('somepic.jpg')
wpercent = (basewidth/float(img.size[0]))
hsize = int((float(img.size[1])*float(wpercent)))
img = img.resize((basewidth,hsize), Image.ANTIALIAS)
img.save('sompic.jpg') 

This script will resize an image (somepic.jpg) using PIL (Python Imaging Library) to a width of 300 pixels and a height proportional to the new width. It does this by determining what percentage 300 pixels is of the original width (img.size[0]) and then multiplying the original height (img.size[1]) by that percentage. Change “basewidth” to any other number to change the default width of your images.

from PIL import Image

basewidth = 300
img = Image.open('somepic.jpg')
wpercent = (basewidth/float(img.size[0]))
hsize = int((float(img.size[1])*float(wpercent)))
img = img.resize((basewidth,hsize), Image.ANTIALIAS)
img.save('sompic.jpg') 

回答 2

我还建议使用PIL的缩略图方法,因为它可以消除您的所有比率麻烦。

不过,有一个重要提示:替换

im.thumbnail(size)

im.thumbnail(size,Image.ANTIALIAS)

默认情况下,PIL使用Image.NEAREST过滤器来调整大小,这会导致性能良好但质量较差。

I also recommend using PIL’s thumbnail method, because it removes all the ratio hassles from you.

One important hint, though: Replace

im.thumbnail(size)

with

im.thumbnail(size,Image.ANTIALIAS)

by default, PIL uses the Image.NEAREST filter for resizing which results in good performance, but poor quality.


回答 3

基于@tomvon,我完成了以下操作(选择您的情况):

a)调整高度我知道新的宽度,所以我需要新的高度

new_width  = 680
new_height = new_width * height / width 

b)调整宽度我知道新的高度,所以我需要新的宽度

new_height = 680
new_width  = new_height * width / height

然后:

img = img.resize((new_width, new_height), Image.ANTIALIAS)

Based in @tomvon, I finished using the following (pick your case):

a) Resizing height (I know the new width, so I need the new height)

new_width  = 680
new_height = new_width * height / width 

b) Resizing width (I know the new height, so I need the new width)

new_height = 680
new_width  = new_height * width / height

Then just:

img = img.resize((new_width, new_height), Image.ANTIALIAS)

回答 4

PIL已经可以选择裁剪图像

img = ImageOps.fit(img, size, Image.ANTIALIAS)

PIL already has the option to crop an image

img = ImageOps.fit(img, size, Image.ANTIALIAS)

回答 5

from PIL import Image

img = Image.open('/your image path/image.jpg') # image extension *.png,*.jpg
new_width  = 200
new_height = 300
img = img.resize((new_width, new_height), Image.ANTIALIAS)
img.save('output image name.png') # format may what you want *.png, *jpg, *.gif
from PIL import Image

img = Image.open('/your image path/image.jpg') # image extension *.png,*.jpg
new_width  = 200
new_height = 300
img = img.resize((new_width, new_height), Image.ANTIALIAS)
img.save('output image name.png') # format may what you want *.png, *jpg, *.gif

回答 6

如果您尝试保持相同的宽高比,那么是否不按原始尺寸的某个百分比调整尺寸?

例如,原始尺寸的一半

half = 0.5
out = im.resize( [int(half * s) for s in im.size] )

If you are trying to maintain the same aspect ratio, then wouldn’t you resize by some percentage of the original size?

For example, half the original size

half = 0.5
out = im.resize( [int(half * s) for s in im.size] )

回答 7

from PIL import Image
from resizeimage import resizeimage

def resize_file(in_file, out_file, size):
    with open(in_file) as fd:
        image = resizeimage.resize_thumbnail(Image.open(fd), size)
    image.save(out_file)
    image.close()

resize_file('foo.tif', 'foo_small.jpg', (256, 256))

我使用这个库:

pip install python-resize-image
from PIL import Image
from resizeimage import resizeimage

def resize_file(in_file, out_file, size):
    with open(in_file) as fd:
        image = resizeimage.resize_thumbnail(Image.open(fd), size)
    image.save(out_file)
    image.close()

resize_file('foo.tif', 'foo_small.jpg', (256, 256))

I use this library:

pip install python-resize-image

回答 8

如果您不希望/不需要使用枕头打开图像,请使用以下命令:

from PIL import Image

new_img_arr = numpy.array(Image.fromarray(img_arr).resize((new_width, new_height), Image.ANTIALIAS))

If you don’t want / don’t have a need to open image with Pillow, use this:

from PIL import Image

new_img_arr = numpy.array(Image.fromarray(img_arr).resize((new_width, new_height), Image.ANTIALIAS))

回答 9

只需使用更现代的包装器更新此问题,该库即可包装Pillow(PIL的一个分支) https://pypi.org/project/python-resize-image/

允许你做这样的事情:

from PIL import Image
from resizeimage import resizeimage

fd_img = open('test-image.jpeg', 'r')
img = Image.open(fd_img)
img = resizeimage.resize_width(img, 200)
img.save('test-image-width.jpeg', img.format)
fd_img.close()

在上面的链接中堆了更多示例。

Just updating this question with a more modern wrapper This library wraps Pillow (a fork of PIL) https://pypi.org/project/python-resize-image/

Allowing you to do something like this :-

from PIL import Image
from resizeimage import resizeimage

fd_img = open('test-image.jpeg', 'r')
img = Image.open(fd_img)
img = resizeimage.resize_width(img, 200)
img.save('test-image-width.jpeg', img.format)
fd_img.close()

Heaps more examples in the above link.


回答 10

我试图调整幻灯片视频的某些图像的大小,因此,我不仅要一个最大尺寸,而且要一个最大宽度一个最大高度(视频帧的大小)。
总是有可能拍摄人像视频…
Image.thumbnail方法很有前途,但我无法将其放大到较小的图像。

因此,在找不到在此处(或其他位置)执行此操作的明显方法之后,我编写了此函数并将其放在此处以供以后使用:

from PIL import Image

def get_resized_img(img_path, video_size):
    img = Image.open(img_path)
    width, height = video_size  # these are the MAX dimensions
    video_ratio = width / height
    img_ratio = img.size[0] / img.size[1]
    if video_ratio >= 1:  # the video is wide
        if img_ratio <= video_ratio:  # image is not wide enough
            width_new = int(height * img_ratio)
            size_new = width_new, height
        else:  # image is wider than video
            height_new = int(width / img_ratio)
            size_new = width, height_new
    else:  # the video is tall
        if img_ratio >= video_ratio:  # image is not tall enough
            height_new = int(width / img_ratio)
            size_new = width, height_new
        else:  # image is taller than video
            width_new = int(height * img_ratio)
            size_new = width_new, height
    return img.resize(size_new, resample=Image.LANCZOS)

I was trying to resize some images for a slideshow video and because of that, I wanted not just one max dimension, but a max width and a max height (the size of the video frame).
And there was always the possibility of a portrait video…
The Image.thumbnail method was promising, but I could not make it upscale a smaller image.

So after I couldn’t find an obvious way to do that here (or at some other places), I wrote this function and put it here for the ones to come:

from PIL import Image

def get_resized_img(img_path, video_size):
    img = Image.open(img_path)
    width, height = video_size  # these are the MAX dimensions
    video_ratio = width / height
    img_ratio = img.size[0] / img.size[1]
    if video_ratio >= 1:  # the video is wide
        if img_ratio <= video_ratio:  # image is not wide enough
            width_new = int(height * img_ratio)
            size_new = width_new, height
        else:  # image is wider than video
            height_new = int(width / img_ratio)
            size_new = width, height_new
    else:  # the video is tall
        if img_ratio >= video_ratio:  # image is not tall enough
            height_new = int(width / img_ratio)
            size_new = width, height_new
        else:  # image is taller than video
            width_new = int(height * img_ratio)
            size_new = width_new, height
    return img.resize(size_new, resample=Image.LANCZOS)

回答 11

一种用于保持约束比率并通过最大宽度/高度的简单方法。不是最漂亮的,但是可以完成工作并且很容易理解:

def resize(img_path, max_px_size, output_folder):
    with Image.open(img_path) as img:
        width_0, height_0 = img.size
        out_f_name = os.path.split(img_path)[-1]
        out_f_path = os.path.join(output_folder, out_f_name)

        if max((width_0, height_0)) <= max_px_size:
            print('writing {} to disk (no change from original)'.format(out_f_path))
            img.save(out_f_path)
            return

        if width_0 > height_0:
            wpercent = max_px_size / float(width_0)
            hsize = int(float(height_0) * float(wpercent))
            img = img.resize((max_px_size, hsize), Image.ANTIALIAS)
            print('writing {} to disk'.format(out_f_path))
            img.save(out_f_path)
            return

        if width_0 < height_0:
            hpercent = max_px_size / float(height_0)
            wsize = int(float(width_0) * float(hpercent))
            img = img.resize((max_px_size, wsize), Image.ANTIALIAS)
            print('writing {} to disk'.format(out_f_path))
            img.save(out_f_path)
            return

这是一个使用此功能运行批处理图像大小调整的python脚本

A simple method for keeping constrained ratios and passing a max width / height. Not the prettiest but gets the job done and is easy to understand:

def resize(img_path, max_px_size, output_folder):
    with Image.open(img_path) as img:
        width_0, height_0 = img.size
        out_f_name = os.path.split(img_path)[-1]
        out_f_path = os.path.join(output_folder, out_f_name)

        if max((width_0, height_0)) <= max_px_size:
            print('writing {} to disk (no change from original)'.format(out_f_path))
            img.save(out_f_path)
            return

        if width_0 > height_0:
            wpercent = max_px_size / float(width_0)
            hsize = int(float(height_0) * float(wpercent))
            img = img.resize((max_px_size, hsize), Image.ANTIALIAS)
            print('writing {} to disk'.format(out_f_path))
            img.save(out_f_path)
            return

        if width_0 < height_0:
            hpercent = max_px_size / float(height_0)
            wsize = int(float(width_0) * float(hpercent))
            img = img.resize((max_px_size, wsize), Image.ANTIALIAS)
            print('writing {} to disk'.format(out_f_path))
            img.save(out_f_path)
            return

Here’s a python script that uses this function to run batch image resizing.


回答 12

已通过“ tomvon”更新了以上答案

from PIL import Image

img = Image.open(image_path)

width, height = img.size[:2]

if height > width:
    baseheight = 64
    hpercent = (baseheight/float(img.size[1]))
    wsize = int((float(img.size[0])*float(hpercent)))
    img = img.resize((wsize, baseheight), Image.ANTIALIAS)
    img.save('resized.jpg')
else:
    basewidth = 64
    wpercent = (basewidth/float(img.size[0]))
    hsize = int((float(img.size[1])*float(wpercent)))
    img = img.resize((basewidth,hsize), Image.ANTIALIAS)
    img.save('resized.jpg')

Have updated the answer above by “tomvon”

from PIL import Image

img = Image.open(image_path)

width, height = img.size[:2]

if height > width:
    baseheight = 64
    hpercent = (baseheight/float(img.size[1]))
    wsize = int((float(img.size[0])*float(hpercent)))
    img = img.resize((wsize, baseheight), Image.ANTIALIAS)
    img.save('resized.jpg')
else:
    basewidth = 64
    wpercent = (basewidth/float(img.size[0]))
    hsize = int((float(img.size[1])*float(wpercent)))
    img = img.resize((basewidth,hsize), Image.ANTIALIAS)
    img.save('resized.jpg')

回答 13

我的丑陋例子。

函数获取文件,例如:“ pic [0-9a-z]。[extension]”,将其大小调整为120×120,将节移动到中心并保存到“ ico [0-9a-z]。[extension]”,使用纵向和景观:

def imageResize(filepath):
    from PIL import Image
    file_dir=os.path.split(filepath)
    img = Image.open(filepath)

    if img.size[0] > img.size[1]:
        aspect = img.size[1]/120
        new_size = (img.size[0]/aspect, 120)
    else:
        aspect = img.size[0]/120
        new_size = (120, img.size[1]/aspect)
    img.resize(new_size).save(file_dir[0]+'/ico'+file_dir[1][3:])
    img = Image.open(file_dir[0]+'/ico'+file_dir[1][3:])

    if img.size[0] > img.size[1]:
        new_img = img.crop( (
            (((img.size[0])-120)/2),
            0,
            120+(((img.size[0])-120)/2),
            120
        ) )
    else:
        new_img = img.crop( (
            0,
            (((img.size[1])-120)/2),
            120,
            120+(((img.size[1])-120)/2)
        ) )

    new_img.save(file_dir[0]+'/ico'+file_dir[1][3:])

My ugly example.

Function get file like: “pic[0-9a-z].[extension]”, resize them to 120×120, moves section to center and save to “ico[0-9a-z].[extension]”, works with portrait and landscape:

def imageResize(filepath):
    from PIL import Image
    file_dir=os.path.split(filepath)
    img = Image.open(filepath)

    if img.size[0] > img.size[1]:
        aspect = img.size[1]/120
        new_size = (img.size[0]/aspect, 120)
    else:
        aspect = img.size[0]/120
        new_size = (120, img.size[1]/aspect)
    img.resize(new_size).save(file_dir[0]+'/ico'+file_dir[1][3:])
    img = Image.open(file_dir[0]+'/ico'+file_dir[1][3:])

    if img.size[0] > img.size[1]:
        new_img = img.crop( (
            (((img.size[0])-120)/2),
            0,
            120+(((img.size[0])-120)/2),
            120
        ) )
    else:
        new_img = img.crop( (
            0,
            (((img.size[1])-120)/2),
            120,
            120+(((img.size[1])-120)/2)
        ) )

    new_img.save(file_dir[0]+'/ico'+file_dir[1][3:])

回答 14

我以这种方式调整了图像的大小,并且效果很好

from io import BytesIO
from django.core.files.uploadedfile import InMemoryUploadedFile
import os, sys
from PIL import Image


def imageResize(image):
    outputIoStream = BytesIO()
    imageTemproaryResized = imageTemproary.resize( (1920,1080), Image.ANTIALIAS) 
    imageTemproaryResized.save(outputIoStream , format='PNG', quality='10') 
    outputIoStream.seek(0)
    uploadedImage = InMemoryUploadedFile(outputIoStream,'ImageField', "%s.jpg" % image.name.split('.')[0], 'image/jpeg', sys.getsizeof(outputIoStream), None)

    ## For upload local folder
    fs = FileSystemStorage()
    filename = fs.save(uploadedImage.name, uploadedImage)

I resizeed the image in such a way and it’s working very well

from io import BytesIO
from django.core.files.uploadedfile import InMemoryUploadedFile
import os, sys
from PIL import Image


def imageResize(image):
    outputIoStream = BytesIO()
    imageTemproaryResized = imageTemproary.resize( (1920,1080), Image.ANTIALIAS) 
    imageTemproaryResized.save(outputIoStream , format='PNG', quality='10') 
    outputIoStream.seek(0)
    uploadedImage = InMemoryUploadedFile(outputIoStream,'ImageField', "%s.jpg" % image.name.split('.')[0], 'image/jpeg', sys.getsizeof(outputIoStream), None)

    ## For upload local folder
    fs = FileSystemStorage()
    filename = fs.save(uploadedImage.name, uploadedImage)

回答 15

我还将添加一个调整大小的版本,以保持宽高比固定。在这种情况下,它将根据初始宽高比asp_rat(为float(!))调整高度以匹配新图像的宽度。但是,要将宽度调整为高度,您只需要注释一条线,然后在else循环中取消注释另一条线即可。您会在哪里看到。

您不需要分号(;),我保留它们只是为了提醒自己我经常使用的语言的语法。

from PIL import Image

img_path = "filename.png";
img = Image.open(img_path);     # puts our image to the buffer of the PIL.Image object

width, height = img.size;
asp_rat = width/height;

# Enter new width (in pixels)
new_width = 50;

# Enter new height (in pixels)
new_height = 54;

new_rat = new_width/new_height;

if (new_rat == asp_rat):
    img = img.resize((new_width, new_height), Image.ANTIALIAS); 

# adjusts the height to match the width
# NOTE: if you want to adjust the width to the height, instead -> 
# uncomment the second line (new_width) and comment the first one (new_height)
else:
    new_height = round(new_width / asp_rat);
    #new_width = round(new_height * asp_rat);
    img = img.resize((new_width, new_height), Image.ANTIALIAS);

# usage: resize((x,y), resample)
# resample filter -> PIL.Image.BILINEAR, PIL.Image.NEAREST (default), PIL.Image.BICUBIC, etc..
# https://pillow.readthedocs.io/en/3.1.x/reference/Image.html#PIL.Image.Image.resize

# Enter the name under which you would like to save the new image
img.save("outputname.png");

并且,它完成了。我尽力将其记录在案,因此很明显。

我希望这可能对那里的人有帮助!

I will also add a version of the resize that keeps the aspect ratio fixed. In this case, it will adjust the height to match the width of the new image, based on the initial aspect ratio, asp_rat, which is float (!). But, to adjust the width to the height, instead, you just need to comment one line and uncomment the other in the else loop. You will see, where.

You do not need the semicolons (;), I keep them just to remind myself of syntax of languages I use more often.

from PIL import Image

img_path = "filename.png";
img = Image.open(img_path);     # puts our image to the buffer of the PIL.Image object

width, height = img.size;
asp_rat = width/height;

# Enter new width (in pixels)
new_width = 50;

# Enter new height (in pixels)
new_height = 54;

new_rat = new_width/new_height;

if (new_rat == asp_rat):
    img = img.resize((new_width, new_height), Image.ANTIALIAS); 

# adjusts the height to match the width
# NOTE: if you want to adjust the width to the height, instead -> 
# uncomment the second line (new_width) and comment the first one (new_height)
else:
    new_height = round(new_width / asp_rat);
    #new_width = round(new_height * asp_rat);
    img = img.resize((new_width, new_height), Image.ANTIALIAS);

# usage: resize((x,y), resample)
# resample filter -> PIL.Image.BILINEAR, PIL.Image.NEAREST (default), PIL.Image.BICUBIC, etc..
# https://pillow.readthedocs.io/en/3.1.x/reference/Image.html#PIL.Image.Image.resize

# Enter the name under which you would like to save the new image
img.save("outputname.png");

And, it is done. I tried to document it as much as I can, so it is clear.

I hope it might be helpful to someone out there!


回答 16

打开你的图片文件

from PIL import Image
im = Image.open("image.png")

使用PIL Image.resize(size,resample = 0)方法,在其中用图像的(宽度,高度)替换大小为2元组。

这将以原始尺寸显示图像:

display(im.resize((int(im.size[0]),int(im.size[1])), 0) )

这将以1/2尺寸显示图像:

display(im.resize((int(im.size[0]/2),int(im.size[1]/2)), 0) )

这将以1/3的大小显示图像:

display(im.resize((int(im.size[0]/3),int(im.size[1]/3)), 0) )

这将以1/4的大小显示图像:

display(im.resize((int(im.size[0]/4),int(im.size[1]/4)), 0) )

Open your image file

from PIL import Image
im = Image.open("image.png")

Use PIL Image.resize(size, resample=0) method, where you substitute (width, height) of your image for the size 2-tuple.

This will display your image at original size:

display(im.resize((int(im.size[0]),int(im.size[1])), 0) )

This will display your image at 1/2 the size:

display(im.resize((int(im.size[0]/2),int(im.size[1]/2)), 0) )

This will display your image at 1/3 the size:

display(im.resize((int(im.size[0]/3),int(im.size[1]/3)), 0) )

This will display your image at 1/4 the size:

display(im.resize((int(im.size[0]/4),int(im.size[1]/4)), 0) )

etc etc


回答 17

from PIL import Image
from resizeimage import resizeimage

def resize_file(in_file, out_file, size):
    with open(in_file) as fd:
        image = resizeimage.resize_thumbnail(Image.open(fd), size)
    image.save(out_file)
    image.close()

resize_file('foo.tif', 'foo_small.jpg', (256, 256))
from PIL import Image
from resizeimage import resizeimage

def resize_file(in_file, out_file, size):
    with open(in_file) as fd:
        image = resizeimage.resize_thumbnail(Image.open(fd), size)
    image.save(out_file)
    image.close()

resize_file('foo.tif', 'foo_small.jpg', (256, 256))

回答 18

您可以通过以下代码调整图片大小:

From PIL import Image
img=Image.open('Filename.jpg') # paste image in python folder
print(img.size())
new_img=img.resize((400,400))
new_img.save('new_filename.jpg')

You can resize image by below code:

From PIL import Image
img=Image.open('Filename.jpg') # paste image in python folder
print(img.size())
new_img=img.resize((400,400))
new_img.save('new_filename.jpg')

如何在列表中找到重复项并使用它们创建另一个列表?

问题:如何在列表中找到重复项并使用它们创建另一个列表?

如何在Python列表中找到重复项并创建另一个重复项列表?该列表仅包含整数。

How can I find the duplicates in a Python list and create another list of the duplicates? The list only contains integers.


回答 0

要删除重复项,请使用set(a)。要打印副本,类似:

a = [1,2,3,2,1,5,6,5,5,5]

import collections
print([item for item, count in collections.Counter(a).items() if count > 1])

## [1, 2, 5]

请注意,这Counter并不是特别有效(计时),并且在这里可能会过大。set会表现更好。此代码按源顺序计算唯一元素的列表:

seen = set()
uniq = []
for x in a:
    if x not in seen:
        uniq.append(x)
        seen.add(x)

或者,更简洁地说:

seen = set()
uniq = [x for x in a if x not in seen and not seen.add(x)]    

我不建议您使用后者,因为not seen.add(x)这样做并不明显(set add()方法总是返回None,因此需要not)。

要计算没有库的重复元素列表:

seen = {}
dupes = []

for x in a:
    if x not in seen:
        seen[x] = 1
    else:
        if seen[x] == 1:
            dupes.append(x)
        seen[x] += 1

如果列表元素不可散列,则不能使用集合/字典,而必须求助于二​​次时间解(将每个解比较)。例如:

a = [[1], [2], [3], [1], [5], [3]]

no_dupes = [x for n, x in enumerate(a) if x not in a[:n]]
print no_dupes # [[1], [2], [3], [5]]

dupes = [x for n, x in enumerate(a) if x in a[:n]]
print dupes # [[1], [3]]

To remove duplicates use set(a). To print duplicates, something like:

a = [1,2,3,2,1,5,6,5,5,5]

import collections
print([item for item, count in collections.Counter(a).items() if count > 1])

## [1, 2, 5]

Note that Counter is not particularly efficient (timings) and probably overkill here. set will perform better. This code computes a list of unique elements in the source order:

seen = set()
uniq = []
for x in a:
    if x not in seen:
        uniq.append(x)
        seen.add(x)

or, more concisely:

seen = set()
uniq = [x for x in a if x not in seen and not seen.add(x)]    

I don’t recommend the latter style, because it is not obvious what not seen.add(x) is doing (the set add() method always returns None, hence the need for not).

To compute the list of duplicated elements without libraries:

seen = {}
dupes = []

for x in a:
    if x not in seen:
        seen[x] = 1
    else:
        if seen[x] == 1:
            dupes.append(x)
        seen[x] += 1

If list elements are not hashable, you cannot use sets/dicts and have to resort to a quadratic time solution (compare each with each). For example:

a = [[1], [2], [3], [1], [5], [3]]

no_dupes = [x for n, x in enumerate(a) if x not in a[:n]]
print no_dupes # [[1], [2], [3], [5]]

dupes = [x for n, x in enumerate(a) if x in a[:n]]
print dupes # [[1], [3]]

回答 1

>>> l = [1,2,3,4,4,5,5,6,1]
>>> set([x for x in l if l.count(x) > 1])
set([1, 4, 5])
>>> l = [1,2,3,4,4,5,5,6,1]
>>> set([x for x in l if l.count(x) > 1])
set([1, 4, 5])

回答 2

您不需要计数,只需要查看之前是否曾查看过该项目即可。改编这个问题的答案,这一问题:

def list_duplicates(seq):
  seen = set()
  seen_add = seen.add
  # adds all elements it doesn't know yet to seen and all other to seen_twice
  seen_twice = set( x for x in seq if x in seen or seen_add(x) )
  # turn the set into a list (as requested)
  return list( seen_twice )

a = [1,2,3,2,1,5,6,5,5,5]
list_duplicates(a) # yields [1, 2, 5]

以防万一速度很重要,以下是一些时间安排:

# file: test.py
import collections

def thg435(l):
    return [x for x, y in collections.Counter(l).items() if y > 1]

def moooeeeep(l):
    seen = set()
    seen_add = seen.add
    # adds all elements it doesn't know yet to seen and all other to seen_twice
    seen_twice = set( x for x in l if x in seen or seen_add(x) )
    # turn the set into a list (as requested)
    return list( seen_twice )

def RiteshKumar(l):
    return list(set([x for x in l if l.count(x) > 1]))

def JohnLaRooy(L):
    seen = set()
    seen2 = set()
    seen_add = seen.add
    seen2_add = seen2.add
    for item in L:
        if item in seen:
            seen2_add(item)
        else:
            seen_add(item)
    return list(seen2)

l = [1,2,3,2,1,5,6,5,5,5]*100

结果如下:(做得很好@JohnLaRooy!)

$ python -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
10000 loops, best of 3: 74.6 usec per loop
$ python -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
10000 loops, best of 3: 91.3 usec per loop
$ python -mtimeit -s 'import test' 'test.thg435(test.l)'
1000 loops, best of 3: 266 usec per loop
$ python -mtimeit -s 'import test' 'test.RiteshKumar(test.l)'
100 loops, best of 3: 8.35 msec per loop

有趣的是,除了计时本身之外,使用pypy时排名也略有变化。最有趣的是,基于计数器的方法从pypy的优化中受益匪浅,而我建议的方法缓存方法似乎几乎没有效果。

$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
100000 loops, best of 3: 17.8 usec per loop
$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'
10000 loops, best of 3: 23 usec per loop
$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
10000 loops, best of 3: 39.3 usec per loop

显然,这种影响与输入数据的“重复性”有关。我已经设置l = [random.randrange(1000000) for i in xrange(10000)]并得到以下结果:

$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
1000 loops, best of 3: 495 usec per loop
$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
1000 loops, best of 3: 499 usec per loop
$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'
1000 loops, best of 3: 1.68 msec per loop

You don’t need the count, just whether or not the item was seen before. Adapted that answer to this problem:

def list_duplicates(seq):
  seen = set()
  seen_add = seen.add
  # adds all elements it doesn't know yet to seen and all other to seen_twice
  seen_twice = set( x for x in seq if x in seen or seen_add(x) )
  # turn the set into a list (as requested)
  return list( seen_twice )

a = [1,2,3,2,1,5,6,5,5,5]
list_duplicates(a) # yields [1, 2, 5]

Just in case speed matters, here are some timings:

# file: test.py
import collections

def thg435(l):
    return [x for x, y in collections.Counter(l).items() if y > 1]

def moooeeeep(l):
    seen = set()
    seen_add = seen.add
    # adds all elements it doesn't know yet to seen and all other to seen_twice
    seen_twice = set( x for x in l if x in seen or seen_add(x) )
    # turn the set into a list (as requested)
    return list( seen_twice )

def RiteshKumar(l):
    return list(set([x for x in l if l.count(x) > 1]))

def JohnLaRooy(L):
    seen = set()
    seen2 = set()
    seen_add = seen.add
    seen2_add = seen2.add
    for item in L:
        if item in seen:
            seen2_add(item)
        else:
            seen_add(item)
    return list(seen2)

l = [1,2,3,2,1,5,6,5,5,5]*100

Here are the results: (well done @JohnLaRooy!)

$ python -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
10000 loops, best of 3: 74.6 usec per loop
$ python -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
10000 loops, best of 3: 91.3 usec per loop
$ python -mtimeit -s 'import test' 'test.thg435(test.l)'
1000 loops, best of 3: 266 usec per loop
$ python -mtimeit -s 'import test' 'test.RiteshKumar(test.l)'
100 loops, best of 3: 8.35 msec per loop

Interestingly, besides the timings itself, also the ranking slightly changes when pypy is used. Most interestingly, the Counter-based approach benefits hugely from pypy’s optimizations, whereas the method caching approach I have suggested seems to have almost no effect.

$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
100000 loops, best of 3: 17.8 usec per loop
$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'
10000 loops, best of 3: 23 usec per loop
$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
10000 loops, best of 3: 39.3 usec per loop

Apparantly this effect is related to the “duplicatedness” of the input data. I have set l = [random.randrange(1000000) for i in xrange(10000)] and got these results:

$ pypy -mtimeit -s 'import test' 'test.moooeeeep(test.l)'
1000 loops, best of 3: 495 usec per loop
$ pypy -mtimeit -s 'import test' 'test.JohnLaRooy(test.l)'
1000 loops, best of 3: 499 usec per loop
$ pypy -mtimeit -s 'import test' 'test.thg435(test.l)'
1000 loops, best of 3: 1.68 msec per loop

回答 3

您可以使用iteration_utilities.duplicates

>>> from iteration_utilities import duplicates

>>> list(duplicates([1,1,2,1,2,3,4,2]))
[1, 1, 2, 2]

或者,如果您只希望每个重复项之一,可以将其与iteration_utilities.unique_everseen

>>> from iteration_utilities import unique_everseen

>>> list(unique_everseen(duplicates([1,1,2,1,2,3,4,2])))
[1, 2]

它还可以处理不可散列的元素(但是以性能为代价):

>>> list(duplicates([[1], [2], [1], [3], [1]]))
[[1], [1]]

>>> list(unique_everseen(duplicates([[1], [2], [1], [3], [1]])))
[[1]]

这是这里仅有的其他几种方法可以处理的。

基准测试

我做了一个快速基准测试,其中包含这里提到的大多数(但不是全部)方法。

第一个基准测试仅包含一小部分列表长度,因为某些方法具有 O(n**2)行为。

在曲线图中,y轴表示时间,因此值越低越好。它还绘制了对数-对数,因此可以更好地可视化各种值:

在此处输入图片说明

除去这些O(n**2)方法,我做了另一个基准测试,列表中最多有500万个元素:

在此处输入图片说明

如您所见 iteration_utilities.duplicates方法比其他任何方法甚至链式方法都快unique_everseen(duplicates(...))都快也比其他方法快或同样快。

这里要注意的另一件有趣的事情是,对于小名单,熊猫方法非常慢,但可以轻松竞争更长的名单。

但是,由于这些基准测试表明大多数方法的性能大致相同,因此使用哪种方法都无关紧要(除了具有O(n**2)运行时的三种方法外)。

from iteration_utilities import duplicates, unique_everseen
from collections import Counter
import pandas as pd
import itertools

def georg_counter(it):
    return [item for item, count in Counter(it).items() if count > 1]

def georg_set(it):
    seen = set()
    uniq = []
    for x in it:
        if x not in seen:
            uniq.append(x)
            seen.add(x)

def georg_set2(it):
    seen = set()
    return [x for x in it if x not in seen and not seen.add(x)]   

def georg_set3(it):
    seen = {}
    dupes = []

    for x in it:
        if x not in seen:
            seen[x] = 1
        else:
            if seen[x] == 1:
                dupes.append(x)
            seen[x] += 1

def RiteshKumar_count(l):
    return set([x for x in l if l.count(x) > 1])

def moooeeeep(seq):
    seen = set()
    seen_add = seen.add
    # adds all elements it doesn't know yet to seen and all other to seen_twice
    seen_twice = set( x for x in seq if x in seen or seen_add(x) )
    # turn the set into a list (as requested)
    return list( seen_twice )

def F1Rumors_implementation(c):
    a, b = itertools.tee(sorted(c))
    next(b, None)
    r = None
    for k, g in zip(a, b):
        if k != g: continue
        if k != r:
            yield k
            r = k

def F1Rumors(c):
    return list(F1Rumors_implementation(c))

def Edward(a):
    d = {}
    for elem in a:
        if elem in d:
            d[elem] += 1
        else:
            d[elem] = 1
    return [x for x, y in d.items() if y > 1]

def wordsmith(a):
    return pd.Series(a)[pd.Series(a).duplicated()].values

def NikhilPrabhu(li):
    li = li.copy()
    for x in set(li):
        li.remove(x)

    return list(set(li))

def firelynx(a):
    vc = pd.Series(a).value_counts()
    return vc[vc > 1].index.tolist()

def HenryDev(myList):
    newList = set()

    for i in myList:
        if myList.count(i) >= 2:
            newList.add(i)

    return list(newList)

def yota(number_lst):
    seen_set = set()
    duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x))
    return seen_set - duplicate_set

def IgorVishnevskiy(l):
    s=set(l)
    d=[]
    for x in l:
        if x in s:
            s.remove(x)
        else:
            d.append(x)
    return d

def it_duplicates(l):
    return list(duplicates(l))

def it_unique_duplicates(l):
    return list(unique_everseen(duplicates(l)))

基准1

from simple_benchmark import benchmark
import random

funcs = [
    georg_counter, georg_set, georg_set2, georg_set3, RiteshKumar_count, moooeeeep, 
    F1Rumors, Edward, wordsmith, NikhilPrabhu, firelynx,
    HenryDev, yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates
]

args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 12)}

b = benchmark(funcs, args, 'list size')

b.plot()

基准2

funcs = [
    georg_counter, georg_set, georg_set2, georg_set3, moooeeeep, 
    F1Rumors, Edward, wordsmith, firelynx,
    yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates
]

args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 20)}

b = benchmark(funcs, args, 'list size')
b.plot()

免责声明

1这来自我编写的第三方库iteration_utilities

You can use iteration_utilities.duplicates:

>>> from iteration_utilities import duplicates

>>> list(duplicates([1,1,2,1,2,3,4,2]))
[1, 1, 2, 2]

or if you only want one of each duplicate this can be combined with iteration_utilities.unique_everseen:

>>> from iteration_utilities import unique_everseen

>>> list(unique_everseen(duplicates([1,1,2,1,2,3,4,2])))
[1, 2]

It can also handle unhashable elements (however at the cost of performance):

>>> list(duplicates([[1], [2], [1], [3], [1]]))
[[1], [1]]

>>> list(unique_everseen(duplicates([[1], [2], [1], [3], [1]])))
[[1]]

That’s something that only a few of the other approaches here can handle.

Benchmarks

I did a quick benchmark containing most (but not all) of the approaches mentioned here.

The first benchmark included only a small range of list-lengths because some approaches have O(n**2) behavior.

In the graphs the y-axis represents the time, so a lower value means better. It’s also plotted log-log so the wide range of values can be visualized better:

enter image description here

Removing the O(n**2) approaches I did another benchmark up to half a million elements in a list:

enter image description here

As you can see the iteration_utilities.duplicates approach is faster than any of the other approaches and even chaining unique_everseen(duplicates(...)) was faster or equally fast than the other approaches.

One additional interesting thing to note here is that the pandas approaches are very slow for small lists but can easily compete for longer lists.

However as these benchmarks show most of the approaches perform roughly equally, so it doesn’t matter much which one is used (except for the 3 that had O(n**2) runtime).

from iteration_utilities import duplicates, unique_everseen
from collections import Counter
import pandas as pd
import itertools

def georg_counter(it):
    return [item for item, count in Counter(it).items() if count > 1]

def georg_set(it):
    seen = set()
    uniq = []
    for x in it:
        if x not in seen:
            uniq.append(x)
            seen.add(x)

def georg_set2(it):
    seen = set()
    return [x for x in it if x not in seen and not seen.add(x)]   

def georg_set3(it):
    seen = {}
    dupes = []

    for x in it:
        if x not in seen:
            seen[x] = 1
        else:
            if seen[x] == 1:
                dupes.append(x)
            seen[x] += 1

def RiteshKumar_count(l):
    return set([x for x in l if l.count(x) > 1])

def moooeeeep(seq):
    seen = set()
    seen_add = seen.add
    # adds all elements it doesn't know yet to seen and all other to seen_twice
    seen_twice = set( x for x in seq if x in seen or seen_add(x) )
    # turn the set into a list (as requested)
    return list( seen_twice )

def F1Rumors_implementation(c):
    a, b = itertools.tee(sorted(c))
    next(b, None)
    r = None
    for k, g in zip(a, b):
        if k != g: continue
        if k != r:
            yield k
            r = k

def F1Rumors(c):
    return list(F1Rumors_implementation(c))

def Edward(a):
    d = {}
    for elem in a:
        if elem in d:
            d[elem] += 1
        else:
            d[elem] = 1
    return [x for x, y in d.items() if y > 1]

def wordsmith(a):
    return pd.Series(a)[pd.Series(a).duplicated()].values

def NikhilPrabhu(li):
    li = li.copy()
    for x in set(li):
        li.remove(x)

    return list(set(li))

def firelynx(a):
    vc = pd.Series(a).value_counts()
    return vc[vc > 1].index.tolist()

def HenryDev(myList):
    newList = set()

    for i in myList:
        if myList.count(i) >= 2:
            newList.add(i)

    return list(newList)

def yota(number_lst):
    seen_set = set()
    duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x))
    return seen_set - duplicate_set

def IgorVishnevskiy(l):
    s=set(l)
    d=[]
    for x in l:
        if x in s:
            s.remove(x)
        else:
            d.append(x)
    return d

def it_duplicates(l):
    return list(duplicates(l))

def it_unique_duplicates(l):
    return list(unique_everseen(duplicates(l)))

Benchmark 1

from simple_benchmark import benchmark
import random

funcs = [
    georg_counter, georg_set, georg_set2, georg_set3, RiteshKumar_count, moooeeeep, 
    F1Rumors, Edward, wordsmith, NikhilPrabhu, firelynx,
    HenryDev, yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates
]

args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 12)}

b = benchmark(funcs, args, 'list size')

b.plot()

Benchmark 2

funcs = [
    georg_counter, georg_set, georg_set2, georg_set3, moooeeeep, 
    F1Rumors, Edward, wordsmith, firelynx,
    yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates
]

args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 20)}

b = benchmark(funcs, args, 'list size')
b.plot()

Disclaimer

1 This is from a third-party library I have written: iteration_utilities.


回答 4

我在寻找相关问题时遇到了这个问题-想知道为什么没人提供基于生成器的解决方案?解决此问题的方法是:

>>> print list(getDupes_9([1,2,3,2,1,5,6,5,5,5]))
[1, 2, 5]

我担心可伸缩性,因此测试了几种方法,包括幼稚的项目,这些项目在小列表上都可以很好地工作,但是随着列表变大就可怕地扩展(注意-使用timeit会更好,但这只是说明性的)。

我加入了@moooeeeep进行比较(这是非常快的:如果输入列表是完全随机的,则是最快的),而itertools方法对于大多数已排序的列表来说甚至更快…现在包括@firelynx的pandas方法-慢,但没有很可怕,而且很简单。注意-对于大型的有序列表,sort / tee / zip方法在我的机器上始终是最快的,对于混洗的列表,moooeeeep最快,但是您的里程可能会有所不同。

优点

  • 使用相同的代码非常容易地测试“任何”重复项

假设条件

  • 重复应仅报告一次
  • 重复的订单不需要保留
  • 重复项可能在列表中的任何位置

最快的解决方案,一百万个条目:

def getDupes(c):
        '''sort/tee/izip'''
        a, b = itertools.tee(sorted(c))
        next(b, None)
        r = None
        for k, g in itertools.izip(a, b):
            if k != g: continue
            if k != r:
                yield k
                r = k

经过测试的方法

import itertools
import time
import random

def getDupes_1(c):
    '''naive'''
    for i in xrange(0, len(c)):
        if c[i] in c[:i]:
            yield c[i]

def getDupes_2(c):
    '''set len change'''
    s = set()
    for i in c:
        l = len(s)
        s.add(i)
        if len(s) == l:
            yield i

def getDupes_3(c):
    '''in dict'''
    d = {}
    for i in c:
        if i in d:
            if d[i]:
                yield i
                d[i] = False
        else:
            d[i] = True

def getDupes_4(c):
    '''in set'''
    s,r = set(),set()
    for i in c:
        if i not in s:
            s.add(i)
        elif i not in r:
            r.add(i)
            yield i

def getDupes_5(c):
    '''sort/adjacent'''
    c = sorted(c)
    r = None
    for i in xrange(1, len(c)):
        if c[i] == c[i - 1]:
            if c[i] != r:
                yield c[i]
                r = c[i]

def getDupes_6(c):
    '''sort/groupby'''
    def multiple(x):
        try:
            x.next()
            x.next()
            return True
        except:
            return False
    for k, g in itertools.ifilter(lambda x: multiple(x[1]), itertools.groupby(sorted(c))):
        yield k

def getDupes_7(c):
    '''sort/zip'''
    c = sorted(c)
    r = None
    for k, g in zip(c[:-1],c[1:]):
        if k == g:
            if k != r:
                yield k
                r = k

def getDupes_8(c):
    '''sort/izip'''
    c = sorted(c)
    r = None
    for k, g in itertools.izip(c[:-1],c[1:]):
        if k == g:
            if k != r:
                yield k
                r = k

def getDupes_9(c):
    '''sort/tee/izip'''
    a, b = itertools.tee(sorted(c))
    next(b, None)
    r = None
    for k, g in itertools.izip(a, b):
        if k != g: continue
        if k != r:
            yield k
            r = k

def getDupes_a(l):
    '''moooeeeep'''
    seen = set()
    seen_add = seen.add
    # adds all elements it doesn't know yet to seen and all other to seen_twice
    for x in l:
        if x in seen or seen_add(x):
            yield x

def getDupes_b(x):
    '''iter*/sorted'''
    x = sorted(x)
    def _matches():
        for k,g in itertools.izip(x[:-1],x[1:]):
            if k == g:
                yield k
    for k, n in itertools.groupby(_matches()):
        yield k

def getDupes_c(a):
    '''pandas'''
    import pandas as pd
    vc = pd.Series(a).value_counts()
    i = vc[vc > 1].index
    for _ in i:
        yield _

def hasDupes(fn,c):
    try:
        if fn(c).next(): return True    # Found a dupe
    except StopIteration:
        pass
    return False

def getDupes(fn,c):
    return list(fn(c))

STABLE = True
if STABLE:
    print 'Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array'
else:
    print 'Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array'
for location in (50,250000,500000,750000,999999):
    for test in (getDupes_2, getDupes_3, getDupes_4, getDupes_5, getDupes_6,
                 getDupes_8, getDupes_9, getDupes_a, getDupes_b, getDupes_c):
        print 'Test %-15s:%10d - '%(test.__doc__ or test.__name__,location),
        deltas = []
        for FIRST in (True,False):
            for i in xrange(0, 5):
                c = range(0,1000000)
                if STABLE:
                    c[0] = location
                else:
                    c.append(location)
                    random.shuffle(c)
                start = time.time()
                if FIRST:
                    print '.' if location == test(c).next() else '!',
                else:
                    print '.' if [location] == list(test(c)) else '!',
                deltas.append(time.time()-start)
            print ' -- %0.3f  '%(sum(deltas)/len(deltas)),
        print
    print

“所有重复”测试的结果是一致的,在此数组中找到“第一个”重复项,然后是“所有”重复项:

Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array
Test set len change :    500000 -  . . . . .  -- 0.264   . . . . .  -- 0.402  
Test in dict        :    500000 -  . . . . .  -- 0.163   . . . . .  -- 0.250  
Test in set         :    500000 -  . . . . .  -- 0.163   . . . . .  -- 0.249  
Test sort/adjacent  :    500000 -  . . . . .  -- 0.159   . . . . .  -- 0.229  
Test sort/groupby   :    500000 -  . . . . .  -- 0.860   . . . . .  -- 1.286  
Test sort/izip      :    500000 -  . . . . .  -- 0.165   . . . . .  -- 0.229  
Test sort/tee/izip  :    500000 -  . . . . .  -- 0.145   . . . . .  -- 0.206  *
Test moooeeeep      :    500000 -  . . . . .  -- 0.149   . . . . .  -- 0.232  
Test iter*/sorted   :    500000 -  . . . . .  -- 0.160   . . . . .  -- 0.221  
Test pandas         :    500000 -  . . . . .  -- 0.493   . . . . .  -- 0.499  

当列表首先被洗牌时,排序的价格显而易见-效率显着下降,@ moooeeeep方法占主导地位,set和dict方法相似,但表现较差:

Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array
Test set len change :    500000 -  . . . . .  -- 0.321   . . . . .  -- 0.473  
Test in dict        :    500000 -  . . . . .  -- 0.285   . . . . .  -- 0.360  
Test in set         :    500000 -  . . . . .  -- 0.309   . . . . .  -- 0.365  
Test sort/adjacent  :    500000 -  . . . . .  -- 0.756   . . . . .  -- 0.823  
Test sort/groupby   :    500000 -  . . . . .  -- 1.459   . . . . .  -- 1.896  
Test sort/izip      :    500000 -  . . . . .  -- 0.786   . . . . .  -- 0.845  
Test sort/tee/izip  :    500000 -  . . . . .  -- 0.743   . . . . .  -- 0.804  
Test moooeeeep      :    500000 -  . . . . .  -- 0.234   . . . . .  -- 0.311  *
Test iter*/sorted   :    500000 -  . . . . .  -- 0.776   . . . . .  -- 0.840  
Test pandas         :    500000 -  . . . . .  -- 0.539   . . . . .  -- 0.540  

I came across this question whilst looking in to something related – and wonder why no-one offered a generator based solution? Solving this problem would be:

>>> print list(getDupes_9([1,2,3,2,1,5,6,5,5,5]))
[1, 2, 5]

I was concerned with scalability, so tested several approaches, including naive items that work well on small lists, but scale horribly as lists get larger (note- would have been better to use timeit, but this is illustrative).

I included @moooeeeep for comparison (it is impressively fast: fastest if the input list is completely random) and an itertools approach that is even faster again for mostly sorted lists… Now includes pandas approach from @firelynx — slow, but not horribly so, and simple. Note – sort/tee/zip approach is consistently fastest on my machine for large mostly ordered lists, moooeeeep is fastest for shuffled lists, but your mileage may vary.

Advantages

  • very quick simple to test for ‘any’ duplicates using the same code

Assumptions

  • Duplicates should be reported once only
  • Duplicate order does not need to be preserved
  • Duplicate might be anywhere in the list

Fastest solution, 1m entries:

def getDupes(c):
        '''sort/tee/izip'''
        a, b = itertools.tee(sorted(c))
        next(b, None)
        r = None
        for k, g in itertools.izip(a, b):
            if k != g: continue
            if k != r:
                yield k
                r = k

Approaches tested

import itertools
import time
import random

def getDupes_1(c):
    '''naive'''
    for i in xrange(0, len(c)):
        if c[i] in c[:i]:
            yield c[i]

def getDupes_2(c):
    '''set len change'''
    s = set()
    for i in c:
        l = len(s)
        s.add(i)
        if len(s) == l:
            yield i

def getDupes_3(c):
    '''in dict'''
    d = {}
    for i in c:
        if i in d:
            if d[i]:
                yield i
                d[i] = False
        else:
            d[i] = True

def getDupes_4(c):
    '''in set'''
    s,r = set(),set()
    for i in c:
        if i not in s:
            s.add(i)
        elif i not in r:
            r.add(i)
            yield i

def getDupes_5(c):
    '''sort/adjacent'''
    c = sorted(c)
    r = None
    for i in xrange(1, len(c)):
        if c[i] == c[i - 1]:
            if c[i] != r:
                yield c[i]
                r = c[i]

def getDupes_6(c):
    '''sort/groupby'''
    def multiple(x):
        try:
            x.next()
            x.next()
            return True
        except:
            return False
    for k, g in itertools.ifilter(lambda x: multiple(x[1]), itertools.groupby(sorted(c))):
        yield k

def getDupes_7(c):
    '''sort/zip'''
    c = sorted(c)
    r = None
    for k, g in zip(c[:-1],c[1:]):
        if k == g:
            if k != r:
                yield k
                r = k

def getDupes_8(c):
    '''sort/izip'''
    c = sorted(c)
    r = None
    for k, g in itertools.izip(c[:-1],c[1:]):
        if k == g:
            if k != r:
                yield k
                r = k

def getDupes_9(c):
    '''sort/tee/izip'''
    a, b = itertools.tee(sorted(c))
    next(b, None)
    r = None
    for k, g in itertools.izip(a, b):
        if k != g: continue
        if k != r:
            yield k
            r = k

def getDupes_a(l):
    '''moooeeeep'''
    seen = set()
    seen_add = seen.add
    # adds all elements it doesn't know yet to seen and all other to seen_twice
    for x in l:
        if x in seen or seen_add(x):
            yield x

def getDupes_b(x):
    '''iter*/sorted'''
    x = sorted(x)
    def _matches():
        for k,g in itertools.izip(x[:-1],x[1:]):
            if k == g:
                yield k
    for k, n in itertools.groupby(_matches()):
        yield k

def getDupes_c(a):
    '''pandas'''
    import pandas as pd
    vc = pd.Series(a).value_counts()
    i = vc[vc > 1].index
    for _ in i:
        yield _

def hasDupes(fn,c):
    try:
        if fn(c).next(): return True    # Found a dupe
    except StopIteration:
        pass
    return False

def getDupes(fn,c):
    return list(fn(c))

STABLE = True
if STABLE:
    print 'Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array'
else:
    print 'Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array'
for location in (50,250000,500000,750000,999999):
    for test in (getDupes_2, getDupes_3, getDupes_4, getDupes_5, getDupes_6,
                 getDupes_8, getDupes_9, getDupes_a, getDupes_b, getDupes_c):
        print 'Test %-15s:%10d - '%(test.__doc__ or test.__name__,location),
        deltas = []
        for FIRST in (True,False):
            for i in xrange(0, 5):
                c = range(0,1000000)
                if STABLE:
                    c[0] = location
                else:
                    c.append(location)
                    random.shuffle(c)
                start = time.time()
                if FIRST:
                    print '.' if location == test(c).next() else '!',
                else:
                    print '.' if [location] == list(test(c)) else '!',
                deltas.append(time.time()-start)
            print ' -- %0.3f  '%(sum(deltas)/len(deltas)),
        print
    print

The results for the ‘all dupes’ test were consistent, finding “first” duplicate then “all” duplicates in this array:

Finding FIRST then ALL duplicates, single dupe of "nth" placed element in 1m element array
Test set len change :    500000 -  . . . . .  -- 0.264   . . . . .  -- 0.402  
Test in dict        :    500000 -  . . . . .  -- 0.163   . . . . .  -- 0.250  
Test in set         :    500000 -  . . . . .  -- 0.163   . . . . .  -- 0.249  
Test sort/adjacent  :    500000 -  . . . . .  -- 0.159   . . . . .  -- 0.229  
Test sort/groupby   :    500000 -  . . . . .  -- 0.860   . . . . .  -- 1.286  
Test sort/izip      :    500000 -  . . . . .  -- 0.165   . . . . .  -- 0.229  
Test sort/tee/izip  :    500000 -  . . . . .  -- 0.145   . . . . .  -- 0.206  *
Test moooeeeep      :    500000 -  . . . . .  -- 0.149   . . . . .  -- 0.232  
Test iter*/sorted   :    500000 -  . . . . .  -- 0.160   . . . . .  -- 0.221  
Test pandas         :    500000 -  . . . . .  -- 0.493   . . . . .  -- 0.499  

When the lists are shuffled first, the price of the sort becomes apparent – the efficiency drops noticeably and the @moooeeeep approach dominates, with set & dict approaches being similar but lessor performers:

Finding FIRST then ALL duplicates, single dupe of "n" included in randomised 1m element array
Test set len change :    500000 -  . . . . .  -- 0.321   . . . . .  -- 0.473  
Test in dict        :    500000 -  . . . . .  -- 0.285   . . . . .  -- 0.360  
Test in set         :    500000 -  . . . . .  -- 0.309   . . . . .  -- 0.365  
Test sort/adjacent  :    500000 -  . . . . .  -- 0.756   . . . . .  -- 0.823  
Test sort/groupby   :    500000 -  . . . . .  -- 1.459   . . . . .  -- 1.896  
Test sort/izip      :    500000 -  . . . . .  -- 0.786   . . . . .  -- 0.845  
Test sort/tee/izip  :    500000 -  . . . . .  -- 0.743   . . . . .  -- 0.804  
Test moooeeeep      :    500000 -  . . . . .  -- 0.234   . . . . .  -- 0.311  *
Test iter*/sorted   :    500000 -  . . . . .  -- 0.776   . . . . .  -- 0.840  
Test pandas         :    500000 -  . . . . .  -- 0.539   . . . . .  -- 0.540  

回答 5

使用熊猫:

>>> import pandas as pd
>>> a = [1, 2, 1, 3, 3, 3, 0]
>>> pd.Series(a)[pd.Series(a).duplicated()].values
array([1, 3, 3])

Using pandas:

>>> import pandas as pd
>>> a = [1, 2, 1, 3, 3, 3, 0]
>>> pd.Series(a)[pd.Series(a).duplicated()].values
array([1, 3, 3])

回答 6

collections.Counter是python 2.7中的新功能:


Python 2.5.4 (r254:67916, May 31 2010, 15:03:39) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2
a = [1,2,3,2,1,5,6,5,5,5]
import collections
print [x for x, y in collections.Counter(a).items() if y > 1]
Type "help", "copyright", "credits" or "license" for more information.
  File "", line 1, in 
AttributeError: 'module' object has no attribute 'Counter'
>>> 

在早期版本中,您可以改用常规dict:

a = [1,2,3,2,1,5,6,5,5,5]
d = {}
for elem in a:
    if elem in d:
        d[elem] += 1
    else:
        d[elem] = 1

print [x for x, y in d.items() if y > 1]

collections.Counter is new in python 2.7:


Python 2.5.4 (r254:67916, May 31 2010, 15:03:39) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2
a = [1,2,3,2,1,5,6,5,5,5]
import collections
print [x for x, y in collections.Counter(a).items() if y > 1]
Type "help", "copyright", "credits" or "license" for more information.
  File "", line 1, in 
AttributeError: 'module' object has no attribute 'Counter'
>>> 

In an earlier version you can use a conventional dict instead:

a = [1,2,3,2,1,5,6,5,5,5]
d = {}
for elem in a:
    if elem in d:
        d[elem] += 1
    else:
        d[elem] = 1

print [x for x, y in d.items() if y > 1]

回答 7

这是一个简洁明了的解决方案-

for x in set(li):
    li.remove(x)

li = list(set(li))

Here’s a neat and concise solution –

for x in set(li):
    li.remove(x)

li = list(set(li))

回答 8

如果不转换为列表,可能最简单的方法如下所示。 当他们要求不使用布景时,这可能在面试中很有用

a=[1,2,3,3,3]
dup=[]
for each in a:
  if each not in dup:
    dup.append(each)
print(dup)

=======否则将获得2个单独的唯一值和重复值列表

a=[1,2,3,3,3]
uniques=[]
dups=[]

for each in a:
  if each not in uniques:
    uniques.append(each)
  else:
    dups.append(each)
print("Unique values are below:")
print(uniques)
print("Duplicate values are below:")
print(dups)

Without converting to list and probably the simplest way would be something like below. This may be useful during a interview when they ask not to use sets

a=[1,2,3,3,3]
dup=[]
for each in a:
  if each not in dup:
    dup.append(each)
print(dup)

======= else to get 2 separate lists of unique values and duplicate values

a=[1,2,3,3,3]
uniques=[]
dups=[]

for each in a:
  if each not in uniques:
    uniques.append(each)
  else:
    dups.append(each)
print("Unique values are below:")
print(uniques)
print("Duplicate values are below:")
print(dups)

回答 9

我会用熊猫做这件事,因为我经常使用熊猫

import pandas as pd
a = [1,2,3,3,3,4,5,6,6,7]
vc = pd.Series(a).value_counts()
vc[vc > 1].index.tolist()

[3,6]

可能不是很有效,但是肯定比许多其他答案要少的代码,所以我想我会做出贡献

I would do this with pandas, because I use pandas a lot

import pandas as pd
a = [1,2,3,3,3,4,5,6,6,7]
vc = pd.Series(a).value_counts()
vc[vc > 1].index.tolist()

Gives

[3,6]

Probably isn’t very efficient, but it sure is less code than a lot of the other answers, so I thought I would contribute


回答 10

接受的答案的第三个示例给出了错误的答案,并且没有尝试重复。这是正确的版本:

number_lst = [1, 1, 2, 3, 5, ...]

seen_set = set()
duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x))
unique_set = seen_set - duplicate_set

the third example of the accepted answer give an erroneous answer and does not attempt to give duplicates. Here is the correct version :

number_lst = [1, 1, 2, 3, 5, ...]

seen_set = set()
duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x))
unique_set = seen_set - duplicate_set

回答 11

如何通过检查出现次数来简单地循环遍历列表中的每个元素,然后将它们添加到集合中,然后将其打印出来。希望这可以帮助某人。

myList  = [2 ,4 , 6, 8, 4, 6, 12];
newList = set()

for i in myList:
    if myList.count(i) >= 2:
        newList.add(i)

print(list(newList))
## [4 , 6]

How about simply loop through each element in the list by checking the number of occurrences, then adding them to a set which will then print the duplicates. Hope this helps someone out there.

myList  = [2 ,4 , 6, 8, 4, 6, 12];
newList = set()

for i in myList:
    if myList.count(i) >= 2:
        newList.add(i)

print(list(newList))
## [4 , 6]

回答 12

我们可以使用itertools.groupby来查找所有具有重复项的项目:

from itertools import groupby

myList  = [2, 4, 6, 8, 4, 6, 12]
# when the list is sorted, groupby groups by consecutive elements which are similar
for x, y in groupby(sorted(myList)):
    #  list(y) returns all the occurences of item x
    if len(list(y)) > 1:
        print x  

输出将是:

4
6

We can use itertools.groupby in order to find all the items that have dups:

from itertools import groupby

myList  = [2, 4, 6, 8, 4, 6, 12]
# when the list is sorted, groupby groups by consecutive elements which are similar
for x, y in groupby(sorted(myList)):
    #  list(y) returns all the occurences of item x
    if len(list(y)) > 1:
        print x  

The output will be:

4
6

回答 13

我想在列表中查找重复项的最有效方法是:

from collections import Counter

def duplicates(values):
    dups = Counter(values) - Counter(set(values))
    return list(dups.keys())

print(duplicates([1,2,3,6,5,2]))

它使用Counter所有元素和所有唯一元素。将第一个与第二个相减将只保留重复项。

I guess the most effective way to find duplicates in a list is:

from collections import Counter

def duplicates(values):
    dups = Counter(values) - Counter(set(values))
    return list(dups.keys())

print(duplicates([1,2,3,6,5,2]))

It uses the Counter the all the elements and all unique elements. Subtracting the first one with the second will leave out only the duplicates.


回答 14

有点晚了,但可能对某些人有所帮助。对于一个庞大的清单,我发现这对我有用。

l=[1,2,3,5,4,1,3,1]
s=set(l)
d=[]
for x in l:
    if x in s:
        s.remove(x)
    else:
        d.append(x)
d
[1,3,1]

只显示所有重复项并保留顺序。

A bit late, but maybe helpful for some. For a largish list, I found this worked for me.

l=[1,2,3,5,4,1,3,1]
s=set(l)
d=[]
for x in l:
    if x in s:
        s.remove(x)
    else:
        d.append(x)
d
[1,3,1]

Shows just and all duplicates and preserves order.


回答 15

在Python中一次迭代即可找到重复对象的非常简单快捷的方法是:

testList = ['red', 'blue', 'red', 'green', 'blue', 'blue']

testListDict = {}

for item in testList:
  try:
    testListDict[item] += 1
  except:
    testListDict[item] = 1

print testListDict

输出如下:

>>> print testListDict
{'blue': 3, 'green': 1, 'red': 2}

这以及我博客中的更多内容http://www.howtoprogramwithpython.com

Very simple and quick way of finding dupes with one iteration in Python is:

testList = ['red', 'blue', 'red', 'green', 'blue', 'blue']

testListDict = {}

for item in testList:
  try:
    testListDict[item] += 1
  except:
    testListDict[item] = 1

print testListDict

Output will be as follows:

>>> print testListDict
{'blue': 3, 'green': 1, 'red': 2}

This and more in my blog http://www.howtoprogramwithpython.com


回答 16

我要进入这个讨论的很晚了。即使,我也想用一个衬纸来解决这个问题。因为那是Python的魅力。如果我们只想将重复项放入单独的列表(或任何集合)中,我建议按照以下步骤进行操作。说我们有一个重复的列表,我们可以将其称为“目标”

    target=[1,2,3,4,4,4,3,5,6,8,4,3]

现在,如果要获取重复项,可以使用一种衬纸,如下所示:

    duplicates=dict(set((x,target.count(x)) for x in filter(lambda rec : target.count(rec)>1,target)))

这段代码会将重复的记录作为键并作为值存入字典’duplicates’中。’duplicate’字典如下所示:

    {3: 3, 4: 4} #it saying 3 is repeated 3 times and 4 is 4 times

如果只想将所有重复的记录都放在一个列表中,那么它的代码又要短得多:

    duplicates=filter(lambda rec : target.count(rec)>1,target)

输出将是:

    [3, 4, 4, 4, 3, 4, 3]

这在python 2.7.x +版本中完美工作

I am entering much much late in to this discussion. Even though, I would like to deal with this problem with one liners . Because that’s the charm of Python. if we just want to get the duplicates in to a separate list (or any collection),I would suggest to do as below.Say we have a duplicated list which we can call as ‘target’

    target=[1,2,3,4,4,4,3,5,6,8,4,3]

Now if we want to get the duplicates,we can use the one liner as below:

    duplicates=dict(set((x,target.count(x)) for x in filter(lambda rec : target.count(rec)>1,target)))

This code will put the duplicated records as key and count as value in to the dictionary ‘duplicates’.’duplicate’ dictionary will look like as below:

    {3: 3, 4: 4} #it saying 3 is repeated 3 times and 4 is 4 times

If you just want all the records with duplicates alone in a list, its again much shorter code:

    duplicates=filter(lambda rec : target.count(rec)>1,target)

Output will be:

    [3, 4, 4, 4, 3, 4, 3]

This works perfectly in python 2.7.x + versions


回答 17

如果您不愿意编写自己的算法或使用库,则使用Python 3.8一线式:

l = [1,2,3,2,1,5,6,5,5,5]

res = [(x, count) for x, g in groupby(sorted(l)) if (count := len(list(g))) > 1]

print(res)

打印项目和计数:

[(1, 2), (2, 2), (5, 4)]

groupby具有分组功能,因此您可以以不同的方式定义分组,并Tuple根据需要返回其他字段。

groupby 很懒,所以不要太慢。

Python 3.8 one-liner if you don’t care to write your own algorithm or use libraries:

l = [1,2,3,2,1,5,6,5,5,5]

res = [(x, count) for x, g in groupby(sorted(l)) if (count := len(list(g))) > 1]

print(res)

Prints item and count:

[(1, 2), (2, 2), (5, 4)]

groupby takes a grouping function so you can define your groupings in different ways and return additional Tuple fields as needed.

groupby is lazy so it shouldn’t be too slow.


回答 18

其他一些测试。当然可以…

set([x for x in l if l.count(x) > 1])

…太贵了 使用下一个最终方法大约要快500倍(更长的数组会带来更好的结果):

def dups_count_dict(l):
    d = {}

    for item in l:
        if item not in d:
            d[item] = 0

        d[item] += 1

    result_d = {key: val for key, val in d.iteritems() if val > 1}

    return result_d.keys()

仅2个循环,没有非常昂贵的l.count()操作。

例如,下面是比较这些方法的代码。代码如下,这是输出:

dups_count: 13.368s # this is a function which uses l.count()
dups_count_dict: 0.014s # this is a final best function (of the 3 functions)
dups_count_counter: 0.024s # collections.Counter

测试代码:

import numpy as np
from time import time
from collections import Counter

class TimerCounter(object):
    def __init__(self):
        self._time_sum = 0

    def start(self):
        self.time = time()

    def stop(self):
        self._time_sum += time() - self.time

    def get_time_sum(self):
        return self._time_sum


def dups_count(l):
    return set([x for x in l if l.count(x) > 1])


def dups_count_dict(l):
    d = {}

    for item in l:
        if item not in d:
            d[item] = 0

        d[item] += 1

    result_d = {key: val for key, val in d.iteritems() if val > 1}

    return result_d.keys()


def dups_counter(l):
    counter = Counter(l)    

    result_d = {key: val for key, val in counter.iteritems() if val > 1}

    return result_d.keys()



def gen_array():
    np.random.seed(17)
    return list(np.random.randint(0, 5000, 10000))


def assert_equal_results(*results):
    primary_result = results[0]
    other_results = results[1:]

    for other_result in other_results:
        assert set(primary_result) == set(other_result) and len(primary_result) == len(other_result)


if __name__ == '__main__':
    dups_count_time = TimerCounter()
    dups_count_dict_time = TimerCounter()
    dups_count_counter = TimerCounter()

    l = gen_array()

    for i in range(3):
        dups_count_time.start()
        result1 = dups_count(l)
        dups_count_time.stop()

        dups_count_dict_time.start()
        result2 = dups_count_dict(l)
        dups_count_dict_time.stop()

        dups_count_counter.start()
        result3 = dups_counter(l)
        dups_count_counter.stop()

        assert_equal_results(result1, result2, result3)

    print 'dups_count: %.3f' % dups_count_time.get_time_sum()
    print 'dups_count_dict: %.3f' % dups_count_dict_time.get_time_sum()
    print 'dups_count_counter: %.3f' % dups_count_counter.get_time_sum()

Some other tests. Of course to do…

set([x for x in l if l.count(x) > 1])

…is too costly. It’s about 500 times faster (the more long array gives better results) to use the next final method:

def dups_count_dict(l):
    d = {}

    for item in l:
        if item not in d:
            d[item] = 0

        d[item] += 1

    result_d = {key: val for key, val in d.iteritems() if val > 1}

    return result_d.keys()

Only 2 loops, no very costly l.count() operations.

Here is a code to compare the methods for example. The code is below, here is the output:

dups_count: 13.368s # this is a function which uses l.count()
dups_count_dict: 0.014s # this is a final best function (of the 3 functions)
dups_count_counter: 0.024s # collections.Counter

The testing code:

import numpy as np
from time import time
from collections import Counter

class TimerCounter(object):
    def __init__(self):
        self._time_sum = 0

    def start(self):
        self.time = time()

    def stop(self):
        self._time_sum += time() - self.time

    def get_time_sum(self):
        return self._time_sum


def dups_count(l):
    return set([x for x in l if l.count(x) > 1])


def dups_count_dict(l):
    d = {}

    for item in l:
        if item not in d:
            d[item] = 0

        d[item] += 1

    result_d = {key: val for key, val in d.iteritems() if val > 1}

    return result_d.keys()


def dups_counter(l):
    counter = Counter(l)    

    result_d = {key: val for key, val in counter.iteritems() if val > 1}

    return result_d.keys()



def gen_array():
    np.random.seed(17)
    return list(np.random.randint(0, 5000, 10000))


def assert_equal_results(*results):
    primary_result = results[0]
    other_results = results[1:]

    for other_result in other_results:
        assert set(primary_result) == set(other_result) and len(primary_result) == len(other_result)


if __name__ == '__main__':
    dups_count_time = TimerCounter()
    dups_count_dict_time = TimerCounter()
    dups_count_counter = TimerCounter()

    l = gen_array()

    for i in range(3):
        dups_count_time.start()
        result1 = dups_count(l)
        dups_count_time.stop()

        dups_count_dict_time.start()
        result2 = dups_count_dict(l)
        dups_count_dict_time.stop()

        dups_count_counter.start()
        result3 = dups_counter(l)
        dups_count_counter.stop()

        assert_equal_results(result1, result2, result3)

    print 'dups_count: %.3f' % dups_count_time.get_time_sum()
    print 'dups_count_dict: %.3f' % dups_count_dict_time.get_time_sum()
    print 'dups_count_counter: %.3f' % dups_count_counter.get_time_sum()

回答 19

方法1:

list(set([val for idx, val in enumerate(input_list) if val in input_list[idx+1:]]))

说明: [idx的val,如果input_list [idx + 1:]中的val,则enumerate(input_list)中的val]是一个列表推导,如果从当前位置的列表中存在相同的元素,则返回一个元素。

例如:input_list = [42,31,42,31,3,31,31,5,6,6,6,6,6,7,42]

从列表42中的第一个元素开始,索引为0,它会检查input_list [1:]中是否存在元素42(即,从索引1到列表末尾),因为input_list [1:]中存在42 ,它将返回42。

然后转到具有索引1的下一个元素31,并检查input_list [2:]中是否存在元素31(即,从索引2到列表末尾),因为input_list [2:]中存在31,它将返回31。

类似地,它遍历列表中的所有元素,并且仅将重复/重复的元素返回到列表中。

然后,因为我们有重复项,所以在列表中,我们需要从每个重复项中选择一个,即删除重复项中的重复项,然后调用python内置的名为set()的python,它会删除重复项,

然后我们剩下一个集合,而不是列表,因此要从一个集合转换为列表,我们使用typecasting,list(),并将元素集转换为一个列表。

方法2:

def dupes(ilist):
    temp_list = [] # initially, empty temporary list
    dupe_list = [] # initially, empty duplicate list
    for each in ilist:
        if each in temp_list: # Found a Duplicate element
            if not each in dupe_list: # Avoid duplicate elements in dupe_list
                dupe_list.append(each) # Add duplicate element to dupe_list
        else: 
            temp_list.append(each) # Add a new (non-duplicate) to temp_list

    return dupe_list

说明: 在这里,我们首先创建两个空列表。然后继续遍历列表的所有元素,以查看它是否存在于temp_list中(最初为空)。如果temp_list中没有它,则使用append方法将其添加到temp_list中。

如果它已经存在于temp_list中,则意味着该列表的当前元素是重复的,因此我们需要使用append方法将其添加到dupe_list中。

Method 1:

list(set([val for idx, val in enumerate(input_list) if val in input_list[idx+1:]]))

Explanation: [val for idx, val in enumerate(input_list) if val in input_list[idx+1:]] is a list comprehension, that returns an element, if the same element is present from it’s current position, in list, the index.

Example: input_list = [42,31,42,31,3,31,31,5,6,6,6,6,6,7,42]

starting with the first element in list, 42, with index 0, it checks if the element 42, is present in input_list[1:] (i.e., from index 1 till end of list) Because 42 is present in input_list[1:], it will return 42.

Then it goes to the next element 31, with index 1, and checks if element 31 is present in the input_list[2:] (i.e., from index 2 till end of list), Because 31 is present in input_list[2:], it will return 31.

similarly it goes through all the elements in the list, and will return only the repeated/duplicate elements into a list.

Then because we have duplicates, in a list, we need to pick one of each duplicate, i.e. remove duplicate among duplicates, and to do so, we do call a python built-in named set(), and it removes the duplicates,

Then we are left with a set, but not a list, and hence to convert from a set to list, we use, typecasting, list(), and that converts the set of elements to a list.

Method 2:

def dupes(ilist):
    temp_list = [] # initially, empty temporary list
    dupe_list = [] # initially, empty duplicate list
    for each in ilist:
        if each in temp_list: # Found a Duplicate element
            if not each in dupe_list: # Avoid duplicate elements in dupe_list
                dupe_list.append(each) # Add duplicate element to dupe_list
        else: 
            temp_list.append(each) # Add a new (non-duplicate) to temp_list

    return dupe_list

Explanation: Here We create two empty lists, to start with. Then keep traversing through all the elements of the list, to see if it exists in temp_list (initially empty). If it is not there in the temp_list, then we add it to the temp_list, using append method.

If it already exists in temp_list, it means, that the current element of the list is a duplicate, and hence we need to add it to dupe_list using append method.


回答 20

raw_list = [1,2,3,3,4,5,6,6,7,2,3,4,2,3,4,1,3,4,]

clean_list = list(set(raw_list))
duplicated_items = []

for item in raw_list:
    try:
        clean_list.remove(item)
    except ValueError:
        duplicated_items.append(item)


print(duplicated_items)
# [3, 6, 2, 3, 4, 2, 3, 4, 1, 3, 4]

您基本上可以通过转换为set(clean_list)来删除重复项,然后对其进行迭代raw_list,同时从item清除列表中删除每个重复项以在中出现raw_list。如果item未找到,ValueError则捕获引发的Exception并将其item添加到duplicated_items列表中。

如果需要重复项的索引,则只需enumerate列出并使用索引即可。(for index, item in enumerate(raw_list):),速度更快,并针对大型列表(如数千个元素以上)进行了优化

raw_list = [1,2,3,3,4,5,6,6,7,2,3,4,2,3,4,1,3,4,]

clean_list = list(set(raw_list))
duplicated_items = []

for item in raw_list:
    try:
        clean_list.remove(item)
    except ValueError:
        duplicated_items.append(item)


print(duplicated_items)
# [3, 6, 2, 3, 4, 2, 3, 4, 1, 3, 4]

You basically remove duplicates by converting to set (clean_list), then iterate the raw_list, while removing each item in the clean list for occurrence in raw_list. If item is not found, the raised ValueError Exception is caught and the item is added to duplicated_items list.

If the index of duplicated items is needed, just enumerate the list and play around with the index. (for index, item in enumerate(raw_list):) which is faster and optimised for large lists (like thousands+ of elements)


回答 21

使用list.count()列表中的方法找出给定列表中的重复元素

arr=[]
dup =[]
for i in range(int(input("Enter range of list: "))):
    arr.append(int(input("Enter Element in a list: ")))
for i in arr:
    if arr.count(i)>1 and i not in dup:
        dup.append(i)
print(dup)

use of list.count() method in the list to find out the duplicate elements of a given list

arr=[]
dup =[]
for i in range(int(input("Enter range of list: "))):
    arr.append(int(input("Enter Element in a list: ")))
for i in arr:
    if arr.count(i)>1 and i not in dup:
        dup.append(i)
print(dup)

回答 22

单线,乐趣无穷,并且需要一个声明。

(lambda iterable: reduce(lambda (uniq, dup), item: (uniq, dup | {item}) if item in uniq else (uniq | {item}, dup), iterable, (set(), set())))(some_iterable)

one-liner, for fun, and where a single statement is required.

(lambda iterable: reduce(lambda (uniq, dup), item: (uniq, dup | {item}) if item in uniq else (uniq | {item}, dup), iterable, (set(), set())))(some_iterable)

回答 23

list2 = [1, 2, 3, 4, 1, 2, 3]
lset = set()
[(lset.add(item), list2.append(item))
 for item in list2 if item not in lset]
print list(lset)
list2 = [1, 2, 3, 4, 1, 2, 3]
lset = set()
[(lset.add(item), list2.append(item))
 for item in list2 if item not in lset]
print list(lset)

回答 24

一线解决方案:

set([i for i in list if sum([1 for a in list if a == i]) > 1])

One line solution:

set([i for i in list if sum([1 for a in list if a == i]) > 1])

回答 25

这里有很多答案,但是我认为这是一种相对易读且易于理解的方法:

def get_duplicates(sorted_list):
    duplicates = []
    last = sorted_list[0]
    for x in sorted_list[1:]:
        if x == last:
            duplicates.append(x)
        last = x
    return set(duplicates)

笔记:

  • 如果您希望保留重复计数,请取消强制转换为底部的“ set”以获取完整列表
  • 如果您更喜欢使用生成器,请用yield x和底部的return语句替换duplicates.append(x)(可以稍后进行设置)

There are a lot of answers up here, but I think this is relatively a very readable and easy to understand approach:

def get_duplicates(sorted_list):
    duplicates = []
    last = sorted_list[0]
    for x in sorted_list[1:]:
        if x == last:
            duplicates.append(x)
        last = x
    return set(duplicates)

Notes:

  • If you wish to preserve duplication count, get rid of the cast to ‘set’ at the bottom to get the full list
  • If you prefer to use generators, replace duplicates.append(x) with yield x and the return statement at the bottom (you can cast to set later)

回答 26

这是一个快速生成器,它使用字典将每个元素存储为具有布尔值的键,以检查是否已产生重复项。

对于具有所有元素都是可哈希类型的列表:

def gen_dupes(array):
    unique = {}
    for value in array:
        if value in unique and unique[value]:
            unique[value] = False
            yield value
        else:
            unique[value] = True

array = [1, 2, 2, 3, 4, 1, 5, 2, 6, 6]
print(list(gen_dupes(array)))
# => [2, 1, 6]

对于可能包含列表的列表:

def gen_dupes(array):
    unique = {}
    for value in array:
        is_list = False
        if type(value) is list:
            value = tuple(value)
            is_list = True

        if value in unique and unique[value]:
            unique[value] = False
            if is_list:
                value = list(value)

            yield value
        else:
            unique[value] = True

array = [1, 2, 2, [1, 2], 3, 4, [1, 2], 5, 2, 6, 6]
print(list(gen_dupes(array)))
# => [2, [1, 2], 6]

Here’s a fast generator that uses a dict to store each element as a key with a boolean value for checking if the duplicate item has already been yielded.

For lists with all elements that are hashable types:

def gen_dupes(array):
    unique = {}
    for value in array:
        if value in unique and unique[value]:
            unique[value] = False
            yield value
        else:
            unique[value] = True

array = [1, 2, 2, 3, 4, 1, 5, 2, 6, 6]
print(list(gen_dupes(array)))
# => [2, 1, 6]

For lists that might contain lists:

def gen_dupes(array):
    unique = {}
    for value in array:
        is_list = False
        if type(value) is list:
            value = tuple(value)
            is_list = True

        if value in unique and unique[value]:
            unique[value] = False
            if is_list:
                value = list(value)

            yield value
        else:
            unique[value] = True

array = [1, 2, 2, [1, 2], 3, 4, [1, 2], 5, 2, 6, 6]
print(list(gen_dupes(array)))
# => [2, [1, 2], 6]

回答 27

def removeduplicates(a):
  seen = set()

  for i in a:
    if i not in seen:
      seen.add(i)
  return seen 

print(removeduplicates([1,1,2,2]))
def removeduplicates(a):
  seen = set()

  for i in a:
    if i not in seen:
      seen.add(i)
  return seen 

print(removeduplicates([1,1,2,2]))

回答 28

使用toolz时

from toolz import frequencies, valfilter

a = [1,2,2,3,4,5,4]
>>> list(valfilter(lambda count: count > 1, frequencies(a)).keys())
[2,4] 

When using toolz:

from toolz import frequencies, valfilter

a = [1,2,2,3,4,5,4]
>>> list(valfilter(lambda count: count > 1, frequencies(a)).keys())
[2,4] 

回答 29

这是我必须这样做的方法,因为我向自己提出挑战,不要使用其他方法:

def dupList(oldlist):
    if type(oldlist)==type((2,2)):
        oldlist=[x for x in oldlist]
    newList=[]
    newList=newList+oldlist
    oldlist=oldlist
    forbidden=[]
    checkPoint=0
    for i in range(len(oldlist)):
        #print 'start i', i
        if i in forbidden:
            continue
        else:
            for j in range(len(oldlist)):
                #print 'start j', j
                if j in forbidden:
                    continue
                else:
                    #print 'after Else'
                    if i!=j: 
                        #print 'i,j', i,j
                        #print oldlist
                        #print newList
                        if oldlist[j]==oldlist[i]:
                            #print 'oldlist[i],oldlist[j]', oldlist[i],oldlist[j]
                            forbidden.append(j)
                            #print 'forbidden', forbidden
                            del newList[j-checkPoint]
                            #print newList
                            checkPoint=checkPoint+1
    return newList

因此您的示例工作方式为:

>>>a = [1,2,3,3,3,4,5,6,6,7]
>>>dupList(a)
[1, 2, 3, 4, 5, 6, 7]

this is the way I had to do it because I challenged myself not to use other methods:

def dupList(oldlist):
    if type(oldlist)==type((2,2)):
        oldlist=[x for x in oldlist]
    newList=[]
    newList=newList+oldlist
    oldlist=oldlist
    forbidden=[]
    checkPoint=0
    for i in range(len(oldlist)):
        #print 'start i', i
        if i in forbidden:
            continue
        else:
            for j in range(len(oldlist)):
                #print 'start j', j
                if j in forbidden:
                    continue
                else:
                    #print 'after Else'
                    if i!=j: 
                        #print 'i,j', i,j
                        #print oldlist
                        #print newList
                        if oldlist[j]==oldlist[i]:
                            #print 'oldlist[i],oldlist[j]', oldlist[i],oldlist[j]
                            forbidden.append(j)
                            #print 'forbidden', forbidden
                            del newList[j-checkPoint]
                            #print newList
                            checkPoint=checkPoint+1
    return newList

so your sample works as:

>>>a = [1,2,3,3,3,4,5,6,6,7]
>>>dupList(a)
[1, 2, 3, 4, 5, 6, 7]

随机播放DataFrame行

问题:随机播放DataFrame行

我有以下DataFrame:

    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
...
20     7     8     9     2
21    10    11    12     2
...
45    13    14    15     3
46    16    17    18     3
...

从csv文件读取DataFrame。所有具有Type1的行都在最上面,然后是具有Type2 的行,然后是具有Type3 的行,依此类推。

我想重新整理DataFrame行的顺序,以便将所有行Type混合在一起。可能的结果可能是:

    Col1  Col2  Col3  Type
0      7     8     9     2
1     13    14    15     3
...
20     1     2     3     1
21    10    11    12     2
...
45     4     5     6     1
46    16    17    18     3
...

我该如何实现?

I have the following DataFrame:

    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
...
20     7     8     9     2
21    10    11    12     2
...
45    13    14    15     3
46    16    17    18     3
...

The DataFrame is read from a csv file. All rows which have Type 1 are on top, followed by the rows with Type 2, followed by the rows with Type 3, etc.

I would like to shuffle the order of the DataFrame’s rows, so that all Type‘s are mixed. A possible result could be:

    Col1  Col2  Col3  Type
0      7     8     9     2
1     13    14    15     3
...
20     1     2     3     1
21    10    11    12     2
...
45     4     5     6     1
46    16    17    18     3
...

How can I achieve this?


回答 0

使用Pandas的惯用方式是使用.sample数据框的方法对所有行进行采样而无需替换:

df.sample(frac=1)

frac关键字参数指定的行的分数到随机样品中返回,所以frac=1装置返回所有行(随机顺序)。


注意: 如果您希望就地改组数据帧并重置索引,则可以执行例如

df = df.sample(frac=1).reset_index(drop=True)

在此,指定drop=True可防止.reset_index创建包含旧索引条目的列。

后续注解:尽管上面的操作似乎并不就位,但是python / pandas足够聪明,不会为经过改组的对象做另一个malloc。也就是说,即使参考对象已更改(我的意思id(df_old)是与相同id(df_new)),底层C对象仍然相同。为了证明确实如此,您可以运行一个简单的内存探查器:

$ python3 -m memory_profiler .\test.py
Filename: .\test.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     68.5 MiB     68.5 MiB   @profile
     6                             def shuffle():
     7    847.8 MiB    779.3 MiB       df = pd.DataFrame(np.random.randn(100, 1000000))
     8    847.9 MiB      0.1 MiB       df = df.sample(frac=1).reset_index(drop=True)

The idiomatic way to do this with Pandas is to use the .sample method of your dataframe to sample all rows without replacement:

df.sample(frac=1)

The frac keyword argument specifies the fraction of rows to return in the random sample, so frac=1 means return all rows (in random order).


Note: If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.

df = df.sample(frac=1).reset_index(drop=True)

Here, specifying drop=True prevents .reset_index from creating a column containing the old index entries.

Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean id(df_old) is not the same as id(df_new)), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:

$ python3 -m memory_profiler .\test.py
Filename: .\test.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     68.5 MiB     68.5 MiB   @profile
     6                             def shuffle():
     7    847.8 MiB    779.3 MiB       df = pd.DataFrame(np.random.randn(100, 1000000))
     8    847.9 MiB      0.1 MiB       df = df.sample(frac=1).reset_index(drop=True)


回答 1

您可以为此简单地使用sklearn

from sklearn.utils import shuffle
df = shuffle(df)

You can simply use sklearn for this

from sklearn.utils import shuffle
df = shuffle(df)

回答 2

您可以通过使用改组后的索引建立索引来改组数据帧的行。为此,您可以使用np.random.permutation(但np.random.choice也可以):

In [12]: df = pd.read_csv(StringIO(s), sep="\s+")

In [13]: df
Out[13]: 
    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
20     7     8     9     2
21    10    11    12     2
45    13    14    15     3
46    16    17    18     3

In [14]: df.iloc[np.random.permutation(len(df))]
Out[14]: 
    Col1  Col2  Col3  Type
46    16    17    18     3
45    13    14    15     3
20     7     8     9     2
0      1     2     3     1
1      4     5     6     1
21    10    11    12     2

如果要像示例中那样将索引的编号始终保持为1、2,..,n,则只需重置索引即可: df_shuffled.reset_index(drop=True)

You can shuffle the rows of a dataframe by indexing with a shuffled index. For this, you can eg use np.random.permutation (but np.random.choice is also a possibility):

In [12]: df = pd.read_csv(StringIO(s), sep="\s+")

In [13]: df
Out[13]: 
    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
20     7     8     9     2
21    10    11    12     2
45    13    14    15     3
46    16    17    18     3

In [14]: df.iloc[np.random.permutation(len(df))]
Out[14]: 
    Col1  Col2  Col3  Type
46    16    17    18     3
45    13    14    15     3
20     7     8     9     2
0      1     2     3     1
1      4     5     6     1
21    10    11    12     2

If you want to keep the index numbered from 1, 2, .., n as in your example, you can simply reset the index: df_shuffled.reset_index(drop=True)


回答 3

TL; DRnp.random.shuffle(ndarray)可以胜任。
所以,在你的情况下

np.random.shuffle(DataFrame.values)

DataFrame在后台,使用NumPy ndarray作为数据持有者。(您可以从DataFrame源代码检查)

因此,如果使用np.random.shuffle(),它将沿多维数组的第一个轴随机排列数组。但是DataFrame遗体的索引仍然没有改组。

虽然,有一些要考虑的问题。

  • 函数不返回任何内容。如果要保留原始对象的副本,则必须这样做,然后再传递给该函数。
  • sklearn.utils.shuffle(),如用户tj89所建议的那样,可以指定random_state其他选项来控制输出。您可能需要出于开发目的。
  • sklearn.utils.shuffle()是比较快的。但洗牌的轴信息(索引,列)DataFrame与沿ndarray它包含的内容。

基准结果

sklearn.utils.shuffle()和之间np.random.shuffle()

ndarray

nd = sklearn.utils.shuffle(nd)

0.10793248389381915秒 快8倍

np.random.shuffle(nd)

0.8897626010002568秒

数据框

df = sklearn.utils.shuffle(df)

0.3183923360193148秒 快3倍

np.random.shuffle(df.values)

0.9357550159329548秒

结论:如果可以将轴信息(索引,列)与ndarray一起改组,请使用sklearn.utils.shuffle()。否则,使用np.random.shuffle()

使用的代码

import timeit
setup = '''
import numpy as np
import pandas as pd
import sklearn
nd = np.random.random((1000, 100))
df = pd.DataFrame(nd)
'''

timeit.timeit('nd = sklearn.utils.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('df = sklearn.utils.shuffle(df)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(df.values)', setup=setup, number=1000)

TL;DR: np.random.shuffle(ndarray) can do the job.
So, in your case

np.random.shuffle(DataFrame.values)

DataFrame, under the hood, uses NumPy ndarray as data holder. (You can check from DataFrame source code)

So if you use np.random.shuffle(), it would shuffles the array along the first axis of a multi-dimensional array. But index of the DataFrame remains unshuffled.

Though, there are some points to consider.

  • function returns none. In case you want to keep a copy of the original object, you have to do so before you pass to the function.
  • sklearn.utils.shuffle(), as user tj89 suggested, can designate random_state along with another option to control output. You may want that for dev purpose.
  • sklearn.utils.shuffle() is faster. But WILL SHUFFLE the axis info(index, column) of the DataFrame along with the ndarray it contains.

Benchmark result

between sklearn.utils.shuffle() and np.random.shuffle().

ndarray

nd = sklearn.utils.shuffle(nd)

0.10793248389381915 sec. 8x faster

np.random.shuffle(nd)

0.8897626010002568 sec

DataFrame

df = sklearn.utils.shuffle(df)

0.3183923360193148 sec. 3x faster

np.random.shuffle(df.values)

0.9357550159329548 sec

Conclusion: If it is okay to axis info(index, column) to be shuffled along with ndarray, use sklearn.utils.shuffle(). Otherwise, use np.random.shuffle()

used code

import timeit
setup = '''
import numpy as np
import pandas as pd
import sklearn
nd = np.random.random((1000, 100))
df = pd.DataFrame(nd)
'''

timeit.timeit('nd = sklearn.utils.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('df = sklearn.utils.shuffle(df)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(df.values)', setup=setup, number=1000)


回答 4

(我没有足够的声誉在最高职位上对此发表评论,所以我希望其他人可以为我这样做。)第一种方法引起了人们的关注:

df.sample(frac=1)

进行深拷贝或只是更改数据框。我运行了以下代码:

print(hex(id(df)))
print(hex(id(df.sample(frac=1))))
print(hex(id(df.sample(frac=1).reset_index(drop=True))))

我的结果是:

0x1f8a784d400
0x1f8b9d65e10
0x1f8b9d65b70

这意味着该方法返回上一个注释中建议的相同对象。因此,此方法的确可以制作随机的副本

(I don’t have enough reputation to comment this on the top post, so I hope someone else can do that for me.) There was a concern raised that the first method:

df.sample(frac=1)

made a deep copy or just changed the dataframe. I ran the following code:

print(hex(id(df)))
print(hex(id(df.sample(frac=1))))
print(hex(id(df.sample(frac=1).reset_index(drop=True))))

and my results were:

0x1f8a784d400
0x1f8b9d65e10
0x1f8b9d65b70

which means the method is not returning the same object, as was suggested in the last comment. So this method does indeed make a shuffled copy.


回答 5

还有用的是,如果将其用于Machine_learning并且希望始终分离相同的数据,则可以使用:

df.sample(n=len(df), random_state=42)

这样可以确保您的随机选择始终可复制

What is also useful, if you use it for Machine_learning and want to seperate always the same data, you could use:

df.sample(n=len(df), random_state=42)

this makes sure, that you keep your random choice always replicatable


回答 6

AFAIK最简单的解决方案是:

df_shuffled = df.reindex(np.random.permutation(df.index))

AFAIK the simplest solution is:

df_shuffled = df.reindex(np.random.permutation(df.index))

回答 7

通过取样阵列中的这种情况下,洗牌大熊猫数据帧索引和随机那么它的顺序来设置所述阵列的数据帧的索引。现在根据索引对数据帧进行排序。这是您经过改组的数据框

import random
df = pd.DataFrame({"a":[1,2,3,4],"b":[5,6,7,8]})
index = [i for i in range(df.shape[0])]
random.shuffle(index)
df.set_index([index]).sort_index()

输出

    a   b
0   2   6
1   1   5
2   3   7
3   4   8

在上面的代码中将数据框插入我的位置。

shuffle the pandas data frame by taking a sample array in this case index and randomize its order then set the array as an index of data frame. Now sort the data frame according to index. Here goes your shuffled dataframe

import random
df = pd.DataFrame({"a":[1,2,3,4],"b":[5,6,7,8]})
index = [i for i in range(df.shape[0])]
random.shuffle(index)
df.set_index([index]).sort_index()

output

    a   b
0   2   6
1   1   5
2   3   7
3   4   8

Insert you data frame in the place of mine in above code .


回答 8

这是另一种方式:

df['rnd'] = np.random.rand(len(df)) df = df.sort_values(by='rnd', inplace=True).drop('rnd', axis=1)

Here is another way:

df['rnd'] = np.random.rand(len(df)) df = df.sort_values(by='rnd', inplace=True).drop('rnd', axis=1)


python文件扩展名.pyc .pyd .pyo代表什么?

问题:python文件扩展名.pyc .pyd .pyo代表什么?

这些python文件扩展名是什么意思?

  • .pyc
  • .pyd
  • .pyo

它们之间有什么区别,它们是如何从* .py文件生成的?

What do these python file extensions mean?

  • .pyc
  • .pyd
  • .pyo

What are the differences between them and how are they generated from a *.py file?


回答 0

  1. .py:这通常是您编写的输入源代码。
  2. .pyc:这是编译后的字节码。如果您导入模块,则python将构建一个*.pyc包含字节码的文件,以使以后再次(更快)地再次导入它。
  3. .pyo:这是在Python 3.5之前用于*.pyc通过优化(-O)标志创建的文件的文件格式。(请参阅下面的注释)
  4. .pyd:这基本上是Windows dll文件。http://docs.python.org/faq/windows.html#is-a-pyd-file-the-same-as-a-dll

另外,对于某些.pyc与vs 有关的讨论.pyo,请查看:http : //www.network-theory.co.uk/docs/pytut/CompiledPythonfiles.html(我已复制了下面的重要部分)

  • 当使用-O标志调用Python解释器时,将生成优化的代码并将其存储在’.pyo’文件中。目前,优化器没有太大帮助。它仅删除断言语句。当使用-O时,所有字节码都被优化;.pyc文件将被忽略,.py文件将被编译为优化的字节码。
  • 将两个-O标志传递给Python解释器(-OO)将导致字节码编译器执行优化,这在极少数情况下可能会导致程序故障。当前仅从__doc__字节码中删除了字符串,从而生成了更紧凑的“ .pyo”文件。由于某些程序可能依赖于这些程序的可用性,因此只有在知道自己在做什么的情况下才应使用此选项。
  • 从’.pyc’或’.pyo’文件中读取程序比从’.py’文件中读取程序运行得更快。关于“ .pyc”或“ .pyo”文件,唯一更快的是它们的加载速度。
  • 通过在命令行中给出脚本名称来运行脚本时,该脚本的字节码永远不会写入“ .pyc”或“ .pyo”文件。因此,可以通过将脚本的大部分代码移至模块并使用较小的引导脚本来导入该模块来减少脚本的启动时间。也可以直接在命令行上命名“ .pyc”或“ .pyo”文件。

注意:

2015年15月15日,Python 3.5版本实现了PEP-488,并删除了.pyo文件。这意味着.pyc文件代表未优化和优化的字节码。

  1. .py: This is normally the input source code that you’ve written.
  2. .pyc: This is the compiled bytecode. If you import a module, python will build a *.pyc file that contains the bytecode to make importing it again later easier (and faster).
  3. .pyo: This was a file format used before Python 3.5 for *.pyc files that were created with optimizations (-O) flag. (see the note below)
  4. .pyd: This is basically a windows dll file. http://docs.python.org/faq/windows.html#is-a-pyd-file-the-same-as-a-dll

Also for some further discussion on .pyc vs .pyo, take a look at: http://www.network-theory.co.uk/docs/pytut/CompiledPythonfiles.html (I’ve copied the important part below)

  • When the Python interpreter is invoked with the -O flag, optimized code is generated and stored in ‘.pyo’ files. The optimizer currently doesn’t help much; it only removes assert statements. When -O is used, all bytecode is optimized; .pyc files are ignored and .py files are compiled to optimized bytecode.
  • Passing two -O flags to the Python interpreter (-OO) will cause the bytecode compiler to perform optimizations that could in some rare cases result in malfunctioning programs. Currently only __doc__ strings are removed from the bytecode, resulting in more compact ‘.pyo’ files. Since some programs may rely on having these available, you should only use this option if you know what you’re doing.
  • A program doesn’t run any faster when it is read from a ‘.pyc’ or ‘.pyo’ file than when it is read from a ‘.py’ file; the only thing that’s faster about ‘.pyc’ or ‘.pyo’ files is the speed with which they are loaded.
  • When a script is run by giving its name on the command line, the bytecode for the script is never written to a ‘.pyc’ or ‘.pyo’ file. Thus, the startup time of a script may be reduced by moving most of its code to a module and having a small bootstrap script that imports that module. It is also possible to name a ‘.pyc’ or ‘.pyo’ file directly on the command line.

Note:

On 2015-09-15 the Python 3.5 release implemented PEP-488 and eliminated .pyo files. This means that .pyc files represent both unoptimized and optimized bytecode.


回答 1

  • .py-常规脚本
  • .py3-(很少使用)Python3脚本。Python3脚本通常以“ .py”而不是“ .py3”结尾,但是我已经看过几次了
  • .pyc-编译脚本(字节码)
  • .pyo -优化的pyc文件(Python3.5的,Python将只使用PYC而非杓和PYC)
  • .pyw-在没有控制台的情况下以窗口模式运行的Python脚本;用pythonw.exe执行
  • .pyx -Cython src转换为C / C ++
  • .pyd -Windows DLL编写的Python脚本
  • .pxd -Cython脚本,等效于C / C ++头
  • .pxi -MyPy存根
  • .pyi-存根文件(PEP 484
  • .pyz -Python脚本存档(PEP 441); 这是一个在标准Python脚本标头之后包含二进制格式的压缩Python脚本(ZIP)的脚本
  • .pywz-适用于MS-Windows的Python脚本存档(PEP 441); 这是一个在标准Python脚本标头之后包含二进制格式的压缩Python脚本(ZIP)的脚本
  • .py [cod] -“ .gitignore”中的通配符表示文件可能是“ .pyc”,“。pyo”或“ .pyd”。
  • .pth-路径配置文件;其内容是要添加到的其他项(每行一个)sys.path。参见site模块。

可以在http://dcjtech.info/topic/python-file-extensions/找到更多其他Python文件扩展名的列表(大多数是罕见的和非正式的)。

  • .py – Regular script
  • .py3 – (rarely used) Python3 script. Python3 scripts usually end with “.py” not “.py3”, but I have seen that a few times
  • .pyc – compiled script (Bytecode)
  • .pyo – optimized pyc file (As of Python3.5, Python will only use pyc rather than pyo and pyc)
  • .pyw – Python script to run in Windowed mode, without a console; executed with pythonw.exe
  • .pyx – Cython src to be converted to C/C++
  • .pyd – Python script made as a Windows DLL
  • .pxd – Cython script which is equivalent to a C/C++ header
  • .pxi – MyPy stub
  • .pyi – Stub file (PEP 484)
  • .pyz – Python script archive (PEP 441); this is a script containing compressed Python scripts (ZIP) in binary form after the standard Python script header
  • .pywz – Python script archive for MS-Windows (PEP 441); this is a script containing compressed Python scripts (ZIP) in binary form after the standard Python script header
  • .py[cod] – wildcard notation in “.gitignore” that means the file may be “.pyc”, “.pyo”, or “.pyd”.
  • .pth – a path configuration file; its contents are additional items (one per line) to be added to sys.path. See site module.

A larger list of additional Python file-extensions (mostly rare and unofficial) can be found at http://dcjtech.info/topic/python-file-extensions/


错误:“’dict’对象没有属性’iteritems’”

问题:错误:“’dict’对象没有属性’iteritems’”

我正在尝试使用NetworkX读取Shapefile并使用该函数write_shp()生成将包含节点和边的Shapefile,但是当我尝试运行代码时,出现以下错误:

Traceback (most recent call last):   File
"C:/Users/Felipe/PycharmProjects/untitled/asdf.py", line 4, in
<module>
    nx.write_shp(redVial, "shapefiles")   File "C:\Python34\lib\site-packages\networkx\readwrite\nx_shp.py", line
192, in write_shp
    for key, data in e[2].iteritems(): AttributeError: 'dict' object has no attribute 'iteritems'

我正在使用Python 3.4,并通过pip install安装了NetworkX。

在发生此错误之前,它已经给我另一个提示“ xrange不存在”或类似名称,因此我进行了查找,然后将其更改xrangerangenx_shp.py文件,似乎可以解决该问题。

根据我的阅读,它可能与Python版本(Python2 vs Python3)有关。

I’m trying to use NetworkX to read a Shapefile and use the function write_shp() to generate the Shapefiles that will contain the nodes and edges, but when I try to run the code it gives me the following error:

Traceback (most recent call last):   File
"C:/Users/Felipe/PycharmProjects/untitled/asdf.py", line 4, in
<module>
    nx.write_shp(redVial, "shapefiles")   File "C:\Python34\lib\site-packages\networkx\readwrite\nx_shp.py", line
192, in write_shp
    for key, data in e[2].iteritems(): AttributeError: 'dict' object has no attribute 'iteritems'

I’m using Python 3.4 and installed NetworkX via pip install.

Before this error it had already given me another one that said “xrange does not exist” or something like that, so I looked it up and just changed xrange to range in the nx_shp.py file, which seemed to solve it.

From what I’ve read it could be related to the Python version (Python2 vs Python3).


回答 0

正如您在python3中一样,请使用dict.items()代替dict.iteritems()

iteritems() 已在python3中删除,因此您无法再使用此方法。

看一下Python 3.0 Wiki的“ 内置更改”部分,其中指出:

删除dict.iteritems()dict.iterkeys()dict.itervalues()

相反:使用dict.items()dict.keys()dict.values() 分别。

As you are in python3 , use dict.items() instead of dict.iteritems()

iteritems() was removed in python3, so you can’t use this method anymore.

Take a look at Python 3.0 Wiki Built-in Changes section, where it is stated:

Removed dict.iteritems(), dict.iterkeys(), and dict.itervalues().

Instead: use dict.items(), dict.keys(), and dict.values() respectively.


回答 1

Python2中,我们有.items().iteritems()在字典中。dict.items()返回字典中的元组列表[(k1,v1),(k2,v2),...]。它复制了字典中的所有元组并创建了新列表。如果字典很大,则对内存的影响很大。

因此,他们dict.iteritems()在更高版本的Python2中创建了代码。此返回的迭代器对象。未复制整个词典,因此内存消耗较少。使用人Python2被教导要使用dict.iteritems()的,而不是.items()如下面的代码解释了效率。

import timeit

d = {i:i*2 for i in xrange(10000000)}  
start = timeit.default_timer()
for key,value in d.items():
    tmp = key + value #do something like print
t1 = timeit.default_timer() - start

start = timeit.default_timer()
for key,value in d.iteritems():
    tmp = key + value
t2 = timeit.default_timer() - start

输出:

Time with d.items(): 9.04773592949
Time with d.iteritems(): 2.17707300186

Python3,他们想使之更有效率,所以感动dictionary.iteritems()dict.items(),并删除.iteritems(),因为它不再需要。

您已经使用过dict.iteritems()Python3所以失败了。尝试使用dict.items()具有与相同功能dict.iteritems()Python2。这是从Python2到的一点点迁移问题Python3

In Python2, we had .items() and .iteritems() in dictionaries. dict.items() returned list of tuples in dictionary [(k1,v1),(k2,v2),...]. It copied all tuples in dictionary and created new list. If dictionary is very big, there is very big memory impact.

So they created dict.iteritems() in later versions of Python2. This returned iterator object. Whole dictionary was not copied so there is lesser memory consumption. People using Python2 are taught to use dict.iteritems() instead of .items() for efficiency as explained in following code.

import timeit

d = {i:i*2 for i in xrange(10000000)}  
start = timeit.default_timer()
for key,value in d.items():
    tmp = key + value #do something like print
t1 = timeit.default_timer() - start

start = timeit.default_timer()
for key,value in d.iteritems():
    tmp = key + value
t2 = timeit.default_timer() - start

Output:

Time with d.items(): 9.04773592949
Time with d.iteritems(): 2.17707300186

In Python3, they wanted to make it more efficient, so moved dictionary.iteritems() to dict.items(), and removed .iteritems() as it was no longer needed.

You have used dict.iteritems() in Python3 so it has failed. Try using dict.items() which has the same functionality as dict.iteritems() of Python2. This is a tiny bit migration issue from Python2 to Python3.


回答 2

我有一个类似的问题(使用3.5),每天损失1/2,但这是可行的-我退休了,只是学习Python,所以我可以帮助我的孙子(12)。

mydict2={'Atlanta':78,'Macon':85,'Savannah':72}
maxval=(max(mydict2.values()))
print(maxval)
mykey=[key for key,value in mydict2.items()if value==maxval][0]
print(mykey)
YEILDS; 
85
Macon

I had a similar problem (using 3.5) and lost 1/2 a day to it but here is a something that works – I am retired and just learning Python so I can help my grandson (12) with it.

mydict2={'Atlanta':78,'Macon':85,'Savannah':72}
maxval=(max(mydict2.values()))
print(maxval)
mykey=[key for key,value in mydict2.items()if value==maxval][0]
print(mykey)
YEILDS; 
85
Macon

回答 3

在Python2 中,该功能dictionary.iteritems()dictionary.items()Python3中的效率更高,该功能dictionary.iteritems()已迁移到dictionary.items()iteritems()已删除。因此,您将收到此错误。

dict.items()在Python3中使用,与Python2相同dict.iteritems()

In Python2, dictionary.iteritems() is more efficient than dictionary.items() so in Python3, the functionality of dictionary.iteritems() has been migrated to dictionary.items() and iteritems() is removed. So you are getting this error.

Use dict.items() in Python3 which is same as dict.iteritems() of Python2.


回答 4

这样做的目的.iteritems()是通过在循环时一次产生一个结果来使用较少的存储空间。我不确定为什么Python 3版本不支持,iteritems()尽管事实证明它比.items()

如果要包含同时支持PY版本2和3的代码,

try:
    iteritems
except NameError:
    iteritems = items

如果您在其他系统上部署项目并且不确定PY版本,这将有所帮助。

The purpose of .iteritems() was to use less memory space by yielding one result at a time while looping. I am not sure why Python 3 version does not support iteritems()though it’s been proved to be efficient than .items()

If you want to include a code that supports both the PY version 2 and 3,

try:
    iteritems
except NameError:
    iteritems = items

This can help if you deploy your project in some other system and you aren’t sure about the PY version.


回答 5

正如RafaelC回答的那样,Python 3重命名了dict.iteritems-> dict.items。尝试使用其他软件包版本。这将列出可用的软件包:

python -m pip install yourOwnPackageHere==

然后重新运行您要在==之后尝试安装/切换版本的版本

As answered by RafaelC, Python 3 renamed dict.iteritems -> dict.items. Try a different package version. This will list available packages:

python -m pip install yourOwnPackageHere==

Then rerun with the version you will try after == to install/switch version


控制台中的文本进度栏[关闭]

问题:控制台中的文本进度栏[关闭]

我编写了一个简单的控制台应用程序,使用ftplib从FTP服务器上载和下载文件。

我希望该应用程序向用户展示其下载/上传进度的一些可视化;每次下载数据块时,我都希望它提供进度更新,即使它只是数字表示形式(如百分比)。

重要的是,我要避免擦除前一行中已打印到控制台的所有文本(即,我不想在打印更新的进度时“清除”整个终端)。

这似乎是一项相当普通的任务-如何在保留先前程序输出的同时,制作进度条或类似的可视化内容输出到控制台?

I wrote a simple console app to upload and download files from an FTP server using the ftplib.

I would like the app to show some visualization of its download/upload progress for the user; each time a data chunk is downloaded, I would like it to provide a progress update, even if it’s just a numeric representation like a percentage.

Importantly, I want to avoid erasing all the text that’s been printed to the console in previous lines (i.e. I don’t want to “clear” the entire terminal while printing the updated progress).

This seems a fairly common task – how can I go about making a progress bar or similar visualization that outputs to my console while preserving prior program output?


回答 0

一个简单的,可定制的进度条

以下是我经常使用的许多答案的汇总(不需要导入)。

# Print iterations progress
def printProgressBar (iteration, total, prefix = '', suffix = '', decimals = 1, length = 100, fill = '█', printEnd = "\r"):
    """
    Call in a loop to create terminal progress bar
    @params:
        iteration   - Required  : current iteration (Int)
        total       - Required  : total iterations (Int)
        prefix      - Optional  : prefix string (Str)
        suffix      - Optional  : suffix string (Str)
        decimals    - Optional  : positive number of decimals in percent complete (Int)
        length      - Optional  : character length of bar (Int)
        fill        - Optional  : bar fill character (Str)
        printEnd    - Optional  : end character (e.g. "\r", "\r\n") (Str)
    """
    percent = ("{0:." + str(decimals) + "f}").format(100 * (iteration / float(total)))
    filledLength = int(length * iteration // total)
    bar = fill * filledLength + '-' * (length - filledLength)
    print('\r%s |%s| %s%% %s' % (prefix, bar, percent, suffix), end = printEnd)
    # Print New Line on Complete
    if iteration == total: 
        print()

注意:这是针对Python 3的;有关在Python 2中使用此功能的详细信息,请参见注释。

样品用量

import time

# A List of Items
items = list(range(0, 57))
l = len(items)

# Initial call to print 0% progress
printProgressBar(0, l, prefix = 'Progress:', suffix = 'Complete', length = 50)
for i, item in enumerate(items):
    # Do stuff...
    time.sleep(0.1)
    # Update Progress Bar
    printProgressBar(i + 1, l, prefix = 'Progress:', suffix = 'Complete', length = 50)

样本输出:

Progress: |█████████████████████████████████████████████-----| 90.0% Complete

更新资料

注释中讨论了一个选项,该选项允许进度条动态调整为终端窗口宽度。虽然我不建议这样做,但是这里有一个实现此功能的要点(并注意了一些警告)。

A Simple, Customizable Progress Bar

Here’s an aggregate of many of the answers below that I use regularly (no imports required).

# Print iterations progress
def printProgressBar (iteration, total, prefix = '', suffix = '', decimals = 1, length = 100, fill = '█', printEnd = "\r"):
    """
    Call in a loop to create terminal progress bar
    @params:
        iteration   - Required  : current iteration (Int)
        total       - Required  : total iterations (Int)
        prefix      - Optional  : prefix string (Str)
        suffix      - Optional  : suffix string (Str)
        decimals    - Optional  : positive number of decimals in percent complete (Int)
        length      - Optional  : character length of bar (Int)
        fill        - Optional  : bar fill character (Str)
        printEnd    - Optional  : end character (e.g. "\r", "\r\n") (Str)
    """
    percent = ("{0:." + str(decimals) + "f}").format(100 * (iteration / float(total)))
    filledLength = int(length * iteration // total)
    bar = fill * filledLength + '-' * (length - filledLength)
    print('\r%s |%s| %s%% %s' % (prefix, bar, percent, suffix), end = printEnd)
    # Print New Line on Complete
    if iteration == total: 
        print()

Note: This is for Python 3; see the comments for details on using this in Python 2.

Sample Usage

import time

# A List of Items
items = list(range(0, 57))
l = len(items)

# Initial call to print 0% progress
printProgressBar(0, l, prefix = 'Progress:', suffix = 'Complete', length = 50)
for i, item in enumerate(items):
    # Do stuff...
    time.sleep(0.1)
    # Update Progress Bar
    printProgressBar(i + 1, l, prefix = 'Progress:', suffix = 'Complete', length = 50)

Sample Output:

Progress: |█████████████████████████████████████████████-----| 90.0% Complete

Update

There was discussion in the comments regarding an option that allows the progress bar to adjust dynamically to the terminal window width. While I don’t recommend this, here’s a gist that implements this feature (and notes the caveats).


回答 1

写入“ \ r”会将光标移回该行的开头。

这将显示一个百分比计数器:

import time
import sys

for i in range(100):
    time.sleep(1)
    sys.stdout.write("\r%d%%" % i)
    sys.stdout.flush()

Writing ‘\r’ will move the cursor back to the beginning of the line.

This displays a percentage counter:

import time
import sys

for i in range(100):
    time.sleep(1)
    sys.stdout.write("\r%d%%" % i)
    sys.stdout.flush()

回答 2

tqdm:在一秒钟内将进度表添加到循环中

>>> import time
>>> from tqdm import tqdm
>>> for i in tqdm(range(100)):
...     time.sleep(1)
... 
|###-------| 35/100  35% [elapsed: 00:35 left: 01:05,  1.00 iters/sec]

tqdm repl会话

tqdm: add a progress meter to your loops in a second:

>>> import time
>>> from tqdm import tqdm
>>> for i in tqdm(range(100)):
...     time.sleep(1)
... 
|###-------| 35/100  35% [elapsed: 00:35 left: 01:05,  1.00 iters/sec]

tqdm repl session


回答 3

将a写入\r控制台。这是一个“回车符”,它导致其后的所有文本在行的开头被回显。就像是:

def update_progress(progress):
    print '\r[{0}] {1}%'.format('#'*(progress/10), progress)

这会给你类似的东西: [ ########## ] 100%

Write a \r to the console. That is a “carriage return” which causes all text after it to be echoed at the beginning of the line. Something like:

def update_progress(progress):
    print '\r[{0}] {1}%'.format('#'*(progress/10), progress)

which will give you something like: [ ########## ] 100%


回答 4

它少于10行代码。

要点在这里:https//gist.github.com/vladignatyev/06860ec2040cb497f0f3

import sys


def progress(count, total, suffix=''):
    bar_len = 60
    filled_len = int(round(bar_len * count / float(total)))

    percents = round(100.0 * count / float(total), 1)
    bar = '=' * filled_len + '-' * (bar_len - filled_len)

    sys.stdout.write('[%s] %s%s ...%s\r' % (bar, percents, '%', suffix))
    sys.stdout.flush()  # As suggested by Rom Ruben

在此处输入图片说明

It is less than 10 lines of code.

The gist here: https://gist.github.com/vladignatyev/06860ec2040cb497f0f3

import sys


def progress(count, total, suffix=''):
    bar_len = 60
    filled_len = int(round(bar_len * count / float(total)))

    percents = round(100.0 * count / float(total), 1)
    bar = '=' * filled_len + '-' * (bar_len - filled_len)

    sys.stdout.write('[%s] %s%s ...%s\r' % (bar, percents, '%', suffix))
    sys.stdout.flush()  # As suggested by Rom Ruben

enter image description here


回答 5

尝试由Python的Mozart,Armin Ronacher编写的点击库。

$ pip install click # both 2 and 3 compatible

要创建一个简单的进度条:

import click

with click.progressbar(range(1000000)) as bar:
    for i in bar:
        pass 

看起来是这样的:

# [###-------------------------------]    9%  00:01:14

自定义您的内心内容:

import click, sys

with click.progressbar(range(100000), file=sys.stderr, show_pos=True, width=70, bar_template='(_(_)=%(bar)sD(_(_| %(info)s', fill_char='=', empty_char=' ') as bar:
    for i in bar:
        pass

自定义外观:

(_(_)===================================D(_(_| 100000/100000 00:00:02

还有更多选项,请参阅API文档

 click.progressbar(iterable=None, length=None, label=None, show_eta=True, show_percent=None, show_pos=False, item_show_func=None, fill_char='#', empty_char='-', bar_template='%(label)s [%(bar)s] %(info)s', info_sep=' ', width=36, file=None, color=None)

Try the click library written by the Mozart of Python, Armin Ronacher.

$ pip install click # both 2 and 3 compatible

To create a simple progress bar:

import click

with click.progressbar(range(1000000)) as bar:
    for i in bar:
        pass 

This is what it looks like:

# [###-------------------------------]    9%  00:01:14

Customize to your hearts content:

import click, sys

with click.progressbar(range(100000), file=sys.stderr, show_pos=True, width=70, bar_template='(_(_)=%(bar)sD(_(_| %(info)s', fill_char='=', empty_char=' ') as bar:
    for i in bar:
        pass

Custom look:

(_(_)===================================D(_(_| 100000/100000 00:00:02

There are even more options, see the API docs:

 click.progressbar(iterable=None, length=None, label=None, show_eta=True, show_percent=None, show_pos=False, item_show_func=None, fill_char='#', empty_char='-', bar_template='%(label)s [%(bar)s] %(info)s', info_sep=' ', width=36, file=None, color=None)

回答 6

我意识到我已经迟到了,但是这是我写的一个有点百胜风格(红帽)的东西(这里并不是100%的准确性,但是如果您使用进度条来达到该准确性水平,那么您反正):

import sys

def cli_progress_test(end_val, bar_length=20):
    for i in xrange(0, end_val):
        percent = float(i) / end_val
        hashes = '#' * int(round(percent * bar_length))
        spaces = ' ' * (bar_length - len(hashes))
        sys.stdout.write("\rPercent: [{0}] {1}%".format(hashes + spaces, int(round(percent * 100))))
        sys.stdout.flush()

应该产生如下内容:

Percent: [##############      ] 69%

…括号保持固定,仅散列增加。

作为装饰者,这可能会更好。再过一天

I realize I’m late to the game, but here’s a slightly Yum-style (Red Hat) one I wrote (not going for 100% accuracy here, but if you’re using a progress bar for that level of accuracy, then you’re WRONG anyway):

import sys

def cli_progress_test(end_val, bar_length=20):
    for i in xrange(0, end_val):
        percent = float(i) / end_val
        hashes = '#' * int(round(percent * bar_length))
        spaces = ' ' * (bar_length - len(hashes))
        sys.stdout.write("\rPercent: [{0}] {1}%".format(hashes + spaces, int(round(percent * 100))))
        sys.stdout.flush()

Should produce something looking like this:

Percent: [##############      ] 69%

… where the brackets stay stationary and only the hashes increase.

This might work better as a decorator. For another day…


回答 7

检查此库:clint

它具有很多功能,包括进度条:

from time import sleep  
from random import random  
from clint.textui import progress  
if __name__ == '__main__':
    for i in progress.bar(range(100)):
        sleep(random() * 0.2)

    for i in progress.dots(range(100)):
        sleep(random() * 0.2)

链接提供了其功能的快速概述

Check this library: clint

it has a lot of features including a progress bar:

from time import sleep  
from random import random  
from clint.textui import progress  
if __name__ == '__main__':
    for i in progress.bar(range(100)):
        sleep(random() * 0.2)

    for i in progress.dots(range(100)):
        sleep(random() * 0.2)

this link provides a quick overview of its features


回答 8

这是一个用Python编写的进度条的好例子:http : //nadiana.com/animated-terminal-progress-bar-in-python

但是,如果您想自己编写。您可以使用该curses模块使事情变得容易:)

也许不是诅咒这个词更容易。但是,如果您想创建一个成熟的cui,那么curses会为您处理很多事情。

[edit]由于旧的链接已失效,因此我设置了自己的Python Progressbar版本,请在此处获取:https : //github.com/WoLpH/python-progressbar

Here’s a nice example of a progressbar written in Python: http://nadiana.com/animated-terminal-progress-bar-in-python

But if you want to write it yourself. You could use the curses module to make things easier :)

[edit] Perhaps easier is not the word for curses. But if you want to create a full-blown cui than curses takes care of a lot of stuff for you.

[edit] Since the old link is dead I have put up my own version of a Python Progressbar, get it here: https://github.com/WoLpH/python-progressbar


回答 9

import time,sys

for i in range(100+1):
    time.sleep(0.1)
    sys.stdout.write(('='*i)+(''*(100-i))+("\r [ %d"%i+"% ] "))
    sys.stdout.flush()

输出

[29%] ===================

import time,sys

for i in range(100+1):
    time.sleep(0.1)
    sys.stdout.write(('='*i)+(''*(100-i))+("\r [ %d"%i+"% ] "))
    sys.stdout.flush()

output

[ 29% ] ===================


回答 10

并且,只是添加到堆中,这是您可以使用的对象

import sys

class ProgressBar(object):
    DEFAULT_BAR_LENGTH = 65
    DEFAULT_CHAR_ON  = '='
    DEFAULT_CHAR_OFF = ' '

    def __init__(self, end, start=0):
        self.end    = end
        self.start  = start
        self._barLength = self.__class__.DEFAULT_BAR_LENGTH

        self.setLevel(self.start)
        self._plotted = False

    def setLevel(self, level):
        self._level = level
        if level < self.start:  self._level = self.start
        if level > self.end:    self._level = self.end

        self._ratio = float(self._level - self.start) / float(self.end - self.start)
        self._levelChars = int(self._ratio * self._barLength)

    def plotProgress(self):
        sys.stdout.write("\r  %3i%% [%s%s]" %(
            int(self._ratio * 100.0),
            self.__class__.DEFAULT_CHAR_ON  * int(self._levelChars),
            self.__class__.DEFAULT_CHAR_OFF * int(self._barLength - self._levelChars),
        ))
        sys.stdout.flush()
        self._plotted = True

    def setAndPlot(self, level):
        oldChars = self._levelChars
        self.setLevel(level)
        if (not self._plotted) or (oldChars != self._levelChars):
            self.plotProgress()

    def __add__(self, other):
        assert type(other) in [float, int], "can only add a number"
        self.setAndPlot(self._level + other)
        return self
    def __sub__(self, other):
        return self.__add__(-other)
    def __iadd__(self, other):
        return self.__add__(other)
    def __isub__(self, other):
        return self.__add__(-other)

    def __del__(self):
        sys.stdout.write("\n")

if __name__ == "__main__":
    import time
    count = 150
    print "starting things:"

    pb = ProgressBar(count)

    #pb.plotProgress()
    for i in range(0, count):
        pb += 1
        #pb.setAndPlot(i + 1)
        time.sleep(0.01)
    del pb

    print "done"

结果是:

starting things:
  100% [=================================================================]
done

这通常被认为是“顶上的”,但是当您经常使用它时非常方便

and, just to add to the pile, here’s an object you can use

import sys

class ProgressBar(object):
    DEFAULT_BAR_LENGTH = 65
    DEFAULT_CHAR_ON  = '='
    DEFAULT_CHAR_OFF = ' '

    def __init__(self, end, start=0):
        self.end    = end
        self.start  = start
        self._barLength = self.__class__.DEFAULT_BAR_LENGTH

        self.setLevel(self.start)
        self._plotted = False

    def setLevel(self, level):
        self._level = level
        if level < self.start:  self._level = self.start
        if level > self.end:    self._level = self.end

        self._ratio = float(self._level - self.start) / float(self.end - self.start)
        self._levelChars = int(self._ratio * self._barLength)

    def plotProgress(self):
        sys.stdout.write("\r  %3i%% [%s%s]" %(
            int(self._ratio * 100.0),
            self.__class__.DEFAULT_CHAR_ON  * int(self._levelChars),
            self.__class__.DEFAULT_CHAR_OFF * int(self._barLength - self._levelChars),
        ))
        sys.stdout.flush()
        self._plotted = True

    def setAndPlot(self, level):
        oldChars = self._levelChars
        self.setLevel(level)
        if (not self._plotted) or (oldChars != self._levelChars):
            self.plotProgress()

    def __add__(self, other):
        assert type(other) in [float, int], "can only add a number"
        self.setAndPlot(self._level + other)
        return self
    def __sub__(self, other):
        return self.__add__(-other)
    def __iadd__(self, other):
        return self.__add__(other)
    def __isub__(self, other):
        return self.__add__(-other)

    def __del__(self):
        sys.stdout.write("\n")

if __name__ == "__main__":
    import time
    count = 150
    print "starting things:"

    pb = ProgressBar(count)

    #pb.plotProgress()
    for i in range(0, count):
        pb += 1
        #pb.setAndPlot(i + 1)
        time.sleep(0.01)
    del pb

    print "done"

results in:

starting things:
  100% [=================================================================]
done

This would most commonly be considered to be “over the top”, but it’s handy when you’re using it a lot


回答 11

安装tqdm。(pip install tqdm)并按以下方式使用它:

import time
from tqdm import tqdm
for i in tqdm(range(1000)):
    time.sleep(0.01)

这是一个10秒钟的进度条,它将输出以下内容:

47%|██████████████████▊                     | 470/1000 [00:04<00:05, 98.61it/s]

Install tqdm.(pip install tqdm) and use it as follows:

import time
from tqdm import tqdm
for i in tqdm(range(1000)):
    time.sleep(0.01)

That’s a 10 seconds progress bar that’ll output something like this:

47%|██████████████████▊                     | 470/1000 [00:04<00:05, 98.61it/s]

回答 12

在Python命令行不在任何IDE或开发环境中)运行此命令

>>> import threading
>>> for i in range(50+1):
...   threading._sleep(0.5)
...   print "\r%3d" % i, ('='*i)+('-'*(50-i)),

在我的Windows系统上工作正常。

Run this at the Python command line (not in any IDE or development environment):

>>> import threading
>>> for i in range(50+1):
...   threading._sleep(0.5)
...   print "\r%3d" % i, ('='*i)+('-'*(50-i)),

Works fine on my Windows system.


回答 13


回答 14

我正在使用reddit的进度。我喜欢它,因为它可以在一行中为每个项目打印进度,并且不应该从程序中删除打印输出。

编辑:固定链接

I am using progress from reddit. I like it because it can print progress for every item in one line, and it shouldn’t erase printouts from the program.

Edit: fixed link


回答 15

尝试安装此软件包 pip install progressbar2

import time
import progressbar

for i in progressbar.progressbar(range(100)):
    time.sleep(0.02)

进度条github:https : //github.com/WoLpH/python-progressbar

Try to install this package: pip install progressbar2 :

import time
import progressbar

for i in progressbar.progressbar(range(100)):
    time.sleep(0.02)

progresssbar github: https://github.com/WoLpH/python-progressbar


回答 16

根据上述答案以及有关CLI进度栏的其他类似问题,我认为我对所有这些答案都有一个通用的答案。在https://stackoverflow.com/a/15860757/2254146上进行检查

总之,代码是这样的:

import time, sys

# update_progress() : Displays or updates a console progress bar
## Accepts a float between 0 and 1. Any int will be converted to a float.
## A value under 0 represents a 'halt'.
## A value at 1 or bigger represents 100%
def update_progress(progress):
    barLength = 10 # Modify this to change the length of the progress bar
    status = ""
    if isinstance(progress, int):
        progress = float(progress)
    if not isinstance(progress, float):
        progress = 0
        status = "error: progress var must be float\r\n"
    if progress < 0:
        progress = 0
        status = "Halt...\r\n"
    if progress >= 1:
        progress = 1
        status = "Done...\r\n"
    block = int(round(barLength*progress))
    text = "\rPercent: [{0}] {1}% {2}".format( "#"*block + "-"*(barLength-block), progress*100, status)
    sys.stdout.write(text)
    sys.stdout.flush()

好像

百分比:[###########] 99.0%

based on the above answers and other similar questions about CLI progress bar, I think I got a general common answer to all of them. Check it at https://stackoverflow.com/a/15860757/2254146

In summary, the code is this:

import time, sys

# update_progress() : Displays or updates a console progress bar
## Accepts a float between 0 and 1. Any int will be converted to a float.
## A value under 0 represents a 'halt'.
## A value at 1 or bigger represents 100%
def update_progress(progress):
    barLength = 10 # Modify this to change the length of the progress bar
    status = ""
    if isinstance(progress, int):
        progress = float(progress)
    if not isinstance(progress, float):
        progress = 0
        status = "error: progress var must be float\r\n"
    if progress < 0:
        progress = 0
        status = "Halt...\r\n"
    if progress >= 1:
        progress = 1
        status = "Done...\r\n"
    block = int(round(barLength*progress))
    text = "\rPercent: [{0}] {1}% {2}".format( "#"*block + "-"*(barLength-block), progress*100, status)
    sys.stdout.write(text)
    sys.stdout.flush()

Looks like

Percent: [##########] 99.0%


回答 17

我建议使用tqdm- https : //pypi.python.org/pypi/tqdm- 这样可以很容易地将任何可迭代项或进程转换为进度条,并处理所有与所需终端相关的问题。

在文档中:“ tqdm可以轻松支持回调/挂钩和手动更新。这是urllib的示例”

import urllib
from tqdm import tqdm

def my_hook(t):
  """
  Wraps tqdm instance. Don't forget to close() or __exit__()
  the tqdm instance once you're done with it (easiest using `with` syntax).

  Example
  -------

  >>> with tqdm(...) as t:
  ...     reporthook = my_hook(t)
  ...     urllib.urlretrieve(..., reporthook=reporthook)

  """
  last_b = [0]

  def inner(b=1, bsize=1, tsize=None):
    """
    b  : int, optional
        Number of blocks just transferred [default: 1].
    bsize  : int, optional
        Size of each block (in tqdm units) [default: 1].
    tsize  : int, optional
        Total size (in tqdm units). If [default: None] remains unchanged.
    """
    if tsize is not None:
        t.total = tsize
    t.update((b - last_b[0]) * bsize)
    last_b[0] = b
  return inner

eg_link = 'http://www.doc.ic.ac.uk/~cod11/matryoshka.zip'
with tqdm(unit='B', unit_scale=True, miniters=1,
          desc=eg_link.split('/')[-1]) as t:  # all optional kwargs
    urllib.urlretrieve(eg_link, filename='/dev/null',
                       reporthook=my_hook(t), data=None)

I recommend using tqdm – https://pypi.python.org/pypi/tqdm – which makes it simple to turn any iterable or process into a progress bar, and handles all messing about with terminals needed.

From the documentation: “tqdm can easily support callbacks/hooks and manual updates. Here’s an example with urllib”

import urllib
from tqdm import tqdm

def my_hook(t):
  """
  Wraps tqdm instance. Don't forget to close() or __exit__()
  the tqdm instance once you're done with it (easiest using `with` syntax).

  Example
  -------

  >>> with tqdm(...) as t:
  ...     reporthook = my_hook(t)
  ...     urllib.urlretrieve(..., reporthook=reporthook)

  """
  last_b = [0]

  def inner(b=1, bsize=1, tsize=None):
    """
    b  : int, optional
        Number of blocks just transferred [default: 1].
    bsize  : int, optional
        Size of each block (in tqdm units) [default: 1].
    tsize  : int, optional
        Total size (in tqdm units). If [default: None] remains unchanged.
    """
    if tsize is not None:
        t.total = tsize
    t.update((b - last_b[0]) * bsize)
    last_b[0] = b
  return inner

eg_link = 'http://www.doc.ic.ac.uk/~cod11/matryoshka.zip'
with tqdm(unit='B', unit_scale=True, miniters=1,
          desc=eg_link.split('/')[-1]) as t:  # all optional kwargs
    urllib.urlretrieve(eg_link, filename='/dev/null',
                       reporthook=my_hook(t), data=None)

回答 18

一个非常简单的解决方案是将以下代码放入循环中:

将其放在文件的正文(即顶部)中:

import sys

把它放在循环体中:

sys.stdout.write("-") # prints a dash for each iteration of loop
sys.stdout.flush() # ensures bar is displayed incrementally

A very simple solution is to put this code into your loop:

Put this in the body (i.e. top) of your file:

import sys

Put this in the body of your loop:

sys.stdout.write("-") # prints a dash for each iteration of loop
sys.stdout.flush() # ensures bar is displayed incrementally

回答 19

import sys
def progresssbar():
         for i in range(100):
            time.sleep(1)
            sys.stdout.write("%i\r" % i)

progressbar()

注意:如果您在互动式拦截器中运行此命令,则会打印出额外的数字

import sys
def progresssbar():
         for i in range(100):
            time.sleep(1)
            sys.stdout.write("%i\r" % i)

progressbar()

NOTE: if you run this in interactive interepter you get extra numbers printed out


回答 20

大声笑我只是为此写了很多东西,代码要记住,在执行块ascii时不能使用unicode,我使用cp437

import os
import time
def load(left_side, right_side, length, time):
    x = 0
    y = ""
    print "\r"
    while x < length:
        space = length - len(y)
        space = " " * space
        z = left + y + space + right
        print "\r", z,
        y += "█"
        time.sleep(time)
        x += 1
    cls()

你这样称呼它

print "loading something awesome"
load("|", "|", 10, .01)

所以看起来像这样

loading something awesome
|█████     |

lol i just wrote a whole thingy for this heres the code keep in mind you cant use unicode when doing block ascii i use cp437

import os
import time
def load(left_side, right_side, length, time):
    x = 0
    y = ""
    print "\r"
    while x < length:
        space = length - len(y)
        space = " " * space
        z = left + y + space + right
        print "\r", z,
        y += "█"
        time.sleep(time)
        x += 1
    cls()

and you call it like so

print "loading something awesome"
load("|", "|", 10, .01)

so it looks like this

loading something awesome
|█████     |

回答 21

有了以上出色的建议,我就会得出进度条。

但是我想指出一些缺点

  1. 每次刷新进度条时,它将在新行开始

    print('\r[{0}]{1}%'.format('#' * progress* 10, progress))  

    像这样:
    [] 0%
    [#] 10%
    [##] 20%
    [###] 30%

2.当“ ###”变长时,方括号“]”和右侧的百分数右移。
3.如果表达式“ progress / 10”不能返回整数,则会发生错误。

并且以下代码将解决上述问题。

def update_progress(progress, total):  
    print('\r[{0:10}]{1:>2}%'.format('#' * int(progress * 10 /total), progress), end='')

With the great advices above I work out the progress bar.

However I would like to point out some shortcomings

  1. Every time the progress bar is flushed, it will start on a new line

    print('\r[{0}]{1}%'.format('#' * progress* 10, progress))  
    

    like this:
    [] 0%
    [#]10%
    [##]20%
    [###]30%

2.The square bracket ‘]’ and the percent number on the right side shift right as the ‘###’ get longer.
3. An error will occur if the expression ‘progress / 10’ can not return an integer.

And the following code will fix the problem above.

def update_progress(progress, total):  
    print('\r[{0:10}]{1:>2}%'.format('#' * int(progress * 10 /total), progress), end='')

回答 22

python终端进度条的代码

import sys
import time

max_length = 5
at_length = max_length
empty = "-"
used = "%"

bar = empty * max_length

for i in range(0, max_length):
    at_length -= 1

    #setting empty and full spots
    bar = used * i
    bar = bar+empty * at_length

    #\r is carriage return(sets cursor position in terminal to start of line)
    #\0 character escape

    sys.stdout.write("[{}]\0\r".format(bar))
    sys.stdout.flush()

    #do your stuff here instead of time.sleep
    time.sleep(1)

sys.stdout.write("\n")
sys.stdout.flush()

Code for python terminal progress bar

import sys
import time

max_length = 5
at_length = max_length
empty = "-"
used = "%"

bar = empty * max_length

for i in range(0, max_length):
    at_length -= 1

    #setting empty and full spots
    bar = used * i
    bar = bar+empty * at_length

    #\r is carriage return(sets cursor position in terminal to start of line)
    #\0 character escape

    sys.stdout.write("[{}]\0\r".format(bar))
    sys.stdout.flush()

    #do your stuff here instead of time.sleep
    time.sleep(1)

sys.stdout.write("\n")
sys.stdout.flush()

回答 23

我写了一个简单的进度栏:

def bar(total, current, length=10, prefix="", filler="#", space=" ", oncomp="", border="[]", suffix=""):
    if len(border) != 2:
        print("parameter 'border' must include exactly 2 symbols!")
        return None

    print(prefix + border[0] + (filler * int(current / total * length) +
                                      (space * (length - int(current / total * length)))) + border[1], suffix, "\r", end="")
    if total == current:
        if oncomp:
            print(prefix + border[0] + space * int(((length - len(oncomp)) / 2)) +
                  oncomp + space * int(((length - len(oncomp)) / 2)) + border[1], suffix)
        if not oncomp:
            print(prefix + border[0] + (filler * int(current / total * length) +
                                        (space * (length - int(current / total * length)))) + border[1], suffix)

如您所见,它具有:条形的长度,前缀和后缀,填充符,空格,条形图在100%(oncomp)上的文本和边框

这里有个例子:

from time import sleep, time
start_time = time()
for i in range(10):
    pref = str((i+1) * 10) + "% "
    complete_text = "done in %s sec" % str(round(time() - start_time))
    sleep(1)
    bar(10, i + 1, length=20, prefix=pref, oncomp=complete_text)

进行中:

30% [######              ]

完成:

100% [   done in 9 sec   ] 

i wrote a simple progressbar:

def bar(total, current, length=10, prefix="", filler="#", space=" ", oncomp="", border="[]", suffix=""):
    if len(border) != 2:
        print("parameter 'border' must include exactly 2 symbols!")
        return None

    print(prefix + border[0] + (filler * int(current / total * length) +
                                      (space * (length - int(current / total * length)))) + border[1], suffix, "\r", end="")
    if total == current:
        if oncomp:
            print(prefix + border[0] + space * int(((length - len(oncomp)) / 2)) +
                  oncomp + space * int(((length - len(oncomp)) / 2)) + border[1], suffix)
        if not oncomp:
            print(prefix + border[0] + (filler * int(current / total * length) +
                                        (space * (length - int(current / total * length)))) + border[1], suffix)

as you can see, it have: length of bar, prefix and suffix, filler, space, text in bar on 100%(oncomp) and borders

here an example:

from time import sleep, time
start_time = time()
for i in range(10):
    pref = str((i+1) * 10) + "% "
    complete_text = "done in %s sec" % str(round(time() - start_time))
    sleep(1)
    bar(10, i + 1, length=20, prefix=pref, oncomp=complete_text)

out in progress:

30% [######              ]

out on complete:

100% [   done in 9 sec   ] 

回答 24

整理一下我在这里找到的一些想法,并增加估计的剩余时间:

import datetime, sys

start = datetime.datetime.now()

def print_progress_bar (iteration, total):

    process_duration_samples = []
    average_samples = 5

    end = datetime.datetime.now()

    process_duration = end - start

    if len(process_duration_samples) == 0:
        process_duration_samples = [process_duration] * average_samples

    process_duration_samples = process_duration_samples[1:average_samples-1] + [process_duration]
    average_process_duration = sum(process_duration_samples, datetime.timedelta()) / len(process_duration_samples)
    remaining_steps = total - iteration
    remaining_time_estimation = remaining_steps * average_process_duration

    bars_string = int(float(iteration) / float(total) * 20.)
    sys.stdout.write(
        "\r[%-20s] %d%% (%s/%s) Estimated time left: %s" % (
            '='*bars_string, float(iteration) / float(total) * 100,
            iteration,
            total,
            remaining_time_estimation
        ) 
    )
    sys.stdout.flush()
    if iteration + 1 == total:
        print 


# Sample usage

for i in range(0,300):
    print_progress_bar(i, 300)

Putting together some of the ideas I found here, and adding estimated time left:

import datetime, sys

start = datetime.datetime.now()

def print_progress_bar (iteration, total):

    process_duration_samples = []
    average_samples = 5

    end = datetime.datetime.now()

    process_duration = end - start

    if len(process_duration_samples) == 0:
        process_duration_samples = [process_duration] * average_samples

    process_duration_samples = process_duration_samples[1:average_samples-1] + [process_duration]
    average_process_duration = sum(process_duration_samples, datetime.timedelta()) / len(process_duration_samples)
    remaining_steps = total - iteration
    remaining_time_estimation = remaining_steps * average_process_duration

    bars_string = int(float(iteration) / float(total) * 20.)
    sys.stdout.write(
        "\r[%-20s] %d%% (%s/%s) Estimated time left: %s" % (
            '='*bars_string, float(iteration) / float(total) * 100,
            iteration,
            total,
            remaining_time_estimation
        ) 
    )
    sys.stdout.flush()
    if iteration + 1 == total:
        print 


# Sample usage

for i in range(0,300):
    print_progress_bar(i, 300)

回答 25

对于python 3:

def progress_bar(current_value, total):
    increments = 50
    percentual = ((current_value/ total) * 100)
    i = int(percentual // (100 / increments ))
    text = "\r[{0: <{1}}] {2}%".format('=' * i, increments, percentual)
    print(text, end="\n" if percentual == 100 else "")

For python 3:

def progress_bar(current_value, total):
    increments = 50
    percentual = ((current_value/ total) * 100)
    i = int(percentual // (100 / increments ))
    text = "\r[{0: <{1}}] {2}%".format('=' * i, increments, percentual)
    print(text, end="\n" if percentual == 100 else "")

回答 26

好了,这里的代码行得通,我在发布之前对其进行了测试:

import sys
def prg(prog, fillchar, emptchar):
    fillt = 0
    emptt = 20
    if prog < 100 and prog > 0:
        prog2 = prog/5
        fillt = fillt + prog2
        emptt = emptt - prog2
        sys.stdout.write("\r[" + str(fillchar)*fillt + str(emptchar)*emptt + "]" + str(prog) + "%")
        sys.stdout.flush()
    elif prog >= 100:
        prog = 100
        prog2 = prog/5
        fillt = fillt + prog2
        emptt = emptt - prog2
        sys.stdout.write("\r[" + str(fillchar)*fillt + str(emptchar)*emptt + "]" + str(prog) + "%" + "\nDone!")
        sys.stdout.flush()
    elif prog < 0:
        prog = 0
        prog2 = prog/5
        fillt = fillt + prog2
        emptt = emptt - prog2
        sys.stdout.write("\r[" + str(fillchar)*fillt + str(emptchar)*emptt + "]" + str(prog) + "%" + "\nHalted!")
        sys.stdout.flush()

优点:

  • 20个字符栏(每5个字符(1个数字))
  • 自定义填充字符
  • 自定义空字符
  • 停止(0以下的任何数字)
  • 完成(100和大于100的任何数字)
  • 进度计数(0-100(以下及以上用于特殊功能))
  • 条形旁边的百分比数字,它是一行

缺点:

  • 仅支持整数(不过,可以通过将除法设置为整数除法来对其进行修改以支持它们,因此只需更改prog2 = prog/5为即可prog2 = int(prog/5)

Well here is code that works and I tested it before posting:

import sys
def prg(prog, fillchar, emptchar):
    fillt = 0
    emptt = 20
    if prog < 100 and prog > 0:
        prog2 = prog/5
        fillt = fillt + prog2
        emptt = emptt - prog2
        sys.stdout.write("\r[" + str(fillchar)*fillt + str(emptchar)*emptt + "]" + str(prog) + "%")
        sys.stdout.flush()
    elif prog >= 100:
        prog = 100
        prog2 = prog/5
        fillt = fillt + prog2
        emptt = emptt - prog2
        sys.stdout.write("\r[" + str(fillchar)*fillt + str(emptchar)*emptt + "]" + str(prog) + "%" + "\nDone!")
        sys.stdout.flush()
    elif prog < 0:
        prog = 0
        prog2 = prog/5
        fillt = fillt + prog2
        emptt = emptt - prog2
        sys.stdout.write("\r[" + str(fillchar)*fillt + str(emptchar)*emptt + "]" + str(prog) + "%" + "\nHalted!")
        sys.stdout.flush()

Pros:

  • 20 character bar (1 character for every 5 (number wise))
  • Custom fill characters
  • Custom empty characters
  • Halt (any number below 0)
  • Done (100 and any number above 100)
  • Progress count (0-100 (below and above used for special functions))
  • Percentage number next to bar, and it’s a single line

Cons:

  • Supports integers only (It can be modified to support them though, by making the division an integer division, so just change prog2 = prog/5 to prog2 = int(prog/5))

回答 27

这是我的Python 3解决方案:

import time
for i in range(100):
    time.sleep(1)
    s = "{}% Complete".format(i)
    print(s,end=len(s) * '\b')

对于字符串中的每个字符,’\ b’是一个反斜杠。这在Windows cmd窗口中不起作用。

Here’s my Python 3 solution:

import time
for i in range(100):
    time.sleep(1)
    s = "{}% Complete".format(i)
    print(s,end=len(s) * '\b')

‘\b’ is a backslash, for each character in your string. This does not work within the Windows cmd window.


回答 28

Greenstick 2.7的功能:

def printProgressBar (iteration, total, prefix = '', suffix = '',decimals = 1, length = 100, fill = '#'):

percent = ("{0:." + str(decimals) + "f}").format(100 * (iteration / float(total)))
filledLength = int(length * iteration // total)
bar = fill * filledLength + '-' * (length - filledLength)
print'\r%s |%s| %s%% %s' % (prefix, bar, percent, suffix),
sys.stdout.flush()
# Print New Line on Complete                                                                                                                                                                                                              
if iteration == total:
    print()

function from Greenstick for 2.7:

def printProgressBar (iteration, total, prefix = '', suffix = '',decimals = 1, length = 100, fill = '#'):

percent = ("{0:." + str(decimals) + "f}").format(100 * (iteration / float(total)))
filledLength = int(length * iteration // total)
bar = fill * filledLength + '-' * (length - filledLength)
print'\r%s |%s| %s%% %s' % (prefix, bar, percent, suffix),
sys.stdout.flush()
# Print New Line on Complete                                                                                                                                                                                                              
if iteration == total:
    print()

回答 29

python模块progressbar是一个不错的选择。这是我的典型代码:

import time
import progressbar

widgets = [
    ' ', progressbar.Percentage(),
    ' ', progressbar.SimpleProgress(format='(%(value_s)s of %(max_value_s)s)'),
    ' ', progressbar.Bar('>', fill='.'),
    ' ', progressbar.ETA(format_finished='- %(seconds)s  -', format='ETA: %(seconds)s', ),
    ' - ', progressbar.DynamicMessage('loss'),
    ' - ', progressbar.DynamicMessage('error'),
    '                          '
]

bar = progressbar.ProgressBar(redirect_stdout=True, widgets=widgets)
bar.start(100)
for i in range(100):
    time.sleep(0.1)
    bar.update(i + 1, loss=i / 100., error=i)
bar.finish()

The python module progressbar is a nice choice. Here is my typical code:

import time
import progressbar

widgets = [
    ' ', progressbar.Percentage(),
    ' ', progressbar.SimpleProgress(format='(%(value_s)s of %(max_value_s)s)'),
    ' ', progressbar.Bar('>', fill='.'),
    ' ', progressbar.ETA(format_finished='- %(seconds)s  -', format='ETA: %(seconds)s', ),
    ' - ', progressbar.DynamicMessage('loss'),
    ' - ', progressbar.DynamicMessage('error'),
    '                          '
]

bar = progressbar.ProgressBar(redirect_stdout=True, widgets=widgets)
bar.start(100)
for i in range(100):
    time.sleep(0.1)
    bar.update(i + 1, loss=i / 100., error=i)
bar.finish()

用Python解压缩文件

问题:用Python解压缩文件

我通读了zipfile文档,但不明白如何解压缩文件,只能解压缩文件。如何将zip文件的所有内容解压缩到同一目录中?

I read through the zipfile documentation, but couldn’t understand how to unzip a file, only how to zip a file. How do I unzip all the contents of a zip file into the same directory?


回答 0

import zipfile
with zipfile.ZipFile(path_to_zip_file, 'r') as zip_ref:
    zip_ref.extractall(directory_to_extract_to)

差不多了!

import zipfile
with zipfile.ZipFile(path_to_zip_file, 'r') as zip_ref:
    zip_ref.extractall(directory_to_extract_to)

That’s pretty much it!


回答 1

如果您使用的是Python 3.2或更高版本:

import zipfile
with zipfile.ZipFile("file.zip","r") as zip_ref:
    zip_ref.extractall("targetdir")

您不需要使用closetry / catch,因为它使用了 上下文管理器构造。

If you are using Python 3.2 or later:

import zipfile
with zipfile.ZipFile("file.zip","r") as zip_ref:
    zip_ref.extractall("targetdir")

You dont need to use the close or try/catch with this as it uses the context manager construction.


回答 2

extractall如果您使用的是Python 2.6+,请使用该方法

zip = ZipFile('file.zip')
zip.extractall()

Use the extractall method, if you’re using Python 2.6+

zip = ZipFile('file.zip')
zip.extractall()

回答 3

您也只能导入ZipFile

from zipfile import ZipFile
zf = ZipFile('path_to_file/file.zip', 'r')
zf.extractall('path_to_extract_folder')
zf.close()

适用于Python 2Python 3

You can also import only ZipFile:

from zipfile import ZipFile
zf = ZipFile('path_to_file/file.zip', 'r')
zf.extractall('path_to_extract_folder')
zf.close()

Works in Python 2 and Python 3.


回答 4

尝试这个 :


import zipfile
def un_zipFiles(path):
    files=os.listdir(path)
    for file in files:
        if file.endswith('.zip'):
            filePath=path+'/'+file
            zip_file = zipfile.ZipFile(filePath)
            for names in zip_file.namelist():
                zip_file.extract(names,path)
            zip_file.close() 

path:解压缩文件的路径

try this :


import zipfile
def un_zipFiles(path):
    files=os.listdir(path)
    for file in files:
        if file.endswith('.zip'):
            filePath=path+'/'+file
            zip_file = zipfile.ZipFile(filePath)
            for names in zip_file.namelist():
                zip_file.extract(names,path)
            zip_file.close() 

path : unzip file’s path


回答 5

import os 
zip_file_path = "C:\AA\BB"
file_list = os.listdir(path)
abs_path = []
for a in file_list:
    x = zip_file_path+'\\'+a
    print x
    abs_path.append(x)
for f in abs_path:
    zip=zipfile.ZipFile(f)
    zip.extractall(zip_file_path)

如果文件不是zip,则不包含对该文件的验证。如果文件夹包含非.zip文件,它将失败。

import os 
zip_file_path = "C:\AA\BB"
file_list = os.listdir(path)
abs_path = []
for a in file_list:
    x = zip_file_path+'\\'+a
    print x
    abs_path.append(x)
for f in abs_path:
    zip=zipfile.ZipFile(f)
    zip.extractall(zip_file_path)

This does not contain validation for the file if its not zip. If the folder contains non .zip file it will fail.