标签归档:Python

定义引发异常的lambda表达式

问题:定义引发异常的lambda表达式

我如何写一个等于的lambda表达式:

def x():
    raise Exception()

不允许以下内容:

y = lambda : raise Exception()

How can I write a lambda expression that’s equivalent to:

def x():
    raise Exception()

The following is not allowed:

y = lambda : raise Exception()

回答 0

设置Python皮肤的方法有多种:

y = lambda: (_ for _ in ()).throw(Exception('foobar'))

Lambda接受语句。既然raise ex是一条语句,您可以编写一个通用的提升器:

def raise_(ex):
    raise ex

y = lambda: raise_(Exception('foobar'))

但是,如果您的目标是避免使用def,则显然不能削减它。但是,它确实允许您有条件地引发异常,例如:

y = lambda x: 2*x if x < 10 else raise_(Exception('foobar'))

另外,您可以在不定义命名函数的情况下引发异常。您所需要的只是强健的腹部(给定的代码是2.x):

type(lambda:0)(type((lambda:0).func_code)(
  1,1,1,67,'|\0\0\202\1\0',(),(),('x',),'','',1,''),{}
)(Exception())

和python3 强健胃部解决方案:

type(lambda: 0)(type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b''),{}
)(Exception())

感谢@WarrenSpencer指出了一个非常简单的答案,如果您不在乎引发哪个异常:y = lambda: 1/0

There is more than one way to skin a Python:

y = lambda: (_ for _ in ()).throw(Exception('foobar'))

Lambdas accept statements. Since raise ex is a statement, you could write a general purpose raiser:

def raise_(ex):
    raise ex

y = lambda: raise_(Exception('foobar'))

But if your goal is to avoid a def, this obviously doesn’t cut it. It does, however allow you to conditionally raise exceptions, e.g.:

y = lambda x: 2*x if x < 10 else raise_(Exception('foobar'))

Alternatively you can raise an exception without defining a named function. All you need is a strong stomach (and 2.x for the given code):

type(lambda:0)(type((lambda:0).func_code)(
  1,1,1,67,'|\0\0\202\1\0',(),(),('x',),'','',1,''),{}
)(Exception())

And a python3 strong stomach solution:

type(lambda: 0)(type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b''),{}
)(Exception())

Thanks @WarrenSpencer for pointing out a very simple answer if you don’t care which exception is raised: y = lambda: 1/0.


回答 1

怎么样:

lambda x: exec('raise(Exception(x))')

How about:

lambda x: exec('raise(Exception(x))')

回答 2

实际上,有一种方法,但是它非常人为。

您可以使用compile()内置函数创建代码对象。这使您可以使用raise语句(或其他任何语句),但这又带来了另一个挑战:执行代码对象。通常的方法是使用该exec语句,但这会使您回到最初的问题,即您不能在lambda(或)中执行语句eval()

解决的办法是破解。诸如lambda语句结果之类的可调用对象均具有属性__code__,该属性实际上可以被替换。因此,如果您创建一个可调用__code__对象并将其值替换为上面的代码对象,则可以得到无需使用语句即可进行评估的内容。但是,实现所有这些都会导致代码非常晦涩:

map(lambda x, y, z: x.__setattr__(y, z) or x, [lambda: 0], ["__code__"], [compile("raise Exception", "", "single"])[0]()

上面执行以下操作:

  • compile()调用创建一个引发异常的代码对象;

  • 所述lambda: 0返回一个可调用什么也不做而返回值0 -这用于以后执行上述代码的对象;

  • lambda x, y, z创建调用函数__setattr__与剩下的参数,第一个参数的方法,并返回第一个参数!这是必要的,因为__setattr__它本身会返回None

  • map()调用需要的结果lambda: 0,并使用lambda x, y, z替换它的__code__目标与结果compile()的呼叫。此映射操作的结果是一个包含一个条目的列表,该列表由返回lambda x, y, z,这就是我们需要这样做的原因lambda:如果立即使用__setattr__,将丢失对该lambda: 0对象的引用!

  • 最终,map()调用返回的列表的第一个(也是唯一一个)元素被执行,导致代码对象被调用,最终引发所需的异常。

它可以工作(在python 2.6中测试),但是绝对不是很漂亮。

最后一点:如果您有权访问该types模块(需要在import之前使用该语句eval),则可以将这段代码缩短一点:使用types.FunctionType()可以创建一个函数来执行给定的代码对象,因此您赢了不需要创建虚拟函数lambda: 0并替换其__code__属性值的技巧。

Actually, there is a way, but it’s very contrived.

You can create a code object using the compile() built-in function. This allows you to use the raise statement (or any other statement, for that matter), but it raises another challenge: executing the code object. The usual way would be to use the exec statement, but that leads you back to the original problem, namely that you can’t execute statements in a lambda (or an eval(), for that matter).

The solution is a hack. Callables like the result of a lambda statement all have an attribute __code__, which can actually be replaced. So, if you create a callable and replace it’s __code__ value with the code object from above, you get something that can be evaluated without using statements. Achieving all this, though, results in very obscure code:

map(lambda x, y, z: x.__setattr__(y, z) or x, [lambda: 0], ["__code__"], [compile("raise Exception", "", "single"])[0]()

The above does the following:

  • the compile() call creates a code object that raises the exception;

  • the lambda: 0 returns a callable that does nothing but return the value 0 — this is used to execute the above code object later;

  • the lambda x, y, z creates a function that calls the __setattr__ method of the first argument with the remaining arguments, AND RETURNS THE FIRST ARGUMENT! This is necessary, because __setattr__ itself returns None;

  • the map() call takes the result of lambda: 0, and using the lambda x, y, z replaces it’s __code__ object with the result of the compile() call. The result of this map operation is a list with one entry, the one returned by lambda x, y, z, which is why we need this lambda: if we would use __setattr__ right away, we would lose the reference to the lambda: 0 object!

  • finally, the first (and only) element of the list returned by the map() call is executed, resulting in the code object being called, ultimately raising the desired exception.

It works (tested in Python 2.6), but it’s definitely not pretty.

One last note: if you have access to the types module (which would require to use the import statement before your eval), then you can shorten this code down a bit: using types.FunctionType() you can create a function that will execute the given code object, so you won’t need the hack of creating a dummy function with lambda: 0 and replacing the value of its __code__ attribute.


回答 3

用lambda表单创建的函数不能包含语句

Functions created with lambda forms cannot contain statements.


回答 4

如果您想要的只是引发任意异常的lambda表达式,则可以使用非法表达式来实现。例如,lambda x: [][0]将尝试访问空列表中的第一个元素,这将引发IndexError。

请注意:这是黑客行为,而非功能。请勿在他人可能看到或使用的任何(非代码高尔夫球)代码中使用此代码。

If all you want is a lambda expression that raises an arbitrary exception, you can accomplish this with an illegal expression. For instance, lambda x: [][0] will attempt to access the first element in an empty list, which will raise an IndexError.

PLEASE NOTE: This is a hack, not a feature. Do not use this in any (non code-golf) code that another human being might see or use.


回答 5

我想解释一下Marcelo Cantos提供的答案的UPDATE 3

type(lambda: 0)(type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b''),{}
)(Exception())

说明

lambda: 0builtins.function该类的一个实例。
type(lambda: 0)builtins.functionClass。
(lambda: 0).__code__是一个code对象。
code对象是保存除了其他方面,编译的字节代码的对象。它在CPython https://github.com/python/cpython/blob/master/Include/include.code中定义。其方法在此处https://github.com/python/cpython/blob/master/Objects/codeobject.c中实现。我们可以在代码对象上运行帮助:

Help on code object:

class code(object)
 |  code(argcount, kwonlyargcount, nlocals, stacksize, flags, codestring,
 |        constants, names, varnames, filename, name, firstlineno,
 |        lnotab[, freevars[, cellvars]])
 |  
 |  Create a code object.  Not for the faint of heart.

type((lambda: 0).__code__)是代码类。
所以当我们说

type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b'')

我们使用以下参数调用代码对象的构造函数:

  • argcount = 1
  • kwonlyargcount = 0
  • nlocals = 1
  • stacksize = 1
  • 标志= 67
  • codestring = b’| \ 0 \ 202 \ 1 \ 0′
  • 常数=()
  • 名称=()
  • varnames =(’x’,)
  • 文件名=”
  • 名称=”
  • firstlineno = 1
  • lnotab = b”

您可以在PyCodeObject https://github.com/python/cpython/blob/master/Include/include.code的定义中了解自变量的含义。flags例如,参数的值67CO_OPTIMIZED | CO_NEWLOCALS | CO_NOFREE

最多importand参数是,codestring其中包含指令操作码。让我们看看它们的含义。

>>> import dis
>>> dis.dis(b'|\0\202\1\0')
          0 LOAD_FAST                0 (0)
          2 RAISE_VARARGS            1
          4 <0>

可以在以下网址找到操作码的文档: https://docs.python.org/3.8/library/dis.html#python-bytecode-instructions。第一个字节是的操作码LOAD_FAST,第二个字节是其参数,即0。

LOAD_FAST(var_num)
    Pushes a reference to the local co_varnames[var_num] onto the stack.

因此,我们将引用x推入堆栈。的varnames是只含有“X”的字符串列表。我们将把要定义的函数的唯一参数推入堆栈。

下一个字节是其操作码,RAISE_VARARGS下一个字节是其参数,即1。

RAISE_VARARGS(argc)
    Raises an exception using one of the 3 forms of the raise statement, depending on the value of argc:
        0: raise (re-raise previous exception)
        1: raise TOS (raise exception instance or type at TOS)
        2: raise TOS1 from TOS (raise exception instance or type at TOS1 with __cause__ set to TOS)

TOS是堆栈的顶部。由于我们将x函数的第一个参数()推入了堆栈且argc为1,因此x如果它是异常实例, 则将其x引发,否则将其引发。

最后节即0不被使用。这不是有效的操作码。它可能不在那里。

回到代码片段,我们在分析:

type(lambda: 0)(type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b''),{}
)(Exception())

我们称为代码对象的构造函数:

type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b'')

我们将代码对象和空字典传递给函数对象的构造函数:

type(lambda: 0)(type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b''),{}
)

让我们在函数对象上调用help来了解参数的含义。

Help on class function in module builtins:

class function(object)
 |  function(code, globals, name=None, argdefs=None, closure=None)
 |  
 |  Create a function object.
 |  
 |  code
 |    a code object
 |  globals
 |    the globals dictionary
 |  name
 |    a string that overrides the name from the code object
 |  argdefs
 |    a tuple that specifies the default argument values
 |  closure
 |    a tuple that supplies the bindings for free variables

然后,我们调用传递的Exception实例作为参数的构造函数。因此,我们调用了引发异常的lambda函数。让我们运行代码段,看看它确实按预期工作。

>>> type(lambda: 0)(type((lambda: 0).__code__)(
...     1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b''),{}
... )(Exception())
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "", line 1, in 
Exception

改进措施

我们看到字节码的最后节是无用的。让我们不要轻易将这个复杂的表达式弄乱。让我们删除该字节。另外,如果我们想打高尔夫球,我们可以省略Exception的实例化,而是将Exception类作为参数传递。这些更改将导致以下代码:

type(lambda: 0)(type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1',(),(),('x',),'','',1,b''),{}
)(Exception)

当我们运行它时,我们将获得与以前相同的结果。它只是更短。

I’d like to give an explanation of the UPDATE 3 of the answer provided by Marcelo Cantos:

type(lambda: 0)(type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b''),{}
)(Exception())

Explanation

lambda: 0 is an instance of the builtins.function class.
type(lambda: 0) is the builtins.function class.
(lambda: 0).__code__ is a code object.
A code object is an object which holds the compiled bytecode among other things. It is defined here in CPython https://github.com/python/cpython/blob/master/Include/code.h. Its methods are implemented here https://github.com/python/cpython/blob/master/Objects/codeobject.c. We can run the help on the code object:

Help on code object:

class code(object)
 |  code(argcount, kwonlyargcount, nlocals, stacksize, flags, codestring,
 |        constants, names, varnames, filename, name, firstlineno,
 |        lnotab[, freevars[, cellvars]])
 |  
 |  Create a code object.  Not for the faint of heart.

type((lambda: 0).__code__) is the code class.
So when we say

type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b'')

we are calling the constructor of the code object with the following arguments:

  • argcount=1
  • kwonlyargcount=0
  • nlocals=1
  • stacksize=1
  • flags=67
  • codestring=b’|\0\202\1\0′
  • constants=()
  • names=()
  • varnames=(‘x’,)
  • filename=”
  • name=”
  • firstlineno=1
  • lnotab=b”

You can read about what the arguments mean in the definition of the PyCodeObject https://github.com/python/cpython/blob/master/Include/code.h. The value of 67 for the flags argument is for example CO_OPTIMIZED | CO_NEWLOCALS | CO_NOFREE.

The most importand argument is the codestring which contains instruction opcodes. Let’s see what they mean.

>>> import dis
>>> dis.dis(b'|\0\202\1\0')
          0 LOAD_FAST                0 (0)
          2 RAISE_VARARGS            1
          4 <0>

The documentation of opcodes can by found here https://docs.python.org/3.8/library/dis.html#python-bytecode-instructions. The first byte is the opcode for LOAD_FAST, the second byte is its argument i.e. 0.

LOAD_FAST(var_num)
    Pushes a reference to the local co_varnames[var_num] onto the stack.

So we push the reference to x onto the stack. The varnames is a list of strings containing only ‘x’. We will push the only argument of the function we are defining to the stack.

The next byte is the opcode for RAISE_VARARGS and the next byte is its argument i.e. 1.

RAISE_VARARGS(argc)
    Raises an exception using one of the 3 forms of the raise statement, depending on the value of argc:
        0: raise (re-raise previous exception)
        1: raise TOS (raise exception instance or type at TOS)
        2: raise TOS1 from TOS (raise exception instance or type at TOS1 with __cause__ set to TOS)

The TOS is the top-of-stack. Since we pushed the first argument (x) of our function to the stack and argc is 1 we will raise the x if it is an exception instance or make an instance of x and raise it otherwise.

The last byte i.e. 0 is not used. It is not a valid opcode. It might as well not be there.

Going back to code snippet we are anylyzing:

type(lambda: 0)(type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b''),{}
)(Exception())

We called the constructor of the code object:

type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b'')

We pass the code object and an empty dictionary to the constructor of a function object:

type(lambda: 0)(type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b''),{}
)

Let’s call help on a function object to see what the arguments mean.

Help on class function in module builtins:

class function(object)
 |  function(code, globals, name=None, argdefs=None, closure=None)
 |  
 |  Create a function object.
 |  
 |  code
 |    a code object
 |  globals
 |    the globals dictionary
 |  name
 |    a string that overrides the name from the code object
 |  argdefs
 |    a tuple that specifies the default argument values
 |  closure
 |    a tuple that supplies the bindings for free variables

We then call the constructed function passing an Exception instance as an argument. Consequently we called a lambda function which raises an exception. Let’s run the snippet and see that it indeed works as intended.

>>> type(lambda: 0)(type((lambda: 0).__code__)(
...     1,0,1,1,67,b'|\0\202\1\0',(),(),('x',),'','',1,b''),{}
... )(Exception())
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "", line 1, in 
Exception

Improvements

We saw that the last byte of the bytecode is useless. Let’s not clutter this complicated expression needlesly. Let’s remove that byte. Also if we want to golf a little we could omit the instantiation of Exception and instead pass the Exception class as an argument. Those changes would result in the following code:

type(lambda: 0)(type((lambda: 0).__code__)(
    1,0,1,1,67,b'|\0\202\1',(),(),('x',),'','',1,b''),{}
)(Exception)

When we run it we will get the same result as before. It’s just shorter.


Python的file.flush()到底在做什么?

问题:Python的file.flush()到底在做什么?

我在Python 文档的File Objects中找到了这个:

flush()不一定会将文件的数据写入磁盘。使用flush()和os.fsync()来确保此行为。

所以我的问题是:Python到底在flush做什么?我以为这会强制将数据写入磁盘,但现在我发现并没有。为什么?

I found this in the Python documentation for File Objects:

flush() does not necessarily write the file’s data to disk. Use flush() followed by os.fsync() to ensure this behavior.

So my question is: what exactly is Python’s flush doing? I thought that it forces to write data to the disk, but now I see that it doesn’t. Why?


回答 0

通常涉及两个级别的缓冲:

  1. 内部缓冲器
  2. 操作系统缓冲区

内部缓冲区是由您针对其进行编程的运行时/库/语言创建的缓冲区,其目的是通过避免每次写入都调用系统来加快处理速度。取而代之的是,当您写入文件对象时,您将写入其缓冲区,并且只要缓冲区被填满,就会使用系统调用将数据写入实际文件。

但是,由于操作系统缓冲区的原因,这可能并不意味着数据已写入disk。这可能仅意味着将数据从运行时维护的缓冲区复制到操作系统维护的缓冲区。

如果您写了一些东西,并且它最终在缓冲区中(仅),并且切断了计算机的电源,则当计算机关闭时,该数据将不在磁盘上。

因此,为了帮助您在各自的对象上使用flushfsync方法。

第一个flush会简单地将程序缓冲区中残留的所有数据写到实际文件中。通常,这意味着数据将从程序缓冲区复制到操作系统缓冲区。

具体来说,这意味着如果另一个进程打开了要读取的相同文件,它将能够访问刚刷新到该文件的数据。但是,这不一定意味着它已“永久”存储在磁盘上。

为此,您需要调用os.fsync确保所有操作系统缓冲区与它们所使用的存储设备同步的方法,换句话说,该方法会将数据从操作系统缓冲区复制到磁盘。

通常,您无需为这两种方法烦恼,但是,如果您对磁盘上实际存储的内容抱有偏执是好事,则应按照说明进行两次调用。


2018年补遗。

请注意,具有缓存机制的磁盘现在比2013年更加普遍,因此现在涉及的缓存和缓冲区级别更高。我认为这些缓冲区也将由sync / flush调用处理,但我真的不知道。

There’s typically two levels of buffering involved:

  1. Internal buffers
  2. Operating system buffers

The internal buffers are buffers created by the runtime/library/language that you’re programming against and is meant to speed things up by avoiding system calls for every write. Instead, when you write to a file object, you write into its buffer, and whenever the buffer fills up, the data is written to the actual file using system calls.

However, due to the operating system buffers, this might not mean that the data is written to disk. It may just mean that the data is copied from the buffers maintained by your runtime into the buffers maintained by the operating system.

If you write something, and it ends up in the buffer (only), and the power is cut to your machine, that data is not on disk when the machine turns off.

So, in order to help with that you have the flush and fsync methods, on their respective objects.

The first, flush, will simply write out any data that lingers in a program buffer to the actual file. Typically this means that the data will be copied from the program buffer to the operating system buffer.

Specifically what this means is that if another process has that same file open for reading, it will be able to access the data you just flushed to the file. However, it does not necessarily mean it has been “permanently” stored on disk.

To do that, you need to call the os.fsync method which ensures all operating system buffers are synchronized with the storage devices they’re for, in other words, that method will copy data from the operating system buffers to the disk.

Typically you don’t need to bother with either method, but if you’re in a scenario where paranoia about what actually ends up on disk is a good thing, you should make both calls as instructed.


Addendum in 2018.

Note that disks with cache mechanisms is now much more common than back in 2013, so now there are even more levels of caching and buffers involved. I assume these buffers will be handled by the sync/flush calls as well, but I don’t really know.


回答 1

因为操作系统可能不会这样做。刷新操作将文件数据强制进入RAM中的文件缓存,然后从那里开始,操作系统的工作就是将其实际发送到磁盘。

Because the operating system may not do so. The flush operation forces the file data into the file cache in RAM, and from there it’s the OS’s job to actually send it to the disk.


回答 2

它刷新内部缓冲区,这应该导致操作系统将缓冲区写出到文件中。[1] 除非您另行配置,否则Python使用操作系统的默认缓冲。

但是有时OS仍然选择不合作。尤其是在Windows / NTFS中具有诸如写入延迟之类的奇妙功能。基本上清除了内部缓冲区,但OS缓冲区仍保持不变。因此,os.fsync()在这种情况下,您必须告诉操作系统将其写入磁盘。

[1] http://docs.python.org/library/stdtypes.html

It flushes the internal buffer, which is supposed to cause the OS to write out the buffer to the file.[1] Python uses the OS’s default buffering unless you configure it do otherwise.

But sometimes the OS still chooses not to cooperate. Especially with wonderful things like write-delays in Windows/NTFS. Basically the internal buffer is flushed, but the OS buffer is still holding on to it. So you have to tell the OS to write it to disk with os.fsync() in those cases.

[1] http://docs.python.org/library/stdtypes.html


回答 3

基本上,flush()清除RAM缓冲区,其真正功能是让您随后继续写入它-但不应将其视为最佳/最安全的文件写入功能。这将冲刷您的RAM,以获取更多数据,仅此而已。如果要确保安全地将数据写入文件,请改用close()。

Basically, flush() cleans out your RAM buffer, its real power is that it lets you continue to write to it afterwards – but it shouldn’t be thought of as the best/safest write to file feature. It’s flushing your RAM for more data to come, that is all. If you want to ensure data gets written to file safely then use close() instead.


在Python中获取迭代器中的元素数量

问题:在Python中获取迭代器中的元素数量

通常,是否有一种有效的方法可以知道Python的迭代器中有多少个元素,而无需遍历每个元素并进行计数?

Is there an efficient way to know how many elements are in an iterator in Python, in general, without iterating through each and counting?


回答 0

不行,不可能

例:

import random

def gen(n):
    for i in xrange(n):
        if random.randint(0, 1) == 0:
            yield i

iterator = gen(10)

的长度iterator未知,直到您遍历为止。

No. It’s not possible.

Example:

import random

def gen(n):
    for i in xrange(n):
        if random.randint(0, 1) == 0:
            yield i

iterator = gen(10)

Length of iterator is unknown until you iterate through it.


回答 1

此代码应工作:

>>> iter = (i for i in range(50))
>>> sum(1 for _ in iter)
50

尽管它确实遍历每个项目并计算它们,但这是最快的方法。

当迭代器没有项目时,它也适用:

>>> sum(1 for _ in range(0))
0

当然,它会无限输入地永远运行,因此请记住,迭代器可以是无限的:

>>> sum(1 for _ in itertools.count())
[nothing happens, forever]

另外,请注意,执行此操作将耗尽迭代器,并且进一步尝试使用它将看不到任何元素。这是Python迭代器设计不可避免的结果。如果要保留元素,则必须将它们存储在列表或其他内容中。

This code should work:

>>> iter = (i for i in range(50))
>>> sum(1 for _ in iter)
50

Although it does iterate through each item and count them, it is the fastest way to do so.

It also works for when the iterator has no item:

>>> sum(1 for _ in range(0))
0

Of course, it runs forever for an infinite input, so remember that iterators can be infinite:

>>> sum(1 for _ in itertools.count())
[nothing happens, forever]

Also, be aware that the iterator will be exhausted by doing this, and further attempts to use it will see no elements. That’s an unavoidable consequence of the Python iterator design. If you want to keep the elements, you’ll have to store them in a list or something.


回答 2

不,任何方法都将要求您解决所有结果。你可以做

iter_length = len(list(iterable))

但是在无限迭代器上运行该函数当然永远不会返回。它还将消耗迭代器,并且如果要使用其内容,则需要将其重置。

告诉我们您要解决的实际问题可能会帮助我们找到实现目标的更好方法。

编辑:使用list()将立即将整个可迭代对象读取到内存中,这可能是不可取的。另一种方法是

sum(1 for _ in iterable)

如另一个人所张贴。这样可以避免将其保存在内存中。

No, any method will require you to resolve every result. You can do

iter_length = len(list(iterable))

but running that on an infinite iterator will of course never return. It also will consume the iterator and it will need to be reset if you want to use the contents.

Telling us what real problem you’re trying to solve might help us find you a better way to accomplish your actual goal.

Edit: Using list() will read the whole iterable into memory at once, which may be undesirable. Another way is to do

sum(1 for _ in iterable)

as another person posted. That will avoid keeping it in memory.


回答 3

您不能(除非特定迭代器的类型实现了某些特定方法才能实现)。

通常,您只能通过使用迭代器来计数迭代器项目。可能是最有效的方法之一:

import itertools
from collections import deque

def count_iter_items(iterable):
    """
    Consume an iterable not reading it into memory; return the number of items.
    """
    counter = itertools.count()
    deque(itertools.izip(iterable, counter), maxlen=0)  # (consume at C speed)
    return next(counter)

(对于Python 3.x,请替换itertools.izipzip)。

You cannot (except the type of a particular iterator implements some specific methods that make it possible).

Generally, you may count iterator items only by consuming the iterator. One of probably the most efficient ways:

import itertools
from collections import deque

def count_iter_items(iterable):
    """
    Consume an iterable not reading it into memory; return the number of items.
    """
    counter = itertools.count()
    deque(itertools.izip(iterable, counter), maxlen=0)  # (consume at C speed)
    return next(counter)

(For Python 3.x replace itertools.izip with zip).


回答 4

金田 您可以检查该__length_hint__方法,但要警告(至少gsnedders指出,至少在Python 3.4之前),这是一个未记录的实现细节遵循线程中的消息),很可能消失或召唤鼻恶魔。

否则,不会。迭代器只是一个仅公开next()方法的对象。您可以根据需要多次调用它,它们最终可能会也可能不会出现StopIteration。幸运的是,这种行为在大多数情况下对编码员是透明的。:)

Kinda. You could check the __length_hint__ method, but be warned that (at least up to Python 3.4, as gsnedders helpfully points out) it’s a undocumented implementation detail (following message in thread), that could very well vanish or summon nasal demons instead.

Otherwise, no. Iterators are just an object that only expose the next() method. You can call it as many times as required and they may or may not eventually raise StopIteration. Luckily, this behaviour is most of the time transparent to the coder. :)


回答 5

我喜欢基数软件包,它非常轻巧,并根据可迭代性尝试使用可能的最快实现。

用法:

>>> import cardinality
>>> cardinality.count([1, 2, 3])
3
>>> cardinality.count(i for i in range(500))
500
>>> def gen():
...     yield 'hello'
...     yield 'world'
>>> cardinality.count(gen())
2

实际count()实现如下:

def count(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

I like the cardinality package for this, it is very lightweight and tries to use the fastest possible implementation available depending on the iterable.

Usage:

>>> import cardinality
>>> cardinality.count([1, 2, 3])
3
>>> cardinality.count(i for i in range(500))
500
>>> def gen():
...     yield 'hello'
...     yield 'world'
>>> cardinality.count(gen())
2

The actual count() implementation is as follows:

def count(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

回答 6

因此,对于那些想了解该讨论摘要的人。使用以下方法计算长度为5000万的生成器表达式的最终最高分:

  • len(list(gen))
  • len([_ for _ in gen])
  • sum(1 for _ in gen),
  • ilen(gen)(来自more_itertool),
  • reduce(lambda c, i: c + 1, gen, 0)

按执行性能(包括内存消耗)排序,会让您感到惊讶:

“`

1:test_list.py:8:0.492 KiB

gen = (i for i in data*1000); t0 = monotonic(); len(list(gen))

(“列表,秒”,1.9684218849870376)

2:test_list_compr.py:8:0.867 KiB

gen = (i for i in data*1000); t0 = monotonic(); len([i for i in gen])

(’list_compr,sec’,2.5885991149989422)

3:test_sum.py:8:0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); sum(1 for i in gen); t1 = monotonic()

(’sum,sec’,3.441088170016883)

4:more_itertools / more.py:413:1.266 KiB

d = deque(enumerate(iterable, 1), maxlen=1)

test_ilen.py:10: 0.875 KiB
gen = (i for i in data*1000); t0 = monotonic(); ilen(gen)

(’ilen,sec’,9.812256851990242)

5:test_reduce.py:8:0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); reduce(lambda counter, i: counter + 1, gen, 0)

(’reduce,sec’,13.436614598002052)“`

因此,len(list(gen))是最频繁且消耗较少的内存

So, for those who would like to know the summary of that discussion. The final top scores for counting a 50 million-lengthed generator expression using:

  • len(list(gen)),
  • len([_ for _ in gen]),
  • sum(1 for _ in gen),
  • ilen(gen) (from more_itertool),
  • reduce(lambda c, i: c + 1, gen, 0),

sorted by performance of execution (including memory consumption), will make you surprised:

“`

1: test_list.py:8: 0.492 KiB

gen = (i for i in data*1000); t0 = monotonic(); len(list(gen))

(‘list, sec’, 1.9684218849870376)

2: test_list_compr.py:8: 0.867 KiB

gen = (i for i in data*1000); t0 = monotonic(); len([i for i in gen])

(‘list_compr, sec’, 2.5885991149989422)

3: test_sum.py:8: 0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); sum(1 for i in gen); t1 = monotonic()

(‘sum, sec’, 3.441088170016883)

4: more_itertools/more.py:413: 1.266 KiB

d = deque(enumerate(iterable, 1), maxlen=1)

test_ilen.py:10: 0.875 KiB
gen = (i for i in data*1000); t0 = monotonic(); ilen(gen)

(‘ilen, sec’, 9.812256851990242)

5: test_reduce.py:8: 0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); reduce(lambda counter, i: counter + 1, gen, 0)

(‘reduce, sec’, 13.436614598002052) “`

So, len(list(gen)) is the most frequent and less memory consumable


回答 7

迭代器只是一个对象,该对象具有指向要由某种缓冲区或流读取的下一个对象的指针,就像一个LinkedList,在其中迭代之前,您不知道自己拥有多少东西。迭代器之所以具有效率,是因为它们所做的只是告诉您引用之后是什么,而不是使用索引(但是如您所见,您失去了查看下一步有多少项的能力)。

An iterator is just an object which has a pointer to the next object to be read by some kind of buffer or stream, it’s like a LinkedList where you don’t know how many things you have until you iterate through them. Iterators are meant to be efficient because all they do is tell you what is next by references instead of using indexing (but as you saw you lose the ability to see how many entries are next).


回答 8

关于您的原始问题,答案仍然是,通常没有办法知道Python中迭代器的长度。

鉴于您的问题是由pysam库的应用引起的,我可以给出一个更具体的答案:我是PySAM的贡献者,而最终的答案是SAM / BAM文件未提供对齐读取的确切数目。也无法从BAM索引文件中轻松获得此信息。最好的办法是在读取多个对齐方式并根据文件的总大小外推后,通过使用文件指针的位置来估计对齐的大概数量。这足以实现进度条,但不足以在恒定时间内计数路线。

Regarding your original question, the answer is still that there is no way in general to know the length of an iterator in Python.

Given that you question is motivated by an application of the pysam library, I can give a more specific answer: I’m a contributer to PySAM and the definitive answer is that SAM/BAM files do not provide an exact count of aligned reads. Nor is this information easily available from a BAM index file. The best one can do is to estimate the approximate number of alignments by using the location of the file pointer after reading a number of alignments and extrapolating based on the total size of the file. This is enough to implement a progress bar, but not a method of counting alignments in constant time.


回答 9

快速基准:

import collections
import itertools

def count_iter_items(iterable):
    counter = itertools.count()
    collections.deque(itertools.izip(iterable, counter), maxlen=0)
    return next(counter)

def count_lencheck(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

def count_sum(iterable):           
    return sum(1 for _ in iterable)

iter = lambda y: (x for x in xrange(y))

%timeit count_iter_items(iter(1000))
%timeit count_lencheck(iter(1000))
%timeit count_sum(iter(1000))

结果:

10000 loops, best of 3: 37.2 µs per loop
10000 loops, best of 3: 47.6 µs per loop
10000 loops, best of 3: 61 µs per loop

即简单的count_iter_items是要走的路。

针对python3进行调整:

61.9 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
74.4 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
82.6 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

A quick benchmark:

import collections
import itertools

def count_iter_items(iterable):
    counter = itertools.count()
    collections.deque(itertools.izip(iterable, counter), maxlen=0)
    return next(counter)

def count_lencheck(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

def count_sum(iterable):           
    return sum(1 for _ in iterable)

iter = lambda y: (x for x in xrange(y))

%timeit count_iter_items(iter(1000))
%timeit count_lencheck(iter(1000))
%timeit count_sum(iter(1000))

The results:

10000 loops, best of 3: 37.2 µs per loop
10000 loops, best of 3: 47.6 µs per loop
10000 loops, best of 3: 61 µs per loop

I.e. the simple count_iter_items is the way to go.

Adjusting this for python3:

61.9 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
74.4 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
82.6 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

回答 10

有两种方法可以获取计算机上“某物”的长度。

第一种方法是存储计数-这需要接触文件/数据的任何东西来修改它(或仅公开接口的类-但归结为同一件事)。

另一种方法是遍历它并计算它的大小。

There are two ways to get the length of “something” on a computer.

The first way is to store a count – this requires anything that touches the file/data to modify it (or a class that only exposes interfaces — but it boils down to the same thing).

The other way is to iterate over it and count how big it is.


回答 11

通常的做法是将这种类型的信息放在文件头中,并让pysam允许您访问此信息。我不知道格式,但是您检查过API吗?

正如其他人所说,您无法从迭代器知道长度。

It’s common practice to put this type of information in the file header, and for pysam to give you access to this. I don’t know the format, but have you checked the API?

As others have said, you can’t know the length from the iterator.


回答 12

这违反了迭代器的定义,迭代器是指向对象的指针,外加有关如何到达下一个对象的信息。

迭代器不知道在终止之前它将可以迭代多少次。这可能是无限的,所以无限可能是您的答案。

This is against the very definition of an iterator, which is a pointer to an object, plus information about how to get to the next object.

An iterator does not know how many more times it will be able to iterate until terminating. This could be infinite, so infinity might be your answer.


回答 13

尽管通常不可能执行所要求的操作,但在对项目进行迭代之后,对迭代的项目数进行计数通常仍然有用。为此,您可以使用jaraco.itertools.Counter或类似的名称。这是一个使用Python 3和rwt加载程序包的示例。

$ rwt -q jaraco.itertools -- -q
>>> import jaraco.itertools
>>> items = jaraco.itertools.Counter(range(100))
>>> _ = list(counted)
>>> items.count
100
>>> import random
>>> def gen(n):
...     for i in range(n):
...         if random.randint(0, 1) == 0:
...             yield i
... 
>>> items = jaraco.itertools.Counter(gen(100))
>>> _ = list(counted)
>>> items.count
48

Although it’s not possible in general to do what’s been asked, it’s still often useful to have a count of how many items were iterated over after having iterated over them. For that, you can use jaraco.itertools.Counter or similar. Here’s an example using Python 3 and rwt to load the package.

$ rwt -q jaraco.itertools -- -q
>>> import jaraco.itertools
>>> items = jaraco.itertools.Counter(range(100))
>>> _ = list(counted)
>>> items.count
100
>>> import random
>>> def gen(n):
...     for i in range(n):
...         if random.randint(0, 1) == 0:
...             yield i
... 
>>> items = jaraco.itertools.Counter(gen(100))
>>> _ = list(counted)
>>> items.count
48

回答 14

def count_iter(iter):
    sum = 0
    for _ in iter: sum += 1
    return sum
def count_iter(iter):
    sum = 0
    for _ in iter: sum += 1
    return sum

回答 15

大概是,您希望不迭代地对项目数进行计数,以使迭代器不会耗尽,以后再使用它。可以通过copydeepcopy

import copy

def get_iter_len(iterator):
    return sum(1 for _ in copy.copy(iterator))

###############################################

iterator = range(0, 10)
print(get_iter_len(iterator))

if len(tuple(iterator)) > 1:
    print("Finding the length did not exhaust the iterator!")
else:
    print("oh no! it's all gone")

输出为“Finding the length did not exhaust the iterator!

您可以选择(并且不建议使用)隐藏内置len函数,如下所示:

import copy

def len(obj, *, len=len):
    try:
        if hasattr(obj, "__len__"):
            r = len(obj)
        elif hasattr(obj, "__next__"):
            r = sum(1 for _ in copy.copy(obj))
        else:
            r = len(obj)
    finally:
        pass
    return r

Presumably, you want count the number of items without iterating through, so that the iterator is not exhausted, and you use it again later. This is possible with copy or deepcopy

import copy

def get_iter_len(iterator):
    return sum(1 for _ in copy.copy(iterator))

###############################################

iterator = range(0, 10)
print(get_iter_len(iterator))

if len(tuple(iterator)) > 1:
    print("Finding the length did not exhaust the iterator!")
else:
    print("oh no! it's all gone")

The output is “Finding the length did not exhaust the iterator!

Optionally (and unadvisedly), you can shadow the built-in len function as follows:

import copy

def len(obj, *, len=len):
    try:
        if hasattr(obj, "__len__"):
            r = len(obj)
        elif hasattr(obj, "__next__"):
            r = sum(1 for _ in copy.copy(obj))
        else:
            r = len(obj)
    finally:
        pass
    return r

OSError:[Errno 2]在Django中使用python子进程时,没有此类文件或目录

问题:OSError:[Errno 2]在Django中使用python子进程时,没有此类文件或目录

我正在尝试运行一个程序以使用Python代码在subprocess.call()其中进行一些系统调用,从而引发以下错误:

Traceback (most recent call last):
      File "<console>", line 1, in <module>
      File "/usr/lib/python2.7/subprocess.py", line 493, in call
      return Popen(*popenargs, **kwargs).wait()
      File "/usr/lib/python2.7/subprocess.py", line 679, in __init__
errread, errwrite)
      File "/usr/lib/python2.7/subprocess.py", line 1249, in _execute_child
      raise child_exception
      OSError: [Errno 2] No such file or directory

我的实际Python代码如下:

url = "/media/videos/3cf02324-43e5-4996-bbdf-6377df448ae4.mp4"
real_path = "/home/chanceapp/webapps/chanceapp/chanceapp"+url
fake_crop_path = "/home/chanceapp/webapps/chanceapp/chanceapp/fake1"+url
fake_rotate_path = "/home/chanceapp/webapps/chanceapp.chanceapp/fake2"+url
crop = "ffmpeg -i %s -vf "%(real_path)+"crop=400:400:0:0 "+ "-strict -2 %s"%(fake_crop_path)
rotate = "ffmpeg -i %s -vf "%(fake_crop_path)+"transpose=1 "+"%s"%(fake_rotate_path)
move_rotated = "mv"+" %s"%(fake_rotate_path)+" %s"%(real_path)
delete_cropped = "rm "+"%s"%(fake_crop_path)
#system calls:
subprocess.call(crop)

我可以获取有关如何解决此问题的一些建议吗?

I am trying to run a program to make some system calls inside Python code using subprocess.call() which throws the following error:

Traceback (most recent call last):
      File "<console>", line 1, in <module>
      File "/usr/lib/python2.7/subprocess.py", line 493, in call
      return Popen(*popenargs, **kwargs).wait()
      File "/usr/lib/python2.7/subprocess.py", line 679, in __init__
errread, errwrite)
      File "/usr/lib/python2.7/subprocess.py", line 1249, in _execute_child
      raise child_exception
      OSError: [Errno 2] No such file or directory

My actual Python code is as follows:

url = "/media/videos/3cf02324-43e5-4996-bbdf-6377df448ae4.mp4"
real_path = "/home/chanceapp/webapps/chanceapp/chanceapp"+url
fake_crop_path = "/home/chanceapp/webapps/chanceapp/chanceapp/fake1"+url
fake_rotate_path = "/home/chanceapp/webapps/chanceapp.chanceapp/fake2"+url
crop = "ffmpeg -i %s -vf "%(real_path)+"crop=400:400:0:0 "+ "-strict -2 %s"%(fake_crop_path)
rotate = "ffmpeg -i %s -vf "%(fake_crop_path)+"transpose=1 "+"%s"%(fake_rotate_path)
move_rotated = "mv"+" %s"%(fake_rotate_path)+" %s"%(real_path)
delete_cropped = "rm "+"%s"%(fake_crop_path)
#system calls:
subprocess.call(crop)

Can I get some relevant advice on how to solve this?


回答 0

使用shell=True,如果你传递一个字符串subprocess.call

文档

如果传递单个字符串,则shell必须为True,否则该字符串必须简单地命名要执行的程序而无需指定任何参数。

subprocess.call(crop, shell=True)

要么:

import shlex
subprocess.call(shlex.split(crop))

Use shell=True if you’re passing a string to subprocess.call.

From docs:

If passing a single string, either shell must be True or else the string must simply name the program to be executed without specifying any arguments.

subprocess.call(crop, shell=True)

or:

import shlex
subprocess.call(shlex.split(crop))

回答 1

无法投票,因此我将重新发布@jfs评论,因为我认为它应该更明显。

@AnneTheAgile:shell = True不是必需的。此外,除非有必要,否则不要使用它(请参阅@valid的注释)。您应该将每个命令行参数作为一个单独的列表项传递,例如,使用[‘command’,’arg 1’,’arg 2’]代替“ command’arg 1”arg 2’”。– jfs 2015年3月3日10:02

Can’t upvote so I’ll repost @jfs comment cause I think it should be more visible.

@AnneTheAgile: shell=True is not required. Moreover you should not use it unless it is necessary (see @ valid’s comment). You should pass each command-line argument as a separate list item instead e.g., use [‘command’, ‘arg 1’, ‘arg 2’] instead of “command ‘arg 1’ ‘arg 2′”. – jfs Mar 3 ’15 at 10:02


回答 2

No such file or directory 如果您尝试将文件参数添加到 Popen双引号中,。

例如:

call_args = ['mv', '"path/to/file with spaces.txt"', 'somewhere']

在这种情况下,您需要删除双引号。

call_args = ['mv', 'path/to/file with spaces.txt', 'somewhere']

No such file or directory can be also raised if you are trying to put a file argument to Popen with double-quotes.

For example:

call_args = ['mv', '"path/to/file with spaces.txt"', 'somewhere']

In this case, you need to remove double-quotes.

call_args = ['mv', 'path/to/file with spaces.txt', 'somewhere']

如何在Spark DataFrame中添加常量列?

问题:如何在Spark DataFrame中添加常量列?

我想在中添加DataFrame具有任意值的列(每行相同)。使用withColumn以下内容时出现错误:

dt.withColumn('new_column', 10).head(5)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-50-a6d0257ca2be> in <module>()
      1 dt = (messages
      2     .select(messages.fromuserid, messages.messagetype, floor(messages.datetime/(1000*60*5)).alias("dt")))
----> 3 dt.withColumn('new_column', 10).head(5)

/Users/evanzamir/spark-1.4.1/python/pyspark/sql/dataframe.pyc in withColumn(self, colName, col)
   1166         [Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)]
   1167         """
-> 1168         return self.select('*', col.alias(colName))
   1169 
   1170     @ignore_unicode_prefix

AttributeError: 'int' object has no attribute 'alias'

似乎我可以通过添加和减去其他一列(这样它们加到零)然后添加我想要的数字(在这种情况下为10)来欺骗该函数按我的意愿工作:

dt.withColumn('new_column', dt.messagetype - dt.messagetype + 10).head(5)
[Row(fromuserid=425, messagetype=1, dt=4809600.0, new_column=10),
 Row(fromuserid=47019141, messagetype=1, dt=4809600.0, new_column=10),
 Row(fromuserid=49746356, messagetype=1, dt=4809600.0, new_column=10),
 Row(fromuserid=93506471, messagetype=1, dt=4809600.0, new_column=10),
 Row(fromuserid=80488242, messagetype=1, dt=4809600.0, new_column=10)]

这绝对是骇客,对吧?我认为还有一种更合法的方法吗?

I want to add a column in a DataFrame with some arbitrary value (that is the same for each row). I get an error when I use withColumn as follows:

dt.withColumn('new_column', 10).head(5)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-50-a6d0257ca2be> in <module>()
      1 dt = (messages
      2     .select(messages.fromuserid, messages.messagetype, floor(messages.datetime/(1000*60*5)).alias("dt")))
----> 3 dt.withColumn('new_column', 10).head(5)

/Users/evanzamir/spark-1.4.1/python/pyspark/sql/dataframe.pyc in withColumn(self, colName, col)
   1166         [Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)]
   1167         """
-> 1168         return self.select('*', col.alias(colName))
   1169 
   1170     @ignore_unicode_prefix

AttributeError: 'int' object has no attribute 'alias'

It seems that I can trick the function into working as I want by adding and subtracting one of the other columns (so they add to zero) and then adding the number I want (10 in this case):

dt.withColumn('new_column', dt.messagetype - dt.messagetype + 10).head(5)
[Row(fromuserid=425, messagetype=1, dt=4809600.0, new_column=10),
 Row(fromuserid=47019141, messagetype=1, dt=4809600.0, new_column=10),
 Row(fromuserid=49746356, messagetype=1, dt=4809600.0, new_column=10),
 Row(fromuserid=93506471, messagetype=1, dt=4809600.0, new_column=10),
 Row(fromuserid=80488242, messagetype=1, dt=4809600.0, new_column=10)]

This is supremely hacky, right? I assume there is a more legit way to do this?


回答 0

Spark 2.2+

Spark 2.2引入typedLit了support SeqMapTuplesSPARK-19254),并且应该支持以下调用(Scala):

import org.apache.spark.sql.functions.typedLit

df.withColumn("some_array", typedLit(Seq(1, 2, 3)))
df.withColumn("some_struct", typedLit(("foo", 1, 0.3)))
df.withColumn("some_map", typedLit(Map("key1" -> 1, "key2" -> 2)))

Spark 1.3以上lit),1.4以上arraystruct),2.0以上map):

的第二个参数DataFrame.withColumn应该是a,Column因此您必须使用文字:

from pyspark.sql.functions import lit

df.withColumn('new_column', lit(10))

如果您需要复杂的列,则可以使用以下代码块构建这些列array

from pyspark.sql.functions import array, create_map, struct

df.withColumn("some_array", array(lit(1), lit(2), lit(3)))
df.withColumn("some_struct", struct(lit("foo"), lit(1), lit(.3)))
df.withColumn("some_map", create_map(lit("key1"), lit(1), lit("key2"), lit(2)))

可以在Scala中使用完全相同的方法。

import org.apache.spark.sql.functions.{array, lit, map, struct}

df.withColumn("new_column", lit(10))
df.withColumn("map", map(lit("key1"), lit(1), lit("key2"), lit(2)))

为了提供名称structs或者使用alias上的每个字段:

df.withColumn(
    "some_struct",
    struct(lit("foo").alias("x"), lit(1).alias("y"), lit(0.3).alias("z"))
 )

cast整个对象

df.withColumn(
    "some_struct", 
    struct(lit("foo"), lit(1), lit(0.3)).cast("struct<x: string, y: integer, z: double>")
 )

尽管较慢,也可以使用UDF。

注意事项

可以使用相同的构造将常量参数传递给UDF或SQL函数。

Spark 2.2+

Spark 2.2 introduces typedLit to support Seq, Map, and Tuples (SPARK-19254) and following calls should be supported (Scala):

import org.apache.spark.sql.functions.typedLit

df.withColumn("some_array", typedLit(Seq(1, 2, 3)))
df.withColumn("some_struct", typedLit(("foo", 1, 0.3)))
df.withColumn("some_map", typedLit(Map("key1" -> 1, "key2" -> 2)))

Spark 1.3+ (lit), 1.4+ (array, struct), 2.0+ (map):

The second argument for DataFrame.withColumn should be a Column so you have to use a literal:

from pyspark.sql.functions import lit

df.withColumn('new_column', lit(10))

If you need complex columns you can build these using blocks like array:

from pyspark.sql.functions import array, create_map, struct

df.withColumn("some_array", array(lit(1), lit(2), lit(3)))
df.withColumn("some_struct", struct(lit("foo"), lit(1), lit(.3)))
df.withColumn("some_map", create_map(lit("key1"), lit(1), lit("key2"), lit(2)))

Exactly the same methods can be used in Scala.

import org.apache.spark.sql.functions.{array, lit, map, struct}

df.withColumn("new_column", lit(10))
df.withColumn("map", map(lit("key1"), lit(1), lit("key2"), lit(2)))

To provide names for structs use either alias on each field:

df.withColumn(
    "some_struct",
    struct(lit("foo").alias("x"), lit(1).alias("y"), lit(0.3).alias("z"))
 )

or cast on the whole object

df.withColumn(
    "some_struct", 
    struct(lit("foo"), lit(1), lit(0.3)).cast("struct<x: string, y: integer, z: double>")
 )

It is also possible, although slower, to use an UDF.

Note:

The same constructs can be used to pass constant arguments to UDFs or SQL functions.


回答 1

在spark 2.2中,有两种方法可以在DataFrame的列中添加常量值:

1)使用 lit

2)使用typedLit

两者之间的区别在于typedLit还可以处理参数化的Scala类型,例如List,Seq和Map

样本数据框:

val df = spark.createDataFrame(Seq((0,"a"),(1,"b"),(2,"c"))).toDF("id", "col1")

+---+----+
| id|col1|
+---+----+
|  0|   a|
|  1|   b|
+---+----+

1)使用lit在名为newcol的新列中添加常量字符串值:

import org.apache.spark.sql.functions.lit
val newdf = df.withColumn("newcol",lit("myval"))

结果:

+---+----+------+
| id|col1|newcol|
+---+----+------+
|  0|   a| myval|
|  1|   b| myval|
+---+----+------+

2)使用typedLit

import org.apache.spark.sql.functions.typedLit
df.withColumn("newcol", typedLit(("sample", 10, .044)))

结果:

+---+----+-----------------+
| id|col1|           newcol|
+---+----+-----------------+
|  0|   a|[sample,10,0.044]|
|  1|   b|[sample,10,0.044]|
|  2|   c|[sample,10,0.044]|
+---+----+-----------------+

In spark 2.2 there are two ways to add constant value in a column in DataFrame:

1) Using lit

2) Using typedLit.

The difference between the two is that typedLit can also handle parameterized scala types e.g. List, Seq, and Map

Sample DataFrame:

val df = spark.createDataFrame(Seq((0,"a"),(1,"b"),(2,"c"))).toDF("id", "col1")

+---+----+
| id|col1|
+---+----+
|  0|   a|
|  1|   b|
+---+----+

1) Using lit: Adding constant string value in new column named newcol:

import org.apache.spark.sql.functions.lit
val newdf = df.withColumn("newcol",lit("myval"))

Result:

+---+----+------+
| id|col1|newcol|
+---+----+------+
|  0|   a| myval|
|  1|   b| myval|
+---+----+------+

2) Using typedLit:

import org.apache.spark.sql.functions.typedLit
df.withColumn("newcol", typedLit(("sample", 10, .044)))

Result:

+---+----+-----------------+
| id|col1|           newcol|
+---+----+-----------------+
|  0|   a|[sample,10,0.044]|
|  1|   b|[sample,10,0.044]|
|  2|   c|[sample,10,0.044]|
+---+----+-----------------+

为什么非默认参数不能跟随默认参数?

问题:为什么非默认参数不能跟随默认参数?

为什么这段代码会引发SyntaxError?

  >>> def fun1(a="who is you", b="True", x, y):
...     print a,b,x,y
... 
  File "<stdin>", line 1
SyntaxError: non-default argument follows default argument

尽管以下代码段运行时没有可见错误:

>>> def fun1(x, y, a="who is you", b="True"):
...     print a,b,x,y
... 

Why does this piece of code throw a SyntaxError?

  >>> def fun1(a="who is you", b="True", x, y):
...     print a,b,x,y
... 
  File "<stdin>", line 1
SyntaxError: non-default argument follows default argument

While the following piece of code runs without visible errors:

>>> def fun1(x, y, a="who is you", b="True"):
...     print a,b,x,y
... 

回答 0

必须将所有必需的参数放在任何默认参数之前。仅仅是因为它们是强制性的,而默认参数不是必需的。从语法上讲,如果允许使用混合模式,则解释器将无法决定哪些值与哪些参数匹配。SyntaxError如果参数的输入顺序不正确,则会引发A :

让我们使用您的函数来查看关键字参数。

def fun1(a="who is you", b="True", x, y):
...     print a,b,x,y

假设其允许声明函数如上,然后使用上述声明,我们可以进行以下(常规)位置或关键字参数调用:

func1("ok a", "ok b", 1)  # Is 1 assigned to x or ?
func1(1)                  # Is 1 assigned to a or ?
func1(1, 2)               # ?

您将如何建议在函数调用中分配变量,如何将默认参数与关键字参数一起使用。

>>> def fun1(x, y, a="who is you", b="True"):
...     print a,b,x,y
... 

参考O’Reilly-Core-Python
其中,此函数在语法上正确地使用了上述函数调用的默认参数。事实证明,关键字参数调用对于提供乱序的位置参数很有用,但是与默认参数一起使用,它们也可以用于“跳过”缺少的参数。

All required parameters must be placed before any default arguments. Simply because they are mandatory, whereas default arguments are not. Syntactically, it would be impossible for the interpreter to decide which values match which arguments if mixed modes were allowed. A SyntaxError is raised if the arguments are not given in the correct order:

Let us take a look at keyword arguments, using your function.

def fun1(a="who is you", b="True", x, y):
...     print a,b,x,y

Suppose its allowed to declare function as above, Then with the above declarations, we can make the following (regular) positional or keyword argument calls:

func1("ok a", "ok b", 1)  # Is 1 assigned to x or ?
func1(1)                  # Is 1 assigned to a or ?
func1(1, 2)               # ?

How you will suggest the assignment of variables in the function call, how default arguments are going to be used along with keyword arguments.

>>> def fun1(x, y, a="who is you", b="True"):
...     print a,b,x,y
... 

Reference O’Reilly – Core-Python
Where as this function make use of the default arguments syntactically correct for above function calls. Keyword arguments calling prove useful for being able to provide for out-of-order positional arguments, but, coupled with default arguments, they can also be used to “skip over” missing arguments as well.


回答 1

SyntaxError: non-default argument follows default argument

如果允许这样做,则默认参数将变得无用,因为您将永远无法使用其默认值,因为非默认参数会在后面

但是,在Python 3中,您可以执行以下操作:

def fun1(a="who is you", b="True", *, x, y):
    pass

这使得xy关键字只有这样,你可以这样做:

fun1(x=2, y=2)

之所以可行,是因为不再有任何歧义。请注意,您仍然无法执行操作fun1(2, 2)(这将设置默认参数)。

SyntaxError: non-default argument follows default argument

If you were to allow this, the default arguments would be rendered useless because you would never be able to use their default values, since the non-default arguments come after.

In Python 3 however, you may do the following:

def fun1(a="who is you", b="True", *, x, y):
    pass

which makes x and y keyword only so you can do this:

fun1(x=2, y=2)

This works because there is no longer any ambiguity. Note you still can’t do fun1(2, 2) (that would set the default arguments).


回答 2

让我在这里澄清两点:

  • 首先,非默认参数不应跟随默认参数,这意味着您无法在函数中定义(a = b,c),而在函数中定义参数的顺序为:
    • 位置参数或非默认参数,即(a,b,c)
    • 关键字参数或默认参数,即(a =“ b”,r =“ j”)
    • 仅关键字参数,即(* args)
    • var-keyword参数,即(** kwargs)

def example(a, b, c=None, r="w" , d=[], *ae, **ab):

(a,b)是位置参数

(c = none)是可选参数

(r =“ w”)是关键字参数

(d = [])是列表参数

(* ae)仅用于关键字

(** ab)是var-keyword参数

  • 现在第二件事是,如果我尝试这样的事情: def example(a, b, c=a,d=b):

保存默认值时未定义参数,定义函数时Python计算并保存默认值

当发生这种情况时,c和d未定义,不存在(仅在执行函数时存在)

参数“ a,a = b”不允许使用。

Let me clear two points here :

  • firstly non-default argument should not follow default argument , it means you can’t define (a=b,c) in function the order of defining parameter in function are :
    • positional parameter or non-default parameter i.e (a,b,c)
    • keyword parameter or default parameter i.e (a=”b”,r=”j”)
    • keyword-only parameter i.e (*args)
    • var-keyword parameter i.e (**kwargs)

def example(a, b, c=None, r="w" , d=[], *ae, **ab):

(a,b) are positional parameter

(c=none) is optional parameter

(r=”w”) is keyword parameter

(d=[]) is list parameter

(*ae) is keyword-only

(**ab) is var-keyword parameter

  • now secondary thing is if i try something like this : def example(a, b, c=a,d=b):

argument is not defined when default values are saved,Python computes and saves default values when you define the function

c and d are not defined, does not exist, when this happens (it exists only when the function is executed)

“a,a=b” its not allowed in parameter.


回答 3

必需的参数(没有默认值的参数)必须在开始时才允许客户端代码仅提供两个。如果可选参数是开头,那将会造成混乱:

fun1("who is who", 3, "jack")

在您的第一个示例中该怎么做?最后,x是“谁是谁”,y是3,a =“ jack”。

Required arguments (the ones without defaults), must be at the start to allow client code to only supply two. If the optional arguments were at the start, it would be confusing:

fun1("who is who", 3, "jack")

What would that do in your first example? In the last, x is “who is who”, y is 3 and a = “jack”.


查找名称包含特定字符串的列

问题:查找名称包含特定字符串的列

我有一个带有列名称的数据框,我想找到一个包含特定字符串但与之不完全匹配的数据框。我在寻找'spike'列名喜欢'spike-2''hey spike''spiked-in'(该'spike'部分总是连续)。

我希望列名以字符串或变量的形式返回,因此我以后可以使用df['name']df[name]照常访问列。我试图找到方法,但没有成功。有小费吗?

I have a dataframe with column names, and I want to find the one that contains a certain string, but does not exactly match it. I’m searching for 'spike' in column names like 'spike-2', 'hey spike', 'spiked-in' (the 'spike' part is always continuous).

I want the column name to be returned as a string or a variable, so I access the column later with df['name'] or df[name] as normal. I’ve tried to find ways to do this, to no avail. Any tips?


回答 0

只需遍历DataFrame.columns,这是一个示例,在此示例中,您将获得匹配的列名称列表:

import pandas as pd

data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)

spike_cols = [col for col in df.columns if 'spike' in col]
print(list(df.columns))
print(spike_cols)

输出:

['hey spke', 'no', 'spike-2', 'spiked-in']
['spike-2', 'spiked-in']

说明:

  1. df.columns 返回列名列表
  2. [col for col in df.columns if 'spike' in col]df.columns使用变量遍历列表col并将其添加到结果列表(如果col包含)'spike'。此语法是列表理解

如果只希望结果数据集的列匹配,则可以执行以下操作:

df2 = df.filter(regex='spike')
print(df2)

输出:

   spike-2  spiked-in
0        1          7
1        2          8
2        3          9

Just iterate over DataFrame.columns, now this is an example in which you will end up with a list of column names that match:

import pandas as pd

data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)

spike_cols = [col for col in df.columns if 'spike' in col]
print(list(df.columns))
print(spike_cols)

Output:

['hey spke', 'no', 'spike-2', 'spiked-in']
['spike-2', 'spiked-in']

Explanation:

  1. df.columns returns a list of column names
  2. [col for col in df.columns if 'spike' in col] iterates over the list df.columns with the variable col and adds it to the resulting list if col contains 'spike'. This syntax is list comprehension.

If you only want the resulting data set with the columns that match you can do this:

df2 = df.filter(regex='spike')
print(df2)

Output:

   spike-2  spiked-in
0        1          7
1        2          8
2        3          9

回答 1

此答案使用DataFrame.filter方法执行此操作而无需列表理解:

import pandas as pd

data = {'spike-2': [1,2,3], 'hey spke': [4,5,6]}
df = pd.DataFrame(data)

print(df.filter(like='spike').columns)

将仅输出“ spike-2”。您还可以使用正则表达式,如某些人在上面的评论中建议的那样:

print(df.filter(regex='spike|spke').columns)

将输出两列:[‘spike-2’,’hey spke’]

This answer uses the DataFrame.filter method to do this without list comprehension:

import pandas as pd

data = {'spike-2': [1,2,3], 'hey spke': [4,5,6]}
df = pd.DataFrame(data)

print(df.filter(like='spike').columns)

Will output just ‘spike-2’. You can also use regex, as some people suggested in comments above:

print(df.filter(regex='spike|spke').columns)

Will output both columns: [‘spike-2’, ‘hey spke’]


回答 2

您也可以使用 df.columns[df.columns.str.contains(pat = 'spike')]

data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)

colNames = df.columns[df.columns.str.contains(pat = 'spike')] 

print(colNames)

这将输出列名称: 'spike-2', 'spiked-in'

有关pandas.Series.str.contains的更多信息。

You can also use df.columns[df.columns.str.contains(pat = 'spike')]

data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)

colNames = df.columns[df.columns.str.contains(pat = 'spike')] 

print(colNames)

This will output the column names: 'spike-2', 'spiked-in'

More about pandas.Series.str.contains.


回答 3

# select columns containing 'spike'
df.filter(like='spike', axis=1)

您还可以按名称选择正则表达式。请参阅:pandas.DataFrame.filter

# select columns containing 'spike'
df.filter(like='spike', axis=1)

You can also select by name, regular expression. Refer to: pandas.DataFrame.filter


回答 4

df.loc[:,df.columns.str.contains("spike")]
df.loc[:,df.columns.str.contains("spike")]

回答 5

您还可以使用以下代码:

spike_cols =[x for x in df.columns[df.columns.str.contains('spike')]]

You also can use this code:

spike_cols =[x for x in df.columns[df.columns.str.contains('spike')]]

回答 6

根据“开始”,“包含”和“结束”获取名称和子集:

# from: /programming/21285380/find-column-whose-name-contains-a-specific-string
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
# from: https://cmdlinetips.com/2019/04/how-to-select-columns-using-prefix-suffix-of-column-names-in-pandas/
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html




import pandas as pd



data = {'spike_starts': [1,2,3], 'ends_spike_starts': [4,5,6], 'ends_spike': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)



print("\n")
print("----------------------------------------")
colNames_contains = df.columns[df.columns.str.contains(pat = 'spike')].tolist() 
print("Contains")
print(colNames_contains)



print("\n")
print("----------------------------------------")
colNames_starts = df.columns[df.columns.str.contains(pat = '^spike')].tolist() 
print("Starts")
print(colNames_starts)



print("\n")
print("----------------------------------------")
colNames_ends = df.columns[df.columns.str.contains(pat = 'spike$')].tolist() 
print("Ends")
print(colNames_ends)



print("\n")
print("----------------------------------------")
df_subset_start = df.filter(regex='^spike',axis=1)
print("Starts")
print(df_subset_start)



print("\n")
print("----------------------------------------")
df_subset_contains = df.filter(regex='spike',axis=1)
print("Contains")
print(df_subset_contains)



print("\n")
print("----------------------------------------")
df_subset_ends = df.filter(regex='spike$',axis=1)
print("Ends")
print(df_subset_ends)

Getting name and subsetting based on Start, Contains, and Ends:

# from: https://stackoverflow.com/questions/21285380/find-column-whose-name-contains-a-specific-string
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
# from: https://cmdlinetips.com/2019/04/how-to-select-columns-using-prefix-suffix-of-column-names-in-pandas/
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html




import pandas as pd



data = {'spike_starts': [1,2,3], 'ends_spike_starts': [4,5,6], 'ends_spike': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)



print("\n")
print("----------------------------------------")
colNames_contains = df.columns[df.columns.str.contains(pat = 'spike')].tolist() 
print("Contains")
print(colNames_contains)



print("\n")
print("----------------------------------------")
colNames_starts = df.columns[df.columns.str.contains(pat = '^spike')].tolist() 
print("Starts")
print(colNames_starts)



print("\n")
print("----------------------------------------")
colNames_ends = df.columns[df.columns.str.contains(pat = 'spike$')].tolist() 
print("Ends")
print(colNames_ends)



print("\n")
print("----------------------------------------")
df_subset_start = df.filter(regex='^spike',axis=1)
print("Starts")
print(df_subset_start)



print("\n")
print("----------------------------------------")
df_subset_contains = df.filter(regex='spike',axis=1)
print("Contains")
print(df_subset_contains)



print("\n")
print("----------------------------------------")
df_subset_ends = df.filter(regex='spike$',axis=1)
print("Ends")
print(df_subset_ends)

Python TypeError:格式字符串的参数不足

问题:Python TypeError:格式字符串的参数不足

这是输出。我相信这些是utf-8字符串…其中一些可以是NoneType,但是在类似这样的字符串之前会立即失败…

instr = "'%s', '%s', '%d', '%s', '%s', '%s', '%s'" % softname, procversion, int(percent), exe, description, company, procurl

TypeError:格式字符串的参数不足

虽然是7比7?

Here’s the output. These are utf-8 strings I believe… some of these can be NoneType but it fails immediately, before ones like that…

instr = "'%s', '%s', '%d', '%s', '%s', '%s', '%s'" % softname, procversion, int(percent), exe, description, company, procurl

TypeError: not enough arguments for format string

Its 7 for 7 though?


回答 0

请注意,%格式化字符串的语法已过时。如果您的Python版本支持它,则应编写:

instr = "'{0}', '{1}', '{2}', '{3}', '{4}', '{5}', '{6}'".format(softname, procversion, int(percent), exe, description, company, procurl)

这也可以修复您碰巧遇到的错误。

Note that the % syntax for formatting strings is becoming outdated. If your version of Python supports it, you should write:

instr = "'{0}', '{1}', '{2}', '{3}', '{4}', '{5}', '{6}'".format(softname, procversion, int(percent), exe, description, company, procurl)

This also fixes the error that you happened to have.


回答 1

您需要将格式参数放入元组(添加括号):

instr = "'%s', '%s', '%d', '%s', '%s', '%s', '%s'" % (softname, procversion, int(percent), exe, description, company, procurl)

您当前拥有的等同于以下内容:

intstr = ("'%s', '%s', '%d', '%s', '%s', '%s', '%s'" % softname), procversion, int(percent), exe, description, company, procurl

例:

>>> "%s %s" % 'hello', 'world'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: not enough arguments for format string
>>> "%s %s" % ('hello', 'world')
'hello world'

You need to put the format arguments into a tuple (add parentheses):

instr = "'%s', '%s', '%d', '%s', '%s', '%s', '%s'" % (softname, procversion, int(percent), exe, description, company, procurl)

What you currently have is equivalent to the following:

intstr = ("'%s', '%s', '%d', '%s', '%s', '%s', '%s'" % softname), procversion, int(percent), exe, description, company, procurl

Example:

>>> "%s %s" % 'hello', 'world'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: not enough arguments for format string
>>> "%s %s" % ('hello', 'world')
'hello world'

回答 2

%在格式字符串中用作百分比字符时,出现了相同的错误。解决的办法是加倍%%

I got the same error when using % as a percent character in my format string. The solution to this is to double up the %%.


熊猫数据框获取每个组的第一行

问题:熊猫数据框获取每个组的第一行

我有DataFrame下面的熊猫。

df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4,5,6,6,6,7,7],
                'value'  : ["first","second","second","first",
                            "second","first","third","fourth",
                            "fifth","second","fifth","first",
                            "first","second","third","fourth","fifth"]})

我想通过[“ id”,“ value”]对此分组,并获得每个分组的第一行。

        id   value
0        1   first
1        1  second
2        1  second
3        2   first
4        2  second
5        3   first
6        3   third
7        3  fourth
8        3   fifth
9        4  second
10       4   fifth
11       5   first
12       6   first
13       6  second
14       6   third
15       7  fourth
16       7   fifth

预期结果

    id   value
     1   first
     2   first
     3   first
     4  second
     5  first
     6  first
     7  fourth

我尝试了以下操作,仅给出的第一行DataFrame。任何有关此的帮助表示赞赏。

In [25]: for index, row in df.iterrows():
   ....:     df2 = pd.DataFrame(df.groupby(['id','value']).reset_index().ix[0])

I have a pandas DataFrame like following.

df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4,5,6,6,6,7,7],
                'value'  : ["first","second","second","first",
                            "second","first","third","fourth",
                            "fifth","second","fifth","first",
                            "first","second","third","fourth","fifth"]})

I want to group this by [“id”,”value”] and get the first row of each group.

        id   value
0        1   first
1        1  second
2        1  second
3        2   first
4        2  second
5        3   first
6        3   third
7        3  fourth
8        3   fifth
9        4  second
10       4   fifth
11       5   first
12       6   first
13       6  second
14       6   third
15       7  fourth
16       7   fifth

Expected outcome

    id   value
     1   first
     2   first
     3   first
     4  second
     5  first
     6  first
     7  fourth

I tried following which only gives the first row of the DataFrame. Any help regarding this is appreciated.

In [25]: for index, row in df.iterrows():
   ....:     df2 = pd.DataFrame(df.groupby(['id','value']).reset_index().ix[0])

回答 0

>>> df.groupby('id').first()
     value
id        
1    first
2    first
3    first
4   second
5    first
6    first
7   fourth

如果需要id作为列:

>>> df.groupby('id').first().reset_index()
   id   value
0   1   first
1   2   first
2   3   first
3   4  second
4   5   first
5   6   first
6   7  fourth

要获取n条第一条记录,可以使用head():

>>> df.groupby('id').head(2).reset_index(drop=True)
    id   value
0    1   first
1    1  second
2    2   first
3    2  second
4    3   first
5    3   third
6    4  second
7    4   fifth
8    5   first
9    6   first
10   6  second
11   7  fourth
12   7   fifth
>>> df.groupby('id').first()
     value
id        
1    first
2    first
3    first
4   second
5    first
6    first
7   fourth

If you need id as column:

>>> df.groupby('id').first().reset_index()
   id   value
0   1   first
1   2   first
2   3   first
3   4  second
4   5   first
5   6   first
6   7  fourth

To get n first records, you can use head():

>>> df.groupby('id').head(2).reset_index(drop=True)
    id   value
0    1   first
1    1  second
2    2   first
3    2  second
4    3   first
5    3   third
6    4  second
7    4   fifth
8    5   first
9    6   first
10   6  second
11   7  fourth
12   7   fifth

回答 1

这将为您提供每组的第二行(零索引,nth(0)与first()相同):

df.groupby('id').nth(1) 

文档:http : //pandas.pydata.org/pandas-docs/stable/groupby.html#taking-the-nth-row-of-each-group

This will give you the second row of each group (zero indexed, nth(0) is the same as first()):

df.groupby('id').nth(1) 

Documentation: http://pandas.pydata.org/pandas-docs/stable/groupby.html#taking-the-nth-row-of-each-group


回答 2

我建议使用.nth(0)而不是.first()如果您需要获得第一行。

它们之间的区别在于它们处理NaN的方式,因此.nth(0)无论该行中的值是什么,都将返回组的第一行,而.first()最终将返回每列中的第一个not NaN值。

例如,如果您的数据集是:

df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4],
            'value'  : ["first","second","third", np.NaN,
                        "second","first","second","third",
                        "fourth","first","second"]})

>>> df.groupby('id').nth(0)
    value
id        
1    first
2    NaN
3    first
4    first

>>> df.groupby('id').first()
    value
id        
1    first
2    second
3    first
4    first

I’d suggest to use .nth(0) rather than .first() if you need to get the first row.

The difference between them is how they handle NaNs, so .nth(0) will return the first row of group no matter what are the values in this row, while .first() will eventually return the first not NaN value in each column.

E.g. if your dataset is :

df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4],
            'value'  : ["first","second","third", np.NaN,
                        "second","first","second","third",
                        "fourth","first","second"]})

>>> df.groupby('id').nth(0)
    value
id        
1    first
2    NaN
3    first
4    first

And

>>> df.groupby('id').first()
    value
id        
1    first
2    second
3    first
4    first

回答 3

也许这就是你想要的

import pandas as pd
idx = pd.MultiIndex.from_product([['state1','state2'],   ['county1','county2','county3','county4']])
df = pd.DataFrame({'pop': [12,15,65,42,78,67,55,31]}, index=idx)
                pop
state1 county1   12
       county2   15
       county3   65
       county4   42
state2 county1   78
       county2   67
       county3   55
       county4   31
df.groupby(level=0, group_keys=False).apply(lambda x: x.sort_values('pop', ascending=False)).groupby(level=0).head(3)

> Out[29]: 
                pop
state1 county3   65
       county4   42
       county2   15
state2 county1   78
       county2   67
       county3   55

maybe this is what you want

import pandas as pd
idx = pd.MultiIndex.from_product([['state1','state2'],   ['county1','county2','county3','county4']])
df = pd.DataFrame({'pop': [12,15,65,42,78,67,55,31]}, index=idx)
                pop
state1 county1   12
       county2   15
       county3   65
       county4   42
state2 county1   78
       county2   67
       county3   55
       county4   31
df.groupby(level=0, group_keys=False).apply(lambda x: x.sort_values('pop', ascending=False)).groupby(level=0).head(3)

> Out[29]: 
                pop
state1 county3   65
       county4   42
       county2   15
state2 county1   78
       county2   67
       county3   55

回答 4

如果只需要我们可以处理的每个组的第一行drop_duplicates,请注意函数default方法keep='first'

df.drop_duplicates('id')
Out[1027]: 
    id   value
0    1   first
3    2   first
5    3   first
9    4  second
11   5   first
12   6   first
15   7  fourth

If you only need the first row from each group we can do with drop_duplicates, Notice the function default method keep='first'.

df.drop_duplicates('id')
Out[1027]: 
    id   value
0    1   first
3    2   first
5    3   first
9    4  second
11   5   first
12   6   first
15   7  fourth

python中的n克,四克,五克,六克?

问题:python中的n克,四克,五克,六克?

我正在寻找一种将文本拆分为n-gram的方法。通常我会做类似的事情:

import nltk
from nltk import bigrams
string = "I really like python, it's pretty awesome."
string_bigrams = bigrams(string)
print string_bigrams

我知道nltk仅提供二元组和三元组,但是有没有办法将我的文本分为四克,五克甚至一百克?

谢谢!

I’m looking for a way to split a text into n-grams. Normally I would do something like:

import nltk
from nltk import bigrams
string = "I really like python, it's pretty awesome."
string_bigrams = bigrams(string)
print string_bigrams

I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams?

Thanks!


回答 0

其他用户提供的基于本地Python的出色答案。但是这就是nltk方法(以防万一,OP会因为重新发明nltk库中已经存在的内容而受到惩罚)。

有一个NGRAM模块,人们很少使用nltk。这不是因为很难读取ngram,而是基于ngram训练模型,其中n> 3将导致大量数据稀疏。

from nltk import ngrams

sentence = 'this is a foo bar sentences and i want to ngramize it'

n = 6
sixgrams = ngrams(sentence.split(), n)

for grams in sixgrams:
  print grams

Great native python based answers given by other users. But here’s the nltk approach (just in case, the OP gets penalized for reinventing what’s already existing in the nltk library).

There is an ngram module that people seldom use in nltk. It’s not because it’s hard to read ngrams, but training a model base on ngrams where n > 3 will result in much data sparsity.

from nltk import ngrams

sentence = 'this is a foo bar sentences and i want to ngramize it'

n = 6
sixgrams = ngrams(sentence.split(), n)

for grams in sixgrams:
  print grams

回答 1

我很惊讶这还没有出现:

In [34]: sentence = "I really like python, it's pretty awesome.".split()

In [35]: N = 4

In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)]

In [37]: for gram in grams: print gram
['I', 'really', 'like', 'python,']
['really', 'like', 'python,', "it's"]
['like', 'python,', "it's", 'pretty']
['python,', "it's", 'pretty', 'awesome.']

I’m surprised that this hasn’t shown up yet:

In [34]: sentence = "I really like python, it's pretty awesome.".split()

In [35]: N = 4

In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)]

In [37]: for gram in grams: print gram
['I', 'really', 'like', 'python,']
['really', 'like', 'python,', "it's"]
['like', 'python,', "it's", 'pretty']
['python,', "it's", 'pretty', 'awesome.']

回答 2

仅使用nltk工具

from nltk.tokenize import word_tokenize
from nltk.util import ngrams

def get_ngrams(text, n ):
    n_grams = ngrams(word_tokenize(text), n)
    return [ ' '.join(grams) for grams in n_grams]

输出示例

get_ngrams('This is the simplest text i could think of', 3 )

['This is the', 'is the simplest', 'the simplest text', 'simplest text i', 'text i could', 'i could think', 'could think of']

为了使ngram保持数组格式,只需删除 ' '.join

Using only nltk tools

from nltk.tokenize import word_tokenize
from nltk.util import ngrams

def get_ngrams(text, n ):
    n_grams = ngrams(word_tokenize(text), n)
    return [ ' '.join(grams) for grams in n_grams]

Example output

get_ngrams('This is the simplest text i could think of', 3 )

['This is the', 'is the simplest', 'the simplest text', 'simplest text i', 'text i could', 'i could think', 'could think of']

In order to keep the ngrams in array format just remove ' '.join


回答 3

这是做n-gram的另一种简单方法

>>> from nltk.util import ngrams
>>> text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams"
>>> tokenize = nltk.word_tokenize(text)
>>> tokenize
['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
>>> bigrams = ngrams(tokenize,2)
>>> bigrams
[('I', 'am'), ('am', 'aware'), ('aware', 'that'), ('that', 'nltk'), ('nltk', 'only'), ('only', 'offers'), ('offers', 'bigrams'), ('bigrams', 'and'), ('and', 'trigrams'), ('trigrams', ','), (',', 'but'), ('but', 'is'), ('is', 'there'), ('there', 'a'), ('a', 'way'), ('way', 'to'), ('to', 'split'), ('split', 'my'), ('my', 'text'), ('text', 'in'), ('in', 'four-grams'), ('four-grams', ','), (',', 'five-grams'), ('five-grams', 'or'), ('or', 'even'), ('even', 'hundred-grams')]
>>> trigrams=ngrams(tokenize,3)
>>> trigrams
[('I', 'am', 'aware'), ('am', 'aware', 'that'), ('aware', 'that', 'nltk'), ('that', 'nltk', 'only'), ('nltk', 'only', 'offers'), ('only', 'offers', 'bigrams'), ('offers', 'bigrams', 'and'), ('bigrams', 'and', 'trigrams'), ('and', 'trigrams', ','), ('trigrams', ',', 'but'), (',', 'but', 'is'), ('but', 'is', 'there'), ('is', 'there', 'a'), ('there', 'a', 'way'), ('a', 'way', 'to'), ('way', 'to', 'split'), ('to', 'split', 'my'), ('split', 'my', 'text'), ('my', 'text', 'in'), ('text', 'in', 'four-grams'), ('in', 'four-grams', ','), ('four-grams', ',', 'five-grams'), (',', 'five-grams', 'or'), ('five-grams', 'or', 'even'), ('or', 'even', 'hundred-grams')]
>>> fourgrams=ngrams(tokenize,4)
>>> fourgrams
[('I', 'am', 'aware', 'that'), ('am', 'aware', 'that', 'nltk'), ('aware', 'that', 'nltk', 'only'), ('that', 'nltk', 'only', 'offers'), ('nltk', 'only', 'offers', 'bigrams'), ('only', 'offers', 'bigrams', 'and'), ('offers', 'bigrams', 'and', 'trigrams'), ('bigrams', 'and', 'trigrams', ','), ('and', 'trigrams', ',', 'but'), ('trigrams', ',', 'but', 'is'), (',', 'but', 'is', 'there'), ('but', 'is', 'there', 'a'), ('is', 'there', 'a', 'way'), ('there', 'a', 'way', 'to'), ('a', 'way', 'to', 'split'), ('way', 'to', 'split', 'my'), ('to', 'split', 'my', 'text'), ('split', 'my', 'text', 'in'), ('my', 'text', 'in', 'four-grams'), ('text', 'in', 'four-grams', ','), ('in', 'four-grams', ',', 'five-grams'), ('four-grams', ',', 'five-grams', 'or'), (',', 'five-grams', 'or', 'even'), ('five-grams', 'or', 'even', 'hundred-grams')]

here is another simple way for do n-grams

>>> from nltk.util import ngrams
>>> text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams"
>>> tokenize = nltk.word_tokenize(text)
>>> tokenize
['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
>>> bigrams = ngrams(tokenize,2)
>>> bigrams
[('I', 'am'), ('am', 'aware'), ('aware', 'that'), ('that', 'nltk'), ('nltk', 'only'), ('only', 'offers'), ('offers', 'bigrams'), ('bigrams', 'and'), ('and', 'trigrams'), ('trigrams', ','), (',', 'but'), ('but', 'is'), ('is', 'there'), ('there', 'a'), ('a', 'way'), ('way', 'to'), ('to', 'split'), ('split', 'my'), ('my', 'text'), ('text', 'in'), ('in', 'four-grams'), ('four-grams', ','), (',', 'five-grams'), ('five-grams', 'or'), ('or', 'even'), ('even', 'hundred-grams')]
>>> trigrams=ngrams(tokenize,3)
>>> trigrams
[('I', 'am', 'aware'), ('am', 'aware', 'that'), ('aware', 'that', 'nltk'), ('that', 'nltk', 'only'), ('nltk', 'only', 'offers'), ('only', 'offers', 'bigrams'), ('offers', 'bigrams', 'and'), ('bigrams', 'and', 'trigrams'), ('and', 'trigrams', ','), ('trigrams', ',', 'but'), (',', 'but', 'is'), ('but', 'is', 'there'), ('is', 'there', 'a'), ('there', 'a', 'way'), ('a', 'way', 'to'), ('way', 'to', 'split'), ('to', 'split', 'my'), ('split', 'my', 'text'), ('my', 'text', 'in'), ('text', 'in', 'four-grams'), ('in', 'four-grams', ','), ('four-grams', ',', 'five-grams'), (',', 'five-grams', 'or'), ('five-grams', 'or', 'even'), ('or', 'even', 'hundred-grams')]
>>> fourgrams=ngrams(tokenize,4)
>>> fourgrams
[('I', 'am', 'aware', 'that'), ('am', 'aware', 'that', 'nltk'), ('aware', 'that', 'nltk', 'only'), ('that', 'nltk', 'only', 'offers'), ('nltk', 'only', 'offers', 'bigrams'), ('only', 'offers', 'bigrams', 'and'), ('offers', 'bigrams', 'and', 'trigrams'), ('bigrams', 'and', 'trigrams', ','), ('and', 'trigrams', ',', 'but'), ('trigrams', ',', 'but', 'is'), (',', 'but', 'is', 'there'), ('but', 'is', 'there', 'a'), ('is', 'there', 'a', 'way'), ('there', 'a', 'way', 'to'), ('a', 'way', 'to', 'split'), ('way', 'to', 'split', 'my'), ('to', 'split', 'my', 'text'), ('split', 'my', 'text', 'in'), ('my', 'text', 'in', 'four-grams'), ('text', 'in', 'four-grams', ','), ('in', 'four-grams', ',', 'five-grams'), ('four-grams', ',', 'five-grams', 'or'), (',', 'five-grams', 'or', 'even'), ('five-grams', 'or', 'even', 'hundred-grams')]

回答 4

对于需要二元组或三元组的情况,人们已经很好地回答了,但是在这种情况下,如果需要句子的每一个整组,您可以使用nltk.util.everygrams

>>> from nltk.util import everygrams

>>> message = "who let the dogs out"

>>> msg_split = message.split()

>>> list(everygrams(msg_split))
[('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out'), ('who', 'let', 'the'), ('let', 'the', 'dogs'), ('the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs'), ('let', 'the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs', 'out')]

如果您有一个限制,如三字母组的最大长度应为3,则可以使用max_len参数来指定它。

>>> list(everygrams(msg_split, max_len=2))
[('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out')]

您可以修改max_len参数以实现任意克,即4克,5克,6甚至100克。

可以修改前面提到的解决方案以实现上面提到的解决方案,但是此解决方案比这要简单得多。

欲了解更多信息,请点击这里

而且,当您只需要一个特定的语法,例如bigram或trigram等时,可以使用MAHassan的答案中提到的nltk.util.ngrams。

People have already answered pretty nicely for the scenario where you need bigrams or trigrams but if you need everygram for the sentence in that case you can use nltk.util.everygrams

>>> from nltk.util import everygrams

>>> message = "who let the dogs out"

>>> msg_split = message.split()

>>> list(everygrams(msg_split))
[('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out'), ('who', 'let', 'the'), ('let', 'the', 'dogs'), ('the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs'), ('let', 'the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs', 'out')]

Incase you have a limit like in case of trigrams where the max length should be 3 then you can use max_len param to specify it.

>>> list(everygrams(msg_split, max_len=2))
[('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out')]

You can just modify the max_len param to achieve whatever gram i.e four gram, five gram, six or even hundred gram.

The previous mentioned solutions can be modified to implement the above mentioned solution but this solution is much straight forward than that.

For further reading click here

And when you just need a specific gram like bigram or trigram etc you can use the nltk.util.ngrams as mentioned in M.A.Hassan’s answer.


回答 5

您可以使用以下命令轻松启动自己的功能itertools

from itertools import izip, islice, tee
s = 'spam and eggs'
N = 3
trigrams = izip(*(islice(seq, index, None) for index, seq in enumerate(tee(s, N))))
list(trigrams)
# [('s', 'p', 'a'), ('p', 'a', 'm'), ('a', 'm', ' '),
# ('m', ' ', 'a'), (' ', 'a', 'n'), ('a', 'n', 'd'),
# ('n', 'd', ' '), ('d', ' ', 'e'), (' ', 'e', 'g'),
# ('e', 'g', 'g'), ('g', 'g', 's')]

You can easily whip up your own function to do this using itertools:

from itertools import izip, islice, tee
s = 'spam and eggs'
N = 3
trigrams = izip(*(islice(seq, index, None) for index, seq in enumerate(tee(s, N))))
list(trigrams)
# [('s', 'p', 'a'), ('p', 'a', 'm'), ('a', 'm', ' '),
# ('m', ' ', 'a'), (' ', 'a', 'n'), ('a', 'n', 'd'),
# ('n', 'd', ' '), ('d', ' ', 'e'), (' ', 'e', 'g'),
# ('e', 'g', 'g'), ('g', 'g', 's')]

回答 6

使用python的buildin构建双字母组的一种更优雅的方法zip()。只需通过将原始字符串转换为列表split(),然后正常传递一次列表,然后一次偏移一个元素即可。

string = "I really like python, it's pretty awesome."

def find_bigrams(s):
    input_list = s.split(" ")
    return zip(input_list, input_list[1:])

def find_ngrams(s, n):
  input_list = s.split(" ")
  return zip(*[input_list[i:] for i in range(n)])

find_bigrams(string)

[('I', 'really'), ('really', 'like'), ('like', 'python,'), ('python,', "it's"), ("it's", 'pretty'), ('pretty', 'awesome.')]

A more elegant approach to build bigrams with python’s builtin zip(). Simply convert the original string into a list by split(), then pass the list once normally and once offset by one element.

string = "I really like python, it's pretty awesome."

def find_bigrams(s):
    input_list = s.split(" ")
    return zip(input_list, input_list[1:])

def find_ngrams(s, n):
  input_list = s.split(" ")
  return zip(*[input_list[i:] for i in range(n)])

find_bigrams(string)

[('I', 'really'), ('really', 'like'), ('like', 'python,'), ('python,', "it's"), ("it's", 'pretty'), ('pretty', 'awesome.')]

回答 7

我从未处理过nltk,但是在一些小班项目中做了N-grams。如果要查找字符串中所有N-gram出现的频率,可以采用这种方法。D会给你N个单词的直方图。

D = dict()
string = 'whatever string...'
strparts = string.split()
for i in range(len(strparts)-N): # N-grams
    try:
        D[tuple(strparts[i:i+N])] += 1
    except:
        D[tuple(strparts[i:i+N])] = 1

I have never dealt with nltk but did N-grams as part of some small class project. If you want to find the frequency of all N-grams occurring in the string, here is a way to do that. D would give you the histogram of your N-words.

D = dict()
string = 'whatever string...'
strparts = string.split()
for i in range(len(strparts)-N): # N-grams
    try:
        D[tuple(strparts[i:i+N])] += 1
    except:
        D[tuple(strparts[i:i+N])] = 1

回答 8

对于four_grams,它已经在NLTK中,这是一段代码,可以帮助您实现这一目标:

 from nltk.collocations import *
 import nltk
 #You should tokenize your text
 text = "I do not like green eggs and ham, I do not like them Sam I am!"
 tokens = nltk.wordpunct_tokenize(text)
 fourgrams=nltk.collocations.QuadgramCollocationFinder.from_words(tokens)
 for fourgram, freq in fourgrams.ngram_fd.items():  
       print fourgram, freq

希望对您有所帮助。

For four_grams it is already in NLTK, here is a piece of code that can help you toward this:

 from nltk.collocations import *
 import nltk
 #You should tokenize your text
 text = "I do not like green eggs and ham, I do not like them Sam I am!"
 tokens = nltk.wordpunct_tokenize(text)
 fourgrams=nltk.collocations.QuadgramCollocationFinder.from_words(tokens)
 for fourgram, freq in fourgrams.ngram_fd.items():  
       print fourgram, freq

I hope it helps.


回答 9

您可以使用sklearn.feature_extraction.text.CountVectorizer

import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html
ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vect.fit(string)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))

输出:

4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it']

您可以设置为ngram_size任何正整数。也就是说,您可以将文本拆分为四克,五克甚至一百克。

You can use sklearn.feature_extraction.text.CountVectorizer:

import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html
ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vect.fit(string)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))

outputs:

4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it']

You can set to ngram_size to any positive integer. I.e. you can split a text in four-grams, five-grams or even hundred-grams.


回答 10

如果效率是一个问题,您必须构建多个不同的n-gram(如您所说的最多一百个),但是您想使用纯python,我会这样做:

from itertools import chain

def n_grams(seq, n=1):
    """Returns an itirator over the n-grams given a listTokens"""
    shiftToken = lambda i: (el for j,el in enumerate(seq) if j>=i)
    shiftedTokens = (shiftToken(i) for i in range(n))
    tupleNGrams = zip(*shiftedTokens)
    return tupleNGrams # if join in generator : (" ".join(i) for i in tupleNGrams)

def range_ngrams(listTokens, ngramRange=(1,2)):
    """Returns an itirator over all n-grams for n in range(ngramRange) given a listTokens."""
    return chain(*(n_grams(listTokens, i) for i in range(*ngramRange)))

用法:

>>> input_list = input_list = 'test the ngrams generator'.split()
>>> list(range_ngrams(input_list, ngramRange=(1,3)))
[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]

〜与NLTK相同的速度:

import nltk
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=5)
# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
n_grams(input_list,n=5)
# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=1)
nltk.ngrams(input_list,n=2)
nltk.ngrams(input_list,n=3)
nltk.ngrams(input_list,n=4)
nltk.ngrams(input_list,n=5)
# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
range_ngrams(input_list, ngramRange=(1,6))
# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

重新发布我以前的答案

If efficiency is an issue and you have to build multiple different n-grams (up to a hundred as you say), but you want to use pure python I would do:

from itertools import chain

def n_grams(seq, n=1):
    """Returns an itirator over the n-grams given a listTokens"""
    shiftToken = lambda i: (el for j,el in enumerate(seq) if j>=i)
    shiftedTokens = (shiftToken(i) for i in range(n))
    tupleNGrams = zip(*shiftedTokens)
    return tupleNGrams # if join in generator : (" ".join(i) for i in tupleNGrams)

def range_ngrams(listTokens, ngramRange=(1,2)):
    """Returns an itirator over all n-grams for n in range(ngramRange) given a listTokens."""
    return chain(*(n_grams(listTokens, i) for i in range(*ngramRange)))

Usage :

>>> input_list = input_list = 'test the ngrams generator'.split()
>>> list(range_ngrams(input_list, ngramRange=(1,3)))
[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]

~Same speed as NLTK:

import nltk
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=5)
# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
n_grams(input_list,n=5)
# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=1)
nltk.ngrams(input_list,n=2)
nltk.ngrams(input_list,n=3)
nltk.ngrams(input_list,n=4)
nltk.ngrams(input_list,n=5)
# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
range_ngrams(input_list, ngramRange=(1,6))
# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Repost from my previous answer.


回答 11

Nltk很棒,但有时对于某些项目来说是一项开销:

import re
def tokenize(text, ngrams=1):
    text = re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    tokens = text.split()
    return [tuple(tokens[i:i+ngrams]) for i in xrange(len(tokens)-ngrams+1)]

使用示例:

>> text = "This is an example text"
>> tokenize(text, 2)
[('This', 'is'), ('is', 'an'), ('an', 'example'), ('example', 'text')]
>> tokenize(text, 3)
[('This', 'is', 'an'), ('is', 'an', 'example'), ('an', 'example', 'text')]

Nltk is great, but sometimes is a overhead for some projects:

import re
def tokenize(text, ngrams=1):
    text = re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    tokens = text.split()
    return [tuple(tokens[i:i+ngrams]) for i in xrange(len(tokens)-ngrams+1)]

Example use:

>> text = "This is an example text"
>> tokenize(text, 2)
[('This', 'is'), ('is', 'an'), ('an', 'example'), ('example', 'text')]
>> tokenize(text, 3)
[('This', 'is', 'an'), ('is', 'an', 'example'), ('an', 'example', 'text')]

回答 12

您可以使用以下代码获得所有4-6克的代码,而无需以下其他软件包:

from itertools import chain

def get_m_2_ngrams(input_list, min, max):
    for s in chain(*[get_ngrams(input_list, k) for k in range(min, max+1)]):
        yield ' '.join(s)

def get_ngrams(input_list, n):
    return zip(*[input_list[i:] for i in range(n)])

if __name__ == '__main__':
    input_list = ['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
    for s in get_m_2_ngrams(input_list, 4, 6):
        print(s)

输出如下:

I am aware that
am aware that nltk
aware that nltk only
that nltk only offers
nltk only offers bigrams
only offers bigrams and
offers bigrams and trigrams
bigrams and trigrams ,
and trigrams , but
trigrams , but is
, but is there
but is there a
is there a way
there a way to
a way to split
way to split my
to split my text
split my text in
my text in four-grams
text in four-grams ,
in four-grams , five-grams
four-grams , five-grams or
, five-grams or even
five-grams or even hundred-grams
I am aware that nltk
am aware that nltk only
aware that nltk only offers
that nltk only offers bigrams
nltk only offers bigrams and
only offers bigrams and trigrams
offers bigrams and trigrams ,
bigrams and trigrams , but
and trigrams , but is
trigrams , but is there
, but is there a
but is there a way
is there a way to
there a way to split
a way to split my
way to split my text
to split my text in
split my text in four-grams
my text in four-grams ,
text in four-grams , five-grams
in four-grams , five-grams or
four-grams , five-grams or even
, five-grams or even hundred-grams
I am aware that nltk only
am aware that nltk only offers
aware that nltk only offers bigrams
that nltk only offers bigrams and
nltk only offers bigrams and trigrams
only offers bigrams and trigrams ,
offers bigrams and trigrams , but
bigrams and trigrams , but is
and trigrams , but is there
trigrams , but is there a
, but is there a way
but is there a way to
is there a way to split
there a way to split my
a way to split my text
way to split my text in
to split my text in four-grams
split my text in four-grams ,
my text in four-grams , five-grams
text in four-grams , five-grams or
in four-grams , five-grams or even
four-grams , five-grams or even hundred-grams

您可以在此博客上找到更多详细信息

You can get all 4-6gram using the code without other package below:

from itertools import chain

def get_m_2_ngrams(input_list, min, max):
    for s in chain(*[get_ngrams(input_list, k) for k in range(min, max+1)]):
        yield ' '.join(s)

def get_ngrams(input_list, n):
    return zip(*[input_list[i:] for i in range(n)])

if __name__ == '__main__':
    input_list = ['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
    for s in get_m_2_ngrams(input_list, 4, 6):
        print(s)

the output is below:

I am aware that
am aware that nltk
aware that nltk only
that nltk only offers
nltk only offers bigrams
only offers bigrams and
offers bigrams and trigrams
bigrams and trigrams ,
and trigrams , but
trigrams , but is
, but is there
but is there a
is there a way
there a way to
a way to split
way to split my
to split my text
split my text in
my text in four-grams
text in four-grams ,
in four-grams , five-grams
four-grams , five-grams or
, five-grams or even
five-grams or even hundred-grams
I am aware that nltk
am aware that nltk only
aware that nltk only offers
that nltk only offers bigrams
nltk only offers bigrams and
only offers bigrams and trigrams
offers bigrams and trigrams ,
bigrams and trigrams , but
and trigrams , but is
trigrams , but is there
, but is there a
but is there a way
is there a way to
there a way to split
a way to split my
way to split my text
to split my text in
split my text in four-grams
my text in four-grams ,
text in four-grams , five-grams
in four-grams , five-grams or
four-grams , five-grams or even
, five-grams or even hundred-grams
I am aware that nltk only
am aware that nltk only offers
aware that nltk only offers bigrams
that nltk only offers bigrams and
nltk only offers bigrams and trigrams
only offers bigrams and trigrams ,
offers bigrams and trigrams , but
bigrams and trigrams , but is
and trigrams , but is there
trigrams , but is there a
, but is there a way
but is there a way to
is there a way to split
there a way to split my
a way to split my text
way to split my text in
to split my text in four-grams
split my text in four-grams ,
my text in four-grams , five-grams
text in four-grams , five-grams or
in four-grams , five-grams or even
four-grams , five-grams or even hundred-grams

you can find more detail on this blog


回答 13

大约七年后,这是一个更优雅的答案collections.deque

def ngrams(words, n):
    d = collections.deque(maxlen=n)
    d.extend(words[:n])
    words = words[n:]
    for window, word in zip(itertools.cycle((d,)), words):
        print(' '.join(window))
        d.append(word)

words = ['I', 'am', 'become', 'death,', 'the', 'destroyer', 'of', 'worlds']

输出:

In [15]: ngrams(words, 3)                                                                                                                                                                                                                     
I am become
am become death,
become death, the
death, the destroyer
the destroyer of

In [16]: ngrams(words, 4)                                                                                                                                                                                                                     
I am become death,
am become death, the
become death, the destroyer
death, the destroyer of

In [17]: ngrams(words, 1)                                                                                                                                                                                                                     
I
am
become
death,
the
destroyer
of

In [18]: ngrams(words, 2)                                                                                                                                                                                                                     
I am
am become
become death,
death, the
the destroyer
destroyer of

After about seven years, here’s a more elegant answer using collections.deque:

def ngrams(words, n):
    d = collections.deque(maxlen=n)
    d.extend(words[:n])
    words = words[n:]
    for window, word in zip(itertools.cycle((d,)), words):
        print(' '.join(window))
        d.append(word)

words = ['I', 'am', 'become', 'death,', 'the', 'destroyer', 'of', 'worlds']

Output:

In [15]: ngrams(words, 3)                                                                                                                                                                                                                     
I am become
am become death,
become death, the
death, the destroyer
the destroyer of

In [16]: ngrams(words, 4)                                                                                                                                                                                                                     
I am become death,
am become death, the
become death, the destroyer
death, the destroyer of

In [17]: ngrams(words, 1)                                                                                                                                                                                                                     
I
am
become
death,
the
destroyer
of

In [18]: ngrams(words, 2)                                                                                                                                                                                                                     
I am
am become
become death,
death, the
the destroyer
destroyer of

回答 14

如果您想为具有恒定内存使用量的大字符串提供纯迭代器解决方案:

from typing import Iterable  
import itertools

def ngrams_iter(input: str, ngram_size: int, token_regex=r"[^\s]+") -> Iterable[str]:
    input_iters = [ 
        map(lambda m: m.group(0), re.finditer(token_regex, input)) 
        for n in range(ngram_size) 
    ]
    # Skip first words
    for n in range(1, ngram_size): list(map(next, input_iters[n:]))  

    output_iter = itertools.starmap( 
        lambda *args: " ".join(args),  
        zip(*input_iters) 
    ) 
    return output_iter

测试:

input = "If you want a pure iterator solution for large strings with constant memory usage"
list(ngrams_iter(input, 5))

输出:

['If you want a pure',
 'you want a pure iterator',
 'want a pure iterator solution',
 'a pure iterator solution for',
 'pure iterator solution for large',
 'iterator solution for large strings',
 'solution for large strings with',
 'for large strings with constant',
 'large strings with constant memory',
 'strings with constant memory usage']

If you want a pure iterator solution for large strings with constant memory usage:

from typing import Iterable  
import itertools

def ngrams_iter(input: str, ngram_size: int, token_regex=r"[^\s]+") -> Iterable[str]:
    input_iters = [ 
        map(lambda m: m.group(0), re.finditer(token_regex, input)) 
        for n in range(ngram_size) 
    ]
    # Skip first words
    for n in range(1, ngram_size): list(map(next, input_iters[n:]))  

    output_iter = itertools.starmap( 
        lambda *args: " ".join(args),  
        zip(*input_iters) 
    ) 
    return output_iter

Test:

input = "If you want a pure iterator solution for large strings with constant memory usage"
list(ngrams_iter(input, 5))

Output:

['If you want a pure',
 'you want a pure iterator',
 'want a pure iterator solution',
 'a pure iterator solution for',
 'pure iterator solution for large',
 'iterator solution for large strings',
 'solution for large strings with',
 'for large strings with constant',
 'large strings with constant memory',
 'strings with constant memory usage']