分类目录归档:知识问答

在Python中获取迭代器中的元素数量

问题:在Python中获取迭代器中的元素数量

通常,是否有一种有效的方法可以知道Python的迭代器中有多少个元素,而无需遍历每个元素并进行计数?

Is there an efficient way to know how many elements are in an iterator in Python, in general, without iterating through each and counting?


回答 0

不行,不可能

例:

import random

def gen(n):
    for i in xrange(n):
        if random.randint(0, 1) == 0:
            yield i

iterator = gen(10)

的长度iterator未知,直到您遍历为止。

No. It’s not possible.

Example:

import random

def gen(n):
    for i in xrange(n):
        if random.randint(0, 1) == 0:
            yield i

iterator = gen(10)

Length of iterator is unknown until you iterate through it.


回答 1

此代码应工作:

>>> iter = (i for i in range(50))
>>> sum(1 for _ in iter)
50

尽管它确实遍历每个项目并计算它们,但这是最快的方法。

当迭代器没有项目时,它也适用:

>>> sum(1 for _ in range(0))
0

当然,它会无限输入地永远运行,因此请记住,迭代器可以是无限的:

>>> sum(1 for _ in itertools.count())
[nothing happens, forever]

另外,请注意,执行此操作将耗尽迭代器,并且进一步尝试使用它将看不到任何元素。这是Python迭代器设计不可避免的结果。如果要保留元素,则必须将它们存储在列表或其他内容中。

This code should work:

>>> iter = (i for i in range(50))
>>> sum(1 for _ in iter)
50

Although it does iterate through each item and count them, it is the fastest way to do so.

It also works for when the iterator has no item:

>>> sum(1 for _ in range(0))
0

Of course, it runs forever for an infinite input, so remember that iterators can be infinite:

>>> sum(1 for _ in itertools.count())
[nothing happens, forever]

Also, be aware that the iterator will be exhausted by doing this, and further attempts to use it will see no elements. That’s an unavoidable consequence of the Python iterator design. If you want to keep the elements, you’ll have to store them in a list or something.


回答 2

不,任何方法都将要求您解决所有结果。你可以做

iter_length = len(list(iterable))

但是在无限迭代器上运行该函数当然永远不会返回。它还将消耗迭代器,并且如果要使用其内容,则需要将其重置。

告诉我们您要解决的实际问题可能会帮助我们找到实现目标的更好方法。

编辑:使用list()将立即将整个可迭代对象读取到内存中,这可能是不可取的。另一种方法是

sum(1 for _ in iterable)

如另一个人所张贴。这样可以避免将其保存在内存中。

No, any method will require you to resolve every result. You can do

iter_length = len(list(iterable))

but running that on an infinite iterator will of course never return. It also will consume the iterator and it will need to be reset if you want to use the contents.

Telling us what real problem you’re trying to solve might help us find you a better way to accomplish your actual goal.

Edit: Using list() will read the whole iterable into memory at once, which may be undesirable. Another way is to do

sum(1 for _ in iterable)

as another person posted. That will avoid keeping it in memory.


回答 3

您不能(除非特定迭代器的类型实现了某些特定方法才能实现)。

通常,您只能通过使用迭代器来计数迭代器项目。可能是最有效的方法之一:

import itertools
from collections import deque

def count_iter_items(iterable):
    """
    Consume an iterable not reading it into memory; return the number of items.
    """
    counter = itertools.count()
    deque(itertools.izip(iterable, counter), maxlen=0)  # (consume at C speed)
    return next(counter)

(对于Python 3.x,请替换itertools.izipzip)。

You cannot (except the type of a particular iterator implements some specific methods that make it possible).

Generally, you may count iterator items only by consuming the iterator. One of probably the most efficient ways:

import itertools
from collections import deque

def count_iter_items(iterable):
    """
    Consume an iterable not reading it into memory; return the number of items.
    """
    counter = itertools.count()
    deque(itertools.izip(iterable, counter), maxlen=0)  # (consume at C speed)
    return next(counter)

(For Python 3.x replace itertools.izip with zip).


回答 4

金田 您可以检查该__length_hint__方法,但要警告(至少gsnedders指出,至少在Python 3.4之前),这是一个未记录的实现细节遵循线程中的消息),很可能消失或召唤鼻恶魔。

否则,不会。迭代器只是一个仅公开next()方法的对象。您可以根据需要多次调用它,它们最终可能会也可能不会出现StopIteration。幸运的是,这种行为在大多数情况下对编码员是透明的。:)

Kinda. You could check the __length_hint__ method, but be warned that (at least up to Python 3.4, as gsnedders helpfully points out) it’s a undocumented implementation detail (following message in thread), that could very well vanish or summon nasal demons instead.

Otherwise, no. Iterators are just an object that only expose the next() method. You can call it as many times as required and they may or may not eventually raise StopIteration. Luckily, this behaviour is most of the time transparent to the coder. :)


回答 5

我喜欢基数软件包,它非常轻巧,并根据可迭代性尝试使用可能的最快实现。

用法:

>>> import cardinality
>>> cardinality.count([1, 2, 3])
3
>>> cardinality.count(i for i in range(500))
500
>>> def gen():
...     yield 'hello'
...     yield 'world'
>>> cardinality.count(gen())
2

实际count()实现如下:

def count(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

I like the cardinality package for this, it is very lightweight and tries to use the fastest possible implementation available depending on the iterable.

Usage:

>>> import cardinality
>>> cardinality.count([1, 2, 3])
3
>>> cardinality.count(i for i in range(500))
500
>>> def gen():
...     yield 'hello'
...     yield 'world'
>>> cardinality.count(gen())
2

The actual count() implementation is as follows:

def count(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

回答 6

因此,对于那些想了解该讨论摘要的人。使用以下方法计算长度为5000万的生成器表达式的最终最高分:

  • len(list(gen))
  • len([_ for _ in gen])
  • sum(1 for _ in gen),
  • ilen(gen)(来自more_itertool),
  • reduce(lambda c, i: c + 1, gen, 0)

按执行性能(包括内存消耗)排序,会让您感到惊讶:

“`

1:test_list.py:8:0.492 KiB

gen = (i for i in data*1000); t0 = monotonic(); len(list(gen))

(“列表,秒”,1.9684218849870376)

2:test_list_compr.py:8:0.867 KiB

gen = (i for i in data*1000); t0 = monotonic(); len([i for i in gen])

(’list_compr,sec’,2.5885991149989422)

3:test_sum.py:8:0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); sum(1 for i in gen); t1 = monotonic()

(’sum,sec’,3.441088170016883)

4:more_itertools / more.py:413:1.266 KiB

d = deque(enumerate(iterable, 1), maxlen=1)

test_ilen.py:10: 0.875 KiB
gen = (i for i in data*1000); t0 = monotonic(); ilen(gen)

(’ilen,sec’,9.812256851990242)

5:test_reduce.py:8:0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); reduce(lambda counter, i: counter + 1, gen, 0)

(’reduce,sec’,13.436614598002052)“`

因此,len(list(gen))是最频繁且消耗较少的内存

So, for those who would like to know the summary of that discussion. The final top scores for counting a 50 million-lengthed generator expression using:

  • len(list(gen)),
  • len([_ for _ in gen]),
  • sum(1 for _ in gen),
  • ilen(gen) (from more_itertool),
  • reduce(lambda c, i: c + 1, gen, 0),

sorted by performance of execution (including memory consumption), will make you surprised:

“`

1: test_list.py:8: 0.492 KiB

gen = (i for i in data*1000); t0 = monotonic(); len(list(gen))

(‘list, sec’, 1.9684218849870376)

2: test_list_compr.py:8: 0.867 KiB

gen = (i for i in data*1000); t0 = monotonic(); len([i for i in gen])

(‘list_compr, sec’, 2.5885991149989422)

3: test_sum.py:8: 0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); sum(1 for i in gen); t1 = monotonic()

(‘sum, sec’, 3.441088170016883)

4: more_itertools/more.py:413: 1.266 KiB

d = deque(enumerate(iterable, 1), maxlen=1)

test_ilen.py:10: 0.875 KiB
gen = (i for i in data*1000); t0 = monotonic(); ilen(gen)

(‘ilen, sec’, 9.812256851990242)

5: test_reduce.py:8: 0.859 KiB

gen = (i for i in data*1000); t0 = monotonic(); reduce(lambda counter, i: counter + 1, gen, 0)

(‘reduce, sec’, 13.436614598002052) “`

So, len(list(gen)) is the most frequent and less memory consumable


回答 7

迭代器只是一个对象,该对象具有指向要由某种缓冲区或流读取的下一个对象的指针,就像一个LinkedList,在其中迭代之前,您不知道自己拥有多少东西。迭代器之所以具有效率,是因为它们所做的只是告诉您引用之后是什么,而不是使用索引(但是如您所见,您失去了查看下一步有多少项的能力)。

An iterator is just an object which has a pointer to the next object to be read by some kind of buffer or stream, it’s like a LinkedList where you don’t know how many things you have until you iterate through them. Iterators are meant to be efficient because all they do is tell you what is next by references instead of using indexing (but as you saw you lose the ability to see how many entries are next).


回答 8

关于您的原始问题,答案仍然是,通常没有办法知道Python中迭代器的长度。

鉴于您的问题是由pysam库的应用引起的,我可以给出一个更具体的答案:我是PySAM的贡献者,而最终的答案是SAM / BAM文件未提供对齐读取的确切数目。也无法从BAM索引文件中轻松获得此信息。最好的办法是在读取多个对齐方式并根据文件的总大小外推后,通过使用文件指针的位置来估计对齐的大概数量。这足以实现进度条,但不足以在恒定时间内计数路线。

Regarding your original question, the answer is still that there is no way in general to know the length of an iterator in Python.

Given that you question is motivated by an application of the pysam library, I can give a more specific answer: I’m a contributer to PySAM and the definitive answer is that SAM/BAM files do not provide an exact count of aligned reads. Nor is this information easily available from a BAM index file. The best one can do is to estimate the approximate number of alignments by using the location of the file pointer after reading a number of alignments and extrapolating based on the total size of the file. This is enough to implement a progress bar, but not a method of counting alignments in constant time.


回答 9

快速基准:

import collections
import itertools

def count_iter_items(iterable):
    counter = itertools.count()
    collections.deque(itertools.izip(iterable, counter), maxlen=0)
    return next(counter)

def count_lencheck(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

def count_sum(iterable):           
    return sum(1 for _ in iterable)

iter = lambda y: (x for x in xrange(y))

%timeit count_iter_items(iter(1000))
%timeit count_lencheck(iter(1000))
%timeit count_sum(iter(1000))

结果:

10000 loops, best of 3: 37.2 µs per loop
10000 loops, best of 3: 47.6 µs per loop
10000 loops, best of 3: 61 µs per loop

即简单的count_iter_items是要走的路。

针对python3进行调整:

61.9 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
74.4 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
82.6 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

A quick benchmark:

import collections
import itertools

def count_iter_items(iterable):
    counter = itertools.count()
    collections.deque(itertools.izip(iterable, counter), maxlen=0)
    return next(counter)

def count_lencheck(iterable):
    if hasattr(iterable, '__len__'):
        return len(iterable)

    d = collections.deque(enumerate(iterable, 1), maxlen=1)
    return d[0][0] if d else 0

def count_sum(iterable):           
    return sum(1 for _ in iterable)

iter = lambda y: (x for x in xrange(y))

%timeit count_iter_items(iter(1000))
%timeit count_lencheck(iter(1000))
%timeit count_sum(iter(1000))

The results:

10000 loops, best of 3: 37.2 µs per loop
10000 loops, best of 3: 47.6 µs per loop
10000 loops, best of 3: 61 µs per loop

I.e. the simple count_iter_items is the way to go.

Adjusting this for python3:

61.9 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
74.4 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
82.6 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

回答 10

有两种方法可以获取计算机上“某物”的长度。

第一种方法是存储计数-这需要接触文件/数据的任何东西来修改它(或仅公开接口的类-但归结为同一件事)。

另一种方法是遍历它并计算它的大小。

There are two ways to get the length of “something” on a computer.

The first way is to store a count – this requires anything that touches the file/data to modify it (or a class that only exposes interfaces — but it boils down to the same thing).

The other way is to iterate over it and count how big it is.


回答 11

通常的做法是将这种类型的信息放在文件头中,并让pysam允许您访问此信息。我不知道格式,但是您检查过API吗?

正如其他人所说,您无法从迭代器知道长度。

It’s common practice to put this type of information in the file header, and for pysam to give you access to this. I don’t know the format, but have you checked the API?

As others have said, you can’t know the length from the iterator.


回答 12

这违反了迭代器的定义,迭代器是指向对象的指针,外加有关如何到达下一个对象的信息。

迭代器不知道在终止之前它将可以迭代多少次。这可能是无限的,所以无限可能是您的答案。

This is against the very definition of an iterator, which is a pointer to an object, plus information about how to get to the next object.

An iterator does not know how many more times it will be able to iterate until terminating. This could be infinite, so infinity might be your answer.


回答 13

尽管通常不可能执行所要求的操作,但在对项目进行迭代之后,对迭代的项目数进行计数通常仍然有用。为此,您可以使用jaraco.itertools.Counter或类似的名称。这是一个使用Python 3和rwt加载程序包的示例。

$ rwt -q jaraco.itertools -- -q
>>> import jaraco.itertools
>>> items = jaraco.itertools.Counter(range(100))
>>> _ = list(counted)
>>> items.count
100
>>> import random
>>> def gen(n):
...     for i in range(n):
...         if random.randint(0, 1) == 0:
...             yield i
... 
>>> items = jaraco.itertools.Counter(gen(100))
>>> _ = list(counted)
>>> items.count
48

Although it’s not possible in general to do what’s been asked, it’s still often useful to have a count of how many items were iterated over after having iterated over them. For that, you can use jaraco.itertools.Counter or similar. Here’s an example using Python 3 and rwt to load the package.

$ rwt -q jaraco.itertools -- -q
>>> import jaraco.itertools
>>> items = jaraco.itertools.Counter(range(100))
>>> _ = list(counted)
>>> items.count
100
>>> import random
>>> def gen(n):
...     for i in range(n):
...         if random.randint(0, 1) == 0:
...             yield i
... 
>>> items = jaraco.itertools.Counter(gen(100))
>>> _ = list(counted)
>>> items.count
48

回答 14

def count_iter(iter):
    sum = 0
    for _ in iter: sum += 1
    return sum
def count_iter(iter):
    sum = 0
    for _ in iter: sum += 1
    return sum

回答 15

大概是,您希望不迭代地对项目数进行计数,以使迭代器不会耗尽,以后再使用它。可以通过copydeepcopy

import copy

def get_iter_len(iterator):
    return sum(1 for _ in copy.copy(iterator))

###############################################

iterator = range(0, 10)
print(get_iter_len(iterator))

if len(tuple(iterator)) > 1:
    print("Finding the length did not exhaust the iterator!")
else:
    print("oh no! it's all gone")

输出为“Finding the length did not exhaust the iterator!

您可以选择(并且不建议使用)隐藏内置len函数,如下所示:

import copy

def len(obj, *, len=len):
    try:
        if hasattr(obj, "__len__"):
            r = len(obj)
        elif hasattr(obj, "__next__"):
            r = sum(1 for _ in copy.copy(obj))
        else:
            r = len(obj)
    finally:
        pass
    return r

Presumably, you want count the number of items without iterating through, so that the iterator is not exhausted, and you use it again later. This is possible with copy or deepcopy

import copy

def get_iter_len(iterator):
    return sum(1 for _ in copy.copy(iterator))

###############################################

iterator = range(0, 10)
print(get_iter_len(iterator))

if len(tuple(iterator)) > 1:
    print("Finding the length did not exhaust the iterator!")
else:
    print("oh no! it's all gone")

The output is “Finding the length did not exhaust the iterator!

Optionally (and unadvisedly), you can shadow the built-in len function as follows:

import copy

def len(obj, *, len=len):
    try:
        if hasattr(obj, "__len__"):
            r = len(obj)
        elif hasattr(obj, "__next__"):
            r = sum(1 for _ in copy.copy(obj))
        else:
            r = len(obj)
    finally:
        pass
    return r

如何在Spark DataFrame中添加常量列?

问题:如何在Spark DataFrame中添加常量列?

我想在中添加DataFrame具有任意值的列(每行相同)。使用withColumn以下内容时出现错误:

dt.withColumn('new_column', 10).head(5)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-50-a6d0257ca2be> in <module>()
      1 dt = (messages
      2     .select(messages.fromuserid, messages.messagetype, floor(messages.datetime/(1000*60*5)).alias("dt")))
----> 3 dt.withColumn('new_column', 10).head(5)

/Users/evanzamir/spark-1.4.1/python/pyspark/sql/dataframe.pyc in withColumn(self, colName, col)
   1166         [Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)]
   1167         """
-> 1168         return self.select('*', col.alias(colName))
   1169 
   1170     @ignore_unicode_prefix

AttributeError: 'int' object has no attribute 'alias'

似乎我可以通过添加和减去其他一列(这样它们加到零)然后添加我想要的数字(在这种情况下为10)来欺骗该函数按我的意愿工作:

dt.withColumn('new_column', dt.messagetype - dt.messagetype + 10).head(5)
[Row(fromuserid=425, messagetype=1, dt=4809600.0, new_column=10),
 Row(fromuserid=47019141, messagetype=1, dt=4809600.0, new_column=10),
 Row(fromuserid=49746356, messagetype=1, dt=4809600.0, new_column=10),
 Row(fromuserid=93506471, messagetype=1, dt=4809600.0, new_column=10),
 Row(fromuserid=80488242, messagetype=1, dt=4809600.0, new_column=10)]

这绝对是骇客,对吧?我认为还有一种更合法的方法吗?

I want to add a column in a DataFrame with some arbitrary value (that is the same for each row). I get an error when I use withColumn as follows:

dt.withColumn('new_column', 10).head(5)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-50-a6d0257ca2be> in <module>()
      1 dt = (messages
      2     .select(messages.fromuserid, messages.messagetype, floor(messages.datetime/(1000*60*5)).alias("dt")))
----> 3 dt.withColumn('new_column', 10).head(5)

/Users/evanzamir/spark-1.4.1/python/pyspark/sql/dataframe.pyc in withColumn(self, colName, col)
   1166         [Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)]
   1167         """
-> 1168         return self.select('*', col.alias(colName))
   1169 
   1170     @ignore_unicode_prefix

AttributeError: 'int' object has no attribute 'alias'

It seems that I can trick the function into working as I want by adding and subtracting one of the other columns (so they add to zero) and then adding the number I want (10 in this case):

dt.withColumn('new_column', dt.messagetype - dt.messagetype + 10).head(5)
[Row(fromuserid=425, messagetype=1, dt=4809600.0, new_column=10),
 Row(fromuserid=47019141, messagetype=1, dt=4809600.0, new_column=10),
 Row(fromuserid=49746356, messagetype=1, dt=4809600.0, new_column=10),
 Row(fromuserid=93506471, messagetype=1, dt=4809600.0, new_column=10),
 Row(fromuserid=80488242, messagetype=1, dt=4809600.0, new_column=10)]

This is supremely hacky, right? I assume there is a more legit way to do this?


回答 0

Spark 2.2+

Spark 2.2引入typedLit了support SeqMapTuplesSPARK-19254),并且应该支持以下调用(Scala):

import org.apache.spark.sql.functions.typedLit

df.withColumn("some_array", typedLit(Seq(1, 2, 3)))
df.withColumn("some_struct", typedLit(("foo", 1, 0.3)))
df.withColumn("some_map", typedLit(Map("key1" -> 1, "key2" -> 2)))

Spark 1.3以上lit),1.4以上arraystruct),2.0以上map):

的第二个参数DataFrame.withColumn应该是a,Column因此您必须使用文字:

from pyspark.sql.functions import lit

df.withColumn('new_column', lit(10))

如果您需要复杂的列,则可以使用以下代码块构建这些列array

from pyspark.sql.functions import array, create_map, struct

df.withColumn("some_array", array(lit(1), lit(2), lit(3)))
df.withColumn("some_struct", struct(lit("foo"), lit(1), lit(.3)))
df.withColumn("some_map", create_map(lit("key1"), lit(1), lit("key2"), lit(2)))

可以在Scala中使用完全相同的方法。

import org.apache.spark.sql.functions.{array, lit, map, struct}

df.withColumn("new_column", lit(10))
df.withColumn("map", map(lit("key1"), lit(1), lit("key2"), lit(2)))

为了提供名称structs或者使用alias上的每个字段:

df.withColumn(
    "some_struct",
    struct(lit("foo").alias("x"), lit(1).alias("y"), lit(0.3).alias("z"))
 )

cast整个对象

df.withColumn(
    "some_struct", 
    struct(lit("foo"), lit(1), lit(0.3)).cast("struct<x: string, y: integer, z: double>")
 )

尽管较慢,也可以使用UDF。

注意事项

可以使用相同的构造将常量参数传递给UDF或SQL函数。

Spark 2.2+

Spark 2.2 introduces typedLit to support Seq, Map, and Tuples (SPARK-19254) and following calls should be supported (Scala):

import org.apache.spark.sql.functions.typedLit

df.withColumn("some_array", typedLit(Seq(1, 2, 3)))
df.withColumn("some_struct", typedLit(("foo", 1, 0.3)))
df.withColumn("some_map", typedLit(Map("key1" -> 1, "key2" -> 2)))

Spark 1.3+ (lit), 1.4+ (array, struct), 2.0+ (map):

The second argument for DataFrame.withColumn should be a Column so you have to use a literal:

from pyspark.sql.functions import lit

df.withColumn('new_column', lit(10))

If you need complex columns you can build these using blocks like array:

from pyspark.sql.functions import array, create_map, struct

df.withColumn("some_array", array(lit(1), lit(2), lit(3)))
df.withColumn("some_struct", struct(lit("foo"), lit(1), lit(.3)))
df.withColumn("some_map", create_map(lit("key1"), lit(1), lit("key2"), lit(2)))

Exactly the same methods can be used in Scala.

import org.apache.spark.sql.functions.{array, lit, map, struct}

df.withColumn("new_column", lit(10))
df.withColumn("map", map(lit("key1"), lit(1), lit("key2"), lit(2)))

To provide names for structs use either alias on each field:

df.withColumn(
    "some_struct",
    struct(lit("foo").alias("x"), lit(1).alias("y"), lit(0.3).alias("z"))
 )

or cast on the whole object

df.withColumn(
    "some_struct", 
    struct(lit("foo"), lit(1), lit(0.3)).cast("struct<x: string, y: integer, z: double>")
 )

It is also possible, although slower, to use an UDF.

Note:

The same constructs can be used to pass constant arguments to UDFs or SQL functions.


回答 1

在spark 2.2中,有两种方法可以在DataFrame的列中添加常量值:

1)使用 lit

2)使用typedLit

两者之间的区别在于typedLit还可以处理参数化的Scala类型,例如List,Seq和Map

样本数据框:

val df = spark.createDataFrame(Seq((0,"a"),(1,"b"),(2,"c"))).toDF("id", "col1")

+---+----+
| id|col1|
+---+----+
|  0|   a|
|  1|   b|
+---+----+

1)使用lit在名为newcol的新列中添加常量字符串值:

import org.apache.spark.sql.functions.lit
val newdf = df.withColumn("newcol",lit("myval"))

结果:

+---+----+------+
| id|col1|newcol|
+---+----+------+
|  0|   a| myval|
|  1|   b| myval|
+---+----+------+

2)使用typedLit

import org.apache.spark.sql.functions.typedLit
df.withColumn("newcol", typedLit(("sample", 10, .044)))

结果:

+---+----+-----------------+
| id|col1|           newcol|
+---+----+-----------------+
|  0|   a|[sample,10,0.044]|
|  1|   b|[sample,10,0.044]|
|  2|   c|[sample,10,0.044]|
+---+----+-----------------+

In spark 2.2 there are two ways to add constant value in a column in DataFrame:

1) Using lit

2) Using typedLit.

The difference between the two is that typedLit can also handle parameterized scala types e.g. List, Seq, and Map

Sample DataFrame:

val df = spark.createDataFrame(Seq((0,"a"),(1,"b"),(2,"c"))).toDF("id", "col1")

+---+----+
| id|col1|
+---+----+
|  0|   a|
|  1|   b|
+---+----+

1) Using lit: Adding constant string value in new column named newcol:

import org.apache.spark.sql.functions.lit
val newdf = df.withColumn("newcol",lit("myval"))

Result:

+---+----+------+
| id|col1|newcol|
+---+----+------+
|  0|   a| myval|
|  1|   b| myval|
+---+----+------+

2) Using typedLit:

import org.apache.spark.sql.functions.typedLit
df.withColumn("newcol", typedLit(("sample", 10, .044)))

Result:

+---+----+-----------------+
| id|col1|           newcol|
+---+----+-----------------+
|  0|   a|[sample,10,0.044]|
|  1|   b|[sample,10,0.044]|
|  2|   c|[sample,10,0.044]|
+---+----+-----------------+

为什么非默认参数不能跟随默认参数?

问题:为什么非默认参数不能跟随默认参数?

为什么这段代码会引发SyntaxError?

  >>> def fun1(a="who is you", b="True", x, y):
...     print a,b,x,y
... 
  File "<stdin>", line 1
SyntaxError: non-default argument follows default argument

尽管以下代码段运行时没有可见错误:

>>> def fun1(x, y, a="who is you", b="True"):
...     print a,b,x,y
... 

Why does this piece of code throw a SyntaxError?

  >>> def fun1(a="who is you", b="True", x, y):
...     print a,b,x,y
... 
  File "<stdin>", line 1
SyntaxError: non-default argument follows default argument

While the following piece of code runs without visible errors:

>>> def fun1(x, y, a="who is you", b="True"):
...     print a,b,x,y
... 

回答 0

必须将所有必需的参数放在任何默认参数之前。仅仅是因为它们是强制性的,而默认参数不是必需的。从语法上讲,如果允许使用混合模式,则解释器将无法决定哪些值与哪些参数匹配。SyntaxError如果参数的输入顺序不正确,则会引发A :

让我们使用您的函数来查看关键字参数。

def fun1(a="who is you", b="True", x, y):
...     print a,b,x,y

假设其允许声明函数如上,然后使用上述声明,我们可以进行以下(常规)位置或关键字参数调用:

func1("ok a", "ok b", 1)  # Is 1 assigned to x or ?
func1(1)                  # Is 1 assigned to a or ?
func1(1, 2)               # ?

您将如何建议在函数调用中分配变量,如何将默认参数与关键字参数一起使用。

>>> def fun1(x, y, a="who is you", b="True"):
...     print a,b,x,y
... 

参考O’Reilly-Core-Python
其中,此函数在语法上正确地使用了上述函数调用的默认参数。事实证明,关键字参数调用对于提供乱序的位置参数很有用,但是与默认参数一起使用,它们也可以用于“跳过”缺少的参数。

All required parameters must be placed before any default arguments. Simply because they are mandatory, whereas default arguments are not. Syntactically, it would be impossible for the interpreter to decide which values match which arguments if mixed modes were allowed. A SyntaxError is raised if the arguments are not given in the correct order:

Let us take a look at keyword arguments, using your function.

def fun1(a="who is you", b="True", x, y):
...     print a,b,x,y

Suppose its allowed to declare function as above, Then with the above declarations, we can make the following (regular) positional or keyword argument calls:

func1("ok a", "ok b", 1)  # Is 1 assigned to x or ?
func1(1)                  # Is 1 assigned to a or ?
func1(1, 2)               # ?

How you will suggest the assignment of variables in the function call, how default arguments are going to be used along with keyword arguments.

>>> def fun1(x, y, a="who is you", b="True"):
...     print a,b,x,y
... 

Reference O’Reilly – Core-Python
Where as this function make use of the default arguments syntactically correct for above function calls. Keyword arguments calling prove useful for being able to provide for out-of-order positional arguments, but, coupled with default arguments, they can also be used to “skip over” missing arguments as well.


回答 1

SyntaxError: non-default argument follows default argument

如果允许这样做,则默认参数将变得无用,因为您将永远无法使用其默认值,因为非默认参数会在后面

但是,在Python 3中,您可以执行以下操作:

def fun1(a="who is you", b="True", *, x, y):
    pass

这使得xy关键字只有这样,你可以这样做:

fun1(x=2, y=2)

之所以可行,是因为不再有任何歧义。请注意,您仍然无法执行操作fun1(2, 2)(这将设置默认参数)。

SyntaxError: non-default argument follows default argument

If you were to allow this, the default arguments would be rendered useless because you would never be able to use their default values, since the non-default arguments come after.

In Python 3 however, you may do the following:

def fun1(a="who is you", b="True", *, x, y):
    pass

which makes x and y keyword only so you can do this:

fun1(x=2, y=2)

This works because there is no longer any ambiguity. Note you still can’t do fun1(2, 2) (that would set the default arguments).


回答 2

让我在这里澄清两点:

  • 首先,非默认参数不应跟随默认参数,这意味着您无法在函数中定义(a = b,c),而在函数中定义参数的顺序为:
    • 位置参数或非默认参数,即(a,b,c)
    • 关键字参数或默认参数,即(a =“ b”,r =“ j”)
    • 仅关键字参数,即(* args)
    • var-keyword参数,即(** kwargs)

def example(a, b, c=None, r="w" , d=[], *ae, **ab):

(a,b)是位置参数

(c = none)是可选参数

(r =“ w”)是关键字参数

(d = [])是列表参数

(* ae)仅用于关键字

(** ab)是var-keyword参数

  • 现在第二件事是,如果我尝试这样的事情: def example(a, b, c=a,d=b):

保存默认值时未定义参数,定义函数时Python计算并保存默认值

当发生这种情况时,c和d未定义,不存在(仅在执行函数时存在)

参数“ a,a = b”不允许使用。

Let me clear two points here :

  • firstly non-default argument should not follow default argument , it means you can’t define (a=b,c) in function the order of defining parameter in function are :
    • positional parameter or non-default parameter i.e (a,b,c)
    • keyword parameter or default parameter i.e (a=”b”,r=”j”)
    • keyword-only parameter i.e (*args)
    • var-keyword parameter i.e (**kwargs)

def example(a, b, c=None, r="w" , d=[], *ae, **ab):

(a,b) are positional parameter

(c=none) is optional parameter

(r=”w”) is keyword parameter

(d=[]) is list parameter

(*ae) is keyword-only

(**ab) is var-keyword parameter

  • now secondary thing is if i try something like this : def example(a, b, c=a,d=b):

argument is not defined when default values are saved,Python computes and saves default values when you define the function

c and d are not defined, does not exist, when this happens (it exists only when the function is executed)

“a,a=b” its not allowed in parameter.


回答 3

必需的参数(没有默认值的参数)必须在开始时才允许客户端代码仅提供两个。如果可选参数是开头,那将会造成混乱:

fun1("who is who", 3, "jack")

在您的第一个示例中该怎么做?最后,x是“谁是谁”,y是3,a =“ jack”。

Required arguments (the ones without defaults), must be at the start to allow client code to only supply two. If the optional arguments were at the start, it would be confusing:

fun1("who is who", 3, "jack")

What would that do in your first example? In the last, x is “who is who”, y is 3 and a = “jack”.


查找名称包含特定字符串的列

问题:查找名称包含特定字符串的列

我有一个带有列名称的数据框,我想找到一个包含特定字符串但与之不完全匹配的数据框。我在寻找'spike'列名喜欢'spike-2''hey spike''spiked-in'(该'spike'部分总是连续)。

我希望列名以字符串或变量的形式返回,因此我以后可以使用df['name']df[name]照常访问列。我试图找到方法,但没有成功。有小费吗?

I have a dataframe with column names, and I want to find the one that contains a certain string, but does not exactly match it. I’m searching for 'spike' in column names like 'spike-2', 'hey spike', 'spiked-in' (the 'spike' part is always continuous).

I want the column name to be returned as a string or a variable, so I access the column later with df['name'] or df[name] as normal. I’ve tried to find ways to do this, to no avail. Any tips?


回答 0

只需遍历DataFrame.columns,这是一个示例,在此示例中,您将获得匹配的列名称列表:

import pandas as pd

data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)

spike_cols = [col for col in df.columns if 'spike' in col]
print(list(df.columns))
print(spike_cols)

输出:

['hey spke', 'no', 'spike-2', 'spiked-in']
['spike-2', 'spiked-in']

说明:

  1. df.columns 返回列名列表
  2. [col for col in df.columns if 'spike' in col]df.columns使用变量遍历列表col并将其添加到结果列表(如果col包含)'spike'。此语法是列表理解

如果只希望结果数据集的列匹配,则可以执行以下操作:

df2 = df.filter(regex='spike')
print(df2)

输出:

   spike-2  spiked-in
0        1          7
1        2          8
2        3          9

Just iterate over DataFrame.columns, now this is an example in which you will end up with a list of column names that match:

import pandas as pd

data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)

spike_cols = [col for col in df.columns if 'spike' in col]
print(list(df.columns))
print(spike_cols)

Output:

['hey spke', 'no', 'spike-2', 'spiked-in']
['spike-2', 'spiked-in']

Explanation:

  1. df.columns returns a list of column names
  2. [col for col in df.columns if 'spike' in col] iterates over the list df.columns with the variable col and adds it to the resulting list if col contains 'spike'. This syntax is list comprehension.

If you only want the resulting data set with the columns that match you can do this:

df2 = df.filter(regex='spike')
print(df2)

Output:

   spike-2  spiked-in
0        1          7
1        2          8
2        3          9

回答 1

此答案使用DataFrame.filter方法执行此操作而无需列表理解:

import pandas as pd

data = {'spike-2': [1,2,3], 'hey spke': [4,5,6]}
df = pd.DataFrame(data)

print(df.filter(like='spike').columns)

将仅输出“ spike-2”。您还可以使用正则表达式,如某些人在上面的评论中建议的那样:

print(df.filter(regex='spike|spke').columns)

将输出两列:[‘spike-2’,’hey spke’]

This answer uses the DataFrame.filter method to do this without list comprehension:

import pandas as pd

data = {'spike-2': [1,2,3], 'hey spke': [4,5,6]}
df = pd.DataFrame(data)

print(df.filter(like='spike').columns)

Will output just ‘spike-2’. You can also use regex, as some people suggested in comments above:

print(df.filter(regex='spike|spke').columns)

Will output both columns: [‘spike-2’, ‘hey spke’]


回答 2

您也可以使用 df.columns[df.columns.str.contains(pat = 'spike')]

data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)

colNames = df.columns[df.columns.str.contains(pat = 'spike')] 

print(colNames)

这将输出列名称: 'spike-2', 'spiked-in'

有关pandas.Series.str.contains的更多信息。

You can also use df.columns[df.columns.str.contains(pat = 'spike')]

data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)

colNames = df.columns[df.columns.str.contains(pat = 'spike')] 

print(colNames)

This will output the column names: 'spike-2', 'spiked-in'

More about pandas.Series.str.contains.


回答 3

# select columns containing 'spike'
df.filter(like='spike', axis=1)

您还可以按名称选择正则表达式。请参阅:pandas.DataFrame.filter

# select columns containing 'spike'
df.filter(like='spike', axis=1)

You can also select by name, regular expression. Refer to: pandas.DataFrame.filter


回答 4

df.loc[:,df.columns.str.contains("spike")]
df.loc[:,df.columns.str.contains("spike")]

回答 5

您还可以使用以下代码:

spike_cols =[x for x in df.columns[df.columns.str.contains('spike')]]

You also can use this code:

spike_cols =[x for x in df.columns[df.columns.str.contains('spike')]]

回答 6

根据“开始”,“包含”和“结束”获取名称和子集:

# from: /programming/21285380/find-column-whose-name-contains-a-specific-string
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
# from: https://cmdlinetips.com/2019/04/how-to-select-columns-using-prefix-suffix-of-column-names-in-pandas/
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html




import pandas as pd



data = {'spike_starts': [1,2,3], 'ends_spike_starts': [4,5,6], 'ends_spike': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)



print("\n")
print("----------------------------------------")
colNames_contains = df.columns[df.columns.str.contains(pat = 'spike')].tolist() 
print("Contains")
print(colNames_contains)



print("\n")
print("----------------------------------------")
colNames_starts = df.columns[df.columns.str.contains(pat = '^spike')].tolist() 
print("Starts")
print(colNames_starts)



print("\n")
print("----------------------------------------")
colNames_ends = df.columns[df.columns.str.contains(pat = 'spike$')].tolist() 
print("Ends")
print(colNames_ends)



print("\n")
print("----------------------------------------")
df_subset_start = df.filter(regex='^spike',axis=1)
print("Starts")
print(df_subset_start)



print("\n")
print("----------------------------------------")
df_subset_contains = df.filter(regex='spike',axis=1)
print("Contains")
print(df_subset_contains)



print("\n")
print("----------------------------------------")
df_subset_ends = df.filter(regex='spike$',axis=1)
print("Ends")
print(df_subset_ends)

Getting name and subsetting based on Start, Contains, and Ends:

# from: https://stackoverflow.com/questions/21285380/find-column-whose-name-contains-a-specific-string
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
# from: https://cmdlinetips.com/2019/04/how-to-select-columns-using-prefix-suffix-of-column-names-in-pandas/
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html




import pandas as pd



data = {'spike_starts': [1,2,3], 'ends_spike_starts': [4,5,6], 'ends_spike': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)



print("\n")
print("----------------------------------------")
colNames_contains = df.columns[df.columns.str.contains(pat = 'spike')].tolist() 
print("Contains")
print(colNames_contains)



print("\n")
print("----------------------------------------")
colNames_starts = df.columns[df.columns.str.contains(pat = '^spike')].tolist() 
print("Starts")
print(colNames_starts)



print("\n")
print("----------------------------------------")
colNames_ends = df.columns[df.columns.str.contains(pat = 'spike$')].tolist() 
print("Ends")
print(colNames_ends)



print("\n")
print("----------------------------------------")
df_subset_start = df.filter(regex='^spike',axis=1)
print("Starts")
print(df_subset_start)



print("\n")
print("----------------------------------------")
df_subset_contains = df.filter(regex='spike',axis=1)
print("Contains")
print(df_subset_contains)



print("\n")
print("----------------------------------------")
df_subset_ends = df.filter(regex='spike$',axis=1)
print("Ends")
print(df_subset_ends)

Python TypeError:格式字符串的参数不足

问题:Python TypeError:格式字符串的参数不足

这是输出。我相信这些是utf-8字符串…其中一些可以是NoneType,但是在类似这样的字符串之前会立即失败…

instr = "'%s', '%s', '%d', '%s', '%s', '%s', '%s'" % softname, procversion, int(percent), exe, description, company, procurl

TypeError:格式字符串的参数不足

虽然是7比7?

Here’s the output. These are utf-8 strings I believe… some of these can be NoneType but it fails immediately, before ones like that…

instr = "'%s', '%s', '%d', '%s', '%s', '%s', '%s'" % softname, procversion, int(percent), exe, description, company, procurl

TypeError: not enough arguments for format string

Its 7 for 7 though?


回答 0

请注意,%格式化字符串的语法已过时。如果您的Python版本支持它,则应编写:

instr = "'{0}', '{1}', '{2}', '{3}', '{4}', '{5}', '{6}'".format(softname, procversion, int(percent), exe, description, company, procurl)

这也可以修复您碰巧遇到的错误。

Note that the % syntax for formatting strings is becoming outdated. If your version of Python supports it, you should write:

instr = "'{0}', '{1}', '{2}', '{3}', '{4}', '{5}', '{6}'".format(softname, procversion, int(percent), exe, description, company, procurl)

This also fixes the error that you happened to have.


回答 1

您需要将格式参数放入元组(添加括号):

instr = "'%s', '%s', '%d', '%s', '%s', '%s', '%s'" % (softname, procversion, int(percent), exe, description, company, procurl)

您当前拥有的等同于以下内容:

intstr = ("'%s', '%s', '%d', '%s', '%s', '%s', '%s'" % softname), procversion, int(percent), exe, description, company, procurl

例:

>>> "%s %s" % 'hello', 'world'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: not enough arguments for format string
>>> "%s %s" % ('hello', 'world')
'hello world'

You need to put the format arguments into a tuple (add parentheses):

instr = "'%s', '%s', '%d', '%s', '%s', '%s', '%s'" % (softname, procversion, int(percent), exe, description, company, procurl)

What you currently have is equivalent to the following:

intstr = ("'%s', '%s', '%d', '%s', '%s', '%s', '%s'" % softname), procversion, int(percent), exe, description, company, procurl

Example:

>>> "%s %s" % 'hello', 'world'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: not enough arguments for format string
>>> "%s %s" % ('hello', 'world')
'hello world'

回答 2

%在格式字符串中用作百分比字符时,出现了相同的错误。解决的办法是加倍%%

I got the same error when using % as a percent character in my format string. The solution to this is to double up the %%.


熊猫数据框获取每个组的第一行

问题:熊猫数据框获取每个组的第一行

我有DataFrame下面的熊猫。

df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4,5,6,6,6,7,7],
                'value'  : ["first","second","second","first",
                            "second","first","third","fourth",
                            "fifth","second","fifth","first",
                            "first","second","third","fourth","fifth"]})

我想通过[“ id”,“ value”]对此分组,并获得每个分组的第一行。

        id   value
0        1   first
1        1  second
2        1  second
3        2   first
4        2  second
5        3   first
6        3   third
7        3  fourth
8        3   fifth
9        4  second
10       4   fifth
11       5   first
12       6   first
13       6  second
14       6   third
15       7  fourth
16       7   fifth

预期结果

    id   value
     1   first
     2   first
     3   first
     4  second
     5  first
     6  first
     7  fourth

我尝试了以下操作,仅给出的第一行DataFrame。任何有关此的帮助表示赞赏。

In [25]: for index, row in df.iterrows():
   ....:     df2 = pd.DataFrame(df.groupby(['id','value']).reset_index().ix[0])

I have a pandas DataFrame like following.

df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4,5,6,6,6,7,7],
                'value'  : ["first","second","second","first",
                            "second","first","third","fourth",
                            "fifth","second","fifth","first",
                            "first","second","third","fourth","fifth"]})

I want to group this by [“id”,”value”] and get the first row of each group.

        id   value
0        1   first
1        1  second
2        1  second
3        2   first
4        2  second
5        3   first
6        3   third
7        3  fourth
8        3   fifth
9        4  second
10       4   fifth
11       5   first
12       6   first
13       6  second
14       6   third
15       7  fourth
16       7   fifth

Expected outcome

    id   value
     1   first
     2   first
     3   first
     4  second
     5  first
     6  first
     7  fourth

I tried following which only gives the first row of the DataFrame. Any help regarding this is appreciated.

In [25]: for index, row in df.iterrows():
   ....:     df2 = pd.DataFrame(df.groupby(['id','value']).reset_index().ix[0])

回答 0

>>> df.groupby('id').first()
     value
id        
1    first
2    first
3    first
4   second
5    first
6    first
7   fourth

如果需要id作为列:

>>> df.groupby('id').first().reset_index()
   id   value
0   1   first
1   2   first
2   3   first
3   4  second
4   5   first
5   6   first
6   7  fourth

要获取n条第一条记录,可以使用head():

>>> df.groupby('id').head(2).reset_index(drop=True)
    id   value
0    1   first
1    1  second
2    2   first
3    2  second
4    3   first
5    3   third
6    4  second
7    4   fifth
8    5   first
9    6   first
10   6  second
11   7  fourth
12   7   fifth
>>> df.groupby('id').first()
     value
id        
1    first
2    first
3    first
4   second
5    first
6    first
7   fourth

If you need id as column:

>>> df.groupby('id').first().reset_index()
   id   value
0   1   first
1   2   first
2   3   first
3   4  second
4   5   first
5   6   first
6   7  fourth

To get n first records, you can use head():

>>> df.groupby('id').head(2).reset_index(drop=True)
    id   value
0    1   first
1    1  second
2    2   first
3    2  second
4    3   first
5    3   third
6    4  second
7    4   fifth
8    5   first
9    6   first
10   6  second
11   7  fourth
12   7   fifth

回答 1

这将为您提供每组的第二行(零索引,nth(0)与first()相同):

df.groupby('id').nth(1) 

文档:http : //pandas.pydata.org/pandas-docs/stable/groupby.html#taking-the-nth-row-of-each-group

This will give you the second row of each group (zero indexed, nth(0) is the same as first()):

df.groupby('id').nth(1) 

Documentation: http://pandas.pydata.org/pandas-docs/stable/groupby.html#taking-the-nth-row-of-each-group


回答 2

我建议使用.nth(0)而不是.first()如果您需要获得第一行。

它们之间的区别在于它们处理NaN的方式,因此.nth(0)无论该行中的值是什么,都将返回组的第一行,而.first()最终将返回每列中的第一个not NaN值。

例如,如果您的数据集是:

df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4],
            'value'  : ["first","second","third", np.NaN,
                        "second","first","second","third",
                        "fourth","first","second"]})

>>> df.groupby('id').nth(0)
    value
id        
1    first
2    NaN
3    first
4    first

>>> df.groupby('id').first()
    value
id        
1    first
2    second
3    first
4    first

I’d suggest to use .nth(0) rather than .first() if you need to get the first row.

The difference between them is how they handle NaNs, so .nth(0) will return the first row of group no matter what are the values in this row, while .first() will eventually return the first not NaN value in each column.

E.g. if your dataset is :

df = pd.DataFrame({'id' : [1,1,1,2,2,3,3,3,3,4,4],
            'value'  : ["first","second","third", np.NaN,
                        "second","first","second","third",
                        "fourth","first","second"]})

>>> df.groupby('id').nth(0)
    value
id        
1    first
2    NaN
3    first
4    first

And

>>> df.groupby('id').first()
    value
id        
1    first
2    second
3    first
4    first

回答 3

也许这就是你想要的

import pandas as pd
idx = pd.MultiIndex.from_product([['state1','state2'],   ['county1','county2','county3','county4']])
df = pd.DataFrame({'pop': [12,15,65,42,78,67,55,31]}, index=idx)
                pop
state1 county1   12
       county2   15
       county3   65
       county4   42
state2 county1   78
       county2   67
       county3   55
       county4   31
df.groupby(level=0, group_keys=False).apply(lambda x: x.sort_values('pop', ascending=False)).groupby(level=0).head(3)

> Out[29]: 
                pop
state1 county3   65
       county4   42
       county2   15
state2 county1   78
       county2   67
       county3   55

maybe this is what you want

import pandas as pd
idx = pd.MultiIndex.from_product([['state1','state2'],   ['county1','county2','county3','county4']])
df = pd.DataFrame({'pop': [12,15,65,42,78,67,55,31]}, index=idx)
                pop
state1 county1   12
       county2   15
       county3   65
       county4   42
state2 county1   78
       county2   67
       county3   55
       county4   31
df.groupby(level=0, group_keys=False).apply(lambda x: x.sort_values('pop', ascending=False)).groupby(level=0).head(3)

> Out[29]: 
                pop
state1 county3   65
       county4   42
       county2   15
state2 county1   78
       county2   67
       county3   55

回答 4

如果只需要我们可以处理的每个组的第一行drop_duplicates,请注意函数default方法keep='first'

df.drop_duplicates('id')
Out[1027]: 
    id   value
0    1   first
3    2   first
5    3   first
9    4  second
11   5   first
12   6   first
15   7  fourth

If you only need the first row from each group we can do with drop_duplicates, Notice the function default method keep='first'.

df.drop_duplicates('id')
Out[1027]: 
    id   value
0    1   first
3    2   first
5    3   first
9    4  second
11   5   first
12   6   first
15   7  fourth

python中的n克,四克,五克,六克?

问题:python中的n克,四克,五克,六克?

我正在寻找一种将文本拆分为n-gram的方法。通常我会做类似的事情:

import nltk
from nltk import bigrams
string = "I really like python, it's pretty awesome."
string_bigrams = bigrams(string)
print string_bigrams

我知道nltk仅提供二元组和三元组,但是有没有办法将我的文本分为四克,五克甚至一百克?

谢谢!

I’m looking for a way to split a text into n-grams. Normally I would do something like:

import nltk
from nltk import bigrams
string = "I really like python, it's pretty awesome."
string_bigrams = bigrams(string)
print string_bigrams

I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams?

Thanks!


回答 0

其他用户提供的基于本地Python的出色答案。但是这就是nltk方法(以防万一,OP会因为重新发明nltk库中已经存在的内容而受到惩罚)。

有一个NGRAM模块,人们很少使用nltk。这不是因为很难读取ngram,而是基于ngram训练模型,其中n> 3将导致大量数据稀疏。

from nltk import ngrams

sentence = 'this is a foo bar sentences and i want to ngramize it'

n = 6
sixgrams = ngrams(sentence.split(), n)

for grams in sixgrams:
  print grams

Great native python based answers given by other users. But here’s the nltk approach (just in case, the OP gets penalized for reinventing what’s already existing in the nltk library).

There is an ngram module that people seldom use in nltk. It’s not because it’s hard to read ngrams, but training a model base on ngrams where n > 3 will result in much data sparsity.

from nltk import ngrams

sentence = 'this is a foo bar sentences and i want to ngramize it'

n = 6
sixgrams = ngrams(sentence.split(), n)

for grams in sixgrams:
  print grams

回答 1

我很惊讶这还没有出现:

In [34]: sentence = "I really like python, it's pretty awesome.".split()

In [35]: N = 4

In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)]

In [37]: for gram in grams: print gram
['I', 'really', 'like', 'python,']
['really', 'like', 'python,', "it's"]
['like', 'python,', "it's", 'pretty']
['python,', "it's", 'pretty', 'awesome.']

I’m surprised that this hasn’t shown up yet:

In [34]: sentence = "I really like python, it's pretty awesome.".split()

In [35]: N = 4

In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)]

In [37]: for gram in grams: print gram
['I', 'really', 'like', 'python,']
['really', 'like', 'python,', "it's"]
['like', 'python,', "it's", 'pretty']
['python,', "it's", 'pretty', 'awesome.']

回答 2

仅使用nltk工具

from nltk.tokenize import word_tokenize
from nltk.util import ngrams

def get_ngrams(text, n ):
    n_grams = ngrams(word_tokenize(text), n)
    return [ ' '.join(grams) for grams in n_grams]

输出示例

get_ngrams('This is the simplest text i could think of', 3 )

['This is the', 'is the simplest', 'the simplest text', 'simplest text i', 'text i could', 'i could think', 'could think of']

为了使ngram保持数组格式,只需删除 ' '.join

Using only nltk tools

from nltk.tokenize import word_tokenize
from nltk.util import ngrams

def get_ngrams(text, n ):
    n_grams = ngrams(word_tokenize(text), n)
    return [ ' '.join(grams) for grams in n_grams]

Example output

get_ngrams('This is the simplest text i could think of', 3 )

['This is the', 'is the simplest', 'the simplest text', 'simplest text i', 'text i could', 'i could think', 'could think of']

In order to keep the ngrams in array format just remove ' '.join


回答 3

这是做n-gram的另一种简单方法

>>> from nltk.util import ngrams
>>> text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams"
>>> tokenize = nltk.word_tokenize(text)
>>> tokenize
['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
>>> bigrams = ngrams(tokenize,2)
>>> bigrams
[('I', 'am'), ('am', 'aware'), ('aware', 'that'), ('that', 'nltk'), ('nltk', 'only'), ('only', 'offers'), ('offers', 'bigrams'), ('bigrams', 'and'), ('and', 'trigrams'), ('trigrams', ','), (',', 'but'), ('but', 'is'), ('is', 'there'), ('there', 'a'), ('a', 'way'), ('way', 'to'), ('to', 'split'), ('split', 'my'), ('my', 'text'), ('text', 'in'), ('in', 'four-grams'), ('four-grams', ','), (',', 'five-grams'), ('five-grams', 'or'), ('or', 'even'), ('even', 'hundred-grams')]
>>> trigrams=ngrams(tokenize,3)
>>> trigrams
[('I', 'am', 'aware'), ('am', 'aware', 'that'), ('aware', 'that', 'nltk'), ('that', 'nltk', 'only'), ('nltk', 'only', 'offers'), ('only', 'offers', 'bigrams'), ('offers', 'bigrams', 'and'), ('bigrams', 'and', 'trigrams'), ('and', 'trigrams', ','), ('trigrams', ',', 'but'), (',', 'but', 'is'), ('but', 'is', 'there'), ('is', 'there', 'a'), ('there', 'a', 'way'), ('a', 'way', 'to'), ('way', 'to', 'split'), ('to', 'split', 'my'), ('split', 'my', 'text'), ('my', 'text', 'in'), ('text', 'in', 'four-grams'), ('in', 'four-grams', ','), ('four-grams', ',', 'five-grams'), (',', 'five-grams', 'or'), ('five-grams', 'or', 'even'), ('or', 'even', 'hundred-grams')]
>>> fourgrams=ngrams(tokenize,4)
>>> fourgrams
[('I', 'am', 'aware', 'that'), ('am', 'aware', 'that', 'nltk'), ('aware', 'that', 'nltk', 'only'), ('that', 'nltk', 'only', 'offers'), ('nltk', 'only', 'offers', 'bigrams'), ('only', 'offers', 'bigrams', 'and'), ('offers', 'bigrams', 'and', 'trigrams'), ('bigrams', 'and', 'trigrams', ','), ('and', 'trigrams', ',', 'but'), ('trigrams', ',', 'but', 'is'), (',', 'but', 'is', 'there'), ('but', 'is', 'there', 'a'), ('is', 'there', 'a', 'way'), ('there', 'a', 'way', 'to'), ('a', 'way', 'to', 'split'), ('way', 'to', 'split', 'my'), ('to', 'split', 'my', 'text'), ('split', 'my', 'text', 'in'), ('my', 'text', 'in', 'four-grams'), ('text', 'in', 'four-grams', ','), ('in', 'four-grams', ',', 'five-grams'), ('four-grams', ',', 'five-grams', 'or'), (',', 'five-grams', 'or', 'even'), ('five-grams', 'or', 'even', 'hundred-grams')]

here is another simple way for do n-grams

>>> from nltk.util import ngrams
>>> text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams"
>>> tokenize = nltk.word_tokenize(text)
>>> tokenize
['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
>>> bigrams = ngrams(tokenize,2)
>>> bigrams
[('I', 'am'), ('am', 'aware'), ('aware', 'that'), ('that', 'nltk'), ('nltk', 'only'), ('only', 'offers'), ('offers', 'bigrams'), ('bigrams', 'and'), ('and', 'trigrams'), ('trigrams', ','), (',', 'but'), ('but', 'is'), ('is', 'there'), ('there', 'a'), ('a', 'way'), ('way', 'to'), ('to', 'split'), ('split', 'my'), ('my', 'text'), ('text', 'in'), ('in', 'four-grams'), ('four-grams', ','), (',', 'five-grams'), ('five-grams', 'or'), ('or', 'even'), ('even', 'hundred-grams')]
>>> trigrams=ngrams(tokenize,3)
>>> trigrams
[('I', 'am', 'aware'), ('am', 'aware', 'that'), ('aware', 'that', 'nltk'), ('that', 'nltk', 'only'), ('nltk', 'only', 'offers'), ('only', 'offers', 'bigrams'), ('offers', 'bigrams', 'and'), ('bigrams', 'and', 'trigrams'), ('and', 'trigrams', ','), ('trigrams', ',', 'but'), (',', 'but', 'is'), ('but', 'is', 'there'), ('is', 'there', 'a'), ('there', 'a', 'way'), ('a', 'way', 'to'), ('way', 'to', 'split'), ('to', 'split', 'my'), ('split', 'my', 'text'), ('my', 'text', 'in'), ('text', 'in', 'four-grams'), ('in', 'four-grams', ','), ('four-grams', ',', 'five-grams'), (',', 'five-grams', 'or'), ('five-grams', 'or', 'even'), ('or', 'even', 'hundred-grams')]
>>> fourgrams=ngrams(tokenize,4)
>>> fourgrams
[('I', 'am', 'aware', 'that'), ('am', 'aware', 'that', 'nltk'), ('aware', 'that', 'nltk', 'only'), ('that', 'nltk', 'only', 'offers'), ('nltk', 'only', 'offers', 'bigrams'), ('only', 'offers', 'bigrams', 'and'), ('offers', 'bigrams', 'and', 'trigrams'), ('bigrams', 'and', 'trigrams', ','), ('and', 'trigrams', ',', 'but'), ('trigrams', ',', 'but', 'is'), (',', 'but', 'is', 'there'), ('but', 'is', 'there', 'a'), ('is', 'there', 'a', 'way'), ('there', 'a', 'way', 'to'), ('a', 'way', 'to', 'split'), ('way', 'to', 'split', 'my'), ('to', 'split', 'my', 'text'), ('split', 'my', 'text', 'in'), ('my', 'text', 'in', 'four-grams'), ('text', 'in', 'four-grams', ','), ('in', 'four-grams', ',', 'five-grams'), ('four-grams', ',', 'five-grams', 'or'), (',', 'five-grams', 'or', 'even'), ('five-grams', 'or', 'even', 'hundred-grams')]

回答 4

对于需要二元组或三元组的情况,人们已经很好地回答了,但是在这种情况下,如果需要句子的每一个整组,您可以使用nltk.util.everygrams

>>> from nltk.util import everygrams

>>> message = "who let the dogs out"

>>> msg_split = message.split()

>>> list(everygrams(msg_split))
[('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out'), ('who', 'let', 'the'), ('let', 'the', 'dogs'), ('the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs'), ('let', 'the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs', 'out')]

如果您有一个限制,如三字母组的最大长度应为3,则可以使用max_len参数来指定它。

>>> list(everygrams(msg_split, max_len=2))
[('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out')]

您可以修改max_len参数以实现任意克,即4克,5克,6甚至100克。

可以修改前面提到的解决方案以实现上面提到的解决方案,但是此解决方案比这要简单得多。

欲了解更多信息,请点击这里

而且,当您只需要一个特定的语法,例如bigram或trigram等时,可以使用MAHassan的答案中提到的nltk.util.ngrams。

People have already answered pretty nicely for the scenario where you need bigrams or trigrams but if you need everygram for the sentence in that case you can use nltk.util.everygrams

>>> from nltk.util import everygrams

>>> message = "who let the dogs out"

>>> msg_split = message.split()

>>> list(everygrams(msg_split))
[('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out'), ('who', 'let', 'the'), ('let', 'the', 'dogs'), ('the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs'), ('let', 'the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs', 'out')]

Incase you have a limit like in case of trigrams where the max length should be 3 then you can use max_len param to specify it.

>>> list(everygrams(msg_split, max_len=2))
[('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out')]

You can just modify the max_len param to achieve whatever gram i.e four gram, five gram, six or even hundred gram.

The previous mentioned solutions can be modified to implement the above mentioned solution but this solution is much straight forward than that.

For further reading click here

And when you just need a specific gram like bigram or trigram etc you can use the nltk.util.ngrams as mentioned in M.A.Hassan’s answer.


回答 5

您可以使用以下命令轻松启动自己的功能itertools

from itertools import izip, islice, tee
s = 'spam and eggs'
N = 3
trigrams = izip(*(islice(seq, index, None) for index, seq in enumerate(tee(s, N))))
list(trigrams)
# [('s', 'p', 'a'), ('p', 'a', 'm'), ('a', 'm', ' '),
# ('m', ' ', 'a'), (' ', 'a', 'n'), ('a', 'n', 'd'),
# ('n', 'd', ' '), ('d', ' ', 'e'), (' ', 'e', 'g'),
# ('e', 'g', 'g'), ('g', 'g', 's')]

You can easily whip up your own function to do this using itertools:

from itertools import izip, islice, tee
s = 'spam and eggs'
N = 3
trigrams = izip(*(islice(seq, index, None) for index, seq in enumerate(tee(s, N))))
list(trigrams)
# [('s', 'p', 'a'), ('p', 'a', 'm'), ('a', 'm', ' '),
# ('m', ' ', 'a'), (' ', 'a', 'n'), ('a', 'n', 'd'),
# ('n', 'd', ' '), ('d', ' ', 'e'), (' ', 'e', 'g'),
# ('e', 'g', 'g'), ('g', 'g', 's')]

回答 6

使用python的buildin构建双字母组的一种更优雅的方法zip()。只需通过将原始字符串转换为列表split(),然后正常传递一次列表,然后一次偏移一个元素即可。

string = "I really like python, it's pretty awesome."

def find_bigrams(s):
    input_list = s.split(" ")
    return zip(input_list, input_list[1:])

def find_ngrams(s, n):
  input_list = s.split(" ")
  return zip(*[input_list[i:] for i in range(n)])

find_bigrams(string)

[('I', 'really'), ('really', 'like'), ('like', 'python,'), ('python,', "it's"), ("it's", 'pretty'), ('pretty', 'awesome.')]

A more elegant approach to build bigrams with python’s builtin zip(). Simply convert the original string into a list by split(), then pass the list once normally and once offset by one element.

string = "I really like python, it's pretty awesome."

def find_bigrams(s):
    input_list = s.split(" ")
    return zip(input_list, input_list[1:])

def find_ngrams(s, n):
  input_list = s.split(" ")
  return zip(*[input_list[i:] for i in range(n)])

find_bigrams(string)

[('I', 'really'), ('really', 'like'), ('like', 'python,'), ('python,', "it's"), ("it's", 'pretty'), ('pretty', 'awesome.')]

回答 7

我从未处理过nltk,但是在一些小班项目中做了N-grams。如果要查找字符串中所有N-gram出现的频率,可以采用这种方法。D会给你N个单词的直方图。

D = dict()
string = 'whatever string...'
strparts = string.split()
for i in range(len(strparts)-N): # N-grams
    try:
        D[tuple(strparts[i:i+N])] += 1
    except:
        D[tuple(strparts[i:i+N])] = 1

I have never dealt with nltk but did N-grams as part of some small class project. If you want to find the frequency of all N-grams occurring in the string, here is a way to do that. D would give you the histogram of your N-words.

D = dict()
string = 'whatever string...'
strparts = string.split()
for i in range(len(strparts)-N): # N-grams
    try:
        D[tuple(strparts[i:i+N])] += 1
    except:
        D[tuple(strparts[i:i+N])] = 1

回答 8

对于four_grams,它已经在NLTK中,这是一段代码,可以帮助您实现这一目标:

 from nltk.collocations import *
 import nltk
 #You should tokenize your text
 text = "I do not like green eggs and ham, I do not like them Sam I am!"
 tokens = nltk.wordpunct_tokenize(text)
 fourgrams=nltk.collocations.QuadgramCollocationFinder.from_words(tokens)
 for fourgram, freq in fourgrams.ngram_fd.items():  
       print fourgram, freq

希望对您有所帮助。

For four_grams it is already in NLTK, here is a piece of code that can help you toward this:

 from nltk.collocations import *
 import nltk
 #You should tokenize your text
 text = "I do not like green eggs and ham, I do not like them Sam I am!"
 tokens = nltk.wordpunct_tokenize(text)
 fourgrams=nltk.collocations.QuadgramCollocationFinder.from_words(tokens)
 for fourgram, freq in fourgrams.ngram_fd.items():  
       print fourgram, freq

I hope it helps.


回答 9

您可以使用sklearn.feature_extraction.text.CountVectorizer

import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html
ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vect.fit(string)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))

输出:

4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it']

您可以设置为ngram_size任何正整数。也就是说,您可以将文本拆分为四克,五克甚至一百克。

You can use sklearn.feature_extraction.text.CountVectorizer:

import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html
ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vect.fit(string)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))

outputs:

4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it']

You can set to ngram_size to any positive integer. I.e. you can split a text in four-grams, five-grams or even hundred-grams.


回答 10

如果效率是一个问题,您必须构建多个不同的n-gram(如您所说的最多一百个),但是您想使用纯python,我会这样做:

from itertools import chain

def n_grams(seq, n=1):
    """Returns an itirator over the n-grams given a listTokens"""
    shiftToken = lambda i: (el for j,el in enumerate(seq) if j>=i)
    shiftedTokens = (shiftToken(i) for i in range(n))
    tupleNGrams = zip(*shiftedTokens)
    return tupleNGrams # if join in generator : (" ".join(i) for i in tupleNGrams)

def range_ngrams(listTokens, ngramRange=(1,2)):
    """Returns an itirator over all n-grams for n in range(ngramRange) given a listTokens."""
    return chain(*(n_grams(listTokens, i) for i in range(*ngramRange)))

用法:

>>> input_list = input_list = 'test the ngrams generator'.split()
>>> list(range_ngrams(input_list, ngramRange=(1,3)))
[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]

〜与NLTK相同的速度:

import nltk
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=5)
# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
n_grams(input_list,n=5)
# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=1)
nltk.ngrams(input_list,n=2)
nltk.ngrams(input_list,n=3)
nltk.ngrams(input_list,n=4)
nltk.ngrams(input_list,n=5)
# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
range_ngrams(input_list, ngramRange=(1,6))
# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

重新发布我以前的答案

If efficiency is an issue and you have to build multiple different n-grams (up to a hundred as you say), but you want to use pure python I would do:

from itertools import chain

def n_grams(seq, n=1):
    """Returns an itirator over the n-grams given a listTokens"""
    shiftToken = lambda i: (el for j,el in enumerate(seq) if j>=i)
    shiftedTokens = (shiftToken(i) for i in range(n))
    tupleNGrams = zip(*shiftedTokens)
    return tupleNGrams # if join in generator : (" ".join(i) for i in tupleNGrams)

def range_ngrams(listTokens, ngramRange=(1,2)):
    """Returns an itirator over all n-grams for n in range(ngramRange) given a listTokens."""
    return chain(*(n_grams(listTokens, i) for i in range(*ngramRange)))

Usage :

>>> input_list = input_list = 'test the ngrams generator'.split()
>>> list(range_ngrams(input_list, ngramRange=(1,3)))
[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]

~Same speed as NLTK:

import nltk
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=5)
# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
n_grams(input_list,n=5)
# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=1)
nltk.ngrams(input_list,n=2)
nltk.ngrams(input_list,n=3)
nltk.ngrams(input_list,n=4)
nltk.ngrams(input_list,n=5)
# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
range_ngrams(input_list, ngramRange=(1,6))
# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Repost from my previous answer.


回答 11

Nltk很棒,但有时对于某些项目来说是一项开销:

import re
def tokenize(text, ngrams=1):
    text = re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    tokens = text.split()
    return [tuple(tokens[i:i+ngrams]) for i in xrange(len(tokens)-ngrams+1)]

使用示例:

>> text = "This is an example text"
>> tokenize(text, 2)
[('This', 'is'), ('is', 'an'), ('an', 'example'), ('example', 'text')]
>> tokenize(text, 3)
[('This', 'is', 'an'), ('is', 'an', 'example'), ('an', 'example', 'text')]

Nltk is great, but sometimes is a overhead for some projects:

import re
def tokenize(text, ngrams=1):
    text = re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    tokens = text.split()
    return [tuple(tokens[i:i+ngrams]) for i in xrange(len(tokens)-ngrams+1)]

Example use:

>> text = "This is an example text"
>> tokenize(text, 2)
[('This', 'is'), ('is', 'an'), ('an', 'example'), ('example', 'text')]
>> tokenize(text, 3)
[('This', 'is', 'an'), ('is', 'an', 'example'), ('an', 'example', 'text')]

回答 12

您可以使用以下代码获得所有4-6克的代码,而无需以下其他软件包:

from itertools import chain

def get_m_2_ngrams(input_list, min, max):
    for s in chain(*[get_ngrams(input_list, k) for k in range(min, max+1)]):
        yield ' '.join(s)

def get_ngrams(input_list, n):
    return zip(*[input_list[i:] for i in range(n)])

if __name__ == '__main__':
    input_list = ['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
    for s in get_m_2_ngrams(input_list, 4, 6):
        print(s)

输出如下:

I am aware that
am aware that nltk
aware that nltk only
that nltk only offers
nltk only offers bigrams
only offers bigrams and
offers bigrams and trigrams
bigrams and trigrams ,
and trigrams , but
trigrams , but is
, but is there
but is there a
is there a way
there a way to
a way to split
way to split my
to split my text
split my text in
my text in four-grams
text in four-grams ,
in four-grams , five-grams
four-grams , five-grams or
, five-grams or even
five-grams or even hundred-grams
I am aware that nltk
am aware that nltk only
aware that nltk only offers
that nltk only offers bigrams
nltk only offers bigrams and
only offers bigrams and trigrams
offers bigrams and trigrams ,
bigrams and trigrams , but
and trigrams , but is
trigrams , but is there
, but is there a
but is there a way
is there a way to
there a way to split
a way to split my
way to split my text
to split my text in
split my text in four-grams
my text in four-grams ,
text in four-grams , five-grams
in four-grams , five-grams or
four-grams , five-grams or even
, five-grams or even hundred-grams
I am aware that nltk only
am aware that nltk only offers
aware that nltk only offers bigrams
that nltk only offers bigrams and
nltk only offers bigrams and trigrams
only offers bigrams and trigrams ,
offers bigrams and trigrams , but
bigrams and trigrams , but is
and trigrams , but is there
trigrams , but is there a
, but is there a way
but is there a way to
is there a way to split
there a way to split my
a way to split my text
way to split my text in
to split my text in four-grams
split my text in four-grams ,
my text in four-grams , five-grams
text in four-grams , five-grams or
in four-grams , five-grams or even
four-grams , five-grams or even hundred-grams

您可以在此博客上找到更多详细信息

You can get all 4-6gram using the code without other package below:

from itertools import chain

def get_m_2_ngrams(input_list, min, max):
    for s in chain(*[get_ngrams(input_list, k) for k in range(min, max+1)]):
        yield ' '.join(s)

def get_ngrams(input_list, n):
    return zip(*[input_list[i:] for i in range(n)])

if __name__ == '__main__':
    input_list = ['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
    for s in get_m_2_ngrams(input_list, 4, 6):
        print(s)

the output is below:

I am aware that
am aware that nltk
aware that nltk only
that nltk only offers
nltk only offers bigrams
only offers bigrams and
offers bigrams and trigrams
bigrams and trigrams ,
and trigrams , but
trigrams , but is
, but is there
but is there a
is there a way
there a way to
a way to split
way to split my
to split my text
split my text in
my text in four-grams
text in four-grams ,
in four-grams , five-grams
four-grams , five-grams or
, five-grams or even
five-grams or even hundred-grams
I am aware that nltk
am aware that nltk only
aware that nltk only offers
that nltk only offers bigrams
nltk only offers bigrams and
only offers bigrams and trigrams
offers bigrams and trigrams ,
bigrams and trigrams , but
and trigrams , but is
trigrams , but is there
, but is there a
but is there a way
is there a way to
there a way to split
a way to split my
way to split my text
to split my text in
split my text in four-grams
my text in four-grams ,
text in four-grams , five-grams
in four-grams , five-grams or
four-grams , five-grams or even
, five-grams or even hundred-grams
I am aware that nltk only
am aware that nltk only offers
aware that nltk only offers bigrams
that nltk only offers bigrams and
nltk only offers bigrams and trigrams
only offers bigrams and trigrams ,
offers bigrams and trigrams , but
bigrams and trigrams , but is
and trigrams , but is there
trigrams , but is there a
, but is there a way
but is there a way to
is there a way to split
there a way to split my
a way to split my text
way to split my text in
to split my text in four-grams
split my text in four-grams ,
my text in four-grams , five-grams
text in four-grams , five-grams or
in four-grams , five-grams or even
four-grams , five-grams or even hundred-grams

you can find more detail on this blog


回答 13

大约七年后,这是一个更优雅的答案collections.deque

def ngrams(words, n):
    d = collections.deque(maxlen=n)
    d.extend(words[:n])
    words = words[n:]
    for window, word in zip(itertools.cycle((d,)), words):
        print(' '.join(window))
        d.append(word)

words = ['I', 'am', 'become', 'death,', 'the', 'destroyer', 'of', 'worlds']

输出:

In [15]: ngrams(words, 3)                                                                                                                                                                                                                     
I am become
am become death,
become death, the
death, the destroyer
the destroyer of

In [16]: ngrams(words, 4)                                                                                                                                                                                                                     
I am become death,
am become death, the
become death, the destroyer
death, the destroyer of

In [17]: ngrams(words, 1)                                                                                                                                                                                                                     
I
am
become
death,
the
destroyer
of

In [18]: ngrams(words, 2)                                                                                                                                                                                                                     
I am
am become
become death,
death, the
the destroyer
destroyer of

After about seven years, here’s a more elegant answer using collections.deque:

def ngrams(words, n):
    d = collections.deque(maxlen=n)
    d.extend(words[:n])
    words = words[n:]
    for window, word in zip(itertools.cycle((d,)), words):
        print(' '.join(window))
        d.append(word)

words = ['I', 'am', 'become', 'death,', 'the', 'destroyer', 'of', 'worlds']

Output:

In [15]: ngrams(words, 3)                                                                                                                                                                                                                     
I am become
am become death,
become death, the
death, the destroyer
the destroyer of

In [16]: ngrams(words, 4)                                                                                                                                                                                                                     
I am become death,
am become death, the
become death, the destroyer
death, the destroyer of

In [17]: ngrams(words, 1)                                                                                                                                                                                                                     
I
am
become
death,
the
destroyer
of

In [18]: ngrams(words, 2)                                                                                                                                                                                                                     
I am
am become
become death,
death, the
the destroyer
destroyer of

回答 14

如果您想为具有恒定内存使用量的大字符串提供纯迭代器解决方案:

from typing import Iterable  
import itertools

def ngrams_iter(input: str, ngram_size: int, token_regex=r"[^\s]+") -> Iterable[str]:
    input_iters = [ 
        map(lambda m: m.group(0), re.finditer(token_regex, input)) 
        for n in range(ngram_size) 
    ]
    # Skip first words
    for n in range(1, ngram_size): list(map(next, input_iters[n:]))  

    output_iter = itertools.starmap( 
        lambda *args: " ".join(args),  
        zip(*input_iters) 
    ) 
    return output_iter

测试:

input = "If you want a pure iterator solution for large strings with constant memory usage"
list(ngrams_iter(input, 5))

输出:

['If you want a pure',
 'you want a pure iterator',
 'want a pure iterator solution',
 'a pure iterator solution for',
 'pure iterator solution for large',
 'iterator solution for large strings',
 'solution for large strings with',
 'for large strings with constant',
 'large strings with constant memory',
 'strings with constant memory usage']

If you want a pure iterator solution for large strings with constant memory usage:

from typing import Iterable  
import itertools

def ngrams_iter(input: str, ngram_size: int, token_regex=r"[^\s]+") -> Iterable[str]:
    input_iters = [ 
        map(lambda m: m.group(0), re.finditer(token_regex, input)) 
        for n in range(ngram_size) 
    ]
    # Skip first words
    for n in range(1, ngram_size): list(map(next, input_iters[n:]))  

    output_iter = itertools.starmap( 
        lambda *args: " ".join(args),  
        zip(*input_iters) 
    ) 
    return output_iter

Test:

input = "If you want a pure iterator solution for large strings with constant memory usage"
list(ngrams_iter(input, 5))

Output:

['If you want a pure',
 'you want a pure iterator',
 'want a pure iterator solution',
 'a pure iterator solution for',
 'pure iterator solution for large',
 'iterator solution for large strings',
 'solution for large strings with',
 'for large strings with constant',
 'large strings with constant memory',
 'strings with constant memory usage']

用sklearn缩放的pandas数据框列

问题:用sklearn缩放的pandas数据框列

我有一个带有混合类型列的pandas数据框,我想将sklearn的min_max_scaler应用于某些列。理想情况下,我想就地进行这些转换,但还没有找到一种方法来进行。我编写了以下有效的代码:

import pandas as pd
import numpy as np
from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler()

dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],'B':[103.02,107.26,110.35,114.23,114.68], 'C':['big','small','big','small','small']})
min_max_scaler = preprocessing.MinMaxScaler()

def scaleColumns(df, cols_to_scale):
    for col in cols_to_scale:
        df[col] = pd.DataFrame(min_max_scaler.fit_transform(pd.DataFrame(dfTest[col])),columns=[col])
    return df

dfTest

    A   B   C
0    14.00   103.02  big
1    90.20   107.26  small
2    90.95   110.35  big
3    96.27   114.23  small
4    91.21   114.68  small

scaled_df = scaleColumns(dfTest,['A','B'])
scaled_df

A   B   C
0    0.000000    0.000000    big
1    0.926219    0.363636    small
2    0.935335    0.628645    big
3    1.000000    0.961407    small
4    0.938495    1.000000    small

我很好奇这是否是进行此转换的首选/最有效的方法。有没有办法可以使用df.apply更好呢?

我也很惊讶我无法使用以下代码:

bad_output = min_max_scaler.fit_transform(dfTest['A'])

如果我将整个数据帧传递给缩放器,则它可以工作:

dfTest2 = dfTest.drop('C', axis = 1) good_output = min_max_scaler.fit_transform(dfTest2) good_output

我很困惑为什么将系列传递给定标器会失败。在上面的完整工作代码中,我希望只将一个系列传递给缩放器,然后将dataframe column =设置为缩放的序列。我已经看到这个问题在其他几个地方问过,但找不到一个好的答案。任何帮助了解这里发生的事情将不胜感激!

I have a pandas dataframe with mixed type columns, and I’d like to apply sklearn’s min_max_scaler to some of the columns. Ideally, I’d like to do these transformations in place, but haven’t figured out a way to do that yet. I’ve written the following code that works:

import pandas as pd
import numpy as np
from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler()

dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],'B':[103.02,107.26,110.35,114.23,114.68], 'C':['big','small','big','small','small']})
min_max_scaler = preprocessing.MinMaxScaler()

def scaleColumns(df, cols_to_scale):
    for col in cols_to_scale:
        df[col] = pd.DataFrame(min_max_scaler.fit_transform(pd.DataFrame(dfTest[col])),columns=[col])
    return df

dfTest

    A   B   C
0    14.00   103.02  big
1    90.20   107.26  small
2    90.95   110.35  big
3    96.27   114.23  small
4    91.21   114.68  small

scaled_df = scaleColumns(dfTest,['A','B'])
scaled_df

A   B   C
0    0.000000    0.000000    big
1    0.926219    0.363636    small
2    0.935335    0.628645    big
3    1.000000    0.961407    small
4    0.938495    1.000000    small

I’m curious if this is the preferred/most efficient way to do this transformation. Is there a way I could use df.apply that would be better?

I’m also surprised I can’t get the following code to work:

bad_output = min_max_scaler.fit_transform(dfTest['A'])

If I pass an entire dataframe to the scaler it works:

dfTest2 = dfTest.drop('C', axis = 1) good_output = min_max_scaler.fit_transform(dfTest2) good_output

I’m confused why passing a series to the scaler fails. In my full working code above I had hoped to just pass a series to the scaler then set the dataframe column = to the scaled series. I’ve seen this question asked a few other places, but haven’t found a good answer. Any help understanding what’s going on here would be greatly appreciated!


回答 0

我不确定以前的版本是否pandas阻止了此操作,但现在以下代码段对我来说效果很好,并且无需使用就可以产生所需的内容apply

>>> import pandas as pd
>>> from sklearn.preprocessing import MinMaxScaler


>>> scaler = MinMaxScaler()

>>> dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],
                           'B':[103.02,107.26,110.35,114.23,114.68],
                           'C':['big','small','big','small','small']})

>>> dfTest[['A', 'B']] = scaler.fit_transform(dfTest[['A', 'B']])

>>> dfTest
          A         B      C
0  0.000000  0.000000    big
1  0.926219  0.363636  small
2  0.935335  0.628645    big
3  1.000000  0.961407  small
4  0.938495  1.000000  small

I am not sure if previous versions of pandas prevented this but now the following snippet works perfectly for me and produces exactly what you want without having to use apply

>>> import pandas as pd
>>> from sklearn.preprocessing import MinMaxScaler


>>> scaler = MinMaxScaler()

>>> dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],
                           'B':[103.02,107.26,110.35,114.23,114.68],
                           'C':['big','small','big','small','small']})

>>> dfTest[['A', 'B']] = scaler.fit_transform(dfTest[['A', 'B']])

>>> dfTest
          A         B      C
0  0.000000  0.000000    big
1  0.926219  0.363636  small
2  0.935335  0.628645    big
3  1.000000  0.961407  small
4  0.938495  1.000000  small

回答 1

像这样?

dfTest = pd.DataFrame({
           'A':[14.00,90.20,90.95,96.27,91.21],
           'B':[103.02,107.26,110.35,114.23,114.68], 
           'C':['big','small','big','small','small']
         })
dfTest[['A','B']] = dfTest[['A','B']].apply(
                           lambda x: MinMaxScaler().fit_transform(x))
dfTest

    A           B           C
0   0.000000    0.000000    big
1   0.926219    0.363636    small
2   0.935335    0.628645    big
3   1.000000    0.961407    small
4   0.938495    1.000000    small

Like this?

dfTest = pd.DataFrame({
           'A':[14.00,90.20,90.95,96.27,91.21],
           'B':[103.02,107.26,110.35,114.23,114.68], 
           'C':['big','small','big','small','small']
         })
dfTest[['A','B']] = dfTest[['A','B']].apply(
                           lambda x: MinMaxScaler().fit_transform(x))
dfTest

    A           B           C
0   0.000000    0.000000    big
1   0.926219    0.363636    small
2   0.935335    0.628645    big
3   1.000000    0.961407    small
4   0.938495    1.000000    small

回答 2

正如pir的评论中提到的那样-该.apply(lambda el: scale.fit_transform(el))方法将产生以下警告:

DeprecationWarning:在0.17中弃用1d数组作为数据,它将在0.19中引发ValueError。如果数据具有单个功能,则使用X.reshape(-1,1)来重塑数据,如果包含单个样本,则使用X.reshape(1,-1)来重塑数据。

将您的列转换为numpy数组应该可以完成这项工作(我更喜欢StandardScaler):

from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

dfTest[['A','B','C']] = scale.fit_transform(dfTest[['A','B','C']].as_matrix())

编辑 2018年11月(已针对熊猫0.23.4测试)-

作为罗布·默里提到的意见,大熊猫的电流(v0.23.4)版本.as_matrix()的回报FutureWarning。因此,应将其替换为.values

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit_transform(dfTest[['A','B']].values)

编辑 2019年5月(已针对熊猫0.24.2测试)-

正如joelostblom在评论中提到的那样:“因此0.24.0,建议使用.to_numpy()代替.values。”

更新的示例:

import pandas as pd
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
dfTest = pd.DataFrame({
               'A':[14.00,90.20,90.95,96.27,91.21],
               'B':[103.02,107.26,110.35,114.23,114.68],
               'C':['big','small','big','small','small']
             })
dfTest[['A', 'B']] = scaler.fit_transform(dfTest[['A','B']].to_numpy())
dfTest
      A         B      C
0 -1.995290 -1.571117    big
1  0.436356 -0.603995  small
2  0.460289  0.100818    big
3  0.630058  0.985826  small
4  0.468586  1.088469  small

As it is being mentioned in pir’s comment – the .apply(lambda el: scale.fit_transform(el)) method will produce the following warning:

DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.

Converting your columns to numpy arrays should do the job (I prefer StandardScaler):

from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

dfTest[['A','B','C']] = scale.fit_transform(dfTest[['A','B','C']].as_matrix())

Edit Nov 2018 (Tested for pandas 0.23.4)–

As Rob Murray mentions in the comments, in the current (v0.23.4) version of pandas .as_matrix() returns FutureWarning. Therefore, it should be replaced by .values:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit_transform(dfTest[['A','B']].values)

Edit May 2019 (Tested for pandas 0.24.2)–

As joelostblom mentions in the comments, “Since 0.24.0, it is recommended to use .to_numpy() instead of .values.”

Updated example:

import pandas as pd
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
dfTest = pd.DataFrame({
               'A':[14.00,90.20,90.95,96.27,91.21],
               'B':[103.02,107.26,110.35,114.23,114.68],
               'C':['big','small','big','small','small']
             })
dfTest[['A', 'B']] = scaler.fit_transform(dfTest[['A','B']].to_numpy())
dfTest
      A         B      C
0 -1.995290 -1.571117    big
1  0.436356 -0.603995  small
2  0.460289  0.100818    big
3  0.630058  0.985826  small
4  0.468586  1.088469  small

回答 3

df = pd.DataFrame(scale.fit_transform(df.values), columns=df.columns, index=df.index)

这应该在没有折旧警告的情况下起作用。

df = pd.DataFrame(scale.fit_transform(df.values), columns=df.columns, index=df.index)

This should work without depreciation warnings.


回答 4

您只能使用以下方法进行操作 pandas

In [235]:
dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],'B':[103.02,107.26,110.35,114.23,114.68], 'C':['big','small','big','small','small']})
df = dfTest[['A', 'B']]
df_norm = (df - df.min()) / (df.max() - df.min())
print df_norm
print pd.concat((df_norm, dfTest.C),1)

          A         B
0  0.000000  0.000000
1  0.926219  0.363636
2  0.935335  0.628645
3  1.000000  0.961407
4  0.938495  1.000000
          A         B      C
0  0.000000  0.000000    big
1  0.926219  0.363636  small
2  0.935335  0.628645    big
3  1.000000  0.961407  small
4  0.938495  1.000000  small

You can do it using pandas only:

In [235]:
dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],'B':[103.02,107.26,110.35,114.23,114.68], 'C':['big','small','big','small','small']})
df = dfTest[['A', 'B']]
df_norm = (df - df.min()) / (df.max() - df.min())
print df_norm
print pd.concat((df_norm, dfTest.C),1)

          A         B
0  0.000000  0.000000
1  0.926219  0.363636
2  0.935335  0.628645
3  1.000000  0.961407
4  0.938495  1.000000
          A         B      C
0  0.000000  0.000000    big
1  0.926219  0.363636  small
2  0.935335  0.628645    big
3  1.000000  0.961407  small
4  0.938495  1.000000  small

回答 5

我知道这是一个很老的评论,但仍然:

不要使用单括号(dfTest['A']),而应使用双括号(dfTest[['A']])

即:min_max_scaler.fit_transform(dfTest[['A']])

我相信这会取得理想的结果。

I know it’s a very old comment, but still:

Instead of using single bracket (dfTest['A']), use double brackets (dfTest[['A']]).

i.e: min_max_scaler.fit_transform(dfTest[['A']]).

I believe this will give the desired result.


Python中两个日期之间的差异

问题:Python中两个日期之间的差异

我有两个不同的日期,我想知道它们之间的天数差异。日期格式为YYYY-MM-DD。

我有一个可以在日期中添加或减去给定数字的功能:

def addonDays(a, x):
   ret = time.strftime("%Y-%m-%d",time.localtime(time.mktime(time.strptime(a,"%Y-%m-%d"))+x*3600*24+3600))      
   return ret

其中A是日期,x是我要添加的天数。结果是另一个日期。

我需要一个可以给出两个日期的函数,结果将是一个以天为单位的日期差的整数。

I have two different dates and I want to know the difference in days between them. The format of the date is YYYY-MM-DD.

I have a function that can ADD or SUBTRACT a given number to a date:

def addonDays(a, x):
   ret = time.strftime("%Y-%m-%d",time.localtime(time.mktime(time.strptime(a,"%Y-%m-%d"))+x*3600*24+3600))      
   return ret

where A is the date and x the number of days I want to add. And the result is another date.

I need a function where I can give two dates and the result would be an int with date difference in days.


回答 0

使用-得到两者之间的区别datetime对象,并采取days会员。

from datetime import datetime

def days_between(d1, d2):
    d1 = datetime.strptime(d1, "%Y-%m-%d")
    d2 = datetime.strptime(d2, "%Y-%m-%d")
    return abs((d2 - d1).days)

Use - to get the difference between two datetime objects and take the days member.

from datetime import datetime

def days_between(d1, d2):
    d1 = datetime.strptime(d1, "%Y-%m-%d")
    d2 = datetime.strptime(d2, "%Y-%m-%d")
    return abs((d2 - d1).days)

回答 1

另一个简短的解决方案:

from datetime import date

def diff_dates(date1, date2):
    return abs(date2-date1).days

def main():
    d1 = date(2013,1,1)
    d2 = date(2013,9,13)
    result1 = diff_dates(d2, d1)
    print '{} days between {} and {}'.format(result1, d1, d2)
    print ("Happy programmer's day!")

main()

Another short solution:

from datetime import date

def diff_dates(date1, date2):
    return abs(date2-date1).days

def main():
    d1 = date(2013,1,1)
    d2 = date(2013,9,13)
    result1 = diff_dates(d2, d1)
    print '{} days between {} and {}'.format(result1, d1, d2)
    print ("Happy programmer's day!")

main()

回答 2

我尝试了上面larsmans发布的代码,但是有两个问题:

1)原样的代码将引发mauguerra提到的错误2)如果将代码更改为以下内容:

...
    d1 = d1.strftime("%Y-%m-%d")
    d2 = d2.strftime("%Y-%m-%d")
    return abs((d2 - d1).days)

这会将您的datetime对象转换为字符串,但是有两件事

1)尝试执行d2-d1将失败,因为您无法在字符串上使用减号运算符,并且2)如果阅读了上述答案的第一行,则想在两个datetime对象上使用-运算符,但是将它们转换为字符串

我发现您实际上只需要以下内容:

import datetime

end_date = datetime.datetime.utcnow()
start_date = end_date - datetime.timedelta(days=8)
difference_in_days = abs((end_date - start_date).days)

print difference_in_days

I tried the code posted by larsmans above but, there are a couple of problems:

1) The code as is will throw the error as mentioned by mauguerra 2) If you change the code to the following:

...
    d1 = d1.strftime("%Y-%m-%d")
    d2 = d2.strftime("%Y-%m-%d")
    return abs((d2 - d1).days)

This will convert your datetime objects to strings but, two things

1) Trying to do d2 – d1 will fail as you cannot use the minus operator on strings and 2) If you read the first line of the above answer it stated, you want to use the – operator on two datetime objects but, you just converted them to strings

What I found is that you literally only need the following:

import datetime

end_date = datetime.datetime.utcnow()
start_date = end_date - datetime.timedelta(days=8)
difference_in_days = abs((end_date - start_date).days)

print difference_in_days

回答 3

试试这个:

data=pd.read_csv('C:\Users\Desktop\Data Exploration.csv')
data.head(5)
first=data['1st Gift']
last=data['Last Gift']
maxi=data['Largest Gift']
l_1=np.mean(first)-3*np.std(first)
u_1=np.mean(first)+3*np.std(first)


m=np.abs(data['1st Gift']-np.mean(data['1st Gift']))>3*np.std(data['1st Gift'])
pd.value_counts(m)
l=first[m]
data.loc[:,'1st Gift'][m==True]=np.mean(data['1st Gift'])+3*np.std(data['1st Gift'])
data['1st Gift'].head()




m=np.abs(data['Last Gift']-np.mean(data['Last Gift']))>3*np.std(data['Last Gift'])
pd.value_counts(m)
l=last[m]
data.loc[:,'Last Gift'][m==True]=np.mean(data['Last Gift'])+3*np.std(data['Last Gift'])
data['Last Gift'].head()

Try this:

data=pd.read_csv('C:\Users\Desktop\Data Exploration.csv')
data.head(5)
first=data['1st Gift']
last=data['Last Gift']
maxi=data['Largest Gift']
l_1=np.mean(first)-3*np.std(first)
u_1=np.mean(first)+3*np.std(first)


m=np.abs(data['1st Gift']-np.mean(data['1st Gift']))>3*np.std(data['1st Gift'])
pd.value_counts(m)
l=first[m]
data.loc[:,'1st Gift'][m==True]=np.mean(data['1st Gift'])+3*np.std(data['1st Gift'])
data['1st Gift'].head()




m=np.abs(data['Last Gift']-np.mean(data['Last Gift']))>3*np.std(data['Last Gift'])
pd.value_counts(m)
l=last[m]
data.loc[:,'Last Gift'][m==True]=np.mean(data['Last Gift'])+3*np.std(data['Last Gift'])
data['Last Gift'].head()

回答 4

pd.date_range(’2019-01-01’,’2019-02-01’)。shape [0]

pd.date_range(‘2019-01-01’, ‘2019-02-01’).shape[0]


如何防止Google Colab断开连接?

问题:如何防止Google Colab断开连接?

问:是否可以通过编程方式防止Google Colab在超时时断开连接?

下面介绍导致笔记本计算机自动断开连接的情况:

Google Colab笔记本的空闲超时为90分钟,绝对超时为12小时。这意味着,如果用户在超过90分钟的时间内未与其Google Colab笔记本互动,则其实例将自动终止。另外,Colab实例的最大生存期为12小时。

自然,我们希望自动将最大值从实例中挤出,而不必不断地手动与之交互。在这里,我将假定常见的系统要求:

  • Ubuntu 18 LTS / Windows 10 / Mac操作系统
  • 对于基于Linux的系统,请使用流行的DE,例如Gnome 3或Unity
  • Firefox或Chromium浏览器

我要在这里指出,这种行为并未违反 Google Colab的使用条款,尽管根据其常见问题解答不鼓励这样做(简而言之:从道德上讲,如果您真的不需要它,则用尽所有GPU是不可行的))。


我当前的解决方案非常愚蠢:

  • 首先,我关闭屏幕保护程序,因此我的屏幕始终保持打开状态。
  • 我有一个Arduino开发板,所以我只是将它变成了一个橡胶鸭子USB,并使其在我睡觉时模拟原始用户交互(只是因为我手边有其他用例)。

有更好的方法吗?

Q: Is there any way to programmatically prevent Google Colab from disconnecting on a timeout?

The following describes the conditions causing a notebook to automatically disconnect:

Google Colab notebooks have an idle timeout of 90 minutes and absolute timeout of 12 hours. This means, if user does not interact with his Google Colab notebook for more than 90 minutes, its instance is automatically terminated. Also, maximum lifetime of a Colab instance is 12 hours.

Naturally, we want to automatically squeeze the maximum out of the instance, without having to manually interact with it constantly. Here I will assume commonly seen system requirements:

  • Ubuntu 18 LTS / Windows 10 / Mac Operating systems
  • In case of Linux-based systems, using popular DEs like Gnome 3 or Unity
  • Firefox or Chromium browsers

I should point out here that such behavior does not violate Google Colab’s Terms of Use, although it is not encouraged according to their FAQ (in short: morally it is not okay to use up all of the GPUs if you don’t really need it).


My current solution is very dumb:

  • First, I turn the screensaver off, so my sreen is always on.
  • I have an Arduino board, so I just turned it into a rubber ducky usb and make it emulate primitive user interaction while I sleep (just because I have it at hand for other use-cases).

Are there better ways?


回答 0

编辑: 显然,该解决方案非常简单,并且不需要任何JavaScript。只需在底部创建具有以下行的新单元格:

while True:pass

现在将单元格保持在运行顺序中,以便无限循环不会停止,从而使会话保持活动状态。

旧方法: 设置一个JavaScript间隔,每60秒点击一次connect按钮。使用Ctrl + Shift + I打开开发人员设置(在您的Web浏览器中),然后单击控制台选项卡,然后在控制台提示符下键入此设置。(对于Mac,请按Option + Command + I)

function ConnectButton(){
    console.log("Connect pushed"); 
    document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click() 
}
setInterval(ConnectButton,60000);

Edit: Apparently the solution is very easy, and doesn’t need any JavaScript. Just create a new cell at the bottom having the following line:

while True:pass

now keep the cell in the run sequence so that the infinite loop won’t stop and thus keep your session alive.

Old method: Set a javascript interval to click on the connect button every 60 seconds. Open developer-settings (in your web-browser) with Ctrl+Shift+I then click on console tab and type this on the console prompt. (for mac press Option+Command+I)

function ConnectButton(){
    console.log("Connect pushed"); 
    document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click() 
}
setInterval(ConnectButton,60000);

回答 1

由于现在将连接按钮的ID更改为“ colab-connect-button”,因此可以使用以下代码来继续单击该按钮。

function ClickConnect(){
    console.log("Clicked on connect button"); 
    document.querySelector("colab-connect-button").click()
}
setInterval(ClickConnect,60000)

如果仍然无法解决问题,请按照以下步骤操作:

  1. 右键单击连接按钮(位于colab的右上方)
  2. 点击检查
  3. 获取按钮的HTML ID并替换为以下代码
function ClickConnect(){
    console.log("Clicked on connect button"); 
    document.querySelector("Put ID here").click() // Change id here
}
setInterval(ClickConnect,60000)

Since the id of the connect button is now changed to “colab-connect-button”, the following code can be used to keep clicking on the button.

function ClickConnect(){
    console.log("Clicked on connect button"); 
    document.querySelector("colab-connect-button").click()
}
setInterval(ClickConnect,60000)

If still, this doesn’t work, then follow the steps given below:

  1. Right-click on the connect button (on the top-right side of the colab)
  2. Click on inspect
  3. Get the HTML id of the button and substitute in the following code
function ClickConnect(){
    console.log("Clicked on connect button"); 
    document.querySelector("Put ID here").click() // Change id here
}
setInterval(ClickConnect,60000)

回答 2

嗯,这对我有用-

在控制台中运行以下代码,它将阻止您断开连接。Ctrl + Shift + i打开检查器视图。然后进入控制台。

function ClickConnect(){
    console.log("Working"); 
    document.querySelector("colab-toolbar-button#connect").click() 
}
setInterval(ClickConnect,60000)

如何防止Google Colab断开连接

Well this is working for me –

run the following code in the console and it will prevent you from disconnecting. Ctrl+ Shift + i to open inspector view . Then go to console.

function ClickConnect(){
    console.log("Working"); 
    document.querySelector("colab-toolbar-button#connect").click() 
}
setInterval(ClickConnect,60000)

How to prevent google colab from disconnecting


回答 3

对我而言,以下示例:

  • document.querySelector("#connect").click() 要么
  • document.querySelector("colab-toolbar-button#connect").click() 要么
  • document.querySelector("colab-connect-button").click()

抛出错误。

我必须使它们适应以下条件:

版本1:

function ClickConnect(){
  console.log("Connnect Clicked - Start"); 
  document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click();
  console.log("Connnect Clicked - End"); 
};
setInterval(ClickConnect, 60000)

版本2: 如果您希望能够停止该功能,请使用以下新代码:

var startClickConnect = function startClickConnect(){
    var clickConnect = function clickConnect(){
        console.log("Connnect Clicked - Start");
        document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click();
        console.log("Connnect Clicked - End"); 
    };

    var intervalId = setInterval(clickConnect, 60000);

    var stopClickConnectHandler = function stopClickConnect() {
        console.log("Connnect Clicked Stopped - Start");
        clearInterval(intervalId);
        console.log("Connnect Clicked Stopped - End");
    };

    return stopClickConnectHandler;
};

var stopClickConnect = startClickConnect();

为了停止,请调用:

stopClickConnect();

For me the following examples:

  • document.querySelector("#connect").click() or
  • document.querySelector("colab-toolbar-button#connect").click() or
  • document.querySelector("colab-connect-button").click()

were throwing errors.

I had to adapt them to the following:

Version 1:

function ClickConnect(){
  console.log("Connnect Clicked - Start"); 
  document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click();
  console.log("Connnect Clicked - End"); 
};
setInterval(ClickConnect, 60000)

Version 2: If you would like to be able to stop the function, here is the new code:

var startClickConnect = function startClickConnect(){
    var clickConnect = function clickConnect(){
        console.log("Connnect Clicked - Start");
        document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click();
        console.log("Connnect Clicked - End"); 
    };

    var intervalId = setInterval(clickConnect, 60000);

    var stopClickConnectHandler = function stopClickConnect() {
        console.log("Connnect Clicked Stopped - Start");
        clearInterval(intervalId);
        console.log("Connnect Clicked Stopped - End");
    };

    return stopClickConnectHandler;
};

var stopClickConnect = startClickConnect();

In order to stop, call:

stopClickConnect();

回答 4

使用Pynput在您的PC中创建python代码

from pynput.mouse import Button, Controller
import time

mouse = Controller()

while True:
    mouse.click(Button.left, 1)
    time.sleep(30)

在您的桌面上运行此代码,然后将鼠标指针悬停在任何目录的目录结构上(左侧的左侧栏-文件部分),此代码将每30秒不断单击一次目录,因此每30秒将展开和缩小一次,因此您的会话不会过期重要-您必须在PC中运行此代码

create a python code in your pc with pynput

from pynput.mouse import Button, Controller
import time

mouse = Controller()

while True:
    mouse.click(Button.left, 1)
    time.sleep(30)

Run this code in your Desktop, Then point mouse arrow over (colabs left panel – file section) directory structure on any directory this code will keep clicking on directory on every 30 seconds so it will expand and shrink every 30 seconds so your session will not get expired Important – you have to run this code in your pc


回答 5

我没有单击“连接”按钮,而是单击“评论”按钮以使会话保持活动状态。(2020年8月)

function ClickConnect(){

console.log("Working"); 
document.querySelector("#comments > span").click() 
}
setInterval(ClickConnect,5000)

Instead of clicking the connect button, i just clicking on comment button to keep my session alive. (August-2020)

function ClickConnect(){

console.log("Working"); 
document.querySelector("#comments > span").click() 
}
setInterval(ClickConnect,5000)

回答 6

我使用宏程序定期单击RAM / Disk按钮以整夜训练模型。诀窍是配置一个宏程序,以两次单击Ram / Disk Colab工具栏按钮,两次单击之间的间隔很短,这样即使运行时断开连接,它也将重新连接。(第一次单击用于关闭对话框,第二次单击用于重新连接)。但是,您仍然必须整夜打开笔记本电脑,甚至可以固定Colab标签。

I use a Macro Program to periodically click on the RAM/Disk button to train the model all night. The trick is to configure a macro program to click on the Ram/Disk Colab Toolbar Button twice with a short interval between the two clicks so that even if the Runtime gets disconnected it will reconnect back. (the first click used to close the dialog box and the second click used to RECONNECT). However, you still have to leave your laptop open all night and maybe pin the Colab tab.


回答 7

在某些脚本的帮助下,以上答案可能效果很好。对于没有脚本的烦人的断开连接,我有一个解决方案(或一种技巧),尤其是当您的程序必须从google驱动器读取数据时,例如训练深度学习网络模型时,使用脚本进行reconnect操作就没有用了,因为一旦您断开与colab的连接,该程序就死了,应该再次手动连接到Google驱动器,以使您的模型能够再次读取数据集,但是脚本不会执行此操作。
我已经测试了很多次,并且效果很好。
当您使用浏览器(我使用Chrome)在colab页面上运行程序时,请记住,一旦程序开始运行,就不要对浏览器进行任何操作,例如:切换到其他网页,打开或关闭另一个网页,以及依此类推,只需将其放置在那里,等待程序完成运行,就可以切换到pycharm等其他软件来继续编写代码,而不必切换到另一个网页。我不知道为什么打开或关闭或切换到其他页面会导致google colab页面的连接问题,但是每次我尝试打扰我的浏览器(如执行某些搜索工作)时,我与colab的连接都会很快断开。

The above answers with the help of some scripts maybe work well. I have a solution(or a kind of trick) for that annoying disconnection without scripts, especially when your program must read data from your google drive, like training a deep learning network model, where using scripts to do reconnect operation is of no use because once you disconnect with your colab, the program is just dead, you should manually connect to your google drive again to make your model able to read dataset again, but the scripts will not do that thing.
I’ve already test it many times and it works well.
When you run a program on the colab page with a browser(I use Chrome), just remember that don’t do any operation to your browser once your program starts running, like: switch to other webpages, open or close another webpage, and so on, just just leave it alone there and waiting for your program finish running, you can switch to another software, like pycharm to keep writing your codes but not switch to another webpage. I don’t know why open or close or switch to other pages will cause the connection problem of the google colab page, but each time I try to bothered my browser, like do some search job, my connection to colab will soon break down.


回答 8

尝试这个:

function ClickConnect(){
  console.log("Working"); 
  document
    .querySelector("#top-toolbar > colab-connect-button")
    .shadowRoot
    .querySelector("#connect")
    .click()
}

setInterval(ClickConnect,60000)

Try this:

function ClickConnect(){
  console.log("Working"); 
  document
    .querySelector("#top-toolbar > colab-connect-button")
    .shadowRoot
    .querySelector("#connect")
    .click()
}

setInterval(ClickConnect,60000)

回答 9

使用python硒

from selenium.webdriver.common.keys import Keys
from selenium import webdriver
import time   

driver = webdriver.Chrome('/usr/lib/chromium-browser/chromedriver')

notebook_url = ''
driver.get(notebook_url)

# run all cells
driver.find_element_by_tag_name('body').send_keys(Keys.CONTROL + Keys.F9)
time.sleep(5)

# click to stay connected
start_time = time.time()
current_time = time.time()
max_time = 11*59*60 #12hours

while (current_time - start_time) < max_time:
    webdriver.ActionChains(driver).send_keys(Keys.ESCAPE).perform()
    driver.find_element_by_xpath('//*[@id="top-toolbar"]/colab-connect-button').click()
    time.sleep(30)
    current_time = time.time()

Using python selenium

from selenium.webdriver.common.keys import Keys
from selenium import webdriver
import time   

driver = webdriver.Chrome('/usr/lib/chromium-browser/chromedriver')

notebook_url = ''
driver.get(notebook_url)

# run all cells
driver.find_element_by_tag_name('body').send_keys(Keys.CONTROL + Keys.F9)
time.sleep(5)

# click to stay connected
start_time = time.time()
current_time = time.time()
max_time = 11*59*60 #12hours

while (current_time - start_time) < max_time:
    webdriver.ActionChains(driver).send_keys(Keys.ESCAPE).perform()
    driver.find_element_by_xpath('//*[@id="top-toolbar"]/colab-connect-button').click()
    time.sleep(30)
    current_time = time.time()

回答 10

我认为JavaScript解决方案不再有效。我在笔记本中使用以下命令进行操作:

    from IPython.display import display, HTML
    js = ('<script>function ConnectButton(){ '
           'console.log("Connect pushed"); '
           'document.querySelector("#connect").click()} '
           'setInterval(ConnectButton,3000);</script>')
    display(HTML(js))

首次执行全部运行时(在启动JavaScript或Python代码之前),控制台将显示:

Connected to 
wss://colab.research.google.com/api/kernels/0e1ce105-0127-4758-90e48cf801ce01a3/channels?session_id=5d8...

但是,每次运行JavaScript时,您都会看到console.log部分,但是click部分仅给出:

Connect pushed

Uncaught TypeError: Cannot read property 'click' of null
 at ConnectButton (<anonymous>:1:92)

其他人建议将按钮名称更改为#colab-connect-button,但这会产生相同的错误。

启动运行系统后,该按钮将更改为显示RAM / DISK,并显示一个下拉列表。单击下拉列表创建一个<DIV class=goog menu...>以前未在DOM中显示的新内容,并带有2个选项“连接到托管运行时”和“连接到本地运行时”。如果控制台窗口已打开并显示元素,则在单击下拉元素时可以看到此DIV出现。只需在出现的新窗口中的两个选项之间移动鼠标焦点,即可向DOM添加其他元素,一旦鼠标释放焦点,它们便会从DOM中完全删除,甚至无需单击即可。

I don’t believe the JavaScript solutions work anymore. I was doing it from within my notebook with:

    from IPython.display import display, HTML
    js = ('<script>function ConnectButton(){ '
           'console.log("Connect pushed"); '
           'document.querySelector("#connect").click()} '
           'setInterval(ConnectButton,3000);</script>')
    display(HTML(js))

When you first do a Run all (before the JavaScript or Python code has started), the console displays:

Connected to 
wss://colab.research.google.com/api/kernels/0e1ce105-0127-4758-90e48cf801ce01a3/channels?session_id=5d8...

However, ever time the JavaScript runs, you see the console.log portion, but the click portion simply gives:

Connect pushed

Uncaught TypeError: Cannot read property 'click' of null
 at ConnectButton (<anonymous>:1:92)

Others suggested the button name has changed to #colab-connect-button, but that gives same error.

After the runtime is started, the button is changed to show RAM/DISK, and a drop down is presented. Clicking on the drop down creates a new <DIV class=goog menu...> that was not shown in the DOM previously, with 2 options “Connect to hosted runtime” and “Connect to local runtime”. If the console window is open and showing elements, you can see this DIV appear when you click the dropdown element. Simply moving the mouse focus between the two options in the new window that appears adds additional elements to the DOM, as soon as the mouse looses focus, they are removed from the DOM completely, even without clicking.


回答 11

我尝试了上面的代码,但它们对我不起作用。这是我重新连接的JS代码。

let interval = setInterval(function(){
let ok = document.getElementById('ok');
if(ok != null){
   console.log("Connect pushed");
ok.click();
}},60000)

您可以使用相同的方式(在浏览器的控制台上运行)来运行它。如果要停止脚本,可以输入clearInterval(interval)并再次运行setInterval(interval)

我希望这可以帮助你。

I tried the codes above but they did not work for me. So here is my JS code for reconnecting.

let interval = setInterval(function(){
let ok = document.getElementById('ok');
if(ok != null){
   console.log("Connect pushed");
ok.click();
}},60000)

You can use it with the same way (run it on the console of your browser) to run it. If you want to stop the script, you can enter clearInterval(interval) and want to run again setInterval(interval).

I hope this helps you.


回答 12

更新了一个。这个对我有用。

function ClickConnect(){
console.log("Working"); 
document.querySelector("paper-icon-button").click()
}
Const myjob = setInterval(ClickConnect, 60000)

如果对您不起作用,请尝试运行以下命令清除它:

clearInterval(myjob)

Updated one. it works for me.

function ClickConnect(){
console.log("Working"); 
document.querySelector("paper-icon-button").click()
}
Const myjob = setInterval(ClickConnect, 60000)

If isn’t working you for you guys try clear it by running:

clearInterval(myjob)

回答 13

这对我有用(似乎他们更改了按钮的类名或ID):

function ClickConnect(){
    console.log("Working"); 
    document.querySelector("colab-connect-button").click() 
}
setInterval(ClickConnect,60000)

This one worked for me (it seems like they changed the button classname or id) :

function ClickConnect(){
    console.log("Working"); 
    document.querySelector("colab-connect-button").click() 
}
setInterval(ClickConnect,60000)

回答 14

投票得最多的答案当然对我有用,但这会使“管理会话”窗口一次又一次地弹出。
我已经解决了这一问题,方法是使用浏览器控制台如下所示自动单击刷新按钮

function ClickRefresh(){
    console.log("Clicked on refresh button"); 
    document.querySelector("paper-icon-button").click()
}
setInterval(ClickRefresh, 60000)

随时在此要点上为此贡献更多代码片段https://gist.github.com/Subangkar/fd1ef276fd40dc374a7c80acc247613e

The most voted answer certainly works for me but it makes the Manage session window popping up again and again.
I’ve solved that by auto clicking the refresh button using browser console like below

function ClickRefresh(){
    console.log("Clicked on refresh button"); 
    document.querySelector("paper-icon-button").click()
}
setInterval(ClickRefresh, 60000)

Feel free to contribute more snippets for this at this gist https://gist.github.com/Subangkar/fd1ef276fd40dc374a7c80acc247613e


回答 15

也许以前的许多解决方案都不再起作用。例如,下面的代码继续在Colab中创建新的代码单元,但仍在工作。无疑,创建一堆代码单元是一个不便之处。如果在运行几个小时后创建了太多的代码单元,而没有足够的RAM,则浏览器可能会冻结。

反复创建代码单元-

function ClickConnect(){
console.log("Working"); 
document.querySelector("colab-toolbar-button").click() 
}setInterval(ClickConnect,60000)

但是我发现下面的代码正在运行,它不会引起任何问题。在Colab笔记本选项卡中,Ctrl + Shift + i同时单击该键,然后将以下代码粘贴到控制台中。120000个间隔就足够了。

function ClickConnect(){
console.log("Working"); 
document.querySelector("colab-toolbar-button#connect").click() 
}setInterval(ClickConnect,120000)

我已在2020年11月在firefox中测试了此代码。它也将在chrome上工作。

Perhaps many of the previous solutions are no longer working. For example, this bellow code continues to create new code cells in Colab, working though. Undoubtedly, creating a bunch of code cells is an inconvenience. If too many code cells are created in some hours of running and there is no enough RAM, the browser may freeze.

This repetedly creates code cells—

function ClickConnect(){
console.log("Working"); 
document.querySelector("colab-toolbar-button").click() 
}setInterval(ClickConnect,60000)

But I found the code below is working, it doesn’t cause any problems. In the Colab notebook tab, click on the Ctrl + Shift + i key simultaneously and paste the below code in the console. 120000 intervals are enough.

function ClickConnect(){
console.log("Working"); 
document.querySelector("colab-toolbar-button#connect").click() 
}setInterval(ClickConnect,120000)

I have tested this code in firefox, in November 2020. It will work on chrome too.


回答 16

我建议使用JQuery(似乎Co-lab默认包含JQuery)。

function ClickConnect(){
  console.log("Working");
  $("colab-toolbar-button").click();
}
setInterval(ClickConnect,60000);

I would recommend using JQuery (It seems that Co-lab includes JQuery by default).

function ClickConnect(){
  console.log("Working");
  $("colab-toolbar-button").click();
}
setInterval(ClickConnect,60000);

回答 17

这些JavaScript函数存在问题:

function ClickConnect(){
    console.log("Clicked on connect button"); 
    document.querySelector("colab-connect-button").click()
}
setInterval(ClickConnect,60000)

他们实际在单击按钮之前在控制台上打印“ Clicked on connect button”。从该线程的不同答案中可以看出,自Google Colab启动以来,connect按钮的ID已经更改了两次。而且将来也可能会更改。因此,如果您打算从该线程中复制旧答案,则可能会说“单击了连接按钮”,但实际上可能不会这样做。当然,如果单击不起作用,它将在控制台上显示一个错误,但是如果您可能不会意外看到该怎么办?因此,您最好这样做:

function ClickConnect(){
    document.querySelector("colab-connect-button").click()
    console.log("Clicked on connect button"); 
}
setInterval(ClickConnect,60000)

您肯定会看到它是否真正起作用。

I have a problem with these javascript functions:

function ClickConnect(){
    console.log("Clicked on connect button"); 
    document.querySelector("colab-connect-button").click()
}
setInterval(ClickConnect,60000)

They print the “Clicked on connect button” on the console before the button is actually clicked. As you can see from different answers in this thread, the id of the connect button has changed a couple of times since Google Colab was launched. And it could be changed in the future as well. So if you’re going to copy an old answer from this thread it may say “Clicked on connect button” but it may actually not do that. Of course if the clicking won’t work it will print an error on the console but what if you may not accidentally see it? So you better do this:

function ClickConnect(){
    document.querySelector("colab-connect-button").click()
    console.log("Clicked on connect button"); 
}
setInterval(ClickConnect,60000)

And you’ll definitely see if it truly works or not.


回答 18

function ClickConnect()
{
    console.log("Working...."); 
    document.querySelector("paper-button#comments").click()
}
setInterval(ClickConnect,600)

这对我有用,但明智地使用

快乐学习:)

function ClickConnect()
{
    console.log("Working...."); 
    document.querySelector("paper-button#comments").click()
}
setInterval(ClickConnect,600)

this worked for me but use wisely

happy learning :)


回答 19

以下最新解决方案适用于我:

function ClickConnect(){
  colab.config
  console.log("Connnect Clicked - Start"); 
  document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click();
  console.log("Connnect Clicked - End");
};
setInterval(ClickConnect, 60000)

the following LATEST solution works for me:

function ClickConnect(){
  colab.config
  console.log("Connnect Clicked - Start"); 
  document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click();
  console.log("Connnect Clicked - End");
};
setInterval(ClickConnect, 60000)

回答 20

下面的javascript对我有用。学分@ artur.k.space

function ColabReconnect() {
    var dialog = document.querySelector("colab-dialog.yes-no-dialog");
    var dialogTitle = dialog && dialog.querySelector("div.content-area>h2");
    if (dialogTitle && dialogTitle.innerText == "Runtime disconnected") {
        dialog.querySelector("paper-button#ok").click();
        console.log("Reconnecting...");
    } else {
        console.log("ColabReconnect is in service.");
    }
}
timerId = setInterval(ColabReconnect, 60000);

在Colab笔记本中,同时单击Ctrl + Shift +i键。将脚本复制并粘贴到提示行中。然后Enter在关闭编辑器之前点击。

这样,该功能将每60秒检查一次,以查看是否显示了屏幕连接对话框,如果显示,则该功能将ok自动为您单击该按钮。

The javascript below works for me. Credits to @artur.k.space.

function ColabReconnect() {
    var dialog = document.querySelector("colab-dialog.yes-no-dialog");
    var dialogTitle = dialog && dialog.querySelector("div.content-area>h2");
    if (dialogTitle && dialogTitle.innerText == "Runtime disconnected") {
        dialog.querySelector("paper-button#ok").click();
        console.log("Reconnecting...");
    } else {
        console.log("ColabReconnect is in service.");
    }
}
timerId = setInterval(ColabReconnect, 60000);

In the Colab notebook, click on Ctrl + Shift + the i key simultaneously. Copy and paste the script into the prompt line. Then hit Enter before closing the editor.

By doing so, the function will check every 60 seconds to see if the onscreen connection dialog is shown, and if it is, the function would then click the ok button automatically for you.


回答 21

好吧,我不是python家伙,也不知道这个’Colab’的实际用途是什么,我将其用作构建系统。我以前在其中设置了ssh转发,然后将这段代码放到运行中,是的。

import getpass
authtoken = getpass.getpass()

Well I am not a python guy nor I know what is the actual use of this ‘Colab’, I use it as a build system lol. And I used to setup ssh forwarding in it then put this code and just leave it running and yeah it works.

import getpass
authtoken = getpass.getpass()

回答 22

此代码在文件资源管理器窗格中单击“刷新文件夹”。

function ClickRefresh(){
  console.log("Working"); 
  document.querySelector("[icon='colab:folder-refresh']").click()
}
const myjob = setInterval(ClickRefresh, 60000)

This code keep clicking “Refresh folder” in the file explorer pane.

function ClickRefresh(){
  console.log("Working"); 
  document.querySelector("[icon='colab:folder-refresh']").click()
}
const myjob = setInterval(ClickRefresh, 60000)

回答 23

GNU Colab使您可以在Colaboratory实例之上运行标准的持久桌面环境。

实际上,它包含一种不让机器死掉的机制。

这是一个视频演示

GNU Colab lets you run a standard persistent desktop environment on top of a Colaboratory instance.

Indeed it contains a mechanism to not let machines die of idling.

Here’s a video demonstration.


回答 24

您也可以使用Python按下箭头键。我也在以下代码中添加了一些随机性。

from pyautogui import press, typewrite, hotkey
import time
from random import shuffle

array = ["left", "right", "up", "down"]

while True:
    shuffle(array)
    time.sleep(10)
    press(array[0])
    press(array[1])
    press(array[2])
    press(array[3])

You can also use Python to press the arrow keys. I added a little bit of randomness in the following code as well.

from pyautogui import press, typewrite, hotkey
import time
from random import shuffle

array = ["left", "right", "up", "down"]

while True:
    shuffle(array)
    time.sleep(10)
    press(array[0])
    press(array[1])
    press(array[2])
    press(array[3])

回答 25

只需在要运行的单元格之后运行以下代码,以防止数据丢失。

!python

同样要退出此模式,请写

exit()

Just run the code below after the cell you want to run to save from data loss.

!python

Also to exit from this mode, write

exit()

回答 26

我一直在寻找解决方案,直到找到一个Python3,该Python3总是在同一位置来回移动鼠标并单击,但这足以使Colab误以为我在笔记本电脑上很活跃并且没有断开连接。

import numpy as np
import time
import mouse
import threading

def move_mouse():
    while True:
        random_row = np.random.random_sample()*100
        random_col = np.random.random_sample()*10
        random_time = np.random.random_sample()*np.random.random_sample() * 100
        mouse.wheel(1000)
        mouse.wheel(-1000)
        mouse.move(random_row, random_col, absolute=False, duration=0.2)
        mouse.move(-random_row, -random_col, absolute=False, duration = 0.2)
        mouse.LEFT
        time.sleep(random_time)


x = threading.Thread(target=move_mouse)
x.start()

您需要安装所需的软件包:sudo -H pip3 install <package_name> 您只需要使用(在本地计算机中)运行它即可(sudo因为它可以控制鼠标)并且它应该可以工作,从而使您能够充分利用Colab的12h会话。

积分: 对于使用Colab(Pro)的用户:防止会话由于不活动而断开连接

I was looking for a solution until I found a Python3 that randomly moves the mouse back and forth and clicks, always on the same place, but that’s enough to fool Colab into thinking I’m active on the notebook and not disconnect.

import numpy as np
import time
import mouse
import threading

def move_mouse():
    while True:
        random_row = np.random.random_sample()*100
        random_col = np.random.random_sample()*10
        random_time = np.random.random_sample()*np.random.random_sample() * 100
        mouse.wheel(1000)
        mouse.wheel(-1000)
        mouse.move(random_row, random_col, absolute=False, duration=0.2)
        mouse.move(-random_row, -random_col, absolute=False, duration = 0.2)
        mouse.LEFT
        time.sleep(random_time)


x = threading.Thread(target=move_mouse)
x.start()

You need to install the needed packages: sudo -H pip3 install <package_name> You just need to run it (in your local machine) with sudo (as it takes control of the mouse) and it should work, allowing you to take full advantage of Colab’s 12h sessions.

Credits: For those using Colab (Pro): Preventing Session from disconnecting due to inactivity


回答 27

function ClickConnect(){
    console.log("Clicked on connect button"); 
    document.querySelector("connect").click() // Change id here
}
setInterval(ClickConnect,60000)

试试上面对我有用的代码:)

function ClickConnect(){
    console.log("Clicked on connect button"); 
    document.querySelector("connect").click() // Change id here
}
setInterval(ClickConnect,60000)

Try above code it worked for me:)