标签归档:Python

词形化与词干的区别是什么?

问题:词形化与词干的区别是什么?

什么时候使用每个?

另外… NLTK词素化是否取决于词类?如果不是,它会更准确吗?

When do I use each ?

Also…is the NLTK lemmatization dependent upon Parts of Speech? Wouldn’t it be more accurate if it was?


回答 0

简短而密集:http : //nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

词干和词根化的目的都是将单词的屈折形式和有时与派生相关的形式减少为通用的基本形式。

但是,这两个词的风格不同。词干通常是指粗略的启发式过程,该过程会切掉单词的结尾,以期在大多数时间正确实现此目标,并且通常包括删除派生词缀。词法词化通常是指使用单词的词汇和词法分析来正确处理事情,通常旨在仅去除词尾变化,并返回单词的基数或字典形式,这被称为引理。

从NLTK文档:

引词化和词干化是规范化的特殊情况。他们为一组相关的单词形式确定规范的代表。

Short and dense: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

From the NLTK docs:

Lemmatization and stemming are special cases of normalization. They identify a canonical representative for a set of related word forms.


回答 1

合法化阻止密切相关。不同之处在于,词干分析器在不了解上下文的情况下对单个单词进行操作,因此无法根据词性区分具有不同含义的单词。但是,茎杆通常更易于实现和运行得更快,并且降低的精度对于某些应用可能并不重要。

例如:

  1. “更好”一词的引理是“好”。由于需要字典查找,因此干掉了该链接。

  2. 单词“ walk”是单词“ walking”的基本形式,因此,它在词干和词根化方面均匹配。

  3. 根据上下文,“开会”一词可以是名词的基本形式,也可以是动词(“见面”)的形式,例如“在我们上次见面”或“我们明天再见面”。与词根提取不同,词条分解原则上可以根据上下文选择适当的词条。

资料来源https : //zh.wikipedia.org/wiki/合法化

Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.

For instance:

  1. The word “better” has “good” as its lemma. This link is missed by stemming, as it requires a dictionary look-up.

  2. The word “walk” is the base form for word “walking”, and hence this is matched in both stemming and lemmatisation.

  3. The word “meeting” can be either the base form of a noun or a form of a verb (“to meet”) depending on the context, e.g., “in our last meeting” or “We are meeting again tomorrow”. Unlike stemming, lemmatisation can in principle select the appropriate lemma depending on the context.

Source: https://en.wikipedia.org/wiki/Lemmatisation


回答 2

有两个方面可以显示它们的差异:

  1. 词干将返回一个字,这不必是完全相同的字的形态根的杆。即使词干本身不是有效的词根,通常只要相关词映射到相同的词干就足够了,而在词法化过程中,它将返回词的字典形式,该词必须是有效的词。

  2. lemmatisation,单词的语音部分,应首先确定和归一化的规则将是语音的不同部分不同,而词干上的单个字的运行不需要的上下文的知识,并且因此其具有不同的字之间不能区分含义取决于词性。

参考http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization

There are two aspects to show their differences:

  1. A stemmer will return the stem of a word, which needn’t be identical to the morphological root of the word. It usually sufficient that related words map to the same stem,even if the stem is not in itself a valid root, while in lemmatisation, it will return the dictionary form of a word, which must be a valid word.

  2. In lemmatisation, the part of speech of a word should be first determined and the normalisation rules will be different for different part of speech, while the stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.

Reference http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization


回答 3

词干和词根化的目的都是为了减少形态变异。这与更通用的“术语合并”过程相反,后者也可以处理词汇语义,句法或正字法变化。

词干和词根化的真正区别有三点:

  1. 词干将单词形式简化为(伪)词干,而词义化将单词形式简化为在语言上有效的词缀。这种差异在形态更为复杂的语言中显而易见,但对于许多IR应用而言可能无关紧要。

  2. 引理化仅处理拐点变化,而词干化也可以处理导数变化。

  3. 在实现方面,词元化通常更为复杂(尤其是对于形态复杂的语言),并且通常需要某种词典。另一方面,可以通过相当简单的基于规则的方法来实现令人满意的词干。

词法标记器也可以支持词法化,以消除同音异义词的歧义。

The purpose of both stemming and lemmatization is to reduce morphological variation. This is in contrast to the the more general “term conflation” procedures, which may also address lexico-semantic, syntactic, or orthographic variations.

The real difference between stemming and lemmatization is threefold:

  1. Stemming reduces word-forms to (pseudo)stems, whereas lemmatization reduces the word-forms to linguistically valid lemmas. This difference is apparent in languages with more complex morphology, but may be irrelevant for many IR applications;

  2. Lemmatization deals only with inflectional variance, whereas stemming may also deal with derivational variance;

  3. In terms of implementation, lemmatization is usually more sophisticated (especially for morphologically complex languages) and usually requires some sort of lexica. Satisfatory stemming, on the other hand, can be achieved with rather simple rule-based approaches.

Lemmatization may also be backed up by a part-of-speech tagger in order to disambiguate homonyms.


回答 4

正如MYYN所指出的那样,词干提取是将词尾的,有时是衍生词的词缀去除为所有原始词都可能与之相关的基本形式的过程。词法化与获得单个单词有关,该单词使您可以将一堆变形的表格组合在一起。这比阻止更难,因为它需要考虑上下文(因此要考虑单词的含义),而阻止则忽略上下文。

至于何时使用一个或另一个,则取决于您的应用程序在多大程度上取决于正确理解上下文中单词的含义。如果您要进行机器翻译,则可能需要进行词形化处理,以避免错误翻译单词。如果您要对10亿个文档进行信息检索,而您有99%的查询(从1-3个字不等),那么您就可以满足要求。

对于NLTK,WordNetLemmatizer确实会使用词性,尽管您必须提供(否则默认为名词)。传递给它“鸽子”和“ v”会产生“潜水”,而传递给“鸽子”和“ n”会产生“鸽子”。

As MYYN pointed out, stemming is the process of removing inflectional and sometimes derivational affixes to a base form that all of the original words are probably related to. Lemmatization is concerned with obtaining the single word that allows you to group together a bunch of inflected forms. This is harder than stemming because it requires taking the context into account (and thus the meaning of the word), while stemming ignores context.

As for when you would use one or the other, it’s a matter of how much your application depends on getting the meaning of a word in context correct. If you’re doing machine translation, you probably want lemmatization to avoid mistranslating a word. If you’re doing information retrieval over a billion documents with 99% of your queries ranging from 1-3 words, you can settle for stemming.

As for NLTK, the WordNetLemmatizer does use the part of speech, though you have to provide it (otherwise it defaults to nouns). Passing it “dove” and “v” yields “dive” while “dove” and “n” yields “dove”.


回答 5

一个示例驱动的解释,说明了词源化和词干之间的区别:

词法化处理将“汽车”与“汽车”匹配以及将“汽车”与“汽车”匹配。

阻止处理将“ car”匹配到“ cars”

词法化意味着模糊词匹配的范围更广,仍然由相同的子系统处理。它暗示了用于引擎内低级处理的某些技术,也可能反映了工程上对术语的偏爱。

[…]以FAST为例,他们的词素化引擎不仅处理诸如单数或复数之类的基本单词变体,而且还处理诸如“热”匹配“暖”之类的词库运算符。

这并不是说其他​​引擎当然不会处理同义词,但是底层实现可能与处理基本词干的引擎不在同一个子系统中。

http://www.ideaeng.com/stemming-lemmatization-0601

An example-driven explanation on the differenes between lemmatization and stemming:

Lemmatization handles matching “car” to “cars” along with matching “car” to “automobile”.

Stemming handles matching “car” to “cars” .

Lemmatization implies a broader scope of fuzzy word matching that is still handled by the same subsystems. It implies certain techniques for low level processing within the engine, and may also reflect an engineering preference for terminology.

[…] Taking FAST as an example, their lemmatization engine handles not only basic word variations like singular vs. plural, but also thesaurus operators like having “hot” match “warm”.

This is not to say that other engines don’t handle synonyms, of course they do, but the low level implementation may be in a different subsystem than those that handle base stemming.

http://www.ideaeng.com/stemming-lemmatization-0601


回答 6

ianacl
但我认为词干是一个粗略的黑客人们用它来获得所有不同形式的同一个单词到它不必是本身就是一个合法的字基本形式
有点像波特施特默尔可以使用简单的正则表达式来消除常见字后缀

词法化将单词还原为实际的基本形式,在不规则动词的情况下,该词看起来可能与输入词
完全不同,例如Morpha之类的东西,它使用FST将名词和动词带入其基本形式

ianacl
but i think Stemming is a rough hack people use to get all the different forms of the same word down to a base form which need not be a legit word on its own
Something like the Porter Stemmer can uses simple regexes to eliminate common word suffixes

Lemmatization brings a word down to its actual base form which, in the case of irregular verbs, might look nothing like the input word
Something like Morpha which uses FSTs to bring nouns and verbs to their base form


回答 7

词干只是去除或阻止单词的最后几个字符,通常会导致错误的含义和拼写。词法化会考虑上下文,并将单词转换为其有意义的基本形式,即词法。有时,同一个词可以有多个不同的引词。我们应该在特定的上下文中为单词识别词性(POS)标签。以下示例说明了所有差异和用例:

  1. 如果您对“ 关怀 ” 一词进行词缀化,它将返回“ 关怀 ”。如果阻止,它将返回“ Car ”,这是错误的。
  2. 如果在动词上下文中对单词“ Stripes ”进行词缀化,它将返回“ Strip ”。如果在名词上下文中对其进行词缀化,它将返回“ Stripe ”。如果您只是阻止它,它将仅返回’ Strip ‘。
  3. 无论您是词干化还是词干化(例如走路,跑步,游泳走路,跑步,游泳等)您都会得到相同的结果。
  4. 归类化在计算上很昂贵,因为它涉及查找表,而不涉及查找表。如果您的数据集很大并且性能是一个问题,请使用Stemming。请记住,您也可以将自己的规则添加到“词干”中。如果准确性至高无上,并且数据集不那么庞大,请使用Lemmatization。

Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. Sometimes, the same word can have multiple different Lemmas. We should identify the Part of Speech (POS) tag for the word in that specific context. Here are the examples to illustrate all the differences and use cases:

  1. If you lemmatize the word ‘Caring‘, it would return ‘Care‘. If you stem, it would return ‘Car‘ and this is erroneous.
  2. If you lemmatize the word ‘Stripes‘ in verb context, it would return ‘Strip‘. If you lemmatize it in noun context, it would return ‘Stripe‘. If you just stem it, it would just return ‘Strip‘.
  3. You would get same results whether you lemmatize or stem words such as walking, running, swimming… to walk, run, swim etc.
  4. Lemmatization is computationally expensive since it involves look-up tables and what not. If you have large dataset and performance is an issue, go with Stemming. Remember you can also add your own rules to Stemming. If accuracy is paramount and dataset isn’t humongous, go with Lemmatization.

回答 8

词干处理是删除给定单词的最后几个字符以获得较短形式的过程,即使该形式没有任何意义。

例子,

"beautiful" -> "beauti"
"corpora" -> "corpora"

可以非常快速地完成茎梗。

反之,词法化是根据单词的字典含义将给定单词转换为其基本形式的过程。

例子,

"beautiful" -> "beauty"
"corpora" -> "corpus"

合法化要花费比阻止更多的时间。

Stemming is the process of removing the last few characters of a given word, to obtain a shorter form, even if that form doesn’t have any meaning.

Examples,

"beautiful" -> "beauti"
"corpora" -> "corpora"

Stemming can be done very quickly.

Lemmatization on the other hand, is the process of converting the given word into it’s base form according to the dictionary meaning of the word.

Examples,

"beautiful" -> "beauty"
"corpora" -> "corpus"

Lemmatization takes more time than stemming.


对于Django 2.0,在urls.py中使用path()或url()更好吗?

问题:对于Django 2.0,在urls.py中使用path()或url()更好吗?

在django在线类中,讲师让我们使用该url()函数调用视图并使用urlpatterns列表中的正则表达式。我在YouTube上看到了其他示例。例如

from django.contrib import admin
from django.urls import include
from django.conf.urls import url

urlpatterns = [
    path('admin/', admin.site.urls),
    url(r'^polls/', include('polls.urls')),
]


#and in polls/urls.py

urlpatterns = [        
    url(r'^$', views.index, name="index"),
]

但是,在阅读Django教程时,他们path()改用例如:

from django.urls import path
from . import views

urlpatterns = [
    path('', views.index, name="index"),        
]

此外,正则表达式似乎不适用于该path()函数,因为使用path(r'^$', views.index, name="index")将找不到mysite.com/polls/视图。

使用path()没有正则表达式匹配的正确方法是正确的吗?是url()更强大,但更复杂,所以他们正在使用path()与开始我们吗?还是针对不同工作使用不同工具的情况?

In a django online course, the instructor has us use the url() function to call views and utilize regular expressions in the urlpatterns list. I’ve seen other examples on youtube of this. e.g.

from django.contrib import admin
from django.urls import include
from django.conf.urls import url

urlpatterns = [
    path('admin/', admin.site.urls),
    url(r'^polls/', include('polls.urls')),
]


#and in polls/urls.py

urlpatterns = [        
    url(r'^$', views.index, name="index"),
]

However, in going through the Django tutorial, they use path() instead e.g.:

from django.urls import path
from . import views

urlpatterns = [
    path('', views.index, name="index"),        
]

Furthermore regular expressions don’t seem to work with the path() function as using a path(r'^$', views.index, name="index") won’t find the mysite.com/polls/ view.

Is using path() without regex matching the proper way going forward? Is url() more powerful but more complicated so they’re using path() to start us out with? Or is it a case of different tools for different jobs?


回答 0

从Django文档获取url

url(regex, view, kwargs=None, name=None)此函数是的别名django.urls.re_path()。在将来的版本中可能不推荐使用。

path和之间的主要区别re_pathpath使用不带正则表达式的路由

您可以re_path用于复杂的正则表达式调用,也可以仅path用于更简单的查找

From Django documentation for url

url(regex, view, kwargs=None, name=None) This function is an alias to django.urls.re_path(). It’s likely to be deprecated in a future release.

Key difference between path and re_path is that path uses route without regex

You can use re_path for complex regex calls and use just path for simpler lookups


回答 1

django.urls.path()功能允许使用更简单,更易读的URL路由语法。例如,此示例来自先前的Django版本:

url(r'^articles/(?P<year>[0-9]{4})/$', views.year_archive)

可以写成:

path('articles/<int:year>/', views.year_archive)

django.conf.urls.url() 以前版本中的功能现在可以作为django.urls.re_path()。保留旧位置是为了向后兼容,而不会很快淘汰。django.conf.urls.include()现在django.urls可以从导入旧功能,因此您可以使用:

from django.urls import include, path, re_path

URLconfs中。进一步阅读django doc

The new django.urls.path() function allows a simpler, more readable URL routing syntax. For example, this example from previous Django releases:

url(r'^articles/(?P<year>[0-9]{4})/$', views.year_archive)

could be written as:

path('articles/<int:year>/', views.year_archive)

The django.conf.urls.url() function from previous versions is now available as django.urls.re_path(). The old location remains for backwards compatibility, without an imminent deprecation. The old django.conf.urls.include() function is now importable from django.urls so you can use:

from django.urls import include, path, re_path

in the URLconfs. For further reading django doc


回答 2

pathDjango 2.0只是几周前才发布的新功能。大多数教程都不会针对新语法进行更新。

当然,这应该是一种更简单的处理方式。我不会说URL更强大,但是您应该能够以任何一种格式表示模式。

path is simply new in Django 2.0, which was only released a couple of weeks ago. Most tutorials won’t have been updated for the new syntax.

It was certainly supposed to be a simpler way of doing things; I wouldn’t say that URL is more powerful though, you should be able to express patterns in either format.


回答 3

正则表达式似乎不适用于path()具有以下参数的函数:path(r'^$', views.index, name="index")

应该是这样的:path('', views.index, name="index")

第一个参数必须为空才能输入正则表达式。

Regular expressions don’t seem to work with the path() function with the following arguments: path(r'^$', views.index, name="index").

It should be like this: path('', views.index, name="index").

The 1st argument must be blank to enter a regular expression.


回答 4

Path是Django 2.0的新功能。在这里解释:https : //docs.djangoproject.com/en/2.0/releases/2.0/#whats-new-2-0

看起来更像pythonic方式,并允许在传递给视图的参数中不使用正则表达式…例如,您可以使用int()函数。

Path is a new feature of Django 2.0. Explained here : https://docs.djangoproject.com/en/2.0/releases/2.0/#whats-new-2-0

Look like more pythonic way, and enable to not use regular expression in argument you pass to view… you can ue int() function for exemple.


回答 5

从v2.0开始,许多用户正在使用path,但是我们可以使用path或url。例如,在django 2.1.1中,可以通过url映射到函数

from django.contrib import admin
from django.urls import path

from django.contrib.auth import login
from posts.views import post_home
from django.conf.urls import url

urlpatterns = [
    path('admin/', admin.site.urls),
    url(r'^posts/$', post_home, name='post_home'),

]

其中posts是应用程序,而post_home是views.py中的函数

From v2.0 many users are using path, but we can use either path or url. For example in django 2.1.1 mapping to functions through url can be done as follows

from django.contrib import admin
from django.urls import path

from django.contrib.auth import login
from posts.views import post_home
from django.conf.urls import url

urlpatterns = [
    path('admin/', admin.site.urls),
    url(r'^posts/$', post_home, name='post_home'),

]

where posts is an application & post_home is a function in views.py


如何获取父目录位置

问题:如何获取父目录位置

这段代码是在b.py中获取templates / blog1 / page.html:

path = os.path.join(os.path.dirname(__file__), os.path.join('templates', 'blog1/page.html'))

但我想获取父目录位置:

aParent
   |--a
   |  |---b.py
   |      |---templates
   |              |--------blog1
   |                         |-------page.html
   |--templates
          |--------blog1
                     |-------page.html

以及如何获取父位置

谢谢

更新:

这是对的:

dirname=os.path.dirname
path = os.path.join(dirname(dirname(__file__)), os.path.join('templates', 'blog1/page.html'))

要么

path = os.path.abspath(os.path.join(os.path.dirname(__file__),".."))

this code is get the templates/blog1/page.html in b.py:

path = os.path.join(os.path.dirname(__file__), os.path.join('templates', 'blog1/page.html'))

but i want to get the parent dir location:

aParent
   |--a
   |  |---b.py
   |      |---templates
   |              |--------blog1
   |                         |-------page.html
   |--templates
          |--------blog1
                     |-------page.html

and how to get the aParent location

thanks

updated:

this is right:

dirname=os.path.dirname
path = os.path.join(dirname(dirname(__file__)), os.path.join('templates', 'blog1/page.html'))

or

path = os.path.abspath(os.path.join(os.path.dirname(__file__),".."))

回答 0

您可以重复应用dirname来爬高:dirname(dirname(file))。但是,这只能到达根包。如果有问题,请使用os.path.abspathdirname(dirname(abspath(file)))

You can apply dirname repeatedly to climb higher: dirname(dirname(file)). This can only go as far as the root package, however. If this is a problem, use os.path.abspath: dirname(dirname(abspath(file))).


回答 1

os.path.abspath不会验证任何内容,因此,如果我们已经向其追加字符串,__file__则无需理会dirname或加入其中的任何一个。只需将其__file__视为目录并开始爬山:

# climb to __file__'s parent's parent:
os.path.abspath(__file__ + "/../../")

这远不os.path.abspath(os.path.join(os.path.dirname(__file__),".."))及易处理dirname(dirname(__file__))。攀登两个以上的水平开始变得荒谬。

但是,由于我们知道要爬多少层,我们可以用一个简单的小函数来清理它:

uppath = lambda _path, n: os.sep.join(_path.split(os.sep)[:-n])

# __file__ = "/aParent/templates/blog1/page.html"
>>> uppath(__file__, 1)
'/aParent/templates/blog1'
>>> uppath(__file__, 2)
'/aParent/templates'
>>> uppath(__file__, 3)
'/aParent'

os.path.abspath doesn’t validate anything, so if we’re already appending strings to __file__ there’s no need to bother with dirname or joining or any of that. Just treat __file__ as a directory and start climbing:

# climb to __file__'s parent's parent:
os.path.abspath(__file__ + "/../../")

That’s far less convoluted than os.path.abspath(os.path.join(os.path.dirname(__file__),"..")) and about as manageable as dirname(dirname(__file__)). Climbing more than two levels starts to get ridiculous.

But, since we know how many levels to climb, we could clean this up with a simple little function:

uppath = lambda _path, n: os.sep.join(_path.split(os.sep)[:-n])

# __file__ = "/aParent/templates/blog1/page.html"
>>> uppath(__file__, 1)
'/aParent/templates/blog1'
>>> uppath(__file__, 2)
'/aParent/templates'
>>> uppath(__file__, 3)
'/aParent'

回答 2

相对路径pathlibPython 3.4+中的模块一起使用:

from pathlib import Path

Path(__file__).parent

您可以使用多个调用来parent进一步进行以下操作:

Path(__file__).parent.parent

作为指定parent两次的替代方法,可以使用:

Path(__file__).parents[1]

Use relative path with the pathlib module in Python 3.4+:

from pathlib import Path

Path(__file__).parent

You can use multiple calls to parent to go further in the path:

Path(__file__).parent.parent

As an alternative to specifying parent twice, you can use:

Path(__file__).parents[1]

回答 3

os.path.dirname(os.path.abspath(__file__))

应该给你通往的道路a

但是,如果b.py当前执行的是文件,则只需执行以下操作即可

os.path.abspath(os.path.join('templates', 'blog1', 'page.html'))
os.path.dirname(os.path.abspath(__file__))

Should give you the path to a.

But if b.py is the file that is currently executed, then you can achieve the same by just doing

os.path.abspath(os.path.join('templates', 'blog1', 'page.html'))

回答 4

os.pardir是一种更好的方法../,更具可读性。

import os
print os.path.abspath(os.path.join(given_path, os.pardir))  

这将返回给定路径的父路径

os.pardir is a better way for ../ and more readable.

import os
print os.path.abspath(os.path.join(given_path, os.pardir))  

This will return the parent path of the given_path


回答 5

一种简单的方法可以是:

import os
current_dir =  os.path.abspath(os.path.dirname(__file__))
parent_dir = os.path.abspath(current_dir + "/../")
print parent_dir

A simple way can be:

import os
current_dir =  os.path.abspath(os.path.dirname(__file__))
parent_dir = os.path.abspath(current_dir + "/../")
print parent_dir

回答 6

可能是加入两个..文件夹,以获取父文件夹的父文件夹?

path = os.path.abspath(os.path.join(os.path.dirname(os.path.abspath(__file__)),"..",".."))

May be join two .. folder, to get parent of the parent folder?

path = os.path.abspath(os.path.join(os.path.dirname(os.path.abspath(__file__)),"..",".."))

回答 7

使用以下命令跳到上一个文件夹:

os.chdir(os.pardir)

如果您需要多次跳转,那么在这种情况下,使用简单的装饰器是一个好而简单的解决方案。

Use the following to jump to previous folder:

os.chdir(os.pardir)

If you need multiple jumps a good and easy solution will be to use a simple decorator in this case.


回答 8

这是另一个相对简单的解决方案:

  • 不使用dirname()(在“ file.txt”这样的一级参数或“ ..”这样的相对父级上不能按预期工作)
  • 不使用abspath()(避免对当前工作目录进行任何假设),而是保留路径的相对字符

它只是使用normpathjoin

def parent(p):
    return os.path.normpath(os.path.join(p, os.path.pardir))

# Example:
for p in ['foo', 'foo/bar/baz', 'with/trailing/slash/', 
        'dir/file.txt', '../up/', '/abs/path']:
    print parent(p)

结果:

.
foo/bar
with/trailing
dir
..
/abs

Here is another relatively simple solution that:

  • does not use dirname() (which does not work as expected on one level arguments like “file.txt” or relative parents like “..”)
  • does not use abspath() (avoiding any assumptions about the current working directory) but instead preserves the relative character of paths

it just uses normpath and join:

def parent(p):
    return os.path.normpath(os.path.join(p, os.path.pardir))

# Example:
for p in ['foo', 'foo/bar/baz', 'with/trailing/slash/', 
        'dir/file.txt', '../up/', '/abs/path']:
    print parent(p)

Result:

.
foo/bar
with/trailing
dir
..
/abs

回答 9

我认为用这个更好:

os.path.realpath(__file__).rsplit('/', X)[0]


In [1]: __file__ = "/aParent/templates/blog1/page.html"

In [2]: os.path.realpath(__file__).rsplit('/', 3)[0]
Out[3]: '/aParent'

In [4]: __file__ = "/aParent/templates/blog1/page.html"

In [5]: os.path.realpath(__file__).rsplit('/', 1)[0]
Out[6]: '/aParent/templates/blog1'

In [7]: os.path.realpath(__file__).rsplit('/', 2)[0]
Out[8]: '/aParent/templates'

In [9]: os.path.realpath(__file__).rsplit('/', 3)[0]
Out[10]: '/aParent'

I think use this is better:

os.path.realpath(__file__).rsplit('/', X)[0]


In [1]: __file__ = "/aParent/templates/blog1/page.html"

In [2]: os.path.realpath(__file__).rsplit('/', 3)[0]
Out[3]: '/aParent'

In [4]: __file__ = "/aParent/templates/blog1/page.html"

In [5]: os.path.realpath(__file__).rsplit('/', 1)[0]
Out[6]: '/aParent/templates/blog1'

In [7]: os.path.realpath(__file__).rsplit('/', 2)[0]
Out[8]: '/aParent/templates'

In [9]: os.path.realpath(__file__).rsplit('/', 3)[0]
Out[10]: '/aParent'

回答 10

我试过了:

import os
os.path.abspath(os.path.join(os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe()))), os.pardir))

I tried:

import os
os.path.abspath(os.path.join(os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe()))), os.pardir))

测试列表是否共享python中的任何项目

问题:测试列表是否共享python中的任何项目

我想检查一个列表中的任何项目是否存在于另一个列表中。我可以使用下面的代码简单地做到这一点,但是我怀疑可能有一个库函数可以做到这一点。如果没有,是否有更多的pythonic方法可以达到相同的结果。

In [78]: a = [1, 2, 3, 4, 5]

In [79]: b = [8, 7, 6]

In [80]: c = [8, 7, 6, 5]

In [81]: def lists_overlap(a, b):
   ....:     for i in a:
   ....:         if i in b:
   ....:             return True
   ....:     return False
   ....: 

In [82]: lists_overlap(a, b)
Out[82]: False

In [83]: lists_overlap(a, c)
Out[83]: True

In [84]: def lists_overlap2(a, b):
   ....:     return len(set(a).intersection(set(b))) > 0
   ....: 

I want to check if any of the items in one list are present in another list. I can do it simply with the code below, but I suspect there might be a library function to do this. If not, is there a more pythonic method of achieving the same result.

In [78]: a = [1, 2, 3, 4, 5]

In [79]: b = [8, 7, 6]

In [80]: c = [8, 7, 6, 5]

In [81]: def lists_overlap(a, b):
   ....:     for i in a:
   ....:         if i in b:
   ....:             return True
   ....:     return False
   ....: 

In [82]: lists_overlap(a, b)
Out[82]: False

In [83]: lists_overlap(a, c)
Out[83]: True

In [84]: def lists_overlap2(a, b):
   ....:     return len(set(a).intersection(set(b))) > 0
   ....: 

回答 0

简短答案:使用not set(a).isdisjoint(b),通常是最快的。

有测试四种常见的方式,如果两个列表ab共享任何项目。第一种选择是将两个都转换为集合并检查它们的交集,如下所示:

bool(set(a) & set(b))

由于集合是使用Python中的哈希表存储的,因此可以搜索它们O(1)(有关Python中运算符复杂性的更多信息,请参见此处)。从理论上讲,这是O(n+m)对平均nm在列表中的对象ab。但是1)它必须首先从列表中创建集合,这可能花费不可忽略的时间量; 2)它假定哈希冲突在您的数据中很少。

第二种方法是使用生成器表达式对列表执行迭代,例如:

any(i in a for i in b)

这允许就地搜索,因此不会为中间变量分配新的内存。它也可以在第一个发现上解决。但是in操作员始终O(n)在列表中(请参阅此处)。

另一个建议的选项是混合访问列表中的一个,转换另一个集合,然后测试该集合的成员资格,如下所示:

a = set(a); any(i in a for i in b)

第四种方法是利用isdisjoint()(冻结)集合的方法(请参阅此处),例如:

not set(a).isdisjoint(b)

如果您搜索的元素在数组的开头附近(例如已排序),则倾向于使用生成器表达式,因为集合交集方法必须为中间变量分配新的内存:

from timeit import timeit
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=list(range(1000))", number=100000)
26.077727576019242
>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=list(range(1000))", number=100000)
0.16220548999262974

这是此示例的执行时间与列表大小的关系图:

请注意,两个轴都是对数的。这代表了生成器表达式的最佳情况。可以看出,该isdisjoint()方法对于非常小的列表大小更好,而生成器表达式对于更大的列表大小更好。

另一方面,由于搜索是从混合表达式和生成器表达式的开头开始的,因此,如果共享元素系统地位于数组的末尾(或者两个列表都不共享任何值),则不相交和集合交集方法比生成器表达式和混合方法快得多。

>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
13.739536046981812
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
0.08102107048034668

有趣的是,对于较大的列表大小,生成器表达式要慢得多。这仅适用于1000次重复,而不是前一个数字的100000次。当没有共享任何元素时,此设置也很合适,并且是不相交和设置相交方法的最佳情况。

这是两个使用随机数的分析(而不是操纵设置以偏爱一种或多种技术):

分享的可能性很高:元素是从中随机抽取的[1, 2*len(a)]。分享机会低:元素是从中随机抽取的[1, 1000*len(a)]

到目前为止,该分析假设两个列表的大小相同。如果有两个不同大小的列表,例如a小得多,isdisjoint()总是更快:

确保a列表较小,否则性能会降低。在此实验中,a列表大小设置为常量5

综上所述:

  • 如果列表很小(<10个元素),not set(a).isdisjoint(b)则总是最快的。
  • 如果列表中的元素已排序或具有可以利用的规则结构,则生成器表达式any(i in a for i in b)在大列表大小时最快。
  • not set(a).isdisjoint(b)用来测试设置的交集,它总是比快bool(set(a) & set(b))
  • 混合“遍历列表,按条件测试” a = set(a); any(i in a for i in b)通常比其他方法慢。
  • 当涉及到不共享元素的列表时,生成器表达式和混合函数比其他两种方法要慢得多。

在大多数情况下,使用该isdisjoint()方法是最好的方法,因为生成器表达式的执行时间会更长,因为在没有共享任何元素时效率非常低。

Short answer: use not set(a).isdisjoint(b), it’s generally the fastest.

There are four common ways to test if two lists a and b share any items. The first option is to convert both to sets and check their intersection, as such:

bool(set(a) & set(b))

Because sets are stored using a hash table in Python, searching them is O(1) (see here for more information about complexity of operators in Python). Theoretically, this is O(n+m) on average for n and m objects in lists a and b. But 1) it must first create sets out of the lists, which can take a non-negligible amount of time, and 2) it supposes that hashing collisions are sparse among your data.

The second way to do it is using a generator expression performing iteration on the lists, such as:

any(i in a for i in b)

This allows to search in-place, so no new memory is allocated for intermediary variables. It also bails out on the first find. But the in operator is always O(n) on lists (see here).

Another proposed option is an hybridto iterate through one of the list, convert the other one in a set and test for membership on this set, like so:

a = set(a); any(i in a for i in b)

A fourth approach is to take advantage of the isdisjoint() method of the (frozen)sets (see here), for example:

not set(a).isdisjoint(b)

If the elements you search are near the beginning of an array (e.g. it is sorted), the generator expression is favored, as the sets intersection method have to allocate new memory for the intermediary variables:

from timeit import timeit
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=list(range(1000))", number=100000)
26.077727576019242
>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=list(range(1000))", number=100000)
0.16220548999262974

Here’s a graph of the execution time for this example in function of list size:

Note that both axes are logarithmic. This represents the best case for the generator expression. As can be seen, the isdisjoint() method is better for very small list sizes, whereas the generator expression is better for larger list sizes.

On the other hand, as the search begins with the beginning for the hybrid and generator expression, if the shared element are systematically at the end of the array (or both lists does not share any values), the disjoint and set intersection approaches are then way faster than the generator expression and the hybrid approach.

>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
13.739536046981812
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
0.08102107048034668

It is interesting to note that the generator expression is way slower for bigger list sizes. This is only for 1000 repetitions, instead of the 100000 for the previous figure. This setup also approximates well when when no elements are shared, and is the best case for the disjoint and set intersection approaches.

Here are two analysis using random numbers (instead of rigging the setup to favor one technique or another):

High chance of sharing: elements are randomly taken from [1, 2*len(a)]. Low chance of sharing: elements are randomly taken from [1, 1000*len(a)].

Up to now, this analysis supposed both lists are of the same size. In case of two lists of different sizes, for example a is much smaller, isdisjoint() is always faster:

Make sure that the a list is the smaller, otherwise the performance decreases. In this experiment, the a list size was set constant to 5.

In summary:

  • If the lists are very small (< 10 elements), not set(a).isdisjoint(b) is always the fastest.
  • If the elements in the lists are sorted or have a regular structure that you can take advantage of, the generator expression any(i in a for i in b) is the fastest on large list sizes;
  • Test the set intersection with not set(a).isdisjoint(b), which is always faster than bool(set(a) & set(b)).
  • The hybrid “iterate through list, test on set” a = set(a); any(i in a for i in b) is generally slower than other methods.
  • The generator expression and the hybrid are much slower than the two other approaches when it comes to lists without sharing elements.

In most cases, using the isdisjoint() method is the best approach as the generator expression will take much longer to execute, as it is very inefficient when no elements are shared.


回答 1

def lists_overlap3(a, b):
    return bool(set(a) & set(b))

注意:以上假设您想要布尔值作为答案。如果您只需要在if语句中使用表达式,则只需使用if set(a) & set(b):

def lists_overlap3(a, b):
    return bool(set(a) & set(b))

Note: the above assumes that you want a boolean as the answer. If all you need is an expression to use in an if statement, just use if set(a) & set(b):


回答 2

def lists_overlap(a, b):
  sb = set(b)
  return any(el in sb for el in a)

这是渐近最优的(最坏情况O(n + m)),并且由于any的短路,可能比交叉点方法更好。

例如:

lists_overlap([3,4,5], [1,2,3])

到达后将立即返回True 3 in sb

编辑:另一种变化(感谢Dave Kirby):

def lists_overlap(a, b):
  sb = set(b)
  return any(itertools.imap(sb.__contains__, a))

这依赖于imap用C实现的迭代器,而不是生成器理解。它还sb.__contains__用作映射功能。我不知道这会带来多少性能差异。它仍然会短路。

def lists_overlap(a, b):
  sb = set(b)
  return any(el in sb for el in a)

This is asymptotically optimal (worst case O(n + m)), and might be better than the intersection approach due to any‘s short-circuiting.

E.g.:

lists_overlap([3,4,5], [1,2,3])

will return True as soon as it gets to 3 in sb

EDIT: Another variation (with thanks to Dave Kirby):

def lists_overlap(a, b):
  sb = set(b)
  return any(itertools.imap(sb.__contains__, a))

This relies on imap‘s iterator, which is implemented in C, rather than a generator comprehension. It also uses sb.__contains__ as the mapping function. I don’t know how much performance difference this makes. It will still short-circuit.


回答 3

您还可以将其any与列表理解一起使用:

any([item in a for item in b])

You could also use any with list comprehension:

any([item in a for item in b])

回答 4

在python 2.6或更高版本中,您可以执行以下操作:

return not frozenset(a).isdisjoint(frozenset(b))

In python 2.6 or later you can do:

return not frozenset(a).isdisjoint(frozenset(b))

回答 5

您可以使用任何内置函数/ wa generator表达式:

def list_overlap(a,b): 
     return any(i for i in a if i in b)

正如John和Lie所指出的那样,当对于两个列表共享的每个i bool(i)== False时,这都会给出错误的结果。它应该是:

return any(i in b for i in a)

You can use the any built in function /w a generator expression:

def list_overlap(a,b): 
     return any(i for i in a if i in b)

As John and Lie have pointed out this gives incorrect results when for every i shared by the two lists bool(i) == False. It should be:

return any(i in b for i in a)

回答 6

这个问题已经很老了,但是我注意到当人们在参数集合与列表时,没有人想到将它们一起使用。按照Soravux的示例,

清单的最坏情况:

>>> timeit('bool(set(a) & set(b))',  setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
100.91506409645081
>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
19.746716022491455
>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
0.092626094818115234

列表的最佳情况是:

>>> timeit('bool(set(a) & set(b))',  setup="a=list(range(10000)); b=list(range(10000))", number=100000)
154.69790101051331
>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=list(range(10000))", number=100000)
0.082653045654296875
>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=list(range(10000))", number=100000)
0.08434605598449707

因此,遍历一个列表以查看它是否在集合中比遍历两个列表更快,这是有意义的,因为检查数字是否在集合中需要固定的时间,而通过遍历列表进行检查所花费的时间与长度成正比。名单。

因此,我的结论是遍历一个列表,并检查它是否在set中

This question is pretty old, but I noticed that while people were arguing sets vs. lists, that no one thought of using them together. Following Soravux’s example,

Worst case for lists:

>>> timeit('bool(set(a) & set(b))',  setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
100.91506409645081
>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
19.746716022491455
>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
0.092626094818115234

And the best case for lists:

>>> timeit('bool(set(a) & set(b))',  setup="a=list(range(10000)); b=list(range(10000))", number=100000)
154.69790101051331
>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=list(range(10000))", number=100000)
0.082653045654296875
>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=list(range(10000))", number=100000)
0.08434605598449707

So even faster than iterating through two lists is iterating though a list to see if it’s in a set, which makes sense since checking if a number is in a set takes constant time while checking by iterating through a list takes time proportional to the length of the list.

Thus, my conclusion is that iterate through a list, and check if it’s in a set.


回答 7

如果您不在乎重叠的元素是什么,则只需检查len合并列表与合并为一组的列表的即可。如果有重叠的元素,则集合将更短:

len(set(a+b+c))==len(a+b+c) 如果没有重叠,则返回True。

if you don’t care what the overlapping element might be, you can simply check the len of the combined list vs. the lists combined as a set. If there are overlapping elements, the set will be shorter:

len(set(a+b+c))==len(a+b+c) returns True, if there is no overlap.


回答 8

我将以一种功能性的编程风格来介绍另一个:

any(map(lambda x: x in a, b))

说明:

map(lambda x: x in a, b)

返回在其中的元素布尔值的列表b中找到a。然后any将该列表传递给,该列表仅返回True是否有任何元素True

I’ll throw another one in with a functional programming style:

any(map(lambda x: x in a, b))

Explanation:

map(lambda x: x in a, b)

returns a list of booleans where elements of b are found in a. That list is then passed to any, which simply returns True if any elements are True.


如何恢复到Anaconda中的先前软件包?

问题:如何恢复到Anaconda中的先前软件包?

如果我做

conda info pandas

我可以看到所有可用的软件包。

pandas今天上午将其更新为最新版本,但是现在我需要恢复到以前的版本。我试过了

conda update pandas 0.13.1

但这没用。如何指定要使用的版本?

If I do

conda info pandas

I can see all of the packages available.

I updated my pandas to the latest this morning, but I need to revert to a prior version now. I tried

conda update pandas 0.13.1

but that didn’t work. How do I specify which version to use?


回答 0

我不得不改用该install函数:

conda install pandas=0.13.1

I had to use the install function instead:

conda install pandas=0.13.1

回答 1

对于希望还原最近安装的软件包的情况,该软件包对依赖项进行了一些更改(例如tensorflow),可以通过以下方法“回滚”到较早的安装状态:

conda list --revisions
conda install --revision [revision number]

第一个命令显示以前的安装版本(带有依赖项),第二个命令还原到revision number您指定的版本。

请注意,如果您希望(重新)安装更高版本,则可能必须顺序重新安装所有中间版本。如果您的版本为23,重新安装了版本20,并希望返回,则可能必须运行每个版本:

conda install --revision 21
conda install --revision 22
conda install --revision 23

For the case that you wish to revert a recently installed package that made several changes to dependencies (such as tensorflow), you can “roll back” to an earlier installation state via the following method:

conda list --revisions
conda install --revision [revision number]

The first command shows previous installation revisions (with dependencies) and the second reverts to whichever revision number you specify.

Note that if you wish to (re)install a later revision, you may have to sequentially reinstall all intermediate versions. If you had been at revision 23, reinstalled revision 20 and wish to return, you may have to run each:

conda install --revision 21
conda install --revision 22
conda install --revision 23

如何避免在Python中显式的“自我”?

问题:如何避免在Python中显式的“自我”?

我通过遵循一些pygame教程来学习Python 。

在其中我发现了关键字self的广泛使用,并且主要来自Java背景,我发现自己一直忘记输入self。例如,代替self.rect.centerx我输入rect.centerx,因为对我来说,rect已经是该类的成员变量。

Java的并行的我能想到的这种情况是有前缀成员变量的所有引用与

我是否在所有成员变量前面加上self前缀,还是有一种方法可以声明它们,从而避免这样做呢?

即使我的建议不是pythonic,我仍然想知道是否有可能。

我看了这些相关的SO问题,但它们并不能完全回答我的要求:

I have been learning Python by following some pygame tutorials.

Therein I found extensive use of the keyword self, and coming from a primarily Java background, I find that I keep forgetting to type self. For example, instead of self.rect.centerx I would type rect.centerx, because, to me, rect is already a member variable of the class.

The Java parallel I can think of for this situation is having to prefix all references to member variables with this.

Am I stuck prefixing all member variables with self, or is there a way to declare them that would allow me to avoid having to do so?

Even if what I am suggesting isn’t pythonic, I’d still like to know if it is possible.

I have taken a look at these related SO questions, but they don’t quite answer what I am after:


回答 0

Python需要指定self。 结果是,即使没有看到完整的类定义,也永远不会混淆什么是成员,什么不是成员。这会导致有用的属性,例如:您不能添加意外遮蔽非成员并从而破坏代码的成员。

一个极端的例子:您可以编写类而不知道它可能具有哪些基类,并且始终知道您是否正在访问成员:

class A(some_function()):
  def f(self):
    self.member = 42
    self.method()

这就是完整的代码!(some_function返回用作基础的类型。)

另一个是动态组合类的方法的:

class B(object):
  pass

print B()
# <__main__.B object at 0xb7e4082c>

def B_init(self):
  self.answer = 42
def B_str(self):
  return "<The answer is %s.>" % self.answer
# notice these functions require no knowledge of the actual class
# how hard are they to read and realize that "members" are used?

B.__init__ = B_init
B.__str__ = B_str

print B()
# <The answer is 42.>

请记住,这两个例子都是极端的,您不会每天看到它们,我也不建议您经常编写这样的代码,但是它们确实显示了明确要求自我的各个方面。

Python requires specifying self. The result is there’s never any confusion over what’s a member and what’s not, even without the full class definition visible. This leads to useful properties, such as: you can’t add members which accidentally shadow non-members and thereby break code.

One extreme example: you can write a class without any knowledge of what base classes it might have, and always know whether you are accessing a member or not:

class A(some_function()):
  def f(self):
    self.member = 42
    self.method()

That’s the complete code! (some_function returns the type used as a base.)

Another, where the methods of a class are dynamically composed:

class B(object):
  pass

print B()
# <__main__.B object at 0xb7e4082c>

def B_init(self):
  self.answer = 42
def B_str(self):
  return "<The answer is %s.>" % self.answer
# notice these functions require no knowledge of the actual class
# how hard are they to read and realize that "members" are used?

B.__init__ = B_init
B.__str__ = B_str

print B()
# <The answer is 42.>

Remember, both of these examples are extreme and you won’t see them every day, nor am I suggesting you should often write code like this, but they do clearly show aspects of self being explicitly required.


回答 1

先前的答案基本上都是“您不能”或“您不应”的变体。我同意后一种观点,但从技术上来说,这个问题尚未得到解答。

此外,出于合理的原因,有人可能想要按照实际问题的要求去做某事。我有时遇到的一件事是冗长的数学方程式,其中使用长名称会使方程式无法识别。以下是在固定示例中如何执行此操作的几种方法:

import numpy as np
class MyFunkyGaussian() :
    def __init__(self, A, x0, w, s, y0) :
        self.A = float(A)
        self.x0 = x0
        self.w = w
        self.y0 = y0
        self.s = s

    # The correct way, but subjectively less readable to some (like me) 
    def calc1(self, x) :
        return (self.A/(self.w*np.sqrt(np.pi))/(1+self.s*self.w**2/2)
                * np.exp( -(x-self.x0)**2/self.w**2)
                * (1+self.s*(x-self.x0)**2) + self.y0 )

    # The correct way if you really don't want to use 'self' in the calculations
    def calc2(self, x) :
        # Explicity copy variables
        A, x0, w, y0, s = self.A, self.x0, self.w, self.y0, self.s
        sqrt, exp, pi = np.sqrt, np.exp, np.pi
        return ( A/( w*sqrt(pi) )/(1+s*w**2/2)
                * exp( -(x-x0)**2/w**2 )
                * (1+s*(x-x0)**2) + y0 )

    # Probably a bad idea...
    def calc3(self, x) :
        # Automatically copy every class vairable
        for k in self.__dict__ : exec(k+'= self.'+k)
        sqrt, exp, pi = np.sqrt, np.exp, np.pi
        return ( A/( w*sqrt(pi) )/(1+s*w**2/2)
                * exp( -(x-x0)**2/w**2 )
                * (1+s*(x-x0)**2) + y0 )

g = MyFunkyGaussian(2.0, 1.5, 3.0, 5.0, 0.0)
print(g.calc1(0.5))
print(g.calc2(0.5))
print(g.calc3(0.5))

第三个例子-即使用for k in self.__dict__ : exec(k+'= self.'+k)基本上就是问题的实质所在,但是让我清楚一点,我认为这通常不是一个好主意。

欲了解更多信息,并通过类变量,甚至函数的方式进行迭代,看答案和讨论这个问题。有关动态命名变量的其他方法的讨论以及为什么通常这样做不是一个好主意,请参阅此博客文章。

更新:似乎没有办法在Python3中的函数中动态更新或更改局部变量,因此calc3和类似的变体不再可能。我现在能想到的唯一与python3兼容的解决方案是使用globals

def calc4(self, x) :
        # Automatically copy every class variable in globals
        globals().update(self.__dict__)
        sqrt, exp, pi = np.sqrt, np.exp, np.pi
        return ( A/( w*sqrt(pi) )/(1+s*w**2/2)
                * exp( -(x-x0)**2/w**2 )
                * (1+s*(x-x0)**2) + y0 )

总体而言,这将是可怕的做法。

Previous answers are all basically variants of “you can’t” or “you shouldn’t”. While I agree with the latter sentiment, the question is technically still unanswered.

Furthermore, there are legitimate reasons why someone might want to do something along the lines of what the actual question is asking. One thing I run into sometimes is lengthy math equations where using long names makes the equation unrecognizable. Here are a couple ways of how you could do this in a canned example:

import numpy as np
class MyFunkyGaussian() :
    def __init__(self, A, x0, w, s, y0) :
        self.A = float(A)
        self.x0 = x0
        self.w = w
        self.y0 = y0
        self.s = s

    # The correct way, but subjectively less readable to some (like me) 
    def calc1(self, x) :
        return (self.A/(self.w*np.sqrt(np.pi))/(1+self.s*self.w**2/2)
                * np.exp( -(x-self.x0)**2/self.w**2)
                * (1+self.s*(x-self.x0)**2) + self.y0 )

    # The correct way if you really don't want to use 'self' in the calculations
    def calc2(self, x) :
        # Explicity copy variables
        A, x0, w, y0, s = self.A, self.x0, self.w, self.y0, self.s
        sqrt, exp, pi = np.sqrt, np.exp, np.pi
        return ( A/( w*sqrt(pi) )/(1+s*w**2/2)
                * exp( -(x-x0)**2/w**2 )
                * (1+s*(x-x0)**2) + y0 )

    # Probably a bad idea...
    def calc3(self, x) :
        # Automatically copy every class vairable
        for k in self.__dict__ : exec(k+'= self.'+k)
        sqrt, exp, pi = np.sqrt, np.exp, np.pi
        return ( A/( w*sqrt(pi) )/(1+s*w**2/2)
                * exp( -(x-x0)**2/w**2 )
                * (1+s*(x-x0)**2) + y0 )

g = MyFunkyGaussian(2.0, 1.5, 3.0, 5.0, 0.0)
print(g.calc1(0.5))
print(g.calc2(0.5))
print(g.calc3(0.5))

The third example – i.e. using for k in self.__dict__ : exec(k+'= self.'+k) is basically what the question is actually asking for, but let me be clear that I don’t think it is generally a good idea.

For more info, and ways to iterate through class variables, or even functions, see answers and discussion to this question. For a discussion of other ways to dynamically name variables, and why this is usually not a good idea see this blog post.

UPDATE: There appears to be no way to dynamically update or change locals in a function in Python3, so calc3 and similar variants are no longer possible. The only python3 compatible solution I can think of now is to use globals:

def calc4(self, x) :
        # Automatically copy every class variable in globals
        globals().update(self.__dict__)
        sqrt, exp, pi = np.sqrt, np.exp, np.pi
        return ( A/( w*sqrt(pi) )/(1+s*w**2/2)
                * exp( -(x-x0)**2/w**2 )
                * (1+s*(x-x0)**2) + y0 )

Which, again, would be a terrible practice in general.


回答 2

实际上self不是关键字,它只是Python中实例方法的第一个参数的常规名称。而且第一个参数不能被跳过,因为它是方法知道该类的哪个实例被调用的唯一机制。

Actually self is not a keyword, it’s just the name conventionally given to the first parameter of instance methods in Python. And that first parameter can’t be skipped, as it’s the only mechanism a method has of knowing which instance of your class it’s being called on.


回答 3

您可以使用任何想要的名称,例如

class test(object):
    def function(this, variable):
        this.variable = variable

甚至

class test(object):
    def function(s, variable):
        s.variable = variable

但您仍然无法使用范围的名称。

我不建议您使用与自己不同的东西,除非您有令人信服的理由,因为这会使有经验的pythonista陌生。

You can use whatever name you want, for example

class test(object):
    def function(this, variable):
        this.variable = variable

or even

class test(object):
    def function(s, variable):
        s.variable = variable

but you are stuck with using a name for the scope.

I do not recommend you use something different to self unless you have a convincing reason, as it would make it alien for experienced pythonistas.


回答 4

是的,您必须始终指定self,因为根据python哲学,显式要比隐式好。

您还将发现使用python进行编程的方式与使用Java进行编程的方式非常不同,因此,self由于您没有在对象内部投影所有内容,因此使用的趋势会减少。相反,您可以更多地使用模块级功能,可以更好地对其进行测试。

顺便说说。我最初讨厌它,现在讨厌相反的东西。缩进驱动的流量控制也是如此。

yes, you must always specify self, because explicit is better than implicit, according to python philosophy.

You will also find out that the way you program in python is very different from the way you program in java, hence the use of self tends to decrease because you don’t project everything inside the object. Rather, you make larger use of module-level function, which can be better tested.

by the way. I hated it at first, now I hate the opposite. same for indented-driven flow control.


回答 5

“自身”是类的当前对象实例的常规占位符。当您要引用类中的对象的属性,字段或方法时,就好像在引用“自身”一样使用它。但是,为了使它简短一些,Python编程领域中的某个人开始使用“ self”,其他领域则使用“ this”,但是它们使它成为无法替换的关键字。我宁愿使用“它”来增加代码的可读性。这是Python的优点之一-您可以自由选择对象实例的占位符,而不是“自身”。自我示例:

class UserAccount():    
    def __init__(self, user_type, username, password):
        self.user_type = user_type
        self.username = username            
        self.password = encrypt(password)        

    def get_password(self):
        return decrypt(self.password)

    def set_password(self, password):
        self.password = encrypt(password)

现在我们用“其”替换“自我”:

class UserAccount():    
    def __init__(its, user_type, username, password):
        its.user_type = user_type
        its.username = username            
        its.password = encrypt(password)        

    def get_password(its):
        return decrypt(its.password)

    def set_password(its, password):
        its.password = encrypt(password)

现在哪个更易读?

The “self” is the conventional placeholder of the current object instance of a class. Its used when you want to refer to the object’s property or field or method inside a class as if you’re referring to “itself”. But to make it shorter someone in the Python programming realm started to use “self” , other realms use “this” but they make it as a keyword which cannot be replaced. I rather used “its” to increase the code readability. Its one of the good things in Python – you have a freedom to choose your own placeholder for the object’s instance other than “self”. Example for self:

class UserAccount():    
    def __init__(self, user_type, username, password):
        self.user_type = user_type
        self.username = username            
        self.password = encrypt(password)        

    def get_password(self):
        return decrypt(self.password)

    def set_password(self, password):
        self.password = encrypt(password)

Now we replace ‘self’ with ‘its’:

class UserAccount():    
    def __init__(its, user_type, username, password):
        its.user_type = user_type
        its.username = username            
        its.password = encrypt(password)        

    def get_password(its):
        return decrypt(its.password)

    def set_password(its, password):
        its.password = encrypt(password)

which is more readable now?


回答 6

self是python语法的一部分,用于访问对象的成员,因此恐怕您会受其束缚

self is part of the python syntax to access members of objects, so I’m afraid you’re stuck with it


回答 7

实际上,您可以使用Armin Ronacher演讲“ 5年的坏主意”中的食谱“自卑自我”(用Google搜索)。

这是一个非常聪明的秘方,几乎所有阿明·罗纳赫(Armin Ronacher)的著作都如此,但我认为这个主意并不吸引人。我想我更愿意在C#/ Java中对此进行明确说明。

更新。链接到“坏主意食谱”:https//speakerdeck.com/mitsuhiko/5-years-of-bad-ideas?slide = 58

Actually you can use recipe “Implicit self” from Armin Ronacher presentation “5 years of bad ideas” ( google it).

It’s a very clever recipe, as almost everything from Armin Ronacher, but I don’t think this idea is very appealing. I think I’d prefer explicit this in C#/Java.

Update. Link to “bad idea recipe”: https://speakerdeck.com/mitsuhiko/5-years-of-bad-ideas?slide=58


回答 8

是的,自我很乏味。但是,更好吗?

class Test:

    def __init__(_):
        _.test = 'test'

    def run(_):
        print _.test

Yeah, self is tedious. But, is it better?

class Test:

    def __init__(_):
        _.test = 'test'

    def run(_):
        print _.test

回答 9

来自:自我地狱-更多有状态的功能。

混合方法效果最好 您所有实际进行计算的类方法都应移到闭包中,并且清理语法的扩展应保留在类中。将闭包塞入类,将类像命名空间一样对待。闭包本质上是静态函数,因此甚至在类中也不需要self *。

From: Self Hell – More stateful functions.

…a hybrid approach works best. All of your class methods that actually do computation should be moved into closures, and extensions to clean up syntax should be kept in classes. Stuff the closures into classes, treating the class much like a namespace. The closures are essentially static functions, and so do not require selfs*, even in the class…


回答 10

我认为,如果有一个“成员”语句和“全局”语句,那将更容易且更具可读性,因此您可以告诉解释器哪些是类的对象成员。

I think that it would be easier and more readable if there was a statement “member” just as there is “global” so you can tell the interpreter which are the objects members of the class.


python中的json.dump()和json.dumps()有什么区别?

问题:python中的json.dump()和json.dumps()有什么区别?

我在官方文档中进行了搜索,以查找python中的json.dump()和json.dumps()之间的区别。显然,它们与文件写入选项有关。
但是,它们之间的详细区别是什么?在什么情况下,一个比另一个具有更多的优势?

I searched in this official document to find difference between the json.dump() and json.dumps() in python. It is clear that they are related with file write option.
But what is the detailed difference between them and in what situations one has more advantage than other?


回答 0

除了文档所说的内容外,没有什么可添加的。如果要将JSON转储到文件/套接字或其他文件中,则应使用dump()。如果只需要它作为字符串(用于打印,解析或其他操作),则使用dumps()(转储字符串)

正如Antii Haapala在此答案中提到的,在ensure_ascii行为上有一些细微的差异。这主要是由于底层write()函数是如何工作的,因为它是对块而不是整个字符串进行操作。检查他的答案以获取更多详细信息。

json.dump()

将obj作为JSON格式的流序列化到fp(支持.write()的类似文件的对象

如果ensure_ascii为False,则写入fp的某些块可能是unicode实例

json.dumps()

将obj序列化为JSON格式的str

如果sure_ascii为False,则结果可能包含非ASCII字符,并且返回值可能是unicode实例

There isn’t much else to add other than what the docs say. If you want to dump the JSON into a file/socket or whatever, then you should go with dump(). If you only need it as a string (for printing, parsing or whatever) then use dumps() (dump string)

As mentioned by Antti Haapala in this answer, there are some minor differences on the ensure_ascii behaviour. This is mostly due to how the underlying write() function works, being that it operates on chunks rather than the whole string. Check his answer for more details on that.

json.dump()

Serialize obj as a JSON formatted stream to fp (a .write()-supporting file-like object

If ensure_ascii is False, some chunks written to fp may be unicode instances

json.dumps()

Serialize obj to a JSON formatted str

If ensure_ascii is False, the result may contain non-ASCII characters and the return value may be a unicode instance


回答 1

与功能s取字符串参数。其他则采用文件流。

The functions with an s take string parameters. The others take file streams.


回答 2

在内存使用和速度上。

调用时,jsonstr = json.dumps(mydata)它首先在内存中创建数据的完整副本,然后才将file.write(jsonstr)其复制到磁盘。因此,这是一种更快的方法,但是如果要保存大量数据,则可能会成为问题。

当调用json.dump(mydata, file)-不带’s’时,不使用新的内存,因为数据是按块转储的。但是整个过程要慢大约2倍。

来源:我检查了json.dump()和的源代码,json.dumps()还测试了两个变量,它们测量了time.time()htop中的时间并观察了它们的内存使用情况。

In memory usage and speed.

When you call jsonstr = json.dumps(mydata) it first creates a full copy of your data in memory and only then you file.write(jsonstr) it to disk. So this is a faster method but can be a problem if you have a big piece of data to save.

When you call json.dump(mydata, file) — without ‘s’, new memory is not used, as the data is dumped by chunks. But the whole process is about 2 times slower.

Source: I checked the source code of json.dump() and json.dumps() and also tested both the variants measuring the time with time.time() and watching the memory usage in htop.


回答 3

Python 2的一个显着差异是,如果您使用ensure_ascii=False,则dump可以将UTF-8编码的数据正确写入文件中(除非您使用的扩展名不是UTF-8的8位字符串):

dumps另一方面,with ensure_ascii=False可以产生a strunicode仅取决于您用于字符串的类型:

使用此转换表将obj序列化为JSON格式的str。如果sure_ascii为False,则结果可能包含非ASCII字符,并且返回值可能是unicodeinstance

(强调我的)。请注意,它可能仍然是一个str实例。

因此,如果不检查返回的格式以及可能使用的格式,就无法使用其返回值将结构保存到文件中unicode.encode

当然,这在Python 3中不再是有效的问题,因为不再存在这种8位/ Unicode的混淆。


至于loadVS loadsload认为整个文件是一个JSON文件,所以你不能用它来从单个文件读取多个新行限制JSON文件。

One notable difference in Python 2 is that if you’re using ensure_ascii=False, dump will properly write UTF-8 encoded data into the file (unless you used 8-bit strings with extended characters that are not UTF-8):

dumps on the other hand, with ensure_ascii=False can produce a str or unicode just depending on what types you used for strings:

Serialize obj to a JSON formatted str using this conversion table. If ensure_ascii is False, the result may contain non-ASCII characters and the return value may be a unicode instance.

(emphasis mine). Note that it may still be a str instance as well.

Thus you cannot use its return value to save the structure into file without checking which format was returned and possibly playing with unicode.encode.

This of course is not valid concern in Python 3 any more, since there is no more this 8-bit/Unicode confusion.


As for load vs loads, load considers the whole file to be one JSON document, so you cannot use it to read multiple newline limited JSON documents from a single file.


在Python3中按索引访问dict_keys元素

问题:在Python3中按索引访问dict_keys元素

我正在尝试通过其索引访问dict_key的元素:

test = {'foo': 'bar', 'hello': 'world'}
keys = test.keys()  # dict_keys object

keys.index(0)
AttributeError: 'dict_keys' object has no attribute 'index'

我想得到foo

与:

keys[0]
TypeError: 'dict_keys' object does not support indexing

我怎样才能做到这一点?

I’m trying to access a dict_key’s element by its index:

test = {'foo': 'bar', 'hello': 'world'}
keys = test.keys()  # dict_keys object

keys.index(0)
AttributeError: 'dict_keys' object has no attribute 'index'

I want to get foo.

same with:

keys[0]
TypeError: 'dict_keys' object does not support indexing

How can I do this?


回答 0

list()而是调用字典:

keys = list(test)

在Python 3中,该dict.keys()方法返回一个字典视图对象,它作为一个集合。直接迭代字典也会产生键,因此将字典转换为列表会得到所有键的列表:

>>> test = {'foo': 'bar', 'hello': 'world'}
>>> list(test)
['foo', 'hello']
>>> list(test)[0]
'foo'

Call list() on the dictionary instead:

keys = list(test)

In Python 3, the dict.keys() method returns a dictionary view object, which acts as a set. Iterating over the dictionary directly also yields keys, so turning a dictionary into a list results in a list of all the keys:

>>> test = {'foo': 'bar', 'hello': 'world'}
>>> list(test)
['foo', 'hello']
>>> list(test)[0]
'foo'

回答 1

不是完整的答案,但可能是有用的提示。如果它确实是您想要的第一项*,那么

next(iter(q))

比快得多

list(q)[0]

对于大词典,因为不必将整个内容存储在内存中。

对于10.000.000物品,我发现它快了将近40.000倍。

*如果dict只是Python 3.6之前的伪随机项目,则第一项(此后它在标准实现中已订购,尽管不建议依赖它)。

Not a full answer but perhaps a useful hint. If it is really the first item you want*, then

next(iter(q))

is much faster than

list(q)[0]

for large dicts, since the whole thing doesn’t have to be stored in memory.

For 10.000.000 items I found it to be almost 40.000 times faster.

*The first item in case of a dict being just a pseudo-random item before Python 3.6 (after that it’s ordered in the standard implementation, although it’s not advised to rely on it).


回答 2

我想要第一个字典项的“键”和“值”对。我用下面的代码。

 key, val = next(iter(my_dict.items()))

I wanted “key” & “value” pair of a first dictionary item. I used the following code.

 key, val = next(iter(my_dict.items()))

回答 3

test = {'foo': 'bar', 'hello': 'world'}
ls = []
for key in test.keys():
    ls.append(key)
print(ls[0])

将键附加到静态定义的列表然后对其进行索引的常规方式

test = {'foo': 'bar', 'hello': 'world'}
ls = []
for key in test.keys():
    ls.append(key)
print(ls[0])

Conventional way of appending the keys to a statically defined list and then indexing it for same


回答 4

在许多情况下,这可能是XY问题。为什么要按位置索引字典键?您真的需要吗?直到最近,字典甚至还没有在Python中排序,因此访问第一个元素是任意的。

我刚刚将一些Python 2代码翻译为Python 3:

keys = d.keys()
for (i, res) in enumerate(some_list):
    k = keys[i]
    # ...

这不是很漂亮,但也不是很糟糕。起初,我正要用可怕的东西代替它

    k = next(itertools.islice(iter(keys), i, None))

在我意识到这一切写成更好之前

for (k, res) in zip(d.keys(), some_list):

效果很好。

我相信在许多其他情况下,可以避免按位置索引字典关键字。尽管字典在Python 3.7中是有序的,但是依靠它并不是很漂亮。上面的代码仅起作用,因为的内容some_list是最近从的内容中产生的d

如果您确实需要disk_keys按索引访问元素,请仔细看一下代码。也许您不需要。

In many cases, this may be an XY Problem. Why are you indexing your dictionary keys by position? Do you really need to? Until recently, dictionaries were not even ordered in Python, so accessing the first element was arbitrary.

I just translated some Python 2 code to Python 3:

keys = d.keys()
for (i, res) in enumerate(some_list):
    k = keys[i]
    # ...

which is not pretty, but not very bad either. At first, I was about to replace it by the monstrous

    k = next(itertools.islice(iter(keys), i, None))

before I realised this is all much better written as

for (k, res) in zip(d.keys(), some_list):

which works just fine.

I believe that in many other cases, indexing dictionary keys by position can be avoided. Although dictionaries are ordered in Python 3.7, relying on that is not pretty. The code above only works because the contents of some_list had been recently produced from the contents of d.

Have a hard look at your code if you really need to access a disk_keys element by index. Perhaps you don’t need to.


回答 5

试试这个

keys = [next(iter(x.keys())) for x in test]
print(list(keys))

结果看起来像这样。[‘foo’,’hello’]

您可以在此处找到更多可能的解决方案。

Try this

keys = [next(iter(x.keys())) for x in test]
print(list(keys))

The result looks like this. [‘foo’, ‘hello’]

You can find more possible solutions here.


上个月的python日期

问题:上个月的python日期

我正在尝试使用python获取上个月的日期。这是我尝试过的:

str( time.strftime('%Y') ) + str( int(time.strftime('%m'))-1 )

但是,这种方法很糟糕,原因有两个:首先,它返回2012年2月的20122(而不是201202),其次它将返回0而不是1月的12。

我已经用bash解决了这个麻烦

echo $(date -d"3 month ago" "+%G%m%d")

我认为,如果bash为此目的提供了一种内置方式,那么功能更强大的python应该比强迫编写自己的脚本来实现此目标更好。我当然可以做类似的事情:

if int(time.strftime('%m')) == 1:
    return '12'
else:
    if int(time.strftime('%m')) < 10:
        return '0'+str(time.strftime('%m')-1)
    else:
        return str(time.strftime('%m') -1)

我没有测试过此代码,也不想使用它(除非我找不到其他方法:/)

谢谢你的帮助!

I am trying to get the date of the previous month with python. Here is what i’ve tried:

str( time.strftime('%Y') ) + str( int(time.strftime('%m'))-1 )

However, this way is bad for 2 reasons: First it returns 20122 for the February of 2012 (instead of 201202) and secondly it will return 0 instead of 12 on January.

I have solved this trouble in bash with

echo $(date -d"3 month ago" "+%G%m%d")

I think that if bash has a built-in way for this purpose, then python, much more equipped, should provide something better than forcing writing one’s own script to achieve this goal. Of course i could do something like:

if int(time.strftime('%m')) == 1:
    return '12'
else:
    if int(time.strftime('%m')) < 10:
        return '0'+str(time.strftime('%m')-1)
    else:
        return str(time.strftime('%m') -1)

I have not tested this code and i don’t want to use it anyway (unless I can’t find any other way:/)

Thanks for your help!


回答 0

datetime和datetime.timedelta类是您的朋友。

  1. 找到今天。
  2. 用它来查找本月的第一天。
  3. 使用timedelta备份一天,直到上个月的最后一天。
  4. 打印您要查找的YYYYMM字符串。

像这样:

 import datetime
 today = datetime.date.today()
 first = today.replace(day=1)
 lastMonth = first - datetime.timedelta(days=1)
 print(lastMonth.strftime("%Y%m"))

201202 打印。

datetime and the datetime.timedelta classes are your friend.

  1. find today.
  2. use that to find the first day of this month.
  3. use timedelta to backup a single day, to the last day of the previous month.
  4. print the YYYYMM string you’re looking for.

Like this:

 import datetime
 today = datetime.date.today()
 first = today.replace(day=1)
 lastMonth = first - datetime.timedelta(days=1)
 print(lastMonth.strftime("%Y%m"))

201202 is printed.


回答 1

您应该使用dateutil。这样,您就可以使用relativedelta,它是timedelta的改进版本。

>>> import datetime 
>>> import dateutil.relativedelta
>>> now = datetime.datetime.now()
>>> print now
2012-03-15 12:33:04.281248
>>> print now + dateutil.relativedelta.relativedelta(months=-1)
2012-02-15 12:33:04.281248

You should use dateutil. With that, you can use relativedelta, it’s an improved version of timedelta.

>>> import datetime 
>>> import dateutil.relativedelta
>>> now = datetime.datetime.now()
>>> print now
2012-03-15 12:33:04.281248
>>> print now + dateutil.relativedelta.relativedelta(months=-1)
2012-02-15 12:33:04.281248

回答 2

from datetime import date, timedelta

first_day_of_current_month = date.today().replace(day=1)
last_day_of_previous_month = first_day_of_current_month - timedelta(days=1)

print "Previous month:", last_day_of_previous_month.month

要么:

from datetime import date, timedelta

prev = date.today().replace(day=1) - timedelta(days=1)
print prev.month
from datetime import date, timedelta

first_day_of_current_month = date.today().replace(day=1)
last_day_of_previous_month = first_day_of_current_month - timedelta(days=1)

print "Previous month:", last_day_of_previous_month.month

Or:

from datetime import date, timedelta

prev = date.today().replace(day=1) - timedelta(days=1)
print prev.month

回答 3

bgporter的答案为基础

def prev_month_range(when = None): 
    """Return (previous month's start date, previous month's end date)."""
    if not when:
        # Default to today.
        when = datetime.datetime.today()
    # Find previous month: https://stackoverflow.com/a/9725093/564514
    # Find today.
    first = datetime.date(day=1, month=when.month, year=when.year)
    # Use that to find the first day of this month.
    prev_month_end = first - datetime.timedelta(days=1)
    prev_month_start = datetime.date(day=1, month= prev_month_end.month, year= prev_month_end.year)
    # Return previous month's start and end dates in YY-MM-DD format.
    return (prev_month_start.strftime('%Y-%m-%d'), prev_month_end.strftime('%Y-%m-%d'))

Building on bgporter’s answer.

def prev_month_range(when = None): 
    """Return (previous month's start date, previous month's end date)."""
    if not when:
        # Default to today.
        when = datetime.datetime.today()
    # Find previous month: https://stackoverflow.com/a/9725093/564514
    # Find today.
    first = datetime.date(day=1, month=when.month, year=when.year)
    # Use that to find the first day of this month.
    prev_month_end = first - datetime.timedelta(days=1)
    prev_month_start = datetime.date(day=1, month= prev_month_end.month, year= prev_month_end.year)
    # Return previous month's start and end dates in YY-MM-DD format.
    return (prev_month_start.strftime('%Y-%m-%d'), prev_month_end.strftime('%Y-%m-%d'))

回答 4

它非常容易和简单。做这个

from dateutil.relativedelta import relativedelta
from datetime import datetime

today_date = datetime.today()
print "todays date time: %s" %today_date

one_month_ago = today_date - relativedelta(months=1)
print "one month ago date time: %s" % one_month_ago
print "one month ago date: %s" % one_month_ago.date()

输出如下:$ python2.7 main.py

todays date time: 2016-09-06 02:13:01.937121
one month ago date time: 2016-08-06 02:13:01.937121
one month ago date: 2016-08-06

Its very easy and simple. Do this

from dateutil.relativedelta import relativedelta
from datetime import datetime

today_date = datetime.today()
print "todays date time: %s" %today_date

one_month_ago = today_date - relativedelta(months=1)
print "one month ago date time: %s" % one_month_ago
print "one month ago date: %s" % one_month_ago.date()

Here is the output: $python2.7 main.py

todays date time: 2016-09-06 02:13:01.937121
one month ago date time: 2016-08-06 02:13:01.937121
one month ago date: 2016-08-06

回答 5

对于到达这里并希望获得上个月的第一天和最后一天的人:

from datetime import date, timedelta

last_day_of_prev_month = date.today().replace(day=1) - timedelta(days=1)

start_day_of_prev_month = date.today().replace(day=1) - timedelta(days=last_day_of_prev_month.day)

# For printing results
print("First day of prev month:", start_day_of_prev_month)
print("Last day of prev month:", last_day_of_prev_month)

输出:

First day of prev month: 2019-02-01
Last day of prev month: 2019-02-28

For someone who got here and looking to get both the first and last day of the previous month:

from datetime import date, timedelta

last_day_of_prev_month = date.today().replace(day=1) - timedelta(days=1)

start_day_of_prev_month = date.today().replace(day=1) - timedelta(days=last_day_of_prev_month.day)

# For printing results
print("First day of prev month:", start_day_of_prev_month)
print("Last day of prev month:", last_day_of_prev_month)

Output:

First day of prev month: 2019-02-01
Last day of prev month: 2019-02-28

回答 6

def prev_month(date=datetime.datetime.today()):
    if date.month == 1:
        return date.replace(month=12,year=date.year-1)
    else:
        try:
            return date.replace(month=date.month-1)
        except ValueError:
            return prev_month(date=date.replace(day=date.day-1))
def prev_month(date=datetime.datetime.today()):
    if date.month == 1:
        return date.replace(month=12,year=date.year-1)
    else:
        try:
            return date.replace(month=date.month-1)
        except ValueError:
            return prev_month(date=date.replace(day=date.day-1))

回答 7

只是为了好玩,一个使用divmod的纯数学答案。由于相乘,效率很低,也可以对月份数进行简单检查(如果等于12,则增加年份等)

year = today.year
month = today.month

nm = list(divmod(year * 12 + month + 1, 12))
if nm[1] == 0:
    nm[1] = 12
    nm[0] -= 1
pm = list(divmod(year * 12 + month - 1, 12))
if pm[1] == 0:
    pm[1] = 12
    pm[0] -= 1

next_month = nm
previous_month = pm

Just for fun, a pure math answer using divmod. Pretty inneficient because of the multiplication, could do just as well a simple check on the number of month (if equal to 12, increase year, etc)

year = today.year
month = today.month

nm = list(divmod(year * 12 + month + 1, 12))
if nm[1] == 0:
    nm[1] = 12
    nm[0] -= 1
pm = list(divmod(year * 12 + month - 1, 12))
if pm[1] == 0:
    pm[1] = 12
    pm[0] -= 1

next_month = nm
previous_month = pm

回答 8

使用Pendulum非常完整的库,我们有了subtract方法(而不是“ subStract”):

import pendulum
today = pendulum.datetime.today()  # 2020, january
lastmonth = today.subtract(months=1)
lastmonth.strftime('%Y%m')
# '201912'

我们看到它可以应对跳跃的岁月。

反向等效为add

https://pendulum.eustace.io/docs/#addition-and-subtraction

With the Pendulum very complete library, we have the subtract method (and not “subStract”):

import pendulum
today = pendulum.datetime.today()  # 2020, january
lastmonth = today.subtract(months=1)
lastmonth.strftime('%Y%m')
# '201912'

We see that it handles jumping years.

The reverse equivalent is add.

https://pendulum.eustace.io/docs/#addition-and-subtraction


回答 9

以@JF Sebastian的注释为基础,您可以将replace()函数链接起来以返回一个“月”。由于一个月不是固定的时间段,因此此解决方案尝试返回到上个月的同一日期,这当然不能在所有月份都有效。在这种情况下,此算法默认为上个月的最后一天。

from datetime import datetime, timedelta

d = datetime(2012, 3, 31) # A problem date as an example

# last day of last month
one_month_ago = (d.replace(day=1) - timedelta(days=1))
try:
    # try to go back to same day last month
    one_month_ago = one_month_ago.replace(day=d.day)
except ValueError:
    pass
print("one_month_ago: {0}".format(one_month_ago))

输出:

one_month_ago: 2012-02-29 00:00:00

Building off the comment of @J.F. Sebastian, you can chain the replace() function to go back one “month”. Since a month is not a constant time period, this solution tries to go back to the same date the previous month, which of course does not work for all months. In such a case, this algorithm defaults to the last day of the prior month.

from datetime import datetime, timedelta

d = datetime(2012, 3, 31) # A problem date as an example

# last day of last month
one_month_ago = (d.replace(day=1) - timedelta(days=1))
try:
    # try to go back to same day last month
    one_month_ago = one_month_ago.replace(day=d.day)
except ValueError:
    pass
print("one_month_ago: {0}".format(one_month_ago))

Output:

one_month_ago: 2012-02-29 00:00:00

回答 10

如果要在LINUX / UNIX环境中查看EXE类型文件中的ASCII字母,请尝试“ od -c’filename’| more”

您可能会得到很多无法识别的项目,但它们都会全部显示出来,并且将显示HEX表示形式,并且ASCII等效字符(如果适用)将跟随十六进制代码行。在您知道的已编译代码上尝试一下。您可能会在其中识别出一些东西。

If you want to look at the ASCII letters in a EXE type file in a LINUX/UNIX Environment, try “od -c ‘filename’ |more”

You will likely get a lot of unrecognizable items, but they will all be presented, and the HEX representations will be displayed, and the ASCII equivalent characters (if appropriate) will follow the line of hex codes. Try it on a compiled piece of code that you know. You might see things in it you recognize.


回答 11

有一个高级库dateparser可以确定给定自然语言的过去日期,并返回相应的Python datetime对象

from dateparser import parse
parse('4 months ago')

There is a high level library dateparser that can determine the past date given natural language, and return the corresponding Python datetime object

from dateparser import parse
parse('4 months ago')

计算大熊猫数量的最有效方法是什么?

问题:计算大熊猫数量的最有效方法是什么?

我有一个大的(约1200万行)数据帧df,说:

df.columns = ['word','documents','frequency']

因此,以下及时运行:

word_grouping = df[['word','frequency']].groupby('word')
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
MaxFrequency_perWord.columns = ['word','MaxFrequency']

但是,这要花费很长的时间才能运行:

Occurrences_of_Words = word_grouping[['word']].count().reset_index()

我在这里做错了什么?有没有更好的方法来计算大型数据框中的出现次数?

df.word.describe()

运行良好,所以我真的没想到这个Occurrences_of_Words数据框会花费很长时间。

ps:如果答案很明显,并且您觉得有必要因提出这个问题而对我不利,请同时提供答案。谢谢。

I have a large (about 12M rows) dataframe df with say:

df.columns = ['word','documents','frequency']

So the following ran in a timely fashion:

word_grouping = df[['word','frequency']].groupby('word')
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
MaxFrequency_perWord.columns = ['word','MaxFrequency']

However, this is taking an unexpected long time to run:

Occurrences_of_Words = word_grouping[['word']].count().reset_index()

What am I doing wrong here? Is there a better way to count occurences in a large dataframe?

df.word.describe()

ran pretty well, so I really did not expect this Occurrences_of_Words dataframe to take very long to build.

ps: If the answer is obvious and you feel the need to penalize me for asking this question, please include the answer as well. thank you.


回答 0

我认为df['word'].value_counts()应该服务。通过跳过groupby机制,您可以节省一些时间。我不知道为什么count要慢于max。两者都需要一些时间来避免丢失值。(与相比size。)

无论如何,对value_counts进行了专门优化以处理像您的单词这样的对象类型,因此我怀疑您会做得更好。

I think df['word'].value_counts() should serve. By skipping the groupby machinery, you’ll save some time. I’m not sure why count should be much slower than max. Both take some time to avoid missing values. (Compare with size.)

In any case, value_counts has been specifically optimized to handle object type, like your words, so I doubt you’ll do much better than that.


回答 1

当您想统计pandas dataFrame中一列中分类数据的频率时,请使用: df['Column_Name'].value_counts()

来源

When you want to count the frequency of categorical data in a column in pandas dataFrame use: df['Column_Name'].value_counts()

Source.


回答 2

只是先前答案的补充。别忘了,在处理实际数据时,可能会有空值,因此使用选项将默认值包括在内也很有用dropna=False默认值为True

一个例子:

>>> df['Embarked'].value_counts(dropna=False)
S      644
C      168
Q       77
NaN      2

Just an addition to the previous answers. Let’s not forget that when dealing with real data there might be null values, so it’s useful to also include those in the counting by using the option dropna=False (default is True)

An example:

>>> df['Embarked'].value_counts(dropna=False)
S      644
C      168
Q       77
NaN      2