标签归档:Python

如何仅在内存中运行Django的测试数据库?

问题:如何仅在内存中运行Django的测试数据库?

我的Django单元测试需要很长时间才能运行,因此我正在寻找加快速度的方法。我正在考虑安装SSD,但我知道它也有缺点。当然,我的代码可以做一些事情,但是我正在寻找结构上的修复方法。由于每次都需要重建/向南迁移数据库,因此即使运行单个测试也很慢。所以这是我的主意…

由于我知道测试数据库总是很小,所以为什么不能仅将系统配置为始终将整个测试数据库保留在RAM中?绝对不要触摸磁盘。如何在Django中配置它?我宁愿继续使用MySQL,因为这是我在生产中使用的方式,但是如果使用SQLite  3或其他方法可以简化这一点,我会采用这种方式。

SQLite或MySQL是否可以选择完全在内存中运行?应该可以配置RAM磁盘,然后配置测试数据库以将其数据存储在其中,但是我不确定如何告诉Django / MySQL为特定数据库使用不同的数据目录,特别是因为它不断被删除并重新创建每次运行。(我在Mac FWIW上。)

My Django unit tests take a long time to run, so I’m looking for ways to speed that up. I’m considering installing an SSD, but I know that has its downsides too. Of course, there are things I could do with my code, but I’m looking for a structural fix. Even running a single test is slow since the database needs to be rebuilt / south migrated every time. So here’s my idea…

Since I know the test database will always be quite small, why can’t I just configure the system to always keep the entire test database in RAM? Never touch the disk at all. How do I configure this in Django? I’d prefer to keep using MySQL since that’s what I use in production, but if SQLite 3 or something else makes this easy, I’d go that way.

Does SQLite or MySQL have an option to run entirely in memory? It should be possible to configure a RAM disk and then configure the test database to store its data there, but I’m not sure how to tell Django / MySQL to use a different data directory for a certain database, especially since it keeps getting erased and recreated each run. (I’m on a Mac FWIW.)


回答 0

如果在运行测试时将数据库引擎设置为sqlite3,则Django将使用内存数据库

settings.py在运行测试时使用如下代码将引擎设置为sqlite:

if 'test' in sys.argv:
    DATABASE_ENGINE = 'sqlite3'

或在Django 1.2中:

if 'test' in sys.argv:
    DATABASES['default'] = {'ENGINE': 'sqlite3'}

最后在Django 1.3和1.4中:

if 'test' in sys.argv:
    DATABASES['default'] = {'ENGINE': 'django.db.backends.sqlite3'}

(到Django 1.3并不一定要有完整的后端路径,但是可以使设置向前兼容。)

您还可以添加以下行,以防南向迁移出现问题:

    SOUTH_TESTS_MIGRATE = False

If you set your database engine to sqlite3 when you run your tests, Django will use a in-memory database.

I’m using code like this in my settings.py to set the engine to sqlite when running my tests:

if 'test' in sys.argv:
    DATABASE_ENGINE = 'sqlite3'

Or in Django 1.2:

if 'test' in sys.argv:
    DATABASES['default'] = {'ENGINE': 'sqlite3'}

And finally in Django 1.3 and 1.4:

if 'test' in sys.argv:
    DATABASES['default'] = {'ENGINE': 'django.db.backends.sqlite3'}

(The full path to the backend isn’t strictly necessary with Django 1.3, but makes the setting forward compatible.)

You can also add the following line, in case you are having problems with South migrations:

    SOUTH_TESTS_MIGRATE = False

回答 1

我通常为测试创建一个单独的设置文件,并在测试命令中使用它,例如

python manage.py test --settings=mysite.test_settings myapp

它有两个好处:

  1. 您不必检查testsys.argv中的任何此类神奇词,test_settings.py只需

    from settings import *
    
    # make tests faster
    SOUTH_TESTS_MIGRATE = False
    DATABASES['default'] = {'ENGINE': 'django.db.backends.sqlite3'}

    或者,您可以根据需要进一步调整它,将测试设置与生产设置完全分开。

  2. 另一个好处是您可以使用生产数据库引擎而不是sqlite3进行测试,从而避免了细微的错误,因此在开发使用时

    python manage.py test --settings=mysite.test_settings myapp

    并在提交代码之前运行一次

    python manage.py test myapp

    只是为了确保所有测试都通过了。

I usually create a separate settings file for tests and use it in test command e.g.

python manage.py test --settings=mysite.test_settings myapp

It has two benefits:

  1. You don’t have to check for test or any such magic word in sys.argv, test_settings.py can simply be

    from settings import *
    
    # make tests faster
    SOUTH_TESTS_MIGRATE = False
    DATABASES['default'] = {'ENGINE': 'django.db.backends.sqlite3'}
    

    Or you can further tweak it for your needs, cleanly separating test settings from production settings.

  2. Another benefit is that you can run test with production database engine instead of sqlite3 avoiding subtle bugs, so while developing use

    python manage.py test --settings=mysite.test_settings myapp
    

    and before committing code run once

    python manage.py test myapp
    

    just to be sure that all test are really passing.


回答 2

MySQL支持称为“ MEMORY”的存储引擎,您可以在数据库config(settings.py)中对其进行配置,如下所示:

    'USER': 'root',                      # Not used with sqlite3.
    'PASSWORD': '',                  # Not used with sqlite3.
    'OPTIONS': {
        "init_command": "SET storage_engine=MEMORY",
    }

请注意,MEMORY存储引擎不支持blob /文本列,因此,如果您使用django.db.models.TextField此功能,则将无法使用。

MySQL supports a storage engine called “MEMORY”, which you can configure in your database config (settings.py) as such:

    'USER': 'root',                      # Not used with sqlite3.
    'PASSWORD': '',                  # Not used with sqlite3.
    'OPTIONS': {
        "init_command": "SET storage_engine=MEMORY",
    }

Note that the MEMORY storage engine doesn’t support blob / text columns, so if you’re using django.db.models.TextField this won’t work for you.


回答 3

我无法回答您的主要问题,但是您可以做一些事情来加快速度。

首先,请确保您的MySQL数据库已设置为使用InnoDB。然后,它可以使用事务在每次测试之前回滚db的状态,以我的经验,这导致了极大的提速。您可以在settings.py中传递数据库init命令(Django 1.2语法):

DATABASES = {
    'default': {
            'ENGINE':'django.db.backends.mysql',
            'HOST':'localhost',
            'NAME':'mydb',
            'USER':'whoever',
            'PASSWORD':'whatever',
            'OPTIONS':{"init_command": "SET storage_engine=INNODB" } 
        }
    }

其次,您不需要每次都运行South迁移。设置SOUTH_TESTS_MIGRATE = False在你的settings.py和数据库将与普通的执行syncdb,这将是比通过所有历史悠久的迁移运行更快创建。

I can’t answer your main question, but there are a couple of things that you can do to speed things up.

Firstly, make sure that your MySQL database is set up to use InnoDB. Then it can use transactions to rollback the state of the db before each test, which in my experience has led to a massive speed-up. You can pass a database init command in your settings.py (Django 1.2 syntax):

DATABASES = {
    'default': {
            'ENGINE':'django.db.backends.mysql',
            'HOST':'localhost',
            'NAME':'mydb',
            'USER':'whoever',
            'PASSWORD':'whatever',
            'OPTIONS':{"init_command": "SET storage_engine=INNODB" } 
        }
    }

Secondly, you don’t need to run the South migrations each time. Set SOUTH_TESTS_MIGRATE = False in your settings.py and the database will be created with plain syncdb, which will be much quicker than running through all the historic migrations.


回答 4

您可以进行两次调整:

  • 使用事务表:在每个TestCase之后,将使用数据库回滚来设置初始固定装置状态。
  • 将您的数据库数据目录放在ramdisk上:就数据库创建而言,您将获得很多收益,并且运行测试会更快。

我正在使用这两种技巧,我很高兴。

如何在Ubuntu上为MySQL设置它:

$ sudo service mysql stop
$ sudo cp -pRL /var/lib/mysql /dev/shm/mysql

$ vim /etc/mysql/my.cnf
# datadir = /dev/shm/mysql
$ sudo service mysql start

当心,这只是为了测试,从内存中重新启动数据库后,丢失了!

You can do double tweaking:

  • use transactional tables: initial fixtures state will be set using database rollback after every TestCase.
  • put your database data dir on ramdisk: you will gain much as far as database creation is concerned and also running test will be faster.

I’m using both tricks and I’m quite happy.

How to set up it for MySQL on Ubuntu:

$ sudo service mysql stop
$ sudo cp -pRL /var/lib/mysql /dev/shm/mysql

$ vim /etc/mysql/my.cnf
# datadir = /dev/shm/mysql
$ sudo service mysql start

Beware, it’s just for testing, after reboot your database from memory is lost!


回答 5

另一种方法:让另一个MySQL实例在使用RAM磁盘的tempf中运行。这篇博客文章中的说明:加速MySQL以在Django中进行测试

优点:

  • 您使用与生产服务器完全相同的数据库
  • 无需更改默认的mysql配置

Another approach: have another instance of MySQL running in a tempfs that uses a RAM Disk. Instructions in this blog post: Speeding up MySQL for testing in Django.

Advantages:

  • You use the exactly same database that your production server uses
  • no need to change your default mysql configuration

回答 6

通过扩展Anurag的答案,我通过创建相同的test_settings并将以下内容添加到manage.py来简化了过程

if len(sys.argv) > 1 and sys.argv[1] == "test":
    os.environ.setdefault("DJANGO_SETTINGS_MODULE", "mysite.test_settings")
else:
    os.environ.setdefault("DJANGO_SETTINGS_MODULE", "mysite.settings")

似乎更干净,因为sys已经导入并且manage.py仅通过命令行使用,因此无需弄乱设置

Extending on Anurag’s answer I simplified the process by creating the same test_settings and adding the following to manage.py

if len(sys.argv) > 1 and sys.argv[1] == "test":
    os.environ.setdefault("DJANGO_SETTINGS_MODULE", "mysite.test_settings")
else:
    os.environ.setdefault("DJANGO_SETTINGS_MODULE", "mysite.settings")

seems cleaner since sys is already imported and manage.py is only used via command line, so no need to clutter up settings


回答 7

在您的下方使用 setting.py

DATABASES['default']['ENGINE'] = 'django.db.backends.sqlite3'

Use below in your setting.py

DATABASES['default']['ENGINE'] = 'django.db.backends.sqlite3'

如何将键值元组列表转换成字典?

问题:如何将键值元组列表转换成字典?

我有一个列表,看起来像:

[('A', 1), ('B', 2), ('C', 3)]

我想把它变成一个像这样的字典:

{'A': 1, 'B': 2, 'C': 3}

最好的方法是什么?

编辑:我的元组列表实际上更像是:

[(A, 12937012397), (BERA, 2034927830), (CE, 2349057340)]

I have a list that looks like:

[('A', 1), ('B', 2), ('C', 3)]

I want to turn it into a dictionary that looks like:

{'A': 1, 'B': 2, 'C': 3}

What’s the best way to go about this?

EDIT: My list of tuples is actually more like:

[(A, 12937012397), (BERA, 2034927830), (CE, 2349057340)]

回答 0

这给了我与尝试拆分列表并压缩列表相同的错误。ValueError:字典更新序列元素#0的长度为1916;2个为必填项

那是你的实际问题。

答案是列表中的元素与您认为的不一样。如果键入,myList[0]您会发现列表的第一个元素不是二元组,例如('A', 1),而是1916长度的iterable。

一旦您真正有了原始问题(myList = [('A',1),('B',2),...])中所述表格的列表,您所需要做的就是dict(myList)

This gives me the same error as trying to split the list up and zip it. ValueError: dictionary update sequence element #0 has length 1916; 2 is required

THAT is your actual question.

The answer is that the elements of your list are not what you think they are. If you type myList[0] you will find that the first element of your list is not a two-tuple, e.g. ('A', 1), but rather a 1916-length iterable.

Once you actually have a list in the form you stated in your original question (myList = [('A',1),('B',2),...]), all you need to do is dict(myList).


回答 1

>>> dict([('A', 1), ('B', 2), ('C', 3)])
{'A': 1, 'C': 3, 'B': 2}
>>> dict([('A', 1), ('B', 2), ('C', 3)])
{'A': 1, 'C': 3, 'B': 2}

回答 2

你有尝试过吗?

>>> l=[('A',1), ('B',2), ('C',3)]
>>> d=dict(l)
>>> d
{'A': 1, 'C': 3, 'B': 2}

Have you tried this?

>>> l=[('A',1), ('B',2), ('C',3)]
>>> d=dict(l)
>>> d
{'A': 1, 'C': 3, 'B': 2}

回答 3

这是处理重复的元组“键”的方法:

# An example
l = [('A', 1), ('B', 2), ('C', 3), ('A', 5), ('D', 0), ('D', 9)]

# A solution
d = dict()
[d [t [0]].append(t [1]) if t [0] in list(d.keys()) 
 else d.update({t [0]: [t [1]]}) for t in l]
d

OUTPUT: {'A': [1, 5], 'B': [2], 'C': [3], 'D': [0, 9]}

Here is a way to handle duplicate tuple “keys”:

# An example
l = [('A', 1), ('B', 2), ('C', 3), ('A', 5), ('D', 0), ('D', 9)]

# A solution
d = dict()
[d [t [0]].append(t [1]) if t [0] in list(d.keys()) 
 else d.update({t [0]: [t [1]]}) for t in l]
d

OUTPUT: {'A': [1, 5], 'B': [2], 'C': [3], 'D': [0, 9]}

回答 4

使用字典推导的另一种方式

>>> t = [('A', 1), ('B', 2), ('C', 3)]
>>> d = { i:j for i,j in t }
>>> d
{'A': 1, 'B': 2, 'C': 3}

Another way using dictionary comprehensions,

>>> t = [('A', 1), ('B', 2), ('C', 3)]
>>> d = { i:j for i,j in t }
>>> d
{'A': 1, 'B': 2, 'C': 3}

回答 5

如果Tuple没有重复键,则很简单。

tup = [("A",0),("B",3),("C",5)]
dic = dict(tup)
print(dic)

如果元组具有键重复。

tup = [("A",0),("B",3),("C",5),("A",9),("B",4)]
dic = {}
for i, j in tup:
    dic.setdefault(i,[]).append(j)
print(dic)

If Tuple has no key repetitions, it’s Simple.

tup = [("A",0),("B",3),("C",5)]
dic = dict(tup)
print(dic)

If tuple has key repetitions.

tup = [("A",0),("B",3),("C",5),("A",9),("B",4)]
dic = {}
for i, j in tup:
    dic.setdefault(i,[]).append(j)
print(dic)

回答 6

l=[['A', 1], ['B', 2], ['C', 3]]
d={}
for i,j in l:
d.setdefault(i,j)
print(d)
l=[['A', 1], ['B', 2], ['C', 3]]
d={}
for i,j in l:
d.setdefault(i,j)
print(d)

如何像使用Sublime Text一样围绕PyCharm中的选定文本

问题:如何像使用Sublime Text一样围绕PyCharm中的选定文本

有没有一种方法可以配置PyCharm,使其能够仅通过键入括号键就可以将所选代码括在括号中,就像我们使用SublimText 2一样?

Is there a way to configure PyCharm to be able to surround selected code with parenthesis by just typing on the parenthesis key, like when we use SublimText 2?


回答 0

我想你想要类似的东西

Settings | Editor | General | Smart Keys -> Surround selection on typing quote or brace

I think you want something like

Settings | Editor | General | Smart Keys -> Surround selection on typing quote or brace


回答 1

PyCharm 4.0可以选择Surround With...,选择代码段并按

ctrl+ alt+T

或在Mac上:+ +T

选项1应该为您提供所需的功能:

PyCharm 4.0 has the option to Surround With..., by selecting your code snippet and pressing

ctrl + alt + T

or on Mac: + + T

Option 1 should provide you with the functionality you are looking for:


回答 2

Windows:打开pycharm并选择文件,设置,编辑器,智能键,在列表中,您将选中“键入引号或花括号时的环绕选择”,然后应用。 在此处输入图片说明

智能钥匙的pycharm位置的图像

Windows: open pycharm and select file, settings, Editor, Smart Keys, in the list you will check “Surround selection on typing quote or brace”, then apply. enter image description here

Image of pycharm location of smart keys


如何使用gensim的word2vec模型和python计算句子相似度

问题:如何使用gensim的word2vec模型和python计算句子相似度

根据Gensim Word2Vec,我可以使用gensim包中的word2vec模型来计算2个单词之间的相似度。

例如

trained_model.similarity('woman', 'man') 
0.73723527

但是,word2vec模型无法预测句子相似度。我在gensim中发现了具有句子相似性的LSI模型,但似乎无法与word2vec模型结合使用。我拥有的每个句子的语料库长度不是很长(少于10个字)。那么,有没有简单的方法可以达到目标呢?

According to the Gensim Word2Vec, I can use the word2vec model in gensim package to calculate the similarity between 2 words.

e.g.

trained_model.similarity('woman', 'man') 
0.73723527

However, the word2vec model fails to predict the sentence similarity. I find out the LSI model with sentence similarity in gensim, but, which doesn’t seem that can be combined with word2vec model. The length of corpus of each sentence I have is not very long (shorter than 10 words). So, are there any simple ways to achieve the goal?


回答 0

这实际上是您要问的一个非常具有挑战性的问题。计算句子相似度需要建立句子的语法模型,了解等效结构(例如“昨天他去商店”和“昨天他去商店”),不仅要在代词和动词上找到相似性,还要在句子中找到相似性。专有名词,在许多真实的文本示例中找到统计共现/关系,等等。

您可以尝试的最简单的方法-尽管我不知道这样做的效果如何,并且肯定不会给您带来最佳效果-首先,请删除所有“停止”字词(例如“ the”,“ an”等等),然后对两个句子中的单词运行word2vec,将一个句子中的向量求和,将另一个句子中的向量求和,然后找出两者之间的区别总和。通过将它们加起来而不是按单词进行区分,您至少不会受到单词顺序的限制。话虽这么说,这将以多种方式失败,而且无论如何都不是一个好的解决方案(尽管对这个问题的好的解决方案几乎总是涉及一定数量的NLP,机器学习和其他聪明才智)。

因此,简短的答案是,不,没有简单的方法可以做到这一点(至少不能很好地做到这一点)。

This is actually a pretty challenging problem that you are asking. Computing sentence similarity requires building a grammatical model of the sentence, understanding equivalent structures (e.g. “he walked to the store yesterday” and “yesterday, he walked to the store”), finding similarity not just in the pronouns and verbs but also in the proper nouns, finding statistical co-occurences / relationships in lots of real textual examples, etc.

The simplest thing you could try — though I don’t know how well this would perform and it would certainly not give you the optimal results — would be to first remove all “stop” words (words like “the”, “an”, etc. that don’t add much meaning to the sentence) and then run word2vec on the words in both sentences, sum up the vectors in the one sentence, sum up the vectors in the other sentence, and then find the difference between the sums. By summing them up instead of doing a word-wise difference, you’ll at least not be subject to word order. That being said, this will fail in lots of ways and isn’t a good solution by any means (though good solutions to this problem almost always involve some amount of NLP, machine learning, and other cleverness).

So, short answer is, no, there’s no easy way to do this (at least not to do it well).


回答 1

由于您正在使用gensim,因此您可能应该使用doc2vec实现。doc2vec是word2vec在短语,句子和文档级别的扩展。这是一个非常简单的扩展,描述如下

http://cs.stanford.edu/~quocle/paragraph_vector.pdf

Gensim非常好,因为它直观,快速且灵活。很棒的是,您可以从word2vec官方页面上获取预训练的单词嵌入,并且gensim的Doc2Vec模型的syn0层暴露出来,以便您可以使用这些高质量的向量来植入单词嵌入!

GoogleNews-vectors-negative300.bin.gz(与Google Code链接)

我认为gensim绝对是在向量空间中嵌入句子的最简单的工具(到目前为止,对我来说也是最好的)。

除了上面的Le&Mikolov的论文中提出的技术外,还有其他的从句到向量技术。斯坦福大学的Socher和Manning无疑是该领域最著名的两位研究人员。他们的工作基于构成原则-句子的语义来自:

1. semantics of the words

2. rules for how these words interact and combine into phrases

他们已经提出了一些这样的模型(变得越来越复杂),以介绍如何使用构图来构建句子级的表示形式。

2011年- 展开递归自动编码器(非常简单。如有兴趣,请从此处开始)

2012- 矩阵向量神经网络

2013- 神经张量网络

2015年- 树LSTM

他的论文都可以在socher.org上找到。其中一些模型可用,但是我仍然建议gensim的doc2vec。一方面,2011 URAE并不是特别强大。此外,它还经过预训练,适用于释义news-y数据。他提供的代码不允许您重新训练网络。您也无法交换不同的单词向量,因此您陷入了Turian的2011年pre2之前的word2vec嵌入。这些向量肯定不在word2vec或GloVe的水平上。

尚未与Tree LSTM合作,但似乎很有希望!

tl; dr是的,请使用gensim的doc2vec。但是其他方法确实存在!

Since you’re using gensim, you should probably use it’s doc2vec implementation. doc2vec is an extension of word2vec to the phrase-, sentence-, and document-level. It’s a pretty simple extension, described here

http://cs.stanford.edu/~quocle/paragraph_vector.pdf

Gensim is nice because it’s intuitive, fast, and flexible. What’s great is that you can grab the pretrained word embeddings from the official word2vec page and the syn0 layer of gensim’s Doc2Vec model is exposed so that you can seed the word embeddings with these high quality vectors!

GoogleNews-vectors-negative300.bin.gz (as linked in Google Code)

I think gensim is definitely the easiest (and so far for me, the best) tool for embedding a sentence in a vector space.

There exist other sentence-to-vector techniques than the one proposed in Le & Mikolov’s paper above. Socher and Manning from Stanford are certainly two of the most famous researchers working in this area. Their work has been based on the principle of compositionally – semantics of the sentence come from:

1. semantics of the words

2. rules for how these words interact and combine into phrases

They’ve proposed a few such models (getting increasingly more complex) for how to use compositionality to build sentence-level representations.

2011 – unfolding recursive autoencoder (very comparatively simple. start here if interested)

2012 – matrix-vector neural network

2013 – neural tensor network

2015 – Tree LSTM

his papers are all available at socher.org. Some of these models are available, but I’d still recommend gensim’s doc2vec. For one, the 2011 URAE isn’t particularly powerful. In addition, it comes pretrained with weights suited for paraphrasing news-y data. The code he provides does not allow you to retrain the network. You also can’t swap in different word vectors, so you’re stuck with 2011’s pre-word2vec embeddings from Turian. These vectors are certainly not on the level of word2vec’s or GloVe’s.

Haven’t worked with the Tree LSTM yet, but it seems very promising!

tl;dr Yeah, use gensim’s doc2vec. But other methods do exist!


回答 2

如果使用word2vec,则需要计算每个句子/文档中所有单词的平均向量,并在向量之间使用余弦相似度:

import numpy as np
from scipy import spatial

index2word_set = set(model.wv.index2word)

def avg_feature_vector(sentence, model, num_features, index2word_set):
    words = sentence.split()
    feature_vec = np.zeros((num_features, ), dtype='float32')
    n_words = 0
    for word in words:
        if word in index2word_set:
            n_words += 1
            feature_vec = np.add(feature_vec, model[word])
    if (n_words > 0):
        feature_vec = np.divide(feature_vec, n_words)
    return feature_vec

计算相似度:

s1_afv = avg_feature_vector('this is a sentence', model=model, num_features=300, index2word_set=index2word_set)
s2_afv = avg_feature_vector('this is also sentence', model=model, num_features=300, index2word_set=index2word_set)
sim = 1 - spatial.distance.cosine(s1_afv, s2_afv)
print(sim)

> 0.915479828613

If you are using word2vec, you need to calculate the average vector for all words in every sentence/document and use cosine similarity between vectors:

import numpy as np
from scipy import spatial

index2word_set = set(model.wv.index2word)

def avg_feature_vector(sentence, model, num_features, index2word_set):
    words = sentence.split()
    feature_vec = np.zeros((num_features, ), dtype='float32')
    n_words = 0
    for word in words:
        if word in index2word_set:
            n_words += 1
            feature_vec = np.add(feature_vec, model[word])
    if (n_words > 0):
        feature_vec = np.divide(feature_vec, n_words)
    return feature_vec

Calculate similarity:

s1_afv = avg_feature_vector('this is a sentence', model=model, num_features=300, index2word_set=index2word_set)
s2_afv = avg_feature_vector('this is also sentence', model=model, num_features=300, index2word_set=index2word_set)
sim = 1 - spatial.distance.cosine(s1_afv, s2_afv)
print(sim)

> 0.915479828613

回答 3

您可以使用Word Mover的距离算法。这是有关WMD简单描述

#load word2vec model, here GoogleNews is used
model = gensim.models.KeyedVectors.load_word2vec_format('../GoogleNews-vectors-negative300.bin', binary=True)
#two sample sentences 
s1 = 'the first sentence'
s2 = 'the second text'

#calculate distance between two sentences using WMD algorithm
distance = model.wmdistance(s1, s2)

print ('distance = %.3f' % distance)

ps:如果您遇到有关导入pyemd库的错误,则可以使用以下命令进行安装:

pip install pyemd

you can use Word Mover’s Distance algorithm. here is an easy description about WMD.

#load word2vec model, here GoogleNews is used
model = gensim.models.KeyedVectors.load_word2vec_format('../GoogleNews-vectors-negative300.bin', binary=True)
#two sample sentences 
s1 = 'the first sentence'
s2 = 'the second text'

#calculate distance between two sentences using WMD algorithm
distance = model.wmdistance(s1, s2)

print ('distance = %.3f' % distance)

P.s.: if you face an error about import pyemd library, you can install it using following command:

pip install pyemd

回答 4

一旦计算了两组单词向量的总和,就应该取向量之间的余弦,而不是diff。余弦可以通过对两个向量的点积进行归一化来计算。因此,字数不是一个因素。

Once you compute the sum of the two sets of word vectors, you should take the cosine between the vectors, not the diff. The cosine can be computed by taking the dot product of the two vectors normalized. Thus, the word count is not a factor.


回答 5

文档中有一项功能,可获取单词列表并比较它们的相似性。

s1 = 'This room is dirty'
s2 = 'dirty and disgusting room' #corrected variable name

distance = model.wv.n_similarity(s1.lower().split(), s2.lower().split())

There is a function from the documentation taking a list of words and comparing their similarities.

s1 = 'This room is dirty'
s2 = 'dirty and disgusting room' #corrected variable name

distance = model.wv.n_similarity(s1.lower().split(), s2.lower().split())

回答 6

我想更新现有的解决方案,以帮助将要计算句子的语义相似性的人们。

第1步:

使用gensim加载合适的模型并计算句子中单词的单词向量并将其存储为单词列表

步骤2:计算句子向量

句子之间语义相似度的计算以前很困难,但是最近提出了一篇名为“句子嵌入的简单但难以理解的基线 ”的论文,该论文提出了一种简单的方法,即计算句子中单词向量的加权平均值,然后将其删除平均向量在其第一个主成分上的投影。这里,单词w的权重为a /(a + p(w)),其中a为参数,而p(w)为(估计的)单词频率,称为平滑逆频率该方法的性能明显更好。

一个简单的代码使用SIF计算句子矢量(平滑逆频率)在已经给出了本文提出的方法在这里

步骤3:使用sklearn cosine_similarity为句子加载两个向量并计算相似度。

这是计算句子相似度的最简单有效的方法。

I would like to update the existing solution to help the people who are going to calculate the semantic similarity of sentences.

Step 1:

Load the suitable model using gensim and calculate the word vectors for words in the sentence and store them as a word list

Step 2 : Computing the sentence vector

The calculation of semantic similarity between sentences was difficult before but recently a paper named “A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SENTENCE EMBEDDINGS” was proposed which suggests a simple approach by computing the weighted average of word vectors in the sentence and then remove the projections of the average vectors on their first principal component.Here the weight of a word w is a/(a + p(w)) with a being a parameter and p(w) the (estimated) word frequency called smooth inverse frequency.this method performing significantly better.

A simple code to calculate the sentence vector using SIF(smooth inverse frequency) the method proposed in the paper has been given here

Step 3: using sklearn cosine_similarity load two vectors for the sentences and compute the similarity.

This is the most simple and efficient method to compute the sentence similarity.


回答 7

我正在使用以下方法,效果很好。首先,您需要运行POSTagger,然后过滤句子以摆脱停用词(行列式,连词等)。我建议使用TextBlob APTagger。然后,通过获取句子中每个单词向量的均值来构建word2vec。Gemsim word2vec中的n_similarity方法通过允许传递两组单词进行比较来实现此目的。

I am using the following method and it works well. You first need to run a POSTagger and then filter your sentence to get rid of the stop words (determinants, conjunctions, …). I recommend TextBlob APTagger. Then you build a word2vec by taking the mean of each word vector in the sentence. The n_similarity method in Gemsim word2vec does exactly that by allowing to pass two sets of words to compare.


回答 8

Word2Vec的扩展旨在解决比较短语或句子等较长文本的问题。其中之一是para2vec或doc2vec。

“句子和文档的分布式表示形式” http://cs.stanford.edu/~quocle/paragraph_vector.pdf

http://rare-technologies.com/doc2vec-tutorial/

There are extensions of Word2Vec intended to solve the problem of comparing longer pieces of text like phrases or sentences. One of them is paragraph2vec or doc2vec.

“Distributed Representations of Sentences and Documents” http://cs.stanford.edu/~quocle/paragraph_vector.pdf

http://rare-technologies.com/doc2vec-tutorial/


回答 9

Gensim段落嵌入实现了一个称为Doc2Vec的模型。

IPython笔记本提供了不同的教程:

另一种方法将依赖Word2VecWord Mover的距离(WMD),如本教程所示:

另一种解决方案是依靠平均向量:

from gensim.models import KeyedVectors
from gensim.utils import simple_preprocess    

def tidy_sentence(sentence, vocabulary):
    return [word for word in simple_preprocess(sentence) if word in vocabulary]    

def compute_sentence_similarity(sentence_1, sentence_2, model_wv):
    vocabulary = set(model_wv.index2word)    
    tokens_1 = tidy_sentence(sentence_1, vocabulary)    
    tokens_2 = tidy_sentence(sentence_2, vocabulary)    
    return model_wv.n_similarity(tokens_1, tokens_2)

wv = KeyedVectors.load('model.wv', mmap='r')
sim = compute_sentence_similarity('this is a sentence', 'this is also a sentence', wv)
print(sim)

最后,如果您可以运行Tensorflow,则可以尝试:https ://tfhub.dev/google/universal-sentence-encoder/2

Gensim implements a model called Doc2Vec for paragraph embedding.

There are different tutorials presented as IPython notebooks:

Another method would rely on Word2Vec and Word Mover’s Distance (WMD), as shown in this tutorial:

An alternative solution would be to rely on average vectors:

from gensim.models import KeyedVectors
from gensim.utils import simple_preprocess    

def tidy_sentence(sentence, vocabulary):
    return [word for word in simple_preprocess(sentence) if word in vocabulary]    

def compute_sentence_similarity(sentence_1, sentence_2, model_wv):
    vocabulary = set(model_wv.index2word)    
    tokens_1 = tidy_sentence(sentence_1, vocabulary)    
    tokens_2 = tidy_sentence(sentence_2, vocabulary)    
    return model_wv.n_similarity(tokens_1, tokens_2)

wv = KeyedVectors.load('model.wv', mmap='r')
sim = compute_sentence_similarity('this is a sentence', 'this is also a sentence', wv)
print(sim)

Finally, if you can run Tensorflow, you may try: https://tfhub.dev/google/universal-sentence-encoder/2


回答 10

我已经尝试了先前答案提供的方法。它是可行的,但是它的主要缺点是句子越长,相似度越大(为了计算相似度,我使用任意两个句子的两个均值嵌入的余弦值),因为单词越多,语义效果就越积极将添加到句子中。

我想我应该改变我的主意,用句子,而不是嵌入在研究文章这个

I have tried the methods provided by the previous answers. It works, but the main drawback of it is that the longer the sentences the larger similarity will be(to calculate the similarity I use the cosine score of the two mean embeddings of any two sentences) since the more the words the more positive semantic effects will be added to the sentence.

I thought I should change my mind and use the sentence embedding instead as studied in this paper and this.


回答 11

Facebook研究小组发布了一个名为InferSent Results的新解决方案,其代码已发布在Github上,请检查其回购。太棒了 我打算使用它。 https://github.com/facebookresearch/InferSent

他们的论文 https://arxiv.org/abs/1705.02364 摘要:许多现代的NLP系统都依赖词嵌入作为基本特征,而词嵌入以前是在大型语料库上以无监督方式进行训练的。然而,为更大的文本块(例如句子)获得嵌入的努力并没有那么成功。学习句子的无监督表示的几种尝试还没有达到令人满意的性能,因此不能被广泛采用。在本文中,我们展示了使用斯坦福自然语言推理数据集的监督数据训练的通用句子表示在各种传输任务上如何始终能够胜过诸如SkipThought向量之类的无监督方法。就像计算机视觉如何使用ImageNet获取功能,然后将其转移到其他任务中一样,我们的工作倾向于表明自然语言推理是否适合将学习转移到其他NLP任务。我们的编码器是公开可用的。

Facebook Research group released a new solution called InferSent Results and code are published on Github, check their repo. It is pretty awesome. I am planning to use it. https://github.com/facebookresearch/InferSent

their paper https://arxiv.org/abs/1705.02364 Abstract: Many modern NLP systems rely on word embeddings, previously trained in an unsupervised manner on large corpora, as base features. Efforts to obtain embeddings for larger chunks of text, such as sentences, have however not been so successful. Several attempts at learning unsupervised representations of sentences have not reached satisfactory enough performance to be widely adopted. In this paper, we show how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks. Much like how computer vision uses ImageNet to obtain features, which can then be transferred to other tasks, our work tends to indicate the suitability of natural language inference for transfer learning to other NLP tasks. Our encoder is publicly available.


回答 12

如果不使用Word2Vec,我们还有其他模型可以使用BERT进行嵌入。以下是参考链接 https://github.com/UKPLab/sentence-transformers

pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer
import scipy.spatial

embedder = SentenceTransformer('bert-base-nli-mean-tokens')

# Corpus with example sentences
corpus = ['A man is eating a food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
          ]
corpus_embeddings = embedder.encode(corpus)

# Query sentences:
queries = ['A man is eating pasta.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah chases prey on across a field.']
query_embeddings = embedder.encode(queries)

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
closest_n = 5
for query, query_embedding in zip(queries, query_embeddings):
    distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for idx, distance in results[0:closest_n]:
        print(corpus[idx].strip(), "(Score: %.4f)" % (1-distance))

其他链接以关注 https://github.com/hanxiao/bert-as-service

If not using Word2Vec we have other model to find it using BERT for embed. Below are reference link https://github.com/UKPLab/sentence-transformers

pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer
import scipy.spatial

embedder = SentenceTransformer('bert-base-nli-mean-tokens')

# Corpus with example sentences
corpus = ['A man is eating a food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
          ]
corpus_embeddings = embedder.encode(corpus)

# Query sentences:
queries = ['A man is eating pasta.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah chases prey on across a field.']
query_embeddings = embedder.encode(queries)

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
closest_n = 5
for query, query_embedding in zip(queries, query_embeddings):
    distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for idx, distance in results[0:closest_n]:
        print(corpus[idx].strip(), "(Score: %.4f)" % (1-distance))

Other Link to follow https://github.com/hanxiao/bert-as-service


TensorFlow中Variable和get_variable之间的区别

问题:TensorFlow中Variable和get_variable之间的区别

据我所知,Variable是变量的默认操作,get_variable主要用于权重分配。

一方面,有人建议在需要变量时使用get_variable而不是原始Variable操作。另一方面,我只get_variable在TensorFlow的官方文档和演示中看到了任何使用。

因此,我想了解有关如何正确使用这两种机制的一些经验法则。是否有任何“标准”原则?

As far as I know, Variable is the default operation for making a variable, and get_variable is mainly used for weight sharing.

On the one hand, there are some people suggesting using get_variable instead of the primitive Variable operation whenever you need a variable. On the other hand, I merely see any use of get_variable in TensorFlow’s official documents and demos.

Thus I want to know some rules of thumb on how to correctly use these two mechanisms. Are there any “standard” principles?


回答 0

我建议始终使用tf.get_variable(...)-如果您需要随时共享变量,例如在multi-gpu设置中(请参见multi-gpu CIFAR示例),它将使您更轻松地重构代码。没有不利的一面。

tf.Variable是低级的。在某些时候tf.get_variable()不存在,因此某些代码仍使用低级方式。

I’d recommend to always use tf.get_variable(...) — it will make it way easier to refactor your code if you need to share variables at any time, e.g. in a multi-gpu setting (see the multi-gpu CIFAR example). There is no downside to it.

Pure tf.Variable is lower-level; at some point tf.get_variable() did not exist so some code still uses the low-level way.


回答 1

tf.Variable是一个类,有多种创建tf.Variable的方法,包括tf.Variable.__init__tf.get_variable

tf.Variable.__init__:创建一个带有initial_value的新变量。

W = tf.Variable(<initial-value>, name=<optional-name>)

tf.get_variable:获取具有这些参数的现有变量或创建一个新变量。您也可以使用初始化程序。

W = tf.get_variable(name, shape=None, dtype=tf.float32, initializer=None,
       regularizer=None, trainable=True, collections=None)

使用初始化器如xavier_initializer

W = tf.get_variable("W", shape=[784, 256],
       initializer=tf.contrib.layers.xavier_initializer())

更多信息在这里

tf.Variable is a class, and there are several ways to create tf.Variable including tf.Variable.__init__ and tf.get_variable.

tf.Variable.__init__: Creates a new variable with initial_value.

W = tf.Variable(<initial-value>, name=<optional-name>)

tf.get_variable: Gets an existing variable with these parameters or creates a new one. You can also use initializer.

W = tf.get_variable(name, shape=None, dtype=tf.float32, initializer=None,
       regularizer=None, trainable=True, collections=None)

It’s very useful to use initializers such as xavier_initializer:

W = tf.get_variable("W", shape=[784, 256],
       initializer=tf.contrib.layers.xavier_initializer())

More information here.


回答 2

我可以发现彼此之间的两个主要区别:

  1. 首先,tf.Variable它将始终创建一个新变量,而从图中tf.get_variable获取具有指定参数的现有变量,如果不存在,则创建一个新变量。

  2. tf.Variable 要求指定一个初始值。

重要的是要阐明该函数tf.get_variable在名称前加上当前变量作用域以执行重用检查。例如:

with tf.variable_scope("one"):
    a = tf.get_variable("v", [1]) #a.name == "one/v:0"
with tf.variable_scope("one"):
    b = tf.get_variable("v", [1]) #ValueError: Variable one/v already exists
with tf.variable_scope("one", reuse = True):
    c = tf.get_variable("v", [1]) #c.name == "one/v:0"

with tf.variable_scope("two"):
    d = tf.get_variable("v", [1]) #d.name == "two/v:0"
    e = tf.Variable(1, name = "v", expected_shape = [1]) #e.name == "two/v_1:0"

assert(a is c)  #Assertion is true, they refer to the same object.
assert(a is d)  #AssertionError: they are different objects
assert(d is e)  #AssertionError: they are different objects

最后一个断言错误很有趣:在相同范围内具有相同名称的两个变量应该是相同变量。但是,如果你测试变量的名字de你会发现,Tensorflow改变变量的名称e

d.name   #d.name == "two/v:0"
e.name   #e.name == "two/v_1:0"

I can find two main differences between one and the other:

  1. First is that tf.Variable will always create a new variable, whereas tf.get_variable gets an existing variable with specified parameters from the graph, and if it doesn’t exist, creates a new one.

  2. tf.Variable requires that an initial value be specified.

It is important to clarify that the function tf.get_variable prefixes the name with the current variable scope to perform reuse checks. For example:

with tf.variable_scope("one"):
    a = tf.get_variable("v", [1]) #a.name == "one/v:0"
with tf.variable_scope("one"):
    b = tf.get_variable("v", [1]) #ValueError: Variable one/v already exists
with tf.variable_scope("one", reuse = True):
    c = tf.get_variable("v", [1]) #c.name == "one/v:0"

with tf.variable_scope("two"):
    d = tf.get_variable("v", [1]) #d.name == "two/v:0"
    e = tf.Variable(1, name = "v", expected_shape = [1]) #e.name == "two/v_1:0"

assert(a is c)  #Assertion is true, they refer to the same object.
assert(a is d)  #AssertionError: they are different objects
assert(d is e)  #AssertionError: they are different objects

The last assertion error is interesting: Two variables with the same name under the same scope are supposed to be the same variable. But if you test the names of variables d and e you will realize that Tensorflow changed the name of variable e:

d.name   #d.name == "two/v:0"
e.name   #e.name == "two/v_1:0"

回答 3

另一个不同之处在于 ('variable_store',)集合中,而另一个不在。

请查看源代码

def _get_default_variable_store():
  store = ops.get_collection(_VARSTORE_KEY)
  if store:
    return store[0]
  store = _VariableStore()
  ops.add_to_collection(_VARSTORE_KEY, store)
  return store

让我说明一下:

import tensorflow as tf
from tensorflow.python.framework import ops

embedding_1 = tf.Variable(tf.constant(1.0, shape=[30522, 1024]), name="word_embeddings_1", dtype=tf.float32) 
embedding_2 = tf.get_variable("word_embeddings_2", shape=[30522, 1024])

graph = tf.get_default_graph()
collections = graph.collections

for c in collections:
    stores = ops.get_collection(c)
    print('collection %s: ' % str(c))
    for k, store in enumerate(stores):
        try:
            print('\t%d: %s' % (k, str(store._vars)))
        except:
            print('\t%d: %s' % (k, str(store)))
    print('')

输出:

collection ('__variable_store',): 0: {'word_embeddings_2': <tf.Variable 'word_embeddings_2:0' shape=(30522, 1024) dtype=float32_ref>}

Another difference lies in that one is in ('variable_store',) collection but the other is not.

Please see the source code:

def _get_default_variable_store():
  store = ops.get_collection(_VARSTORE_KEY)
  if store:
    return store[0]
  store = _VariableStore()
  ops.add_to_collection(_VARSTORE_KEY, store)
  return store

Let me illustrate that:

import tensorflow as tf
from tensorflow.python.framework import ops

embedding_1 = tf.Variable(tf.constant(1.0, shape=[30522, 1024]), name="word_embeddings_1", dtype=tf.float32) 
embedding_2 = tf.get_variable("word_embeddings_2", shape=[30522, 1024])

graph = tf.get_default_graph()
collections = graph.collections

for c in collections:
    stores = ops.get_collection(c)
    print('collection %s: ' % str(c))
    for k, store in enumerate(stores):
        try:
            print('\t%d: %s' % (k, str(store._vars)))
        except:
            print('\t%d: %s' % (k, str(store)))
    print('')

The output:

collection ('__variable_store',): 0: {'word_embeddings_2': <tf.Variable 'word_embeddings_2:0' shape=(30522, 1024) dtype=float32_ref>}


如何使用NLTK标记器消除标点符号?

问题:如何使用NLTK标记器消除标点符号?

我刚刚开始使用NLTK,但我不太了解如何从文本中获取单词列表。如果使用nltk.word_tokenize(),则会得到单词和标点的列表。我只需要这些词。我如何摆脱标点符号?同样word_tokenize不适用于多个句子:点号会添加到最后一个单词中。

I’m just starting to use NLTK and I don’t quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also word_tokenize doesn’t work with multiple sentences: dots are added to the last word.


回答 0

看看nltk 在此处提供的其他标记化选项。例如,您可以定义一个令牌生成器,该令牌生成器将字母数字字符序列选作令牌,并丢弃其他所有内容:

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

输出:

['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']

Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else:

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

Output:

['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']

回答 1

您实际上并不需要NLTK来删除标点符号。您可以使用简单的python将其删除。对于字符串:

import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)

或对于unicode:

import string
translate_table = dict((ord(char), None) for char in string.punctuation)   
s.translate(translate_table)

然后在令牌生成器中使用此字符串。

PS字符串模块还有一些其他可以删除的元素集(例如数字)。

You do not really need NLTK to remove punctuation. You can remove it with simple python. For strings:

import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)

Or for unicode:

import string
translate_table = dict((ord(char), None) for char in string.punctuation)   
s.translate(translate_table)

and then use this string in your tokenizer.

P.S. string module have some other sets of elements that can be removed (like digits).


回答 2

下面的代码将删除所有标点符号以及非字母字符。从他们的书中复制。

http://www.nltk.org/book/ch01.html

import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time. @ sd  4 232"

words = nltk.word_tokenize(s)

words=[word.lower() for word in words if word.isalpha()]

print(words)

输出

['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']

Below code will remove all punctuation marks as well as non alphabetic characters. Copied from their book.

http://www.nltk.org/book/ch01.html

import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time. @ sd  4 232"

words = nltk.word_tokenize(s)

words=[word.lower() for word in words if word.isalpha()]

print(words)

output

['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']

回答 3

正如注释中所注意到的那样,因为word_tokenize()仅适用于单个句子,所以它以send_tokenize()开头。您可以使用filter()过滤出标点符号。如果您有一个unicode字符串,请确保它是一个unicode对象(而不是使用“ utf-8”之类的编码编码的“ str”)。

from nltk.tokenize import word_tokenize, sent_tokenize

text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)

As noticed in comments start with sent_tokenize(), because word_tokenize() works only on a single sentence. You can filter out punctuation with filter(). And if you have an unicode strings make sure that is a unicode object (not a ‘str’ encoded with some encoding like ‘utf-8’).

from nltk.tokenize import word_tokenize, sent_tokenize

text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)

回答 4

我只使用了以下代码,删除了所有标点符号:

tokens = nltk.wordpunct_tokenize(raw)

type(tokens)

text = nltk.Text(tokens)

type(text)  

words = [w.lower() for w in text if w.isalpha()]

I just used the following code, which removed all the punctuation:

tokens = nltk.wordpunct_tokenize(raw)

type(tokens)

text = nltk.Text(tokens)

type(text)  

words = [w.lower() for w in text if w.isalpha()]

回答 5

我认为您需要某种正则表达式匹配(以下代码在Python 3中):

import string
import re
import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time."
l = nltk.word_tokenize(s)
ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)]
print(l)
print(ll)

输出:

['I', 'ca', "n't", 'do', 'this', 'now', ',', 'because', 'I', "'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.']
['I', 'ca', "n't", 'do', 'this', 'now', 'because', 'I', "'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']

在大多数情况下应该可以正常使用,因为它可以删除标点符号,同时保留“ n’t”之类的令牌,而这些令牌不能从regex令牌生成器(如)获得wordpunct_tokenize

I think you need some sort of regular expression matching (the following code is in Python 3):

import string
import re
import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time."
l = nltk.word_tokenize(s)
ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)]
print(l)
print(ll)

Output:

['I', 'ca', "n't", 'do', 'this', 'now', ',', 'because', 'I', "'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.']
['I', 'ca', "n't", 'do', 'this', 'now', 'because', 'I', "'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']

Should work well in most cases since it removes punctuation while preserving tokens like “n’t”, which can’t be obtained from regex tokenizers such as wordpunct_tokenize.


回答 6

真诚的问,这是什么字?如果您假设一个单词仅由字母字符组成,那您就错了,因为如果在标记化之前删除标点符号,can't诸如的单词将被破坏成碎片(例如cant),这很可能会对程序产生负面影响。

因此,解决方案是先标记化然后删除标点标记

import string

from nltk.tokenize import word_tokenize

tokens = word_tokenize("I'm a southern salesman.")
# ['I', "'m", 'a', 'southern', 'salesman', '.']

tokens = list(filter(lambda token: token not in string.punctuation, tokens))
# ['I', "'m", 'a', 'southern', 'salesman']

……然后,如果你愿意,你可以替换某些标记,如'mam

Sincerely asking, what is a word? If your assumption is that a word consists of alphabetic characters only, you are wrong since words such as can't will be destroyed into pieces (such as can and t) if you remove punctuation before tokenisation, which is very likely to affect your program negatively.

Hence the solution is to tokenise and then remove punctuation tokens.

import string

from nltk.tokenize import word_tokenize

tokens = word_tokenize("I'm a southern salesman.")
# ['I', "'m", 'a', 'southern', 'salesman', '.']

tokens = list(filter(lambda token: token not in string.punctuation, tokens))
# ['I', "'m", 'a', 'southern', 'salesman']

…and then if you wish, you can replace certain tokens such as 'm with am.


回答 7

我使用以下代码删除标点符号:

import nltk
def getTerms(sentences):
    tokens = nltk.word_tokenize(sentences)
    words = [w.lower() for w in tokens if w.isalnum()]
    print tokens
    print words

getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B ")

而且,如果您想检查令牌是否为有效的英语单词,则可能需要PyEnchant

教程:

 import enchant
 d = enchant.Dict("en_US")
 d.check("Hello")
 d.check("Helo")
 d.suggest("Helo")

I use this code to remove punctuation:

import nltk
def getTerms(sentences):
    tokens = nltk.word_tokenize(sentences)
    words = [w.lower() for w in tokens if w.isalnum()]
    print tokens
    print words

getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B ")

And If you want to check whether a token is a valid English word or not, you may need PyEnchant

Tutorial:

 import enchant
 d = enchant.Dict("en_US")
 d.check("Hello")
 d.check("Helo")
 d.suggest("Helo")

回答 8

删除标点符号(它将删除和标点符号处理的一部分,使用下面的代码)

        tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
        text_string = text_string.translate(tbl) #text_string don't have punctuation
        w = word_tokenize(text_string)  #now tokenize the string 

样本输入/输出:

direct flat in oberoi esquire. 3 bhk 2195 saleable 1330 carpet. rate of 14500 final plus 1% floor rise. tax approx 9% only. flat cost with parking 3.89 cr plus taxes plus possession charger. middle floor. north door. arey and oberoi woods facing. 53% paymemt due. 1% transfer charge with buyer. total cost around 4.20 cr approx plus possession charges. rahul soni

['direct', 'flat', 'oberoi', 'esquire', '3', 'bhk', '2195', 'saleable', '1330', 'carpet', 'rate', '14500', 'final', 'plus', '1', 'floor', 'rise', 'tax', 'approx', '9', 'flat', 'cost', 'parking', '389', 'cr', 'plus', 'taxes', 'plus', 'possession', 'charger', 'middle', 'floor', 'north', 'door', 'arey', 'oberoi', 'woods', 'facing', '53', 'paymemt', 'due', '1', 'transfer', 'charge', 'buyer', 'total', 'cost', 'around', '420', 'cr', 'approx', 'plus', 'possession', 'charges', 'rahul', 'soni']

Remove punctuaion(It will remove . as well as part of punctuation handling using below code)

        tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
        text_string = text_string.translate(tbl) #text_string don't have punctuation
        w = word_tokenize(text_string)  #now tokenize the string 

Sample Input/Output:

direct flat in oberoi esquire. 3 bhk 2195 saleable 1330 carpet. rate of 14500 final plus 1% floor rise. tax approx 9% only. flat cost with parking 3.89 cr plus taxes plus possession charger. middle floor. north door. arey and oberoi woods facing. 53% paymemt due. 1% transfer charge with buyer. total cost around 4.20 cr approx plus possession charges. rahul soni

['direct', 'flat', 'oberoi', 'esquire', '3', 'bhk', '2195', 'saleable', '1330', 'carpet', 'rate', '14500', 'final', 'plus', '1', 'floor', 'rise', 'tax', 'approx', '9', 'flat', 'cost', 'parking', '389', 'cr', 'plus', 'taxes', 'plus', 'possession', 'charger', 'middle', 'floor', 'north', 'door', 'arey', 'oberoi', 'woods', 'facing', '53', 'paymemt', 'due', '1', 'transfer', 'charge', 'buyer', 'total', 'cost', 'around', '420', 'cr', 'approx', 'plus', 'possession', 'charges', 'rahul', 'soni']


回答 9

只需添加@rmalouf的解决方案,就不会包含任何数字,因为\ w +等效于[a-zA-Z0-9_]

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z]')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

Just adding to the solution by @rmalouf, this will not include any numbers because \w+ is equivalent to [a-zA-Z0-9_]

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z]')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

回答 10

您可以在没有nltk(python 3.x)的情况下一行完成此操作。

import string
string_text= string_text.translate(str.maketrans('','',string.punctuation))

You can do it in one line without nltk (python 3.x).

import string
string_text= string_text.translate(str.maketrans('','',string.punctuation))

如何将xml字符串转换为字典?

问题:如何将xml字符串转换为字典?

我有一个程序可以从套接字读取xml文档。我将xml文档存储在一个字符串中,我想将其直接转换为Python字典,就像在Django的simplejson库中一样。

举个例子:

str ="<?xml version="1.0" ?><person><name>john</name><age>20</age></person"
dic_xml = convert_to_dic(str)

然后dic_xml看起来像{'person' : { 'name' : 'john', 'age' : 20 } }

I have a program that reads an xml document from a socket. I have the xml document stored in a string which I would like to convert directly to a Python dictionary, the same way it is done in Django’s simplejson library.

Take as an example:

str ="<?xml version="1.0" ?><person><name>john</name><age>20</age></person"
dic_xml = convert_to_dic(str)

Then dic_xml would look like {'person' : { 'name' : 'john', 'age' : 20 } }


回答 0

这是某人创建的一个很棒的模块。我已经使用过几次了。 http://code.activestate.com/recipes/410469-xml-as-dictionary/

这是网站上的代码,以防链接损坏。

from xml.etree import cElementTree as ElementTree

class XmlListConfig(list):
    def __init__(self, aList):
        for element in aList:
            if element:
                # treat like dict
                if len(element) == 1 or element[0].tag != element[1].tag:
                    self.append(XmlDictConfig(element))
                # treat like list
                elif element[0].tag == element[1].tag:
                    self.append(XmlListConfig(element))
            elif element.text:
                text = element.text.strip()
                if text:
                    self.append(text)


class XmlDictConfig(dict):
    '''
    Example usage:

    >>> tree = ElementTree.parse('your_file.xml')
    >>> root = tree.getroot()
    >>> xmldict = XmlDictConfig(root)

    Or, if you want to use an XML string:

    >>> root = ElementTree.XML(xml_string)
    >>> xmldict = XmlDictConfig(root)

    And then use xmldict for what it is... a dict.
    '''
    def __init__(self, parent_element):
        if parent_element.items():
            self.update(dict(parent_element.items()))
        for element in parent_element:
            if element:
                # treat like dict - we assume that if the first two tags
                # in a series are different, then they are all different.
                if len(element) == 1 or element[0].tag != element[1].tag:
                    aDict = XmlDictConfig(element)
                # treat like list - we assume that if the first two tags
                # in a series are the same, then the rest are the same.
                else:
                    # here, we put the list in dictionary; the key is the
                    # tag name the list elements all share in common, and
                    # the value is the list itself 
                    aDict = {element[0].tag: XmlListConfig(element)}
                # if the tag has attributes, add those to the dict
                if element.items():
                    aDict.update(dict(element.items()))
                self.update({element.tag: aDict})
            # this assumes that if you've got an attribute in a tag,
            # you won't be having any text. This may or may not be a 
            # good idea -- time will tell. It works for the way we are
            # currently doing XML configuration files...
            elif element.items():
                self.update({element.tag: dict(element.items())})
            # finally, if there are no child tags and no attributes, extract
            # the text
            else:
                self.update({element.tag: element.text})

用法示例:

tree = ElementTree.parse('your_file.xml')
root = tree.getroot()
xmldict = XmlDictConfig(root)

//或者,如果要使用XML字符串:

root = ElementTree.XML(xml_string)
xmldict = XmlDictConfig(root)

This is a great module that someone created. I’ve used it several times. http://code.activestate.com/recipes/410469-xml-as-dictionary/

Here is the code from the website just in case the link goes bad.

from xml.etree import cElementTree as ElementTree

class XmlListConfig(list):
    def __init__(self, aList):
        for element in aList:
            if element:
                # treat like dict
                if len(element) == 1 or element[0].tag != element[1].tag:
                    self.append(XmlDictConfig(element))
                # treat like list
                elif element[0].tag == element[1].tag:
                    self.append(XmlListConfig(element))
            elif element.text:
                text = element.text.strip()
                if text:
                    self.append(text)


class XmlDictConfig(dict):
    '''
    Example usage:

    >>> tree = ElementTree.parse('your_file.xml')
    >>> root = tree.getroot()
    >>> xmldict = XmlDictConfig(root)

    Or, if you want to use an XML string:

    >>> root = ElementTree.XML(xml_string)
    >>> xmldict = XmlDictConfig(root)

    And then use xmldict for what it is... a dict.
    '''
    def __init__(self, parent_element):
        if parent_element.items():
            self.update(dict(parent_element.items()))
        for element in parent_element:
            if element:
                # treat like dict - we assume that if the first two tags
                # in a series are different, then they are all different.
                if len(element) == 1 or element[0].tag != element[1].tag:
                    aDict = XmlDictConfig(element)
                # treat like list - we assume that if the first two tags
                # in a series are the same, then the rest are the same.
                else:
                    # here, we put the list in dictionary; the key is the
                    # tag name the list elements all share in common, and
                    # the value is the list itself 
                    aDict = {element[0].tag: XmlListConfig(element)}
                # if the tag has attributes, add those to the dict
                if element.items():
                    aDict.update(dict(element.items()))
                self.update({element.tag: aDict})
            # this assumes that if you've got an attribute in a tag,
            # you won't be having any text. This may or may not be a 
            # good idea -- time will tell. It works for the way we are
            # currently doing XML configuration files...
            elif element.items():
                self.update({element.tag: dict(element.items())})
            # finally, if there are no child tags and no attributes, extract
            # the text
            else:
                self.update({element.tag: element.text})

Example usage:

tree = ElementTree.parse('your_file.xml')
root = tree.getroot()
xmldict = XmlDictConfig(root)

//Or, if you want to use an XML string:

root = ElementTree.XML(xml_string)
xmldict = XmlDictConfig(root)

回答 1

xmltodict(完全公开:我写了它)确实做到了:

xmltodict.parse("""
<?xml version="1.0" ?>
<person>
  <name>john</name>
  <age>20</age>
</person>""")
# {u'person': {u'age': u'20', u'name': u'john'}}

xmltodict (full disclosure: I wrote it) does exactly that:

xmltodict.parse("""
<?xml version="1.0" ?>
<person>
  <name>john</name>
  <age>20</age>
</person>""")
# {u'person': {u'age': u'20', u'name': u'john'}}

回答 2

以下XML-to-Python-dict片段分析了此XML-to-JSON“规范”之后的实体以及属性。这是处理XML所有情况的最通用的解决方案。

from collections import defaultdict

def etree_to_dict(t):
    d = {t.tag: {} if t.attrib else None}
    children = list(t)
    if children:
        dd = defaultdict(list)
        for dc in map(etree_to_dict, children):
            for k, v in dc.items():
                dd[k].append(v)
        d = {t.tag: {k:v[0] if len(v) == 1 else v for k, v in dd.items()}}
    if t.attrib:
        d[t.tag].update(('@' + k, v) for k, v in t.attrib.items())
    if t.text:
        text = t.text.strip()
        if children or t.attrib:
            if text:
              d[t.tag]['#text'] = text
        else:
            d[t.tag] = text
    return d

它用于:

from xml.etree import cElementTree as ET
e = ET.XML('''
<root>
  <e />
  <e>text</e>
  <e name="value" />
  <e name="value">text</e>
  <e> <a>text</a> <b>text</b> </e>
  <e> <a>text</a> <a>text</a> </e>
  <e> text <a>text</a> </e>
</root>
''')

from pprint import pprint
pprint(etree_to_dict(e))

此示例的输出(根据上面链接的“规范”)应为:

{'root': {'e': [None,
                'text',
                {'@name': 'value'},
                {'#text': 'text', '@name': 'value'},
                {'a': 'text', 'b': 'text'},
                {'a': ['text', 'text']},
                {'#text': 'text', 'a': 'text'}]}}

不一定很漂亮,但它是明确的,而更简单的XML输入会导致更简单的JSON。:)


更新资料

如果要进行相反的操作从JSON / dict发出XML字符串,则可以使用:

try:
  basestring
except NameError:  # python3
  basestring = str

def dict_to_etree(d):
    def _to_etree(d, root):
        if not d:
            pass
        elif isinstance(d, basestring):
            root.text = d
        elif isinstance(d, dict):
            for k,v in d.items():
                assert isinstance(k, basestring)
                if k.startswith('#'):
                    assert k == '#text' and isinstance(v, basestring)
                    root.text = v
                elif k.startswith('@'):
                    assert isinstance(v, basestring)
                    root.set(k[1:], v)
                elif isinstance(v, list):
                    for e in v:
                        _to_etree(e, ET.SubElement(root, k))
                else:
                    _to_etree(v, ET.SubElement(root, k))
        else:
            raise TypeError('invalid type: ' + str(type(d)))
    assert isinstance(d, dict) and len(d) == 1
    tag, body = next(iter(d.items()))
    node = ET.Element(tag)
    _to_etree(body, node)
    return ET.tostring(node)

pprint(dict_to_etree(d))

The following XML-to-Python-dict snippet parses entities as well as attributes following this XML-to-JSON “specification”. It is the most general solution handling all cases of XML.

from collections import defaultdict

def etree_to_dict(t):
    d = {t.tag: {} if t.attrib else None}
    children = list(t)
    if children:
        dd = defaultdict(list)
        for dc in map(etree_to_dict, children):
            for k, v in dc.items():
                dd[k].append(v)
        d = {t.tag: {k:v[0] if len(v) == 1 else v for k, v in dd.items()}}
    if t.attrib:
        d[t.tag].update(('@' + k, v) for k, v in t.attrib.items())
    if t.text:
        text = t.text.strip()
        if children or t.attrib:
            if text:
              d[t.tag]['#text'] = text
        else:
            d[t.tag] = text
    return d

It is used:

from xml.etree import cElementTree as ET
e = ET.XML('''
<root>
  <e />
  <e>text</e>
  <e name="value" />
  <e name="value">text</e>
  <e> <a>text</a> <b>text</b> </e>
  <e> <a>text</a> <a>text</a> </e>
  <e> text <a>text</a> </e>
</root>
''')

from pprint import pprint
pprint(etree_to_dict(e))

The output of this example (as per above-linked “specification”) should be:

{'root': {'e': [None,
                'text',
                {'@name': 'value'},
                {'#text': 'text', '@name': 'value'},
                {'a': 'text', 'b': 'text'},
                {'a': ['text', 'text']},
                {'#text': 'text', 'a': 'text'}]}}

Not necessarily pretty, but it is unambiguous, and simpler XML inputs result in simpler JSON. :)


Update

If you want to do the reverse, emit an XML string from a JSON/dict, you can use:

try:
  basestring
except NameError:  # python3
  basestring = str

def dict_to_etree(d):
    def _to_etree(d, root):
        if not d:
            pass
        elif isinstance(d, basestring):
            root.text = d
        elif isinstance(d, dict):
            for k,v in d.items():
                assert isinstance(k, basestring)
                if k.startswith('#'):
                    assert k == '#text' and isinstance(v, basestring)
                    root.text = v
                elif k.startswith('@'):
                    assert isinstance(v, basestring)
                    root.set(k[1:], v)
                elif isinstance(v, list):
                    for e in v:
                        _to_etree(e, ET.SubElement(root, k))
                else:
                    _to_etree(v, ET.SubElement(root, k))
        else:
            raise TypeError('invalid type: ' + str(type(d)))
    assert isinstance(d, dict) and len(d) == 1
    tag, body = next(iter(d.items()))
    node = ET.Element(tag)
    _to_etree(body, node)
    return ET.tostring(node)

pprint(dict_to_etree(d))

回答 3

这个轻量级的版本虽然不可配置,但是很容易根据需要进行定制,并且可以在旧的python中工作。它也是严格的-意味着无论属性是否存在,结果都是相同的。

import xml.etree.ElementTree as ET

from copy import copy

def dictify(r,root=True):
    if root:
        return {r.tag : dictify(r, False)}
    d=copy(r.attrib)
    if r.text:
        d["_text"]=r.text
    for x in r.findall("./*"):
        if x.tag not in d:
            d[x.tag]=[]
        d[x.tag].append(dictify(x,False))
    return d

所以:

root = ET.fromstring("<erik><a x='1'>v</a><a y='2'>w</a></erik>")

dictify(root)

结果是:

{'erik': {'a': [{'x': '1', '_text': 'v'}, {'y': '2', '_text': 'w'}]}}

This lightweight version, while not configurable, is pretty easy to tailor as needed, and works in old pythons. Also it is rigid – meaning the results are the same regardless of the existence of attributes.

import xml.etree.ElementTree as ET

from copy import copy

def dictify(r,root=True):
    if root:
        return {r.tag : dictify(r, False)}
    d=copy(r.attrib)
    if r.text:
        d["_text"]=r.text
    for x in r.findall("./*"):
        if x.tag not in d:
            d[x.tag]=[]
        d[x.tag].append(dictify(x,False))
    return d

So:

root = ET.fromstring("<erik><a x='1'>v</a><a y='2'>w</a></erik>")

dictify(root)

Results in:

{'erik': {'a': [{'x': '1', '_text': 'v'}, {'y': '2', '_text': 'w'}]}}

回答 4

PicklingTools库的最新版本(1.3.0和1.3.1)支持将XML转换为Python dict的工具。

可从此处下载文件: PicklingTools 1.3.1

没有为转换颇有几分文档在这里:文档中详细的所有XML和Python字典之间转换时将产生的决定和问题描述(也有一些边缘情况:属性,列表,匿名列表,匿名多数转换器无法处理的dict,eval等)。通常,这些转换器易于使用。如果“ example.xml”包含:

<top>
  <a>1</a>
  <b>2.2</b>
  <c>three</c>
</top>

然后将其转换为字典:

>>> from xmlloader import *
>>> example = file('example.xml', 'r')   # A document containing XML
>>> xl = StreamXMLLoader(example, 0)     # 0 = all defaults on operation
>>> result = xl.expect XML()
>>> print result
{'top': {'a': '1', 'c': 'three', 'b': '2.2'}}

有一些可以在C ++和Python中进行转换的工具:C ++和Python可以进行相同的转换,但是C ++的速度要快60倍左右

The most recent versions of the PicklingTools libraries (1.3.0 and 1.3.1) support tools for converting from XML to a Python dict.

The download is available here: PicklingTools 1.3.1

There is quite a bit of documentation for the converters here: the documentation describes in detail all of the decisions and issues that will arise when converting between XML and Python dictionaries (there are a number of edge cases: attributes, lists, anonymous lists, anonymous dicts, eval, etc. that most converters don’t handle). In general, though, the converters are easy to use. If an ‘example.xml’ contains:

<top>
  <a>1</a>
  <b>2.2</b>
  <c>three</c>
</top>

Then to convert it to a dictionary:

>>> from xmlloader import *
>>> example = file('example.xml', 'r')   # A document containing XML
>>> xl = StreamXMLLoader(example, 0)     # 0 = all defaults on operation
>>> result = xl.expect XML()
>>> print result
{'top': {'a': '1', 'c': 'three', 'b': '2.2'}}

There are tools for converting in both C++ and Python: the C++ and Python do indentical conversion, but the C++ is about 60x faster


回答 5

您可以使用lxml轻松完成此操作。首先安装它:

[sudo] pip install lxml

这是我编写的递归函数,可以为您完成繁重的工作:

from lxml import objectify as xml_objectify


def xml_to_dict(xml_str):
    """ Convert xml to dict, using lxml v3.4.2 xml processing library """
    def xml_to_dict_recursion(xml_object):
        dict_object = xml_object.__dict__
        if not dict_object:
            return xml_object
        for key, value in dict_object.items():
            dict_object[key] = xml_to_dict_recursion(value)
        return dict_object
    return xml_to_dict_recursion(xml_objectify.fromstring(xml_str))

xml_string = """<?xml version="1.0" encoding="UTF-8"?><Response><NewOrderResp>
<IndustryType>Test</IndustryType><SomeData><SomeNestedData1>1234</SomeNestedData1>
<SomeNestedData2>3455</SomeNestedData2></SomeData></NewOrderResp></Response>"""

print xml_to_dict(xml_string)

以下变体保留了父键/元素:

def xml_to_dict(xml_str):
    """ Convert xml to dict, using lxml v3.4.2 xml processing library, see http://lxml.de/ """
    def xml_to_dict_recursion(xml_object):
        dict_object = xml_object.__dict__
        if not dict_object:  # if empty dict returned
            return xml_object
        for key, value in dict_object.items():
            dict_object[key] = xml_to_dict_recursion(value)
        return dict_object
    xml_obj = objectify.fromstring(xml_str)
    return {xml_obj.tag: xml_to_dict_recursion(xml_obj)}

如果只想返回一个子树并将其转换为dict,则可以使用Element.find()获取该子树,然后对其进行转换:

xml_obj.find('.//')  # lxml.objectify.ObjectifiedElement instance

请在此处查看lxml文档。我希望这有帮助!

You can do this quite easily with lxml. First install it:

[sudo] pip install lxml

Here is a recursive function I wrote that does the heavy lifting for you:

from lxml import objectify as xml_objectify


def xml_to_dict(xml_str):
    """ Convert xml to dict, using lxml v3.4.2 xml processing library """
    def xml_to_dict_recursion(xml_object):
        dict_object = xml_object.__dict__
        if not dict_object:
            return xml_object
        for key, value in dict_object.items():
            dict_object[key] = xml_to_dict_recursion(value)
        return dict_object
    return xml_to_dict_recursion(xml_objectify.fromstring(xml_str))

xml_string = """<?xml version="1.0" encoding="UTF-8"?><Response><NewOrderResp>
<IndustryType>Test</IndustryType><SomeData><SomeNestedData1>1234</SomeNestedData1>
<SomeNestedData2>3455</SomeNestedData2></SomeData></NewOrderResp></Response>"""

print xml_to_dict(xml_string)

The below variant preserves the parent key / element:

def xml_to_dict(xml_str):
    """ Convert xml to dict, using lxml v3.4.2 xml processing library, see http://lxml.de/ """
    def xml_to_dict_recursion(xml_object):
        dict_object = xml_object.__dict__
        if not dict_object:  # if empty dict returned
            return xml_object
        for key, value in dict_object.items():
            dict_object[key] = xml_to_dict_recursion(value)
        return dict_object
    xml_obj = objectify.fromstring(xml_str)
    return {xml_obj.tag: xml_to_dict_recursion(xml_obj)}

If you want to only return a subtree and convert it to dict, you can use Element.find() to get the subtree and then convert it:

xml_obj.find('.//')  # lxml.objectify.ObjectifiedElement instance

See the lxml docs here. I hope this helps!


回答 6

免责声明:此经过修改的XML解析器受到Adam Clark 的启发。原始XML解析器适用于大多数简单情况。但是,它不适用于某些复杂的XML文件。我逐行调试了代码,最后解决了一些问题。如果您发现一些错误,请告诉我。我很高兴修复它。

class XmlDictConfig(dict):  
    '''   
    Note: need to add a root into if no exising    
    Example usage:
    >>> tree = ElementTree.parse('your_file.xml')
    >>> root = tree.getroot()
    >>> xmldict = XmlDictConfig(root)
    Or, if you want to use an XML string:
    >>> root = ElementTree.XML(xml_string)
    >>> xmldict = XmlDictConfig(root)
    And then use xmldict for what it is... a dict.
    '''
    def __init__(self, parent_element):
        if parent_element.items():
            self.updateShim( dict(parent_element.items()) )
        for element in parent_element:
            if len(element):
                aDict = XmlDictConfig(element)
            #   if element.items():
            #   aDict.updateShim(dict(element.items()))
                self.updateShim({element.tag: aDict})
            elif element.items():    # items() is specialy for attribtes
                elementattrib= element.items()
                if element.text:           
                    elementattrib.append((element.tag,element.text ))     # add tag:text if there exist
                self.updateShim({element.tag: dict(elementattrib)})
            else:
                self.updateShim({element.tag: element.text})

    def updateShim (self, aDict ):
        for key in aDict.keys():   # keys() includes tag and attributes
            if key in self:
                value = self.pop(key)
                if type(value) is not list:
                    listOfDicts = []
                    listOfDicts.append(value)
                    listOfDicts.append(aDict[key])
                    self.update({key: listOfDicts})
                else:
                    value.append(aDict[key])
                    self.update({key: value})
            else:
                self.update({key:aDict[key]})  # it was self.update(aDict)    

Disclaimer: This modified XML parser was inspired by Adam Clark The original XML parser works for most of simple cases. However, it didn’t work for some complicated XML files. I debugged the code line by line and finally fixed some issues. If you find some bugs, please let me know. I am glad to fix it.

class XmlDictConfig(dict):  
    '''   
    Note: need to add a root into if no exising    
    Example usage:
    >>> tree = ElementTree.parse('your_file.xml')
    >>> root = tree.getroot()
    >>> xmldict = XmlDictConfig(root)
    Or, if you want to use an XML string:
    >>> root = ElementTree.XML(xml_string)
    >>> xmldict = XmlDictConfig(root)
    And then use xmldict for what it is... a dict.
    '''
    def __init__(self, parent_element):
        if parent_element.items():
            self.updateShim( dict(parent_element.items()) )
        for element in parent_element:
            if len(element):
                aDict = XmlDictConfig(element)
            #   if element.items():
            #   aDict.updateShim(dict(element.items()))
                self.updateShim({element.tag: aDict})
            elif element.items():    # items() is specialy for attribtes
                elementattrib= element.items()
                if element.text:           
                    elementattrib.append((element.tag,element.text ))     # add tag:text if there exist
                self.updateShim({element.tag: dict(elementattrib)})
            else:
                self.updateShim({element.tag: element.text})

    def updateShim (self, aDict ):
        for key in aDict.keys():   # keys() includes tag and attributes
            if key in self:
                value = self.pop(key)
                if type(value) is not list:
                    listOfDicts = []
                    listOfDicts.append(value)
                    listOfDicts.append(aDict[key])
                    self.update({key: listOfDicts})
                else:
                    value.append(aDict[key])
                    self.update({key: value})
            else:
                self.update({key:aDict[key]})  # it was self.update(aDict)    

回答 7

def xml_to_dict(node):
    u''' 
    @param node:lxml_node
    @return: dict 
    '''

    return {'tag': node.tag, 'text': node.text, 'attrib': node.attrib, 'children': {child.tag: xml_to_dict(child) for child in node}}
def xml_to_dict(node):
    u''' 
    @param node:lxml_node
    @return: dict 
    '''

    return {'tag': node.tag, 'text': node.text, 'attrib': node.attrib, 'children': {child.tag: xml_to_dict(child) for child in node}}

回答 8

最容易使用的XML XML解析器是ElementTree(从2.5x开始,在标准库xml.etree.ElementTree中)。我认为没有什么可以完全满足您的要求。使用ElementTree编写某些内容来完成您想要的事情,这很简单,但是为什么要转换为字典,为什么不直接使用ElementTree。

The easiest to use XML parser for Python is ElementTree (as of 2.5x and above it is in the standard library xml.etree.ElementTree). I don’t think there is anything that does exactly what you want out of the box. It would be pretty trivial to write something to do what you want using ElementTree, but why convert to a dictionary, and why not just use ElementTree directly.


回答 9

来自http://code.activestate.com/recipes/410469-xml-as-dictionary/的代码效果很好,但是,如果在层次结构中的给定位置存在多个相同的元素,它将覆盖它们。

我在两者之间添加了一个垫片,以查看在self.update()之前该元素是否已经存在。如果是这样,则弹出现有条目并从现有条目和新条目中创建一个列表。随后的所有重复项都将添加到列表中。

不知道是否可以更妥善地处理此问题,但它的工作原理是:

import xml.etree.ElementTree as ElementTree

class XmlDictConfig(dict):
    def __init__(self, parent_element):
        if parent_element.items():
            self.updateShim(dict(parent_element.items()))
        for element in parent_element:
            if len(element):
                aDict = XmlDictConfig(element)
                if element.items():
                    aDict.updateShim(dict(element.items()))
                self.updateShim({element.tag: aDict})
            elif element.items():
                self.updateShim({element.tag: dict(element.items())})
            else:
                self.updateShim({element.tag: element.text.strip()})

    def updateShim (self, aDict ):
        for key in aDict.keys():
            if key in self:
                value = self.pop(key)
                if type(value) is not list:
                    listOfDicts = []
                    listOfDicts.append(value)
                    listOfDicts.append(aDict[key])
                    self.update({key: listOfDicts})

                else:
                    value.append(aDict[key])
                    self.update({key: value})
            else:
                self.update(aDict)

The code from http://code.activestate.com/recipes/410469-xml-as-dictionary/ works well, but if there are multiple elements that are the same at a given place in the hierarchy it just overrides them.

I added a shim between that looks to see if the element already exists before self.update(). If so, pops the existing entry and creates a lists out of the existing and the new. Any subsequent duplicates are added to the list.

Not sure if this can be handled more gracefully, but it works:

import xml.etree.ElementTree as ElementTree

class XmlDictConfig(dict):
    def __init__(self, parent_element):
        if parent_element.items():
            self.updateShim(dict(parent_element.items()))
        for element in parent_element:
            if len(element):
                aDict = XmlDictConfig(element)
                if element.items():
                    aDict.updateShim(dict(element.items()))
                self.updateShim({element.tag: aDict})
            elif element.items():
                self.updateShim({element.tag: dict(element.items())})
            else:
                self.updateShim({element.tag: element.text.strip()})

    def updateShim (self, aDict ):
        for key in aDict.keys():
            if key in self:
                value = self.pop(key)
                if type(value) is not list:
                    listOfDicts = []
                    listOfDicts.append(value)
                    listOfDicts.append(aDict[key])
                    self.update({key: listOfDicts})

                else:
                    value.append(aDict[key])
                    self.update({key: value})
            else:
                self.update(aDict)

回答 10

从@ K3 — rnc 响应(最适合我),我添加了一些小修改以从XML文本中获得OrderedDict(有时顺序很重要):

def etree_to_ordereddict(t):
d = OrderedDict()
d[t.tag] = OrderedDict() if t.attrib else None
children = list(t)
if children:
    dd = OrderedDict()
    for dc in map(etree_to_ordereddict, children):
        for k, v in dc.iteritems():
            if k not in dd:
                dd[k] = list()
            dd[k].append(v)
    d = OrderedDict()
    d[t.tag] = OrderedDict()
    for k, v in dd.iteritems():
        if len(v) == 1:
            d[t.tag][k] = v[0]
        else:
            d[t.tag][k] = v
if t.attrib:
    d[t.tag].update(('@' + k, v) for k, v in t.attrib.iteritems())
if t.text:
    text = t.text.strip()
    if children or t.attrib:
        if text:
            d[t.tag]['#text'] = text
    else:
        d[t.tag] = text
return d

在@ K3 — rnc示例中,可以使用它:

from xml.etree import cElementTree as ET
e = ET.XML('''
<root>
  <e />
  <e>text</e>
  <e name="value" />
  <e name="value">text</e>
  <e> <a>text</a> <b>text</b> </e>
  <e> <a>text</a> <a>text</a> </e>
  <e> text <a>text</a> </e>
</root>
''')

from pprint import pprint
pprint(etree_to_ordereddict(e))

希望能帮助到你 ;)

From @K3—rnc response (the best for me) I’ve added a small modifications to get an OrderedDict from an XML text (some times order matters):

def etree_to_ordereddict(t):
d = OrderedDict()
d[t.tag] = OrderedDict() if t.attrib else None
children = list(t)
if children:
    dd = OrderedDict()
    for dc in map(etree_to_ordereddict, children):
        for k, v in dc.iteritems():
            if k not in dd:
                dd[k] = list()
            dd[k].append(v)
    d = OrderedDict()
    d[t.tag] = OrderedDict()
    for k, v in dd.iteritems():
        if len(v) == 1:
            d[t.tag][k] = v[0]
        else:
            d[t.tag][k] = v
if t.attrib:
    d[t.tag].update(('@' + k, v) for k, v in t.attrib.iteritems())
if t.text:
    text = t.text.strip()
    if children or t.attrib:
        if text:
            d[t.tag]['#text'] = text
    else:
        d[t.tag] = text
return d

Following @K3—rnc example, you can use it:

from xml.etree import cElementTree as ET
e = ET.XML('''
<root>
  <e />
  <e>text</e>
  <e name="value" />
  <e name="value">text</e>
  <e> <a>text</a> <b>text</b> </e>
  <e> <a>text</a> <a>text</a> </e>
  <e> text <a>text</a> </e>
</root>
''')

from pprint import pprint
pprint(etree_to_ordereddict(e))

Hope it helps ;)


回答 11

这是ActiveState解决方案的链接-以及代码再次消失的代码。

==================================================
xmlreader.py:
==================================================
from xml.dom.minidom import parse


class NotTextNodeError:
    pass


def getTextFromNode(node):
    """
    scans through all children of node and gathers the
    text. if node has non-text child-nodes, then
    NotTextNodeError is raised.
    """
    t = ""
    for n in node.childNodes:
    if n.nodeType == n.TEXT_NODE:
        t += n.nodeValue
    else:
        raise NotTextNodeError
    return t


def nodeToDic(node):
    """
    nodeToDic() scans through the children of node and makes a
    dictionary from the content.
    three cases are differentiated:
    - if the node contains no other nodes, it is a text-node
    and {nodeName:text} is merged into the dictionary.
    - if the node has the attribute "method" set to "true",
    then it's children will be appended to a list and this
    list is merged to the dictionary in the form: {nodeName:list}.
    - else, nodeToDic() will call itself recursively on
    the nodes children (merging {nodeName:nodeToDic()} to
    the dictionary).
    """
    dic = {} 
    for n in node.childNodes:
    if n.nodeType != n.ELEMENT_NODE:
        continue
    if n.getAttribute("multiple") == "true":
        # node with multiple children:
        # put them in a list
        l = []
        for c in n.childNodes:
            if c.nodeType != n.ELEMENT_NODE:
            continue
        l.append(nodeToDic(c))
            dic.update({n.nodeName:l})
        continue

    try:
        text = getTextFromNode(n)
    except NotTextNodeError:
            # 'normal' node
            dic.update({n.nodeName:nodeToDic(n)})
            continue

        # text node
        dic.update({n.nodeName:text})
    continue
    return dic


def readConfig(filename):
    dom = parse(filename)
    return nodeToDic(dom)





def test():
    dic = readConfig("sample.xml")

    print dic["Config"]["Name"]
    print
    for item in dic["Config"]["Items"]:
    print "Item's Name:", item["Name"]
    print "Item's Value:", item["Value"]

test()



==================================================
sample.xml:
==================================================
<?xml version="1.0" encoding="UTF-8"?>

<Config>
    <Name>My Config File</Name>

    <Items multiple="true">
    <Item>
        <Name>First Item</Name>
        <Value>Value 1</Value>
    </Item>
    <Item>
        <Name>Second Item</Name>
        <Value>Value 2</Value>
    </Item>
    </Items>

</Config>



==================================================
output:
==================================================
My Config File

Item's Name: First Item
Item's Value: Value 1
Item's Name: Second Item
Item's Value: Value 2

Here’s a link to an ActiveState solution – and the code in case it disappears again.

==================================================
xmlreader.py:
==================================================
from xml.dom.minidom import parse


class NotTextNodeError:
    pass


def getTextFromNode(node):
    """
    scans through all children of node and gathers the
    text. if node has non-text child-nodes, then
    NotTextNodeError is raised.
    """
    t = ""
    for n in node.childNodes:
    if n.nodeType == n.TEXT_NODE:
        t += n.nodeValue
    else:
        raise NotTextNodeError
    return t


def nodeToDic(node):
    """
    nodeToDic() scans through the children of node and makes a
    dictionary from the content.
    three cases are differentiated:
    - if the node contains no other nodes, it is a text-node
    and {nodeName:text} is merged into the dictionary.
    - if the node has the attribute "method" set to "true",
    then it's children will be appended to a list and this
    list is merged to the dictionary in the form: {nodeName:list}.
    - else, nodeToDic() will call itself recursively on
    the nodes children (merging {nodeName:nodeToDic()} to
    the dictionary).
    """
    dic = {} 
    for n in node.childNodes:
    if n.nodeType != n.ELEMENT_NODE:
        continue
    if n.getAttribute("multiple") == "true":
        # node with multiple children:
        # put them in a list
        l = []
        for c in n.childNodes:
            if c.nodeType != n.ELEMENT_NODE:
            continue
        l.append(nodeToDic(c))
            dic.update({n.nodeName:l})
        continue

    try:
        text = getTextFromNode(n)
    except NotTextNodeError:
            # 'normal' node
            dic.update({n.nodeName:nodeToDic(n)})
            continue

        # text node
        dic.update({n.nodeName:text})
    continue
    return dic


def readConfig(filename):
    dom = parse(filename)
    return nodeToDic(dom)





def test():
    dic = readConfig("sample.xml")

    print dic["Config"]["Name"]
    print
    for item in dic["Config"]["Items"]:
    print "Item's Name:", item["Name"]
    print "Item's Value:", item["Value"]

test()



==================================================
sample.xml:
==================================================
<?xml version="1.0" encoding="UTF-8"?>

<Config>
    <Name>My Config File</Name>

    <Items multiple="true">
    <Item>
        <Name>First Item</Name>
        <Value>Value 1</Value>
    </Item>
    <Item>
        <Name>Second Item</Name>
        <Value>Value 2</Value>
    </Item>
    </Items>

</Config>



==================================================
output:
==================================================
My Config File

Item's Name: First Item
Item's Value: Value 1
Item's Name: Second Item
Item's Value: Value 2

回答 12

在某一时刻,我不得不解析和编写仅包含没有属性的元素的XML,因此从XML到dict的1:1映射很容易。如果别人也不需要属性,这就是我想出的:

def xmltodict(element):
    if not isinstance(element, ElementTree.Element):
        raise ValueError("must pass xml.etree.ElementTree.Element object")

    def xmltodict_handler(parent_element):
        result = dict()
        for element in parent_element:
            if len(element):
                obj = xmltodict_handler(element)
            else:
                obj = element.text

            if result.get(element.tag):
                if hasattr(result[element.tag], "append"):
                    result[element.tag].append(obj)
                else:
                    result[element.tag] = [result[element.tag], obj]
            else:
                result[element.tag] = obj
        return result

    return {element.tag: xmltodict_handler(element)}


def dicttoxml(element):
    if not isinstance(element, dict):
        raise ValueError("must pass dict type")
    if len(element) != 1:
        raise ValueError("dict must have exactly one root key")

    def dicttoxml_handler(result, key, value):
        if isinstance(value, list):
            for e in value:
                dicttoxml_handler(result, key, e)
        elif isinstance(value, basestring):
            elem = ElementTree.Element(key)
            elem.text = value
            result.append(elem)
        elif isinstance(value, int) or isinstance(value, float):
            elem = ElementTree.Element(key)
            elem.text = str(value)
            result.append(elem)
        elif value is None:
            result.append(ElementTree.Element(key))
        else:
            res = ElementTree.Element(key)
            for k, v in value.items():
                dicttoxml_handler(res, k, v)
            result.append(res)

    result = ElementTree.Element(element.keys()[0])
    for key, value in element[element.keys()[0]].items():
        dicttoxml_handler(result, key, value)
    return result

def xmlfiletodict(filename):
    return xmltodict(ElementTree.parse(filename).getroot())

def dicttoxmlfile(element, filename):
    ElementTree.ElementTree(dicttoxml(element)).write(filename)

def xmlstringtodict(xmlstring):
    return xmltodict(ElementTree.fromstring(xmlstring).getroot())

def dicttoxmlstring(element):
    return ElementTree.tostring(dicttoxml(element))

At one point I had to parse and write XML that only consisted of elements without attributes so a 1:1 mapping from XML to dict was possible easily. This is what I came up with in case someone else also doesnt need attributes:

def xmltodict(element):
    if not isinstance(element, ElementTree.Element):
        raise ValueError("must pass xml.etree.ElementTree.Element object")

    def xmltodict_handler(parent_element):
        result = dict()
        for element in parent_element:
            if len(element):
                obj = xmltodict_handler(element)
            else:
                obj = element.text

            if result.get(element.tag):
                if hasattr(result[element.tag], "append"):
                    result[element.tag].append(obj)
                else:
                    result[element.tag] = [result[element.tag], obj]
            else:
                result[element.tag] = obj
        return result

    return {element.tag: xmltodict_handler(element)}


def dicttoxml(element):
    if not isinstance(element, dict):
        raise ValueError("must pass dict type")
    if len(element) != 1:
        raise ValueError("dict must have exactly one root key")

    def dicttoxml_handler(result, key, value):
        if isinstance(value, list):
            for e in value:
                dicttoxml_handler(result, key, e)
        elif isinstance(value, basestring):
            elem = ElementTree.Element(key)
            elem.text = value
            result.append(elem)
        elif isinstance(value, int) or isinstance(value, float):
            elem = ElementTree.Element(key)
            elem.text = str(value)
            result.append(elem)
        elif value is None:
            result.append(ElementTree.Element(key))
        else:
            res = ElementTree.Element(key)
            for k, v in value.items():
                dicttoxml_handler(res, k, v)
            result.append(res)

    result = ElementTree.Element(element.keys()[0])
    for key, value in element[element.keys()[0]].items():
        dicttoxml_handler(result, key, value)
    return result

def xmlfiletodict(filename):
    return xmltodict(ElementTree.parse(filename).getroot())

def dicttoxmlfile(element, filename):
    ElementTree.ElementTree(dicttoxml(element)).write(filename)

def xmlstringtodict(xmlstring):
    return xmltodict(ElementTree.fromstring(xmlstring).getroot())

def dicttoxmlstring(element):
    return ElementTree.tostring(dicttoxml(element))

回答 13

@dibrovsd:如果xml具有多个具有相同名称的标签,则解决方案将不起作用

根据您的想法,我对代码进行了一些修改,并将其编写为常规节点而不是root用户:

from collections import defaultdict
def xml2dict(node):
    d, count = defaultdict(list), 1
    for i in node:
        d[i.tag + "_" + str(count)]['text'] = i.findtext('.')[0]
        d[i.tag + "_" + str(count)]['attrib'] = i.attrib # attrib gives the list
        d[i.tag + "_" + str(count)]['children'] = xml2dict(i) # it gives dict
     return d

@dibrovsd: Solution will not work if the xml have more than one tag with same name

On your line of thought, I have modified the code a bit and written it for general node instead of root:

from collections import defaultdict
def xml2dict(node):
    d, count = defaultdict(list), 1
    for i in node:
        d[i.tag + "_" + str(count)]['text'] = i.findtext('.')[0]
        d[i.tag + "_" + str(count)]['attrib'] = i.attrib # attrib gives the list
        d[i.tag + "_" + str(count)]['children'] = xml2dict(i) # it gives dict
     return d

回答 14

我修改了我的口味的答案之一,并使用同一标签处理多个值,例如考虑以下保存在XML.xml文件中的xml代码。

     <A>
        <B>
            <BB>inAB</BB>
            <C>
                <D>
                    <E>
                        inABCDE
                    </E>
                    <E>value2</E>
                    <E>value3</E>
                </D>
                <inCout-ofD>123</inCout-ofD>
            </C>
        </B>
        <B>abc</B>
        <F>F</F>
    </A>

和在python中

import xml.etree.ElementTree as ET




class XMLToDictionary(dict):
    def __init__(self, parentElement):
        self.parentElement = parentElement
        for child in list(parentElement):
            child.text = child.text if (child.text != None) else  ' '
            if len(child) == 0:
                self.update(self._addToDict(key= child.tag, value = child.text.strip(), dict = self))
            else:
                innerChild = XMLToDictionary(parentElement=child)
                self.update(self._addToDict(key=innerChild.parentElement.tag, value=innerChild, dict=self))

    def getDict(self):
        return {self.parentElement.tag: self}

    class _addToDict(dict):
        def __init__(self, key, value, dict):
            if not key in dict:
                self.update({key: value})
            else:
                identical = dict[key] if type(dict[key]) == list else [dict[key]]
                self.update({key: identical + [value]})


tree = ET.parse('./XML.xml')
root = tree.getroot()
parseredDict = XMLToDictionary(root).getDict()
print(parseredDict)

输出是

{'A': {'B': [{'BB': 'inAB', 'C': {'D': {'E': ['inABCDE', 'value2', 'value3']}, 'inCout-ofD': '123'}}, 'abc'], 'F': 'F'}}

I have modified one of the answers to my taste and to work with multiple values with the same tag for example consider the following xml code saved in XML.xml file

     <A>
        <B>
            <BB>inAB</BB>
            <C>
                <D>
                    <E>
                        inABCDE
                    </E>
                    <E>value2</E>
                    <E>value3</E>
                </D>
                <inCout-ofD>123</inCout-ofD>
            </C>
        </B>
        <B>abc</B>
        <F>F</F>
    </A>

and in python

import xml.etree.ElementTree as ET




class XMLToDictionary(dict):
    def __init__(self, parentElement):
        self.parentElement = parentElement
        for child in list(parentElement):
            child.text = child.text if (child.text != None) else  ' '
            if len(child) == 0:
                self.update(self._addToDict(key= child.tag, value = child.text.strip(), dict = self))
            else:
                innerChild = XMLToDictionary(parentElement=child)
                self.update(self._addToDict(key=innerChild.parentElement.tag, value=innerChild, dict=self))

    def getDict(self):
        return {self.parentElement.tag: self}

    class _addToDict(dict):
        def __init__(self, key, value, dict):
            if not key in dict:
                self.update({key: value})
            else:
                identical = dict[key] if type(dict[key]) == list else [dict[key]]
                self.update({key: identical + [value]})


tree = ET.parse('./XML.xml')
root = tree.getroot()
parseredDict = XMLToDictionary(root).getDict()
print(parseredDict)

the output is

{'A': {'B': [{'BB': 'inAB', 'C': {'D': {'E': ['inABCDE', 'value2', 'value3']}, 'inCout-ofD': '123'}}, 'abc'], 'F': 'F'}}

回答 15

我有一个递归方法,可从lxml元素获取字典

    def recursive_dict(element):
        return (element.tag.split('}')[1],
                dict(map(recursive_dict, element.getchildren()),
                     **element.attrib))

I have a recursive method to get a dictionary from a lxml element

    def recursive_dict(element):
        return (element.tag.split('}')[1],
                dict(map(recursive_dict, element.getchildren()),
                     **element.attrib))

如何覆盖和扩展基本的Django管理模板?

问题:如何覆盖和扩展基本的Django管理模板?

如何覆盖管理模板(例如admin / index.html),同时扩展它(请参阅https://docs.djangoproject.com/en/dev/ref/contrib/admin/#overriding-vs-replacing -an-admin-template)?

首先-我知道这个问题已经被问过并回答过(请参阅Django:覆盖和扩展应用程序模板),但是正如答案所言,如果您使用的是app_directories模板加载器(这是大多数时间)。

我当前的解决方法是制作副本并从中扩展,而不是直接从管理模板扩展。这很好用,但是确实很混乱,并且在管理模板更改时增加了额外的工作。

它可能会想到一些针对模板的自定义扩展标签,但如果已有解决方案,我不想重新发明轮子。

附带说明一下:有人知道Django本身是否可以解决此问题?

How do I override an admin template (e.g. admin/index.html) while at the same time extending it (see https://docs.djangoproject.com/en/dev/ref/contrib/admin/#overriding-vs-replacing-an-admin-template)?

First – I know that this question has been asked and answered before (see Django: Overriding AND extending an app template) but as the answer says it isn’t directly applicable if you’re using the app_directories template loader (which is most of the time).

My current workaround is to make copies and extend from them instead of extending directly from the admin templates. This works great but it’s really confusing and adds extra work when the admin templates change.

It could think of some custom extend-tag for the templates but I don’t want to reinvent the wheel if there already exists a solution.

On a side note: Does anybody know if this problem will be addressed by Django itself?


回答 0

更新

阅读适用于您的Django版本的文档。例如

https://docs.djangoproject.com/zh-CN/1.11/ref/contrib/admin/#admin-overriding-templates https://docs.djangoproject.com/en/2.0/ref/contrib/admin/#admin-overriding -模板

2011年的原始答案:

大约一年半以前,我遇到了同样的问题,并且在djangosnippets.org上找到了一个不错的模板加载器,可以轻松实现这一目的。它允许您在特定应用程序中扩展模板,从而使您能够创建自己的admin / index.html,从而从管理应用程序扩展admin / index.html模板。像这样:

{% extends "admin:admin/index.html" %}

{% block sidebar %}
    {{block.super}}
    <div>
        <h1>Extra links</h1>
        <a href="https://stackoverflow.com/admin/extra/">My extra link</a>
    </div>
{% endblock %}

我在我网站上的博客文章中提供了有关如何使用此模板加载器的完整示例。

Update:

Read the Docs for your version of Django. e.g.

https://docs.djangoproject.com/en/1.11/ref/contrib/admin/#admin-overriding-templates https://docs.djangoproject.com/en/2.0/ref/contrib/admin/#admin-overriding-templates https://docs.djangoproject.com/en/3.0/ref/contrib/admin/#admin-overriding-templates

Original answer from 2011:

I had the same issue about a year and a half ago and I found a nice template loader on djangosnippets.org that makes this easy. It allows you to extend a template in a specific app, giving you the ability to create your own admin/index.html that extends the admin/index.html template from the admin app. Like this:

{% extends "admin:admin/index.html" %}

{% block sidebar %}
    {{block.super}}
    <div>
        <h1>Extra links</h1>
        <a href="/admin/extra/">My extra link</a>
    </div>
{% endblock %}

I’ve given a full example on how to use this template loader in a blog post on my website.


回答 1

对于最新版本的Django 1.8,无需进行符号链接,将admin / templates复制到您的项目文件夹中,也无需按照上面的答案建议安装中间件。这是做什么的:

  1. 创建以下树结构(官方文档推荐)

    your_project
         |-- your_project/
         |-- myapp/
         |-- templates/
              |-- admin/
                  |-- myapp/
                      |-- change_form.html  <- do not misspell this

注意:此文件的位置并不重要。您可以将其放入您的应用程序中,并且仍然可以使用。只要它的位置可以被django发现。更重要的是,HTML文件的名称必须与django提供的原始HTML文件名相同。

  1. 将此模板路径添加到您的settings.py中

    TEMPLATES = [
        {
            'BACKEND': 'django.template.backends.django.DjangoTemplates',
            'DIRS': [os.path.join(BASE_DIR, 'templates')], # <- add this line
            'APP_DIRS': True,
            'OPTIONS': {
                'context_processors': [
                    'django.template.context_processors.debug',
                    'django.template.context_processors.request',
                    'django.contrib.auth.context_processors.auth',
                    'django.contrib.messages.context_processors.messages',
                ],
            },
        },
    ]
  2. 确定名称和要覆盖的块。这是通过查看django的admin / templates目录来完成的。我正在使用virtualenv,因此对我来说,路径在这里:

    ~/.virtualenvs/edge/lib/python2.7/site-packages/django/contrib/admin/templates/admin

在此示例中,我要修改添加新用户表单。该视图负责的模板是change_form.html。打开change_form.html并找到要扩展的{%block%}。

  1. 您的change_form.html中,编写如下内容:

    {% extends "admin/change_form.html" %}
    {% block field_sets %}
         {# your modification here #}
    {% endblock %}
  2. 加载您的页面,您应该看到更改

As for Django 1.8 being the current release, there is no need to symlink, copy the admin/templates to your project folder, or install middlewares as suggested by the answers above. Here is what to do:

  1. create the following tree structure(recommended by the official documentation)

    your_project
         |-- your_project/
         |-- myapp/
         |-- templates/
              |-- admin/
                  |-- myapp/
                      |-- change_form.html  <- do not misspell this
    

Note: The location of this file is not important. You can put it inside your app and it will still work. As long as its location can be discovered by django. What’s more important is the name of the HTML file has to be the same as the original HTML file name provided by django.

  1. Add this template path to your settings.py:

    TEMPLATES = [
        {
            'BACKEND': 'django.template.backends.django.DjangoTemplates',
            'DIRS': [os.path.join(BASE_DIR, 'templates')], # <- add this line
            'APP_DIRS': True,
            'OPTIONS': {
                'context_processors': [
                    'django.template.context_processors.debug',
                    'django.template.context_processors.request',
                    'django.contrib.auth.context_processors.auth',
                    'django.contrib.messages.context_processors.messages',
                ],
            },
        },
    ]
    
  2. Identify the name and block you want to override. This is done by looking into django’s admin/templates directory. I am using virtualenv, so for me, the path is here:

    ~/.virtualenvs/edge/lib/python2.7/site-packages/django/contrib/admin/templates/admin
    

In this example, I want to modify the add new user form. The template responsiblve for this view is change_form.html. Open up the change_form.html and find the {% block %} that you want to extend.

  1. In your change_form.html, write somethings like this:

    {% extends "admin/change_form.html" %}
    {% block field_sets %}
         {# your modification here #}
    {% endblock %}
    
  2. Load up your page and you should see the changes


回答 2

如果您需要覆盖admin/index.html,则可以设置的index_template参数AdminSite

例如

# urls.py
...
from django.contrib import admin

admin.site.index_template = 'admin/my_custom_index.html'
admin.autodiscover()

并将您的模板放在 <appname>/templates/admin/my_custom_index.html

if you need to overwrite the admin/index.html, you can set the index_template parameter of the AdminSite.

e.g.

# urls.py
...
from django.contrib import admin

admin.site.index_template = 'admin/my_custom_index.html'
admin.autodiscover()

and place your template in <appname>/templates/admin/my_custom_index.html


回答 3

django至少使用1.5,您可以定义要用于特定模板的模板modeladmin

参见https://docs.djangoproject.com/en/1.5/ref/contrib/admin/#custom-template-options

你可以做类似的事情

class Myadmin(admin.ModelAdmin):
    change_form_template = 'change_form.htm'

通过change_form.html扩展为简单的html模板admin/change_form.html(如果您想从头开始,则不这样做)

With django 1.5 (at least) you can define the template you want to use for a particular modeladmin

see https://docs.djangoproject.com/en/1.5/ref/contrib/admin/#custom-template-options

You can do something like

class Myadmin(admin.ModelAdmin):
    change_form_template = 'change_form.htm'

With change_form.html being a simple html template extending admin/change_form.html (or not if you want to do it from scratch)


回答 4

Chengs的答案是正确的,根据管理文档,并非所有的管理模板都可以通过这种方式覆盖:https ://docs.djangoproject.com/en/1.9/ref/contrib/admin/#overriding-admin-templates

可以针对每个应用或模型覆盖的模板

并非每个应用程序或每个模型都可以覆盖contrib / admin / templates / admin中的每个模板。以下可以:

app_index.html
change_form.html
change_list.html
delete_confirmation.html
object_history.html

对于无法以这种方式覆盖的模板,您仍然可以在整个项目中覆盖它们。只需新版本放在您的template / admin目录中即可。这对于创建自定义404和500页特别有用

我必须覆盖admin的login.html,因此必须将覆盖的模板放在此文件夹结构中:

your_project
 |-- your_project/
 |-- myapp/
 |-- templates/
      |-- admin/
          |-- login.html  <- do not misspell this

(没有admin中的myapp子文件夹)我没有足够的声誉来评论Cheng的帖子,这就是为什么我不得不将其编写为新答案。

Chengs’s answer is correct, howewer according to the admin docs not every admin template can be overwritten this way: https://docs.djangoproject.com/en/1.9/ref/contrib/admin/#overriding-admin-templates

Templates which may be overridden per app or model

Not every template in contrib/admin/templates/admin may be overridden per app or per model. The following can:

app_index.html
change_form.html
change_list.html
delete_confirmation.html
object_history.html

For those templates that cannot be overridden in this way, you may still override them for your entire project. Just place the new version in your templates/admin directory. This is particularly useful to create custom 404 and 500 pages

I had to overwrite the login.html of the admin and therefore had to put the overwritten template in this folder structure:

your_project
 |-- your_project/
 |-- myapp/
 |-- templates/
      |-- admin/
          |-- login.html  <- do not misspell this

(without the myapp subfolder in the admin) I do not have enough repution for commenting on Cheng’s post this is why I had to write this as new answer.


回答 5

最好的方法是将Django管理模板放入您的项目中。因此,您的模板将在其中,templates/admin而库存的Django管理模板将在其中template/django_admin。然后,您可以执行以下操作:

templates / admin / change_form.html

{% extends 'django_admin/change_form.html' %}

Your stuff here

如果您担心要使库存模板保持最新状态,则可以在svn外部组件或类似工具中包含它们。

The best way to do it is to put the Django admin templates inside your project. So your templates would be in templates/admin while the stock Django admin templates would be in say template/django_admin. Then, you can do something like the following:

templates/admin/change_form.html

{% extends 'django_admin/change_form.html' %}

Your stuff here

If you’re worried about keeping the stock templates up to date, you can include them with svn externals or similar.


回答 6

我找不到官方的Django文档中的单个答案或某个部分,而该部分没有覆盖/扩展默认管理模板所需的全部信息,因此,我正在将此答案作为完整指南编写,希望对您有所帮助为将来的其他人。

假设标准的Django项目结构为:

mysite-container/         # project container directory
    manage.py
    mysite/               # project package
        __init__.py
        admin.py
        apps.py
        settings.py
        urls.py
        wsgi.py
    app1/
    app2/
    ...
    static/
    templates/

这是您需要做的:

  1. 在中mysite/admin.py,创建以下子类AdminSite

    from django.contrib.admin import AdminSite
    
    
    class CustomAdminSite(AdminSite):
        # set values for `site_header`, `site_title`, `index_title` etc.
        site_header = 'Custom Admin Site'
        ...
    
        # extend / override admin views, such as `index()`
        def index(self, request, extra_context=None):
            extra_context = extra_context or {}
    
            # do whatever you want to do and save the values in `extra_context`
            extra_context['world'] = 'Earth'
    
            return super(CustomAdminSite, self).index(request, extra_context)
    
    
    custom_admin_site = CustomAdminSite()

    确保导入custom_admin_siteadmin.py你的应用程序并注册它的模型在您的自定义管理网站显示它们(如果你愿意的话)。

  2. 在中mysite/apps.py,创建的子类,AdminConfig并从上一步中将其设置default_siteadmin.CustomAdminSite

    from django.contrib.admin.apps import AdminConfig
    
    
    class CustomAdminConfig(AdminConfig):
        default_site = 'admin.CustomAdminSite'
  3. mysite/settings.py,替换django.admin.siteINSTALLED_APPSapps.CustomAdminConfig(来自之前的步骤自定义管理应用程序配置)。

  4. 在中mysite/urls.pyadmin.site.urls从管理URL 替换为custom_admin_site.urls

    from .admin import custom_admin_site
    
    
    urlpatterns = [
        ...
        path('admin/', custom_admin_site.urls),
        # for Django 1.x versions: url(r'^admin/', include(custom_admin_site.urls)),
        ...
    ]
  5. templates目录中创建要修改的模板,并保持docs中指定的默认Django管理模板目录结构。例如,如果要修改admin/index.html,请创建文件templates/admin/index.html

    所有现有模板都可以通过这种方式进行修改,并且它们的名称和结构可以在Django的源代码中找到

  6. 现在,您可以通过从头开始编写模板来覆盖模板,也可以对其进行扩展,然后覆盖/扩展特定的块。

    例如,如果您想将所有内容保持原样但要覆盖该content块(在索引页面上列出了您注册的应用及其模型),则将以下内容添加到中templates/admin/index.html

    {% extends 'admin/index.html' %}
    
    {% block content %}
      <h1>
        Hello, {{ world }}!
      </h1>
    {% endblock %}

    要保留块的原始内容,请{{ block.super }}在希望显示原始内容的任何位置添加:

    {% extends 'admin/index.html' %}
    
    {% block content %}
      <h1>
        Hello, {{ world }}!
      </h1>
      {{ block.super }}
    {% endblock %}

    您还可以通过修改extrastyleextrahead块来添加自定义样式和脚本。

I couldn’t find a single answer or a section in the official Django docs that had all the information I needed to override/extend the default admin templates, so I’m writing this answer as a complete guide, hoping that it would be helpful for others in the future.

Assuming the standard Django project structure:

mysite-container/         # project container directory
    manage.py
    mysite/               # project package
        __init__.py
        admin.py
        apps.py
        settings.py
        urls.py
        wsgi.py
    app1/
    app2/
    ...
    static/
    templates/

Here’s what you need to do:

  1. In mysite/admin.py, create a sub-class of AdminSite:

    from django.contrib.admin import AdminSite
    
    
    class CustomAdminSite(AdminSite):
        # set values for `site_header`, `site_title`, `index_title` etc.
        site_header = 'Custom Admin Site'
        ...
    
        # extend / override admin views, such as `index()`
        def index(self, request, extra_context=None):
            extra_context = extra_context or {}
    
            # do whatever you want to do and save the values in `extra_context`
            extra_context['world'] = 'Earth'
    
            return super(CustomAdminSite, self).index(request, extra_context)
    
    
    custom_admin_site = CustomAdminSite()
    

    Make sure to import custom_admin_site in the admin.py of your apps and register your models on it to display them on your customized admin site (if you want to).

  2. In mysite/apps.py, create a sub-class of AdminConfig and set default_site to admin.CustomAdminSite from the previous step:

    from django.contrib.admin.apps import AdminConfig
    
    
    class CustomAdminConfig(AdminConfig):
        default_site = 'admin.CustomAdminSite'
    
  3. In mysite/settings.py, replace django.admin.site in INSTALLED_APPS with apps.CustomAdminConfig (your custom admin app config from the previous step).

  4. In mysite/urls.py, replace admin.site.urls from the admin URL to custom_admin_site.urls

    from .admin import custom_admin_site
    
    
    urlpatterns = [
        ...
        path('admin/', custom_admin_site.urls),
        # for Django 1.x versions: url(r'^admin/', include(custom_admin_site.urls)),
        ...
    ]
    
  5. Create the template you want to modify in your templates directory, maintaining the default Django admin templates directory structure as specified in the docs. For example, if you were modifying admin/index.html, create the file templates/admin/index.html.

    All of the existing templates can be modified this way, and their names and structures can be found in Django’s source code.

  6. Now you can either override the template by writing it from scratch or extend it and then override/extend specific blocks.

    For example, if you wanted to keep everything as-is but wanted to override the content block (which on the index page lists the apps and their models that you registered), add the following to templates/admin/index.html:

    {% extends 'admin/index.html' %}
    
    {% block content %}
      <h1>
        Hello, {{ world }}!
      </h1>
    {% endblock %}
    

    To preserve the original contents of a block, add {{ block.super }} wherever you want the original contents to be displayed:

    {% extends 'admin/index.html' %}
    
    {% block content %}
      <h1>
        Hello, {{ world }}!
      </h1>
      {{ block.super }}
    {% endblock %}
    

    You can also add custom styles and scripts by modifying the extrastyle and extrahead blocks.


回答 7

我同意克里斯·普拉特(Chris Pratt)的观点。但我认为最好将符号链接创建到原始Django文件夹,并将其放置在管理模板中:

ln -s /usr/local/lib/python2.7/dist-packages/django/contrib/admin/templates/admin/ templates/django_admin

如您所见,它取决于python版本和Django的安装文件夹。因此,将来或在生产服务器上,您可能需要更改路径。

I agree with Chris Pratt. But I think it’s better to create the symlink to original Django folder where the admin templates place in:

ln -s /usr/local/lib/python2.7/dist-packages/django/contrib/admin/templates/admin/ templates/django_admin

and as you can see it depends on python version and the folder where the Django installed. So in future or on a production server you might need to change the path.


回答 8

这个站点有一个与我的Django 1.7配置一起使用的简单解决方案。

首先:在项目的template /目录中,创建一个名为admin_src的符号链接到已安装的Django模板。对我来说,使用virtualenv在Dreamhost上,我的“源” Django管理模板位于:

~/virtualenvs/mydomain/lib/python2.7/site-packages/django/contrib/admin/templates/admin

第二:在模板/中创建管理目录

所以我项目的template /目录现在看起来像这样:

/templates/
   admin
   admin_src -> [to django source]
   base.html
   index.html
   sitemap.xml
   etc...

第三:在新的template / admin /目录中,创建具有以下内容的base.html文件:

{% extends "admin_src/base.html" %}

{% block extrahead %}
<link rel='shortcut icon' href='{{ STATIC_URL }}img/favicon-admin.ico' />
{% endblock %}

第四:将您的admin favicon-admin.ico添加到您的静态根img文件夹中。

做完了 简单。

This site had a simple solution that worked with my Django 1.7 configuration.

FIRST: Make a symlink named admin_src in your project’s template/ directory to your installed Django templates. For me on Dreamhost using a virtualenv, my “source” Django admin templates were in:

~/virtualenvs/mydomain/lib/python2.7/site-packages/django/contrib/admin/templates/admin

SECOND: Create an admin directory in templates/

So my project’s template/ directory now looked like this:

/templates/
   admin
   admin_src -> [to django source]
   base.html
   index.html
   sitemap.xml
   etc...

THIRD: In your new template/admin/ directory create a base.html file with this content:

{% extends "admin_src/base.html" %}

{% block extrahead %}
<link rel='shortcut icon' href='{{ STATIC_URL }}img/favicon-admin.ico' />
{% endblock %}

FOURTH: Add your admin favicon-admin.ico into your static root img folder.

Done. Easy.


回答 9

对于应用程序索引,将此行添加到某个常见的py文件(例如url.py)中

admin.site.index_template = 'admin/custom_index.html'

对于应用程序模块索引:将此行添加到admin.py

admin.AdminSite.app_index_template = "servers/servers-home.html"

对于更改列表:将此行添加到admin类:

change_list_template = "servers/servers_changelist.html"

对于应用程序模块表单模板:将此行添加到您的管理类中

change_form_template = "servers/server_changeform.html"

等等,并在同一管理员的模块类中查找其他

for app index add this line to somewhere common py file like url.py

admin.site.index_template = 'admin/custom_index.html'

for app module index : add this line to admin.py

admin.AdminSite.app_index_template = "servers/servers-home.html"

for change list : add this line to admin class:

change_list_template = "servers/servers_changelist.html"

for app module form template : add this line to your admin class

change_form_template = "servers/server_changeform.html"

etc. and find other in same admin’s module classes


回答 10

您可以使用django-overextends,它为Django提供了循环模板继承。

它来自Mezzanine CMS,Stephen从中将其提取到独立的Django扩展中。

您可以在夹层文档中的“覆盖与扩展模板”(http:/mezzanine.jupo.org/docs/content-architecture.html#overriding-vs-extending-templates)中找到更多信息。

有关更深入的信息,请参见Stephens Blog“ Django的循环模板继承”(http:/blog.jupo.org/2012/05/17/circular-template-inheritance-for-django)。

在Google网上论坛中,讨论(https://groups.google.com/forum/#!topic/mezzanine-users/sUydcf_IZkQ)开始了该功能的开发。

注意:

我没有添加两个以上链接的声誉。但是我认为这些链接提供了有趣的背景信息。因此,我在“ http(s):”之后留了一个斜线。也许信誉较高的人可以修复链接并删除此注释。

You can use django-overextends, which provides circular template inheritance for Django.

It comes from the Mezzanine CMS, from where Stephen extracted it into a standalone Django extension.

More infos you find in “Overriding vs Extending Templates” (http:/mezzanine.jupo.org/docs/content-architecture.html#overriding-vs-extending-templates) inside the Mezzanine docs.

For deeper insides look at Stephens Blog “Circular Template Inheritance for Django” (http:/blog.jupo.org/2012/05/17/circular-template-inheritance-for-django).

And in Google Groups the discussion (https:/groups.google.com/forum/#!topic/mezzanine-users/sUydcf_IZkQ) which started the development of this feature.

Note:

I don’t have the reputation to add more than 2 links. But I think the links provide interesting background information. So I just left out a slash after “http(s):”. Maybe someone with better reputation can repair the links and remove this note.


在Python中将参数添加到给定的URL

问题:在Python中将参数添加到给定的URL

假设给了我一个URL。
它可能已经具有GET参数(例如http://example.com/search?q=question),也可能没有(例如http://example.com/)。

现在我需要为其添加一些参数{'lang':'en','tag':'python'}。在第一种情况下,我将拥有,http://example.com/search?q=question&lang=en&tag=python而在第二种情况下- http://example.com/search?lang=en&tag=python

有什么标准的方法可以做到这一点吗?

Suppose I was given a URL.
It might already have GET parameters (e.g. http://example.com/search?q=question) or it might not (e.g. http://example.com/).

And now I need to add some parameters to it like {'lang':'en','tag':'python'}. In the first case I’m going to have http://example.com/search?q=question&lang=en&tag=python and in the second — http://example.com/search?lang=en&tag=python.

Is there any standard way to do this?


回答 0

urlliburlparse模块有几个怪癖。这是一个工作示例:

try:
    import urlparse
    from urllib import urlencode
except: # For Python 3
    import urllib.parse as urlparse
    from urllib.parse import urlencode

url = "http://stackoverflow.com/search?q=question"
params = {'lang':'en','tag':'python'}

url_parts = list(urlparse.urlparse(url))
query = dict(urlparse.parse_qsl(url_parts[4]))
query.update(params)

url_parts[4] = urlencode(query)

print(urlparse.urlunparse(url_parts))

ParseResult,结果urlparse()是只读的,我们需要把它转换成list之前,我们可以尝试修改其数据。

There are a couple of quirks with the urllib and urlparse modules. Here’s a working example:

try:
    import urlparse
    from urllib import urlencode
except: # For Python 3
    import urllib.parse as urlparse
    from urllib.parse import urlencode

url = "http://stackoverflow.com/search?q=question"
params = {'lang':'en','tag':'python'}

url_parts = list(urlparse.urlparse(url))
query = dict(urlparse.parse_qsl(url_parts[4]))
query.update(params)

url_parts[4] = urlencode(query)

print(urlparse.urlunparse(url_parts))

ParseResult, the result of urlparse(), is read-only and we need to convert it to a list before we can attempt to modify its data.


回答 1

为什么

我对本页上的所有解决方案都不满意(请问,我们最喜欢的复制粘贴内容在哪里?),所以我根据此处的答案写了自己的解决方案。它试图变得完整和更加Pythonic。我为参数中的dictbool值添加了一个处理程序,以使其对消费者端(JS)更友好,但是它们仍然是可选的,您可以将其删除。

这个怎么运作

测试1:添加新参数,处理数组和布尔值:

url = 'http://stackoverflow.com/test'
new_params = {'answers': False, 'data': ['some','values']}

add_url_params(url, new_params) == \
    'http://stackoverflow.com/test?data=some&data=values&answers=false'

测试2:重写现有的参数,处理DICT值:

url = 'http://stackoverflow.com/test/?question=false'
new_params = {'question': {'__X__':'__Y__'}}

add_url_params(url, new_params) == \
    'http://stackoverflow.com/test/?question=%7B%22__X__%22%3A+%22__Y__%22%7D'

谈话很便宜。给我看代码。

代码本身。我试图详细描述它:

from json import dumps

try:
    from urllib import urlencode, unquote
    from urlparse import urlparse, parse_qsl, ParseResult
except ImportError:
    # Python 3 fallback
    from urllib.parse import (
        urlencode, unquote, urlparse, parse_qsl, ParseResult
    )


def add_url_params(url, params):
    """ Add GET params to provided URL being aware of existing.

    :param url: string of target URL
    :param params: dict containing requested params to be added
    :return: string with updated URL

    >> url = 'http://stackoverflow.com/test?answers=true'
    >> new_params = {'answers': False, 'data': ['some','values']}
    >> add_url_params(url, new_params)
    'http://stackoverflow.com/test?data=some&data=values&answers=false'
    """
    # Unquoting URL first so we don't loose existing args
    url = unquote(url)
    # Extracting url info
    parsed_url = urlparse(url)
    # Extracting URL arguments from parsed URL
    get_args = parsed_url.query
    # Converting URL arguments to dict
    parsed_get_args = dict(parse_qsl(get_args))
    # Merging URL arguments dict with new params
    parsed_get_args.update(params)

    # Bool and Dict values should be converted to json-friendly values
    # you may throw this part away if you don't like it :)
    parsed_get_args.update(
        {k: dumps(v) for k, v in parsed_get_args.items()
         if isinstance(v, (bool, dict))}
    )

    # Converting URL argument to proper query string
    encoded_get_args = urlencode(parsed_get_args, doseq=True)
    # Creating new parsed result object based on provided with new
    # URL arguments. Same thing happens inside of urlparse.
    new_url = ParseResult(
        parsed_url.scheme, parsed_url.netloc, parsed_url.path,
        parsed_url.params, encoded_get_args, parsed_url.fragment
    ).geturl()

    return new_url

请注意,可能会有一些问题,如果您发现一个问题,请告诉我,我们会做的更好

Why

I’ve been not satisfied with all the solutions on this page (come on, where is our favorite copy-paste thing?) so I wrote my own based on answers here. It tries to be complete and more Pythonic. I’ve added a handler for dict and bool values in arguments to be more consumer-side (JS) friendly, but they are yet optional, you can drop them.

How it works

Test 1: Adding new arguments, handling Arrays and Bool values:

url = 'http://stackoverflow.com/test'
new_params = {'answers': False, 'data': ['some','values']}

add_url_params(url, new_params) == \
    'http://stackoverflow.com/test?data=some&data=values&answers=false'

Test 2: Rewriting existing args, handling DICT values:

url = 'http://stackoverflow.com/test/?question=false'
new_params = {'question': {'__X__':'__Y__'}}

add_url_params(url, new_params) == \
    'http://stackoverflow.com/test/?question=%7B%22__X__%22%3A+%22__Y__%22%7D'

Talk is cheap. Show me the code.

Code itself. I’ve tried to describe it in details:

from json import dumps

try:
    from urllib import urlencode, unquote
    from urlparse import urlparse, parse_qsl, ParseResult
except ImportError:
    # Python 3 fallback
    from urllib.parse import (
        urlencode, unquote, urlparse, parse_qsl, ParseResult
    )


def add_url_params(url, params):
    """ Add GET params to provided URL being aware of existing.

    :param url: string of target URL
    :param params: dict containing requested params to be added
    :return: string with updated URL

    >> url = 'http://stackoverflow.com/test?answers=true'
    >> new_params = {'answers': False, 'data': ['some','values']}
    >> add_url_params(url, new_params)
    'http://stackoverflow.com/test?data=some&data=values&answers=false'
    """
    # Unquoting URL first so we don't loose existing args
    url = unquote(url)
    # Extracting url info
    parsed_url = urlparse(url)
    # Extracting URL arguments from parsed URL
    get_args = parsed_url.query
    # Converting URL arguments to dict
    parsed_get_args = dict(parse_qsl(get_args))
    # Merging URL arguments dict with new params
    parsed_get_args.update(params)

    # Bool and Dict values should be converted to json-friendly values
    # you may throw this part away if you don't like it :)
    parsed_get_args.update(
        {k: dumps(v) for k, v in parsed_get_args.items()
         if isinstance(v, (bool, dict))}
    )

    # Converting URL argument to proper query string
    encoded_get_args = urlencode(parsed_get_args, doseq=True)
    # Creating new parsed result object based on provided with new
    # URL arguments. Same thing happens inside of urlparse.
    new_url = ParseResult(
        parsed_url.scheme, parsed_url.netloc, parsed_url.path,
        parsed_url.params, encoded_get_args, parsed_url.fragment
    ).geturl()

    return new_url

Please be aware that there may be some issues, if you’ll find one please let me know and we will make this thing better


回答 2

如果字符串可以具有任意数据(例如,需要对与号,斜线等字符进行编码),则要使用URL编码。

查看urllib.urlencode:

>>> import urllib
>>> urllib.urlencode({'lang':'en','tag':'python'})
'lang=en&tag=python'

在python3中:

from urllib import parse
parse.urlencode({'lang':'en','tag':'python'})

You want to use URL encoding if the strings can have arbitrary data (for example, characters such as ampersands, slashes, etc. will need to be encoded).

Check out urllib.urlencode:

>>> import urllib
>>> urllib.urlencode({'lang':'en','tag':'python'})
'lang=en&tag=python'

In python3:

from urllib import parse
parse.urlencode({'lang':'en','tag':'python'})

回答 3

您还可以使用furl模块https://github.com/gruns/furl

>>> from furl import furl
>>> print furl('http://example.com/search?q=question').add({'lang':'en','tag':'python'}).url
http://example.com/search?q=question&lang=en&tag=python

You can also use the furl module https://github.com/gruns/furl

>>> from furl import furl
>>> print furl('http://example.com/search?q=question').add({'lang':'en','tag':'python'}).url
http://example.com/search?q=question&lang=en&tag=python

回答 4

将其外包给经过战斗测试的请求库

这就是我要做的:

from requests.models import PreparedRequest
url = 'http://example.com/search?q=question'
params = {'lang':'en','tag':'python'}
req = PreparedRequest()
req.prepare_url(url, params)
print(req.url)

Outsource it to the battle tested requests library.

This is how I will do it:

from requests.models import PreparedRequest
url = 'http://example.com/search?q=question'
params = {'lang':'en','tag':'python'}
req = PreparedRequest()
req.prepare_url(url, params)
print(req.url)

回答 5

如果您使用请求lib

import requests
...
params = {'tag': 'python'}
requests.get(url, params=params)

If you are using the requests lib:

import requests
...
params = {'tag': 'python'}
requests.get(url, params=params)

回答 6

是的:使用urllib

从文档中的示例中:

>>> import urllib
>>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params)
>>> print f.geturl() # Prints the final URL with parameters.
>>> print f.read() # Prints the contents

Yes: use urllib.

From the examples in the documentation:

>>> import urllib
>>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params)
>>> print f.geturl() # Prints the final URL with parameters.
>>> print f.read() # Prints the contents

回答 7

基于这个答案,简单案例的一线式(Python 3代码):

from urllib.parse import urlparse, urlencode


url = "https://stackoverflow.com/search?q=question"
params = {'lang':'en','tag':'python'}

url += ('&' if urlparse(url).query else '?') + urlencode(params)

要么:

url += ('&', '?')[urlparse(url).query == ''] + urlencode(params)

Based on this answer, one-liner for simple cases (Python 3 code):

from urllib.parse import urlparse, urlencode


url = "https://stackoverflow.com/search?q=question"
params = {'lang':'en','tag':'python'}

url += ('&' if urlparse(url).query else '?') + urlencode(params)

or:

url += ('&', '?')[urlparse(url).query == ''] + urlencode(params)

回答 8

我发现这比两个最重要的答案更为优雅:

from urllib.parse import urlencode, urlparse, parse_qs

def merge_url_query_params(url: str, additional_params: dict) -> str:
    url_components = urlparse(url)
    original_params = parse_qs(url_components.query)
    # Before Python 3.5 you could update original_params with 
    # additional_params, but here all the variables are immutable.
    merged_params = {**original_params, **additional_params}
    updated_query = urlencode(merged_params, doseq=True)
    # _replace() is how you can create a new NamedTuple with a changed field
    return url_components._replace(query=updated_query).geturl()

assert merge_url_query_params(
    'http://example.com/search?q=question',
    {'lang':'en','tag':'python'},
) == 'http://example.com/search?q=question&lang=en&tag=python'

我在最重要的答案中不喜欢的最重要的事情(尽管如此,它们还是不错的):

  • Łukasz:必须记住queryURL组件中的索引
  • Sapphire64:创建更新版本的非常冗长的方法 ParseResult

我的响应不好的是dict使用了拆包的神奇合并,但是由于我对可变性的偏见,我更喜欢更新现有字典。

I find this more elegant than the two top answers:

from urllib.parse import urlencode, urlparse, parse_qs

def merge_url_query_params(url: str, additional_params: dict) -> str:
    url_components = urlparse(url)
    original_params = parse_qs(url_components.query)
    # Before Python 3.5 you could update original_params with 
    # additional_params, but here all the variables are immutable.
    merged_params = {**original_params, **additional_params}
    updated_query = urlencode(merged_params, doseq=True)
    # _replace() is how you can create a new NamedTuple with a changed field
    return url_components._replace(query=updated_query).geturl()

assert merge_url_query_params(
    'http://example.com/search?q=question',
    {'lang':'en','tag':'python'},
) == 'http://example.com/search?q=question&lang=en&tag=python'

The most important things I dislike in the top answers (they are nevertheless good):

  • Łukasz: having to remember the index at which the query is in the URL components
  • Sapphire64: the very verbose way of creating the updated ParseResult

What’s bad about my response is the magically looking dict merge using unpacking, but I prefer that to updating an already existing dictionary because of my prejudice against mutability.


回答 9

我喜欢Łukasz版本,但是由于在这种情况下使用urllib和urllparse函数有些尴尬,因此我认为执行以下操作更简单:

params = urllib.urlencode(params)

if urlparse.urlparse(url)[4]:
    print url + '&' + params
else:
    print url + '?' + params

I liked Łukasz version, but since urllib and urllparse functions are somewhat awkward to use in this case, I think it’s more straightforward to do something like this:

params = urllib.urlencode(params)

if urlparse.urlparse(url)[4]:
    print url + '&' + params
else:
    print url + '?' + params

回答 10

使用各种urlparse功能urllib.urlencode()将组合字典上的现有URL拆开,然后urlparse.urlunparse()将其重新组合在一起。

或只取结果urllib.urlencode()并将其适当地连接到URL。

Use the various urlparse functions to tear apart the existing URL, urllib.urlencode() on the combined dictionary, then urlparse.urlunparse() to put it all back together again.

Or just take the result of urllib.urlencode() and concatenate it to the URL appropriately.


回答 11

还有一个答案:

def addGetParameters(url, newParams):
    (scheme, netloc, path, params, query, fragment) = urlparse.urlparse(url)
    queryList = urlparse.parse_qsl(query, keep_blank_values=True)
    for key in newParams:
        queryList.append((key, newParams[key]))
    return urlparse.urlunparse((scheme, netloc, path, params, urllib.urlencode(queryList), fragment))

Yet another answer:

def addGetParameters(url, newParams):
    (scheme, netloc, path, params, query, fragment) = urlparse.urlparse(url)
    queryList = urlparse.parse_qsl(query, keep_blank_values=True)
    for key in newParams:
        queryList.append((key, newParams[key]))
    return urlparse.urlunparse((scheme, netloc, path, params, urllib.urlencode(queryList), fragment))

回答 12

这是我的实现方法。

import urllib

params = urllib.urlencode({'lang':'en','tag':'python'})
url = ''
if request.GET:
   url = request.url + '&' + params
else:
   url = request.url + '?' + params    

像魅力一样工作。但是,我希望有一种更清洁的方法来实现此目的。

实现上述内容的另一种方法是将其放入方法中。

import urllib

def add_url_param(request, **params):
   new_url = ''
   _params = dict(**params)
   _params = urllib.urlencode(_params)

   if _params:
      if request.GET:
         new_url = request.url + '&' + _params
      else:
         new_url = request.url + '?' + _params
   else:
      new_url = request.url

   return new_ur

Here is how I implemented it.

import urllib

params = urllib.urlencode({'lang':'en','tag':'python'})
url = ''
if request.GET:
   url = request.url + '&' + params
else:
   url = request.url + '?' + params    

Worked like a charm. However, I would have liked a more cleaner way to implement this.

Another way of implementing the above is put it in a method.

import urllib

def add_url_param(request, **params):
   new_url = ''
   _params = dict(**params)
   _params = urllib.urlencode(_params)

   if _params:
      if request.GET:
         new_url = request.url + '&' + _params
      else:
         new_url = request.url + '?' + _params
   else:
      new_url = request.url

   return new_ur

回答 13

在python 2.5中

import cgi
import urllib
import urlparse

def add_url_param(url, **params):
    n=3
    parts = list(urlparse.urlsplit(url))
    d = dict(cgi.parse_qsl(parts[n])) # use cgi.parse_qs for list values
    d.update(params)
    parts[n]=urllib.urlencode(d)
    return urlparse.urlunsplit(parts)

url = "http://stackoverflow.com/search?q=question"
add_url_param(url, lang='en') == "http://stackoverflow.com/search?q=question&lang=en"

In python 2.5

import cgi
import urllib
import urlparse

def add_url_param(url, **params):
    n=3
    parts = list(urlparse.urlsplit(url))
    d = dict(cgi.parse_qsl(parts[n])) # use cgi.parse_qs for list values
    d.update(params)
    parts[n]=urllib.urlencode(d)
    return urlparse.urlunsplit(parts)

url = "http://stackoverflow.com/search?q=question"
add_url_param(url, lang='en') == "http://stackoverflow.com/search?q=question&lang=en"

如何初始化基类(超级)?

问题:如何初始化基类(超级)?

在Python中,请考虑以下代码:

>>> class SuperClass(object):
    def __init__(self, x):
        self.x = x

>>> class SubClass(SuperClass):
    def __init__(self, y):
        self.y = y
        # how do I initialize the SuperClass __init__ here?

如何SuperClass __init__在子类中初始化?我正在关注Python教程,但并未涵盖该内容。当我在Google上搜索时,发现了不止一种方法。处理此问题的标准方法是什么?

In Python, consider I have the following code:

>>> class SuperClass(object):
    def __init__(self, x):
        self.x = x

>>> class SubClass(SuperClass):
    def __init__(self, y):
        self.y = y
        # how do I initialize the SuperClass __init__ here?

How do I initialize the SuperClass __init__ in the subclass? I am following the Python tutorial and it doesn’t cover that. When I searched on Google, I found more than one way of doing. What is the standard way of handling this?


回答 0

Python(直到版本3)支持“旧式”和新式类。新样式类派生自object您使用的类,并通过调用它们的基类super(),例如

class X(object):
  def __init__(self, x):
    pass

  def doit(self, bar):
    pass

class Y(X):
  def __init__(self):
    super(Y, self).__init__(123)

  def doit(self, foo):
    return super(Y, self).doit(foo)

因为python知道旧样式和新样式的类,所以有不同的方法可以调用基本方法,这就是为什么您找到了多种方法的原因。

为了完整起见,老式类使用基类显式调用基方法,即

def doit(self, foo):
  return X.doit(self, foo)

但是由于您不应该再使用旧样式,因此我不会对此太在意。

Python 3只知道新型类(无论您是否派生自新object)。

Python (until version 3) supports “old-style” and new-style classes. New-style classes are derived from object and are what you are using, and invoke their base class through super(), e.g.

class X(object):
  def __init__(self, x):
    pass

  def doit(self, bar):
    pass

class Y(X):
  def __init__(self):
    super(Y, self).__init__(123)

  def doit(self, foo):
    return super(Y, self).doit(foo)

Because python knows about old- and new-style classes, there are different ways to invoke a base method, which is why you’ve found multiple ways of doing so.

For completeness sake, old-style classes call base methods explicitly using the base class, i.e.

def doit(self, foo):
  return X.doit(self, foo)

But since you shouldn’t be using old-style anymore, I wouldn’t care about this too much.

Python 3 only knows about new-style classes (no matter if you derive from object or not).


回答 1

SuperClass.__init__(self, x)

要么

super(SubClass,self).__init__( x )

会起作用(我更喜欢第二个,因为它更加遵守DRY原则)。

参见此处:http : //docs.python.org/reference/datamodel.html#basic-customization

Both

SuperClass.__init__(self, x)

or

super(SubClass,self).__init__( x )

will work (I prefer the 2nd one, as it adheres more to the DRY principle).

See here: http://docs.python.org/reference/datamodel.html#basic-customization


回答 2

从python 3.5.2开始,您可以使用:

class C(B):
def method(self, arg):
    super().method(arg)    # This does the same thing as:
                           # super(C, self).method(arg)

https://docs.python.org/3/library/functions.html#super

As of python 3.5.2, you can use:

class C(B):
def method(self, arg):
    super().method(arg)    # This does the same thing as:
                           # super(C, self).method(arg)

https://docs.python.org/3/library/functions.html#super


回答 3

如何初始化基类(超级)?

class SuperClass(object):
    def __init__(self, x):
        self.x = x

class SubClass(SuperClass):
    def __init__(self, y):
        self.y = y

使用一个super对象来确保您以方法解析顺序获取下一个方法(作为绑定方法)。在Python 2中,您需要传递类名并self传递super以查找绑定的__init__方法:

 class SubClass(SuperClass):
      def __init__(self, y):
          super(SubClass, self).__init__('x')
          self.y = y

在Python 3中,有一点魔术使参数变得super不必要了-附带的好处是它的运行速度更快:

 class SubClass(SuperClass):
      def __init__(self, y):
          super().__init__('x')
          self.y = y

像下面这样对父级进行硬编码可防止您使用协作式多重继承:

 class SubClass(SuperClass):
      def __init__(self, y):
          SuperClass.__init__(self, 'x') # don't do this
          self.y = y

请注意,它__init__可能仅返回None -它旨在就地修改对象。

东西 __new__

还有另一种初始化实例的方法-这是Python中不可变类型的子类的唯一方法。所以,如果你想子类它需要strtuple或其他不可变对象。

您可能会认为这是一个类方法,因为它获得了隐式类参数。但这实际上是一种静态方法。所以,你需要打电话__new__cls明确。

我们通常从返回实例__new__,因此,您也需要在基类中__new__通过调用super基类的。因此,如果您同时使用两种方法:

class SuperClass(object):
    def __new__(cls, x):
        return super(SuperClass, cls).__new__(cls)
    def __init__(self, x):
        self.x = x

class SubClass(object):
    def __new__(cls, y):
        return super(SubClass, cls).__new__(cls)

    def __init__(self, y):
        self.y = y
        super(SubClass, self).__init__('x')

Python 3回避了由于__new__是静态方法而导致的超级调用的怪异之处,但是您仍然需要传递cls给非绑定__new__方法:

class SuperClass(object):
    def __new__(cls, x):
        return super().__new__(cls)
    def __init__(self, x):
        self.x = x

class SubClass(object):
    def __new__(cls, y):
        return super().__new__(cls)
    def __init__(self, y):
        self.y = y
        super().__init__('x')

How do I initialize the base (super) class?

class SuperClass(object):
    def __init__(self, x):
        self.x = x

class SubClass(SuperClass):
    def __init__(self, y):
        self.y = y

Use a super object to ensure you get the next method (as a bound method) in the method resolution order. In Python 2, you need to pass the class name and self to super to lookup the bound __init__ method:

 class SubClass(SuperClass):
      def __init__(self, y):
          super(SubClass, self).__init__('x')
          self.y = y

In Python 3, there’s a little magic that makes the arguments to super unnecessary – and as a side benefit it works a little faster:

 class SubClass(SuperClass):
      def __init__(self, y):
          super().__init__('x')
          self.y = y

Hardcoding the parent like this below prevents you from using cooperative multiple inheritance:

 class SubClass(SuperClass):
      def __init__(self, y):
          SuperClass.__init__(self, 'x') # don't do this
          self.y = y

Note that __init__ may only return None – it is intended to modify the object in-place.

Something __new__

There’s another way to initialize instances – and it’s the only way for subclasses of immutable types in Python. So it’s required if you want to subclass str or tuple or another immutable object.

You might think it’s a classmethod because it gets an implicit class argument. But it’s actually a staticmethod. So you need to call __new__ with cls explicitly.

We usually return the instance from __new__, so if you do, you also need to call your base’s __new__ via super as well in your base class. So if you use both methods:

class SuperClass(object):
    def __new__(cls, x):
        return super(SuperClass, cls).__new__(cls)
    def __init__(self, x):
        self.x = x

class SubClass(object):
    def __new__(cls, y):
        return super(SubClass, cls).__new__(cls)

    def __init__(self, y):
        self.y = y
        super(SubClass, self).__init__('x')

Python 3 sidesteps a little of the weirdness of the super calls caused by __new__ being a static method, but you still need to pass cls to the non-bound __new__ method:

class SuperClass(object):
    def __new__(cls, x):
        return super().__new__(cls)
    def __init__(self, x):
        self.x = x

class SubClass(object):
    def __new__(cls, y):
        return super().__new__(cls)
    def __init__(self, y):
        self.y = y
        super().__init__('x')