Python在一个列表中查找不在另一个列表中的元素[重复]

问题:Python在一个列表中查找不在另一个列表中的元素[重复]

我需要比较两个列表,以便创建在一个列表中找到但不在另一个列表中找到的特定元素的新列表。例如:

main_list=[]
list_1=["a", "b", "c", "d", "e"]
list_2=["a", "f", "c", "m"] 

我想遍历list_1,并将list_2中所有在list_1中找不到的元素附加到main_list。

结果应为:

main_list=["f", "m"]

如何使用python做到这一点?

I need to compare two lists in order to create a new list of specific elements found in one list but not in the other. For example:

main_list=[]
list_1=["a", "b", "c", "d", "e"]
list_2=["a", "f", "c", "m"] 

I want to loop through list_1 and append to main_list all the elements from list_2 that are not found in list_1.

The result should be:

main_list=["f", "m"]

How can I do it with python?


回答 0

TL; DR:
解决方案(1)

import numpy as np
main_list = np.setdiff1d(list_2,list_1)
# yields the elements in `list_2` that are NOT in `list_1`

解决方案(2) 您需要一个排序列表

def setdiff_sorted(array1,array2,assume_unique=False):
    ans = np.setdiff1d(array1,array2,assume_unique).tolist()
    if assume_unique:
        return sorted(ans)
    return ans
main_list = setdiff_sorted(list_2,list_1)




说明:
(1)可以使用与NumPy的setdiff1darray1array2assume_unique= False)。

assume_unique询问用户数组是否已经唯一。
如果为False,则首先确定唯一元素。
如果为True,则函数将假定元素已经是唯一的,并且函数将跳过确定唯一元素的操作。

这产生了独特的值array1不是array2assume_uniqueFalse默认。

如果您担心 唯一元素(基于Chinny84响应),则只需使用(其中assume_unique=False=>默认值):

import numpy as np
list_1 = ["a", "b", "c", "d", "e"]
list_2 = ["a", "f", "c", "m"] 
main_list = np.setdiff1d(list_2,list_1)
# yields the elements in `list_2` that are NOT in `list_1`


(2) 对于想要对答案进行排序的人,我做了一个自定义函数:

import numpy as np
def setdiff_sorted(array1,array2,assume_unique=False):
    ans = np.setdiff1d(array1,array2,assume_unique).tolist()
    if assume_unique:
        return sorted(ans)
    return ans

要获得答案,请运行:

main_list = setdiff_sorted(list_2,list_1)

旁注:
(a)解决方案2(自定义函数setdiff_sorted)返回一个列表(与解决方案1中的数组相比)。

(b)如果不确定这些元素是否唯一,则只需setdiff1d在解决方案A和B中都使用NumPy的默认设置。并发症的例子是什么?见注释(c)。

(c)如果两个列表中的任何一个都不唯一,情况将有所不同。
list_2的不是唯一的:list2 = ["a", "f", "c", "m", "m"]。保持list1原样:list_1 = ["a", "b", "c", "d", "e"]
设置assume_uniqueyields 的默认值["f", "m"](在两种解决方案中)。但是,如果您设置了assume_unique=True,两种解决方案都可以["f", "m", "m"]。为什么?这是因为用户认为元素是唯一的)。因此,最好保持assume_unique为其默认值。请注意,两个答案均已排序。

TL;DR:
SOLUTION (1)

import numpy as np
main_list = np.setdiff1d(list_2,list_1)
# yields the elements in `list_2` that are NOT in `list_1`

SOLUTION (2) You want a sorted list

def setdiff_sorted(array1,array2,assume_unique=False):
    ans = np.setdiff1d(array1,array2,assume_unique).tolist()
    if assume_unique:
        return sorted(ans)
    return ans
main_list = setdiff_sorted(list_2,list_1)




EXPLANATIONS:
(1) You can use NumPy’s setdiff1d (array1,array2,assume_unique=False).

assume_unique asks the user IF the arrays ARE ALREADY UNIQUE.
If False, then the unique elements are determined first.
If True, the function will assume that the elements are already unique AND function will skip determining the unique elements.

This yields the unique values in array1 that are not in array2. assume_unique is False by default.

If you are concerned with the unique elements (based on the response of Chinny84), then simply use (where assume_unique=False => the default value):

import numpy as np
list_1 = ["a", "b", "c", "d", "e"]
list_2 = ["a", "f", "c", "m"] 
main_list = np.setdiff1d(list_2,list_1)
# yields the elements in `list_2` that are NOT in `list_1`


(2) For those who want answers to be sorted, I’ve made a custom function:

import numpy as np
def setdiff_sorted(array1,array2,assume_unique=False):
    ans = np.setdiff1d(array1,array2,assume_unique).tolist()
    if assume_unique:
        return sorted(ans)
    return ans

To get the answer, run:

main_list = setdiff_sorted(list_2,list_1)

SIDE NOTES:
(a) Solution 2 (custom function setdiff_sorted) returns a list (compared to an array in solution 1).

(b) If you aren’t sure if the elements are unique, just use the default setting of NumPy’s setdiff1d in both solutions A and B. What can be an example of a complication? See note (c).

(c) Things will be different if either of the two lists is not unique.
Say list_2 is not unique: list2 = ["a", "f", "c", "m", "m"]. Keep list1 as is: list_1 = ["a", "b", "c", "d", "e"]
Setting the default value of assume_unique yields ["f", "m"] (in both solutions). HOWEVER, if you set assume_unique=True, both solutions give ["f", "m", "m"]. Why? This is because the user ASSUMED that the elements are unique). Hence, IT IS BETTER TO KEEP assume_unique to its default value. Note that both answers are sorted.


回答 1

您可以使用集:

main_list = list(set(list_2) - set(list_1))

输出:

>>> list_1=["a", "b", "c", "d", "e"]
>>> list_2=["a", "f", "c", "m"]
>>> set(list_2) - set(list_1)
set(['m', 'f'])
>>> list(set(list_2) - set(list_1))
['m', 'f']

根据@JonClements的评论,这是一个更简洁的版本:

>>> list_1=["a", "b", "c", "d", "e"]
>>> list_2=["a", "f", "c", "m"]
>>> list(set(list_2).difference(list_1))
['m', 'f']

You can use sets:

main_list = list(set(list_2) - set(list_1))

Output:

>>> list_1=["a", "b", "c", "d", "e"]
>>> list_2=["a", "f", "c", "m"]
>>> set(list_2) - set(list_1)
set(['m', 'f'])
>>> list(set(list_2) - set(list_1))
['m', 'f']

Per @JonClements’ comment, here is a tidier version:

>>> list_1=["a", "b", "c", "d", "e"]
>>> list_2=["a", "f", "c", "m"]
>>> list(set(list_2).difference(list_1))
['m', 'f']

回答 2

不知道为什么当您拥有本机方法时,上述说明为何如此复杂:

main_list = list(set(list_2)-set(list_1))

Not sure why the above explanations are so complicated when you have native methods available:

main_list = list(set(list_2)-set(list_1))

回答 3

使用这样的列表理解

main_list = [item for item in list_2 if item not in list_1]

输出:

>>> list_1 = ["a", "b", "c", "d", "e"]
>>> list_2 = ["a", "f", "c", "m"] 
>>> 
>>> main_list = [item for item in list_2 if item not in list_1]
>>> main_list
['f', 'm']

编辑:

就像下面的注释中提到的那样,如果列表很大,则以上并不是理想的解决方案。在这种情况下,更好的选择是转换list_1set第一个:

set_1 = set(list_1)  # this reduces the lookup time from O(n) to O(1)
main_list = [item for item in list_2 if item not in set_1]

Use a list comprehension like this:

main_list = [item for item in list_2 if item not in list_1]

Output:

>>> list_1 = ["a", "b", "c", "d", "e"]
>>> list_2 = ["a", "f", "c", "m"] 
>>> 
>>> main_list = [item for item in list_2 if item not in list_1]
>>> main_list
['f', 'm']

Edit:

Like mentioned in the comments below, with large lists, the above is not the ideal solution. When that’s the case, a better option would be converting list_1 to a set first:

set_1 = set(list_1)  # this reduces the lookup time from O(n) to O(1)
main_list = [item for item in list_2 if item not in set_1]

回答 4

如果您想要一种单线解决方案(忽略导入),该解决方案仅需要O(max(n, m))长度n和长度输入工作,m而不需要O(n * m)工作,则可以使用以下itertools模块

from itertools import filterfalse

main_list = list(filterfalse(set(list_1).__contains__, list_2))

这利用了功能函数在构造上采用回调函数的优势,从而允许它创建一次回调并在每个元素中重用它,而无需将其存储在某个位置(因为filterfalse在内部存储);列表推导和生成器表达式可以做到这一点,但这很丑陋。†

在一行中得到与以下结果相同的结果:

main_list = [x for x in list_2 if x not in list_1]

速度:

set_1 = set(list_1)
main_list = [x for x in list_2 if x not in set_1]

当然,如果比较是按位置进行的,则:

list_1 = [1, 2, 3]
list_2 = [2, 3, 4]

应该生成:

main_list = [2, 3, 4]

(因为in list_2中的值与in 中的相同索引相匹配list_1),您绝对应该使用Patrick的答案,该答案不涉及临时lists或sets(即使sets大致相同O(1),它们每张支票的“常数”因数也比简单的等式支票高) )并且涉及O(min(n, m))工作,比其他任何答案都要少,并且如果您的问题对位置敏感,则是唯一正确的答案当匹配元素以不匹配的偏移量出现时解决方案。

†:使用列表理解来做与单行代码相同的方法是滥用嵌套循环以在“最外层”循环中创建和缓存值,例如:

main_list = [x for set_1 in (set(list_1),) for x in list_2 if x not in set_1]

这也给Python 3带来了次要的性能优势(因为现在set_1它在理解代码中处于本地范围内,而不是从每次检查的嵌套范围中查找;在Python 2上则没有关系,因为Python 2并未使用闭包列出理解;它们的作用范围与所使用的作用域相同)。

If you want a one-liner solution (ignoring imports) that only requires O(max(n, m)) work for inputs of length n and m, not O(n * m) work, you can do so with the itertools module:

from itertools import filterfalse

main_list = list(filterfalse(set(list_1).__contains__, list_2))

This takes advantage of the functional functions taking a callback function on construction, allowing it to create the callback once and reuse it for every element without needing to store it somewhere (because filterfalse stores it internally); list comprehensions and generator expressions can do this, but it’s ugly.†

That gets the same results in a single line as:

main_list = [x for x in list_2 if x not in list_1]

with the speed of:

set_1 = set(list_1)
main_list = [x for x in list_2 if x not in set_1]

Of course, if the comparisons are intended to be positional, so:

list_1 = [1, 2, 3]
list_2 = [2, 3, 4]

should produce:

main_list = [2, 3, 4]

(because no value in list_2 has a match at the same index in list_1), you should definitely go with Patrick’s answer, which involves no temporary lists or sets (even with sets being roughly O(1), they have a higher “constant” factor per check than simple equality checks) and involves O(min(n, m)) work, less than any other answer, and if your problem is position sensitive, is the only correct solution when matching elements appear at mismatched offsets.

†: The way to do the same thing with a list comprehension as a one-liner would be to abuse nested looping to create and cache value(s) in the “outermost” loop, e.g.:

main_list = [x for set_1 in (set(list_1),) for x in list_2 if x not in set_1]

which also gives a minor performance benefit on Python 3 (because now set_1 is locally scoped in the comprehension code, rather than looked up from nested scope for each check; on Python 2 that doesn’t matter, because Python 2 doesn’t use closures for list comprehensions; they operate in the same scope they’re used in).


回答 5

main_list=[]
list_1=["a", "b", "c", "d", "e"]
list_2=["a", "f", "c", "m"]

for i in list_2:
    if i not in list_1:
        main_list.append(i)

print(main_list)

输出:

['f', 'm']
main_list=[]
list_1=["a", "b", "c", "d", "e"]
list_2=["a", "f", "c", "m"]

for i in list_2:
    if i not in list_1:
        main_list.append(i)

print(main_list)

output:

['f', 'm']

回答 6

我会将zip这些列表放在一起,以逐个元素比较它们。

main_list = [b for a, b in zip(list1, list2) if a!= b]

I would zip the lists together to compare them element by element.

main_list = [b for a, b in zip(list1, list2) if a!= b]

回答 7

我使用了两种方法,发现一种方法比其他方法有用。这是我的答案:

我的输入数据:

crkmod_mpp = ['M13','M18','M19','M24']
testmod_mpp = ['M13','M14','M15','M16','M17','M18','M19','M20','M21','M22','M23','M24']

方法1:np.setdiff1d我喜欢这种方法,因为它保留了位置

test= list(np.setdiff1d(testmod_mpp,crkmod_mpp))
print(test)
['M15', 'M16', 'M22', 'M23', 'M20', 'M14', 'M17', 'M21']

方法2:尽管给出的答案与方法1相同,但扰乱了顺序

test = list(set(testmod_mpp).difference(set(crkmod_mpp)))
print(test)
['POA23', 'POA15', 'POA17', 'POA16', 'POA22', 'POA18', 'POA24', 'POA21']

方法1完全np.setdiff1d符合我的要求。此答案仅供参考。

I used two methods and I found one method useful over other. Here is my answer:

My input data:

crkmod_mpp = ['M13','M18','M19','M24']
testmod_mpp = ['M13','M14','M15','M16','M17','M18','M19','M20','M21','M22','M23','M24']

Method1: np.setdiff1d I like this approach over other because it preserves the position

test= list(np.setdiff1d(testmod_mpp,crkmod_mpp))
print(test)
['M15', 'M16', 'M22', 'M23', 'M20', 'M14', 'M17', 'M21']

Method2: Though it gives same answer as in Method1 but disturbs the order

test = list(set(testmod_mpp).difference(set(crkmod_mpp)))
print(test)
['POA23', 'POA15', 'POA17', 'POA16', 'POA22', 'POA18', 'POA24', 'POA21']

Method1 np.setdiff1d meets my requirements perfectly. This answer for information.


回答 8

如果应该考虑发生的次数,则可能需要使用类似以下内容的方法collections.Counter

list_1=["a", "b", "c", "d", "e"]
list_2=["a", "f", "c", "m"] 
from collections import Counter
cnt1 = Counter(list_1)
cnt2 = Counter(list_2)
final = [key for key, counts in cnt2.items() if cnt1.get(key, 0) != counts]

>>> final
['f', 'm']

如所承诺的,这也可以将不同的出现次数称为“差异”:

list_1=["a", "b", "c", "d", "e", 'a']
cnt1 = Counter(list_1)
cnt2 = Counter(list_2)
final = [key for key, counts in cnt2.items() if cnt1.get(key, 0) != counts]

>>> final
['a', 'f', 'm']

If the number of occurences should be taken into account you probably need to use something like collections.Counter:

list_1=["a", "b", "c", "d", "e"]
list_2=["a", "f", "c", "m"] 
from collections import Counter
cnt1 = Counter(list_1)
cnt2 = Counter(list_2)
final = [key for key, counts in cnt2.items() if cnt1.get(key, 0) != counts]

>>> final
['f', 'm']

As promised this can also handle differing number of occurences as “difference”:

list_1=["a", "b", "c", "d", "e", 'a']
cnt1 = Counter(list_1)
cnt2 = Counter(list_2)
final = [key for key, counts in cnt2.items() if cnt1.get(key, 0) != counts]

>>> final
['a', 'f', 'm']

回答 9

从ser1中删除ser2中存在的项目。

输入项

ser1 = pd.Series([1、2、3、4、5])ser2 = pd.Series([4、5、6、7、8])

ser1 [〜ser1.isin(ser2)]

From ser1 remove items present in ser2.

Input

ser1 = pd.Series([1, 2, 3, 4, 5]) ser2 = pd.Series([4, 5, 6, 7, 8])

Solution

ser1[~ser1.isin(ser2)]


如何在类型提示中指定函数类型?

问题:如何在类型提示中指定函数类型?

我想在当前的Python 3.5项目中使用类型提示。我的函数应该接收一个函数作为参数。

如何在类型提示中指定类型函数?

import typing

def my_function(name:typing.AnyStr, func: typing.Function) -> None:
    # However, typing.Function does not exist.
    # How can I specify the type function for the parameter `func`?

    # do some processing
    pass

我检查了PEP 483,但在那里找不到函数类型提示。

I want to use type hints in my current Python 3.5 project. My function should receive a function as parameter.

How can I specify the type function in my type hints?

import typing

def my_function(name:typing.AnyStr, func: typing.Function) -> None:
    # However, typing.Function does not exist.
    # How can I specify the type function for the parameter `func`?

    # do some processing
    pass

I checked PEP 483, but could not find a function type hint there.


回答 0

正如@jonrsharpe在评论中指出的,可以使用以下方法完成typing.Callable

from typing import AnyStr, Callable

def my_function(name: AnyStr, func: Callable) -> None:

问题是,Callable将其本身翻译为Callable[..., Any]

一个Callable接受任意数量的/类型的参数,并返回任何类型的值。在大多数情况下,这不是您想要的,因为您几乎可以允许传递任何函数。您也希望提示函数参数和返回类型。

这就是为什么许多typesin typing重载以支持表示这些额外类型的子脚本的原因。因此,例如,如果您有一个函数sum接受两个ints并返回一个int

def sum(a: int, b: int) -> int: return a+b

您的注释为:

Callable[[int, int], int]

也就是说,参数在外部订阅中带有下标,返回类型作为外部订阅中的第二个元素。一般来说:

Callable[[ParamType1, ParamType2, .., ParamTypeN], ReturnType]

As @jonrsharpe noted in a comment, this can be done with typing.Callable:

from typing import AnyStr, Callable

def my_function(name: AnyStr, func: Callable) -> None:

Issue is, Callable on it’s own is translated to Callable[..., Any] which means:

A callable takes any number of/type of arguments and returns a value of any type. In most cases, this isn’t what you want since you’ll allow pretty much any function to be passed. You want the function parameters and return types to be hinted too.

That’s why many types in typing have been overloaded to support sub-scripting which denotes these extra types. So if, for example, you had a function sum that takes two ints and returns an int:

def sum(a: int, b: int) -> int: return a+b

Your annotation for it would be:

Callable[[int, int], int]

that is, the parameters are sub-scripted in the outer subscription with the return type as the second element in the outer subscription. In general:

Callable[[ParamType1, ParamType2, .., ParamTypeN], ReturnType]

回答 1

另一个需要注意的有趣点是,您可以使用内置函数type()来获取内置函数的类型并使用它。所以你可以

def f(my_function: type(abs)) -> int:
    return my_function(100)

或那种形式的东西

Another interesting point to note is that you can use the built in function type() to get the type of a built in function and use that. So you could have

def f(my_function: type(abs)) -> int:
    return my_function(100)

Or something of that form


字符串格式化命名参数?

问题:字符串格式化命名参数?

我知道这是一个非常简单的问题,但我不知道该如何使用Google。

我能怎么做

print '<a href="%s">%s</a>' % (my_url)

所以要my_url使用两次?我假设我必须“命名” the %s,然后在参数中使用字典,但是我不确定正确的语法吗?


仅供参考,我知道我可以my_url在参数中使用两次,但这不是重点:)

I know it’s a really simple question, but I have no idea how to google it.

how can I do

print '<a href="%s">%s</a>' % (my_url)

So that my_url is used twice? I assume I have to “name” the %s and then use a dict in the params, but I’m not sure of the proper syntax?


just FYI, I’m aware I can just use my_url twice in the params, but that’s not the point :)


回答 0

在Python 2.6+和Python 3中,您可能选择使用更新的字符串格式化方法。

print('<a href="{0}">{0}</a>'.format(my_url))

这样可以避免重复输入参数,或者

print('<a href="{url}">{url}</a>'.format(url=my_url))

如果要命名参数。

print('<a href="{}">{}</a>'.format(my_url, my_url))

这是严格的位置,只有警告:format()参数遵循Python规则,其中必须首先使用未命名的args,然后是命名参数,然后是* args(类似于list或tuple的序列),然后是* kwargs(一种dict如果您知道什么对您有好处,请使用字符串进行键控)。首先通过将命名值替换为它们的标签来确定插值点,然后从剩下的位置进行定位。因此,您也可以这样做…

print('<a href="{not_my_url}">{}</a>'.format(my_url, my_url, not_my_url=her_url))

但这不是…

print('<a href="{not_my_url}">{}</a>'.format(my_url, not_my_url=her_url, my_url))

In Python 2.6+ and Python 3, you might choose to use the newer string formatting method.

print('<a href="{0}">{0}</a>'.format(my_url))

which saves you from repeating the argument, or

print('<a href="{url}">{url}</a>'.format(url=my_url))

if you want named parameters.

print('<a href="{}">{}</a>'.format(my_url, my_url))

which is strictly positional, and only comes with the caveat that format() arguments follow Python rules where unnamed args must come first, followed by named arguments, followed by *args (a sequence like list or tuple) and then *kwargs (a dict keyed with strings if you know what’s good for you). The interpolation points are determined first by substituting the named values at their labels, and then positional from what’s left. So, you can also do this…

print('<a href="{not_my_url}">{}</a>'.format(my_url, my_url, not_my_url=her_url))

But not this…

print('<a href="{not_my_url}">{}</a>'.format(my_url, not_my_url=her_url, my_url))

回答 1

print '<a href="%(url)s">%(url)s</a>' % {'url': my_url}
print '<a href="%(url)s">%(url)s</a>' % {'url': my_url}

回答 2

Python 3.6+中的解决方案

Python 3.6引入了文字字符串格式化,因此您可以格式化命名参数,而无需在字符串外重复任何命名参数:

print(f'<a href="{my_url:s}">{my_url:s}</a>')

这将进行评估my_url,因此,如果未定义,您将获得NameError。实际上,my_url您可以编写一个任意的Python表达式(而不是),只要它的计算结果为字符串(由于:s格式代码)即可。如果要为表达式的结果表示字符串表示形式(可能不是字符串),请替换:s!s,就像一般,预文字字符串格式化。

有关文字字符串格式的详细信息,请参阅PEP 498(首次引入该格式)。

Solution in Python 3.6+

Python 3.6 introduces literal string formatting, so that you can format the named parameters without any repeating any of your named parameters outside the string:

print(f'<a href="{my_url:s}">{my_url:s}</a>')

This will evaluate my_url, so if it’s not defined you will get a NameError. In fact, instead of my_url, you can write an arbitrary Python expression, as long as it evaluates to a string (because of the :s formatting code). If you want a string representation for the result of an expression that might not be a string, replace :s by !s, just like with regular, pre-literal string formatting.

For details on literal string formatting, see PEP 498, where it was first introduced.


回答 3

您将沉迷于语法。

同样是C#6.0,EcmaScript开发人员也熟悉此语法。

In [1]: print '{firstname} {lastname}'.format(firstname='Mehmet', lastname='Ağa')
Mehmet Ağa

In [2]: print '{firstname} {lastname}'.format(**dict(firstname='Mehmet', lastname='Ağa'))
Mehmet Ağa

You will be addicted to syntax.

Also C# 6.0, EcmaScript developers has also familier this syntax.

In [1]: print '{firstname} {lastname}'.format(firstname='Mehmet', lastname='Ağa')
Mehmet Ağa

In [2]: print '{firstname} {lastname}'.format(**dict(firstname='Mehmet', lastname='Ağa'))
Mehmet Ağa

回答 4

对于构建HTML页面,您要使用模板引擎,而不是简单的字符串插值。

For building HTML pages, you want to use a templating engine, not simple string interpolation.


回答 5

与字典方式一样,了解以下格式可能会很有用:

print '<a href="%s">%s</a>' % (my_url, my_url)

这是一点点的冗余,并且在修改代码时,字典方式当然不易出错,但是仍然可以使用元组进行多次插入。第%s一个元素替换了元组中的第一个元素,第二%s个元素替换了元组中的第二个元素,以此类推。

As well as the dictionary way, it may be useful to know the following format:

print '<a href="%s">%s</a>' % (my_url, my_url)

Here it’s a tad redundant, and the dictionary way is certainly less error prone when modifying the code, but it’s still possible to use tuples for multiple insertions. The first %s is substituted for the first element in the tuple, the second %s is substituted for the second element in the tuple, and so on for each element in the tuple.


BeautifulSoup和Scrapy搜寻器之间的区别?

问题:BeautifulSoup和Scrapy搜寻器之间的区别?

我想建立一个网站,显示亚马逊和电子海湾产品价格之间的比较。其中哪个会更好,为什么?我对BeautifulSoup有点熟悉,但对Scrapy爬虫却不太了解

I want to make a website that shows the comparison between amazon and e-bay product price. Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler.


回答 0

Scrapy是一个Web-spider或Web Scraper 框架,您为Scrapy提供一个根URL以开始爬网,然后您可以指定要爬网和获取的URL数量的限制。它是用于爬网爬网的完整框架。

BeautifulSoup是一个解析库,它在从URL提取内容方面也做得很好,并允许您轻松解析其中的某些部分。它仅获取您提供的URL的内容,然后停止。除非您使用某些条件将其手动放入无限循环内,否则它不会爬网。

简而言之,使用Beautiful Soup,您可以构建类似于Scrapy的东西。美丽的汤是一个库,而Scrapy是一个完整的框架

资源

Scrapy is a Web-spider or web scraper framework, You give Scrapy a root URL to start crawling, then you can specify constraints on how many (number of) URLs you want to crawl and fetch,etc. It is a complete framework for web-scraping or crawling.

While

BeautifulSoup is a parsing library which also does a pretty good job of fetching contents from URL and allows you to parse certain parts of them without any hassle. It only fetches the contents of the URL that you give and then stops. It does not crawl unless you manually put it inside an infinite loop with certain criteria.

In simple words, with Beautiful Soup you can build something similar to Scrapy. Beautiful Soup is a library while Scrapy is a complete framework.

Source


回答 1

我认为两者都很好。即时通讯正在做一个同时使用两者的项目。首先,我使用scrapy抓取所有页面,并使用它们的管道将其保存在mongodb集合中,还下载页面上存在的图像。之后,我使用BeautifulSoup4进行pos处理,我必须更改属性值并获取一些特殊标签。

如果您不知道所需的产品页面,那么一个好的工具将是徒劳的,因为您可以使用它们的搜寻器来运行所有amazon / ebay网站来寻找产品,而无需进行明确的for循环。

看一下草率的文档,它非常易于使用。

I think both are good… im doing a project right now that use both. First i scrap all the pages using scrapy and save that on a mongodb collection using their pipelines, also downloading the images that exists on the page. After that i use BeautifulSoup4 to make a pos-processing where i must change attributes values and get some special tags.

If you don’t know which pages products you want, a good tool will be scrapy since you can use their crawlers to run all amazon/ebay website looking for the products without making a explicit for loop.

Take a look at the scrapy documentation, it’s very simple to use.


回答 2

两者都用于解析数据。

Scrapy

  • Scrapy是一个快速的高级Web爬网和Web爬网框架,用于对网站进行爬网并从其页面中提取结构化数据。
  • 但是当数据来自Java脚本或动态加载时,它有一些局限性,我们可以通过使用诸如splash,selenium等包来克服它。

BeautifulSoup

  • Beautiful Soup是一个Python库,用于从HTML和XML文件中提取数据。

  • 我们可以使用此包从Java脚本获取数据或动态加载页面。

Scrapy with BeautifulSoup是我们可以使用的最好的组合之一,可用于刮取静态和动态内容

Both are using to parse data.

Scrapy:

  • Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.
  • But it has some limitations when data comes from java script or loading dynamicaly, we can over come it by using packages like splash, selenium etc.

BeautifulSoup:

  • Beautiful Soup is a Python library for pulling data out of HTML and XML files.

  • we can use this package for getting data from java script or dynamically loading pages.

Scrapy with BeautifulSoup is one of the best combo we can work with for scraping static and dynamic contents


回答 3

我这样做的方法是使用eBay / Amazon API,而不是scrapy,然后使用BeautifulSoup解析结果。

API为您提供了一种正式的方式来获取与从scrapy爬网程序中获得的数据相同的正式方式,而无需担心隐藏您的身份,与代理相关的麻烦等。

The way I do it is to use the eBay/Amazon API’s rather than scrapy, and then parse the results using BeautifulSoup.

The APIs gives you an official way of getting the same data that you would have got from scrapy crawler, with no need to worry about hiding your identity, mess about with proxies,etc.


回答 4

Scrapy 这是一个 Web抓取框架,其中包含大量的功能,使抓取变得更加容易,因此我们可以仅关注抓取逻辑。下面是我最喜欢的一些scrapy照顾我们的事情。

  • Feed导出:基本上,它可以使我们以CSV,JSON,jsonlines和XML等各种格式保存数据。
  • 异步抓取:Scrapy使用了扭曲的框架,该框架使我们能够一次访问多个URL,在每个URL中以非阻塞方式处理每个请求(基本上,在发送另一个请求之前,我们不必等待请求完成)。
  • 选择器:在这里我们可以比较scrap头和漂亮的汤。选择器使我们能够从网页中选择特定数据,例如标题,具有类名的某些div等)。Scrapy使用lxml进行解析,这比漂亮的汤要快得多。
  • 设置代理,用户代理,标题等:scrapy允许我们动态设置和旋转代理以及其他标题。

  • 项目管道:管道使我们能够在提取后处理数据。例如,我们可以配置管道以将数据推送到您的mysql服务器。

  • Cookies:scrapy自动为我们处理cookie。

等等

TLDR:scrapy是一个框架,提供了构建大规模爬网可能需要的所有内容。它提供了各种功能,可隐藏爬网的复杂性。您可以简单地开始编写网络爬虫,而无需担心安装负担。

美丽的汤 Beautiful Soup是用于解析HTML和XML文档的Python包。因此,使用Beautiful汤,您可以解析一个已经下载的网页。BS4非常受欢迎且古老。与刮y不同,您不能仅用美丽的汤来制作履带。您将需要其他库(例如request,urllib等)来使bs4成为爬虫。同样,这意味着您将需要管理要爬网的URL列表,要爬网的URL,处理Cookie,管理代理,处理错误,创建自己的函数以将数据推送到CSV,JSON,XML等。如果要加快速度比您将不得不使用其他库(如多处理)

总结一下。

  • Scrapy是一个丰富的框架,您可以使用它开始编写爬虫程序,而无需进行任何麻烦。

  • 美丽的汤是您可以用来解析网页的库。它不能单独用于刮网。

您绝对应该在您的Amazon和e-bay产品价格比较网站上使用scrapy。您可以建立一个url数据库并每天运行爬虫(cron作业,Celery用于计划爬虫)并更新数据库的价格。这样,您的网站将始终从数据库中提取,并且爬虫和数据库将作为单独的组件。

Scrapy It is a web scraping framework which comes with tons of goodies which make scraping from easier so that we can focus on crawling logic only. Some of my favourite things scrapy takes care for us are below.

  • Feed exports: It basically allows us to save data in various formats like CSV,JSON,jsonlines and XML.
  • Asynchronous scraping: Scrapy uses twisted framework which gives us power to visit multiple urls at once where each request is processed in non blocking way(Basically we don’t have to wait for a request to finish before sending another request).
  • Selectors: This is where we can compare scrapy with beautiful soup. Selectors are what allow us to select particular data from the webpage like heading, certain div with a class name etc.). Scrapy uses lxml for parsing which is extremely fast than beautiful soup.
  • Setting proxy,user agent ,headers etc: scrapy allows us to set and rotate proxy,and other headers dynamically.

  • Item Pipelines: Pipelines enable us to process data after extraction. For example we can configure pipeline to push data to your mysql server.

  • Cookies: scrapy automatically handles cookies for us.

etc.

TLDR: scrapy is a framework that provides everything that one might need to build large scale crawls. It provides various features that hide complexity of crawling the webs. one can simply start writing web crawlers without worrying about the setup burden.

Beautiful soup Beautiful Soup is a Python package for parsing HTML and XML documents. So with Beautiful soup you can parse a webpage that has been already downloaded. BS4 is very popular and old. Unlike scrapy,You cannot use beautiful soup only to make crawlers. You will need other libraries like requests,urllib etc to make crawlers with bs4. Again, this means you would need to manage the list of urls being crawled,to be crawled, handle cookies , manage proxy, handle errors, create your own functions to push data to CSV,JSON,XML etc. If you want to speed up than you will have to use other libraries like multiprocessing.

To sum up.

  • Scrapy is a rich framework that you can use to start writing crawlers without any hassale.

  • Beautiful soup is a library that you can use to parse a webpage. It cannot be used alone to scrape web.

You should definitely use scrapy for your amazon and e-bay product price comparison website. You could build a database of urls and run the crawler every day(cron jobs,Celery for scheduling crawls) and update the price on your database.This way your website will always pull from the database and crawler and database will act as individual components.


回答 5

BeautifulSoup是一个使您可以从网页提取信息的库。

另一方面,Scrapy是一个框架,它可以执行上述操作以及您在抓取项目中可能需要的许多其他事情,例如用于保存数据的管道。

您可以查看此博客以开始使用Scrapy https://www.inkoop.io/blog/web-scraping-using-python-and-scrapy/

BeautifulSoup is a library that lets you extract information from a web page.

Scrapy on the other hand is a framework, which does the above thing and many more things you probably need in your scraping project like pipelines for saving data.

You can check this blog to get started with Scrapy https://www.inkoop.io/blog/web-scraping-using-python-and-scrapy/


回答 6

使用scrapy可以节省大量代码,并从结构化编程开始。如果您不喜欢scapy的任何预写方法,则可以使用BeautifulSoup代替scrapy方法。大型项目同时具有这两个优点。

Using scrapy you can save tons of code and start with structured programming, If you dont like any of the scapy’s pre-written methods then BeautifulSoup can be used in the place of scrapy method. Big project takes both advantages.


回答 7

差异很多,选择任何工具/技术都取决于个人需求。

几个主要区别是:

  1. BeautifulSoup 比Scrapy 容易学习
  2. Scrapy的扩展,支持和社区大于BeautifulSoup。
  3. 当BeautifulSoup是解析器时,Scrapy应该被视为蜘蛛

The differences are many and selection of any tool/technology depends on individual needs.

Few major differences are:

  1. BeautifulSoup is comparatively is easy to learn than Scrapy.
  2. The extensions, support, community is larger for Scrapy than for BeautifulSoup.
  3. Scrapy should be considered as a Spider while BeautifulSoup is a Parser.

如何使用python / matplotlib为3D图设置“相机位置”?

问题:如何使用python / matplotlib为3D图设置“相机位置”?

我正在学习如何使用mplot3d生成漂亮的3d数据图,到目前为止我还很高兴。我现在想做的是旋转表面的动画效果。为此,我需要为3D投影设置相机位置。我猜这一定是可能的,因为在交互使用matplotlib时,可以使用鼠标旋转表面。但是如何从脚本执行此操作?我在mpl_toolkits.mplot3d.proj3d中发现了很多转换,但是我找不到如何使用这些转换的目的,也没有找到任何尝试的示例。

I’m learning how to use mplot3d to produce nice plots of 3d data and I’m pretty happy so far. What I am trying to do at the moment is a little animation of a rotating surface. For that purpose, I need to set a camera position for the 3D projection. I guess this must be possible since a surface can be rotated using the mouse when using matplotlib interactively. But how can I do this from a script? I found a lot of transforms in mpl_toolkits.mplot3d.proj3d but I could not find out how to use these for my purpose and I didn’t find any example for what I’m trying to do.


回答 0

通过“摄像机位置”,听起来好像您想调整用于查看3D图的仰角和方位角。您可以使用设置ax.view_init。我使用以下脚本首先创建了绘图,然后确定了一个合适的高程(或)elev,从中可以查看我的绘图。然后,我调整了方位角或azim,以改变绘图周围的整个360度,并保存了每个实例的图形(并在保存绘图时记下了哪个方位角)。对于更复杂的相机镜头,您可以同时调整仰角和角度以达到所需的效果。

    from mpl_toolkits.mplot3d import Axes3D
    ax = Axes3D(fig)
    ax.scatter(xx,yy,zz, marker='o', s=20, c="goldenrod", alpha=0.6)
    for ii in xrange(0,360,1):
        ax.view_init(elev=10., azim=ii)
        savefig("movie%d.png" % ii)

By “camera position,” it sounds like you want to adjust the elevation and the azimuth angle that you use to view the 3D plot. You can set this with ax.view_init. I’ve used the below script to first create the plot, then I determined a good elevation, or elev, from which to view my plot. I then adjusted the azimuth angle, or azim, to vary the full 360deg around my plot, saving the figure at each instance (and noting which azimuth angle as I saved the plot). For a more complicated camera pan, you can adjust both the elevation and angle to achieve the desired effect.

    from mpl_toolkits.mplot3d import Axes3D
    ax = Axes3D(fig)
    ax.scatter(xx,yy,zz, marker='o', s=20, c="goldenrod", alpha=0.6)
    for ii in xrange(0,360,1):
        ax.view_init(elev=10., azim=ii)
        savefig("movie%d.png" % ii)

回答 1

方便的是将“摄影机”位置应用于新图。因此,我进行绘图,然后使用鼠标更改距离来移动绘图。然后尝试复制包含另一图上距离的视图。我发现axx.ax.get_axes()为我提供了一个带有旧.azim和.elev的对象。

在PYTHON中…

axx=ax1.get_axes()
azm=axx.azim
ele=axx.elev
dst=axx.dist       # ALWAYS GIVES 10
#dst=ax1.axes.dist # ALWAYS GIVES 10
#dst=ax1.dist      # ALWAYS GIVES 10

以后的3D图形…

ax2.view_init(elev=ele, azim=azm) #Works!
ax2.dist=dst                       # works but always 10 from axx

编辑1 …好,关于.dist值,相机位置是错误的思维方式。它作为整个图形的一种hackey标量乘法器而位于一切之上。

这适用于视图的放大/缩放:

xlm=ax1.get_xlim3d() #These are two tupples
ylm=ax1.get_ylim3d() #we use them in the next
zlm=ax1.get_zlim3d() #graph to reproduce the magnification from mousing
axx=ax1.get_axes()
azm=axx.azim
ele=axx.elev

以后的图…

ax2.view_init(elev=ele, azim=azm) #Reproduce view
ax2.set_xlim3d(xlm[0],xlm[1])     #Reproduce magnification
ax2.set_ylim3d(ylm[0],ylm[1])     #...
ax2.set_zlim3d(zlm[0],zlm[1])     #...

What would be handy would be to apply the Camera position to a new plot. So I plot, then move the plot around with the mouse changing the distance. Then try to replicate the view including the distance on another plot. I find that axx.ax.get_axes() gets me an object with the old .azim and .elev.

IN PYTHON…

axx=ax1.get_axes()
azm=axx.azim
ele=axx.elev
dst=axx.dist       # ALWAYS GIVES 10
#dst=ax1.axes.dist # ALWAYS GIVES 10
#dst=ax1.dist      # ALWAYS GIVES 10

Later 3d graph…

ax2.view_init(elev=ele, azim=azm) #Works!
ax2.dist=dst                       # works but always 10 from axx

EDIT 1… OK, Camera position is the wrong way of thinking concerning the .dist value. It rides on top of everything as a kind of hackey scalar multiplier for the whole graph.

This works for the magnification/zoom of the view:

xlm=ax1.get_xlim3d() #These are two tupples
ylm=ax1.get_ylim3d() #we use them in the next
zlm=ax1.get_zlim3d() #graph to reproduce the magnification from mousing
axx=ax1.get_axes()
azm=axx.azim
ele=axx.elev

Later Graph…

ax2.view_init(elev=ele, azim=azm) #Reproduce view
ax2.set_xlim3d(xlm[0],xlm[1])     #Reproduce magnification
ax2.set_ylim3d(ylm[0],ylm[1])     #...
ax2.set_zlim3d(zlm[0],zlm[1])     #...

Python中泡菜的常见用例

问题:Python中泡菜的常见用例

我看过泡菜文档,但是我不知道泡菜在哪里有用。

泡菜有哪些常见用例?

I’ve looked at the pickle documentation, but I don’t understand where pickle is useful.

What are some common use-cases for pickle?


回答 0

我遇到的一些用途:

1)将程序的状态数据保存到磁盘,以便它可以在重新启动时从中断处继续执行(持久性)

2)在多核或分布式系统中通过TCP连接发送python数据(编组)

3)将python对象存储在数据库中

4)将任意python对象转换为字符串,以便可以将其用作字典键(例如,用于缓存和备忘录)。

最后一个存在一些问题-两个相同的对象可以被腌制并导致不同的字符串-甚至相同的对象两次被腌制也可以具有不同的表示形式。这是因为泡菜可以包括参考计数信息。

为了强调@lunaryorn的评论-切勿从不可靠的来源获取字符串,因为精心制作的pickle可以在系统上执行任意代码。例如,请参阅https://blog.nelhage.com/2011/03/exploiting-pickle/

Some uses that I have come across:

1) saving a program’s state data to disk so that it can carry on where it left off when restarted (persistence)

2) sending python data over a TCP connection in a multi-core or distributed system (marshalling)

3) storing python objects in a database

4) converting an arbitrary python object to a string so that it can be used as a dictionary key (e.g. for caching & memoization).

There are some issues with the last one – two identical objects can be pickled and result in different strings – or even the same object pickled twice can have different representations. This is because the pickle can include reference count information.

To emphasise @lunaryorn’s comment – you should never unpickle a string from an untrusted source, since a carefully crafted pickle could execute arbitrary code on your system. For example see https://blog.nelhage.com/2011/03/exploiting-pickle/


回答 1

最小往返次数示例

>>> import pickle
>>> a = Anon()
>>> a.foo = 'bar'
>>> pickled = pickle.dumps(a)
>>> unpickled = pickle.loads(pickled)
>>> unpickled.foo
'bar'

编辑:但作为酸洗的现实世界的例子的问题,也许最先进的使用酸洗的(你必须相当深挖掘到源)ZODB: http://svn.zope.org/

否则,PyPI会提到几个:http ://pypi.python.org/pypi?:action=search&term=pickle&submit=search

我个人已经看到了几个通过网络发送的腌制对象的示例,它们是一种易于使用的网络传输协议。

Minimal roundtrip example..

>>> import pickle
>>> a = Anon()
>>> a.foo = 'bar'
>>> pickled = pickle.dumps(a)
>>> unpickled = pickle.loads(pickled)
>>> unpickled.foo
'bar'

Edit: but as for the question of real-world examples of pickling, perhaps the most advanced use of pickling (you’d have to dig quite deep into the source) is ZODB: http://svn.zope.org/

Otherwise, PyPI mentions several: http://pypi.python.org/pypi?:action=search&term=pickle&submit=search

I have personally seen several examples of pickled objects being sent over the network as an easy to use network transfer protocol.


回答 2

酸洗对于分布式和并行计算绝对必要。

假设您要使用并行映射简化multiprocessing(或使用pyina跨群集节点),那么您需要确保要在并行资源上映射的函数可以腌制。如果没有腌制,则无法将其发送到其他进程,计算机等上的其他资源。另请参见此处的示例。

为此,我使用dill,它可以在python中序列化几乎所有内容。Dill还有一些很好的工具,可以帮助您了解在代码失败时导致酸洗失败的原因。

而且,是的,人们使用挑选来保存计算状态,您的ipython会话等。

Pickling is absolutely necessary for distributed and parallel computing.

Say you wanted to do a parallel map-reduce with multiprocessing (or across cluster nodes with pyina), then you need to make sure the function you want to have mapped across the parallel resources will pickle. If it doesn’t pickle, you can’t send it to the other resources on another process, computer, etc. Also see here for a good example.

To do this, I use dill, which can serialize almost anything in python. Dill also has some good tools for helping you understand what is causing your pickling to fail when your code fails.

And, yes, people use picking to save the state of a calculation, or your ipython session, or whatever.


回答 3

我已经在我的一个项目中使用了它。如果该应用在工作期间终止(它完成了冗长的任务并处理了许多数据),那么我需要保存整个数据结构,并在再次运行该应用后重新加载它。我之所以使用cPickle,是因为速度至关重要,并且数据量确实很大。

I have used it in one of my projects. If the app was terminated during it’s working (it did a lengthy task and processed lots of data), I needed to save the whole data structure and reload it after the app was run again. I used cPickle for this, as speed was a crucial thing and the size of data was really big.


回答 4

对于您的数据结构和类,Pickle类似于“另存为..”和“打开..”。假设我要保存数据结构,以便在程序运行之间保持持久性。

保存:

with open("save.p", "wb") as f:    
    pickle.dump(myStuff, f)        

正在加载:

try:
    with open("save.p", "rb") as f:
        myStuff = pickle.load(f)
except:
    myStuff = defaultdict(dict)

现在,我不必从头开始重新构建myStuff,而我可以从上次停止的地方继续学习。

Pickle is like “Save As..” and “Open..” for your data structures and classes. Let’s say I want to save my data structures so that it is persistent between program runs.

Saving:

with open("save.p", "wb") as f:    
    pickle.dump(myStuff, f)        

Loading:

try:
    with open("save.p", "rb") as f:
        myStuff = pickle.load(f)
except:
    myStuff = defaultdict(dict)

Now I don’t have to build myStuff from scratch all over again, and I can just pick(le) up from where I left off.


回答 5

对于初学者(就像我一样),很难理解为什么在阅读官方文档时首先使用泡菜。可能是因为文档暗示您已经知道序列化的全部目的。仅在阅读了序列化的一般说明之后,我才了解该模块的原因及其常见用例。不考虑特定编程语言的序列化的广泛解释也可能会有所帮助:https : //stackoverflow.com/a/14482962/4383472什么是序列化?https://stackoverflow.com/a/3984483/4383472

For the beginner (as is the case with me) it’s really hard to understand why use pickle in the first place when reading the official documentation. It’s maybe because the docs imply that you already know the whole purpose of serialization. Only after reading the general description of serialization have I understood the reason for this module and its common use cases. Also broad explanations of serialization disregarding a particular programming language may help: https://stackoverflow.com/a/14482962/4383472, What is serialization?, https://stackoverflow.com/a/3984483/4383472


回答 6

要添加一个真实的示例:用于Python 的Sphinx文档工具使用pickle来缓存已解析的文档和文档之间的交叉引用,以加快文档的后续构建。

To add a real-world example: The Sphinx documentation tool for Python uses pickle to cache parsed documents and cross-references between documents, to speed up subsequent builds of the documentation.


回答 7

我可以告诉你我使用它的用途,并且已经看到它的用途:

  • 游戏资料保存
  • 游戏数据可以像生命和健康一样保存
  • 以前输入程序的说号的记录

那些是我至少用过的

I can tell you the uses I use it for and have seen it used for:

  • Game profile saves
  • Game data saves like lives and health
  • Previous records of say numbers inputed to a program

Those are the ones I use it for at least


回答 8

当时,我在网站的一个网站上进行网页爬取时使用了腌制,因此我想存储超过8000k的URL,并希望尽快处理它们,所以我使用腌制是因为它的输出质量非常高。

您可以轻松地到达url,甚至在作业目录关键字停止的位置也可以非常快速地获取url详细信息以恢复该过程。

I use pickling during web scrapping one of website at that time I want to store more than 8000k urls and want to process them as fast as possible so I use pickling because its output quality is very high.

you can easily reach to url and where you stop even job directory key word also fetch url details very fast for resuming the process.


如何将我的秘密密钥和密码安全地保存在版本控制系统中?

问题:如何将我的秘密密钥和密码安全地保存在版本控制系统中?

我在版本控制系统中保留了重要的设置,例如开发和生产服务器的主机名和端口。但是我知道,将秘密(例如私钥和数据库密码)保存在VCS存储库中是一种不好的做法

但是像其他设置一样,密码似乎应该进行版本控制。那么保持密码版本控制的正确方法什么?

我想这将涉及保持秘密,在自己的“秘密设置”文件,并有文件的加密和版本控制。但是什么技术呢?以及如何正确执行此操作?有没有更好的方法可以完全解决这个问题?


我通常会问这个问题,但是在我的特定实例中,我想使用gitgithub存储Django / Python站点的秘密密钥和密码。

同样,当我使用git推/拉时,理想的解决方案也会做一些神奇的事情-例如,如果加密的密码文件更改了运行的脚本,该脚本会要求输入密码并将其解密到位。


编辑:为清楚起见,我正在询问在哪里存储的生产秘密。

I keep important settings like the hostnames and ports of development and production servers in my version control system. But I know that it’s bad practice to keep secrets (like private keys and database passwords) in a VCS repository.

But passwords–like any other setting–seem like they should be versioned. So what is the proper way to keep passwords version controlled?

I imagine it would involve keeping the secrets in their own “secrets settings” file and having that file encrypted and version controlled. But what technologies? And how to do this properly? Is there a better way entirely to go about it?


I ask the question generally, but in my specific instance I would like to store secret keys and passwords for a Django/Python site using git and github.

Also, an ideal solution would do something magical when I push/pull with git–e.g., if the encrypted passwords file changes a script is run which asks for a password and decrypts it into place.


EDIT: For clarity, I am asking about where to store production secrets.

回答 0

您完全想在保持版本控制的同时加密敏感设置文件是正确的。正如您提到的,最好的解决方案是Git在推送某些敏感文件时透明地对其进行加密,以便在本地(即在具有证书的任何计算机上)可以使用设置文件,但Git或Dropbox或在VC下存储文件无法读取纯文本信息。

推/拉期间的透明加密/解密教程

这个要点https://gist.github.com/873637显示了有关如何将Git的smudge / clean过滤器驱动程序与openssl一起使用的教程,以透明方式加密推送的文件。您只需要进行一些初始设置即可。

工作原理摘要

您基本上将创建一个.gitencrypt包含3个bash脚本的文件夹,

clean_filter_openssl 
smudge_filter_openssl 
diff_filter_openssl 

Git用于解密,加密和支持Git差异。在这些脚本中定义了主密码和盐(已修复!),您必须确保.gitencrypt从未被实际推送。示例clean_filter_openssl脚本:

#!/bin/bash

SALT_FIXED=<your-salt> # 24 or less hex characters
PASS_FIXED=<your-passphrase>

openssl enc -base64 -aes-256-ecb -S $SALT_FIXED -k $PASS_FIXED

smudge_filter_open_ssl和类似diff_filter_oepnssl。参见要点。

您的带有敏感信息的仓库应该有一个.gitattribute文件(未加密并包含在仓库中),该文件引用.gitencrypt目录(该目录包含Git透明地加密/解密项目所需的所有内容),并且该文件存在于本地计算机上。

.gitattribute 内容:

* filter=openssl diff=openssl
[merge]
    renormalize = true

最后,您还需要将以下内容添加到.git/config文件中

[filter "openssl"]
    smudge = ~/.gitencrypt/smudge_filter_openssl
    clean = ~/.gitencrypt/clean_filter_openssl
[diff "openssl"]
    textconv = ~/.gitencrypt/diff_filter_openssl

现在,当您将包含敏感信息的存储库推送到远程存储库时,文件将被透明加密。当您从具有.gitencrypt目录(包含密码)的本地计算机中提取文件时,文件将被透明解密。

笔记

我应该注意,本教程没有描述仅加密敏感设置文件的方法。这将透明地加密推送到远程VC主机的整个存储库,并解密整个存储库,以便在本地对其进行完全解密。为了实现所需的行为,您可以将一个或多个项目的敏感文件放在一个sensitive_settings_repo中。如果您确实需要将敏感文件存储在同一存储库中,则可以研究这种透明加密技术如何与Git子模块http://git-scm.com/book/en/Git-Tools-Submodules一起使用。

如果攻击者可以访问许多加密的存储库/文件,则在理论上使用固定的密码短语可能会导致暴力破解漏洞。海事组织,这种可能性非常低。正如本教程底部提到的那样,不使用固定密码短语将导致不同机器上的本地回购版本始终显示“ git status”已发生更改。

You’re exactly right to want to encrypt your sensitive settings file while still maintaining the file in version control. As you mention, the best solution would be one in which Git will transparently encrypt certain sensitive files when you push them so that locally (i.e. on any machine which has your certificate) you can use the settings file, but Git or Dropbox or whoever is storing your files under VC does not have the ability to read the information in plaintext.

Tutorial on Transparent Encryption/Decryption during Push/Pull

This gist https://gist.github.com/873637 shows a tutorial on how to use the Git’s smudge/clean filter driver with openssl to transparently encrypt pushed files. You just need to do some initial setup.

Summary of How it Works

You’ll basically be creating a .gitencrypt folder containing 3 bash scripts,

clean_filter_openssl 
smudge_filter_openssl 
diff_filter_openssl 

which are used by Git for decryption, encryption, and supporting Git diff. A master passphrase and salt (fixed!) is defined inside these scripts and you MUST ensure that .gitencrypt is never actually pushed. Example clean_filter_openssl script:

#!/bin/bash

SALT_FIXED=<your-salt> # 24 or less hex characters
PASS_FIXED=<your-passphrase>

openssl enc -base64 -aes-256-ecb -S $SALT_FIXED -k $PASS_FIXED

Similar for smudge_filter_open_ssl and diff_filter_oepnssl. See Gist.

Your repo with sensitive information should have a .gitattribute file (unencrypted and included in repo) which references the .gitencrypt directory (which contains everything Git needs to encrypt/decrypt the project transparently) and which is present on your local machine.

.gitattribute contents:

* filter=openssl diff=openssl
[merge]
    renormalize = true

Finally, you will also need to add the following content to your .git/config file

[filter "openssl"]
    smudge = ~/.gitencrypt/smudge_filter_openssl
    clean = ~/.gitencrypt/clean_filter_openssl
[diff "openssl"]
    textconv = ~/.gitencrypt/diff_filter_openssl

Now, when you push the repository containing your sensitive information to a remote repository, the files will be transparently encrypted. When you pull from a local machine which has the .gitencrypt directory (containing your passphrase), the files will be transparently decrypted.

Notes

I should note that this tutorial does not describe a way to only encrypt your sensitive settings file. This will transparently encrypt the entire repository that is pushed to the remote VC host and decrypt the entire repository so it is entirely decrypted locally. To achieve the behavior you want, you could place sensitive files for one or many projects in one sensitive_settings_repo. You could investigate how this transparent encryption technique works with Git submodules http://git-scm.com/book/en/Git-Tools-Submodules if you really need the sensitive files to be in the same repository.

The use of a fixed passphrase could theoretically lead to brute-force vulnerabilities if attackers had access to many encrypted repos/files. IMO, the probability of this is very low. As a note at the bottom of this tutorial mentions, not using a fixed passphrase will result in local versions of a repo on different machines always showing that changes have occurred with ‘git status’.


回答 1

Heroku推动使用环境变量进行设置和密钥:

处理此类config var的传统方法是将它们放在源代码下-放在某种属性文件中。这是一个容易出错的过程,对于经常需要使用应用程序特定配置维护单独的(和私有的)分支的开源应用程序而言,这尤其复杂。

更好的解决方案是使用环境变量,并将键保留在代码之外。在传统主机上或在本地工作,您可以在bashrc中设置环境变量。在Heroku上,您使用config vars。

借助Foreman和.env文件,Heroku提供了令人羡慕的工具链来导出,导入和同步环境变量。


就个人而言,我认为将秘密密钥与代码一起保存是错误的。从根本上说,它与源代码管理不一致,因为密钥是针对代码外部的服务。一个好处是开发人员可以克隆HEAD并运行应用程序而无需进行任何设置。但是,假设开发人员检查了该代码的历史版本。他们的副本将包含去年的数据库密码,因此该应用程序将无法使用今天的数据库。

使用上面的Heroku方法,开发人员可以签出去年的应用程序,使用今天的密钥对其进行配置,然后针对当今的数据库成功运行它。

Heroku pushes the use of environment variables for settings and secret keys:

The traditional approach for handling such config vars is to put them under source – in a properties file of some sort. This is an error-prone process, and is especially complicated for open source apps which often have to maintain separate (and private) branches with app-specific configurations.

A better solution is to use environment variables, and keep the keys out of the code. On a traditional host or working locally you can set environment vars in your bashrc. On Heroku, you use config vars.

With Foreman and .env files Heroku provide an enviable toolchain to export, import and synchronise environment variables.


Personally, I believe it’s wrong to save secret keys alongside code. It’s fundamentally inconsistent with source control, because the keys are for services extrinsic to the the code. The one boon would be that a developer can clone HEAD and run the application without any setup. However, suppose a developer checks out a historic revision of the code. Their copy will include last year’s database password, so the application will fail against today’s database.

With the Heroku method above, a developer can checkout last year’s app, configure it with today’s keys, and run it successfully against today’s database.


回答 2

我认为最干净的方法是使用环境变量。例如,您不必处理.dist文件,生产环境中的项目状态将与本地计算机的状态相同。

我建议阅读“十二要素应用程序 ”的配置章节,如果您有兴趣的话,也阅读其他章节。

The cleanest way in my opinion is to use environment variables. You won’t have to deal with .dist files for example, and the project state on the production environment would be the same as your local machine’s.

I recommend reading The Twelve-Factor App‘s config chapter, the others too if you’re interested.


回答 3

一种选择是将项目绑定的凭据放入加密的容器(TrueCrypt或Keepass)中并推送。

从下面的评论中更新为答案:

有趣的问题顺便说一句。我刚刚发现了这一点:github.com/shadowhand/git-encrypt对于自动加密而言,它看起来很有希望

An option would be to put project-bound credentials into an encrypted container (TrueCrypt or Keepass) and push it.

Update as answer from my comment below:

Interesting question btw. I just found this: github.com/shadowhand/git-encrypt which looks very promising for automatic encryption


回答 4

我建议为此使用配置文件,而不要对其进行版本控制。

但是,您可以版本文件的示例。

我看不到共享开发设置的任何问题。根据定义,它不应包含任何有价值的数据。

I suggest using configuration files for that and to not version them.

You can however version examples of the files.

I don’t see any problem of sharing development settings. By definition it should contain no valuable data.


回答 5

BlackBox是由StackExchange最近发布的,虽然我尚未使用它,但它似乎可以完全解决问题并支持此问题中要求的功能。

https://github.com/StackExchange/blackbox的描述中:

将机密安全存储在VCS仓库中(即Git或Mercurial)。这些命令使您可以轻松地对存储库中的特定文件进行GPG加密,以便在存储库中对它们进行“静态加密”。但是,使用脚本可以轻松地在需要查看或编辑脚本时对其进行解密,并可以将其解密以用于生产。

BlackBox was recently released by StackExchange and while I have yet to use it, it seems to exactly address the problems and support the features requested in this question.

From the description on https://github.com/StackExchange/blackbox:

Safely store secrets in a VCS repo (i.e. Git or Mercurial). These commands make it easy for you to GPG encrypt specific files in a repo so they are “encrypted at rest” in your repository. However, the scripts make it easy to decrypt them when you need to view or edit them, and decrypt them for for use in production.


回答 6

自从问了这个问题之后,我就确定了一个解决方案,该解决方案是在与一小群人一起开发小型应用程序时使用的。

git-crypt

git-crypt使用GPG对文件名进行匹配的透明加密。例如,如果您添加到.gitattributes文件中…

*.secret.* filter=git-crypt diff=git-crypt

…然后,像这样的文件config.secret.json将始终通过加密方式推送到远程存储库,但在本地文件系统上保持未加密状态。

如果我想向您的存储库添加一个新的GPG密钥(一个人),可以解密受保护的文件,请运行git-crypt add-gpg-user <gpg_user_key>。这将创建一个新的提交。新用户将能够解密后续提交。

Since asking this question I have settled on a solution, which I use when developing small application with a small team of people.

git-crypt

git-crypt uses GPG to transparently encrypt files when their names match certain patterns. For intance, if you add to your .gitattributes file…

*.secret.* filter=git-crypt diff=git-crypt

…then a file like config.secret.json will always be pushed to remote repos with encryption, but remain unencrypted on your local file system.

If I want to add a new GPG key (a person) to your repo which can decrypt the protected files then run git-crypt add-gpg-user <gpg_user_key>. This creates a new commit. The new user will be able to decrypt subsequent commits.


回答 7

我通常会问这个问题,但是在我的特定实例中,我想使用git和github存储Django / Python站点的秘密密钥和密码。

不,只是不要,即使这是您的私人存储库,并且您从不打算共享它,也不要。

您应该创建一个local_settings.py,将其放在VCS上忽略,然后在settings.py中执行类似的操作

from local_settings import DATABASES, SECRET_KEY
DATABASES = DATABASES

SECRET_KEY = SECRET_KEY

如果您的机密设置用途广泛,我很想说您做错了

I ask the question generally, but in my specific instance I would like to store secret keys and passwords for a Django/Python site using git and github.

No, just don’t, even if it’s your private repo and you never intend to share it, don’t.

You should create a local_settings.py put it on VCS ignore and in your settings.py do something like

from local_settings import DATABASES, SECRET_KEY
DATABASES = DATABASES

SECRET_KEY = SECRET_KEY

If your secrets settings are that versatile, I am eager to say you’re doing something wrong


回答 8

编辑:我假设您想跟踪以前的密码版本-例如,对于防止密码重用等的脚本。

我认为GnuPG是最好的方法-已经在一个与git相关的项目(git-annex)中用于加密存储在云服务上的存储库内容。GnuPG(gnu pgp)提供了非常强大的基于密钥的加密。

  1. 您将密钥保留在本地计算机上。
  2. 您将’mypassword’添加到忽略的文件。
  3. 在pre-commit挂钩上,您将mypassword文件加密为git跟踪的mypassword.gpg文件,并将其添加到提交中。
  4. 在合并后钩子上,您只需将mypassword.gpg解密为mypassword。

现在,如果您的“ mypassword”文件未更改,则将使用相同的密文对其进行加密,并且不会将其添加到索引中(无冗余)。稍加修改mypassword会导致密文截然不同,并且登台区域中的mypassword.gpg与存储库中的密文有很大不同,因此将被添加到提交中。即使攻击者掌握了您的gpg密钥,他仍然需要暴力破解密码。如果攻击者可以使用密文访问远程存储库,则他可以比较一堆密文,但是它们的数量不足以给他带来任何不可忽视的优势。

稍后,您可以使用.gitattributes为您的密码退出git diff提供即时解密。

您也可以为不同类型的密码等设置单独的密钥。

EDIT: I assume you want to keep track of your previous passwords versions – say, for a script that would prevent password reusing etc.

I think GnuPG is the best way to go – it’s already used in one git-related project (git-annex) to encrypt repository contents stored on cloud services. GnuPG (gnu pgp) provides a very strong key-based encryption.

  1. You keep a key on your local machine.
  2. You add ‘mypassword’ to ignored files.
  3. On pre-commit hook you encrypt the mypassword file into the mypassword.gpg file tracked by git and add it to the commit.
  4. On post-merge hook you just decrypt mypassword.gpg into mypassword.

Now if your ‘mypassword’ file did not change then encrypting it will result with same ciphertext and it won’t be added to the index (no redundancy). Slightest modification of mypassword results in radically different ciphertext and mypassword.gpg in staging area differs a lot from the one in repository, thus will be added to the commit. Even if the attacker gets a hold of your gpg key he still needs to bruteforce the password. If the attacker gets an access to remote repository with ciphertext he can compare a bunch of ciphertexts, but their number won’t be sufficient to give him any non-negligible advantage.

Later on you can use .gitattributes to provide an on-the-fly decryption for quit git diff of your password.

Also you can have separate keys for different types of passwords etc.


回答 9

通常,我将密码分隔为配置文件。并使其成为dist。

/yourapp
    main.py
    default.cfg.dist

当我跑步时main.py,输入真实密码default.cfg该副本中。

ps。当您使用git或hg时。您可以忽略*.cfg要制作的文件.gitignore.hgignore

Usually, i seperate password as a config file. and make them dist.

/yourapp
    main.py
    default.cfg.dist

And when i run main.py, put the real password in default.cfg that copied.

ps. when you work with git or hg. you can ignore *.cfg files to make .gitignore or .hgignore


回答 10

提供一种方法来覆盖配置

这是管理您签入的配置的一组合理默认值的最佳方法,而无需完成配置或包含主机名和凭据之类的内容。有几种方法可以覆盖默认配置。

环境变量(如其他人已经提到的)是做到这一点的一种方法。

最好的方法是寻找一个覆盖默认配置值的外部配置文件。这使您可以通过诸如Chef,Puppet或Cfengine之类的配置管理系统来管理外部配置。配置管理是与代码库分开进行配置管理的标准答案,因此您无需发布即可在单个主机或一组主机上更新配置。

仅供参考:加密凭证并非总是最佳做法,尤其是在资源有限的地方。可能情况是,加密凭据不会为您带来额外的风险缓解,而只会增加不必要的复杂性。在做出决定之前,请确保您进行了正确的分析。

Provide a way to override the config

This is the best way to manage a set of sane defaults for the config you checkin without requiring the config be complete, or contain things like hostnames and credentials. There are a few ways to override default configs.

Environment variables (as others have already mentioned) are one way of doing it.

The best way is to look for an external config file that overrides the default config values. This allows you to manage the external configs via a configuration management system like Chef, Puppet or Cfengine. Configuration management is the standard answer for the management of configs separate from the codebase so you don’t have to do a release to update the config on a single host or a group of hosts.

FYI: Encrypting creds is not always a best practice, especially in a place with limited resources. It may be the case that encrypting creds will gain you no additional risk mitigation and simply add an unnecessary layer of complexity. Make sure you do the proper analysis before making a decision.


回答 11

使用例如GPG加密密码文件。将密钥添加到本地计算机和服务器上。解密文件并将其放在您的repo文件夹之外。

我使用homefolder中的passwords.conf。在每次部署中,此文件都会更新。

Encrypt the passwords file, using for example GPG. Add the keys on your local machine and on your server. Decrypt the file and put it outside your repo folders.

I use a passwords.conf, located in my homefolder. On every deploy this file gets updated.


回答 12

不可以,私钥和密码不属于版本控制。当大多数情况下并非所有人都应该有权访问这些服务时,没有理由让每个人都拥有对存储库的读取访问权的负担,因为他们知道生产中使用的敏感服务凭证。

从Django 1.4开始,您的Django项目现在附带了project.wsgi定义application对象的模块,这是开始强制使用a的理想场所project.local包含特定于站点的配置设置模块。

该设置模块从版本控制中被忽略,但是在将项目实例作为WSGI应用程序运行时(在生产环境中通常需要)它必须存在。它应该是这样的:

import os

os.environ.setdefault("DJANGO_SETTINGS_MODULE", "project.local")

# This application object is used by the development server
# as well as any WSGI server configured to use this file.
from django.core.wsgi import get_wsgi_application
application = get_wsgi_application()

现在,您可以拥有一个local.py可以配置所有者和组的模块,以便只有授权人员和Django进程才能读取文件的内容。

No, private keys and passwords do not fall under revision control. There is no reason to burden everyone with read access to your repository with knowing sensitive service credentials used in production, when most likely not all of them should have access to those services.

Starting with Django 1.4, your Django projects now ship with a project.wsgi module that defines the application object and it’s a perfect place to start enforcing the use of a project.local settings module that contains site-specific configurations.

This settings module is ignored from revision control, but it’s presence is required when running your project instance as a WSGI application, typical for production environments. This is how it should look like:

import os

os.environ.setdefault("DJANGO_SETTINGS_MODULE", "project.local")

# This application object is used by the development server
# as well as any WSGI server configured to use this file.
from django.core.wsgi import get_wsgi_application
application = get_wsgi_application()

Now you can have a local.py module who’s owner and group can be configured so that only authorized personnel and the Django processes can read the file’s contents.


回答 13

如果您需要VCS作为秘密,则至少应将其保存在与实际代码分开的第二个存储库中。因此,您可以授予团队成员访问源代码存储库的权限,而他们不会看到您的凭据。此外,将此存储库托管在其他位置(例如,在具有加密文件系统的您自己的服务器上,而不是在github上),并将其检出到生产系统中,可以使用git-submodule之类的东西。

If you need VCS for your secrets you should at least keep them in a second repository seperated from you actual code. So you can give your team members access to the source code repository and they won’t see your credentials. Furthermore host this repository somewhere else (eg. on your own server with an encrypted filesystem, not on github) and for checking it out to the production system you could use something like git-submodule.


回答 14

另一种方法可能是完全避免在版本控制系统中保存机密,而是使用诸如hashicorp的Vault之类的工具,该工具是具有密钥滚动和审核,具有API和嵌入式加密功能的秘密存储。

Another approach could be to completely avoid saving secrets in version control systems and instead use a tool like vault from hashicorp, a secret storage with key rolling and auditing, with an API and embedded encryption.


回答 15

这是我的工作:

  • 将所有秘密保留为$ HOME / .bashrc来源的$ HOME / .secrets(go-r权限)中的env vars(这样,如果您在某人面前打开.bashrc,他们将看不到这些秘密)
  • 配置文件作为模板存储在VCS中,例如config.properties存储为config.properties.tmpl
  • 模板文件包含机密的占位符,例如:

    my.password = ## MY_PASSWORD ##

  • 在应用程序部署中,将运行脚本,该脚本将模板文件转换为目标文件,并用环境变量的值替换占位符,例如将## MY_PASSWORD ##更改为$ MY_PASSWORD的值。

This is what I do:

  • Keep all secrets as env vars in $HOME/.secrets (go-r perms) that $HOME/.bashrc sources (this way if you open .bashrc in front of someone, they won’t see the secrets)
  • Configuration files are stored in VCS as templates, such as config.properties stored as config.properties.tmpl
  • The template files contain a placeholder for the secret, such as:

    my.password=##MY_PASSWORD##

  • On application deployment, script is ran that transforms the template file into the target file, replacing placeholders with values of environment variables, such as changing ##MY_PASSWORD## to the value of $MY_PASSWORD.


回答 16

如果您的系统提供了此功能,则可以使用EncFS。因此,您可以将加密的数据保留为存储库的子文件夹,同时为应用程序提供对已装入数据的解密视图。由于加密是透明的,因此在拉或推上不需要任何特殊操作。

但是,它将需要挂载EncFS文件夹,这可以由您的应用程序根据存储在版本化文件夹之外的其他位置的密码(例如,环境变量)来完成。

You could use EncFS if your system provides that. Thus you could keep your encrypted data as a subfolder of your repository, while providing your application a decrypted view to the data mounted aside. As the encryption is transparent, no special operations are needed on pull or push.

It would however need to mount the EncFS folders, which could be done by your application based on an password stored elsewhere outside the versioned folders (eg. environment variables).


如何使用Python检查单词是否为英语单词?

问题:如何使用Python检查单词是否为英语单词?

我想检查Python程序中英语词典中是否有单词。

我相信可以使用nltk wordnet接口,但是我不知道如何将其用于如此简单的任务。

def is_english_word(word):
    pass # how to I implement is_english_word?

is_english_word(token.lower())

将来,我可能想检查单词的单数形式是否在字典中(例如,属性->属性->英语单词)。我将如何实现?

I want to check in a Python program if a word is in the English dictionary.

I believe nltk wordnet interface might be the way to go but I have no clue how to use it for such a simple task.

def is_english_word(word):
    pass # how to I implement is_english_word?

is_english_word(token.lower())

In the future, I might want to check if the singular form of a word is in the dictionary (e.g., properties -> property -> english word). How would I achieve that?


回答 0

要获得更大的功能和灵活性,请使用专用的拼写检查库,例如PyEnchant。有一个教程,或者您可以直接学习:

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
>>>

PyEnchant带有一些词典(en_GB,en_US,de_DE,fr_FR),但是如果您需要更多语言,可以使用任何OpenOffice

似乎有一个名为的多元化图书馆inflect,但我不知道它是否有用。

For (much) more power and flexibility, use a dedicated spellchecking library like PyEnchant. There’s a tutorial, or you could just dive straight in:

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
>>>

PyEnchant comes with a few dictionaries (en_GB, en_US, de_DE, fr_FR), but can use any of the OpenOffice ones if you want more languages.

There appears to be a pluralisation library called inflect, but I’ve no idea whether it’s any good.


回答 1

它不适用于WordNet,因为WordNet并不包含所有英语单词。基于NLTK却没有附魔的另一种可能性是NLTK的语料库

>>> from nltk.corpus import words
>>> "would" in words.words()
True
>>> "could" in words.words()
True
>>> "should" in words.words()
True
>>> "I" in words.words()
True
>>> "you" in words.words()
True

It won’t work well with WordNet, because WordNet does not contain all english words. Another possibility based on NLTK without enchant is NLTK’s words corpus

>>> from nltk.corpus import words
>>> "would" in words.words()
True
>>> "could" in words.words()
True
>>> "should" in words.words()
True
>>> "I" in words.words()
True
>>> "you" in words.words()
True

回答 2

使用NLTK

from nltk.corpus import wordnet

if not wordnet.synsets(word_to_test):
  #Not an English Word
else:
  #English Word

如果您在安装wordnet时遇到问题或想要尝试其他方法,则应该参考本文

Using NLTK:

from nltk.corpus import wordnet

if not wordnet.synsets(word_to_test):
  #Not an English Word
else:
  #English Word

You should refer to this article if you have trouble installing wordnet or want to try other approaches.


回答 3

使用集合存储单词列表,因为查找它们会更快:

with open("english_words.txt") as word_file:
    english_words = set(word.strip().lower() for word in word_file)

def is_english_word(word):
    return word.lower() in english_words

print is_english_word("ham")  # should be true if you have a good english_words.txt

为了回答问题的第二部分,复数已经在一个好的单词列表中了,但是如果出于某种原因要专门从列表中排除那些复数,则确实可以编写一个函数来处理它。但是英语的复数规则非常棘手,以至于我只在单词列表中包括复数。

至于在哪里找到英语单词列表,我只是通过谷歌搜索“英语单词列表”找到了几个。这是其中之一:http : //www.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt 如果您想特别使用其中一种方言,则可以使用Google的英式或美式英语。

Using a set to store the word list because looking them up will be faster:

with open("english_words.txt") as word_file:
    english_words = set(word.strip().lower() for word in word_file)

def is_english_word(word):
    return word.lower() in english_words

print is_english_word("ham")  # should be true if you have a good english_words.txt

To answer the second part of the question, the plurals would already be in a good word list, but if you wanted to specifically exclude those from the list for some reason, you could indeed write a function to handle it. But English pluralization rules are tricky enough that I’d just include the plurals in the word list to begin with.

As to where to find English word lists, I found several just by Googling “English word list”. Here is one: http://www.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt You could Google for British or American English if you want specifically one of those dialects.


回答 4

对于更快的基于NLTK的解决方案,您可以对单词集进行哈希处理以避免线性搜索。

from nltk.corpus import words as nltk_words
def is_english_word(word):
    # creation of this dictionary would be done outside of 
    #     the function because you only need to do it once.
    dictionary = dict.fromkeys(nltk_words.words(), None)
    try:
        x = dictionary[word]
        return True
    except KeyError:
        return False

For a faster NLTK-based solution you could hash the set of words to avoid a linear search.

from nltk.corpus import words as nltk_words
def is_english_word(word):
    # creation of this dictionary would be done outside of 
    #     the function because you only need to do it once.
    dictionary = dict.fromkeys(nltk_words.words(), None)
    try:
        x = dictionary[word]
        return True
    except KeyError:
        return False

回答 5

我发现有3种基于包的解决方案可以解决该问题。它们是pyenchant,wordnet和语料库(自定义或来自ntlk)。使用py3无法在Win64中轻松安装Pyenchant。Wordnet不能很好地运行,因为它的语料库不完整。所以对我来说,我选择@Sadik回答的解决方案,并使用’set(words.words())’加快速度。

第一:

pip3 install nltk
python3

import nltk
nltk.download('words')

然后:

from nltk.corpus import words
setofwords = set(words.words())

print("hello" in setofwords)
>>True

I find that there are 3 package-based solutions to solve the problem. They are pyenchant, wordnet and corpus(self-defined or from ntlk). Pyenchant couldn’t installed easily in win64 with py3. Wordnet doesn’t work very well because it’s corpus isn’t complete. So for me, I choose the solution answered by @Sadik, and use ‘set(words.words())’ to speed up.

First:

pip3 install nltk
python3

import nltk
nltk.download('words')

Then:

from nltk.corpus import words
setofwords = set(words.words())

print("hello" in setofwords)
>>True

回答 6

使用pyEnchant.checker SpellChecker:

from enchant.checker import SpellChecker

def is_in_english(quote):
    d = SpellChecker("en_US")
    d.set_text(quote)
    errors = [err.word for err in d]
    return False if ((len(errors) > 4) or len(quote.split()) < 3) else True

print(is_in_english('“办理美国加州州立大学圣贝纳迪诺分校高仿成绩单Q/V2166384296加州州立大学圣贝纳迪诺分校学历学位认证'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))

> False
> True

With pyEnchant.checker SpellChecker:

from enchant.checker import SpellChecker

def is_in_english(quote):
    d = SpellChecker("en_US")
    d.set_text(quote)
    errors = [err.word for err in d]
    return False if ((len(errors) > 4) or len(quote.split()) < 3) else True

print(is_in_english('“办理美国加州州立大学圣贝纳迪诺分校高仿成绩单Q/V2166384296加州州立大学圣贝纳迪诺分校学历学位认证'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))

> False
> True

回答 7

对于语义Web方法,您可以以RDF格式针对WordNet运行sparql查询。基本上只使用urllib模块发出GET请求并以JSON格式返回结果,然后使用python’json’模块进行解析。如果不是英文单词,您将不会获得任何结果。

另外,您可以查询Wiktionary的API

For a semantic web approach, you could run a sparql query against WordNet in RDF format. Basically just use urllib module to issue GET request and return results in JSON format, parse using python ‘json’ module. If it’s not English word you’ll get no results.

As another idea, you could query Wiktionary’s API.


回答 8

对于所有Linux / Unix用户

如果您的操作系统使用Linux内核,则有一种简单的方法可以从英语/美国词典中获取所有单词。在目录中,/usr/share/dict您有一个words文件。还有一个更具体american-englishbritish-english文件。这些包含该特定语言的所有单词。您可以通过每种编程语言来访问它,这就是为什么我认为您可能想了解这一点的原因。

现在,对于特定于python的用户,下面的python代码应该将列表单词分配为具有每个单词的值:

import re
file = open("/usr/share/dict/words", "r")
words = re.sub("[^\w]", " ",  file.read()).split()

def is_word(word):
    return word.lower() in words

is_word("tarts") ## Returns true
is_word("jwiefjiojrfiorj") ## Returns False

希望这可以帮助!!!

For All Linux/Unix Users

If your OS uses the Linux kernel, there is a simple way to get all the words from the English/American dictionary. In the directory /usr/share/dict you have a words file. There is also a more specific american-english and british-english files. These contain all of the words in that specific language. You can access this throughout every programming language which is why I thought you might want to know about this.

Now, for python specific users, the python code below should assign the list words to have the value of every single word:

import re
file = open("/usr/share/dict/words", "r")
words = re.sub("[^\w]", " ",  file.read()).split()

def is_word(word):
    return word.lower() in words

is_word("tarts") ## Returns true
is_word("jwiefjiojrfiorj") ## Returns False

Hope this helps!!!


熊猫唯一值多列

问题:熊猫唯一值多列

df = pd.DataFrame({'Col1': ['Bob', 'Joe', 'Bill', 'Mary', 'Joe'],
                   'Col2': ['Joe', 'Steve', 'Bob', 'Bob', 'Steve'],
                   'Col3': np.random.random(5)})

返回“ Col1”和“ Col2”的唯一值的最佳方法是什么?

所需的输出是

'Bob', 'Joe', 'Bill', 'Mary', 'Steve'
df = pd.DataFrame({'Col1': ['Bob', 'Joe', 'Bill', 'Mary', 'Joe'],
                   'Col2': ['Joe', 'Steve', 'Bob', 'Bob', 'Steve'],
                   'Col3': np.random.random(5)})

What is the best way to return the unique values of ‘Col1’ and ‘Col2’?

The desired output is

'Bob', 'Joe', 'Bill', 'Mary', 'Steve'

回答 0

pd.unique 从输入数组或DataFrame列或索引返回唯一值。

此函数的输入必须是一维的,因此将需要合并多列。最简单的方法是选择所需的列,然后在展平的NumPy数组中查看值。整个操作如下所示:

>>> pd.unique(df[['Col1', 'Col2']].values.ravel('K'))
array(['Bob', 'Joe', 'Bill', 'Mary', 'Steve'], dtype=object)

请注意,这ravel()是一个数组方法,它返回多维数组的视图(如果可能)。该参数'K'告诉方法按元素在内存中存储的顺序展平数组(熊猫通常以Fortran连续的顺序存储基础数组;列在行之前)。这可能比使用该方法的默认“ C”顺序快得多。


另一种方法是选择列并将其传递给np.unique

>>> np.unique(df[['Col1', 'Col2']].values)
array(['Bill', 'Bob', 'Joe', 'Mary', 'Steve'], dtype=object)

ravel()此处不需要使用该方法,因为该方法可以处理多维数组。即使这样,它也可能比pd.unique使用基于排序的算法而不是哈希表来标识唯一值的方法要慢。

对于较大的DataFrame,速度上的差异非常大(尤其是在只有少数唯一值的情况下):

>>> df1 = pd.concat([df]*100000, ignore_index=True) # DataFrame with 500000 rows
>>> %timeit np.unique(df1[['Col1', 'Col2']].values)
1 loop, best of 3: 1.12 s per loop

>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel('K'))
10 loops, best of 3: 38.9 ms per loop

>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel()) # ravel using C order
10 loops, best of 3: 49.9 ms per loop

pd.unique returns the unique values from an input array, or DataFrame column or index.

The input to this function needs to be one-dimensional, so multiple columns will need to be combined. The simplest way is to select the columns you want and then view the values in a flattened NumPy array. The whole operation looks like this:

>>> pd.unique(df[['Col1', 'Col2']].values.ravel('K'))
array(['Bob', 'Joe', 'Bill', 'Mary', 'Steve'], dtype=object)

Note that ravel() is an array method than returns a view (if possible) of a multidimensional array. The argument 'K' tells the method to flatten the array in the order the elements are stored in memory (pandas typically stores underlying arrays in Fortran-contiguous order; columns before rows). This can be significantly faster than using the method’s default ‘C’ order.


An alternative way is to select the columns and pass them to np.unique:

>>> np.unique(df[['Col1', 'Col2']].values)
array(['Bill', 'Bob', 'Joe', 'Mary', 'Steve'], dtype=object)

There is no need to use ravel() here as the method handles multidimensional arrays. Even so, this is likely to be slower than pd.unique as it uses a sort-based algorithm rather than a hashtable to identify unique values.

The difference in speed is significant for larger DataFrames (especially if there are only a handful of unique values):

>>> df1 = pd.concat([df]*100000, ignore_index=True) # DataFrame with 500000 rows
>>> %timeit np.unique(df1[['Col1', 'Col2']].values)
1 loop, best of 3: 1.12 s per loop

>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel('K'))
10 loops, best of 3: 38.9 ms per loop

>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel()) # ravel using C order
10 loops, best of 3: 49.9 ms per loop

回答 1

DataFrame在其列中设置了一些简单的字符串:

>>> df
   a  b
0  a  g
1  b  h
2  d  a
3  e  e

您可以连接感兴趣的列并调用unique函数:

>>> pandas.concat([df['a'], df['b']]).unique()
array(['a', 'b', 'd', 'e', 'g', 'h'], dtype=object)

I have setup a DataFrame with a few simple strings in it’s columns:

>>> df
   a  b
0  a  g
1  b  h
2  d  a
3  e  e

You can concatenate the columns you are interested in and call unique function:

>>> pandas.concat([df['a'], df['b']]).unique()
array(['a', 'b', 'd', 'e', 'g', 'h'], dtype=object)

回答 2

In [5]: set(df.Col1).union(set(df.Col2))
Out[5]: {'Bill', 'Bob', 'Joe', 'Mary', 'Steve'}

要么:

set(df.Col1) | set(df.Col2)
In [5]: set(df.Col1).union(set(df.Col2))
Out[5]: {'Bill', 'Bob', 'Joe', 'Mary', 'Steve'}

Or:

set(df.Col1) | set(df.Col2)

回答 3

如果使用多个列,则使用numpy v1.13 +更新的解决方案需要在np.unique中指定轴,否则该数组将隐式展平。

import numpy as np

np.unique(df[['col1', 'col2']], axis=0)

此更改于2016年11月引入:https : //github.com/numpy/numpy/commit/1f764dbff7c496d6636dc0430f083ada9ff4e4be

An updated solution using numpy v1.13+ requires specifying the axis in np.unique if using multiple columns, otherwise the array is implicitly flattened.

import numpy as np

np.unique(df[['col1', 'col2']], axis=0)

This change was introduced Nov 2016: https://github.com/numpy/numpy/commit/1f764dbff7c496d6636dc0430f083ada9ff4e4be


回答 4

pandas解决方案:使用set()。

import pandas as pd
import numpy as np

df = pd.DataFrame({'Col1' : ['Bob', 'Joe', 'Bill', 'Mary', 'Joe'],
              'Col2' : ['Joe', 'Steve', 'Bob', 'Bob', 'Steve'],
               'Col3' : np.random.random(5)})

print df

print set(df.Col1.append(df.Col2).values)

输出:

   Col1   Col2      Col3
0   Bob    Joe  0.201079
1   Joe  Steve  0.703279
2  Bill    Bob  0.722724
3  Mary    Bob  0.093912
4   Joe  Steve  0.766027
set(['Steve', 'Bob', 'Bill', 'Joe', 'Mary'])

Non-pandas solution: using set().

import pandas as pd
import numpy as np

df = pd.DataFrame({'Col1' : ['Bob', 'Joe', 'Bill', 'Mary', 'Joe'],
              'Col2' : ['Joe', 'Steve', 'Bob', 'Bob', 'Steve'],
               'Col3' : np.random.random(5)})

print df

print set(df.Col1.append(df.Col2).values)

Output:

   Col1   Col2      Col3
0   Bob    Joe  0.201079
1   Joe  Steve  0.703279
2  Bill    Bob  0.722724
3  Mary    Bob  0.093912
4   Joe  Steve  0.766027
set(['Steve', 'Bob', 'Bill', 'Joe', 'Mary'])

回答 5

对于那些喜欢大熊猫的人来说,适用于它们,当然还有lambda函数:

df['Col3'] = df[['Col1', 'Col2']].apply(lambda x: ''.join(x), axis=1)

for those of us that love all things pandas, apply, and of course lambda functions:

df['Col3'] = df[['Col1', 'Col2']].apply(lambda x: ''.join(x), axis=1)

回答 6

这是另一种方式


import numpy as np
set(np.concatenate(df.values))

here’s another way


import numpy as np
set(np.concatenate(df.values))

回答 7

list(set(df[['Col1', 'Col2']].as_matrix().reshape((1,-1)).tolist()[0]))

输出将是[‘Mary’,’Joe’,’Steve’,’Bob’,’Bill’]

list(set(df[['Col1', 'Col2']].as_matrix().reshape((1,-1)).tolist()[0]))

The output will be [‘Mary’, ‘Joe’, ‘Steve’, ‘Bob’, ‘Bill’]


ImportError:没有名为dateutil.parser的模块

问题:ImportError:没有名为dateutil.parser的模块

我在导入时收到以下错误pandasPython程序

monas-mbp:book mona$ sudo pip install python-dateutil
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python
Cleaning up...
monas-mbp:book mona$ python t1.py
No module named dateutil.parser
Traceback (most recent call last):
  File "t1.py", line 4, in <module>
    import pandas as pd
  File "/Library/Python/2.7/site-packages/pandas/__init__.py", line 6, in <module>
    from . import hashtable, tslib, lib
  File "tslib.pyx", line 31, in init pandas.tslib (pandas/tslib.c:48782)
ImportError: No module named dateutil.parser

这也是程序:

import codecs 
from math import sqrt
import numpy as np
import pandas as pd

users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0,
                      "Norah Jones": 4.5, "Phoenix": 5.0,
                      "Slightly Stoopid": 1.5,
                      "The Strokes": 2.5, "Vampire Weekend": 2.0},

         "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5,
                 "Deadmau5": 4.0, "Phoenix": 2.0,
                 "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},

         "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0,
                  "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5,
                  "Slightly Stoopid": 1.0},

         "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0,
                 "Deadmau5": 4.5, "Phoenix": 3.0,
                 "Slightly Stoopid": 4.5, "The Strokes": 4.0,
                 "Vampire Weekend": 2.0},

         "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0,
                    "Norah Jones": 4.0, "The Strokes": 4.0,
                    "Vampire Weekend": 1.0},

         "Jordyn":  {"Broken Bells": 4.5, "Deadmau5": 4.0,
                     "Norah Jones": 5.0, "Phoenix": 5.0,
                     "Slightly Stoopid": 4.5, "The Strokes": 4.0,
                     "Vampire Weekend": 4.0},

         "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0,
                 "Norah Jones": 3.0, "Phoenix": 5.0,
                 "Slightly Stoopid": 4.0, "The Strokes": 5.0},

         "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0,
                      "Phoenix": 4.0, "Slightly Stoopid": 2.5,
                      "The Strokes": 3.0}
        }



class recommender:

    def __init__(self, data, k=1, metric='pearson', n=5):
        """ initialize recommender
        currently, if data is dictionary the recommender is initialized
        to it.
        For all other data types of data, no initialization occurs
        k is the k value for k nearest neighbor
        metric is which distance formula to use
        n is the maximum number of recommendations to make"""
        self.k = k
        self.n = n
        self.username2id = {}
        self.userid2name = {}
        self.productid2name = {}
        # for some reason I want to save the name of the metric
        self.metric = metric
        if self.metric == 'pearson':
            self.fn = self.pearson
        #
        # if data is dictionary set recommender data to it
        #
        if type(data).__name__ == 'dict':
            self.data = data

    def convertProductID2name(self, id):
        """Given product id number return product name"""
        if id in self.productid2name:
            return self.productid2name[id]
        else:
            return id


    def userRatings(self, id, n):
        """Return n top ratings for user with id"""
        print ("Ratings for " + self.userid2name[id])
        ratings = self.data[id]
        print(len(ratings))
        ratings = list(ratings.items())
        ratings = [(self.convertProductID2name(k), v)
                   for (k, v) in ratings]
        # finally sort and return
        ratings.sort(key=lambda artistTuple: artistTuple[1],
                     reverse = True)
        ratings = ratings[:n]
        for rating in ratings:
            print("%s\t%i" % (rating[0], rating[1]))




    def loadBookDB(self, path=''):
        """loads the BX book dataset. Path is where the BX files are
        located"""
        self.data = {}
        i = 0
        #
        # First load book ratings into self.data
        #
        f = codecs.open(path + "BX-Book-Ratings.csv", 'r', 'utf8')
        for line in f:
            i += 1
            #separate line into fields
            fields = line.split(';')
            user = fields[0].strip('"')
            book = fields[1].strip('"')
            rating = int(fields[2].strip().strip('"'))
            if user in self.data:
                currentRatings = self.data[user]
            else:
                currentRatings = {}
            currentRatings[book] = rating
            self.data[user] = currentRatings
        f.close()
        #
        # Now load books into self.productid2name
        # Books contains isbn, title, and author among other fields
        #
        f = codecs.open(path + "BX-Books.csv", 'r', 'utf8')
        for line in f:
            i += 1
            #separate line into fields
            fields = line.split(';')
            isbn = fields[0].strip('"')
            title = fields[1].strip('"')
            author = fields[2].strip().strip('"')
            title = title + ' by ' + author
            self.productid2name[isbn] = title
        f.close()
        #
        #  Now load user info into both self.userid2name and
        #  self.username2id
        #
        f = codecs.open(path + "BX-Users.csv", 'r', 'utf8')
        for line in f:
            i += 1
            #print(line)
            #separate line into fields
            fields = line.split(';')
            userid = fields[0].strip('"')
            location = fields[1].strip('"')
            if len(fields) > 3:
                age = fields[2].strip().strip('"')
            else:
                age = 'NULL'
            if age != 'NULL':
                value = location + '  (age: ' + age + ')'
            else:
                value = location
            self.userid2name[userid] = value
            self.username2id[location] = userid
        f.close()
        print(i)


    def pearson(self, rating1, rating2):
        sum_xy = 0
        sum_x = 0
        sum_y = 0
        sum_x2 = 0
        sum_y2 = 0
        n = 0
        for key in rating1:
            if key in rating2:
                n += 1
                x = rating1[key]
                y = rating2[key]
                sum_xy += x * y
                sum_x += x
                sum_y += y
                sum_x2 += pow(x, 2)
                sum_y2 += pow(y, 2)
        if n == 0:
            return 0
        # now compute denominator
        denominator = (sqrt(sum_x2 - pow(sum_x, 2) / n)
                       * sqrt(sum_y2 - pow(sum_y, 2) / n))
        if denominator == 0:
            return 0
        else:
            return (sum_xy - (sum_x * sum_y) / n) / denominator


    def computeNearestNeighbor(self, username):
        """creates a sorted list of users based on their distance to
        username"""
        distances = []
        for instance in self.data:
            if instance != username:
                distance = self.fn(self.data[username],
                                   self.data[instance])
                distances.append((instance, distance))
        # sort based on distance -- closest first
        distances.sort(key=lambda artistTuple: artistTuple[1],
                       reverse=True)
        return distances

    def recommend(self, user):
       """Give list of recommendations"""
       recommendations = {}
       # first get list of users  ordered by nearness
       nearest = self.computeNearestNeighbor(user)
       #
       # now get the ratings for the user
       #
       userRatings = self.data[user]
       #
       # determine the total distance
       totalDistance = 0.0
       for i in range(self.k):
          totalDistance += nearest[i][1]
       # now iterate through the k nearest neighbors
       # accumulating their ratings
       for i in range(self.k):
          # compute slice of pie 
          weight = nearest[i][1] / totalDistance
          # get the name of the person
          name = nearest[i][0]
          # get the ratings for this person
          neighborRatings = self.data[name]
          # get the name of the person
          # now find bands neighbor rated that user didn't
          for artist in neighborRatings:
             if not artist in userRatings:
                if artist not in recommendations:
                   recommendations[artist] = (neighborRatings[artist]
                                              * weight)
                else:
                   recommendations[artist] = (recommendations[artist]
                                              + neighborRatings[artist]
                                              * weight)
       # now make list from dictionary
       recommendations = list(recommendations.items())
       recommendations = [(self.convertProductID2name(k), v)
                          for (k, v) in recommendations]
       # finally sort and return
       recommendations.sort(key=lambda artistTuple: artistTuple[1],
                            reverse = True)
       # Return the first n items
       return recommendations[:self.n]

r = recommender(users)
# The author implementation
r.loadBookDB('/Users/mona/Downloads/BX-Dump/')

ratings = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Book-Ratings.csv', sep=";", quotechar="\"", escapechar="\\")
books = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Books.csv', sep=";", quotechar="\"", escapechar="\\")
users = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Users.csv', sep=";", quotechar="\"", escapechar="\\")



pivot_rating = ratings.pivot(index='User-ID', columns='ISBN', values='Book-Rating')

I am receiving the following error when importing pandas in a Python program

monas-mbp:book mona$ sudo pip install python-dateutil
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python
Cleaning up...
monas-mbp:book mona$ python t1.py
No module named dateutil.parser
Traceback (most recent call last):
  File "t1.py", line 4, in <module>
    import pandas as pd
  File "/Library/Python/2.7/site-packages/pandas/__init__.py", line 6, in <module>
    from . import hashtable, tslib, lib
  File "tslib.pyx", line 31, in init pandas.tslib (pandas/tslib.c:48782)
ImportError: No module named dateutil.parser

Also here’s the program:

import codecs 
from math import sqrt
import numpy as np
import pandas as pd

users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0,
                      "Norah Jones": 4.5, "Phoenix": 5.0,
                      "Slightly Stoopid": 1.5,
                      "The Strokes": 2.5, "Vampire Weekend": 2.0},

         "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5,
                 "Deadmau5": 4.0, "Phoenix": 2.0,
                 "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},

         "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0,
                  "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5,
                  "Slightly Stoopid": 1.0},

         "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0,
                 "Deadmau5": 4.5, "Phoenix": 3.0,
                 "Slightly Stoopid": 4.5, "The Strokes": 4.0,
                 "Vampire Weekend": 2.0},

         "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0,
                    "Norah Jones": 4.0, "The Strokes": 4.0,
                    "Vampire Weekend": 1.0},

         "Jordyn":  {"Broken Bells": 4.5, "Deadmau5": 4.0,
                     "Norah Jones": 5.0, "Phoenix": 5.0,
                     "Slightly Stoopid": 4.5, "The Strokes": 4.0,
                     "Vampire Weekend": 4.0},

         "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0,
                 "Norah Jones": 3.0, "Phoenix": 5.0,
                 "Slightly Stoopid": 4.0, "The Strokes": 5.0},

         "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0,
                      "Phoenix": 4.0, "Slightly Stoopid": 2.5,
                      "The Strokes": 3.0}
        }



class recommender:

    def __init__(self, data, k=1, metric='pearson', n=5):
        """ initialize recommender
        currently, if data is dictionary the recommender is initialized
        to it.
        For all other data types of data, no initialization occurs
        k is the k value for k nearest neighbor
        metric is which distance formula to use
        n is the maximum number of recommendations to make"""
        self.k = k
        self.n = n
        self.username2id = {}
        self.userid2name = {}
        self.productid2name = {}
        # for some reason I want to save the name of the metric
        self.metric = metric
        if self.metric == 'pearson':
            self.fn = self.pearson
        #
        # if data is dictionary set recommender data to it
        #
        if type(data).__name__ == 'dict':
            self.data = data

    def convertProductID2name(self, id):
        """Given product id number return product name"""
        if id in self.productid2name:
            return self.productid2name[id]
        else:
            return id


    def userRatings(self, id, n):
        """Return n top ratings for user with id"""
        print ("Ratings for " + self.userid2name[id])
        ratings = self.data[id]
        print(len(ratings))
        ratings = list(ratings.items())
        ratings = [(self.convertProductID2name(k), v)
                   for (k, v) in ratings]
        # finally sort and return
        ratings.sort(key=lambda artistTuple: artistTuple[1],
                     reverse = True)
        ratings = ratings[:n]
        for rating in ratings:
            print("%s\t%i" % (rating[0], rating[1]))




    def loadBookDB(self, path=''):
        """loads the BX book dataset. Path is where the BX files are
        located"""
        self.data = {}
        i = 0
        #
        # First load book ratings into self.data
        #
        f = codecs.open(path + "BX-Book-Ratings.csv", 'r', 'utf8')
        for line in f:
            i += 1
            #separate line into fields
            fields = line.split(';')
            user = fields[0].strip('"')
            book = fields[1].strip('"')
            rating = int(fields[2].strip().strip('"'))
            if user in self.data:
                currentRatings = self.data[user]
            else:
                currentRatings = {}
            currentRatings[book] = rating
            self.data[user] = currentRatings
        f.close()
        #
        # Now load books into self.productid2name
        # Books contains isbn, title, and author among other fields
        #
        f = codecs.open(path + "BX-Books.csv", 'r', 'utf8')
        for line in f:
            i += 1
            #separate line into fields
            fields = line.split(';')
            isbn = fields[0].strip('"')
            title = fields[1].strip('"')
            author = fields[2].strip().strip('"')
            title = title + ' by ' + author
            self.productid2name[isbn] = title
        f.close()
        #
        #  Now load user info into both self.userid2name and
        #  self.username2id
        #
        f = codecs.open(path + "BX-Users.csv", 'r', 'utf8')
        for line in f:
            i += 1
            #print(line)
            #separate line into fields
            fields = line.split(';')
            userid = fields[0].strip('"')
            location = fields[1].strip('"')
            if len(fields) > 3:
                age = fields[2].strip().strip('"')
            else:
                age = 'NULL'
            if age != 'NULL':
                value = location + '  (age: ' + age + ')'
            else:
                value = location
            self.userid2name[userid] = value
            self.username2id[location] = userid
        f.close()
        print(i)


    def pearson(self, rating1, rating2):
        sum_xy = 0
        sum_x = 0
        sum_y = 0
        sum_x2 = 0
        sum_y2 = 0
        n = 0
        for key in rating1:
            if key in rating2:
                n += 1
                x = rating1[key]
                y = rating2[key]
                sum_xy += x * y
                sum_x += x
                sum_y += y
                sum_x2 += pow(x, 2)
                sum_y2 += pow(y, 2)
        if n == 0:
            return 0
        # now compute denominator
        denominator = (sqrt(sum_x2 - pow(sum_x, 2) / n)
                       * sqrt(sum_y2 - pow(sum_y, 2) / n))
        if denominator == 0:
            return 0
        else:
            return (sum_xy - (sum_x * sum_y) / n) / denominator


    def computeNearestNeighbor(self, username):
        """creates a sorted list of users based on their distance to
        username"""
        distances = []
        for instance in self.data:
            if instance != username:
                distance = self.fn(self.data[username],
                                   self.data[instance])
                distances.append((instance, distance))
        # sort based on distance -- closest first
        distances.sort(key=lambda artistTuple: artistTuple[1],
                       reverse=True)
        return distances

    def recommend(self, user):
       """Give list of recommendations"""
       recommendations = {}
       # first get list of users  ordered by nearness
       nearest = self.computeNearestNeighbor(user)
       #
       # now get the ratings for the user
       #
       userRatings = self.data[user]
       #
       # determine the total distance
       totalDistance = 0.0
       for i in range(self.k):
          totalDistance += nearest[i][1]
       # now iterate through the k nearest neighbors
       # accumulating their ratings
       for i in range(self.k):
          # compute slice of pie 
          weight = nearest[i][1] / totalDistance
          # get the name of the person
          name = nearest[i][0]
          # get the ratings for this person
          neighborRatings = self.data[name]
          # get the name of the person
          # now find bands neighbor rated that user didn't
          for artist in neighborRatings:
             if not artist in userRatings:
                if artist not in recommendations:
                   recommendations[artist] = (neighborRatings[artist]
                                              * weight)
                else:
                   recommendations[artist] = (recommendations[artist]
                                              + neighborRatings[artist]
                                              * weight)
       # now make list from dictionary
       recommendations = list(recommendations.items())
       recommendations = [(self.convertProductID2name(k), v)
                          for (k, v) in recommendations]
       # finally sort and return
       recommendations.sort(key=lambda artistTuple: artistTuple[1],
                            reverse = True)
       # Return the first n items
       return recommendations[:self.n]

r = recommender(users)
# The author implementation
r.loadBookDB('/Users/mona/Downloads/BX-Dump/')

ratings = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Book-Ratings.csv', sep=";", quotechar="\"", escapechar="\\")
books = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Books.csv', sep=";", quotechar="\"", escapechar="\\")
users = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Users.csv', sep=";", quotechar="\"", escapechar="\\")



pivot_rating = ratings.pivot(index='User-ID', columns='ISBN', values='Book-Rating')

回答 0

在Ubuntu上,您可能需要先安装软件包管理器pip

sudo apt-get install python-pip

然后使用以下命令安装python-dateutil软件包:

sudo pip install python-dateutil

On Ubuntu you may need to install the package manager pip first:

sudo apt-get install python-pip

Then install the python-dateutil package with:

sudo pip install python-dateutil

回答 1

您可以在https://pypi.python.org/pypi/python-dateutil中找到dateutil包。将其解压缩到某个地方并运行命令:

python setup.py install

它为我工作!

You can find the dateutil package at https://pypi.python.org/pypi/python-dateutil. Extract it to somewhere and run the command:

python setup.py install

It worked for me!


回答 2

对于Python 3:

pip3 install python-dateutil

For Python 3:

pip3 install python-dateutil

回答 3

对于上述Python 3,请使用:

sudo apt-get install python3-dateutil

For Python 3 above, use:

sudo apt-get install python3-dateutil

回答 4

如果使用的是virtualenv,请确保从virtualenv内部运行pip 。

$ which pip
/Library/Frameworks/Python.framework/Versions/Current/bin/pip
$ find . -name pip -print
./flask/bin/pip
./flask/lib/python2.7/site-packages/pip
$ ./flask/bin/pip install python-dateutil

If you’re using a virtualenv, make sure that you are running pip from within the virtualenv.

$ which pip
/Library/Frameworks/Python.framework/Versions/Current/bin/pip
$ find . -name pip -print
./flask/bin/pip
./flask/lib/python2.7/site-packages/pip
$ ./flask/bin/pip install python-dateutil

回答 5

没有一种解决方案对我有用。如果您使用的是PIP,请执行以下操作:

pip install pycrypto==2.6.1

None of the solutions worked for me. If you are using PIP do:

pip install pycrypto==2.6.1


回答 6

在适用于Python2的Ubuntu 18.04中:

sudo apt-get install python-dateutil

In Ubuntu 18.04 for Python2:

sudo apt-get install python-dateutil

回答 7

我在MacOS上也遇到了同样的问题,尝试安装python-dateutil对我来说是有用的

i have same issues on my MacOS and it’s work for me to try install python-dateutil


回答 8

如果您使用Pipenv,则可能需要将此添加到您的Pipfile

[packages]
python-dateutil = "*"

If you are using Pipenv, you may need to add this to your Pipfile:

[packages]
python-dateutil = "*"

有趣好用的Python教程

退出移动版
微信支付
请使用 微信 扫码支付