import numpy as np
list_1 =["a","b","c","d","e"]
list_2 =["a","f","c","m"]
main_list = np.setdiff1d(list_2,list_1)# yields the elements in `list_2` that are NOT in `list_1`
(2)
对于想要对答案进行排序的人,我做了一个自定义函数:
import numpy as np
def setdiff_sorted(array1,array2,assume_unique=False):
ans = np.setdiff1d(array1,array2,assume_unique).tolist()if assume_unique:return sorted(ans)return ans
import numpy as np
main_list = np.setdiff1d(list_2,list_1)
# yields the elements in `list_2` that are NOT in `list_1`
SOLUTION (2)You want a sorted list
def setdiff_sorted(array1,array2,assume_unique=False):
ans = np.setdiff1d(array1,array2,assume_unique).tolist()
if assume_unique:
return sorted(ans)
return ans
main_list = setdiff_sorted(list_2,list_1)
EXPLANATIONS: (1) You can use NumPy’s setdiff1d (array1,array2,assume_unique=False).
assume_unique asks the user IF the arrays ARE ALREADY UNIQUE. If False, then the unique elements are determined first.
If True, the function will assume that the elements are already unique AND function will skip determining the unique elements.
This yields the unique values in array1 that are not in array2. assume_unique is False by default.
If you are concerned with the unique elements (based on the response of Chinny84), then simply use (where assume_unique=False => the default value):
import numpy as np
list_1 = ["a", "b", "c", "d", "e"]
list_2 = ["a", "f", "c", "m"]
main_list = np.setdiff1d(list_2,list_1)
# yields the elements in `list_2` that are NOT in `list_1`
(2)
For those who want answers to be sorted, I’ve made a custom function:
import numpy as np
def setdiff_sorted(array1,array2,assume_unique=False):
ans = np.setdiff1d(array1,array2,assume_unique).tolist()
if assume_unique:
return sorted(ans)
return ans
To get the answer, run:
main_list = setdiff_sorted(list_2,list_1)
SIDE NOTES:
(a) Solution 2 (custom function setdiff_sorted) returns a list (compared to an array in solution 1).
(b) If you aren’t sure if the elements are unique, just use the default setting of NumPy’s setdiff1d in both solutions A and B. What can be an example of a complication? See note (c).
(c) Things will be different if either of the two lists is not unique.
Say list_2 is not unique: list2 = ["a", "f", "c", "m", "m"]. Keep list1 as is: list_1 = ["a", "b", "c", "d", "e"] Setting the default value of assume_unique yields ["f", "m"] (in both solutions). HOWEVER, if you set assume_unique=True, both solutions give ["f", "m", "m"]. Why? This is because the user ASSUMED that the elements are unique). Hence, IT IS BETTER TO KEEP assume_unique to its default value. Note that both answers are sorted.
main_list = [item for item in list_2 if item not in list_1]
Output:
>>> list_1 = ["a", "b", "c", "d", "e"]
>>> list_2 = ["a", "f", "c", "m"]
>>>
>>> main_list = [item for item in list_2 if item not in list_1]
>>> main_list
['f', 'm']
Edit:
Like mentioned in the comments below, with large lists, the above is not the ideal solution. When that’s the case, a better option would be converting list_1 to a set first:
set_1 = set(list_1) # this reduces the lookup time from O(n) to O(1)
main_list = [item for item in list_2 if item not in set_1]
If you want a one-liner solution (ignoring imports) that only requires O(max(n, m)) work for inputs of length n and m, not O(n * m) work, you can do so with the itertools module:
from itertools import filterfalse
main_list = list(filterfalse(set(list_1).__contains__, list_2))
This takes advantage of the functional functions taking a callback function on construction, allowing it to create the callback once and reuse it for every element without needing to store it somewhere (because filterfalse stores it internally); list comprehensions and generator expressions can do this, but it’s ugly.†
That gets the same results in a single line as:
main_list = [x for x in list_2 if x not in list_1]
with the speed of:
set_1 = set(list_1)
main_list = [x for x in list_2 if x not in set_1]
Of course, if the comparisons are intended to be positional, so:
list_1 = [1, 2, 3]
list_2 = [2, 3, 4]
should produce:
main_list = [2, 3, 4]
(because no value in list_2 has a match at the same index in list_1), you should definitely go with Patrick’s answer, which involves no temporary lists or sets (even with sets being roughly O(1), they have a higher “constant” factor per check than simple equality checks) and involves O(min(n, m)) work, less than any other answer, and if your problem is position sensitive, is the only correct solution when matching elements appear at mismatched offsets.
†: The way to do the same thing with a list comprehension as a one-liner would be to abuse nested looping to create and cache value(s) in the “outermost” loop, e.g.:
main_list = [x for set_1 in (set(list_1),) for x in list_2 if x not in set_1]
which also gives a minor performance benefit on Python 3 (because now set_1 is locally scoped in the comprehension code, rather than looked up from nested scope for each check; on Python 2 that doesn’t matter, because Python 2 doesn’t use closures for list comprehensions; they operate in the same scope they’re used in).
回答 5
main_list=[]
list_1=["a","b","c","d","e"]
list_2=["a","f","c","m"]for i in list_2:if i notin list_1:
main_list.append(i)print(main_list)
Method1 np.setdiff1d meets my requirements perfectly.
This answer for information.
回答 8
如果应该考虑发生的次数,则可能需要使用类似以下内容的方法collections.Counter:
list_1=["a","b","c","d","e"]
list_2=["a","f","c","m"]from collections importCounter
cnt1 =Counter(list_1)
cnt2 =Counter(list_2)
final =[key for key, counts in cnt2.items()if cnt1.get(key,0)!= counts]>>> final
['f','m']
如所承诺的,这也可以将不同的出现次数称为“差异”:
list_1=["a","b","c","d","e",'a']
cnt1 =Counter(list_1)
cnt2 =Counter(list_2)
final =[key for key, counts in cnt2.items()if cnt1.get(key,0)!= counts]>>> final
['a','f','m']
import typing
def my_function(name:typing.AnyStr, func: typing.Function)->None:# However, typing.Function does not exist.# How can I specify the type function for the parameter `func`?# do some processingpass
I want to use type hints in my current Python 3.5 project. My function should receive a function as parameter.
How can I specify the type function in my type hints?
import typing
def my_function(name:typing.AnyStr, func: typing.Function) -> None:
# However, typing.Function does not exist.
# How can I specify the type function for the parameter `func`?
# do some processing
pass
I checked PEP 483, but could not find a function type hint there.
Issue is, Callable on it’s own is translated to Callable[..., Any] which means:
A callable takes any number of/type of arguments and returns a value of any type. In most cases, this isn’t what you want since you’ll allow pretty much any function to be passed. You want the function parameters and return types to be hinted too.
That’s why many types in typing have been overloaded to support sub-scripting which denotes these extra types. So if, for example, you had a function sum that takes two ints and returns an int:
def sum(a: int, b: int) -> int: return a+b
Your annotation for it would be:
Callable[[int, int], int]
that is, the parameters are sub-scripted in the outer subscription with the return type as the second element in the outer subscription. In general:
Another interesting point to note is that you can use the built in function type() to get the type of a built in function and use that.
So you could have
which is strictly positional, and only comes with the caveat that format() arguments follow Python rules where unnamed args must come first, followed by named arguments, followed by *args (a sequence like list or tuple) and then *kwargs (a dict keyed with strings if you know what’s good for you).
The interpolation points are determined first by substituting the named values at their labels, and then positional from what’s left.
So, you can also do this…
Python 3.6 introduces literal string formatting, so that you can format the named parameters without any repeating any of your named parameters outside the string:
print(f'<a href="{my_url:s}">{my_url:s}</a>')
This will evaluate my_url, so if it’s not defined you will get a NameError. In fact, instead of my_url, you can write an arbitrary Python expression, as long as it evaluates to a string (because of the :s formatting code). If you want a string representation for the result of an expression that might not be a string, replace :s by !s, just like with regular, pre-literal string formatting.
For details on literal string formatting, see PEP 498, where it was first introduced.
回答 3
您将沉迷于语法。
同样是C#6.0,EcmaScript开发人员也熟悉此语法。
In[1]:print'{firstname} {lastname}'.format(firstname='Mehmet', lastname='Ağa')Mehmet Ağa
In[2]:print'{firstname} {lastname}'.format(**dict(firstname='Mehmet', lastname='Ağa'))Mehmet Ağa
Also C# 6.0, EcmaScript developers has also familier this syntax.
In [1]: print '{firstname} {lastname}'.format(firstname='Mehmet', lastname='Ağa')
Mehmet Ağa
In [2]: print '{firstname} {lastname}'.format(**dict(firstname='Mehmet', lastname='Ağa'))
Mehmet Ağa
As well as the dictionary way, it may be useful to know the following format:
print '<a href="%s">%s</a>' % (my_url, my_url)
Here it’s a tad redundant, and the dictionary way is certainly less error prone when modifying the code, but it’s still possible to use tuples for multiple insertions. The first %s is substituted for the first element in the tuple, the second %s is substituted for the second element in the tuple, and so on for each element in the tuple.
I want to make a website that shows the comparison between amazon and e-bay product price.
Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler.
Scrapy is a Web-spider or web scraper framework, You give Scrapy a root URL to start crawling, then you can specify constraints on how many (number of) URLs you want to crawl and fetch,etc. It is a complete framework for web-scraping or crawling.
While
BeautifulSoup is a parsing library which also does a pretty good job of fetching contents from URL and allows you to parse certain parts of them without any hassle. It only fetches the contents of the URL that you give and then stops. It does not crawl unless you manually put it inside an infinite loop with certain criteria.
In simple words, with Beautiful Soup you can build something similar to Scrapy.
Beautiful Soup is a library while Scrapy is a complete framework.
I think both are good… im doing a project right now that use both. First i scrap all the pages using scrapy and save that on a mongodb collection using their pipelines, also downloading the images that exists on the page.
After that i use BeautifulSoup4 to make a pos-processing where i must change attributes values and get some special tags.
If you don’t know which pages products you want, a good tool will be scrapy since you can use their crawlers to run all amazon/ebay website looking for the products without making a explicit for loop.
Take a look at the scrapy documentation, it’s very simple to use.
The way I do it is to use the eBay/Amazon API’s rather than scrapy, and then parse the results using BeautifulSoup.
The APIs gives you an official way of getting the same data that you would have got from scrapy crawler, with no need to worry about hiding your identity, mess about with proxies,etc.
Scrapy
It is a web scraping framework which comes with tons of goodies which make scraping from easier so that we can focus on crawling logic only. Some of my favourite things scrapy takes care for us are below.
Feed exports: It basically allows us to save data in various formats like CSV,JSON,jsonlines and XML.
Asynchronous scraping: Scrapy uses twisted framework which gives us power to visit multiple urls at once where each request is processed in non blocking way(Basically we don’t have to wait for a request to finish before sending another request).
Selectors: This is where we can compare scrapy with beautiful soup. Selectors are what allow us to select particular data from the webpage like heading, certain div with a class name etc.). Scrapy uses lxml for parsing which is extremely fast than beautiful soup.
Setting proxy,user agent ,headers etc: scrapy allows us to set and rotate proxy,and other headers dynamically.
Item Pipelines: Pipelines enable us to process data after extraction. For example we can configure pipeline to push data to your mysql server.
Cookies: scrapy automatically handles cookies for us.
etc.
TLDR: scrapy is a framework that provides everything that one might
need to build large scale crawls. It provides various features that
hide complexity of crawling the webs. one can simply start writing web
crawlers without worrying about the setup burden.
Beautiful soup
Beautiful Soup is a Python package for parsing HTML and XML documents. So with Beautiful soup you can parse a webpage that has been already downloaded. BS4 is very popular and old. Unlike scrapy,You cannot use beautiful soup only to make crawlers. You will need other libraries like requests,urllib etc to make crawlers with bs4. Again, this means you would need to manage the list of urls being crawled,to be crawled, handle cookies , manage proxy, handle errors, create your own functions to push data to CSV,JSON,XML etc. If you want to speed up than you will have to use other libraries like multiprocessing.
To sum up.
Scrapy is a rich framework that you can use to start writing crawlers
without any hassale.
Beautiful soup is a library that you can use to parse a webpage. It
cannot be used alone to scrape web.
You should definitely use scrapy for your amazon and e-bay product price comparison website. You could build a database of urls and run the crawler every day(cron jobs,Celery for scheduling crawls) and update the price on your database.This way your website will always pull from the database and crawler and database will act as individual components.
BeautifulSoup is a library that lets you extract information from a web page.
Scrapy on the other hand is a framework, which does the above thing and many more things you probably need in your scraping project like pipelines for saving data.
Using scrapy you can save tons of code and start with structured programming, If you dont like any of the scapy’s pre-written methods then BeautifulSoup can be used in the place of scrapy method.
Big project takes both advantages.
I’m learning how to use mplot3d to produce nice plots of 3d data and I’m pretty happy so far. What I am trying to do at the moment is a little animation of a rotating surface. For that purpose, I need to set a camera position for the 3D projection. I guess this must be possible since a surface can be rotated using the mouse when using matplotlib interactively. But how can I do this from a script?
I found a lot of transforms in mpl_toolkits.mplot3d.proj3d but I could not find out how to use these for my purpose and I didn’t find any example for what I’m trying to do.
By “camera position,” it sounds like you want to adjust the elevation and the azimuth angle that you use to view the 3D plot. You can set this with ax.view_init. I’ve used the below script to first create the plot, then I determined a good elevation, or elev, from which to view my plot. I then adjusted the azimuth angle, or azim, to vary the full 360deg around my plot, saving the figure at each instance (and noting which azimuth angle as I saved the plot). For a more complicated camera pan, you can adjust both the elevation and angle to achieve the desired effect.
from mpl_toolkits.mplot3d import Axes3D
ax = Axes3D(fig)
ax.scatter(xx,yy,zz, marker='o', s=20, c="goldenrod", alpha=0.6)
for ii in xrange(0,360,1):
ax.view_init(elev=10., azim=ii)
savefig("movie%d.png" % ii)
xlm=ax1.get_xlim3d()#These are two tupples
ylm=ax1.get_ylim3d()#we use them in the next
zlm=ax1.get_zlim3d()#graph to reproduce the magnification from mousing
axx=ax1.get_axes()
azm=axx.azim
ele=axx.elev
What would be handy would be to apply the Camera position to a new plot.
So I plot, then move the plot around with the mouse changing the distance. Then try to replicate the view including the distance on another plot.
I find that axx.ax.get_axes() gets me an object with the old .azim and .elev.
ax2.view_init(elev=ele, azim=azm) #Works!
ax2.dist=dst # works but always 10 from axx
EDIT 1…
OK, Camera position is the wrong way of thinking concerning the .dist value. It rides on top of everything as a kind of hackey scalar multiplier for the whole graph.
This works for the magnification/zoom of the view:
xlm=ax1.get_xlim3d() #These are two tupples
ylm=ax1.get_ylim3d() #we use them in the next
zlm=ax1.get_zlim3d() #graph to reproduce the magnification from mousing
axx=ax1.get_axes()
azm=axx.azim
ele=axx.elev
1) saving a program’s state data to disk so that it can carry on where it left off when restarted (persistence)
2) sending python data over a TCP connection in a multi-core or distributed system (marshalling)
3) storing python objects in a database
4) converting an arbitrary python object to a string so that it can be used as a dictionary key (e.g. for caching & memoization).
There are some issues with the last one – two identical objects can be pickled and result in different strings – or even the same object pickled twice can have different representations. This is because the pickle can include reference count information.
To emphasise @lunaryorn’s comment – you should never unpickle a string from an untrusted source, since a carefully crafted pickle could execute arbitrary code on your system. For example see https://blog.nelhage.com/2011/03/exploiting-pickle/
Edit: but as for the question of real-world examples of pickling, perhaps the most advanced use of pickling (you’d have to dig quite deep into the source) is ZODB:
http://svn.zope.org/
Pickling is absolutely necessary for distributed and parallel computing.
Say you wanted to do a parallel map-reduce with multiprocessing (or across cluster nodes with pyina), then you need to make sure the function you want to have mapped across the parallel resources will pickle. If it doesn’t pickle, you can’t send it to the other resources on another process, computer, etc. Also see here for a good example.
To do this, I use dill, which can serialize almost anything in python. Dill also has some good tools for helping you understand what is causing your pickling to fail when your code fails.
And, yes, people use picking to save the state of a calculation, or your ipython session, or whatever.
I have used it in one of my projects. If the app was terminated during it’s working (it did a lengthy task and processed lots of data), I needed to save the whole data structure and reload it after the app was run again. I used cPickle for this, as speed was a crucial thing and the size of data was really big.
Pickle is like “Save As..” and “Open..” for your data structures and classes. Let’s say I want to save my data structures so that it is persistent between program runs.
Saving:
with open("save.p", "wb") as f:
pickle.dump(myStuff, f)
Loading:
try:
with open("save.p", "rb") as f:
myStuff = pickle.load(f)
except:
myStuff = defaultdict(dict)
Now I don’t have to build myStuff from scratch all over again, and I can just pick(le) up from where I left off.
To add a real-world example: The Sphinx documentation tool for Python uses pickle to cache parsed documents and cross-references between documents, to speed up subsequent builds of the documentation.
I use pickling during web scrapping one of website at that time I want to store more than 8000k urls and want to process them as fast as possible so I use pickling because its output quality is very high.
you can easily reach to url and where you stop even job directory key word also fetch url details very fast for resuming the process.
I keep important settings like the hostnames and ports of development and production servers in my version control system. But I know that it’s bad practice to keep secrets (like private keys and database passwords) in a VCS repository.
But passwords–like any other setting–seem like they should be versioned. So what is the proper way to keep passwords version controlled?
I imagine it would involve keeping the secrets in their own “secrets settings” file and having that file encrypted and version controlled. But what technologies? And how to do this properly? Is there a better way entirely to go about it?
I ask the question generally, but in my specific instance I would like to store secret keys and passwords for a Django/Python site using git and github.
Also, an ideal solution would do something magical when I push/pull with git–e.g., if the encrypted passwords file changes a script is run which asks for a password and decrypts it into place.
EDIT: For clarity, I am asking about where to store production secrets.
You’re exactly right to want to encrypt your sensitive settings file while still maintaining the file in version control. As you mention, the best solution would be one in which Git will transparently encrypt certain sensitive files when you push them so that locally (i.e. on any machine which has your certificate) you can use the settings file, but Git or Dropbox or whoever is storing your files under VC does not have the ability to read the information in plaintext.
Tutorial on Transparent Encryption/Decryption during Push/Pull
This gist https://gist.github.com/873637 shows a tutorial on how to use the Git’s smudge/clean filter driver with openssl to transparently encrypt pushed files. You just need to do some initial setup.
Summary of How it Works
You’ll basically be creating a .gitencrypt folder containing 3 bash scripts,
which are used by Git for decryption, encryption, and supporting Git diff. A master passphrase and salt (fixed!) is defined inside these scripts and you MUST ensure that .gitencrypt is never actually pushed.
Example clean_filter_openssl script:
#!/bin/bash
SALT_FIXED=<your-salt> # 24 or less hex characters
PASS_FIXED=<your-passphrase>
openssl enc -base64 -aes-256-ecb -S $SALT_FIXED -k $PASS_FIXED
Similar for smudge_filter_open_ssl and diff_filter_oepnssl. See Gist.
Your repo with sensitive information should have a .gitattribute file (unencrypted and included in repo) which references the .gitencrypt directory (which contains everything Git needs to encrypt/decrypt the project transparently) and which is present on your local machine.
Now, when you push the repository containing your sensitive information to a remote repository, the files will be transparently encrypted. When you pull from a local machine which has the .gitencrypt directory (containing your passphrase), the files will be transparently decrypted.
Notes
I should note that this tutorial does not describe a way to only encrypt your sensitive settings file. This will transparently encrypt the entire repository that is pushed to the remote VC host and decrypt the entire repository so it is entirely decrypted locally. To achieve the behavior you want, you could place sensitive files for one or many projects in one sensitive_settings_repo. You could investigate how this transparent encryption technique works with Git submodules http://git-scm.com/book/en/Git-Tools-Submodules if you really need the sensitive files to be in the same repository.
The use of a fixed passphrase could theoretically lead to brute-force vulnerabilities if attackers had access to many encrypted repos/files. IMO, the probability of this is very low. As a note at the bottom of this tutorial mentions, not using a fixed passphrase will result in local versions of a repo on different machines always showing that changes have occurred with ‘git status’.
The traditional approach for handling such config vars is to put them under source – in a properties file of some sort. This is an error-prone process, and is especially complicated for open source apps which often have to maintain separate (and private) branches with app-specific configurations.
A better solution is to use environment variables, and keep the keys out of the code. On a traditional host or working locally you can set environment vars in your bashrc. On Heroku, you use config vars.
With Foreman and .env files Heroku provide an enviable toolchain to export, import and synchronise environment variables.
Personally, I believe it’s wrong to save secret keys alongside code. It’s fundamentally inconsistent with source control, because the keys are for services extrinsic to the the code. The one boon would be that a developer can clone HEAD and run the application without any setup. However, suppose a developer checks out a historic revision of the code. Their copy will include last year’s database password, so the application will fail against today’s database.
With the Heroku method above, a developer can checkout last year’s app, configure it with today’s keys, and run it successfully against today’s database.
The cleanest way in my opinion is to use environment variables. You won’t have to deal with .dist files for example, and the project state on the production environment would be the same as your local machine’s.
I recommend reading The Twelve-Factor App‘s config chapter, the others too if you’re interested.
BlackBox was recently released by StackExchange and while I have yet to use it, it seems to exactly address the problems and support the features requested in this question.
Safely store secrets in a VCS repo (i.e. Git or Mercurial). These
commands make it easy for you to GPG encrypt specific files in a repo
so they are “encrypted at rest” in your repository. However, the
scripts make it easy to decrypt them when you need to view or edit
them, and decrypt them for for use in production.
git-crypt uses GPG to transparently encrypt files when their names match certain patterns. For intance, if you add to your .gitattributes file…
*.secret.* filter=git-crypt diff=git-crypt
…then a file like config.secret.json will always be pushed to remote repos with encryption, but remain unencrypted on your local file system.
If I want to add a new GPG key (a person) to your repo which can decrypt the protected files then run git-crypt add-gpg-user <gpg_user_key>. This creates a new commit. The new user will be able to decrypt subsequent commits.
I ask the question generally, but in my specific instance I would like
to store secret keys and passwords for a Django/Python site using git
and github.
No, just don’t, even if it’s your private repo and you never intend to share it, don’t.
You should create a local_settings.py put it on VCS ignore and in your settings.py do something like
EDIT: I assume you want to keep track of your previous passwords versions – say, for a script that would prevent password reusing etc.
I think GnuPG is the best way to go – it’s already used in one git-related project (git-annex) to encrypt repository contents stored on cloud services. GnuPG (gnu pgp) provides a very strong key-based encryption.
You keep a key on your local machine.
You add ‘mypassword’ to ignored files.
On pre-commit hook you encrypt the mypassword file into the mypassword.gpg file tracked by git and add it to the commit.
On post-merge hook you just decrypt mypassword.gpg into mypassword.
Now if your ‘mypassword’ file did not change then encrypting it will result with same ciphertext and it won’t be added to the index (no redundancy). Slightest modification of mypassword results in radically different ciphertext and mypassword.gpg in staging area differs a lot from the one in repository, thus will be added to the commit. Even if the attacker gets a hold of your gpg key he still needs to bruteforce the password. If the attacker gets an access to remote repository with ciphertext he can compare a bunch of ciphertexts, but their number won’t be sufficient to give him any non-negligible advantage.
Later on you can use .gitattributes to provide an on-the-fly decryption for quit git diff of your password.
Also you can have separate keys for different types of passwords etc.
This is the best way to manage a set of sane defaults for the config you checkin without requiring the config be complete, or contain things like hostnames and credentials. There are a few ways to override default configs.
Environment variables (as others have already mentioned) are one way of doing it.
The best way is to look for an external config file that overrides the default config values. This allows you to manage the external configs via a configuration management system like Chef, Puppet or Cfengine. Configuration management is the standard answer for the management of configs separate from the codebase so you don’t have to do a release to update the config on a single host or a group of hosts.
FYI: Encrypting creds is not always a best practice, especially in a place with limited resources. It may be the case that encrypting creds will gain you no additional risk mitigation and simply add an unnecessary layer of complexity. Make sure you do the proper analysis before making a decision.
Encrypt the passwords file, using for example GPG. Add the keys on your local machine and on your server. Decrypt the file and put it outside your repo folders.
I use a passwords.conf, located in my homefolder. On every deploy this file gets updated.
import os
os.environ.setdefault("DJANGO_SETTINGS_MODULE","project.local")# This application object is used by the development server# as well as any WSGI server configured to use this file.from django.core.wsgi import get_wsgi_application
application = get_wsgi_application()
No, private keys and passwords do not fall under revision control. There is no reason to burden everyone with read access to your repository with knowing sensitive service credentials used in production, when most likely not all of them should have access to those services.
Starting with Django 1.4, your Django projects now ship with a project.wsgi module that defines the application object and it’s a perfect place to start enforcing the use of a project.local settings module that contains site-specific configurations.
This settings module is ignored from revision control, but it’s presence is required when running your project instance as a WSGI application, typical for production environments. This is how it should look like:
import os
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "project.local")
# This application object is used by the development server
# as well as any WSGI server configured to use this file.
from django.core.wsgi import get_wsgi_application
application = get_wsgi_application()
Now you can have a local.py module who’s owner and group can be configured so that only authorized personnel and the Django processes can read the file’s contents.
If you need VCS for your secrets you should at least keep them in a second repository seperated from you actual code. So you can give your team members access to the source code repository and they won’t see your credentials. Furthermore host this repository somewhere else (eg. on your own server with an encrypted filesystem, not on github) and for checking it out to the production system you could use something like git-submodule.
Another approach could be to completely avoid saving secrets in version control systems and instead use a tool like vault from hashicorp, a secret storage with key rolling and auditing, with an API and embedded encryption.
回答 15
这是我的工作:
将所有秘密保留为$ HOME / .bashrc来源的$ HOME / .secrets(go-r权限)中的env vars(这样,如果您在某人面前打开.bashrc,他们将看不到这些秘密)
Keep all secrets as env vars in $HOME/.secrets (go-r perms) that $HOME/.bashrc sources (this way if you open .bashrc in front of someone, they won’t see the secrets)
Configuration files are stored in VCS as templates, such as config.properties stored as config.properties.tmpl
The template files contain a placeholder for the secret, such as:
my.password=##MY_PASSWORD##
On application deployment, script is ran that transforms the template file into the target file, replacing placeholders with values of environment variables, such as changing ##MY_PASSWORD## to the value of $MY_PASSWORD.
You could use EncFS if your system provides that. Thus you could keep your encrypted data as a subfolder of your repository, while providing your application a decrypted view to the data mounted aside. As the encryption is transparent, no special operations are needed on pull or push.
It would however need to mount the EncFS folders, which could be done by your application based on an password stored elsewhere outside the versioned folders (eg. environment variables).
I want to check in a Python program if a word is in the English dictionary.
I believe nltk wordnet interface might be the way to go but I have no clue how to use it for such a simple task.
def is_english_word(word):
pass # how to I implement is_english_word?
is_english_word(token.lower())
In the future, I might want to check if the singular form of a word is in the dictionary (e.g., properties -> property -> english word). How would I achieve that?
It won’t work well with WordNet, because WordNet does not contain all english words.
Another possibility based on NLTK without enchant is NLTK’s words corpus
>>> from nltk.corpus import words
>>> "would" in words.words()
True
>>> "could" in words.words()
True
>>> "should" in words.words()
True
>>> "I" in words.words()
True
>>> "you" in words.words()
True
回答 2
使用NLTK:
from nltk.corpus import wordnet
ifnot wordnet.synsets(word_to_test):#Not an English Wordelse:#English Word
from nltk.corpus import wordnet
if not wordnet.synsets(word_to_test):
#Not an English Word
else:
#English Word
You should refer to this article if you have trouble installing wordnet or want to try other approaches.
回答 3
使用集合存储单词列表,因为查找它们会更快:
with open("english_words.txt")as word_file:
english_words = set(word.strip().lower()for word in word_file)def is_english_word(word):return word.lower()in english_words
print is_english_word("ham")# should be true if you have a good english_words.txt
Using a set to store the word list because looking them up will be faster:
with open("english_words.txt") as word_file:
english_words = set(word.strip().lower() for word in word_file)
def is_english_word(word):
return word.lower() in english_words
print is_english_word("ham") # should be true if you have a good english_words.txt
To answer the second part of the question, the plurals would already be in a good word list, but if you wanted to specifically exclude those from the list for some reason, you could indeed write a function to handle it. But English pluralization rules are tricky enough that I’d just include the plurals in the word list to begin with.
from nltk.corpus import words as nltk_words
def is_english_word(word):# creation of this dictionary would be done outside of # the function because you only need to do it once.
dictionary = dict.fromkeys(nltk_words.words(),None)try:
x = dictionary[word]returnTrueexceptKeyError:returnFalse
For a faster NLTK-based solution you could hash the set of words to avoid a linear search.
from nltk.corpus import words as nltk_words
def is_english_word(word):
# creation of this dictionary would be done outside of
# the function because you only need to do it once.
dictionary = dict.fromkeys(nltk_words.words(), None)
try:
x = dictionary[word]
return True
except KeyError:
return False
I find that there are 3 package-based solutions to solve the problem. They are pyenchant, wordnet and corpus(self-defined or from ntlk). Pyenchant couldn’t installed easily in win64 with py3. Wordnet doesn’t work very well because it’s corpus isn’t complete. So for me, I choose the solution answered by @Sadik, and use ‘set(words.words())’ to speed up.
from nltk.corpus import words
setofwords = set(words.words())
print("hello" in setofwords)
>>True
回答 6
使用pyEnchant.checker SpellChecker:
from enchant.checker importSpellCheckerdef is_in_english(quote):
d =SpellChecker("en_US")
d.set_text(quote)
errors =[err.word for err in d]returnFalseif((len(errors)>4)or len(quote.split())<3)elseTrueprint(is_in_english('“办理美国加州州立大学圣贝纳迪诺分校高仿成绩单Q/V2166384296加州州立大学圣贝纳迪诺分校学历学位认证'))print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))>False>True
from enchant.checker import SpellChecker
def is_in_english(quote):
d = SpellChecker("en_US")
d.set_text(quote)
errors = [err.word for err in d]
return False if ((len(errors) > 4) or len(quote.split()) < 3) else True
print(is_in_english('“办理美国加州州立大学圣贝纳迪诺分校高仿成绩单Q/V2166384296加州州立大学圣贝纳迪诺分校学历学位认证'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))
> False
> True
For a semantic web approach, you could run a sparql query against WordNet in RDF format. Basically just use urllib module to issue GET request and return results in JSON format, parse using python ‘json’ module. If it’s not English word you’ll get no results.
If your OS uses the Linux kernel, there is a simple way to get all the words from the English/American dictionary. In the directory /usr/share/dict you have a words file. There is also a more specific american-english and british-english files. These contain all of the words in that specific language. You can access this throughout every programming language which is why I thought you might want to know about this.
Now, for python specific users, the python code below should assign the list words to have the value of every single word:
import re
file = open("/usr/share/dict/words", "r")
words = re.sub("[^\w]", " ", file.read()).split()
def is_word(word):
return word.lower() in words
is_word("tarts") ## Returns true
is_word("jwiefjiojrfiorj") ## Returns False
>>> df1 = pd.concat([df]*100000, ignore_index=True)# DataFrame with 500000 rows>>>%timeit np.unique(df1[['Col1','Col2']].values)1 loop, best of 3:1.12 s per loop
>>>%timeit pd.unique(df1[['Col1','Col2']].values.ravel('K'))10 loops, best of 3:38.9 ms per loop
>>>%timeit pd.unique(df1[['Col1','Col2']].values.ravel())# ravel using C order10 loops, best of 3:49.9 ms per loop
pd.unique returns the unique values from an input array, or DataFrame column or index.
The input to this function needs to be one-dimensional, so multiple columns will need to be combined. The simplest way is to select the columns you want and then view the values in a flattened NumPy array. The whole operation looks like this:
Note that ravel() is an array method than returns a view (if possible) of a multidimensional array. The argument 'K' tells the method to flatten the array in the order the elements are stored in memory (pandas typically stores underlying arrays in Fortran-contiguous order; columns before rows). This can be significantly faster than using the method’s default ‘C’ order.
An alternative way is to select the columns and pass them to np.unique:
There is no need to use ravel() here as the method handles multidimensional arrays. Even so, this is likely to be slower than pd.unique as it uses a sort-based algorithm rather than a hashtable to identify unique values.
The difference in speed is significant for larger DataFrames (especially if there are only a handful of unique values):
>>> df1 = pd.concat([df]*100000, ignore_index=True) # DataFrame with 500000 rows
>>> %timeit np.unique(df1[['Col1', 'Col2']].values)
1 loop, best of 3: 1.12 s per loop
>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel('K'))
10 loops, best of 3: 38.9 ms per loop
>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel()) # ravel using C order
10 loops, best of 3: 49.9 ms per loop
An updated solution using numpy v1.13+ requires specifying the axis in np.unique if using multiple columns, otherwise the array is implicitly flattened.
import numpy as np
np.unique(df[['col1', 'col2']], axis=0)
Col1 Col2 Col3
0 Bob Joe 0.201079
1 Joe Steve 0.703279
2 Bill Bob 0.722724
3 Mary Bob 0.093912
4 Joe Steve 0.766027
set(['Steve', 'Bob', 'Bill', 'Joe', 'Mary'])
monas-mbp:book mona$ sudo pip install python-dateutil
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python
Cleaning up...
monas-mbp:book mona$ python t1.py
No module named dateutil.parser
Traceback(most recent call last):File"t1.py", line 4,in<module>import pandas as pd
File"/Library/Python/2.7/site-packages/pandas/__init__.py", line 6,in<module>from.import hashtable, tslib, lib
File"tslib.pyx", line 31,in init pandas.tslib (pandas/tslib.c:48782)ImportError:No module named dateutil.parser
这也是程序:
import codecs
from math import sqrt
import numpy as np
import pandas as pd
users ={"Angelica":{"Blues Traveler":3.5,"Broken Bells":2.0,"Norah Jones":4.5,"Phoenix":5.0,"Slightly Stoopid":1.5,"The Strokes":2.5,"Vampire Weekend":2.0},"Bill":{"Blues Traveler":2.0,"Broken Bells":3.5,"Deadmau5":4.0,"Phoenix":2.0,"Slightly Stoopid":3.5,"Vampire Weekend":3.0},"Chan":{"Blues Traveler":5.0,"Broken Bells":1.0,"Deadmau5":1.0,"Norah Jones":3.0,"Phoenix":5,"Slightly Stoopid":1.0},"Dan":{"Blues Traveler":3.0,"Broken Bells":4.0,"Deadmau5":4.5,"Phoenix":3.0,"Slightly Stoopid":4.5,"The Strokes":4.0,"Vampire Weekend":2.0},"Hailey":{"Broken Bells":4.0,"Deadmau5":1.0,"Norah Jones":4.0,"The Strokes":4.0,"Vampire Weekend":1.0},"Jordyn":{"Broken Bells":4.5,"Deadmau5":4.0,"Norah Jones":5.0,"Phoenix":5.0,"Slightly Stoopid":4.5,"The Strokes":4.0,"Vampire Weekend":4.0},"Sam":{"Blues Traveler":5.0,"Broken Bells":2.0,"Norah Jones":3.0,"Phoenix":5.0,"Slightly Stoopid":4.0,"The Strokes":5.0},"Veronica":{"Blues Traveler":3.0,"Norah Jones":5.0,"Phoenix":4.0,"Slightly Stoopid":2.5,"The Strokes":3.0}}class recommender:def __init__(self, data, k=1, metric='pearson', n=5):""" initialize recommender
currently, if data is dictionary the recommender is initialized
to it.
For all other data types of data, no initialization occurs
k is the k value for k nearest neighbor
metric is which distance formula to use
n is the maximum number of recommendations to make"""
self.k = k
self.n = n
self.username2id ={}
self.userid2name ={}
self.productid2name ={}# for some reason I want to save the name of the metric
self.metric = metric
if self.metric =='pearson':
self.fn = self.pearson
## if data is dictionary set recommender data to it#if type(data).__name__ =='dict':
self.data = data
def convertProductID2name(self, id):"""Given product id number return product name"""if id in self.productid2name:return self.productid2name[id]else:return id
def userRatings(self, id, n):"""Return n top ratings for user with id"""print("Ratings for "+ self.userid2name[id])
ratings = self.data[id]print(len(ratings))
ratings = list(ratings.items())
ratings =[(self.convertProductID2name(k), v)for(k, v)in ratings]# finally sort and return
ratings.sort(key=lambda artistTuple: artistTuple[1],
reverse =True)
ratings = ratings[:n]for rating in ratings:print("%s\t%i"%(rating[0], rating[1]))def loadBookDB(self, path=''):"""loads the BX book dataset. Path is where the BX files are
located"""
self.data ={}
i =0## First load book ratings into self.data#
f = codecs.open(path +"BX-Book-Ratings.csv",'r','utf8')for line in f:
i +=1#separate line into fields
fields = line.split(';')
user = fields[0].strip('"')
book = fields[1].strip('"')
rating = int(fields[2].strip().strip('"'))if user in self.data:
currentRatings = self.data[user]else:
currentRatings ={}
currentRatings[book]= rating
self.data[user]= currentRatings
f.close()## Now load books into self.productid2name# Books contains isbn, title, and author among other fields#
f = codecs.open(path +"BX-Books.csv",'r','utf8')for line in f:
i +=1#separate line into fields
fields = line.split(';')
isbn = fields[0].strip('"')
title = fields[1].strip('"')
author = fields[2].strip().strip('"')
title = title +' by '+ author
self.productid2name[isbn]= title
f.close()## Now load user info into both self.userid2name and# self.username2id#
f = codecs.open(path +"BX-Users.csv",'r','utf8')for line in f:
i +=1#print(line)#separate line into fields
fields = line.split(';')
userid = fields[0].strip('"')
location = fields[1].strip('"')if len(fields)>3:
age = fields[2].strip().strip('"')else:
age ='NULL'if age !='NULL':
value = location +' (age: '+ age +')'else:
value = location
self.userid2name[userid]= value
self.username2id[location]= userid
f.close()print(i)def pearson(self, rating1, rating2):
sum_xy =0
sum_x =0
sum_y =0
sum_x2 =0
sum_y2 =0
n =0for key in rating1:if key in rating2:
n +=1
x = rating1[key]
y = rating2[key]
sum_xy += x * y
sum_x += x
sum_y += y
sum_x2 += pow(x,2)
sum_y2 += pow(y,2)if n ==0:return0# now compute denominator
denominator =(sqrt(sum_x2 - pow(sum_x,2)/ n)* sqrt(sum_y2 - pow(sum_y,2)/ n))if denominator ==0:return0else:return(sum_xy -(sum_x * sum_y)/ n)/ denominator
def computeNearestNeighbor(self, username):"""creates a sorted list of users based on their distance to
username"""
distances =[]for instance in self.data:if instance != username:
distance = self.fn(self.data[username],
self.data[instance])
distances.append((instance, distance))# sort based on distance -- closest first
distances.sort(key=lambda artistTuple: artistTuple[1],
reverse=True)return distances
def recommend(self, user):"""Give list of recommendations"""
recommendations ={}# first get list of users ordered by nearness
nearest = self.computeNearestNeighbor(user)## now get the ratings for the user#
userRatings = self.data[user]## determine the total distance
totalDistance =0.0for i in range(self.k):
totalDistance += nearest[i][1]# now iterate through the k nearest neighbors# accumulating their ratingsfor i in range(self.k):# compute slice of pie
weight = nearest[i][1]/ totalDistance
# get the name of the person
name = nearest[i][0]# get the ratings for this person
neighborRatings = self.data[name]# get the name of the person# now find bands neighbor rated that user didn'tfor artist in neighborRatings:ifnot artist in userRatings:if artist notin recommendations:
recommendations[artist]=(neighborRatings[artist]* weight)else:
recommendations[artist]=(recommendations[artist]+ neighborRatings[artist]* weight)# now make list from dictionary
recommendations = list(recommendations.items())
recommendations =[(self.convertProductID2name(k), v)for(k, v)in recommendations]# finally sort and return
recommendations.sort(key=lambda artistTuple: artistTuple[1],
reverse =True)# Return the first n itemsreturn recommendations[:self.n]
r = recommender(users)# The author implementation
r.loadBookDB('/Users/mona/Downloads/BX-Dump/')
ratings = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Book-Ratings.csv', sep=";", quotechar="\"", escapechar="\\")
books = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Books.csv', sep=";", quotechar="\"", escapechar="\\")
users = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Users.csv', sep=";", quotechar="\"", escapechar="\\")
pivot_rating = ratings.pivot(index='User-ID', columns='ISBN', values='Book-Rating')
I am receiving the following error when importing pandas in a Python program
monas-mbp:book mona$ sudo pip install python-dateutil
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python
Cleaning up...
monas-mbp:book mona$ python t1.py
No module named dateutil.parser
Traceback (most recent call last):
File "t1.py", line 4, in <module>
import pandas as pd
File "/Library/Python/2.7/site-packages/pandas/__init__.py", line 6, in <module>
from . import hashtable, tslib, lib
File "tslib.pyx", line 31, in init pandas.tslib (pandas/tslib.c:48782)
ImportError: No module named dateutil.parser
Also here’s the program:
import codecs
from math import sqrt
import numpy as np
import pandas as pd
users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0,
"Norah Jones": 4.5, "Phoenix": 5.0,
"Slightly Stoopid": 1.5,
"The Strokes": 2.5, "Vampire Weekend": 2.0},
"Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5,
"Deadmau5": 4.0, "Phoenix": 2.0,
"Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},
"Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0,
"Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5,
"Slightly Stoopid": 1.0},
"Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0,
"Deadmau5": 4.5, "Phoenix": 3.0,
"Slightly Stoopid": 4.5, "The Strokes": 4.0,
"Vampire Weekend": 2.0},
"Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0,
"Norah Jones": 4.0, "The Strokes": 4.0,
"Vampire Weekend": 1.0},
"Jordyn": {"Broken Bells": 4.5, "Deadmau5": 4.0,
"Norah Jones": 5.0, "Phoenix": 5.0,
"Slightly Stoopid": 4.5, "The Strokes": 4.0,
"Vampire Weekend": 4.0},
"Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0,
"Norah Jones": 3.0, "Phoenix": 5.0,
"Slightly Stoopid": 4.0, "The Strokes": 5.0},
"Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0,
"Phoenix": 4.0, "Slightly Stoopid": 2.5,
"The Strokes": 3.0}
}
class recommender:
def __init__(self, data, k=1, metric='pearson', n=5):
""" initialize recommender
currently, if data is dictionary the recommender is initialized
to it.
For all other data types of data, no initialization occurs
k is the k value for k nearest neighbor
metric is which distance formula to use
n is the maximum number of recommendations to make"""
self.k = k
self.n = n
self.username2id = {}
self.userid2name = {}
self.productid2name = {}
# for some reason I want to save the name of the metric
self.metric = metric
if self.metric == 'pearson':
self.fn = self.pearson
#
# if data is dictionary set recommender data to it
#
if type(data).__name__ == 'dict':
self.data = data
def convertProductID2name(self, id):
"""Given product id number return product name"""
if id in self.productid2name:
return self.productid2name[id]
else:
return id
def userRatings(self, id, n):
"""Return n top ratings for user with id"""
print ("Ratings for " + self.userid2name[id])
ratings = self.data[id]
print(len(ratings))
ratings = list(ratings.items())
ratings = [(self.convertProductID2name(k), v)
for (k, v) in ratings]
# finally sort and return
ratings.sort(key=lambda artistTuple: artistTuple[1],
reverse = True)
ratings = ratings[:n]
for rating in ratings:
print("%s\t%i" % (rating[0], rating[1]))
def loadBookDB(self, path=''):
"""loads the BX book dataset. Path is where the BX files are
located"""
self.data = {}
i = 0
#
# First load book ratings into self.data
#
f = codecs.open(path + "BX-Book-Ratings.csv", 'r', 'utf8')
for line in f:
i += 1
#separate line into fields
fields = line.split(';')
user = fields[0].strip('"')
book = fields[1].strip('"')
rating = int(fields[2].strip().strip('"'))
if user in self.data:
currentRatings = self.data[user]
else:
currentRatings = {}
currentRatings[book] = rating
self.data[user] = currentRatings
f.close()
#
# Now load books into self.productid2name
# Books contains isbn, title, and author among other fields
#
f = codecs.open(path + "BX-Books.csv", 'r', 'utf8')
for line in f:
i += 1
#separate line into fields
fields = line.split(';')
isbn = fields[0].strip('"')
title = fields[1].strip('"')
author = fields[2].strip().strip('"')
title = title + ' by ' + author
self.productid2name[isbn] = title
f.close()
#
# Now load user info into both self.userid2name and
# self.username2id
#
f = codecs.open(path + "BX-Users.csv", 'r', 'utf8')
for line in f:
i += 1
#print(line)
#separate line into fields
fields = line.split(';')
userid = fields[0].strip('"')
location = fields[1].strip('"')
if len(fields) > 3:
age = fields[2].strip().strip('"')
else:
age = 'NULL'
if age != 'NULL':
value = location + ' (age: ' + age + ')'
else:
value = location
self.userid2name[userid] = value
self.username2id[location] = userid
f.close()
print(i)
def pearson(self, rating1, rating2):
sum_xy = 0
sum_x = 0
sum_y = 0
sum_x2 = 0
sum_y2 = 0
n = 0
for key in rating1:
if key in rating2:
n += 1
x = rating1[key]
y = rating2[key]
sum_xy += x * y
sum_x += x
sum_y += y
sum_x2 += pow(x, 2)
sum_y2 += pow(y, 2)
if n == 0:
return 0
# now compute denominator
denominator = (sqrt(sum_x2 - pow(sum_x, 2) / n)
* sqrt(sum_y2 - pow(sum_y, 2) / n))
if denominator == 0:
return 0
else:
return (sum_xy - (sum_x * sum_y) / n) / denominator
def computeNearestNeighbor(self, username):
"""creates a sorted list of users based on their distance to
username"""
distances = []
for instance in self.data:
if instance != username:
distance = self.fn(self.data[username],
self.data[instance])
distances.append((instance, distance))
# sort based on distance -- closest first
distances.sort(key=lambda artistTuple: artistTuple[1],
reverse=True)
return distances
def recommend(self, user):
"""Give list of recommendations"""
recommendations = {}
# first get list of users ordered by nearness
nearest = self.computeNearestNeighbor(user)
#
# now get the ratings for the user
#
userRatings = self.data[user]
#
# determine the total distance
totalDistance = 0.0
for i in range(self.k):
totalDistance += nearest[i][1]
# now iterate through the k nearest neighbors
# accumulating their ratings
for i in range(self.k):
# compute slice of pie
weight = nearest[i][1] / totalDistance
# get the name of the person
name = nearest[i][0]
# get the ratings for this person
neighborRatings = self.data[name]
# get the name of the person
# now find bands neighbor rated that user didn't
for artist in neighborRatings:
if not artist in userRatings:
if artist not in recommendations:
recommendations[artist] = (neighborRatings[artist]
* weight)
else:
recommendations[artist] = (recommendations[artist]
+ neighborRatings[artist]
* weight)
# now make list from dictionary
recommendations = list(recommendations.items())
recommendations = [(self.convertProductID2name(k), v)
for (k, v) in recommendations]
# finally sort and return
recommendations.sort(key=lambda artistTuple: artistTuple[1],
reverse = True)
# Return the first n items
return recommendations[:self.n]
r = recommender(users)
# The author implementation
r.loadBookDB('/Users/mona/Downloads/BX-Dump/')
ratings = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Book-Ratings.csv', sep=";", quotechar="\"", escapechar="\\")
books = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Books.csv', sep=";", quotechar="\"", escapechar="\\")
users = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Users.csv', sep=";", quotechar="\"", escapechar="\\")
pivot_rating = ratings.pivot(index='User-ID', columns='ISBN', values='Book-Rating')