标签归档:Python

如何对字符串列表进行排序?

问题:如何对字符串列表进行排序?

在Python中创建按字母顺序排序的列表的最佳方法是什么?

What is the best way of creating an alphabetically sorted list in Python?


回答 0

基本答案:

mylist = ["b", "C", "A"]
mylist.sort()

这会修改您的原始列表(即就地排序)。要获得列表的排序副本,而无需更改原始副本,请使用以下sorted()函数:

for x in sorted(mylist):
    print x

但是,上面的示例有些天真,因为它们没有考虑区域设置,而是执行区分大小写的排序。您可以利用可选参数key指定自定义排序顺序(使用的替代方法cmp是不推荐使用的解决方案,因为它必须多次评估- key每个元素仅计算一次)。

因此,要根据当前语言环境进行排序,并考虑到特定于语言的规则(这cmp_to_key是functools的帮助函数):

sorted(mylist, key=cmp_to_key(locale.strcoll))

最后,如果需要,您可以指定自定义语言环境进行排序:

import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') # vary depending on your lang/locale
assert sorted((u'Ab', u'ad', u'aa'),
  key=cmp_to_key(locale.strcoll)) == [u'aa', u'Ab', u'ad']

最后要注意的是:您将看到使用该lower()方法的不区分大小写的排序示例-这些是不正确的,因为它们仅适用于ASCII字符子集。对于任何非英语数据,这两个错误:

# this is incorrect!
mylist.sort(key=lambda x: x.lower())
# alternative notation, a bit faster, but still wrong
mylist.sort(key=str.lower)

Basic answer:

mylist = ["b", "C", "A"]
mylist.sort()

This modifies your original list (i.e. sorts in-place). To get a sorted copy of the list, without changing the original, use the sorted() function:

for x in sorted(mylist):
    print x

However, the examples above are a bit naive, because they don’t take locale into account, and perform a case-sensitive sorting. You can take advantage of the optional parameter key to specify custom sorting order (the alternative, using cmp, is a deprecated solution, as it has to be evaluated multiple times – key is only computed once per element).

So, to sort according to the current locale, taking language-specific rules into account (cmp_to_key is a helper function from functools):

sorted(mylist, key=cmp_to_key(locale.strcoll))

And finally, if you need, you can specify a custom locale for sorting:

import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') # vary depending on your lang/locale
assert sorted((u'Ab', u'ad', u'aa'),
  key=cmp_to_key(locale.strcoll)) == [u'aa', u'Ab', u'ad']

Last note: you will see examples of case-insensitive sorting which use the lower() method – those are incorrect, because they work only for the ASCII subset of characters. Those two are wrong for any non-English data:

# this is incorrect!
mylist.sort(key=lambda x: x.lower())
# alternative notation, a bit faster, but still wrong
mylist.sort(key=str.lower)

回答 1

还值得注意的sorted()功能:

for x in sorted(list):
    print x

这将返回列表的新排序版本,而不更改原始列表。

It is also worth noting the sorted() function:

for x in sorted(list):
    print x

This returns a new, sorted version of a list without changing the original list.


回答 2

list.sort()

真的就是这么简单:)

list.sort()

It really is that simple :)


回答 3

字符串排序的正确方法是:

import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') # vary depending on your lang/locale
assert sorted((u'Ab', u'ad', u'aa'), cmp=locale.strcoll) == [u'aa', u'Ab', u'ad']

# Without using locale.strcoll you get:
assert sorted((u'Ab', u'ad', u'aa')) == [u'Ab', u'aa', u'ad']

前面的示例mylist.sort(key=lambda x: x.lower())对于仅ASCII上下文适用。

The proper way to sort strings is:

import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') # vary depending on your lang/locale
assert sorted((u'Ab', u'ad', u'aa'), cmp=locale.strcoll) == [u'aa', u'Ab', u'ad']

# Without using locale.strcoll you get:
assert sorted((u'Ab', u'ad', u'aa')) == [u'Ab', u'aa', u'ad']

The previous example of mylist.sort(key=lambda x: x.lower()) will work fine for ASCII-only contexts.


回答 4

请在Python3中使用sorted()函数

items = ["love", "like", "play", "cool", "my"]
sorted(items2)

Please use sorted() function in Python3

items = ["love", "like", "play", "cool", "my"]
sorted(items2)

回答 5

但是,这如何处理特定于语言的排序规则?是否考虑到语言环境?

不,list.sort()是通用排序功能。如果要根据Unicode规则进行排序,则必须定义一个自定义的排序键函数。您可以尝试使用pyuca模块,但我不知道它的完整性。

But how does this handle language specific sorting rules? Does it take locale into account?

No, list.sort() is a generic sorting function. If you want to sort according to the Unicode rules, you’ll have to define a custom sort key function. You can try using the pyuca module, but I don’t know how complete it is.


回答 6

这是一个老问题,但是如果您想在不进行设置的情况下进行 locale.LC_ALL感知区域设置的排序,则可以按照此答案的建议使用PyICU库

import icu # PyICU

def sorted_strings(strings, locale=None):
    if locale is None:
       return sorted(strings)
    collator = icu.Collator.createInstance(icu.Locale(locale))
    return sorted(strings, key=collator.getSortKey)

然后用例如:

new_list = sorted_strings(list_of_strings, "de_DE.utf8")

这对我有用,而无需安装任何语言环境或更改其他系统设置。

(这已经在上面的评论中建议,但是我想让它更加突出,因为我一开始就很想念它。)

Old question, but if you want to do locale-aware sorting without setting locale.LC_ALL you can do so by using the PyICU library as suggested by this answer:

import icu # PyICU

def sorted_strings(strings, locale=None):
    if locale is None:
       return sorted(strings)
    collator = icu.Collator.createInstance(icu.Locale(locale))
    return sorted(strings, key=collator.getSortKey)

Then call with e.g.:

new_list = sorted_strings(list_of_strings, "de_DE.utf8")

This worked for me without installing any locales or changing other system settings.

(This was already suggested in a comment above, but I wanted to give it more prominence, because I missed it myself at first.)


回答 7

假设 s = "ZWzaAd"

要在字符串上方排序,简单的解决方案将是在字符串下方。

print ''.join(sorted(s))

Suppose s = "ZWzaAd"

To sort above string the simple solution will be below one.

print ''.join(sorted(s))

回答 8

或许:

names = ['Jasmine', 'Alberto', 'Ross', 'dig-dog']
print ("The solution for this is about this names being sorted:",sorted(names, key=lambda name:name.lower()))

Or maybe:

names = ['Jasmine', 'Alberto', 'Ross', 'dig-dog']
print ("The solution for this is about this names being sorted:",sorted(names, key=lambda name:name.lower()))

回答 9

l =['abc' , 'cd' , 'xy' , 'ba' , 'dc']
l.sort()
print(l1)

结果

[‘abc’,’ba’,’cd’,’dc’,’xy’]

l =['abc' , 'cd' , 'xy' , 'ba' , 'dc']
l.sort()
print(l1)

Result

[‘abc’, ‘ba’, ‘cd’, ‘dc’, ‘xy’]


回答 10

很简单:https : //trinket.io/library/trinkets/5db81676e4

scores = '54 - Alice,35 - Bob,27 - Carol,27 - Chuck,05 - Craig,30 - Dan,27 - Erin,77 - Eve,14 - Fay,20 - Frank,48 - Grace,61 - Heidi,03 - Judy,28 - Mallory,05 - Olivia,44 - Oscar,34 - Peggy,30 - Sybil,82 - Trent,75 - Trudy,92 - Victor,37 - Walter'

得分= scores.split(’,’)for x in sorted(scores):print(x)

It is simple: https://trinket.io/library/trinkets/5db81676e4

scores = '54 - Alice,35 - Bob,27 - Carol,27 - Chuck,05 - Craig,30 - Dan,27 - Erin,77 - Eve,14 - Fay,20 - Frank,48 - Grace,61 - Heidi,03 - Judy,28 - Mallory,05 - Olivia,44 - Oscar,34 - Peggy,30 - Sybil,82 - Trent,75 - Trudy,92 - Victor,37 - Walter'

scores = scores.split(‘,’) for x in sorted(scores): print(x)


如何保存Python交互式会话?

问题:如何保存Python交互式会话?

我发现自己经常使用Python的解释器来处理数据库,文件等-基本上是半结构化数据的大量手动格式化。我没有按我的意愿正确地保存和清理有用的位。有没有一种方法可以将我的输入保存到外壳中(数据库连接,变量分配,很少用于循环和逻辑位)-交互式会话的一些历史记录?如果我使用类似的东西,script则会收到过多的标准输出噪音。我真的不需要腌制所有对象-尽管如果有解决方案可以做到这一点,那就可以了。理想情况下,我只剩下一个脚本,该脚本可以像我交互式创建的那样运行,并且我可以删除不需要的部分。是否有这样做的包装或DIY方法?

更新:我对这些软件包的质量和实用性感到非常惊讶。对于那些类似的痒:

  • IPython-早就应该使用它了
  • 重新互动 -非常令人印象深刻,我想了解有关可视化的更多信息,这似乎会在这里闪耀。排序图的gtk / gnome桌面应用程序。想象一下混合壳+图形计算器+迷你月食。此处的源代码分发:http : //www.reinteract.org/trac/wiki/GettingIt。可以在Ubuntu上很好地构建,也可以集成到gnome桌面,Windows和Mac安装程序中。
  • bpython-非常酷,有很多不错的功能,自动完成(!),倒带,一键保存到文件,缩进,做得很好。Python源代码发行版从sourceforge中提取了两个依赖项。

我被转换了,这些真的满足了解释器和编辑器之间的需求。

I find myself frequently using Python’s interpreter to work with databases, files, etc — basically a lot of manual formatting of semi-structured data. I don’t properly save and clean up the useful bits as often as I would like. Is there a way to save my input into the shell (db connections, variable assignments, little for loops and bits of logic) — some history of the interactive session? If I use something like script I get too much stdout noise. I don’t really need to pickle all the objects — though if there is a solution that does that, it would be OK. Ideally I would just be left with a script that ran as the one I created interactively, and I could just delete the bits I didn’t need. Is there a package that does this, or a DIY approach?

UPDATE: I am really amazed at the quality and usefulness of these packages. For those with a similar itch:

  • IPython — should have been using this for ages, kind of what I had in mind
  • reinteract — very impressive, I want to learn more about visualization and this seems like it will shine there. Sort of a gtk/gnome desktop app that renders graphs inline. Imagine a hybrid shell + graphing calculator + mini eclipse. Source distribution here: http://www.reinteract.org/trac/wiki/GettingIt . Built fine on Ubuntu, integrates into gnome desktop, Windows and Mac installers too.
  • bpython — extremely cool, lots of nice features, autocomplete(!), rewind, one keystroke save to file, indentation, well done. Python source distribution, pulled a couple of dependencies from sourceforge.

I am converted, these really fill a need between interpreter and editor.


回答 0

如果您喜欢使用交互式会话,则IPython非常有用。例如,对于您的用例,有一个%savemagic命令,您只需输入%save my_useful_session 10-20 23以保存输入行10至20和23至my_useful_session.py(为此,每行都以其编号作为前缀)。

此外,文档指出:

此函数对输入范围使用与%history相同的语法,然后将这些行保存到您指定的文件名中。

例如,这允许引用较旧的会话,例如

%save current_session ~0/
%save previous_session ~1/

观看演示页面的视频,以快速了解这些功能。

IPython is extremely useful if you like using interactive sessions. For example for your use-case there is the %save magic command, you just input %save my_useful_session 10-20 23 to save input lines 10 to 20 and 23 to my_useful_session.py (to help with this, every line is prefixed by its number).

Furthermore, the documentation states:

This function uses the same syntax as %history for input ranges, then saves the lines to the filename you specify.

This allows for example, to reference older sessions, such as

%save current_session ~0/
%save previous_session ~1/

Look at the videos on the presentation page to get a quick overview of the features.


回答 1

http://www.andrewhjon.es/save-interactive-python-session-history

import readline
readline.write_history_file('/home/ahj/history')

http://www.andrewhjon.es/save-interactive-python-session-history

import readline
readline.write_history_file('/home/ahj/history')

回答 2

有一个 方法可以做到。将文件存储在~/.pystartup

# Add auto-completion and a stored history file of commands to your Python
# interactive interpreter. Requires Python 2.0+, readline. Autocomplete is
# bound to the Esc key by default (you can change it - see readline docs).
#
# Store the file in ~/.pystartup, and set an environment variable to point
# to it:  "export PYTHONSTARTUP=/home/user/.pystartup" in bash.
#
# Note that PYTHONSTARTUP does *not* expand "~", so you have to put in the
# full path to your home directory.

import atexit
import os
import readline
import rlcompleter

historyPath = os.path.expanduser("~/.pyhistory")

def save_history(historyPath=historyPath):
    import readline
    readline.write_history_file(historyPath)

if os.path.exists(historyPath):
    readline.read_history_file(historyPath)

atexit.register(save_history)
del os, atexit, readline, rlcompleter, save_history, historyPath

然后PYTHONSTARTUP在您的shell中设置环境变量(例如在中~/.bashrc):

export PYTHONSTARTUP=$HOME/.pystartup

您还可以添加以下内容以免费获取自动完成功能:

readline.parse_and_bind('tab: complete')

请注意,这仅适用于* nix系统。由于readline仅在Unix平台上可用。

There is a way to do it. Store the file in ~/.pystartup

# Add auto-completion and a stored history file of commands to your Python
# interactive interpreter. Requires Python 2.0+, readline. Autocomplete is
# bound to the Esc key by default (you can change it - see readline docs).
#
# Store the file in ~/.pystartup, and set an environment variable to point
# to it:  "export PYTHONSTARTUP=/home/user/.pystartup" in bash.
#
# Note that PYTHONSTARTUP does *not* expand "~", so you have to put in the
# full path to your home directory.

import atexit
import os
import readline
import rlcompleter

historyPath = os.path.expanduser("~/.pyhistory")

def save_history(historyPath=historyPath):
    import readline
    readline.write_history_file(historyPath)

if os.path.exists(historyPath):
    readline.read_history_file(historyPath)

atexit.register(save_history)
del os, atexit, readline, rlcompleter, save_history, historyPath

and then set the environment variable PYTHONSTARTUP in your shell (e.g. in ~/.bashrc):

export PYTHONSTARTUP=$HOME/.pystartup

You can also add this to get autocomplete for free:

readline.parse_and_bind('tab: complete')

Please note that this will only work on *nix systems. As readline is only available in Unix platform.


回答 3

如果使用的是IPython,则可以使用魔术函数%history-f参数pe 将以前的所有命令保存到文件中:

%history -f /tmp/history.py

If you are using IPython you can save to a file all your previous commands using the magic function %history with the -f parameter, p.e:

%history -f /tmp/history.py

回答 4

安装Ipython并通过运行以下命令打开Ipython会话后:

ipython

从命令行中,只需运行以下Ipython’magic’命令以自动记录整个Ipython会话:

%logstart

这将创建一个唯一命名的.py文件,并存储您的会话,以供以后用作交互式Ipython会话或在您选择的脚本中使用。

After installing Ipython, and opening an Ipython session by running the command:

ipython

from your command line, just run the following Ipython ‘magic’ command to automatically log your entire Ipython session:

%logstart

This will create a uniquely named .py file and store your session for later use as an interactive Ipython session or for use in the script(s) of your choosing.


回答 5

同样,重新交互为您提供了类似于笔记本的Python会话界面。

Also, reinteract gives you a notebook-like interface to a Python session.


回答 6

除了IPython,类似的实用程序bpython还具有“将您输入的代码保存到文件中”的功能

In addition to IPython, a similar utility bpython has a “save the code you’ve entered to a file” feature


回答 7

我必须努力寻找答案,我对iPython环境非常陌生。

这会工作

如果您的iPython会话如下所示

In [1] : import numpy as np
....
In [135]: counter=collections.Counter(mapusercluster[3])
In [136]: counter
Out[136]: Counter({2: 700, 0: 351, 1: 233})

您想要保存从1到135的行,然后在同一ipython会话上使用此命令

In [137]: %save test.py 1-135

这会将所有python语句保存在当前目录(启动ipython的位置)的test.py文件中。

I had to struggle to find an answer, I was very new to iPython environment.

This will work

If your iPython session looks like this

In [1] : import numpy as np
....
In [135]: counter=collections.Counter(mapusercluster[3])
In [136]: counter
Out[136]: Counter({2: 700, 0: 351, 1: 233})

You want to save lines from 1 till 135 then on the same ipython session use this command

In [137]: %save test.py 1-135

This will save all your python statements in test.py file in your current directory ( where you initiated the ipython).


回答 8

有%history魔术可用于打印和保存输入历史记录(以及可选的输出)。

要将当前会话存储到名为的文件中my_history.py

>>> %hist -f my_history.py

历史记录IPython既存储您输入的命令,又存储它产生的结果。您可以使用上下箭头键轻松查看以前的命令,或者以更复杂的方式访问历史记录。

您可以使用%history magic函数来检查过去的输入和输出。先前会话的输入历史记录保存在数据库中,并且可以配置IPython来保存输出历史记录。

其他几个魔术功能也可以使用您的输入历史记录,包括%edit,%rerun,%recall,%macro,%save和%pastebin。您可以使用标准格式来引用行:

%pastebin 3 18-20 ~1/1-5

这将占用当前会话的第3行和第18至20行,以及上一会话的第1-5行。

看到%history?Docstring和更多示例。

另外,请确保探索%store magic在IPython中实现变量的轻量级持久性的功能。

在IPython的数据库中存储变量,别名和宏。

d = {'a': 1, 'b': 2}
%store d  # stores the variable
del d

%store -r d  # Refresh the variable from IPython's database.
>>> d
{'a': 1, 'b': 2}

要在启动c.StoreMagic.autorestore = True时自动恢复存储的变量,请在ipython_config.py中指定。

There is %history magic for printing and saving the input history (and optionally the output).

To store your current session to a file named my_history.py:

>>> %hist -f my_history.py

History IPython stores both the commands you enter, and the results it produces. You can easily go through previous commands with the up- and down-arrow keys, or access your history in more sophisticated ways.

You can use the %history magic function to examine past input and output. Input history from previous sessions is saved in a database, and IPython can be configured to save output history.

Several other magic functions can use your input history, including %edit, %rerun, %recall, %macro, %save and %pastebin. You can use a standard format to refer to lines:

%pastebin 3 18-20 ~1/1-5

This will take line 3 and lines 18 to 20 from the current session, and lines 1-5 from the previous session.

See %history? for the Docstring and more examples.

Also, be sure to explore the capabilities of %store magic for lightweight persistence of variables in IPython.

Stores variables, aliases and macros in IPython’s database.

d = {'a': 1, 'b': 2}
%store d  # stores the variable
del d

%store -r d  # Refresh the variable from IPython's database.
>>> d
{'a': 1, 'b': 2}

To autorestore stored variables on startup, specifyc.StoreMagic.autorestore = True in ipython_config.py.


回答 9

只是在碗里放另一个建议: Spyder

它具有历史记录日志变量资源管理器。如果您使用过MatLab,那么您将看到相似之处。

Just putting another suggesting in the bowl: Spyder

It has History log and Variable explorer. If you have worked with MatLab, then you’ll see the similarities.


回答 10

就Linux而言,人们可以使用script命令来记录整个会话。它是util-linux软件包的一部分,因此应在大多数Linux系统上使用。您可以创建将要调用的别名或函数script -c python,并将其保存到typescript文件中。例如,这是一个这样的文件的重印。

$ cat typescript                                                                                                      
Script started on Sat 14 May 2016 08:30:08 AM MDT
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print 'Hello Pythonic World'
Hello Pythonic World
>>> 

Script done on Sat 14 May 2016 08:30:42 AM MDT

这里的一个小缺点是script,无论何时碰到退格键等等,它都会记录所有内容,甚至是换行。因此,您可能希望使用它col来清理输出(请参阅Unix&Linux Stackexchange上的这篇文章)。

As far as Linux goes, one can use script command to record the whole session. It is part of util-linux package so should be on most Linux systems . You can create and alias or function that will call script -c python and that will be saved to a typescript file. For instance, here’s a reprint of one such file.

$ cat typescript                                                                                                      
Script started on Sat 14 May 2016 08:30:08 AM MDT
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print 'Hello Pythonic World'
Hello Pythonic World
>>> 

Script done on Sat 14 May 2016 08:30:42 AM MDT

Small disadvantage here is that the script records everything , even line-feeds, whenever you hit backspaces , etc. So you may want to use col to clean up the output (see this post on Unix&Linux Stackexchange) .


回答 11

%history命令很棒,但不幸的是,它不会让您将%paste的内容保存到sesh中。为此,我认为您必须在一开始就做%logstart (尽管我尚未确认这项工作有效)。

我喜欢做的是

%history -o -n -p -f filename.txt

它将在每个输入之前保存输出,行号和’>>>’(o,n和p选项)。在此处查看%history的文档。

The %history command is awesome, but unfortunately it won’t let you save things that were %paste ‘d into the sesh. To do that I think you have to do %logstart at the beginning (although I haven’t confirmed this works).

What I like to do is

%history -o -n -p -f filename.txt

which will save the output, line numbers, and ‘>>>’ before each input (o, n, and p options). See the docs for %history here.


回答 12

还有另一种选择— pyslice。在“ wxpython 2.8文档演示和工具”中,有一个名为“ pyslices”的开源程序。

您可以像编辑器一样使用它,它还可以像控制台一样使用—-像执行即时回显的交互式解释器一样执行每一行。

当然,所有代码块和每个块的结果将自动记录到txt文件中。

结果记录在相应的代码块后面。很方便。

there is another option — pyslice. in the “wxpython 2.8 docs demos and tools”, there is a open source program named “pyslices”.

you can use it like a editor, and it also support using like a console —- executing each line like a interactive interpreter with immediate echo.

of course, all the blocks of codes and results of each block will be recorded into a txt file automatically.

the results are logged just behind the corresponding block of code. very convenient.


回答 13

如果使用bpython,则默认情况下所有命令历史记录都保存到~/.pythonhist

要保存命令以供以后重用,可以将它们复制到python脚本文件中:

$ cp ~/.pythonhist mycommands.py

然后编辑该文件以将其清理并放在Python路径(全局或虚拟环境的站点包,当前目录,*。pth中提及或其他方式)下。

要将命令包括到您的shell中,只需从保存的文件中导入它们:

>>> from mycommands import *

If you use bpython, all your command history is by default saved to ~/.pythonhist.

To save the commands for later reusage you can copy them to a python script file:

$ cp ~/.pythonhist mycommands.py

Then edit that file to clean it up and put it under Python path (global or virtual environment’s site-packages, current directory, mentioning in *.pth, or some other way).

To include the commands into your shell, just import them from the saved file:

>>> from mycommands import *

回答 14

一些评论询问如何立即保存所有IPython输入。对于IPython中的%s magic,可以如下所示以编程方式保存所有命令,以避免出现提示消息,也避免指定输入数字。currentLine = len(In)-1%保存-f my_session 1- $ currentLine

-f选项用于强制替换文件,并len(IN)-1在IPython中显示当前输入提示,从而允许您以编程方式保存整个会话。

Some comments were asking how to save all of the IPython inputs at once. For %save magic in IPython, you can save all of the commands programmatically as shown below, to avoid the prompt message and also to avoid specifying the input numbers. currentLine = len(In)-1 %save -f my_session 1-$currentLine

The -f option is used for forcing file replacement and the len(IN)-1 shows the current input prompt in IPython, allowing you to save the whole session programmatically.


回答 15

对于那些使用spacemacsipython附带的用户python-layer,由于在后台运行恒定的自动完成命令,例如save,魔术会产生很多不需要的输出:

len(all_suffixes)
';'.join(__PYTHON_EL_get_completions('''len'''))
';'.join(__PYTHON_EL_get_completions('''all_substa'''))
len(all_substantives_w_suffixes)
';'.join(__PYTHON_EL_get_completions('''len'''))
';'.join(__PYTHON_EL_get_completions('''all'''))
';'.join(__PYTHON_EL_get_completions('''all_'''))
';'.join(__PYTHON_EL_get_completions('''all_w'''))
';'.join(__PYTHON_EL_get_completions('''all_wo'''))
';'.join(__PYTHON_EL_get_completions('''all_wor'''))
';'.join(__PYTHON_EL_get_completions('''all_word'''))
';'.join(__PYTHON_EL_get_completions('''all_words'''))
len(all_words_w_logograms)
len(all_verbs)

为了避免这种情况,只需像平时保存其他任何文件一样保存ipython缓冲区即可: spc f s

For those using spacemacs, and ipython that comes with python-layer, save magic creates a lot of unwanted output, because of the constant auto-completion command working in the backround such as:

len(all_suffixes)
';'.join(__PYTHON_EL_get_completions('''len'''))
';'.join(__PYTHON_EL_get_completions('''all_substa'''))
len(all_substantives_w_suffixes)
';'.join(__PYTHON_EL_get_completions('''len'''))
';'.join(__PYTHON_EL_get_completions('''all'''))
';'.join(__PYTHON_EL_get_completions('''all_'''))
';'.join(__PYTHON_EL_get_completions('''all_w'''))
';'.join(__PYTHON_EL_get_completions('''all_wo'''))
';'.join(__PYTHON_EL_get_completions('''all_wor'''))
';'.join(__PYTHON_EL_get_completions('''all_word'''))
';'.join(__PYTHON_EL_get_completions('''all_words'''))
len(all_words_w_logograms)
len(all_verbs)

To avoid this just save the ipython buffer like you normally save any other: spc f s


回答 16

我想提出另一种在Linux上通过tmux维护python会话的方法。您运行tmux,将自己附加到您打开的会话中(如果直接打开后未附加)。执行python并在上面执行任何操作。然后脱离会话。从tmux会话中分离不会关闭该会话。会话保持打开状态。

这种方法的优点: 您可以从任何其他设备连接到此会话(以防万一您可以SSH电脑)

此方法的缺点: 在您实际存在python解释器之前,此方法不会放弃打开的python会话使用的资源。

I’d like to suggest another way to maintain python session through tmux on linux. you run tmux, attach your self to the session you opened (if not attached after opening it directly). execute python and do whatever you are doing on it. then detach from session. detaching from a tmux session does not close the session. the session remains open.

pros of this method: you can attach to this session from any other device (in case you can ssh your pc)

cons of this method: this method does not relinquish the resources used by the opened python session until you actually exist the python interpreter.


回答 17

要在XUbuntu上保存输入和输出

  1. 在XWindows中,从Xfce终端应用程序运行iPython
  2. Terminal在顶部菜单栏中单击,然后在save contents下拉菜单中查找

我发现这可以保存输入和输出,一直到打开终端时一直返回。这不是ipython特有的,并且可以与ssh会话或从终端窗口运行的其他任务一起使用。

To save input and output on XUbuntu:

  1. In XWindows, run iPython from the Xfce terminal app
  2. click Terminal in the top menu bar and look for save contents in the dropdown

I find this saves the input and output, going all the way back to when I opened the terminal. This is not ipython specific, and would work with ssh sessions or other tasks run from the terminal window.


在matplotlib中设置y轴限制

问题:在matplotlib中设置y轴限制

我需要在matplotlib上设置y轴限制的帮助。这是我尝试失败的代码。

import matplotlib.pyplot as plt

plt.figure(1, figsize = (8.5,11))
plt.suptitle('plot title')
ax = []
aPlot = plt.subplot(321, axisbg = 'w', title = "Year 1")
ax.append(aPlot)
plt.plot(paramValues,plotDataPrice[0], color = '#340B8C', 
     marker = 'o', ms = 5, mfc = '#EB1717')
plt.xticks(paramValues)
plt.ylabel('Average Price')
plt.xlabel('Mark-up')
plt.grid(True)
plt.ylim((25,250))

使用此图的数据,我得到的Y轴限制为20和200。但是,我希望限制为20和250。

I need help with setting the limits of y-axis on matplotlib. Here is the code that I tried, unsuccessfully.

import matplotlib.pyplot as plt

plt.figure(1, figsize = (8.5,11))
plt.suptitle('plot title')
ax = []
aPlot = plt.subplot(321, axisbg = 'w', title = "Year 1")
ax.append(aPlot)
plt.plot(paramValues,plotDataPrice[0], color = '#340B8C', 
     marker = 'o', ms = 5, mfc = '#EB1717')
plt.xticks(paramValues)
plt.ylabel('Average Price')
plt.xlabel('Mark-up')
plt.grid(True)
plt.ylim((25,250))

With the data I have for this plot, I get y-axis limits of 20 and 200. However, I want the limits 20 and 250.


回答 0

尝试这个 。也适用于子图。

axes = plt.gca()
axes.set_xlim([xmin,xmax])
axes.set_ylim([ymin,ymax])

Try this . Works for subplots too .

axes = plt.gca()
axes.set_xlim([xmin,xmax])
axes.set_ylim([ymin,ymax])

回答 1

您的代码也对我有用。但是,另一种解决方法是获取图的轴,然后仅更改y值:

x1,x2,y1,y2 = plt.axis()
plt.axis((x1,x2,25,250))

Your code works also for me. However, another workaround can be to get the plot’s axis and then change only the y-values:

x1,x2,y1,y2 = plt.axis()
plt.axis((x1,x2,25,250))


回答 2

您可以做的一件事是使用matplotlib.pyplot.axis自行设置轴范围。

matplotlib.pyplot.axis

from matplotlib import pyplot as plt
plt.axis([0, 10, 0, 20])

0,10用于x轴范围。0,20是y轴范围。

或者您也可以使用matplotlib.pyplot.xlim或matplotlib.pyplot.ylim

matplotlib.pyplot.ylim

plt.ylim(-2, 2)
plt.xlim(0,10)

One thing you can do is to set your axis range by yourself by using matplotlib.pyplot.axis.

matplotlib.pyplot.axis

from matplotlib import pyplot as plt
plt.axis([0, 10, 0, 20])

0,10 is for x axis range. 0,20 is for y axis range.

or you can also use matplotlib.pyplot.xlim or matplotlib.pyplot.ylim

matplotlib.pyplot.ylim

plt.ylim(-2, 2)
plt.xlim(0,10)

回答 3

您可以从中实例化对象matplotlib.pyplot.axes并对其进行调用set_ylim()。就像这样:

import matplotlib.pyplot as plt
axes = plt.axes()
axes.set_ylim([0, 1])

You can instantiate an object from matplotlib.pyplot.axes and call the set_ylim() on it. It would be something like this:

import matplotlib.pyplot as plt
axes = plt.axes()
axes.set_ylim([0, 1])

回答 4

这至少在matplotlib 2.2.2版中有效:

plt.axis([None, None, 0, 100])

大概这是设置例如xmin和ymax等的好方法。

This worked at least in matplotlib version 2.2.2:

plt.axis([None, None, 0, 100])

Probably this is a nice way to set up for example xmin and ymax only, etc.


回答 5

要添加到@Hima的答案中,如果要修改当前的x或y限制,可以使用以下内容。

import numpy as np # you probably alredy do this so no extra overhead
fig, axes = plt.subplot()
axes.plot(data[:,0], data[:,1])
xlim = axes.get_xlim()
# example of how to zoomout by a factor of 0.1
factor = 0.1 
new_xlim = (xlim[0] + xlim[1])/2 + np.array((-0.5, 0.5)) * (xlim[1] - xlim[0]) * (1 + factor) 
axes.set_xlim(new_xlim)

当我想从默认绘图设置中缩小或放大一点时,我发现这特别有用。

To add to @Hima’s answer, if you want to modify a current x or y limit you could use the following.

import numpy as np # you probably alredy do this so no extra overhead
fig, axes = plt.subplot()
axes.plot(data[:,0], data[:,1])
xlim = axes.get_xlim()
# example of how to zoomout by a factor of 0.1
factor = 0.1 
new_xlim = (xlim[0] + xlim[1])/2 + np.array((-0.5, 0.5)) * (xlim[1] - xlim[0]) * (1 + factor) 
axes.set_xlim(new_xlim)

I find this particularly useful when I want to zoom out or zoom in just a little from the default plot settings.


回答 6

这应该工作。您的代码对我有效,例如Tamás和Manoj Govindan。看来您可以尝试更新Matplotlib。如果您无法更新Matplotlib(例如,如果您的管理权限不足),也许使用其他后端matplotlib.use()可能会有所帮助。

This should work. Your code works for me, like for Tamás and Manoj Govindan. It looks like you could try to update Matplotlib. If you can’t update Matplotlib (for instance if you have insufficient administrative rights), maybe using a different backend with matplotlib.use() could help.


回答 7

仅用于微调。如果只想设置轴的一个边界,而另一个边界不变,则可以选择以下一个或多个语句

plt.xlim(right=xmax) #xmax is your value
plt.xlim(left=xmin) #xmin is your value
plt.ylim(top=ymax) #ymax is your value
plt.ylim(bottom=ymin) #ymin is your value

查看有关xlimylim的文档

Just for fine tuning. If you want to set only one of the boundaries of the axis and let the other boundary unchanged, you can choose one or more of the following statements

plt.xlim(right=xmax) #xmax is your value
plt.xlim(left=xmin) #xmin is your value
plt.ylim(top=ymax) #ymax is your value
plt.ylim(bottom=ymin) #ymin is your value

Take a look at the documentation for xlim and for ylim


回答 8

如果某个轴(由问题下方代码下方的代码生成)与第一个轴共享范围,请确保将范围设置为该轴的最后一个绘图之后

If an axes (generated by code below the code shown in the question) is sharing the range with the first axes, make sure that you set the range after the last plot of that axes.


将字符列表转换为字符串

问题:将字符列表转换为字符串

如果我有一个字符列表:

a = ['a','b','c','d']

如何将其转换为单个字符串?

a = 'abcd'

If I have a list of chars:

a = ['a','b','c','d']

How do I convert it into a single string?

a = 'abcd'

回答 0

使用join空字符串的方法将所有字符串以及中间的空字符串连接在一起,如下所示:

>>> a = ['a', 'b', 'c', 'd']
>>> ''.join(a)
'abcd'

Use the join method of the empty string to join all of the strings together with the empty string in between, like so:

>>> a = ['a', 'b', 'c', 'd']
>>> ''.join(a)
'abcd'

回答 1

这可以在许多流行的语言(例如JavaScript和Ruby)中使用,为什么不能在Python中使用?

>>> ['a', 'b', 'c'].join('')
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'join'

奇怪的是,在Python中,join方法在str类上:

# this is the Python way
"".join(['a','b','c','d'])

为什么对象中join的方法list不像JavaScript或其他流行的脚本语言那样?这是Python社区如何思考的一个示例。由于join返回的是字符串,因此应将其放置在字符串类中,而不是列表类中,因此该str.join(list)方法意味着:使用str分隔符将列表连接到新字符串中(本例中str为空字符串)。

过了一段时间,我莫名其妙地爱上了这种思维方式。我可以抱怨Python设计中的很多事情,但不能抱怨它的连贯性。

This works in many popular languages like JavaScript and Ruby, why not in Python?

>>> ['a', 'b', 'c'].join('')
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'join'

Strange enough, in Python the join method is on the str class:

# this is the Python way
"".join(['a','b','c','d'])

Why join is not a method in the list object like in JavaScript or other popular script languages? It is one example of how the Python community thinks. Since join is returning a string, it should be placed in the string class, not on the list class, so the str.join(list) method means: join the list into a new string using str as a separator (in this case str is an empty string).

Somehow I got to love this way of thinking after a while. I can complain about a lot of things in Python design, but not about its coherence.


回答 2

如果您的Python解释器较旧(例如,1.5.2在某些较旧的Linux发行版中很常见),则您可能无法join()在任何旧的字符串对象上将其用作方法,而需要使用字符串模块。例:

a = ['a', 'b', 'c', 'd']

try:
    b = ''.join(a)

except AttributeError:
    import string
    b = string.join(a, '')

字符串b将为'abcd'

If your Python interpreter is old (1.5.2, for example, which is common on some older Linux distributions), you may not have join() available as a method on any old string object, and you will instead need to use the string module. Example:

a = ['a', 'b', 'c', 'd']

try:
    b = ''.join(a)

except AttributeError:
    import string
    b = string.join(a, '')

The string b will be 'abcd'.


回答 3

这可能是最快的方法:

>> from array import array
>> a = ['a','b','c','d']
>> array('B', map(ord,a)).tostring()
'abcd'

This may be the fastest way:

>> from array import array
>> a = ['a','b','c','d']
>> array('B', map(ord,a)).tostring()
'abcd'

回答 4

减少功能也起作用

import operator
h=['a','b','c','d']
reduce(operator.add, h)
'abcd'

The reduce function also works

import operator
h=['a','b','c','d']
reduce(operator.add, h)
'abcd'

回答 5

如果列表包含数字,则可以map()与结合使用join()

例如:

>>> arr = [3, 30, 34, 5, 9]
>>> ''.join(map(str, arr))
3303459

If the list contains numbers, you can use map() with join().

Eg:

>>> arr = [3, 30, 34, 5, 9]
>>> ''.join(map(str, arr))
3303459

回答 6

h = ['a','b','c','d','e','f']
g = ''
for f in h:
    g = g + f

>>> g
'abcdef'
h = ['a','b','c','d','e','f']
g = ''
for f in h:
    g = g + f

>>> g
'abcdef'

回答 7

除了str.join这是最自然的方式,一种可能性是使用io.StringIO和滥用一次writelines编写所有元素:

import io

a = ['a','b','c','d']

out = io.StringIO()
out.writelines(a)
print(out.getvalue())

印刷品:

abcd

当将此方法与生成器函数或不是a tuple或a 的可迭代器一起使用时list,它将保存临时创建的列表,该列表join确实可以一次性分配正确的大小(并且1个字符的字符串列表在内存方面非常昂贵) )。

如果您的内存不足,并且输入的对象是惰性求值,则此方法是最佳解决方案。

besides str.join which is the most natural way, a possibility is to use io.StringIO and abusing writelines to write all elements in one go:

import io

a = ['a','b','c','d']

out = io.StringIO()
out.writelines(a)
print(out.getvalue())

prints:

abcd

When using this approach with a generator function or an iterable which isn’t a tuple or a list, it saves the temporary list creation that join does to allocate the right size in one go (and a list of 1-character strings is very expensive memory-wise).

If you’re low in memory and you have a lazily-evaluated object as input, this approach is the best solution.


回答 8

您也可以operator.concat()这样使用:

>>> from operator import concat
>>> a = ['a', 'b', 'c', 'd']
>>> reduce(concat, a)
'abcd'

如果您使用的是Python 3,则需要先添加:

>>> from functools import reduce

由于内置函数reduce()已从Python 3中删除,现在位于中functools.reduce()

You could also use operator.concat() like this:

>>> from operator import concat
>>> a = ['a', 'b', 'c', 'd']
>>> reduce(concat, a)
'abcd'

If you’re using Python 3 you need to prepend:

>>> from functools import reduce

since the builtin reduce() has been removed from Python 3 and now lives in functools.reduce().


使用Python将JSON字符串转换为dict

问题:使用Python将JSON字符串转换为dict

我对Python中的JSON感到有些困惑。在我看来,这就像是一本字典,因此我正在尝试这样做:

{
    "glossary":
    {
        "title": "example glossary",
        "GlossDiv":
        {
            "title": "S",
            "GlossList":
            {
                "GlossEntry":
                {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef":
                    {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}

但是当我这样做时print dict(json),它会给出一个错误。

如何将该字符串转换为结构,然后调用json["title"]以获得“示例词汇表”?

I’m a little bit confused with JSON in Python. To me, it seems like a dictionary, and for that reason I’m trying to do that:

{
    "glossary":
    {
        "title": "example glossary",
        "GlossDiv":
        {
            "title": "S",
            "GlossList":
            {
                "GlossEntry":
                {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef":
                    {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}

But when I do print dict(json), it gives an error.

How can I transform this string into a structure and then call json["title"] to obtain “example glossary”?


回答 0

json.loads()

import json

d = json.loads(j)
print d['glossary']['title']

json.loads()

import json

d = json.loads(j)
print d['glossary']['title']

回答 1

当我开始使用json时,我很困惑,无法解决一段时间,但最终我得到了想要的东西。
这是简单的解决方案

import json
m = {'id': 2, 'name': 'hussain'}
n = json.dumps(m)
o = json.loads(n)
print(o['id'], o['name'])

When I started using json, I was confused and unable to figure it out for some time, but finally I got what I wanted
Here is the simple solution

import json
m = {'id': 2, 'name': 'hussain'}
n = json.dumps(m)
o = json.loads(n)
print(o['id'], o['name'])

回答 2

使用simplejson或cjson进行加速

import simplejson as json

json.loads(obj)

or 

cjson.decode(obj)

use simplejson or cjson for speedups

import simplejson as json

json.loads(obj)

or 

cjson.decode(obj)

回答 3

如果您信任数据源,则可以用于eval将字符串转换为字典:

eval(your_json_format_string)

例:

>>> x = "{'a' : 1, 'b' : True, 'c' : 'C'}"
>>> y = eval(x)

>>> print x
{'a' : 1, 'b' : True, 'c' : 'C'}
>>> print y
{'a': 1, 'c': 'C', 'b': True}

>>> print type(x), type(y)
<type 'str'> <type 'dict'>

>>> print y['a'], type(y['a'])
1 <type 'int'>

>>> print y['a'], type(y['b'])
1 <type 'bool'>

>>> print y['a'], type(y['c'])
1 <type 'str'>

If you trust the data source, you can use eval to convert your string into a dictionary:

eval(your_json_format_string)

Example:

>>> x = "{'a' : 1, 'b' : True, 'c' : 'C'}"
>>> y = eval(x)

>>> print x
{'a' : 1, 'b' : True, 'c' : 'C'}
>>> print y
{'a': 1, 'c': 'C', 'b': True}

>>> print type(x), type(y)
<type 'str'> <type 'dict'>

>>> print y['a'], type(y['a'])
1 <type 'int'>

>>> print y['a'], type(y['b'])
1 <type 'bool'>

>>> print y['a'], type(y['c'])
1 <type 'str'>

如何获得浮动范围之间的随机数?

问题:如何获得浮动范围之间的随机数?

randrange(start, stop)只接受整数参数。那么,如何在两个浮点值之间获得一个随机数呢?

randrange(start, stop) only takes integer arguments. So how would I get a random number between two float values?


回答 0

使用random.uniform(a,b)

>>> random.uniform(1.5, 1.9)
1.8733202628557872

Use random.uniform(a, b):

>>> random.uniform(1.5, 1.9)
1.8733202628557872

回答 1

random.uniform(a, b)似乎是您要寻找的。从文档:

返回一个随机浮点数N,使得a <= N <= b表示a <= b,b <= N <= a表示b <a。

这里

random.uniform(a, b) appears to be what your looking for. From the docs:

Return a random floating point number N such that a <= N <= b for a <= b and b <= N <= a for b < a.

See here.


回答 2

如果您想生成一个随机浮点数,该浮点数的右边是N个数字,则可以执行以下操作:

round(random.uniform(1,2), N)

第二个参数是小数位数。

if you want generate a random float with N digits to the right of point, you can make this :

round(random.uniform(1,2), N)

the second argument is the number of decimals.


回答 3

最常见的是,您将使用:

import random
random.uniform(a, b) # range [a, b) or [a, b] depending on floating-point rounding

如果需要,Python可提供其他发行版

如果已经numpy导入,则可以使用其等效项:

import numpy as np
np.random.uniform(a, b) # range [a, b)

同样,如果需要其他发行版,请numpy提供与python相同的发行版,以及许多其他发行版

Most commonly, you’d use:

import random
random.uniform(a, b) # range [a, b) or [a, b] depending on floating-point rounding

Python provides other distributions if you need.

If you have numpy imported already, you can used its equivalent:

import numpy as np
np.random.uniform(a, b) # range [a, b)

Again, if you need another distribution, numpy provides the same distributions as python, as well as many additional ones.


如何将CSV数据读入NumPy中的记录数组?

问题:如何将CSV数据读入NumPy中的记录数组?

我不知道是否有一个CSV文件的内容导入到一个记录阵列直接的方式,很多的方式是R的read.table()read.delim()read.csv()家庭的进口数据与R的数据帧?

还是使用csv.reader()然后应用类似内容的最佳方法numpy.core.records.fromrecords()

I wonder if there is a direct way to import the contents of a CSV file into a record array, much in the way that R’s read.table(), read.delim(), and read.csv() family imports data to R’s data frame?

Or is the best way to use csv.reader() and then apply something like numpy.core.records.fromrecords()?


回答 0

您可以genfromtxt()通过将delimiterkwarg 设置为逗号来使用Numpy的方法。

from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')

有关该功能的更多信息,请参见其相应的文档

You can use Numpy’s genfromtxt() method to do so, by setting the delimiter kwarg to a comma.

from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')

More information on the function can be found at its respective documentation.


回答 1

我会read_csvpandas库中推荐该功能:

import pandas as pd
df=pd.read_csv('myfile.csv', sep=',',header=None)
df.values
array([[ 1. ,  2. ,  3. ],
       [ 4. ,  5.5,  6. ]])

这提供了一个熊猫DataFrame-允许许多有用的数据操作功能,而numpy记录数组无法直接使用这些功能

DataFrame是二维标记的数据结构,具有可能不同类型的列。您可以将其视为电子表格或SQL表…


我也建议genfromtxt。但是,由于该问题要求记录数组,而不是普通数组,因此dtype=None需要将参数添加到genfromtxt调用中:

给定一个输入文件,myfile.csv

1.0, 2, 3
4, 5.5, 6

import numpy as np
np.genfromtxt('myfile.csv',delimiter=',')

给出一个数组:

array([[ 1. ,  2. ,  3. ],
       [ 4. ,  5.5,  6. ]])

np.genfromtxt('myfile.csv',delimiter=',',dtype=None)

给出一个记录数组:

array([(1.0, 2.0, 3), (4.0, 5.5, 6)], 
      dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<i4')])

这样的优点是可以轻松导入具有多种数据类型(包括字符串)的文件。

I would recommend the read_csv function from the pandas library:

import pandas as pd
df=pd.read_csv('myfile.csv', sep=',',header=None)
df.values
array([[ 1. ,  2. ,  3. ],
       [ 4. ,  5.5,  6. ]])

This gives a pandas DataFrame – allowing many useful data manipulation functions which are not directly available with numpy record arrays.

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table…


I would also recommend genfromtxt. However, since the question asks for a record array, as opposed to a normal array, the dtype=None parameter needs to be added to the genfromtxt call:

Given an input file, myfile.csv:

1.0, 2, 3
4, 5.5, 6

import numpy as np
np.genfromtxt('myfile.csv',delimiter=',')

gives an array:

array([[ 1. ,  2. ,  3. ],
       [ 4. ,  5.5,  6. ]])

and

np.genfromtxt('myfile.csv',delimiter=',',dtype=None)

gives a record array:

array([(1.0, 2.0, 3), (4.0, 5.5, 6)], 
      dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<i4')])

This has the advantage that file with multiple data types (including strings) can be easily imported.


回答 2

我定时了

from numpy import genfromtxt
genfromtxt(fname = dest_file, dtype = (<whatever options>))

import csv
import numpy as np
with open(dest_file,'r') as dest_f:
    data_iter = csv.reader(dest_f,
                           delimiter = delimiter,
                           quotechar = '"')
    data = [data for data in data_iter]
data_array = np.asarray(data, dtype = <whatever options>)

在460万行,约70列的数据上,发现NumPy路径花费了2分16秒,而csv-list理解方法花费了13秒。

我建议使用csv-list理解方法,因为它很可能依赖于预编译的库,而不像NumPy那样依赖于解释器。我怀疑pandas方法会有类似的解释器开销。

I timed the

from numpy import genfromtxt
genfromtxt(fname = dest_file, dtype = (<whatever options>))

versus

import csv
import numpy as np
with open(dest_file,'r') as dest_f:
    data_iter = csv.reader(dest_f,
                           delimiter = delimiter,
                           quotechar = '"')
    data = [data for data in data_iter]
data_array = np.asarray(data, dtype = <whatever options>)

on 4.6 million rows with about 70 columns and found that the NumPy path took 2 min 16 secs and the csv-list comprehension method took 13 seconds.

I would recommend the csv-list comprehension method as it is most likely relies on pre-compiled libraries and not the interpreter as much as NumPy. I suspect the pandas method would have similar interpreter overhead.


回答 3

您也可以尝试使用recfromcsv()哪种方法可以猜测数据类型并返回格式正确的记录数组。

You can also try recfromcsv() which can guess data types and return a properly formatted record array.


回答 4

当我尝试使用NumPy和Pandas两种方式时,使用Pandas有很多优点:

  • 快点
  • 减少CPU使用率
  • 与NumPy genfromtxt相比1/3的RAM使用量

这是我的测试代码:

$ for f in test_pandas.py test_numpy_csv.py ; do  /usr/bin/time python $f; done
2.94user 0.41system 0:03.05elapsed 109%CPU (0avgtext+0avgdata 502068maxresident)k
0inputs+24outputs (0major+107147minor)pagefaults 0swaps

23.29user 0.72system 0:23.72elapsed 101%CPU (0avgtext+0avgdata 1680888maxresident)k
0inputs+0outputs (0major+416145minor)pagefaults 0swaps

test_numpy_csv.py

from numpy import genfromtxt
train = genfromtxt('/home/hvn/me/notebook/train.csv', delimiter=',')

test_pandas.py

from pandas import read_csv
df = read_csv('/home/hvn/me/notebook/train.csv')

资料档案:

du -h ~/me/notebook/train.csv
 59M    /home/hvn/me/notebook/train.csv

使用NumPy和pandas版本:

$ pip freeze | egrep -i 'pandas|numpy'
numpy==1.13.3
pandas==0.20.2

As I tried both ways using NumPy and Pandas, using pandas has a lot of advantages:

  • Faster
  • Less CPU usage
  • 1/3 RAM usage compared to NumPy genfromtxt

This is my test code:

$ for f in test_pandas.py test_numpy_csv.py ; do  /usr/bin/time python $f; done
2.94user 0.41system 0:03.05elapsed 109%CPU (0avgtext+0avgdata 502068maxresident)k
0inputs+24outputs (0major+107147minor)pagefaults 0swaps

23.29user 0.72system 0:23.72elapsed 101%CPU (0avgtext+0avgdata 1680888maxresident)k
0inputs+0outputs (0major+416145minor)pagefaults 0swaps

test_numpy_csv.py

from numpy import genfromtxt
train = genfromtxt('/home/hvn/me/notebook/train.csv', delimiter=',')

test_pandas.py

from pandas import read_csv
df = read_csv('/home/hvn/me/notebook/train.csv')

Data file:

du -h ~/me/notebook/train.csv
 59M    /home/hvn/me/notebook/train.csv

With NumPy and pandas at versions:

$ pip freeze | egrep -i 'pandas|numpy'
numpy==1.13.3
pandas==0.20.2

回答 5

您可以使用以下代码将CSV文件数据发送到数组中:

import numpy as np
csv = np.genfromtxt('test.csv', delimiter=",")
print(csv)

You can use this code to send CSV file data into an array:

import numpy as np
csv = np.genfromtxt('test.csv', delimiter=",")
print(csv)

回答 6

使用 numpy.loadtxt

一个非常简单的方法。但这要求所有元素都是浮点数(int等)

import numpy as np 
data = np.loadtxt('c:\\1.csv',delimiter=',',skiprows=0)  

Using numpy.loadtxt

A quite simple method. But it requires all the elements being float (int and so on)

import numpy as np 
data = np.loadtxt('c:\\1.csv',delimiter=',',skiprows=0)  

回答 7

这是最简单的方法:

import csv with open('testfile.csv', newline='') as csvfile: data = list(csv.reader(csvfile))

现在,数据中的每个条目都是一条记录,表示为一个数组。因此,您拥有一个2D阵列。它节省了我很多时间。

This is the easiest way:

import csv with open('testfile.csv', newline='') as csvfile: data = list(csv.reader(csvfile))

Now each entry in data is a record, represented as an array. So you have a 2D array. It saved me so much time.


回答 8

我尝试了这个:

import pandas as p
import numpy as n

closingValue = p.read_csv("<FILENAME>", usecols=[4], dtype=float)
print(closingValue)

I tried this:

import pandas as p
import numpy as n

closingValue = p.read_csv("<FILENAME>", usecols=[4], dtype=float)
print(closingValue)

回答 9

我建议使用表格(pip3 install tables)。您可以将.csv文件保存为.h5使用熊猫(pip3 install pandas),

import pandas as pd
data = pd.read_csv("dataset.csv")
store = pd.HDFStore('dataset.h5')
store['mydata'] = data
store.close()

然后,您可以轻松地以较少的时间(即使是处理大量数据)将数据加载到NumPy数组中

import pandas as pd
store = pd.HDFStore('dataset.h5')
data = store['mydata']
store.close()

# Data in NumPy format
data = data.values

I would suggest using tables (pip3 install tables). You can save your .csv file to .h5 using pandas (pip3 install pandas),

import pandas as pd
data = pd.read_csv("dataset.csv")
store = pd.HDFStore('dataset.h5')
store['mydata'] = data
store.close()

You can then easily, and with less time even for huge amount of data, load your data in a NumPy array.

import pandas as pd
store = pd.HDFStore('dataset.h5')
data = store['mydata']
store.close()

# Data in NumPy format
data = data.values

回答 10

这项工作令人着迷…

import csv
with open("data.csv", 'r') as f:
    data = list(csv.reader(f, delimiter=";"))

import numpy as np
data = np.array(data, dtype=np.float)

This work as a charm…

import csv
with open("data.csv", 'r') as f:
    data = list(csv.reader(f, delimiter=";"))

import numpy as np
data = np.array(data, dtype=np.float)

使用Python在Pandas中读取CSV文件时出现UnicodeDecodeError

问题:使用Python在Pandas中读取CSV文件时出现UnicodeDecodeError

我正在运行一个程序,正在处理30,000个类似文件。他们中有随机数正在停止并产生此错误…

   File "C:\Importer\src\dfman\importer.py", line 26, in import_chr
     data = pd.read_csv(filepath, names=fields)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
     return _read(filepath_or_buffer, kwds)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
     return parser.read()
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
     ret = self._engine.read(nrows)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
     data = self._reader.read(nrows)
   File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
   File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964)
   File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas\parser.c:7780)
   File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793)
   File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484)
   File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642)
   File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandas\parser.c:10853)
   File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas\parser.c:15657)
 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid    continuation byte

这些文件的源/创建都来自同一位置。纠正此错误以继续导入的最佳方法是什么?

I’m running a program which is processing 30,000 similar files. A random number of them are stopping and producing this error…

   File "C:\Importer\src\dfman\importer.py", line 26, in import_chr
     data = pd.read_csv(filepath, names=fields)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
     return _read(filepath_or_buffer, kwds)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
     return parser.read()
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
     ret = self._engine.read(nrows)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
     data = self._reader.read(nrows)
   File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
   File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964)
   File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas\parser.c:7780)
   File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793)
   File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484)
   File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642)
   File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandas\parser.c:10853)
   File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas\parser.c:15657)
 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid    continuation byte

The source/creation of these files all come from the same place. What’s the best way to correct this to proceed with the import?


回答 0

read_csv可以encoding选择处理不同格式的文件。我主要使用read_csv('file', encoding = "ISO-8859-1"),或者替代地encoding = "utf-8"阅读,并且通常utf-8用于to_csv

您还可以使用而不是的多个alias选项'latin'之一'ISO-8859-1'(请参阅python docs,还可能会遇到许多其他编码)。

请参阅相关的Pandas文档有关csv文件的python文档示例以及有关SO的大量相关问题。一个好的背景资源是每个开发人员应了解的unicode和字符集

要检测编码(假设文件包含非ASCII字符),可以使用enca(请参见手册页)或file -i(linux)或file -I(osx)(请参见手册页)。

read_csv takes an encoding option to deal with files in different formats. I mostly use read_csv('file', encoding = "ISO-8859-1"), or alternatively encoding = "utf-8" for reading, and generally utf-8 for to_csv.

You can also use one of several alias options like 'latin' instead of 'ISO-8859-1' (see python docs, also for numerous other encodings you may encounter).

See relevant Pandas documentation, python docs examples on csv files, and plenty of related questions here on SO. A good background resource is What every developer should know about unicode and character sets.

To detect the encoding (assuming the file contains non-ascii characters), you can use enca (see man page) or file -i (linux) or file -I (osx) (see man page).


回答 1

所有解决方案中最简单的:

import pandas as pd
df = pd.read_csv('file_name.csv', engine='python')

替代解决方案:

  • Sublime文本编辑器中打开csv文件。
  • 以utf-8格式保存文件。

崇高地,单击文件->使用编码保存-> UTF-8

然后,您可以照常读取文件:

import pandas as pd
data = pd.read_csv('file_name.csv', encoding='utf-8')

其他不同的编码类型是:

encoding = "cp1252"
encoding = "ISO-8859-1"

Simplest of all Solutions:

import pandas as pd
df = pd.read_csv('file_name.csv', engine='python')

Alternate Solution:

  • Open the csv file in Sublime text editor.
  • Save the file in utf-8 format.

In sublime, Click File -> Save with encoding -> UTF-8

Then, you can read your file as usual:

import pandas as pd
data = pd.read_csv('file_name.csv', encoding='utf-8')

and the other different encoding types are:

encoding = "cp1252"
encoding = "ISO-8859-1"

回答 2

熊猫允许指定编码,但不允许忽略错误以免自动替换有问题的字节。因此,没有一种适合所有方法的大小,而是取决于实际用例的不同方法。

  1. 您知道编码,并且文件中没有编码错误。太好了:您只需要指定编码即可:

    file_encoding = 'cp1252'        # set file_encoding to the file encoding (utf8, latin1, etc.)
    pd.read_csv(input_file_and_path, ..., encoding=file_encoding)
  2. 您不希望被编码问题困扰,无论某些文本字段是否包含垃圾内容,都只希望加载该死的文件。好的,您只需要使用Latin1编码,因为它接受任何可能的字节作为输入(并将其转换为相同代码的unicode字符):

    pd.read_csv(input_file_and_path, ..., encoding='latin1')
  3. 您知道大多数文件都是用特定的编码编写的,但是它也包含编码错误。一个真实的示例是一个UTF8文件,该文件已使用非utf8编辑器进行了编辑,并且其中包含一些使用不同编码的行。Pandas没有提供特殊的错误处理的准备,但是Python open函数具有(假设Python3),并且read_csv接受像object这样的文件。在这里使用的典型错误参数是'ignore'仅抑制有问题的字节,或者(IMHO更好)'backslashreplace'用其Python的反斜杠转义序列替换有问题的字节:

    file_encoding = 'utf8'        # set file_encoding to the file encoding (utf8, latin1, etc.)
    input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace')
    pd.read_csv(input_fd, ...)

Pandas allows to specify encoding, but does not allow to ignore errors not to automatically replace the offending bytes. So there is no one size fits all method but different ways depending on the actual use case.

  1. You know the encoding, and there is no encoding error in the file. Great: you have just to specify the encoding:

    file_encoding = 'cp1252'        # set file_encoding to the file encoding (utf8, latin1, etc.)
    pd.read_csv(input_file_and_path, ..., encoding=file_encoding)
    
  2. You do not want to be bothered with encoding questions, and only want that damn file to load, no matter if some text fields contain garbage. Ok, you only have to use Latin1 encoding because it accept any possible byte as input (and convert it to the unicode character of same code):

    pd.read_csv(input_file_and_path, ..., encoding='latin1')
    
  3. You know that most of the file is written with a specific encoding, but it also contains encoding errors. A real world example is an UTF8 file that has been edited with a non utf8 editor and which contains some lines with a different encoding. Pandas has no provision for a special error processing, but Python open function has (assuming Python3), and read_csv accepts a file like object. Typical errors parameter to use here are 'ignore' which just suppresses the offending bytes or (IMHO better) 'backslashreplace' which replaces the offending bytes by their Python’s backslashed escape sequence:

    file_encoding = 'utf8'        # set file_encoding to the file encoding (utf8, latin1, etc.)
    input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace')
    pd.read_csv(input_fd, ...)
    

回答 3

with open('filename.csv') as f:
   print(f)

执行此代码后,您将找到“ filename.csv”的编码,然后执行以下代码

data=pd.read_csv('filename.csv', encoding="encoding as you found earlier"

你去

with open('filename.csv') as f:
   print(f)

after executing this code you will find encoding of ‘filename.csv’ then execute code as following

data=pd.read_csv('filename.csv', encoding="encoding as you found earlier"

there you go


回答 4

就我而言,USC-2 LE BOM根据Notepad ++ ,文件具有编码。它encoding="utf_16_le"用于python。

希望这有助于更快找到某人的答案。

In my case, a file has USC-2 LE BOM encoding, according to Notepad++. It is encoding="utf_16_le" for python.

Hope, it helps to find an answer a bit faster for someone.


回答 5

就我而言,这适用于python 2.7:

data = read_csv(filename, encoding = "ISO-8859-1", dtype={'name_of_colum': unicode}, low_memory=False) 

而对于python 3,仅:

data = read_csv(filename, encoding = "ISO-8859-1", low_memory=False) 

In my case this worked for python 2.7:

data = read_csv(filename, encoding = "ISO-8859-1", dtype={'name_of_colum': unicode}, low_memory=False) 

And for python 3, only:

data = read_csv(filename, encoding = "ISO-8859-1", low_memory=False) 

回答 6

尝试指定engine =’python’。它对我有用,但我仍在尝试找出原因。

df = pd.read_csv(input_file_path,...engine='python')

Try specifying the engine=’python’. It worked for me but I’m still trying to figure out why.

df = pd.read_csv(input_file_path,...engine='python')

回答 7

我正在发布答案,以提供有关为什么会出现此问题的更新解决方案和解释。假设您正在从数据库或Excel工作簿中获取此数据。如果您有特殊字符,例如La Cañada Flintridge city,除非您使用UTF-8编码导出数据,否则将引入错误。La Cañada Flintridge city将成为La Ca\xf1ada Flintridge city。如果您pandas.read_csv对默认参数没有任何调整,则会遇到以下错误

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 5: invalid continuation byte

幸运的是,有一些解决方案。

选项1,修复出口。确保使用UTF-8编码。

选项2,如果您无法解决出口问题,而需要使用pandas.read_csv,请确保包括以下参数engine='python'。缺省情况下,pandas使用engine='C'此选项非常适合读取大型干净文件,但如果出现意外情况,它将崩溃。根据我的经验,设置encoding='utf-8'从未解决过这个问题UnicodeDecodeError。另外,您不需要使用errors_bad_lines,但是,如果您确实需要它,那仍然是一个选择。

pd.read_csv(<your file>, engine='python')

选项3:解决方案是我个人首选的解决方案。使用香草Python读取文件。

import pandas as pd

data = []

with open(<your file>, "rb") as myfile:
    # read the header seperately
    # decode it as 'utf-8', remove any special characters, and split it on the comma (or deliminator)
    header = myfile.readline().decode('utf-8').replace('\r\n', '').split(',')
    # read the rest of the data
    for line in myfile:
        row = line.decode('utf-8', errors='ignore').replace('\r\n', '').split(',')
        data.append(row)

# save the data as a dataframe
df = pd.DataFrame(data=data, columns = header)

希望这可以帮助人们第一次遇到这个问题。

I am posting an answer to provide an updated solution and explanation as to why this problem can occur. Say you are getting this data from a database or Excel workbook. If you have special characters like La Cañada Flintridge city, well unless you are exporting the data using UTF-8 encoding, you’re going to introduce errors. La Cañada Flintridge city will become La Ca\xf1ada Flintridge city. If you are using pandas.read_csv without any adjustments to the default parameters, you’ll hit the following error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 5: invalid continuation byte

Fortunately, there are a few solutions.

Option 1, fix the exporting. Be sure to use UTF-8 encoding.

Option 2, if fixing the exporting problem is not available to you, and you need to use pandas.read_csv, be sure to include the following paramters, engine='python'. By default, pandas uses engine='C' which is great for reading large clean files, but will crash if anything unexpected comes up. In my experience, setting encoding='utf-8' has never fixed this UnicodeDecodeError. Also, you do not need to use errors_bad_lines, however, that is still an option if you REALLY need it.

pd.read_csv(<your file>, engine='python')

Option 3: solution is my preferred solution personally. Read the file using vanilla Python.

import pandas as pd

data = []

with open(<your file>, "rb") as myfile:
    # read the header seperately
    # decode it as 'utf-8', remove any special characters, and split it on the comma (or deliminator)
    header = myfile.readline().decode('utf-8').replace('\r\n', '').split(',')
    # read the rest of the data
    for line in myfile:
        row = line.decode('utf-8', errors='ignore').replace('\r\n', '').split(',')
        data.append(row)

# save the data as a dataframe
df = pd.DataFrame(data=data, columns = header)

Hope this helps people encountering this issue for the first time.


回答 8

挣扎了一段时间,以为我会在这个问题上发布,因为它是第一个搜索结果。将encoding="iso-8859-1"标签添加到熊猫read_csv没有用,也没有任何其他编码,但始终给出UnicodeDecodeError。

如果您要传递文件句柄,则pd.read_csv(),需要将encoding属性放在文件上,而不是中read_csv。事后看来很明显,但是要跟踪却有一个微妙的错误。

Struggled with this a while and thought I’d post on this question as it’s the first search result. Adding the encoding="iso-8859-1" tag to pandas read_csv didn’t work, nor did any other encoding, kept giving a UnicodeDecodeError.

If you’re passing a file handle to pd.read_csv(), you need to put the encoding attribute on the file open, not in read_csv. Obvious in hindsight, but a subtle error to track down.


回答 9

这个答案似乎可以解决CSV编码问题。如果标题出现奇怪的编码问题,如下所示:

>>> f = open(filename,"r")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('\ufeffid', '1'), ... ])

然后,您在CSV文件的开头就有一个字节顺序标记(BOM)字符。这个答案解决了这个问题:

Python读取csv-BOM嵌入第一个密钥

解决方案是使用加载CSV encoding="utf-8-sig"

>>> f = open(filename,"r", encoding="utf-8-sig")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('id', '1'), ... ])

希望这对某人有帮助。

This answer seems to be the catch-all for CSV encoding issues. If you are getting a strange encoding problem with your header like this:

>>> f = open(filename,"r")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('\ufeffid', '1'), ... ])

Then you have a byte order mark (BOM) character at the beginning of your CSV file. This answer addresses the issue:

Python read csv – BOM embedded into the first key

The solution is to load the CSV with encoding="utf-8-sig":

>>> f = open(filename,"r", encoding="utf-8-sig")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('id', '1'), ... ])

Hopefully this helps someone.


回答 10

我正在发布此旧线程的更新。我找到了一个可行的解决方案,但需要打开每个文件。我在LibreOffice中打开了csv文件,选择另存为>编辑过滤器设置。在下拉菜单中,我选择了UTF8编码。然后我添加encoding="utf-8-sig"data = pd.read_csv(r'C:\fullpathtofile\filename.csv', sep = ',', encoding="utf-8-sig")

希望这对某人有帮助。

I am posting an update to this old thread. I found one solution that worked, but requires opening each file. I opened my csv file in LibreOffice, chose Save As > edit filter settings. In the drop-down menu I chose UTF8 encoding. Then I added encoding="utf-8-sig" to the data = pd.read_csv(r'C:\fullpathtofile\filename.csv', sep = ',', encoding="utf-8-sig").

Hope this helps someone.


回答 11

我无法打开从网上银行下载的简体中文CSV文件,我尝试过latin1,尝试过iso-8859-1cp1252,但都无济于事。

但是pd.read_csv("",encoding ='gbk')工作就完成了。

I have trouble opening a CSV file in simplified Chinese downloaded from an online bank, I have tried latin1, I have tried iso-8859-1, I have tried cp1252, all to no avail.

But pd.read_csv("",encoding ='gbk') simply does the work.


回答 12

请尝试添加

encoding='unicode_escape'

这会有所帮助。为我工作。另外,请确保使用正确的定界符和列名。

您可以从仅加载1000行开始,以快速加载文件。

Please try to add

encoding='unicode_escape'

This will help. Worked for me. Also, make sure you’re using the correct delimiter and column names.

You can start with loading just 1000 rows to load the file quickly.


回答 13

我正在使用Jupyter笔记本。以我为例,它以错误的格式显示文件。“编码”选项无效。因此,我将CSV保存为utf-8格式,并且可以正常工作。

I am using Jupyter-notebook. And in my case, it was showing the file in the wrong format. The ‘encoding’ option was not working. So I save the csv in utf-8 format, and it works.


回答 14

尝试这个:

import pandas as pd
with open('filename.csv') as f:
    data = pd.read_csv(f)

看起来它会处理编码,而无需通过参数明确表示

Try this:

import pandas as pd
with open('filename.csv') as f:
    data = pd.read_csv(f)

Looks like it will take care of the encoding without explicitly expressing it through argument


回答 15

在传递给熊猫之前,请检查编码。它会使您减速,但是…

with open(path, 'r') as f:
    encoding = f.encoding 

df = pd.read_csv(path,sep=sep, encoding=encoding)

在python 3.7中

Check the encoding before you pass to pandas. It will slow you down, but…

with open(path, 'r') as f:
    encoding = f.encoding 

df = pd.read_csv(path,sep=sep, encoding=encoding)

In python 3.7


回答 16

我遇到的另一个导致相同错误的重要问题是:

_values = pd.read_csv("C:\Users\Mujeeb\Desktop\file.xlxs")

^此行导致相同的错误,因为我正在使用read_csv()方法读取Excel文件。使用read_excel()阅读.xlxs

Another important issue that I faced which resulted in the same error was:

_values = pd.read_csv("C:\Users\Mujeeb\Desktop\file.xlxs")

^This line resulted in the same error because I am reading an excel file using read_csv() method. Use read_excel() for reading .xlxs


将元组扩展为参数

问题:将元组扩展为参数

有没有一种方法可以将Python元组扩展为函数-作为实际参数?

例如,这里expand()做了魔术:

some_tuple = (1, "foo", "bar")

def myfun(number, str1, str2):
    return (number * 2, str1 + str2, str2 + str1)

myfun(expand(some_tuple)) # (2, "foobar", "barfoo")

我知道可以将其定义myfunmyfun((a, b, c)),但是当然可能会有遗留代码。谢谢

Is there a way to expand a Python tuple into a function – as actual parameters?

For example, here expand() does the magic:

some_tuple = (1, "foo", "bar")

def myfun(number, str1, str2):
    return (number * 2, str1 + str2, str2 + str1)

myfun(expand(some_tuple)) # (2, "foobar", "barfoo")

I know one could define myfun as myfun((a, b, c)), but of course there may be legacy code. Thanks


回答 0

myfun(*some_tuple)完全符合您的要求。的*操作者只需解包元组(或任何可迭代),并把它们作为位置函数的自变量。阅读有关解压缩参数的更多信息。

myfun(*some_tuple) does exactly what you request. The * operator simply unpacks the tuple (or any iterable) and passes them as the positional arguments to the function. Read more about unpacking arguments.


回答 1

请注意,您还可以扩展参数列表的一部分:

myfun(1, *("foo", "bar"))

Note that you can also expand part of argument list:

myfun(1, *("foo", "bar"))

回答 2

看一下Python教程的第4.7.3和4.7.4节。它讨论将元组作为参数传递。

我还将考虑使用命名参数(并传递字典),而不是使用元组并传递序列。当位置不直观或有多个参数时,我发现使用位置参数是一种不好的做法。

Take a look at the Python tutorial section 4.7.3 and 4.7.4. It talks about passing tuples as arguments.

I would also consider using named parameters (and passing a dictionary) instead of using a tuple and passing a sequence. I find the use of positional arguments to be a bad practice when the positions are not intuitive or there are multiple parameters.


回答 3

这是功能编程方法。它从语法糖中提升了元组扩展功能:

apply_tuple = lambda f, t: f(*t)

用法示例:

from toolz import * 
from operator import add, eq

apply_tuple = curry(apply_tuple)

thread_last(
    [(1,2), (3,4)],
    (map, apply_tuple(add)),
    list,
    (eq, [3, 7])
)
# Prints 'True'

咖喱的redefiniton apply_tuple节省了大量的partial,从长远来看通话。

This is the functional programming method. It lifts the tuple expansion feature out of syntax sugar:

apply_tuple = lambda f, t: f(*t)

Example usage:

from toolz import * 
from operator import add, eq

apply_tuple = curry(apply_tuple)

thread_last(
    [(1,2), (3,4)],
    (map, apply_tuple(add)),
    list,
    (eq, [3, 7])
)
# Prints 'True'

curry redefiniton of apply_tuple saves a lot of partial calls in the long run.


生成器表达式与列表理解

问题:生成器表达式与列表理解

什么时候应该使用生成器表达式,什么时候应该在Python中使用列表推导?

# Generator expression
(x*2 for x in range(256))

# List comprehension
[x*2 for x in range(256)]

When should you use generator expressions and when should you use list comprehensions in Python?

# Generator expression
(x*2 for x in range(256))

# List comprehension
[x*2 for x in range(256)]

回答 0

John的答案很好(当您要迭代多次时,列表理解会更好)。但是,还应注意,如果要使用任何列表方法,都应使用列表。例如,以下代码将不起作用:

def gen():
    return (something for something in get_some_stuff())

print gen()[:2]     # generators don't support indexing or slicing
print [5,6] + gen() # generators can't be added to lists

基本上,如果您要做的只是迭代一次,则使用生成器表达式。如果要存储和使用生成的结果,则最好使用列表理解功能。

由于性能是选择彼此的最常见原因,所以我的建议是不要担心它,而只选择一个即可。如果您发现程序运行速度太慢,则只有这样,您才应回去担心调整代码。

John’s answer is good (that list comprehensions are better when you want to iterate over something multiple times). However, it’s also worth noting that you should use a list if you want to use any of the list methods. For example, the following code won’t work:

def gen():
    return (something for something in get_some_stuff())

print gen()[:2]     # generators don't support indexing or slicing
print [5,6] + gen() # generators can't be added to lists

Basically, use a generator expression if all you’re doing is iterating once. If you want to store and use the generated results, then you’re probably better off with a list comprehension.

Since performance is the most common reason to choose one over the other, my advice is to not worry about it and just pick one; if you find that your program is running too slowly, then and only then should you go back and worry about tuning your code.


回答 1

遍历生成器表达式列表理解将执行相同的操作。但是,列表理解将首先在内存中创建整个列表,而生成器表达式将在运行中创建项目,因此您可以将其用于非常大的(也可以是无限的!)序列。

Iterating over the generator expression or the list comprehension will do the same thing. However, the list comprehension will create the entire list in memory first while the generator expression will create the items on the fly, so you are able to use it for very large (and also infinite!) sequences.


回答 2

当结果需要多次迭代或速度至关重要时,请使用列表推导。使用范围较大或无限的生成器表达式。

有关更多信息,请参见生成器表达式和列表推导。

Use list comprehensions when the result needs to be iterated over multiple times, or where speed is paramount. Use generator expressions where the range is large or infinite.

See Generator expressions and list comprehensions for more info.


回答 3

重要的是列表理解会创建一个新列表。生成器创建一个可迭代的对象,当您使用这些位时,它将动态“过滤”源材料。

假设您有一个名为“ hugefile.txt”的2TB日志文件,并且想要以单词“ ENTRY”开头的所有行的内容和长度。

因此,您尝试通过编写列表理解来开始:

logfile = open("hugefile.txt","r")
entry_lines = [(line,len(line)) for line in logfile if line.startswith("ENTRY")]

这样会抓取整个文件,处理每一行,并将匹配的行存储在数组中。因此,此阵列最多可以包含2TB的内容。那会占用很多RAM,对于您的目的可能不切实际。

因此,我们可以使用生成器将“过滤器”应用于我们的内容。直到我们开始遍历结果之前,才实际读取任何数据。

logfile = open("hugefile.txt","r")
entry_lines = ((line,len(line)) for line in logfile if line.startswith("ENTRY"))

甚至没有从我们的文件中读取任何一行。实际上,假设我们想进一步过滤结果:

long_entries = ((line,length) for (line,length) in entry_lines if length > 80)

仍未读取任何内容,但是我们现在指定了两个生成器,它们将根据需要对数据起作用。

让我们将过滤后的行写到另一个文件中:

outfile = open("filtered.txt","a")
for entry,length in long_entries:
    outfile.write(entry)

现在我们读取输入文件。随着for循环继续请求其他行,long_entries生成器要求生成器提供行entry_lines,仅返回长度大于80个字符的行。然后,entry_lines生成器从logfile迭代迭代器读取文件。

因此,不是以完全填充列表的形式将数据“推送”到输出函数,而是为输出函数提供了一种仅在需要时才“拉”数据的方法。在我们的情况下,这要高效得多,但不够灵活。生成器是一种方式,一次通过。我们读取的日志文件中的数据会立即被丢弃,因此我们无法返回上一行。另一方面,完成数据后,我们不必担心保留数据。

The important point is that the list comprehension creates a new list. The generator creates a an iterable object that will “filter” the source material on-the-fly as you consume the bits.

Imagine you have a 2TB log file called “hugefile.txt”, and you want the content and length for all the lines that start with the word “ENTRY”.

So you try starting out by writing a list comprehension:

logfile = open("hugefile.txt","r")
entry_lines = [(line,len(line)) for line in logfile if line.startswith("ENTRY")]

This slurps up the whole file, processes each line, and stores the matching lines in your array. This array could therefore contain up to 2TB of content. That’s a lot of RAM, and probably not practical for your purposes.

So instead we can use a generator to apply a “filter” to our content. No data is actually read until we start iterating over the result.

logfile = open("hugefile.txt","r")
entry_lines = ((line,len(line)) for line in logfile if line.startswith("ENTRY"))

Not even a single line has been read from our file yet. In fact, say we want to filter our result even further:

long_entries = ((line,length) for (line,length) in entry_lines if length > 80)

Still nothing has been read, but we’ve specified now two generators that will act on our data as we wish.

Lets write out our filtered lines to another file:

outfile = open("filtered.txt","a")
for entry,length in long_entries:
    outfile.write(entry)

Now we read the input file. As our for loop continues to request additional lines, the long_entries generator demands lines from the entry_lines generator, returning only those whose length is greater than 80 characters. And in turn, the entry_lines generator requests lines (filtered as indicated) from the logfile iterator, which in turn reads the file.

So instead of “pushing” data to your output function in the form of a fully-populated list, you’re giving the output function a way to “pull” data only when its needed. This is in our case much more efficient, but not quite as flexible. Generators are one way, one pass; the data from the log file we’ve read gets immediately discarded, so we can’t go back to a previous line. On the other hand, we don’t have to worry about keeping data around once we’re done with it.


回答 4

生成器表达式的好处是它使用较少的内存,因为它不会立即构建整个列表。当列表是中间变量时,最好使用生成器表达式,例如对结果求和或根据结果创建字典。

例如:

sum(x*2 for x in xrange(256))

dict( (k, some_func(k)) for k in some_list_of_keys )

这样做的好处是列表不会完全生成,因此使用的内存很少(而且应该更快)

但是,当所需的最终产品是列表时,应该使用列表推导。您将不会使用生成器表达式保存任何内存,因为您需要生成的列表。您还可以获得能够使用任何列表功能(如已排序或反转)的好处。

例如:

reversed( [x*2 for x in xrange(256)] )

The benefit of a generator expression is that it uses less memory since it doesn’t build the whole list at once. Generator expressions are best used when the list is an intermediary, such as summing the results, or creating a dict out of the results.

For example:

sum(x*2 for x in xrange(256))

dict( (k, some_func(k)) for k in some_list_of_keys )

The advantage there is that the list isn’t completely generated, and thus little memory is used (and should also be faster)

You should, though, use list comprehensions when the desired final product is a list. You are not going to save any memeory using generator expressions, since you want the generated list. You also get the benefit of being able to use any of the list functions like sorted or reversed.

For example:

reversed( [x*2 for x in xrange(256)] )

回答 5

从可变对象(如列表)创建生成器时,请注意,生成器将在使用生成器时(而不是在创建生成器时)根据列表的状态进行评估:

>>> mylist = ["a", "b", "c"]
>>> gen = (elem + "1" for elem in mylist)
>>> mylist.clear()
>>> for x in gen: print (x)
# nothing

如果您的列表有可能被修改(或列表中的可变对象),但是您需要在生成器创建时的状态,则需要使用列表推导。

When creating a generator from a mutable object (like a list) be aware that the generator will get evaluated on the state of the list at time of using the generator, not at time of the creation of the generator:

>>> mylist = ["a", "b", "c"]
>>> gen = (elem + "1" for elem in mylist)
>>> mylist.clear()
>>> for x in gen: print (x)
# nothing

If there is any chance of your list getting modified (or a mutable object inside that list) but you need the state at creation of the generator you need to use a list comprehension instead.


回答 6

有时,您可以从itertools中使用tee函数,它为同一生成器返回多个迭代器,这些迭代器可以独立使用。

Sometimes you can get away with the tee function from itertools, it returns multiple iterators for the same generator that can be used independently.


回答 7

我正在使用Hadoop Mincemeat模块。我认为这是一个值得注意的好例子:

import mincemeat

def mapfn(k,v):
    for w in v:
        yield 'sum',w
        #yield 'count',1


def reducefn(k,v): 
    r1=sum(v)
    r2=len(v)
    print r2
    m=r1/r2
    std=0
    for i in range(r2):
       std+=pow(abs(v[i]-m),2)  
    res=pow((std/r2),0.5)
    return r1,r2,res

在这里,生成器从文本文件(最大为15GB)中获取数字,并使用Hadoop的map-reduce对这些数字进行简单的数学运算。如果我没有使用yield函数,而是使用列表理解,那么计算总和和平均值将花费更长的时间(更不用说空间复杂性了)。

Hadoop是利用Generators的所有优点的一个很好的例子。

I’m using the Hadoop Mincemeat module. I think this is a great example to take a note of:

import mincemeat

def mapfn(k,v):
    for w in v:
        yield 'sum',w
        #yield 'count',1


def reducefn(k,v): 
    r1=sum(v)
    r2=len(v)
    print r2
    m=r1/r2
    std=0
    for i in range(r2):
       std+=pow(abs(v[i]-m),2)  
    res=pow((std/r2),0.5)
    return r1,r2,res

Here the generator gets numbers out of a text file (as big as 15GB) and applies simple math on those numbers using Hadoop’s map-reduce. If I had not used the yield function, but instead a list comprehension, it would have taken a much longer time calculating the sums and average (not to mention the space complexity).

Hadoop is a great example for using all the advantages of Generators.