标签归档:Python

熊猫将列表的一列分为多列

问题:熊猫将列表的一列分为多列

我有一列的pandas DataFrame:

import pandas as pd

df = pd.DataFrame(
    data={
        "teams": [
            ["SF", "NYG"],
            ["SF", "NYG"],
            ["SF", "NYG"],
            ["SF", "NYG"],
            ["SF", "NYG"],
            ["SF", "NYG"],
            ["SF", "NYG"],
        ]
    }
)

print(df)

输出:

       teams
0  [SF, NYG]
1  [SF, NYG]
2  [SF, NYG]
3  [SF, NYG]
4  [SF, NYG]
5  [SF, NYG]
6  [SF, NYG]

如何将列表的这一列分为两列?

I have a pandas DataFrame with one column:

import pandas as pd

df = pd.DataFrame(
    data={
        "teams": [
            ["SF", "NYG"],
            ["SF", "NYG"],
            ["SF", "NYG"],
            ["SF", "NYG"],
            ["SF", "NYG"],
            ["SF", "NYG"],
            ["SF", "NYG"],
        ]
    }
)

print(df)

Output:

       teams
0  [SF, NYG]
1  [SF, NYG]
2  [SF, NYG]
3  [SF, NYG]
4  [SF, NYG]
5  [SF, NYG]
6  [SF, NYG]

How can split this column of lists into 2 columns?


回答 0

您可以将DataFrame构造函数与lists创建者to_list

import pandas as pd

d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
                ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
print (df2)
       teams
0  [SF, NYG]
1  [SF, NYG]
2  [SF, NYG]
3  [SF, NYG]
4  [SF, NYG]
5  [SF, NYG]
6  [SF, NYG]

df2[['team1','team2']] = pd.DataFrame(df2.teams.tolist(), index= df2.index)
print (df2)
       teams team1 team2
0  [SF, NYG]    SF   NYG
1  [SF, NYG]    SF   NYG
2  [SF, NYG]    SF   NYG
3  [SF, NYG]    SF   NYG
4  [SF, NYG]    SF   NYG
5  [SF, NYG]    SF   NYG
6  [SF, NYG]    SF   NYG

对于新的DataFrame

df3 = pd.DataFrame(df2['teams'].to_list(), columns=['team1','team2'])
print (df3)
  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

解决方案apply(pd.Series)非常慢:

#7k rows
df2 = pd.concat([df2]*1000).reset_index(drop=True)

In [121]: %timeit df2['teams'].apply(pd.Series)
1.79 s ± 52.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [122]: %timeit pd.DataFrame(df2['teams'].to_list(), columns=['team1','team2'])
1.63 ms ± 54.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You can use DataFrame constructor with lists created by to_list:

import pandas as pd

d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
                ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
print (df2)
       teams
0  [SF, NYG]
1  [SF, NYG]
2  [SF, NYG]
3  [SF, NYG]
4  [SF, NYG]
5  [SF, NYG]
6  [SF, NYG]

df2[['team1','team2']] = pd.DataFrame(df2.teams.tolist(), index= df2.index)
print (df2)
       teams team1 team2
0  [SF, NYG]    SF   NYG
1  [SF, NYG]    SF   NYG
2  [SF, NYG]    SF   NYG
3  [SF, NYG]    SF   NYG
4  [SF, NYG]    SF   NYG
5  [SF, NYG]    SF   NYG
6  [SF, NYG]    SF   NYG

And for new DataFrame:

df3 = pd.DataFrame(df2['teams'].to_list(), columns=['team1','team2'])
print (df3)
  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

Solution with apply(pd.Series) is very slow:

#7k rows
df2 = pd.concat([df2]*1000).reset_index(drop=True)

In [121]: %timeit df2['teams'].apply(pd.Series)
1.79 s ± 52.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [122]: %timeit pd.DataFrame(df2['teams'].to_list(), columns=['team1','team2'])
1.63 ms ± 54.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

回答 1

更简单的解决方案:

pd.DataFrame(df2["teams"].to_list(), columns=['team1', 'team2'])

Yield

  team1 team2
-------------
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG
7    SF   NYG

如果要拆分一列分隔字符串而不是列表,则可以类似地执行以下操作:

pd.DataFrame(df["teams"].str.split('<delim>', expand=True).values,
             columns=['team1', 'team2'])

Much simpler solution:

pd.DataFrame(df2["teams"].to_list(), columns=['team1', 'team2'])

Yields,

  team1 team2
-------------
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG
7    SF   NYG

If you wanted to split a column of delimited strings rather than lists, you could similarly do:

pd.DataFrame(df["teams"].str.split('<delim>', expand=True).values,
             columns=['team1', 'team2'])

回答 2

df2与使用tolist()以下解决方案的解决方案不同,此解决方案保留了DataFrame 的索引:

df3 = df2.teams.apply(pd.Series)
df3.columns = ['team1', 'team2']

结果如下:

  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

This solution preserves the index of the df2 DataFrame, unlike any solution that uses tolist():

df3 = df2.teams.apply(pd.Series)
df3.columns = ['team1', 'team2']

Here’s the result:

  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

回答 3

与提议的解决方案相比,似乎在语法上更简单,因此更容易记住。我假设该列在数据帧df中称为“元”:

df2 = pd.DataFrame(df['meta'].str.split().values.tolist())

There seems to be a syntactically simpler way, and therefore easier to remember, as opposed to the proposed solutions. I’m assuming that the column is called ‘meta’ in a dataframe df:

df2 = pd.DataFrame(df['meta'].str.split().values.tolist())

回答 4

根据先前的答案,这是另一个解决方案,它以更快的运行时间返回与df2.teams.apply(pd.Series)相同的结果:

pd.DataFrame([{x: y for x, y in enumerate(item)} for item in df2['teams'].values.tolist()], index=df2.index)

时间:

In [1]:
import pandas as pd
d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
                ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
df2 = pd.concat([df2]*1000).reset_index(drop=True)

In [2]: %timeit df2['teams'].apply(pd.Series)

8.27 s ± 2.73 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit pd.DataFrame([{x: y for x, y in enumerate(item)} for item in df2['teams'].values.tolist()], index=df2.index)

35.4 ms ± 5.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Based on the previous answers, here is another solution which returns the same result as df2.teams.apply(pd.Series) with a much faster run time:

pd.DataFrame([{x: y for x, y in enumerate(item)} for item in df2['teams'].values.tolist()], index=df2.index)

Timings:

In [1]:
import pandas as pd
d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
                ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
df2 = pd.concat([df2]*1000).reset_index(drop=True)

In [2]: %timeit df2['teams'].apply(pd.Series)

8.27 s ± 2.73 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit pd.DataFrame([{x: y for x, y in enumerate(item)} for item in df2['teams'].values.tolist()], index=df2.index)

35.4 ms ± 5.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

回答 5

由于我的nan观察中有上述发现,上述解决方案对我不起作用dataframe。就我而言,df2[['team1','team2']] = pd.DataFrame(df2.teams.values.tolist(), index= df2.index)收益:

object of type 'float' has no len()

我使用列表理解来解决这个问题。这里是可复制的示例:

import pandas as pd
import numpy as np
d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
            ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
df2.loc[2,'teams'] = np.nan
df2.loc[4,'teams'] = np.nan
df2

输出:

        teams
0   [SF, NYG]
1   [SF, NYG]
2   NaN
3   [SF, NYG]
4   NaN
5   [SF, NYG]
6   [SF, NYG]

df2['team1']=np.nan
df2['team2']=np.nan

用列表理解来解决:

for i in [0,1]:
    df2['team{}'.format(str(i+1))]=[k[i] if isinstance(k,list) else k for k in df2['teams']]

df2

Yield:

    teams   team1   team2
0   [SF, NYG]   SF  NYG
1   [SF, NYG]   SF  NYG
2   NaN        NaN  NaN
3   [SF, NYG]   SF  NYG
4   NaN        NaN  NaN
5   [SF, NYG]   SF  NYG
6   [SF, NYG]   SF  NYG

The above solutions didn’t work for me since I have nan observations in my dataframe. In my case df2[['team1','team2']] = pd.DataFrame(df2.teams.values.tolist(), index= df2.index) yields:

object of type 'float' has no len()

I solve this using list comprehension. Here the replicable example:

import pandas as pd
import numpy as np
d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
            ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
df2.loc[2,'teams'] = np.nan
df2.loc[4,'teams'] = np.nan
df2

output:

        teams
0   [SF, NYG]
1   [SF, NYG]
2   NaN
3   [SF, NYG]
4   NaN
5   [SF, NYG]
6   [SF, NYG]

df2['team1']=np.nan
df2['team2']=np.nan

solving with list comprehension:

for i in [0,1]:
    df2['team{}'.format(str(i+1))]=[k[i] if isinstance(k,list) else k for k in df2['teams']]

df2

yields:

    teams   team1   team2
0   [SF, NYG]   SF  NYG
1   [SF, NYG]   SF  NYG
2   NaN        NaN  NaN
3   [SF, NYG]   SF  NYG
4   NaN        NaN  NaN
5   [SF, NYG]   SF  NYG
6   [SF, NYG]   SF  NYG

回答 6

清单理解

列表理解的简单实现(我的最爱)

df = pd.DataFrame([pd.Series(x) for x in df.teams])
df.columns = ['team_{}'.format(x+1) for x in df.columns]

输出定时:

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 2.71 ms

输出:

team_1  team_2
0   SF  NYG
1   SF  NYG
2   SF  NYG
3   SF  NYG
4   SF  NYG
5   SF  NYG
6   SF  NYG

list comprehension

simple implementation with list comprehension ( my favorite)

df = pd.DataFrame([pd.Series(x) for x in df.teams])
df.columns = ['team_{}'.format(x+1) for x in df.columns]

timing on output:

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 2.71 ms

output:

team_1  team_2
0   SF  NYG
1   SF  NYG
2   SF  NYG
3   SF  NYG
4   SF  NYG
5   SF  NYG
6   SF  NYG

回答 7

这是另一个使用df.transform和的解决方案df.set_index

>>> (df['teams']
       .transform([lambda x:x[0], lambda x:x[1]])
       .set_axis(['team1','team2'],
                  axis=1,
                  inplace=False)
    )

  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

Here’s another solution using df.transform and df.set_index:

>>> (df['teams']
       .transform([lambda x:x[0], lambda x:x[1]])
       .set_axis(['team1','team2'],
                  axis=1,
                  inplace=False)
    )

  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

如何在Python中处理POST和GET变量?

问题:如何在Python中处理POST和GET变量?

在PHP中,您只能将其$_POST用于POST和$_GETGET(查询字符串)变量。Python中的等效功能是什么?

In PHP you can just use $_POST for POST and $_GET for GET (Query string) variables. What’s the equivalent in Python?


回答 0

假设您正在发布带有以下内容的html表单:

<input type="text" name="username">

如果使用原始cgi

import cgi
form = cgi.FieldStorage()
print form["username"]

如果使用DjangoPylonsFlaskPyramid

print request.GET['username'] # for GET form method
print request.POST['username'] # for POST form method

使用TurbogearsCherrypy

from cherrypy import request
print request.params['username']

Web.py

form = web.input()
print form.username

Werkzeug

print request.form['username']

如果使用Cherrypy或Turbogears,还可以直接使用参数定义处理程序函数:

def index(self, username):
    print username

Google App Engine

class SomeHandler(webapp2.RequestHandler):
    def post(self):
        name = self.request.get('username') # this will get the value from the field named username
        self.response.write(name) # this will write on the document

因此,您实际上必须选择这些框架之一。

suppose you’re posting a html form with this:

<input type="text" name="username">

If using raw cgi:

import cgi
form = cgi.FieldStorage()
print form["username"]

If using Django, Pylons, Flask or Pyramid:

print request.GET['username'] # for GET form method
print request.POST['username'] # for POST form method

Using Turbogears, Cherrypy:

from cherrypy import request
print request.params['username']

Web.py:

form = web.input()
print form.username

Werkzeug:

print request.form['username']

If using Cherrypy or Turbogears, you can also define your handler function taking a parameter directly:

def index(self, username):
    print username

Google App Engine:

class SomeHandler(webapp2.RequestHandler):
    def post(self):
        name = self.request.get('username') # this will get the value from the field named username
        self.response.write(name) # this will write on the document

So you really will have to choose one of those frameworks.


回答 1

我知道这是一个老问题。然而令人惊讶的是,没有给出好的答案。

首先,这个问题是完全有效的,而无需提及框架。CONTEXT是PHP语言的等效项。尽管有很多方法可以在Python中获取查询字符串参数,但是可以方便地填充框架变量。在PHP中,$_GET并且$_POST也方便变量。它们分别从QUERY_URI和php:// input解析。

在Python中,这些函数将是os.getenv('QUERY_STRING')sys.stdin.read()。记住要导入os和sys模块。

我们在这里必须小心使用“ CGI”一词,尤其是在谈论两种语言及其与Web服务器接口时的通用性时。1. CGI作为协议,定义了HTTP协议中的数据传输机制。2.可以将Python配置为在Apache中作为CGI脚本运行。3. Python中的CGI模块提供了一些便利功能。

由于HTTP协议与语言无关,并且Apache的CGI扩展也与语言无关,因此获取GET和POST参数仅应具有跨语言的语法差异。

这是填充GET字典的Python例程:

GET={}
args=os.getenv("QUERY_STRING").split('&')

for arg in args: 
    t=arg.split('=')
    if len(t)>1: k,v=arg.split('='); GET[k]=v

对于POST:

POST={}
args=sys.stdin.read().split('&')

for arg in args: 
    t=arg.split('=')
    if len(t)>1: k, v=arg.split('='); POST[k]=v

您现在可以按以下方式访问字段:

print GET.get('user_id')
print POST.get('user_name')

我还必须指出,CGI模块不能很好地工作。考虑以下HTTP请求:

POST / test.py?user_id=6

user_name=Bob&age=30

使用CGI.FieldStorage().getvalue('user_id')将导致空指针异常,因为该模块盲目检查POST数据,而忽略了POST请求也可以携带GET参数的事实。

I know this is an old question. Yet it’s surprising that no good answer was given.

First of all the question is completely valid without mentioning the framework. The CONTEXT is a PHP language equivalence. Although there are many ways to get the query string parameters in Python, the framework variables are just conveniently populated. In PHP, $_GET and $_POST are also convenience variables. They are parsed from QUERY_URI and php://input respectively.

In Python, these functions would be os.getenv('QUERY_STRING') and sys.stdin.read(). Remember to import os and sys modules.

We have to be careful with the word “CGI” here, especially when talking about two languages and their commonalities when interfacing with a web server. 1. CGI, as a protocol, defines the data transport mechanism in the HTTP protocol. 2. Python can be configured to run as a CGI-script in Apache. 3. The CGI module in Python offers some convenience functions.

Since the HTTP protocol is language-independent, and that Apache’s CGI extension is also language-independent, getting the GET and POST parameters should bear only syntax differences across languages.

Here’s the Python routine to populate a GET dictionary:

GET={}
args=os.getenv("QUERY_STRING").split('&')

for arg in args: 
    t=arg.split('=')
    if len(t)>1: k,v=arg.split('='); GET[k]=v

and for POST:

POST={}
args=sys.stdin.read().split('&')

for arg in args: 
    t=arg.split('=')
    if len(t)>1: k, v=arg.split('='); POST[k]=v

You can now access the fields as following:

print GET.get('user_id')
print POST.get('user_name')

I must also point out that the CGI module doesn’t work well. Consider this HTTP request:

POST / test.py?user_id=6

user_name=Bob&age=30

Using CGI.FieldStorage().getvalue('user_id') will cause a null pointer exception because the module blindly checks the POST data, ignoring the fact that a POST request can carry GET parameters too.


回答 2

我发现nosklo的答案非常广泛且有用!对于像我这样的人,他们可能会发现直接访问原始请求数据也很有用,我想添加一种方法:

import os, sys

# the query string, which contains the raw GET data
# (For example, for http://example.com/myscript.py?a=b&c=d&e
# this is "a=b&c=d&e")
os.getenv("QUERY_STRING")

# the raw POST data
sys.stdin.read()

I’ve found nosklo’s answer very extensive and useful! For those, like myself, who might find accessing the raw request data directly also useful, I would like to add the way to do that:

import os, sys

# the query string, which contains the raw GET data
# (For example, for http://example.com/myscript.py?a=b&c=d&e
# this is "a=b&c=d&e")
os.getenv("QUERY_STRING")

# the raw POST data
sys.stdin.read()

回答 3

它们存储在CGI fieldtorage对象中。

import cgi
form = cgi.FieldStorage()

print "The user entered %s" % form.getvalue("uservalue")

They are stored in the CGI fieldstorage object.

import cgi
form = cgi.FieldStorage()

print "The user entered %s" % form.getvalue("uservalue")

回答 4

它在某种程度上取决于您用作CGI框架的方式,但是在程序可访问的字典中可以找到它们。我会向您指出这些文档,但现在还没有到达python.org。但是mail.python.org上的此注释将为您提供第一个指针。查看CGI和URLLIB Python库以获取更多信息。

更新资料

好的,该链接无效。这是基本的wsgi参考

It somewhat depends on what you use as a CGI framework, but they are available in dictionaries accessible to the program. I’d point you to the docs, but I’m not getting through to python.org right now. But this note on mail.python.org will give you a first pointer. Look at the CGI and URLLIB Python libs for more.

Update

Okay, that link busted. Here’s the basic wsgi ref


回答 5

Python仅是一种语言,要获取GET和POST数据,您需要使用Python编写的Web框架或工具包。查理指出,Django是一个,cgi和urllib标准模块是另一个。也可以使用Turbogears,Pylons,CherryPy,web.py,mod_python,fastcgi等。

在Django中,您的视图函数会接收一个带有request.GET和request.POST的请求参数。其他框架将采取不同的方式。

Python is only a language, to get GET and POST data, you need a web framework or toolkit written in Python. Django is one, as Charlie points out, the cgi and urllib standard modules are others. Also available are Turbogears, Pylons, CherryPy, web.py, mod_python, fastcgi, etc, etc.

In Django, your view functions receive a request argument which has request.GET and request.POST. Other frameworks will do it differently.


库中是否提供Python保留字和内建函数的列表?

问题:库中是否提供Python保留字和内建函数的列表?

库中是否提供Python保留字和内建函数的列表?我想做类似的事情:

 from x.y import reserved_words_and_builtins

 if x in reserved_words_and_builtins:
     x += '_'

Is the list of Python reserved words and builtins available in a library? I want to do something like:

 from x.y import reserved_words_and_builtins

 if x in reserved_words_and_builtins:
     x += '_'

回答 0

要验证字符串是关键字,您可以使用keyword.iskeyword; 要获取保留关键字列表,可以使用keyword.kwlist

>>> import keyword
>>> keyword.iskeyword('break')
True
>>> keyword.kwlist
['False', 'None', 'True', 'and', 'as', 'assert', 'break', 'class', 'continue', 'def', 
 'del', 'elif', 'else', 'except', 'finally', 'for', 'from', 'global', 'if', 'import', 
 'in', 'is', 'lambda', 'nonlocal', 'not', 'or', 'pass', 'raise', 'return', 'try', 
 'while', 'with', 'yield']

如果你想包括内置的名称,以及(Python 3中),然后检查builtins模块

>>> import builtins
>>> dir(builtins)
['ArithmeticError', 'AssertionError', 'AttributeError',
 'BaseException', 'BlockingIOError', 'BrokenPipeError', 'BufferError', 'BytesWarning',
 'ChildProcessError', 'ConnectionAbortedError', 'ConnectionError',
 'ConnectionRefusedError', 'ConnectionResetError', 'DeprecationWarning', 'EOFError',
 'Ellipsis', 'EnvironmentError', 'Exception', 'False', 'FileExistsError',
 'FileNotFoundError', 'FloatingPointError', 'FutureWarning', 'GeneratorExit', 'IOError',
 'ImportError', 'ImportWarning', 'IndentationError', 'IndexError',
 'InterruptedError', 'IsADirectoryError', 'KeyError', 'KeyboardInterrupt', 'LookupError',
 'MemoryError', 'NameError', 'None', 'NotADirectoryError', 'NotImplemented',
 'NotImplementedError', 'OSError', 'OverflowError', 'PendingDeprecationWarning',
 'PermissionError', 'ProcessLookupError', 'RecursionError', 'ReferenceError',
 'ResourceWarning', 'RuntimeError', 'RuntimeWarning', 'StopAsyncIteration',
 'StopIteration', 'SyntaxError', 'SyntaxWarning', 'SystemError', 'SystemExit',
 'TabError', 'TimeoutError', 'True', 'TypeError', 'UnboundLocalError',
 'UnicodeDecodeError', 'UnicodeEncodeError', 'UnicodeError', 'UnicodeTranslateError',
 'UnicodeWarning', 'UserWarning', 'ValueError', 'Warning', 'ZeroDivisionError', '_',
 '__build_class__', '__debug__', '__doc__', '__import__', '__loader__', '__name__',
 '__package__', '__spec__', 'abs', 'all', 'any', 'ascii', 'bin', 'bool',
 'bytearray', 'bytes', 'callable', 'chr', 'classmethod', 'compile', 'complex',
 'copyright', 'credits', 'delattr', 'dict', 'dir', 'divmod', 'enumerate', 'eval',
 'exec', 'exit', 'filter', 'float', 'format', 'frozenset', 'getattr',
 'globals', 'hasattr', 'hash', 'help', 'hex', 'id', 'input', 'int',
 'isinstance', 'issubclass', 'iter', 'len', 'license', 'list', 'locals', 'map',
 'max', 'memoryview', 'min', 'next', 'object', 'oct', 'open', 'ord', 'pow',
 'print', 'property', 'quit', 'range', 'repr', 'reversed', 'round', 'set',
 'setattr', 'slice', 'sorted', 'staticmethod', 'str', 'sum', 'super', 'tuple',
 'type', 'vars', 'zip']

对于Python 2,你需要使用__builtin__模块

>>> import __builtin__
>>> dir(__builtin__)
['ArithmeticError', 'AssertionError', 'AttributeError', 'BaseException', 'BufferError', 'BytesWarning', 'DeprecationWarning', 'EOFError', 'Ellipsis', 'EnvironmentError', 'Exception', 'False', 'FloatingPointError', 'FutureWarning', 'GeneratorExit', 'IOError', 'ImportError', 'ImportWarning', 'IndentationError', 'IndexError', 'KeyError', 'KeyboardInterrupt', 'LookupError', 'MemoryError', 'NameError', 'None', 'NotImplemented', 'NotImplementedError', 'OSError', 'OverflowError', 'PendingDeprecationWarning', 'ReferenceError', 'RuntimeError', 'RuntimeWarning', 'StandardError', 'StopIteration', 'SyntaxError', 'SyntaxWarning', 'SystemError', 'SystemExit', 'TabError', 'True', 'TypeError', 'UnboundLocalError', 'UnicodeDecodeError', 'UnicodeEncodeError', 'UnicodeError', 'UnicodeTranslateError', 'UnicodeWarning', 'UserWarning', 'ValueError', 'Warning', 'WindowsError', 'ZeroDivisionError', '_', '__debug__', '__doc__', '__import__', '__name__', '__package__', 'abs', 'all', 'any', 'apply', 'basestring', 'bin', 'bool', 'buffer', 'bytearray', 'bytes', 'callable', 'chr', 'classmethod', 'cmp', 'coerce', 'compile', 'complex', 'copyright', 'credits', 'delattr', 'dict', 'dir', 'divmod', 'enumerate', 'eval', 'execfile', 'exit', 'file', 'filter', 'float', 'format', 'frozenset', 'getattr', 'globals', 'hasattr', 'hash', 'help', 'hex', 'id', 'input', 'int', 'intern', 'isinstance', 'issubclass', 'iter', 'len', 'license', 'list', 'locals', 'long', 'map', 'max', 'memoryview', 'min', 'next', 'object', 'oct', 'open', 'ord', 'pow', 'print', 'property', 'quit', 'range', 'raw_input', 'reduce', 'reload', 'repr', 'reversed', 'round', 'set', 'setattr', 'slice', 'sorted', 'staticmethod', 'str', 'sum', 'super', 'tuple', 'type', 'unichr', 'unicode', 'vars', 'xrange', 'zip']

To verify that a string is a keyword you can use keyword.iskeyword; to get the list of reserved keywords you can use keyword.kwlist:

>>> import keyword
>>> keyword.iskeyword('break')
True
>>> keyword.kwlist
['False', 'None', 'True', 'and', 'as', 'assert', 'break', 'class', 'continue', 'def', 
 'del', 'elif', 'else', 'except', 'finally', 'for', 'from', 'global', 'if', 'import', 
 'in', 'is', 'lambda', 'nonlocal', 'not', 'or', 'pass', 'raise', 'return', 'try', 
 'while', 'with', 'yield']

If you want to include built-in names as well (Python 3), then check the builtins module:

>>> import builtins
>>> dir(builtins)
['ArithmeticError', 'AssertionError', 'AttributeError',
 'BaseException', 'BlockingIOError', 'BrokenPipeError', 'BufferError', 'BytesWarning',
 'ChildProcessError', 'ConnectionAbortedError', 'ConnectionError',
 'ConnectionRefusedError', 'ConnectionResetError', 'DeprecationWarning', 'EOFError',
 'Ellipsis', 'EnvironmentError', 'Exception', 'False', 'FileExistsError',
 'FileNotFoundError', 'FloatingPointError', 'FutureWarning', 'GeneratorExit', 'IOError',
 'ImportError', 'ImportWarning', 'IndentationError', 'IndexError',
 'InterruptedError', 'IsADirectoryError', 'KeyError', 'KeyboardInterrupt', 'LookupError',
 'MemoryError', 'NameError', 'None', 'NotADirectoryError', 'NotImplemented',
 'NotImplementedError', 'OSError', 'OverflowError', 'PendingDeprecationWarning',
 'PermissionError', 'ProcessLookupError', 'RecursionError', 'ReferenceError',
 'ResourceWarning', 'RuntimeError', 'RuntimeWarning', 'StopAsyncIteration',
 'StopIteration', 'SyntaxError', 'SyntaxWarning', 'SystemError', 'SystemExit',
 'TabError', 'TimeoutError', 'True', 'TypeError', 'UnboundLocalError',
 'UnicodeDecodeError', 'UnicodeEncodeError', 'UnicodeError', 'UnicodeTranslateError',
 'UnicodeWarning', 'UserWarning', 'ValueError', 'Warning', 'ZeroDivisionError', '_',
 '__build_class__', '__debug__', '__doc__', '__import__', '__loader__', '__name__',
 '__package__', '__spec__', 'abs', 'all', 'any', 'ascii', 'bin', 'bool',
 'bytearray', 'bytes', 'callable', 'chr', 'classmethod', 'compile', 'complex',
 'copyright', 'credits', 'delattr', 'dict', 'dir', 'divmod', 'enumerate', 'eval',
 'exec', 'exit', 'filter', 'float', 'format', 'frozenset', 'getattr',
 'globals', 'hasattr', 'hash', 'help', 'hex', 'id', 'input', 'int',
 'isinstance', 'issubclass', 'iter', 'len', 'license', 'list', 'locals', 'map',
 'max', 'memoryview', 'min', 'next', 'object', 'oct', 'open', 'ord', 'pow',
 'print', 'property', 'quit', 'range', 'repr', 'reversed', 'round', 'set',
 'setattr', 'slice', 'sorted', 'staticmethod', 'str', 'sum', 'super', 'tuple',
 'type', 'vars', 'zip']

For Python 2 you’ll need to use the __builtin__ module

>>> import __builtin__
>>> dir(__builtin__)
['ArithmeticError', 'AssertionError', 'AttributeError', 'BaseException', 'BufferError', 'BytesWarning', 'DeprecationWarning', 'EOFError', 'Ellipsis', 'EnvironmentError', 'Exception', 'False', 'FloatingPointError', 'FutureWarning', 'GeneratorExit', 'IOError', 'ImportError', 'ImportWarning', 'IndentationError', 'IndexError', 'KeyError', 'KeyboardInterrupt', 'LookupError', 'MemoryError', 'NameError', 'None', 'NotImplemented', 'NotImplementedError', 'OSError', 'OverflowError', 'PendingDeprecationWarning', 'ReferenceError', 'RuntimeError', 'RuntimeWarning', 'StandardError', 'StopIteration', 'SyntaxError', 'SyntaxWarning', 'SystemError', 'SystemExit', 'TabError', 'True', 'TypeError', 'UnboundLocalError', 'UnicodeDecodeError', 'UnicodeEncodeError', 'UnicodeError', 'UnicodeTranslateError', 'UnicodeWarning', 'UserWarning', 'ValueError', 'Warning', 'WindowsError', 'ZeroDivisionError', '_', '__debug__', '__doc__', '__import__', '__name__', '__package__', 'abs', 'all', 'any', 'apply', 'basestring', 'bin', 'bool', 'buffer', 'bytearray', 'bytes', 'callable', 'chr', 'classmethod', 'cmp', 'coerce', 'compile', 'complex', 'copyright', 'credits', 'delattr', 'dict', 'dir', 'divmod', 'enumerate', 'eval', 'execfile', 'exit', 'file', 'filter', 'float', 'format', 'frozenset', 'getattr', 'globals', 'hasattr', 'hash', 'help', 'hex', 'id', 'input', 'int', 'intern', 'isinstance', 'issubclass', 'iter', 'len', 'license', 'list', 'locals', 'long', 'map', 'max', 'memoryview', 'min', 'next', 'object', 'oct', 'open', 'ord', 'pow', 'print', 'property', 'quit', 'range', 'raw_input', 'reduce', 'reload', 'repr', 'reversed', 'round', 'set', 'setattr', 'slice', 'sorted', 'staticmethod', 'str', 'sum', 'super', 'tuple', 'type', 'unichr', 'unicode', 'vars', 'xrange', 'zip']

熊猫:如何将一列中的文本分成多行?

问题:熊猫:如何将一列中的文本分成多行?

我正在处理一个较大的csv文件,并且最后一列的旁边是一串文本,我想用一个特定的分隔符来分割它。我想知道是否有使用pandas或python的简单方法?

CustNum  CustomerName     ItemQty  Item   Seatblocks                 ItemExt
32363    McCartney, Paul      3     F04    2:218:10:4,6                   60
31316    Lennon, John        25     F01    1:13:36:1,12 1:13:37:1,13     300

我想先按空格(' ')(':')Seatblocks列中按冒号分开,但每个单元格将导致列数不同。我具有重新排列列的功能,因此Seatblocks列位于工作表的末尾,但是我不确定从那里开始如何做。我可以使用内置text-to-columns函数和快速宏在excel中完成此操作,但是我的数据集记录太多,无法处理excel。

最终,我想记录约翰·列侬的记录并创建多行,并将每组座位的信息放在单独的行上。

I’m working with a large csv file and the next to last column has a string of text that I want to split by a specific delimiter. I was wondering if there is a simple way to do this using pandas or python?

CustNum  CustomerName     ItemQty  Item   Seatblocks                 ItemExt
32363    McCartney, Paul      3     F04    2:218:10:4,6                   60
31316    Lennon, John        25     F01    1:13:36:1,12 1:13:37:1,13     300

I want to split by the space(' ') and then the colon(':') in the Seatblocks column, but each cell would result in a different number of columns. I have a function to rearrange the columns so the Seatblocks column is at the end of the sheet, but I’m not sure what to do from there. I can do it in excel with the built in text-to-columns function and a quick macro, but my dataset has too many records for excel to handle.

Ultimately, I want to take records such John Lennon’s and create multiple lines, with the info from each set of seats on a separate line.


回答 0

这将座垫按空间划分,并给每个单独的行。

In [43]: df
Out[43]: 
   CustNum     CustomerName  ItemQty Item                 Seatblocks  ItemExt
0    32363  McCartney, Paul        3  F04               2:218:10:4,6       60
1    31316     Lennon, John       25  F01  1:13:36:1,12 1:13:37:1,13      300

In [44]: s = df['Seatblocks'].str.split(' ').apply(Series, 1).stack()

In [45]: s.index = s.index.droplevel(-1) # to line up with df's index

In [46]: s.name = 'Seatblocks' # needs a name to join

In [47]: s
Out[47]: 
0    2:218:10:4,6
1    1:13:36:1,12
1    1:13:37:1,13
Name: Seatblocks, dtype: object

In [48]: del df['Seatblocks']

In [49]: df.join(s)
Out[49]: 
   CustNum     CustomerName  ItemQty Item  ItemExt    Seatblocks
0    32363  McCartney, Paul        3  F04       60  2:218:10:4,6
1    31316     Lennon, John       25  F01      300  1:13:36:1,12
1    31316     Lennon, John       25  F01      300  1:13:37:1,13

或者,将每个冒号分隔的字符串放在自己的列中:

In [50]: df.join(s.apply(lambda x: Series(x.split(':'))))
Out[50]: 
   CustNum     CustomerName  ItemQty Item  ItemExt  0    1   2     3
0    32363  McCartney, Paul        3  F04       60  2  218  10   4,6
1    31316     Lennon, John       25  F01      300  1   13  36  1,12
1    31316     Lennon, John       25  F01      300  1   13  37  1,13

这有点丑陋,但也许有人会用更漂亮的解决方案。

This splits the Seatblocks by space and gives each its own row.

In [43]: df
Out[43]: 
   CustNum     CustomerName  ItemQty Item                 Seatblocks  ItemExt
0    32363  McCartney, Paul        3  F04               2:218:10:4,6       60
1    31316     Lennon, John       25  F01  1:13:36:1,12 1:13:37:1,13      300

In [44]: s = df['Seatblocks'].str.split(' ').apply(Series, 1).stack()

In [45]: s.index = s.index.droplevel(-1) # to line up with df's index

In [46]: s.name = 'Seatblocks' # needs a name to join

In [47]: s
Out[47]: 
0    2:218:10:4,6
1    1:13:36:1,12
1    1:13:37:1,13
Name: Seatblocks, dtype: object

In [48]: del df['Seatblocks']

In [49]: df.join(s)
Out[49]: 
   CustNum     CustomerName  ItemQty Item  ItemExt    Seatblocks
0    32363  McCartney, Paul        3  F04       60  2:218:10:4,6
1    31316     Lennon, John       25  F01      300  1:13:36:1,12
1    31316     Lennon, John       25  F01      300  1:13:37:1,13

Or, to give each colon-separated string in its own column:

In [50]: df.join(s.apply(lambda x: Series(x.split(':'))))
Out[50]: 
   CustNum     CustomerName  ItemQty Item  ItemExt  0    1   2     3
0    32363  McCartney, Paul        3  F04       60  2  218  10   4,6
1    31316     Lennon, John       25  F01      300  1   13  36  1,12
1    31316     Lennon, John       25  F01      300  1   13  37  1,13

This is a little ugly, but maybe someone will chime in with a prettier solution.


回答 1

与Dan不同的是,我认为他的回答相当优雅……但是不幸的是,它的效率也非常低下。因此,由于问题提到“大的csv文件”,因此我建议尝试使用Shell Dan的解决方案:

time python -c "import pandas as pd;
df = pd.DataFrame(['a b c']*100000, columns=['col']);
print df['col'].apply(lambda x : pd.Series(x.split(' '))).head()"

…与这种替代方案相比:

time python -c "import pandas as pd;
from scipy import array, concatenate;
df = pd.DataFrame(['a b c']*100000, columns=['col']);
print pd.DataFrame(concatenate(df['col'].apply( lambda x : [x.split(' ')]))).head()"

… 还有这个:

time python -c "import pandas as pd;
df = pd.DataFrame(['a b c']*100000, columns=['col']);
print pd.DataFrame(dict(zip(range(3), [df['col'].apply(lambda x : x.split(' ')[i]) for i in range(3)]))).head()"

第二个简单地避免了分配10万个序列,这足以使它快10倍左右。但是,第三种解决方案有点讽刺地浪费了对str.split()的调用(每行每列调用一次,因此比其他两种解决方案多三倍),它比第一种解决方案快40倍,因为它甚至避免实例化100000个列表。是的,这确实有点丑陋…

编辑: 此答案建议如何使用“ to_list()”并避免使用lambda。结果是像

time python -c "import pandas as pd;
df = pd.DataFrame(['a b c']*100000, columns=['col']);
print pd.DataFrame(df.col.str.split().tolist()).head()"

这比第三个解决方案更有效,而且肯定更优雅。

编辑:更简单

time python -c "import pandas as pd;
df = pd.DataFrame(['a b c']*100000, columns=['col']);
print pd.DataFrame(list(df.col.str.split())).head()"

也可以,并且几乎一样有效。

编辑: 更简单!并处理NaN(但效率较低):

time python -c "import pandas as pd;
df = pd.DataFrame(['a b c']*100000, columns=['col']);
print df.col.str.split(expand=True).head()"

Differently from Dan, I consider his answer quite elegant… but unfortunately it is also very very inefficient. So, since the question mentioned “a large csv file”, let me suggest to try in a shell Dan’s solution:

time python -c "import pandas as pd;
df = pd.DataFrame(['a b c']*100000, columns=['col']);
print df['col'].apply(lambda x : pd.Series(x.split(' '))).head()"

… compared to this alternative:

time python -c "import pandas as pd;
from scipy import array, concatenate;
df = pd.DataFrame(['a b c']*100000, columns=['col']);
print pd.DataFrame(concatenate(df['col'].apply( lambda x : [x.split(' ')]))).head()"

… and this:

time python -c "import pandas as pd;
df = pd.DataFrame(['a b c']*100000, columns=['col']);
print pd.DataFrame(dict(zip(range(3), [df['col'].apply(lambda x : x.split(' ')[i]) for i in range(3)]))).head()"

The second simply refrains from allocating 100 000 Series, and this is enough to make it around 10 times faster. But the third solution, which somewhat ironically wastes a lot of calls to str.split() (it is called once per column per row, so three times more than for the others two solutions), is around 40 times faster than the first, because it even avoids to instance the 100 000 lists. And yes, it is certainly a little ugly…

EDIT: this answer suggests how to use “to_list()” and to avoid the need for a lambda. The result is something like

time python -c "import pandas as pd;
df = pd.DataFrame(['a b c']*100000, columns=['col']);
print pd.DataFrame(df.col.str.split().tolist()).head()"

which is even more efficient than the third solution, and certainly much more elegant.

EDIT: the even simpler

time python -c "import pandas as pd;
df = pd.DataFrame(['a b c']*100000, columns=['col']);
print pd.DataFrame(list(df.col.str.split())).head()"

works too, and is almost as efficient.

EDIT: even simpler! And handles NaNs (but less efficient):

time python -c "import pandas as pd;
df = pd.DataFrame(['a b c']*100000, columns=['col']);
print df.col.str.split(expand=True).head()"

回答 2

import pandas as pd
import numpy as np

df = pd.DataFrame({'ItemQty': {0: 3, 1: 25}, 
                   'Seatblocks': {0: '2:218:10:4,6', 1: '1:13:36:1,12 1:13:37:1,13'}, 
                   'ItemExt': {0: 60, 1: 300}, 
                   'CustomerName': {0: 'McCartney, Paul', 1: 'Lennon, John'}, 
                   'CustNum': {0: 32363, 1: 31316}, 
                   'Item': {0: 'F04', 1: 'F01'}}, 
                    columns=['CustNum','CustomerName','ItemQty','Item','Seatblocks','ItemExt'])

print (df)
   CustNum     CustomerName  ItemQty Item                 Seatblocks  ItemExt
0    32363  McCartney, Paul        3  F04               2:218:10:4,6       60
1    31316     Lennon, John       25  F01  1:13:36:1,12 1:13:37:1,13      300

链接的另一个类似解决方案是use reset_indexrename

print (df.drop('Seatblocks', axis=1)
             .join
             (
             df.Seatblocks
             .str
             .split(expand=True)
             .stack()
             .reset_index(drop=True, level=1)
             .rename('Seatblocks')           
             ))

   CustNum     CustomerName  ItemQty Item  ItemExt    Seatblocks
0    32363  McCartney, Paul        3  F04       60  2:218:10:4,6
1    31316     Lennon, John       25  F01      300  1:13:36:1,12
1    31316     Lennon, John       25  F01      300  1:13:37:1,13

如果in列中不是NOT NaN值,则最快的解决方案是listDataFrame构造函数使用理解:

df = pd.DataFrame(['a b c']*100000, columns=['col'])

In [141]: %timeit (pd.DataFrame(dict(zip(range(3), [df['col'].apply(lambda x : x.split(' ')[i]) for i in range(3)]))))
1 loop, best of 3: 211 ms per loop

In [142]: %timeit (pd.DataFrame(df.col.str.split().tolist()))
10 loops, best of 3: 87.8 ms per loop

In [143]: %timeit (pd.DataFrame(list(df.col.str.split())))
10 loops, best of 3: 86.1 ms per loop

In [144]: %timeit (df.col.str.split(expand=True))
10 loops, best of 3: 156 ms per loop

In [145]: %timeit (pd.DataFrame([ x.split() for x in df['col'].tolist()]))
10 loops, best of 3: 54.1 ms per loop

但是如果列NaN只包含str.splitexpand=True返回的参数一起使用DataFrame值为(document)的,那么它解释了为什么它比较慢:

df = pd.DataFrame(['a b c']*10, columns=['col'])
df.loc[0] = np.nan
print (df.head())
     col
0    NaN
1  a b c
2  a b c
3  a b c
4  a b c

print (df.col.str.split(expand=True))
     0     1     2
0  NaN  None  None
1    a     b     c
2    a     b     c
3    a     b     c
4    a     b     c
5    a     b     c
6    a     b     c
7    a     b     c
8    a     b     c
9    a     b     c
import pandas as pd
import numpy as np

df = pd.DataFrame({'ItemQty': {0: 3, 1: 25}, 
                   'Seatblocks': {0: '2:218:10:4,6', 1: '1:13:36:1,12 1:13:37:1,13'}, 
                   'ItemExt': {0: 60, 1: 300}, 
                   'CustomerName': {0: 'McCartney, Paul', 1: 'Lennon, John'}, 
                   'CustNum': {0: 32363, 1: 31316}, 
                   'Item': {0: 'F04', 1: 'F01'}}, 
                    columns=['CustNum','CustomerName','ItemQty','Item','Seatblocks','ItemExt'])

print (df)
   CustNum     CustomerName  ItemQty Item                 Seatblocks  ItemExt
0    32363  McCartney, Paul        3  F04               2:218:10:4,6       60
1    31316     Lennon, John       25  F01  1:13:36:1,12 1:13:37:1,13      300

Another similar solution with chaining is use reset_index and rename:

print (df.drop('Seatblocks', axis=1)
             .join
             (
             df.Seatblocks
             .str
             .split(expand=True)
             .stack()
             .reset_index(drop=True, level=1)
             .rename('Seatblocks')           
             ))

   CustNum     CustomerName  ItemQty Item  ItemExt    Seatblocks
0    32363  McCartney, Paul        3  F04       60  2:218:10:4,6
1    31316     Lennon, John       25  F01      300  1:13:36:1,12
1    31316     Lennon, John       25  F01      300  1:13:37:1,13

If in column are NOT NaN values, the fastest solution is use list comprehension with DataFrame constructor:

df = pd.DataFrame(['a b c']*100000, columns=['col'])

In [141]: %timeit (pd.DataFrame(dict(zip(range(3), [df['col'].apply(lambda x : x.split(' ')[i]) for i in range(3)]))))
1 loop, best of 3: 211 ms per loop

In [142]: %timeit (pd.DataFrame(df.col.str.split().tolist()))
10 loops, best of 3: 87.8 ms per loop

In [143]: %timeit (pd.DataFrame(list(df.col.str.split())))
10 loops, best of 3: 86.1 ms per loop

In [144]: %timeit (df.col.str.split(expand=True))
10 loops, best of 3: 156 ms per loop

In [145]: %timeit (pd.DataFrame([ x.split() for x in df['col'].tolist()]))
10 loops, best of 3: 54.1 ms per loop

But if column contains NaN only works str.split with parameter expand=True which return DataFrame (documentation), and it explain why it is slowier:

df = pd.DataFrame(['a b c']*10, columns=['col'])
df.loc[0] = np.nan
print (df.head())
     col
0    NaN
1  a b c
2  a b c
3  a b c
4  a b c

print (df.col.str.split(expand=True))
     0     1     2
0  NaN  None  None
1    a     b     c
2    a     b     c
3    a     b     c
4    a     b     c
5    a     b     c
6    a     b     c
7    a     b     c
8    a     b     c
9    a     b     c

回答 3

另一种方法是这样的:

temp = df['Seatblocks'].str.split(' ')
data = data.reindex(data.index.repeat(temp.apply(len)))
data['new_Seatblocks'] = np.hstack(temp)

Another approach would be like this:

temp = df['Seatblocks'].str.split(' ')
data = data.reindex(data.index.repeat(temp.apply(len)))
data['new_Seatblocks'] = np.hstack(temp)

回答 4

也可以使用groupby()而不需要加入和stack()。

使用上面的示例数据:

import pandas as pd
import numpy as np


df = pd.DataFrame({'ItemQty': {0: 3, 1: 25}, 
                   'Seatblocks': {0: '2:218:10:4,6', 1: '1:13:36:1,12 1:13:37:1,13'}, 
                   'ItemExt': {0: 60, 1: 300}, 
                   'CustomerName': {0: 'McCartney, Paul', 1: 'Lennon, John'}, 
                   'CustNum': {0: 32363, 1: 31316}, 
                   'Item': {0: 'F04', 1: 'F01'}}, 
                    columns=['CustNum','CustomerName','ItemQty','Item','Seatblocks','ItemExt']) 
print(df)

   CustNum     CustomerName  ItemQty Item                 Seatblocks  ItemExt
0  32363    McCartney, Paul  3        F04  2:218:10:4,6               60     
1  31316    Lennon, John     25       F01  1:13:36:1,12 1:13:37:1,13  300  


#first define a function: given a Series of string, split each element into a new series
def split_series(ser,sep):
    return pd.Series(ser.str.cat(sep=sep).split(sep=sep)) 
#test the function, 
split_series(pd.Series(['a b','c']),sep=' ')
0    a
1    b
2    c
dtype: object

df2=(df.groupby(df.columns.drop('Seatblocks').tolist()) #group by all but one column
          ['Seatblocks'] #select the column to be split
          .apply(split_series,sep=' ') # split 'Seatblocks' in each group
         .reset_index(drop=True,level=-1).reset_index()) #remove extra index created

print(df2)
   CustNum     CustomerName  ItemQty Item  ItemExt    Seatblocks
0    31316     Lennon, John       25  F01      300  1:13:36:1,12
1    31316     Lennon, John       25  F01      300  1:13:37:1,13
2    32363  McCartney, Paul        3  F04       60  2:218:10:4,6

Can also use groupby() with no need to join and stack().

Use above example data:

import pandas as pd
import numpy as np


df = pd.DataFrame({'ItemQty': {0: 3, 1: 25}, 
                   'Seatblocks': {0: '2:218:10:4,6', 1: '1:13:36:1,12 1:13:37:1,13'}, 
                   'ItemExt': {0: 60, 1: 300}, 
                   'CustomerName': {0: 'McCartney, Paul', 1: 'Lennon, John'}, 
                   'CustNum': {0: 32363, 1: 31316}, 
                   'Item': {0: 'F04', 1: 'F01'}}, 
                    columns=['CustNum','CustomerName','ItemQty','Item','Seatblocks','ItemExt']) 
print(df)

   CustNum     CustomerName  ItemQty Item                 Seatblocks  ItemExt
0  32363    McCartney, Paul  3        F04  2:218:10:4,6               60     
1  31316    Lennon, John     25       F01  1:13:36:1,12 1:13:37:1,13  300  


#first define a function: given a Series of string, split each element into a new series
def split_series(ser,sep):
    return pd.Series(ser.str.cat(sep=sep).split(sep=sep)) 
#test the function, 
split_series(pd.Series(['a b','c']),sep=' ')
0    a
1    b
2    c
dtype: object

df2=(df.groupby(df.columns.drop('Seatblocks').tolist()) #group by all but one column
          ['Seatblocks'] #select the column to be split
          .apply(split_series,sep=' ') # split 'Seatblocks' in each group
         .reset_index(drop=True,level=-1).reset_index()) #remove extra index created

print(df2)
   CustNum     CustomerName  ItemQty Item  ItemExt    Seatblocks
0    31316     Lennon, John       25  F01      300  1:13:36:1,12
1    31316     Lennon, John       25  F01      300  1:13:37:1,13
2    32363  McCartney, Paul        3  F04       60  2:218:10:4,6

回答 5

这似乎比该线程其他地方建议的方法容易得多。

在熊猫数据框中拆分行

This seems a far easier method than those suggested elsewhere in this thread.

split rows in pandas dataframe


遍历一个numpy数组

问题:遍历一个numpy数组

有没有那么冗长的替代方案:

for x in xrange(array.shape[0]):
    for y in xrange(array.shape[1]):
        do_stuff(x, y)

我想出了这个:

for x, y in itertools.product(map(xrange, array.shape)):
    do_stuff(x, y)

这节省了一个缩进,但仍然很丑陋。

我希望看起来像这样的伪代码:

for x, y in array.indices:
    do_stuff(x, y)

有没有类似的东西存在?

Is there a less verbose alternative to this:

for x in xrange(array.shape[0]):
    for y in xrange(array.shape[1]):
        do_stuff(x, y)

I came up with this:

for x, y in itertools.product(map(xrange, array.shape)):
    do_stuff(x, y)

Which saves one indentation, but is still pretty ugly.

I’m hoping for something that looks like this pseudocode:

for x, y in array.indices:
    do_stuff(x, y)

Does anything like that exist?


回答 0

我认为您正在寻找ndenumerate

>>> a =numpy.array([[1,2],[3,4],[5,6]])
>>> for (x,y), value in numpy.ndenumerate(a):
...  print x,y
... 
0 0
0 1
1 0
1 1
2 0
2 1

关于性能。它比列表理解要慢一些。

X = np.zeros((100, 100, 100))

%timeit list([((i,j,k), X[i,j,k]) for i in range(X.shape[0]) for j in range(X.shape[1]) for k in range(X.shape[2])])
1 loop, best of 3: 376 ms per loop

%timeit list(np.ndenumerate(X))
1 loop, best of 3: 570 ms per loop

如果您担心性能,可以通过查看实现来进一步优化ndenumerate,它实现了两件事,转换为数组并循环。如果知道有数组,则可以调用.coords平面迭代器的属性。

a = X.flat
%timeit list([(a.coords, x) for x in a.flat])
1 loop, best of 3: 305 ms per loop

I think you’re looking for the ndenumerate.

>>> a =numpy.array([[1,2],[3,4],[5,6]])
>>> for (x,y), value in numpy.ndenumerate(a):
...  print x,y
... 
0 0
0 1
1 0
1 1
2 0
2 1

Regarding the performance. It is a bit slower than a list comprehension.

X = np.zeros((100, 100, 100))

%timeit list([((i,j,k), X[i,j,k]) for i in range(X.shape[0]) for j in range(X.shape[1]) for k in range(X.shape[2])])
1 loop, best of 3: 376 ms per loop

%timeit list(np.ndenumerate(X))
1 loop, best of 3: 570 ms per loop

If you are worried about the performance you could optimise a bit further by looking at the implementation of ndenumerate, which does 2 things, converting to an array and looping. If you know you have an array, you can call the .coords attribute of the flat iterator.

a = X.flat
%timeit list([(a.coords, x) for x in a.flat])
1 loop, best of 3: 305 ms per loop

回答 1

如果只需要索引,可以尝试numpy.ndindex

>>> a = numpy.arange(9).reshape(3, 3)
>>> [(x, y) for x, y in numpy.ndindex(a.shape)]
[(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2)]

If you only need the indices, you could try numpy.ndindex:

>>> a = numpy.arange(9).reshape(3, 3)
>>> [(x, y) for x, y in numpy.ndindex(a.shape)]
[(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2)]

回答 2

nditer

import numpy as np
Y = np.array([3,4,5,6])
for y in np.nditer(Y, op_flags=['readwrite']):
    y += 3

Y == np.array([6, 7, 8, 9])

y = 3将无法使用y *= 0y += 3而是使用。

see nditer

import numpy as np
Y = np.array([3,4,5,6])
for y in np.nditer(Y, op_flags=['readwrite']):
    y += 3

Y == np.array([6, 7, 8, 9])

y = 3 would not work, use y *= 0 and y += 3 instead.


如何在Python中将任何数据类型更改为字符串

问题:如何在Python中将任何数据类型更改为字符串

如何在Python中将任何数据类型更改为字符串?

How can I change any data type into a string in Python?


回答 0

myvariable = 4
mystring = str(myvariable)  # '4'

另外,也可以尝试repr:

mystring = repr(myvariable) # '4'

这在python中称为“转换”,非常普遍。

myvariable = 4
mystring = str(myvariable)  # '4'

also, alternatively try repr:

mystring = repr(myvariable) # '4'

This is called “conversion” in python, and is quite common.


回答 1

str旨在产生对象数据的字符串表示形式。如果您正在编写自己的类,并且想str为您工作,请添加:

def __str__(self):
    return "Some descriptive string"

print str(myObj)会打电话给myObj.__str__()

repr是类似的方法,通常会产生有关类信息的信息。对于大多数核心库对象,repr在尖括号之间生成类名称(有时会生成一些类信息)。repr例如,仅通过在“交互”窗格中键入对象即可使用,而无需使用print或其他任何方法。

您可以repr为自己的对象定义行为,就像可以定义以下对象的行为一样str

def __repr__(self):
    return "Some descriptive string"

>>> myObj在“交互”窗格中,或repr(myObj),将导致myObj.__repr__()

str is meant to produce a string representation of the object’s data. If you’re writing your own class and you want str to work for you, add:

def __str__(self):
    return "Some descriptive string"

print str(myObj) will call myObj.__str__().

repr is a similar method, which generally produces information on the class info. For most core library object, repr produces the class name (and sometime some class information) between angle brackets. repr will be used, for example, by just typing your object into your interactions pane, without using print or anything else.

You can define the behavior of repr for your own objects just like you can define the behavior of str:

def __repr__(self):
    return "Some descriptive string"

>>> myObj in your interactions pane, or repr(myObj), will result in myObj.__repr__()


回答 2

我看到所有建议使用的答案str(object)。如果您的对象包含多个ascii字符,则可能会失败,并且您会看到类似的错误ordinal not in range(128)。在我用英语以外的其他语言转换字符串列表时,情况就是这样

我通过使用解决了 unicode(object)

I see all answers recommend using str(object). It might fail if your object have more than ascii characters and you will see error like ordinal not in range(128). This was the case for me while I was converting list of string in language other than English

I resolved it by using unicode(object)


回答 3

str(object) 会成功的

如果要更改对象的字符串化方式,请__str__(self)为对象的类定义方法。这种方法必须返回str或unicode对象。

str(object) will do the trick.

If you want to alter the way object is stringified, define __str__(self) method for object’s class. Such method has to return str or unicode object.


回答 4

使用str内置的:

x = str(something)

例子:

>>> str(1)
'1'
>>> str(1.0)
'1.0'
>>> str([])
'[]'
>>> str({})
'{}'

...

从文档中:

返回一个字符串,其中包含对象的可很好打印的表示形式。对于字符串,这将返回字符串本身。与repr(object)的区别在于str(object)并不总是尝试返回eval()可接受的字符串;它的目标是返回可打印的字符串。如果未提供任何参数,则返回空字符串“”。

Use the str built-in:

x = str(something)

Examples:

>>> str(1)
'1'
>>> str(1.0)
'1.0'
>>> str([])
'[]'
>>> str({})
'{}'

...

From the documentation:

Return a string containing a nicely printable representation of an object. For strings, this returns the string itself. The difference with repr(object) is that str(object) does not always attempt to return a string that is acceptable to eval(); its goal is to return a printable string. If no argument is given, returns the empty string, ”.


回答 5

str(x)。但是,每种数据类型都可以定义自己的字符串转换,因此这可能不是您想要的。

With str(x). However, every data type can define its own string conversion, so this might not be what you want.


回答 6

您可以%s如下使用

>>> "%s" %([])
'[]'

You can use %s like below

>>> "%s" %([])
'[]'

回答 7

只需使用str-例如:

>>> str([])
'[]'

Just use str – for example:

>>> str([])
'[]'

回答 8

使用格式:

"%s" % (x)

例:

x = time.ctime(); str = "%s" % (x); print str

输出: 2018年1月11日星期四20:40:05

Use formatting:

"%s" % (x)

Example:

x = time.ctime(); str = "%s" % (x); print str

Output: Thu Jan 11 20:40:05 2018


回答 9

如果您确实要“更改”数据类型,请小心。像在其他情况下(例如,在for循环中更改迭代器),这可能会引起意外的行为:

>> dct = {1:3, 2:1}
>> len(str(dct))
12
>> print(str(dct))
{1: 31, 2: 0}
>> l = ["all","colours"]
>> len(str(l))
18

Be careful if you really want to “change” the data type. Like in other cases (e.g. changing the iterator in a for loop) this might bring up unexpected behaviour:

>> dct = {1:3, 2:1}
>> len(str(dct))
12
>> print(str(dct))
{1: 31, 2: 0}
>> l = ["all","colours"]
>> len(str(l))
18

Jenkins中的Python单元测试?

问题:Jenkins中的Python单元测试?

您如何让Jenkins执行python unittest案例?是否可以从内置unittest包中输出JUnit样式的XML ?

How do you get Jenkins to execute python unittest cases? Is it possible to JUnit style XML output from the builtin unittest package?


回答 0

样本测试:

tests.py:

# tests.py

import random
try:
    import unittest2 as unittest
except ImportError:
    import unittest

class SimpleTest(unittest.TestCase):
    @unittest.skip("demonstrating skipping")
    def test_skipped(self):
        self.fail("shouldn't happen")

    def test_pass(self):
        self.assertEqual(10, 7 + 3)

    def test_fail(self):
        self.assertEqual(11, 7 + 3)

带有pytest的JUnit

使用以下命令运行测试:

py.test --junitxml results.xml tests.py

results.xml:

<?xml version="1.0" encoding="utf-8"?>
<testsuite errors="0" failures="1" name="pytest" skips="1" tests="2" time="0.097">
    <testcase classname="tests.SimpleTest" name="test_fail" time="0.000301837921143">
        <failure message="test failure">self = &lt;tests.SimpleTest testMethod=test_fail&gt;

    def test_fail(self):
&gt;       self.assertEqual(11, 7 + 3)
E       AssertionError: 11 != 10

tests.py:16: AssertionError</failure>
    </testcase>
    <testcase classname="tests.SimpleTest" name="test_pass" time="0.000109910964966"/>
    <testcase classname="tests.SimpleTest" name="test_skipped" time="0.000164031982422">
        <skipped message="demonstrating skipping" type="pytest.skip">/home/damien/test-env/lib/python2.6/site-packages/_pytest/unittest.py:119: Skipped: demonstrating skipping</skipped>
    </testcase>
</testsuite>

带鼻子的JUnit

使用以下命令运行测试:

nosetests --with-xunit

鼻子测试.xml:

<?xml version="1.0" encoding="UTF-8"?>
<testsuite name="nosetests" tests="3" errors="0" failures="1" skip="1">
    <testcase classname="tests.SimpleTest" name="test_fail" time="0.000">
        <failure type="exceptions.AssertionError" message="11 != 10">
            <![CDATA[Traceback (most recent call last):
File "/opt/python-2.6.1/lib/python2.6/site-packages/unittest2-0.5.1-py2.6.egg/unittest2/case.py", line 340, in run
testMethod()
File "/home/damien/tests.py", line 16, in test_fail
self.assertEqual(11, 7 + 3)
File "/opt/python-2.6.1/lib/python2.6/site-packages/unittest2-0.5.1-py2.6.egg/unittest2/case.py", line 521, in assertEqual
assertion_func(first, second, msg=msg)
File "/opt/python-2.6.1/lib/python2.6/site-packages/unittest2-0.5.1-py2.6.egg/unittest2/case.py", line 514, in _baseAssertEqual
raise self.failureException(msg)
AssertionError: 11 != 10
]]>
        </failure>
    </testcase>
    <testcase classname="tests.SimpleTest" name="test_pass" time="0.000"></testcase>
    <testcase classname="tests.SimpleTest" name="test_skipped" time="0.000">
        <skipped type="nose.plugins.skip.SkipTest" message="demonstrating skipping">
            <![CDATA[SkipTest: demonstrating skipping
]]>
        </skipped>
    </testcase>
</testsuite>

带有鼻子的JUnit2

您将需要使用nose2.plugins.junitxml插件。您可以nose2像通常一样使用配置文件进行配置,也可以使用--plugin命令行选项进行配置。

使用以下命令运行测试:

nose2 --plugin nose2.plugins.junitxml --junit-xml tests

鼻子2-junit.xml:

<testsuite errors="0" failures="1" name="nose2-junit" skips="1" tests="3" time="0.001">
  <testcase classname="tests.SimpleTest" name="test_fail" time="0.000126">
    <failure message="test failure">Traceback (most recent call last):
  File "/Users/damien/Work/test2/tests.py", line 18, in test_fail
    self.assertEqual(11, 7 + 3)
AssertionError: 11 != 10
</failure>
  </testcase>
  <testcase classname="tests.SimpleTest" name="test_pass" time="0.000095" />
  <testcase classname="tests.SimpleTest" name="test_skipped" time="0.000058">
    <skipped />
  </testcase>
</testsuite>

具有unittest-xml-reporting的JUnit

将以下内容附加到 tests.py

if __name__ == '__main__':
    import xmlrunner
    unittest.main(testRunner=xmlrunner.XMLTestRunner(output='test-reports'))

使用以下命令运行测试:

python tests.py

测试报告/TEST-SimpleTest-20131001140629.xml:

<?xml version="1.0" ?>
<testsuite errors="1" failures="0" name="SimpleTest-20131001140629" tests="3" time="0.000">
    <testcase classname="SimpleTest" name="test_pass" time="0.000"/>
    <testcase classname="SimpleTest" name="test_fail" time="0.000">
        <error message="11 != 10" type="AssertionError">
<![CDATA[Traceback (most recent call last):
  File "tests.py", line 16, in test_fail
    self.assertEqual(11, 7 + 3)
AssertionError: 11 != 10
]]>     </error>
    </testcase>
    <testcase classname="SimpleTest" name="test_skipped" time="0.000">
        <skipped message="demonstrating skipping" type="skip"/>
    </testcase>
    <system-out>
<![CDATA[]]>    </system-out>
    <system-err>
<![CDATA[]]>    </system-err>
</testsuite>

sample tests:

tests.py:

# tests.py

import random
try:
    import unittest2 as unittest
except ImportError:
    import unittest

class SimpleTest(unittest.TestCase):
    @unittest.skip("demonstrating skipping")
    def test_skipped(self):
        self.fail("shouldn't happen")

    def test_pass(self):
        self.assertEqual(10, 7 + 3)

    def test_fail(self):
        self.assertEqual(11, 7 + 3)

JUnit with pytest

run the tests with:

py.test --junitxml results.xml tests.py

results.xml:

<?xml version="1.0" encoding="utf-8"?>
<testsuite errors="0" failures="1" name="pytest" skips="1" tests="2" time="0.097">
    <testcase classname="tests.SimpleTest" name="test_fail" time="0.000301837921143">
        <failure message="test failure">self = &lt;tests.SimpleTest testMethod=test_fail&gt;

    def test_fail(self):
&gt;       self.assertEqual(11, 7 + 3)
E       AssertionError: 11 != 10

tests.py:16: AssertionError</failure>
    </testcase>
    <testcase classname="tests.SimpleTest" name="test_pass" time="0.000109910964966"/>
    <testcase classname="tests.SimpleTest" name="test_skipped" time="0.000164031982422">
        <skipped message="demonstrating skipping" type="pytest.skip">/home/damien/test-env/lib/python2.6/site-packages/_pytest/unittest.py:119: Skipped: demonstrating skipping</skipped>
    </testcase>
</testsuite>

JUnit with nose

run the tests with:

nosetests --with-xunit

nosetests.xml:

<?xml version="1.0" encoding="UTF-8"?>
<testsuite name="nosetests" tests="3" errors="0" failures="1" skip="1">
    <testcase classname="tests.SimpleTest" name="test_fail" time="0.000">
        <failure type="exceptions.AssertionError" message="11 != 10">
            <![CDATA[Traceback (most recent call last):
File "/opt/python-2.6.1/lib/python2.6/site-packages/unittest2-0.5.1-py2.6.egg/unittest2/case.py", line 340, in run
testMethod()
File "/home/damien/tests.py", line 16, in test_fail
self.assertEqual(11, 7 + 3)
File "/opt/python-2.6.1/lib/python2.6/site-packages/unittest2-0.5.1-py2.6.egg/unittest2/case.py", line 521, in assertEqual
assertion_func(first, second, msg=msg)
File "/opt/python-2.6.1/lib/python2.6/site-packages/unittest2-0.5.1-py2.6.egg/unittest2/case.py", line 514, in _baseAssertEqual
raise self.failureException(msg)
AssertionError: 11 != 10
]]>
        </failure>
    </testcase>
    <testcase classname="tests.SimpleTest" name="test_pass" time="0.000"></testcase>
    <testcase classname="tests.SimpleTest" name="test_skipped" time="0.000">
        <skipped type="nose.plugins.skip.SkipTest" message="demonstrating skipping">
            <![CDATA[SkipTest: demonstrating skipping
]]>
        </skipped>
    </testcase>
</testsuite>

JUnit with nose2

You would need to use the nose2.plugins.junitxml plugin. You can configure nose2 with a config file like you would normally do, or with the --plugin command-line option.

run the tests with:

nose2 --plugin nose2.plugins.junitxml --junit-xml tests

nose2-junit.xml:

<testsuite errors="0" failures="1" name="nose2-junit" skips="1" tests="3" time="0.001">
  <testcase classname="tests.SimpleTest" name="test_fail" time="0.000126">
    <failure message="test failure">Traceback (most recent call last):
  File "/Users/damien/Work/test2/tests.py", line 18, in test_fail
    self.assertEqual(11, 7 + 3)
AssertionError: 11 != 10
</failure>
  </testcase>
  <testcase classname="tests.SimpleTest" name="test_pass" time="0.000095" />
  <testcase classname="tests.SimpleTest" name="test_skipped" time="0.000058">
    <skipped />
  </testcase>
</testsuite>

JUnit with unittest-xml-reporting

Append the following to tests.py

if __name__ == '__main__':
    import xmlrunner
    unittest.main(testRunner=xmlrunner.XMLTestRunner(output='test-reports'))

run the tests with:

python tests.py

test-reports/TEST-SimpleTest-20131001140629.xml:

<?xml version="1.0" ?>
<testsuite errors="1" failures="0" name="SimpleTest-20131001140629" tests="3" time="0.000">
    <testcase classname="SimpleTest" name="test_pass" time="0.000"/>
    <testcase classname="SimpleTest" name="test_fail" time="0.000">
        <error message="11 != 10" type="AssertionError">
<![CDATA[Traceback (most recent call last):
  File "tests.py", line 16, in test_fail
    self.assertEqual(11, 7 + 3)
AssertionError: 11 != 10
]]>     </error>
    </testcase>
    <testcase classname="SimpleTest" name="test_skipped" time="0.000">
        <skipped message="demonstrating skipping" type="skip"/>
    </testcase>
    <system-out>
<![CDATA[]]>    </system-out>
    <system-err>
<![CDATA[]]>    </system-err>
</testsuite>

回答 1

我会第二次使用鼻子。现在已经内置了基本的XML报告。只需使用–with-xunit命令行选项,它就会生成一个nosetests.xml文件。例如:

鼻子测试–with-xunit

然后添加一个“发布JUnit测试结果报告”后生成操作,并使用nasesttests.xml填充“测试报告XML”字段(假设您在$ WORKSPACE中运行了鼻子测试)。

I would second using nose. Basic XML reporting is now built in. Just use the –with-xunit command line option and it will produce a nosetests.xml file. For example:

nosetests –with-xunit

Then add a “Publish JUnit test result report” post build action, and fill in the “Test report XMLs” field with nosetests.xml (assuming that you ran nosetests in $WORKSPACE).


回答 2

您可以安装unittest-xml-reporting包,以将生成XML的测试运行器添加到内置unittest

我们使用pytest,它内置了XML输出(这是一个命令行选项)。

无论哪种方式,都可以通过运行shell命令来执行单元测试。

You can install the unittest-xml-reporting package to add a test runner that generates XML to the built-in unittest.

We use pytest, which has XML output built in (it’s a command line option).

Either way, executing the unit tests can be done by running a shell command.


回答 3

我用鼻子测试。有一些插件可以为Jenkins输出XML

I used nosetests. There are addons to output the XML for Jenkins


回答 4

当使用buildout时,我们collective.xmltestreport会产生JUnit风格的XML输出,也许它是源代码,或者模块本身可能会有所帮助。

When using buildout we use collective.xmltestreport to produce JUnit-style XML output, perhaps it’s source code or the module itself could be of help.


回答 5

python -m pytest --junit-xml=pytest_unit.xml source_directory/test/unit || true # tests may fail

从jenkins作为shell运行它,您可以在pytest_unit.xml中作为工件获取报告。

python -m pytest --junit-xml=pytest_unit.xml source_directory/test/unit || true # tests may fail

Run this as shell from jenkins , you can get the report in pytest_unit.xml as artifact.


如何在setuptools / distribute中包含软件包数据?

问题:如何在setuptools / distribute中包含软件包数据?

使用setuptools / distribute时,我无法使安装程序提取任何package_data文件。我读过的所有内容都表明,以下是正确的方法。有人可以请教吗?

setup(
   name='myapp',
   packages=find_packages(),
   package_data={
      'myapp': ['data/*.txt'],
   },
   include_package_data=True,
   zip_safe=False,
   install_requires=['distribute'],
)

myapp/data/数据文件的位置在哪里。

When using setuptools, I can not get the installer to pull in any package_data files. Everything I’ve read says that the following is the correct way to do it. Can someone please advise?

setup(
   name='myapp',
   packages=find_packages(),
   package_data={
      'myapp': ['data/*.txt'],
   },
   include_package_data=True,
   zip_safe=False,
   install_requires=['distribute'],
)

where myapp/data/ is the location of the data files.


回答 0

我知道这是一个老问题,但人们发现这里通过谷歌自己的方式: package_data是低了下来,肮脏的谎言。它仅在构建二进制软件包(python setup.py bdist ...)时使用,在构建源软件包(python setup.py sdist ...)时不使用。当然,这是荒谬的-人们希望构建源代码分发将导致文件集合,这些文件可以发送给其他人来构建二进制分发。

在任何情况下,使用MANIFEST.in将工作二进制和源分布。

I realize that this is an old question, but for people finding their way here via Google: package_data is a low-down, dirty lie. It is only used when building binary packages (python setup.py bdist ...) but not when building source packages (python setup.py sdist ...). This is, of course, ridiculous — one would expect that building a source distribution would result in a collection of files that could be sent to someone else to built the binary distribution.

In any case, using MANIFEST.in will work both for binary and for source distributions.


回答 1

我只是有同样的问题。解决的方法是简单地删除include_package_data=True

这里阅读之后,我意识到它include_package_data旨在包含来自版本控制的文件,而不是顾名思义仅包含“ include package data”。从文档:

[include_package_data]的数据文件必须处于CVS或Subversion控制之下

如果要对包含的文件进行更细粒度的控制(例如,如果您的软件包目录中有文档文件,并希望将其从安装中排除),则也可以使用package_data关键字。

把那个参数排除掉可以解决这个问题,这恰好是为什么当您切换到distutils时它也可以工作的原因,因为它不接受那个参数。

I just had this same issue. The solution, was simply to remove include_package_data=True.

After reading here, I realized that include_package_data aims to include files from version control, as opposed to merely “include package data” as the name implies. From the docs:

The data files [of include_package_data] must be under CVS or Subversion control

If you want finer-grained control over what files are included (for example, if you have documentation files in your package directories and want to exclude them from installation), then you can also use the package_data keyword.

Taking that argument out fixed it, which is coincidentally why it also worked when you switched to distutils, since it doesn’t take that argument.


回答 2

遵循@Joe的建议删除该include_package_data=True行也对我有用。

详细说明一下,我没有 MANIFEST.in文件。我使用Git而不是CVS。

存储库采用以下形式:

/myrepo
    - .git/
    - setup.py
    - myproject
        - __init__.py
        - some_mod
            - __init__.py
            - animals.py
            - rocks.py
        - config
            - __init__.py
            - settings.py
            - other_settings.special
            - cool.huh
            - other_settings.xml
        - words
            - __init__.py
            word_set.txt

setup.py

from setuptools import setup, find_packages
import os.path

setup (
    name='myproject',
    version = "4.19",
    packages = find_packages(),  
    # package_dir={'mypkg': 'src/mypkg'},  # didnt use this.
    package_data = {
        # If any package contains *.txt or *.rst files, include them:
        '': ['*.txt', '*.xml', '*.special', '*.huh'],
    },

#
    # Oddly enough, include_package_data=True prevented package_data from working.
    # include_package_data=True, # Commented out.
    data_files=[
#               ('bitmaps', ['bm/b1.gif', 'bm/b2.gif']),
        ('/opt/local/myproject/etc', ['myproject/config/settings.py', 'myproject/config/other_settings.special']),
        ('/opt/local/myproject/etc', [os.path.join('myproject/config', 'cool.huh')]),
#
        ('/opt/local/myproject/etc', [os.path.join('myproject/config', 'other_settings.xml')]),
        ('/opt/local/myproject/data', [os.path.join('myproject/words', 'word_set.txt')]),
    ],

    install_requires=[ 'jsonschema',
        'logging', ],

     entry_points = {
        'console_scripts': [
            # Blah...
        ], },
)

python setup.py sdist为源发行版(没有尝试过二进制)运行。

在新的虚拟环境中,我有一个myproject-4.19.tar.gz文件,并且我使用

(venv) pip install ~/myproject-4.19.tar.gz
...

除了将所有内容都安装到我的虚拟环境中之外site-packages,这些特殊数据文件也都安装到/opt/local/myproject/data和中/opt/local/myproject/etc

Following @Joe ‘s recommendation to remove the include_package_data=True line also worked for me.

To elaborate a bit more, I have no MANIFEST.in file. I use Git and not CVS.

Repository takes this kind of shape:

/myrepo
    - .git/
    - setup.py
    - myproject
        - __init__.py
        - some_mod
            - __init__.py
            - animals.py
            - rocks.py
        - config
            - __init__.py
            - settings.py
            - other_settings.special
            - cool.huh
            - other_settings.xml
        - words
            - __init__.py
            word_set.txt

setup.py:

from setuptools import setup, find_packages
import os.path

setup (
    name='myproject',
    version = "4.19",
    packages = find_packages(),  
    # package_dir={'mypkg': 'src/mypkg'},  # didnt use this.
    package_data = {
        # If any package contains *.txt or *.rst files, include them:
        '': ['*.txt', '*.xml', '*.special', '*.huh'],
    },

#
    # Oddly enough, include_package_data=True prevented package_data from working.
    # include_package_data=True, # Commented out.
    data_files=[
#               ('bitmaps', ['bm/b1.gif', 'bm/b2.gif']),
        ('/opt/local/myproject/etc', ['myproject/config/settings.py', 'myproject/config/other_settings.special']),
        ('/opt/local/myproject/etc', [os.path.join('myproject/config', 'cool.huh')]),
#
        ('/opt/local/myproject/etc', [os.path.join('myproject/config', 'other_settings.xml')]),
        ('/opt/local/myproject/data', [os.path.join('myproject/words', 'word_set.txt')]),
    ],

    install_requires=[ 'jsonschema',
        'logging', ],

     entry_points = {
        'console_scripts': [
            # Blah...
        ], },
)

I run python setup.py sdist for a source distrib (haven’t tried binary).

And when inside of a brand new virtual environment, I have a myproject-4.19.tar.gz, file, and I use

(venv) pip install ~/myproject-4.19.tar.gz
...

And other than everything getting installed to my virtual environment’s site-packages, those special data files get installed to /opt/local/myproject/data and /opt/local/myproject/etc.


回答 3

include_package_data=True 为我工作。

如果你使用git,请记住,包括setuptools-gitinstall_requires。远没有拥有Manifest或包含所有路径package_data(在我的情况下,它是具有各种静态特性的django应用程序)那么无聊

(粘贴了我的评论,就像k3-rnc所说的那样,实际上是有帮助的)

include_package_data=True worked for me.

If you use git, remember to include setuptools-git in install_requires. Far less boring than having a Manifest or including all path in package_data ( in my case it’s a django app with all kind of statics )

( pasted the comment I made, as k3-rnc mentioned it’s actually helpful as is )


回答 4

更新:此答案是旧的,该信息不再有效。所有setup.py配置均应使用import setuptools。我在https://stackoverflow.com/a/49501350/64313中添加了更完整的答案


我通过切换到distutils解决了这个问题。似乎已弃用和/或破坏了分发。

from distutils.core import setup

setup(
   name='myapp',
   packages=['myapp'],
   package_data={
      'myapp': ['data/*.txt'],
   },
)

Update: This answer is old and the information is no longer valid. All setup.py configs should use import setuptools. I’ve added a more complete answer at https://stackoverflow.com/a/49501350/64313


I solved this by switching to distutils. Looks like distribute is deprecated and/or broken.

from distutils.core import setup

setup(
   name='myapp',
   packages=['myapp'],
   package_data={
      'myapp': ['data/*.txt'],
   },
)

回答 5

古老的问题,然而… python的软件包管理确实有很多不足之处。因此,我有在本地使用pip安装到指定目录的用例,很惊讶package_data和data_files路径都无法解决。我不希望再向仓库添加另一个文件,所以最终我利用了data_files和setup.py选项–install-data;。像这样的东西

pip install . --install-option="--install-data=$PWD/package" -t package  

Ancient question and yet… package management of python really leaves a lot to be desired. So I had the use case of installing using pip locally to a specified directory and was surprised both package_data and data_files paths did not work out. I was not keen on adding yet another file to the repo so I ended up leveraging data_files and setup.py option –install-data; something like this

pip install . --install-option="--install-data=$PWD/package" -t package  

回答 6

将包含软件包数据的文件夹移到module文件夹为我解决了这个问题。

看到这个问题:MANIFEST.in在“ python setup.py install”上被忽略-没有安装数据文件?

Moving the folder containing the package data into to module folder solved the problem for me.

See this question: MANIFEST.in ignored on “python setup.py install” – no data files installed?


回答 7

我在几天中遇到了同样的问题,但是即使一切都变得混乱,这个线程也无法为我提供帮助。因此,我进行了研究,发现了以下解决方案:

基本上在这种情况下,您应该执行以下操作:

from setuptools import setup

setup(
   name='myapp',
   packages=['myapp'],
   package_dir={'myapp':'myapp'}, # the one line where all the magic happens
   package_data={
      'myapp': ['data/*.txt'],
   },
)

完整的其他stackoverflow答案在这里

I had the same problem for a couple of days but even this thread wasn’t able to help me as everything was confusing. So I did my research and found the following solution:

Basically in this case, you should do:

from setuptools import setup

setup(
   name='myapp',
   packages=['myapp'],
   package_dir={'myapp':'myapp'}, # the one line where all the magic happens
   package_data={
      'myapp': ['data/*.txt'],
   },
)

The full other stackoverflow answer here


回答 8

只需删除该行:

include_package_data=True,

从您的安装脚本中,它将正常工作。(刚刚通过最新的setuptools测试。)

Just remove the line:

include_package_data=True,

from your setup script, and it will work fine. (Tested just now with latest setuptools.)


回答 9

使用setup.cfg(setuptools≥30.3.0)

从setuptools 30.3.0(2016年12月8日发布)开始,您可以保持setup.py很小的规模并将配置移动到setup.cfg文件中。使用这种方法,您可以将包数据放在以下[options.package_data]部分中:

[options.package_data]
* = *.txt, *.rst
hello = *.msg

在这种情况下,您setup.py可以做到:

from setuptools import setup
setup()

有关更多信息,请参阅使用setup.cfg文件配置安装程序

一些关于setup.cfgPEP 518中pyproject.toml提议的弃用赞成的说法,但从2020年2月21日起这仍然是临时的。

Using setup.cfg (setuptools ≥ 30.3.0)

Starting with setuptools 30.3.0 (released 2016-12-08), you can keep your setup.py very small and move the configuration to a setup.cfg file. With this approach, you could put your package data in an [options.package_data] section:

[options.package_data]
* = *.txt, *.rst
hello = *.msg

In this case, your setup.py can be as short as:

from setuptools import setup
setup()

For more information, see configuring setup using setup.cfg files.

There is some talk of deprecating setup.cfg in favour of pyproject.toml as proposed in PEP 518, but this is still provisional as of 2020-02-21.


这是从哪里来的:-*-编码:utf-8-*-

问题:这是从哪里来的:-*-编码:utf-8-*-

Python将以下内容识别为定义文件编码的指令:

# -*- coding: utf-8 -*-

我确实在(-*- var: value -*-)之前看到过这种说明。它从何而来?完整规范是什么,例如,值可以包含空格,特殊符号,换行符,甚至-*-本身吗?

我的程序将编写纯文本文件,我想使用这种格式在其中包含一些元数据。

Python recognizes the following as instruction which defines file’s encoding:

# -*- coding: utf-8 -*-

I definitely saw this kind of instructions before (-*- var: value -*-). Where does it come from? What is the full specification, e.g. can the value include spaces, special symbols, newlines, even -*- itself?

My program will be writing plain text files and I’d like to include some metadata in them using this format.


回答 0

这种指定Python文件编码的方式来自PEP 0263-定义Python源代码编码

GNU Emacs也可以识别它(请参阅Python语言参考,2.1.4编码声明),尽管我不知道它是否是第一个使用该语法的程序。

This way of specifying the encoding of a Python file comes from PEP 0263 – Defining Python Source Code Encodings.

It is also recognized by GNU Emacs (see Python Language Reference, 2.1.4 Encoding declarations), though I don’t know if it was the first program to use that syntax.


回答 1

# -*- coding: utf-8 -*-是Python 2的东西。在Python 3+中,源文件默认编码已经是UTF-8,并且该行是无用的。

请参阅:我应该在Python 3中使用编码声明吗?

pyupgrade是一个可以在代码上运行的工具,用于从Python 2中删除这些注释和其他不再有用的遗留物,例如让所有类都继承自object

# -*- coding: utf-8 -*- is a Python 2 thing. In Python 3+, the default encoding of source files is already UTF-8 and that line is useless.

See: Should I use encoding declaration in Python 3?

pyupgrade is a tool you can run on your code to remove those comments and other no-longer-useful leftovers from Python 2, like having all your classes inherit from object.


回答 2

这就是所谓的文件局部变量,Emacs可以理解并相应地进行设置。请参阅Emacs手册中的相应部分 -您可以在文件的页眉或页脚中定义它们

This is so called file local variables, that are understood by Emacs and set correspondingly. See corresponding section in Emacs manual – you can define them either in header or in footer of file


回答 3

在PyCharm中,我将其省略。它将关闭底部的UTF-8指示器,并警告该编码为硬编码。不要以为您需要上面提到的PyCharm评论。

In PyCharm, I’d leave it out. It turns off the UTF-8 indicator at the bottom with a warning that the encoding is hard-coded. Don’t think you need the PyCharm comment mentioned above.


如何更新Python?

问题:如何更新Python?

我从2012年初开始安装了2.7版。对于在安装最新版本之前是否应该完全卸载并擦除此版本,我无法达成共识。

“软”删除旧版本?硬删除/清除旧版本?安装在顶部?

我在某处看到了一个特殊的安装/升级过程,该过程使用Python安装的“分段”方法,将不同的版本分开并保持功能。不知道这是否是事实上的标准方法。

我还想知道Revo是否太过热情,是否可能导致清除仍然需要的残留物(例如环境/ PATH变量)而引起问题。

(Win7 x64,32位Python)

I have version 2.7 installed from early 2012. I can’t find any consensus on whether I should completely uninstall and wipe this version before putting on the latest version.

“Soft”-removing old versions? Hard-removing/wiping old versions? Installing over top?

I’ve seen somewhere a special install/upgrade process using a “segmenting” method of Python installations, keeping different versions separate and apart, but functional. Not sure if this is the standard, de facto way.

I also wonder if Revo gets too overzealous and may cause issues with wiping out still-needed remnants, like environment/PATH variables.

(Win7 x64, 32-bit Python)


回答 0

更新日期:2018-07-06

这个帖子现在已经快5年了!2020年,Python-2.7将停止从python.org接收官方更新。此外,还发布了Python-3.7。查看Python-Future,了解如何使您的Python-2代码与Python-3兼容。为了更新conda,文档现在建议conda update --all在您的每个conda环境中使用更新该版本的所有软件包和Python可执行文件。另外,由于它们将名称更改为Anaconda,所以我不知道Windows注册表项是否仍然相同。

更新日期:2017-03-24

自2015年6月以来,没有对Python(x,y)进行任何更新,因此我认为可以断定它已被放弃。

更新:2016-11-11

正如下面的@cxw注释所示,这些答案适用于相同的位版本,按位版本,我的意思是64位与32位。例如,这些答案将适用于从64位Python-2.7.10更新到64位Python-2.7.11,相同的位版本。虽然可以将两个不同的Python版本一起安装,但这需要一些技巧,因此我将为读者保存该练习。如果您不想黑客,我建议如果切换位版本,请先删除其他位版本。

更新日期:2016-05-16
  • 通过禁用更改Windows 和注册表的选项,AnacondaMiniConda可以与现有的Python安装一起使用PATH。解压后,conda在您的binPyPI中创建符号链接到或安装conda。然后创建另一个名为符号链接conda-activateactivate在巨蟒/ Miniconda根bin文件夹。现在,Anaconda / Miniconda就像Ruby RVM。仅用于conda-activate root启用Anaconda / Miniconda。
  • 便携式Python已不再开发或维护。

TL; DR

  • 使用Anaconda或miniconda,然后执行conda update --all以保持每个conda环境的更新,
  • 同样重要的版本官方的Python比如 2.7.5),只需安装过旧的( 2.7.4),
  • 官方Python的不同主要版本 3.3),与老,设置路径/联装并排方点到显性的( 2.7),快捷方式等(在bash $ ln /c/Python33/python.exe python3)。

答案取决于:

  1. 如果OP具有2.7.x,并且要安装较新的2.7.x,则

    • 如果使用MSI安装程序Python官方网站上,只要安装了旧版本,安装程序会发出警告,它会删除并替换旧版本; 前后检查“控制面板”中的“已安装程序”,以确认旧版本已被新版本替换;2.7.x的较新版本向后兼容,因此这是完全安全的,因此,IMHO 2.7.x的多个版本永远不需要。
    • 如果是从源代码构建的,那么您可能应该构建在一个全新的,干净的目录中,然后在通过所有测试并且确信它已成功构建后,将路径指向新的构建,但是您可能希望保留旧的进行构建,因为从源构建可能偶尔会遇到问题。请参阅我的指南,以在带有SDK 7.0的Windows 7上构建Python x64
    • 如果从诸如Python(x,y)之类的发行版进行安装,请访问其网站。Python(x,y)已被放弃。 我相信可以使用其包管理器在Python(x,y)内处理更新,但是更新更新也包含在其网站上。我找不到具体的参考,所以也许有人可以对此发表意见。与ActiveState相似,并且可能是有思想的,Python(x,y)明确指出它与Python的其他安装不兼容:

      建议在安装Python(x,y)之前先卸载所有其他Python发行版

    • Enthought Canopy使用MSI,并将分别安装到所有用户中或为所有用户安装,Program Files\Enthoughthome\AppData\Local\Enthought\Canopy\App针对每个用户安装。通过使用内置的更新工具来更新较新的安装。查看他们的文档
    • ActiveState还使用MSI,因此可以在较旧的安装之上安装较新的安装。查看其安装说明

      其他Python 2.7安装在Windows上,ActivePython 2.7无法与其他Python 2.7安装共存(例如,来自python.org的Python 2.7构建)。在安装ActivePython 2.7之前,请卸载其他所有Python 2.7安装。

    • Sage建议您将其安装到虚拟机中,并提供可用于此目的的Oracle VirtualBox映像文件。发出sage -upgrade命令在内部处理升级。
    • 可以使用以下conda命令更新Anaconda

      conda update --all

      Anaconda / Miniconda允许用户创建环境来管理多个Python版本,包括Python- 2.6、2.7、3.3、3.4 和3.5。Anaconda / Miniconda的根安装当前基于Python-2.7或Python-3.5。

      Anaconda可能会破坏其他Python安装。安装使用MSI安装程序。 [ 更新:2016-05-16] Anaconda和Miniconda现在使用.exe安装程序,并提供选项来禁用Windows PATH和注册表更改。

      因此,根据安装方式和安装过程中选择的选项,可以在不中断现有Python安装的情况下安装Anaconda / Miniconda。如果.exe使用安装程序和选项来改变的Windows PATH和注册表都没有禁用,则任何以前的Python的安装将被禁用,但只需卸载Python/ Miniconda安装应恢复原来的Python安装,也许除了Windows注册表Python\PythonCore键。

      Python/ Miniconda使得下面的注册表编辑无论安装选项:HKCU\Software\Python\ContinuumAnalytics\使用下列按键:HelpInstallPathModulesPythonPath– Python官方注册过这些按键,但下Python\PythonCore。还为Anaconda \ Miniconda注册了卸载信息。除非在安装过程中选择“在Windows中注册”选项,否则它不会创建PythonCore,因此像Visual Studio的Python Tools这样的集成不会自动看到Anaconda / Miniconda。如果注册Python/ Miniconda选项激活,那么我认为您现有的Python Windows注册表项将被改变和卸载可能不会恢复它们。

    • 我认为,可以通过WinPython控制面板处理WinPython更新。
    • PortablePython不再被开发它没有更新方法。可能更新可以解压缩到一个新的目录,然后App\lib\site-packagesApp\Scripts可以复制到新安装的,但如果没有工作,然后重新安装所有的包可能是必要的。使用pip list查看包安装了什么,它们的版本。有些是由PortablePython安装的。使用easy_install pip如果未安装它安装点子。
  2. 如果OP具有2.7.x,并且想要安装其他版本,例如 <= 2.6.x或> = 3.xx,则可以并排安装不同版本。您必须选择要与*.py文件关联的Python版本(如果有),以及要在路径中使用的版本,尽管如果使用BASH则应该能够设置具有不同路径的shell 。AFAIK 2.7.x向后兼容2.6.x,因此不需要IMHO并排安装,但是Python-3.xx不向后兼容,因此我的建议是将Python-2.7放在您的路径上并具有通过创建指向可执行文件的快捷方式python3(这是Linux上的常见设置),可以将python-3作为可选版本。Windows上官方的Python默认安装路径是

    • 适用于3.3.x的C:\ Python33(最新2013-07-29)
    • C:\ Python32 for 3.2.x
    • &C。
    • C:\ Python27 for 2.7.x(最新2013-07-29)
    • C:\ Python26 for 2.6.x
    • &C。
  3. 如果OP不是在更新Python,而只是在更新软件包,则他们可能希望研究virtualenv,以使特定于其开发项目的软件包的不同版本分开。Pip还是更新软件包的好工具。如果软件包使用二进制安装程序,则通常在安装新软件包之前先卸载旧软件包。

我希望这可以消除任何混乱。

UPDATE: 2018-07-06

This post is now nearly 5 years old! Python-2.7 will stop receiving official updates from python.org in 2020. Also, Python-3.7 has been released. Check out Python-Future on how to make your Python-2 code compatible with Python-3. For updating conda, the documentation now recommends using conda update --all in each of your conda environments to update all packages and the Python executable for that version. Also, since they changed their name to Anaconda, I don’t know if the Windows registry keys are still the same.

UPDATE: 2017-03-24

There have been no updates to Python(x,y) since June of 2015, so I think it’s safe to assume it has been abandoned.

UPDATE: 2016-11-11

As @cxw comments below, these answers are for the same bit-versions, and by bit-version I mean 64-bit vs. 32-bit. For example, these answers would apply to updating from 64-bit Python-2.7.10 to 64-bit Python-2.7.11, ie: the same bit-version. While it is possible to install two different bit versions of Python together, it would require some hacking, so I’ll save that exercise for the reader. If you don’t want to hack, I suggest that if switching bit-versions, remove the other bit-version first.

UPDATES: 2016-05-16
  • Anaconda and MiniConda can be used with an existing Python installation by disabling the options to alter the Windows PATH and Registry. After extraction, create a symlink to conda in your bin or install conda from PyPI. Then create another symlink called conda-activate to activate in the Anaconda/Miniconda root bin folder. Now Anaconda/Miniconda is just like Ruby RVM. Just use conda-activate root to enable Anaconda/Miniconda.
  • Portable Python is no longer being developed or maintained.

TL;DR

  • Using Anaconda or miniconda, then just execute conda update --all to keep each conda environment updated,
  • same major version of official Python (e.g. 2.7.5), just install over old (e.g. 2.7.4),
  • different major version of official Python (e.g. 3.3), install side-by-side with old, set paths/associations to point to dominant (e.g. 2.7), shortcut to other (e.g. in BASH $ ln /c/Python33/python.exe python3).

The answer depends:

  1. If OP has 2.7.x and wants to install newer version of 2.7.x, then

    • if using MSI installer from the official Python website, just install over old version, installer will issue warning that it will remove and replace the older version; looking in “installed programs” in “control panel” before and after confirms that the old version has been replaced by the new version; newer versions of 2.7.x are backwards compatible so this is completely safe and therefore IMHO multiple versions of 2.7.x should never necessary.
    • if building from source, then you should probably build in a fresh, clean directory, and then point your path to the new build once it passes all tests and you are confident that it has been built successfully, but you may wish to keep the old build around because building from source may occasionally have issues. See my guide for building Python x64 on Windows 7 with SDK 7.0.
    • if installing from a distribution such as Python(x,y), see their website. Python(x,y) has been abandoned. I believe that updates can be handled from within Python(x,y) with their package manager, but updates are also included on their website. I could not find a specific reference so perhaps someone else can speak to this. Similar to ActiveState and probably Enthought, Python (x,y) clearly states it is incompatible with other installations of Python:

      It is recommended to uninstall any other Python distribution before installing Python(x,y)

    • Enthought Canopy uses an MSI and will install either into Program Files\Enthought or home\AppData\Local\Enthought\Canopy\App for all users or per user respectively. Newer installations are updated by using the built in update tool. See their documentation.
    • ActiveState also uses an MSI so newer installations can be installed on top of older ones. See their installation notes.

      Other Python 2.7 Installations On Windows, ActivePython 2.7 cannot coexist with other Python 2.7 installations (for example, a Python 2.7 build from python.org). Uninstall any other Python 2.7 installations before installing ActivePython 2.7.

    • Sage recommends that you install it into a virtual machine, and provides a Oracle VirtualBox image file that can be used for this purpose. Upgrades are handled internally by issuing the sage -upgrade command.
    • Anaconda can be updated by using the conda command:

      conda update --all
      

      Anaconda/Miniconda lets users create environments to manage multiple Python versions including Python-2.6, 2.7, 3.3, 3.4 and 3.5. The root Anaconda/Miniconda installations are currently based on either Python-2.7 or Python-3.5.

      Anaconda will likely disrupt any other Python installations. Installation uses MSI installer. [UPDATE: 2016-05-16] Anaconda and Miniconda now use .exe installers and provide options to disable Windows PATH and Registry alterations.

      Therefore Anaconda/Miniconda can be installed without disrupting existing Python installations depending on how it was installed and the options that were selected during installation. If the .exe installer is used and the options to alter Windows PATH and Registry are not disabled, then any previous Python installations will be disabled, but simply uninstalling the Anaconda/Miniconda installation should restore the original Python installation, except maybe the Windows Registry Python\PythonCore keys.

      Anaconda/Miniconda makes the following registry edits regardless of the installation options: HKCU\Software\Python\ContinuumAnalytics\ with the following keys: Help, InstallPath, Modules and PythonPath – official Python registers these keys too, but under Python\PythonCore. Also uninstallation info is registered for Anaconda\Miniconda. Unless you select the “Register with Windows” option during installation, it doesn’t create PythonCore, so integrations like Python Tools for Visual Studio do not automatically see Anaconda/Miniconda. If the option to register Anaconda/Miniconda is enabled, then I think your existing Python Windows Registry keys will be altered and uninstallation will probably not restore them.

    • WinPython updates, I think, can be handled through the WinPython Control Panel.
    • PortablePython is no longer being developed. It had no update method. Possibly updates could be unzipped into a fresh directory and then App\lib\site-packages and App\Scripts could be copied to the new installation, but if this didn’t work then reinstalling all packages might have been necessary. Use pip list to see what packages were installed and their versions. Some were installed by PortablePython. Use easy_install pip to install pip if it wasn’t installed.
  2. If OP has 2.7.x and wants to install a different version, e.g. <=2.6.x or >=3.x.x, then installing different versions side-by-side is fine. You must choose which version of Python (if any) to associate with *.py files and which you want on your path, although you should be able to set up shells with different paths if you use BASH. AFAIK 2.7.x is backwards compatible with 2.6.x, so IMHO side-by-side installs is not necessary, however Python-3.x.x is not backwards compatible, so my recommendation would be to put Python-2.7 on your path and have Python-3 be an optional version by creating a shortcut to its executable called python3 (this is a common setup on Linux). The official Python default install path on Windows is

    • C:\Python33 for 3.3.x (latest 2013-07-29)
    • C:\Python32 for 3.2.x
    • &c.
    • C:\Python27 for 2.7.x (latest 2013-07-29)
    • C:\Python26 for 2.6.x
    • &c.
  3. If OP is not updating Python, but merely updating packages, they may wish to look into virtualenv to keep the different versions of packages specific to their development projects separate. Pip is also a great tool to update packages. If packages use binary installers I usually uninstall the old package before installing the new one.

I hope this clears up any confusion.


回答 1

最好的解决方案是在多个路径中安装不同的Python版本。

例如。C:\ Python27(适用于2.7)和C:\ Python33(适用于3.3)。

阅读以获取更多信息:如何在Windows上运行多个Python版本

The best solution is to install the different Python versions in multiple paths.

eg. C:\Python27 for 2.7, and C:\Python33 for 3.3.

Read this for more info: How to run multiple Python versions on Windows


回答 2

  • 官方Python .msi安装程序旨在替代:

    • 以前的任何微型发行版(在xyz中z为“微型”),因为可以保证它们是向后兼容和二进制兼容的
    • 任何微型版本的“快照”(从源构建)安装
  • 快照安装程序旨在用较低的微型版本替换任何快照。

(见的2.X负责代码为3.X

任何其他版本不一定兼容,因此与现有版本一起安装。如果您希望卸载旧版本,则需要手动进行。并卸载您拥有的所有第三方模块:

  • 如果您从bdist_wininst软件包(Windows .exe)安装了任何模块,请在卸载版本之前先将其卸载,否则如果卸载程序具有自定义逻辑,则卸载程序可能无法正常工作
  • 安装了模块 setuptools /的pip驻留在其中,Lib\site-packages之后可以删除
  • 您为每个用户安装的软件包(如果有)驻留在该软件包中,%APPDATA%/Python/PythonXY/site-packages并且同样可以删除
  • Official Python .msi installers are designed to replace:

    • any previous micro release (in x.y.z, z is “micro”) because they are guaranteed to be backward-compatible and binary-compatible
    • a “snapshot” (built from source) installation with any micro version
  • A snapshot installer is designed to replace any snapshot with a lower micro version.

(See responsible code for 2.x, for 3.x)

Any other versions are not necessarily compatible and are thus installed alongside the existing one. If you wish to uninstall the old version, you’ll need to do that manually. And also uninstall any 3rd-party modules you had for it:

  • If you installed any modules from bdist_wininst packages (Windows .exes), uninstall them before uninstalling the version, or the uninstaller might not work correctly if it has custom logic
  • modules installed with setuptools/pip that reside in Lib\site-packages can just be deleted afterwards
  • packages that you installed per-user, if any, reside in %APPDATA%/Python/PythonXY/site-packages and can likewise be deleted

回答 3

我一直只是将新版本安装在最上面,从来没有任何问题。但是,请确保您的路径已更新为指向新版本。

I have always just installed the new version on top and never had any issues. Do make sure that your path is updated to point to the new version though.