标签归档:Python

如何从sqlite查询中获取字典?

问题:如何从sqlite查询中获取字典?

db = sqlite.connect("test.sqlite")
res = db.execute("select * from table")

通过迭代,我得到了对应于行的列表。

for row in res:
    print row

我可以得到列的名称

col_name_list = [tuple[0] for tuple in res.description]

但是是否有一些功能或设置可以获取字典而不是列表?

{'col1': 'value', 'col2': 'value'}

还是我必须自己做?

db = sqlite.connect("test.sqlite")
res = db.execute("select * from table")

With iteration I get lists coresponding to the rows.

for row in res:
    print row

I can get name of the columns

col_name_list = [tuple[0] for tuple in res.description]

But is there some function or setting to get dictionaries instead of list?

{'col1': 'value', 'col2': 'value'}

or I have to do myself?


回答 0

您可以使用row_factory,如docs中的示例所示:

import sqlite3

def dict_factory(cursor, row):
    d = {}
    for idx, col in enumerate(cursor.description):
        d[col[0]] = row[idx]
    return d

con = sqlite3.connect(":memory:")
con.row_factory = dict_factory
cur = con.cursor()
cur.execute("select 1 as a")
print cur.fetchone()["a"]

或按照文档中此示例之后给出的建议进行操作:

如果返回一个元组还不够,并且您希望基于名称的列访问,则应考虑将row_factory设置为高度优化的sqlite3.Row类型。Row提供对列的基于索引和不区分大小写的基于名称的访问,几乎没有内存开销。它可能比您自己的基于字典的自定义方法甚至基于db_row的解决方案都要好。

You could use row_factory, as in the example in the docs:

import sqlite3

def dict_factory(cursor, row):
    d = {}
    for idx, col in enumerate(cursor.description):
        d[col[0]] = row[idx]
    return d

con = sqlite3.connect(":memory:")
con.row_factory = dict_factory
cur = con.cursor()
cur.execute("select 1 as a")
print cur.fetchone()["a"]

or follow the advice that’s given right after this example in the docs:

If returning a tuple doesn’t suffice and you want name-based access to columns, you should consider setting row_factory to the highly-optimized sqlite3.Row type. Row provides both index-based and case-insensitive name-based access to columns with almost no memory overhead. It will probably be better than your own custom dictionary-based approach or even a db_row based solution.


回答 1

我以为我已经回答了这个问题,即使亚当·施密德(Adam Schmideg)和亚历克斯·马特利(Alex Martelli)的回答中都提到了部分答案。为了让其他像我一样有相同问题的人,可以轻松找到答案。

conn = sqlite3.connect(":memory:")

#This is the important part, here we are setting row_factory property of
#connection object to sqlite3.Row(sqlite3.Row is an implementation of
#row_factory)
conn.row_factory = sqlite3.Row
c = conn.cursor()
c.execute('select * from stocks')

result = c.fetchall()
#returns a list of dictionaries, each item in list(each dictionary)
#represents a row of the table

I thought I answer this question even though the answer is partly mentioned in both Adam Schmideg’s and Alex Martelli’s answers. In order for others like me that have the same question, to find the answer easily.

conn = sqlite3.connect(":memory:")

#This is the important part, here we are setting row_factory property of
#connection object to sqlite3.Row(sqlite3.Row is an implementation of
#row_factory)
conn.row_factory = sqlite3.Row
c = conn.cursor()
c.execute('select * from stocks')

result = c.fetchall()
#returns a list of dictionaries, each item in list(each dictionary)
#represents a row of the table

回答 2

即使使用sqlite3.Row类-您仍然不能使用以下形式的字符串格式:

print "%(id)i - %(name)s: %(value)s" % row

为了解决这个问题,我使用了一个辅助函数,该函数接受行并将其转换为字典。我仅在字典对象比Row对象更可取时才使用它(例如,对于诸如字符串格式之类的东西,其中Row对象本身也不支持字典API)。但是其他所有时间都使用Row对象。

def dict_from_row(row):
    return dict(zip(row.keys(), row))       

Even using the sqlite3.Row class– you still can’t use string formatting in the form of:

print "%(id)i - %(name)s: %(value)s" % row

In order to get past this, I use a helper function that takes the row and converts to a dictionary. I only use this when the dictionary object is preferable to the Row object (e.g. for things like string formatting where the Row object doesn’t natively support the dictionary API as well). But use the Row object all other times.

def dict_from_row(row):
    return dict(zip(row.keys(), row))       

回答 3

连接到SQLite之后: con = sqlite3.connect(.....)只需运行即可:

con.row_factory = sqlite3.Row

瞧!

After you connect to SQLite: con = sqlite3.connect(.....) it is sufficient to just run:

con.row_factory = sqlite3.Row

Voila!


回答 4

PEP 249

Question: 

   How can I construct a dictionary out of the tuples returned by
   .fetch*():

Answer:

   There are several existing tools available which provide
   helpers for this task. Most of them use the approach of using
   the column names defined in the cursor attribute .description
   as basis for the keys in the row dictionary.

   Note that the reason for not extending the DB API specification
   to also support dictionary return values for the .fetch*()
   methods is that this approach has several drawbacks:

   * Some databases don't support case-sensitive column names or
     auto-convert them to all lowercase or all uppercase
     characters.

   * Columns in the result set which are generated by the query
     (e.g.  using SQL functions) don't map to table column names
     and databases usually generate names for these columns in a
     very database specific way.

   As a result, accessing the columns through dictionary keys
   varies between databases and makes writing portable code
   impossible.

所以是的,你自己做。

From PEP 249:

Question: 

   How can I construct a dictionary out of the tuples returned by
   .fetch*():

Answer:

   There are several existing tools available which provide
   helpers for this task. Most of them use the approach of using
   the column names defined in the cursor attribute .description
   as basis for the keys in the row dictionary.

   Note that the reason for not extending the DB API specification
   to also support dictionary return values for the .fetch*()
   methods is that this approach has several drawbacks:

   * Some databases don't support case-sensitive column names or
     auto-convert them to all lowercase or all uppercase
     characters.

   * Columns in the result set which are generated by the query
     (e.g.  using SQL functions) don't map to table column names
     and databases usually generate names for these columns in a
     very database specific way.

   As a result, accessing the columns through dictionary keys
   varies between databases and makes writing portable code
   impossible.

So yes, do it yourself.


回答 5

较短的版本:

db.row_factory = lambda c, r: dict([(col[0], r[idx]) for idx, col in enumerate(c.description)])

Shorter version:

db.row_factory = lambda c, r: dict([(col[0], r[idx]) for idx, col in enumerate(c.description)])

回答 6

在我的测试中最快:

conn.row_factory = lambda c, r: dict(zip([col[0] for col in c.description], r))
c = conn.cursor()

%timeit c.execute('SELECT * FROM table').fetchall()
19.8 µs ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

vs:

conn.row_factory = lambda c, r: dict([(col[0], r[idx]) for idx, col in enumerate(c.description)])
c = conn.cursor()

%timeit c.execute('SELECT * FROM table').fetchall()
19.4 µs ± 75.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

你决定 :)

Fastest on my tests:

conn.row_factory = lambda c, r: dict(zip([col[0] for col in c.description], r))
c = conn.cursor()

%timeit c.execute('SELECT * FROM table').fetchall()
19.8 µs ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

vs:

conn.row_factory = lambda c, r: dict([(col[0], r[idx]) for idx, col in enumerate(c.description)])
c = conn.cursor()

%timeit c.execute('SELECT * FROM table').fetchall()
19.4 µs ± 75.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

You decide :)


回答 7

与上述解决方案类似,但最紧凑:

db.row_factory = lambda C, R: { c[0]: R[i] for i, c in enumerate(C.description) }

Similar like before-mentioned solutions, but most compact:

db.row_factory = lambda C, R: { c[0]: R[i] for i, c in enumerate(C.description) }

回答 8

正如@gandalf的答案所提到的,必须使用conn.row_factory = sqlite3.Row,但是结果不是直接的字典。必须dict在上一个循环中添加一个附加的“ cast” :

import sqlite3
conn = sqlite3.connect(":memory:")
conn.execute('create table t (a text, b text, c text)')
conn.execute('insert into t values ("aaa", "bbb", "ccc")')
conn.execute('insert into t values ("AAA", "BBB", "CCC")')
conn.row_factory = sqlite3.Row
c = conn.cursor()
c.execute('select * from t')
for r in c.fetchall():
    print(dict(r))

# {'a': 'aaa', 'b': 'bbb', 'c': 'ccc'}
# {'a': 'AAA', 'b': 'BBB', 'c': 'CCC'}

As mentioned by @gandalf’s answer, one has to use conn.row_factory = sqlite3.Row, but the results are not directly dictionaries. One has to add an additional “cast” to dict in the last loop:

import sqlite3
conn = sqlite3.connect(":memory:")
conn.execute('create table t (a text, b text, c text)')
conn.execute('insert into t values ("aaa", "bbb", "ccc")')
conn.execute('insert into t values ("AAA", "BBB", "CCC")')
conn.row_factory = sqlite3.Row
c = conn.cursor()
c.execute('select * from t')
for r in c.fetchall():
    print(dict(r))

# {'a': 'aaa', 'b': 'bbb', 'c': 'ccc'}
# {'a': 'AAA', 'b': 'BBB', 'c': 'CCC'}

回答 9

我认为您在正确的轨道上。让我们保持非常简单并完成您要执行的操作:

import sqlite3
db = sqlite3.connect("test.sqlite3")
cur = db.cursor()
res = cur.execute("select * from table").fetchall()
data = dict(zip([c[0] for c in cur.description], res[0]))

print(data)

缺点是.fetchall(),这是您消耗内存的谋杀手段,如果表很大。但是对于仅处理数千行文本和数字列的琐碎应用程序而言,这种简单的方法就足够了。

对于严重的问题,您应该按照其他许多答案中的建议研究行工厂。

I think you were on the right track. Let’s keep this very simple and complete what you were trying to do:

import sqlite3
db = sqlite3.connect("test.sqlite3")
cur = db.cursor()
res = cur.execute("select * from table").fetchall()
data = dict(zip([c[0] for c in cur.description], res[0]))

print(data)

The downside is that .fetchall(), which is murder on your memory consumption, if your table is very large. But for trivial applications dealing with mere few thousands of rows of text and numeric columns, this simple approach is good enough.

For serious stuff, you should look into row factories, as proposed in many other answers.


回答 10

或者,您可以按以下方式将sqlite3.Rows转换为字典。这将为字典提供每一行的列表。

    def from_sqlite_Row_to_dict(list_with_rows):
    ''' Turn a list with sqlite3.Row objects into a dictionary'''
    d ={} # the dictionary to be filled with the row data and to be returned

    for i, row in enumerate(list_with_rows): # iterate throw the sqlite3.Row objects            
        l = [] # for each Row use a separate list
        for col in range(0, len(row)): # copy over the row date (ie. column data) to a list
            l.append(row[col])
        d[i] = l # add the list to the dictionary   
    return d

Or you could convert the sqlite3.Rows to a dictionary as follows. This will give a dictionary with a list for each row.

    def from_sqlite_Row_to_dict(list_with_rows):
    ''' Turn a list with sqlite3.Row objects into a dictionary'''
    d ={} # the dictionary to be filled with the row data and to be returned

    for i, row in enumerate(list_with_rows): # iterate throw the sqlite3.Row objects            
        l = [] # for each Row use a separate list
        for col in range(0, len(row)): # copy over the row date (ie. column data) to a list
            l.append(row[col])
        d[i] = l # add the list to the dictionary   
    return d

回答 11

通用替代方案,仅使用三行

def select_column_and_value(db, sql, parameters=()):
    execute = db.execute(sql, parameters)
    fetch = execute.fetchone()
    return {k[0]: v for k, v in list(zip(execute.description, fetch))}

con = sqlite3.connect('/mydatabase.db')
c = con.cursor()
print(select_column_and_value(c, 'SELECT * FROM things WHERE id=?', (id,)))

但是,如果您的查询未返回任何内容,将导致错误。在这种情况下…

def select_column_and_value(self, sql, parameters=()):
    execute = self.execute(sql, parameters)
    fetch = execute.fetchone()

    if fetch is None:
        return {k[0]: None for k in execute.description}

    return {k[0]: v for k, v in list(zip(execute.description, fetch))}

要么

def select_column_and_value(self, sql, parameters=()):
    execute = self.execute(sql, parameters)
    fetch = execute.fetchone()

    if fetch is None:
        return {}

    return {k[0]: v for k, v in list(zip(execute.description, fetch))}

A generic alternative, using just three lines

def select_column_and_value(db, sql, parameters=()):
    execute = db.execute(sql, parameters)
    fetch = execute.fetchone()
    return {k[0]: v for k, v in list(zip(execute.description, fetch))}

con = sqlite3.connect('/mydatabase.db')
c = con.cursor()
print(select_column_and_value(c, 'SELECT * FROM things WHERE id=?', (id,)))

But if your query returns nothing, will result in error. In this case…

def select_column_and_value(self, sql, parameters=()):
    execute = self.execute(sql, parameters)
    fetch = execute.fetchone()

    if fetch is None:
        return {k[0]: None for k in execute.description}

    return {k[0]: v for k, v in list(zip(execute.description, fetch))}

or

def select_column_and_value(self, sql, parameters=()):
    execute = self.execute(sql, parameters)
    fetch = execute.fetchone()

    if fetch is None:
        return {}

    return {k[0]: v for k, v in list(zip(execute.description, fetch))}

回答 12

import sqlite3

db = sqlite3.connect('mydatabase.db')
cursor = db.execute('SELECT * FROM students ORDER BY CREATE_AT')
studentList = cursor.fetchall()

columnNames = list(map(lambda x: x[0], cursor.description)) #students table column names list
studentsAssoc = {} #Assoc format is dictionary similarly


#THIS IS ASSOC PROCESS
for lineNumber, student in enumerate(studentList):
    studentsAssoc[lineNumber] = {}

    for columnNumber, value in enumerate(student):
        studentsAssoc[lineNumber][columnNames[columnNumber]] = value


print(studentsAssoc)

结果肯定是正确的,但我不知道最好的。

import sqlite3

db = sqlite3.connect('mydatabase.db')
cursor = db.execute('SELECT * FROM students ORDER BY CREATE_AT')
studentList = cursor.fetchall()

columnNames = list(map(lambda x: x[0], cursor.description)) #students table column names list
studentsAssoc = {} #Assoc format is dictionary similarly


#THIS IS ASSOC PROCESS
for lineNumber, student in enumerate(studentList):
    studentsAssoc[lineNumber] = {}

    for columnNumber, value in enumerate(student):
        studentsAssoc[lineNumber][columnNames[columnNumber]] = value


print(studentsAssoc)

The result is definitely true, but I do not know the best.


回答 13

python中的字典提供对元素的任意访问。因此,任何带有“名称”的词典,尽管一方面可能会提供更多信息(又称字段名称),却会使字段“无序”,这可能是不必要的。

最好的方法是将名称放在单独的列表中,然后根据需要自己将其与结果组合。

try:
         mycursor = self.memconn.cursor()
         mycursor.execute('''SELECT * FROM maintbl;''')
         #first get the names, because they will be lost after retrieval of rows
         names = list(map(lambda x: x[0], mycursor.description))
         manyrows = mycursor.fetchall()

         return manyrows, names

还请记住,在所有方法中,名称都是您在查询中提供的名称,而不是数据库中的名称。exceptions是SELECT * FROM

如果您唯一关心的是使用字典来获得结果,则一定要使用conn.row_factory = sqlite3.Row(已经在另一个答案中说明了)。

Dictionaries in python provide arbitrary access to their elements. So any dictionary with “names” although it might be informative on one hand (a.k.a. what are the field names) “un-orders” the fields, which might be unwanted.

Best approach is to get the names in a separate list and then combine them with the results by yourself, if needed.

try:
         mycursor = self.memconn.cursor()
         mycursor.execute('''SELECT * FROM maintbl;''')
         #first get the names, because they will be lost after retrieval of rows
         names = list(map(lambda x: x[0], mycursor.description))
         manyrows = mycursor.fetchall()

         return manyrows, names

Also remember that the names, in all approaches, are the names you provided in the query, not the names in database. Exception is the SELECT * FROM

If your only concern is to get the results using a dictionary, then definitely use the conn.row_factory = sqlite3.Row (already stated in another answer).


如何使用PIL保存图像?

问题:如何使用PIL保存图像?

我刚刚使用Python图像库(PIL)进行了一些图像处理,这是我之前发现的用于执行图像的傅立叶变换的文章,但我无法使用save函数。整个代码运行良好,但不会保存生成的图像:

from PIL import Image
import numpy as np

i = Image.open("C:/Users/User/Desktop/mesh.bmp")
i = i.convert("L")
a = np.asarray(i)
b = np.abs(np.fft.rfft2(a))
j = Image.fromarray(b)
j.save("C:/Users/User/Desktop/mesh_trans",".bmp")

我得到的错误如下:

save_handler = SAVE[string.upper(format)] # unknown format
    KeyError: '.BMP'

如何使用Pythons PIL保存图像?

I have just done some image processing using the Python image library (PIL) using a post I found earlier to perform fourier transforms of images and I can’t get the save function to work. The whole code works fine but it just wont save the resulting image:

from PIL import Image
import numpy as np

i = Image.open("C:/Users/User/Desktop/mesh.bmp")
i = i.convert("L")
a = np.asarray(i)
b = np.abs(np.fft.rfft2(a))
j = Image.fromarray(b)
j.save("C:/Users/User/Desktop/mesh_trans",".bmp")

The error I get is the following:

save_handler = SAVE[string.upper(format)] # unknown format
    KeyError: '.BMP'

How can I save an image with Pythons PIL?


回答 0

解决了与文件扩展名有关的错误,您可以使用BMP(不带点)或将输出名称与扩展名一起传递。现在要处理该错误,您需要在频域中适当地修改数据以将其保存为整数图像,PIL这告诉您它不接受将浮点数据保存为BMP。

这是进行转换以实现正确可视化的建议(还有其他一些小的修改,例如使用fftshiftnumpy.array代替numpy.asarray):

import sys
import numpy
from PIL import Image

img = Image.open(sys.argv[1]).convert('L')

im = numpy.array(img)
fft_mag = numpy.abs(numpy.fft.fftshift(numpy.fft.fft2(im)))

visual = numpy.log(fft_mag)
visual = (visual - visual.min()) / (visual.max() - visual.min())

result = Image.fromarray((visual * 255).astype(numpy.uint8))
result.save('out.bmp')

The error regarding the file extension has been handled, you either use BMP (without the dot) or pass the output name with the extension already. Now to handle the error you need to properly modify your data in the frequency domain to be saved as an integer image, PIL is telling you that it doesn’t accept float data to save as BMP.

Here is a suggestion (with other minor modifications, like using fftshift and numpy.array instead of numpy.asarray) for doing the conversion for proper visualization:

import sys
import numpy
from PIL import Image

img = Image.open(sys.argv[1]).convert('L')

im = numpy.array(img)
fft_mag = numpy.abs(numpy.fft.fftshift(numpy.fft.fft2(im)))

visual = numpy.log(fft_mag)
visual = (visual - visual.min()) / (visual.max() - visual.min())

result = Image.fromarray((visual * 255).astype(numpy.uint8))
result.save('out.bmp')

回答 1

您应该能够简单地让PIL从扩展名中获取文件类型,即使用:

j.save("C:/Users/User/Desktop/mesh_trans.bmp")

You should be able to simply let PIL get the filetype from extension, i.e. use:

j.save("C:/Users/User/Desktop/mesh_trans.bmp")

回答 2

尝试删除.之前的.bmp(它BMP与预期的不匹配)。正如您从错误中看到的那样,save_handler就是format您提供的大写字母,然后在中寻找匹配项SAVE。但是,该对象中的对应键为BMP(而不是.BMP)。

我不太了解PIL,但是通过一些快速搜索,似乎mode图像的问题。将的定义更改j为:

j = Image.fromarray(b, mode='RGB')

似乎为我工作(但是请注意,我对的了解很少PIL,因此我建议使用@mmgp的解决方案,因为他/她清楚地知道他们在做什么:)))。对于的类型mode,我使用了页面-希望那里的一种选择适合您。

Try removing the . before the .bmp (it isn’t matching BMP as expected). As you can see from the error, the save_handler is upper-casing the format you provided and then looking for a match in SAVE. However the corresponding key in that object is BMP (instead of .BMP).

I don’t know a great deal about PIL, but from some quick searching around it seems that it is a problem with the mode of the image. Changing the definition of j to:

j = Image.fromarray(b, mode='RGB')

Seemed to work for me (however note that I have very little knowledge of PIL, so I would suggest using @mmgp’s solution as s/he clearly knows what they are doing :) ). For the types of mode, I used this page – hopefully one of the choices there will work for you.


回答 3

我知道这很旧,但是我发现(在使用Pillow的同时)通过使用open(fp, 'w')然后保存文件来打开文件是可行的。例如:

with open(fp, 'w') as f:
    result.save(f)

fp 当然是文件路径。

I know that this is old, but I’ve found that (while using Pillow) opening the file by using open(fp, 'w') and then saving the file will work. E.g:

with open(fp, 'w') as f:
    result.save(f)

fp being the file path, of course.


熊猫使用什么规则生成视图与副本?

问题:熊猫使用什么规则生成视图与副本?

我对Pandas决定从数据框中进行选择是原始数据框的副本或原始数据视图时使用的规则感到困惑。

例如,如果我有

df = pd.DataFrame(np.random.randn(8,8), columns=list('ABCDEFGH'), index=range(1,9))

我了解a会query传回副本,因此类似

foo = df.query('2 < index <= 5')
foo.loc[:,'E'] = 40

将对原始数据帧无效df。我也了解标量或命名切片返回一个视图,因此对它们的赋值(例如

df.iloc[3] = 70

要么

df.ix[1,'B':'E'] = 222

会改变df。但是当涉及到更复杂的案件时,我迷失了。例如,

df[df.C <= df.B] = 7654321

变化df,但是

df[df.C <= df.B].ix[:,'B':'E']

才不是。

是否有一个熊猫正在使用的简单规则,我只是想念它?在这些特定情况下发生了什么;尤其是,如何更改满足特定查询的数据框中的所有值(或值的子集)(就像我在上面的最后一个示例中尝试的那样)?


注意:这和这个问题不一样;并且我已经阅读了文档,但并未对此有所启发。我还阅读了有关此主题的“相关”问题,但我仍然缺少Pandas使用的简单规则,以及如何将其应用于(例如)修改值(或值的子集)在满足特定查询的数据框中。

I’m confused about the rules Pandas uses when deciding that a selection from a dataframe is a copy of the original dataframe, or a view on the original.

If I have, for example,

df = pd.DataFrame(np.random.randn(8,8), columns=list('ABCDEFGH'), index=range(1,9))

I understand that a query returns a copy so that something like

foo = df.query('2 < index <= 5')
foo.loc[:,'E'] = 40

will have no effect on the original dataframe, df. I also understand that scalar or named slices return a view, so that assignments to these, such as

df.iloc[3] = 70

or

df.ix[1,'B':'E'] = 222

will change df. But I’m lost when it comes to more complicated cases. For example,

df[df.C <= df.B] = 7654321

changes df, but

df[df.C <= df.B].ix[:,'B':'E']

does not.

Is there a simple rule that Pandas is using that I’m just missing? What’s going on in these specific cases; and in particular, how do I change all values (or a subset of values) in a dataframe that satisfy a particular query (as I’m attempting to do in the last example above)?


Note: This is not the same as this question; and I have read the documentation, but am not enlightened by it. I’ve also read through the “Related” questions on this topic, but I’m still missing the simple rule Pandas is using, and how I’d apply it to — for example — modify the values (or a subset of values) in a dataframe that satisfy a particular query.


回答 0

这是规则,其后是覆盖:

  • 所有操作都会生成一个副本

  • 如果inplace=True提供,它将原位修改;只有一些操作支持这一点

  • 设置的索引器,例如.loc/.iloc/.iat/.at将原地设置。

  • 到达单一类型对象的索引器几乎总是一个视图(取决于内存布局,这可能不是原因,这不可靠)。这主要是为了提高效率。(上面的示例用于.query;它将始终返回的副本,其值为numexpr

  • 到达多类型对象的索引器始终是副本。

您的例子 chained indexing

df[df.C <= df.B].loc[:,'B':'E']

不能保证能正常工作(因此您不应该这样做)。

而是:

df.loc[df.C <= df.B, 'B':'E']

因为这更快,并且将始终有效

链式索引是2个单独的python操作,因此无法可靠地被熊猫拦截(您通常会得到SettingWithCopyWarning,但也不是100%可检测到的)。您所指出的dev文档提供了更全面的说明。

Here’s the rules, subsequent override:

  • All operations generate a copy

  • If inplace=True is provided, it will modify in-place; only some operations support this

  • An indexer that sets, e.g. .loc/.iloc/.iat/.at will set inplace.

  • An indexer that gets on a single-dtyped object is almost always a view (depending on the memory layout it may not be that’s why this is not reliable). This is mainly for efficiency. (the example from above is for .query; this will always return a copy as its evaluated by numexpr)

  • An indexer that gets on a multiple-dtyped object is always a copy.

Your example of chained indexing

df[df.C <= df.B].loc[:,'B':'E']

is not guaranteed to work (and thus you shoulld never do this).

Instead do:

df.loc[df.C <= df.B, 'B':'E']

as this is faster and will always work

The chained indexing is 2 separate python operations and thus cannot be reliably intercepted by pandas (you will oftentimes get a SettingWithCopyWarning, but that is not 100% detectable either). The dev docs, which you pointed, offer a much more full explanation.


+ =在python中到底是做什么的?

问题:+ =在python中到底是做什么的?

我需要知道+ =在python中做什么。就这么简单。我也希望链接到python中其他速记工具的定义。

I need to know what += does in python. It’s that simple. I also would appreciate links to definitions of other short hand tools in python.


回答 0

在Python中,+ =是__iadd__特殊方法的糖衣,__add__或者__radd__如果__iadd__不存在,则为+ 。__iadd__类的方法可以执行任何所需的操作。列表对象实现了它,并使用它来迭代一个可迭代对象,该对象将每个元素附加到自身上,方法与列表的extend方法相同。

这是一个实现__iadd__特殊方法的简单自定义类。您可以使用int初始化对象,然后可以使用+ =运算符添加数字。我在其中添加了一条打印语句,__iadd__以表明它被调用了。另外,__iadd__期望返回一个对象,因此我返回了自身的加号以及在这种情况下有意义的其他数字。

>>> class Adder(object):
        def __init__(self, num=0):
            self.num = num

        def __iadd__(self, other):
            print 'in __iadd__', other
            self.num = self.num + other
            return self.num

>>> a = Adder(2)
>>> a += 3
in __iadd__ 3
>>> a
5

希望这可以帮助。

In Python, += is sugar coating for the __iadd__ special method, or __add__ or __radd__ if __iadd__ isn’t present. The __iadd__ method of a class can do anything it wants. The list object implements it and uses it to iterate over an iterable object appending each element to itself in the same way that the list’s extend method does.

Here’s a simple custom class that implements the __iadd__ special method. You initialize the object with an int, then can use the += operator to add a number. I’ve added a print statement in __iadd__ to show that it gets called. Also, __iadd__ is expected to return an object, so I returned the addition of itself plus the other number which makes sense in this case.

>>> class Adder(object):
        def __init__(self, num=0):
            self.num = num

        def __iadd__(self, other):
            print 'in __iadd__', other
            self.num = self.num + other
            return self.num

>>> a = Adder(2)
>>> a += 3
in __iadd__ 3
>>> a
5

Hope this helps.


回答 1

+= 将另一个值与变量的值相加,然后将新值分配给该变量。

>>> x = 3
>>> x += 2
>>> print x
5

-=*=/=做减法,乘法和除法类似。

+= adds another value with the variable’s value and assigns the new value to the variable.

>>> x = 3
>>> x += 2
>>> print x
5

-=, *=, /= does similar for subtraction, multiplication and division.


回答 2

x += 5x = x + 5Python中所说的不完全相同。

注意这里:

In [1]: x = [2,3,4]    
In [2]: y = x    
In [3]: x += 7,8,9    
In [4]: x
Out[4]: [2, 3, 4, 7, 8, 9]    
In [5]: y
Out[5]: [2, 3, 4, 7, 8, 9]    
In [6]: x += [44,55]    
In [7]: x
Out[7]: [2, 3, 4, 7, 8, 9, 44, 55]    
In [8]: y
Out[8]: [2, 3, 4, 7, 8, 9, 44, 55]    
In [9]: x = x + [33,22]    
In [10]: x
Out[10]: [2, 3, 4, 7, 8, 9, 44, 55, 33, 22]    
In [11]: y
Out[11]: [2, 3, 4, 7, 8, 9, 44, 55]

请参阅以供参考:为什么+ =在列表上表现异常?

x += 5 is not exactly same as saying x = x + 5 in Python.

Note here:

In [1]: x = [2,3,4]    
In [2]: y = x    
In [3]: x += 7,8,9    
In [4]: x
Out[4]: [2, 3, 4, 7, 8, 9]    
In [5]: y
Out[5]: [2, 3, 4, 7, 8, 9]    
In [6]: x += [44,55]    
In [7]: x
Out[7]: [2, 3, 4, 7, 8, 9, 44, 55]    
In [8]: y
Out[8]: [2, 3, 4, 7, 8, 9, 44, 55]    
In [9]: x = x + [33,22]    
In [10]: x
Out[10]: [2, 3, 4, 7, 8, 9, 44, 55, 33, 22]    
In [11]: y
Out[11]: [2, 3, 4, 7, 8, 9, 44, 55]

See for reference: Why does += behave unexpectedly on lists?


回答 3

+=在变量中添加一个数字,从而在过程中更改变量本身(而+不会)。与此类似,以下内容也可以修改变量:

  • -=,从变量中减去一个值,将变量设置为结果
  • *=,将变量和一个值相乘,结果就是变量
  • /=,将变量除以值,使结果成为变量
  • %=,对变量执行模数,然后将变量设置为其结果

可能还有其他。我不是Python程序员。

+= adds a number to a variable, changing the variable itself in the process (whereas + would not). Similar to this, there are the following that also modifies the variable:

  • -=, subtracts a value from variable, setting the variable to the result
  • *=, multiplies the variable and a value, making the outcome the variable
  • /=, divides the variable by the value, making the outcome the variable
  • %=, performs modulus on the variable, with the variable then being set to the result of it

There may be others. I am not a Python programmer.


回答 4

它将右边的操作数添加到左边。 x += 2手段x = x + 2

它还可以将元素添加到列表中-请参见此SO线程

It adds the right operand to the left. x += 2 means x = x + 2

It can also add elements to a list — see this SO thread.


回答 5

这不仅仅是语法上的捷径。试试这个:

x=[]                   # empty list
x += "something"       # iterates over the string and appends to list
print(x)               # ['s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g']

x=[]                   # empty list
x = x + "something"    # TypeError: can only concatenate list (not "str") to list

这说明+ =调用iadd list方法,而+调用add,后者对列表执行不同的操作。

It is not a mere syntactic shortcut. Try this:

x=[]                   # empty list
x += "something"       # iterates over the string and appends to list
print(x)               # ['s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g']

versus

x=[]                   # empty list
x = x + "something"    # TypeError: can only concatenate list (not "str") to list

This illustrates that += invokes the iadd list method but + invokes add, which do different things with lists.


回答 6

通常,a + = b将b“加”到将结果存储到a中。这种简单的描述将以多种语言描述+ =运算符。

然而,简单的描述提出了两个问题。

  1. “添加”到底是什么意思?
  2. “将结果存储在a中”到底是什么意思?python变量不直接存储值,而是存储对对象的引用。

在python中,这两个问题的答案取决于a的数据类型。


那么“加”到底是什么意思?

  • 对于数字,表示数字加法。
  • 对于列表,元组,字符串等,意味着串联。

请注意,对于列表+ =比+更具灵活性,列表上的+运算符需要另一个列表,但是+ =运算符将接受任何可迭代的。


那么“将值存储在”中意味着什么呢?

如果对象是可变的,则鼓励(但不是必需)就地执行修改。因此,指向以前做过的同一个对象,但是该对象现在具有不同的内容。

如果对象是不可变的,则显然不能就地执行修改。一些可变对象也可能没有就地“添加”操作的实现。在这种情况下,变量“ a”将更新为指向包含加法运算结果的新对象。

从技术上讲,这是通过__IADD__首先查找来实现的,如果未实现,则__ADD__尝试并最后__RADD__


当在不确定确切类型的变量上使用python中的+ =时,需要特别小心,尤其是在不确定不确定类型是否可变的情况下。例如,考虑以下代码。

def dostuff(a):
    b = a
    a += (3,4)
    print(repr(a)+' '+repr(b))

dostuff((1,2))
dostuff([1,2])

当我们使用一个元组调用dostuff时,该元组将被复制为+ =操作的一部分,因此b不受影响。但是,当我们使用列表调用它时,列表会被修改,因此a和b都会受到影响。

在python 3中,“ bytes”和“ bytearray”类型的行为类似。


最后请注意,即使未替换对象,也会发生重新分配。如果左侧只是一个变量,这无关紧要,但是当您有一个引用可变集合的不可变集合时,它可能引起混乱的行为,例如:

a = ([1,2],[3,4])
a[0] += [5]

在这种情况下,[5]将成功添加到a [0]引用的列表中,但是此后,当代码尝试重新分配a [0]失败时,将引发异常。

Notionally a += b “adds” b to a storing the result in a. This simplistic description would describe the += operator in many languages.

However the simplistic description raises a couple of questions.

  1. What exactly do we mean by “adding”?
  2. What exactly do we mean by “storing the result in a”? python variables don’t store values directly they store references to objects.

In python the answers to both of these questions depend on the data type of a.


So what exactly does “adding” mean?

  • For numbers it means numeric addition.
  • For lists, tuples, strings etc it means concatenation.

Note that for lists += is more flexible than +, the + operator on a list requires another list, but the += operator will accept any iterable.


So what does “storing the value in a” mean?

If the object is mutable then it is encouraged (but not required) to perform the modification in-place. So a points to the same object it did before but that object now has different content.

If the object is immutable then it obviously can’t perform the modification in-place. Some mutable objects may also not have an implementation of an in-place “add” operation . In this case the variable “a” will be updated to point to a new object containing the result of an addition operation.

Technically this is implemented by looking for __IADD__ first, if that is not implemented then __ADD__ is tried and finally __RADD__.


Care is required when using += in python on variables where we are not certain of the exact type and in particular where we are not certain if the type is mutable or not. For example consider the following code.

def dostuff(a):
    b = a
    a += (3,4)
    print(repr(a)+' '+repr(b))

dostuff((1,2))
dostuff([1,2])

When we invoke dostuff with a tuple then the tuple is copied as part of the += operation and so b is unaffected. However when we invoke it with a list the list is modified in place, so both a and b are affected.

In python 3, similar behaviour is observed with the “bytes” and “bytearray” types.


Finally note that reassignment happens even if the object is not replaced. This doesn’t matter much if the left hand side is simply a variable but it can cause confusing behaviour when you have an immutable collection referring to mutable collections for example:

a = ([1,2],[3,4])
a[0] += [5]

In this case [5] will successfully be added to the list referred to by a[0] but then afterwards an exception will be raised when the code tries and fails to reassign a[0].


回答 7

简短的答案+=可以翻译为“将+ =右侧的任何内容添加到+ =“左侧的变量中。

例如 如果是a = 10这样的a += 5话: a = a + 5

因此,“ a”现在等于15。

The short answer is += can be translated as “add whatever is to the right of the += to the variable on the left of the +=”.

Ex. If you have a = 10 then a += 5 would be: a = a + 5

So, “a” now equal to 15.


回答 8

注意x += yx = x + y某些情况下包含附加运算符不同,这是因为运算符优先级加上总是首先评估右侧的事实,例如

>>> x = 2
>>> x += 2 and 1
>>> x
3

>>> x = 2
>>> x = x + 2 and 1
>>> x
1

注意第一种情况扩展为:

>>> x = 2
>>> x = x + (2 and 1)
>>> x
3

您更有可能在“现实世界”中与其他运营商(例如,

x *= 2 + 1== x = x * (2 + 1)!=x = x * 2 + 1

Note x += y is not the same as x = x + y in some situations where an additional operator is included because of the operator precedence combined with the fact that the right hand side is always evaluated first, e.g.

>>> x = 2
>>> x += 2 and 1
>>> x
3

>>> x = 2
>>> x = x + 2 and 1
>>> x
1

Note the first case expand to:

>>> x = 2
>>> x = x + (2 and 1)
>>> x
3

You are more likely to encounter this in the ‘real world’ with other operators, e.g.

x *= 2 + 1 == x = x * (2 + 1) != x = x * 2 + 1


回答 9

+= 只是写作的捷径

number = 4
number = number + 1

所以你会写

numbers = 4
numbers += 1

两种方法都是正确的,但是示例二可以帮助您减少编写代码

+= is just a shortcut for writing

number = 4
number = number + 1

So instead you would write

numbers = 4
numbers += 1

Both ways are correct but example two helps you write a little less code


回答 10

就像其他人所说的,+ =运算符是快捷方式。一个例子:

var = 1;
var = var + 1;
#var = 2

也可以这样写:

var = 1;
var += 1;
#var = 2

因此,无需编写第一个示例,您只需编写第二个示例,就可以了。

As others also said, the += operator is a shortcut. An example:

var = 1;
var = var + 1;
#var = 2

It could also be written like so:

var = 1;
var += 1;
#var = 2

So instead of writing the first example, you can just write the second one, which would work just fine.


回答 11

请记住,当您过去使用旧计算器求和时(例如2和3),每当您点击=总和时,您就会看到3被加到总数中,+=执行类似的工作。例:

>>> orange = 2
>>> orange += 3
>>> print(orange)
5
>>> orange +=3
>>> print(orange)
8

Remember when you used to sum, for example 2 & 3, in your old calculator and every time you hit the = you see 3 added to the total, the += does similar job. Example:

>>> orange = 2
>>> orange += 3
>>> print(orange)
5
>>> orange +=3
>>> print(orange)
8

回答 12

我看到很多使用+ =并带有多个整数的答案。

一个例子:

x -= 1 + 3

这类似于:

x = x - (1 + 3)

并不是:

x = (x - 1) + 3

I’m seeing a lot of answers that don’t bring up using += with multiple integers.

One example:

x -= 1 + 3

This would be similar to:

x = x - (1 + 3)

and not:

x = (x - 1) + 3

回答 13

根据文档

x += y等价于x = operator.iadd(x, y)。另一种表达方式是说它z = operator.iadd(x, y)等同于复合语句z = x; z += y

因此x += 3与相同x = x + 3

x = 2

x += 3

print(x)

将输出5。

请注意,还有

According to the documentation

x += y is equivalent to x = operator.iadd(x, y). Another way to put it is to say that z = operator.iadd(x, y) is equivalent to the compound statement z = x; z += y.

So x += 3 is the same as x = x + 3.

x = 2

x += 3

print(x)

will output 5.

Notice that there’s also


在numpy中将一维数组转换为二维数组

问题:在numpy中将一维数组转换为二维数组

我想通过指定2D数组中的列数将一维数组转换为二维数组。可能会像这样工作:

> import numpy as np
> A = np.array([1,2,3,4,5,6])
> B = vec2matrix(A,ncol=2)
> B
array([[1, 2],
       [3, 4],
       [5, 6]])

numpy是否具有类似于我的组合函数“ vec2matrix”的功能?(我知道您可以像2D数组一样索引1D数组,但这不是我拥有的代码中的选项-我需要进行此转换。)

I want to convert a 1-dimensional array into a 2-dimensional array by specifying the number of columns in the 2D array. Something that would work like this:

> import numpy as np
> A = np.array([1,2,3,4,5,6])
> B = vec2matrix(A,ncol=2)
> B
array([[1, 2],
       [3, 4],
       [5, 6]])

Does numpy have a function that works like my made-up function “vec2matrix”? (I understand that you can index a 1D array like a 2D array, but that isn’t an option in the code I have – I need to make this conversion.)


回答 0

您要reshape阵列。

B = np.reshape(A, (-1, 2))

其中-1,从输入数组的大小推断出新维度的大小。

You want to reshape the array.

B = np.reshape(A, (-1, 2))

where -1 infers the size of the new dimension from the size of the input array.


回答 1

您有两种选择:

  • 如果您不再想要原始形状,最简单的方法就是为数组分配一个新形状

    a.shape = (a.size//ncols, ncols)

    您可以切换a.size//ncols通过-1自动计算合适的形状。确保a.shape[0]*a.shape[1]=a.size,否则会遇到一些问题。

  • 您可以使用np.reshape函数获得一个新的数组,该函数的工作原理与上述版本相似

    new = np.reshape(a, (-1, ncols))

    如果可能,new将仅是初始array的视图a,这意味着数据是共享的。但是,在某些情况下,new数组将被复制。请注意,np.reshape还接受一个可选关键字order,该关键字使您可以从行优先C顺序切换到列优先Fortran顺序。np.reshape是该a.reshape方法的函数版本。

如果您不能满足要求a.shape[0]*a.shape[1]=a.size,则必须创建一个新数组。您可以使用该np.resize函数并将其与混合使用np.reshape,例如

>>> a =np.arange(9)
>>> np.resize(a, 10).reshape(5,2)

You have two options:

  • If you no longer want the original shape, the easiest is just to assign a new shape to the array

    a.shape = (a.size//ncols, ncols)
    

    You can switch the a.size//ncols by -1 to compute the proper shape automatically. Make sure that a.shape[0]*a.shape[1]=a.size, else you’ll run into some problem.

  • You can get a new array with the np.reshape function, that works mostly like the version presented above

    new = np.reshape(a, (-1, ncols))
    

    When it’s possible, new will be just a view of the initial array a, meaning that the data are shared. In some cases, though, new array will be acopy instead. Note that np.reshape also accepts an optional keyword order that lets you switch from row-major C order to column-major Fortran order. np.reshape is the function version of the a.reshape method.

If you can’t respect the requirement a.shape[0]*a.shape[1]=a.size, you’re stuck with having to create a new array. You can use the np.resize function and mixing it with np.reshape, such as

>>> a =np.arange(9)
>>> np.resize(a, 10).reshape(5,2)

回答 2

尝试类似的方法:

B = np.reshape(A,(-1,ncols))

您需要确保可以将数组中的元素数除以ncols。您也可以B使用order关键字按照将数字拉入的顺序进行游戏。

Try something like:

B = np.reshape(A,(-1,ncols))

You’ll need to make sure that you can divide the number of elements in your array by ncols though. You can also play with the order in which the numbers are pulled into B using the order keyword.


回答 3

如果您的唯一目的是将1d数组X转换为2d数组,请执行以下操作:

X = np.reshape(X,(1, X.size))

If your sole purpose is to convert a 1d array X to a 2d array just do:

X = np.reshape(X,(1, X.size))

回答 4

import numpy as np
array = np.arange(8) 
print("Original array : \n", array)
array = np.arange(8).reshape(2, 4)
print("New array : \n", array)
import numpy as np
array = np.arange(8) 
print("Original array : \n", array)
array = np.arange(8).reshape(2, 4)
print("New array : \n", array)

回答 5

some_array.shape = (1,)+some_array.shape

或换一个新的

another_array = numpy.reshape(some_array, (1,)+some_array.shape)

这将使尺寸+1,等于在最外层添加一个括号

some_array.shape = (1,)+some_array.shape

or get a new one

another_array = numpy.reshape(some_array, (1,)+some_array.shape)

This will make dimensions +1, equals to adding a bracket on the outermost


回答 6

您可以flatten()从numpy包中使用。

import numpy as np
a = np.array([[1, 2],
       [3, 4],
       [5, 6]])
a_flat = a.flatten()
print(f"original array: {a} \nflattened array = {a_flat}")

输出:

original array: [[1 2]
 [3 4]
 [5 6]] 
flattened array = [1 2 3 4 5 6]

You can useflatten() from the numpy package.

import numpy as np
a = np.array([[1, 2],
       [3, 4],
       [5, 6]])
a_flat = a.flatten()
print(f"original array: {a} \nflattened array = {a_flat}")

Output:

original array: [[1 2]
 [3 4]
 [5 6]] 
flattened array = [1 2 3 4 5 6]

回答 7

不使用Numpy将一维数组更改为二维数组。

l = [i for i in range(1,21)]
part = 3
new = []
start, end = 0, part


while end <= len(l):
    temp = []
    for i in range(start, end):
        temp.append(l[i])
    new.append(temp)
    start += part
    end += part
print("new values:  ", new)


# for uneven cases
temp = []
while start < len(l):
    temp.append(l[start])
    start += 1
    new.append(temp)
print("new values for uneven cases:   ", new)

Change 1D array into 2D array without using Numpy.

l = [i for i in range(1,21)]
part = 3
new = []
start, end = 0, part


while end <= len(l):
    temp = []
    for i in range(start, end):
        temp.append(l[i])
    new.append(temp)
    start += part
    end += part
print("new values:  ", new)


# for uneven cases
temp = []
while start < len(l):
    temp.append(l[start])
    start += 1
    new.append(temp)
print("new values for uneven cases:   ", new)

如何在Python中将’false’转换为0并将’true’转换为1

问题:如何在Python中将’false’转换为0并将’true’转换为1

有没有一种方法可以将true类型转换unicode为1并将false类型转换unicode为0(在Python中)?

例如: x == 'true' and type(x) == unicode

我想要 x = 1

PS:我不想使用ifelse

Is there a way to convert true of type unicode to 1 and false of type unicode to 0 (in Python)?

For example: x == 'true' and type(x) == unicode

I want x = 1

PS: I don’t want to use ifelse.


回答 0

使用int()一个布尔测试:

x = int(x == 'true')

int()将布尔值转换为10。请注意,任何等于的值'true'都将导致0返回。

Use int() on a boolean test:

x = int(x == 'true')

int() turns the boolean into 1 or 0. Note that any value not equal to 'true' will result in 0 being returned.


回答 1

如果B是布尔数组,则写

B = B*1

(一些代码golfy。)

If B is a Boolean array, write

B = B*1

(A bit code golfy.)


回答 2

您可以使用x.astype('uint8')where x是布尔数组。

You can use x.astype('uint8') where x is your Boolean array.


回答 3

这是您的问题的另一种解决方案:

def to_bool(s):
    return 1 - sum(map(ord, s)) % 2
    # return 1 - sum(s.encode('ascii')) % 2  # Alternative for Python 3

它的工作原理因为ASCII码的总和'true'就是448,这是偶数,而的ASCII码的总和'false'就是523这是奇怪的。


关于此解决方案的有趣之处在于,如果输入不是'true' or 之一,则其结果是非常随机的'false'。一半的时间会回来0,另一半1encode如果输入不是ASCII ,变体using 将引发编码错误(从而增加行为的不确定性)。


认真地说,我认为最易读,更快捷的解决方案是使用if

def to_bool(s):
    return 1 if s == 'true' else 0

查看一些微基准测试:

In [14]: def most_readable(s):
    ...:     return 1 if s == 'true' else 0

In [15]: def int_cast(s):
    ...:     return int(s == 'true')

In [16]: def str2bool(s):
    ...:     try:
    ...:         return ['false', 'true'].index(s)
    ...:     except (ValueError, AttributeError):
    ...:         raise ValueError()

In [17]: def str2bool2(s):
    ...:     try:
    ...:         return ('false', 'true').index(s)
    ...:     except (ValueError, AttributeError):
    ...:         raise ValueError()

In [18]: def to_bool(s):
    ...:     return 1 - sum(s.encode('ascii')) % 2

In [19]: %timeit most_readable('true')
10000000 loops, best of 3: 112 ns per loop

In [20]: %timeit most_readable('false')
10000000 loops, best of 3: 109 ns per loop

In [21]: %timeit int_cast('true')
1000000 loops, best of 3: 259 ns per loop

In [22]: %timeit int_cast('false')
1000000 loops, best of 3: 262 ns per loop

In [23]: %timeit str2bool('true')
1000000 loops, best of 3: 343 ns per loop

In [24]: %timeit str2bool('false')
1000000 loops, best of 3: 325 ns per loop

In [25]: %timeit str2bool2('true')
1000000 loops, best of 3: 295 ns per loop

In [26]: %timeit str2bool2('false')
1000000 loops, best of 3: 277 ns per loop

In [27]: %timeit to_bool('true')
1000000 loops, best of 3: 607 ns per loop

In [28]: %timeit to_bool('false')
1000000 loops, best of 3: 612 ns per loop

请注意该怎么if解决办法是至少 2.5倍速度所有其他解决方案。避免使用s 是没有意义的,if除非这是某种家庭作业(在这种情况下,您本来不应该首先问这个问题)。

Here’s a yet another solution to your problem:

def to_bool(s):
    return 1 - sum(map(ord, s)) % 2
    # return 1 - sum(s.encode('ascii')) % 2  # Alternative for Python 3

It works because the sum of the ASCII codes of 'true' is 448, which is even, while the sum of the ASCII codes of 'false' is 523 which is odd.


The funny thing about this solution is that its result is pretty random if the input is not one of 'true' or 'false'. Half of the time it will return 0, and the other half 1. The variant using encode will raise an encoding error if the input is not ASCII (thus increasing the undefined-ness of the behaviour).


Seriously, I believe the most readable, and faster, solution is to use an if:

def to_bool(s):
    return 1 if s == 'true' else 0

See some microbenchmarks:

In [14]: def most_readable(s):
    ...:     return 1 if s == 'true' else 0

In [15]: def int_cast(s):
    ...:     return int(s == 'true')

In [16]: def str2bool(s):
    ...:     try:
    ...:         return ['false', 'true'].index(s)
    ...:     except (ValueError, AttributeError):
    ...:         raise ValueError()

In [17]: def str2bool2(s):
    ...:     try:
    ...:         return ('false', 'true').index(s)
    ...:     except (ValueError, AttributeError):
    ...:         raise ValueError()

In [18]: def to_bool(s):
    ...:     return 1 - sum(s.encode('ascii')) % 2

In [19]: %timeit most_readable('true')
10000000 loops, best of 3: 112 ns per loop

In [20]: %timeit most_readable('false')
10000000 loops, best of 3: 109 ns per loop

In [21]: %timeit int_cast('true')
1000000 loops, best of 3: 259 ns per loop

In [22]: %timeit int_cast('false')
1000000 loops, best of 3: 262 ns per loop

In [23]: %timeit str2bool('true')
1000000 loops, best of 3: 343 ns per loop

In [24]: %timeit str2bool('false')
1000000 loops, best of 3: 325 ns per loop

In [25]: %timeit str2bool2('true')
1000000 loops, best of 3: 295 ns per loop

In [26]: %timeit str2bool2('false')
1000000 loops, best of 3: 277 ns per loop

In [27]: %timeit to_bool('true')
1000000 loops, best of 3: 607 ns per loop

In [28]: %timeit to_bool('false')
1000000 loops, best of 3: 612 ns per loop

Notice how the if solution is at least 2.5x times faster than all the other solutions. It does not make sense to put as a requirement to avoid using ifs except if this is some kind of homework (in which case you shouldn’t have asked this in the first place).


回答 4

如果您需要从本身不是布尔值的字符串进行通用转换,则最好编写类似于以下所示的例程。秉承鸭子打字的精神,我没有默默地传递错误,而是将其转换为适合当前情况的错误。

>>> def str2bool(st):
try:
    return ['false', 'true'].index(st.lower())
except (ValueError, AttributeError):
    raise ValueError('no Valid Conversion Possible')


>>> str2bool('garbaze')

Traceback (most recent call last):
  File "<pyshell#106>", line 1, in <module>
    str2bool('garbaze')
  File "<pyshell#105>", line 5, in str2bool
    raise TypeError('no Valid COnversion Possible')
TypeError: no Valid Conversion Possible
>>> str2bool('false')
0
>>> str2bool('True')
1

If you need a general purpose conversion from a string which per se is not a bool, you should better write a routine similar to the one depicted below. In keeping with the spirit of duck typing, I have not silently passed the error but converted it as appropriate for the current scenario.

>>> def str2bool(st):
try:
    return ['false', 'true'].index(st.lower())
except (ValueError, AttributeError):
    raise ValueError('no Valid Conversion Possible')


>>> str2bool('garbaze')

Traceback (most recent call last):
  File "<pyshell#106>", line 1, in <module>
    str2bool('garbaze')
  File "<pyshell#105>", line 5, in str2bool
    raise TypeError('no Valid COnversion Possible')
TypeError: no Valid Conversion Possible
>>> str2bool('false')
0
>>> str2bool('True')
1

回答 5

布尔到整数: x = (x == 'true') + 0

现在x包含1,x == 'true'否则为0。

注意:x == 'true'将返回bool,然后将其与0一起转换为具有值(如果bool值为True则为1,否则为0)的int类型。

bool to int: x = (x == 'true') + 0

Now the x contains 1 if x == 'true' else 0.

Note: x == 'true' will return bool which then will be typecasted to int having value (1 if bool value is True else 0) when added with 0.


回答 6

仅与此:

const a = true; const b = false;

console.log(+ a); // 1 console.log(+ b); // 0

only with this:

const a = true; const b = false;

console.log(+a);//1 console.log(+b);//0


Argparse:如果存在“ x”,则必需的参数“ y”

问题:Argparse:如果存在“ x”,则必需的参数“ y”

我的要求如下:

./xyifier --prox --lport lport --rport rport

对于参数prox,我使用action =’store_true’来检查它是否存在。我不需要任何论点。但是,如果设置了–prox,我也需要 rport和lport。有没有一种简单的方法可以使用argparse做到这一点,而无需编写自定义条件编码。

更多代码:

non_int.add_argument('--prox', action='store_true', help='Flag to turn on proxy')
non_int.add_argument('--lport', type=int, help='Listen Port.')
non_int.add_argument('--rport', type=int, help='Proxy port.')

I have a requirement as follows:

./xyifier --prox --lport lport --rport rport

for the argument prox , I use action=’store_true’ to check if it is present or not. I do not require any of the arguments. But, if –prox is set I require rport and lport as well. Is there an easy way of doing this with argparse without writing custom conditional coding.

More Code:

non_int.add_argument('--prox', action='store_true', help='Flag to turn on proxy')
non_int.add_argument('--lport', type=int, help='Listen Port.')
non_int.add_argument('--rport', type=int, help='Proxy port.')

回答 0

不,argparse中没有任何选项可以构成相互包含的选项集。

解决此问题的最简单方法是:

if args.prox and (args.lport is None or args.rport is None):
    parser.error("--prox requires --lport and --rport.")

No, there isn’t any option in argparse to make mutually inclusive sets of options.

The simplest way to deal with this would be:

if args.prox and (args.lport is None or args.rport is None):
    parser.error("--prox requires --lport and --rport.")

回答 1

您是在说要有条件地要求参数。就像@borntyping所说的那样,您可以检查错误并执行parser.error(),或者可以应用与--prox添加新参数时相关的要求。

您的示例的简单解决方案可能是:

non_int.add_argument('--prox', action='store_true', help='Flag to turn on proxy')
non_int.add_argument('--lport', required='--prox' in sys.argv, type=int)
non_int.add_argument('--rport', required='--prox' in sys.argv, type=int)

这种方式required接收True还是False取决于用户是否使用过--prox。这也保证了-lport-rport相互之间的独立行为。

You’re talking about having conditionally required arguments. Like @borntyping said you could check for the error and do parser.error(), or you could just apply a requirement related to --prox when you add a new argument.

A simple solution for your example could be:

non_int.add_argument('--prox', action='store_true', help='Flag to turn on proxy')
non_int.add_argument('--lport', required='--prox' in sys.argv, type=int)
non_int.add_argument('--rport', required='--prox' in sys.argv, type=int)

This way required receives either True or False depending on whether the user as used --prox. This also guarantees that -lport and -rport have an independent behavior between each other.


回答 2

如果存在,如何使用parser.parse_known_args()method然后添加args --lport--rportargs --prox

# just add --prox arg now
non_int = argparse.ArgumentParser(description="stackoverflow question", 
                                  usage="%(prog)s [-h] [--prox --lport port --rport port]")
non_int.add_argument('--prox', action='store_true', 
                     help='Flag to turn on proxy, requires additional args lport and rport')
opts, rem_args = non_int.parse_known_args()
if opts.prox:
    non_int.add_argument('--lport', required=True, type=int, help='Listen Port.')
    non_int.add_argument('--rport', required=True, type=int, help='Proxy port.')
    # use options and namespace from first parsing
    non_int.parse_args(rem_args, namespace = opts)

还请记住,您可以提供opts第一次解析后生成的命名空间,而第二次解析其余参数。这样,最后,在完成所有解析之后,您将拥有一个包含所有选项的命名空间。

缺点:

  • 如果--prox不存在,则命名空间中甚至不存在其他两个从属选项。尽管根据您的用例(如果--prox不存在),则其他选项的发生无关紧要。
  • 需要修改用法消息,因为解析器不知道完整结构
  • --lport并且--rport不显示在帮助消息中

How about using parser.parse_known_args() method and then adding the --lport and --rport args as required args if --prox is present.

# just add --prox arg now
non_int = argparse.ArgumentParser(description="stackoverflow question", 
                                  usage="%(prog)s [-h] [--prox --lport port --rport port]")
non_int.add_argument('--prox', action='store_true', 
                     help='Flag to turn on proxy, requires additional args lport and rport')
opts, rem_args = non_int.parse_known_args()
if opts.prox:
    non_int.add_argument('--lport', required=True, type=int, help='Listen Port.')
    non_int.add_argument('--rport', required=True, type=int, help='Proxy port.')
    # use options and namespace from first parsing
    non_int.parse_args(rem_args, namespace = opts)

Also keep in mind that you can supply the namespace opts generated after the first parsing while parsing the remaining arguments the second time. That way, in the the end, after all the parsing is done, you’ll have a single namespace with all the options.

Drawbacks:

  • If --prox is not present the other two dependent options aren’t even present in the namespace. Although based on your use-case, if --prox is not present, what happens to the other options is irrelevant.
  • Need to modify usage message as parser doesn’t know full structure
  • --lport and --rport don’t show up in help message

回答 3

未设置lport时使用prox。如果不是,为什么不进行lport和的rport论证prox?例如

parser.add_argument('--prox', nargs=2, type=int, help='Prox: listen and proxy ports')

这样可以节省用户输入的时间。测试if args.prox is not None:和一样容易if args.prox:

Do you use lport when prox is not set. If not, why not make lport and rport arguments of prox? e.g.

parser.add_argument('--prox', nargs=2, type=int, help='Prox: listen and proxy ports')

That saves your users typing. It is just as easy to test if args.prox is not None: as if args.prox:.


回答 4

接受的答案对我很有用!由于所有代码都未经测试就被破坏,这就是我测试接受答案的方式。parser.error()不会引发argparse.ArgumentError错误,而是退出该过程。您必须进行测试SystemExit

与pytest

import pytest
from . import parse_arguments  # code that rasises parse.error()


def test_args_parsed_raises_error():
    with pytest.raises(SystemExit):
        parse_arguments(["argument that raises error"])

有单元测试

from unittest import TestCase
from . import parse_arguments  # code that rasises parse.error()

class TestArgs(TestCase):

    def test_args_parsed_raises_error():
        with self.assertRaises(SystemExit) as cm:
            parse_arguments(["argument that raises error"])

启发自:使用unittest测试argparse-退出错误

The accepted answer worked great for me! Since all code is broken without tests here is how I tested the accepted answer. parser.error() does not raise an argparse.ArgumentError error it instead exits the process. You have to test for SystemExit.

with pytest

import pytest
from . import parse_arguments  # code that rasises parse.error()


def test_args_parsed_raises_error():
    with pytest.raises(SystemExit):
        parse_arguments(["argument that raises error"])

with unittests

from unittest import TestCase
from . import parse_arguments  # code that rasises parse.error()

class TestArgs(TestCase):

    def test_args_parsed_raises_error():
        with self.assertRaises(SystemExit) as cm:
            parse_arguments(["argument that raises error"])

inspired from: Using unittest to test argparse – exit errors


Python-sklearn.pipeline.Pipeline到底是什么?

问题:Python-sklearn.pipeline.Pipeline到底是什么?

我不知道如何sklearn.pipeline.Pipeline工作。

文档中有一些解释。例如,它们的意思是:

带有最终估算器的变换管线。

为了使我的问题更清楚,什么是steps?它们如何运作?

编辑

多亏了答案,我可以使我的问题更清楚:

当我调用管道并通过时,需要两个转换器和一个估计器,例如:

pipln = Pipeline([("trsfm1",transformer_1),
                  ("trsfm2",transformer_2),
                  ("estmtr",estimator)])

我叫这个怎么办?

pipln.fit()
OR
pipln.fit_transform()

我不知道估算器如何成为变压器以及如何装配变压器。

I can’t figure out how the sklearn.pipeline.Pipeline works exactly.

There are a few explanation in the doc. For example what do they mean by:

Pipeline of transforms with a final estimator.

To make my question clearer, what are steps? How do they work?

Edit

Thanks to the answers I can make my question clearer:

When I call pipeline and pass, as steps, two transformers and one estimator, e.g:

pipln = Pipeline([("trsfm1",transformer_1),
                  ("trsfm2",transformer_2),
                  ("estmtr",estimator)])

What happens when I call this?

pipln.fit()
OR
pipln.fit_transform()

I can’t figure out how an estimator can be a transformer and how a transformer can be fitted.


回答 0

scikit-learn中的Transformer-一些具有fit和transform方法或fit_transform方法的类。

预测器 -具有fit和预测方法或fit_predict方法的某些类。

管道只是一个抽象概念,它不是现有的ml算法。在ML任务中,通常需要在应用最终估计量之前对原始数据集执行不同变换的序列(查找特征集,生成新特征,仅选择一些良好特征)。

是管道使用的一个很好的例子。管道为您提供了所有3个转换步骤和最终估算器的单一界面。它在内部封装了转换器和预测变量,现在您可以执行以下操作:

    vect = CountVectorizer()
    tfidf = TfidfTransformer()
    clf = SGDClassifier()

    vX = vect.fit_transform(Xtrain)
    tfidfX = tfidf.fit_transform(vX)
    predicted = clf.fit_predict(tfidfX)

    # Now evaluate all steps on test set
    vX = vect.fit_transform(Xtest)
    tfidfX = tfidf.fit_transform(vX)
    predicted = clf.fit_predict(tfidfX)

只是:

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])
predicted = pipeline.fit(Xtrain).predict(Xtrain)
# Now evaluate all steps on test set
predicted = pipeline.predict(Xtest)

使用管道,您可以轻松地针对该元估计器的每个步骤对一组参数执行网格搜索。如以上链接中所述。除了最后一个步骤以外的所有步骤都必须是转换步骤,最后一个步骤可以是变换器或预测值。 编辑答案:调用时pipln.fit()-管道中的每个变压器都将安装在先前变压器的输出上(从原始数据集获悉第一个变压器)。最后一个估计器可以是转换器或预测器,仅当您的最后一个估计器是转换器(可以实现fit_transform或分别转换和拟合方法)时,才可以在管道上调用fit_transform(),仅在以下情况下可以在管道上调用fit_predict()或dictate():您的最后一个估算器是预测器。因此,您无法调用fit_transform或在管道上进行转换,而最后一步是预测变量。

Transformer in scikit-learn – some class that have fit and transform method, or fit_transform method.

Predictor – some class that has fit and predict methods, or fit_predict method.

Pipeline is just an abstract notion, it’s not some existing ml algorithm. Often in ML tasks you need to perform sequence of different transformations (find set of features, generate new features, select only some good features) of raw dataset before applying final estimator.

Here is a good example of Pipeline usage. Pipeline gives you a single interface for all 3 steps of transformation and resulting estimator. It encapsulates transformers and predictors inside, and now you can do something like:

    vect = CountVectorizer()
    tfidf = TfidfTransformer()
    clf = SGDClassifier()

    vX = vect.fit_transform(Xtrain)
    tfidfX = tfidf.fit_transform(vX)
    predicted = clf.fit_predict(tfidfX)

    # Now evaluate all steps on test set
    vX = vect.fit_transform(Xtest)
    tfidfX = tfidf.fit_transform(vX)
    predicted = clf.fit_predict(tfidfX)

With just:

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])
predicted = pipeline.fit(Xtrain).predict(Xtrain)
# Now evaluate all steps on test set
predicted = pipeline.predict(Xtest)

With pipelines you can easily perform a grid-search over set of parameters for each step of this meta-estimator. As described in the link above. All steps except last one must be transforms, last step can be transformer or predictor. Answer to edit: When you call pipln.fit() – each transformer inside pipeline will be fitted on outputs of previous transformer (First transformer is learned on raw dataset). Last estimator may be transformer or predictor, you can call fit_transform() on pipeline only if your last estimator is transformer (that implements fit_transform, or transform and fit methods separately), you can call fit_predict() or predict() on pipeline only if your last estimator is predictor. So you just can’t call fit_transform or transform on pipeline, last step of which is predictor.


回答 1

我认为M0rkHaV有正确的想法。Scikit学习的管道类是用于封装多个不同的变压器旁边的估计到一个对象,一个有用的工具,这样你只需要打电话给你一次(重要的方法fit()predict()等等)。让我们分解两个主要部分:

  1. 变形金刚是同时实现fit()和的类transform()。您可能熟悉一些sklearn预处理工具,例如TfidfVectorizerBinarizer。如果查看这些预处理工具的文档,就会发现它们实现了这两种方法。我觉得很酷的是,一些估算器也可以用作转换步骤,例如LinearSVC

  2. 估算器是同时实现fit()和的类predict()。您会发现许多分类器和回归模型都实现了这两种方法,因此您可以轻松地测试许多不同的模型。可以使用另一个转换器作为最终估计量(即,它不一定实现predict(),但肯定实现fit())。这意味着您不能打电话predict()

至于您的编辑:让我们来看一个基于文本的示例。使用LabelBinarizer,我们希望将标签列表转换为二进制值列表。

bin = LabelBinarizer()  #first we initialize

vec = ['cat', 'dog', 'dog', 'dog'] #we have our label list we want binarized

现在,当二进制化器适合某些数据时,它将具有一个称为的结构classes_,其中包含转换器“知道”的唯一类。如果不调用fit()Binarizer,则不知道数据的外观,因此调用transform()没有任何意义。如果在尝试拟合数据之前打印出类的列表,则为true。

print bin.classes_  

尝试此操作时出现以下错误:

AttributeError: 'LabelBinarizer' object has no attribute 'classes_'

但是,当您将二值化器放在vec列表中时:

bin.fit(vec)

然后再试一次

print bin.classes_

我得到以下内容:

['cat' 'dog']


print bin.transform(vec)

现在,在对vec对象调用transform之后,我们得到以下信息:

[[0]
 [1]
 [1]
 [1]]

至于用作转换器的估计器,让我们以DecisionTree分类器为特征提取器的示例。决策树之所以出色,有很多原因,但是出于我们的目的,重要的是决策树能够对那些认为对预测有用的要素进行排名。当你调用transform()一个决策树,它将把你的输入数据,并查找认为是最重要的特征。因此,您可以考虑将数据矩阵(n行乘m列)转换为较小的矩阵(n行乘k列),其中k列是决策树发现的k个最重要的特征。

I think that M0rkHaV has the right idea. Scikit-learn’s pipeline class is a useful tool for encapsulating multiple different transformers alongside an estimator into one object, so that you only have to call your important methods once (fit(), predict(), etc). Let’s break down the two major components:

  1. Transformers are classes that implement both fit() and transform(). You might be familiar with some of the sklearn preprocessing tools, like TfidfVectorizer and Binarizer. If you look at the docs for these preprocessing tools, you’ll see that they implement both of these methods. What I find pretty cool is that some estimators can also be used as transformation steps, e.g. LinearSVC!

  2. Estimators are classes that implement both fit() and predict(). You’ll find that many of the classifiers and regression models implement both these methods, and as such you can readily test many different models. It is possible to use another transformer as the final estimator (i.e., it doesn’t necessarily implement predict(), but definitely implements fit()). All this means is that you wouldn’t be able to call predict().

As for your edit: let’s go through a text-based example. Using LabelBinarizer, we want to turn a list of labels into a list of binary values.

bin = LabelBinarizer()  #first we initialize

vec = ['cat', 'dog', 'dog', 'dog'] #we have our label list we want binarized

Now, when the binarizer is fitted on some data, it will have a structure called classes_ that contains the unique classes that the transformer ‘knows’ about. Without calling fit() the binarizer has no idea what the data looks like, so calling transform() wouldn’t make any sense. This is true if you print out the list of classes before trying to fit the data.

print bin.classes_  

I get the following error when trying this:

AttributeError: 'LabelBinarizer' object has no attribute 'classes_'

But when you fit the binarizer on the vec list:

bin.fit(vec)

and try again

print bin.classes_

I get the following:

['cat' 'dog']


print bin.transform(vec)

And now, after calling transform on the vec object, we get the following:

[[0]
 [1]
 [1]
 [1]]

As for estimators being used as transformers, let us use the DecisionTree classifier as an example of a feature-extractor. Decision Trees are great for a lot of reasons, but for our purposes, what’s important is that they have the ability to rank features that the tree found useful for predicting. When you call transform() on a Decision Tree, it will take your input data and find what it thinks are the most important features. So you can think of it transforming your data matrix (n rows by m columns) into a smaller matrix (n rows by k columns), where the k columns are the k most important features that the Decision Tree found.


回答 2

ML算法通常处理表格数据。您可能需要在ML算法之前和之后对该数据进行预处理和后处理。管道是链接这些数据处理步骤的一种方式。

什么是ML管道,它们如何工作?

管道是转换数据的一系列步骤。它来自旧的“管道和过滤器”设计模式(例如,您可以想到带有管道“ |”的unix bash命令或重定向运算符“>”)。但是,管道是代码中的对象。因此,您可能为每个过滤器(又称为每个管道步骤)都有一个类,然后是另一个将这些步骤组合到最终管道中的类。一些管道可能将其他管道串联或并联组合,具有多个输入或输出,依此类推。我们喜欢将机器学习管道视为:

  • 管道和过滤器。管道的步骤处理数据,并且它们管理可以从数据中学到的内部状态。
  • 复合材料。管道可以嵌套:例如,整个管道可以视为另一个管道中的单个管道步骤。流水线步骤不一定是流水线,但根据定义,流水线本身至少是流水线步骤。
  • 有向无环图(DAG)。流水线步骤的输出可以发送到许多其他步骤,然后可以重新组合生成的输出,依此类推。旁注:尽管管道是非循环的,但它们可以一个接一个地处理多个项目,并且如果它们的状态发生变化(例如:每次使用fit_transform方法),那么它们可以被视为随着时间的流逝不断展开,保持其状态(例如RNN)。这是一种有趣的方式,可用于在生产中进行在线学习并在更多数据上对其进行培训时进行在线学习。

Scikit-Learn管道的方法

管道(或管道中的步骤)必须具有以下两种方法

  • 适合 ”以学习数据并获取状态(例如:神经网络的神经权重就是这种状态)
  • 转换 ”(或“预测”)以实际处理数据并生成预测。

也可以调用此方法来链接两者:

  • fit_transform ”可以拟合然后转换数据,但是要一次通过,当必须直接一个接一个地执行这两种方法时,可以进行潜在的代码优化。

sklearn.pipeline.Pipeline类的问题

Scikit-Learn的“管道和过滤器”设计模式非常漂亮。但是如何将其用于深度学习,AutoML和复杂的生产级管道?

Scikit-Learn于2007年首次发布,那是一个深度学习纪。但是,它是最著名和采用最广泛的机器学习库之一,并且仍在增长。最重要的是,它使用“管道和过滤器”设计模式作为软件体系结构样式-这就是Scikit-Learn如此出色的原因,此外它还提供了可供使用的算法。但是,在执行以下操作时会遇到很多问题,我们应该能够在2020年做到这一点:

  • 自动机器学习(AutoML),
  • 深度学习管道,
  • 更复杂的机器学习管道。

我们为那些Scikit-Learn问题找到的解决方案

当然,Scikit-Learn非常方便且结构精良。但是,它需要刷新。这是我们与Neuraxle的解决方案,使Scikit-Learn在现代计算项目中变得新鲜和可用!

通过Neuraxle提供的其他管道方法和功能

注意:如果管道的某个步骤不需要使用fit或transform方法之一,则它可以从NonFittableMixinNonTransformableMixin继承,以提供这些方法之一的默认实现而不执行任何操作。

首先,管道或其步骤还可以选择定义这些方法

  • setup ”,将在每个步骤中调用“ setup”方法。例如,如果某个步骤包含TensorFlow,PyTorch或Keras神经网络,则这些步骤可以创建它们的神经图,并在适合之前通过“设置”方法将它们注册到GPU。不建议在步骤的构造函数中直接创建图形,这有几个原因,例如,如果在自动机器学习算法中使用不同的超参数多次运行之前复制了这些步骤,然后自动为您搜索最佳的超参数。
  • 拆解 ”,与“设置”方法相反:它清除资源。

默认提供以下方法,使管理的超参数:

  • get_hyperparams ”将为您返回超参数的字典。如果您的管道包含更多的管道(嵌套管道),则超参数的键将用双下划线“ __”分隔符链接。
  • set_hyperparams ”将允许您以获取时的相同格式设置新的超参数。
  • get_hyperparams_space ”允许您获取超参数的空间,如果您定义了超参数的空间,则该空间不会为空。因此,这里与“ get_hyperparams”的唯一区别是,您将获得统计分布作为值而不是精确值。例如,层数的一个超参数可以是a RandInt(1, 3),表示1到3层。您可以调用.rvs()此dict随机选择一个值,并将其发送到“ set_hyperparams”以尝试对其进行训练。
  • set_hyperparams_space ”可用于使用与“ get_hyperparams_space ”中相同的超参数分布类来设置新空间。

有关建议的解决方案的更多信息,请阅读上面带有链接的大列表中的条目。

ML algorithms typically process tabular data. You may want to do preprocessing and post-processing of this data before and after your ML algorithm. A pipeline is a way to chain those data processing steps.

What are ML pipelines and how do they work?

A pipeline is a series of steps in which data is transformed. It comes from the old “pipe and filter” design pattern (for instance, you could think of unix bash commands with pipes “|” or redirect operators “>”). However, pipelines are objects in the code. Thus, you may have a class for each filter (a.k.a. each pipeline step), and then another class to combine those steps into the final pipeline. Some pipelines may combine other pipelines in series or in parallel, have multiple inputs or outputs, and so on. We like to view Machine Learning pipelines as:

  • Pipe and filters. The pipeline’s steps process data, and they manage their inner state which can be learned from the data.
  • Composites. Pipelines can be nested: for example a whole pipeline can be treated as a single pipeline step in another pipeline. A pipeline step is not necessarily a pipeline, but a pipeline is itself at least a pipeline step by definition.
  • Directed Acyclic Graphs (DAG). A pipeline step’s output may be sent to many other steps, and then the resulting outputs can be recombined, and so on. Side note: despite pipelines are acyclic, they can process multiple items one by one, and if their state change (e.g.: using the fit_transform method each time), then they can be viewed as recurrently unfolding through time, keeping their states (think like an RNN). That’s an interesting way to see pipelines for doing online learning when putting them in production and training them on more data.

Methods of a Scikit-Learn Pipeline

Pipelines (or steps in the pipeline) must have those two methods:

  • fit” to learn on the data and acquire state (e.g.: neural network’s neural weights are such state)
  • transform” (or “predict”) to actually process the data and generate a prediction.

It’s also possible to call this method to chain both:

  • fit_transform” to fit and then transform the data, but in one pass, which allows for potential code optimizations when the two methods must be done one after the other directly.

Problems of the sklearn.pipeline.Pipeline class

Scikit-Learn’s “pipe and filter” design pattern is simply beautiful. But how to use it for Deep Learning, AutoML, and complex production-level pipelines?

Scikit-Learn had its first release in 2007, which was a pre deep learning era. However, it’s one of the most known and adopted machine learning library, and is still growing. On top of all, it uses the Pipe and Filter design pattern as a software architectural style – it’s what makes Scikit-Learn so fabulous, added to the fact it provides algorithms ready for use. However, it has massive issues when it comes to do the following, which we should be able to do in 2020 already:

  • Automatic Machine Learning (AutoML),
  • Deep Learning Pipelines,
  • More complex Machine Learning pipelines.

Solutions that we’ve Found to Those Scikit-Learn’s Problems

For sure, Scikit-Learn is very convenient and well-built. However, it needs a refresh. Here are our solutions with Neuraxle to make Scikit-Learn fresh and useable within modern computing projects!

Additional pipeline methods and features offered through Neuraxle

Note: if a step of a pipeline doesn’t need to have one of the fit or transform methods, it could inherit from NonFittableMixin or NonTransformableMixin to be provided a default implementation of one of those methods to do nothing.

As a starter, it is possible for pipelines or their steps to also optionally define those methods:

  • setup” which will call the “setup” method on each of its step. For instance, if a step contains a TensorFlow, PyTorch, or Keras neural network, the steps could create their neural graphs and register them to the GPU in the “setup” method before fit. It is discouraged to create the graphs directly in the constructors of the steps for several reasons, such as if the steps are copied before running many times with different hyperparameters within an Automatic Machine Learning algorithm that searches for the best hyperparameters for you.
  • teardown”, which is the opposite of the “setup” method: it clears resources.

The following methods are provided by default to allow for managing hyperparameters:

  • get_hyperparams” will return you a dictionary of the hyperparameters. If your pipeline contains more pipelines (nested pipelines), then the hyperparameter’ keys are chained with double underscores “__” separators.
  • set_hyperparams” will allow you to set new hyperparameters in the same format of when you get them.
  • get_hyperparams_space” allows you to get the space of hyperparameter, which will be not empty if you defined one. So, the only difference with “get_hyperparams” here is that you’ll get statistic distributions as values instead of a precise value. For instance, one hyperparameter for the number of layers could be a RandInt(1, 3) which means 1 to 3 layers. You can call .rvs() on this dict to pick a value randomly and send it to “set_hyperparams” to try training on it.
  • set_hyperparams_space” can be used to set a new space using the same hyperparameter distribution classes as in “get_hyperparams_space”.

For more info on our suggested solutions, read the entries in the big list with links above.


使用熊猫将字符串前缀添加到字符串列中的每个值

问题:使用熊猫将字符串前缀添加到字符串列中的每个值

我想在熊猫数据帧的所述列中的每个值的开头附加一个字符串(优雅)。我已经弄清楚该如何做,目前正在使用:

df.ix[(df['col'] != False), 'col'] = 'str'+df[(df['col'] != False), 'col']

这似乎是一件微不足道的事情-您是否知道其他任何方式(可能还会将该字符添加到该列为0或NaN的行中)?

如果还不清楚,我想转一下:

    col 
1     a
2     0

变成:

       col 
1     stra
2     str0

I would like to append a string to the start of each value in a said column of a pandas dataframe (elegantly). I already figured out how to kind-of do this and I am currently using:

df.ix[(df['col'] != False), 'col'] = 'str'+df[(df['col'] != False), 'col']

This seems one hell of an inelegant thing to do – do you know any other way (which maybe also adds the character to rows where that column is 0 or NaN)?

In case this is yet unclear, I would like to turn:

    col 
1     a
2     0

into:

       col 
1     stra
2     str0

回答 0

df['col'] = 'str' + df['col'].astype(str)

例:

>>> df = pd.DataFrame({'col':['a',0]})
>>> df
  col
0   a
1   0
>>> df['col'] = 'str' + df['col'].astype(str)
>>> df
    col
0  stra
1  str0
df['col'] = 'str' + df['col'].astype(str)

Example:

>>> df = pd.DataFrame({'col':['a',0]})
>>> df
  col
0   a
1   0
>>> df['col'] = 'str' + df['col'].astype(str)
>>> df
    col
0  stra
1  str0

回答 1

另外,您也可以使用apply组合format(或f字符串更好),如果例如还想添加后缀或操纵元素本身,我会觉得可读性更高:

df = pd.DataFrame({'col':['a', 0]})

df['col'] = df['col'].apply(lambda x: "{}{}".format('str', x))

这也会产生所需的输出:

    col
0  stra
1  str0

如果您使用的是Python 3.6+,则还可以使用f字符串:

df['col'] = df['col'].apply(lambda x: f"str{x}")

产生相同的输出。

f字符串版本几乎与@RomanPekar的解决方案(python 3.6.4)一样快:

df = pd.DataFrame({'col':['a', 0]*200000})

%timeit df['col'].apply(lambda x: f"str{x}")
117 ms ± 451 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit 'str' + df['col'].astype(str)
112 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

format但是,使用的确确实要慢得多:

%timeit df['col'].apply(lambda x: "{}{}".format('str', x))
185 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

As an alternative, you can also use an apply combined with format (or better with f-strings) which I find slightly more readable if one e.g. also wants to add a suffix or manipulate the element itself:

df = pd.DataFrame({'col':['a', 0]})

df['col'] = df['col'].apply(lambda x: "{}{}".format('str', x))

which also yields the desired output:

    col
0  stra
1  str0

If you are using Python 3.6+, you can also use f-strings:

df['col'] = df['col'].apply(lambda x: f"str{x}")

yielding the same output.

The f-string version is almost as fast as @RomanPekar’s solution (python 3.6.4):

df = pd.DataFrame({'col':['a', 0]*200000})

%timeit df['col'].apply(lambda x: f"str{x}")
117 ms ± 451 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit 'str' + df['col'].astype(str)
112 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using format, however, is indeed far slower:

%timeit df['col'].apply(lambda x: "{}{}".format('str', x))
185 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

回答 2

您可以使用pandas.Series.map:

df['col'].map('str{}'.format)

它将在所有值之前加上“ str”一词。

You can use pandas.Series.map :

df['col'].map('str{}'.format)

It will apply the word “str” before all your values.


回答 3

如果使用加载表文件dtype=str
或将列类型转换为字符串,df['a'] = df['a'].astype(str)
则可以使用以下方法:

df['a']= 'col' + df['a'].str[:]

这种方法允许使用的前缀,追加和子集字符串df
适用于Pandas v0.23.4,v0.24.1。不了解较早的版本。

If you load you table file with dtype=str
or convert column type to string df['a'] = df['a'].astype(str)
then you can use such approach:

df['a']= 'col' + df['a'].str[:]

This approach allows prepend, append, and subset string of df.
Works on Pandas v0.23.4, v0.24.1. Don’t know about earlier versions.


回答 4

.loc的另一种解决方案:

df = pd.DataFrame({'col': ['a', 0]})
df.loc[df.index, 'col'] = 'string' + df['col'].astype(str)

这没有上述解决方案快(每个循环慢1ms以上),但在需要条件更改时可能有用,例如:

mask = (df['col'] == 0)
df.loc[mask, 'col'] = 'string' + df['col'].astype(str)

Another solution with .loc:

df = pd.DataFrame({'col': ['a', 0]})
df.loc[df.index, 'col'] = 'string' + df['col'].astype(str)

This is not as quick as solutions above (>1ms per loop slower) but may be useful in case you need conditional change, like:

mask = (df['col'] == 0)
df.loc[mask, 'col'] = 'string' + df['col'].astype(str)

如何在python抽象类中创建抽象属性

问题:如何在python抽象类中创建抽象属性

在以下代码中,我创建了一个基本抽象类Base。我希望所有从其继承的类都Base提供该name属性,因此我将该属性设置为@abstractmethod

然后,我创建了一个Base名为的子类,该子类Base_1旨在提供一些功能,但仍保持抽象。中没有name属性Base_1,但是python实例化了该类的对象而没有错误。一个人如何创建抽象属性?

from abc import ABCMeta, abstractmethod
class Base(object):
    __metaclass__ = ABCMeta
    def __init__(self, strDirConfig):
        self.strDirConfig = strDirConfig

    @abstractmethod
    def _doStuff(self, signals):
        pass

    @property    
    @abstractmethod
    def name(self):
        #this property will be supplied by the inheriting classes
        #individually
        pass


class Base_1(Base):
    __metaclass__ = ABCMeta
    # this class does not provide the name property, should raise an error
    def __init__(self, strDirConfig):
        super(Base_1, self).__init__(strDirConfig)

    def _doStuff(self, signals):
        print 'Base_1 does stuff'


class C(Base_1):
    @property
    def name(self):
        return 'class C'


if __name__ == '__main__':
    b1 = Base_1('abc')  

In the following code, I create a base abstract class Base. I want all the classes that inherit from Base to provide the name property, so I made this property an @abstractmethod.

Then I created a subclass of Base, called Base_1, which is meant to supply some functionality, but still remain abstract. There is no name property in Base_1, but nevertheless python instatinates an object of that class without an error. How does one create abstract properties?

from abc import ABCMeta, abstractmethod
class Base(object):
    __metaclass__ = ABCMeta
    def __init__(self, strDirConfig):
        self.strDirConfig = strDirConfig

    @abstractmethod
    def _doStuff(self, signals):
        pass

    @property    
    @abstractmethod
    def name(self):
        #this property will be supplied by the inheriting classes
        #individually
        pass


class Base_1(Base):
    __metaclass__ = ABCMeta
    # this class does not provide the name property, should raise an error
    def __init__(self, strDirConfig):
        super(Base_1, self).__init__(strDirConfig)

    def _doStuff(self, signals):
        print 'Base_1 does stuff'


class C(Base_1):
    @property
    def name(self):
        return 'class C'


if __name__ == '__main__':
    b1 = Base_1('abc')  

回答 0

Python 3.3开始,修复了一个错误,这意味着property()装饰器现在应用于抽象方法时,可以正确地标识为抽象。

注:订单的问题,你必须使用@property@abstractmethod

Python 3.3以上版本:python docs):

class C(ABC):
    @property
    @abstractmethod
    def my_abstract_property(self):
        ...

Python 2:python docs

class C(ABC):
    @abstractproperty
    def my_abstract_property(self):
        ...

Since Python 3.3 a bug was fixed meaning the property() decorator is now correctly identified as abstract when applied to an abstract method.

Note: Order matters, you have to use @property before @abstractmethod

Python 3.3+: (python docs):

class C(ABC):
    @property
    @abstractmethod
    def my_abstract_property(self):
        ...

Python 2: (python docs)

class C(ABC):
    @abstractproperty
    def my_abstract_property(self):
        ...

回答 1

Python 3.3之前,您不能嵌套@abstractmethod@property

使用@abstractproperty创建抽象属性(文档)。

from abc import ABCMeta, abstractmethod, abstractproperty

class Base(object):
    # ...
    @abstractproperty
    def name(self):
        pass

该代码现在引发正确的异常:

追溯(最近一次通话):
  在第36行的文件“ foo.py”中 
    b1 = Base_1('abc')  
TypeError:无法使用抽象方法名称实例化抽象类Base_1

Until Python 3.3, you cannot nest @abstractmethod and @property.

Use @abstractproperty to create abstract properties (docs).

from abc import ABCMeta, abstractmethod, abstractproperty

class Base(object):
    # ...
    @abstractproperty
    def name(self):
        pass

The code now raises the correct exception:

Traceback (most recent call last):
  File "foo.py", line 36, in 
    b1 = Base_1('abc')  
TypeError: Can't instantiate abstract class Base_1 with abstract methods name

回答 2

根据上面的詹姆斯回答

def compatibleabstractproperty(func):

    if sys.version_info > (3, 3):             
        return property(abstractmethod(func))
    else:
        return abstractproperty(func)

并将其用作装饰器

@compatibleabstractproperty
def env(self):
    raise NotImplementedError()

Based on James answer above

def compatibleabstractproperty(func):

    if sys.version_info > (3, 3):             
        return property(abstractmethod(func))
    else:
        return abstractproperty(func)

and use it as a decorator

@compatibleabstractproperty
def env(self):
    raise NotImplementedError()