标签归档:Python

Python速度测试-时差-毫秒

问题:Python速度测试-时差-毫秒

为了快速测试一段代码,在Python中进行两次比较的正确方法是什么?我尝试阅读API文档。我不确定我是否了解timedelta。

到目前为止,我有以下代码:

from datetime import datetime

tstart = datetime.now()
print t1

# code to speed test

tend = datetime.now()
print t2
# what am I missing?
# I'd like to print the time diff here

What is the proper way to compare 2 times in Python in order to speed test a section of code? I tried reading the API docs. I’m not sure I understand the timedelta thing.

So far I have this code:

from datetime import datetime

tstart = datetime.now()
print t1

# code to speed test

tend = datetime.now()
print t2
# what am I missing?
# I'd like to print the time diff here

回答 0

datetime.timedelta 只是两个日期时间之间的差…所以就像一段时间,以天/秒/微秒为单位

>>> import datetime
>>> a = datetime.datetime.now()
>>> b = datetime.datetime.now()
>>> c = b - a

>>> c
datetime.timedelta(0, 4, 316543)
>>> c.days
0
>>> c.seconds
4
>>> c.microseconds
316543

请注意,它c.microseconds仅返回timedelta的微秒部分!出于计时目的,请始终使用c.total_seconds()

您可以使用datetime.timedelta进行各种数学运算,例如:

>>> c / 10
datetime.timedelta(0, 0, 431654)

不过,查看CPU时间而不是墙上时钟时间可能更有用……虽然这取决于操作系统,但在类Unix系统下,请检查“ time”命令。

datetime.timedelta is just the difference between two datetimes … so it’s like a period of time, in days / seconds / microseconds

>>> import datetime
>>> a = datetime.datetime.now()
>>> b = datetime.datetime.now()
>>> c = b - a

>>> c
datetime.timedelta(0, 4, 316543)
>>> c.days
0
>>> c.seconds
4
>>> c.microseconds
316543

Be aware that c.microseconds only returns the microseconds portion of the timedelta! For timing purposes always use c.total_seconds().

You can do all sorts of maths with datetime.timedelta, eg:

>>> c / 10
datetime.timedelta(0, 0, 431654)

It might be more useful to look at CPU time instead of wallclock time though … that’s operating system dependant though … under Unix-like systems, check out the ‘time’ command.


回答 1

从Python 2.7开始,有了timedelta.total_seconds()方法。因此,要获得经过的毫秒数:

>>> import datetime
>>> a = datetime.datetime.now()
>>> b = datetime.datetime.now()
>>> delta = b - a
>>> print delta
0:00:05.077263
>>> int(delta.total_seconds() * 1000) # milliseconds
5077

Since Python 2.7 there’s the timedelta.total_seconds() method. So, to get the elapsed milliseconds:

>>> import datetime
>>> a = datetime.datetime.now()
>>> b = datetime.datetime.now()
>>> delta = b - a
>>> print delta
0:00:05.077263
>>> int(delta.total_seconds() * 1000) # milliseconds
5077

回答 2

您可能要改用timeit模块

You might want to use the timeit module instead.


回答 3

您还可以使用:

import time

start = time.clock()
do_something()
end = time.clock()
print "%.2gs" % (end-start)

或者您可以使用python分析器

You could also use:

import time

start = time.clock()
do_something()
end = time.clock()
print "%.2gs" % (end-start)

Or you could use the python profilers.


回答 4

我知道这很晚了,但实际上我真的很喜欢使用:

import time
start = time.time()

##### your timed code here ... #####

print "Process time: " + (time.time() - start)

time.time()从纪元开始,您可以得到秒数。因为这是标准时间(以秒为单位),所以您可以简单地从结束时间中减去开始时间来获得处理时间(以秒为单位)。time.clock()对基准测试非常有用,但是如果您想知道过程花费了多长时间,我发现它毫无用处。例如,说“我的过程需要10秒”比说“我的过程需要10个处理器时钟单位”要直观得多。

>>> start = time.time(); sum([each**8.3 for each in range(1,100000)]) ; print (time.time() - start)
3.4001404476250935e+45
0.0637760162354
>>> start = time.clock(); sum([each**8.3 for each in range(1,100000)]) ; print (time.clock() - start)
3.4001404476250935e+45
0.05

在上面的第一个示例中,显示的时间time.clock()为0.05,而time.time()为0.06377

>>> start = time.clock(); time.sleep(1) ; print "process time: " + (time.clock() - start)
process time: 0.0
>>> start = time.time(); time.sleep(1) ; print "process time: " + (time.time() - start)
process time: 1.00111794472

在第二个示例中,即使进程睡眠了一秒钟,处理器时间也以某种方式显示为“ 0”。time.time()正确显示多于1秒。

I know this is late, but I actually really like using:

import time
start = time.time()

##### your timed code here ... #####

print "Process time: " + (time.time() - start)

time.time() gives you seconds since the epoch. Because this is a standardized time in seconds, you can simply subtract the start time from the end time to get the process time (in seconds). time.clock() is good for benchmarking, but I have found it kind of useless if you want to know how long your process took. For example, it’s much more intuitive to say “my process takes 10 seconds” than it is to say “my process takes 10 processor clock units”

>>> start = time.time(); sum([each**8.3 for each in range(1,100000)]) ; print (time.time() - start)
3.4001404476250935e+45
0.0637760162354
>>> start = time.clock(); sum([each**8.3 for each in range(1,100000)]) ; print (time.clock() - start)
3.4001404476250935e+45
0.05

In the first example above, you are shown a time of 0.05 for time.clock() vs 0.06377 for time.time()

>>> start = time.clock(); time.sleep(1) ; print "process time: " + (time.clock() - start)
process time: 0.0
>>> start = time.time(); time.sleep(1) ; print "process time: " + (time.time() - start)
process time: 1.00111794472

In the second example, somehow the processor time shows “0” even though the process slept for a second. time.time() correctly shows a little more than 1 second.


回答 5

以下代码应显示时间说明…

from datetime import datetime

tstart = datetime.now()

# code to speed test

tend = datetime.now()
print tend - tstart

The following code should display the time detla…

from datetime import datetime

tstart = datetime.now()

# code to speed test

tend = datetime.now()
print tend - tstart

回答 6

您可以简单地打印出差异:

print tend - tstart

You could simply print the difference:

print tend - tstart

回答 7

我不是Python程序员,但我确实知道如何使用Google,这就是我发现的内容:您使用“-”运算符。要完成您的代码:

from datetime import datetime

tstart = datetime.now()

# code to speed test

tend = datetime.now()
print tend - tstart

此外,看起来您可以使用strftime()函数格式化时间跨度计算以呈现时间,但是这会让您感到高兴。

I am not a Python programmer, but I do know how to use Google and here’s what I found: you use the “-” operator. To complete your code:

from datetime import datetime

tstart = datetime.now()

# code to speed test

tend = datetime.now()
print tend - tstart

Additionally, it looks like you can use the strftime() function to format the timespan calculation in order to render the time however makes you happy.


回答 8

time.time()/ datetime可以快速使用,但并不总是100%精确。出于这个原因,我喜欢使用其中一个std lib 分析器(尤其是hotshot)来找出问题所在。

time.time() / datetime is good for quick use, but is not always 100% precise. For that reason, I like to use one of the std lib profilers (especially hotshot) to find out what’s what.


回答 9

您可能需要研究配置文件模块。您会更好地了解减速的位置,并且大部分工作将完全自动化。

You may want to look into the profile modules. You’ll get a better read out of where your slowdowns are, and much of your work will be full-on automated.


回答 10

这是一个模仿Matlab / Octave tic toc函数的自定义函数。

使用示例:

time_var = time_me(); # get a variable with the current timestamp

... run operation ...

time_me(time_var); # print the time difference (e.g. '5 seconds 821.12314 ms')

功能:

def time_me(*arg):
    if len(arg) != 0: 
        elapsedTime = time.time() - arg[0];
        #print(elapsedTime);
        hours = math.floor(elapsedTime / (60*60))
        elapsedTime = elapsedTime - hours * (60*60);
        minutes = math.floor(elapsedTime / 60)
        elapsedTime = elapsedTime - minutes * (60);
        seconds = math.floor(elapsedTime);
        elapsedTime = elapsedTime - seconds;
        ms = elapsedTime * 1000;
        if(hours != 0):
            print ("%d hours %d minutes %d seconds" % (hours, minutes, seconds)) 
        elif(minutes != 0):
            print ("%d minutes %d seconds" % (minutes, seconds))
        else :
            print ("%d seconds %f ms" % (seconds, ms))
    else:
        #print ('does not exist. here you go.');
        return time.time()

Here is a custom function that mimic’s Matlab’s/Octave’s tic toc functions.

Example of use:

time_var = time_me(); # get a variable with the current timestamp

... run operation ...

time_me(time_var); # print the time difference (e.g. '5 seconds 821.12314 ms')

Function :

def time_me(*arg):
    if len(arg) != 0: 
        elapsedTime = time.time() - arg[0];
        #print(elapsedTime);
        hours = math.floor(elapsedTime / (60*60))
        elapsedTime = elapsedTime - hours * (60*60);
        minutes = math.floor(elapsedTime / 60)
        elapsedTime = elapsedTime - minutes * (60);
        seconds = math.floor(elapsedTime);
        elapsedTime = elapsedTime - seconds;
        ms = elapsedTime * 1000;
        if(hours != 0):
            print ("%d hours %d minutes %d seconds" % (hours, minutes, seconds)) 
        elif(minutes != 0):
            print ("%d minutes %d seconds" % (minutes, seconds))
        else :
            print ("%d seconds %f ms" % (seconds, ms))
    else:
        #print ('does not exist. here you go.');
        return time.time()

回答 11

您可以像这样使用timeit测试名为module.py的脚本。

$ python -mtimeit -s 'import module'

You could use timeit like this to test a script named module.py

$ python -mtimeit -s 'import module'

回答 12

《箭头》:Python的更好日期和时间

import arrow
start_time = arrow.utcnow()
end_time = arrow.utcnow()
(end_time - start_time).total_seconds()  # senconds
(end_time - start_time).total_seconds() * 1000  # milliseconds

Arrow: Better dates & times for Python

import arrow
start_time = arrow.utcnow()
end_time = arrow.utcnow()
(end_time - start_time).total_seconds()  # senconds
(end_time - start_time).total_seconds() * 1000  # milliseconds

如何将NumPy数组标准化到一定范围内?

问题:如何将NumPy数组标准化到一定范围内?

在对音频或图像阵列进行一些处理之后,需要先在一定范围内对其进行标准化,然后才能将其写回到文件中。可以这样完成:

# Normalize audio channels to between -1.0 and +1.0
audio[:,0] = audio[:,0]/abs(audio[:,0]).max()
audio[:,1] = audio[:,1]/abs(audio[:,1]).max()

# Normalize image to between 0 and 255
image = image/(image.max()/255.0)

有没有那么繁琐,方便的函数方式来做到这一点?matplotlib.colors.Normalize()似乎无关。

After doing some processing on an audio or image array, it needs to be normalized within a range before it can be written back to a file. This can be done like so:

# Normalize audio channels to between -1.0 and +1.0
audio[:,0] = audio[:,0]/abs(audio[:,0]).max()
audio[:,1] = audio[:,1]/abs(audio[:,1]).max()

# Normalize image to between 0 and 255
image = image/(image.max()/255.0)

Is there a less verbose, convenience function way to do this? matplotlib.colors.Normalize() doesn’t seem to be related.


回答 0

audio /= np.max(np.abs(audio),axis=0)
image *= (255.0/image.max())

使用/=*=可以消除中间的临时阵列,从而节省了一些内存。乘法比除法便宜,所以

image *= 255.0/image.max()    # Uses 1 division and image.size multiplications

比…快一点

image /= image.max()/255.0    # Uses 1+image.size divisions

由于我们在这里使用基本的numpy方法,因此我认为这是尽可能有效的numpy解决方案。


就地操作不会更改容器数组的dtype。由于所需的标准化值是浮点型,因此在执行就地操作之前,audioand image数组需要具有浮点dtype。如果它们还不是浮点dtype,则需要使用进行转换astype。例如,

image = image.astype('float64')
audio /= np.max(np.abs(audio),axis=0)
image *= (255.0/image.max())

Using /= and *= allows you to eliminate an intermediate temporary array, thus saving some memory. Multiplication is less expensive than division, so

image *= 255.0/image.max()    # Uses 1 division and image.size multiplications

is marginally faster than

image /= image.max()/255.0    # Uses 1+image.size divisions

Since we are using basic numpy methods here, I think this is about as efficient a solution in numpy as can be.


In-place operations do not change the dtype of the container array. Since the desired normalized values are floats, the audio and image arrays need to have floating-point point dtype before the in-place operations are performed. If they are not already of floating-point dtype, you’ll need to convert them using astype. For example,

image = image.astype('float64')

回答 1

如果数组同时包含正数和负数,我将使用:

import numpy as np

a = np.random.rand(3,2)

# Normalised [0,1]
b = (a - np.min(a))/np.ptp(a)

# Normalised [0,255] as integer: don't forget the parenthesis before astype(int)
c = (255*(a - np.min(a))/np.ptp(a)).astype(int)        

# Normalised [-1,1]
d = 2.*(a - np.min(a))/np.ptp(a)-1

如果数组包含nan,则一种解决方案是将其删除为:

def nan_ptp(a):
    return np.ptp(a[np.isfinite(a)])

b = (a - np.nanmin(a))/nan_ptp(a)

但是,根据上下文,您可能需要nan不同的对待。例如,插值,用例如0代替,或引发错误。

最后,值得一提的是,即使不是OP的问题,也要标准化

e = (a - np.mean(a)) / np.std(a)

If the array contains both positive and negative data, I’d go with:

import numpy as np

a = np.random.rand(3,2)

# Normalised [0,1]
b = (a - np.min(a))/np.ptp(a)

# Normalised [0,255] as integer: don't forget the parenthesis before astype(int)
c = (255*(a - np.min(a))/np.ptp(a)).astype(int)        

# Normalised [-1,1]
d = 2.*(a - np.min(a))/np.ptp(a)-1

If the array contains nan, one solution could be to just remove them as:

def nan_ptp(a):
    return np.ptp(a[np.isfinite(a)])

b = (a - np.nanmin(a))/nan_ptp(a)

However, depending on the context you might want to treat nan differently. E.g. interpolate the value, replacing in with e.g. 0, or raise an error.

Finally, worth mentioning even if it’s not OP’s question, standardization:

e = (a - np.mean(a)) / np.std(a)

回答 2

您也可以使用重新缩放sklearn。优势在于,除了对数据进行均值居中之外,还可以调整标准差的归一化,并且可以在任一轴上,通过要素或按记录进行校准。

from sklearn.preprocessing import scale
X = scale( X, axis=0, with_mean=True, with_std=True, copy=True )

关键词参数axiswith_meanwith_std是自我解释,并且在默认状态显示。如果该参数copy设置为,则执行就地操作False这里的文件

You can also rescale using sklearn. The advantages are that you can adjust normalize the standard deviation, in addition to mean-centering the data, and that you can do this on either axis, by features, or by records.

from sklearn.preprocessing import scale
X = scale( X, axis=0, with_mean=True, with_std=True, copy=True )

The keyword arguments axis, with_mean, with_std are self explanatory, and are shown in their default state. The argument copy performs the operation in-place if it is set to False. Documentation here.


回答 3

您可以使用“ i”版本(如idiv中的imul ..),它看起来还不错:

image /= (image.max()/255.0)

在另一种情况下,您可以编写一个函数来通过colums标准化n维数组:

def normalize_columns(arr):
    rows, cols = arr.shape
    for col in xrange(cols):
        arr[:,col] /= abs(arr[:,col]).max()

You can use the “i” (as in idiv, imul..) version, and it doesn’t look half bad:

image /= (image.max()/255.0)

For the other case you can write a function to normalize an n-dimensional array by colums:

def normalize_columns(arr):
    rows, cols = arr.shape
    for col in xrange(cols):
        arr[:,col] /= abs(arr[:,col]).max()

回答 4

您正在尝试最小-最大比例缩放audio介于-1和+1 image之间以及0和255之间的值。

使用sklearn.preprocessing.minmax_scale,应该可以轻松解决您的问题。

例如:

audio_scaled = minmax_scale(audio, feature_range=(-1,1))

shape = image.shape
image_scaled = minmax_scale(image.ravel(), feature_range=(0,255)).reshape(shape)

注意:不要与将向量的范数(长度)缩放到某个值(通常为1)的操作相混淆,该操作通常也称为归一化。

You are trying to min-max scale the values of audio between -1 and +1 and image between 0 and 255.

Using sklearn.preprocessing.minmax_scale, should easily solve your problem.

e.g.:

audio_scaled = minmax_scale(audio, feature_range=(-1,1))

and

shape = image.shape
image_scaled = minmax_scale(image.ravel(), feature_range=(0,255)).reshape(shape)

note: Not to be confused with the operation that scales the norm (length) of a vector to a certain value (usually 1), which is also commonly referred to as normalization.


回答 5

一个简单的解决方案是使用sklearn.preprocessing库提供的缩放器。

scaler = sk.MinMaxScaler(feature_range=(0, 250))
scaler = scaler.fit(X)
X_scaled = scaler.transform(X)
# Checking reconstruction
X_rec = scaler.inverse_transform(X_scaled)

错误X_rec-X将为零。您可以根据需要调整feature_range,甚至可以使用标准缩放器sk.StandardScaler()

A simple solution is using the scalers offered by the sklearn.preprocessing library.

scaler = sk.MinMaxScaler(feature_range=(0, 250))
scaler = scaler.fit(X)
X_scaled = scaler.transform(X)
# Checking reconstruction
X_rec = scaler.inverse_transform(X_scaled)

The error X_rec-X will be zero. You can adjust the feature_range for your needs, or even use a standart scaler sk.StandardScaler()


回答 6

我尝试按照此操作,但出现了错误

TypeError: ufunc 'true_divide' output (typecode 'd') could not be coerced to provided output parameter (typecode 'l') according to the casting rule ''same_kind''

numpy我试图正常化阵列是一个integer数组。似乎他们不赞成在版本>中进行类型转换1.10,而您必须使用它numpy.true_divide()来解决该问题。

arr = np.array(img)
arr = np.true_divide(arr,[255.0],out=None)

img是一个PIL.Image对象。

I tried following this, and got the error

TypeError: ufunc 'true_divide' output (typecode 'd') could not be coerced to provided output parameter (typecode 'l') according to the casting rule ''same_kind''

The numpy array I was trying to normalize was an integer array. It seems they deprecated type casting in versions > 1.10, and you have to use numpy.true_divide() to resolve that.

arr = np.array(img)
arr = np.true_divide(arr,[255.0],out=None)

img was an PIL.Image object.


在Python中显示带有两位小数的浮点数

问题:在Python中显示带有两位小数的浮点数

我有一个带浮点参数的函数(通常是整数或具有一位有效数字的十进制数),我需要在字符串中输出具有两位小数位的值(5-> 5.00、5.5-> 5.50等)。如何在Python中做到这一点?

I have a function taking float arguments (generally integers or decimals with one significant digit), and I need to output the values in a string with two decimal places (5 -> 5.00, 5.5 -> 5.50, etc). How can I do this in Python?


回答 0

您可以为此使用字符串格式运算符:

>>> '%.2f' % 1.234
'1.23'
>>> '%.2f' % 5.0
'5.00'

运算符的结果是一个字符串,因此您可以将其存储在变量中,进行打印等。

You could use the string formatting operator for that:

>>> '%.2f' % 1.234
'1.23'
>>> '%.2f' % 5.0
'5.00'

The result of the operator is a string, so you can store it in a variable, print etc.


回答 1

由于这篇文章可能会在这里出现一段时间,因此我们还要指出python 3语法:

"{:.2f}".format(5)

Since this post might be here for a while, lets also point out python 3 syntax:

"{:.2f}".format(5)

回答 2

f字符串格式:

这是Python 3.6中的新功能-照常将字符串放在引号中,并f'...以与r'...原始字符串相同的方式加上前缀。然后,将任何要放入字符串,变量,数字,大括号内的内容放入其中-Python会f'some string text with a {variable} or {number} within that text'像以前的字符串格式化方法那样进行求值,只是该方法更具可读性。

>>> a = 3.141592
>>> print(f'My number is {a:.2f} - look at the nice rounding!')

My number is 3.14 - look at the nice rounding!

您可以在此示例中看到,我们以与以前的字符串格式化方法相似的方式用小数位格式化。

NB a可以是数字,变量甚至是表达式,例如f'{3*my_func(3.14):02f}'

展望未来,使用新代码,我更喜欢f字符串而不是常见的%s或str.format()方法,因为f字符串可以更容易阅读,并且通常更快

f-string formatting:

This was new in Python 3.6 – the string is placed in quotation marks as usual, prepended with f'... in the same way you would r'... for a raw string. Then you place whatever you want to put within your string, variables, numbers, inside braces f'some string text with a {variable} or {number} within that text' – and Python evaluates as with previous string formatting methods, except that this method is much more readable.

>>> foobar = 3.141592
>>> print(f'My number is {foobar:.2f} - look at the nice rounding!')

My number is 3.14 - look at the nice rounding!

You can see in this example we format with decimal places in similar fashion to previous string formatting methods.

NB foobar can be an number, variable, or even an expression eg f'{3*my_func(3.14):02f}'.

Going forward, with new code I prefer f-strings over common %s or str.format() methods as f-strings can be far more readable, and are often much faster.


回答 3

字符串格式:

print "%.2f" % 5

String formatting:

print "%.2f" % 5

回答 4

使用python字符串格式。

>>> "%0.2f" % 3
'3.00'

Using python string formatting.

>>> "%0.2f" % 3
'3.00'

回答 5

字符串格式:

a = 6.789809823
print('%.2f' %a)

要么

print ("{0:.2f}".format(a)) 

舍入函数可以使用:

print(round(a, 2))

round()的好处是,我们可以将结果存储到另一个变量中,然后将其用于其他目的。

b = round(a, 2)
print(b)

String Formatting:

a = 6.789809823
print('%.2f' %a)

OR

print ("{0:.2f}".format(a)) 

Round Function can be used:

print(round(a, 2))

Good thing about round() is that, we can store this result to another variable, and then use it for other purposes.

b = round(a, 2)
print(b)

回答 6

最短的Python 3语法:

n = 5
print(f'{n:.2f}')

Shortest Python 3 syntax:

n = 5
print(f'{n:.2f}')

回答 7

如果您实际上想更改数字本身,而不是只显示不同的数字,请使用format()

将其格式化为2位小数:

format(value, '.2f')

例:

>>> format(5.00000, '.2f')
'5.00'

If you actually want to change the number itself instead of only displaying it differently use format()

Format it to 2 decimal places:

format(value, '.2f')

example:

>>> format(5.00000, '.2f')
'5.00'

回答 8

我知道这是一个古老的问题,但我一直在努力寻找答案。这是我想出的:

Python 3:

>>> num_dict = {'num': 0.123, 'num2': 0.127}
>>> "{0[num]:.2f}_{0[num2]:.2f}".format(num_dict) 
0.12_0.13

I know it is an old question, but I was struggling finding the answer myself. Here is what I have come up with:

Python 3:

>>> num_dict = {'num': 0.123, 'num2': 0.127}
>>> "{0[num]:.2f}_{0[num2]:.2f}".format(num_dict) 
0.12_0.13

回答 9

使用Python 3语法:

print('%.2f' % number)

Using Python 3 syntax:

print('%.2f' % number)

回答 10

如果要在调用输入时获得一个小数点后两位数限制的浮点值,

看看这个〜

a = eval(format(float(input()), '.2f'))   # if u feed 3.1415 for 'a'.
print(a)                                  # output 3.14 will be printed.

If you want to get a floating point value with two decimal places limited at the time of calling input,

Check this out ~

a = eval(format(float(input()), '.2f'))   # if u feed 3.1415 for 'a'.
print(a)                                  # output 3.14 will be printed.

NameError:全局名称“ unicode”未定义-在Python 3中

问题:NameError:全局名称“ unicode”未定义-在Python 3中

我正在尝试使用一个名为bidi的Python包。在此程序包(algorithm.py)的模块中,尽管它是程序包的一部分,但仍有一些行会给我带来错误。

以下是这些行:

# utf-8 ? we need unicode
if isinstance(unicode_or_str, unicode):
    text = unicode_or_str
    decoded = False
else:
    text = unicode_or_str.decode(encoding)
    decoded = True

这是错误消息:

Traceback (most recent call last):
  File "<pyshell#25>", line 1, in <module>
    bidi_text = get_display(reshaped_text)
  File "C:\Python33\lib\site-packages\python_bidi-0.3.4-py3.3.egg\bidi\algorithm.py",   line 602, in get_display
    if isinstance(unicode_or_str, unicode):
NameError: global name 'unicode' is not defined

我应该如何重新编写代码的这一部分,使其可以在Python3中使用?另外,如果有人在Python 3中使用了bidi软件包,请让我知道他们是否发现了类似的问题。我感谢您的帮助。

I am trying to use a Python package called bidi. In a module in this package (algorithm.py) there are some lines that give me error, although it is part of the package.

Here are the lines:

# utf-8 ? we need unicode
if isinstance(unicode_or_str, unicode):
    text = unicode_or_str
    decoded = False
else:
    text = unicode_or_str.decode(encoding)
    decoded = True

and here is the error message:

Traceback (most recent call last):
  File "<pyshell#25>", line 1, in <module>
    bidi_text = get_display(reshaped_text)
  File "C:\Python33\lib\site-packages\python_bidi-0.3.4-py3.3.egg\bidi\algorithm.py",   line 602, in get_display
    if isinstance(unicode_or_str, unicode):
NameError: global name 'unicode' is not defined

How should I re-write this part of the code so it works in Python3? Also if anyone have used bidi package with Python 3 please let me know if they have found similar problems or not. I appreciate your help.


回答 0

Python 3将unicode类型重命名为str,旧str类型已替换为bytes

if isinstance(unicode_or_str, str):
    text = unicode_or_str
    decoded = False
else:
    text = unicode_or_str.decode(encoding)
    decoded = True

您可能需要阅读Python 3 porting HOWTO以获得更多此类详细信息。还有Lennart Regebro的“ 移植到Python 3:深入指南”,可免费在线获得。

最后但并非最不重要的一点是,您可以尝试使用该2to3工具查看如何为您翻译代码。

Python 3 renamed the unicode type to str, the old str type has been replaced by bytes.

if isinstance(unicode_or_str, str):
    text = unicode_or_str
    decoded = False
else:
    text = unicode_or_str.decode(encoding)
    decoded = True

You may want to read the Python 3 porting HOWTO for more such details. There is also Lennart Regebro’s Porting to Python 3: An in-depth guide, free online.

Last but not least, you could just try to use the 2to3 tool to see how that translates the code for you.


回答 1

如果您需要让脚本像我一样继续在python2和3上工作,这可能会对某人有所帮助

import sys
if sys.version_info[0] >= 3:
    unicode = str

然后可以例如

foo = unicode.lower(foo)

If you need to have the script keep working on python2 and 3 as I did, this might help someone

import sys
if sys.version_info[0] >= 3:
    unicode = str

and can then just do for example

foo = unicode.lower(foo)

回答 2

您可以使用6个库同时支持Python 2和3:

import six
if isinstance(value, six.string_types):
    handle_string(value)

You can use the six library to support both Python 2 and 3:

import six
if isinstance(value, six.string_types):
    handle_string(value)

回答 3

希望您使用的是Python 3,默认情况下Str是unicode,所以请Unicode用String Str函数替换函数。

if isinstance(unicode_or_str, str):    ##Replaces with str
    text = unicode_or_str
    decoded = False

Hope you are using Python 3 , Str are unicode by default, so please Replace Unicode function with String Str function.

if isinstance(unicode_or_str, str):    ##Replaces with str
    text = unicode_or_str
    decoded = False

Python日期时间与时间模块之间的差异

问题:Python日期时间与时间模块之间的差异

我试图弄清楚datetimetime模块之间的区别,以及每个模块的用途。

我知道datetime提供日期和时间。该time模块的用途是什么?

将理解示例,并且将特别关注与时区有关的差异。

I am trying to figure out the differences between the datetime and time modules, and what each should be used for.

I know that datetime provides both dates and time. What is the use of the time module?

Examples would be appreciated and differences concerning timezones would especially be of interest.


回答 0

time模块主要用于处理unix时间戳;表示为一个浮点数,以距unix纪元的秒数​​为单位。该datetime模块可以支持许多相同的操作,但是提供了更多的面向对象的类型集,并且对时区的支持有限。

the time module is principally for working with unix time stamps; expressed as a floating point number taken to be seconds since the unix epoch. the datetime module can support many of the same operations, but provides a more object oriented set of types, and also has some limited support for time zones.


回答 1

坚决time防止DST歧义。

专门使用系统time模块而不是datetime模块,以防止夏令时(DST)引起歧义

转换为任何时间格式(包括本地时间)都非常容易:

import time
t = time.time()

time.strftime('%Y-%m-%d %H:%M %Z', time.localtime(t))
'2019-05-27 12:03 CEST'

time.strftime('%Y-%m-%d %H:%M %Z', time.gmtime(t))
'2019-05-27 10:03 GMT'

time.time()是一个浮点数,表示自系统纪元以来的时间(以秒为单位)。time.time()非常适合明确的时间戳记。

如果系统另外运行了网络时间协议(NTP)守护程序,那么最终将获得相当可靠的时基。

这是该模块的文档time

Stick to time to prevent DST ambiguity.

Use exclusively the system time module instead of the datetime module to prevent ambiguity issues with daylight savings time (DST).

Conversion to any time format, including local time, is pretty easy:

import time
t = time.time()

time.strftime('%Y-%m-%d %H:%M %Z', time.localtime(t))
'2019-05-27 12:03 CEST'

time.strftime('%Y-%m-%d %H:%M %Z', time.gmtime(t))
'2019-05-27 10:03 GMT'

time.time() is a floating point number representing the time in seconds since the system epoch. time.time() is ideal for unambiguous time stamping.

If the system additionally runs the network time protocol (NTP) dæmon, one ends up with a pretty solid time base.

Here is the documentation of the time module.


回答 2

当您只需要特定记录的时间时,可以使用时间模块-假设您每天有一个单独的表/文件用于交易,那么您只需要时间。但是,时间数据类型通常用于存储2个时间点之间的时间

这也可以使用datetime完成,但是如果我们只处理特定日期的时间,则可以使用time模块。

日期时间用于存储特定数据和记录时间。就像在出租公司里一样。截止日期将是datetime数据类型。

The time module can be used when you just need the time of a particular record – like lets say you have a seperate table/file for the transactions for each day, then you would just need the time. However the time datatype is usually used to store the time difference between 2 points of time.

This can also be done using datetime, but if we are only dealing with time for a particular day, then time module can be used.

Datetime is used to store a particular data and time for a record. Like in a rental agency. The due date would be a datetime datatype.


回答 3

如果您对时区感兴趣,则应考虑使用pytz。

If you are interested in timezones, you should consider the use of pytz.


如何更新SQLAlchemy行条目?

问题:如何更新SQLAlchemy行条目?

假设表有三列:usernamepasswordno_of_logins

当用户尝试登录时,将使用查询之类的内容检查条目

user = User.query.filter_by(username=form.username.data).first()

如果密码匹配,他将继续。我想做的是计算用户登录的次数。因此,无论他何时成功登录,我都希望增加该no_of_logins字段并将其存储回用户表。我不确定如何使用SqlAlchemy运行更新查询。

Assume table has three columns: username, password and no_of_logins.

When user tries to login, it’s checked for an entry with a query like

user = User.query.filter_by(username=form.username.data).first()

If password matches, he proceeds further. What I would like to do is count how many times the user logged in. Thus whenever he successfully logs in, I would like to increment the no_of_logins field and store it back to the user table. I’m not sure how to run update query with SqlAlchemy.


回答 0

user.no_of_logins += 1
session.commit()
user.no_of_logins += 1
session.commit()

回答 1

有几种UPDATE使用方法sqlalchemy

1) user.no_of_logins += 1
   session.commit()

2) session.query().\
       filter(User.username == form.username.data).\
       update({"no_of_logins": (User.no_of_logins +1)})
   session.commit()

3) conn = engine.connect()
   stmt = User.update().\
       values(no_of_logins=(User.no_of_logins + 1)).\
       where(User.username == form.username.data)
   conn.execute(stmt)

4) setattr(user, 'no_of_logins', user.no_of_logins+1)
   session.commit()

There are several ways to UPDATE using sqlalchemy

1) user.no_of_logins += 1
   session.commit()

2) session.query().\
       filter(User.username == form.username.data).\
       update({"no_of_logins": (User.no_of_logins +1)})
   session.commit()

3) conn = engine.connect()
   stmt = User.update().\
       values(no_of_logins=(User.no_of_logins + 1)).\
       where(User.username == form.username.data)
   conn.execute(stmt)

4) setattr(user, 'no_of_logins', user.no_of_logins+1)
   session.commit()

回答 2

举例说明澄清已接受答案的重要问题

直到我自己玩弄它之前,我都不了解它,所以我认为还会有其他人感到困惑。假设您正在id == 6no_of_logins == 30启动时的用户和用户一起工作。

# 1 (bad)
user.no_of_logins += 1
# result: UPDATE user SET no_of_logins = 31 WHERE user.id = 6

# 2 (bad)
user.no_of_logins = user.no_of_logins + 1
# result: UPDATE user SET no_of_logins = 31 WHERE user.id = 6

# 3 (bad)
setattr(user, 'no_of_logins', user.no_of_logins + 1)
# result: UPDATE user SET no_of_logins = 31 WHERE user.id = 6

# 4 (ok)
user.no_of_logins = User.no_of_logins + 1
# result: UPDATE user SET no_of_logins = no_of_logins + 1 WHERE user.id = 6

# 5 (ok)
setattr(user, 'no_of_logins', User.no_of_logins + 1)
# result: UPDATE user SET no_of_logins = no_of_logins + 1 WHERE user.id = 6

重点

通过引用类而不是实例,您可以使SQLAlchemy更加智能地进行增量,使其在数据库端而不是Python端发生。在数据库中执行此操作会更好,因为它不易受到数据损坏的影响(例如,两个客户端尝试同时进行增量操作,而最终结果仅是一次增量而不是两次增量)。我认为,如果您设置了锁或提高了隔离级别,则可以在Python中进行增量操作,但是如果不必这样做,为什么还要打扰呢?

注意事项

如果您要通过产生SQL之类的代码来增加两次SET no_of_logins = no_of_logins + 1,那么您将需要提交或至少刷新两次增加之间的增量,否则总共将只获得一个增量:

# 6 (bad)
user.no_of_logins = User.no_of_logins + 1
user.no_of_logins = User.no_of_logins + 1
session.commit()
# result: UPDATE user SET no_of_logins = no_of_logins + 1 WHERE user.id = 6

# 7 (ok)
user.no_of_logins = User.no_of_logins + 1
session.flush()
# result: UPDATE user SET no_of_logins = no_of_logins + 1 WHERE user.id = 6
user.no_of_logins = User.no_of_logins + 1
session.commit()
# result: UPDATE user SET no_of_logins = no_of_logins + 1 WHERE user.id = 6

Examples to clarify the important issue in accepted answer’s comments

I didn’t understand it until I played around with it myself, so I figured there would be others who were confused as well. Say you are working on the user whose id == 6 and whose no_of_logins == 30 when you start.

# 1 (bad)
user.no_of_logins += 1
# result: UPDATE user SET no_of_logins = 31 WHERE user.id = 6

# 2 (bad)
user.no_of_logins = user.no_of_logins + 1
# result: UPDATE user SET no_of_logins = 31 WHERE user.id = 6

# 3 (bad)
setattr(user, 'no_of_logins', user.no_of_logins + 1)
# result: UPDATE user SET no_of_logins = 31 WHERE user.id = 6

# 4 (ok)
user.no_of_logins = User.no_of_logins + 1
# result: UPDATE user SET no_of_logins = no_of_logins + 1 WHERE user.id = 6

# 5 (ok)
setattr(user, 'no_of_logins', User.no_of_logins + 1)
# result: UPDATE user SET no_of_logins = no_of_logins + 1 WHERE user.id = 6

The point

By referencing the class instead of the instance, you can get SQLAlchemy to be smarter about incrementing, getting it to happen on the database side instead of the Python side. Doing it within the database is better since it’s less vulnerable to data corruption (e.g. two clients attempt to increment at the same time with a net result of only one increment instead of two). I assume it’s possible to do the incrementing in Python if you set locks or bump up the isolation level, but why bother if you don’t have to?

A caveat

If you are going to increment twice via code that produces SQL like SET no_of_logins = no_of_logins + 1, then you will need to commit or at least flush in between increments, or else you will only get one increment in total:

# 6 (bad)
user.no_of_logins = User.no_of_logins + 1
user.no_of_logins = User.no_of_logins + 1
session.commit()
# result: UPDATE user SET no_of_logins = no_of_logins + 1 WHERE user.id = 6

# 7 (ok)
user.no_of_logins = User.no_of_logins + 1
session.flush()
# result: UPDATE user SET no_of_logins = no_of_logins + 1 WHERE user.id = 6
user.no_of_logins = User.no_of_logins + 1
session.commit()
# result: UPDATE user SET no_of_logins = no_of_logins + 1 WHERE user.id = 6

回答 3

借助user=User.query.filter_by(username=form.username.data).first()声明,您将获得指定用户的user变量。

现在,您可以像一样user.no_of_logins += 1更改新对象变量的值,并使用session的commit方法保存更改。

With the help of user=User.query.filter_by(username=form.username.data).first() statement you will get the specified user in user variable.

Now you can change the value of the new object variable like user.no_of_logins += 1 and save the changes with the session‘s commit method.


回答 4

我写了电报bot,并且在更新行时遇到了一些问题。如果您有模型,请使用此示例

def update_state(chat_id, state):
    try:
        value = Users.query.filter(Users.chat_id == str(chat_id)).first()
        value.state = str(state)
        db.session.flush()
        db.session.commit()
        #db.session.close()
    except:
        print('Error in def update_state')

为什么要使用db.session.flush()?这就是为什么>>> SQLAlchemy:flush()和commit()有什么区别?

I wrote telegram bot, and have some problem with update rows. Use this example, if you have Model

def update_state(chat_id, state):
    try:
        value = Users.query.filter(Users.chat_id == str(chat_id)).first()
        value.state = str(state)
        db.session.flush()
        db.session.commit()
        #db.session.close()
    except:
        print('Error in def update_state')

Why use db.session.flush()? That’s why >>> SQLAlchemy: What’s the difference between flush() and commit()?


创建两个熊猫数据框列的字典的最有效方法是什么?

问题:创建两个熊猫数据框列的字典的最有效方法是什么?

组织以下熊猫数据框的最有效方法是什么:

数据=

Position    Letter
1           a
2           b
3           c
4           d
5           e

变成字典一样alphabet[1 : 'a', 2 : 'b', 3 : 'c', 4 : 'd', 5 : 'e']

What is the most efficient way to organise the following pandas Dataframe:

data =

Position    Letter
1           a
2           b
3           c
4           d
5           e

into a dictionary like alphabet[1 : 'a', 2 : 'b', 3 : 'c', 4 : 'd', 5 : 'e']?


回答 0

In [9]: pd.Series(df.Letter.values,index=df.Position).to_dict()
Out[9]: {1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e'}

速度比较(使用Wouter方法)

In [6]: df = pd.DataFrame(randint(0,10,10000).reshape(5000,2),columns=list('AB'))

In [7]: %timeit dict(zip(df.A,df.B))
1000 loops, best of 3: 1.27 ms per loop

In [8]: %timeit pd.Series(df.A.values,index=df.B).to_dict()
1000 loops, best of 3: 987 us per loop
In [9]: pd.Series(df.Letter.values,index=df.Position).to_dict()
Out[9]: {1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e'}

Speed comparion (using Wouter’s method)

In [6]: df = pd.DataFrame(randint(0,10,10000).reshape(5000,2),columns=list('AB'))

In [7]: %timeit dict(zip(df.A,df.B))
1000 loops, best of 3: 1.27 ms per loop

In [8]: %timeit pd.Series(df.A.values,index=df.B).to_dict()
1000 loops, best of 3: 987 us per loop

回答 1

我找到了解决问题的更快方法,至少在使用以下方法的大型数据集上: df.set_index(KEY).to_dict()[VALUE]

50,000行的证明:

df = pd.DataFrame(np.random.randint(32, 120, 100000).reshape(50000,2),columns=list('AB'))
df['A'] = df['A'].apply(chr)

%timeit dict(zip(df.A,df.B))
%timeit pd.Series(df.A.values,index=df.B).to_dict()
%timeit df.set_index('A').to_dict()['B']

输出:

100 loops, best of 3: 7.04 ms per loop  # WouterOvermeire
100 loops, best of 3: 9.83 ms per loop  # Jeff
100 loops, best of 3: 4.28 ms per loop  # Kikohs (me)

I found a faster way to solve the problem, at least on realistically large datasets using: df.set_index(KEY).to_dict()[VALUE]

Proof on 50,000 rows:

df = pd.DataFrame(np.random.randint(32, 120, 100000).reshape(50000,2),columns=list('AB'))
df['A'] = df['A'].apply(chr)

%timeit dict(zip(df.A,df.B))
%timeit pd.Series(df.A.values,index=df.B).to_dict()
%timeit df.set_index('A').to_dict()['B']

Output:

100 loops, best of 3: 7.04 ms per loop  # WouterOvermeire
100 loops, best of 3: 9.83 ms per loop  # Jeff
100 loops, best of 3: 4.28 ms per loop  # Kikohs (me)

回答 2

在Python 3.6中,最快的方法仍然是WouterOvermeire。Kikohs的提议比其他两个方案要慢。

import timeit

setup = '''
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(32, 120, 100000).reshape(50000,2),columns=list('AB'))
df['A'] = df['A'].apply(chr)
'''

timeit.Timer('dict(zip(df.A,df.B))', setup=setup).repeat(7,500)
timeit.Timer('pd.Series(df.A.values,index=df.B).to_dict()', setup=setup).repeat(7,500)
timeit.Timer('df.set_index("A").to_dict()["B"]', setup=setup).repeat(7,500)

结果:

1.1214002349999777 s  # WouterOvermeire
1.1922008498571748 s  # Jeff
1.7034366211428602 s  # Kikohs

In Python 3.6 the fastest way is still the WouterOvermeire one. Kikohs’ proposal is slower than the other two options.

import timeit

setup = '''
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(32, 120, 100000).reshape(50000,2),columns=list('AB'))
df['A'] = df['A'].apply(chr)
'''

timeit.Timer('dict(zip(df.A,df.B))', setup=setup).repeat(7,500)
timeit.Timer('pd.Series(df.A.values,index=df.B).to_dict()', setup=setup).repeat(7,500)
timeit.Timer('df.set_index("A").to_dict()["B"]', setup=setup).repeat(7,500)

Results:

1.1214002349999777 s  # WouterOvermeire
1.1922008498571748 s  # Jeff
1.7034366211428602 s  # Kikohs

回答 3

TL; DR

>>> import pandas as pd
>>> df = pd.DataFrame({'Position':[1,2,3,4,5], 'Letter':['a', 'b', 'c', 'd', 'e']})
>>> dict(sorted(df.values.tolist())) # Sort of sorted... 
{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
>>> from collections import OrderedDict
>>> OrderedDict(df.values.tolist())
OrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)])

在长

解决方案说明: dict(sorted(df.values.tolist()))

鉴于:

df = pd.DataFrame({'Position':[1,2,3,4,5], 'Letter':['a', 'b', 'c', 'd', 'e']})

[出]:

 Letter Position
0   a   1
1   b   2
2   c   3
3   d   4
4   e   5

尝试:

# Get the values out to a 2-D numpy array, 
df.values

[出]:

array([['a', 1],
       ['b', 2],
       ['c', 3],
       ['d', 4],
       ['e', 5]], dtype=object)

然后(可选):

# Dump it into a list so that you can sort it using `sorted()`
sorted(df.values.tolist()) # Sort by key

要么:

# Sort by value:
from operator import itemgetter
sorted(df.values.tolist(), key=itemgetter(1))

[出]:

[['a', 1], ['b', 2], ['c', 3], ['d', 4], ['e', 5]]

最后,将2个元素的列表转换成字典。

dict(sorted(df.values.tolist())) 

[出]:

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}

有关

回答@sbradbio评论:

如果一个特定的键有多个值,而您想保留所有值,那么这不是最有效,但最直观的方法是:

from collections import defaultdict
import pandas as pd

multivalue_dict = defaultdict(list)

df = pd.DataFrame({'Position':[1,2,4,4,4], 'Letter':['a', 'b', 'd', 'e', 'f']})

for idx,row in df.iterrows():
    multivalue_dict[row['Position']].append(row['Letter'])

[出]:

>>> print(multivalue_dict)
defaultdict(list, {1: ['a'], 2: ['b'], 4: ['d', 'e', 'f']})

TL;DR

>>> import pandas as pd
>>> df = pd.DataFrame({'Position':[1,2,3,4,5], 'Letter':['a', 'b', 'c', 'd', 'e']})
>>> dict(sorted(df.values.tolist())) # Sort of sorted... 
{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
>>> from collections import OrderedDict
>>> OrderedDict(df.values.tolist())
OrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)])

In Long

Explaining solution: dict(sorted(df.values.tolist()))

Given:

df = pd.DataFrame({'Position':[1,2,3,4,5], 'Letter':['a', 'b', 'c', 'd', 'e']})

[out]:

 Letter Position
0   a   1
1   b   2
2   c   3
3   d   4
4   e   5

Try:

# Get the values out to a 2-D numpy array, 
df.values

[out]:

array([['a', 1],
       ['b', 2],
       ['c', 3],
       ['d', 4],
       ['e', 5]], dtype=object)

Then optionally:

# Dump it into a list so that you can sort it using `sorted()`
sorted(df.values.tolist()) # Sort by key

Or:

# Sort by value:
from operator import itemgetter
sorted(df.values.tolist(), key=itemgetter(1))

[out]:

[['a', 1], ['b', 2], ['c', 3], ['d', 4], ['e', 5]]

Lastly, cast the list of list of 2 elements into a dict.

dict(sorted(df.values.tolist())) 

[out]:

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}

Related

Answering @sbradbio comment:

If there are multiple values for a specific key and you would like to keep all of them, it’s the not the most efficient but the most intuitive way is:

from collections import defaultdict
import pandas as pd

multivalue_dict = defaultdict(list)

df = pd.DataFrame({'Position':[1,2,4,4,4], 'Letter':['a', 'b', 'd', 'e', 'f']})

for idx,row in df.iterrows():
    multivalue_dict[row['Position']].append(row['Letter'])

[out]:

>>> print(multivalue_dict)
defaultdict(list, {1: ['a'], 2: ['b'], 4: ['d', 'e', 'f']})

熊猫加入问题:列重叠但未指定后缀

问题:熊猫加入问题:列重叠但未指定后缀

我有以下2个数据帧:

df_a =

     mukey  DI  PI
0   100000  35  14
1  1000005  44  14
2  1000006  44  14
3  1000007  43  13
4  1000008  43  13

df_b = 
    mukey  niccdcd
0  190236        4
1  190237        6
2  190238        7
3  190239        4
4  190240        7

当我尝试加入这两个数据框时:

join_df = df_a.join(df_b,on='mukey',how='left')

我得到错误:

*** ValueError: columns overlap but no suffix specified: Index([u'mukey'], dtype='object')

为什么会这样呢?数据帧确实具有通用的“ mukey”值。

I have following 2 data frames:

df_a =

     mukey  DI  PI
0   100000  35  14
1  1000005  44  14
2  1000006  44  14
3  1000007  43  13
4  1000008  43  13

df_b = 
    mukey  niccdcd
0  190236        4
1  190237        6
2  190238        7
3  190239        4
4  190240        7

When I try to join these 2 dataframes:

join_df = df_a.join(df_b,on='mukey',how='left')

I get the error:

*** ValueError: columns overlap but no suffix specified: Index([u'mukey'], dtype='object')

Why is this so? The dataframes do have common ‘mukey’ values.


回答 0

您发布的数据片段中的错误有点神秘,因为没有通用值,所以联接操作失败,因为这些值不重叠,这需要您在左侧和右侧提供后缀:

In [173]:

df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')
Out[173]:
       mukey_left  DI  PI  mukey_right  niccdcd
index                                          
0          100000  35  14          NaN      NaN
1         1000005  44  14          NaN      NaN
2         1000006  44  14          NaN      NaN
3         1000007  43  13          NaN      NaN
4         1000008  43  13          NaN      NaN

merge 之所以有效,是因为它没有此限制:

In [176]:

df_a.merge(df_b, on='mukey', how='left')
Out[176]:
     mukey  DI  PI  niccdcd
0   100000  35  14      NaN
1  1000005  44  14      NaN
2  1000006  44  14      NaN
3  1000007  43  13      NaN
4  1000008  43  13      NaN

Your error on the snippet of data you posted is a little cryptic, in that because there are no common values, the join operation fails because the values don’t overlap it requires you to supply a suffix for the left and right hand side:

In [173]:

df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')
Out[173]:
       mukey_left  DI  PI  mukey_right  niccdcd
index                                          
0          100000  35  14          NaN      NaN
1         1000005  44  14          NaN      NaN
2         1000006  44  14          NaN      NaN
3         1000007  43  13          NaN      NaN
4         1000008  43  13          NaN      NaN

merge works because it doesn’t have this restriction:

In [176]:

df_a.merge(df_b, on='mukey', how='left')
Out[176]:
     mukey  DI  PI  niccdcd
0   100000  35  14      NaN
1  1000005  44  14      NaN
2  1000006  44  14      NaN
3  1000007  43  13      NaN
4  1000008  43  13      NaN

回答 1

.join()函数正在使用index传递的参数数据集的,因此您应该改用set_index或使用.mergefunction。

请找到适合您的情况的两个示例:

join_df = LS_sgo.join(MSU_pi.set_index('mukey'), on='mukey', how='left')

要么

join_df = df_a.merge(df_b, on='mukey', how='left')

The .join() function is using the index of the passed as argument dataset, so you should use set_index or use .merge function instead.

Please find the two examples that should work in your case:

join_df = LS_sgo.join(MSU_pi.set_index('mukey'), on='mukey', how='left')

or

join_df = df_a.merge(df_b, on='mukey', how='left')

回答 2

此错误表明两个表具有1个或多个具有相同列名的列名。错误消息翻译为:“我可以在两个表中看到同一列,但是您没有告诉我在将其中一个引入之前重命名了任何一个”

您要么要删除一列,然后再使用del df [‘column name’]从另一列中引入,要么使用lsuffix来重写原始列,或者使用rsuffix重命名要引入的列。

df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')

This error indicates that the two tables have the 1 or more column names that have the same column name. The error message translates to: “I can see the same column in both tables but you haven’t told me to rename either before bringing one of them in”

You either want to delete one of the columns before bringing it in from the other on using del df[‘column name’], or use lsuffix to re-write the original column, or rsuffix to rename the one that is being brought it.

df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')

回答 3

主要是join是专门用于基于索引的联接,而不是基于属性名称的联接,因此在两个不同的数据框中更改属性名称,然后尝试联接,它们将被联接,否则会引发此错误

Mainly join is used exclusively to join based on the index,not on the attribute names,so change the attributes names in two different dataframes,then try to join,they will be joined,else this error is raised


如何将tsv文件加载到Pandas DataFrame中?

问题:如何将tsv文件加载到Pandas DataFrame中?

我是python和pandas的新手。我正在尝试将tsv文件加载到熊猫中DataFrame

这是我正在尝试的错误:

>>> df1 = DataFrame(csv.reader(open('c:/~/trainSetRel3.txt'), delimiter='\t'))

Traceback (most recent call last):
  File "<pyshell#28>", line 1, in <module>
    df1 = DataFrame(csv.reader(open('c:/~/trainSetRel3.txt'), delimiter='\t'))
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 318, in __init__
    raise PandasError('DataFrame constructor not properly called!')
PandasError: DataFrame constructor not properly called!

I’m new to python and pandas. I’m trying to get a tsv file loaded into a pandas DataFrame.

This is what I’m trying and the error I’m getting:

>>> df1 = DataFrame(csv.reader(open('c:/~/trainSetRel3.txt'), delimiter='\t'))

Traceback (most recent call last):
  File "<pyshell#28>", line 1, in <module>
    df1 = DataFrame(csv.reader(open('c:/~/trainSetRel3.txt'), delimiter='\t'))
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 318, in __init__
    raise PandasError('DataFrame constructor not properly called!')
PandasError: DataFrame constructor not properly called!

回答 0

:由于17.0 from_csv气馁:使用pd.read_csv替代

该文档列出了一个.from_csv函数,该函数似乎可以执行您想要的操作:

DataFrame.from_csv('c:/~/trainSetRel3.txt', sep='\t')

如果您有标题,则可以传递header=0

DataFrame.from_csv('c:/~/trainSetRel3.txt', sep='\t', header=0)

Note: As of 17.0 from_csv is discouraged: use pd.read_csv instead

The documentation lists a .from_csv function that appears to do what you want:

DataFrame.from_csv('c:/~/trainSetRel3.txt', sep='\t')

If you have a header, you can pass header=0.

DataFrame.from_csv('c:/~/trainSetRel3.txt', sep='\t', header=0)

回答 1

从17.0开始from_csv不建议使用。

使用pd.read_csv(fpath, sep='\t')pd.read_table(fpath)

As of 17.0 from_csv is discouraged.

Use pd.read_csv(fpath, sep='\t') or pd.read_table(fpath).


回答 2

使用read_table(filepath)。默认分隔符是制表符

Use read_table(filepath). The default separator is tab


回答 3

试试这个

df = pd.read_csv("rating-data.tsv",sep='\t')
df.head()

您实际上需要修复sep参数。

Try this

df = pd.read_csv("rating-data.tsv",sep='\t')
df.head()

You actually need to fix the sep parameter.


回答 4

打开文件,另存为.csv,然后应用

df = pd.read_csv('apps.csv', sep='\t')

对于任何其他格式,只需更改sep标记

open file, save as .csv and then apply

df = pd.read_csv('apps.csv', sep='\t')

for any other format also, just change the sep tag


回答 5

df = pd.read_csv('filename.csv', sep='\t', header=0)

您可以通过指定分隔符和标头将tsv文件直接加载到pandas数据框中。

df = pd.read_csv('filename.csv', sep='\t', header=0)

You can load the tsv file directly into pandas data frame by specifying delimitor and header.


如何从psycopg2游标获取列名列表?

问题:如何从psycopg2游标获取列名列表?

我想要一种直接从所选列名直接生成列标签的通用方法,并且回想起看到python的psycopg2模块支持此功能的情况。

I would like a general way to generate column labels directly from the selected column names, and recall seeing that python’s psycopg2 module supports this feature.


回答 0

摘自Mark Lutz的“ Programming Python”:

curs.execute("Select * FROM people LIMIT 0")
colnames = [desc[0] for desc in curs.description]

From “Programming Python” by Mark Lutz:

curs.execute("Select * FROM people LIMIT 0")
colnames = [desc[0] for desc in curs.description]

回答 1

您可以做的另一件事是创建一个游标,您可以使用该游标通过列名来引用您的列(这是导致我首先进入此页面的需要):

import psycopg2
from psycopg2.extras import RealDictCursor

ps_conn = psycopg2.connect(...)
ps_cursor = psql_conn.cursor(cursor_factory=RealDictCursor)

ps_cursor.execute('select 1 as col_a, 2 as col_b')
my_record = ps_cursor.fetchone()
print (my_record['col_a'],my_record['col_b'])

>> 1, 2

Another thing you can do is to create a cursor with which you will be able to reference your columns by their names (that’s a need which led me to this page in the first place):

import psycopg2
from psycopg2.extras import RealDictCursor

ps_conn = psycopg2.connect(...)
ps_cursor = psql_conn.cursor(cursor_factory=RealDictCursor)

ps_cursor.execute('select 1 as col_a, 2 as col_b')
my_record = ps_cursor.fetchone()
print (my_record['col_a'],my_record['col_b'])

>> 1, 2

回答 2

在单独的查询中获取列名,可以查询information_schema.columns表。

#!/usr/bin/env python3

import psycopg2

if __name__ == '__main__':
  DSN = 'host=YOUR_DATABASE_HOST port=YOUR_DATABASE_PORT dbname=YOUR_DATABASE_NAME user=YOUR_DATABASE_USER'

  column_names = []

  with psycopg2.connect(DSN) as connection:
      with connection.cursor() as cursor:
          cursor.execute("select column_name from information_schema.columns where table_schema = 'YOUR_SCHEMA_NAME' and table_name='YOUR_TABLE_NAME'")
          column_names = [row[0] for row in cursor]

  print("Column names: {}\n".format(column_names))

在与数据行相同的查询中获取列名,可以使用游标的描述字段:

#!/usr/bin/env python3

import psycopg2

if __name__ == '__main__':
  DSN = 'host=YOUR_DATABASE_HOST port=YOUR_DATABASE_PORT dbname=YOUR_DATABASE_NAME user=YOUR_DATABASE_USER'

  column_names = []
  data_rows = []

  with psycopg2.connect(DSN) as connection:
    with connection.cursor() as cursor:
      cursor.execute("select field1, field2, fieldn from table1")
      column_names = [desc[0] for desc in cursor.description]
      for row in cursor:
        data_rows.append(row)

  print("Column names: {}\n".format(column_names))

To get the column names in a separate query, you can query the information_schema.columns table.

#!/usr/bin/env python3

import psycopg2

if __name__ == '__main__':
  DSN = 'host=YOUR_DATABASE_HOST port=YOUR_DATABASE_PORT dbname=YOUR_DATABASE_NAME user=YOUR_DATABASE_USER'

  column_names = []

  with psycopg2.connect(DSN) as connection:
      with connection.cursor() as cursor:
          cursor.execute("select column_name from information_schema.columns where table_schema = 'YOUR_SCHEMA_NAME' and table_name='YOUR_TABLE_NAME'")
          column_names = [row[0] for row in cursor]

  print("Column names: {}\n".format(column_names))

To get column names in the same query as data rows, you can use the description field of the cursor:

#!/usr/bin/env python3

import psycopg2

if __name__ == '__main__':
  DSN = 'host=YOUR_DATABASE_HOST port=YOUR_DATABASE_PORT dbname=YOUR_DATABASE_NAME user=YOUR_DATABASE_USER'

  column_names = []
  data_rows = []

  with psycopg2.connect(DSN) as connection:
    with connection.cursor() as cursor:
      cursor.execute("select field1, field2, fieldn from table1")
      column_names = [desc[0] for desc in cursor.description]
      for row in cursor:
        data_rows.append(row)

  print("Column names: {}\n".format(column_names))

回答 3

如果要从数据库查询中获取命名元组obj,可以使用以下代码段:

from collections import namedtuple

def create_record(obj, fields):
    ''' given obj from db returns named tuple with fields mapped to values '''
    Record = namedtuple("Record", fields)
    mappings = dict(zip(fields, obj))
    return Record(**mappings)

cur.execute("Select * FROM people")
colnames = [desc[0] for desc in cur.description]
rows = cur.fetchall()
cur.close()
result = []
for row in rows:
    result.append(create_record(row, colnames))

这使您可以将记录值当作类属性来访问,即

record.id,record.other_table_column_name等。

甚至更短

from psycopg2.extras import NamedTupleCursor
with cursor(cursor_factory=NamedTupleCursor) as cur:
   cur.execute("Select * ...")
   return cur.fetchall()

If you want to have a named tuple obj from db query you can use the following snippet:

from collections import namedtuple

def create_record(obj, fields):
    ''' given obj from db returns named tuple with fields mapped to values '''
    Record = namedtuple("Record", fields)
    mappings = dict(zip(fields, obj))
    return Record(**mappings)

cur.execute("Select * FROM people")
colnames = [desc[0] for desc in cur.description]
rows = cur.fetchall()
cur.close()
result = []
for row in rows:
    result.append(create_record(row, colnames))

This allows you to access record values as if they were class properties i.e.

record.id, record.other_table_column_name, etc.

or even shorter

from psycopg2.extras import NamedTupleCursor
with cursor(cursor_factory=NamedTupleCursor) as cur:
   cur.execute("Select * ...")
   return cur.fetchall()

回答 4

执行SQL查询后,编写以下2.7中编写的python脚本

total_fields = len(cursor.description)    
fields_names = [i[0] for i in cursor.description   
    Print fields_names

After executing SQL query write following python script written in 2.7

total_fields = len(cursor.description)    
fields_names = [i[0] for i in cursor.description   
    Print fields_names

回答 5

我注意到,您必须cursor.fetchone()在查询后使用来获取中cursor.description(即[desc[0] for desc in curs.description])中的列列表

I have noticed that you must use cursor.fetchone() after the query to get the list of columns in cursor.description (i.e in [desc[0] for desc in curs.description])


回答 6

我也曾经遇到过类似的问题。我用一个简单的技巧来解决这个问题。假设列表中有列名,例如

col_name = ['a', 'b', 'c']

然后您可以执行以下操作

for row in cursor.fetchone():
    print zip(col_name, row)

I also used to face similar issue. I use a simple trick to solve this. Suppose you have column names in a list like

col_name = ['a', 'b', 'c']

Then you can do following

for row in cursor.fetchone():
    print zip(col_name, row)

回答 7

 # You can use this function
 def getColumns(cursorDescription):
     columnList = []
     for tupla in cursorDescription:
         columnList.append(tupla[0])
     return columnList 
 # You can use this function
 def getColumns(cursorDescription):
     columnList = []
     for tupla in cursorDescription:
         columnList.append(tupla[0])
     return columnList 

回答 8

#!/usr/bin/python
import psycopg2
#note that we have to import the Psycopg2 extras library!
import psycopg2.extras
import sys

def main():
    conn_string = "host='localhost' dbname='my_database' user='postgres' password='secret'"
    # print the connection string we will use to connect
    print "Connecting to database\n ->%s" % (conn_string)

    # get a connection, if a connect cannot be made an exception will be raised here
    conn = psycopg2.connect(conn_string)

    # conn.cursor will return a cursor object, you can use this query to perform queries
    # note that in this example we pass a cursor_factory argument that will
    # dictionary cursor so COLUMNS will be returned as a dictionary so we
    # can access columns by their name instead of index.
    cursor = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)

    # tell postgres to use more work memory
    work_mem = 2048

    # by passing a tuple as the 2nd argument to the execution function our
    # %s string variable will get replaced with the order of variables in
    # the list. In this case there is only 1 variable.
    # Note that in python you specify a tuple with one item in it by placing
    # a comma after the first variable and surrounding it in parentheses.
    cursor.execute('SET work_mem TO %s', (work_mem,))

    # Then we get the work memory we just set -> we know we only want the
    # first ROW so we call fetchone.
    # then we use bracket access to get the FIRST value.
    # Note that even though we've returned the columns by name we can still
    # access columns by numeric index as well - which is really nice.
    cursor.execute('SHOW work_mem')

    # Call fetchone - which will fetch the first row returned from the
    # database.
    memory = cursor.fetchone()

    # access the column by numeric index:
    # even though we enabled columns by name I'm showing you this to
    # show that you can still access columns by index and iterate over them.
    print "Value: ", memory[0]

    # print the entire row 
    print "Row: ", memory

if __name__ == "__main__":
    main()
#!/usr/bin/python
import psycopg2
#note that we have to import the Psycopg2 extras library!
import psycopg2.extras
import sys

def main():
    conn_string = "host='localhost' dbname='my_database' user='postgres' password='secret'"
    # print the connection string we will use to connect
    print "Connecting to database\n ->%s" % (conn_string)

    # get a connection, if a connect cannot be made an exception will be raised here
    conn = psycopg2.connect(conn_string)

    # conn.cursor will return a cursor object, you can use this query to perform queries
    # note that in this example we pass a cursor_factory argument that will
    # dictionary cursor so COLUMNS will be returned as a dictionary so we
    # can access columns by their name instead of index.
    cursor = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)

    # tell postgres to use more work memory
    work_mem = 2048

    # by passing a tuple as the 2nd argument to the execution function our
    # %s string variable will get replaced with the order of variables in
    # the list. In this case there is only 1 variable.
    # Note that in python you specify a tuple with one item in it by placing
    # a comma after the first variable and surrounding it in parentheses.
    cursor.execute('SET work_mem TO %s', (work_mem,))

    # Then we get the work memory we just set -> we know we only want the
    # first ROW so we call fetchone.
    # then we use bracket access to get the FIRST value.
    # Note that even though we've returned the columns by name we can still
    # access columns by numeric index as well - which is really nice.
    cursor.execute('SHOW work_mem')

    # Call fetchone - which will fetch the first row returned from the
    # database.
    memory = cursor.fetchone()

    # access the column by numeric index:
    # even though we enabled columns by name I'm showing you this to
    # show that you can still access columns by index and iterate over them.
    print "Value: ", memory[0]

    # print the entire row 
    print "Row: ", memory

if __name__ == "__main__":
    main()