I would like to create a string buffer to do lots of processing, format and finally write the buffer in a text file using a C-style sprintf functionality in Python. Because of conditional statements, I can’t write them directly to the file.
Edit, to clarify my question: buf is a big buffer contains all these strings which have formatted using sprintf.
Going by your examples, buf will only contain current values, not older ones.
e.g first in buf I wrote A= something ,B= something later C= something was appended in the same buf, but in your Python answers buf contains only last value, which is not I want – I want buf to have all the printfs I have done since the beginning, like in C.
回答 0
Python %为此提供了一个运算符。
>>> a =5>>> b ="hello">>> buf ="A = %d\n , B = %s\n"%(a, b)>>>print buf
A =5, B = hello
>>> c =10>>> buf ="C = %d\n"% c
>>>print buf
C =10
>>> a = 5
>>> b = "hello"
>>> buf = "A = %d\n , B = %s\n" % (a, b)
>>> print buf
A = 5
, B = hello
>>> c = 10
>>> buf = "C = %d\n" % c
>>> print buf
C = 10
See this reference for all supported format specifiers.
>>>importStringIO>>> buf =StringIO.StringIO()>>> buf.write("A = %d, B = %s\n"%(3,"bar"))>>> buf.write("C=%d\n"%5)>>>print(buf.getvalue())
A =3, B = bar
C=5
To insert into a very long string it is nice to use names for the different arguments, instead of hoping they are in the right positions. This also makes it easier to replace multiple recurrences.
This is probably the closest translation from your C code to Python code.
A = 1
B = "hello"
buf = "A = %d\n , B= %s\n" % (A, B)
c = 2
buf += "C=%d\n" % c
f = open('output.txt', 'w')
print >> f, c
f.close()
The % operator in Python does almost exactly the same thing as C’s sprintf. You can also print the string to a file directly. If there are lots of these string formatted stringlets involved, it might be wise to use a StringIO object to speed up processing time.
So instead of doing +=, do this:
import cStringIO
buf = cStringIO.StringIO()
...
print >> buf, "A = %d\n , B= %s\n" % (A, B)
...
print >> buf, "C=%d\n" % c
...
print >> f, buf.getvalue()
classReport:... usual init/enter/exit
defprint(self,*args,**kwargs):withStringIO()as s:print(*args,**kwargs, file=s)
out = s.getvalue()... stuff with out
withReport()as r:
r.print(f"This is {datetime.date.today()}!",'Yikes!', end=':')
Two approaches are to write to a string buffer or to write lines to a list and join them later. I think the StringIO approach is more pythonic, but didn’t work before Python 2.6.
from io import StringIO
with StringIO() as s:
print("Hello", file=s)
print("Goodbye", file=s)
# And later...
with open('myfile', 'w') as f:
f.write(s.getvalue())
You can also use these without a ContextMananger (s = StringIO()). Currently, I’m using a context manager class with a print function. This fragment might be useful to be able to insert debugging or odd paging requirements:
class Report:
... usual init/enter/exit
def print(self, *args, **kwargs):
with StringIO() as s:
print(*args, **kwargs, file=s)
out = s.getvalue()
... stuff with out
with Report() as r:
r.print(f"This is {datetime.date.today()}!", 'Yikes!', end=':')
I have manipulated some data using pandas and now I want to carry out a batch save back to the database. This requires me to convert the dataframe into an array of tuples, with each tuple corresponding to a “row” of the dataframe.
from simple_benchmark importBenchmarkBuilder
b =BenchmarkBuilder()import pandas as pd
import numpy as np
def tuple_comp(df):return[tuple(x)for x in df.to_numpy()]def iter_namedtuples(df):return list(df.itertuples(index=False))def iter_tuples(df):return list(df.itertuples(index=False, name=None))def records(df):return df.to_records(index=False).tolist()def zipmap(df):return list(zip(*map(df.get, df)))
funcs =[tuple_comp, iter_namedtuples, iter_tuples, records, zipmap]for func in funcs:
b.add_function()(func)def creator(n):return pd.DataFrame({"A": random.randint(n, size=n),"B": random.randint(n, size=n)})@b.add_arguments('Rows in DataFrame')def argument_provider():for n in(10**(np.arange(4,11)/2)).astype(int):yield n, creator(n)
r = b.run()
检查结果
r.to_pandas_dataframe().pipe(lambda d: d.div(d.min(1),0))
tuple_comp iter_namedtuples iter_tuples records zipmap
1002.9056626.6263083.4507411.4694711.0000003164.6126924.8144332.3758741.0963521.00000010006.5131214.1064261.9582931.0000001.31630331628.4461384.0821611.8083391.0000001.533605100008.4244833.6214611.6518311.0000001.558592316227.8138033.3865921.5864831.0000001.5154781000007.0505723.1624261.4999771.0000001.480131
Motivation
Many data sets are large enough that we need to concern ourselves with speed/efficiency. So I offer this solution in that spirit. It happens to also be succinct.
For the sake of comparison, let’s drop the index column
It happens to also be flexible if we wanted to deal with a specific subset of columns. We’ll assume the columns we’ve already displayed are the subset we want.
The idea of setting datetime column as the index axis is to aid in the conversion of the Timestamp value to it’s corresponding datetime.datetime format equivalent by making use of the convert_datetime64 argument in DF.to_records which does so for a DateTimeIndex dataframe.
This returns a recarray which could be then made to return a list using .tolist
More generalized solution depending on the use case would be:
df.to_records().tolist() # Supply index=False to exclude index
from numpy import random
import pandas as pd
def create_random_df(n):return pd.DataFrame({"A": random.randint(n, size=n),"B": random.randint(n, size=n)})
小尺寸:
df = create_random_df(10000)%timeit tuples = list(zip(*[df[c].values.tolist()for c in df]))%timeit tuples =[tuple(x)for x in df.values]%timeit tuples = list(df.itertuples(index=False, name=None))
给出:
1.66 ms ±200µs per loop (mean ± std. dev. of 7 runs,1000 loops each)15.5 ms ±1.52 ms per loop (mean ± std. dev. of 7 runs,100 loops each)1.74 ms ±75.4µs per loop (mean ± std. dev. of 7 runs,1000 loops each)
较大:
df = create_random_df(1000000)%timeit tuples = list(zip(*[df[c].values.tolist()for c in df]))%timeit tuples =[tuple(x)for x in df.values]%timeit tuples = list(df.itertuples(index=False, name=None))
给出:
202 ms ±5.91 ms per loop (mean ± std. dev. of 7 runs,10 loops each)1.52 s ±98.1 ms per loop (mean ± std. dev. of 7 runs,1 loop each)209 ms ±11.8 ms per loop (mean ± std. dev. of 7 runs,10 loops each)
尽我所能:
df = create_random_df(10000000)%timeit tuples = list(zip(*[df[c].values.tolist()for c in df]))%timeit tuples =[tuple(x)for x in df.values]%timeit tuples = list(df.itertuples(index=False, name=None))
给出:
1.78 s ±118 ms per loop (mean ± std. dev. of 7 runs,1 loop each)15.4 s ±222 ms per loop (mean ± std. dev. of 7 runs,1 loop each)1.68 s ±96.3 ms per loop (mean ± std. dev. of 7 runs,1 loop each)
This answer doesn’t add any answers that aren’t already discussed, but here are some speed results. I think this should resolve questions that came up in the comments. All of these look like they are O(n), based on these three values.
TL;DR: tuples = list(df.itertuples(index=False, name=None)) and tuples = list(zip(*[df[c].values.tolist() for c in df])) are tied for the fastest.
I did a quick speed test on results for three suggestions here:
The zip answer from @pirsquared: tuples = list(zip(*[df[c].values.tolist() for c in df]))
The accepted answer from @wes-mckinney: tuples = [tuple(x) for x in df.values]
The itertuples answer from @ksindi with the name=None suggestion from @Axel: tuples = list(df.itertuples(index=False, name=None))
from numpy import random
import pandas as pd
def create_random_df(n):
return pd.DataFrame({"A": random.randint(n, size=n), "B": random.randint(n, size=n)})
Small size:
df = create_random_df(10000)
%timeit tuples = list(zip(*[df[c].values.tolist() for c in df]))
%timeit tuples = [tuple(x) for x in df.values]
%timeit tuples = list(df.itertuples(index=False, name=None))
Gives:
1.66 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
15.5 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.74 ms ± 75.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Larger:
df = create_random_df(1000000)
%timeit tuples = list(zip(*[df[c].values.tolist() for c in df]))
%timeit tuples = [tuple(x) for x in df.values]
%timeit tuples = list(df.itertuples(index=False, name=None))
Gives:
202 ms ± 5.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.52 s ± 98.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
209 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
As much patience as I have:
df = create_random_df(10000000)
%timeit tuples = list(zip(*[df[c].values.tolist() for c in df]))
%timeit tuples = [tuple(x) for x in df.values]
%timeit tuples = list(df.itertuples(index=False, name=None))
Gives:
1.78 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
15.4 s ± 222 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.68 s ± 96.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The zip version and the itertuples version are within the confidence intervals each other. I suspect that they are doing the same thing under the hood.
These speed tests are probably irrelevant though. Pushing the limits of my computer’s memory doesn’t take a huge amount of time, and you really shouldn’t be doing this on a large data set. Working with those tuples after doing this will end up being really inefficient. It’s unlikely to be a major bottleneck in your code, so just stick with the version you think is most readable.
回答 7
#try this one:
tuples = list(zip(data_set["data_date"], data_set["data_1"],data_set["data_2"]))print(tuples)
I’m trying to do a “hello world” with new boto3 client for AWS.
The use-case I have is fairly simple: get object from S3 and save it to the file.
In boto 2.X I would do it like this:
import boto
key = boto.connect_s3().get_bucket('foo').get_key('foo')
key.get_contents_to_filename('/tmp/foo')
In boto 3 . I can’t find a clean way to do the same thing, so I’m manually iterating over the “Streaming” object:
import boto3
key = boto3.resource('s3').Object('fooo', 'docker/my-image.tar.gz').get()
with open('/tmp/my-image.tar.gz', 'w') as f:
chunk = key['Body'].read(1024*8)
while chunk:
f.write(chunk)
chunk = key['Body'].read(1024*8)
or
import boto3
key = boto3.resource('s3').Object('fooo', 'docker/my-image.tar.gz').get()
with open('/tmp/my-image.tar.gz', 'w') as f:
for chunk in iter(lambda: key['Body'].read(4096), b''):
f.write(chunk)
And it works fine. I was wondering is there any “native” boto3 function that will do the same task?
s3_client = boto3.client('s3')
open('hello.txt').write('Hello, world!')# Upload the file to S3
s3_client.upload_file('hello.txt','MyBucket','hello-remote.txt')# Download the file from S3
s3_client.download_file('MyBucket','hello-remote.txt','hello2.txt')print(open('hello2.txt').read())
There is a customization that went into Boto3 recently which helps with this (among other things). It is currently exposed on the low-level S3 client, and can be used like this:
s3_client = boto3.client('s3')
open('hello.txt').write('Hello, world!')
# Upload the file to S3
s3_client.upload_file('hello.txt', 'MyBucket', 'hello-remote.txt')
# Download the file from S3
s3_client.download_file('MyBucket', 'hello-remote.txt', 'hello2.txt')
print(open('hello2.txt').read())
These functions will automatically handle reading/writing files as well as doing multipart uploads in parallel for large files.
Note that s3_client.download_file won’t create a directory. It can be created as pathlib.Path('/path/to/file.txt').parent.mkdir(parents=True, exist_ok=True).
This by itself isn’t tremendously better than the client in the accepted answer (although the docs say that it does a better job retrying uploads and downloads on failure) but considering that resources are generally more ergonomic (for example, the s3 bucket and object resources are nicer than the client methods) this does allow you to stay at the resource layer without having to drop down.
Resources generally can be created in the same way as clients, and they take all or most of the same arguments and just forward them to their internal clients.
回答 2
对于那些想模拟set_contents_from_string类似boto2方法的人,您可以尝试
import boto3
from cStringIO importStringIO
s3c = boto3.client('s3')
contents ='My string to save to S3 object'
target_bucket ='hello-world.by.vor'
target_file ='data/hello.txt'
fake_handle =StringIO(contents)# notice if you do fake_handle.read() it reads like a file handle
s3c.put_object(Bucket=target_bucket,Key=target_file,Body=fake_handle.read())
For those of you who would like to simulate the set_contents_from_string like boto2 methods, you can try
import boto3
from cStringIO import StringIO
s3c = boto3.client('s3')
contents = 'My string to save to S3 object'
target_bucket = 'hello-world.by.vor'
target_file = 'data/hello.txt'
fake_handle = StringIO(contents)
# notice if you do fake_handle.read() it reads like a file handle
s3c.put_object(Bucket=target_bucket, Key=target_file, Body=fake_handle.read())
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
回答 3
# Preface: File is json with contents: {'name': 'Android', 'status': 'ERROR'}import boto3
import io
s3 = boto3.resource('s3')
obj = s3.Object('my-bucket','key-to-file.json')
data = io.BytesIO()
obj.download_fileobj(data)# object is now a bytes string, Converting it to a dict:
new_dict = json.loads(data.getvalue().decode("utf-8"))print(new_dict['status'])# Should print "Error"
# Preface: File is json with contents: {'name': 'Android', 'status': 'ERROR'}
import boto3
import io
s3 = boto3.resource('s3')
obj = s3.Object('my-bucket', 'key-to-file.json')
data = io.BytesIO()
obj.download_fileobj(data)
# object is now a bytes string, Converting it to a dict:
new_dict = json.loads(data.getvalue().decode("utf-8"))
print(new_dict['status'])
# Should print "Error"
def s3_download(source, destination,
exists_strategy='raise',
profile_name=None):"""
Copy a file from an S3 source to a local destination.
Parameters
----------
source : str
Path starting with s3://, e.g. 's3://bucket-name/key/foo.bar'
destination : str
exists_strategy : {'raise', 'replace', 'abort'}
What is done when the destination already exists?
profile_name : str, optional
AWS profile
Raises
------
botocore.exceptions.NoCredentialsError
Botocore is not able to find your credentials. Either specify
profile_name or add the environment variables AWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN.
See https://boto3.readthedocs.io/en/latest/guide/configuration.html
"""
exists_strategies =['raise','replace','abort']if exists_strategy notin exists_strategies:raiseValueError('exists_strategy \'{}\' is not in {}'.format(exists_strategy, exists_strategies))
session = boto3.Session(profile_name=profile_name)
s3 = session.resource('s3')
bucket_name, key = _s3_path_split(source)if os.path.isfile(destination):if exists_strategy is'raise':raiseRuntimeError('File \'{}\' already exists.'.format(destination))elif exists_strategy is'abort':return
s3.Bucket(bucket_name).download_file(key, destination)from collections import namedtuple
S3Path = namedtuple("S3Path",["bucket_name","key"])def _s3_path_split(s3_path):"""
Split an S3 path into bucket and key.
Parameters
----------
s3_path : str
Returns
-------
splitted : (str, str)
(bucket, key)
Examples
--------
>>> _s3_path_split('s3://my-bucket/foo/bar.jpg')
S3Path(bucket_name='my-bucket', key='foo/bar.jpg')
"""ifnot s3_path.startswith("s3://"):raiseValueError("s3_path is expected to start with 's3://', ""but was {}".format(s3_path))
bucket_key = s3_path[len("s3://"):]
bucket_name, key = bucket_key.split("/",1)return S3Path(bucket_name, key)
When you want to read a file with a different configuration than the default one, feel free to use either mpu.aws.s3_download(s3path, destination) directly or the copy-pasted code:
def s3_download(source, destination,
exists_strategy='raise',
profile_name=None):
"""
Copy a file from an S3 source to a local destination.
Parameters
----------
source : str
Path starting with s3://, e.g. 's3://bucket-name/key/foo.bar'
destination : str
exists_strategy : {'raise', 'replace', 'abort'}
What is done when the destination already exists?
profile_name : str, optional
AWS profile
Raises
------
botocore.exceptions.NoCredentialsError
Botocore is not able to find your credentials. Either specify
profile_name or add the environment variables AWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN.
See https://boto3.readthedocs.io/en/latest/guide/configuration.html
"""
exists_strategies = ['raise', 'replace', 'abort']
if exists_strategy not in exists_strategies:
raise ValueError('exists_strategy \'{}\' is not in {}'
.format(exists_strategy, exists_strategies))
session = boto3.Session(profile_name=profile_name)
s3 = session.resource('s3')
bucket_name, key = _s3_path_split(source)
if os.path.isfile(destination):
if exists_strategy is 'raise':
raise RuntimeError('File \'{}\' already exists.'
.format(destination))
elif exists_strategy is 'abort':
return
s3.Bucket(bucket_name).download_file(key, destination)
from collections import namedtuple
S3Path = namedtuple("S3Path", ["bucket_name", "key"])
def _s3_path_split(s3_path):
"""
Split an S3 path into bucket and key.
Parameters
----------
s3_path : str
Returns
-------
splitted : (str, str)
(bucket, key)
Examples
--------
>>> _s3_path_split('s3://my-bucket/foo/bar.jpg')
S3Path(bucket_name='my-bucket', key='foo/bar.jpg')
"""
if not s3_path.startswith("s3://"):
raise ValueError(
"s3_path is expected to start with 's3://', " "but was {}"
.format(s3_path)
)
bucket_key = s3_path[len("s3://"):]
bucket_name, key = bucket_key.split("/", 1)
return S3Path(bucket_name, key)
回答 5
注意:我假设您已经分别配置了身份验证。下面的代码是从S3存储桶下载单个对象。
import boto3
#initiate s3 client
s3 = boto3.resource('s3')#Download object to the file
s3.Bucket('mybucket').download_file('hello.txt','/tmp/hello.txt')
def test_something:# some actionswith patch('something')as my_var:try:# args are not important. func should never be called in this test
my_var.assert_called_with(some, args)exceptAssertionError:pass# this error being raised means it's ok# other stuff
I’m using the Mock library to test my application, but I want to assert that some function was not called. Mock docs talk about methods like mock.assert_called_with and mock.assert_called_once_with, but I didn’t find anything like mock.assert_not_called or something related to verify mock was NOT called.
I could go with something like the following, though it doesn’t seem cool nor pythonic:
def test_something:
# some actions
with patch('something') as my_var:
try:
# args are not important. func should never be called in this test
my_var.assert_called_with(some, args)
except AssertionError:
pass # this error being raised means it's ok
# other stuff
Any ideas how to accomplish this?
回答 0
这应该适合您的情况;
assertnot my_var.called,'method should not have been called'
样品;
>>> mock=Mock()>>> mock.a()<Mock name='mock.a()' id='4349129872'>>>>assertnot mock.b.called,'b was called and should not have been'>>>assertnot mock.a.called,'a was called and should not have been'Traceback(most recent call last):File"<stdin>", line 1,in<module>AssertionError: a was called and should not have been
assert not my_var.called, 'method should not have been called'
Sample;
>>> mock=Mock()
>>> mock.a()
<Mock name='mock.a()' id='4349129872'>
>>> assert not mock.b.called, 'b was called and should not have been'
>>> assert not mock.a.called, 'a was called and should not have been'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError: a was called and should not have been
You can check the called attribute, but if your assertion fails, the next thing you’ll want to know is something about the unexpected call, so you may as well arrange for that information to be displayed from the start. Using unittest, you can check the contents of call_args_list instead:
self.assertItemsEqual(my_var.call_args_list, [])
When it fails, it gives a message like this:
AssertionError: Element counts were not equal:
First has 0, Second has 1: call('first argument', 4)
import unittest
from unittest import mock
import my_module
class A(unittest.TestCase):def setUp(self):
self.message ="Method should not be called. Called {times} times!"@mock.patch("my_module.method_to_mock")def test(self, mock_method):
my_module.method_to_mock()
self.assertFalse(mock_method.called,
self.message.format(times=mock_method.call_count))
In your example we can simply assert if mock_method.called property is False, which means that method was not called.
import unittest
from unittest import mock
import my_module
class A(unittest.TestCase):
def setUp(self):
self.message = "Method should not be called. Called {times} times!"
@mock.patch("my_module.method_to_mock")
def test(self, mock_method):
my_module.method_to_mock()
self.assertFalse(mock_method.called,
self.message.format(times=mock_method.call_count))
Consuming a call object is easy, since you can compare it with a tuple of length 2 where the first component is a tuple containing all the positional arguments of the related call, while the second component is a dictionary of the keyword arguments.
>>> ((42,),) in m.call_args_list
True
>>> m(42, foo='bar')
<MagicMock name='mock()' id='139675158423872'>
>>> ((42,), {'foo': 'bar'}) in m.call_args_list
True
>>> m(foo='bar')
<MagicMock name='mock()' id='139675158423872'>
>>> ((), {'foo': 'bar'}) in m.call_args_list
True
So, a way to address the specific problem of the OP is
def test_something():
with patch('something') as my_var:
assert ((some, args),) not in my_var.call_args_list
Note that this way, instead of just checking if a mocked callable has been called, via MagicMock.called, you can now check if it has been called with a specific set of arguments.
That’s useful. Say you want to test a function that takes a list and call another function, compute(), for each of the value of the list only if they satisfy a specific condition.
You can now mock compute, and test if it has been called on some value but not on others.
Here’s a relevant example from the itertools module docs:
import itertools
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = itertools.tee(iterable)
next(b, None)
return zip(a, b)
For Python 2, you need itertools.izip instead of zip:
import itertools
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = itertools.tee(iterable)
next(b, None)
return itertools.izip(a, b)
How this works:
First, two parallel iterators, a and b are created (the tee() call), both pointing to the first element of the original iterable. The second iterator, b is moved 1 step forward (the next(b, None)) call). At this point a points to s0 and b points to s1. Both a and b can traverse the original iterator independently – the izip function takes the two iterators and makes pairs of the returned elements, advancing both iterators at the same pace.
One caveat: the tee() function produces two iterators that can advance independently of each other, but it comes at a cost. If one of the iterators advances further than the other, then tee() needs to keep the consumed elements in memory until the second iterator comsumes them too (it cannot ‘rewind’ the original iterator). Here it doesn’t matter because one iterator is only 1 step ahead of the other, but in general it’s easy to use a lot of memory this way.
And since tee() can take an n parameter, this can also be used for more than two parallel iterators:
def threes(iterator):
"s -> (s0,s1,s2), (s1,s2,s3), (s2, s3,4), ..."
a, b, c = itertools.tee(iterator, 3)
next(b, None)
next(c, None)
next(c, None)
return zip(a, b, c)
回答 1
自己滚!
def pairwise(iterable):
it = iter(iterable)
a = next(it,None)for b in it:yield(a, b)
a = b
Since the_list[1:] actually creates a copy of the whole list (excluding its first element), and zip() creates a list of tuples immediately when called, in total three copies of your list are created. If your list is very large, you might prefer
from itertools import izip, islice
for current_item, next_item in izip(the_list, islice(the_list, 1, None)):
print(current_item, next_item)
from more_itertools import pairwise
fib=[1,1,2,3,5,8,13]for current, nxt in pairwise(fib):
ratio=current/nxt
print(f'Curent = {current}, next = {nxt}, ratio = {ratio} ')
from more_itertools import pairwise
for current, next in pairwise(your_iterable):
print(f'Current = {current}, next = {nxt}')
Docs for more-itertools
Under the hood this code is the same as that in the other answers, but I much prefer imports when available.
If you don’t already have it installed then:
pip install more-itertools
Example
For instance if you had the fibbonnacci sequence, you could calculate the ratios of subsequent pairs as:
from more_itertools import pairwise
fib= [1,1,2,3,5,8,13]
for current, nxt in pairwise(fib):
ratio=current/nxt
print(f'Curent = {current}, next = {nxt}, ratio = {ratio} ')
回答 6
使用列表理解从列表中配对
the_list =[1,2,3,4]
pairs =[[the_list[i], the_list[i +1]]for i in range(len(the_list)-1)]for[current_item, next_item]in pairs:print(current_item, next_item)
the_list = [1, 2, 3, 4]
pairs = [[the_list[i], the_list[i + 1]] for i in range(len(the_list) - 1)]
for [current_item, next_item] in pairs:
print(current_item, next_item)
Output:
(1, 2)
(2, 3)
(3, 4)
回答 7
我真的很惊讶,没有人提到更短,更简单,最重要的通用解决方案:
Python 3:
from itertools import islice
def n_wise(iterable, n):return zip(*(islice(iterable, i,None)for i in range(n)))
Python 2:
from itertools import izip, islice
def n_wise(iterable, n):return izip(*(islice(iterable, i,None)for i in xrange(n)))
它可以通过进行成对迭代n=2,但是可以处理更大的数字:
>>>for a, b in n_wise('Hello!',2):>>>print(a, b)
H e
e l
l l
l o
o !>>>for a, b, c, d in n_wise('Hello World!',4):>>>print(a, b, c, d)
H e l l
e l l o
l l o
l o W
o W o
W o r
W o r l
o r l d
r l d !
I am really surprised nobody has mentioned the shorter, simpler and most importantly general solution:
Python 3:
from itertools import islice
def n_wise(iterable, n):
return zip(*(islice(iterable, i, None) for i in range(n)))
Python 2:
from itertools import izip, islice
def n_wise(iterable, n):
return izip(*(islice(iterable, i, None) for i in xrange(n)))
It works for pairwise iteration by passing n=2, but can handle any higher number:
>>> for a, b in n_wise('Hello!', 2):
>>> print(a, b)
H e
e l
l l
l o
o !
>>> for a, b, c, d in n_wise('Hello World!', 4):
>>> print(a, b, c, d)
H e l l
e l l o
l l o
l o W
o W o
W o r
W o r l
o r l d
r l d !
回答 8
基本解决方案:
def neighbors( list ):
i =0while i +1< len( list ):yield( list[ i ], list[ i +1])
i +=1for( x, y )in neighbors( list ):print( x, y )
def neighbors( list ):
i = 0
while i + 1 < len( list ):
yield ( list[ i ], list[ i + 1 ] )
i += 1
for ( x, y ) in neighbors( list ):
print( x, y )
回答 9
code ='0016364ee0942aa7cc04a8189ef3'# Getting the current and next itemprint[code[idx]+code[idx+1]for idx in range(len(code)-1)]# Getting the pairprint[code[idx*2]+code[idx*2+1]for idx in range(len(code)/2)]
code = '0016364ee0942aa7cc04a8189ef3'
# Getting the current and next item
print [code[idx]+code[idx+1] for idx in range(len(code)-1)]
# Getting the pair
print [code[idx*2]+code[idx*2+1] for idx in range(len(code)/2)]
I tried to use multiple assignment as show below to initialize variables, but I got confused by the behavior, I expect to reassign the values list separately, I mean b[0] and c[0] equal 0 as before.
a=b=c=[0,3,5]
a[0]=1
print(a)
print(b)
print(c)
Result is:
[1, 3, 5]
[1, 3, 5]
[1, 3, 5]
Is that correct? what should I use for multiple assignment?
what is different from this?
If you’re coming to Python from a language in the C/Java/etc. family, it may help you to stop thinking about a as a “variable”, and start thinking of it as a “name”.
a, b, and c aren’t different variables with equal values; they’re different names for the same identical value. Variables have types, identities, addresses, and all kinds of stuff like that.
Names don’t have any of that. Values do, of course, and you can have lots of names for the same value.
If you give Notorious B.I.G. a hot dog,* Biggie Smalls and Chris Wallace have a hot dog. If you change the first element of a to 1, the first elements of b and c are 1.
If you want to know if two names are naming the same object, use the is operator:
>>> a=b=c=[0,3,5]
>>> a is b
True
You then ask:
what is different from this?
d=e=f=3
e=4
print('f:',f)
print('e:',e)
Here, you’re rebinding the name e to the value 4. That doesn’t affect the names d and f in any way.
In your previous version, you were assigning to a[0], not to a. So, from the point of view of a[0], you’re rebinding a[0], but from the point of view of a, you’re changing it in-place.
You can use the id function, which gives you some unique number representing the identity of an object, to see exactly which object is which even when is can’t help:
Notice that a[0] has changed from 4297261120 to 4297261216—it’s now a name for a different value. And b[0] is also now a name for that same new value. That’s because a and b are still naming the same object.
Under the covers, a[0]=1 is actually calling a method on the list object. (It’s equivalent to a.__setitem__(0, 1).) So, it’s not really rebinding anything at all. It’s like calling my_object.set_something(1). Sure, likely the object is rebinding an instance attribute in order to implement this method, but that’s not what’s important; what’s important is that you’re not assigning anything, you’re just mutating the object. And it’s the same with a[0]=1.
user570826 asked:
What if we have, a = b = c = 10
That’s exactly the same situation as a = b = c = [1, 2, 3]: you have three names for the same value.
But in this case, the value is an int, and ints are immutable. In either case, you can rebind a to a different value (e.g., a = "Now I'm a string!"), but the won’t affect the original value, which b and c will still be names for. The difference is that with a list, you can change the value [1, 2, 3] into [1, 2, 3, 4] by doing, e.g., a.append(4); since that’s actually changing the value that b and c are names for, b will now b [1, 2, 3, 4]. There’s no way to change the value 10 into anything else. 10 is 10 forever, just like Claudia the vampire is 5 forever (at least until she’s replaced by Kirsten Dunst).
* Warning: Do not give Notorious B.I.G. a hot dog. Gangsta rap zombies should never be fed after midnight.
回答 1
咳嗽
>>> a,b,c =(1,2,3)>>> a
1>>> b
2>>> c
3>>> a,b,c =({'test':'a'},{'test':'b'},{'test':'c'})>>> a
{'test':'a'}>>> b
{'test':'b'}>>> c
{'test':'c'}>>>
>>> a,b,c = (1,2,3)
>>> a
1
>>> b
2
>>> c
3
>>> a,b,c = ({'test':'a'},{'test':'b'},{'test':'c'})
>>> a
{'test': 'a'}
>>> b
{'test': 'b'}
>>> c
{'test': 'c'}
>>>
b = a[:]# this does a shallow copy, which is good enough for this caseimport copy
c = copy.deepcopy(a)# this does a deep copy, which matters if the list contains mutable objects
Yes, that’s the expected behavior. a, b and c are all set as labels for the same list. If you want three different lists, you need to assign them individually. You can either repeat the explicit list, or use one of the numerous ways to copy a list:
b = a[:] # this does a shallow copy, which is good enough for this case
import copy
c = copy.deepcopy(a) # this does a deep copy, which matters if the list contains mutable objects
Assignment statements in Python do not copy objects – they bind the name to an object, and an object can have as many labels as you set. In your first edit, changing a[0], you’re updating one element of the single list that a, b, and c all refer to. In your second, changing e, you’re switching e to be a label for a different object (4 instead of 3).
In python, everything is an object, also “simple” variables types (int, float, etc..).
When you changes a variable value, you actually changes it’s pointer, and if you compares between two variables it’s compares their pointers.
(To be clear, pointer is the address in physical computer memory where a variable is stored).
As a result, when you changes an inner variable value, you changes it’s value in the memory and it’s affects all the variables that point to this address.
For your example, when you do:
a = b = 5
This means that a and b points to the same address in memory that contains the value 5, but when you do:
a = 6
It’s not affect b because a is now points to another memory location that contains 6 and b still points to the memory address that contains 5.
But, when you do:
a = b = [1,2,3]
a and b, again, points to the same location but the difference is that if you change the one of the list values:
a[0] = 2
It’s changes the value of the memory that a is points on, but a is still points to the same address as b, and as a result, b changes as well.
回答 4
您可以id(name)用来检查两个名称是否代表相同的对象:
>>> a = b = c =[0,3,5]>>>print(id(a), id(b), id(c))462684884626848846268488
>>> a =[1,8,5]>>>print(id(a), id(b), id(c))1394238804626848846268488>>>print(a, b, c)[1,8,5][1,3,5][1,3,5]
整数是不可变的,因此您不能在不创建新对象的情况下更改值:
>>> x = y = z =1>>>print(id(x), id(y), id(z))507081216507081216507081216>>> x =2>>>print(id(x), id(y), id(z))507081248507081216507081216>>>print(x, y, z)211
Simply put, in the first case, you are assigning multiple names to a list. Only one copy of list is created in memory and all names refer to that location. So changing the list using any of the names will actually modify the list in memory.
In the second case, multiple copies of same value are created in memory. So each copy is independent of one another.
回答 7
您需要的是:
a, b, c =[0,3,5]# Unpack the list, now a, b, and c are ints
a =1# `a` did equal 0, not [0,3,5]print(a)print(b)print(c)
a, b, c = [0,3,5] # Unpack the list, now a, b, and c are ints
a = 1 # `a` did equal 0, not [0,3,5]
print(a)
print(b)
print(c)
回答 8
执行我需要的代码可能是这样的:
# test
aux=[[0for n in range(3)]for i in range(4)]print('aux:',aux)# initialization
a,b,c,d=[[0for n in range(3)]for i in range(4)]# changing values
a[0]=1
d[2]=5print('a:',a)print('b:',b)print('c:',c)print('d:',d)
# test
aux=[[0 for n in range(3)] for i in range(4)]
print('aux:',aux)
# initialization
a,b,c,d=[[0 for n in range(3)] for i in range(4)]
# changing values
a[0]=1
d[2]=5
print('a:',a)
print('b:',b)
print('c:',c)
print('d:',d)
I am plotting two similar trajectories in matplotlib and I’d like to plot each of the lines with partial transparency so that the red (plotted second) doesn’t obscure the blue.
It really depends on what functions you’re using to plot the lines, but try see if the on you’re using takes an alpha value and set it to something like 0.5. If that doesn’t work, try get the line objects and set their alpha values directly.
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.
However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .
From the NLTK docs:
Lemmatization and stemming are special cases of normalization. They identify a canonical representative for a set of related word forms.
Lemmatisation is closely related to stemming. The difference is that a
stemmer operates on a single word without knowledge of the context,
and therefore cannot discriminate between words which have different
meanings depending on part of speech. However, stemmers are typically
easier to implement and run faster, and the reduced accuracy may not
matter for some applications.
For instance:
The word “better” has “good” as its lemma. This link is missed by
stemming, as it requires a dictionary look-up.
The word “walk” is the base form for word “walking”, and hence this
is matched in both stemming and lemmatisation.
The word “meeting” can be either the base form of a noun or a form
of a verb (“to meet”) depending on the context, e.g., “in our last
meeting” or “We are meeting again tomorrow”. Unlike stemming,
lemmatisation can in principle select the appropriate lemma
depending on the context.
A stemmer will return the stem of a word, which needn’t be identical to the morphological root of the word. It usually sufficient that related words map to the same stem,even if the stem is not in itself a valid root, while in lemmatisation, it will return the dictionary form of a word, which must be a valid word.
In lemmatisation, the part of speech of a word should be first determined and the normalisation rules will be different for different part of speech, while the stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.
The purpose of both stemming and lemmatization is to reduce morphological variation. This is in contrast to the the more general “term conflation” procedures, which may also address lexico-semantic, syntactic, or orthographic variations.
The real difference between stemming and lemmatization is threefold:
Stemming reduces word-forms to (pseudo)stems, whereas lemmatization reduces the word-forms to linguistically valid lemmas. This difference is apparent in languages with more complex morphology, but may be irrelevant for many IR applications;
Lemmatization deals only with inflectional variance, whereas stemming may also deal with derivational variance;
In terms of implementation, lemmatization is usually more sophisticated (especially for morphologically complex languages) and usually requires some sort of lexica. Satisfatory stemming, on the other hand, can be achieved with rather simple rule-based approaches.
Lemmatization may also be backed up by a part-of-speech tagger in order to disambiguate homonyms.
As MYYN pointed out, stemming is the process of removing inflectional and sometimes derivational affixes to a base form that all of the original words are probably related to. Lemmatization is concerned with obtaining the single word that allows you to group together a bunch of inflected forms. This is harder than stemming because it requires taking the context into account (and thus the meaning of the word), while stemming ignores context.
As for when you would use one or the other, it’s a matter of how much your application depends on getting the meaning of a word in context correct. If you’re doing machine translation, you probably want lemmatization to avoid mistranslating a word. If you’re doing information retrieval over a billion documents with 99% of your queries ranging from 1-3 words, you can settle for stemming.
As for NLTK, the WordNetLemmatizer does use the part of speech, though you have to provide it (otherwise it defaults to nouns). Passing it “dove” and “v” yields “dive” while “dove” and “n” yields “dove”.
An example-driven explanation on the differenes between lemmatization and stemming:
Lemmatization handles matching “car” to “cars” along
with matching “car” to “automobile”.
Stemming handles matching “car” to “cars” .
Lemmatization implies a broader scope of fuzzy word matching that is
still handled by the same subsystems. It implies certain techniques
for low level processing within the engine, and may also reflect an
engineering preference for terminology.
[…] Taking FAST as an example,
their lemmatization engine handles not only basic word variations like
singular vs. plural, but also thesaurus operators like having “hot”
match “warm”.
This is not to say that other engines don’t handle synonyms, of course
they do, but the low level implementation may be in a different
subsystem than those that handle base stemming.
ianacl
but i think Stemming is a rough hack people use to get all the different forms of the same word down to a base form which need not be a legit word on its own
Something like the Porter Stemmer can uses simple regexes to eliminate common word suffixes
Lemmatization brings a word down to its actual base form which, in the case of irregular verbs, might look nothing like the input word
Something like Morpha which uses FSTs to bring nouns and verbs to their base form
Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. Sometimes, the same word can have multiple different Lemmas. We should identify the Part of Speech (POS) tag for the word in that specific context. Here are the examples to illustrate all the differences and use cases:
If you lemmatize the word ‘Caring‘, it would return ‘Care‘. If you stem, it would return ‘Car‘ and this is erroneous.
If you lemmatize the word ‘Stripes‘ in verb context, it would return ‘Strip‘. If you lemmatize it in noun context, it would return ‘Stripe‘. If you just stem it, it would just return ‘Strip‘.
You would get same results whether you lemmatize or stem words such as walking, running, swimming… to walk, run, swim etc.
Lemmatization is computationally expensive since it involves look-up tables and what not. If you have large dataset and performance is an issue, go with Stemming. Remember you can also add your own rules to Stemming. If accuracy is paramount and dataset isn’t humongous, go with Lemmatization.
from django.contrib import admin
from django.urls import include
from django.conf.urls import url
urlpatterns =[
path('admin/', admin.site.urls),
url(r'^polls/', include('polls.urls')),]#and in polls/urls.py
urlpatterns =[
url(r'^$', views.index, name="index"),]
In a django online course, the instructor has us use the url() function to call views and utilize regular expressions in the urlpatterns list. I’ve seen other examples on youtube of this.
e.g.
from django.contrib import admin
from django.urls import include
from django.conf.urls import url
urlpatterns = [
path('admin/', admin.site.urls),
url(r'^polls/', include('polls.urls')),
]
#and in polls/urls.py
urlpatterns = [
url(r'^$', views.index, name="index"),
]
However, in going through the Django tutorial, they use path() instead e.g.:
from django.urls import path
from . import views
urlpatterns = [
path('', views.index, name="index"),
]
Furthermore regular expressions don’t seem to work with the path() function as using a path(r'^$', views.index, name="index") won’t find the mysite.com/polls/ view.
Is using path() without regex matching the proper way going forward? Is url() more powerful but more complicated so they’re using path() to start us out with? Or is it a case of different tools for different jobs?
The django.conf.urls.url() function from previous versions is now available as django.urls.re_path(). The old location remains for backwards compatibility, without an imminent deprecation. The old django.conf.urls.include() function is now importable from django.urls so you can use:
path is simply new in Django 2.0, which was only released a couple of weeks ago. Most tutorials won’t have been updated for the new syntax.
It was certainly supposed to be a simpler way of doing things; I wouldn’t say that URL is more powerful though, you should be able to express patterns in either format.
From v2.0 many users are using path, but we can use either path or url.
For example in django 2.1.1
mapping to functions through url can be done as follows
from django.contrib import admin
from django.urls import path
from django.contrib.auth import login
from posts.views import post_home
from django.conf.urls import url
urlpatterns = [
path('admin/', admin.site.urls),
url(r'^posts/$', post_home, name='post_home'),
]
where posts is an application & post_home is a function in views.py
You can apply dirname repeatedly to climb higher: dirname(dirname(file)). This can only go as far as the root package, however. If this is a problem, use os.path.abspath: dirname(dirname(abspath(file))).
os.path.abspath doesn’t validate anything, so if we’re already appending strings to __file__ there’s no need to bother with dirname or joining or any of that. Just treat __file__ as a directory and start climbing:
# climb to __file__'s parent's parent:
os.path.abspath(__file__ + "/../../")
That’s far less convoluted than os.path.abspath(os.path.join(os.path.dirname(__file__),"..")) and about as manageable as dirname(dirname(__file__)). Climbing more than two levels starts to get ridiculous.
But, since we know how many levels to climb, we could clean this up with a simple little function: